create table mil ( id int, val text );
insert into mil select i, i::text || '-val' from
generate_series(1,1000000) as gs(i);
postgres=# select * from mil tablesample system ( 0.04 );
id | val
So, what I didn't understand here is the way rows are selected for TABLESAMPLE SYSTEM. Since SYSTEM is page-based, I thought that we selected the requested % of pages, and then pick that many pages at random. Since this table had exactly 185 rows per page, it should return 370 rows every time (2 pages). But that's not what happened. In fact, running the following query I got a variety of counts:
SELECT count(*) FROM (select * from mil tablesample system ( 0.04 ) ) as a;
370 370 370 555 555 185 0 925
925? 0? What the hey?
What's really happening is that pages for SYSTEM are selected a different way. Each page is checked against the probability once. This means that, while on average you'll get the number of pages you're expecting, the numbers will vary from request to request quite a bit.
This also means that SYSTEM is a bad choice for really small sample sizes, like 0.01%. BERNOULLI is better, because it'll be checking by row, and therefore the size of the return sample will be much more predictable. It will still have a bit of variation, though; in my testing, +/- 10% on a few hundred rows.
Gulcin Yildirim has a great explanation of this on the 2nd Quadrant blog.
So, what if you need TABLESAMPLE to get a very specific number of rows for you? Well, that's why Petr Jelinek wrote the optional (loadable) SYSTEM_ROWS sampling method. This can be loaded as the tsm_system_rows extension in 9.5.
Hopefully that helps folks be less confused than me about how TABLESAMPLE works.