Friday, August 7, 2015

Understanding Unintuitive TABLESAMPLE Results

In 9.5 Alpha 2:

create table mil ( id int, val text );
insert into mil select i, i::text || '-val'  from
generate_series(1,1000000) as gs(i);
analyze;

postgres=# select * from mil tablesample system ( 0.04 );
 id | val
----+-----
(0 rows)


Huh, what?

So, what I didn't understand here is the way rows are selected for TABLESAMPLE SYSTEM.  Since SYSTEM is page-based, I thought that we selected the requested % of pages, and then pick that many pages at random.  Since this table had exactly 185 rows per page, it should return 370 rows every time (2 pages).  But that's not what happened. In fact, running the following query I got a variety of counts:

SELECT count(*) FROM (select * from mil tablesample system ( 0.04 ) ) as a;
370
370
370
555
555
185
0
925
 
925?  0?  What the hey?

What's really happening is that pages for SYSTEM are selected a different way.  Each page is checked against the probability once.  This means that, while on average you'll get the number of pages you're expecting, the numbers will vary from request to request quite a bit.

This also means that SYSTEM is a bad choice for really small sample sizes, like 0.01%.  BERNOULLI is better, because it'll be checking by row, and therefore the size of the return sample will be much more predictable.  It will still have a bit of variation, though; in my testing, +/- 10% on a few hundred rows.

Gulcin Yildirim has a great explanation of this on the 2nd Quadrant blog.

So, what if you need TABLESAMPLE to get a very specific number of rows for you?  Well, that's why Petr Jelinek wrote the optional (loadable) SYSTEM_ROWS sampling method.  This can be loaded as the tsm_system_rows extension in 9.5.

Hopefully that helps folks be less confused than me about how TABLESAMPLE works.

4 comments:

  1. Is this behaviour of SYSTEM mandated by the standard, or just an implementation detail ? The intuitive behaviour (first deciding how many pages to fetch, then selecting that many pages at random) not only yields more consistent results, it seems like it would be less CPU work for PG.

    ReplyDelete
    Replies
    1. According to the author, It's not actually less work. If we're picking a set number of pages, we need to avoid duplicates, empty pages, and a few other things.

      Delete
  2. If I understand this correctly, it is possible that not all pages contain the same number of 'live' tuples, e.g. after a DELETE. So in order to guarantee a correct percentage, PostgreSQL would have to keep track of how many rows were already sampled and somehow choose the next pages so, that it will approximate the target as good as possible. I doubt that this would be less CPU work, and it would also not qualify as a random process anymore.

    ReplyDelete
    Replies
    1. That is also an issue, but not the one I'm describing above. What I'm explaining above is that for SYSTEM, it's a set % chance that any partiicular page will be included, which leads to randomness in how many pages are actually included.

      Delete