Thursday, January 29, 2015

Tag all the Things, part 2

continued from part 1.

So the real test of the different tagging data structures and types is how they perform.  The tests below were run on a 10 million document database, on an AWS m3.large virtual server.  Importantly, the total database size is around 7GB on a machine with 3.75GB RAM, so I've deliberately sized things so that none of the tables will fit into shared_buffers.

So our first check was for size, including indexes.  All other things being equal, a smaller table and indexes is faster to query, and takes up less RAM you'd like to use to cache other data.  So smaller is better. 

    tagtest=# select relname,
    from pg_stat_user_tables;
        relname     | pg_size_pretty
    tags            | 96 kB
    doc_tags_id     | 1551 MB
    doc_tags_text   | 2106 MB
    doc_tags_json   | 1285 MB
    doc_tags_array  | 1036 MB

How about a graph of that?

The text array is a clear winner here, half the size of the default text tag option.  Both of the advanced data types are smaller than any of the "normalized" versions.  This is largely because of per-row overhead, which is 24 bytes per row, and really adds up.

So for our first exercise, we'll want to compare the amount of time required to retrieve all of the tags for a single document, which is probably the most common single operation.   In this case I tested both documents with 1 tag (75% of all documents) and documents with 9 tags (< 1% of documents). 

As you can see, these are all pretty close, except for the IDs method.  That's because the doc_tags_id table needs to join against the tags table to look up any tag; that's a major drawback of the surrogate key approach.  We'll see that penalty in other tests.  Also, notice that both the IDs and the text approach take more time the more tags a document has, whereas the JSONB and array data take constant time, because they are each looking up one row regardless.

The next test is to simulate looking up paginated search results when searching by tag.  This comes in three parts: first, we search for a common tag and retrieve the first "page" of 25 results.  Then we search for the 11th page of 25 results, since sometimes high offsets can trigger very different query plans, and are a common issue in searches.    The queries for JSONB and arrays use the "contains" operator to do a GIN index search, and look like this:

    select doc_id
    from doc_tags_json
    where tags @> '["beta"]'::jsonb
    order by doc_id limit 25 offset 0;

    select doc_id
    from doc_tags_array
    where tags @> ARRAY['math']
    order by doc_id limit 25 offset 0;

This shows us more or less what we'd expect.  The ID approach is very slow because of filtering via a JOIN is an order of magnitude slower than filtering in-table.  Surprisingly, the JSONB and array approaches are slightly faster than TEXT for the 11th page test, even though they are (as expected) slower for the first page test.  I'm not sure why that is, but it stayed that way across multiple test runs.

Then we search for an uncommon tag, where to get 25 results requires checking an entire index or the base table.

Whoa!  What's going on here?  What happened to JSONB?  We expected IDs to be slow, but not JSONB.  Let's check the query plan for the uncommon tag search using explain.depesz.  See here:

    >  Index Scan using doc_tags_json_doc_id_idx on doc_tags_json

Why is it using the doc_id index when the JSONB index is available?  Well, when we implemented JSONB for 9.4, one of the unimplemented features was having a stats model for containment searches.  As we've done with new data types in the past, we simply use "contsel" as our selectivity estimate, which means that all "contains" queries on JSONB fields estimate that 1% of rows will be returned.  This causes the planner to mis-estimate that it can get the first 25 rows faster by searching the doc_id index and applying the JSONB condition as a filter. 

I can force the planner to use the JSONB index for rare tags, but then it will use that index for common tags as well.  That's bad; the doc_id index is actually much faster for common tags.  So I can have fast searches for rare tags, or fast searches for common tags, but not both.  Looks like that knocks JSONB out of the running.

Continued in Part 3.


  1. Thank you for this comparative analysis for a very common problem!

    I looked it up and pg_total_relation_size includes indexes, so I'm impressed with how small the GIN indexes must be for JSONB and array.

  2. I think there's a typo: the per-row overhead is 24 bytes, not 14.

  3. Can you show a representative sample of the tags, and some details of the distribution? e.g. the top ten tags by usage and their count, and the top ten counts, the median+mean tag usage, the total number of tags, the total number of docs, the top ten tag counts on a single doc, the median+mean tags per doc, and the average tag length.

    That would make it easier to judge whether this test is representative for a readers use-case.