Friday, January 30, 2015

Tag All The Things, part 3

Continued from Part 2.

The next test is two tags combined.  This is where the alternative approaches really pull ahead of the traditional tagging approaches.

For example, here's the text tag query:

    select doc_id
    from doc_tags_text dt1
        join doc_tags_text dt2
        using (doc_id)
    where dt1.tag = 'math'
        and dt2.tag = 'physics'
    order by doc_id limit 25;


The query for tag IDs is even worse:

    select di1.doc_id
    from doc_tags_id di1
        join doc_tags_id di2
        on di1.doc_id = di2.doc_id
        join tags tags1
        on di1.tag_id = tags1.tag_id
        join tags tags2
        on di2.tag_id = tags2.tag_id
    where tags1.tag = 'thrift shop'
        and tags2.tag = 'blogging'
    order by di1.doc_id limit 25;


Imagine how either of these would look for three tags, or four.   Now compare that with the JSONB and array queries:

    select doc_id
    from doc_tags_array
    where tags @> array['math','physics']
    order by doc_id limit 25;

    with find_docs as (
      select doc_id
      from doc_tags_json
      where tags @> '[ "thrift shop", "blogging" ]'
    )
    select * from find_docs
    order by doc_id limit 25;


(the dodge with the WITH clause is to force use of the JSONB index, per Part 2)

Big difference, eh?  It can probably be taken as a given that if you need to do searches which involve combining two or more tags, you really want to use a GIN indexed approach just for code maintainability.  Just in case, though, let's look at performance, both for combining two common tags ("math" and "physics", for 957 hits), and two rare tags ("thrift shop" and "blogging", which only have 5 hits).  The differences in performance in the approaches were so extreme, I have to use a logarithmic scale for this graph, and am providing the raw numbers:




That's a huge difference.  The "JSONB fixed" numbers are for a query where I force the planner to use the JSONB index instead of letting it choose its own path (using the doc_id index).  As you can see, two-tag search via GIN index is an order of magnitude faster for common tags, and three orders of magnitude faster for rare tags.  And the ID approach completely bombs for two-tag search.

For our final test, we'll build a tag cloud from scratch.  This involves pulling counts of all distinct tags and then taking the top 100.  In a real production environment, you wouldn't do things this way; you'd maintain counts by trigger, or use HyperLogLog, or something similar.  But it makes a good test of mass index access and/or table scans for the various approaches.

Here the JSONB and array queries get annoying.  Examples:

    select count(*) as tag_count, tag
    from doc_tags_json
        join tags on doc_tags_json.tags @> to_json(tags.tag)::jsonb
    group by tag
    order by tag_count desc limit 100;

    select count(*) as tag_count, tag
    from doc_tags_array
        join tags on doc_tags_array.tags @> array[tags.tag::text]
    group by tag
    order by tag_count desc limit 100;


As I mentioned at the beginning of this series, arrays don't automatically prevent duplicate entries.  If I don't care that much about accuracy (and generally for tag clouds one doesn't) I can use a faster query with UNNEST:

    select count(*) as tag_count, ut.tag
    from doc_tags_array,
       lateral unnest(doc_tags_array.tags) as ut(tag)
    group by ut.tag
    order by tag_count desc limit 100;




(the lateral unnest is a 9.4 feature, and darned useful).  Let's look at execution times:



So, here you can see that the array and JSONB approaches lose; they simply can't be as fast as a plain text column for building a count.  This does mean that if your application spends a lot of its time dynamically building tag clouds, arrays might not be the way to go and JSONB certainly isn't.

Conclusion: the overall winner is an array of text, with a GIN index.  This is better for one-tag searches, worlds faster for two-tag searches, and competitive at other tasks.  It's also the smallest representation, and becomes smaller and faster still if you actually put the array of tags in the documents table.  Still, there are times that you would want to use the traditional child table with plain text tags: if you build tag clouds a lot or if you never search for two tags and your ORM can't deal with Postgres arrays.

Sadly, I have to conclude that my personal favorite, JSONB, isn't quite ready for this use.  See discussions on the pgsql-performance list regarding how to estimate selectivity for JSONB contains.

I realize that I did not test comparative write speeds.  Maybe next time; I've already torn down the AWS instance.

20 comments:

  1. Josh, we'll work on jsonb stats for 9.6. Today we discusses this with Alexander and decided to experiment with several approaches.

    ReplyDelete
    Replies
    1. Great! See also the discussion on -performance about this.

      Delete
  2. Hi Josh,

    first of all, thanks a lot for this informative article series! As chance would have it, we've recently implemented tags in our document management system Pepa using the traditional approach with a tags relation and a join table as we are not yet at the point of optimizing performance and this is the most flexible approach. However, we've come up with a different solution for querying for multiple tags using something like GROUP BY doc_id HAVING array_agg(tag) @> array['thrift shop', 'blogging']. My guess is that it doesn't perform terribly well either but at least it makes the query much more pleasant to write (or generate) than having to join with the tags table again for every queried tag. Should you continue with your benchmarks I'd be curious how it compares, though!

    Cheers
    Moritz

    ReplyDelete
    Replies
    1. Thanks Moritz. We'll see if I get back to doing benchmarks again.

      Delete
  3. This series is great. Thanks for taking the time. I'm implementing tags for a high-traffic production API built on Postgres 9.4, and you've convinced me to store tags in an array in the document table, together with triggers to update tag counts in a separate table (along with other info about each tag).

    I had previously been planning to use the standard junction table method since I have other data associated with each tag.

    One question: You mention repeatedly that JSON arrays stored in JSONB are order-agnostic and automatically de-duplicate. But isn't that only true of JSON objects/hashes, not arrays?

    Array preserve order and duplicated. For example:

    postgres=> select '["a", "b"]'::jsonb = '["b", "a"]'::jsonb;
    ?column?
    ----------
    f
    (1 row)

    postgres=> SELECT '["a", "b", "a"]'::JSONB;
    jsonb
    -----------------
    ["a", "b", "a"]
    (1 row)

    ReplyDelete
    Replies
    1. Yes, I confused things between keys and array elements; see previous post. Keys are deduplicated, array elements aren't.

      Delete
  4. So questions: Why are you introducing the same table multiple times in your tag-id query?

    In other words, what does your existing query:

    select di1.doc_id
    from doc_tags_id di1
    join doc_tags_id di2
    on di1.doc_id = di2.doc_id
    join tags tags1
    on di1.tag_id = tags1.tag_id
    join tags tags2
    on di2.tag_id = tags2.tag_id
    where tags1.tag = 'thrift shop'
    and tags2.tag = 'blogging'
    order by di1.doc_id limit 25;

    Accomplish over this new query that I would propose, which only introduces each table once?

    select dt.doc_id
    from doc_tags_id dt
    join tags t on (dt.tag_id = t.tag_id)
    where t.tag in ('blogging', 'thrift shop')
    order by 1 limit 25;

    Maybe I'm mistaken, but it looks like the query is being made less efficient than it could be, so naturally it would have a worse run time...

    ReplyDelete
    Replies
    1. Your query asks "give me all the documents with EITHER the tags 'blogging' or 'thrift shop'" What I want is "give me all the documents with BOTH the tags 'blogging' and 'thrift shop'".

      You could do this with a group by and an COUNT(), but in my testing that wasn't any faster.

      Delete
    2. Ah. Makes sense. Alright, thanks

      Delete
  5. The standard join table might fare a lot better if the lookup-by-tag index had included the doc_id.

    ReplyDelete
    Replies
    1. Eamon: you're claiming that having an index on (tag_id, doc_id) would make a significant difference, even though we already have an index on (doc_id, tag_id)? Can you run some tests?

      Delete
  6. Did you do any testing when using the tag itself as the primary key in your tags table? Using a natural pkey would probably speed things up as you can avoid a join when querying the tags no?

    ReplyDelete
    Replies
    1. In Postgres, there's no difference between a PK and any other unique index, in terms of performance. Also, the "text tags" setup *is* the one using natural keys for tags -- see Part I.

      Delete
  7. Josh, thank you for this article! I've been experimenting with json and text array columns like you are showing here in Postgres 9.3. I’m loving it. I look forward to your findings on jsonb in the future. I really hope Postgres continues to create these new column types.

    The project I'm currently building uses Postgres 9.3 and from a coder and sysadmin PoV it has been wonderful. I’m definitely a Postgres head going forward. Big thanks to the Postgres team, I'm looking forward to using 9.4

    ReplyDelete
  8. I assume that a known limitation of not doing the join is that you cannot rename a tag? Or at least to rename it you would have to update all the rows that use it.

    I have that requirement so I am thinking of using the array approach, but still have a separate tag_lookup table that I would never join against, but rather query that table first to get the tag IDs from the tag names. Then query against the tags column using the tag IDs. Do you see any issues with this approach?

    ReplyDelete
    Replies
    1. Correct. So if renaming tags is a major part of your application, then the array approach is probably not for you.

      Delete
  9. Thank you for your great post.
    I did obtain the same results when benchmarking some time ago (postgreSQL 9.2, so only "normalized" versus arrays).

    But I obtained better performances yet (IIRC) with the intarray extension.

    It means there's a need for a "tags" table, and even some code at the app level to construct the query, but it lets you create fast and complex queries as :

    select * from doc where tags @@ '1&(2|3)'

    to search for tags 1 and (2 or 3).

    And it's really small in the DB !

    I would love to see it added in your benchmark.

    ReplyDelete
    Replies
    1. For what it's worth, I just did basic performance testing using intarray extension vs text[]. I didn't see any substantive difference and when I add in the overhead of having to dereference the ids to tags, using text[] seems a better choice.

      Delete
  10. Hi,
    I think your benchmark doesn't cover one quite common case that if usually very heavy on SQL.
    Get all documents tagged by X or Y or Z.
    I'll give real life example: you build portfolio of 10 stocks and you want to see feed of all content related to your stocks.
    We're currently using Sphinx (full text search engine) for this purpose and it works surprisingly fast when searching by less than 30 tags.
    I've done a bit of benchmark vs GIN index on int array with intersection operator and it also works fast, but slower than sphinx.

    Do you maybe know a better solution in this case, or ways to tweak GIN index to optimize it for intersection operations?

    ReplyDelete
    Replies
    1. Can you post your actual setup and numbers?

      And frankly, I'd expect sphinx to be faster. It's a dedicated tool which only does one thing. If it couldn't do a tag search faster than Postgres, it would be pretty sucky.

      Delete