Database Soup: Tag All The Things, part 3

Friday, January 30, 2015

That's a huge difference. The "JSONB fixed" numbers are for a query where I force the planner to use the JSONB index instead of letting it choose its own path (using the doc_id index). As you can see, two-tag search via GIN index is an order of magnitude faster for common tags, and three orders of magnitude faster for rare tags. And the ID approach completely bombs for two-tag search.

For our final test, we'll build a tag cloud from scratch. This involves pulling counts of all distinct tags and then taking the top 100. In a real production environment, you wouldn't do things this way; you'd maintain counts by trigger, or use HyperLogLog, or something similar. But it makes a good test of mass index access and/or table scans for the various approaches.

Here the JSONB and array queries get annoying. Examples:

    select count(*) as tag_count, tag
    from doc_tags_json
        join tags on doc_tags_json.tags @> to_json(tags.tag)::jsonb
    group by tag
    order by tag_count desc limit 100;

    select count(*) as tag_count, tag
    from doc_tags_array
        join tags on doc_tags_array.tags @> array[tags.tag::text]
    group by tag
    order by tag_count desc limit 100;

As I mentioned at the beginning of this series, arrays don't automatically prevent duplicate entries. If I don't care that much about accuracy (and generally for tag clouds one doesn't) I can use a faster query with UNNEST:

    select count(*) as tag_count, ut.tag
    from doc_tags_array,
       lateral unnest(doc_tags_array.tags) as ut(tag)
    group by ut.tag
    order by tag_count desc limit 100;

(the lateral unnest is a 9.4 feature, and darned useful). Let's look at execution times:

So, here you can see that the array and JSONB approaches lose; they simply can't be as fast as a plain text column for building a count. This does mean that if your application spends a lot of its time dynamically building tag clouds, arrays might not be the way to go and JSONB certainly isn't.

Conclusion: the overall winner is an array of text, with a GIN index. This is better for one-tag searches, worlds faster for two-tag searches, and competitive at other tasks. It's also the smallest representation, and becomes smaller and faster still if you actually put the array of tags in the documents table. Still, there are times that you would want to use the traditional child table with plain text tags: if you build tag clouds a lot or if you never search for two tags and your ORM can't deal with Postgres arrays.

Sadly, I have to conclude that my personal favorite, JSONB, isn't quite ready for this use. See discussions on the pgsql-performance list regarding how to estimate selectivity for JSONB contains.

I realize that I did not test comparative write speeds. Maybe next time; I've already torn down the AWS instance.

20 comments:

Oleg BartunovJanuary 30, 2015 at 12:07 PM
Josh, we'll work on jsonb stats for 9.6. Today we discusses this with Alexander and decided to experiment with several approaches.
ReplyDelete
Replies
DerGuteMoritzJanuary 31, 2015 at 5:03 AM
Hi Josh,

first of all, thanks a lot for this informative article series! As chance would have it, we've recently implemented tags in our document management system Pepa using the traditional approach with a tags relation and a join table as we are not yet at the point of optimizing performance and this is the most flexible approach. However, we've come up with a different solution for querying for multiple tags using something like GROUP BY doc_id HAVING array_agg(tag) @> array['thrift shop', 'blogging']. My guess is that it doesn't perform terribly well either but at least it makes the query much more pleasant to write (or generate) than having to join with the tags table again for every queried tag. Should you continue with your benchmarks I'd be curious how it compares, though!

Cheers
Moritz
ReplyDelete
Replies
UnknownFebruary 6, 2015 at 8:09 AM
This series is great. Thanks for taking the time. I'm implementing tags for a high-traffic production API built on Postgres 9.4, and you've convinced me to store tags in an array in the document table, together with triggers to update tag counts in a separate table (along with other info about each tag).

I had previously been planning to use the standard junction table method since I have other data associated with each tag.

One question: You mention repeatedly that JSON arrays stored in JSONB are order-agnostic and automatically de-duplicate. But isn't that only true of JSON objects/hashes, not arrays?

Array preserve order and duplicated. For example:

postgres=> select '["a", "b"]'::jsonb = '["b", "a"]'::jsonb;
?column?
----------
f
(1 row)

postgres=> SELECT '["a", "b", "a"]'::JSONB;
jsonb
-----------------
["a", "b", "a"]
(1 row)
ReplyDelete
Replies
AnonymousFebruary 6, 2015 at 3:02 PM
So questions: Why are you introducing the same table multiple times in your tag-id query?

In other words, what does your existing query:

select di1.doc_id
from doc_tags_id di1
join doc_tags_id di2
on di1.doc_id = di2.doc_id
join tags tags1
on di1.tag_id = tags1.tag_id
join tags tags2
on di2.tag_id = tags2.tag_id
where tags1.tag = 'thrift shop'
and tags2.tag = 'blogging'
order by di1.doc_id limit 25;

Accomplish over this new query that I would propose, which only introduces each table once?

select dt.doc_id
from doc_tags_id dt
join tags t on (dt.tag_id = t.tag_id)
where t.tag in ('blogging', 'thrift shop')
order by 1 limit 25;

Maybe I'm mistaken, but it looks like the query is being made less efficient than it could be, so naturally it would have a worse run time...
ReplyDelete
Replies
Eamon NerbonneFebruary 8, 2015 at 1:01 AM
The standard join table might fare a lot better if the lookup-by-tag index had included the doc_id.
ReplyDelete
Replies
Tim UckunFebruary 16, 2015 at 2:48 AM
Did you do any testing when using the tag itself as the primary key in your tags table? Using a natural pkey would probably speed things up as you can avoid a join when querying the tags no?
ReplyDelete
Replies
UnknownFebruary 21, 2015 at 4:09 AM
Josh, thank you for this article! I've been experimenting with json and text array columns like you are showing here in Postgres 9.3. I’m loving it. I look forward to your findings on jsonb in the future. I really hope Postgres continues to create these new column types.

The project I'm currently building uses Postgres 9.3 and from a coder and sysadmin PoV it has been wonderful. I’m definitely a Postgres head going forward. Big thanks to the Postgres team, I'm looking forward to using 9.4
ReplyDelete
Replies
AnonymousMarch 12, 2015 at 10:02 AM
I assume that a known limitation of not doing the join is that you cannot rename a tag? Or at least to rename it you would have to update all the rows that use it.

I have that requirement so I am thinking of using the array approach, but still have a separate tag_lookup table that I would never join against, but rather query that table first to get the tag IDs from the tag names. Then query against the tags column using the tag IDs. Do you see any issues with this approach?
ReplyDelete
Replies
DrasillMarch 25, 2015 at 3:30 PM
Thank you for your great post.
I did obtain the same results when benchmarking some time ago (postgreSQL 9.2, so only "normalized" versus arrays).

But I obtained better performances yet (IIRC) with the intarray extension.

It means there's a need for a "tags" table, and even some code at the app level to construct the query, but it lets you create fast and complex queries as :

select * from doc where tags @@ '1&(2|3)'

to search for tags 1 and (2 or 3).

And it's really small in the DB !

I would love to see it added in your benchmark.
ReplyDelete
Replies
IlyaNovember 3, 2015 at 1:01 AM
Hi,
I think your benchmark doesn't cover one quite common case that if usually very heavy on SQL.
Get all documents tagged by X or Y or Z.
I'll give real life example: you build portfolio of 10 stocks and you want to see feed of all content related to your stocks.
We're currently using Sphinx (full text search engine) for this purpose and it works surprisingly fast when searching by less than 30 tags.
I've done a bit of benchmark vs GIN index on int array with intersection operator and it also works fast, but slower than sphinx.

Do you maybe know a better solution in this case, or ways to tweak GIN index to optimize it for intersection operations?
ReplyDelete
Replies

Add comment