Friday, January 30, 2015

Tag All The Things, part 3

Continued from Part 2.

The next test is two tags combined.  This is where the alternative approaches really pull ahead of the traditional tagging approaches.

For example, here's the text tag query:

    select doc_id
    from doc_tags_text dt1
        join doc_tags_text dt2
        using (doc_id)
    where dt1.tag = 'math'
        and dt2.tag = 'physics'
    order by doc_id limit 25;


The query for tag IDs is even worse:

    select di1.doc_id
    from doc_tags_id di1
        join doc_tags_id di2
        on di1.doc_id = di2.doc_id
        join tags tags1
        on di1.tag_id = tags1.tag_id
        join tags tags2
        on di2.tag_id = tags2.tag_id
    where tags1.tag = 'thrift shop'
        and tags2.tag = 'blogging'
    order by di1.doc_id limit 25;


Imagine how either of these would look for three tags, or four.   Now compare that with the JSONB and array queries:

    select doc_id
    from doc_tags_array
    where tags @> array['math','physics']
    order by doc_id limit 25;

    with find_docs as (
      select doc_id
      from doc_tags_json
      where tags @> '[ "thrift shop", "blogging" ]'
    )
    select * from find_docs
    order by doc_id limit 25;


(the dodge with the WITH clause is to force use of the JSONB index, per Part 2)

Big difference, eh?  It can probably be taken as a given that if you need to do searches which involve combining two or more tags, you really want to use a GIN indexed approach just for code maintainability.  Just in case, though, let's look at performance, both for combining two common tags ("math" and "physics", for 957 hits), and two rare tags ("thrift shop" and "blogging", which only have 5 hits).  The differences in performance in the approaches were so extreme, I have to use a logarithmic scale for this graph, and am providing the raw numbers:




That's a huge difference.  The "JSONB fixed" numbers are for a query where I force the planner to use the JSONB index instead of letting it choose its own path (using the doc_id index).  As you can see, two-tag search via GIN index is an order of magnitude faster for common tags, and three orders of magnitude faster for rare tags.  And the ID approach completely bombs for two-tag search.

For our final test, we'll build a tag cloud from scratch.  This involves pulling counts of all distinct tags and then taking the top 100.  In a real production environment, you wouldn't do things this way; you'd maintain counts by trigger, or use HyperLogLog, or something similar.  But it makes a good test of mass index access and/or table scans for the various approaches.

Here the JSONB and array queries get annoying.  Examples:

    select count(*) as tag_count, tag
    from doc_tags_json
        join tags on doc_tags_json.tags @> to_json(tags.tag)::jsonb
    group by tag
    order by tag_count desc limit 100;

    select count(*) as tag_count, tag
    from doc_tags_array
        join tags on doc_tags_array.tags @> array[tags.tag::text]
    group by tag
    order by tag_count desc limit 100;


As I mentioned at the beginning of this series, arrays don't automatically prevent duplicate entries.  If I don't care that much about accuracy (and generally for tag clouds one doesn't) I can use a faster query with UNNEST:

    select count(*) as tag_count, ut.tag
    from doc_tags_array,
       lateral unnest(doc_tags_array.tags) as ut(tag)
    group by ut.tag
    order by tag_count desc limit 100;




(the lateral unnest is a 9.4 feature, and darned useful).  Let's look at execution times:



So, here you can see that the array and JSONB approaches lose; they simply can't be as fast as a plain text column for building a count.  This does mean that if your application spends a lot of its time dynamically building tag clouds, arrays might not be the way to go and JSONB certainly isn't.

Conclusion: the overall winner is an array of text, with a GIN index.  This is better for one-tag searches, worlds faster for two-tag searches, and competitive at other tasks.  It's also the smallest representation, and becomes smaller and faster still if you actually put the array of tags in the documents table.  Still, there are times that you would want to use the traditional child table with plain text tags: if you build tag clouds a lot or if you never search for two tags and your ORM can't deal with Postgres arrays.

Sadly, I have to conclude that my personal favorite, JSONB, isn't quite ready for this use.  See discussions on the pgsql-performance list regarding how to estimate selectivity for JSONB contains.

I realize that I did not test comparative write speeds.  Maybe next time; I've already torn down the AWS instance.

Thursday, January 29, 2015

Tag all the Things, part 2

continued from part 1.

So the real test of the different tagging data structures and types is how they perform.  The tests below were run on a 10 million document database, on an AWS m3.large virtual server.  Importantly, the total database size is around 7GB on a machine with 3.75GB RAM, so I've deliberately sized things so that none of the tables will fit into shared_buffers.

So our first check was for size, including indexes.  All other things being equal, a smaller table and indexes is faster to query, and takes up less RAM you'd like to use to cache other data.  So smaller is better. 

    tagtest=# select relname,
    pg_size_pretty(pg_total_relation_size(relid)) 
    from pg_stat_user_tables;
        relname     | pg_size_pretty
    ----------------+----------------
    tags            | 96 kB
    doc_tags_id     | 1551 MB
    doc_tags_text   | 2106 MB
    doc_tags_json   | 1285 MB
    doc_tags_array  | 1036 MB



How about a graph of that?


The text array is a clear winner here, half the size of the default text tag option.  Both of the advanced data types are smaller than any of the "normalized" versions.  This is largely because of per-row overhead, which is 24 bytes per row, and really adds up.

So for our first exercise, we'll want to compare the amount of time required to retrieve all of the tags for a single document, which is probably the most common single operation.   In this case I tested both documents with 1 tag (75% of all documents) and documents with 9 tags (< 1% of documents). 



As you can see, these are all pretty close, except for the IDs method.  That's because the doc_tags_id table needs to join against the tags table to look up any tag; that's a major drawback of the surrogate key approach.  We'll see that penalty in other tests.  Also, notice that both the IDs and the text approach take more time the more tags a document has, whereas the JSONB and array data take constant time, because they are each looking up one row regardless.

The next test is to simulate looking up paginated search results when searching by tag.  This comes in three parts: first, we search for a common tag and retrieve the first "page" of 25 results.  Then we search for the 11th page of 25 results, since sometimes high offsets can trigger very different query plans, and are a common issue in searches.    The queries for JSONB and arrays use the "contains" operator to do a GIN index search, and look like this:

    select doc_id
    from doc_tags_json
    where tags @> '["beta"]'::jsonb
    order by doc_id limit 25 offset 0;


    select doc_id
    from doc_tags_array
    where tags @> ARRAY['math']
    order by doc_id limit 25 offset 0;




This shows us more or less what we'd expect.  The ID approach is very slow because of filtering via a JOIN is an order of magnitude slower than filtering in-table.  Surprisingly, the JSONB and array approaches are slightly faster than TEXT for the 11th page test, even though they are (as expected) slower for the first page test.  I'm not sure why that is, but it stayed that way across multiple test runs.

Then we search for an uncommon tag, where to get 25 results requires checking an entire index or the base table.




Whoa!  What's going on here?  What happened to JSONB?  We expected IDs to be slow, but not JSONB.  Let's check the query plan for the uncommon tag search using explain.depesz.  See here:

    >  Index Scan using doc_tags_json_doc_id_idx on doc_tags_json

Why is it using the doc_id index when the JSONB index is available?  Well, when we implemented JSONB for 9.4, one of the unimplemented features was having a stats model for containment searches.  As we've done with new data types in the past, we simply use "contsel" as our selectivity estimate, which means that all "contains" queries on JSONB fields estimate that 1% of rows will be returned.  This causes the planner to mis-estimate that it can get the first 25 rows faster by searching the doc_id index and applying the JSONB condition as a filter. 

I can force the planner to use the JSONB index for rare tags, but then it will use that index for common tags as well.  That's bad; the doc_id index is actually much faster for common tags.  So I can have fast searches for rare tags, or fast searches for common tags, but not both.  Looks like that knocks JSONB out of the running.

Continued in Part 3.


pgCon Talk Review

It's that time again, I'm reviewing talks for pgCon.  If you didn't submit already ... well, it's too late.  Unless you want to do a workshop/tutorial, since it looks like we will have more workshop slots this year; if so, contact me.

The good news: submissions from people we don't normally see behind the podium at pgCon (Asians, South Americans, women, etc.) are up, so hopefully this will help make it a more diverse pgCon this year.

Now, the bad news: I am seeing the same issue I saw last year, which is long-time PostgreSQL community members submitting talks with a one or two line description.  Yes, we know you're a great person and know lots about Postgres.  But if we don't know what you're really going to talk about, we can't take it anyway, especially over other submitters who have carefully prepared a detailed description and thus shown that they plan to put time and effort into their talks.  If this is you, you can still log in and improve the description of your talk.

Wednesday, January 28, 2015

Tag All The Things

Many web applications are designed with a "tags" feature, where users can add arbitrary string "tags" to  each document (or post or picture or event or whatever).  Like other user-extensible data, tags provide special challenges for SQL/relational databases.  They can be awkward to write queries for, and at the high end perform really poorly.  PostgreSQL 9.4 offers some data structures to take the awkwardness out of managing and quering tags ... but which one is best, and how do they perform?  I'll answer that over a few blog posts.

To test various ways of tagging, I created a test server on AWS (an m3.large) and put PostgreSQL on local storage to eliminate latency as a consideration.  I then took a population of tags from two real databases: a document management application and a recipe database.  This gave me a population of 165 unique tags to work with.

The simplest way to store tags is just the basic text tags table:

    table doc_tags_text (
        doc_id int not null references documents(doc_id),
        tag text not null
    )
    unique index doc_tags_text_doc_id_tag on (doc_id, tag)
    index doc_tags_text_tag on (tag)


This has the advantage of being simple, direct, and fast for some queries.  I also created the "normalized" version of the above, which is equally common:

    table tags (
        tag_id serial not null primary key,
        tag text not null unique
    )

    table doc_tags_id (
        doc_id int not null references documents(doc_id),
        tag_id int not null references tags(tag_id)
    )
    unique index doc_tags_id_doc_id_tag_id on (doc_id, tag_id)
    index doc_tags_id_tag_id on (tag_id)


I put "normalized" in quotes because using a surrogate integer key doesn't actually make our schema more normal in any real sense.  Nor does it make it smaller or faster, as we will soon see.

Now for the advanced Postgres magic.  Of course, one of the first things I looked at was 9.4's JSONB.  Particularly, JSON arrays stored in JSONB seemed to have everything I was looking for: they can store arrays of strings or numbers, automatically deduplicate, and are order-independant.  As long as we're doing arrays, though, I should also use a standard TEXT array.  That has some disadvantages; sometimes it is order-dependant, and it doesn't automatically remove duplicates.  But PostgreSQL offers GIN indexing for all one-dimensional arrays, which makes that at least theoretically useful for tags.

    table doc_tags_json (
        doc_id int not null references documents(doc_id),
        tags jsonb
    )
    unique index doc_tags_id_doc_id on (doc_id)
    index doc_tags_id_tags using gin (tags)

    table doc_tags_array (
        doc_id int not null references documents(doc_id),
        tags text[] not null default '{}'
    )
    unique index doc_tags_id_doc_id on (doc_id)
    index doc_tags_id_tags using gin (tags)


I did consider also trying a TSVECTOR, which would seem obvious for a bag of text strings.  But there are some issues with using TSearch for tags: tags would be stemmed, and multiword tags would get broken up into individual terms.  This is not how most people expect tags to work.   While I could work around those issues, it just wasn't worth testing.

Then I populated them.  In real applications, tags distribution is heavily skewed.   Some tags show up on 15% of all documents while others are used on only one, and most documents have only one tag while a few have a dozen.  As such, I wrote a quick script to distribute the tags geometrically across the documents.  Thanks to Steve Atkins for the math on this, and I'll share the functions.  The first one returns a number between 1 and 14, heavily skewed towards 1, and the second a number between 1 and 150, again with tag #1 ("technology") 10,000 times as common as tag #150 ("Summer 2015 Summit").  I also randomized document and tag ordering so that none of the queries could take advantage of data more ordered than it would be in a real application.

    create or replace function geornd_10()
    returns int
    language sql
    as $f$
    select round(log(random()::numeric)/log(0.3))::INT + 1;
    $f$;

    create or replace function geornd_100()
    returns int
    language sql
    as $f$
    select round(log(random()::numeric)/log(0.9))::INT + 1;
    $f$;


To get a large enough population to show some performance differences, I created 10 million "documents" and tagged them all, assigning each of the four data structures the exact same document ids and tags.  Now, it's time to test.

Can you guess which one turned out to be the best all-around?

Continued in the next post.

Also, if anyone wants the tools I used to generate the test case, I can package them up, but it would be some work, so you gotta ask.

Sunday, January 4, 2015

NZPUG meetup in Wellington January 20th

Thanks to Derek Sievers of CDBaby, we've scheduled and found a venue for an NZPUG meetup in Wellington while I'm there.  The meetup will be from 6pm to 9pm on January 20th, and I will be presenting on PostgreSQL 9.4's new features.

I've created a wiki page with the details of all of the happenings in New Zealand during my Auckland/Wellington trip; please RSVP if you are attending so we can get a headcount.