Wednesday, January 28, 2015

Tag All The Things

Many web applications are designed with a "tags" feature, where users can add arbitrary string "tags" to  each document (or post or picture or event or whatever).  Like other user-extensible data, tags provide special challenges for SQL/relational databases.  They can be awkward to write queries for, and at the high end perform really poorly.  PostgreSQL 9.4 offers some data structures to take the awkwardness out of managing and quering tags ... but which one is best, and how do they perform?  I'll answer that over a few blog posts.

To test various ways of tagging, I created a test server on AWS (an m3.large) and put PostgreSQL on local storage to eliminate latency as a consideration.  I then took a population of tags from two real databases: a document management application and a recipe database.  This gave me a population of 165 unique tags to work with.

The simplest way to store tags is just the basic text tags table:

    table doc_tags_text (
        doc_id int not null references documents(doc_id),
        tag text not null
    )
    unique index doc_tags_text_doc_id_tag on (doc_id, tag)
    index doc_tags_text_tag on (tag)


This has the advantage of being simple, direct, and fast for some queries.  I also created the "normalized" version of the above, which is equally common:

    table tags (
        tag_id serial not null primary key,
        tag text not null unique
    )

    table doc_tags_id (
        doc_id int not null references documents(doc_id),
        tag_id int not null references tags(tag_id)
    )
    unique index doc_tags_id_doc_id_tag_id on (doc_id, tag_id)
    index doc_tags_id_tag_id on (tag_id)


I put "normalized" in quotes because using a surrogate integer key doesn't actually make our schema more normal in any real sense.  Nor does it make it smaller or faster, as we will soon see.

Now for the advanced Postgres magic.  Of course, one of the first things I looked at was 9.4's JSONB.  Particularly, JSON arrays stored in JSONB seemed to have everything I was looking for: they can store arrays of strings or numbers, automatically deduplicate, and are order-independant.  As long as we're doing arrays, though, I should also use a standard TEXT array.  That has some disadvantages; sometimes it is order-dependant, and it doesn't automatically remove duplicates.  But PostgreSQL offers GIN indexing for all one-dimensional arrays, which makes that at least theoretically useful for tags.

    table doc_tags_json (
        doc_id int not null references documents(doc_id),
        tags jsonb
    )
    unique index doc_tags_id_doc_id on (doc_id)
    index doc_tags_id_tags using gin (tags)

    table doc_tags_array (
        doc_id int not null references documents(doc_id),
        tags text[] not null default '{}'
    )
    unique index doc_tags_id_doc_id on (doc_id)
    index doc_tags_id_tags using gin (tags)


I did consider also trying a TSVECTOR, which would seem obvious for a bag of text strings.  But there are some issues with using TSearch for tags: tags would be stemmed, and multiword tags would get broken up into individual terms.  This is not how most people expect tags to work.   While I could work around those issues, it just wasn't worth testing.

Then I populated them.  In real applications, tags distribution is heavily skewed.   Some tags show up on 15% of all documents while others are used on only one, and most documents have only one tag while a few have a dozen.  As such, I wrote a quick script to distribute the tags geometrically across the documents.  Thanks to Steve Atkins for the math on this, and I'll share the functions.  The first one returns a number between 1 and 14, heavily skewed towards 1, and the second a number between 1 and 150, again with tag #1 ("technology") 10,000 times as common as tag #150 ("Summer 2015 Summit").  I also randomized document and tag ordering so that none of the queries could take advantage of data more ordered than it would be in a real application.

    create or replace function geornd_10()
    returns int
    language sql
    as $f$
    select round(log(random()::numeric)/log(0.3))::INT + 1;
    $f$;

    create or replace function geornd_100()
    returns int
    language sql
    as $f$
    select round(log(random()::numeric)/log(0.9))::INT + 1;
    $f$;


To get a large enough population to show some performance differences, I created 10 million "documents" and tagged them all, assigning each of the four data structures the exact same document ids and tags.  Now, it's time to test.

Can you guess which one turned out to be the best all-around?

Continued in the next post.

Also, if anyone wants the tools I used to generate the test case, I can package them up, but it would be some work, so you gotta ask.

8 comments:

  1. Aagghhh why you keep me waiting? I can't handle the suspense!☺

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete
  3. You mention deduplication of JSONB arrays. I'm not exactly sure what you mean here: JSONB arrays do not remove duplicates. JSONB objects _do_ remove duplicate keys.

    Can you elaborate?

    ReplyDelete
    Replies
    1. JSONB *does* remove duplicate keys.

      Delete
    2. You two agree on keys, but he was making the point that duplicate JSONB Array *elements* are not removed. Did you use a JSONB Array to hold the tags or a JSONB Object? e.g. ['aaa', 'bbb', 'bbb'] vs. {'aaa':0, 'bbb':0}

      Delete
  4. I wonder whether the index `index doc_tags_id_tag_id on (tag_id) ` wouldn't have better been a `unique index doc_tags_id_tag_id_doc_id on (tag_id, doc_id)`. This a few advantages - first of all, that means that filtering by tag need not leave the index (the doc id is included in the index). Secondly, the document ids for a given tag are known unique, so that the DB need not be de-duplicated (in particular for pagination by row-offset this matters). Thirdly, the doc_id's are known to be ordered in the same order as the rows in the doc table itself, which allows a more efficient (ordered) join - though I'm not sure whether postgres implements that.

    ReplyDelete
    Replies
    1. Please try it and publish your results.

      Delete
    2. I don't use postgres regularly, nor have access to this test set, so that's not something I'm going to do. I'm positive MS sql server does benefit from such an index in scenario's such as yours; hence my suggestion.

      If the data and benchmark were sufficiently scripted; I could compare results on sql server if you want...

      Delete