Database Soup: appdev

Showing posts with label appdev. Show all posts

Thursday, February 12, 2015

Tree Join Tables: preventing cycles

Searching Google, I was surprised to find that there were few solutions published for a common issue: preventing users from creating a cycle when you create a self-join table. So here's one solution, which will be "good enough" for most people, but has some caveats (see below).

First, the setup: we have a table of items. Items can be in one or more collections. Each item can itself be a collection, allowing users to create collections of collections. So the first thing we need is a self-join table on the "adjacency list" model:

    create table collections (
        collection_id int not null references items(id) on delete cascade,
        item_id int not null references items(id) on delete cascade,
        constraint collections_pk primary key ( collection_id, item_id )
    );
    create index on collections(item_id);

So the first part of preventing cycles is to prevent the simplest cycle, where a collection collects itself. That can be done with a constraint:

     alter table collections add constraint
     no_self_join check ( collection_id <> item_id )

Now comes the tough part, preventing cycles of more than one, two, or N collections in a chain. This requires us to look down a chain of possible collections and make sure that each inserted tuple doesn't complete a loop. Fortunately, WITH RECURSIVE works for this provided we do it in a BEFORE trigger. If we did it in an AFTER trigger, the trigger itself would cycle, which would be no good.

    CREATE OR REPLACE FUNCTION collections_prevent_cycle ()
    returns trigger
    language plpgsql
    as $f$
    BEGIN
        -- select recusively, looking for all child items of the new collection
        -- and making sure that they don't include the new collection
        IF EXISTS ( WITH recursive colitem as (
                select collection_id, item_id
                from collections
                where collection_id = NEW.item_id
                UNION ALL
                select colitem.collection_id, collections.item_id
                from collections
                join colitem on colitem.item_id = collections.collection_id
            )
            SELECT collection_id from colitem
            WHERE item_id = NEW.collection_id
            LIMIT 1 ) THEN
                RAISE EXCEPTION 'You may not create a cycle of collections.';
        END IF;

        RETURN NEW;
    END; $f$;

    CREATE TRIGGER collections_prevent_cycle
    BEFORE INSERT OR UPDATE ON collections
    FOR EACH ROW EXECUTE PROCEDURE collections_prevent_cycle();

As I said, this solution will be "good enough" for a variety of uses. However, it has some defects:

Concurrency: It is vulnerable to concurrency failure. That is, if two users simultaneously insert "A collects B" and "B collects A", this trigger would not prevent it. The alternative is locking the entire table on each commit, which is also problematic.

Cost: we're running a pretty expensive recursive query with every insert. For applications where the tree table is write-heavy, this will decrease throughput significantly.

So my, challenge to you is this: come up with a better solution for this, which solves either the concurrency or cost problem without making the other problem worse.

P.S.: this blog has reached half a million views. Thanks, readers!

Friday, January 30, 2015

That's a huge difference. The "JSONB fixed" numbers are for a query where I force the planner to use the JSONB index instead of letting it choose its own path (using the doc_id index). As you can see, two-tag search via GIN index is an order of magnitude faster for common tags, and three orders of magnitude faster for rare tags. And the ID approach completely bombs for two-tag search.

For our final test, we'll build a tag cloud from scratch. This involves pulling counts of all distinct tags and then taking the top 100. In a real production environment, you wouldn't do things this way; you'd maintain counts by trigger, or use HyperLogLog, or something similar. But it makes a good test of mass index access and/or table scans for the various approaches.

Here the JSONB and array queries get annoying. Examples:

    select count(*) as tag_count, tag
    from doc_tags_json
        join tags on doc_tags_json.tags @> to_json(tags.tag)::jsonb
    group by tag
    order by tag_count desc limit 100;

    select count(*) as tag_count, tag
    from doc_tags_array
        join tags on doc_tags_array.tags @> array[tags.tag::text]
    group by tag
    order by tag_count desc limit 100;

As I mentioned at the beginning of this series, arrays don't automatically prevent duplicate entries. If I don't care that much about accuracy (and generally for tag clouds one doesn't) I can use a faster query with UNNEST:

    select count(*) as tag_count, ut.tag
    from doc_tags_array,
       lateral unnest(doc_tags_array.tags) as ut(tag)
    group by ut.tag
    order by tag_count desc limit 100;

(the lateral unnest is a 9.4 feature, and darned useful). Let's look at execution times:

So, here you can see that the array and JSONB approaches lose; they simply can't be as fast as a plain text column for building a count. This does mean that if your application spends a lot of its time dynamically building tag clouds, arrays might not be the way to go and JSONB certainly isn't.

Conclusion: the overall winner is an array of text, with a GIN index. This is better for one-tag searches, worlds faster for two-tag searches, and competitive at other tasks. It's also the smallest representation, and becomes smaller and faster still if you actually put the array of tags in the documents table. Still, there are times that you would want to use the traditional child table with plain text tags: if you build tag clouds a lot or if you never search for two tags and your ORM can't deal with Postgres arrays.

Sadly, I have to conclude that my personal favorite, JSONB, isn't quite ready for this use. See discussions on the pgsql-performance list regarding how to estimate selectivity for JSONB contains.

I realize that I did not test comparative write speeds. Maybe next time; I've already torn down the AWS instance.

Thursday, January 29, 2015

Tag all the Things, part 2

continued from part 1.

So the real test of the different tagging data structures and types is how they perform. The tests below were run on a 10 million document database, on an AWS m3.large virtual server. Importantly, the total database size is around 7GB on a machine with 3.75GB RAM, so I've deliberately sized things so that none of the tables will fit into shared_buffers.

So our first check was for size, including indexes. All other things being equal, a smaller table and indexes is faster to query, and takes up less RAM you'd like to use to cache other data. So smaller is better.

    tagtest=# select relname,
    pg_size_pretty(pg_total_relation_size(relid))
    from pg_stat_user_tables;
        relname     | pg_size_pretty
    ----------------+----------------
    tags            | 96 kB
    doc_tags_id     | 1551 MB
    doc_tags_text   | 2106 MB
    doc_tags_json   | 1285 MB
    doc_tags_array | 1036 MB

How about a graph of that?

The text array is a clear winner here, half the size of the default text tag option. Both of the advanced data types are smaller than any of the "normalized" versions. This is largely because of per-row overhead, which is 24 bytes per row, and really adds up.

So for our first exercise, we'll want to compare the amount of time required to retrieve all of the tags for a single document, which is probably the most common single operation. In this case I tested both documents with 1 tag (75% of all documents) and documents with 9 tags (< 1% of documents).

As you can see, these are all pretty close, except for the IDs method. That's because the doc_tags_id table needs to join against the tags table to look up any tag; that's a major drawback of the surrogate key approach. We'll see that penalty in other tests. Also, notice that both the IDs and the text approach take more time the more tags a document has, whereas the JSONB and array data take constant time, because they are each looking up one row regardless.

The next test is to simulate looking up paginated search results when searching by tag. This comes in three parts: first, we search for a common tag and retrieve the first "page" of 25 results. Then we search for the 11th page of 25 results, since sometimes high offsets can trigger very different query plans, and are a common issue in searches.    The queries for JSONB and arrays use the "contains" operator to do a GIN index search, and look like this:

    select doc_id
    from doc_tags_json
    where tags @> '["beta"]'::jsonb
    order by doc_id limit 25 offset 0;

    select doc_id
    from doc_tags_array
    where tags @> ARRAY['math']
    order by doc_id limit 25 offset 0;

This shows us more or less what we'd expect. The ID approach is very slow because of filtering via a JOIN is an order of magnitude slower than filtering in-table. Surprisingly, the JSONB and array approaches are slightly faster than TEXT for the 11th page test, even though they are (as expected) slower for the first page test. I'm not sure why that is, but it stayed that way across multiple test runs.

Then we search for an uncommon tag, where to get 25 results requires checking an entire index or the base table.

Whoa! What's going on here? What happened to JSONB? We expected IDs to be slow, but not JSONB. Let's check the query plan for the uncommon tag search using explain.depesz. See here:

> Index Scan using doc_tags_json_doc_id_idx on doc_tags_json

Why is it using the doc_id index when the JSONB index is available? Well, when we implemented JSONB for 9.4, one of the unimplemented features was having a stats model for containment searches. As we've done with new data types in the past, we simply use "contsel" as our selectivity estimate, which means that all "contains" queries on JSONB fields estimate that 1% of rows will be returned. This causes the planner to mis-estimate that it can get the first 25 rows faster by searching the doc_id index and applying the JSONB condition as a filter.

I can force the planner to use the JSONB index for rare tags, but then it will use that index for common tags as well. That's bad; the doc_id index is actually much faster for common tags. So I can have fast searches for rare tags, or fast searches for common tags, but not both. Looks like that knocks JSONB out of the running.

Continued in Part 3.

Wednesday, January 28, 2015

Tag All The Things

Many web applications are designed with a "tags" feature, where users can add arbitrary string "tags" to each document (or post or picture or event or whatever). Like other user-extensible data, tags provide special challenges for SQL/relational databases. They can be awkward to write queries for, and at the high end perform really poorly. PostgreSQL 9.4 offers some data structures to take the awkwardness out of managing and quering tags ... but which one is best, and how do they perform? I'll answer that over a few blog posts.

To test various ways of tagging, I created a test server on AWS (an m3.large) and put PostgreSQL on local storage to eliminate latency as a consideration. I then took a population of tags from two real databases: a document management application and a recipe database. This gave me a population of 165 unique tags to work with.

The simplest way to store tags is just the basic text tags table:

    table doc_tags_text (
        doc_id int not null references documents(doc_id),
        tag text not null
    )
    unique index doc_tags_text_doc_id_tag on (doc_id, tag)
    index doc_tags_text_tag on (tag)

This has the advantage of being simple, direct, and fast for some queries. I also created the "normalized" version of the above, which is equally common:

    table tags (
        tag_id serial not null primary key,
        tag text not null unique
    )

    table doc_tags_id (
        doc_id int not null references documents(doc_id),
        tag_id int not null references tags(tag_id)
    )
    unique index doc_tags_id_doc_id_tag_id on (doc_id, tag_id)
    index doc_tags_id_tag_id on (tag_id)

I put "normalized" in quotes because using a surrogate integer key doesn't actually make our schema more normal in any real sense. Nor does it make it smaller or faster, as we will soon see.

Now for the advanced Postgres magic. Of course, one of the first things I looked at was 9.4's JSONB. Particularly, JSON arrays stored in JSONB seemed to have everything I was looking for: they can store arrays of strings or numbers, automatically deduplicate, and are order-independant. As long as we're doing arrays, though, I should also use a standard TEXT array. That has some disadvantages; sometimes it is order-dependant, and it doesn't automatically remove duplicates. But PostgreSQL offers GIN indexing for all one-dimensional arrays, which makes that at least theoretically useful for tags.

    table doc_tags_json (
        doc_id int not null references documents(doc_id),
        tags jsonb
    )
    unique index doc_tags_id_doc_id on (doc_id)
    index doc_tags_id_tags using gin (tags)

    table doc_tags_array (
        doc_id int not null references documents(doc_id),
        tags text[] not null default '{}'
    )
    unique index doc_tags_id_doc_id on (doc_id)
    index doc_tags_id_tags using gin (tags)

I did consider also trying a TSVECTOR, which would seem obvious for a bag of text strings. But there are some issues with using TSearch for tags: tags would be stemmed, and multiword tags would get broken up into individual terms. This is not how most people expect tags to work.   While I could work around those issues, it just wasn't worth testing.

Then I populated them. In real applications, tags distribution is heavily skewed.   Some tags show up on 15% of all documents while others are used on only one, and most documents have only one tag while a few have a dozen. As such, I wrote a quick script to distribute the tags geometrically across the documents. Thanks to Steve Atkins for the math on this, and I'll share the functions. The first one returns a number between 1 and 14, heavily skewed towards 1, and the second a number between 1 and 150, again with tag #1 ("technology") 10,000 times as common as tag #150 ("Summer 2015 Summit"). I also randomized document and tag ordering so that none of the queries could take advantage of data more ordered than it would be in a real application.

    create or replace function geornd_10()
    returns int
    language sql
    as $f$
    select round(log(random()::numeric)/log(0.3))::INT + 1;
    $f$;

    create or replace function geornd_100()
    returns int
    language sql
    as $f$
    select round(log(random()::numeric)/log(0.9))::INT + 1;
    $f$;

To get a large enough population to show some performance differences, I created 10 million "documents" and tagged them all, assigning each of the four data structures the exact same document ids and tags. Now, it's time to test.

Can you guess which one turned out to be the best all-around?

Continued in the next post.

Also, if anyone wants the tools I used to generate the test case, I can package them up, but it would be some work, so you gotta ask.

Tuesday, June 12, 2012

Video SFPUG June 13: Build Your Application Inside Postgres

Tommorrow night ( June 13th, 7:15PM PDT ) Leon Starr will be presenting "Building Your Application Inside PostgreSQL" for the San Francisco PostgreSQL User Group. As usual, this SFPUG will be broadcast live on video.

In Leon's words, the presentation will be about:

Leon Starr has built a full featured open source Executable UML model editor entirely in plpgsql. WHY? Leon presents the case for building a tightly constrained schema and letting the RDBMS do the heavy logic lifting. Code patterns will be presented where object and db principles are intertwined with an eye toward deriving the best of both worlds. Naturally, plpgsql has its limitations as does the presenter who is relatively new to the language. You are invited to poke fun at his novice coding techniques and weigh in with your experiences and thoughts regarding the appropriateness of stored procedures vs. conventional code.

We have a new broadcaster, JustinTV, who is also physically hosting the event. JustinTV runs on PostgreSQL so they're going to be our live streaming of choice for the forseeable future. I also have a new webcam, so hopefully video quality will be better than last time.