Database Soup: Tree Join Tables: preventing cycles

Thursday, February 12, 2015

Tree Join Tables: preventing cycles

Searching Google, I was surprised to find that there were few solutions published for a common issue: preventing users from creating a cycle when you create a self-join table. So here's one solution, which will be "good enough" for most people, but has some caveats (see below).

First, the setup: we have a table of items. Items can be in one or more collections. Each item can itself be a collection, allowing users to create collections of collections. So the first thing we need is a self-join table on the "adjacency list" model:

    create table collections (
        collection_id int not null references items(id) on delete cascade,
        item_id int not null references items(id) on delete cascade,
        constraint collections_pk primary key ( collection_id, item_id )
    );
    create index on collections(item_id);

So the first part of preventing cycles is to prevent the simplest cycle, where a collection collects itself. That can be done with a constraint:

     alter table collections add constraint
     no_self_join check ( collection_id <> item_id )

Now comes the tough part, preventing cycles of more than one, two, or N collections in a chain. This requires us to look down a chain of possible collections and make sure that each inserted tuple doesn't complete a loop. Fortunately, WITH RECURSIVE works for this provided we do it in a BEFORE trigger. If we did it in an AFTER trigger, the trigger itself would cycle, which would be no good.

    CREATE OR REPLACE FUNCTION collections_prevent_cycle ()
    returns trigger
    language plpgsql
    as $f$
    BEGIN
        -- select recusively, looking for all child items of the new collection
        -- and making sure that they don't include the new collection
        IF EXISTS ( WITH recursive colitem as (
                select collection_id, item_id
                from collections
                where collection_id = NEW.item_id
                UNION ALL
                select colitem.collection_id, collections.item_id
                from collections
                join colitem on colitem.item_id = collections.collection_id
            )
            SELECT collection_id from colitem
            WHERE item_id = NEW.collection_id
            LIMIT 1 ) THEN
                RAISE EXCEPTION 'You may not create a cycle of collections.';
        END IF;

        RETURN NEW;
    END; $f$;

    CREATE TRIGGER collections_prevent_cycle
    BEFORE INSERT OR UPDATE ON collections
    FOR EACH ROW EXECUTE PROCEDURE collections_prevent_cycle();

As I said, this solution will be "good enough" for a variety of uses. However, it has some defects:

Concurrency: It is vulnerable to concurrency failure. That is, if two users simultaneously insert "A collects B" and "B collects A", this trigger would not prevent it. The alternative is locking the entire table on each commit, which is also problematic.

Cost: we're running a pretty expensive recursive query with every insert. For applications where the tree table is write-heavy, this will decrease throughput significantly.

So my, challenge to you is this: come up with a better solution for this, which solves either the concurrency or cost problem without making the other problem worse.

P.S.: this blog has reached half a million views. Thanks, readers!

22 comments:

AnonymousFebruary 13, 2015 at 3:09 AM
For the concurrency issue; if you're not inserting many rows per transaction would it be any use to grab transaction level advisory locks on both the new ids before the exists check? Still locking, not very elegant, and potential to be more problematic but does that hold anything over a full table lock?
ReplyDelete
Replies
AnonymousFebruary 13, 2015 at 1:08 PM
Short-sighted yes, I was thinking of a starting and ending id, but yes you'd have to traverse the chain and lock every id.
ReplyDelete
Replies
Alex KFebruary 13, 2015 at 2:15 PM
This comment has been removed by the author.
ReplyDelete
Replies
Alex KFebruary 13, 2015 at 2:23 PM
This comment has been removed by the author.
ReplyDelete
Replies
Alex KFebruary 13, 2015 at 5:18 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownFebruary 14, 2015 at 2:43 AM
Well, you can always add "cache" of all parents, grand parents and so on for every node.

Assuming the tree is not too deep it shouldn't be big problem (space wise).

Of course this makes for some "fun" when adding/removing parent from a node, but it should be rather manageable.
ReplyDelete
Replies
AnonymousFebruary 14, 2015 at 2:50 AM
A solution I use, when it's appropriate, is to have a slightly different constraint. In your case it would be:

CHECK (collection_id > item_id);

This is efficient to enforce and describes a directed acyclic graph.

The downside is that placing an item/collection into an existing collection A can mean reordering A to have an id greater than its new children. This doesn't usually require much computational effort but is annoying if you're depending on having stable ids; for instance, when you use that id in another table with declaring the proper FOREIGN KEY!

(I have chosen here to order parents after their children. Doing the opposite can result in much more reordering, for example, take entries in order: A, b, c, d, E, with (b,c,d) in collection A. Placing A into collection E requires moving A to the end of the list -- i.e. giving it a higher id -- which in turn means moving b, c, and d.)

This scheme also ties into versioning quite nicely. When an entry is changed, a new tree of collections is build over it, reusing existing unchanged entries where it can.

(Apologies if this post is duplicated; it got lost the first time I submitted...)
ReplyDelete
Replies

Add comment