Monday, May 11, 2015

Recursive cycle detection is recursive

In prior posts, I've gone over some methods to prevent cycles from being added to your database.  However, imagine that someone has handed you an existing adjacency list tree -- perhaps migrated from another DBMS -- and you need to find all of the cycles as part of data cleaning?  How do you do that?

One way, obviously, would be just explore all paths and flag the ones where any ID appeared twice:

    WITH RECURSIVE prev AS (
        SELECT folders.id, 1 AS depth, array[id] as seen, false as cycle
        FROM folders
        UNION ALL
        SELECT folders.id, prev.depth + 1, path || folders.id as seen,
            folders.id = any(seen) as cycle
        FROM prev
        INNER JOIN folders on prev.id = parent_id
    )
    SELECT *
    FROM prev;



However, the above has a serious issue: the query itself will cycle and never complete (in fact, it will error out). So we need to terminate each cycle when the first repeat happens.  Fortunately, that's easy to do, and we'll filter for only the cycles while we're at it:

    WITH RECURSIVE prev AS (
        SELECT folders.id, 1 AS depth, array[id] as seen, false as cycle
        FROM folders
        UNION ALL
        SELECT folders.id, prev.depth + 1, seen || folders.id as seen,
            folders.id = any(seen) as cycle
        FROM prev
        INNER JOIN folders on prev.id = parent_id
        AND prev.cycle = false
    )
    SELECT *
    FROM prev
    WHERE cycle = true;


The results of the above query look like this:

    id | depth |       seen       | cycle
   ----+-------+------------------+-------
    21 |     2 | {21,21}          | t
    13 |     5 | {13,14,15,11,13} | t
    14 |     5 | {14,15,11,13,14} | t
    15 |     5 | {15,11,13,14,15} | t
    11 |     5 | {11,13,14,15,11} | t
    (5 rows)


One thing to notice is that you'll get a row for every node in a cycle loop.  That's because with cycles, the choice of starting point is arbitrary.  So the above query isn't the best choice for a deep tree where you have a lot of existing cycles.  There's a 2nd way to detect cycles in a tree; lets see if any of the commentors can post that query.



9 comments:

  1. Schema is ...?
    CREATE TABLE folders (id int, parent_id int);
    insert into folders values (1,null),(2,1),(3,1),(4,3),(5,2),(6,1),(7,1),(8,7),(9,7),(10,7),(11,15),(12,2),(13,11),(14,13),(15,14),(21,21);
    (with some extra data just to keep things interesting)

    The one adjacency table I personally maintain has the feature that roots of trees are defined as nodes with parent_id is null, so my particular queries tend to not find cycles, since they've been 'pruned off' their parent trees, and I always start scanning at root. Running your query against my data, I'm clean, though (whew!)

    ReplyDelete
    Replies
    1. The main reason cycles occur in that kind of tree is that a former "grandparent" gets updated to make it a branch under an existing tree, without noticing that one of its grandchildren is already a parent of the same tree. Hence the detection query.

      I'm disappointed that nobody posted the other cycle detection query, though.

      Delete
  2. My attempt:

    ```
    with recursive
    -- first of all will try to find all nodes that are not
    -- in cycle
    not_in_cycle as (
    -- obviously nodes whose parent is null are good to
    -- us
    select f.id as node
    from folders as f
    where f.parent_id is null
    -- we would like to find other nodes too
    union all
    -- what constitutes a good node? Well node whose parent
    -- does not belong to cycle can not be in cycle
    select f.id as node
    from folders as f
    inner join not_in_cycle as c
    on f.parent_id = c.node
    )
    -- now that we know which nodes are not in cycle it is easy
    -- to find all that are
    , in_cycle as (
    -- basically all nodes what are not in relation not_in_cycle
    -- will do
    select f.id as node
    from folders as f
    except
    select n.node
    from not_in_cycle as n
    )
    -- its nice to know if we have cycle, but what if we need
    -- to know how many? Lets try to find all reachable nodes
    -- from any given one
    , cycle_paths as (
    select
    i.node, -- will start from any node
    f.parent_id as reached, -- we can reach its parent
    -- head? Well we will use this as a cycle identification
    -- later on
    greatest(i.node, f.parent_id) as head
    from in_cycle as i
    inner join folders f on i.node = f.id
    -- what else we can reach?
    union all
    -- parents of reached node - duh.
    select
    cp.node,
    f.parent_id as reached,
    greatest(cp.head, f.parent_id) as head
    from cycle_paths as cp
    inner join folders as f on f.id = cp.reached
    -- we would like end this recursive call
    where cp.node != f.parent_id
    )
    -- what was way to much of data - because we started from all
    -- possible nodes we managed to repeat some cycles multiple times
    -- :/ - lets fix that
    , unique_cycles as (
    -- but how? In cycle we can start from node which has the
    -- largest id - it uniquely identifies the cycle
    select cp.head as cycle_id, cp.reached
    from cycle_paths as cp
    where cp.head = cp.node
    -- also add head to this list (note that loops will be already
    -- counted in)
    union
    select cp.head as cycle_id, cp.head as reached
    from cycle_paths as cp
    where cp.head = cp.node
    )
    -- but tuples are sometimes annoying so give user nice arrays
    , cycles as (
    select distinct i.* from (
    select
    array_agg(ucp.reached) over (partition by ucp.cycle_id) as cycle
    from unique_cycles as ucp
    order by ucp.cycle_id
    ) as i
    )
    select * from cycles;
    ```

    ReplyDelete
  3. More readable version http://pastie.org/10189263 (without window functions)

    ReplyDelete
    Replies
    1. Wow, nice!

      Yeah, that's what I was looking for, but your version is way fancier. If I see you at a Postgres conference, you deserve a prize!

      Delete
    2. Thanks ;) But my code seems to have a bug. It correctly reports whatever graph contains cycle or not, but it fails to print it correctly. If we have a graph of form 14->13->15->11->13 it will report whole thing as a cycle. That is we need to prune graph both from top and from bottom (or do what you did with arrays). And currently I am thinking that such pruning is impossible.

      Oh well, at least its easy to fix it so that it would not go infinitive loop on such occasions http://pastie.org/10189441 :)

      Delete
    3. Ok, my final solution - http://pastie.org/10189969 (probably should stop spamming this post :)). Basically instead of pruning from bottom abused properties of a union to help find repeated elements on a path. Tried against a bunch of random tests and it seems to work practically (it may hang if ids are repeated, though).

      Delete