Monday, October 7, 2013

Fancy SQL Monday: Finding broken keys

One of the things which usually happens on the rare occasions when you have PostgreSQL database corruption is that indexes become corrupted so that index entries no longer match the underlying table.  For regular indexes, this can be fixed with a REINDEX or pg_repack.  But it becomes a real problem when the index was used for a primary key, unique index or foreign key, because then duplicate or broken data gets inserted into the tables, and attempts to rebuild the index will fail.

Database-wide, the easiest way to find these broken indexes is to do a pg_dump and pg_restore of your whole database; data which violates keys will show up as errors on the pg_restore.

The second thing we need to do is find the duplicate or missing data so that we can "clean" it and then rebuild the index.  First, we'll need to disable index access to make sure that we don't hit the index and get erroneous information back:

SET enable_indexscan = off;
SET enable_bitmapscan = off;
SET enable_indexonlyscan = off;  -- 9.2+ only!

So for our first SQL trick, we want to query a table with a simple SERIAL surrogate key and find duplicate IDs.  Then, for each of these IDs, we want to get the entire referenced row so we can look at it and decide how to fix the data.  One way to do that is using a WITH clause and an aggregate:

WITH dups AS (
    SELECT ticketid, count(*) 

    FROM tickets 
    GROUP BY ticketid 
    HAVING count(*) > 1 )
SELECT * 

FROM tickets 
   JOIN dups USING ( ticketid )
ORDER BY ticketid;

And that gives us a list of duplicate ticketids.  From there, we can choose to delete rows, assign new IDs, or do something else, depending on the nature of the duplication.   Of course, if you never created any real keys (in addition to the surrogate key), you're going to have a devil of a time figuring things out.

However, the above query format isn't the only way we can search for duplicate IDs.  We can also use a windowing query, which is often easier to type if we have multicolumn unique constraints:

WITH dupcount AS (
   SELECT comments.*,
       count(*) OVER (
            PARTITION BY post, commentor, commented_at)
            AS entrycount 
   FROM comments )
SELECT * FROM dupcount
WHERE  entrycount > 1
ORDER BY post, commentor, commented_at;

Finding broken foreign keys is a bit more straightforwards.  In this case, you just want to use an outer join to find the missing parent records:

SELECT shoppingcart_item.*
FROM shoppingcart_item
   LEFT OUTER JOIN shoppingcart
   ON shoppingcart_item.shoppingcart_id 
          = shoppingcart.shoppingcart_id
WHERE shoppingcart.shoppingcart_id IS NULL
ORDER BY shoppingcart_item.shoppingcart_id;

Once you've found those, though, you can't do much more than dump them to a file and delete the records, since there's no way to find out what the missing parent record data was.

Happy de-corrupting!

6 comments:

  1. I think you missed out a line in the second SELECT statement ... Probably you meant something like this...

    WITH dupcount AS (
    SELECT comments.*,
    count(*) OVER (
    PARTITION BY post, commentor, commented_at)
    AS entrycount
    FROM comments)
    SELECT * FROM dupcount
    WHERE entrycount > 1
    ORDER BY post, commentor, commented_at;

    ReplyDelete
    Replies
    1. Oh, you are correct! Paste-O. Fixed, thanks.

      Delete
  2. Hi Josh! Great article! My question is: what is the reason of this corruptions? Is there any way to avoid that in the first place?

    ReplyDelete
    Replies
    1. Well, two of the three cases I had from last week were user-initiated. One allowed the DB server to run out of disk space and ignored the monitoring alerts, and then tried to free up disk space by deleting files. Another user remounted the SAN volume while Postgres was running. And a third just had RAID failure. So, monitor your hardware and don't do stupid things.

      Delete
    2. The opening paragraph does kinda imply that data corruption is a very common situation - I'm guessing that Josh didn't actually mean to give that impression. In my experience it is quite uncommon (funnily enough, I'm dealing with one such now)!

      Delete
    3. Oh, I see what you mean! I've changed two words to fix that.

      Delete