Thursday, March 29, 2012

Postgres Update Release, Data Loss, and a Million Monkeys

One of the major bugfixes in the most recent PostgreSQL update release (9.1.3 et. al.) is one of those annoying timing-based issues which you'd expect to be very rare in production.  And it is rare, as measured against the total number of inserts you do in a database.  Unfortunately, it's also a "million monkeys" issue, which means that it's not that rare as measured by the number of databases in which it occurs.   For that reason, I'd urge all users to update their PostgreSQL releases as soon as they can.  Let me explain.

The issue, as I understand it, was that autovaccum has been (since 8.3.0) re-using a pointer for multiple passes over the same leaf page while pruning dead nodes in b-tree indexes.  If this occurred at the same time a concurrent process was inserting an new index pointer into dead space on the same leaf page at the same time, autovacuum could cause that index pointer to be flagged dead and removed.  As a result, a row would exist in the table but not in the index -- a form of mild data loss or data corruption.

The above depends on split-millesecond timing to occur, which is how we were able to miss the issue for so long.  The odds of it happening for any particular insert, even on a busy system, were miniscule, maybe 1 in a million.  Further, it required frequent inserts into a table which also had heavy updates and deletes. Not much to worry about overall, yes?  Well, no.

This is where the million monkeys come in.  Just as the apochryphal mega-primate literary club will, eventually, due to random typewriter-key-mashing, produce War and Peace, with enough database activity your chances of hitting this bug move from "highly improbable" to "nearly certain".  Every high-traffic production database has at least one table which receives heavy inserts, updates, and deletes (think queue table), and given a year or so might have a couple billion writes and a million autovacuum runs.  Suddenly "one in a million" timing becomes 99.5% probability.

Indeed, on examination, we've found this kind of index corruption on four production customer databases.  And we haven't checked nearly all of them.

Obviously this means you should update your production databases as soon as you can (downtime this weekend, perhaps?).  Folks have also asked me about how to check for this form of index corruption.  It's actually quite simple:

First, do a count of the rows in the table by full table scan:

staffos=# SELECT count(*) FROM candidates;

 count
-------
  1449
(1 row)

Then, force an index scan, and do a count(*) using a condition you know to be universally true for an indexed column in the table:

staffos=# set enable_seqscan = false;
SET
staffos=# select count(*) from candidates where id > 0;
 count
-------
  1448
(1 row)

Results like the above would indicate that you've encountered the index corruption bug.  Note that you need to test each index to be sure.  You fix it by reindexing:

staffos=# REINDEX INDEX candidates_pkey;
REINDEX
staffos=# select count(*) from candidates where id > 0;
 count
-------
  1449
(1 row)

Hopefully this helps with the bug.  Our apologies for not catching this issue earlier, but in our defense, PostgreSQL's record on subtle bugs like this is much better than competing databases.  This does show you why it's vital to stay up to date with the most recent update releases, though!

9 comments:

  1. Replies
    1. No. Per the explanation above, it affects all currently maintained versions, back to 8.3. So you should apply the most recent patch release from PostgreSQL.org.

      Delete
  2. Has anyone written an automated tool yet that checks all indexes? This would take forever to do by hand.

    PS: Your new blog design is awful. Give me back my Web 1.0! :)

    ReplyDelete
  3. The check you mention above will not work unless you either lock the tables or go into SERIALIZABLE mode, no?

    ReplyDelete
  4. Greg,

    Yes, that pretty much presumes that user access has been shut down.

    ReplyDelete
  5. Is 1 in a million rhetorical or a semi sane likelihood?

    ReplyDelete
    Replies
    1. It's rhetorical. The actual probability very much depends on your use-case. However, hitting the issue depending on extremely fine (as in sub-millesecond) timing, so it would never be *likely*.

      Delete
    2. Actually, to give you a concrete example: we tested reproduceing the issue on a test database. It took an average of 8 days of fast-as-we-can inserts/updates/deletes in a queueing table (around 12 million of them) to reproduce the issue.

      Delete
  6. Is there a bug number that provides more details and where this is fixed?
    Thanks.

    ReplyDelete