Thursday, December 5, 2013

Why you need to apply today's update real soon

So, another month, another PostgreSQL update.  This one is a lot more critical than most because it patches up to three data loss issues, depending on what version of PostgreSQL you're on and which features you're using.  While all users need to plan to apply this update sometime during the next couple weeks, two kinds of users need to schedule an update downtime for this weekend, if at all possible:
  1. Users on version 9.3
  2. Users of binary replication
While all three data loss issues depending on timing, and are thus low-probability, all three are impossible to predict in advance and difficult to detect when they occur.  As such, they should be considered high-risk and an urgent fix.  Of the three issues, the replication one affects 9.0, 9.1, 9.2 and 9.3, one of the VACUUM issues affects all versions, and the other affects only 9.3.

Annoyingly, you'll have to do some additional stuff after you update:
  • If using binary replication, you need to take a new base backup of each replica after updating.
  • You should run the following on each production database after updating:
SET vacuum_freeze_table_age = 0;
VACUUM; -- optionally, ANALYZE as well

This second step is critical for users on 9.3, and a generally good idea for users on other versions. Note that, while VACUUM is a non-blocking operation, it can create a lot of IO, so it's a good idea to do this during a slow traffic period and monitor it for pile-ups.

More information about the replication issue is here.

So, how did this happen?  Well, all three issues were unexpected side effects of fixes we applied for other issues in earlier versions.  For example, the replication issue is the result of the combination of two independent fixes for failover issues, both of which needed to be fixed.  Since all of these issues depend on both timing and heavy traffic, and their effect is subtle (as in, a few missing or extra rows), none of our existing testing was capable of showing them.

If anyone wants to devise destruction tests to detect subtle data corruption issues in PostgreSQL before we release code -- if anyone can -- please offer to help!

5 comments:

  1. Josh,

    A few years ago the Slony guys built a testing framework to help track down race conditions that were causing them grief. It might be worth reaching out to them to see if there is anything there that still exists and could be leveraged.

    ReplyDelete
  2. My doubt about the replication bug is about the warm/hot standby.
    Is not clearly stated the bug doesn't affect the warm standby but as far as I see from the hacker's discussion this bug affects only the hot standby.

    As I have very strict maintenance windows for now I've rebuilt my failovers and I'm keeping them in warm standby.

    As all the failovers are 9.2.6, and as far I can see the issue is in the standby code, this should give me a peace of mind before the master's upgrade.

    Ta!
    Federico

    ReplyDelete
    Replies
    1. Federico, make sure you have hot_standby=off in your postgresql.conf when restoring. hot_standby=on is what triggers the bug.

      Delete
  3. > while VACUUM is a non-blocking operation, it can create a lot of IO, so it's a good idea to do this during a slow traffic period

    Note that you can also use vacuum_cost_delay (and other vacuum_cost_* settings) to throttle vacuum. Here's what I used:

    SET vacuum_cost_delay = 5; SET vacuum_freeze_table_age = 0; VACUUM VERBOSE;

    ReplyDelete