Thursday, March 29, 2012

Postgres Update Release, Data Loss, and a Million Monkeys

One of the major bugfixes in the most recent PostgreSQL update release (9.1.3 et. al.) is one of those annoying timing-based issues which you'd expect to be very rare in production.  And it is rare, as measured against the total number of inserts you do in a database.  Unfortunately, it's also a "million monkeys" issue, which means that it's not that rare as measured by the number of databases in which it occurs.   For that reason, I'd urge all users to update their PostgreSQL releases as soon as they can.  Let me explain.

The issue, as I understand it, was that autovaccum has been (since 8.3.0) re-using a pointer for multiple passes over the same leaf page while pruning dead nodes in b-tree indexes.  If this occurred at the same time a concurrent process was inserting an new index pointer into dead space on the same leaf page at the same time, autovacuum could cause that index pointer to be flagged dead and removed.  As a result, a row would exist in the table but not in the index -- a form of mild data loss or data corruption.

The above depends on split-millesecond timing to occur, which is how we were able to miss the issue for so long.  The odds of it happening for any particular insert, even on a busy system, were miniscule, maybe 1 in a million.  Further, it required frequent inserts into a table which also had heavy updates and deletes. Not much to worry about overall, yes?  Well, no.

This is where the million monkeys come in.  Just as the apochryphal mega-primate literary club will, eventually, due to random typewriter-key-mashing, produce War and Peace, with enough database activity your chances of hitting this bug move from "highly improbable" to "nearly certain".  Every high-traffic production database has at least one table which receives heavy inserts, updates, and deletes (think queue table), and given a year or so might have a couple billion writes and a million autovacuum runs.  Suddenly "one in a million" timing becomes 99.5% probability.

Indeed, on examination, we've found this kind of index corruption on four production customer databases.  And we haven't checked nearly all of them.

Obviously this means you should update your production databases as soon as you can (downtime this weekend, perhaps?).  Folks have also asked me about how to check for this form of index corruption.  It's actually quite simple:

First, do a count of the rows in the table by full table scan:

staffos=# SELECT count(*) FROM candidates;

(1 row)

Then, force an index scan, and do a count(*) using a condition you know to be universally true for an indexed column in the table:

staffos=# set enable_seqscan = false;
staffos=# select count(*) from candidates where id > 0;
(1 row)

Results like the above would indicate that you've encountered the index corruption bug.  Note that you need to test each index to be sure.  You fix it by reindexing:

staffos=# REINDEX INDEX candidates_pkey;
staffos=# select count(*) from candidates where id > 0;
(1 row)

Hopefully this helps with the bug.  Our apologies for not catching this issue earlier, but in our defense, PostgreSQL's record on subtle bugs like this is much better than competing databases.  This does show you why it's vital to stay up to date with the most recent update releases, though!

Wednesday, March 28, 2012

GSOC applications are open

Reminder: Google Summer of Code is open for student applications.  If you, or someone you know, is a student, professor, or connected with a university, make sure they're aware of the opportunity to get paid for hacking on PostgreSQL for the summer.

Of course, we're looking for core PostgreSQL hacking.  But we're also looking for cool tools, extensions, and drivers.  Wanna work on JSON support, an HTTP-SQL proxy, or an OSX-native GUI?  Apply to GSOC!

Tuesday, March 27, 2012

PostgreSQL needs a new load balancer

We do a lot of high-availablity PostgreSQL deployments for clients.  As of 9.1, streaming replication (SR) is excellent for this, and can scale sufficiently to scale a client across and AWS node cluster with some simple tools to help manage it.  But where we're keenly feeling the lack is simple load balancing and failover.

We use pgPool a lot for this, and once you've set it up it works.  But pgPool suffers from runaway sporkism: it's a load balancer and a failover tool and multimaster replication and data partitioning and a cache and compatible with SR and Slony and Bucardo.  If you need all of those things, it's great, but if you only need one of them you still get the complexity of all of them.  It also suffers from having been designed around the needs of specific SRA customers, and not really for general use.  We've been tinkering with it for a while, and I just don't see a way to "fix" pgPool for the general use case; we need something new, designed to be simple and limited to the 80% use-case from scratch.

What we really need is simple tool, or rather a pair of tools, designed to only do failover and read load-balancing, and only for PostgreSQL streaming replication.  These tools should be stackable in any combination the user wants, like pgBouncer (and, for that matter, with pgBouncer).  They should provide information via a web service, and be configurable via a web service.

I'll call the first tool "pgFailover".  The purpose of this tool is to manage a master-replica group of servers, including managing both planned and unplanned failovers.  You would only have one active pgFailover node within a group in order to avoid "split-brain" issues.  It will not handle database connections at all.

pgFailover would track a master and several replicas.  The status of each server would be monitored by polling both the replication information on each server, and pg_stat_replication on the master.  This information would be provided to the network by pgFailover via a web service, and pgFailover would accept commands via the same webservice as well as on the local command line.

Based on user-configurable criterial, pgFailover would carry out any of the following operations: failover to a new master; remaster the other replicas; add a new replica, with or without data sync; resync a replica; or shut down a replica.  It would also handle some situations automatically.  In the event that user-configurable conditions of nonresponsiveness are met, it would fail over the master to the "best" replica. The failover replica would be decided based on either the configuration or based on which replica is most caught up according to replication timestamps.  Likewise, replicas would be dropped from the availability list if they stop replicating for a certain period or become nonresponsive. 

The second tool I'll call "pgBalancer", after the unreleased tool from Skype.  pgBalancer would just do load-balancing of database connections across the replicated servers.  It won't deal with failover or monitoring the servers at all; instead, it relies entirely on pgFailover for that.  This allows users to have several separate pgBalancer servers, supporting both high availabilty and complex load-balancing configurations.

Since automated separation of read and write queries is an impossible problem, we won't even try to solve it.  Instead, we'll rely on the idea that the application knows whether it's reading or writing, and provide two separate database connection ports, one for read-write (RW) connections, and one for read-only (RO) connections.  pgBalancer would obtain information on what servers to connect to by either a configuration file, or by querying a pgFailover server for information via its web service.  RO connections would be load-balanced across available servers, based on a simple "least active" plus "round-robin" algorithm.

pgBalancer would also accept a variety of commands via web service, including: suspend a service or all services; disconnect a specific connection or all connections; failover the write node to specific new server; drop a server from the load balancing list; add a server to the load balancing list; give a list of connections; give a list of servers.

If we could build two tools which more-or-less match the specification above, I think they would go a lot further towards supporting the kinds of high-availability, horizontally-scaled PostgreSQL server clusters we want to use for the applications of the future. Time permitting, I'll start work on pgFailover.  A geek can dream, can't he?

Monday, March 26, 2012

Travel March to June 2012

For the next four months, I will be travelling and speaking at the following venues:

pgDay Austin: this Wednesday (March 28th).  "Super Jumbo Deluxe", about doing big databases on PostgreSQL.

FOSS4G-NA, Washington D.C., April 10-12: I will be keynoting at the premier opengeo conference in the USA, as well as giving a talk on upcoming developments in PostgreSQL.

pgCon, Ottawa, May 15-18: I will be doing "9.2 Gran Turismo", a racing review of PostgreSQL 9.2 features.

SELF and OpenDBCamp, Charlotte, NC, June 8-10: talks not selected yet. OpenDBCamp is an unconference for database hackers.

For SELF, I'm looking for more PostgreSQL people who want to speak or work a booth.  If you're available, please contact me or speak up on the pgsql-advocacy mailing list.

So, if you're in Austin, D.C., Ottawa or Charlotte, I'll see you there!

Sunday, March 25, 2012

How can we get software vendors to update?

Now that PostgreSQL is becoming the database of choice for independent software vendors, we're developing a new problem: software vendors do not apply updates.  Within the last month, we've had the exact same conversation with four different ISV customers we have:

Customer: we have an instance of data corruption on one of our client's machines.  detailed description follows

pgExperts: yes, that sounds like data corruption.  What version of PostgreSQL are you running on that machine?

Customer: 8.4.1.

pgExperts: 8.4.1 is missing 2 years of patch updates, including fixes for several data corruption issues.  You should have updated to 8.4.11. 

Customer: so can you fix it?

pgExperts: you need to apply the update to the current PostgreSQL patch version first.

Customer: we can't do that.  Can you fix it?

pgExperts: not for a reasonable cost, no.

It seems that many ISVs regularly deploy databases where they have neither mechanism nor regular practice of applying updates and patches.   This could be from a practice of avoiding bad patches (like those from certain major database and OS vendors), poor QA and testing, lack of remote access, inability to schedule downtimes, or some other issue.  The only strange thing is the level of resistance ISVs have to the idea of applying updates, as if they'd never heard of it before.  Regardless, the result is the same: the user's data is lost/corrupt/hacked, and PostgreSQL will be blamed.

I doubt we're the only middleware software provider to encounter this.  My question is, what can we do to educate vendors about the need to apply updates regularly, promptly, and throughout their customer base?

Tuesday, March 20, 2012

New Blog Location

After six years with, I've moved my blog, here, to  It was time.

The archives of six years of blogging on databases and related things will remain at ToolBox, but all new posts will be here.

Expect to see much more frequent updates from me in the future!

P.S. I also have a separate blog about food and cooking at  And I expect to resume writing for sometime soon.