Monday, October 8, 2012

Determining Furthest Ahead Replica

Here's a common situation for PostgreSQL binary replication setups with multiple replicas: the master is down, and you want to promote the "farthest ahead" replica so that you can remaster the other replicas from it with a minimum of fuss and data loss.   Sadly, this is something we make stupidly hard to do in PostgreSQL 9.0, 9.1 and 9.2.  While our project is great at making things possible, we're not so great at making them easy.

Checking Log Position

By "furthest ahead", we mean "which replica has received the most recent data from the master before the master went down?"  The relevant place to check this is the replica's transaction log buffer, which is where data goes which has been received from the master. 

When you start out, it looks simple.  You check the last XLOG receive location on each replica, and take the "highest" one.  Let's try that!

replica1=# select pg_last_xlog_receive_location();

replica2=# select pg_last_xlog_receive_location();

Uhhh ...

Yeah, not helpful at all, is it?  We give you a completely undocumented pair of hexidecimal values.  To make things even more fun, Postgres doesn't have any built-in functions for converting hexidecimal to numeric.

Checking in PostgreSQL 9.1

In 9.1 we added pg_last_xact_replay_timestamp(), which gives you a timestamp for the replica replay:

replica1=# select pg_last_xact_replay_timestamp();
 2012-10-05 10:35:47.527-07

Yay!  That solves everything.  I can just take the replay timestamp from each replica and pick the latest one, right?

Well, no.  Sorry.

First, that's the replay timestamp, and what we're more interested in is the received timestamp, which we can't get.  WAL records are first received, and then replayed, so if a replica is under load the replay location can be behind the received location, sometimes considerably.  When remastering, we're going to terminate any load on the new master and let it catch up, so the replay location matters less than the received one in limiting data loss.

More importantly, a replica with a higher received location will refused to remaster from a replica with a lower one, even if it's not completely caught up on replay.  This means that promoting the replica with the highest replay timestamp does not necessarily give you the replica from which all other replicas can remaster. Checking replay timestamp also has a race condition built in if all replicas aren't done replaying yet.  You could wait for all replicas to catch up on replay, but if your master is down, you probably don't want to.

The remaining issue is that it's theoretically possible to have to servers with the same received timestamp, but with different log locations, especially if the external code you're using to compare them rounds down the milleseconds.

Still, pg_last_xact_replay_timestamp() is an OK way to find the furthest ahead replica if you have no other choice.  It's just not ideal or determinative.

Checking in 9.2

In 9.2, we supply pg_xlog_location_diff(), which lets us compare two xlog locations to see which one is higher:

replica2=# select pg_xlog_location_diff('1/10808DE8','1/FFD1560');

If it's positive, the first value is higher.  If it's negative, the second.  The problem with this approach becomes obvious if you have, say, seven replicas; you'd have to run 1:1 comparisons on for each pair of replicas, which means doing six individual comparisons, each of which comparisons involves a database call, and a clever piece of code which can treat these comparisons as a sort routine.

More importantly to a lot of users, this function is only available to 9.2, and most users haven't upgraded to 9.2 yet.  So this is a determinative solution, but won't really work for a lot of people.

The reason I can't use 9.2's pg_xlogfile_name_offset() for comparisons is that it doesn't work on replicas, making the function of questionable utility.

Workaround Solution

What we could really use is a workaround which would do all of the following:

  1. Give us the received location of each replica as an externally sortable, preferably numeric, value.
  2. Do so with a single, simple database call to each replica.
  3. Not require installation of external procedural languages or C functions, so that there aren't issues with installing or compiling software on each node.
Happily, we can fulfill all three of the above conditions using some ugly-but-functional PL/pgSQL functions, available here.

The way you would use them is as follows:

  1. The master goes down.
  2. Check xlog_location_numeric() on each backend.
  3. Pick the backend with the highest (or tied for highest) numeric position, and check how far behind it is in replay using replay_lag_mb(), but see below.
  4. If the highest replica isn't too far behind on replay, promote it.
  5. If the highest replica is too far behind, drop to the next-highest and check replay lag.
The purpose of the check in replay is to avoid promoting a replica which happens to have the highest received location, but for some reason (such as a really long-running transaction) is actually hours behind on replay.  This means making a business decision between data loss and speed of failover, so only you can set the threshold here.

One way you could check the replay lag is using pg_last_xact_replay_timestamp(), at least on 9.1 and later.  However, since the master is down and you don't necessarily know the exact time it went down, that means checking the timestamp on all replicas, and comparing against the most recent one.  It also means waiting a bit for the replicas to catch up in order to avoid race conditions.

A second, and my preferred, way to check how far the candidate new master is behind in replay is to check how many bytes different pg_last_xlog_replay_location() is from pg_last_xlog_receive_location().  Not only does this not require checking all replicas (i.e. you can test only your candidate new master), but the number of bytes is more directly related to required replay time than the clock time lag is.  A replica requires no time to replay portions of the log which relate to idle periods on the master.

In 9.2, this is easily done using pg_xlog_location_diff:

replica2=# SELECT pg_xlog_location_diff(pg_xlog_last_receive_location(), pg_xlog_last_replay_location());

When this value reaches zero, you know this replica is completely caught up.  You can also set a threshold for how far you want a replica to be behind on replay before promoting it, such as 1GB (the numeric value is roughly equivalent to bytes).

However, pg_xlog_location_diff doesn't exist in 9.0 or 9.1.  So I've provided my own function, replay_lag_mb(), which gives the approximate replay lag in megabytes, regardless of PostgreSQL version.

bench=# select replay_lag_mb();

I've also provided a simpler function, all_replayed(), which returns true if replay is caught up.

Hopefully this provides a solution for many of you.  And hopefully this leads to us having something simpler built into 9.3!

Notes on the Functions

Link again, in case you didn't see it above: replay location functions.

First, thanks very much to Andrew Geirth ("RhodiumToad") for verifying my logic and correcting my math.  And, of course, Simon Riggs, Fujii Masao, Robert Haas, Heikki Linnakangas,  Magnus Hagander, and all of the other developers who worked on binary replication.
  • all_replayed() and replay_lag_mb() are designed to be run only on replicas.  They will return NULL on standalone or master servers.
  • these functions will return the same results regardless of which database they're installed in.  However, they can only be called from the database in which they are installed.   So you might want to install them in the "postgres" scratch database.
  • xlog_location_numeric() returns a numeric value which can exceed a 64-bit integer in size.  So make sure your calling and sorting code is prepared to handle a larger-than-eight-byte number.
  • The math for numeric log position is only valid through 9.2.  In 9.3, we change the multiplier for the logfile, so we'll need new functions (or, hopefully, something built in).
  • Yes, I really do have to do that horrible ugly conversion (string to hex to bitmap to bigint to numeric) to turn the hex values to numeric.
If anyone writes administrative code or recipes utilizing these functions, please link it in the comments.

Also, please note that if your real concern is to have a failover replica with the lowest data loss and fastest failover, the best answer is to have a synchronous standby which  receives no query load, and fail over to that.  Assuming you can afford the extra server, that's better than any of the workarounds above.


  1. Great. I don't understand how we have gone this long without being forced to provide more helpful tooling for this.

    1. perhaps because the pg community is becoming aware of other environments that simply civil.

      bouncability has ALWAYS been important, which is what i believe josh is addressing in this great post.

      from the front-jmscott

  2. Sorry, I'm kind of new in Postgresql setup and wonder if there is some automatic solution for failover.

    the master is down, and you want to promote the "farthest ahead" replica so that you can remaster the other replicas from it with a minimum of fuss and data loss.

    Do you really expect a human sitting and waiting for master failing and then fix all manually? Is there no software for it?

    1. Well, manual failover is the solution for some users. If you don't have the tooling to prevent split-brain, then it's a bad idea to do automatic failover.

      For folks who set up automated failover, there are a variety of packages, including PGX's internal code (soon to be open sourced), RepMgr, OmniPITR, pgPool2 and others, which can be integrated with monitoring software and network control software to manage failover.

      Even in that case, though, the software needs a determinative algorithm to determine which replica to fail over *to*, which is what I'm trying to provide.

      The problem with providing packaged one-size-fits-all software is that no two setups with multiple replicas are the same, especially when you take other systems tools into account. I agree that we could stand to have some one-size-fits-some software for some of the most obvious cases, such as AWS web load balancing. There's a lot more work to be done there, certainly.

    2. I see. Thanks for your post and the answer!

  3. > xlog_position_numeric() returns a numeric value which can exceed a 64-bit
    > integer in size. So make sure your calling and sorting code is prepared
    > to handle a larger-than-eight-byte number.

    An XLogRecPtr is internally two 32-bit integers in PostgreSQL 9.2, and will be a single 64-bit integer in PostgreSQL 9.3. Given that, I don't see how xlog_position_numeric() can return a value that doesn't fit in a 64-bit integer. Theoretically I suppose it might not fit in a *signed* 64-bit integer, but getting the XLOG position into that range would require generating 8 exabytes of WAL data, and once you get there you'd start to get uncomfortably close to reaching the hard 2^64 limit anyway.

    1. Heikki,

      I'm definitely able to get a larger-than-32-bit signed integer from the offset field. Are you sure of the above?

  4. Thanks so much for your post it was very helpful. I've created a basic script that will use the functions against a list of nodes to perform the checks to see if the node is furthest ahead and has the least replay lag.

    I'm looking at trying to integrate or create a OCF resource agent that will combine this logic also for a pacemaker cluster.

  5. In your post, you refer to "xlog_position_numeric()", but in the code, the function is "xlog_location_numeric()"

  6. great article. and a late question , even with sync standby(with streaming replication) the standby servers might fall out of sync. so for, 9.3 are there any updates you can suggest related to streaming replication on how to find the most up-to-date slave?