Database Soup: Determining Furthest Ahead Replica

Here's a common situation for PostgreSQL binary replication setups with multiple replicas: the master is down, and you want to promote the "farthest ahead" replica so that you can remaster the other replicas from it with a minimum of fuss and data loss. Sadly, this is something we make stupidly hard to do in PostgreSQL 9.0, 9.1 and 9.2. While our project is great at making things possible, we're not so great at making them easy.

Checking Log Position

By "furthest ahead", we mean "which replica has received the most recent data from the master before the master went down?" The relevant place to check this is the replica's transaction log buffer, which is where data goes which has been received from the master.

When you start out, it looks simple. You check the last XLOG receive location on each replica, and take the "highest" one. Let's try that!

replica1=# select pg_last_xlog_receive_location();

        pg_last_xlog_receive_location 

       -------------------------------

        41A/10808DE8

replica2=# select pg_last_xlog_receive_location();

        pg_last_xlog_receive_location 

       -------------------------------

        41A/FFD1560

Uhhh ...

Yeah, not helpful at all, is it? We give you a completely undocumented pair of hexidecimal values. To make things even more fun, Postgres doesn't have any built-in functions for converting hexidecimal to numeric.

Checking in PostgreSQL 9.1

In 9.1 we added pg_last_xact_replay_timestamp(), which gives you a timestamp for the replica replay:

replica1=# select pg_last_xact_replay_timestamp();

 pg_last_xact_replay_timestamp 

-------------------------------

 2012-10-05 10:35:47.527-07

Yay! That solves everything. I can just take the replay timestamp from each replica and pick the latest one, right?

Well, no. Sorry.

First, that's the replay timestamp, and what we're more interested in is the received timestamp, which we can't get. WAL records are first received, and then replayed, so if a replica is under load the replay location can be behind the received location, sometimes considerably. When remastering, we're going to terminate any load on the new master and let it catch up, so the replay location matters less than the received one in limiting data loss.

More importantly, a replica with a higher received location will refused to remaster from a replica with a lower one, even if it's not completely caught up on replay. This means that promoting the replica with the highest replay timestamp does not necessarily give you the replica from which all other replicas can remaster. Checking replay timestamp also has a race condition built in if all replicas aren't done replaying yet. You could wait for all replicas to catch up on replay, but if your master is down, you probably don't want to.

The remaining issue is that it's theoretically possible to have to servers with the same received timestamp, but with different log locations, especially if the external code you're using to compare them rounds down the milleseconds.

Still, pg_last_xact_replay_timestamp() is an OK way to find the furthest ahead replica if you have no other choice. It's just not ideal or determinative.

Checking in 9.2

In 9.2, we supply pg_xlog_location_diff(), which lets us compare two xlog locations to see which one is higher:

replica2=# select pg_xlog_location_diff('1/10808DE8','1/FFD1560');

 pg_xlog_location_diff 

-----------------------

8616072

If it's positive, the first value is higher. If it's negative, the second. The problem with this approach becomes obvious if you have, say, seven replicas; you'd have to run 1:1 comparisons on for each pair of replicas, which means doing six individual comparisons, each of which comparisons involves a database call, and a clever piece of code which can treat these comparisons as a sort routine.

More importantly to a lot of users, this function is only available to 9.2, and most users haven't upgraded to 9.2 yet. So this is a determinative solution, but won't really work for a lot of people.

The reason I can't use 9.2's pg_xlogfile_name_offset() for comparisons is that it doesn't work on replicas, making the function of questionable utility.

Workaround Solution

What we could really use is a workaround which would do all of the following:

Give us the received location of each replica as an externally sortable, preferably numeric, value.
Do so with a single, simple database call to each replica.
Not require installation of external procedural languages or C functions, so that there aren't issues with installing or compiling software on each node.

Happily, we can fulfill all three of the above conditions using some ugly-but-functional PL/pgSQL functions, available here.

The way you would use them is as follows:

The master goes down.
Check xlog_location_numeric() on each backend.
Pick the backend with the highest (or tied for highest) numeric position, and check how far behind it is in replay using replay_lag_mb(), but see below.
If the highest replica isn't too far behind on replay, promote it.
If the highest replica is too far behind, drop to the next-highest and check replay lag.

The purpose of the check in replay is to avoid promoting a replica which happens to have the highest received location, but for some reason (such as a really long-running transaction) is actually hours behind on replay. This means making a business decision between data loss and speed of failover, so only you can set the threshold here.

One way you could check the replay lag is using pg_last_xact_replay_timestamp(), at least on 9.1 and later. However, since the master is down and you don't necessarily know the exact time it went down, that means checking the timestamp on all replicas, and comparing against the most recent one. It also means waiting a bit for the replicas to catch up in order to avoid race conditions.

A second, and my preferred, way to check how far the candidate new master is behind in replay is to check how many bytes different pg_last_xlog_replay_location() is from pg_last_xlog_receive_location(). Not only does this not require checking all replicas (i.e. you can test only your candidate new master), but the number of bytes is more directly related to required replay time than the clock time lag is. A replica requires no time to replay portions of the log which relate to idle periods on the master.

In 9.2, this is easily done using pg_xlog_location_diff:

replica2=# SELECT pg_xlog_location_diff(pg_xlog_last_receive_location(), pg_xlog_last_replay_location());

pg_xlog_location_diff

----------------------------

16192

When this value reaches zero, you know this replica is completely caught up. You can also set a threshold for how far you want a replica to be behind on replay before promoting it, such as 1GB (the numeric value is roughly equivalent to bytes).

However, pg_xlog_location_diff doesn't exist in 9.0 or 9.1. So I've provided my own function, replay_lag_mb(), which gives the approximate replay lag in megabytes, regardless of PostgreSQL version.

bench=# select replay_lag_mb();
replay_lag_mb
---------------
93.7

I've also provided a simpler function, all_replayed(), which returns true if replay is caught up.

Hopefully this provides a solution for many of you. And hopefully this leads to us having something simpler built into 9.3!

Notes on the Functions

Link again, in case you didn't see it above: replay location functions.

First, thanks very much to Andrew Geirth ("RhodiumToad") for verifying my logic and correcting my math. And, of course, Simon Riggs, Fujii Masao, Robert Haas, Heikki Linnakangas, Magnus Hagander, and all of the other developers who worked on binary replication.

all_replayed() and replay_lag_mb() are designed to be run only on replicas. They will return NULL on standalone or master servers.
these functions will return the same results regardless of which database they're installed in. However, they can only be called from the database in which they are installed. So you might want to install them in the "postgres" scratch database.
xlog_location_numeric() returns a numeric value which can exceed a 64-bit integer in size. So make sure your calling and sorting code is prepared to handle a larger-than-eight-byte number.
The math for numeric log position is only valid through 9.2. In 9.3, we change the multiplier for the logfile, so we'll need new functions (or, hopefully, something built in).
Yes, I really do have to do that horrible ugly conversion (string to hex to bitmap to bigint to numeric) to turn the hex values to numeric.

If anyone writes administrative code or recipes utilizing these functions, please link it in the comments.

Also, please note that if your real concern is to have a failover replica with the lowest data loss and fastest failover, the best answer is to have a synchronous standby which receives no query load, and fail over to that. Assuming you can afford the extra server, that's better than any of the workarounds above.

11 comments:

Bruce MomjianOctober 8, 2012 at 11:25 AM
Great. I don't understand how we have gone this long without being forced to provide more helpful tooling for this.
AnonymousOctober 8, 2012 at 11:29 AM
Sorry, I'm kind of new in Postgresql setup and wonder if there is some automatic solution for failover.

"
the master is down, and you want to promote the "farthest ahead" replica so that you can remaster the other replicas from it with a minimum of fuss and data loss.
"

Do you really expect a human sitting and waiting for master failing and then fix all manually? Is there no software for it?
Heikki LinnakangasOctober 9, 2012 at 1:05 AM
> xlog_position_numeric() returns a numeric value which can exceed a 64-bit
> integer in size. So make sure your calling and sorting code is prepared
> to handle a larger-than-eight-byte number.

An XLogRecPtr is internally two 32-bit integers in PostgreSQL 9.2, and will be a single 64-bit integer in PostgreSQL 9.3. Given that, I don't see how xlog_position_numeric() can return a value that doesn't fit in a 64-bit integer. Theoretically I suppose it might not fit in a *signed* 64-bit integer, but getting the XLOG position into that range would require generating 8 exabytes of WAL data, and once you get there you'd start to get uncomfortably close to reaching the hard 2^64 limit anyway.
S. BamblingApril 2, 2013 at 3:22 AM
Thanks so much for your post it was very helpful. I've created a basic script that will use the functions against a list of nodes to perform the checks to see if the node is furthest ahead and has the least replay lag.

I'm looking at trying to integrate or create a OCF resource agent that will combine this logic also for a pacemaker cluster.

https://github.com/smbambling/scripts_bash/blob/master/pgsql_replica_xlog_check.sh
--AndyAugust 13, 2013 at 7:20 PM
In your post, you refer to "xlog_position_numeric()", but in the code, the function is "xlog_location_numeric()"
UnknownJanuary 23, 2015 at 7:53 AM
great article. and a late question , even with sync standby(with streaming replication) the standby servers might fall out of sync. so for, 9.3 are there any updates you can suggest related to streaming replication on how to find the most up-to-date slave?

Monday, October 8, 2012

Determining Furthest Ahead Replica