Database Soup: Cascading Replication and Cycles

Saturday, January 5, 2013

Cascading Replication and Cycles

Cascading replication is a 9.2 feature which allows you to replicate through a chain of replicas. The master replicates to replica1, which replicates to replica2, which replicates to replica3, etc. This has some tremendous advantages for large high-availability setups, including rapid failover, limiting replication streams over WAN, and knowing determinatively which replica is furthest ahead.

In PostgreSQL 9.2, cascading replication really requires file-based WAL archiving to work. As of a recent commit, PostgreSQL 9.3 will not require that: you can do cascading replication chains of whatever length you want without any WAL archiving. Score one for vastly enhanced usability! Thanks, Heikki.

Also, score one for arming additional foot-guns (see below).

First, the good news. Setting up a chain of replicas has now become fairly simple even for non-DBAs. Here's the basic steps:

Edit pg_hba.conf to enable replication connections.
Edit postgresql.conf to turn on the various replication options.
Restart the master.
Set up passwordless access to the master from the first replica.
Use pg_basebackup to snapshot the master from the replica.
Create a recovery.conf pointing at the master.
Start the replica.
Run pg_basebackup on the second replica.
Edit recovery.conf to point at the first replica.
Start the second replica.
Repeat steps 8-10 for as many replicas you want to chain.

By the time 9.3 comes out, the above may be even easier because we might fold recovery.conf into postgresql.conf, giving you one less file to edit, and pg_basebackup may create recovery.conf options for you and start the replica, which would get all of the above down to 7 steps. Frankly, setting up passwordless access is still the most annoying part.

Given this simplicity, I was able to set up an eight-server chain of replicas on the GoGrid Cloud in about an hour, half of which was waiting for GoGrid's rubber-stamping feature to copy the master once I had it configured (I really like this feature on GoGrid, for testing). In a low-traffic environment lag from master to replica7 was between 0.1 and 1.7 seconds, median 0.5 seconds. Not bad!

Thom Brown one-upped me by setting up a chain of 120 replicas. Surprisingly, this worked fine, except that replication lag to the end of the chain was several minutes.

Of course, where there are powerful new features, there are powerful foot-guns. This foot-gun is known as "cycles".

Here's how it goes: you have a replication chain where m1 replicates to r1 (synchronously) and r2 (async). r2 replicates to r3, which replicates to r4, r5 and r6. This is complex, but not unreasonable in an HA multi-datacenter setup.

Then, m1 fails. You promote r1 to be the new master, and reconfigure r2 to replicate from it. However, due to out-of-date network documentation, you accidentally connect r2 to r6 instead of r1, which 9.3 cascading replication will happily let you do without a complaint. Now you have a single independant master, and five replicas which are working perfectly well except that they are receiving no new data at all because you have a cycle. Ooops!

Troubleshooting this situation, once it happens, will be annoying: you basically have to check pg_is_in_recovery() and pg_stat_replication on each server, looking for a master which has replicas. If you don't find one, a cycle is your problem; otherwise your problem is something else.

However, it's not clear how PostgreSQL would detect a cycle; currently a replica only knows about its master and its own replicas, and nothing about masters or replicas more than one step away. We'd probably need to modify streaming replication to support a detection mechanism.

It's also not clear what we should do if a cycle is detected. Preventing a cycle entirely seems wrong, since one can imagine circumstances where a temporary cycle is a legitimate part of a transition between two different replication setups. Issuing a warning doesn't seem sufficient, though, given that many replication chains are managed by devops software which ignores warnings. Although writing a warning every 5 minutes to the logs on all replicas would at least help with troubleshooting.

Anyway, there's no plans to address cycles as an issue for 9.3, except maybe some documentation. Please comment below if you have thoughts or ideas on this particular foot-gun.

8 comments:

Adrian KlaverJanuary 5, 2013 at 5:24 PM
Hmm, the snake eating its tail. Seems the key would be the fact no new data is being generated. Now that can happen for legitimate reasons so it not in itself a flag. It could serve though as the basis for a user setting that activates the warning/error based on time/bytes passed. Otherwise as you say there would need to be a way for replication admin information to traverse the entire chain. Something that sees a cluster appearing more than once in the chain.
ReplyDelete
Replies
RobJanuary 5, 2013 at 9:24 PM
Having worked with cascading replication, I am confident that this will more often not be an anti-feature.
ReplyDelete
Replies
UnknownSeptember 25, 2013 at 6:50 PM
Are there any specific instructions on how to get this working? Been looking around and we can't seem to find any detailed instruction on how to set this up.
ReplyDelete
Replies

Add comment