Saturday, January 5, 2013

Cascading Replication and Cycles

Cascading replication is a 9.2 feature which allows you to replicate through a chain of replicas.  The master replicates to replica1, which replicates to replica2, which replicates to replica3, etc.  This has some tremendous advantages for large high-availability setups, including rapid failover, limiting replication streams over WAN, and knowing determinatively which replica is furthest ahead.

In PostgreSQL 9.2, cascading replication really requires file-based WAL archiving to work.  As of a recent commit, PostgreSQL 9.3 will not require that: you can do cascading replication chains of whatever length you want without any WAL archiving.  Score one for vastly enhanced usability!  Thanks, Heikki.

Also, score one for arming additional foot-guns (see below).

First, the good news.  Setting up a chain of replicas has now become fairly simple even for non-DBAs.  Here's the basic steps:

  1. Edit pg_hba.conf to enable replication connections.
  2. Edit postgresql.conf to turn on the various replication options.
  3. Restart the master.
  4. Set up passwordless access to the master from the first replica.
  5. Use pg_basebackup to snapshot the master from the replica.
  6. Create a recovery.conf pointing at the master.
  7. Start the replica.
  8. Run pg_basebackup on the second replica.
  9. Edit recovery.conf to point at the first replica.
  10. Start the second replica.
  11. Repeat steps 8-10 for as many replicas you want to chain.
By the time 9.3 comes out, the above may be even easier because we might fold recovery.conf into postgresql.conf, giving you one less file to edit, and pg_basebackup may create recovery.conf options for you and start the replica, which would get all of the above down to 7 steps.  Frankly, setting up passwordless access is still the most annoying part.

Given this simplicity, I was able to set up an eight-server chain of replicas on the GoGrid Cloud in about an hour, half of which was waiting for GoGrid's rubber-stamping feature to copy the master once I had it configured (I really like this feature on GoGrid, for testing).  In a low-traffic environment lag from master to replica7 was between 0.1 and 1.7 seconds, median 0.5 seconds.  Not bad!

Thom Brown one-upped me by setting up a chain of 120 replicas.  Surprisingly, this worked fine, except that replication lag to the end of the chain was several minutes.

Of course, where there are powerful new features, there are powerful foot-guns.  This foot-gun is known as "cycles".

Here's how it goes: you have a replication chain where m1 replicates to r1 (synchronously) and r2 (async).  r2 replicates to r3, which replicates to r4, r5 and r6.  This is complex, but not unreasonable in an HA multi-datacenter setup.  

Then, m1 fails.  You promote r1 to be the new master, and reconfigure r2 to replicate from it.  However, due to out-of-date network documentation, you accidentally connect r2 to r6 instead of r1, which 9.3 cascading replication will happily let you do without a complaint.   Now you have a single independant master, and five replicas which are working perfectly well except that they are receiving no new data at all because you have a cycle.  Ooops!

Troubleshooting this situation, once it happens, will be annoying: you basically have to check pg_is_in_recovery() and pg_stat_replication on each server, looking for a master which has replicas.  If you don't find one, a cycle is your problem; otherwise your problem is something else.

However, it's not clear how PostgreSQL would detect a cycle; currently a replica only knows about its master and its own replicas, and nothing about masters or replicas more than one step away.  We'd probably need to modify streaming replication to support a detection mechanism.

It's also not clear what we should do if a cycle is detected.  Preventing a cycle entirely seems wrong, since one can imagine circumstances where a temporary cycle is a legitimate part of a transition between two different replication setups.  Issuing a warning doesn't seem sufficient, though, given that many replication chains are managed by devops software which ignores warnings.  Although writing a warning every 5 minutes to the logs on all replicas would at least help with troubleshooting.

Anyway, there's no plans to address cycles as an issue for 9.3, except maybe some documentation.  Please comment below if you have thoughts or ideas on this particular foot-gun.


  1. Hmm, the snake eating its tail. Seems the key would be the fact no new data is being generated. Now that can happen for legitimate reasons so it not in itself a flag. It could serve though as the basis for a user setting that activates the warning/error based on time/bytes passed. Otherwise as you say there would need to be a way for replication admin information to traverse the entire chain. Something that sees a cluster appearing more than once in the chain.

  2. Having worked with cascading replication, I am confident that this will more often not be an anti-feature.

    1. Out of curiosity, what would be the anti-feature and why?

    2. Given a non-zero likelihood of failure of a server, introducing cascading replication will reduce overall reliability and increase complexity. If there is a problem in the intermediate layer (aka the master slave) then the replica of the master slave will also be impacted. The increase in complexity comes from having more than one replica configuration.

      This may sound hand wavey, but at my last job there was a replication chain 7 layers deep with more than 100 replica instances. Managing that was non-trivial.

      In the world of MySQL, people (developers) like to introduce partial replication and master slaves which take writes. <- shitshow

      A good use case is running out of bandwidth on the NIC of the master. IMHO, given the option upgrading to 10G is preferable as is perhaps putting the logs on something like Gluster.

    3. Rob,

      You're missing the three primary cases for cascading replication, although one of them doesn't exist in MySQL.

      1. Metered bandwidth: if you are replicating between data centers, and paying per MB for replication traffic, then cascading between the data centers makes a lot of sense.

      2. Predictable failover: when failing over to a replica, you want to fail over to the replica which is furthest ahead. In a cascading setup, the intermediate replica will always be furthest ahead, reducing complexity of your failover scripts. (this is not, AFAIK, a MySQL issue)

      3. Cloning Replicas: this is the big one. Cascading replication means that you can also hot-clone replicas instead of needing to pull data from the master (this works fine on 9.2 already).

    4. Hi Josh!
      1. Transfer across the network into something durable and then have the replica servers get their info logs from whatever is durable. Yes, this is a valid use case, I just think there are better ways to solve the problem.

      2. This is a problem in the mysql world. Applying logs and fixing a mess is much easy with one level of replication than with multiple and the perma fix is globally unique transaction id (coming soon! (though google has had it for years)). This ought to be easy to deal with in the PG world as the new master replica should have all the logs to push down to the lagged replica servers and I would hope that it is relatively easy to determine which server is farthest ahead.

      3. Wait, what? It has not been possible to build a replica from an existing replica? Really?

      Note what I said at the start. This feature does have some valid use cases, however I am predicting that most of the time it will not be used in a sane manner.

    5. Rob,

      1. I've found cascading replicas to handle the situation better that what you sugest, but YMMV.

      2. Actually, if you flip backwards through this blog, determining the furthest ahead replica in PostgreSQL is fairly complex.

      3. In 9.0 and 9.1, the only way to create a new replica from another replica has been either (a) to shut down the replica being copied, or (b) to use something which produces coherent FS snapshots. There are some clever workarounds for this, but they aren't that reliable. Hence the importance of 9.2 cascading replication, even if it didn't work streaming-only.

  3. Are there any specific instructions on how to get this working? Been looking around and we can't seem to find any detailed instruction on how to set this up.