Tuesday, March 27, 2012

PostgreSQL needs a new load balancer

We do a lot of high-availablity PostgreSQL deployments for clients.  As of 9.1, streaming replication (SR) is excellent for this, and can scale sufficiently to scale a client across and AWS node cluster with some simple tools to help manage it.  But where we're keenly feeling the lack is simple load balancing and failover.

We use pgPool a lot for this, and once you've set it up it works.  But pgPool suffers from runaway sporkism: it's a load balancer and a failover tool and multimaster replication and data partitioning and a cache and compatible with SR and Slony and Bucardo.  If you need all of those things, it's great, but if you only need one of them you still get the complexity of all of them.  It also suffers from having been designed around the needs of specific SRA customers, and not really for general use.  We've been tinkering with it for a while, and I just don't see a way to "fix" pgPool for the general use case; we need something new, designed to be simple and limited to the 80% use-case from scratch.

What we really need is simple tool, or rather a pair of tools, designed to only do failover and read load-balancing, and only for PostgreSQL streaming replication.  These tools should be stackable in any combination the user wants, like pgBouncer (and, for that matter, with pgBouncer).  They should provide information via a web service, and be configurable via a web service.

I'll call the first tool "pgFailover".  The purpose of this tool is to manage a master-replica group of servers, including managing both planned and unplanned failovers.  You would only have one active pgFailover node within a group in order to avoid "split-brain" issues.  It will not handle database connections at all.

pgFailover would track a master and several replicas.  The status of each server would be monitored by polling both the replication information on each server, and pg_stat_replication on the master.  This information would be provided to the network by pgFailover via a web service, and pgFailover would accept commands via the same webservice as well as on the local command line.

Based on user-configurable criterial, pgFailover would carry out any of the following operations: failover to a new master; remaster the other replicas; add a new replica, with or without data sync; resync a replica; or shut down a replica.  It would also handle some situations automatically.  In the event that user-configurable conditions of nonresponsiveness are met, it would fail over the master to the "best" replica. The failover replica would be decided based on either the configuration or based on which replica is most caught up according to replication timestamps.  Likewise, replicas would be dropped from the availability list if they stop replicating for a certain period or become nonresponsive. 

The second tool I'll call "pgBalancer", after the unreleased tool from Skype.  pgBalancer would just do load-balancing of database connections across the replicated servers.  It won't deal with failover or monitoring the servers at all; instead, it relies entirely on pgFailover for that.  This allows users to have several separate pgBalancer servers, supporting both high availabilty and complex load-balancing configurations.

Since automated separation of read and write queries is an impossible problem, we won't even try to solve it.  Instead, we'll rely on the idea that the application knows whether it's reading or writing, and provide two separate database connection ports, one for read-write (RW) connections, and one for read-only (RO) connections.  pgBalancer would obtain information on what servers to connect to by either a configuration file, or by querying a pgFailover server for information via its web service.  RO connections would be load-balanced across available servers, based on a simple "least active" plus "round-robin" algorithm.

pgBalancer would also accept a variety of commands via web service, including: suspend a service or all services; disconnect a specific connection or all connections; failover the write node to specific new server; drop a server from the load balancing list; add a server to the load balancing list; give a list of connections; give a list of servers.

If we could build two tools which more-or-less match the specification above, I think they would go a lot further towards supporting the kinds of high-availability, horizontally-scaled PostgreSQL server clusters we want to use for the applications of the future. Time permitting, I'll start work on pgFailover.  A geek can dream, can't he?

19 comments:

  1. I agree with neccessity for a simple load balancer for PostgreSQL. And I'm considering to enable load balance things with using some network load balancer, like UltraMonkey. Do we actually need a "SQL statement level" load balancer, or just a network load balancer? What do you think?

    ReplyDelete
  2. I don't know if it is really necessary, when existing load balancers can do the job quite well. Something like pgbouncer needs to know the Postgres protocol and innards to work, but a dumb load balancer as you describe would not.

    ReplyDelete
  3. Hey,

    About failover, I am currently working on a OCF resource agent for Pacemaker which would be able to track slaves status and particulary their lag with the master. In case of automatic failover, Pacemaker would then promote the best candidate.

    In PostgreSQL context, Pacemaker sounds really promising and powerfull. I hope I'll be able to have a PoC in one or two week and I'll probably write some docs and a conference at some point.

    ReplyDelete
  4. Greg, Satoshi,

    Actually, the more sophisticated network load balancers do work for this, except that I have yet to find one which can work together, in an automated way, with a failover tool. Maybe there's something I haven't tried; suggestions?

    ReplyDelete
  5. It's easy to come up with situations where PgFailover could itself be a SPoF. Would that just be a design limitation, or would you want some kind of HA/STONITH system for it eventually?

    ReplyDelete
    Replies
    1. DF,

      Yeah, I've been over this at several different production sites. I don't see any way out of having failing over to a new pgFailover node being manual. There's simply no way for pgFailover to tell the difference between an actual failure of the primary, and a network partition.

      Now, in pgFailover 2.0 (or 3.0), I can imagine having a passive secondary pgFailover node which constantly read status data from the primary node, so that it would be completely up-to-date when the manual failover is triggered.

      Delete
  6. I wonder if the way to go is HAProxy with a read-only connection and a read-write connection, then a script that monitors the health of the nodes and rewrites the HAProxy config and reloads. Still need something to do the failover of course, but that sounds like a job for heartbeat.

    ReplyDelete
    Replies
    1. This is almost what I do.

      app -> pgbouncer -> haproxy -> postgres

      the app, pgbouncer, and haproxy all run on the same host. app talks to pgbouncer via a unix domain socket. haproxy has a read pool and a write pool in TCP mode. Using the simple TCP health check in haproxy (attempt to open a connection, health fails if it doesn't work or times out) works reasonably well, although extending this with a script that does more significant health checking of the backend might be nice.

      Planned maintenance involves putting a backend into maintenance mode in haproxy (or setting weight 0), then waiting for the session count for that backend to go to zero (basically whatever server_idle_timeout in pgbouncer is set to, or the duration of the longest running query).

      The app is configured to not do any connection pooling of it's own, and open a new connection to pgbouncer for each query, which is really low overhead because it's a domain socket.

      I've had this configuration in production for about six months now, and it's worked pretty well. YMMV

      Delete
  7. Sessions are my main problem with all database connection poolers, especially with extensions. Consider dblink's dblink_connect(text, connstr), which has its own out-of-band hash table in the session. It's especially bad because not only can we not fix the problem (certain kinds of session state may be impossible to move, contrast to simple "SET" statements) but we can't even detect when the user has used a "bad" feature. An error would be finitely preferable to what we do in this case -- which is nothing -- and allow the user to silently do the wrong thing or get mysterious errors.

    ReplyDelete
  8. Josh,

    In Japan, I've heard that many MySQL guys have been using Linux-LVS with keepalived to enable load balancing in MySQL slave nodes. Theoretically speaking, it should be working with PostgreSQL, but I've not finished PoC yet. I'm going to work for the PoC in the next or two months.

    I believe you also can learn something from MHA for MySQL to deal with failover stuff in PostgreSQL.
    http://yoshinorimatsunobu.blogspot.jp/2011/07/announcing-mysql-mha-mysql-master-high.html

    ReplyDelete
  9. All of these suggestions are good, and interesting.

    However, what I really want to move away from are hackish solutions which work some of the time and put a huge configuration/setup burden on admins. We really need something which is "install this RPM, edit this simple config file, and go".

    ReplyDelete
  10. Absolutely. Actually, that's the reason why I'm now interested in starting a new project that gathers such hackish things into a single package with some configuration tool.

    ReplyDelete
  11. Almost all of what you're looking for from pgFailover is in repmgr 2.0. The code is out there, we just haven't gotten to packaging it up nicely and writing a tutorial on its use yet. We try to sort out split-brain issues by allowing additional witness servers to be added, nodes which just run the repmgr daemon but not a while database. The main challenge I'm working on now is integrating with network fencing software better. Fencing is a hard problem and a lot of complexity that keeps this from being plug and play comes from there.

    ReplyDelete
    Replies
    1. I look forward to it; if I'm lucky, it'll make my own efforts unnecessary.

      I do think having a utility written in a popular interpreted language (e.g. python) and controllable via web service has a fair amount of value in addition to what repmgr currently provides. On the other hand, so does maximizing my uncompensated time ...

      Delete
  12. Why is automated separation of read/write queries an impossible problem?

    ReplyDelete
  13. It's been 2 years -- does anything like pgFailover and/or pgBalancer exist yet?

    ReplyDelete
  14. I'm also interested about the progress of those tools. Any news ?

    ReplyDelete
  15. Hi,
    Now, it's been 3 years ... nothing? Some news?

    We have a master database with 20K TPS and slaves with 2K ... the binary replication is perfect, less than 200ms - 300ms lag time ... we just need this magic thing between application and pg to balance the read load, just this ... ~90% of our queries are read.

    ReplyDelete