Database Soup: PostgreSQL needs a new load balancer

Tuesday, March 27, 2012

PostgreSQL needs a new load balancer

We do a lot of high-availablity PostgreSQL deployments for clients. As of 9.1, streaming replication (SR) is excellent for this, and can scale sufficiently to scale a client across and AWS node cluster with some simple tools to help manage it. But where we're keenly feeling the lack is simple load balancing and failover.

We use pgPool a lot for this, and once you've set it up it works. But pgPool suffers from runaway sporkism: it's a load balancer and a failover tool and multimaster replication and data partitioning and a cache and compatible with SR and Slony and Bucardo. If you need all of those things, it's great, but if you only need one of them you still get the complexity of all of them. It also suffers from having been designed around the needs of specific SRA customers, and not really for general use. We've been tinkering with it for a while, and I just don't see a way to "fix" pgPool for the general use case; we need something new, designed to be simple and limited to the 80% use-case from scratch.

What we really need is simple tool, or rather a pair of tools, designed to only do failover and read load-balancing, and only for PostgreSQL streaming replication. These tools should be stackable in any combination the user wants, like pgBouncer (and, for that matter, with pgBouncer). They should provide information via a web service, and be configurable via a web service.

I'll call the first tool "pgFailover". The purpose of this tool is to manage a master-replica group of servers, including managing both planned and unplanned failovers. You would only have one active pgFailover node within a group in order to avoid "split-brain" issues. It will not handle database connections at all.

pgFailover would track a master and several replicas. The status of each server would be monitored by polling both the replication information on each server, and pg_stat_replication on the master. This information would be provided to the network by pgFailover via a web service, and pgFailover would accept commands via the same webservice as well as on the local command line.

Based on user-configurable criterial, pgFailover would carry out any of the following operations: failover to a new master; remaster the other replicas; add a new replica, with or without data sync; resync a replica; or shut down a replica. It would also handle some situations automatically. In the event that user-configurable conditions of nonresponsiveness are met, it would fail over the master to the "best" replica. The failover replica would be decided based on either the configuration or based on which replica is most caught up according to replication timestamps. Likewise, replicas would be dropped from the availability list if they stop replicating for a certain period or become nonresponsive.

The second tool I'll call "pgBalancer", after the unreleased tool from Skype. pgBalancer would just do load-balancing of database connections across the replicated servers. It won't deal with failover or monitoring the servers at all; instead, it relies entirely on pgFailover for that. This allows users to have several separate pgBalancer servers, supporting both high availabilty and complex load-balancing configurations.

Since automated separation of read and write queries is an impossible problem, we won't even try to solve it. Instead, we'll rely on the idea that the application knows whether it's reading or writing, and provide two separate database connection ports, one for read-write (RW) connections, and one for read-only (RO) connections. pgBalancer would obtain information on what servers to connect to by either a configuration file, or by querying a pgFailover server for information via its web service. RO connections would be load-balanced across available servers, based on a simple "least active" plus "round-robin" algorithm.

pgBalancer would also accept a variety of commands via web service, including: suspend a service or all services; disconnect a specific connection or all connections; failover the write node to specific new server; drop a server from the load balancing list; add a server to the load balancing list; give a list of connections; give a list of servers.

If we could build two tools which more-or-less match the specification above, I think they would go a lot further towards supporting the kinds of high-availability, horizontally-scaled PostgreSQL server clusters we want to use for the applications of the future. Time permitting, I'll start work on pgFailover. A geek can dream, can't he?

19 comments:

Satoshi NagayasuMarch 27, 2012 at 6:52 PM
I agree with neccessity for a simple load balancer for PostgreSQL. And I'm considering to enable load balance things with using some network load balancer, like UltraMonkey. Do we actually need a "SQL statement level" load balancer, or just a network load balancer? What do you think?
ReplyDelete
Replies
Greg Sabino MullaneMarch 27, 2012 at 8:39 PM
I don't know if it is really necessary, when existing load balancers can do the job quite well. Something like pgbouncer needs to know the Postgres protocol and innards to work, but a dumb load balancer as you describe would not.
ReplyDelete
Replies
ioguixMarch 28, 2012 at 1:32 AM
Hey,

About failover, I am currently working on a OCF resource agent for Pacemaker which would be able to track slaves status and particulary their lag with the master. In case of automatic failover, Pacemaker would then promote the best candidate.

In PostgreSQL context, Pacemaker sounds really promising and powerfull. I hope I'll be able to have a PoC in one or two week and I'll probably write some docs and a conference at some point.
ReplyDelete
Replies
Josh BerkusMarch 28, 2012 at 7:43 AM
Greg, Satoshi,

Actually, the more sophisticated network load balancers do work for this, except that I have yet to find one which can work together, in an automated way, with a failover tool. Maybe there's something I haven't tried; suggestions?
ReplyDelete
Replies
David FetterMarch 28, 2012 at 9:19 AM
It's easy to come up with situations where PgFailover could itself be a SPoF. Would that just be a design limitation, or would you want some kind of HA/STONITH system for it eventually?
ReplyDelete
Replies
Jeff FrostMarch 28, 2012 at 3:03 PM
I wonder if the way to go is HAProxy with a read-only connection and a read-write connection, then a script that monitors the health of the nodes and rewrites the HAProxy config and reloads. Still need something to do the failover of course, but that sounds like a job for heartbeat.
ReplyDelete
Replies
fdrMarch 28, 2012 at 3:29 PM
Sessions are my main problem with all database connection poolers, especially with extensions. Consider dblink's dblink_connect(text, connstr), which has its own out-of-band hash table in the session. It's especially bad because not only can we not fix the problem (certain kinds of session state may be impossible to move, contrast to simple "SET" statements) but we can't even detect when the user has used a "bad" feature. An error would be finitely preferable to what we do in this case -- which is nothing -- and allow the user to silently do the wrong thing or get mysterious errors.
ReplyDelete
Replies
Satoshi NagayasuMarch 28, 2012 at 6:42 PM
Josh,

In Japan, I've heard that many MySQL guys have been using Linux-LVS with keepalived to enable load balancing in MySQL slave nodes. Theoretically speaking, it should be working with PostgreSQL, but I've not finished PoC yet. I'm going to work for the PoC in the next or two months.

I believe you also can learn something from MHA for MySQL to deal with failover stuff in PostgreSQL.
http://yoshinorimatsunobu.blogspot.jp/2011/07/announcing-mysql-mha-mysql-master-high.html
ReplyDelete
Replies
Josh BerkusMarch 28, 2012 at 9:32 PM
All of these suggestions are good, and interesting.

However, what I really want to move away from are hackish solutions which work some of the time and put a huge configuration/setup burden on admins. We really need something which is "install this RPM, edit this simple config file, and go".
ReplyDelete
Replies
Satoshi NagayasuMarch 29, 2012 at 1:54 AM
Absolutely. Actually, that's the reason why I'm now interested in starting a new project that gathers such hackish things into a single package with some configuration tool.
ReplyDelete
Replies
Greg SmithMarch 29, 2012 at 5:06 AM
Almost all of what you're looking for from pgFailover is in repmgr 2.0. The code is out there, we just haven't gotten to packaging it up nicely and writing a tutorial on its use yet. We try to sort out split-brain issues by allowing additional witness servers to be added, nodes which just run the repmgr daemon but not a while database. The main challenge I'm working on now is integrating with network fencing software better. Fencing is a hard problem and a lot of complexity that keeps this from being plug and play comes from there.
ReplyDelete
Replies
UnknownApril 5, 2012 at 9:24 AM
Why is automated separation of read/write queries an impossible problem?
ReplyDelete
Replies
Greg Sabino MullaneApril 9, 2012 at 9:55 AM
Andy: SELECT my_blackbox_function();
ReplyDelete
Replies
Paul SmithApril 15, 2014 at 10:20 AM
It's been 2 years -- does anything like pgFailover and/or pgBalancer exist yet?
ReplyDelete
Replies
normalApril 17, 2014 at 8:09 AM
I'm also interested about the progress of those tools. Any news ?
ReplyDelete
Replies
Fabio CaiutApril 9, 2015 at 11:20 AM
Hi,
Now, it's been 3 years ... nothing? Some news?

We have a master database with 20K TPS and slaves with 2K ... the binary replication is perfect, less than 200ms - 300ms lag time ... we just need this magic thing between application and pg to balance the read load, just this ... ~90% of our queries are read.
ReplyDelete
Replies

Add comment