Wednesday, May 15, 2013

PostgreSQL New Development Priorities: Scale It Now

The comments on my introductory post on this topic mentioned a lot of the major features which users would like to see in PostgreSQL.  Among those mentioned were:
  • Improvements to replication
  • Parallel query
  • Index-organized tables
  • Better partitioning
  • Better full-text search
  • Logical streaming replication
However, as with other projects, our perennial temptation is to listen to current users rather than potential users.  We can focus on making PostgreSQL better for the people who already use it.  It's attractive, but that way lies obsolescence.

What we need to focus on is the reasons why people don't use PostgreSQL at all.  Only by exploiting new markets -- by pushing Postgres into places which never had a database before -- do we grow the future PostgreSQL community.  And there's a bunch of ways we are failing new users.

For example, listen to Nelson Elhage, engineer at
"I love Mongo's HA story.  Out of the box I can build a 3-node Mongo cluster with a full replica set.  I can add nodes, I can fail over, without losing data."
Wouldn't it be nice if we could say the same thing about Postgres?  But we can't.

If we're looking for a #1 PostgreSQL development priority, this is it:

We need a "scale it now" button.

This is where we're losing ground to the new databases.  In every other way we are superior: durability, pluggability, standards-compliance, query sophistication, everything.  But when a PostgreSQL user outstrips the throughput of a single server or a single EC2 instance, our process for scaling out sucks.  It's complicated.  It has weird limitations.  And most of all, our scale-out requires advanced database expertise, which is expensive and in short supply.

We need some way for users to go smoothly and easily from one Postgres node to three and then to ten.  Until we do that, we will continue to lose ground to databases which do a better job at scaling, even if they suck at everything else.

I don't know what our answer to scaling out will be.  Streaming logical replication, PostgresXC, Translattice, Hadapt, something else ... I don't know.  But if we don't answer this vital question, all our incremental improvements will amount to naught.


  1. Agreed. At work we grudgingly moved our main project to MongoDB, which is a PITA on many levels but is easyer to scale than PG. I'd love to get back to PG someday.

    Postgres-XC looks promissing but its development seems slow, and the restrictions and configuration are still too burdensome. Perhaps this can be solved by an influx of developers.

    Streaming logical replication sounds easyer to achieve, but it doesn't solve the querying part.

  2. There is one thing that has been somewhere way in the back of my mind a while now; is there anything that can be done by Postgres itself to make it more accomodating to "scale-out" solutions, and perhaps even provide a "stepping-stone path" to an easy to use scale-out solution some releases in the future? Any common feature that could be pulled in and maintained as part of Postgres? Could extensions be expanded so that all one has to do to turn a common PostgreSQL database installation into a cluster node is to load the right extension? Or is each scale-out approach just too different and requires too specific configuration? I don't know the internals of the various components well enough so nothing specific comes to mind, I'm just seeing that there's a lot of side-projects related to scaling Postgres but they all add "curious setup steps and complicated configuration" to the mix.

  3. As a data architect who has successfully tried to push Postgres to be used for astronomy project with a near Petabyte scale eventually (currently 10^9 complex objects, 10^11 unnested, 10sTB currently), I would love to see improvements on all the items from your list, with a twist: I'd move improvements to replication to the last place.

    XC is certainly not a way to do replication, but for semi-analytical XLDB-like setups it might be an answer granted it is robust (enough) and supports query partitioning better.

    I could have seen spikes of development activity for XC, but at least it has been progressing, so hope this would be a better bet next year and years to come.
    XC for replication: amount of traffic to/from GTM just defies the purpose and this has been clear from the beginning of this project, it should never be run on WAN currently (and will never be unless Postgres provides better read-only logic, without need to create snapshots etc).

    There's an idea to add a 'standalone' option to XC so it could be deployed as a standard PG but it is low on the list of priorities from what I could deduct as a XC user.


  4. I completely agree with the ideas in that post. Scalability and ability to easily push and then aggregate data to several nodes is something, that is needed more and more.

    Actually PostgreSQL already has most important components that are needed and the amount of work to extend them in not really much:

    for the quick win:
    - readable postgres fdw (there)
    - writable postgres fdw (mostly there)
    - predicate pushing in fdw (there)
    - aggregate pushing in fdw (should be implemented)
    - data moving from shard to shard / re-balancing (should be implemented)

  5. One anonymous comment sent to me by email:

    "I'm running a large scale POC to bring in "new" database technology at a major firm (with over 15,000 Sybase databases), and I probably know MongoDB better than anyone here. I wanted you to know that I think you hit the nail on the head. I've also heard from friends in other sectors that they've tried PG but went back to MySQL because "failover is easier". In my experience Mongo has its ups and downs, but if it does something well, it's that you can have a replicated cluster up in about 15 minutes tops, and that when you need to failover, it almost always "just works". This nearly flat learning curve greatly helped it gain traction IMO, because it does that simply, and the rest of it is close enough to the RDBMS in principal that an average project is off to the races in no time. Bringing this kind of simplicity to HA initial setup and stability could do a massive amount of good for PG's adoption curve."

    1. Is this about MySQL replication being easier before Postgres 9.X ? Or still the case today?

  6. This comment has been removed by a blog administrator.

  7. In my dream world (where I have a pony and can fly), some simplified form of Oracle RAC would be a big win. In my day job and for my personal AWS projects I don't really require the pure scalability of Mongo/Redis/Couch/etc. They are great but for all of my projects and work and I would guess about 99% of the devs out there that if you could get PostgreSQL a good out of box HA with 2-4 with some simple form of sharded SELECT would be "good enough". For DML a single node is generally enough, or similar to RAC, whichever node you are connected to is your DML server but all of the clustered servers "could" be used for SELECTS if the query amenable to it.

    Actually I take that back, that would "friggin' awesome!". I've been interested in Postgres-XC for a while though but in talking to the owner of the company at SCALE this year the limitations are too extreme for my use cases.

  8. > For example, listen to Nelson Elhage, engineer at "I love Mongo's HA story. Out of the box I can build a 3-node Mongo cluster with a full replica set. I can add nodes, I can fail over, without losing data."

    An interesting contrast are the stories where due to the nature of MongoDB's design, data can be lost without node failures or network partitions. In some cases this is due to developers not reading the docs about the safe vs fast write options, but in others it is simply the nature of the tool which favors speed over consistency (or logical integrity).

    IMO caching systems (especially Redis) in conjunction with a relational engine like PostgreSQL are the more interesting and flexible approaches.

    1. MongoDB-s safe write opion are so slow that you really can't use them well for scaling. That is with {j=True} you get about 1/20th of write performance of PostgreSQL which likely means that you need 20+ nodes to get the same performance.
      And NoSQL-like scaling is easy - do as Skype does and use pl/proxy for near-linear scaling, easy multi-node queries, map/reduce, etc...

  9. I was just talking with Mason Sharp about this same thing last night after the NYCPUG meetup. Postgres is currently lacking in the areas of HA and Fault Tolerance, and what's worse is that the majority of Postgres developers seem almost indifferent. Given the operational demands that OmniTI and our clients have, I'm not surprised by this gap between us and some of the other support companies, but it can be frustrating at times, so I am glad to hear someone else bringing up the issue.

    That said, we (OmniTI) are working on trying to solve this; we've got trials ready to go with one of our customers on the whole automated failover / rebuild portions of this problem. There are other components in the mix as well; pgbouncer and vippy solve certain parts of the puzzle, it's a matter of finding the right mix. We were able to make WAL management and backups way better with the OmniPITR tools, and we think we close to the point where we can do the same thing for the HA story, and then hopefully tackle fault tolerance at some point after that.

  10. Amazon claims to have made the "scale it now" button: