Thursday, December 4, 2014

What's this pg_shard thing?

I've been working with CitusData for a while, and I'm excited that they've finally released pg_shard, a generic "sharding" extension for PostgreSQL databases.  The press releases have all of the noise about this, so this blog post is for the nuts-and-bolts of pg_shard.


What does pg_shard do?

pg_shard takes one or more tables in your database and distributes them over multiple databases on multiple servers, or "shards".  Rows are distributed based on a hash function on the primary key which you define.  The hash function used is Postgres' own hash function for hash indexes.

A single node is the "master node" which contains templates for the distributed tables, as and transparently handles making sure that inserts, updates, and selects go to the right shards.  To ensure redundancy, sharded tables can be defined as having multiple copies across multiple shards.

SELECTs are distributed based on comparing constants in your WHERE clause and other filter conditions against the hashed key, allowing you to query against only the shards which have compliant data.  This should speed up queries for really large tables a great deal.

Where there are multiple copies of shards, pg_shard notices timeouts in contacting the shards and marks certain shards as offline. 

All of this is done inside PostgreSQL, most of it using our defined APIs and hooks.  The only thing the Citus team had to fork was ruleutils.c, and they're hoping to change that in 9.5 so forking that isn't required either.


What does pg_shard not do?

First, since the rows are distributed using a hash function, only equality comparisons are currently supported.  This means that if you query your table on ranges (such as BETWEEN two timestamps) it will scan all shards for that data.  Adding range partitioning to pg_shard is planned.

Second, unlike Postgres-XC/Postgres-XL, transactions between shards and distributed JOINs are not supported.  Those features are among the things which distinguish CitusData's proprietary product.

Third, currently only one master node is supported.  That means that pg_shard isn't yet a solution for supporting applications which need massive numbers of connections.   Supporting multiple query nodes is planned for the future.

Complex distribution rules, like requiring each shard to be copied to a specific subset of hosts in a different availability zone or data center, is also not supported.  It's not clear when/if it will be.


So what is pg_shard useful for right now?

A lot of web applications consist of "One Big Table" (OBT) which by size is 90% or more of the database. This might be a users table, or a messages table, a log table, or a pictures table.  Years ago, this was exactly the problem we faced with Instagram.

pg_shard will allow users to distribute the OBT across many nodes in a public cloud, letting users scale to much larger sizes.  Particularly, it makes it much easier to redistribute the OBT so that it fits in RAM on each cloud node, both decreasing overall cloud hosting cost and improving throughput and response times.  And since there's shard redundancy, they are protected against certain kinds of cloud failures.

As Instagram and Wanelo have done, you can implement this kind of sharding on the application layer ... and when you get to their scale, you'll probably have to.  But most PostgreSQL users aren't at that scale yet, and pg_shard can help them get over the "hump" of going from one server to several with a lot less pain.


Can you give some examples of pg_sharding a database?

First, over the next week I'll be building a "Postgre-NoSQL" database using PostgreSQL 9.4, pg_shard and JSONB, to demonstrate that you can replace some non-relational databases with PostgreSQL.  Expect more on this blog.

If you're in the Bay Area, then join us for an SFPUG meeting next week, which will be all about pg_shard.  If possible, we'll have live video, but no promises.


  1. Nice. This is like a transparent version of my shard_manager extension. Mine's all SQL based, but that also means the metadata can be mirrored to avoid depending on a master node. Caching the shard mapping makes even that unnecessary.

    Regardless, advancements like pg_shard always excite me, because it gets us closer to truly transparent distributed databases. Thanks for all the hard work, Josh!

    1. Oh, I hadn't seem shard_manager. Given that you focus mostly on ID creation, something which is currently lacking in pg_shard, the two extensions seem like a good combination if we could work them out.

      Oh, and the Citus guys did all the hard work. I just tested.

    2. Ah. It sounded almost like you worked with them to develop it. My bad! :)

  2. Even if the City's guy did the hard work, you did the work with testing to write about this. That is also a lot of hard work, thanks for that.