Database Soup: PostgreSQL New Development Priorities 4: Parallel Query

Monday, May 20, 2013

PostgreSQL New Development Priorities 4: Parallel Query

Parallel query is the first priority from those suggested in the comments that I agree should be a major PostgreSQL development priority. I think that Joel Jacobson summarized it neatly: Bring Back Moore's Law. Vertical scaling has always been one of PostgreSQL's strengths, but we're running into hard limits as servers are getting more cores but not faster cores. We need to be able to use a server's full CPU capacity.

(note: this series of articles is my personal opinion as a PostgreSQL core team member)

The benefits to having some kind of parallel query are obvious to most users and developers today. Mostly, people tend to think of analytics and parallel query across terabyte-sized tables, and that's definitely one of the reasons we need parallel query. But possibly a stronger reason, which isn't much talked about, is CPU-heavy extensions -- chief among them, PostGIS. All of those spatial queries are very processor-heavy; a location search takes a lot of math, a spatial JOIN more so. While most users of large databases would like parallel query in order to do things a bit faster, PostGIS users need parallism yesterday.

Fortunately, work on parallelism has already started. Even more fortunately, parallel query isn't a single monumental thing which has to be done as one big chunk; we can add parallelism piecemeal over the next few versions of Postgres. Rougly, parallel query breaks down into parallelizing all of the following operations:

Table scan
Index scan
Bitmap scan
In-memory sort
On-disk sort
Hashing
Merge Join
Nested loop join
Aggregation
Framework for parallel functions

Most of these features can be worked on independently, in any order -- dare I say, developed in parallel? Joins probably need to be done after sorts and scans, but that's pretty much it. Noah Misch has chosen to start with parallel in-memory sort, so you can probably expect that for version 9.4.

4 comments:

UnknownMay 21, 2013 at 4:49 PM
This is one area where oracle and Microsoft sql server have Postgres beat. It will be a huge win if this gets in across the board.
ReplyDelete
Replies
UnknownMay 21, 2013 at 4:50 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownJuly 3, 2013 at 12:32 AM
This would be super exciting. It is definitely feels wasteful seeing all these 32 CPU cores completely idle, while one is struggling to get the query run, especially when dealing with SSD raids that just got launched on Joyent cloud.

ReplyDelete
Replies
FlorisMarch 11, 2014 at 6:07 AM
Some relevant work has been done here, it is the outcome of a big russian research program:
PargreSQL [http://ceur-ws.org/Vol-735/paper10.pdf] and [http://www.docstoc.com/docs/152845963/ZymblerP_slides_Heidelberg-12]
The paper describes the architecture and the design of PargreSQL parallel database management system (DBMS) for distributed memory multiprocessors. PargreSQL is based upon PostgreSQL open-source DBMS and exploits partitioned parallelism
ReplyDelete
Replies

Add comment