Monday, May 20, 2013

PostgreSQL New Development Priorities 4: Parallel Query

Parallel query is the first priority from those suggested in the comments that I agree should be a major PostgreSQL development priority.  I think that Joel Jacobson summarized it neatly: Bring Back Moore's Law.  Vertical scaling has always been one of PostgreSQL's strengths, but we're running into hard limits as servers are getting more cores but not faster cores.  We need to be able to use a server's full CPU capacity.

(note: this series of articles is my personal opinion as a PostgreSQL core team member)

The benefits to having some kind of parallel query are obvious to most users and developers today.  Mostly, people tend to think of analytics and parallel query across terabyte-sized tables, and that's definitely one of the reasons we need parallel query.  But possibly a stronger reason, which isn't much talked about, is CPU-heavy extensions -- chief among them, PostGIS.  All of those spatial queries are very processor-heavy; a location search takes a lot of math, a spatial JOIN more so.  While most users of large databases would like parallel query in order to do things a bit faster, PostGIS users need parallism yesterday.

Fortunately, work on parallelism has already started.  Even more fortunately, parallel query isn't a single monumental thing which has to be done as one big chunk; we can add parallelism piecemeal over the next few versions of Postgres.  Rougly, parallel query breaks down into parallelizing all of the following operations:

  • Table scan
  • Index scan
  • Bitmap scan
  • In-memory sort
  • On-disk sort
  • Hashing
  • Merge Join
  • Nested loop join
  • Aggregation
  • Framework for parallel functions

Most of these features can be worked on independently, in any order -- dare I say, developed in parallel?  Joins probably need to be done after sorts and scans, but that's pretty much it.  Noah Misch has chosen to start with parallel in-memory sort, so you can probably expect that for version 9.4.

4 comments:

  1. This is one area where oracle and Microsoft sql server have Postgres beat. It will be a huge win if this gets in across the board.

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. This would be super exciting. It is definitely feels wasteful seeing all these 32 CPU cores completely idle, while one is struggling to get the query run, especially when dealing with SSD raids that just got launched on Joyent cloud.

    ReplyDelete
  4. Some relevant work has been done here, it is the outcome of a big russian research program:
    PargreSQL [http://ceur-ws.org/Vol-735/paper10.pdf] and [http://www.docstoc.com/docs/152845963/ZymblerP_slides_Heidelberg-12]
    The paper describes the architecture and the design of PargreSQL parallel database management system (DBMS) for distributed memory multiprocessors. PargreSQL is based upon PostgreSQL open-source DBMS and exploits partitioned parallelism

    ReplyDelete