Database Soup: future

Showing posts with label future. Show all posts

Tuesday, May 21, 2013

PostgreSQL New Development Priorities 5: New User Experience

So, I started this looking for our five major goals for future PostgreSQL develoment. The last goal is more nebulous, but I think equally important with the other goals. It's this: improve the "new user experience".

This is not a new goal, in some ways. Improving installation, one of our previous 5 goals, was really about improving the experience for new users. But the new user experience goes beyond installation now, and competition has "raised the bar". That is, we matched MySQL, but now that's not good enough; we need to match the new databases. It should be as easy to get started on a dev database with PostgreSQL as it is with, for example, Redis. Let me give you a summary of the steps to get up, running, and developing an application in the two platforms:

Redis:

install Redis, either from packages or multiplatform binaries. No root access is required for the binaries.
read a 1-page tutorial
run redis-server
run redis-cli or install drivers for your programming language
start developing
when your app works, deploy to production
in production, tune how much RAM Redis gets.

PostgreSQL:

install PostgreSQL from packages or the one-click installer. Root/Admin access is usually required.
search the documentation to figure out how to get started.
figure out whether or not your packages automatically start Postgres. If not, figure out how to start it. This may require root access.
Install drivers for your programming language.
Figure out how to connect to PostgreSQL. This may require making changes to configuration files.
Read more pages of documentation to learn the basics of PostgreSQL's variety of SQL, or how to program an ORM which works with PostgreSQL.
Start developing.
Deploy to production.
Read 20 pages of documentation, plus numerous blogs, wiki pages and online presentations in order to figure out how to tune PostgreSQL.
Tune PostgreSQL for production workload. Be unsure if you've done it right.

The unfortunate reality is that a new user will hit a lot of points in the "getting to know PostgreSQL" where they can be stuck, confused, and at a loss. At those points, they may decide to try something else, and never come back. I've seen it happen; just last SFPUG I was talking to a guy who started on Postgres, ran into a shared memory issue, switched to Mongo, and didn't come back to Postgres for 2 years.

So, what can we do about it? Well, a few things:

better new user tutorials, such as the ones on postgresguide.org
better autotuning, made a lot easier to implement as of version 9.3.
a "developer mode PostgreSQL"

The last would be a version of PostgreSQL which starts when the developer opens a psql prompt, shuts down when they exit, starts with minimal processes and crash safety turned off, and above all with a security configuration which allows that user to immediately connect to PostgreSQL without figuring anything else out. With some of the work on recovery mode supplying a single-user Postgres, this should become easier, but it needs a lot more work.

Those are the five things I can see which would greatly expand the market for PostgreSQL and keep us competitive against the new databases. Yes, I'm talking really big features, but any two out of the five would still make a big difference for us. There may be others; now that you've seen the kind of big feature I'm talking about, put your suggestions below.

Monday, May 20, 2013

PostgreSQL New Development Priorities 4: Parallel Query

Parallel query is the first priority from those suggested in the comments that I agree should be a major PostgreSQL development priority. I think that Joel Jacobson summarized it neatly: Bring Back Moore's Law. Vertical scaling has always been one of PostgreSQL's strengths, but we're running into hard limits as servers are getting more cores but not faster cores. We need to be able to use a server's full CPU capacity.

(note: this series of articles is my personal opinion as a PostgreSQL core team member)

The benefits to having some kind of parallel query are obvious to most users and developers today. Mostly, people tend to think of analytics and parallel query across terabyte-sized tables, and that's definitely one of the reasons we need parallel query. But possibly a stronger reason, which isn't much talked about, is CPU-heavy extensions -- chief among them, PostGIS. All of those spatial queries are very processor-heavy; a location search takes a lot of math, a spatial JOIN more so. While most users of large databases would like parallel query in order to do things a bit faster, PostGIS users need parallism yesterday.

Fortunately, work on parallelism has already started. Even more fortunately, parallel query isn't a single monumental thing which has to be done as one big chunk; we can add parallelism piecemeal over the next few versions of Postgres. Rougly, parallel query breaks down into parallelizing all of the following operations:

Table scan
Index scan
Bitmap scan
In-memory sort
On-disk sort
Hashing
Merge Join
Nested loop join
Aggregation
Framework for parallel functions

Most of these features can be worked on independently, in any order -- dare I say, developed in parallel? Joins probably need to be done after sorts and scans, but that's pretty much it. Noah Misch has chosen to start with parallel in-memory sort, so you can probably expect that for version 9.4.

Thursday, May 16, 2013

PostgreSQL New Development Priorities 2: Pluggable Storage

Over the last decade, Greenplum, Vertica, Everest, Paraccel, and a number of non-public projects all forked off of PostgreSQL. In each case, one of the major changes to the forks was to radically change data storage structures in order to enable new functionality or much better performance on large data. In general, once a Postgres fork goes through the storage change, they stop contributing back to the main project because their codebase is then different enough to make merging very difficult.

Considering the amount of venture capital money poured into these forks, that's a big loss of feature contributions from the community. Especially when the startup in question gets bought out by a company who buries it or loots it for IP and then kills the product.

More importantly, we have a number of people who would like to do something interesting and substantially different with PostgreSQL storage, and will likely be forced to fork PostgreSQL to get their ideas to work. Index-organized tables, fractal trees, JSON trees, EAV-optimized storage, non-MVCC tables, column stores, hash-distributed tables and graphs all require changes to storage which can't currently be fit into the model of index classes and blobs we offer for extensibility of data storage. Transactional RAM and Persistent RAM in the future may urge other incompatible storage changes.

As a community, we want to capture these innovations and make them part of mainstream Postgres, and their users part of the PostgreSQL community. The only way to do this is to have some form of pluggable storage, just like we have pluggable function languages and pluggable index types.

The direct way to do this would be to refactor our code to replace all direct manipulation of storage and data pages with a well-defined API. This would be extremely difficult, and would produce large performance issues in the first few versions. It would, however, also have the advantage of allowing us to completely solve the binary upgrade of page format changes issue.

A second approach would be to do a MySQL, and build up Foreign Data Wrappers (FDWs) to the point where they could perform and behave like local tables. This may be the more feasible route because the work could be done incrementally, and FDWs are already a well-defined API. However, having Postgres run administration and maintenance of foreign tables would be a big step and is conceptually difficult to imagine.

Either way, this is a problem we need to solve long-term in order to continue expanding the places people can use PostgreSQL.