Database Soup: May 2013

Tuesday, May 21, 2013

PostgreSQL New Development Priorities 5: New User Experience

So, I started this looking for our five major goals for future PostgreSQL develoment. The last goal is more nebulous, but I think equally important with the other goals. It's this: improve the "new user experience".

This is not a new goal, in some ways. Improving installation, one of our previous 5 goals, was really about improving the experience for new users. But the new user experience goes beyond installation now, and competition has "raised the bar". That is, we matched MySQL, but now that's not good enough; we need to match the new databases. It should be as easy to get started on a dev database with PostgreSQL as it is with, for example, Redis. Let me give you a summary of the steps to get up, running, and developing an application in the two platforms:

Redis:

install Redis, either from packages or multiplatform binaries. No root access is required for the binaries.
read a 1-page tutorial
run redis-server
run redis-cli or install drivers for your programming language
start developing
when your app works, deploy to production
in production, tune how much RAM Redis gets.

PostgreSQL:

install PostgreSQL from packages or the one-click installer. Root/Admin access is usually required.
search the documentation to figure out how to get started.
figure out whether or not your packages automatically start Postgres. If not, figure out how to start it. This may require root access.
Install drivers for your programming language.
Figure out how to connect to PostgreSQL. This may require making changes to configuration files.
Read more pages of documentation to learn the basics of PostgreSQL's variety of SQL, or how to program an ORM which works with PostgreSQL.
Start developing.
Deploy to production.
Read 20 pages of documentation, plus numerous blogs, wiki pages and online presentations in order to figure out how to tune PostgreSQL.
Tune PostgreSQL for production workload. Be unsure if you've done it right.

The unfortunate reality is that a new user will hit a lot of points in the "getting to know PostgreSQL" where they can be stuck, confused, and at a loss. At those points, they may decide to try something else, and never come back. I've seen it happen; just last SFPUG I was talking to a guy who started on Postgres, ran into a shared memory issue, switched to Mongo, and didn't come back to Postgres for 2 years.

So, what can we do about it? Well, a few things:

better new user tutorials, such as the ones on postgresguide.org
better autotuning, made a lot easier to implement as of version 9.3.
a "developer mode PostgreSQL"

The last would be a version of PostgreSQL which starts when the developer opens a psql prompt, shuts down when they exit, starts with minimal processes and crash safety turned off, and above all with a security configuration which allows that user to immediately connect to PostgreSQL without figuring anything else out. With some of the work on recovery mode supplying a single-user Postgres, this should become easier, but it needs a lot more work.

Those are the five things I can see which would greatly expand the market for PostgreSQL and keep us competitive against the new databases. Yes, I'm talking really big features, but any two out of the five would still make a big difference for us. There may be others; now that you've seen the kind of big feature I'm talking about, put your suggestions below.

Monday, May 20, 2013

PostgreSQL New Development Priorities 4: Parallel Query

Parallel query is the first priority from those suggested in the comments that I agree should be a major PostgreSQL development priority. I think that Joel Jacobson summarized it neatly: Bring Back Moore's Law. Vertical scaling has always been one of PostgreSQL's strengths, but we're running into hard limits as servers are getting more cores but not faster cores. We need to be able to use a server's full CPU capacity.

(note: this series of articles is my personal opinion as a PostgreSQL core team member)

The benefits to having some kind of parallel query are obvious to most users and developers today. Mostly, people tend to think of analytics and parallel query across terabyte-sized tables, and that's definitely one of the reasons we need parallel query. But possibly a stronger reason, which isn't much talked about, is CPU-heavy extensions -- chief among them, PostGIS. All of those spatial queries are very processor-heavy; a location search takes a lot of math, a spatial JOIN more so. While most users of large databases would like parallel query in order to do things a bit faster, PostGIS users need parallism yesterday.

Fortunately, work on parallelism has already started. Even more fortunately, parallel query isn't a single monumental thing which has to be done as one big chunk; we can add parallelism piecemeal over the next few versions of Postgres. Rougly, parallel query breaks down into parallelizing all of the following operations:

Table scan
Index scan
Bitmap scan
In-memory sort
On-disk sort
Hashing
Merge Join
Nested loop join
Aggregation
Framework for parallel functions

Most of these features can be worked on independently, in any order -- dare I say, developed in parallel? Joins probably need to be done after sorts and scans, but that's pretty much it. Noah Misch has chosen to start with parallel in-memory sort, so you can probably expect that for version 9.4.

Saturday, May 18, 2013

PostgreSQL New Development Priorities 3: Pluggable Parser

Really, when you look at the long-term viability of any platform, pluggability is where it's at. A lot of the success of PostgreSQL to date has been built on extensions and portability, just as the success of AWS has been built on their comprehensive API. Our future success will be built on taking pluggability even further.

In addition to pluggable storage, a second thing we really need is a pluggable parser interface. That is, it should be possible to generate a statement structure, in binary form, and hand that off to libpq for execution. There was recently some discussion about this on -hackers.

If there were a way to hand off expression trees directly to the planner, then this would allow creating extensions which actually had additional syntax, without having to fork PostgreSQL. This would support most of those "compatibility" extensions, as well as potentially allowing extensions like SKYLINE OF which change SQL behavior.

It would also help support PostgreSQL-based clustered databases, by allowing all of the parsing for a particular client to happen on a remote node and get passed to the clustered backends. The pgPool2 project has asked for this for several years for that reason.

More intriguingly, it would allow for potentially creating an "ORM" which doesn't have to serialize everything to SQL, but can instead build expression trees directly based on client code. This would both improve response times, and encourage developers to use a lot of PostgreSQL's more sophisticated features since they could access them directly in their code.

Taking things a step further, we could extend this to allow users to hand a plan tree directly to the executor. This would fix things for all of the users who actually need query hints (as opposed to those who think they need them), as well as taking efficiency a step beyond cached plans.

There are a lot of reasons this would be just as difficult to do as pluggable storage. Currently parsing depends on a context-dependant knowledge of system catalogs, including things like search_path. So I have no idea what it would even look like. But a parser API is something that people who hack on Postgres and fork it will continue to ask for.

Thursday, May 16, 2013

PostgreSQL New Development Priorities 2: Pluggable Storage

Over the last decade, Greenplum, Vertica, Everest, Paraccel, and a number of non-public projects all forked off of PostgreSQL. In each case, one of the major changes to the forks was to radically change data storage structures in order to enable new functionality or much better performance on large data. In general, once a Postgres fork goes through the storage change, they stop contributing back to the main project because their codebase is then different enough to make merging very difficult.

Considering the amount of venture capital money poured into these forks, that's a big loss of feature contributions from the community. Especially when the startup in question gets bought out by a company who buries it or loots it for IP and then kills the product.

More importantly, we have a number of people who would like to do something interesting and substantially different with PostgreSQL storage, and will likely be forced to fork PostgreSQL to get their ideas to work. Index-organized tables, fractal trees, JSON trees, EAV-optimized storage, non-MVCC tables, column stores, hash-distributed tables and graphs all require changes to storage which can't currently be fit into the model of index classes and blobs we offer for extensibility of data storage. Transactional RAM and Persistent RAM in the future may urge other incompatible storage changes.

As a community, we want to capture these innovations and make them part of mainstream Postgres, and their users part of the PostgreSQL community. The only way to do this is to have some form of pluggable storage, just like we have pluggable function languages and pluggable index types.

The direct way to do this would be to refactor our code to replace all direct manipulation of storage and data pages with a well-defined API. This would be extremely difficult, and would produce large performance issues in the first few versions. It would, however, also have the advantage of allowing us to completely solve the binary upgrade of page format changes issue.

A second approach would be to do a MySQL, and build up Foreign Data Wrappers (FDWs) to the point where they could perform and behave like local tables. This may be the more feasible route because the work could be done incrementally, and FDWs are already a well-defined API. However, having Postgres run administration and maintenance of foreign tables would be a big step and is conceptually difficult to imagine.

Either way, this is a problem we need to solve long-term in order to continue expanding the places people can use PostgreSQL.

Wednesday, May 15, 2013

Unconference, 9.3 Beta

The first pgCon Unconference is only 10 days away, and we now have room numbers. If you want to lead a topic ... or, more importantly, if you want someone else to lead a topic ... please add your topic ideas to the wiki page! We're expecting over 100 people at the unconference, given that pgCon registration is up above 225, more than 10% higher than last year at this point.

Second, my article on the PostgreSQL 9.3 Beta is up at LWN.net (subscription required, or wait 2 weeks).

PostgreSQL New Development Priorities: Scale It Now

The comments on my introductory post on this topic mentioned a lot of the major features which users would like to see in PostgreSQL. Among those mentioned were:

Improvements to replication
Parallel query
Index-organized tables
Better partitioning
Better full-text search
Logical streaming replication

However, as with other projects, our perennial temptation is to listen to current users rather than potential users. We can focus on making PostgreSQL better for the people who already use it. It's attractive, but that way lies obsolescence.

What we need to focus on is the reasons why people don't use PostgreSQL at all. Only by exploiting new markets -- by pushing Postgres into places which never had a database before -- do we grow the future PostgreSQL community. And there's a bunch of ways we are failing new users.

For example, listen to Nelson Elhage, engineer at Stripe.com:

"I love Mongo's HA story. Out of the box I can build a 3-node Mongo cluster with a full replica set. I can add nodes, I can fail over, without losing data."

Wouldn't it be nice if we could say the same thing about Postgres? But we can't.

If we're looking for a #1 PostgreSQL development priority, this is it:

We need a "scale it now" button.

This is where we're losing ground to the new databases. In every other way we are superior: durability, pluggability, standards-compliance, query sophistication, everything. But when a PostgreSQL user outstrips the throughput of a single server or a single EC2 instance, our process for scaling out sucks. It's complicated. It has weird limitations. And most of all, our scale-out requires advanced database expertise, which is expensive and in short supply.

We need some way for users to go smoothly and easily from one Postgres node to three and then to ten. Until we do that, we will continue to lose ground to databases which do a better job at scaling, even if they suck at everything else.

I don't know what our answer to scaling out will be. Streaming logical replication, PostgresXC, Translattice, Hadapt, something else ... I don't know. But if we don't answer this vital question, all our incremental improvements will amount to naught.

Tuesday, May 7, 2013

Open Source NPO Accounting

The Software Freedom Conservancy has announced a campaign to fund the development of the first free and open non-profit accounting software. Given the state of open source software for nonprofits (see below), this is something I strongly support. It also touches the PostgreSQL community since our own NPOs/NGOs (Non-Profit/Non-Goverment Organizations) lack software for accounting.

For those who don't do accounting or accounting software, you might look at the wealth of ERP packages like LedgerSMB, OpenERP, xTuple, Adempiere, etc. and think we're well-covered. However, ERP packages only really work for "goods" businesses, which sell packaged products to consumers or to other businesses. The entire orientation of an ERP is around cost accounting for the cost of goods, and fulfilling numerous orders for a relatively fixed list of customers. This doesn't work well for NPOs, who receive a large number of variable donations with no cost basis from a fluid body of donors. So while it's possible that an ERP system could be adapted to non-profit accounting, it would be a substantial and difficult adaptation, probably requiring forking the parent project.

Currently, 100% of the mature software for doing non-profit accounting is proprietary. Most of this software is also quite expensive, outdated, and limiting, taking a lot of time and money away from charitable organizations who desperately need more of both. Premier NPO package Raiser's Edge, for example, generally costs non-profits as much as the salary of three staff; I know one museum who had to lay off so many people to afford it that they had nobody to enter data into the computer! NPOs have no alternative, though; while they can go without software for a lot of things, accounting isn't one of them.

Among the NPOs which lack good software to track donations, donors, expenses, funds and tax reports are our own PostgreSQL non-profits. Chief among these is Software in the Public Interest, which, as a "fiscal sponsor" of dozens of open source projects, needs much more sophisticated financial software than our smaller funds.

Of course, I'm also hoping that we end up with a PostgreSQL-based solution. But at this point I'll take anything which gets the job done and is open and free. The need is that bad.

Anyway, that's why I'm supporting the Conservancy Campaign for NPO software. You could, too.