Saturday, February 22, 2014

Why HStore2/jsonb is the most important patch of 9.4

There are a bunch of features which are pending for 9.4, still, and a bunch of features which are already committed.  Given how interesting some of those are: SET PERSISTENT, Logical Changeset Extraction, Materialized Views, etc., I think a lot of people will be surprised that I think Hstore2/jsonb is the single most important patch -- important enough that I think we shouldn't release 9.4 unless it goes in.  Why would I make this wild assertion?  Let me explain.

Open source databases rise and fall on the popularity of the programming languages which use those databases.  MySQL largely rose on the success of PHP, and it fell as PHP became marginalized.  Our current PostgreSQL salad days are based largely on the current hegemony of Python, Ruby, and Rails.  While other events have fed into changes in adoption, where the numbers of developers come from is really "what is the default database for popular language X".

While the future is unpredictable, the current momentum in programming languages is behind two platforms: Node.js and Go.  PostgreSQL already enjoys good support and adoption among Go users.  However, our adoption in the Node.js community is less encouraging.

I was given a set of statistics I'm not allowed to publish, but I can summarize them.  Two of them are fairly alarming:
  • PostgreSQL is the database for fewer than 1 out of 8 Node.js deployments which use a database.
  • The rise in popularity of MongoDB almost exactly parallels the rise in usage of Node.js.
If you've watched database adoption trends for the last 20 years like I have, this is alarming.  We are in danger of being sidelined.  If we want PostgreSQL 10.5 to enjoy the same level of adoption that version 9.3 does, then we need to appeal to Node.js users and whatever comes after them.

What do Node.js users want that we don't have?  There's three main things that I've been able to identify:
  1. A better, faster, driver which fully supports asynchronous querying.
  2. Relatively painless multi-node scaling
  3. Full, indexed support for jsonish hierarchical data and queries.
The first two points need to happen outside the core PostgreSQL project, at least for 9.4.  However, the last point is very much on the table; we have the HStore2/jsonb patch pending.  If that goes in, the PostgreSQL project will be seen as still making progress and still relevant to Node.js users and to other people who like document databases.  If it gets booted to 9.5, and there is no discernable progress on JSON features in 9.4, I believe that we will have permanently conceded the bulk of database market to the new databases for the forseeable future.

Oh, and if anyone wants to work on our Node driver ... please pitch in!

44 comments:

  1. Unfortunately, what's good for adoption is not what drives PG development priorities.

    Implement real partitioning, index-organized tables, bitmap indexes, MERGE, and true stored procedures, and PG would start stealing market share from Oracle at a whiplash-inducing rate. Stealing adoption from Mongo is fun and all, but the trend is that Mongo users eventually abandon it anyway because it's basically terrible. Oracle on the other hand is 1. here to stay, for the forseeable future and 2. creates an enormous amount of vendor lock-in that its customers gradually grow to hate. They can't escape though, because the open source alternatives don't have feature parity and they don't want to re-engineer their applications around those more limited feature sets.

    JSON support in PG is great, and I'm a big fan, but it gets worked on because OSS developers think it's sexy, not because it's what PG needs.

    ReplyDelete
    Replies
    1. I really share this vision. The most important thing for Postgres is to continue to implement relational features and be a contender to Oracle. That's what most of us see as the value of PostgreSQL.
      Then, the developer community around PG is great and it's understandable developers pick to work on their preferred themes. But this can be a double edged sword. I would really not be interested in a PostgreSQL database that abandons its solid relational roots to follow the fashion of the moment.

      Delete
    2. Noah, you've confused EnterpriseDB's business model with an adoption strategy. There's definitely a bunch of money to be made stealing Oracle users. There is no future for PostgreSQL in doing that. The Oracle market, while presently large, is not growing, at all. The actual population of Oracle users who can defect to PostgreSQL is even smaller than that. No matter how many features we ever add, we can never be better at being Oracle than Oracle can.

      Innovative technologies succeed because they reach new users and grow the market. MySQL went from being 2-developer shareware to the #3 most adopted SQL database by growing the market. MongoDB has gone from being a joke to being a serious database option for many companies by driving adoption among new developers. Outside of out industry, the iPhone succeeded because people bought them who'd never owned a high-end phone before.

      To the extent that we can grow the market and reach new users who never seriously used a database before, we will thrive. If we focus entirely on cannibalizing the existing SQL database market, we will die.

      Delete
    3. ... continued.

      While PostgreSQL hackers work on the things they and their companies find important, we *collectively* make decisions on which pending patches are important. Which patches we decide are worth extra review, time, and work in the commitfests and elsewhere are *certainly* strategic project decisions. If we didn't do such things, we would not have binary replication now.

      Right now, in the CF, hstore2/jsonb is being treated as a 2nd-tier patch, something which it would be nice to have in 9.4 but isn't critically important. I'm arguing that we should change that.

      Delete
    4. To clarify, the "Noah" posting above is not Noah Misch.

      Delete
    5. I just launch a startup. We use extensivley Node.JS and PostgreSQL.
      hstore2/jsonv will be handsome.

      Delete
    6. I really can't agree with this. Josh is spot on. I work at a startup and I interact with lots of other startups and we're all using Node, Python, or Go, (some are still on Ruby). We all use JSON extensively and we're also smart enough to not ditch relational databases entirely. Many of us use Postgres, and many of us feel the pain of not having fast JSON support. This isn't meant as a replacement for most workloads, just the ones where a flexible schema is necessary.

      JSON support is make or break in our application. I really hope the community is able to get this in. Thank you all for your contributions!

      Delete
    7. Josh, We use hstore extensively in places where a document style makes mores sense then a relational model. Typically these are configuration options for different things that you would not typically query on. Having the hstore2 features I have read about will be a bug benefit for us.

      Delete
    8. We are leveraging JSON in postgres 9.3 in a big way for online analysis of user activity events. While I think the system is going to hold up fine when we go live with it, my main worry otherwise is around performance of queries when I can't rely on an index of a field in the JSON column. I for one wish jsonb were top priority for 9.4. That and concurrent refresh of materialized views.

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. My voice is new to the community, though I'm a long time PG (and Oracle) user, my hope is that the community will enjoy hearing a little more from me: I'm giving a talk at the upcoming PG conference in NYC. I've been working on some really challenging issues in a demanding environment, the details of which I'm only allowed to discuss in a hand-wavy manner due to the terms of my employment. I'm steeped in the relational, data modeling in particular is something I'm passionate about, and I promise you that I am NOT in favor of diluting the traditional vision and value proposition of PG. I very much agree that what we're discussing IS a double-edged sword and we must tread cautiously. But I am proposing what I believe is an augmentation to the vision, not a replacement. Please allow me elaborate just a little more, I deleted my two previous comment because I realize that I'd directed an accusatory tone at Noah, and I apologize for that. While I'm asking for forgiveness, there's also the length of this post... all of which is my humble opinion, not factual, and may not even reflect anyone's version of reality other than my own. In return, I'll offer all who wish to do so have the last word(s). Because of the 4096 character limit in this space, those words are in the next post down :P

      Delete
    3. Thanks for making the jump.

      There are two surveys I'm aware of that show the thing most developers are interested in about so called NoSQL is the flexibility JSON (and other approaches) offers. One of these comes from a Gartner-like research organization which, as in Josh's case, I can not cite here. The other I conducted myself as part of my work, where years of using XML and opaque blobs of all types have lost their luster (some would make you cringe). I understand the attraction for a flexible schema for certain things: one in particular is at the beginning of a project when you are rapidly iterating, or at any point in the age of a project on when you are building out new functionality. There are other classes of problems, such as configuration management or systems that require "meta-models", or user defined K/V pairs and (gulp) user defined semantics (the scariest thing for an information architect, ever ;) If you've ever been backed into a corner and had to use EAV or XML, you'll know what I'm talking about.

      I'm not sure that the community sees what it has with the JSON type that no other database has: the benefit (perceived or otherwise) of a document store with acid and relational semantics, wrapped up in a track record for quality and stability. Why choose when you can have both? Interested in competing with Oracle? It can't offer both data models at the same time with an in place upgrade path to a time-tested schema that has evolved after actual user acceptance in production with transactional DML/DDL. Neither can any document oriented "post relational" database (a term I prefer to NoSQL, it's sarcasm masked as pretentiousness). And have you seen the emerging trend spawned by Google's Spanner and F1 papers? These projects are all years away from having the kind of maturity, quality, or community PG has had for decades. IMO, the world (of database nerds like me) is starting to recognize that it doesn't want to have to choose. What advocates of so called polyglot persistence aren't quick to point out the hassle of running two or more databases that need operational backing, duplicated storage, additional software upgrades, new monitoring scripts, etc. IMO, PG is the right answer. With this patch, it shores up any shadow of a doubt, at the time I happen to need it: right now. A year from now gives the potential alternatives (nee competitors) too long to catch up.

      So PG can continue to compete with Oracle and all other databases by playing catch up with the things they already have, and roll the dice trying to convince a big company that can afford expensive software licenses, not to mention the inertia of years of incumbent positions in their data centers. Good luck with that. (Please don't take me as ungrateful for saying that work by the way).

      I'm throwing my invisible gauntlet down in favor of PG competing via innovation through data types, something uniquely its own. PG has had the most powerful and extensible type system for a long time, that might be what I love most. Vis-a-vis the demand for JSON right now, this patch, delivered in 9.4 rather than a year from now, keeps the train rolling. It is said that "timing is everything", and _that_ is why I agree with Josh.

      Delete
  3. We use PG with Node.js and I fully agree with your strategy!

    ReplyDelete
  4. mongodb doesn't have painless horizontal sharding

    ReplyDelete
    Replies
    1. No, but they're good at advertising that they do. And by the time most users find out that the horizontal scaling is actually quite painful, it's too late to change direction.

      Anyway, I said *relatively* painless horizontal scaling. By which I mean that you shouldn't have to hire PGX or 2ndQ to do your scaling; it should be possible for a talented devops person to make it work.

      Andres is hard at work on this with Logical Streaming Replication, and when that's done and tooled up I think we'll have something awesome. But that's more in the 9.5 timeline.

      Delete
  5. Where can I find information about the Node driver's project? Is there a github project or something equivalent?

    ReplyDelete
    Replies
    1. This is the most widely used library: https://www.npmjs.org/package/pg

      Delete
    2. I know, I use it myself. I thought Josh Berkus was referring to another driver.

      Delete
    3. Nope. I'm not saying the driver sucks; it's pretty good. But I've been told by several node users that it could be better, especially performance-wise.

      Delete
  6. "Oh, and if anyone wants to work on our Node driver ... please pitch in!" - are You talking about https://github.com/brianc/node-postgres? 'Cause I don't see Your name in the contributors list.. Although I don't know any other PostgreSQL driver for node which is so popular.

    ReplyDelete
    Replies
    1. I am not presently a contributor to the driver.

      Delete
    2. Brian's a friend of mine in Austin, TX :) Great guy!

      Delete
  7. Many big companies use/switch to node.js in the near future.
    The rise of JS will also push document-storage engines.

    ReplyDelete
  8. IMHO - there is a sunken cost aspect to mongo and this project (https://github.com/umitanuki/mongres) is attempting to short circuit it.

    It is building a PG extension that acts as a mongodb compatible layer and ensures that PG could potentially become a drop-in replacement for mongo.

    This is very, very cool !

    ReplyDelete
    Replies
    1. IBM built this into DB2 10.5 in partnership with 10Gen, and even have a wire protocol listener for true turnkey replacement (which BTW makes their acquisition of Cloudant a little strange to me).

      Delete
    2. take a look at http://pgre.st for "mongodb compatible layer"

      Delete
  9. This comment has been removed by a blog administrator.

    ReplyDelete
  10. This comment has been removed by a blog administrator.

    ReplyDelete
  11. This comment has been removed by a blog administrator.

    ReplyDelete
    Replies
    1. Was this multiple posting some kind of subtle hint on why upserts are important? #sadtrombone
      ;)

      Delete
    2. ISS: Heh. Mind you, Google is not using PostgreSQL ....

      Delete
  12. Dart is shaping up to be a good competitor to Node.js. If anyone is interested in contributing to the Dart postgresql driver, patches and bug reports are always welcome. https://github.com/xxgreg/postgresql

    ReplyDelete
  13. Sounds to me like a LAMP/LAPP war. Personally, I've never chosen a technology because it is the default for the language I'm using. Therefore, while I appreciate ideas and efforts in making PostgreSQL a better database, I don't believe PostgreSQL development should be driven by such "defaults".
    It is however an interesting subject to work on.

    ReplyDelete
  14. Once you give developers ability to store data in unstructured way - they would certainly abuse it. Later, once the dataset grows large - they will run into performance issues and Postgres will be on the hook to finding the solutions.

    It's not just about supporting indexing in json. Postgres will need better statistics on values in json (so execution plan will be smarter), smart locking when parts of json is updated (maybe?). I don't know what else. Does Postgres have a strategy to support all that and to what extend?
    Maybe the better option is to provide good-enough functionality for flexible data types like json, hstore and encourage migration to relational structures as soon as the model is stable enough?
    It's hard to believe that Postgres - a solid relational database will be able to compete with a non-relational database which was built to work with unstructured data from the ground up. And maybe that shouldn't be the goal...imho.

    ReplyDelete
    Replies
    1. Slava, we're not going to win over a lot of developers by telling them we know better than they do.

      Regarding competition, we can already outperform MongoDB on a single node, and there's no question that we're more stable and secure. Surely there's a hypothetical nonrelational database which would be better than PostgreSQL could ever achieve ... but if so, it hasn't been released yet.

      Delete
    2. Sometimes the model is never stable. For CMS / knowledge base type projects the model is never done as the world you are modelling is always changing around you. I've found the JSON type in PostgreSQL to be a great option. My database code is much simpler (no need for 30+ sets of tables for all of our versioned data types), I get transactions and I can construct arbitrary queries into the data for reporting purposes. And with LISTEN/NOTIFY I can asynchronously feed data into into elasticsearch (because of it's great support for faceted search.)

      Delete
  15. As further evidence of my assertions above, an anecdote: at the PostgreSQL booth at SCALE, two different developers walked up to me with the exact same question:

    "I use MongoDB but I've heard that PostgreSQL has JSON support now, and wanted to check it out."

    ReplyDelete
  16. We switched from PG to Mongo in order to be able to scale up (but still have more data in PG than in Mongo), not so much because of the schema-less attributes. While json was considered a plus by some (not all), it's the integrated sharding that sealed the deal. Json support in PG is already good enough for us (YMMV), but untill PG gets better integrated multi-node sharding, we (sadly) can't go back to it.

    ReplyDelete
  17. This comment has been removed by a blog administrator.

    ReplyDelete
  18. I've been searching a lot for an alternative database to MongoDB to use with Nodejs. I have never seen the developer community so polarized on a single product when it comes to the selection of the storage layer. Some literally hate it and will never look at it,some are ok with it, some love it (love it most likely applies to folks from 10gen). I am very leery of entrusting my data with MongoDB. High on memory consumption, indexes are much larger than alternative storage engines, much larger disk storage, horizontal scaling not as easy as it made out to be by their marketing drum.

    In fact, I also looked at Go so I could keep using Postgres. Coming from Java/Rails background, I was looking for both ease of use as well as strong storage. I don't really need to prematurely optimize. In fact, I could throw all this data into Amazon Postgres and they support upto 3 TB in Posgres. I have a long way to go to get to 3 TB and when I get to 2 TB, I will start to think about alternative ways of storing. Till then, I would love to keep using Postgres, but as this blog makes clear, Nodejs users aren't really enamored of Postgres for whatever reason. Their default selection seems to be MongoDB. I can't for the life of me understand how a database that isn't really up to snuff be trusted enough. What Nodejs saves on resources, MongoDB takes up. Its break even or worse could end up costing you more on the MongoDB instances. "Hey Ma! I am processing 1 million requests a second with just 1 node instances, but I have 30 MongoDB instances that I need to support the single node instance"

    ReplyDelete
  19. Web Framework Benchmarks:
    MySQL vs PostgreSQL vs Mongo
    http://www.techempower.com/benchmarks/#section=data-r8&hw=i7&test=update
    "This is a performance comparison of many web application frameworks executing fundamental tasks such as JSON serialization, database access, and server-side template composition. "

    ReplyDelete
  20. It would be great if you could help on this one.. http://stackoverflow.com/questions/22654170/explanation-of-jsonb-introduced-by-postgresql

    ReplyDelete