Thursday, December 18, 2014

Your Hanukkah Present: PostgreSQL 9.4

For the third night of Hanukkah, we have a nice present for you: PostgreSQL 9.4.0.  Now eat your latkes.



Let's talk about 9.4 by talking about one of my favorite features: JSONB.  A lot of people worked to bring you JSONB in this release, including my colleague Andrew Dunstan, Oleg Bartunov and Teodor Sigaev (sponsored by Engine Yard), Peter Geohegan of Heroku.com, and several others.

For example, imagine we have a large set of JSON documents which contain publication data about various books.  One reason we might have this is that we receive the data in JSON or HTML format from various publishers, and it is not normalized, so we keep it in its original format for data mining..  We can create store them in a JSON column like so:

    table booksdata (
        title citext not null,
        isbn isbn not null primary key,
        pubinfo jsonb not null
    )

Some example pubinfo might be:

    {"authority" : [ "James Michaels", "Nina Totenberg" ], "cost": 16, "edition" : 3,
     "format": "Trade PB", "publisher": "HarperOne",
     "published_on": "1995-08-05", "revised" : [ "1996-06-01", "1999-01-01" ] }

You'll note that the keys are already sorted in alphabetical order.  This is a feature of JSONB; whitespace gets cleaned up and keys get storted on import.

Then we can create a general index on the JSONB like so:

    CREATE INDEX ON booksdata USING GIN (pubinfo);

or:

    CREATE INDEX ON booksdata USING GIN (pubinfo json_path_ops);

There are two different versions depending on the operations you expect to run.  The standard GIN index supports every kind of operation we support for JSONB.  The path_ops index supports only the search path operator "@>" (see below), but produces a smaller and faster index for these kinds of searches.  However, we're going to do a bunch of different JSON operations for data mining, so we'll use the regular GIN index.

Once we have the index, we can do arbitrary path searches on JSONB, and these will be index searches, depending on what the planner picks.  Examples of JSONB queries:

Count all books published by "It Books":

    SELECT count(*) FROM booksdata
    WHERE pubinfo @> '{ "publisher" : "It Books" }';

     count
    ───────
     41

Get the ISBN, and extract the cost of the book as a string, for all books.

    SELECT isbn, pubinfo #>> '{"cost"}' as cost
    FROM booksdata;
    
        isbn      │ cost
    ──────────────┼──────
    0-06-203728-5 │ 16
    0-06-204980-1 │ 27
    0-06-206888-1 │ 180
    0-06-201744-6 │ 10
    0-06-162698-8 │ 20
    0-06-199895-8 │ 10
 

Give me the count of all books which have both the "publisher" and "cost" keys:

    SELECT count(*) FROM booksdata
    WHERE pubinfo ?& array['publisher','cost'];

     count
    ───────
     391

That goofy "?&" is a special operator which says "has all of these keys".  There are other operators for "has any of these keys", and you can negate it with a NOT.

JSON operations can be combined with standard PostgreSQL aggregates and other query operations.  This allows us to provide far more functionality than non-relational databases have.  For example:

Give me the average cost of all books from "It Books":

    SELECT avg((pubinfo #>> '{"cost"}')::NUMERIC)
    FROM booksdata
    WHERE pubinfo @> '{ "publisher" : "It Books" }';

              avg        
      ─────────────────────
       35.0000000000000000


Now, you'll notice that I had to cast "cost" to numeric, even though it's stored as a number in the JSONB.  That's a current limitation.  By 9.5, expect to have more operators which use native JSON types.

Return the publisher name as a string, and the average cost for all books grouped by each publisher name:

    SELECT pubinfo #>> '{"publisher"}' as publisher,
        round(avg((pubinfo #>> '{"cost"}')::NUMERIC),2)
    FROM booksdata
    GROUP BY 1
    ORDER BY publisher;

            publisher         │ round
    ──────────────────────────┼────────
    Avon                      │  43.39
    Collins Reference         │  24.57
    Harper                    │  45.40
    HarperBusiness            │  26.29
    HarperOne                 │  21.96
    Harper Paperbacks         │  16.00
    Harper Voyager            │  29.18
    Igniter                   │  19.50
    It Books                  │  35.00
    William Morrow            │ 348.00
    William Morrow Paperbacks │  15.83


Wow, William Morrow is expensive!   I suspect some kind of data problem here.

Further, JSONB is sortable.  It sorts in a sensible way, first by keys and then by values, as their primitive JSON types (integer, numeric, boolean and text).  This allows JSONB values to be compared and even paginated.  For example, this query pulls the full pubinfo from all "It Books" titles, sorts them, and limits to 25:

    SELECT pubinfo
    FROM booksdata
    WHERE pubinfo @> '{ "publisher" : "It Books" }'
    ORDER BY pubinfo LIMIT 25 OFFSET 0;

                        pubinfo
    ─────────────────────────────────────────────────
    {"cost": 14, "format": "Trade PB", "publisher": "It Books", ...
    {"cost": 15, "format": "Hardcover", "publisher": "It Books", ...
    {"cost": 15, "format": "Trade PB", "publisher": "It Books", ...
    {"cost": 15, "format": "Trade PB", "publisher": "It Books", ...

    ...

This query grabs all pubinfo for "It Books" titles and then sorts them by the JSON of the publication date.  Since JSON has no internal type for dates, you'll want to be careful to use dates in Unix format:

    SELECT pubinfo
    FROM booksdata
    WHERE pubinfo @> '{ "publisher" : "It Books" }'
    ORDER BY pubinfo #> '{"published_on"}'
    LIMIT 5 OFFSET 0

                            pubinfo
    ──────────────────────────────────────────────────────────────
    {"cost": 260, "format": "Hardcover ", "publisher": "It Books",
        "published_on": "2006-03-01"}
    {"cost": 17, "format": "Trade PB", "publisher": "It Books",
        "published_on": "2006-03-01"}
    {"cost": 90, "format": "Hardcover ", "publisher": "It Books",
        "published_on": "2006-11-01"}
    {"cost": 15, "format": "Trade PB", "publisher": "It Books",
        "published_on": "2006-12-01"}
    {"cost": 22, "format": "Trade PB", "publisher": "It Books",
        "published_on": "2007-02-01"}


So, some awesome functionality to let PostgreSQL 9.4 be your document database solution.  Download and enjoy!

1 comment:

  1. Hi Josh, if I wanted to filter by a certain numerical range, say cost > 20
    can it be done without casting.

    ReplyDelete