Friday, December 26, 2014

PostgreSQL New Zealand Visit

I will be in New Zealand next month for LinuxConf AU.  In addition to a PostgreSQL replication tutorial at LCA, I will have other events to get in touch with the PostgreSQL community in Auckland and Wellington:

Auckland (Jan. 11 to 16)

Monday night (Jan. 12), at 7:30 PM, we will have a PostgreSQL 9.4 BOF at the LCA venue.  Members of the Auckland PostgreSQL community who can't otherwise make it to LCA are allowed and encouraged to attend this BOF; please contact me so I can get a headcount for pizza.

Wellington (Jan. 18 to 21) 

Tuesday, January 20th from 9am to 3pm we will have an extended PostgreSQL replication tutorial, hosted by NIWA New Zealand in downtown Wellington.  This will be the same tutorial I give at LCA, only the extended 4.5 hour version which covers things like replication slots and performance tuning replication.  Contact NIWA if you are interested in attending this; there is a fee.

On Tuesday or Monday night I would really like to have a Wellington PostgreSQL user meetup.  However, I need help finding a venue and getting in contact with PostgreSQL users in Wellington.  Please contact me if you can help, and check back here for updates on the meeting.

UPDATED: please see this wiki page for more information.

Thursday, December 18, 2014

Your Hanukkah Present: PostgreSQL 9.4

For the third night of Hanukkah, we have a nice present for you: PostgreSQL 9.4.0.  Now eat your latkes.

Let's talk about 9.4 by talking about one of my favorite features: JSONB.  A lot of people worked to bring you JSONB in this release, including my colleague Andrew Dunstan, Oleg Bartunov and Teodor Sigaev (sponsored by Engine Yard), Peter Geohegan of, and several others.

For example, imagine we have a large set of JSON documents which contain publication data about various books.  One reason we might have this is that we receive the data in JSON or HTML format from various publishers, and it is not normalized, so we keep it in its original format for data mining..  We can create store them in a JSON column like so:

    table booksdata (
        title citext not null,
        isbn isbn not null primary key,
        pubinfo jsonb not null

Some example pubinfo might be:

    {"authority" : [ "James Michaels", "Nina Totenberg" ], "cost": 16, "edition" : 3,
     "format": "Trade PB", "publisher": "HarperOne",
     "published_on": "1995-08-05", "revised" : [ "1996-06-01", "1999-01-01" ] }

You'll note that the keys are already sorted in alphabetical order.  This is a feature of JSONB; whitespace gets cleaned up and keys get storted on import.

Then we can create a general index on the JSONB like so:

    CREATE INDEX ON booksdata USING GIN (pubinfo);


    CREATE INDEX ON booksdata USING GIN (pubinfo json_path_ops);

There are two different versions depending on the operations you expect to run.  The standard GIN index supports every kind of operation we support for JSONB.  The path_ops index supports only the search path operator "@>" (see below), but produces a smaller and faster index for these kinds of searches.  However, we're going to do a bunch of different JSON operations for data mining, so we'll use the regular GIN index.

Once we have the index, we can do arbitrary path searches on JSONB, and these will be index searches, depending on what the planner picks.  Examples of JSONB queries:

Count all books published by "It Books":

    SELECT count(*) FROM booksdata
    WHERE pubinfo @> '{ "publisher" : "It Books" }';


Get the ISBN, and extract the cost of the book as a string, for all books.

    SELECT isbn, pubinfo #>> '{"cost"}' as cost
    FROM booksdata;
        isbn      │ cost
    0-06-203728-5 │ 16
    0-06-204980-1 │ 27
    0-06-206888-1 │ 180
    0-06-201744-6 │ 10
    0-06-162698-8 │ 20
    0-06-199895-8 │ 10

Give me the count of all books which have both the "publisher" and "cost" keys:

    SELECT count(*) FROM booksdata
    WHERE pubinfo ?& array['publisher','cost'];


That goofy "?&" is a special operator which says "has all of these keys".  There are other operators for "has any of these keys", and you can negate it with a NOT.

JSON operations can be combined with standard PostgreSQL aggregates and other query operations.  This allows us to provide far more functionality than non-relational databases have.  For example:

Give me the average cost of all books from "It Books":

    SELECT avg((pubinfo #>> '{"cost"}')::NUMERIC)
    FROM booksdata
    WHERE pubinfo @> '{ "publisher" : "It Books" }';


Now, you'll notice that I had to cast "cost" to numeric, even though it's stored as a number in the JSONB.  That's a current limitation.  By 9.5, expect to have more operators which use native JSON types.

Return the publisher name as a string, and the average cost for all books grouped by each publisher name:

    SELECT pubinfo #>> '{"publisher"}' as publisher,
        round(avg((pubinfo #>> '{"cost"}')::NUMERIC),2)
    FROM booksdata
    GROUP BY 1
    ORDER BY publisher;

            publisher         │ round
    Avon                      │  43.39
    Collins Reference         │  24.57
    Harper                    │  45.40
    HarperBusiness            │  26.29
    HarperOne                 │  21.96
    Harper Paperbacks         │  16.00
    Harper Voyager            │  29.18
    Igniter                   │  19.50
    It Books                  │  35.00
    William Morrow            │ 348.00
    William Morrow Paperbacks │  15.83

Wow, William Morrow is expensive!   I suspect some kind of data problem here.

Further, JSONB is sortable.  It sorts in a sensible way, first by keys and then by values, as their primitive JSON types (integer, numeric, boolean and text).  This allows JSONB values to be compared and even paginated.  For example, this query pulls the full pubinfo from all "It Books" titles, sorts them, and limits to 25:

    SELECT pubinfo
    FROM booksdata
    WHERE pubinfo @> '{ "publisher" : "It Books" }'
    ORDER BY pubinfo LIMIT 25 OFFSET 0;

    {"cost": 14, "format": "Trade PB", "publisher": "It Books", ...
    {"cost": 15, "format": "Hardcover", "publisher": "It Books", ...
    {"cost": 15, "format": "Trade PB", "publisher": "It Books", ...
    {"cost": 15, "format": "Trade PB", "publisher": "It Books", ...


This query grabs all pubinfo for "It Books" titles and then sorts them by the JSON of the publication date.  Since JSON has no internal type for dates, you'll want to be careful to use dates in Unix format:

    SELECT pubinfo
    FROM booksdata
    WHERE pubinfo @> '{ "publisher" : "It Books" }'
    ORDER BY pubinfo #> '{"published_on"}'

    {"cost": 260, "format": "Hardcover ", "publisher": "It Books",
        "published_on": "2006-03-01"}
    {"cost": 17, "format": "Trade PB", "publisher": "It Books",
        "published_on": "2006-03-01"}
    {"cost": 90, "format": "Hardcover ", "publisher": "It Books",
        "published_on": "2006-11-01"}
    {"cost": 15, "format": "Trade PB", "publisher": "It Books",
        "published_on": "2006-12-01"}
    {"cost": 22, "format": "Trade PB", "publisher": "It Books",
        "published_on": "2007-02-01"}

So, some awesome functionality to let PostgreSQL 9.4 be your document database solution.  Download and enjoy!

Wednesday, December 10, 2014

SQLnoSQL: pg_shard and JSONB, part 1

Of course, the first thing which occurred to me with pg_shard was to create a distributed "NoSQL" database using 9.4's JSONB.  I had to wait a bit because there were some issues with 9.4 compatibility which needed resolving, but this week I was able to build it.  I was hoping to show it off at the SFPUG meeting, but sadly the weather gods (in the form of the PostgreSQL-powered NWS) interfered.

So we'll go over it here because it makes a good introduction to pg_shard and to JSONB.

First step was to spin up an AWS m3.xlarge instance to be our master.  While I don't need much in other capabilities, I do want the master node to have AWS's best network option since it'll be doing a LOT of networking.  Do, however, make sure to configure it with instance storage because you'll want that space for temp tables and any tables which aren't sharded.  This also becomes our template for the shards, because it's easier that way.  I created the instance running Ubuntu 14.04.  Then I had to install all of the PostgreSQL 9.4 components and a few others, including:

  • postgresql-9.4, postgresql-contrib-9.4, postgresql-server-dev-9.4
  • python-psycopg2 (for my benchmark test)
  • git, make

After that I "git cloned" the pg_shard repo and did a make and make install in it.  Currently it emits some spurious warnings with 9.4; ignore those.

Next was configuring PostgreSQL for pg_shard.  This is the master node, so it's where we're going to use pg_shard.  In addition to the usual PostgreSQL configuration, you want to change two settings:

shared_preload_libraries = 'pg_shard' #load pg_shard
temp_buffers = 64MB #pg_shard uses temp tables internally

Then start (or restart) PostgreSQL.  After that, we create a "psocial" database to represent our dummy social networking application, and set up a .pgpass file and an authorized_hosts file for passwordless access by the postgres user.  Then we stop, and shut PostgreSQL down. and use the AWS API to create eight r3.large instances based on that master instance to be our shards.  This way they already have the right configuration and the psocial database created.

We then harvest the IP addresses for those shards and turn them into some local DNS, naming them shard#.demo.  After that, we can set up the pg_worker_list.conf file, which goes in the data directory for PostgreSQL (not the configuration directory, but the actual data directory).

    #hostname       port
    shard1.demo     5432
    shard2.demo     5432
    shard3.demo     5432
    shard4.demo     5432
    shard5.demo     5432
    shard6.demo     5432
    shard7.demo     5432
    shard8.demo     5432

Unfortunately, we do have to configure the shards a bit manually.  First, we'll need to mount the instance storage on each shard, and move the PGDATA directory to it.  Second, we'll need to start Postgres on each shard.  If you use Puppet, Salt, Chef, CFEngine, or similar, then it will be easy do to this; otherwise, you're in for a little SSH time.

Now we are ready to create the sharded "members" table for our app.  Let's create the pg_shard extension (this only needs to be done on the master):

    CREATE EXTENSION pg_shard;

Then we can create the master table:

    create table members (
        id text not null unique,
        email text,
        profile jsonb
    create index on members(email);
    create index on members using gin(profile);

You can see that we've created this table as a highly denormalized collection of JSON documents, using indexed JSONB.  We're doing this mainly to test the capabilities of 9.4's JSONB in a similation of a large, sharded, NoSQL application.  If this were a real application, I'd normalize it a bit.  The id column is TEXT because we're going to put a fake GUID into it.

This empty table is our template for the shards, and like a master partition acts as a target for queries which will be redirected by pg_shard to the shards.  It's not sharded yet; we have to do two things to make that happen.  One is "register" it as a master, which creates entries in pg_shard's metadata for it:

    SELECT master_create_distributed_table('members','id');

This tells pg_shard that "members" is a sharded table, and that it will be sharded on a hash of the column "id".  Now to create the actual shards:

    SELECT master_create_worker_shards('members',32,2);

The first number is the number of shards to create.  While not required, it helps for this to be evenly divisible by the number of hosts you have.  You generally want to create more shards than hosts, first to take advantage of some parallelism on each host, and second to have room for growth by "splitting" hosts.  That is, in our psocial example, we could double to 16 hosts and give each half of the shards.

If you look at the pg_shard metadata, you can see that things are now sharded.  The "partition" table holds our list of master tables:

    psocial=# select * from pgs_distribution_metadata.partition ;
     relation_id | partition_method | key
           16450 | h                | id

The "shard" table shows us the list of shards:

     id   | relation_id | storage |  min_value  |  max_value
    10000 |       16450 | t       | -2147483648 | -2013265922
    10001 |       16450 | t       | -2013265921 | -1879048195
    10002 |       16450 | t       | -1879048194 | -1744830468
    10003 |       16450 | t       | -1744830467 | -1610612741

You can see that each shard is defined as a range of hash values, hashed using PostgreSQL's internal hashing function.  Finally, the actual locations of shards are listed in the shard_placement table:

   id | shard_id | shard_state |  node_name  | node_port
    1 |    10000 |           1 | shard1.demo |      5432
    2 |    10000 |           1 | shard2.demo |      5432
    3 |    10001 |           1 | shard2.demo |      5432
    4 |    10001 |           1 | shard3.demo |      5432

You'll notice that each shard exists on two hosts.  That's that "redundancy" number we gave pg_shard when we created the shards; it says how many hosts should have a copy of each shard.  As you can see, these shards are assigned on a round-robin basis.

If we actually look on one of the shards, we see that each shard is a numbered table:

                List of relations
    Schema |     Name      | Type  |  Owner
    public | members_10000 | table | postgres
    public | members_10007 | table | postgres
    public | members_10008 | table | postgres
    public | members_10015 | table | postgres
    public | members_10016 | table | postgres
    public | members_10023 | table | postgres
    public | members_10024 | table | postgres
    public | members_10031 | table | postgres

Again, you can see that there are eight shards per host, because of the 2X redundancy we asked for.

Next post, we'll populate the shards with a bunch of dummy data.

Monday, December 8, 2014

Loading pg_partman on RDS or Heroku

One of the limitations with both RDS and Heroku is that you can only install the extensions they've approved and made available to you.  Except ... you can install "SQL-only" extensions which don't have a binary component.  Like pg_partman.

pg_partman is a set of functions and tables which help you manage your partitioned tables, including creating new partitions and triggers on the partition master.  I love it because it means I never have to write ad-hoc code for a client to do that again.  So of course I want to install it on our clients who are running on Heroku and RDS and need partitioning.  However, I have no access to "create extension partman" on the RDS command line, so how do I make it work?  Mostly by loading the extension the old way.

1) create a new schema on the target database called "partman".

2) Install PostgreSQL 9.3 on my laptop, including the development files.

3) git clone pg_partman.

4) make & make install pg_partman on my laptop.  Inside the pg_partman source directory, that will create a file in the "sql" directory called (currently): pg_partman--1.7.2.sql.

5) Since that file is designed to be installed as an extension, I need to alter it.  First, I replace all occurances of "@extschema@" with "partman".  Second, I delete the line "SELECT pg_catalog.pg_extension_config_dump('part_config', '');"

6) Load partman into your database by using "psql -f  pg_partman--1.7.2.sql YOUR_DB_NAME -h YOUR_HOST".  In the RDS case, this needs to be done as the instance superuser.

pg_partman can now be used in the target database by calling all of the functions as "partman.function", or by putting the partman schema in your search_path.

Now, this has the major drawback that upgrading pg_partman once 1.7.3 comes out will be a very manual process.  But that's better than not using it at all.

Thursday, December 4, 2014

What's this pg_shard thing?

I've been working with CitusData for a while, and I'm excited that they've finally released pg_shard, a generic "sharding" extension for PostgreSQL databases.  The press releases have all of the noise about this, so this blog post is for the nuts-and-bolts of pg_shard.


What does pg_shard do?

pg_shard takes one or more tables in your database and distributes them over multiple databases on multiple servers, or "shards".  Rows are distributed based on a hash function on the primary key which you define.  The hash function used is Postgres' own hash function for hash indexes.

A single node is the "master node" which contains templates for the distributed tables, as and transparently handles making sure that inserts, updates, and selects go to the right shards.  To ensure redundancy, sharded tables can be defined as having multiple copies across multiple shards.

SELECTs are distributed based on comparing constants in your WHERE clause and other filter conditions against the hashed key, allowing you to query against only the shards which have compliant data.  This should speed up queries for really large tables a great deal.

Where there are multiple copies of shards, pg_shard notices timeouts in contacting the shards and marks certain shards as offline. 

All of this is done inside PostgreSQL, most of it using our defined APIs and hooks.  The only thing the Citus team had to fork was ruleutils.c, and they're hoping to change that in 9.5 so forking that isn't required either.


What does pg_shard not do?

First, since the rows are distributed using a hash function, only equality comparisons are currently supported.  This means that if you query your table on ranges (such as BETWEEN two timestamps) it will scan all shards for that data.  Adding range partitioning to pg_shard is planned.

Second, unlike Postgres-XC/Postgres-XL, transactions between shards and distributed JOINs are not supported.  Those features are among the things which distinguish CitusData's proprietary product.

Third, currently only one master node is supported.  That means that pg_shard isn't yet a solution for supporting applications which need massive numbers of connections.   Supporting multiple query nodes is planned for the future.

Complex distribution rules, like requiring each shard to be copied to a specific subset of hosts in a different availability zone or data center, is also not supported.  It's not clear when/if it will be.


So what is pg_shard useful for right now?

A lot of web applications consist of "One Big Table" (OBT) which by size is 90% or more of the database. This might be a users table, or a messages table, a log table, or a pictures table.  Years ago, this was exactly the problem we faced with Instagram.

pg_shard will allow users to distribute the OBT across many nodes in a public cloud, letting users scale to much larger sizes.  Particularly, it makes it much easier to redistribute the OBT so that it fits in RAM on each cloud node, both decreasing overall cloud hosting cost and improving throughput and response times.  And since there's shard redundancy, they are protected against certain kinds of cloud failures.

As Instagram and Wanelo have done, you can implement this kind of sharding on the application layer ... and when you get to their scale, you'll probably have to.  But most PostgreSQL users aren't at that scale yet, and pg_shard can help them get over the "hump" of going from one server to several with a lot less pain.


Can you give some examples of pg_sharding a database?

First, over the next week I'll be building a "Postgre-NoSQL" database using PostgreSQL 9.4, pg_shard and JSONB, to demonstrate that you can replace some non-relational databases with PostgreSQL.  Expect more on this blog.

If you're in the Bay Area, then join us for an SFPUG meeting next week, which will be all about pg_shard.  If possible, we'll have live video, but no promises.

Friday, November 28, 2014

Setting up Ghost with PostgreSQL and Apache vhosts

I'm currently in love with Ghost, a Node.js blogging platform.  It's simple and small, uses Markdown for blog editing, and has scads of themes available, most of which are responsive and typography-centric.  If there's a project which might be a Wordpress-killer -- especially for PostgreSQL users -- Ghost is probably it.  I've set up both my food and pottery blog and the new SFPUG page using Ghost.

Ghost comes configured by default to use SQLite, though, and of course I had to run it on Postgres.  Also, I needed to set up Apache vhosts to redirect to the various Ghost instances, since one Node instance only runs one Ghost instance, and needed supervisord for autostart.  At some point later, I'll repackage all this to use docker containers for each Ghost instance, but these instructions don't cover that.

First, install Node.js.  OS packages are fine for Ubuntu 14.04 or Fedora; for other OSes you may need to get funky with source installs.  I'm not going to cover that here.  You'll also need to install the Node PostgreSQL driver, and the postgresql-server-dev package.

Second, "git clone" the stable branch of the Ghost code into an appropriate directory.  This can be directly your web directory for Ghost, or you can download, build, and then copy.  If you do the latter, be aware that Ghost has several "dot" files it needs which a regular "cp -r" won't pick up.

Now, before you build Ghost, you're going to need to make some changes to support PostgreSQL.  Edit "package.json" in the main directory, and add PostgreSQL into the "optional dependencies" section like so:

    "optionalDependencies": {
        "mysql": "2.1.1",
        "pg" : "latest"

Now, build Ghost by running the build commands from the home directory:

    npm install -g grunt-cli
    npm install
    grunt init
    grunt prod

Make sure you scroll through the output of the above commands and look for errors; they don't always fail on error.

As a note, Ghost, like Rails, has "dev", "stage", and "production" modes.  Each of these modes can use a different database, even a different DBMS.  Or you can have Ghost connect to the same database for all three modes.  By default, Ghost starts in dev mode.

You'll need to configure your database connection.  Before editing config.js, make sure that you've set up PostgreSQL to accept local connections using an md5 password.  I additionally route Ghost through pgBouncer in case of getting slammed by connections; this requires you to configure pgBouncer as well:
  1. add Ghost user to the database
  2. create a database for the Ghost
  3. configure pg_hba.conf so the new user can connect with an md5 password
  4. if using pgBouncer, configure pgbouncer.ini and users.txt with the database and the user/password you're using for Ghost.
Once that's all set up, you can configure Ghost to use Postgres.  Edit config.js, and in each section (dev, staging and prod) where you want to use a Postgres database, replace the SQLite connection information with something like this:

        database: {
            client: 'pg',
            connection: {
                host     : '',
                port     : '6432',
                user     : 'ghostblog',
                password : 'supercalifragalistic',
                database : 'ghostblog',
                charset  : 'utf8'

Then save.  As you can see, that's a configuration for pgBouncer.  If you're not using pgBouncer, then you can omit the host and port configuration.  If you're using Heroku, you'll need to point the Host at the public address for your Heroku Postgres instance. Restart Ghost at this point, or start it if you haven't already:

   npm start
   npm start --production

After this, you should be able to connect to Ghost on port 2368 of the defined interface ( if you haven't configured anything).  However, that doesn't actually get a site up.  For one, Ghost may have to share the server with non-Ghost sites and other Ghost instances.  Second, users aren't generally accustomed to looking for sites on port 2638.  Also, I'm not all that confident about Node.js security, given how new the platform is.

Enter Apache vhosts with proxying.

First, configure Ghost to listen only an alternate port and the internal IP address:

        server: {
            // Host to be passed to node's `net.Server#listen()`
            host: '',
            // Port to be passed to node's `net.Server#listen()`
            port: '4444'

Note that if you have several Ghost instances, you'll need a different port for each.  If you're running Ghost under Docker, then redirect the Ghost port to a unique high port on the host.

Next, create a new vhost file (in /etc/apache2/servers-available on Ubuntu) pointing to Ghost:

    <VirtualHost *:80>

        ProxyRequests off
        ProxyPass /
        ProxyPassReverse / http:/

        CustomLog /var/log/apache2/ combined
        ErrorLog /var/log/apache2/

Those ProxyPass lines allow bi-directional pass-through from Apache to Ghost.  Note that this prevents you from putting any other Apache directives on the Vhost other than logging, such as an authentication config; they will get ignored, because the request gets passed to Ghost first.

Now, link that to servers-enabled, and restart Ghost and Apache.  You should now be able to connect to Ghost using the URL of the vhost.

But wait ... what's keeping Ghost running?  npm will quit on error, and has no ability to restart with a server restart.   You could craft your own init file, but and alternative is to use supervisord.

First, install supervisor from packages.  Then add a config file to /etc/supervisor/conf.d:

    command = node /var/www/
    directory = /var/www/
    user = ghost
    autostart = true
    autorestart = true
    stdout_logfile = /var/log/supervisor/sfpostgres.log
    stderr_logfile = /var/log/supervisor/sfpostgres_err.log
    environment = NODE_ENV="production"

Then add the service to supervisor and enable it:

    supervisorctl add sfpostgres
    supervisorctl start sfpostgres

.... and that's it!  You should now have a Ghost instance on a custom domain with autorestart able to share a server with other websites.  And most importantly, running on PostgreSQL.

Wednesday, November 19, 2014

Good kernel, bad kernel

A month ago I got into an argument on IRC with Sergey about telling people to avoid kernel 3.2.  This turned out to be a very productive argument, because Sergey then went and did a battery of performance tests against various Linux kernels on Ubuntu. Go read it now, I'll wait.

My takeaways from this:

  • Kernel 3.2 is in fact lethally bad.
  • Kernel 3.13 is the best out of kernel 3.X so far.  I hope that this can be credited to the PostgreSQL team's work with the LFS/MM group.
  • No 3.X kernel yet has quite the throughput of 2.6.32, at least at moderate memory sizes and core counts.
  • However, kernel 3.13 has substantially lower write volumes at almost the same throughput.  This means that if you are write-bound on IO, 3.13 will improve your performance considerably.
  • If your database is mostly-reads and highly concurrent, consider enabling
Thanks a lot to Sergey for doing this testing, and thanks even more to the LFS/MM group for improving IO performance so much in 3.13.

Tuesday, November 18, 2014

pgDay SF 2015 Call For Speakers/Sponsors now open

pgDay SF 2015, to be held March 10th in Burlingame, CA, now has its Call for Speakers open.  The event is also looking for sponsors.  Come join us and meet even more Postgres folks!

Thursday, November 6, 2014

We need a webapp benchmark

I've been doing some comparative testing on different cloud platform's hosting of PostgreSQL.  And one of the deficiencies in this effort is the only benchmarking tool I have is pgbench, which doesn't reflect the kinds of workloads people would want to run on cloud hosting.  Don't get me wrong, pgbench does everything you could imagine with the simple Wisconsin benchmark, including statistics and sampling.   But the core benchmark is still something which doesn't look much like the kind of Rails and Django apps I deal with on a daily basis.

There's also TPCC-js, which is more sophisticated, but is ultimately still a transactional, back-office OLTP benchmark. 

So I'm thinking of developing a "webapp" benchmark.  Here's what I see as concepts for such a benchmark:
  • Read-mostly
  • No multi-statement transactions
  • Defined "users" concept with logins and new user registration
  • Needs a "sessions" table which is frequently updated
  • Read-write, read-only and session database connections should be separable, in order to test load-balancing optimization.
  • Queries counting, sorting and modifying content
  • Measured unit of work is the "user session" which would contain some content lookups and minor updates ("likes").
Now, one of the big questions is whether we should base this benchmark on the idea of a social networking (SN) site.  I think we should; SN sites test a number of things, including locking and joins which might not be exercised by other types of applications (and aren't by pgbench).  What do you think?  Does anyone other than me want to work on this?

Saturday, November 1, 2014

Finding Foreign Keys with No Indexes

Unlike some other SQL databases, PostgreSQL does not automatically create indexes on the "child" (in formal language "referencing") side of a foreign key.  There are some good reasons for this (see below), but it does give users another opportunity to forget something, which is indexing the foreign keys (FKs) that need it.  But which ones need it?  Well, fortunately, you can interrogate the system catalogs and find out.

I have added a query for finding foreign keys without indexes to pgx_scripts.  Indexes on the "parent", or "referenced" side of the FK are automatically indexed (they have to be, because they need to be unique), so we won't talk about them further.

Now, in order to understand how to interpret this, you have to understand why you would or would not have an index on an FK, and what sort of indexes are valid.  There's two times that indexes on the child side of FKs are used:
  • when doing JOIN and lookup queries using the FK column
  • when updating or deleting a row from the "parent" table
The second occasion is news to some DBAs.  The way it works is this: before letting you delete or change a row in the "parent" table, Postgres has to verify that there are no rows in the "child" table referencing the FK value that might be going away.  If there are, it needs to perform the action you have defined (such as CASCADE, SET NULL or RESTRICT).  If the "child" table is large, this can be substantially speeded up by having an index on the FK column.

This means that it's important to have an index on the child side of the FK if any of the following are true:
  • The child table is large and the parent table gets updates/deletes
  • The parent table is large and the FK is used for JOINs
  • The child table is large and the FK is used to filter (WHERE clause) records on the child table
This means most FKs, but not all of them.  If both tables are small, or if the parent table is small and the FK is used only to prevent bad data entry, then there's no reason to index it.  Also, if the FK is very low cardinality (like, say, only four possible values) then it's probably also a waste of resources to index it.

Now you're ready to run the query on your own database and look at the results. The query tries to filter for the best indexing candidates, but it is just a query and you need to use your judgement on what you know about the tables.  The query also filters for either the parent or child table being larger than 10MB.

Now, you might say "but I have an index on column Y" and wonder why it's appearing on the report.   That's probably because the FK does match the first columns of the index.  For example, an index on ( name, team_id ) cannot be used for an FK on team_id.

You may notice a 2nd section of the report called "questionable indexes".  These are FKs which have an index available, but that index may not be usable for JOINs and constraint enforcement, or may be very inefficient.  This includes:
  • Non-BTree indexes.  Currently other types of indexes can't be used for FK enforcement, although that is likely to change in future Postgres versions.  But they can sometimes be used for joins.
  • Indexes with more than one column in addition to the FK columns.  These indexes can be used for FKs, but they may be very inefficient at it due to the bigger size and extra index levels.
  • Partial indexes (i.e. INDEX ... WHERE).  In general, these cannot be used for FK enforcement, but they can sometimes be used for joins.
It's not quite as clear when you want to add to, or replace those indexes.  My advice is to start with the missing indexes and then move on to the more nuanced cases.

Thursday, October 30, 2014

Upcoming Seattle Visit

I will be in Seattle soon on business.  This will include two opportunties to do PostgreSQL community stuff:
If you're in the Seattle area, come chat!  We'll go out to Dilletante and have chocolate!

Tuesday, October 21, 2014

Introducing Flexible Freeze

One of the things I mentioned in my series on VACUUM FREEZE was that we really needed a Postgres utility which would opportunistically freeze tables during low traffic periods. Today I'm announcing the Flexible Freeze project, our first attempt at designing such a utility.

All that's there right now is a simple Python script.  However, that script is already a useful tool, installed at multiple production sites. Here's how the script works:
  1. identify your active databases and daily/weekly low traffic periods.
  2. create a cron job which calls with a time limit to keep it inside your low traffic window.
  3. will loop through your tables with the oldest XIDs, freezing them until it runs out of time or out of tables
There is also a second mode, using the --vacuum switch, which does VACUUM ANALYZE on the tables with the most dead rows (according to pg_stat_user_tables).  This is to help users who have a strong high/low traffic cycle and want to make sure that regular vacuuming takes place during low traffic.  If you're running both modes, we advise doing the freeze first.

Of course, I have a tanker-truck full of desired improvements/expansions to this.  So, pull requests welcome. 

If you're more into Ruby, Wanelo has rewritten flexible freeze for Rails and incorporated it into their Postmodern tool.

Friday, October 10, 2014

New Table Bloat Query

To accompany the New Index Bloat Query, I've written a New Table Bloat Query.  This also involves the launch of the pgx_scripts project on GitHub, which will include most of the "useful scripts" I talk about here, as well as some scripts from my co-workers.

The new table bloat query is different from the version in several ways:
  • Rewritten to use WITH statements for better maintainability and clarity
  • Conditional logic for old Postgres versions and 32-bit platforms taken out
  • Index bloat removed, since we have a separate query for that
  • Columns formatted to be more immediately comprehensible
In the course of building this, I found two fundamentally hard issues:
  1. Some attributes (such as JSON and polygon fields) have no stats, so those tables can't be estimated.
  2. There's no good way to estimate bloat for compressed (TOAST) attributes and rows.
Also, while I rewrote the query almost entirely, I am still relying on Greg's core math for estimating table size.  Comparing this with the results of pgstattuple, I'm seeing an error of +/- 20%, which is pretty substantial.  I'm not clear on where that error is coming from, so help improving the math is very welcome!

Results look like this:

  databasename | schemaname |   tablename   | pct_bloat | mb_bloat | table_mb   
  members_2014 | public   | current_member  |    92 |  16.98 |  18.547   
  members_2014 | public   | member_response  |    87 |  17.46 |  20.000   
  members_2014 | public   | archive_member  |    84 |  35.16 |  41.734   
  members_2014 | public   | survey      |    57 |  28.59 |  50.188   

pct_bloat is how much of the table (0 to 100) is estimated to be dead space.  MB_bloat is how many megabytes of bloat are estimated to exist.  Table_mb is the actual size of the table in megabytes.

The suggested criteria is to list tables which are either more than 50% bloat and bigger than 10MB, or more than 25% bloat and bigger than 1GB.  However, you should calibrate this according to your own database.

Thursday, October 2, 2014

JSONB and 9.4: Move Slow and Break Things

If you've been paying any attention at all, you're probably wondering why 9.4 isn't out yet.  The answer is that we had to change JSONB at the last minute, in a way that breaks compatibility with earlier betas.

In August, a beta-testing user reported that we had an issue with JSONB not compressing well.  This was because of the binary structure of key offsets at the beginning of the JSONB value, and the affects were dramatic; in worst cases, JSONB values were 150% larger than comparable JSON values.   We spent August through September revising the data structure and Heikki and Tom eventually developed one which gives better compressibility without sacrificing extraction speed.

I did a few benchmarks on the various JSONB types.  We're getting a JSONB which is both faster and smaller than competing databases, so it'll be worth the wait.

However, this means that we'll be releasing an 9.4beta3 next week, whose JSONB type will be incompatible with prior betas; you'll have to dump and reload if you were using Beta 1 or Beta 2 and have JSONB data.  It also means a delay in final release of 9.4.

Monday, September 29, 2014

Why you need to avoid Linux Kernel 3.2

In fact, you really need to avoid every kernel between 3.0 and 3.8.  While RHEL has been sticking to the 2.6 kernels (which have their own issues, but not as bad as this), Ubuntu has released various 3.X kernels for 12.04.  Why is this an issue?  Well, let me give you two pictures.

Here's private benchmark workload running against PostgreSQL 9.3 on Ubuntu 12.04 with kernel 3.2.0.  This is the IO utilization graph for the replica database, running a read-only workload:

Sorry for cropping; had to eliminate some proprietary information.  The X-axis is time. This graph shows MB/s data transfers -- in this case, reads -- for the main database disk array.  As you can see, it goes from 150MB/s to over 300MB/s.  If this wasn't an SSD array, this machine would have fallen over. 

Then we upgraded it to kernel 3.13.0, and ran the same exact workload as the previous test.  Here's the new graph:

Bit of a difference, eh?  Now we're between 40 and 60MB/s for the exact same workload: an 80% reduction in IO.   We can thank the smart folks in the Linux FS/MM group for hammering down a whole slew of performance issues.

So, check your Postgres servers and make sure you're not running a bad kernel!