Friday, November 28, 2014

Setting up Ghost with PostgreSQL and Apache vhosts

I'm currently in love with Ghost, a Node.js blogging platform.  It's simple and small, uses Markdown for blog editing, and has scads of themes available, most of which are responsive and typography-centric.  If there's a project which might be a Wordpress-killer -- especially for PostgreSQL users -- Ghost is probably it.  I've set up both my food and pottery blog and the new SFPUG page using Ghost.

Ghost comes configured by default to use SQLite, though, and of course I had to run it on Postgres.  Also, I needed to set up Apache vhosts to redirect to the various Ghost instances, since one Node instance only runs one Ghost instance, and needed supervisord for autostart.  At some point later, I'll repackage all this to use docker containers for each Ghost instance, but these instructions don't cover that.

First, install Node.js.  OS packages are fine for Ubuntu 14.04 or Fedora; for other OSes you may need to get funky with source installs.  I'm not going to cover that here.  You'll also need to install the Node PostgreSQL driver, and the postgresql-server-dev package.

Second, "git clone" the stable branch of the Ghost code into an appropriate directory.  This can be directly your web directory for Ghost, or you can download, build, and then copy.  If you do the latter, be aware that Ghost has several "dot" files it needs which a regular "cp -r" won't pick up.

Now, before you build Ghost, you're going to need to make some changes to support PostgreSQL.  Edit "package.json" in the main directory, and add PostgreSQL into the "optional dependencies" section like so:

    "optionalDependencies": {
        "mysql": "2.1.1",
        "pg" : "latest"
    },


Now, build Ghost by running the build commands from the home directory:

    npm install -g grunt-cli
    npm install
    grunt init
    grunt prod


Make sure you scroll through the output of the above commands and look for errors; they don't always fail on error.

As a note, Ghost, like Rails, has "dev", "stage", and "production" modes.  Each of these modes can use a different database, even a different DBMS.  Or you can have Ghost connect to the same database for all three modes.  By default, Ghost starts in dev mode.

You'll need to configure your database connection.  Before editing config.js, make sure that you've set up PostgreSQL to accept local connections using an md5 password.  I additionally route Ghost through pgBouncer in case of getting slammed by connections; this requires you to configure pgBouncer as well:
  1. add Ghost user to the database
  2. create a database for the Ghost
  3. configure pg_hba.conf so the new user can connect with an md5 password
  4. if using pgBouncer, configure pgbouncer.ini and users.txt with the database and the user/password you're using for Ghost.
Once that's all set up, you can configure Ghost to use Postgres.  Edit config.js, and in each section (dev, staging and prod) where you want to use a Postgres database, replace the SQLite connection information with something like this:

        database: {
            client: 'pg',
            connection: {
                host     : '127.0.0.1',
                port     : '6432',
                user     : 'ghostblog',
                password : 'supercalifragalistic',
                database : 'ghostblog',
                charset  : 'utf8'
            }
        },


Then save.  As you can see, that's a configuration for pgBouncer.  If you're not using pgBouncer, then you can omit the host and port configuration.  If you're using Heroku, you'll need to point the Host at the public address for your Heroku Postgres instance. Restart Ghost at this point, or start it if you haven't already:

   npm start
   or:
   npm start --production

After this, you should be able to connect to Ghost on port 2368 of the defined interface (127.0.0.1 if you haven't configured anything).  However, that doesn't actually get a site up.  For one, Ghost may have to share the server with non-Ghost sites and other Ghost instances.  Second, users aren't generally accustomed to looking for sites on port 2638.  Also, I'm not all that confident about Node.js security, given how new the platform is.

Enter Apache vhosts with proxying.

First, configure Ghost to listen only an alternate port and the internal IP address:

        server: {
            // Host to be passed to node's `net.Server#listen()`
            host: '127.0.0.1',
            // Port to be passed to node's `net.Server#listen()`
            port: '4444'
        }


Note that if you have several Ghost instances, you'll need a different port for each.  If you're running Ghost under Docker, then redirect the Ghost port to a unique high port on the host.

Next, create a new vhost file (in /etc/apache2/servers-available on Ubuntu) pointing to Ghost:

    <VirtualHost *:80>
        ServerName sfpostgres.org
        ServerAlias www.sfpostgres.org

        ProxyRequests off
        ProxyPass / http://127.0.0.1:4444/
        ProxyPassReverse / http:/127.0.0.1:4444/

        CustomLog /var/log/apache2/www.sfpostgres.org-access.log combined
        ErrorLog /var/log/apache2/www.sfpostgres.org-error.log
    </VirtualHost>


Those ProxyPass lines allow bi-directional pass-through from Apache to Ghost.  Note that this prevents you from putting any other Apache directives on the Vhost other than logging, such as an authentication config; they will get ignored, because the request gets passed to Ghost first.

Now, link that to servers-enabled, and restart Ghost and Apache.  You should now be able to connect to Ghost using the URL of the vhost.

But wait ... what's keeping Ghost running?  npm will quit on error, and has no ability to restart with a server restart.   You could craft your own init file, but and alternative is to use supervisord.

First, install supervisor from packages.  Then add a config file to /etc/supervisor/conf.d:

    [program:sfpostgres]
    command = node /var/www/sfpostgres.org/index.js
    directory = /var/www/sfpostgres.org
    user = ghost
    autostart = true
    autorestart = true
    stdout_logfile = /var/log/supervisor/sfpostgres.log
    stderr_logfile = /var/log/supervisor/sfpostgres_err.log
    environment = NODE_ENV="production"


Then add the service to supervisor and enable it:

    supervisorctl add sfpostgres
    supervisorctl start sfpostgres

.... and that's it!  You should now have a Ghost instance on a custom domain with autorestart able to share a server with other websites.  And most importantly, running on PostgreSQL.

Wednesday, November 19, 2014

Good kernel, bad kernel

A month ago I got into an argument on IRC with Sergey about telling people to avoid kernel 3.2.  This turned out to be a very productive argument, because Sergey then went and did a battery of performance tests against various Linux kernels on Ubuntu. Go read it now, I'll wait.

My takeaways from this:

  • Kernel 3.2 is in fact lethally bad.
  • Kernel 3.13 is the best out of kernel 3.X so far.  I hope that this can be credited to the PostgreSQL team's work with the LFS/MM group.
  • No 3.X kernel yet has quite the throughput of 2.6.32, at least at moderate memory sizes and core counts.
  • However, kernel 3.13 has substantially lower write volumes at almost the same throughput.  This means that if you are write-bound on IO, 3.13 will improve your performance considerably.
  • If your database is mostly-reads and highly concurrent, consider enabling
    kernel.sched_autogroup_enabled.
Thanks a lot to Sergey for doing this testing, and thanks even more to the LFS/MM group for improving IO performance so much in 3.13.

Tuesday, November 18, 2014

pgDay SF 2015 Call For Speakers/Sponsors now open

pgDay SF 2015, to be held March 10th in Burlingame, CA, now has its Call for Speakers open.  The event is also looking for sponsors.  Come join us and meet even more Postgres folks!

Thursday, November 6, 2014

We need a webapp benchmark

I've been doing some comparative testing on different cloud platform's hosting of PostgreSQL.  And one of the deficiencies in this effort is the only benchmarking tool I have is pgbench, which doesn't reflect the kinds of workloads people would want to run on cloud hosting.  Don't get me wrong, pgbench does everything you could imagine with the simple Wisconsin benchmark, including statistics and sampling.   But the core benchmark is still something which doesn't look much like the kind of Rails and Django apps I deal with on a daily basis.

There's also TPCC-js, which is more sophisticated, but is ultimately still a transactional, back-office OLTP benchmark. 

So I'm thinking of developing a "webapp" benchmark.  Here's what I see as concepts for such a benchmark:
  • Read-mostly
  • No multi-statement transactions
  • Defined "users" concept with logins and new user registration
  • Needs a "sessions" table which is frequently updated
  • Read-write, read-only and session database connections should be separable, in order to test load-balancing optimization.
  • Queries counting, sorting and modifying content
  • Measured unit of work is the "user session" which would contain some content lookups and minor updates ("likes").
Now, one of the big questions is whether we should base this benchmark on the idea of a social networking (SN) site.  I think we should; SN sites test a number of things, including locking and joins which might not be exercised by other types of applications (and aren't by pgbench).  What do you think?  Does anyone other than me want to work on this?

Saturday, November 1, 2014

Finding Foreign Keys with No Indexes

Unlike some other SQL databases, PostgreSQL does not automatically create indexes on the "child" (in formal language "referencing") side of a foreign key.  There are some good reasons for this (see below), but it does give users another opportunity to forget something, which is indexing the foreign keys (FKs) that need it.  But which ones need it?  Well, fortunately, you can interrogate the system catalogs and find out.

I have added a query for finding foreign keys without indexes to pgx_scripts.  Indexes on the "parent", or "referenced" side of the FK are automatically indexed (they have to be, because they need to be unique), so we won't talk about them further.

Now, in order to understand how to interpret this, you have to understand why you would or would not have an index on an FK, and what sort of indexes are valid.  There's two times that indexes on the child side of FKs are used:
  • when doing JOIN and lookup queries using the FK column
  • when updating or deleting a row from the "parent" table
The second occasion is news to some DBAs.  The way it works is this: before letting you delete or change a row in the "parent" table, Postgres has to verify that there are no rows in the "child" table referencing the FK value that might be going away.  If there are, it needs to perform the action you have defined (such as CASCADE, SET NULL or RESTRICT).  If the "child" table is large, this can be substantially speeded up by having an index on the FK column.

This means that it's important to have an index on the child side of the FK if any of the following are true:
  • The child table is large and the parent table gets updates/deletes
  • The parent table is large and the FK is used for JOINs
  • The child table is large and the FK is used to filter (WHERE clause) records on the child table
This means most FKs, but not all of them.  If both tables are small, or if the parent table is small and the FK is used only to prevent bad data entry, then there's no reason to index it.  Also, if the FK is very low cardinality (like, say, only four possible values) then it's probably also a waste of resources to index it.

Now you're ready to run the query on your own database and look at the results. The query tries to filter for the best indexing candidates, but it is just a query and you need to use your judgement on what you know about the tables.  The query also filters for either the parent or child table being larger than 10MB.

Now, you might say "but I have an index on column Y" and wonder why it's appearing on the report.   That's probably because the FK does match the first columns of the index.  For example, an index on ( name, team_id ) cannot be used for an FK on team_id.

You may notice a 2nd section of the report called "questionable indexes".  These are FKs which have an index available, but that index may not be usable for JOINs and constraint enforcement, or may be very inefficient.  This includes:
  • Non-BTree indexes.  Currently other types of indexes can't be used for FK enforcement, although that is likely to change in future Postgres versions.  But they can sometimes be used for joins.
  • Indexes with more than one column in addition to the FK columns.  These indexes can be used for FKs, but they may be very inefficient at it due to the bigger size and extra index levels.
  • Partial indexes (i.e. INDEX ... WHERE).  In general, these cannot be used for FK enforcement, but they can sometimes be used for joins.
It's not quite as clear when you want to add to, or replace those indexes.  My advice is to start with the missing indexes and then move on to the more nuanced cases.