Sunday, May 25, 2014

9.4 Theme Contest Analyzed by 9.4

So a couple weeks ago I ran a little contest to see who could come up with a slogan for PostgreSQL 9.4.  Surprisingly, we got over 300 votes on various slogans, which means I need to do some statistics to analyze them -- which means I'm going to show off some of PostgreSQL's new 9.4 features as part of that!

Version 9.4 includes a number of new aggregate, array and set operations which make it vastly easier and faster to do statistical summaries and analysis.  Most of these were contributed by Andrew Gierth, including the two I'm going to use below, FILTER and WITHIN GROUP.  I'm also going to use MATERIALIZED VIEWS, developed by Kevin Grittner.  First, though, I need to import the data.  So I downloaded the survey results as a CSV, and created a table for them in PostgreSQL and loaded it up:

CREATE TABLE raw_survey (
    ts       timestamptz,
    prf      integer,
    moreways integer,
    devops   integer,
    moresql  integer,
    yesql    integer,

\copy raw_survey from 'slogans.csv' with csv header

Now, Google's column-per-question format isn't very friendly to analysis and comparison; I want a more vertical orientation.  So I create one as a MatView.  This means that if I reimport the data in the future, or weed out obvious ballot-box stuffing, I just need to refresh it:

SELECT 'Performance, Reliability, Flexibility' as slogan, prf as vote
FROM raw_survey
SELECT 'More Ways to Database', moreways
FROM raw_survey
SELECT 'A Database for Dev and Ops', devops
FROM raw_survey
SELECT 'More Than SQL', moresql
FROM raw_survey
SELECT 'NoSQL + YeSQL = PostgreSQL', yesql
FROM raw_survey;

Now, for some statistics.  A total or average is easy, but it's not statistically sound.  A median is a much better statistic.  I also want to know the balance of people who hated a slogan (1) vs. loved it and put it first (5).  So, some of the new aggregates.

In the past, I've retrieved medians by either using SciPy inside PL/Python, or by doing some elaborate calculations on windowing rank.  No more.  Now I can do a simple one-line median using WITHIN GROUP.  WITHIN GROUP is a lot like a windowing aggregate, except that it returns a single summary aggregate.  Shipping with version 9.4 are several such aggregates, including percentile_cont() which is one of three functions which allow you to get the value at the stated percent of a sorted group: in this case, 0.5 to get a median.  Like so:

SELECT slogan,
    percentile_cont(0.5) WITHIN GROUP (ORDER BY vote)
FROM slogans
GROUP BY slogan;

slogan percentile_cont
A Database for Dev and Ops 3
More Than SQL 3
More Ways to Database 3
NoSQL + YeSQL = PostgreSQL 3
Performance, Reliability, Flexibility 4

"Performance, Reliability, Flexibility" is taking a clear lead here.  Incidentally, percentile_cont() can take an array of values in order to give you a full box (remember, every time you say "big data" without drawing a box plot, God kills a kitten):

SELECT slogan,
    percentile_cont(ARRAY[0.1,0.25,0.5,0.75,0.9]) WITHIN GROUP (ORDER BY vote)
FROM slogans
GROUP BY slogan;

slogan percentile_cont
A Database for Dev and Ops {1,2,3,3,4}
More Than SQL {1.4,2,3,4,5}
More Ways to Database {1,2,3,4,5}
NoSQL + YeSQL = PostgreSQL {1,1,3,4,5}
Performance, Reliability, Flexibility {2,3,4,5,5}
Let's check or "loves" and "hates" to see if they tell us anything different.  Now, the old way to do this would be:

SELECT slogan,
    sum(CASE WHEN vote = 1 THEN 1 ELSE 0 END) as hates,
    sum(CASE WHEN vote = 5 THEN 1 ELSE 0 END) as loves
FROM slogans
GROUP BY slogan;

Awkward, neh?  Well, no more, thanks to the FILTER clause:

SELECT slogan,
    count(*) FILTER ( WHERE vote = 1 ) as hates,
    count(*) FILTER ( WHERE vote = 5 ) as loves
FROM slogans
GROUP BY slogan;

Isn't that way more intuitive and readable?  I think it is, anyway.  So, let's put it all together:

SELECT slogan,
    percentile_cont(0.5) WITHIN GROUP (ORDER BY vote) as median,
    count(*) FILTER ( WHERE vote = 1 ) as hates,
    count(*) FILTER ( WHERE vote = 5 ) as loves
FROM slogans
GROUP BY slogan;

And the results:

slogan median hates loves
A Database for Dev and Ops 3 47 21
More Than SQL 3 32 58
More Ways to Database 3 39 55
NoSQL + YeSQL = PostgreSQL 3 81 58
Performance, Reliability, Flexibility 4 11 118

And there we have it: "Performance, Reliability, Flexibility" is the winning theme idea for 9.4.  It wins on median, and on hates vs. loves counts.

Congratulations Korry Douglas; I'll contact you about shipping your Chelnik.  Note that the theme will be workshopped a little bit to fit in the structure of the final 9.4 release announcement (i.e. we may change it slightly to match the sections of the actual press release), but we're going with that general idea now.

Thursday, May 15, 2014

Help us choose an advocacy theme for PostgreSQL 9.4

Every year, for each PostgreSQL release, I have a "theme" which decides our graphics and presentation themes for promoting that version of PostgreSQL.   In the past, the themes have generally been my personal ideas, but this year we're putting it out to our greater community.

Five potential theme ideas have been selected form about 100 which were suggested on the pgsql-advocacy mailing list.  Now we need you to rate them, in order to decide which one we go with ... and who wins a Chelnik from Mark Wong!

Please vote on the basis of selecting a good theme/slogan for PostgreSQL 9.4 specifically, rather than just what sounds like the coolest phrase.

So, vote!

9.4 Beta, Postgres-XL, and pgCon Events

So, in case you somehow missed it, the PostgreSQL 9.4 Beta 1 is out.  Yaay!  Here's what I have to say about that:

libdata=# select title, 
    bookdata #> '{"publication_info", 0, "isbn"}' as isbn
from booksdata 
where bookdata @> '{ "publication_info" : [{"publisher": "Avon"} ] }'::jsonb 
order by bookdata #> '{"publication_info", 0, "price"}' DESC;

                 title                |    isbn    
 The Bewitched Viking                 | "0-06-201900-7"
 When a Scot Loves a Lady             | "0-06-213120-6"
 Eternal Prey                         | "0-06-201895-7"
 My Irresistible Earl                 | "0-06-173396-2"

Download the beta now and test it out!  Break it!  Tell us how you broke it!  It's a beta, and it's up to you to make sure that the final release is as robust as possible.

Speaking of betas, there's a new new open source big data option on the block: Postgres-XL.  This is a fork of PostgresXC, which supposedly resolves the blockers which have kept PostgresXC from being ready for production use.  I look forward to trying it out when I get a chance.

Finally, I wanted to remind everyone about the Clustering Summit, the PostgresXC Pizza Demo, and the Unconference at pgCon next week.   Especially, I still need two assistants to help me with the unconference.  Email me at if you're available to help with setup at the unconference.

Monday, May 12, 2014

cstore_fdw and big data

About a month ago, PostgreSQL fork vendor and Data Warehousing company CitusDB announced the availability of the open-source cstore_fdw.  This foreign data wrapper creates an external table with highly compressed data, which allows you to keep large amounts of archival data on your PostgreSQL server.

You can find out more if you want to tune in tommorrow night, May 13th, around 7:15PM PDT.  Tomorrow night's even will be sponsored by CitusDB and hosted by Rackspace.

First, the good stuff: compression:

phc=# select pg_size_pretty(pg_total_relation_size('postgres_log'));
 28 GB 

ls -lh $PGDATA/base

-rw------- 1 postgres postgres 3.2G May 12 13:37 pglog.cstore
-rw------- 1 postgres postgres  12K May 12 13:37 pglog.cstore.footer

So, the "postgres_log" table from this Performance Health Check database, which has 15 million records, takes up 28GB in postgres, and 3.2GB as a cstore table ... a space savings of about 89%.  Not bad.  Especially if you consider that the cstore table already has skip indexes on all indexable columns.

Now, where this space savings becomes a real benefit is if the cstore table fits in memory and the Postgres table doesn't.   I don't have a case like that, although the cstore still does show performance benefits if you have a wide table and you don't need all columns:

phc=# select count(1) from postgres_log where command_tag = 'UPDATE';
(1 row)

Time: 23746.476 ms
phc=# select count(1) from c_pglog where command_tag = 'UPDATE';
(1 row)

Time: 14059.405 ms

And even better if you can apply a relatively restrictive filter:

phc=# select count(1) from postgres_log where command_tag = 'UPDATE' and log_time BETWEEN '2014-04-16 07:15:00' and '2014-04-16 07:20:00';
(1 row)

Time: 19653.746 ms

phc=# select count(1) from c_pglog where command_tag = 'UPDATE' and log_time BETWEEN '2014-04-16 07:15:00' and '2014-04-16 07:20:00';
(1 row)

Time: 2260.891 ms

One limitation is that currently, with FDWs not able to cleanly push down aggregation to the foreign data wrapper, the actual aggregation is still done on the postgres side.  This means that large aggregates are about the same speed on cstore_fdw as they are for PostgreSQL tables:

phc=# select round((sum(duration)/1000)::numeric,2) from statements where command_tag = 'UPDATE'; round 
(1 row)

Time: 2920.640 ms
phc=# select round((sum(duration)/1000)::numeric,2) from c_statements where command_tag = 'UPDATE';
(1 row)

Time: 3232.986 ms

The project plans to fix this, but until then, cstore_fdw is useful mainly for searches across really large/wide tables.  Or for seldom-touched archive tables where you want to save yourself GB or TB of disk space.

There are a bunch of other features, and a bunch of other limitations; tune in to the SFPUG event to learn more.

Friday, May 9, 2014

Remastering without restarting

Thanks to Streaming-Only Remastering, PostgreSQL 9.3 has been a boon to high-availability setups and maintenance  You can re-arrange your replicas however you like; remastered, in a tree, in a ring, whatever.  However, there's been one wart on this free reconfiguration of Postgres replication clusters: if you want to change masters, you have to restart the replica.

This doesn't sound like a big deal, until you think about a cluster with load balancing to 16 read-only replicas.  Every one of those you restart breaks a bunch of application connections.   Looking at how timeline switch works, it didn't seem like there was even a good reason for this; really, the only thing which seemed to be blocking it was that primary_conninfo comes from recovery.conf, which only gets read on startup.  I'd hoped that the merger of recovery.conf and postgresql.conf would solve this, but that patch got punted to 9.5 due to conflicts with SET PERSISTENT.

So, I set out to find a workaround, as well as proving that it was only the deficiencies of recovery.conf preventing us from doing no-restart remastering.  And I found one, thanks to the question someone asked me at pgDay NYC.

So, in the diagram above, M1 is the current master.  M2 is a replica which is the designated failover target.  R1 and R2 are additional replicas.  "proxy" is a simple TCP proxy; in fact, I used a python proxy written in 100 lines of code for this test.  You can't use a Postgres proxy like pgBouncer because it won't accept a replication connection.

Remastering time!
  1. Shut down M1
  2. Promote M2
  3. Restart the proxy, now pointing to M2
And the new configuration:

But: what happened to R1 and R2?  Did they remaster without restarting?  Yes, indeedy, they did!

LOG:  entering standby mode
LOG:  redo starts at 0/21000028
LOG:  consistent recovery state reached at 0/210000F0
LOG:  database system is ready to accept read only connections
LOG:  started streaming WAL from primary at 0/22000000 on timeline 6
LOG:  replication terminated by primary server
DETAIL:  End of WAL reached on timeline 6 at 0/2219C1F0.
FATAL:  could not send end-of-streaming message to primary: no COPY in progress
FATAL:  could not connect to the primary server: could not connect to server: Connection refused
                Is the server running on host "" and accepting
                TCP/IP connections on port 9999?

LOG:  fetching timeline history file for timeline 7 from primary server
LOG:  started streaming WAL from primary at 0/22000000 on timeline 6
LOG:  replication terminated by primary server
DETAIL:  End of WAL reached on timeline 6 at 0/2219C1F0.
LOG:  new target timeline is 7
LOG:  restarted WAL streaming at 0/22000000 on timeline 7

Not only does this provide us a new remastering workaround for high-availability configurations on 9.3, it also shows us that as soon as we get around to merging recovery.conf with postgresql.conf, restarting to remaster can be eliminated.

Wednesday, May 7, 2014

Why you should always set temp_file_limit

"The database is growing at 2GB a minute.  We're 40 minutes away from running out of disk space."

"Sounds like I should probably take a look."

I looked at the database size, which was 160GB.  But the database SAN share was up to 1.4TB used out of 1.6TB.  WTF?

Then I looked at the filesystem and did directory sizes.  pgsql_tmp was over a terabyte.  Oooooooohhh.

Apparently they'd accidentally pushed a new report to the application which worked OK in testing, but with certain parameters created a 15 billion item sort.  And since it was slow, users called it several times.  Ooops.

Enter temp_file_limit, a parameter added by Mark Kirkwood to PostgreSQL 9.2.   This is a limit on per-session usage of temporary files for sorts, hashes, and similar operations.  If a user goes over the limit, their query gets cancelled and they see an error.

This is an excellent way to prevent a single user, or a bad application change, from DOSing your database server.  Set it to something high; I'm using 10GB or 20GB, or 10% of available disks space, whichever is less.  But even a high limit like that will save you from some unexpected downtime.

Thursday, May 1, 2014

New Finding Unused Indexes Query

As long as we're overhauling standard monitoring queries for PostgreSQL, here's another one.  This query helps you find indexes which are relatively unused, and as a result could probably be dropped.  I wrote a more complex (yes, really) version for our internal Performance Health Check suite, but the linked version is usable by anyone.

The query is also an example of why CTEs, otherwise known as "WITH statements", are a life-saver for working with complex queries.  I've only tested it on 9.2 and 9.3; I don't know if it'll work on older versions.

Before you use it, you need to check how long you've been collecting data into pg_stat_user_indexes and friends.  The default is since you created the database, but some people reset stats on a daily or monthly basis.  So it's important to know what you're looking at.   Don't make the mistake of dropping the indexes which are needed for the month-end reports!

The query divides seldom-used indexes into four groups:

Indexes Which Aren't Scanned At All: these indexes have no scans during the stats period.  These pretty much certainly can be dropped, except those on really small tables which you expect to grow later.

Seldom Used Indexes on Heavily Written Tables:  as a general rule, if you're not using an index twice as often as it's written to, you should probably drop it.  This query is a little more conservative than that.

Really Large, Seldom-Used Indexes: these indexes get used, but not that often, and they're taking up a lot of RAM and storage space.   Consider dropping them after some thought about why they were added.

Large Non-BTree Indexes On Busy Tables:  as a rule, non-BTree indexes like GiST and GIN don't accurately report usage stats.  As a result, we can't check how often they're used, just how big they are and if they're attached to tables which get a lot of writes.  This list of indexes should be very judiciously pruned.

Happy pruning!