Sunday, May 25, 2014

9.4 Theme Contest Analyzed by 9.4

So a couple weeks ago I ran a little contest to see who could come up with a slogan for PostgreSQL 9.4.  Surprisingly, we got over 300 votes on various slogans, which means I need to do some statistics to analyze them -- which means I'm going to show off some of PostgreSQL's new 9.4 features as part of that!

Version 9.4 includes a number of new aggregate, array and set operations which make it vastly easier and faster to do statistical summaries and analysis.  Most of these were contributed by Andrew Gierth, including the two I'm going to use below, FILTER and WITHIN GROUP.  I'm also going to use MATERIALIZED VIEWS, developed by Kevin Grittner.  First, though, I need to import the data.  So I downloaded the survey results as a CSV, and created a table for them in PostgreSQL and loaded it up:

CREATE TABLE raw_survey (
    ts       timestamptz,
    prf      integer,
    moreways integer,
    devops   integer,
    moresql  integer,
    yesql    integer,
);

\copy raw_survey from 'slogans.csv' with csv header

Now, Google's column-per-question format isn't very friendly to analysis and comparison; I want a more vertical orientation.  So I create one as a MatView.  This means that if I reimport the data in the future, or weed out obvious ballot-box stuffing, I just need to refresh it:

CREATE MATERIALIZED VIEW slogans AS
SELECT 'Performance, Reliability, Flexibility' as slogan, prf as vote
FROM raw_survey
UNION ALL
SELECT 'More Ways to Database', moreways
FROM raw_survey
UNION ALL
SELECT 'A Database for Dev and Ops', devops
FROM raw_survey
UNION ALL
SELECT 'More Than SQL', moresql
FROM raw_survey
UNION ALL
SELECT 'NoSQL + YeSQL = PostgreSQL', yesql
FROM raw_survey;


Now, for some statistics.  A total or average is easy, but it's not statistically sound.  A median is a much better statistic.  I also want to know the balance of people who hated a slogan (1) vs. loved it and put it first (5).  So, some of the new aggregates.

In the past, I've retrieved medians by either using SciPy inside PL/Python, or by doing some elaborate calculations on windowing rank.  No more.  Now I can do a simple one-line median using WITHIN GROUP.  WITHIN GROUP is a lot like a windowing aggregate, except that it returns a single summary aggregate.  Shipping with version 9.4 are several such aggregates, including percentile_cont() which is one of three functions which allow you to get the value at the stated percent of a sorted group: in this case, 0.5 to get a median.  Like so:

SELECT slogan,
    percentile_cont(0.5) WITHIN GROUP (ORDER BY vote)
FROM slogans
GROUP BY slogan;


slogan percentile_cont
A Database for Dev and Ops 3
More Than SQL 3
More Ways to Database 3
NoSQL + YeSQL = PostgreSQL 3
Performance, Reliability, Flexibility 4

"Performance, Reliability, Flexibility" is taking a clear lead here.  Incidentally, percentile_cont() can take an array of values in order to give you a full box (remember, every time you say "big data" without drawing a box plot, God kills a kitten):

SELECT slogan,
    percentile_cont(ARRAY[0.1,0.25,0.5,0.75,0.9]) WITHIN GROUP (ORDER BY vote)
FROM slogans
GROUP BY slogan;


slogan percentile_cont
A Database for Dev and Ops {1,2,3,3,4}
More Than SQL {1.4,2,3,4,5}
More Ways to Database {1,2,3,4,5}
NoSQL + YeSQL = PostgreSQL {1,1,3,4,5}
Performance, Reliability, Flexibility {2,3,4,5,5}
Let's check or "loves" and "hates" to see if they tell us anything different.  Now, the old way to do this would be:

SELECT slogan,
    sum(CASE WHEN vote = 1 THEN 1 ELSE 0 END) as hates,
    sum(CASE WHEN vote = 5 THEN 1 ELSE 0 END) as loves
FROM slogans
GROUP BY slogan;


Awkward, neh?  Well, no more, thanks to the FILTER clause:

SELECT slogan,
    count(*) FILTER ( WHERE vote = 1 ) as hates,
    count(*) FILTER ( WHERE vote = 5 ) as loves
FROM slogans
GROUP BY slogan;


Isn't that way more intuitive and readable?  I think it is, anyway.  So, let's put it all together:

SELECT slogan,
    percentile_cont(0.5) WITHIN GROUP (ORDER BY vote) as median,
    count(*) FILTER ( WHERE vote = 1 ) as hates,
    count(*) FILTER ( WHERE vote = 5 ) as loves
FROM slogans
GROUP BY slogan;


And the results:


slogan median hates loves
A Database for Dev and Ops 3 47 21
More Than SQL 3 32 58
More Ways to Database 3 39 55
NoSQL + YeSQL = PostgreSQL 3 81 58
Performance, Reliability, Flexibility 4 11 118

And there we have it: "Performance, Reliability, Flexibility" is the winning theme idea for 9.4.  It wins on median, and on hates vs. loves counts.

Congratulations Korry Douglas; I'll contact you about shipping your Chelnik.  Note that the theme will be workshopped a little bit to fit in the structure of the final 9.4 release announcement (i.e. we may change it slightly to match the sections of the actual press release), but we're going with that general idea now.

No comments:

Post a Comment