Thursday, August 14, 2014

More on 9.4 percentiles

Some of you may be following the discussion on JSONB compression on -hackers, which is the remaining major issue for 9.4.  As part of this, I needed to run some statistics on average column sizes for JSON vs JSONB tables. This makes a good example of how incredibly useful WITHIN GROUP will be in 9.4.

First, I created two versions of the data, one called "jsonic" which has text-JSON data, and one called "jsonbish" which has the JSONB version.  Then I ran some comparisons using pg_column_size on the JSON columns.

with colsizes_b as (
    select pg_column_size(jsonbish.jcol) as colsize
    from jsonbish
distrib_b as (
    select percentile_cont(array [0,0.25,0.5,0.75,1])
    within group (order by colsize) as colsize_distribution
    from colsizes_b
colsizes_j as (
    select pg_column_size(jsonic.jcol) as colsize
    from jsonic
distrib_j as (
    select percentile_cont(array [0,0.25,0.5,0.75,1])
    within group (order by colsize) as colsize_distribution
    from colsizes_j
select 'json' as thetype,
    colsize_distribution from distrib_j
union all
select 'jsonb',
    from distrib_b;

So I'm taking the column size of each row, then sorting them by size, and then doing a percentile distribution using the new WITHIN GROUP.  There's no group there because the group is actually the whole table.  I've chosen the usual box plot percentages: minimum, 25%, median, 75%, and maximum.

And the results:

 thetype |    colsize_distribution   
 json    | {1741,1767,1854,1904,2292}
 jsonb   | {3551,5866,5910,5958,6168}

... which demonstrates that we do indeed have a major issue with compressability.

Tuesday, August 12, 2014

SFPUG Video on YouTube

TL;DR: SFPUG video available on YouTube, including streaming video tonight.  I need help converting old videos and making new ones.

First, we have a few SFPUG videos available on YouTube:
While JSONB Deep Dive was also recorded, there were issues with the audio, so I don't recommend watching it, sorry.

We will have streaming video for tonight's SFPUG, which will be on Postgres-XL.   Join the Google Event to get a link to the streaming video once it starts; expected start time is 7:15PM PST.

This brings me to the third thing, which is that I could use some help with video, in two areas:
  1. I desperately need someone to help take better-quality videos of SFPUG meetings, with better focus and sound.  Right now I'm using a Logitech webcam, and it's just not cutting it.  Video will need to be compatible with Google Hangouts, unless you have another plan for broadcasting.
  2. I have 7 or 8 videos of SFPUG meetings from 2013, including Tom Lane explaining the query planner, in FLV format.  I need someone to transcode these to a YouTube format and upload them.
If you can help, email me.  Thanks!

Thursday, August 7, 2014

Call for quotes for the 9.4 release announcement

We are looking for endorsement quotes for the 9.4 release.
Specifically, we want quotes from end-users or ISVs* who have not been
featured for a previous release, and enthuse about one of the following
  • Streaming Changesets/Bi-Directional Replication
  • Performance improvements in 9.4
  • Something else about Flexibility, Scalability or Performance
We need these quotes by August 15th.  Contact me if you want to offer a quote.  Thanks!

(* while we love our PostgreSQL consulting and support companies,
endorsements from them aren't news in a PR sense)

Friday, June 13, 2014

Pending Portland Postgres Presentations

This year is apparently my year to spend most of my time in Portland.  Maybe I should move there ;-)  Please read through if you are in Portland or are attending conferences in Portland for a list PostgreSQL-related events there.

O'Reilly OSCON 2014: on Monday, July 21st, I will be presenting The Accidental DBA again at OSCON.  O'Reilly has asked me to reprise this tutorial as it was sold out last year.  This tutorial covers the care and feeding of a PostgreSQL server for people whose main job is something else, or for those new to database administration.  Will include multiple hands-on exercises using Vagrant on your own laptop.  The exercises are on github, but will be updated on or before July 10th in order to cover 9.3 and feedback from the last talk.

Note that there will also be a special PDXPUG meeting or BOF at OSCON; details TBD.  I will also be doing an Office Hours session at OSCON on July 22nd at 3:20 PM; please bring your questions about PostgreSQL scalability and replication.

I am likely to be speaking at in September.  As the Call for Presentations is still open, I don't know what it will be about, or have confirmation that I will, in fact, be speaking.

Right between and FOSS4G, PDXPUG will be hosting a full PostgreSQL Day at Portland State University on September 6.  This event is free, so if you are going to either DjangoCon or FOSS4G, extend your stay in Portland in order to attend!  RSVP required.

People are still submitting presentations for that; if you're interested in speaking, contact Mark per the wiki page.  I expect to be talking about 9.4, replication, or maybe Postgres on Amazon.

Finally, at FOSS4G, I will be doing a full day of Postgres/PostGIS workshops.  In the morning I will do Accidental DBA again, this time tailored for PostGIS administrators.  In the afternoon, I will do a hands-on "learning to do PostGIS replication" tutorial.  This does mean that I will likely not include much about replication in the morning session.  Both of these workshops are "bring your own device", which means that you will need to install the tutorial materials on your laptop in advance.

Since I will be spending the first half of September in Portland.  If you or your company is interested in onsite training or consulting during September, please get in touch with PGX.

Tuesday, June 3, 2014

Upcoming SFPUG Live Streams: Crontabber and JSONB

Apparently June is the month for hackers named Peter with difficult-to-spell last names.  Join us for two live streams of two different Peters giving two different PostgreSQL presentations.

First, on June 10th, we'll have Peter Bengtsson of Mozilla presenting Crontabber, a network-wide scheduled job manager written in Python and PostgreSQL.  Tune in on Air Mozilla at around 7:15pm PDT to watch this.

Second, on June 23rd, Peter Geoghegan of Heroku will do a "JSONB Deep Dive" and explain the features and internals of the new JSONB type.   Subscribe to the Google Event to be linked into the Google Hangout when this goes live around 7:15PM PDT.

See you there!

Sunday, May 25, 2014

9.4 Theme Contest Analyzed by 9.4

So a couple weeks ago I ran a little contest to see who could come up with a slogan for PostgreSQL 9.4.  Surprisingly, we got over 300 votes on various slogans, which means I need to do some statistics to analyze them -- which means I'm going to show off some of PostgreSQL's new 9.4 features as part of that!

Version 9.4 includes a number of new aggregate, array and set operations which make it vastly easier and faster to do statistical summaries and analysis.  Most of these were contributed by Andrew Gierth, including the two I'm going to use below, FILTER and WITHIN GROUP.  I'm also going to use MATERIALIZED VIEWS, developed by Kevin Grittner.  First, though, I need to import the data.  So I downloaded the survey results as a CSV, and created a table for them in PostgreSQL and loaded it up:

CREATE TABLE raw_survey (
    ts       timestamptz,
    prf      integer,
    moreways integer,
    devops   integer,
    moresql  integer,
    yesql    integer,

\copy raw_survey from 'slogans.csv' with csv header

Now, Google's column-per-question format isn't very friendly to analysis and comparison; I want a more vertical orientation.  So I create one as a MatView.  This means that if I reimport the data in the future, or weed out obvious ballot-box stuffing, I just need to refresh it:

SELECT 'Performance, Reliability, Flexibility' as slogan, prf as vote
FROM raw_survey
SELECT 'More Ways to Database', moreways
FROM raw_survey
SELECT 'A Database for Dev and Ops', devops
FROM raw_survey
SELECT 'More Than SQL', moresql
FROM raw_survey
SELECT 'NoSQL + YeSQL = PostgreSQL', yesql
FROM raw_survey;

Now, for some statistics.  A total or average is easy, but it's not statistically sound.  A median is a much better statistic.  I also want to know the balance of people who hated a slogan (1) vs. loved it and put it first (5).  So, some of the new aggregates.

In the past, I've retrieved medians by either using SciPy inside PL/Python, or by doing some elaborate calculations on windowing rank.  No more.  Now I can do a simple one-line median using WITHIN GROUP.  WITHIN GROUP is a lot like a windowing aggregate, except that it returns a single summary aggregate.  Shipping with version 9.4 are several such aggregates, including percentile_cont() which is one of three functions which allow you to get the value at the stated percent of a sorted group: in this case, 0.5 to get a median.  Like so:

SELECT slogan,
    percentile_cont(0.5) WITHIN GROUP (ORDER BY vote)
FROM slogans
GROUP BY slogan;

slogan percentile_cont
A Database for Dev and Ops 3
More Than SQL 3
More Ways to Database 3
NoSQL + YeSQL = PostgreSQL 3
Performance, Reliability, Flexibility 4

"Performance, Reliability, Flexibility" is taking a clear lead here.  Incidentally, percentile_cont() can take an array of values in order to give you a full box (remember, every time you say "big data" without drawing a box plot, God kills a kitten):

SELECT slogan,
    percentile_cont(ARRAY[0.1,0.25,0.5,0.75,0.9]) WITHIN GROUP (ORDER BY vote)
FROM slogans
GROUP BY slogan;

slogan percentile_cont
A Database for Dev and Ops {1,2,3,3,4}
More Than SQL {1.4,2,3,4,5}
More Ways to Database {1,2,3,4,5}
NoSQL + YeSQL = PostgreSQL {1,1,3,4,5}
Performance, Reliability, Flexibility {2,3,4,5,5}
Let's check or "loves" and "hates" to see if they tell us anything different.  Now, the old way to do this would be:

SELECT slogan,
    sum(CASE WHEN vote = 1 THEN 1 ELSE 0 END) as hates,
    sum(CASE WHEN vote = 5 THEN 1 ELSE 0 END) as loves
FROM slogans
GROUP BY slogan;

Awkward, neh?  Well, no more, thanks to the FILTER clause:

SELECT slogan,
    count(*) FILTER ( WHERE vote = 1 ) as hates,
    count(*) FILTER ( WHERE vote = 5 ) as loves
FROM slogans
GROUP BY slogan;

Isn't that way more intuitive and readable?  I think it is, anyway.  So, let's put it all together:

SELECT slogan,
    percentile_cont(0.5) WITHIN GROUP (ORDER BY vote) as median,
    count(*) FILTER ( WHERE vote = 1 ) as hates,
    count(*) FILTER ( WHERE vote = 5 ) as loves
FROM slogans
GROUP BY slogan;

And the results:

slogan median hates loves
A Database for Dev and Ops 3 47 21
More Than SQL 3 32 58
More Ways to Database 3 39 55
NoSQL + YeSQL = PostgreSQL 3 81 58
Performance, Reliability, Flexibility 4 11 118

And there we have it: "Performance, Reliability, Flexibility" is the winning theme idea for 9.4.  It wins on median, and on hates vs. loves counts.

Congratulations Korry Douglas; I'll contact you about shipping your Chelnik.  Note that the theme will be workshopped a little bit to fit in the structure of the final 9.4 release announcement (i.e. we may change it slightly to match the sections of the actual press release), but we're going with that general idea now.

Thursday, May 15, 2014

Help us choose an advocacy theme for PostgreSQL 9.4

Every year, for each PostgreSQL release, I have a "theme" which decides our graphics and presentation themes for promoting that version of PostgreSQL.   In the past, the themes have generally been my personal ideas, but this year we're putting it out to our greater community.

Five potential theme ideas have been selected form about 100 which were suggested on the pgsql-advocacy mailing list.  Now we need you to rate them, in order to decide which one we go with ... and who wins a Chelnik from Mark Wong!

Please vote on the basis of selecting a good theme/slogan for PostgreSQL 9.4 specifically, rather than just what sounds like the coolest phrase.

So, vote!