Friday, August 23, 2013

PostgreSQL plus Vertica on Tuesday: SFPUG Live Video

This upcoming Tuesday, the 27th, SFPUG will have live streaming video of Chris Bohn from Etsy talking about how he uses PostgreSQL and Vertica together to do data analysis of Etsy's hundreds of gigabytes of customer traffic.  barring technical difficulties with the video or internet, of course.

The video will be on the usual SFPUG Video Channel.  It is likely to start around 7:15PM PDT.  Questions from the internet will be taken on the attached chat channel.

For those in San Francisco, this event will be held at Etsy's new downtown SF offices, and Etsy is sponsoring a Tacolicious taco bar.  Of course, the event is already full up, but you can always join the waiting list.

In other, related events, sfPython will be talking about PostgreSQL performance, and DjangoSF will be talking about multicolumn joins, both on Wednesday the 28th.  I'll be at DjangoSF, doing my "5 ways to Crash Postgres" lightning talk.

Monday, August 12, 2013

Fancy SQL Monday: ON vs. NATURAL JOIN vs. USING

First, there was the SQL89 JOIN, where there was no JOIN statement, and joins were performed by adding WHERE clauses.  To wit:

SELECT, count(*)
FROM users, comments
WHERE users.user_id = comments.user_id
   AND users.status = 'active'
   AND comments.comment_date BETWEEN '2013-07-01' and '2013-07-31'

There were two problems with this kind of JOIN: you often got join conditions mixed up with other filters and accidentally did cartesian products, and there was no way to do directional (outer) joins.

Not that that keeps people from continuing to use it.

So then we got the SQL92 JOIN:

SELECT, count(*)
FROM users JOIN comments
   ON users.user_id = comments.user_id
WHERE users.status = 'active'
   AND comments.comment_date BETWEEN '2013-07-01' and '2013-07-31'

But this wasn't the end of things.  Since SQL92 allowed users to define FOREIGN KEYS, why not allow them to just join along the keys "naturally"?

SELECT, count(*)
FROM users NATURAL JOIN comments
WHERE users.status = 'active'
   AND comments.comment_date BETWEEN '2013-07-01' and '2013-07-31'

Except that the SQL committee screwed up, and instead of using the declared foreign keys, NATURAL JOIN looks for columns with identical names.  And if there are several columns with identical names, there's no standard way to resolve them.  So, NATURAL JOIN, which ought to have been a real keyboard-saver, instead turned out to be uselessly unreliable.

One of the annoying issues about joins between two tables on columns of the same name is that SQL wants you to constantly qualify which table you're asking for the column from, even though the contents are identical:

ERROR:  column reference "user_id" is ambiguous

This was annoying, so they created the USING clause instead:

SELECT user_id,, count(*)
FROM users JOIN comments USING ( user_id )
WHERE users.status = 'active'
   AND comments.comment_date BETWEEN '2013-07-01' and '2013-07-31'
GROUP BY user_id,;

In Postgres, NATURAL JOIN also conflates matching columns, but it doesn't do this on all DBMSes.

Of course, USING only works if your columns do have the same names.  All the more reason to adopt that as a naming convention!

Thursday, August 8, 2013

Another reason to set pg_hba.conf restrictively

pg_hba.conf can be a pain in the neck.  You're expecting some burst traffic, you deploy 8 new virtual appservers, fire them up, and ... they can't connect to Postgres.  Gah!

That kind of experience makes it tempting to set everything to:

   host  all  all   md5

But I just encountered another case this month why you shouldn't.  While everyone thinks about malefactors as the main reason to have tight restrictions on which hosts can connect via pg_hba.conf, one should remember that accidents are far more common than malice.  As was the case this time.

I set up a benchmark test for a client which involved replaying 8 hours of activity logs against a test copy of their database.  What they didn't tell me (because they didn't know due to poor internal documentation) was that their database used DBlink to write data to another database several times a minute.

Thank goodness they had exercised good pg_hba.conf practices!  All I got was a bunch of DBlink errors, instead of scribbling all over production data.   Whew!

Tuesday, August 6, 2013

Speaker Boot Camp at LinuxCon

Are you technically expert but locutionarily inept?  Are you giving your first or second talk this year?  Do you present frequently, but don't feel like you're "connecting" with your audience?  Are you going to LinuxCon in New Orleans?

Then you should attend the Give a Great Tech Talk workshop the evening before LinuxCon.  It's a sort of "speaker boot camp"; if you are speaking at LinuxCon, or better are attending LinuxCon and planning to speak at other conferences in 2013-2014, you should consider attending.

So that speakers can have coaching before the beginning of the conference, the session will be from 5pm to 8:30pm on Sunday the 15th (concurrent with the Meet and Greet), before the start of LinuxCon, at the Hyatt.  The Linux Foundation will provide food.  Join us and learn to deliver a better talk!

Monday, August 5, 2013

Fancy SQL Monday: generate_series() and cross joins

So, here's the situation:  you want to create a report which shows totals of registrations by day and by category, and shows zeros for each day and category you don't have any registrations for.  Your first attempt looks like this:

SELECT date_trunc('day',registered_on) as registration_date,
      count(*) as no_registrations, 
      sum(fee) as total_earned
FROM registrations
WHERE registered_on BETWEEN '2013-07-01' and '2013-08-01'
GROUP BY date_trunc('day', registered_on), category
ORDER BY date_trunc('day', registered_on), category;

However, the results of this query have a problem: if there were no registrations in a particular category on a particular day, it doesn't display any results.  You want it to display a zero.  So, how do you do that?

Well, what you want is to multiply the set of days in the month with the set of categories, forming a matrix where every category/day is represented.  The SQL syntax to say "give me all of X multiplied by all of Y" is a CROSS JOIN.  Normally, you don't want to do a cross join because it could result in millions or billions of rows, which would eat all the memory on your server.  But this time, you want one.

To do a cross join, though, you need the set of all days.  This is easy if you have a calendar table, but if you don't ... generate_series() to the rescue!  generate_series() is a terrific utility which creates one row for each iteration of a set expressed by start, end, interval.  So, like so:

generate_series('2013-07-01','2013-08-01',interval '1 day')

Putting that together with the categories in a cross join, we get this:

SELECT category, reg_day
FROM categories
    CROSS JOIN generate_series('2013-07-01','2013-08-01',interval '1 day')
    AS gs(reg_day)

That give us our grid.  Now we're going to use a LEFT JOIN to attach it to the set of registrations, so that we get zeros for days without registrations.  The easiest way to do this is to wrap the above query in a WITH clause, otherwise known as a CTE.

WITH cat_days AS (
  SELECT category, reg_day
  FROM categories
      CROSS JOIN generate_series('2013-07-01','2013-08-01',interval '1 day')
      AS gs(reg_day)
SELECT cat_days.reg_day,
    count(registrations.registration_id) as registrations,
    coalesce(sum(fee),0.0) as total_earned
FROM cat_days 
   LEFT JOIN registrations ON
     cat_days.category = registrations.category
     AND cat_days.reg_day <= registrations.registered_on
     AND ( cat_days.reg_day + interval '1 day') >
GROUP BY reg_day, cat_days.category
ORDER BY reg_day, cat_days.category;

So: CROSS JOINs, WITH clauses, and LEFT JOINs can actually be useful.  Enjoy!


Thursday, August 1, 2013

9.4 Commitfest 1 Wrap-up

The first CommitFest of the 9.4 development cycle involved an unprecedented number of patch submissions.  As of today, 9.4CF1 is still open because of one patch which apparently only Tom Lane can commit.  However, everything else out of a brutal commitfest has been done, and it's 16 days after the CF was supposed to finish, so I think I can write the wrap-up now.

First, let's talk numbers: this CF started with 106 patches, peaked at 108 patches, and finished with 102 patches (6 were moved to September).  This is over 50% more patches than we've had for the first CommitFest of the annual development cycle and more patches than we've had in any single CF before. Last year, CF1 only involved 59 patches.  So if you are a committer and you feel exhausted, that's why.  Some graphs:

As you can see, this year is a jump forward in the number of patches being submitted to the first commitfest, and if it's representative of the rest of the year, we're all going to spend a lot more time reviewing patches.

Why so many patches?  Well, I can think of a few reasons:
  • contributions from the staffs of SalesForce and Huawei
  • beginning of 2ndQuadrant's work on streaming logical replication
  • a bunch of regression test patches
  • some new major contributors, including one from Google Summer of Code
  • more people submitting WIP patches to get spec feedback
All of the above are unquestionably good things, and not anything we want to deter, so we need to find ways to shoulder the load.  On the other hand, I'm not going to pay much attention to anyone who tells me we're "making things difficult for submitters".

So what did we get out of all this patch activity?  Some highlight features:
  • new pgbench options
  • improvements in compression of BLOBs and large rows
  • a bunch more regression tests
  • create background workers at runtime
  • FILTER option for aggregates
  • Part 1 of streaming logical replication
  • WITH ORDINALITY to supply "row numbers" for set-returning functions
All in all, 49 patches were committed (I'm expecting the remaining pending patch to make a round 50).  However, 46 were sent back and an additional 6 were moved to the September CommitFest, so we need to plan for that one to be busy too.

So, what are we going to do to handle more patches?  I have some thoughts on that in my next blog post ...