Monday, August 5, 2013

Fancy SQL Monday: generate_series() and cross joins

So, here's the situation:  you want to create a report which shows totals of registrations by day and by category, and shows zeros for each day and category you don't have any registrations for.  Your first attempt looks like this:

SELECT date_trunc('day',registered_on) as registration_date,
      category,
      count(*) as no_registrations, 
      sum(fee) as total_earned
FROM registrations
WHERE registered_on BETWEEN '2013-07-01' and '2013-08-01'
GROUP BY date_trunc('day', registered_on), category
ORDER BY date_trunc('day', registered_on), category;

However, the results of this query have a problem: if there were no registrations in a particular category on a particular day, it doesn't display any results.  You want it to display a zero.  So, how do you do that?

Well, what you want is to multiply the set of days in the month with the set of categories, forming a matrix where every category/day is represented.  The SQL syntax to say "give me all of X multiplied by all of Y" is a CROSS JOIN.  Normally, you don't want to do a cross join because it could result in millions or billions of rows, which would eat all the memory on your server.  But this time, you want one.

To do a cross join, though, you need the set of all days.  This is easy if you have a calendar table, but if you don't ... generate_series() to the rescue!  generate_series() is a terrific utility which creates one row for each iteration of a set expressed by start, end, interval.  So, like so:

generate_series('2013-07-01','2013-08-01',interval '1 day')

Putting that together with the categories in a cross join, we get this:

SELECT category, reg_day
FROM categories
    CROSS JOIN generate_series('2013-07-01','2013-08-01',interval '1 day')
    AS gs(reg_day)

That give us our grid.  Now we're going to use a LEFT JOIN to attach it to the set of registrations, so that we get zeros for days without registrations.  The easiest way to do this is to wrap the above query in a WITH clause, otherwise known as a CTE.

WITH cat_days AS (
  SELECT category, reg_day
  FROM categories
      CROSS JOIN generate_series('2013-07-01','2013-08-01',interval '1 day')
      AS gs(reg_day)
)
SELECT cat_days.reg_day,
    cat_days.category,
    count(registrations.registration_id) as registrations,
    coalesce(sum(fee),0.0) as total_earned
FROM cat_days 
   LEFT JOIN registrations ON
     cat_days.category = registrations.category
     AND cat_days.reg_day <= registrations.registered_on
     AND ( cat_days.reg_day + interval '1 day') >
        registrations.registered_on
GROUP BY reg_day, cat_days.category
ORDER BY reg_day, cat_days.category;

So: CROSS JOINs, WITH clauses, and LEFT JOINs can actually be useful.  Enjoy!

 




2 comments:

  1. I made use of this very technique a week or two ago. It's always nice to get validation.

    ReplyDelete
  2. Nice post,
    But can you give sample of your tables ?
    I have some problem make me Headache
    t'x

    ReplyDelete