Tuesday, June 18, 2013

Maximum Importance Aggregation

You already know that I'm a big fan of custom aggregates.  What's even more fun is when you combine them with the ordered aggregate feature from PostgreSQL 9.0.  For example, let's take a fairly common task: selecting the item with the maximum importance.

This is a challenge for standard aggregation because you want to display a value which relates to an importance score contained in another table.  For example, the table could be a lookup list of error types, and you want to display the most severe error.  This is simple to do if you only want one row, but what if you want to group by other columns?  Well, here's a relatively neat way using ordered aggregates and custom aggregates.

First, we want to create a generic aggregate called "first", which simply returns the first non-null value in the column:

    CREATE FUNCTION first_state (
        anyelement, anyelement )
    RETURNS anyelement
    LANGUAGE SQL
    IMMUTABLE STRICT
    AS $f$
        SELECT $1;
    $f$;


    CREATE AGGREGATE first ( anyelement ) (
            SFUNC = first_state,
            STYPE = anyelement
    );


This is a generally useful aggregate to have around, btw, so I suggest adding it to your general library.

Now, let's look at our tables:

    table errors (
        error_code text primary key,
        severity int not null default 0,
        description text
    )

    table error_log (
        lo_time timestamptz,
        error_code text    

    )

So, say I wanted to see the count of errors for each hour in the log and the most severe error during that hour?  Well, what I need to do is to select the first error code which shows up when I order by severity.  For good measure, I order by error_code as well, in case there's two codes with the same severity:

    SELECT date_trunc('hour',log_time) as log_hour,
        count(*) as error_count,
        first(error_code order by severity desc, error_code) 

            as most_severe
    FROM error_log
        JOIN errors USING (error_code)
    GROUP BY date_trunc('hour',log_time)
    ORDER BY date_trunc('hour',log_time);

 
And that works a treat:

           log_hour        | error_count | most_severe
    -----------------------+-------------+-------------
    2013-06-17 00:00:00+00 |           6 | CX56B
    2013-06-17 01:00:00+00 |          16 | CX56B
    2013-06-17 02:00:00+00 |           4 | PPPMQ5
    2013-06-17 03:00:00+00 |           2 | PPPMQ5
    2013-06-17 04:00:00+00 |          24 | CX56B
    2013-06-17 05:00:00+00 |          15 | CX56B
    2013-06-17 06:00:00+00 |           5 | FM229A


I'm sure you can figure out ways to adapt this to your needs.  Enjoy!



4 comments:

  1. We really should have first() (and possibly last()) bundled in core PostgreSQL.

    ReplyDelete
    Replies
    1. count_true() is also a surprisingly handy general custom aggregate.

      Delete
  2. What's the difference between first() above and first_value() already in PG core?

    http://www.postgresql.org/docs/current/static/functions-window.html

    first_value() """returns value evaluated at the row that is the first row of the window frame"""

    (sorry if this is posted multiple times - no feedback at all on whether it worked or not)

    ReplyDelete
    Replies
    1. first_value() requires a window declaration; it doesn't work as an aggregate.

      Technically, we should support first_value(x order by y), which is effectively the same thing. But we don't.

      Delete