Wednesday, April 17, 2013

A very simple custom aggregate

Custom aggregates are one of those features which are unique to Postgres, and seem hopelessly obscure.  Once you actually create one in earnest though, you'll wonder how you ever lived without one.  To get you started, I'm going to walk you through creating a very simple custom aggregate: one which gives you the mode (most frequent value) for a boolean column.

Why would you want such a thing?  Well, imagine you're monitoring your webservers, and you want to present 1-hour summaries of whether they are up or down.  However, you have data for each 30 seconds.  For a webserver which is up most of the time you want to return TRUE; for one which was down for most of the hour you want to return FALSE.  If the monitoring system was down (and thus there's no data), you want to return NULL.

You could do this using windowing queries.  However, that doesn't work well with other cumulative statistics, such as the number of minutes up.  You want something you can display side-by-side with other aggregate stats.  Well, with PostgreSQL, it's surprisingly easy!

First, we need a "state function" which accumulates data about the boolean.  This state function generally has two parameters, a data type which accumulates value, and the data type of the column you're aggregating.  In our case, we want to accumulate two counters: a count of falses and a count of trues, which we do using an array of INT.  This can be done with a pure-SQL function:

CREATE OR REPLACE function mode_bool_state(int[], boolean)
returns int[]
language sql
as $f$
SELECT CASE $2
WHEN TRUE THEN
    array[ $1[1] + 1, $1[2] ]
WHEN FALSE THEN
    array[ $1[1], $1[2] + 1 ]
ELSE
    $1
END;
$f$;


Once both registers have been accumulated, we need to use a "final" function to compare them and decide which is the mode, which will accept the accumulation type (INT[]) and return boolean:

CREATE OR REPLACE FUNCTION mode_bool_final(INT[])
returns boolean
language sql
as $f$
SELECT CASE WHEN ( $1[1] = 0 AND $1[2] = 0 )
THEN NULL
ELSE $1[1] >= $1[2]
END;
$f$;


Then we can declare the aggregate to bring it all together:

create aggregate mode(boolean) (
    SFUNC = mode_bool_state,
    STYPE = INT[],
    FINALFUNC = mode_bool_final,
    INITCOND = '{0,0}'
);


SFUNC and FINALFUNC refer to our two functions.  STYPE tells Postgres what state accumulator type to use, and INITCOND initializes the INT[] so that it's not NULL to start.

Let's see if it works!

SELECT server_name,
    sum(CASE WHEN server_up THEN 0.5 ELSE 0 END) as minutes_up,
    mode(server_up) as mode
FROM servers 
WHERE montime BETWEEN '2013-04-01' and '2013-04-01 01:00:00';

server_name      minutes_up       mode
web1             56.5             TRUE
web2             0.0              FALSE
web3             48.0             TRUE
web4             11.5             FALSE

So easy you'll wonder why you didn't do it before!

(thanks to David Fetter for suggesting INT[] instead of a composite type)


9 comments:

  1. Should 'mode_bool_trans' be 'mode_bool_state' in your aggregate declaration Josh?

    ReplyDelete
    Replies
    1. Yes, it should. Good catch, thanks. Obviously I'm transposing names from proprietary code ...

      Delete
  2. Awesome article!

    Should any of the functions be marked IMMUTABLE?

    ReplyDelete
  3. SQL functions (functions that use SQL language) should not be flagged, because a optimalizer can look into body of function and it understand to context - co it use well flags self. Different situations is with C or PL/pgSQL or other languages - these functions are black box for the optimizer and there flags are necessary.

    ReplyDelete
    Replies
    1. Marking as stable (which should be true for any SQL function I believe) helps in that you can EXPLAIN ANALYZE the function invocation and get a meaningful plan.

      Delete
  4. If you're interested, I also posted an article which includes the creation of a custom aggregate too, except I didn't bother explaining how the aggregate definition worked: http://thombrown.blogspot.co.uk/2010/11/countif-expression.html

    ReplyDelete
  5. Oracle has user-defined aggregate functions as well.

    ReplyDelete
    Replies
    1. I've been told this multiple times, but I've never seen one in the wild. Oracle's custom aggs must be really hard to use, or really limited.

      Delete
  6. Don't you need a 'GROUP BY server_name' in the select query?

    ReplyDelete