Database Soup: A very simple custom aggregate

Wednesday, April 17, 2013

A very simple custom aggregate

Custom aggregates are one of those features which are unique to Postgres, and seem hopelessly obscure. Once you actually create one in earnest though, you'll wonder how you ever lived without one. To get you started, I'm going to walk you through creating a very simple custom aggregate: one which gives you the mode (most frequent value) for a boolean column.

Why would you want such a thing? Well, imagine you're monitoring your webservers, and you want to present 1-hour summaries of whether they are up or down. However, you have data for each 30 seconds. For a webserver which is up most of the time you want to return TRUE; for one which was down for most of the hour you want to return FALSE. If the monitoring system was down (and thus there's no data), you want to return NULL.

You could do this using windowing queries. However, that doesn't work well with other cumulative statistics, such as the number of minutes up. You want something you can display side-by-side with other aggregate stats. Well, with PostgreSQL, it's surprisingly easy!

First, we need a "state function" which accumulates data about the boolean. This state function generally has two parameters, a data type which accumulates value, and the data type of the column you're aggregating. In our case, we want to accumulate two counters: a count of falses and a count of trues, which we do using an array of INT. This can be done with a pure-SQL function:

CREATE OR REPLACE function mode_bool_state(int[], boolean)
returns int[]
language sql
as $f$
SELECT CASE $2
WHEN TRUE THEN
    array[ $1[1] + 1, $1[2] ]
WHEN FALSE THEN
    array[ $1[1], $1[2] + 1 ]
ELSE
    $1
END;
$f$;

Once both registers have been accumulated, we need to use a "final" function to compare them and decide which is the mode, which will accept the accumulation type (INT[]) and return boolean:

CREATE OR REPLACE FUNCTION mode_bool_final(INT[])
returns boolean
language sql
as $f$
SELECT CASE WHEN ( $1[1] = 0 AND $1[2] = 0 )
THEN NULL
ELSE $1[1] >= $1[2]
END;
$f$;

Then we can declare the aggregate to bring it all together:

create aggregate mode(boolean) (
    SFUNC = mode_bool_state,
    STYPE = INT[],
    FINALFUNC = mode_bool_final,
    INITCOND = '{0,0}'
);

SFUNC and FINALFUNC refer to our two functions. STYPE tells Postgres what state accumulator type to use, and INITCOND initializes the INT[] so that it's not NULL to start.

Let's see if it works!

SELECT server_name,
    sum(CASE WHEN server_up THEN 0.5 ELSE 0 END) as minutes_up,
    mode(server_up) as mode
FROM servers
WHERE montime BETWEEN '2013-04-01' and '2013-04-01 01:00:00';

server_name      minutes_up       mode
web1             56.5             TRUE
web2             0.0              FALSE
web3             48.0             TRUE
web4             11.5             FALSE

So easy you'll wonder why you didn't do it before!

(thanks to David Fetter for suggesting INT[] instead of a composite type)

9 comments:

NaoshikaApril 17, 2013 at 2:45 PM
Should 'mode_bool_trans' be 'mode_bool_state' in your aggregate declaration Josh?
ReplyDelete
Replies
joevandykApril 17, 2013 at 6:15 PM
Awesome article!

Should any of the functions be marked IMMUTABLE?
ReplyDelete
Replies
Pavel StěhuleApril 17, 2013 at 10:34 PM
SQL functions (functions that use SQL language) should not be flagged, because a optimalizer can look into body of function and it understand to context - co it use well flags self. Different situations is with C or PL/pgSQL or other languages - these functions are black box for the optimizer and there flags are necessary.
ReplyDelete
Replies
ThomApril 18, 2013 at 12:06 AM
If you're interested, I also posted an article which includes the creation of a custom aggregate too, except I didn't bother explaining how the aggregate definition worked: http://thombrown.blogspot.co.uk/2010/11/countif-expression.html
ReplyDelete
Replies
NoahApril 25, 2013 at 3:01 PM
Oracle has user-defined aggregate functions as well.
ReplyDelete
Replies
joevandykOctober 5, 2014 at 11:36 AM
Don't you need a 'GROUP BY server_name' in the select query?
ReplyDelete
Replies

Add comment