Wednesday, April 24, 2013

sorted_mode Aggregate Function

So a boolean mode() was very simple to construct, partly because booleans have only three potential values (TRUE, FALSE, NULL).  However, in many cases we want mode() for, say, integers.  Now, there are several constructions of mode() depending on what statistic you really want, but here I'm concerned with the simplest one: Most Common Value (MCV).

One way to get this kind of mode is to do a windowing function, but as mentioned this works poorly with other aggregates in the same result set.  So let's take the same custom aggregate approach and see if we can do better.

Now, for types other than boolean, mode() is a "two-pass aggregate".  That means it's impossible to calculate in one pass; you need two passes, one to sort the set, and one to count the items and pick the MCV.   Since we know we'll need two passes, we'll construct our aggregate to assume that it's receiving sorted data going in, and make sure it gets sorted data when we use it.  Given 9.2's new ORDER BY clause for aggregates, that's easy to ensure.

For a state type, we'll need a 4-part register.  This register will include:

  1. the last seen value
  2. the count of the last seen value
  3. the most common value so far
  4. the count of the most common value so far
Again, we'll use an array rather than a composite type so that we don't have to create a type for the function.  Then we can construct a pure-SQL state function:

create or replace function sorted_mode_state(
    tally bigint[], nextval bigint
)
returns bigint[]
language sql
immutable
as $f$

-- have we seen this value before?
SELECT CASE WHEN nextval = tally[1] THEN

    -- if weve seen it before, does it out-count the prior MCV?
    CASE WHEN tally[2] = tally[4] THEN

        -- if so, swap the MCV
        ARRAY[ nextval, tally[2] + 1, nextval, tally[2] + 1 

    ELSE
        -- if not, keep counting
        ARRAY[ nextval, tally[2] + 1, tally[3], tally[4] ]
    END

-- is the register uninitialized?  then take the first value
WHEN tally[1] IS NULL THEN
    ARRAY [ nextval, 1, nextval, 1 ]

-- skip nulls
WHEN nextval IS NULL THEN
    tally
ELSE

    -- if it's a new value, count 1
    ARRAY [ nextval, 1, tally[3], tally[4] ]
END;
$f$;








The final function just picks out tally[3], returning NULL if no MCV has been selected:

create function sorted_mode_final(
    tally bigint[]
)
returns bigint
language sql
immutable
as $f$
SELECT CASE 

-- do we have an MCV?
WHEN tally[4] > 1 THEN tally[3]
-- otherwise, do we have a single-value set?
WHEN tally[1] = tally[3] THEN tally[3]
-- is there no clear mcv?  return null
ELSE NULL END;
$f$;


Then we can build our aggregate:

CREATE AGGREGATE sorted_mode ( bigint ) (
    SFUNC = sorted_mode_state,
    STYPE = BIGINT[],
    FINALFUNC = sorted_mode_final,
    INITCOND = '{}'
);


Usage:

SELECT server, date_trunc('hour', poll_time) as thehour,
     sorted_mode(status ORDER BY status) as status_mode
FROM server_status
GROUP BY server, date_trunc('hour', poll_time)
ORDER BY date_trunc('hour', poll_time), server;

This approach has some inherent limitations, of course:
  • if no value appears more than once, it returns NULL, unless there's only one value in the set;
  • if several values appear the same number of times, it returns the first one in the sort order;
  • it can't do estimated mode or other mode constructions.
However, it's "good enough" for a lot of applications which want the most common value for display.  And it's very fast compared with other approaches.




No comments:

Post a Comment