Thursday, April 16, 2015

Expressions VS advanced aggregates

So ... you're using some of 9.4's new advanced aggregates, including FILTER and WITHIN GROUP.  You want to take some statistical samples of your data, including median, mode, and a count of validated rows.  However, your incoming data is floats and you want to store the samples as INTs, since the source data is actually whole numbers.  Also, COUNT(*) returns BIGINT by default, and you want to round it to INT as well.  So you do this:

    SELECT
        device_id,
        count(*)::INT as present,
        count(*)::INT FILTER (WHERE valid) as valid_count,
        mode()::INT WITHIN GROUP (order by val) as mode,
        percentile_disc(0.5)::INT WITHIN GROUP (order by val)
          as median
    FROM dataflow_0913
    GROUP BY device_id
    ORDER BY device_id;


And you get this unhelpful error message:

    ERROR:  syntax error at or near "FILTER"
    LINE 4:         count(*)::INT FILTER (WHERE valid)
            as valid_count,


And your first thought is that you're not using 9.4, or you got the filter clause wrong.  But that's not the problem.  The problem is that "aggregate() FILTER (where clause)" is a syntactical unit, and cannot be broken up by other expressions.  Hence the syntax error.  The correct expression is this one, with parens around the whole expression and then a cast to INT:

    SELECT
        device_id,
        count(*)::INT as present,
        (count(*) FILTER (WHERE valid))::INT as valid_count,
        (mode() WITHIN GROUP (order by val))::INT as mode,
        (percentile_disc(0.5) WITHIN GROUP (order by val))::INT
           as median
    FROM dataflow_0913
    GROUP BY device_id
    ORDER BY device_id;


If you don't understand this, and you use calculated expressions, you can get a worse result: one which does not produce an error but is nevertheless wrong.  For example, imagine that we were, for some dumb reason, calculating our own average over validated rows.  We might do this:

    SELECT
        device_id,
        sum(val)/count(*) FILTER (WHERE valid) as avg
    FROM dataflow_0913
    GROUP BY device_id
    ORDER BY device_id;


... which would execute successfully, but would give us the total of all rows divided by the count of validated rows. That's because the FILTER clause applies only to the COUNT, and not to the SUM.  If we actually wanted to calculate our own average, we'd have to do this:

    SELECT
        device_id,
        sum(val) FILTER (WHERE valid)
            / count(*) FILTER (WHERE valid) as avg
    FROM dataflow_0913
    GROUP BY device_id
    ORDER BY device_id;


Hopefully that helps everyone who is using the new aggregates to use them correctly and not get mysterious errors.  In the meantime, we can see about making the error messages more helpful.