Thursday, February 26, 2015

Why you might need statement_cost_limit

Here's a commonplace ops crisis: the developers push a new dashboard display widget for user homepages on your application.  This seems to work fine with in testing, and they push it out to production ... not realizing that for some large subset of users dissimilar from your tests, the generated query triggers a sequential scan on the second-largest table in the database.   Suddenly your database servers are paralyzed with load, and you have to shut down the whole site and back out the changes.

Wouldn't it be nice if you could just tell the database server "don't run expensive queries for the 'web' user"?  Well, thanks to my colleague Andrew Dunstan, who wrote plan_filter with support from Twitch.TV, now you can.

Sort of.  Let me explain.

PostgreSQL has had statement_timeout for a while, which can be set on a per-user basis (or other places) to prevent application errors from running queries for hours.  However, this doesn't really solve the "overload" issue, because the query runs for that length of time, gobbling resources until it's terminated.  What you really want to do is return an error immediately if a query is going to be too costly.

plan_filter is a loadable module which allows you to set a limit on the cost of queries you can execute.  It works, as far as we know, with all versions of Postgres starting at 9.0 (we've tested 9.1, 9.3 and 9.4). 

Let me show you.  First, you have to load the module in postgresql.conf:

    shared_preload_libraries = 'plan_filter'

Then you alter the "web" user to have a strict limit:

    ALTER USER web SET plan_filter.statement_cost_limit = 200000.0

Then try some brain-dead query as that user, like a blanket select from the 100m-row "edges" graph table:

    \c - web
    SELECT * FROM edges;

    STATEMENT:  select * from edges;
    ERROR:  plan cost limit exceeded
    HINT:  The plan for your query shows that it would probably
    have an excessive run time. This may be due to a logic error
    in the SQL, or it maybe just a very costly query. Rewrite 
    your query or increase the configuration parameter
    "plan_filter.statement_cost_limit".

Obviously, your application needs to handle this error gracefully, especially since you'll likely get it for hundreds or thousands of queries at once if you're sending bad queries due to a code change. But a bunch of errors is definitely better than having to restart your whole app cluster.   It's comparatively easy to just display a broken widget icon.

So why did I say "sort of", and why aren't we submitting this as a feature for PostgreSQL 9.5?

Well, there's some issues with limiting by plan cost.  The first is that if you can't run the query due to the cost limit, you also can't run an EXPLAIN to see why the query is so costly in the first place.  You'd need to set plan_filter.statement_cost_limit = 0 in your session to get the plan.

The second, and much bigger, issue is that plan cost estimates are just that: estimates.  They don't necessarily accurately show how long the query is actually going to take.  Also, unless you do a lot of cost tuning, costs do not necessarily consistently scale between very different queries.   Worst of all, some types of queries, especially those with LIMIT clauses, can return a cost in the plan which is much higher than the real cost because the planner expects to abort the query early.

So you're looking at a strong potential for false positives with statement_cost_limit.  This means that you need to both set the limit very high (like 5000000) and work your way down, and test this on your staging cluster to make sure that you're not bouncing lots of legitimate queries.  Overall, statement_cost_limit is mainly useful to DBAs who know their query workloads really well.

That means it's not ready for core Postgres (assuming it ever is).  Fortunately, PostgreSQL is extensible so you can use it right now while you wait for it to eventually become a feature, or to be supplanted by a better mechanism of resource control.

4 comments:

  1. I've been looking for this since 2003 primarily for one major reason: Cartesian Products. Let's face it, everyone has fat-fingered a query, or missed a WHERE clause, or a buggy app just plain has a bad query. There are some pretty clear cases, like when the cost estimate is in the trillions, that should be rejected.

    Even if this can't be used in general for preventing misuse from "web user," having it around for a high sanity threshold would be amazing. If this is available as an extension, I'm installing as of last week.

    ReplyDelete
    Replies
    1. It's a loadable module, not an extension (i.e. no CREATE EXTENSION), like auto_explain is. And it's available now.

      Delete
    2. Good enough for me. I'm on it! I'm in transition so I've been off the mailing list for a month or so; I'm out of the loop. :)

      Delete
    3. Good, you'll be able to give it a serious performance workout. Mind you, the site where we're using it already does 10K qps, so it holds up under at least moderate load.

      Delete