Here's a commonplace ops crisis: the developers push a new dashboard display widget for user homepages on your application. This seems to work fine with in testing, and they push it out to production ... not realizing that for some large subset of users dissimilar from your tests, the generated query triggers a sequential scan on the second-largest table in the database. Suddenly your database servers are paralyzed with load, and you have to shut down the whole site and back out the changes.
Wouldn't it be nice if you could just tell the database server "don't run expensive queries for the 'web' user"? Well, thanks to my colleague Andrew Dunstan, who wrote plan_filter with support from Twitch.TV, now you can.
Sort of. Let me explain.
PostgreSQL has had statement_timeout for a while, which can be set on a per-user basis (or other places) to prevent application errors from running queries for hours. However, this doesn't really solve the "overload" issue, because the query runs for that length of time, gobbling resources until it's terminated. What you really want to do is return an error immediately if a query is going to be too costly.
plan_filter is a loadable module which allows you to set a limit on the cost of queries you can execute. It works, as far as we know, with all versions of Postgres starting at 9.0 (we've tested 9.1, 9.3 and 9.4).
Let me show you. First, you have to load the module in postgresql.conf:
shared_preload_libraries = 'plan_filter'
Then you alter the "web" user to have a strict limit:
ALTER USER web SET plan_filter.statement_cost_limit = 200000.0
Then try some brain-dead query as that user, like a blanket select from the 100m-row "edges" graph table:
\c - web
SELECT * FROM edges;
STATEMENT: select * from edges;
ERROR: plan cost limit exceeded
HINT: The plan for your query shows that it would probably
have an excessive run time. This may be due to a logic error
in the SQL, or it maybe just a very costly query. Rewrite
your query or increase the configuration parameter
Obviously, your application needs to handle this error gracefully, especially since you'll likely get it for hundreds or thousands of queries at once if you're sending bad queries due to a code change. But a bunch of errors is definitely better than having to restart your whole app cluster. It's comparatively easy to just display a broken widget icon.
So why did I say "sort of", and why aren't we submitting this as a feature for PostgreSQL 9.5?
Well, there's some issues with limiting by plan cost. The first is that if you can't run the query due to the cost limit, you also can't run an EXPLAIN to see why the query is so costly in the first place. You'd need to set plan_filter.statement_cost_limit = 0 in your session to get the plan.
The second, and much bigger, issue is that plan cost estimates are just that: estimates. They don't necessarily accurately show how long the query is actually going to take. Also, unless you do a lot of cost tuning, costs do not necessarily consistently scale between very different queries. Worst of all, some types of queries, especially those with LIMIT clauses, can return a cost in the plan which is much higher than the real cost because the planner expects to abort the query early.
So you're looking at a strong potential for false positives with statement_cost_limit. This means that you need to both set the limit very high (like 5000000) and work your way down, and test this on your staging cluster to make sure that you're not bouncing lots of legitimate queries. Overall, statement_cost_limit is mainly useful to DBAs who know their query workloads really well.
That means it's not ready for core Postgres (assuming it ever is). Fortunately, PostgreSQL is extensible so you can use it right now while you wait for it to eventually become a feature, or to be supplanted by a better mechanism of resource control.