Thursday, February 26, 2015

Why you might need statement_cost_limit

Here's a commonplace ops crisis: the developers push a new dashboard display widget for user homepages on your application.  This seems to work fine with in testing, and they push it out to production ... not realizing that for some large subset of users dissimilar from your tests, the generated query triggers a sequential scan on the second-largest table in the database.   Suddenly your database servers are paralyzed with load, and you have to shut down the whole site and back out the changes.

Wouldn't it be nice if you could just tell the database server "don't run expensive queries for the 'web' user"?  Well, thanks to my colleague Andrew Dunstan, who wrote plan_filter with support from Twitch.TV, now you can.

Sort of.  Let me explain.

PostgreSQL has had statement_timeout for a while, which can be set on a per-user basis (or other places) to prevent application errors from running queries for hours.  However, this doesn't really solve the "overload" issue, because the query runs for that length of time, gobbling resources until it's terminated.  What you really want to do is return an error immediately if a query is going to be too costly.

plan_filter is a loadable module which allows you to set a limit on the cost of queries you can execute.  It works, as far as we know, with all versions of Postgres starting at 9.0 (we've tested 9.1, 9.3 and 9.4). 

Let me show you.  First, you have to load the module in postgresql.conf:

    shared_preload_libraries = 'plan_filter'

Then you alter the "web" user to have a strict limit:

    ALTER USER web SET plan_filter.statement_cost_limit = 200000.0

Then try some brain-dead query as that user, like a blanket select from the 100m-row "edges" graph table:

    \c - web
    SELECT * FROM edges;

    STATEMENT:  select * from edges;
    ERROR:  plan cost limit exceeded
    HINT:  The plan for your query shows that it would probably
    have an excessive run time. This may be due to a logic error
    in the SQL, or it maybe just a very costly query. Rewrite 
    your query or increase the configuration parameter
    "plan_filter.statement_cost_limit".

Obviously, your application needs to handle this error gracefully, especially since you'll likely get it for hundreds or thousands of queries at once if you're sending bad queries due to a code change. But a bunch of errors is definitely better than having to restart your whole app cluster.   It's comparatively easy to just display a broken widget icon.

So why did I say "sort of", and why aren't we submitting this as a feature for PostgreSQL 9.5?

Well, there's some issues with limiting by plan cost.  The first is that if you can't run the query due to the cost limit, you also can't run an EXPLAIN to see why the query is so costly in the first place.  You'd need to set plan_filter.statement_cost_limit = 0 in your session to get the plan.

The second, and much bigger, issue is that plan cost estimates are just that: estimates.  They don't necessarily accurately show how long the query is actually going to take.  Also, unless you do a lot of cost tuning, costs do not necessarily consistently scale between very different queries.   Worst of all, some types of queries, especially those with LIMIT clauses, can return a cost in the plan which is much higher than the real cost because the planner expects to abort the query early.

So you're looking at a strong potential for false positives with statement_cost_limit.  This means that you need to both set the limit very high (like 5000000) and work your way down, and test this on your staging cluster to make sure that you're not bouncing lots of legitimate queries.  Overall, statement_cost_limit is mainly useful to DBAs who know their query workloads really well.

That means it's not ready for core Postgres (assuming it ever is).  Fortunately, PostgreSQL is extensible so you can use it right now while you wait for it to eventually become a feature, or to be supplanted by a better mechanism of resource control.

Tuesday, February 17, 2015

Spring/Summer 2015 Conference Schedule

What follows is my conference travel schedule through the early summer.  I'm posting it so that local PUGs will know when I'm going to be nearby, in case you want me to come talk to your members.  Also, so folks can find me at conference booths everywhere.

This list is also for anyone who was unaware of the amount of Postgres content available this year at conferences everywhere.
  • SCALE, Los Angeles, this week: 2-day Postgres track, booth.  Use code "SPEAK" if you still haven't registered for a small discount.  I'm speaking on 9.4 (Friday), and PostgreSQL on AWS (Sunday).
  • March 10, Burlingame, CA: pgDay SF 2015 Running the event, and a lightning talk.
  • March 25-27, NYC, NY: pgConf NYC: speaking on PostgreSQL on PAAS: a comparison of all the big ones.
  • April 25-26, Bellingham, WA: LinuxFest NorthWest, tentatively.  Talks haven't been chosen yet.  If I go, I'll also be working a booth no doubt.  I understand there are plans to have a bunch of Postgres stuff at this event.
  • June 16-20, Ottawa, Canada: pgCon of course.
  • July 20-24, Portland, OR: OSCON (tentatively, talks not selected).  Postgres talk of some sort, and probably booth duty.
Now you know.

Sunday, February 15, 2015

Running with scissors mode

DBAs Running with Scissors

Based on some comments in my post about "in-memory" databases, I realized that my post about running Postgres without disk sync was no longer available on Database Soup.  So I'm reposting the instructions here.

This blog post version has corrections and clarifications thanks to Andres Freund, since it first went up yesterday.  Thanks, Andres.

Running PostgreSQL this way was christened "running with scissors mode" by Gavin Roy, because you're operating completely without crash-safety; if something happens to the server, even a power fluctuation, your database contents are untrustworthy and may be corrupt.  However, it can be a useful way to run Postgres for extra, read-only replicas used strictly for load-balancing, or if what you're loading into Postgres is completely disposable/replaceable.

Note that these settings do not, in fact, disable all disk writes.  What they do instead is minimize disk writes, and make all disk writes asynchronous, dependant entirely on the OS's own memory swapping and dirty block flushing for any disk writes.  This is what you want; you don't want the database to halt because, for example, you simply ran out of space in memory.

So, without further ado, here's the settings:

    work_mem =  (RAM - DBsize - shared_buffers)*2 / max_connections
    temp_buffers = (RAM - DBsize - shared_buffers)*4 / max_connections
    temp_file_limit = 0

On the one hand, we want to set work_mem high in order to avoid on-disk sorts. On the other hand, having pinned RAM for sorts push the database out of memory would be counterproductive.  As such, you want to set up work memory to use available RAM you don't need for database caching.  The above assumes that max_connections is set to something sensible for the number of connections you actually need.  You should really be using pgbouncer as well with this setup.


Set temp_file_limit = 0 to cause queries to be cancelled instead of doing disk sorts.

     bgwriter_lru_maxpages = 0
     wal_level = minimal
     fsync = off
     synchronous_commit = off
     full_page_writes = off

     wal_log_hints = off
     wal_buffers = 64MB


Here we're minimizing the amount of writing we do to the transaction log, and making said writing completely asynchronous.  We're also disabling background writing.

     checkpoint_segments = 8 to 64
   checkpoint_timeout = 60min
     checkpoint_completion_target = 0.9

Checkpoint segments is a bit trickier.  On the one hand, you want it to be large enough that it's not cycling a lot and triggering extra disk flushes. On the other hand, you want all the segments to stay cached in RAM.  So something moderate, 256MB to 2GB, depending on how much RAM you have.  Don't set it to more than 1/32nd of RAM.  Ideally, we'd be able to disable checkpoints entirely, but currently there's no reasonable way to do that.

   stats_temp_directory = '/pgramdisk/stats_tmp'

You will also want to move the stats file to a ramdisk so that it's not being written out.  This is a good optimization in general, even outside of running with scissors mode.


Finally, we need to ensure that PostgreSQL will not restart if the system crashes; at that point, you assume your database is corrupt and proceed to recover it from another source.  The first part of doing this is to disable any autostart in your init system configuration.  Secondly, after starting up Postgres, add a line like this to the beginning of postgresql.conf:

     DO_NOT_RESTART=True

The exact parameter you use doesn't matter; what matters is that it's not a recognized parameter, so that Postgres will error out instead of  restarting.  This does mean extra steps when you want to manually restart this node for configuration changes, but there isn't a really good way around that.

Next up, we'll run some performance tests to see how much this benefits us.

Continued in Part II.

Photo of "DBAs running with scissors" taken by Gavin Roy.

Friday, February 13, 2015

"In-memory" is not a feature, it's a bug

So, I'm hearing again about the latest generation of "in-memory databases". Apparently Gartner even has a category for them now.  Let me define an in-memory database for you:

     An in-memory database is one which lacks the capability of spilling to disk.

As far as I know in my industry literature reading, nobody has demonstrated any useful way in which data should be stored differently if it never spills to disk.   While the talented engineers of several database products have focused on other performance optimizations to the exclusion of making disk access work, that's not an optimization of the database; it's an optimization of engineer time.   The exact same database, with disk access capabilities, would be automatically superior to its predecessor, because users would now have more options.

PostgreSQL can be an "in-memory" database too, if you simply turn all of the disk storage features off.  This is known as "running with scissors" mode, and people do it for useful effect on public clouds with disposable replicas.

So an "in-memory" database is a database with a major limitation.  It's not a feature, any more than an incapability of supporting SQL access is a feature.  Let's define databases by their useful features, not by what they lack, please.

Besides which, with the new types of persistent memory and fast random access storage coming down the pipe in a couple years, there soon won't be any difference between disk and memory anyway.

Thursday, February 12, 2015

Tree Join Tables: preventing cycles

Searching Google, I was surprised to find that there were few solutions published for a common issue: preventing users from creating a cycle when you create a self-join table.  So here's one solution, which will be "good enough" for most people, but has some caveats (see below).

First, the setup: we have a table of items.  Items can be in one or more collections.  Each item can itself be a collection, allowing users to create collections of collections.  So the first thing we need is a self-join table on the "adjacency list" model:

    create table collections (
        collection_id int not null references items(id) on delete cascade,
        item_id int not null references items(id) on delete cascade,
        constraint collections_pk primary key ( collection_id, item_id )
    );
    create index on collections(item_id);

So the first part of preventing cycles is to prevent the simplest cycle, where a collection collects itself.  That can be done with a constraint:

     alter table collections add constraint
     no_self_join check ( collection_id <> item_id )

Now comes the tough part, preventing cycles of more than one, two, or N collections in a chain.  This requires us to look down a chain of possible collections and make sure that each inserted tuple doesn't complete a loop.  Fortunately, WITH RECURSIVE works for this provided we do it in a BEFORE trigger.  If we did it in an AFTER trigger, the trigger itself would cycle, which would be no good.

    CREATE OR REPLACE FUNCTION collections_prevent_cycle ()
    returns trigger
    language plpgsql
    as $f$
    BEGIN
        -- select recusively, looking for all child items of the new collection
        -- and making sure that they don't include the new collection
        IF EXISTS ( WITH recursive colitem as (
                select collection_id, item_id
                from collections
                where collection_id = NEW.item_id
                UNION ALL
                select colitem.collection_id, collections.item_id
                from collections
                join colitem on colitem.item_id = collections.collection_id
            )
            SELECT collection_id from colitem
            WHERE item_id = NEW.collection_id
            LIMIT 1 ) THEN
                RAISE EXCEPTION 'You may not create a cycle of collections.';
        END IF;
       
        RETURN NEW;
    END; $f$;

    CREATE TRIGGER collections_prevent_cycle
    BEFORE INSERT OR UPDATE ON collections
    FOR EACH ROW EXECUTE PROCEDURE collections_prevent_cycle();

As I said, this solution will be "good enough" for a variety of uses.  However, it has some defects:

Concurrency: It is vulnerable to concurrency failure.  That is, if two users simultaneously insert "A collects B" and "B collects A", this trigger would not prevent it.  The alternative is locking the entire table on each commit, which is also problematic.

Cost: we're running a pretty expensive recursive query with every insert.  For applications where the tree table is write-heavy, this will decrease throughput significantly.

So my, challenge to you is this: come up with a better solution for this, which solves either the concurrency or cost problem without making the other problem worse.

P.S.: this blog has reached half a million views.  Thanks, readers!

Friday, February 6, 2015

A statement on recent conference events

The PostgreSQL user group in Moscow is currently conducting their first-ever PostgreSQL-themed conference, which has been a tremendous success.  Unfortunately, the venue booked by the conference chose to include inappropriate dancers as part of their entertaiment package. The conference organizers and the Russian PostgreSQL community were not aware of the nature of the entertainment supplied ahead of time.

The PostgreSQL Core Team believes there is no place for inappropriate or discriminatory behaviour at PostgreSQL conferences and tries to ensure that all our conferences are suitable for anyone to attend. As PostgreSQL is an Open Source project with volunteer contributors and a federated organizational structure, we do not have supervisory control over how individual conferences are organized, which means that sometimes they do not benefit from general community experience.

The Russian conference organizers are expected to comment on this unforseen incident once the conference is concluded. The international community will be working with them to make sure that this mistake is not repeated.

Josh Berkus
On Behalf of the PostgreSQL Core Team
and the PostgreSQL Global Development Project

Thursday, February 5, 2015

Some notes on today's update release

We released a security and cumulative bugfix release to all supported versions today.  That means it's update time.  What follows is my personal advice on the update.

For the first time in a while, we have a bunch of "low-risk" security fixes in this release, but no "critical" security fixes.  The reason I put those terms in quotes is that it doesn't matter how critical the fixes are in general; it matters how critical they are to you.  So you should definitely read over the release notes and the CVE notices to check how they affect you.

All five of the security holes patched require prior authentication.  Four of the five have not been proven to have an actual privilege escalation vector; they may be only denial-of-service attacks.  And the fifth security issue only affects you if you are using per-column privileges for columns with constraints on them.  That's why I regard these issues as relatively "low-risk".

There are also some important fixes to performance and replication for versions 9.4 and 9.3, so users of those versions should apply the update soon.  For other users, unless you live in the Fiji Islands or other places affected by timezone changes, you can probably wait for your next scheduled maintenance window.  You do have scheduled maintenance windows, yes?

Other people who might care to apply this update sooner rather than later include:
  • Users who have already had issues with autovacuum
  • People using the new logical decoding
  • Users who have a single archive which is shared between master and replicas.
  • Folks who create a bunch of tablespaces.
  • Developers who use tsquery, xpath(), and/or complex regular expression searches
  • JSONB users.
  • Norwegians who use Postgres on Windows
  • Users who have reported bugs with explicit locking and deadlocking in the last few months.
Again, though, read the release notes.  Because it's always possible that we fixed a bug that already affects you.