Thursday, March 19, 2015

PostgreSQL data migration hacks from Tilt

Since the folks at aren't on Planet Postgres, I thought I'd link their recent blog post on cool data migration hacks.  Tilt is a YCombinator company, and a SFPUG supporter.

Monday, March 16, 2015

Benchmarking Postgres in the Cloud, part 1

In 2008, when Heroku started, there was only one real option for cloud hosting PostgreSQL: roll-your-own on EC2, or a couple other not-very-competitive platforms.  Since then, we've seen the number of cloud hosting providers explode, and added several "PostgreSQL-As-A-Service" providers as well: first Heroku, then Gandi, CloudFoundry, RDS, OpenShift and more.  This has led many of pgExperts' clients to ask: "Where should I be hosting my PostgreSQL?"

So to provide a definitive answer to that question, for the past several weeks I've been doing some head-to-head testing of different cloud hosting options for PostgreSQL.  Even more work has been done by my collaborator, Ruben Rudio Rey of  I will be presenting on the results of this testing in a series of blog posts, together with a series of presentations starting at SCALE and going through pgConf NYC, LinuxFestNorthWest, and culminating at pgCon.   Each presentation will add new tests and new data.

Here's my slides from SCALE, which compare AWS, RDS, and Heroku, if you want to get some immediate data.

What We're Testing

The idea is to run benchmarks against ephemeral instances of PostgreSQL 9.3 on each cloud or service.  Our main goal is to collect performance figures, since while features and pricing are publicly available, performance information is not.  And even when the specification is the same, the actual throughput is not.  From each cloud or service, we are testing two different instance sizes:

Small: 1-2 cores, 3 to 4GB RAM, low throughput storage (compare EC2's m3.medium).  This is the "economy" instance for running PostgreSQL; it's intended to represent what people with non-critical PostgreSQL instances buy, and to answer the question of "how much performance can I get for cheap".

Large: 8-16 cores, 48 to 70GB RAM, high throughput storage (compare EC2's r3.2xlarge).  This is the maximum for a "high end" instance which we could afford to test in our test runs. 

The clouds we're testing or plan to test include:
  • AWS EC2 "roll-your-own".
  • Amazon RDS PostgreSQL
  • Heroku
  • Google Compute Engine
  • DigitalOcean
  • Rackspace Cloud
  • OpenShift PostgreSQL Cartridge
  • (maybe Joyent, not sure)
Note that in many cases we're working with the cloud vendor to achieve maximum performance results.  Our goal here isn't to "blind test" the various clouds, but rather to try to realistically deliver the best performance we can get on that platform.  In at least one case, our findings have resulted in the vendor making improvements to their cloud platform, which then allowed us to retest with better results.

The tests we're running include three pgbench runs:

  • In-Memory, Read-Write (IMRW): pgbench database 30% to 60% of the size of RAM, full transaction workload
  • In-Memory, Read-Only (IMRO): pgbench database 30% to 60% of RAM, read-only queries
  • On-Disk, Read-Write (ODRW): pgbench database 150% to 250% of RAM, full transactions
The idea here is to see the different behavior profiles with WAL-bound, CPU-bound, and storage-bound workloads.  We're also recording the load time for each database, since bulk loading behavior is useful information for each platform. 

Each combination of cloud/size/test needs to then be run at least 5 times in order to get a statistically useful sample.  As I will document later, often the difference between runs on the same cloud was greater than the difference between clouds.

Issues with pgBench as a Test Tool

One of the early things I discovered was some of the limitations of what pgbench could tell us.  Its workload is 100% random access and homogeneous one-liner queries.  It's also used extensively and automatically to test PostgreSQL performance.  As a result, we found that postgresql.conf tuning made little or no difference at all, so our original plan to test "tuned" vs. "untuned" instances went by the wayside.

We also found on public clouds that, because of the rapidfire nature of pgbench queries, performance was dominated by network response times more than anything on most workloads.  We did not use pgbench_tools, because that is concerned with automating many test runs against one host rather than a few test runs against many hosts.

For this reason, we also want to run a different, more "serious" benchmark which works out other performance areas.  To support this, I'm working on deploying Jignesh's build of DVDStore so that I can do that benchmark against the various platforms.  This will require some significant work to make a reality, though; I will need to create images or deployment tools on all of the platforms I want to test before I can do it.

To be continued ...

Friday, March 6, 2015

Fancy SQL Friday: subtracting arrays

Here's one which just came up:  how to see all of the elements in a new array which were not in the old array.  This isn't currently supported by any of PostgreSQL's array operators, but thanks to UNNEST() and custom operators, you can create your own:

    create or replace function diff_elements_text (
        text[], text[] )
    returns text[]
    language sql
    as $f$
    SELECT array_agg(DISTINCT new_arr.elem)
        unnest($1) as new_arr(elem)
        unnest($2) as old_arr(elem)
        ON new_arr.elem = old_arr.elem
    WHERE old_arr.elem IS NULL;

    create operator - (
        procedure = diff_elements_text,
        leftarg = text[],
        rightarg = text[]

Now you can just subtract text arrays:

    josh=# select array['n','z','d','e'] - array['a','n','z'];
    (1 row)

Unfortunately, you'll need to create a new function and operator for each base  type; I haven't been able to get it to work with "anyarray".  But this should save you some time/code on array comparisons.  Enjoy!

Thursday, February 26, 2015

Why you might need statement_cost_limit

Here's a commonplace ops crisis: the developers push a new dashboard display widget for user homepages on your application.  This seems to work fine with in testing, and they push it out to production ... not realizing that for some large subset of users dissimilar from your tests, the generated query triggers a sequential scan on the second-largest table in the database.   Suddenly your database servers are paralyzed with load, and you have to shut down the whole site and back out the changes.

Wouldn't it be nice if you could just tell the database server "don't run expensive queries for the 'web' user"?  Well, thanks to my colleague Andrew Dunstan, who wrote plan_filter with support from Twitch.TV, now you can.

Sort of.  Let me explain.

PostgreSQL has had statement_timeout for a while, which can be set on a per-user basis (or other places) to prevent application errors from running queries for hours.  However, this doesn't really solve the "overload" issue, because the query runs for that length of time, gobbling resources until it's terminated.  What you really want to do is return an error immediately if a query is going to be too costly.

plan_filter is a loadable module which allows you to set a limit on the cost of queries you can execute.  It works, as far as we know, with all versions of Postgres starting at 9.0 (we've tested 9.1, 9.3 and 9.4). 

Let me show you.  First, you have to load the module in postgresql.conf:

    shared_preload_libraries = 'plan_filter'

Then you alter the "web" user to have a strict limit:

    ALTER USER web SET plan_filter.statement_cost_limit = 200000.0

Then try some brain-dead query as that user, like a blanket select from the 100m-row "edges" graph table:

    \c - web
    SELECT * FROM edges;

    STATEMENT:  select * from edges;
    ERROR:  plan cost limit exceeded
    HINT:  The plan for your query shows that it would probably
    have an excessive run time. This may be due to a logic error
    in the SQL, or it maybe just a very costly query. Rewrite 
    your query or increase the configuration parameter

Obviously, your application needs to handle this error gracefully, especially since you'll likely get it for hundreds or thousands of queries at once if you're sending bad queries due to a code change. But a bunch of errors is definitely better than having to restart your whole app cluster.   It's comparatively easy to just display a broken widget icon.

So why did I say "sort of", and why aren't we submitting this as a feature for PostgreSQL 9.5?

Well, there's some issues with limiting by plan cost.  The first is that if you can't run the query due to the cost limit, you also can't run an EXPLAIN to see why the query is so costly in the first place.  You'd need to set plan_filter.statement_cost_limit = 0 in your session to get the plan.

The second, and much bigger, issue is that plan cost estimates are just that: estimates.  They don't necessarily accurately show how long the query is actually going to take.  Also, unless you do a lot of cost tuning, costs do not necessarily consistently scale between very different queries.   Worst of all, some types of queries, especially those with LIMIT clauses, can return a cost in the plan which is much higher than the real cost because the planner expects to abort the query early.

So you're looking at a strong potential for false positives with statement_cost_limit.  This means that you need to both set the limit very high (like 5000000) and work your way down, and test this on your staging cluster to make sure that you're not bouncing lots of legitimate queries.  Overall, statement_cost_limit is mainly useful to DBAs who know their query workloads really well.

That means it's not ready for core Postgres (assuming it ever is).  Fortunately, PostgreSQL is extensible so you can use it right now while you wait for it to eventually become a feature, or to be supplanted by a better mechanism of resource control.

Tuesday, February 17, 2015

Spring/Summer 2015 Conference Schedule

What follows is my conference travel schedule through the early summer.  I'm posting it so that local PUGs will know when I'm going to be nearby, in case you want me to come talk to your members.  Also, so folks can find me at conference booths everywhere.

This list is also for anyone who was unaware of the amount of Postgres content available this year at conferences everywhere.
  • SCALE, Los Angeles, this week: 2-day Postgres track, booth.  Use code "SPEAK" if you still haven't registered for a small discount.  I'm speaking on 9.4 (Friday), and PostgreSQL on AWS (Sunday).
  • March 10, Burlingame, CA: pgDay SF 2015 Running the event, and a lightning talk.
  • March 25-27, NYC, NY: pgConf NYC: speaking on PostgreSQL on PAAS: a comparison of all the big ones.
  • April 25-26, Bellingham, WA: LinuxFest NorthWest, tentatively.  Talks haven't been chosen yet.  If I go, I'll also be working a booth no doubt.  I understand there are plans to have a bunch of Postgres stuff at this event.
  • June 16-20, Ottawa, Canada: pgCon of course.
  • July 20-24, Portland, OR: OSCON (tentatively, talks not selected).  Postgres talk of some sort, and probably booth duty.
Now you know.

Sunday, February 15, 2015

Running with scissors mode

DBAs Running with Scissors

Based on some comments in my post about "in-memory" databases, I realized that my post about running Postgres without disk sync was no longer available on Database Soup.  So I'm reposting the instructions here.

This blog post version has corrections and clarifications thanks to Andres Freund, since it first went up yesterday.  Thanks, Andres.

Running PostgreSQL this way was christened "running with scissors mode" by Gavin Roy, because you're operating completely without crash-safety; if something happens to the server, even a power fluctuation, your database contents are untrustworthy and may be corrupt.  However, it can be a useful way to run Postgres for extra, read-only replicas used strictly for load-balancing, or if what you're loading into Postgres is completely disposable/replaceable.

Note that these settings do not, in fact, disable all disk writes.  What they do instead is minimize disk writes, and make all disk writes asynchronous, dependant entirely on the OS's own memory swapping and dirty block flushing for any disk writes.  This is what you want; you don't want the database to halt because, for example, you simply ran out of space in memory.

So, without further ado, here's the settings:

    work_mem =  (RAM - DBsize - shared_buffers)*2 / max_connections
    temp_buffers = (RAM - DBsize - shared_buffers)*4 / max_connections
    temp_file_limit = 0

On the one hand, we want to set work_mem high in order to avoid on-disk sorts. On the other hand, having pinned RAM for sorts push the database out of memory would be counterproductive.  As such, you want to set up work memory to use available RAM you don't need for database caching.  The above assumes that max_connections is set to something sensible for the number of connections you actually need.  You should really be using pgbouncer as well with this setup.

Set temp_file_limit = 0 to cause queries to be cancelled instead of doing disk sorts.

     bgwriter_lru_maxpages = 0
     wal_level = minimal
     fsync = off
     synchronous_commit = off
     full_page_writes = off

     wal_log_hints = off
     wal_buffers = 64MB

Here we're minimizing the amount of writing we do to the transaction log, and making said writing completely asynchronous.  We're also disabling background writing.

     checkpoint_segments = 8 to 64
   checkpoint_timeout = 60min
     checkpoint_completion_target = 0.9

Checkpoint segments is a bit trickier.  On the one hand, you want it to be large enough that it's not cycling a lot and triggering extra disk flushes. On the other hand, you want all the segments to stay cached in RAM.  So something moderate, 256MB to 2GB, depending on how much RAM you have.  Don't set it to more than 1/32nd of RAM.  Ideally, we'd be able to disable checkpoints entirely, but currently there's no reasonable way to do that.

   stats_temp_directory = '/pgramdisk/stats_tmp'

You will also want to move the stats file to a ramdisk so that it's not being written out.  This is a good optimization in general, even outside of running with scissors mode.

Finally, we need to ensure that PostgreSQL will not restart if the system crashes; at that point, you assume your database is corrupt and proceed to recover it from another source.  The first part of doing this is to disable any autostart in your init system configuration.  Secondly, after starting up Postgres, add a line like this to the beginning of postgresql.conf:


The exact parameter you use doesn't matter; what matters is that it's not a recognized parameter, so that Postgres will error out instead of  restarting.  This does mean extra steps when you want to manually restart this node for configuration changes, but there isn't a really good way around that.

Next up, we'll run some performance tests to see how much this benefits us.

Continued in Part II.

Photo of "DBAs running with scissors" taken by Gavin Roy.

Friday, February 13, 2015

"In-memory" is not a feature, it's a bug

So, I'm hearing again about the latest generation of "in-memory databases". Apparently Gartner even has a category for them now.  Let me define an in-memory database for you:

     An in-memory database is one which lacks the capability of spilling to disk.

As far as I know in my industry literature reading, nobody has demonstrated any useful way in which data should be stored differently if it never spills to disk.   While the talented engineers of several database products have focused on other performance optimizations to the exclusion of making disk access work, that's not an optimization of the database; it's an optimization of engineer time.   The exact same database, with disk access capabilities, would be automatically superior to its predecessor, because users would now have more options.

PostgreSQL can be an "in-memory" database too, if you simply turn all of the disk storage features off.  This is known as "running with scissors" mode, and people do it for useful effect on public clouds with disposable replicas.

So an "in-memory" database is a database with a major limitation.  It's not a feature, any more than an incapability of supporting SQL access is a feature.  Let's define databases by their useful features, not by what they lack, please.

Besides which, with the new types of persistent memory and fast random access storage coming down the pipe in a couple years, there soon won't be any difference between disk and memory anyway.