Database Soup: performance

Showing posts with label performance. Show all posts

Thursday, March 31, 2016

9.5.2 update release and corrupt indexes

We've released an off-schedule update release today, because of a bug in one of 9.5's features which has forced us to partially disable the feature. This is, obviously, not the sort of thing we do lightly.

One of the performance features in 9.5 was an optimization which speeded up sorts across the board for text and numeric values, contributed by Peter Geoghegan. This was an awesome feature which speeded up sorts across the board by 50% to 2000%, and since databases do a lot of sorting, was an overall speed increase for PostgreSQL. It was especially effective in speeding up index builds.

That feature depends on a built-in function in glibc, strxfrm(), which could be used to create a sortable hash of strings. Now, the POSIX standard says that strxfrm() + strcmp() should produce sorting results identical to the strcoll() function. And in our general tests, it did.

However, there are dozens of versions of glibc in the field, and hundreds of collations, and it's computationally unreasonable to test all combinations. Which is how we missed the problem until a user reported it. It turns out that for certain releases of glibc (particularly anything before 2.22 on Linux or BSD), with certain collations, strxfrm() and strcoll() return different results due to bugs. Which can result in an index lookup failing to find rows which are actually there. In the bug report, for example, an index on a German text column on RedHat Enterprise Linux 6.5 would fail to find many rows in a "between" search.

As a result, we've disabled the feature in 9.5.2 for all indexes which are on collations other than the simplified "C" collation. This sucks.

Also, if you're on 9.5.0 or 9.5.1 and you have indexes on columns with real collations (i.e. not "C" collation), then you should REINDEX (or CREATE CONCURRENTLY + DROP CONCURRENTLY) each of those indexes. Which really sucks.

Of course we're discussing ways to bring back the feature, but nobody has a solution yet. In the meantime, you can read more about the problem on the wiki page.

Friday, January 8, 2016

Configuration changes in 9.5: transaction log size

If you downloaded 9.5 after yesterday's release, you might have noticed some changes to postgresql.conf, especially if you copied over you favorite 9.4 config and it refused to start up. Particularly, the parameter checkpoint_segments is gone, and has been replaced by min_wal_size and max_wal_size. What does this mean?

Well, checkpoint_segments was the old way we had for users to determine how "big" the transaction log should be. Users with a large server or a lot of transactions per second would set it high, and users with small VMs and a slow database could set it low. This had some problems, though:

The default setting worked for pretty much nobody, so you always had to adjust it.
The WAL* always used the maximum space available, even if it wasn't needed.
Figuring out the largest size your WAL could be required some math and a knowledge of version-specific PostgreSQL internals.

The last was the most confusing part for users; the calculation for maximum WAL size was:

( ( checkpoint_segments * 2 ) + 1 ) * 16MB ) + ( wal_keep_segments * 16MB )

... which meant that people generally sized it by picking an arbitrary number and then adjusting up or down based on feedback.

The new parameters are way simpler:

min_wal_size: the minimum size the transaction log will be;
max_wal_size: the maximum size the transaction log will be (but see below)

This means that your transaction log on disk shouldn't ever be larger than ( max_wal_size + wal_keep_segments ). It is a "soft" limit though; if PostgreSQL gets really behind, or if archiving is failing, it will get higher than max_wal_size.

However, that isn't the really cool part. Heikki did an amazing thing, in that the WAL is sized dynamically based on how much was used during the previous cycles. So you can set max_wal_size to some high value (default is 1GB), and not worry about PostgreSQL using a bunch of extra disk space if it's not needed. This means that we can set a default which will be "good enough" for 80% of our users, and we have. This makes me do a little happy dance.

The other 20% may want to tune, still though, so here's some guidelines:

if you know your database write traffic is "bursty", with long periods of inactivity followed by furious writes, increase min_wal_size;
if you do bulk loads larger than 1GB, increase max_wal_size to the size of a bulk load;
if you write more than 1GB of data every 10 minutes, increase max_wal_size;
if you get "checkpoints occurring to frequently" error messages, try increasing both parameters incrementally.

Most users won't need to touch those parameters at all, though. Which is as it should be.

* WAL == Write Ahead Log == Transaction Log == XLog

Monday, March 16, 2015

Benchmarking Postgres in the Cloud, part 1

In 2008, when Heroku started, there was only one real option for cloud hosting PostgreSQL: roll-your-own on EC2, or a couple other not-very-competitive platforms. Since then, we've seen the number of cloud hosting providers explode, and added several "PostgreSQL-As-A-Service" providers as well: first Heroku, then Gandi, CloudFoundry, RDS, OpenShift and more. This has led many of pgExperts' clients to ask: "Where should I be hosting my PostgreSQL?"

So to provide a definitive answer to that question, for the past several weeks I've been doing some head-to-head testing of different cloud hosting options for PostgreSQL. Even more work has been done by my collaborator, Ruben Rudio Rey of ManageACloud.com. I will be presenting on the results of this testing in a series of blog posts, together with a series of presentations starting at SCALE and going through pgConf NYC, LinuxFestNorthWest, and culminating at pgCon. Each presentation will add new tests and new data.

Here's my slides from SCALE, which compare AWS, RDS, and Heroku, if you want to get some immediate data.

What We're Testing

The idea is to run benchmarks against ephemeral instances of PostgreSQL 9.3 on each cloud or service. Our main goal is to collect performance figures, since while features and pricing are publicly available, performance information is not. And even when the specification is the same, the actual throughput is not. From each cloud or service, we are testing two different instance sizes:

Small: 1-2 cores, 3 to 4GB RAM, low throughput storage (compare EC2's m3.medium). This is the "economy" instance for running PostgreSQL; it's intended to represent what people with non-critical PostgreSQL instances buy, and to answer the question of "how much performance can I get for cheap".

Large: 8-16 cores, 48 to 70GB RAM, high throughput storage (compare EC2's r3.2xlarge). This is the maximum for a "high end" instance which we could afford to test in our test runs.

The clouds we're testing or plan to test include:

AWS EC2 "roll-your-own".
Amazon RDS PostgreSQL
Heroku
Google Compute Engine
DigitalOcean
Rackspace Cloud
OpenShift PostgreSQL Cartridge
(maybe Joyent, not sure)

Note that in many cases we're working with the cloud vendor to achieve maximum performance results. Our goal here isn't to "blind test" the various clouds, but rather to try to realistically deliver the best performance we can get on that platform. In at least one case, our findings have resulted in the vendor making improvements to their cloud platform, which then allowed us to retest with better results.

The tests we're running include three pgbench runs:

In-Memory, Read-Write (IMRW): pgbench database 30% to 60% of the size of RAM, full transaction workload
In-Memory, Read-Only (IMRO): pgbench database 30% to 60% of RAM, read-only queries
On-Disk, Read-Write (ODRW): pgbench database 150% to 250% of RAM, full transactions

The idea here is to see the different behavior profiles with WAL-bound, CPU-bound, and storage-bound workloads. We're also recording the load time for each database, since bulk loading behavior is useful information for each platform.

Each combination of cloud/size/test needs to then be run at least 5 times in order to get a statistically useful sample. As I will document later, often the difference between runs on the same cloud was greater than the difference between clouds.

Issues with pgBench as a Test Tool

One of the early things I discovered was some of the limitations of what pgbench could tell us. Its workload is 100% random access and homogeneous one-liner queries. It's also used extensively and automatically to test PostgreSQL performance. As a result, we found that postgresql.conf tuning made little or no difference at all, so our original plan to test "tuned" vs. "untuned" instances went by the wayside.

We also found on public clouds that, because of the rapidfire nature of pgbench queries, performance was dominated by network response times more than anything on most workloads. We did not use pgbench_tools, because that is concerned with automating many test runs against one host rather than a few test runs against many hosts.

For this reason, we also want to run a different, more "serious" benchmark which works out other performance areas. To support this, I'm working on deploying Jignesh's build of DVDStore so that I can do that benchmark against the various platforms. This will require some significant work to make a reality, though; I will need to create images or deployment tools on all of the platforms I want to test before I can do it.

To be continued ...

Thursday, February 26, 2015

Why you might need statement_cost_limit

Here's a commonplace ops crisis: the developers push a new dashboard display widget for user homepages on your application. This seems to work fine with in testing, and they push it out to production ... not realizing that for some large subset of users dissimilar from your tests, the generated query triggers a sequential scan on the second-largest table in the database.   Suddenly your database servers are paralyzed with load, and you have to shut down the whole site and back out the changes.

Wouldn't it be nice if you could just tell the database server "don't run expensive queries for the 'web' user"? Well, thanks to my colleague Andrew Dunstan, who wrote plan_filter with support from Twitch.TV, now you can.

Sort of. Let me explain.

PostgreSQL has had statement_timeout for a while, which can be set on a per-user basis (or other places) to prevent application errors from running queries for hours. However, this doesn't really solve the "overload" issue, because the query runs for that length of time, gobbling resources until it's terminated. What you really want to do is return an error immediately if a query is going to be too costly.

plan_filter is a loadable module which allows you to set a limit on the cost of queries you can execute. It works, as far as we know, with all versions of Postgres starting at 9.0 (we've tested 9.1, 9.3 and 9.4).

Let me show you. First, you have to load the module in postgresql.conf:

    shared_preload_libraries = 'plan_filter'

Then you alter the "web" user to have a strict limit:

    ALTER USER web SET plan_filter.statement_cost_limit = 200000.0

Then try some brain-dead query as that user, like a blanket select from the 100m-row "edges" graph table:

    \c - web
    SELECT * FROM edges;

    STATEMENT: select * from edges;
    ERROR: plan cost limit exceeded
    HINT: The plan for your query shows that it would probably
have an excessive run time. This may be due to a logic error
in the SQL, or it maybe just a very costly query. Rewrite
    your query or increase the configuration parameter
"plan_filter.statement_cost_limit".

Obviously, your application needs to handle this error gracefully, especially since you'll likely get it for hundreds or thousands of queries at once if you're sending bad queries due to a code change. But a bunch of errors is definitely better than having to restart your whole app cluster.   It's comparatively easy to just display a broken widget icon.

So why did I say "sort of", and why aren't we submitting this as a feature for PostgreSQL 9.5?

Well, there's some issues with limiting by plan cost. The first is that if you can't run the query due to the cost limit, you also can't run an EXPLAIN to see why the query is so costly in the first place. You'd need to set plan_filter.statement_cost_limit = 0 in your session to get the plan.

The second, and much bigger, issue is that plan cost estimates are just that: estimates. They don't necessarily accurately show how long the query is actually going to take. Also, unless you do a lot of cost tuning, costs do not necessarily consistently scale between very different queries.   Worst of all, some types of queries, especially those with LIMIT clauses, can return a cost in the plan which is much higher than the real cost because the planner expects to abort the query early.

So you're looking at a strong potential for false positives with statement_cost_limit. This means that you need to both set the limit very high (like 5000000) and work your way down, and test this on your staging cluster to make sure that you're not bouncing lots of legitimate queries. Overall, statement_cost_limit is mainly useful to DBAs who know their query workloads really well.

That means it's not ready for core Postgres (assuming it ever is). Fortunately, PostgreSQL is extensible so you can use it right now while you wait for it to eventually become a feature, or to be supplanted by a better mechanism of resource control.

Sunday, February 15, 2015

Running with scissors mode

Based on some comments in my post about "in-memory" databases, I realized that my post about running Postgres without disk sync was no longer available on Database Soup. So I'm reposting the instructions here.

This blog post version has corrections and clarifications thanks to Andres Freund, since it first went up yesterday. Thanks, Andres.

Running PostgreSQL this way was christened "running with scissors mode" by Gavin Roy, because you're operating completely without crash-safety; if something happens to the server, even a power fluctuation, your database contents are untrustworthy and may be corrupt. However, it can be a useful way to run Postgres for extra, read-only replicas used strictly for load-balancing, or if what you're loading into Postgres is completely disposable/replaceable.

Note that these settings do not, in fact, disable all disk writes. What they do instead is minimize disk writes, and make all disk writes asynchronous, dependant entirely on the OS's own memory swapping and dirty block flushing for any disk writes. This is what you want; you don't want the database to halt because, for example, you simply ran out of space in memory.

So, without further ado, here's the settings:

    work_mem = (RAM - DBsize - shared_buffers)*2 / max_connections
    temp_buffers = (RAM - DBsize - shared_buffers)*4 / max_connections
temp_file_limit = 0

On the one hand, we want to set work_mem high in order to avoid on-disk sorts. On the other hand, having pinned RAM for sorts push the database out of memory would be counterproductive. As such, you want to set up work memory to use available RAM you don't need for database caching. The above assumes that max_connections is set to something sensible for the number of connections you actually need. You should really be using pgbouncer as well with this setup.

Set temp_file_limit = 0 to cause queries to be cancelled instead of doing disk sorts.

     bgwriter_lru_maxpages = 0
     wal_level = minimal
     fsync = off
     synchronous_commit = off
     full_page_writes = off
     wal_log_hints = off
     wal_buffers = 64MB

Here we're minimizing the amount of writing we do to the transaction log, and making said writing completely asynchronous. We're also disabling background writing.

     checkpoint_segments = 8 to 64
   checkpoint_timeout = 60min
     checkpoint_completion_target = 0.9

Checkpoint segments is a bit trickier. On the one hand, you want it to be large enough that it's not cycling a lot and triggering extra disk flushes. On the other hand, you want all the segments to stay cached in RAM. So something moderate, 256MB to 2GB, depending on how much RAM you have. Don't set it to more than 1/32nd of RAM. Ideally, we'd be able to disable checkpoints entirely, but currently there's no reasonable way to do that.

   stats_temp_directory = '/pgramdisk/stats_tmp'

You will also want to move the stats file to a ramdisk so that it's not being written out. This is a good optimization in general, even outside of running with scissors mode.

Finally, we need to ensure that PostgreSQL will not restart if the system crashes; at that point, you assume your database is corrupt and proceed to recover it from another source. The first part of doing this is to disable any autostart in your init system configuration. Secondly, after starting up Postgres, add a line like this to the beginning of postgresql.conf:

     DO_NOT_RESTART=True

The exact parameter you use doesn't matter; what matters is that it's not a recognized parameter, so that Postgres will error out instead of restarting. This does mean extra steps when you want to manually restart this node for configuration changes, but there isn't a really good way around that.

Next up, we'll run some performance tests to see how much this benefits us.

Continued in Part II.

Photo of "DBAs running with scissors" taken by Gavin Roy.

Friday, February 13, 2015

"In-memory" is not a feature, it's a bug

So, I'm hearing again about the latest generation of "in-memory databases". Apparently Gartner even has a category for them now. Let me define an in-memory database for you:

An in-memory database is one which lacks the capability of spilling to disk.

As far as I know in my industry literature reading, nobody has demonstrated any useful way in which data should be stored differently if it never spills to disk. While the talented engineers of several database products have focused on other performance optimizations to the exclusion of making disk access work, that's not an optimization of the database; it's an optimization of engineer time. The exact same database, with disk access capabilities, would be automatically superior to its predecessor, because users would now have more options.

PostgreSQL can be an "in-memory" database too, if you simply turn all of the disk storage features off. This is known as "running with scissors" mode, and people do it for useful effect on public clouds with disposable replicas.

So an "in-memory" database is a database with a major limitation. It's not a feature, any more than an incapability of supporting SQL access is a feature. Let's define databases by their useful features, not by what they lack, please.

Besides which, with the new types of persistent memory and fast random access storage coming down the pipe in a couple years, there soon won't be any difference between disk and memory anyway.

Friday, January 30, 2015

That's a huge difference. The "JSONB fixed" numbers are for a query where I force the planner to use the JSONB index instead of letting it choose its own path (using the doc_id index). As you can see, two-tag search via GIN index is an order of magnitude faster for common tags, and three orders of magnitude faster for rare tags. And the ID approach completely bombs for two-tag search.

For our final test, we'll build a tag cloud from scratch. This involves pulling counts of all distinct tags and then taking the top 100. In a real production environment, you wouldn't do things this way; you'd maintain counts by trigger, or use HyperLogLog, or something similar. But it makes a good test of mass index access and/or table scans for the various approaches.

Here the JSONB and array queries get annoying. Examples:

    select count(*) as tag_count, tag
    from doc_tags_json
        join tags on doc_tags_json.tags @> to_json(tags.tag)::jsonb
    group by tag
    order by tag_count desc limit 100;

    select count(*) as tag_count, tag
    from doc_tags_array
        join tags on doc_tags_array.tags @> array[tags.tag::text]
    group by tag
    order by tag_count desc limit 100;

As I mentioned at the beginning of this series, arrays don't automatically prevent duplicate entries. If I don't care that much about accuracy (and generally for tag clouds one doesn't) I can use a faster query with UNNEST:

    select count(*) as tag_count, ut.tag
    from doc_tags_array,
       lateral unnest(doc_tags_array.tags) as ut(tag)
    group by ut.tag
    order by tag_count desc limit 100;

(the lateral unnest is a 9.4 feature, and darned useful). Let's look at execution times:

So, here you can see that the array and JSONB approaches lose; they simply can't be as fast as a plain text column for building a count. This does mean that if your application spends a lot of its time dynamically building tag clouds, arrays might not be the way to go and JSONB certainly isn't.

Conclusion: the overall winner is an array of text, with a GIN index. This is better for one-tag searches, worlds faster for two-tag searches, and competitive at other tasks. It's also the smallest representation, and becomes smaller and faster still if you actually put the array of tags in the documents table. Still, there are times that you would want to use the traditional child table with plain text tags: if you build tag clouds a lot or if you never search for two tags and your ORM can't deal with Postgres arrays.

Sadly, I have to conclude that my personal favorite, JSONB, isn't quite ready for this use. See discussions on the pgsql-performance list regarding how to estimate selectivity for JSONB contains.

I realize that I did not test comparative write speeds. Maybe next time; I've already torn down the AWS instance.

Wednesday, January 28, 2015

Tag All The Things

Many web applications are designed with a "tags" feature, where users can add arbitrary string "tags" to each document (or post or picture or event or whatever). Like other user-extensible data, tags provide special challenges for SQL/relational databases. They can be awkward to write queries for, and at the high end perform really poorly. PostgreSQL 9.4 offers some data structures to take the awkwardness out of managing and quering tags ... but which one is best, and how do they perform? I'll answer that over a few blog posts.

To test various ways of tagging, I created a test server on AWS (an m3.large) and put PostgreSQL on local storage to eliminate latency as a consideration. I then took a population of tags from two real databases: a document management application and a recipe database. This gave me a population of 165 unique tags to work with.

The simplest way to store tags is just the basic text tags table:

    table doc_tags_text (
        doc_id int not null references documents(doc_id),
        tag text not null
    )
    unique index doc_tags_text_doc_id_tag on (doc_id, tag)
    index doc_tags_text_tag on (tag)

This has the advantage of being simple, direct, and fast for some queries. I also created the "normalized" version of the above, which is equally common:

    table tags (
        tag_id serial not null primary key,
        tag text not null unique
    )

    table doc_tags_id (
        doc_id int not null references documents(doc_id),
        tag_id int not null references tags(tag_id)
    )
    unique index doc_tags_id_doc_id_tag_id on (doc_id, tag_id)
    index doc_tags_id_tag_id on (tag_id)

I put "normalized" in quotes because using a surrogate integer key doesn't actually make our schema more normal in any real sense. Nor does it make it smaller or faster, as we will soon see.

Now for the advanced Postgres magic. Of course, one of the first things I looked at was 9.4's JSONB. Particularly, JSON arrays stored in JSONB seemed to have everything I was looking for: they can store arrays of strings or numbers, automatically deduplicate, and are order-independant. As long as we're doing arrays, though, I should also use a standard TEXT array. That has some disadvantages; sometimes it is order-dependant, and it doesn't automatically remove duplicates. But PostgreSQL offers GIN indexing for all one-dimensional arrays, which makes that at least theoretically useful for tags.

    table doc_tags_json (
        doc_id int not null references documents(doc_id),
        tags jsonb
    )
    unique index doc_tags_id_doc_id on (doc_id)
    index doc_tags_id_tags using gin (tags)

    table doc_tags_array (
        doc_id int not null references documents(doc_id),
        tags text[] not null default '{}'
    )
    unique index doc_tags_id_doc_id on (doc_id)
    index doc_tags_id_tags using gin (tags)

I did consider also trying a TSVECTOR, which would seem obvious for a bag of text strings. But there are some issues with using TSearch for tags: tags would be stemmed, and multiword tags would get broken up into individual terms. This is not how most people expect tags to work.   While I could work around those issues, it just wasn't worth testing.

Then I populated them. In real applications, tags distribution is heavily skewed.   Some tags show up on 15% of all documents while others are used on only one, and most documents have only one tag while a few have a dozen. As such, I wrote a quick script to distribute the tags geometrically across the documents. Thanks to Steve Atkins for the math on this, and I'll share the functions. The first one returns a number between 1 and 14, heavily skewed towards 1, and the second a number between 1 and 150, again with tag #1 ("technology") 10,000 times as common as tag #150 ("Summer 2015 Summit"). I also randomized document and tag ordering so that none of the queries could take advantage of data more ordered than it would be in a real application.

    create or replace function geornd_10()
    returns int
    language sql
    as $f$
    select round(log(random()::numeric)/log(0.3))::INT + 1;
    $f$;

    create or replace function geornd_100()
    returns int
    language sql
    as $f$
    select round(log(random()::numeric)/log(0.9))::INT + 1;
    $f$;

To get a large enough population to show some performance differences, I created 10 million "documents" and tagged them all, assigning each of the four data structures the exact same document ids and tags. Now, it's time to test.

Can you guess which one turned out to be the best all-around?

Continued in the next post.

Also, if anyone wants the tools I used to generate the test case, I can package them up, but it would be some work, so you gotta ask.

Wednesday, November 19, 2014

Good kernel, bad kernel

A month ago I got into an argument on IRC with Sergey about telling people to avoid kernel 3.2. This turned out to be a very productive argument, because Sergey then went and did a battery of performance tests against various Linux kernels on Ubuntu. Go read it now, I'll wait.

My takeaways from this:

Kernel 3.2 is in fact lethally bad.
Kernel 3.13 is the best out of kernel 3.X so far. I hope that this can be credited to the PostgreSQL team's work with the LFS/MM group.
No 3.X kernel yet has quite the throughput of 2.6.32, at least at moderate memory sizes and core counts.
However, kernel 3.13 has substantially lower write volumes at almost the same throughput. This means that if you are write-bound on IO, 3.13 will improve your performance considerably.
If your database is mostly-reads and highly concurrent, consider enabling
```
kernel.sched_autogroup_enabled.
```

Thanks a lot to Sergey for doing this testing, and thanks even more to the LFS/MM group for improving IO performance so much in 3.13.

Thursday, November 6, 2014

We need a webapp benchmark

I've been doing some comparative testing on different cloud platform's hosting of PostgreSQL. And one of the deficiencies in this effort is the only benchmarking tool I have is pgbench, which doesn't reflect the kinds of workloads people would want to run on cloud hosting. Don't get me wrong, pgbench does everything you could imagine with the simple Wisconsin benchmark, including statistics and sampling. But the core benchmark is still something which doesn't look much like the kind of Rails and Django apps I deal with on a daily basis.

There's also TPCC-js, which is more sophisticated, but is ultimately still a transactional, back-office OLTP benchmark.

So I'm thinking of developing a "webapp" benchmark. Here's what I see as concepts for such a benchmark:

Read-mostly
No multi-statement transactions
Defined "users" concept with logins and new user registration
Needs a "sessions" table which is frequently updated
Read-write, read-only and session database connections should be separable, in order to test load-balancing optimization.
Queries counting, sorting and modifying content
Measured unit of work is the "user session" which would contain some content lookups and minor updates ("likes").

Now, one of the big questions is whether we should base this benchmark on the idea of a social networking (SN) site. I think we should; SN sites test a number of things, including locking and joins which might not be exercised by other types of applications (and aren't by pgbench). What do you think? Does anyone other than me want to work on this?

Saturday, November 1, 2014

Finding Foreign Keys with No Indexes

Unlike some other SQL databases, PostgreSQL does not automatically create indexes on the "child" (in formal language "referencing") side of a foreign key. There are some good reasons for this (see below), but it does give users another opportunity to forget something, which is indexing the foreign keys (FKs) that need it. But which ones need it? Well, fortunately, you can interrogate the system catalogs and find out.

I have added a query for finding foreign keys without indexes to pgx_scripts. Indexes on the "parent", or "referenced" side of the FK are automatically indexed (they have to be, because they need to be unique), so we won't talk about them further.

Now, in order to understand how to interpret this, you have to understand why you would or would not have an index on an FK, and what sort of indexes are valid. There's two times that indexes on the child side of FKs are used:

when doing JOIN and lookup queries using the FK column
when updating or deleting a row from the "parent" table

The second occasion is news to some DBAs. The way it works is this: before letting you delete or change a row in the "parent" table, Postgres has to verify that there are no rows in the "child" table referencing the FK value that might be going away. If there are, it needs to perform the action you have defined (such as CASCADE, SET NULL or RESTRICT). If the "child" table is large, this can be substantially speeded up by having an index on the FK column.

This means that it's important to have an index on the child side of the FK if any of the following are true:

The child table is large and the parent table gets updates/deletes
The parent table is large and the FK is used for JOINs
The child table is large and the FK is used to filter (WHERE clause) records on the child table

This means most FKs, but not all of them. If both tables are small, or if the parent table is small and the FK is used only to prevent bad data entry, then there's no reason to index it. Also, if the FK is very low cardinality (like, say, only four possible values) then it's probably also a waste of resources to index it.

Now you're ready to run the query on your own database and look at the results. The query tries to filter for the best indexing candidates, but it is just a query and you need to use your judgement on what you know about the tables. The query also filters for either the parent or child table being larger than 10MB.

Now, you might say "but I have an index on column Y" and wonder why it's appearing on the report. That's probably because the FK does match the first columns of the index. For example, an index on ( name, team_id ) cannot be used for an FK on team_id.

You may notice a 2nd section of the report called "questionable indexes". These are FKs which have an index available, but that index may not be usable for JOINs and constraint enforcement, or may be very inefficient. This includes:

Non-BTree indexes. Currently other types of indexes can't be used for FK enforcement, although that is likely to change in future Postgres versions. But they can sometimes be used for joins.
Indexes with more than one column in addition to the FK columns. These indexes can be used for FKs, but they may be very inefficient at it due to the bigger size and extra index levels.
Partial indexes (i.e. INDEX ... WHERE). In general, these cannot be used for FK enforcement, but they can sometimes be used for joins.

It's not quite as clear when you want to add to, or replace those indexes. My advice is to start with the missing indexes and then move on to the more nuanced cases.

Friday, October 10, 2014

New Table Bloat Query

To accompany the New Index Bloat Query, I've written a New Table Bloat Query. This also involves the launch of the pgx_scripts project on GitHub, which will include most of the "useful scripts" I talk about here, as well as some scripts from my co-workers.

The new table bloat query is different from the check_postgres.pl version in several ways:

Rewritten to use WITH statements for better maintainability and clarity
Conditional logic for old Postgres versions and 32-bit platforms taken out
Index bloat removed, since we have a separate query for that
Columns formatted to be more immediately comprehensible

In the course of building this, I found two fundamentally hard issues:

Some attributes (such as JSON and polygon fields) have no stats, so those tables can't be estimated.
There's no good way to estimate bloat for compressed (TOAST) attributes and rows.

Also, while I rewrote the query almost entirely, I am still relying on Greg's core math for estimating table size. Comparing this with the results of pgstattuple, I'm seeing an error of +/- 20%, which is pretty substantial. I'm not clear on where that error is coming from, so help improving the math is very welcome!

Results look like this:

  databasename | schemaname |   tablename   | pct_bloat | mb_bloat | table_mb   
 --------------+------------+-------------------+-----------+----------+----------  
  members_2014 | public   | current_member  |    92 |  16.98 |  18.547   
  members_2014 | public   | member_response  |    87 |  17.46 |  20.000   
  members_2014 | public   | archive_member  |    84 |  35.16 |  41.734   
  members_2014 | public   | survey      |    57 |  28.59 |  50.188

pct_bloat is how much of the table (0 to 100) is estimated to be dead space. MB_bloat is how many megabytes of bloat are estimated to exist. Table_mb is the actual size of the table in megabytes.

The suggested criteria is to list tables which are either more than 50% bloat and bigger than 10MB, or more than 25% bloat and bigger than 1GB. However, you should calibrate this according to your own database.

Monday, September 29, 2014

Why you need to avoid Linux Kernel 3.2

In fact, you really need to avoid every kernel between 3.0 and 3.8. While RHEL has been sticking to the 2.6 kernels (which have their own issues, but not as bad as this), Ubuntu has released various 3.X kernels for 12.04. Why is this an issue? Well, let me give you two pictures.

Here's private benchmark workload running against PostgreSQL 9.3 on Ubuntu 12.04 with kernel 3.2.0. This is the IO utilization graph for the replica database, running a read-only workload:

Sorry for cropping; had to eliminate some proprietary information. The X-axis is time. This graph shows MB/s data transfers -- in this case, reads -- for the main database disk array. As you can see, it goes from 150MB/s to over 300MB/s. If this wasn't an SSD array, this machine would have fallen over.

Then we upgraded it to kernel 3.13.0, and ran the same exact workload as the previous test. Here's the new graph:

Bit of a difference, eh? Now we're between 40 and 60MB/s for the exact same workload: an 80% reduction in IO. We can thank the smart folks in the Linux FS/MM group for hammering down a whole slew of performance issues.

So, check your Postgres servers and make sure you're not running a bad kernel!