Friday, September 28, 2012

Freezing Your Tuples Off, Part 1

Given that it's currently around 50F, damp and foggy in San Francisco (so much for Indian Summer), my thoughts turn to freezing.   Specifically, freezing transaction IDs.  There's three settings associated with this behavior in postgresql.conf, which I'll be discussing below:
  •     autovacuum_freeze_max_age
  •     vacuum_freeze_min_age
  •     vacuum_freeze_table_age
At the least, you'll need to understand freezing and what these options do.  Many users will also want to change them away from the default values.

The Need To Freeze

Every write transaction or standalone statement in PostgreSQL gets assigned a unique transaction ID (or XID).  This XID determines the precedence visibility, determining which other concurrent users can see row changes associated with the transaction.  Read-only transactions do not get assigned an XID.  These XIDs get stamped on each row in storage, in the form of xmin and xmax, denoting the minimum and maximum concurrent transactions which can "see" the row.  By putting these in the row headers, we decentralize transaction conflict resolution, which supports PostgreSQL's very high concurrency capabilities.

However, there's a problem with XIDs: they're a four-byte integer.  This means that after 2.1 billion transactions, they need to wrap around.  And by the time the XID counter wraps around, the old XIDs need to be gone from all of the old row headers.  This means removing the XID from xmin and xmax and replacing it with the special RelFrozenXID value, indicating a transaction which is so old everyone can see it.

Now, 2.1 billion transactions seems like a lot, but it's not really.  If you're doing just 100 write transactions per second (average), you'll go through that in about 10 months.  This means that we need to be constantly cleaning old data pages of old XIDs to be ready for wraparound.  And, indeed, PostgreSQL is prepared to do this, through the autovacuum facility.


The first autovacuum setting is autovacuum_freeze_max_age, which is an absolute ceiling on how long autovacuum will let you go before it kicks in and starts exhaustively vacuum freezing the old XIDs on your tables with old rows.  Hitting this threshold isn't a great thing, because it can generate lots of IO at unexpected times, and freeze autovacuums aren't cancellable (if you cancel them, they come right back).

You can see how close you are to autovacuum freeze anywhere with this query:

select max(age(datfrozenxid)) from pg_database;

However, you don't generally really care about that; what you care about is "how close am I to forced autovacuum freeze on a large table?"  Here's that:

SELECT relname, age(relfrozenxid) as xid_age,
    pg_size_pretty(pg_table_size(oid)) as table_size
FROM pg_class
WHERE relkind = 'r' and pg_table_size(oid) > 1073741824
ORDER BY age(relfrozenxid) DESC LIMIT 20;

More specifically, that says "give me the top 20 tables over 1GB, sorted by the age of their oldest XID".  Results look something like this:

        relname         |  xid_age  | table_size
 postgres_log           | 199785216 | 12 GB
 statements             |   4551790 | 1271 MB
 normal_statement_times |        31 | 12 GB

Now, that above is trouble brewing.  We see that postgres_log is at an XID age of 199 million, which is awfully close to the default autovacuum_freeze_max_age of 200 million.  And having autovac decide to read and rewrite an entire 12GB table, possibly at peak usage times, can cause unexpected poor application performance.

One thing we can do is raise autovacuum_freeze_max_age.  The default, 200m, is very conservative; it's only 10% of our shut-down-the-database threshold.  On high-transaction-volume databases, I generally raise it to 1 billion, which is still only the 50% mark.

However, that's just putting off the problem.  What we should do is manually freeze the table at at time which is convenient for the application, i.e. during a low activity period or a scheduled downtime.  Then we run:

VACUUM FREEZE postgres_log;

Even better, we'd like to prevent ourselves from getting that close to the threshold in the first place.  How do we do that?  Well, there's some other tuning parameters. 

To be continued in Part 2 ...

Monday, September 10, 2012

PostgreSQL 9.2 out the gate

And ... we're off!  PostgreSQL 9.2.0 is finally out, and you can now deploy it for production applications.  So lemme talk a little about one of my favorite features from the release, as well as linking you to PostgreSQL 9.2 clothing (yes, really, see bottom of post).



Back in version 8.4, Itagaki Takahiro introduced a new Extension called pg_stat_statements.   This extension creates a dynamic system view which shows you recently executed queries and aggregates both counts and total time to execute.  The idea was to give you rapid feedback on which of your queries where "hot" and consuming a lot of system resources, instead of forcing you to wait for logging and log processing and get information the next day.

However, pg_stat_statements suffered a major limitation; unlike log processors like pgfouine and pgbadger, pg_stat_statements could only aggregate queries with the same structure and different constants if they were called using the PostgreSQL prepared plans interface.   This meant that for the majority of applications (everything except Java, really), queries were not aggregated at all, making pg_stat_statements not that useful in practice.

Peter Geoghegan decided to fix that in this release (funded by the folks at, using the query plan for each query to "normalize" out the constants so that identical queries could he aggregated, regardless of how they were sent to the database.  Suddenly, that means I can depend a lot less on log processing in order to tell clients what their worst queries are, and I can even troubleshoot potential bad queries in a new application deployment within minutes of it going live.

We're going to be installing pg_stat_statements on most of our clients who upgrade to 9.2, and apparently Heroku is going to make it available by default.

Peter went a bit further, though, so pg_stat_statements also now collects cache hit/dirty and IO timing information.   This gives us ability to diagnose issues with bad application queries which we couldn't do before, even with the delay of log processing.

So the new pg_stat_statements looks something like this:

userid              | 10
dbid                | 18018
query               | SELECT tbalance - sum(delta) FROM pgbench_tellers JOIN pgbench_history USING (tid) WHERE pg
bench_tellers.tid = ? group by tid, tbalance;
calls               | 158207
total_time          | 5999651.60500004
rows                | 158084
shared_blks_hit     | 80551793
shared_blks_read    | 2
shared_blks_dirtied | 0
shared_blks_written | 0
local_blks_hit      | 0
local_blks_read     | 0
local_blks_dirtied  | 0
local_blks_written  | 0
temp_blks_read      | 0
temp_blks_written   | 0
blk_read_time       | 169.666
blk_write_time      | 0

Notice all the new columns.  blk_read_time is particularly interesting; it shows that we're doing logical IO to access the records we need.  Lemme see which queries are doing a lot of logical IO relative to the amount of data they need:

SELECT query, total_time, 
    blk_read_time / calls as read_time_query, 
    blk_read_time / rows as read_time_row
FROM pg_stat_statements 
WHERE calls > 0 AND rows > 0 
ORDER BY blk_read_time / calls desc LIMIT 5;

Out of those results, this one is not a surprise:

query           | select count(*) from pgbench_accounts;
total_time      | 33765.04
read_time_query | 7934.87833333333
read_time_row   | 7934.87833333333

But this one is:

query           | UPDATE pgbench_accounts SET abalance = abalance + ? WHERE aid = ?;
total_time      | 18529.712
read_time_query | 6.67768562644125
read_time_row   | 6.67768562644125

From that, it seems like we're often updating rows which aren't in cache, and paying a heavy price for it.


Buy 9.2 Shirts!


To promote/celebrate/cash in on the 9.2 release, I've posted tshirts and hoodies on the PostgreSQL Zazzle store.  These feature an original design by cartoonist Chris Lowrance.  15% of the price of these items goes towards buying swag for PostgreSQL advocacy volunteers.  So, get one now!

Sunday, September 9, 2012

SFPUG Streaming: PostgreSQL 9.2

This Tuesday (7:15 PM PDT), we'll be broadcasting the SFPUG meeting live on JustinTV.   I'll be presenting "PostgreSQL 9.2: Full Throttle Database", showing off all the features of the new version which is coming out very soon.  If you can't make it to the meeting in person, then tune in and watch!

If you can make it to the meeting (RSVP here), or at least join the waiting list, then we've got a fun and exciting meeting lined up for you.  The meeting is hosted at the new Disqus offices, and we'll have beer and pizza (sponsored by, and cake (sponsored by EngineYard). 

We'll also have a fundraising auction with some of the best auction items ever, so bring your wallet and/or checkbook (checks, cash preferred):
Mark Wong's hand-crocheted blue elephant (one of these went for $500 at pgCon)

Linux skateboard, courtesy of The Linux Foundation.