Friday, October 4, 2013

De-corrupting TOAST Tables

In a database which has been mildly corrupted, the second issue you're liable to run into is corrupt TOAST tables.  You'll know you have one because you'll be running a SELECT against the table or a pg_dump, and you'll get an error message like this one:

ERROR:  unexpected chunk size 226 (expected 175) in final chunk 1 for toast value 8846932 in pg_toast_2877446 

TOAST tables are where PostgreSQL stores offline compressed data for large values, like big text fields, geometries, large arrays, and BLOBs.   What the server is telling you in that error message is that the size of the compressed block in the TOAST table doesn't match its metadata, and as a result Postgres can't decompress it.

Among other things, this is generally an indication that either you've found a PostgreSQL bug (which you should report, if you're running on the latest update version), or you've had a serious corruption event, if you didn't know already.  So you should definitely do a full corruption check on the database.

But, what to do about the bad TOAST record?

Obviously, if you have a backup from just before the corruption occurred the answer is to restore that backup.  But if you don't: this is fortunately hand-correctable.

The first thing to do is to find the actually row or rows where the corruption is.  I do that with a simple DO script like this:

DO $f$
    curid INT := 0;
    vcontent TEXT;
    badid INT;
FOR badid IN SELECT message_id FROM messages LOOP
    curid = curid + 1;
    if curid % 100000 = 0 then
        raise notice '% rows inspected', curid;
    end if;
        SELECT contents
        INTO vcontent
        FROM messages where message_id = badid;

        vcontent := substr(vcontent,1000,2000);
        when others then
            raise notice 'data for message % is corrupt', badid;
end loop;

The goal here is to reach into the TOASTed value for each row (the large-text contents field) and de-compress it, which should raise any errors you're going to get.  Of course, you need to tailor the above for your own tables.

Once you've identified one or more rows, there are 3 steps you need to take to clean up the corrupted TOAST record.  Note that this does not involve recovering the corrupted value: that's already lost, and you'll do yourself a favor by not trying to hang onto it.  Get that from your backups.

Once you've located the row or rows which have corrupt TOAST data, you need to do three things to make sure that the DB is clean:
  1. Re-index the TOAST table (in case it's the index which is corrupt)
  2. Break the link between the live row and the corrupt TOAST record.
  3. Rebuild the regular table in order to expunge the bad TOAST pointer.
All TOAST tables are located in the pg_toast schema.  And the error message already gave us the name of the bad TOAST table.  So we reindex by:

REINDEX TABLE pg_toast.pg_toast_2877446;

Breaking the link is the most obvious step:

UPDATE messages SET contents = '' WHERE message_id = 432212;

By setting the TOASTED column to a new value, we make sure that the "current" version of the row does not link to the bad TOAST entry.  I did say we'd be destroying data here, yes?

But that's not all we have to do.  The old row version still exists, and will continue to produce errors every time VACUUM visits it (preventing VACUUMs of that table from completing).  So we need to build a new version of the table which does not include that old row version.

Fortunately, there is a utility to do this: the super-awesome pg_repack, by Daniele Varrazzo, Josh Kupershmidt and Itagaki Takahiro. This utility will save your life someday, so buy Daniele, Josh and/or Itagaki a drink of their choice if you see them.

pg_repack -n -t messages messagedb

Now, while pg_repack does not need an exclusive lock for most of the rebuild, it does need three things, which you need to be prepared for:
  1. enough disk space to make a full copy of the table and its indexes.
  2. a short exclusive lock at the very end of processing, including on related tables by FK.
  3. to wait for all long-running transactions to complete
However, once you've done this, you should have a clean table free of TOAST corruption.

Wait, though, what's this?

ERROR:  could not create unique index "messages_pkey"
DETAIL:  Key (message_id)=(795437) is duplicated.

Oh-oh.  Looks like we have more work to do.  See you Monday!


  1. How do i do to run that 'do' script? Im very new at this...

    1. On the psql command line, like a query.

      DO scripts are a way of running a one-time stored procedure.

  2. Josh, would it be possible to eliminate the corruption (understood, there is data loss) with zero_damaged_pages=on and vacuuming the table? Instead of rebuilding the table with pg_repack. Thanks!

    1. As a note on this, zero damaged pages deletes all information on a page when that page is damaged (i.e. database rows cannot be read out of it because of actual corruption errors on the page level).

      Here there are no problems on the page level. The problems are with missing rows in TOAST. To the pages are intact. But rows are missing. So page-level corrections would not be applied. And if they were they'd lose way more info than the approach discussed here.