Thursday, October 3, 2013

First thing you do is take a backup

Last week seems to have been a week for "pilot error": not just on SouthWest, but among our clientele at PGX.  Three customers in two days managed to corrupt their databases through sysadmin mistakes, such as remounting the SAN with PostgreSQL running.  Which reminded me of an important tip for what you should do whenever anything goes really wrong with your database server:
  1. Shut down PostgreSQL if it's running.
  2. Check if you have current valid backups and/or replicas which are not already corrupt. If so, make sure these are protected.
  3. If there are no backups, make a full copy of the database files and all transaction logs.
You'll be tempted do everything you can to get the main production server up and running as soon as possible, rather than waiting for backups, especially if your database is quite large.  Resist that temptation.

Many things you will try to rescue a corrupt database or a damaged server can cause further destruction of data.  Worse, certain kinds of corruption (caused by bad storage or bad RAM, for example) will spread to the replicas and/or your backups given time.  In a worst case, you can fiddle with your damaged DB server long enough that the regularly scheduled backup kicks in ... and overwrites a good backup with a corrupt one.

Of course, maintaining a few days or weeks of rotational backups is an even better idea, and protects you against many forms of data loss.  I recommend Barman or pgRollDump for this purpose.

1 comment:

  1. Yep really any system and any systems administrator should have auto backups of everything (rotating) and any systems admin worth his salt should be taking a backup before performing maintenance, repairs, upgrades etc etc... I know for many systems admins after doing this job for a long time this can seem trivial when performing something done 1000 times flawlessly before.

    But a nightmare just takes one time.

    ReplyDelete