Monday, July 28, 2014

Improved TOAST corruption function

Recently I had reason to try to locate TOAST corruption on an older server (running 8.4.0), and found that the prior script I posted didn't always find the corruption.  As such, below is an improved function, which can easily be changed to a DO statement or whatever you want.

Note that TOAST corruption isn't a common thing; in fact, if you've kept up with your PostgreSQL updates like the update last Thursday, you should never see it.  This issue was subsequently fixed in 8.4.3, but the owner never applied that update.

Oh, also note that version 8.4 is now EOL.  If you haven't upgraded already, you should have a schedule for doing so.

This function assumes that you have a primary key column which is a SERIAL.  It would need to be altered for other designs.  It also assumes the postgres user can write to /tmp/testout.  See the original posts for how to use this.

create or replace function find_bad_toast2 (
   tablename text,
   pk_col text
)
returns text
language plpgsql
as
$f$
declare
   curid BIGINT := 0;
   badid BIGINT;
begin
FOR badid IN EXECUTE 'SELECT ' || pk_col || ' FROM ' || tablename LOOP
   curid = curid + 1;
   if curid % 100000 = 0 then
       raise notice '% rows inspected', curid;
   end if;
   begin
       EXECUTE 'COPY ( SELECT * FROM ' || tablename || ' WHERE ' ||
            pk_col || ' = ' || cast(badid as text) || ') TO ''/tmp/testout'';';
   exception
       when others then
           raise notice 'data for id % is corrupt', badid;
           continue;
   end;
end loop;
return 'done checking all rows';
end;
$f$;

Friday, June 13, 2014

Pending Portland Postgres Presentations

This year is apparently my year to spend most of my time in Portland.  Maybe I should move there ;-)  Please read through if you are in Portland or are attending conferences in Portland for a list PostgreSQL-related events there.

O'Reilly OSCON 2014: on Monday, July 21st, I will be presenting The Accidental DBA again at OSCON.  O'Reilly has asked me to reprise this tutorial as it was sold out last year.  This tutorial covers the care and feeding of a PostgreSQL server for people whose main job is something else, or for those new to database administration.  Will include multiple hands-on exercises using Vagrant on your own laptop.  The exercises are on github, but will be updated on or before July 10th in order to cover 9.3 and feedback from the last talk.

Note that there will also be a special PDXPUG meeting or BOF at OSCON; details TBD.  I will also be doing an Office Hours session at OSCON on July 22nd at 3:20 PM; please bring your questions about PostgreSQL scalability and replication.

I am likely to be speaking at DjangoCon.us in September.  As the Call for Presentations is still open, I don't know what it will be about, or have confirmation that I will, in fact, be speaking.

Right between DjangoCon.us and FOSS4G, PDXPUG will be hosting a full PostgreSQL Day at Portland State University on September 6.  This event is free, so if you are going to either DjangoCon or FOSS4G, extend your stay in Portland in order to attend!  RSVP required.

People are still submitting presentations for that; if you're interested in speaking, contact Mark per the wiki page.  I expect to be talking about 9.4, replication, or maybe Postgres on Amazon.

Finally, at FOSS4G, I will be doing a full day of Postgres/PostGIS workshops.  In the morning I will do Accidental DBA again, this time tailored for PostGIS administrators.  In the afternoon, I will do a hands-on "learning to do PostGIS replication" tutorial.  This does mean that I will likely not include much about replication in the morning session.  Both of these workshops are "bring your own device", which means that you will need to install the tutorial materials on your laptop in advance.

Since I will be spending the first half of September in Portland.  If you or your company is interested in onsite training or consulting during September, please get in touch with PGX.

Tuesday, June 3, 2014

Upcoming SFPUG Live Streams: Crontabber and JSONB

Apparently June is the month for hackers named Peter with difficult-to-spell last names.  Join us for two live streams of two different Peters giving two different PostgreSQL presentations.

First, on June 10th, we'll have Peter Bengtsson of Mozilla presenting Crontabber, a network-wide scheduled job manager written in Python and PostgreSQL.  Tune in on Air Mozilla at around 7:15pm PDT to watch this.

Second, on June 23rd, Peter Geoghegan of Heroku will do a "JSONB Deep Dive" and explain the features and internals of the new JSONB type.   Subscribe to the Google Event to be linked into the Google Hangout when this goes live around 7:15PM PDT.

See you there!


Sunday, May 25, 2014

9.4 Theme Contest Analyzed by 9.4

So a couple weeks ago I ran a little contest to see who could come up with a slogan for PostgreSQL 9.4.  Surprisingly, we got over 300 votes on various slogans, which means I need to do some statistics to analyze them -- which means I'm going to show off some of PostgreSQL's new 9.4 features as part of that!

Version 9.4 includes a number of new aggregate, array and set operations which make it vastly easier and faster to do statistical summaries and analysis.  Most of these were contributed by Andrew Gierth, including the two I'm going to use below, FILTER and WITHIN GROUP.  I'm also going to use MATERIALIZED VIEWS, developed by Kevin Grittner.  First, though, I need to import the data.  So I downloaded the survey results as a CSV, and created a table for them in PostgreSQL and loaded it up:

CREATE TABLE raw_survey (
    ts       timestamptz,
    prf      integer,
    moreways integer,
    devops   integer,
    moresql  integer,
    yesql    integer,
);

\copy raw_survey from 'slogans.csv' with csv header

Now, Google's column-per-question format isn't very friendly to analysis and comparison; I want a more vertical orientation.  So I create one as a MatView.  This means that if I reimport the data in the future, or weed out obvious ballot-box stuffing, I just need to refresh it:

CREATE MATERIALIZED VIEW slogans AS
SELECT 'Performance, Reliability, Flexibility' as slogan, prf as vote
FROM raw_survey
UNION ALL
SELECT 'More Ways to Database', moreways
FROM raw_survey
UNION ALL
SELECT 'A Database for Dev and Ops', devops
FROM raw_survey
UNION ALL
SELECT 'More Than SQL', moresql
FROM raw_survey
UNION ALL
SELECT 'NoSQL + YeSQL = PostgreSQL', yesql
FROM raw_survey;


Now, for some statistics.  A total or average is easy, but it's not statistically sound.  A median is a much better statistic.  I also want to know the balance of people who hated a slogan (1) vs. loved it and put it first (5).  So, some of the new aggregates.

In the past, I've retrieved medians by either using SciPy inside PL/Python, or by doing some elaborate calculations on windowing rank.  No more.  Now I can do a simple one-line median using WITHIN GROUP.  WITHIN GROUP is a lot like a windowing aggregate, except that it returns a single summary aggregate.  Shipping with version 9.4 are several such aggregates, including percentile_cont() which is one of three functions which allow you to get the value at the stated percent of a sorted group: in this case, 0.5 to get a median.  Like so:

SELECT slogan,
    percentile_cont(0.5) WITHIN GROUP (ORDER BY vote)
FROM slogans
GROUP BY slogan;


slogan percentile_cont
A Database for Dev and Ops 3
More Than SQL 3
More Ways to Database 3
NoSQL + YeSQL = PostgreSQL 3
Performance, Reliability, Flexibility 4

"Performance, Reliability, Flexibility" is taking a clear lead here.  Incidentally, percentile_cont() can take an array of values in order to give you a full box (remember, every time you say "big data" without drawing a box plot, God kills a kitten):

SELECT slogan,
    percentile_cont(ARRAY[0.1,0.25,0.5,0.75,0.9]) WITHIN GROUP (ORDER BY vote)
FROM slogans
GROUP BY slogan;


slogan percentile_cont
A Database for Dev and Ops {1,2,3,3,4}
More Than SQL {1.4,2,3,4,5}
More Ways to Database {1,2,3,4,5}
NoSQL + YeSQL = PostgreSQL {1,1,3,4,5}
Performance, Reliability, Flexibility {2,3,4,5,5}
Let's check or "loves" and "hates" to see if they tell us anything different.  Now, the old way to do this would be:

SELECT slogan,
    sum(CASE WHEN vote = 1 THEN 1 ELSE 0 END) as hates,
    sum(CASE WHEN vote = 5 THEN 1 ELSE 0 END) as loves
FROM slogans
GROUP BY slogan;


Awkward, neh?  Well, no more, thanks to the FILTER clause:

SELECT slogan,
    count(*) FILTER ( WHERE vote = 1 ) as hates,
    count(*) FILTER ( WHERE vote = 5 ) as loves
FROM slogans
GROUP BY slogan;


Isn't that way more intuitive and readable?  I think it is, anyway.  So, let's put it all together:

SELECT slogan,
    percentile_cont(0.5) WITHIN GROUP (ORDER BY vote) as median,
    count(*) FILTER ( WHERE vote = 1 ) as hates,
    count(*) FILTER ( WHERE vote = 5 ) as loves
FROM slogans
GROUP BY slogan;


And the results:


slogan median hates loves
A Database for Dev and Ops 3 47 21
More Than SQL 3 32 58
More Ways to Database 3 39 55
NoSQL + YeSQL = PostgreSQL 3 81 58
Performance, Reliability, Flexibility 4 11 118

And there we have it: "Performance, Reliability, Flexibility" is the winning theme idea for 9.4.  It wins on median, and on hates vs. loves counts.

Congratulations Korry Douglas; I'll contact you about shipping your Chelnik.  Note that the theme will be workshopped a little bit to fit in the structure of the final 9.4 release announcement (i.e. we may change it slightly to match the sections of the actual press release), but we're going with that general idea now.

Thursday, May 15, 2014

Help us choose an advocacy theme for PostgreSQL 9.4

Every year, for each PostgreSQL release, I have a "theme" which decides our graphics and presentation themes for promoting that version of PostgreSQL.   In the past, the themes have generally been my personal ideas, but this year we're putting it out to our greater community.

Five potential theme ideas have been selected form about 100 which were suggested on the pgsql-advocacy mailing list.  Now we need you to rate them, in order to decide which one we go with ... and who wins a Chelnik from Mark Wong!

Please vote on the basis of selecting a good theme/slogan for PostgreSQL 9.4 specifically, rather than just what sounds like the coolest phrase.

So, vote!

9.4 Beta, Postgres-XL, and pgCon Events

So, in case you somehow missed it, the PostgreSQL 9.4 Beta 1 is out.  Yaay!  Here's what I have to say about that:

libdata=# select title, 
    bookdata #> '{"publication_info", 0, "isbn"}' as isbn
from booksdata 
where bookdata @> '{ "publication_info" : [{"publisher": "Avon"} ] }'::jsonb 
order by bookdata #> '{"publication_info", 0, "price"}' DESC;

                 title                |    isbn    
--------------------------------------+-----------------
 The Bewitched Viking                 | "0-06-201900-7"
 When a Scot Loves a Lady             | "0-06-213120-6"
 Eternal Prey                         | "0-06-201895-7"
 My Irresistible Earl                 | "0-06-173396-2"
...


Download the beta now and test it out!  Break it!  Tell us how you broke it!  It's a beta, and it's up to you to make sure that the final release is as robust as possible.

Speaking of betas, there's a new new open source big data option on the block: Postgres-XL.  This is a fork of PostgresXC, which supposedly resolves the blockers which have kept PostgresXC from being ready for production use.  I look forward to trying it out when I get a chance.

Finally, I wanted to remind everyone about the Clustering Summit, the PostgresXC Pizza Demo, and the Unconference at pgCon next week.   Especially, I still need two assistants to help me with the unconference.  Email me at josh-at-postgresql.org if you're available to help with setup at the unconference.

Monday, May 12, 2014

cstore_fdw and big data

About a month ago, PostgreSQL fork vendor and Data Warehousing company CitusDB announced the availability of the open-source cstore_fdw.  This foreign data wrapper creates an external table with highly compressed data, which allows you to keep large amounts of archival data on your PostgreSQL server.

You can find out more if you want to tune in tommorrow night, May 13th, around 7:15PM PDT.  Tomorrow night's even will be sponsored by CitusDB and hosted by Rackspace.

First, the good stuff: compression:

phc=# select pg_size_pretty(pg_total_relation_size('postgres_log'));
 pg_size_pretty
----------------
 28 GB 


ls -lh $PGDATA/base

-rw------- 1 postgres postgres 3.2G May 12 13:37 pglog.cstore
-rw------- 1 postgres postgres  12K May 12 13:37 pglog.cstore.footer


So, the "postgres_log" table from this Performance Health Check database, which has 15 million records, takes up 28GB in postgres, and 3.2GB as a cstore table ... a space savings of about 89%.  Not bad.  Especially if you consider that the cstore table already has skip indexes on all indexable columns.

Now, where this space savings becomes a real benefit is if the cstore table fits in memory and the Postgres table doesn't.   I don't have a case like that, although the cstore still does show performance benefits if you have a wide table and you don't need all columns:

phc=# select count(1) from postgres_log where command_tag = 'UPDATE';
 count 
--------
 986390
(1 row)

Time: 23746.476 ms
phc=# select count(1) from c_pglog where command_tag = 'UPDATE';
 count 
--------
 986390
(1 row)

Time: 14059.405 ms


And even better if you can apply a relatively restrictive filter:

phc=# select count(1) from postgres_log where command_tag = 'UPDATE' and log_time BETWEEN '2014-04-16 07:15:00' and '2014-04-16 07:20:00';
 count
-------
 84982
(1 row)

Time: 19653.746 ms


phc=# select count(1) from c_pglog where command_tag = 'UPDATE' and log_time BETWEEN '2014-04-16 07:15:00' and '2014-04-16 07:20:00';
 count
-------
 84982
(1 row)

Time: 2260.891 ms


One limitation is that currently, with FDWs not able to cleanly push down aggregation to the foreign data wrapper, the actual aggregation is still done on the postgres side.  This means that large aggregates are about the same speed on cstore_fdw as they are for PostgreSQL tables:

phc=# select round((sum(duration)/1000)::numeric,2) from statements where command_tag = 'UPDATE'; round 
--------
 444.94
(1 row)

Time: 2920.640 ms
phc=# select round((sum(duration)/1000)::numeric,2) from c_statements where command_tag = 'UPDATE';
 round 
--------
 444.94
(1 row)

Time: 3232.986 ms


The project plans to fix this, but until then, cstore_fdw is useful mainly for searches across really large/wide tables.  Or for seldom-touched archive tables where you want to save yourself GB or TB of disk space.

There are a bunch of other features, and a bunch of other limitations; tune in to the SFPUG event to learn more.