Monday, May 12, 2014

cstore_fdw and big data

About a month ago, PostgreSQL fork vendor and Data Warehousing company CitusDB announced the availability of the open-source cstore_fdw.  This foreign data wrapper creates an external table with highly compressed data, which allows you to keep large amounts of archival data on your PostgreSQL server.

You can find out more if you want to tune in tommorrow night, May 13th, around 7:15PM PDT.  Tomorrow night's even will be sponsored by CitusDB and hosted by Rackspace.

First, the good stuff: compression:

phc=# select pg_size_pretty(pg_total_relation_size('postgres_log'));
 pg_size_pretty
----------------
 28 GB 


ls -lh $PGDATA/base

-rw------- 1 postgres postgres 3.2G May 12 13:37 pglog.cstore
-rw------- 1 postgres postgres  12K May 12 13:37 pglog.cstore.footer


So, the "postgres_log" table from this Performance Health Check database, which has 15 million records, takes up 28GB in postgres, and 3.2GB as a cstore table ... a space savings of about 89%.  Not bad.  Especially if you consider that the cstore table already has skip indexes on all indexable columns.

Now, where this space savings becomes a real benefit is if the cstore table fits in memory and the Postgres table doesn't.   I don't have a case like that, although the cstore still does show performance benefits if you have a wide table and you don't need all columns:

phc=# select count(1) from postgres_log where command_tag = 'UPDATE';
 count 
--------
 986390
(1 row)

Time: 23746.476 ms
phc=# select count(1) from c_pglog where command_tag = 'UPDATE';
 count 
--------
 986390
(1 row)

Time: 14059.405 ms


And even better if you can apply a relatively restrictive filter:

phc=# select count(1) from postgres_log where command_tag = 'UPDATE' and log_time BETWEEN '2014-04-16 07:15:00' and '2014-04-16 07:20:00';
 count
-------
 84982
(1 row)

Time: 19653.746 ms


phc=# select count(1) from c_pglog where command_tag = 'UPDATE' and log_time BETWEEN '2014-04-16 07:15:00' and '2014-04-16 07:20:00';
 count
-------
 84982
(1 row)

Time: 2260.891 ms


One limitation is that currently, with FDWs not able to cleanly push down aggregation to the foreign data wrapper, the actual aggregation is still done on the postgres side.  This means that large aggregates are about the same speed on cstore_fdw as they are for PostgreSQL tables:

phc=# select round((sum(duration)/1000)::numeric,2) from statements where command_tag = 'UPDATE'; round 
--------
 444.94
(1 row)

Time: 2920.640 ms
phc=# select round((sum(duration)/1000)::numeric,2) from c_statements where command_tag = 'UPDATE';
 round 
--------
 444.94
(1 row)

Time: 3232.986 ms


The project plans to fix this, but until then, cstore_fdw is useful mainly for searches across really large/wide tables.  Or for seldom-touched archive tables where you want to save yourself GB or TB of disk space.

There are a bunch of other features, and a bunch of other limitations; tune in to the SFPUG event to learn more.

No comments:

Post a Comment