Thursday, November 12, 2015

PostgreSQL link round-up

First, in case you somehow missed it, PostgreSQL 9.5 Beta 2 is now out.  Time for another round of testing!  There's fewer and fewer bugs found, so we're not far from a final release.  I don't know about anyone else, but we're going into production on Beta 2 in a couple places.  Can't wait any longer for Upsert and RLS.

Second, pgConfSV has announced its keynotes, from CitusData, Pivotal, and -- as a bit of a surprise -- from streaming data thing Kafka (before you ask, I didn't pick the keynotes). I believe registration is still open, so you can still go to the conference next week if you act quickly.

Thirdly, I have a roundup of some cool blogs covering PostgreSQL which aren't on Planet and you may have missed: did a terrific two-part article on why PostgreSQL is the best open source database.  Read it here: Part I  Part II

They also covered using JSON in Postgres with Python.

In stranger news, there's an amazingly strange story about Soylent, PostgreSQL, and auto-responding to poisoning accusations.  Yes, Soylent Inc. uses PostgreSQL, why not?  Read the whole weird tale here on Zapier's blog.

That's it for my round up ... if you have some good links, post them in the comments.

Tuesday, November 10, 2015

Ready to be an Accidental DBA at pgConfSV?

My AccidentalDBA tutorial materials have been updated for pgConfSV next week.  If you are planning to take this tutorial, please read and follow the setup instructions before you get to the conference.

I created this tutorial in 2009 because of the number of people these days who find themselves in charge of a PostgreSQL server ... or many servers ... without any DBA background, training, or desire to be ops.  The tutorial is intended to teach you the minimum you need to keep your PostgreSQL server from falling over so you can get back to whatever your main job is.

You can take it self-paced if you can't make it to pgConfSV.  The tutorial includes all materials and notes.  In this version, I've added the option of a Docker container (preferred), pg_stat_statements, and a restore-to-point-in-time exercise.

If you are going to attend this tutorial, note that it requires advance setup which will be harder to perform once you get to your hotel, and near-impossible if you wait until 10 minutes before it starts.  So read up and install now.

And ... I believe there's still a few registrations left for the tutorial day, so consider signing up.  You have nothing to lose but your fear of database admin!

Monday, August 31, 2015

Lock-polling script for ALTER TABLE

One of PostgreSQL's advantages over many other SQL databases is letting users modify database objects with a minimum of locking and fuss, and do it in a transaction so it can be rolled back.  However, a minimum of locking isn't no locking, so one still has to plan for getting a lock to push through the database migration.

For example, say I wanted to add two new columns to a table.  I already know that by making the column nullable, and not setting a default, I can add them with a metadata-only change which requires only an update to the system catalogs.  However, this table change does require an ACCESS EXCLUSIVE lock for a few milleseconds to go through, which can be tricky.  See, an access exclusive lock blocks everything, including reads and autovacuum, which can be a bit messy if your timing is bad.  You can end up waiting for that lock behind a long-running query, and then all the other database traffic can pile up behind you.

Now, if we had ALTER TABLE CONCURRENTLY it would take care of this for you.  But we don't (yet), so you need to fake it with a DO script.  Here's an example DO script which polls for a lock every 2 seconds until it can get one, then pushes through the table change:

    DO $f$
    DECLARE ntries INT := 10;
        sleepytime INT := 2;

    FOR get_lock IN 1 .. ntries LOOP
                    LOCK TABLE mytable
                    ALTER TABLE mytable ADD COLUMN new_col1 INT,
                            ADD COLUMN new_col2 VARCHAR;
                    RAISE INFO 'table updated';
                    WHEN lock_not_available THEN
                            PERFORM pg_sleep(sleepytime);

    RAISE INFO 'unable to obtain lock after % tries', ntries;


The idea here is that you keep trying to LOCK ... NOWAIT, which will throw and error if it can't get the lock immediately.  Then it sleeps and tries again 2 seconds later.  If you're using 9.3 or later, you can take an even better approach, using lock_timeout:

    DO $f$
    DECLARE ntries INT := 10;
        sleepytime INT := 2;
    SET lock_timeout = '100ms';
    FOR get_lock IN 1 .. ntries LOOP
                    LOCK TABLE mytable IN ACCESS EXCLUSIVE MODE;
                    ALTER TABLE mytable ADD COLUMN a int,
                            ADD COLUMN b INT,
                            ADD COLUMN c INT;
                    RAISE INFO 'table updated';
                    WHEN lock_not_available THEN
                            PERFORM pg_sleep(sleepytime);

    RAISE INFO 'unable to obtain lock after % tries', ntries;


The advantage of this is that it doesn't have to get the lock immediately; it waits 100ms, then gives up on getting the lock.  That helps in situations where the table is never completely free of read queries, without risking a serious pile-up of requests.

You should be able to adapt this approach. to the migrations you actually need to push through.

Friday, August 28, 2015

Stupid Hacks: Dictionary replace function

I write a lot more PL/pgSQL than I'd like to.  Not that I don't like SQL, but as a language PL/pgSQL really shows its thrown-together origin; it's like 80's primitive.  One thing that PL/pgSQL lacks is good string manipulation tools, which is particularly tragic given that the #1 thing to do in PL/pgSQL is to generate queries from parameters and run EXECUTE.

Postgres has two built-in ways to do string substitution: concatenation and format().  Both have drawbacks.  Let me give you an example:

EXECUTE 'SELECT ' || col1 || ', ' || col2 || ' FROM ' || userschema ||
  '.accounts WHERE ' || filterclause || ' ORDER BY ' || col1 || ',' || col2;

EXECUTE format('SELECT %s, %s FROM %s.accounts WHERE %s ORDER BY %s, %s', col1, col2, userschema, filterclause, col1, col2);

You can see the problem here.  Both formats are hard to read and hard to maintain.  Python and Perl have a good fix for this: dictionary/hash-based string substitution, where you can swap in the dictionary keys for the values.  So I wrote up a quick hack to do this in PL/pgSQL.

Here's the dict-replace function.

Using it, you'd replace the above with:

EXECUTE replace_vars('SELECT ${col1}, ${col2} FROM ${userschema}.accounts
  WHERE ${filterclause} ORDER BY ${col1}, ${col2}', vardict);

Of course, you need to first set up the vardict as a JSON value, in the form:

vardict := '{ "col1" : "username", 'col2' : "branchname", ...}'

Still, much more readable, eh?  No refcounting, no repeating variables, no string breaks.  I used Bash's variable substitution syntax of ${var} because it seemed like the thing least likely to conflict with user text, unlike anything involving %.

There's some caveats, though: it's not performant, and you could probably find strings which will break it, particularly if you're swapping in JSON values.  It's not SQL-injection safe, so don't use it for stuff handling user input.  You still have to do your own quoting of strings.  And if you have access to PL/Perl or PL/Python you don't need this nonsense.

But for the minimal case, it should help.

Friday, August 14, 2015

Most of the talks chosen for pgConfSV

Since we're still working on the schedule, I wanted to give folks a preview of the talks we've picked for pgConfSV.  We still have a few more talks to pick and the tutorials to finalize.  Regardless, it's looking like a really exciting lineup! 
We'll have folks from Heap, Wanelo, TreasureData, Rackspace,, EMC, Joyent, Square, and more.   The sad part will be missing three talks in each timeslot.

Note that this set of talks is not final; we're still swapping a couple of things, and some speakers have not confirmed.

pgConfSV is November 17th and 18th in South San Francisco.  If you're going to AnsibleCon, that's the day after, so you can do both!

Wednesday, August 12, 2015

Two Great Tastes That Taste Great Together: cstore + Pipeline

cstore_fdw, the column-store extension for PostgreSQL by CitusData, is a really good way to add compressed storage for archival data, and analytic data intended to be aggregated, to your application.  Because it's a column store, though, cstore wants new data added in batches, the bigger the better.  This means that you need to find some way to batch-up incoming data, preferably one able to accept a lot of new data rapidly. 

This is where PipelineDB comes in.  Since PipelineDB is open source now, and based on 9.4.4, I can add extensions to it, including cstore_fdw.  I've done so with the PipelineDB 0.7.7a Docker container, so if you use Docker it's simply available.

As a demonstration of this, I'll set up some fake clickstream data, archived to a cstore table hourly.  First, I wrote a quick python script to generate it continuously and push it to a Pipeline stream.

Then I created the stream and continuous view in PipelineDB:

    CREATE STREAM seenstream ( user_id int, page_id int, ts timestamptz );

    CREATE CONTINUOUS VIEW seenfeed as select user_id, page_id, ts
    FROM seenstream
    WHERE arrival_timestamp > ( clock_timestamp() - interval '90 minutes' );

Next, I created the cstore table:

    CREATE EXTENSION cstore_fdw;

    CREATE SERVER cstore_server FOREIGN DATA WRAPPER cstore_fdw;

        user_id int,
        page_id int,
        ts timestamp tz
    SERVER cstore_server
    OPTIONS (compression 'pglz');

Finally, I added a simple script which ran the following query once per hour:

    INSERT INTO seen
    SELECT user_id, page_id, ts
    FROM seenfeed
    WHERE ts >= ( now() - interval '1 hour' )
    ORDER BY user_id, page_id, ts;

... and then I started everything in motion.

Now, for cstore the ORDER BY is vitally important; it determines how the blocks you create for the column store are organized.  In this particular case, I knew that I would be doing more analysis by user.  But most users would do ORDER BY ts instead.

After a few hours, I checked back, and now I can run some analytical queries on the cstore table.  For example, user activity:

    select user_id, count(distinct page_id), max(ts) 

    from seen group by user_id;

    user_id | count |              max             
          1 |    92 | 2015-08-11 22:59:51.504777+00
          2 |    86 | 2015-08-11 22:54:09.77318+00
          3 |    89 | 2015-08-11 22:59:14.574697+00

page activity:

    select page_id, count(distinct user_id) as duv, max(ts) as last
    from seen group by page_id order by duv desc;

    page_id | duv |             last             
         71 |  96 | 2015-08-11 22:59:38.690743+00
         99 |  96 | 2015-08-11 22:58:43.004618+00
          4 |  96 | 2015-08-11 22:57:45.95007+00

... and more. 

Now, in a production environment, you'll want to do more than that.  Depending on traffic, you might batch inserts daily instead of hourly.  You might want to use several continuous views to store different summaries of the data instead of raw data.  But the above should be enough to show you how well these two tools go together.

Tuesday, August 11, 2015

Please, security test our code!

Since Oracle was so nice as to remind everyone what software security is really like with closed-source software, I wanted to remind people how finding and reporting security issues works in PostgreSQL:
  1. Feel free to "reverse engineer" the code.  In fact, here it is on github if you want to scrutinize it.
  2. We generally credit security researchers who find real security holes (with limitations for duplicates, etc.).  Over the last few years, some of the most critical bugs in PostgreSQL were found by professional security researchers doing things like fuzz testing.
  3. If you think you've found a security issue, please report it to If it turns out to be a non-issue, we'll tell you, and you can report it as a regular bug.
  4. Be prepared to answer questions about your report.  We showed you our code, you can show us yours.
Our open approach to security is the reason why PostgreSQL was rated by the Database Hacker's Handbook as "the most secure by default": more secure than Oracle.  And why for five Defcons in a row, security hackers have been unable to crack PostgreSQL in the annual Schemaverse competition.

And please ... update your servers! We send out those incremental updates for a reason, and often as not, there are security patches.   More information on our Security Page.

Note: the above is my personal opinion and is not the opinion of The PostgreSQL Project or any other organization.