Database Soup: 2015

Thursday, December 31, 2015

Speaker Practice Survey

VM Brasseur and I are doing a speaker training tutorial at SCALE14, entitled "10 Step Program for Great Tech Talks." As experienced speakers who go to a lot of conferences, we have way too much material for a three hour tutorial, and need to know what to cut. So she had the idea of doing a survey of conference attendees, which we conducted over a couple weeks. The survey was meant to find out what frequent conference attendees liked about good talks, and didn't like about not-so-good talks.

You can check out the original survey here, although we're no longer paying attention to responses.

Demographics

First let's find out who our surveyees are. 108 people responded. Given that VM and I broadcasted the survey over Twitter, as did Josh Simmons and Gareth Greenaway, I expect that our sample will be fairly skewed; let's see. First we asked people whether they just attend talks, or whether they both give and attend them.

As you can see, our sample skews heavily towards presenters, with a minority of pure audience members. A few wiseacres said they only give talks, and don't attend them, which wasn't supported by their other answers. We also asked how many conferences and talks folks went to in the last year:

So, most of our respondees are frequent conference and/or talk attendees. This colors the rest of the survey; what we're looking at is what a group of experienced people who go to a lot of talks, and present more than a few, think of other speakers' performance. I suspect that if we did a survey of folks at their very first tech conference, we would see somewhat different data.

The Good

We asked "Thinking of the presentations you attended with topics and/or speakers you've most enjoyed, what speaker behaviors below do you feel substantially added to the experience?". The idea of this question was to find out what skills or actions by the speaker were the most important for making a talk valuable. Each item was rated on a 1-5 scale.

The above graph requires some explanation. The length of the bars is the average score. The bars are ranked from the highest to lowest rated practice. The color of the bars indicates the median score, for cases where the average is deceptively skewed: Red 5, Blue 4, Green 3, and Black 2.

We can see a few things from this. First, no practice was considered completely unimportant; nothing averaged below 2.1. Good narrative structure was clearly the best liked practice, with deep subject knowledge and being an energetic speaker as our second and third ranked items.

If you look at the overall rankings, you can see that "content" items are the best loved, whereas talk "extras" and technical improvements are the least important. Slide visual design, which many speakers spend a lot of time sweating over, just averages 3.0, and the three items involving technical interaction with the audience (demos, participation, and exercises), come in last. I do wonder whether those three are last because they are so seldom done well, even by experienced speakers, or because they really are unimportant. Let me know what you think in the comments.

The Bad

To balance this out, we asked surveyees, "Thinking of the presentations you attended with topics and/or speakers you liked, what speaker behaviors below do you feel substantially detracted from, or even ruined, an otherwise good presentation?" The idea was not to find out why sucky presentations sucked, but what things prevented an acceptable talk from being good. Here's the ratings, which are graphed the same way as the "good" ratings:

The top two speaking problems were not being able to hear or understand the speaker, and the presenter being bored and/or distracted. The first shows how important diction, projection, and AV systems are to having a good talk -- especially if you're speaking in a language which isn't your primary one. The second is consistent with the good speaker ratings: people love energetic speakers, and if speakers are listless, so is the audience. So make sure to shoot that triple espresso before you go on stage.

There were a few surprises in here for me. Unreadable code on slides, one of my presonal pet peeves, was a middle-ranker. Running over or under time, which I expected to be a major talk-ruiner, was not considered to be. Offensive content, on the other hand, is a much more widespread problem than I would have assumed.

And "ums" and body language problems, however, ranked at the bottom. From my own observation, I know that these are common issues, but apparently they don't bother people that much. Or do they?

The Ugly

Finally, we asked, "If you could magically eliminate just one bad speaker behavior from all presenters everywhere, which one would it be?" The idea was to find out what one thing was really driving audiences bananas, it was so bad. Here's the responses:

So, first, there are a few things which are inconsistent with the ratings. Unpracticed presentations were the #1 issue, and some other items which looked unimportant by rating, like saying "um" a lot, show up as a top pet peeve. So apparently people hate "um" and interruptions a lot, but they don't see them ruining that many otherwise good talks. The other top issues match the ratings we already saw.

About a seventh of the responses were ones which received only one or two votes, including an assortment of write-in responses. Here's a few of the more interesting ones:

"Reading slide content to audience" (two votes, including a page-long tirade)
"Giving the talk using slides that are really optimized to be downloaded by someone who isn't able to attend the talk. Presentation slides and downloadable slides are different beasts."
"Long personal company or employment history introductions"

On to the Tutorial

So this survey will give us enough information to know what things we can cut back on in order to make our timeslot ... and even a few things to spend more time on! If you want to learn more about how to be a great speaker, join us at SCALE14 on the morning of Thursday, January 21.

If you just like the graphs, both the graphs and the IPython notebook used to produce them are on our Github repo.

Wednesday, December 23, 2015

A Christmas present: PostgreSQL 9.5 RC1

It's been much-delayed, but PostgreSQL 9.5 RC1 is finally available. This release contains not only fixes for all known issues in 9.5 features, it also contains an "ultimate solution" to multixact truncation handling. Multixacts were the cause of multiple bugfix releases in versions 9.3 and 9.4, and several hackers have taken the opportunity in 9.5 to implement a new design for this. This has been a large part of the delay in releasing 9.5.

So: if you plan to upgrade to 9.5, download and test now. Final will be released in early January, barring discovery of new ghastly bugs in the RC.

To make it easier for you to test, I've updated the postgres95-test Docker image, ready for you to download and try out whatever you want.

9.5 brings us a lot of cool features which are likely to justify upgrading for you:

UPSERT
Row Level Security
Grouping Sets/CUBE/ROLLUP
Faster sorts for text
FDW push-down and partitioning support
BRIN indexes

If you want to read up on the features so you know what to test:

What's New in 9.5
My articles on 9.5 for LWN: Part 1 and Part 2
Discussion on Multxact fixes
Guide to 9.5 beta testing

Tuesday, December 8, 2015

Meet the newest member of the PostgreSQL community: MongoDB, Inc.

Well, this story is better told by the guy who broke it, John De Goes.

The TL;DR is that MongoDB will be shipping PostgreSQL as its "legacy BI connector" in version 3.2, using PostgreSQL+Multicorn FDW as a wrapper to connect SQL-based BI systems to Mongo.

I have two thoughts about this:

Welcome to the PostgreSQL community, MongoDB staff!
Multicorn? I mean, not that Multicorn isn't great, but don't you think you should be using a C-based FDW driver for performance?

Anyway, hopefully the folks at MongoDB will put in a talk at pgNYC about this. We're always keen to hear about creative uses of FDWs, and if this isn't one, I don't know what is.

PostgreSQL, The Center Of Your Dataverse™

(even if you're using Mongo)

Thursday, November 12, 2015

PostgreSQL link round-up

First, in case you somehow missed it, PostgreSQL 9.5 Beta 2 is now out. Time for another round of testing! There's fewer and fewer bugs found, so we're not far from a final release. I don't know about anyone else, but we're going into production on Beta 2 in a couple places. Can't wait any longer for Upsert and RLS.

Second, pgConfSV has announced its keynotes, from CitusData, Pivotal, and -- as a bit of a surprise -- from streaming data thing Kafka (before you ask, I didn't pick the keynotes). I believe registration is still open, so you can still go to the conference next week if you act quickly.

Thirdly, I have a roundup of some cool blogs covering PostgreSQL which aren't on Planet and you may have missed:

Compose.io did a terrific two-part article on why PostgreSQL is the best open source database. Read it here: Part I Part II.

They also covered using JSON in Postgres with Python.

In stranger news, there's an amazingly strange story about Soylent, PostgreSQL, and auto-responding to poisoning accusations. Yes, Soylent Inc. uses PostgreSQL, why not? Read the whole weird tale here on Zapier's blog.

That's it for my round up ... if you have some good links, post them in the comments.

Tuesday, November 10, 2015

Ready to be an Accidental DBA at pgConfSV?

My AccidentalDBA tutorial materials have been updated for pgConfSV next week. If you are planning to take this tutorial, please read and follow the setup instructions before you get to the conference.

I created this tutorial in 2009 because of the number of people these days who find themselves in charge of a PostgreSQL server ... or many servers ... without any DBA background, training, or desire to be ops. The tutorial is intended to teach you the minimum you need to keep your PostgreSQL server from falling over so you can get back to whatever your main job is.

You can take it self-paced if you can't make it to pgConfSV. The tutorial includes all materials and notes. In this version, I've added the option of a Docker container (preferred), pg_stat_statements, and a restore-to-point-in-time exercise.

If you are going to attend this tutorial, note that it requires advance setup which will be harder to perform once you get to your hotel, and near-impossible if you wait until 10 minutes before it starts. So read up and install now.

And ... I believe there's still a few registrations left for the tutorial day, so consider signing up. You have nothing to lose but your fear of database admin!

Monday, August 31, 2015

Lock-polling script for ALTER TABLE

One of PostgreSQL's advantages over many other SQL databases is letting users modify database objects with a minimum of locking and fuss, and do it in a transaction so it can be rolled back. However, a minimum of locking isn't no locking, so one still has to plan for getting a lock to push through the database migration.

For example, say I wanted to add two new columns to a table. I already know that by making the column nullable, and not setting a default, I can add them with a metadata-only change which requires only an update to the system catalogs. However, this table change does require an ACCESS EXCLUSIVE lock for a few milleseconds to go through, which can be tricky. See, an access exclusive lock blocks everything, including reads and autovacuum, which can be a bit messy if your timing is bad. You can end up waiting for that lock behind a long-running query, and then all the other database traffic can pile up behind you.

Now, if we had ALTER TABLE CONCURRENTLY it would take care of this for you. But we don't (yet), so you need to fake it with a DO script. Here's an example DO script which polls for a lock every 2 seconds until it can get one, then pushes through the table change:

    DO $f$
    DECLARE ntries INT := 10;
        sleepytime INT := 2;
    BEGIN

    FOR get_lock IN 1 .. ntries LOOP
            BEGIN
                    LOCK TABLE mytable
                    IN ACCESS EXCLUSIVE MODE NOWAIT;
                    ALTER TABLE mytable ADD COLUMN new_col1 INT,
                            ADD COLUMN new_col2 VARCHAR;
                    RAISE INFO 'table updated';
                    RETURN;
            EXCEPTION
                    WHEN lock_not_available THEN
                            PERFORM pg_sleep(sleepytime);
            END;
    END LOOP;

    RAISE INFO 'unable to obtain lock after % tries', ntries;

    END;$f$;

The idea here is that you keep trying to LOCK ... NOWAIT, which will throw and error if it can't get the lock immediately. Then it sleeps and tries again 2 seconds later. If you're using 9.3 or later, you can take an even better approach, using lock_timeout:

    DO $f$
    DECLARE ntries INT := 10;
        sleepytime INT := 2;
    BEGIN
    SET lock_timeout = '100ms';
    FOR get_lock IN 1 .. ntries LOOP
            BEGIN
                    LOCK TABLE mytable IN ACCESS EXCLUSIVE MODE;
                    ALTER TABLE mytable ADD COLUMN a int,
                            ADD COLUMN b INT,
                            ADD COLUMN c INT;
                    RAISE INFO 'table updated';
                    RETURN;
            EXCEPTION
                    WHEN lock_not_available THEN
                            PERFORM pg_sleep(sleepytime);
            END;
    END LOOP;

    RAISE INFO 'unable to obtain lock after % tries', ntries;

    END;$f$;

The advantage of this is that it doesn't have to get the lock immediately; it waits 100ms, then gives up on getting the lock. That helps in situations where the table is never completely free of read queries, without risking a serious pile-up of requests.

You should be able to adapt this approach. to the migrations you actually need to push through.

Friday, August 28, 2015

Stupid Hacks: Dictionary replace function

I write a lot more PL/pgSQL than I'd like to. Not that I don't like SQL, but as a language PL/pgSQL really shows its thrown-together origin; it's like 80's primitive. One thing that PL/pgSQL lacks is good string manipulation tools, which is particularly tragic given that the #1 thing to do in PL/pgSQL is to generate queries from parameters and run EXECUTE.

Postgres has two built-in ways to do string substitution: concatenation and format(). Both have drawbacks. Let me give you an example:

EXECUTE 'SELECT ' || col1 || ', ' || col2 || ' FROM ' || userschema ||
'.accounts WHERE ' || filterclause || ' ORDER BY ' || col1 || ',' || col2;

EXECUTE format('SELECT %s, %s FROM %s.accounts WHERE %s ORDER BY %s, %s', col1, col2, userschema, filterclause, col1, col2);

You can see the problem here. Both formats are hard to read and hard to maintain. Python and Perl have a good fix for this: dictionary/hash-based string substitution, where you can swap in the dictionary keys for the values. So I wrote up a quick hack to do this in PL/pgSQL.

Here's the dict-replace function.

Using it, you'd replace the above with:

EXECUTE replace_vars('SELECT ${col1}, ${col2} FROM ${userschema}.accounts
WHERE ${filterclause} ORDER BY ${col1}, ${col2}', vardict);

Of course, you need to first set up the vardict as a JSON value, in the form:

vardict := '{ "col1" : "username", 'col2' : "branchname", ...}'

Still, much more readable, eh? No refcounting, no repeating variables, no string breaks. I used Bash's variable substitution syntax of ${var} because it seemed like the thing least likely to conflict with user text, unlike anything involving %.

There's some caveats, though: it's not performant, and you could probably find strings which will break it, particularly if you're swapping in JSON values. It's not SQL-injection safe, so don't use it for stuff handling user input. You still have to do your own quoting of strings. And if you have access to PL/Perl or PL/Python you don't need this nonsense.

But for the minimal case, it should help.

Friday, August 14, 2015

Most of the talks chosen for pgConfSV

Since we're still working on the schedule, I wanted to give folks a preview of the talks we've picked for pgConfSV. We still have a few more talks to pick and the tutorials to finalize. Regardless, it's looking like a really exciting lineup!
We'll have folks from Heap, Wanelo, TreasureData, Rackspace, Turnitin.com, EMC, Joyent, Square, and more. The sad part will be missing three talks in each timeslot.

Note that this set of talks is not final; we're still swapping a couple of things, and some speakers have not confirmed.

pgConfSV is November 17th and 18th in South San Francisco. If you're going to AnsibleCon, that's the day after, so you can do both!

Wednesday, August 12, 2015

Two Great Tastes That Taste Great Together: cstore + Pipeline

cstore_fdw, the column-store extension for PostgreSQL by CitusData, is a really good way to add compressed storage for archival data, and analytic data intended to be aggregated, to your application. Because it's a column store, though, cstore wants new data added in batches, the bigger the better. This means that you need to find some way to batch-up incoming data, preferably one able to accept a lot of new data rapidly.

This is where PipelineDB comes in. Since PipelineDB is open source now, and based on 9.4.4, I can add extensions to it, including cstore_fdw. I've done so with the PipelineDB 0.7.7a Docker container, so if you use Docker it's simply available.

As a demonstration of this, I'll set up some fake clickstream data, archived to a cstore table hourly. First, I wrote a quick python script to generate it continuously and push it to a Pipeline stream.

Then I created the stream and continuous view in PipelineDB:

    CREATE STREAM seenstream ( user_id int, page_id int, ts timestamptz );

    CREATE CONTINUOUS VIEW seenfeed as select user_id, page_id, ts
    FROM seenstream
    WHERE arrival_timestamp > ( clock_timestamp() - interval '90 minutes' );

Next, I created the cstore table:

    CREATE EXTENSION cstore_fdw;

    CREATE SERVER cstore_server FOREIGN DATA WRAPPER cstore_fdw;

    CREATE FOREIGN TABLE seen (
        user_id int,
        page_id int,
        ts timestamp tz
    )
    SERVER cstore_server
    OPTIONS (compression 'pglz');

Finally, I added a simple script which ran the following query once per hour:

    INSERT INTO seen
    SELECT user_id, page_id, ts
    FROM seenfeed
    WHERE ts >= ( now() - interval '1 hour' )
    ORDER BY user_id, page_id, ts;

... and then I started everything in motion.

Now, for cstore the ORDER BY is vitally important; it determines how the blocks you create for the column store are organized. In this particular case, I knew that I would be doing more analysis by user. But most users would do ORDER BY ts instead.

After a few hours, I checked back, and now I can run some analytical queries on the cstore table. For example, user activity:

    select user_id, count(distinct page_id), max(ts)
    from seen group by user_id;

    user_id | count |              max
    --------+-------+-------------------------------
          1 |    92 | 2015-08-11 22:59:51.504777+00
          2 |    86 | 2015-08-11 22:54:09.77318+00
          3 |    89 | 2015-08-11 22:59:14.574697+00

page activity:

    select page_id, count(distinct user_id) as duv, max(ts) as last
from seen group by page_id order by duv desc;

    page_id | duv |             last
    --------+-----+-------------------------------
         71 | 96 | 2015-08-11 22:59:38.690743+00
         99 | 96 | 2015-08-11 22:58:43.004618+00
          4 | 96 | 2015-08-11 22:57:45.95007+00

... and more.

Now, in a production environment, you'll want to do more than that. Depending on traffic, you might batch inserts daily instead of hourly. You might want to use several continuous views to store different summaries of the data instead of raw data. But the above should be enough to show you how well these two tools go together.

Tuesday, August 11, 2015

Please, security test our code!

Since Oracle was so nice as to remind everyone what software security is really like with closed-source software, I wanted to remind people how finding and reporting security issues works in PostgreSQL:

Feel free to "reverse engineer" the code. In fact, here it is on github if you want to scrutinize it.
We generally credit security researchers who find real security holes (with limitations for duplicates, etc.). Over the last few years, some of the most critical bugs in PostgreSQL were found by professional security researchers doing things like fuzz testing.
If you think you've found a security issue, please report it to security@postgresql.org. If it turns out to be a non-issue, we'll tell you, and you can report it as a regular bug.
Be prepared to answer questions about your report. We showed you our code, you can show us yours.

Our open approach to security is the reason why PostgreSQL was rated by the Database Hacker's Handbook as "the most secure by default": more secure than Oracle. And why for five Defcons in a row, security hackers have been unable to crack PostgreSQL in the annual Schemaverse competition.

And please ... update your servers! We send out those incremental updates for a reason, and often as not, there are security patches. More information on our Security Page.

Note: the above is my personal opinion and is not the opinion of The PostgreSQL Project or any other organization.

Friday, August 7, 2015

Understanding Unintuitive TABLESAMPLE Results

In 9.5 Alpha 2:

create table mil ( id int, val text );
insert into mil select i, i::text || '-val' from
generate_series(1,1000000) as gs(i);
analyze;

postgres=# select * from mil tablesample system ( 0.04 );
id | val
----+-----
(0 rows)

Huh, what?

So, what I didn't understand here is the way rows are selected for TABLESAMPLE SYSTEM. Since SYSTEM is page-based, I thought that we selected the requested % of pages, and then pick that many pages at random. Since this table had exactly 185 rows per page, it should return 370 rows every time (2 pages). But that's not what happened. In fact, running the following query I got a variety of counts:

SELECT count(*) FROM (select * from mil tablesample system ( 0.04 ) ) as a;

925? 0? What the hey?

What's really happening is that pages for SYSTEM are selected a different way. Each page is checked against the probability once. This means that, while on average you'll get the number of pages you're expecting, the numbers will vary from request to request quite a bit.

This also means that SYSTEM is a bad choice for really small sample sizes, like 0.01%. BERNOULLI is better, because it'll be checking by row, and therefore the size of the return sample will be much more predictable. It will still have a bit of variation, though; in my testing, +/- 10% on a few hundred rows.

Gulcin Yildirim has a great explanation of this on the 2nd Quadrant blog.

So, what if you need TABLESAMPLE to get a very specific number of rows for you? Well, that's why Petr Jelinek wrote the optional (loadable) SYSTEM_ROWS sampling method. This can be loaded as the tsm_system_rows extension in 9.5.

Hopefully that helps folks be less confused than me about how TABLESAMPLE works.

Thursday, August 6, 2015

Time to test 9.5 Alpha 2

PostgreSQL 9.5 Alpha 2 is out. We're still calling this one an Alpha instead of a Beta, because major replumbing is still underway. However, most of the issues reported with Alpha 1 have been fixed. A bunch of really useful bug reports came from regular users who tested out the alpha with their own workloads or tests, so please keep up the good work!

And, if you didn't test Alpha 1, well, we still need your help testing Alpha 2. Here's some ideas for what to test.

Again, to make testing easier for folks, I've created a docker image of Alpha 2. If you already run docker, simply run jberkus/postgres95-test and you can test our features without needing to worry about how to install Postgres. Specific notes on this test image are on the wiki.

As part of updating to alpha2, I managed to shrink the image by about 200MB by getting rid of a lot of build code. However, I did not remove 150MB of build and compile tools because I'm expecting testers to want to install additional extensions or tools in the image.

Thursday, July 30, 2015

Texas Trip: DjangoCon and Postgres Open

This September, I will be having a great Texas adventure, and you're invited along.

September 8-10, Austin: DjangoCon.US. I will be presenting "PostgreSQL Performance in 15 Minutes" on the 9th.

September 9th or 10th, Austin: I will speak at AustinPUG. Date, location and exact topic still TBA. Follow the AustinPUG Meetup, or check back here in a week for an update.

September 16-18, Dallas: Postgres Open: I will be doing my Replication 101 tutorial, and then Explaining Explain.

September is a great time to be in Texas: temperatures are reasonable, Texans are friendly, and there's lots of great Mexican food and barbecue. So, register for DjangoCon and/or Postgres Open today and join me!

Oh, and the photo? That's one of the hand-thrown DjangoCon pint cups I'm making as speaker gifts for DjangoCon. Don't you wish you'd submitted a talk, now?

Tuesday, July 28, 2015

PipelineDB: streaming Postgres

If you've been following the tech news, you might have noticed that we have a new open source PostgreSQL fork called "PipelineDB". Since I've joined the advisory board of PipelineDB, I thought I'd go over what it is, what it does, and what you'd use it for. If you're not interested in Pipeline, or Postgres forks, you can stop reading now.

PipelineDB is a streaming database version of PostgreSQL. The idea of a streaming database, first introduced in the PostgreSQL fork TelegraphCQ back in 2003, is that queries are run against incoming data before it is stored, as a kind of stream processing with full query support. If the idea of a standard database is "durable data, ephemeral queries" the idea of a streaming database is "durable queries, ephemeral data". This was previously implemented in StreamBase, StreamSQL, and the PostgreSQL fork Truviso. In the Hadoop world, the concept is implemented in Apache SparkSQL.

On a practical level, what streaming queries do is allow you to eliminate a lot of ETL and redundant or temporary storage.

PipelineDB 0.7.7 is 100% of PostgreSQL 9.4, plus the ability to create Continuous Views, which are actually standing queries which produce different data each time you query them depending on the incoming stream. The idea is that you create the queries which filter and/or summarize the data you're looking for in the stream, and store only the data you want to keep, which can go in regular PostgreSQL tables.

As an example of this, we're going to use PipelineDB to do tag popularity counts on Twitter. Twitter has a nice streaming API, which gives us some interesting stream data to work with. First I spun up a PipelineDB Docker container. Connecting to it, I created the "twitter" database and a static stream called "tweets":

Creating a static stream isn't, strictly speaking, necessary; you can create a Continuous View without one. As a career DBA, though, implied object names give me the heebie-jeebies. Also, in some future release of PipelineDB, static streams will have performance optimizations, so it's a good idea to get used to creating them now.

    docker run pipelinedb/pipelinedb
    josh@Radegast:~$ psql -h 172.17.0.88 -p 6543 -U pipeline
    Password for user pipeline:
    psql (9.4.1, server 9.4.4)
    Type "help" for help.
    pipeline=# create database twitter;
    CREATE DATABASE
    pipeline=# \c twitter
    twitter=# create stream tweets ( content json );
    CREATE STREAM

Then I created a Continous View which pulls out all identified hashtags from each tweet. To do this, I have to reach deep inside the JSON of the tweet structure and use json_array_elements to expand that into a column. Continuous Views also automatically add a timestamp column called "arrival_timestamp" which is the server timestamp when that particular streaming row showed up. We can use this to create a 1-hour sliding window over the stream, by comparing it to clock_timestamp(). Unlike regular views, volatile expressions are allowed in Continuous Views.

    CREATE CONTINUOUS VIEW tagstream as
    SELECT json_array_elements(content #>
      ARRAY['entities','hashtags']) ->> 'text' AS tag
    FROM tweets
    WHERE arrival_timestamp >
          ( clock_timestamp() - interval '1 hour' );

This pulls a continous column of tags which appear in the San Francisco stream.

Then I created a linked Docker container with all of the Python tools I need to use TwitterAPI, and then wrote this little script based on a TwitterAPI example. This pulls a stream of tweets with geo turned on and identified as being in the area of San Francisco. Yes, INSERTing into the stream is all that's required for a client to deliver stream data. If you have a high volume of data, it's better to use the COPY interface if your language supports it.

Then I started it up, and it started pushing tweets into my stream in PipelineDB. After that, I waited an hour for the stream to populate with an hour's worth of data.

Now, let's do some querying:

    twitter=# select * from tagstream limit 5;
        tag
    ------------
    Glendale
    Healthcare
    Job
    Jobs
    Hiring

How about the 10 most popular tags in the last hour?

    twitter=# select tag, count(*) as tag_count from tagstream group
              by tag order by tag_count desc limit 10;
         tag       | tag_count
    ---------------+-----------
    Hiring         |       211
    Job            |       165
    CareerArc      |       163
    Jobs           |       154
    job            |       138
    SanFrancisco   |        69
    hiring         |        60
    FaceTimeMeNash |        42
    CiscoRocks     |        35
    IT             |        35

I detect a theme here. Namely, one of sponsored tweets by recruiters.

Now, obviously you could do this by storing all of the tweets in an unlogged table, then summarizing them into another table, etc. However, using continuous views avoids a bunch of disk-and-time-wasting store, transform, store again if you're uninterested in the bulk of the data (and tweet metadata is bulky indeed). Further, I can create more continuous views based on the same stream, and pull different summary information on it in parallel.

So, there you have it: PipelineDB, the latest addition to the PostgreSQL family. Download it or install it using the Docker container I built for it.

Friday, July 24, 2015

Using Docker to isolate VPN clients

As a consultant, I need to use VPNs to log into many of our client environments. As one of our data security policies, our staff are not allowed to connect to multiple customer VPNs from the same machine at the same time; plus many VPNs do not permit this. However, this causes some work efficiency issues whenever we need to run a job for a customer which involves a lot of wait time, such as building an index.

I've used VirtualBox VMs as isolation environments, but VBox consumes a lot of memory and system resources. So I started thinking: could I use Docker containers to supply some network/file isolation while connected to multiple VPNs? The answer is yes.

Let's take the difficult case first: a Cisco IPSec VPN. First, I exported the VPN configuration from Gnome's network manager, which creates a PCF file. Next, I start Docker container based on the official Ubuntu Trusty images, giving it the net_admin Linux cap to allow the VPN to work. I also need to mount the new PCF file into a directory in the container so I can access it.

     docker run -it --cap-add net_admin \
        --volume /home/josh/vpns:/mnt/vpn ubuntu:trusty

After this I need to install the connection software:

   apt-get update
   apt-get install vpnc ssh

Once vpnc is installed, I need to convert that PCF file to a vpnc file and install it:

     cd /mnt/vpn
     pcf2vpnc vpn.pcf /etc/vpnc/vpnc.conf
     chmod 600 /etc/vpnc/vpnc.conf

Once that's all set up, I can start the VPN:

     root@c5b7802ca120:/# vpnc vpnc
     Enter password for jberkus@10.1.1.15:
     VPNC started in background (pid: 78)...

Success! Sometimes, though, you'll see this error message:

         Cannot open "/proc/sys/net/ipv4/route/flush"

I haven't pinned down what causes that yet, but it seems to have no effect on actual ability to use the vpnc VPN. Probably I need one more --cap-add, but I don't yet know what it is.

Anyway, once you have that running, docker commit the image and you have a saved VPN image for that particular VPN (just remember not to upload it to Docker Hub). Then you can run the image whenever you need that VPN.

OpenVPN is actually more complicated, because device access as well as net access is required. On the up side, you only need the config file to actually connect. See this project on how to set up your container, and then run:

       openvpn --config /mnt/vpn/config.ovpn

Now, the caveat: Docker containers do not supply nearly the levels of isolation which a full VM would, especially since we have to grant the "net_admin" cap to get the VPN to work. While containers prevent accidental network cross-connection, they would not block targeted malware which attacks the network stack. In our case, that means I can use this technique for our regular clients, but not for PCI-compliant customers or other clients with strong regulated data isolation requirements.

But for the others, I can now easily run 2 VPNs at a time, while protecting them from cross-connection, and me from accidentally targeting the wrong customer's server.

Thanks to Rikaard Hosein, programmerq, and other helpful folks on IRC channel #docker for advice on this.

Thursday, July 23, 2015

unsafe UPSERT using WITH clauses

By now you've read about PostgreSQL 9.5 and our shiny new UPSERT feature. And you're thinking, "hey, 9.5 is still in alpha, how do I get me some UPSERT action now?" While there are some suggestions for workarounds on the PostgreSQL wiki, I'm going to show you one more method for approximating UPSERT while you help us test 9.5 so that it will come out sooner.

The general suggested way for handling UPSERT situations is one of the following:

Try to do an INSERT. On error, have the application retry with an UPDATE.
Write a PL/pgSQL procedure which does insert-then-update or update-then-insert.

Both of these approaches have the drawback of being very high overhead: the first involves multiple round-trips to the database and lots of errors in the log, and the second involves major subtransaction overhead. Neither is concurrency-safe, but then the method I'm about to show you isn't either. At least this method avoids a lot of the overhead, though.

What's the method? Using writeable WITH clauses. This feature, introduced in 9.1, allows you to do a multi-step write transaction as a single query. For an example, let's construct a dummy table with a unique key on ID and a value column, then populate it:

     create table test_upsert ( id int not null primary key,
        val text );
     insert into test_upsert select i, 'aaa'
       from generate_series (1, 100) as gs(i);

Now, let's say we wanted to update ID 50, or insert it if it doesn't exist. We can do that like so:

    WITH
    newrow ( id, val ) as (
        VALUES ( 50::INT, 'bbb'::TEXT ) ),
    tryupdate as (
        UPDATE test_upsert SET val = newrow.val
        FROM newrow
        WHERE test_upsert.id = newrow.id
        RETURNING test_upsert.id
    )
    INSERT INTO test_upsert
    SELECT id, val
        FROM newrow
    WHERE id NOT IN ( SELECT id FROM tryupdate );

The above tries to update ID=50. If no rows are updated, it inserts them. This also works for multiple rows:

    WITH
    newrow ( id, val ) as (
        VALUES ( 75::INT, 'ccc'::TEXT ),
( 103::INT, 'ccc'::TEXT )
    ),
    tryupdate as (
        UPDATE test_upsert SET val = newrow.val
        FROM newrow
        WHERE test_upsert.id = newrow.id
        RETURNING test_upsert.id
    )
    INSERT INTO test_upsert
    SELECT id, val
        FROM newrow
    WHERE id NOT IN ( SELECT id FROM tryupdate );

... and will update or insert each row as called for.

Given that we can do the above, why do we need real UPSERT? Well, there's some problems with this approximate method:

It's not concurrency-safe, and can produce unexpected results given really bad timing of multiple connections wanting to update the same rows.
It will still produce key violation errors given bad concurrency timing, just fewer of them than method 1 above.
It's still higher overhead than 9.5's UPSERT feature, which is optimized.
It will return INSERT 0 0 from calls that do only updates, possibly making the app think the upsert failed.
It's not safe to use with INSERT/UPDATE triggers on the same table.

But ... if you can't wait for 9.5, then at least it's a temporary workaround for you. In the meantime, download 9.5 alpha and get testing the new UPSERT.

Thursday, July 2, 2015

Test PostgreSQL 9.5 Alpha in a Docker container

Yay, the first alpha of PostgreSQL 9.5 is out! This has lots of cool stuff in it, which you can read about elsewhere. What I'm posting here is a new experiment: offering up Docker images for testing the alphas and betas.

TL:DR = Image here, information and instructions on the wiki.

This occurred to me last week at DockerCon (naturally enough); one of the reasons more people don't test the PostgreSQL beta releases is that it can be a pain to install them, especially with all of the stuff you want to test. So I've just put a couple days' work into making it easier for you: all you gotta do is install the image and you can test away.

And, since you're testing in a container, you don't have to worry about messing up the PostgreSQL installation on the desktop/laptop/server you normally use. If you feel like doing your testing in a public cloud, many of them have ways to run containers as cloud servers, so that's easy to do.

I created this image crammed full of stuff: all of contrib, jsquery, pgxn, PL/python and PL/perl, a sample database, etc., so that you'd have something to test. I wasn't able to include PL/Lua due to build issues, nor PostGIS because I'm building from source, and building PostGIS from source is kinda scary. If you want to help me build a better test image, though, the Dockerfile project for it is here.

In the meantime, I've removed another excuse for not testing PostgreSQL 9.5. So what are you waiting for? Test already!

Friday, June 26, 2015

NDAs at SFPUG: Survey Results

It's boomtime in San Francisco, which means we're also full swing into The Cult of the NDA. This includes many of our venues for SFPUG meetups; they require signing a confidentiality disclaimer before entering their office. While I question the utility of this, since these hosts are providing us with free space, food, and drink, I'm really not in a position to argue. So I launched a survey a month ago to see how much of a problem this is for our members. I thought it might be useful to share the results with other PUG leaders so that they can also develop policies around this.

Here's the results. First, let me give you the overall results in a pie chart. Scroll down to the bottom of my post for my suggested policy conclusions.

Incidentally, these are all graphed using iPython Notebook and Pylab, which is awesome way to do one-off graphs. Here's the code for that graph:

    %matplotlib inline
    import psycopg2
    from pylab import *

    conn=psycopg2.connect('dbname=sfpug host=127.0.0.1')
    cur = conn.cursor()
    cur.execute("""SELECT att, count(*) as members
        FROM ndasurvey GROUP BY att ORDER BY att""");

    labels = []
    fracs = []
    explode = []
    for rec in cur:
        labels.append(rec[0])
        fracs.append(rec[1])
        explode.append(0.05)

    figure(1, figsize=(6,6))
    pie(fracs, explode=explode, labels=labels,
                    autopct='%1.0f%%', shadow=True, startangle=90)
    title('Attitudes Towards Venue NDAs: Overall')
    show()

So overall we have a somewhat split distribution. BTW, here's the definitions of the attitudes:

won't attend: I won't go to a meetup which requires signing
needs review: I/my employer must review the agreement first
depends on text: depends on what agreement says
go anway: I don't like them, but I'll still go
don't care: I don't care, whatever

Does this differ for which of our three divisions (San Francisco, East Bay, and South Bay) it is?

So, East Bay attendees don't seem to care in general, and South Bay attendees are much more concerned about reviewing the text of the confidentiality statement. Which makes sense, if you think about it.

The other division we have is how frequent of an attendee someone is; is there a difference in attitude about our regulars? Apparently there is:

... so the "won't attend" group is also, generally, a group which doesn't regularly go to SFPUG meetups. What does this tell us for a policy for SFPUG?

Well, the 5 options actually fall into three broader groups: those who will go regardless, those who need to see the text of the agreement, and those who won't go. What if we group it that way?

So from the look of things, 83% of SFPUG will go to a meetup with an confidentiality agreement, provided that they get to review it first. This suggests that a good policy for SFPUG should be:

"Venues which require confidentiality agreements are acceptable, if not preferred, provided that the text of the agreement is disclosed to the group beforehand."

It also suggests that I and the other leaders should be dilligent in finding out about such agreements whenever booking a venue.

Oh, and for full data, here's the population breakdown of the above: