Thursday, July 30, 2015

Texas Trip: DjangoCon and Postgres Open



This September, I will be having a great Texas adventure, and you're invited along. 

September 8-10, Austin: DjangoCon.US.   I will be presenting "PostgreSQL Performance in 15 Minutes" on the 9th. 

September 9th or 10th, Austin: I will speak at AustinPUG.  Date, location and exact topic still TBA.  Follow the AustinPUG Meetup, or check back here in a week for an update.

September 16-18, Dallas: Postgres Open:  I will be doing my Replication 101 tutorial, and then Explaining Explain.

September is a great time to be in Texas: temperatures are reasonable, Texans are friendly, and there's lots of great Mexican food and barbecue.  So, register for DjangoCon and/or Postgres Open today and join me!

Oh, and the photo? That's one of the hand-thrown DjangoCon pint cups I'm making as speaker gifts for DjangoCon.  Don't you wish you'd submitted a talk, now?

Tuesday, July 28, 2015

PipelineDB: streaming Postgres

If you've been following the tech news, you might have noticed that we have a new open source PostgreSQL fork called "PipelineDB".  Since I've joined the advisory board of PipelineDB, I thought I'd go over what it is, what it does, and what you'd use it for.  If you're not interested in Pipeline, or Postgres forks, you can stop reading now.

PipelineDB is a streaming database version of PostgreSQL.  The idea of a streaming database, first introduced in the PostgreSQL fork TelegraphCQ back in 2003, is that queries are run against incoming data before it is stored, as a kind of stream processing with full query support.  If the idea of a standard database is "durable data, ephemeral queries" the idea of a streaming database is "durable queries, ephemeral data".  This was previously implemented in StreamBase, StreamSQL, and the PostgreSQL fork Truviso. In the Hadoop world, the concept is implemented in Apache SparkSQL.

On a practical level, what streaming queries do is allow you to eliminate a lot of ETL and redundant or temporary storage.

PipelineDB 0.7.7 is 100% of PostgreSQL 9.4, plus the ability to create Continuous Views, which are actually standing queries which produce different data each time you query them depending on the incoming stream.  The idea is that you create the queries which filter and/or summarize the data you're looking for in the stream, and store only the data you want to keep, which can go in regular PostgreSQL tables.

As an example of this, we're going to use PipelineDB to do tag popularity counts on Twitter.  Twitter has a nice streaming API, which gives us some interesting stream data to work with.  First I spun up a PipelineDB Docker container.  Connecting to it, I created the "twitter" database and a static stream called "tweets":

Creating a static stream isn't, strictly speaking, necessary; you can create a Continuous View without one.  As a career DBA, though, implied object names give me the heebie-jeebies.  Also, in some future release of PipelineDB, static streams will have performance optimizations, so it's a good idea to get used to creating them now.

    docker run pipelinedb/pipelinedb
    josh@Radegast:~$ psql -h 172.17.0.88 -p 6543 -U pipeline
    Password for user pipeline:
    psql (9.4.1, server 9.4.4)
    Type "help" for help.
    pipeline=# create database twitter;
    CREATE DATABASE
    pipeline=# \c twitter
    twitter=# create stream tweets ( content json );
    CREATE STREAM


Then I created a Continous View which pulls out all identified hashtags from each tweet.  To do this, I have to reach deep inside the JSON of the tweet structure and use json_array_elements to expand that into a column.  Continuous Views also automatically add a timestamp column called "arrival_timestamp" which is the server timestamp when that particular streaming row showed up.  We can use this to create a 1-hour sliding window over the stream, by comparing it to clock_timestamp().  Unlike regular views, volatile expressions are allowed in Continuous Views.

    CREATE CONTINUOUS VIEW tagstream as
    SELECT json_array_elements(content #>

      ARRAY['entities','hashtags'] ->> 'text' AS tag
    FROM tweets
    WHERE arrival_timestamp >
          ( clock_timestamp() - interval '1 hour' );


This pulls a continous column of tags which appear in the San Francisco stream.

Then I created a linked Docker container with all of the Python tools I need to use TwitterAPI, and then wrote this little script based on a TwitterAPI example.  This pulls a stream of tweets with geo turned on and identified as being in the area of San Francisco.  Yes, INSERTing into the stream is all that's required for a client to deliver stream data.  If you have a high volume of data, it's better to use the COPY interface if your language supports it.

Then I started it up, and it started pushing tweets into my stream in PipelineDB.  After that, I waited an hour for the stream to populate with an hour's worth of data.

Now, let's do some querying:

    twitter=# select * from tagstream limit 5;
        tag    
    ------------
    Glendale
    Healthcare
    Job
    Jobs
    Hiring


How about the 10 most popular tags in the last hour?

    twitter=# select tag, count(*) as tag_count from tagstream group
              by tag order by tag_count desc limit 10;
         tag       | tag_count
    ---------------+-----------
    Hiring         |       211
    Job            |       165
    CareerArc      |       163
    Jobs           |       154
    job            |       138
    SanFrancisco   |        69
    hiring         |        60
    FaceTimeMeNash |        42
    CiscoRocks     |        35
    IT             |        35


I detect a theme here.  Namely, one of sponsored tweets by recruiters.

Now, obviously you could do this by storing all of the tweets in an unlogged table, then summarizing them into another table, etc.  However, using continuous views avoids a bunch of disk-and-time-wasting store, transform, store again if you're uninterested in the bulk of the data (and tweet metadata is bulky indeed). Further, I can create more continuous views based on the same stream, and pull different summary information on it in parallel. 

So, there you have it: PipelineDB, the latest addition to the PostgreSQL family.  Download it or install it using the Docker container I built for it.

Friday, July 24, 2015

Using Docker to isolate VPN clients

As a consultant, I need to use VPNs to log into many of our client environments.  As one of our data security policies, our staff are not allowed to connect to multiple customer VPNs from the same machine at the same time; plus many VPNs do not permit this.  However, this causes some work efficiency issues whenever we need to run a job for a customer which involves a lot of wait time, such as building an index.

I've used VirtualBox VMs as isolation environments, but VBox consumes a lot of memory and system resources.  So I started thinking: could I use Docker containers to supply some network/file isolation while connected to multiple VPNs?  The answer is yes.

Let's take the difficult case first: a Cisco IPSec VPN.  First, I exported the VPN configuration from Gnome's network manager, which creates a PCF file.  Next, I start Docker container based on the official Ubuntu Trusty images, giving it the net_admin Linux cap to allow the VPN to work.  I also need to mount the new PCF file into a directory in the container so I can access it.

     docker run -it --cap-add net_admin \
        --volume /home/josh/vpns:/mnt/vpn ubuntu:trusty

After this I need to install the connection software:

     apt-get update
     apt-get install vpnc ssh

Once vpnc is installed, I need to convert that PCF file to a vpnc file and install it:

     cd /mnt/vpn
     pcf2vpnc vpn.pcf /etc/vpnc/vpnc.conf
     chmod 600 /etc/vpnc/vpnc.conf

Once that's all set up, I can start the VPN:

     root@c5b7802ca120:/# vpnc vpnc
     Enter password for jberkus@10.1.1.15:
     VPNC started in background (pid: 78)...


Success!  Sometimes, though, you'll see this error message:

         Cannot open "/proc/sys/net/ipv4/route/flush"

I haven't pinned down what causes that yet, but it seems to have no effect on actual ability to use the vpnc VPN.  Probably I need one more --cap-add, but I don't yet know what it is.

Anyway, once you have that running, docker commit the image and you have a saved VPN image for that particular VPN (just remember not to upload it to Docker Hub).  Then you can run the image whenever you need that VPN.

OpenVPN is actually more complicated, because device access as well as net access is required.  On the up side, you only need the config file to actually connect.  See this project on how to set up your container, and then run:

       openvpn --config /mnt/vpn/config.ovpn

Now, the caveat:  Docker containers do not supply nearly the levels of isolation which a full VM would, especially since we have to grant the "net_admin" cap to get the VPN to work.  While containers prevent accidental network cross-connection, they would not block targeted malware which attacks the network stack. In our case, that means I can use this technique for our regular clients, but not for PCI-compliant customers or other clients with strong regulated data isolation requirements.

But for the others, I can now easily run 2 VPNs at a time, while protecting them from cross-connection, and me from accidentally targeting the wrong customer's server.

Thanks to Rikaard Hosein, programmerq, and other helpful folks on IRC channel #docker for advice on this.

Thursday, July 23, 2015

unsafe UPSERT using WITH clauses

By now you've read about PostgreSQL 9.5 and our shiny new UPSERT feature.  And you're thinking, "hey, 9.5 is still in alpha, how do I get me some UPSERT action now?"  While there are some suggestions for workarounds on the PostgreSQL wiki, I'm going to show you one more method for approximating UPSERT while you help us test 9.5 so that it will come out sooner.

The general suggested way for handling UPSERT situations is one of the following:
  1. Try to do an INSERT.  On error, have the application retry with an UPDATE.
  2. Write a PL/pgSQL procedure which does insert-then-update or update-then-insert.
Both of these approaches have the drawback of being very high overhead: the first involves multiple round-trips to the database and lots of errors in the log, and the second involves major subtransaction overhead.  Neither is concurrency-safe, but then the method I'm about to show you isn't either.  At least this method avoids a lot of the overhead, though.

What's the method?  Using writeable WITH clauses.  This feature, introduced in 9.1, allows you to do a multi-step write transaction as a single query.  For an example, let's construct a dummy table with a unique key on ID and a value column, then populate it:

     create table test_upsert ( id int not null primary key, 
        val text );
     insert into test_upsert select i, 'aaa'
       from generate_series (1, 100) as gs(i);


Now, let's say we wanted to update ID 50, or insert it if it doesn't exist.  We can do that like so:

    WITH
    newrow ( id, val ) as (
        VALUES ( 50::INT, 'bbb'::TEXT ) ),
    tryupdate as (
        UPDATE test_upsert SET val = newrow.val
        FROM newrow
        WHERE test_upsert.id = newrow.id
        RETURNING test_upsert.id
    )
    INSERT INTO test_upsert
    SELECT id, val
        FROM newrow
    WHERE id NOT IN ( SELECT id FROM tryupdate );


The above tries to update  ID=50.  If no rows are updated, it inserts them.  This also works for multiple rows:

    WITH
    newrow ( id, val ) as (
        VALUES ( 75::INT, 'ccc'::TEXT ),

                      ( 103::INT, 'ccc'::TEXT )
    ),
    tryupdate as (
        UPDATE test_upsert SET val = newrow.val
        FROM newrow
        WHERE test_upsert.id = newrow.id
        RETURNING test_upsert.id
    )
    INSERT INTO test_upsert
    SELECT id, val
        FROM newrow
    WHERE id NOT IN ( SELECT id FROM tryupdate );


... and will update or insert each row as called for.

Given that we can do the above, why do we need real UPSERT?  Well, there's some problems with this approximate method:
  • It's not concurrency-safe, and can produce unexpected results given really bad timing of multiple connections wanting to update the same rows.
  • It will still produce key violation errors given bad concurrency timing, just fewer of them than method 1 above.
  • It's still higher overhead than 9.5's UPSERT feature, which is optimized.
  • It will return INSERT 0 0  from calls that do only updates, possibly making the app think the upsert failed.
  • It's not safe to use with INSERT/UPDATE triggers on the same table.
But ... if you can't wait for 9.5, then at least it's a temporary workaround for you.  In the meantime, download 9.5 alpha and get testing the new UPSERT.

Thursday, July 2, 2015

Test PostgreSQL 9.5 Alpha in a Docker container

Yay, the first alpha of PostgreSQL 9.5 is out!  This has lots of cool stuff in it, which you can read about elsewhere.  What I'm posting here is a new experiment: offering up Docker images for testing the alphas and betas.

TL:DR = Image here, information and instructions on the wiki.

This occurred to me last week at DockerCon (naturally enough); one of the reasons more people don't test the PostgreSQL beta releases is that it can be a pain to install them, especially with all of the stuff you want to test.  So I've just put a couple days' work into making it easier for you: all you gotta do is install the image and you can test away.

And, since you're testing in a container, you don't have to worry about messing up the PostgreSQL installation on the desktop/laptop/server you normally use.  If you feel like doing your testing in a public cloud, many of them have ways to run containers as cloud servers, so that's easy to do.

I created this image crammed full of stuff: all of contrib, jsquery, pgxn, PL/python and PL/perl, a sample database, etc., so that you'd have something to test.  I wasn't able to include PL/Lua due to build issues, nor PostGIS because I'm building from source, and building PostGIS from source is kinda scary.  If you want to help me build a better test image, though, the Dockerfile project for it is here.

In the meantime, I've removed another excuse for not testing PostgreSQL 9.5.  So what are you waiting for?  Test already!

Friday, June 26, 2015

NDAs at SFPUG: Survey Results

It's boomtime in San Francisco, which means we're also full swing into The Cult of the NDA.  This includes many of our venues for SFPUG meetups; they require signing a confidentiality disclaimer before entering their office.  While I question the utility of this, since these hosts are providing us with free space, food, and drink, I'm really not in a position to argue.  So I launched a survey a month ago to see how much of a problem this is for our members.  I thought it might be useful to share the results with other PUG leaders so that they can also develop policies around this.

Here's the results. First, let me give you the overall results in a pie chart.  Scroll down to the bottom of my post for my suggested policy conclusions.


Incidentally, these are all graphed using iPython Notebook and Pylab, which is awesome way to do one-off graphs.  Here's the code for that graph:

    %matplotlib inline
    import psycopg2
    from pylab import *

    conn=psycopg2.connect('dbname=sfpug host=127.0.0.1')
    cur = conn.cursor()
    cur.execute("""SELECT att, count(*) as members
        FROM ndasurvey GROUP BY att ORDER BY att""");

    labels = []
    fracs = []
    explode = []
    for rec in cur:
        labels.append(rec[0])
        fracs.append(rec[1])
        explode.append(0.05)

    figure(1, figsize=(6,6))
    pie(fracs, explode=explode, labels=labels,
                    autopct='%1.0f%%', shadow=True, startangle=90)
    title('Attitudes Towards Venue NDAs: Overall')
    show()


So overall we have a somewhat split distribution.  BTW, here's the definitions of the attitudes:

  • won't attend: I won't go to a meetup which requires signing
  • needs review: I/my employer must review the agreement first
  • depends on text: depends on what agreement says
  • go anway: I don't like them, but I'll still go
  • don't care: I don't care, whatever
Does this differ for which of our three divisions (San Francisco, East Bay, and South Bay) it is?






So, East Bay attendees don't seem to care in general, and South Bay attendees are much more concerned about reviewing the text of the confidentiality statement.  Which makes sense, if you think about it.

The other division we have is how frequent of an attendee someone is; is there a difference in attitude about our regulars?  Apparently there is:




... so the "won't attend" group is also, generally, a group which doesn't regularly go to SFPUG meetups.  What does this tell us for a policy for SFPUG? 

Well, the 5 options actually fall into three broader groups: those who will go regardless, those who need to see the text of the agreement, and those who won't go.  What if we group it that way?





So from the look of things, 83% of SFPUG will go to a meetup with an confidentiality agreement, provided that they get to review it first.  This suggests that a good policy for SFPUG should be:

"Venues which require confidentiality agreements are acceptable, if not preferred, provided that the text of the agreement is disclosed to the group beforehand."

It also suggests that I and the other leaders should be dilligent in finding out about such agreements whenever booking a venue.

Oh, and for full data, here's the population breakdown of the above: