Monday, May 11, 2015

Recursive cycle detection is recursive

In prior posts, I've gone over some methods to prevent cycles from being added to your database.  However, imagine that someone has handed you an existing adjacency list tree -- perhaps migrated from another DBMS -- and you need to find all of the cycles as part of data cleaning?  How do you do that?

One way, obviously, would be just explore all paths and flag the ones where any ID appeared twice:

        SELECT, 1 AS depth, array[id] as seen, false as cycle
        FROM folders
        UNION ALL
        SELECT, prev.depth + 1, path || as seen,
   = any(seen) as cycle
        FROM prev
        INNER JOIN folders on = parent_id
    SELECT *
    FROM prev;

However, the above has a serious issue: the query itself will cycle and never complete (in fact, it will error out). So we need to terminate each cycle when the first repeat happens.  Fortunately, that's easy to do, and we'll filter for only the cycles while we're at it:

        SELECT, 1 AS depth, array[id] as seen, false as cycle
        FROM folders
        UNION ALL
        SELECT, prev.depth + 1, seen || as seen,
   = any(seen) as cycle
        FROM prev
        INNER JOIN folders on = parent_id
        AND prev.cycle = false
    SELECT *
    FROM prev
    WHERE cycle = true;

The results of the above query look like this:

    id | depth |       seen       | cycle
    21 |     2 | {21,21}          | t
    13 |     5 | {13,14,15,11,13} | t
    14 |     5 | {14,15,11,13,14} | t
    15 |     5 | {15,11,13,14,15} | t
    11 |     5 | {11,13,14,15,11} | t
    (5 rows)

One thing to notice is that you'll get a row for every node in a cycle loop.  That's because with cycles, the choice of starting point is arbitrary.  So the above query isn't the best choice for a deep tree where you have a lot of existing cycles.  There's a 2nd way to detect cycles in a tree; lets see if any of the commentors can post that query.

Thursday, May 7, 2015

Fancy SQL Thursday: row-to-column transformation

So, this question came up on IRC today:

"How can I take a one-week date range, and have each date in a separate column?"

This is a good question for an example of row-to-column transformation using Postgres' built-in functions.  While you could use the Tablefunc Extension to do crosstabs, you can also do this on your own in SQL.

First, let's generate a list of formatted dates using Postgres' built-in iterator, generate_series():

    with days as (
        select d, to_char(d, 'Mon DD') as label
        from generate_series($1,$2,interval '1 day') as gs(d)

That generates days in the format "Apr 05" between the dates $1 and $2.  Now comes the tricky part, which is we're going to roll those dates up into an array so that we can transform them horizontally:

    dagg as (
        select array_agg(label order by d) as ld
        from days

So we're using array_agg to make the day label into an array of labels.  We also add the "order by d" to the aggregate to make sure that those days stay in date order.

Once we've got that, then we can just select each array element as a column:

    select ld[1] as d1,
        ld[2] as d2,
        ld[3] as d3,
        ld[4] as d4,
        ld[5] as d5,
        ld[6] as d6,
        ld[7] as d7
    from dagg

Now, this has the limitation that we need to know how many days we're selecting before running the query, but pretty much any method we use requires that if we want columns as output.  So, putting it all together with some sample dates:

    with days as (
        select d, to_char(d, 'Mon DD') as label
        from generate_series('2015-04-01','2015-04-07',interval '1 day') as gs(d)
    ), dagg as (
        select array_agg(label order by d) as ld
        from days
    select ld[1] as d1,
        ld[2] as d2,
        ld[3] as d3,
        ld[4] as d4,
        ld[5] as d5,
        ld[6] as d6,
        ld[7] as d7
    from dagg;

And the result:

      d1   |   d2   |   d3   |   d4   |   d5   |   d6   |   d7  
    Apr 01 | Apr 02 | Apr 03 | Apr 04 | Apr 05 | Apr 06 | Apr 07

That should help you figure out how to do row-to-column transformations for your own queries.  Enjoy!

Thursday, April 16, 2015

Expressions VS advanced aggregates

So ... you're using some of 9.4's new advanced aggregates, including FILTER and WITHIN GROUP.  You want to take some statistical samples of your data, including median, mode, and a count of validated rows.  However, your incoming data is floats and you want to store the samples as INTs, since the source data is actually whole numbers.  Also, COUNT(*) returns BIGINT by default, and you want to round it to INT as well.  So you do this:

        count(*)::INT as present,
        count(*)::INT FILTER (WHERE valid) as valid_count,
        mode()::INT WITHIN GROUP (order by val) as mode,
        percentile_disc(0.5)::INT WITHIN GROUP (order by val)
          as median
    FROM dataflow_0913
    GROUP BY device_id
    ORDER BY device_id;

And you get this unhelpful error message:

    ERROR:  syntax error at or near "FILTER"
    LINE 4:         count(*)::INT FILTER (WHERE valid)
            as valid_count,

And your first thought is that you're not using 9.4, or you got the filter clause wrong.  But that's not the problem.  The problem is that "aggregate() FILTER (where clause)" is a syntactical unit, and cannot be broken up by other expressions.  Hence the syntax error.  The correct expression is this one, with parens around the whole expression and then a cast to INT:

        count(*)::INT as present,
        (count(*) FILTER (WHERE valid))::INT as valid_count,
        (mode() WITHIN GROUP (order by val))::INT as mode,
        (percentile_disc(0.5) WITHIN GROUP (order by val))::INT
           as median
    FROM dataflow_0913
    GROUP BY device_id
    ORDER BY device_id;

If you don't understand this, and you use calculated expressions, you can get a worse result: one which does not produce an error but is nevertheless wrong.  For example, imagine that we were, for some dumb reason, calculating our own average over validated rows.  We might do this:

        sum(val)/count(*) FILTER (WHERE valid) as avg
    FROM dataflow_0913
    GROUP BY device_id
    ORDER BY device_id;

... which would execute successfully, but would give us the total of all rows divided by the count of validated rows. That's because the FILTER clause applies only to the COUNT, and not to the SUM.  If we actually wanted to calculate our own average, we'd have to do this:

        sum(val) FILTER (WHERE valid)
            / count(*) FILTER (WHERE valid) as avg
    FROM dataflow_0913
    GROUP BY device_id
    ORDER BY device_id;

Hopefully that helps everyone who is using the new aggregates to use them correctly and not get mysterious errors.  In the meantime, we can see about making the error messages more helpful.

Tuesday, March 31, 2015

Primary Keyvil, reprised

Primary Keyvil was one of the most popular posts on my old blog.  Since the old block has become somewhat inaccessible, and I recently did my Keyvil lightning talk again at pgConf NYC, I thought I'd reprint it here, updated and consolidated.

Two actual conversations I had on IRC ages ago, handles changed to protect the ignorant, and edited for brevity (, channel #postgresql):

    newbie1: schema design:

    agliodbs: hmmm ... why do you have an ID column
        in "states"?  You're not using it.

    newbie1: because I have to.

    agliodbs: you what?

    newbie1: the ID column is required for normalization.

    agliodbs chokes

    newbie2: how do I write a query to remove the duplicate rows?

    agliodbs: please post your table definition


    agliodbs: What's the key for "sessions"?

    newbie2: it has an "id" column

    agliodbs: Yes, but what's the real key? 
        Which columns determine a unique row?

    newbie2: I told you, the "id" column. 
        It's a primary key and everything.

    agliodbs: That's not going to help you 

        identify duplicate sessions. You need another
        key ... a unique constraint on real data columns, 
        not just an "id" column.

    newbie2: no I don't

    agliodbs: Good luck with your problem then.

The surrogate numeric key has been a necessary evil for as long as we've had SQL. It was set into SQL89 because the new SQL databases had to interact with older applications which expected "row numbers," and it continues because of poor vendor support for features like CASCADE.

Inevitably, practices which are "necessary evils" tend to become "pervasive evils" in the hands of the untrained and the lazy. Not realizing that ID columns are a pragmatic compromise with application performance, many frameworks and development pragmas have enshrined numeric IDs as part of the logical and theoretical model of their applications. Worse yet, even RDBMS book authors have instructed their readers to "always include an ID column," suturing this misunderstanding into the body of industry knowledge like a badly wired cybernetic implant.

What Are Numeric Surrogate Primary Keys, Exactly?

Before people post a lot of irrelevant arguments, let me be definitional: I'm talking about auto-numbering "ID" columns, like PostgreSQL's or Oracle's SERIAL and MySQL's AUTO_INCREMENT, or the various systems of UUID. Such columns are known as "surrogate keys" because they provide a unique handle for the row which has nothing to do with the row's data content. It is the abuse of these "numeric surrogate keys" which I am attacking in this column, not any other type of key.

Further, "keys" are real: a "key" is any combination of columns which forms a "predicate", or a set which uniquely identifies a tuple or row, of which there should be at least one per table. The concept of "primary key," however, has no intrinsic meaning in relational theory -- all keys are equal and no one of them is "primary". Instead, the idea of a "primary key" is based on the idea that one and only one key determines the physical order of the tuples on disk, something which relational theory specifically says we should ignore in the logical model of our data. Therefore primary keys are a specific violation of relational theory, a legacy of the days when most SQL databases were index-ordered.  Mind you, some of them still are.

Theory-Schmeery. Why Should We Care?

Since there has been a relational theory for over thirty years and an ANSI SQL standard for longer than we've had PCs, it's easy to forget that E.F. Codd created the relational model in order to cure major, chronic data management problems on the mainframes at IBM. Careless abandonment of tenets of the relational model, then, risk repeating those same data management problems. These are not theoretical issues; these are real data issues that can cost your company a lot of money and you many weekends and late nights.

I'll give you an example from my own work history. We once developed a rather sophisticated web-based legal calendaring system for some multi-state, multi-firm litigation involving thousands of plaintiffs. Since there were multiple law firms involved, and some of them had a significant amount of partner turnover, the project suffered from horrible "spec drift," going 2 years late and $100,000 over budget. In the course of several hundred revisions to the database schema, the unique constraint to the central "events" table got removed. The spec committee didn't see this as a problem, because there was still the "event_id" column, which was the "primary key."

Then the duplicates started to appear.

It didn't take long (about 2 months) to discover that there was a serious problem with having "id" as the only unique column. We got multiple hearings scheduled on the calendar, in the same docket, on the same date or in the same place. Were these duplicates or two different hearings? We couldn't tell. The calendaring staff had to hire an extra person for six weeks just to call the legal staff on each case and weed out the duplicates. In the meantime, several attorneys drove hundreds of miles to show up for hearings which had been rescheduled or cancelled. The lead firm probably spent $40,000 getting the duplicates problem under control, not including wasted attorney time.

The essential problem is that an autonumber "id" column contains no information about the record to which it's connected, and tells you nothing about that record. It could be a duplicate, it could be unique, it could have ceased to exist if some idiot deleted the foreign key constraint.

A second example occurred when I and a different partner were working on an accounting application. We spent, off and on, about 6 weeks trying to track down an elusive error that would throw disbursements out of balance. When we found it, it turned out the problem was assigning an ID from an transaction record to a variable meant to hold a sub-transaction record, causing part of the disbursment to be assigned to the wrong transaction. Since all of the IDs in question where 4-byte integers, who could tell? They all looked the same, even in debug mode.

I am not saying that you should avoid autonumber surrogate keys like an Uber driver with a claw hammer. The danger is not in their use but in their abuse. The "events_id" column in the "events" table didn't give us any trouble until we began to rely on it as the sole key for the table. The accounting application gave us problems because we were using the ID as the entire handle for the records. That crossed the line from use to misuse, and we suffered for it.

Unfortunately, I'm seeing design mistakes that I made in the past not only repeated wholesale by younger developers, the rationales for them are being defended vigorously on the Internet and elsewhere.

Reasons to Use an Autonumber Surrogate Key

What follows are a number of reasons people have given me, on IRC and the mailing lists, for using autonumber keys. Some of them are "good" reasons which demonstrate and understanding of the costs and benefits. Others are "bad" reasons based on sloppy thinking or lack of training.  Form your own opinions before scrolling down.

Many-Column Keys
    The real key of the table has 3 or more columns and makes writing queries painful.

Table Size
    Since integers are smaller than most other types, using them makes the table, and my database, smaller. And a smaller database is faster.

    My web development framework requires that all tables have integer primary keys to do code generation from my database.

No Good Key
    My table has no combination of columns that makes a good natural key or unique index. Therefore I need to just use an ID.

    Our technical specification requires all tables except join and load tables to have an "id" and a "name" column, each of which is unique.

Join/Sort Performance
    Integers sort and join much faster than large column types. So using integer primary keys gives me better query performance.

Design Principles
    Using an ID column in each table is an important principle of good relational database design, and is required for "normalization".  I read a book/web page/magazine article that says so.

DB Abstraction/ORM
    The database abstraction library (like PDO or ActiveRecord) I use requires integer primary keys.

SQL Standard
    The SQL Standard requires an ID column in each table.

Programmer Demands
    The PHP/Python/JS/Java guys on the interface team refuse to deal with different data types and multi-column keys, and/or want an integer to use as an "object ID."

    The natural keys in my table can change, and IDs aren't allowed to change.

Reasons to Use an Autonumber Surrogate Key, Evaluated

Here's my evaluation of the various reasons above. You'll have your own opinions, of course, but read through this list to make sure that your design decisions are well-grounded.

Many-Column Keys
    It Depends. As much as this shouldn't be a reason, the rather verbose SQL join syntax and multicolumn index performance makes it one. If SQL was more terse, and query executors better, this would evaporate as a reason.  Note that in some cases though, it can still be better to use the multicolumn key, expecially if you're partitioning on some of the inherited key values.

No Real Key
    Very, Very Bad. This is an example of exactly the kind of very bad database design that puts the application designers into several weekends of overtime down the line. Without any natural key ... even if you use a surrogate key for joins, etc. ... you have no way of telling which rows in your table are duplicates. Which means that you will get duplicates, many of them, and be unable to fix the data without significant and costly fieldwork to reexamine the sources of your data ("Mary? Do we have one or two John MacEnroes working for us?")
    Worse, these indicate that the developer does not really know his data and that the spec was never really hashed out. When I interrogate people claiming that there's "no real key" I generally find that it's not actually the case that there aren't any unique keys, it's that the developer doesn't know what they are. This is a weathervane for far more serious design problems.
    As Jeff Davis likes to say, "Conflicts will get worked out somewhere.  In general, it's far less expensive to work them out in the database than in the real world."
     Note that thanks to Exclusion Constraints, GIN indexes, and functional unique indexes, PostgreSQL is able to support complex criteria as keys of which other databases would not be capable.  So if you're using something else, there is the possibility of "I know the real key, but my database engine doesn't support it."

External Requirements
    It Depends. The ORM, DB Abstraction and Programmer Demands arguments all amount to external requirements to use integer keys. Certainly a degree of genericization is necessary for any multi-purpose tool or code set. This is the most common reason for me when I succumb to using autonumber IDs. However, this should be a compelling reason only after you've evaluated the ORM, DB abstraction library and/or the staff involved to make sure that integer keys are a real requirement and that the tool/person will actually push your project forwards instead of becoming an obstacle.

    Usually Bad. A scrupulous adherence to consistent design standards is generally a good thing.  However, the ID/Name requirement suggests that the architects haven't looked very hard at the application's actual requirements or the real structure of the data.

Standard Practice
    Bad. Both the SQL Standard and the Design Principles arguments are based on ignorance. Generally the developer using these rationales heard from a friend of a collegue who read someone's blog who took a course at the University that ID columns were a good idea. That some of these ignorant designers are also book and article authors is really tragic. For the record, neither the SQL Standard nor relational theory compel the use of surrogate keys. In fact, the papers which established relational theory don't even mention surrogate keys.

    It Depends. It's an unfortunate reality that many SQL DBMSes do not support ON UPDATE CASCADE for foreign keys, and even those which do tend to be inefficient in executing it (this may be a reason to switch to PostgreSQL). As a result, real keys which change very frequenty in large databases are generally not usable as join keys. However, I've seen this argument used for values which change extremely infrequently in small databases (like for full names or SSNs in a small company personnel directory), which makes it just an excuse.
    Sometimes, however, this argument is based completely on the misinformation that keys are supposed to be invariable and immutable for the life of the record. Where this idea came from I'm not sure; certainly not from either the SQL standard or the writings of E.F. Codd. It's probably unthinking bleed-over from mis-applied OO design. If this is your reason for not using a real key, it's wrong.
    The other practical reason to require immutable keys is if you're using the keys as part of a cache invalidation or generic sharding system.   However, a smart design for such a system still doesn't use autonumber surrogate keys; instead, you have a synthetic key which carries information about the entity to which it is attached in compressed form, such as a application-specific hash or addressing system.

    Usually Bad. I've saved the Table Size and Join/Sort Performance for last because performance is the most complex issue. The reason I say "usually bad" is that 80% of the time the developer making these arguments has not actually tested his performance claims, on this or any other database. Premature optimization is the hobgoblin of database design.
    For data warehouses, the Table Size argument can be compelling, although it needs to be balanced against the need for more joins in real performance tests. For any other type of application ... or any database smaller than 10GB, period ... this argument is nonsense. Whether your web application database is 200mb or 230mb is not going to make an appreciable performance difference on any modern machine, but poor design choices can make an appreciable difference in downtime.
    Join/Sort performance is a bit more of a serious argument, depending on the size of the database, the number of columns and data types of the real key.  Note that using the natural key can, in many cases, allow you to avoid doing a join althogether, which can result in query speedups which outstrip any slowness due to column size.  I have refactored data warehouses to use natural keys precisely for this reason.
   If you think you need to use a surrogate key for database performance, test, then make a decision.


As I said earlier, it's not using autonumber surrogate keys (which are a necessary, pragmatic evil) but misusing them that causes pain, late nights and budget overruns. Just make sure that you're using integer keys in the right ways and for the right reasons, and always ask yourself if an autonumber surrogate key is really needed in each table where you put one.

Saturday, March 28, 2015

Crazy SQL Saturday: replacing SciPy with SQL

I have a data analytics project which produces multiple statistical metrics for a large volume of sensor data.  This includes percentiles (like median and 90%) as well as average, min and max.  Originally this worked using PL/R, which was pretty good except that some of the R modules were crashy, which was not so great for uptime.

This is why, two years ago, I ripped out all of the PL/R and replaced it with PL/Python and SciPy.  I love SciPy because it gives me everything I liked about R, without most of the things I didn't like.  But now, I've ripped out the SciPy as well.  What am I replacing it with?  Well, SQL.

In version 9.4, Andrew Gierth added support for percentiles to PostgreSQL via WITHIN GROUP aggregates. As far as I'm concerned, this is second only to JSONB in reasons to use 9.4.

Now, one of the more complicated uses I make of aggregates is doing "circular" aggregates, that is producing percentiles for a set of circular directions in an effort to determine the most common facings for certain devices.  Here's the PL/Python function I wrote for this, which calculates circular aggregates using the "largest gap" method.  This algorithm assumes that the heading measurements are essentially unordered, so to find the endpoints of the arc we look for two measurements which are the furthest apart on the circle.  This means shifting the measurements to an imaginary coordinate system where the edge of this gap is the low measurement, calculating percentiles, and then shifting it back.  Note that this method produces garbage if the device turned around a complete circle during the aggregate period.

Now, that SciPy function was pretty good and we used it for quite a while.  But we were unhappy with two things: first, SciPy is rather painful as a dependency because the packaging for it is terrible; second, having PostgreSQL call out to SciPy for each iteration isn't all that efficient.

So, since 9.4 has percentiles now, I started writing a function based the built-in SQL percentiles.  Initially I was thinking it would be a PL/pgSQL function, but was pleasantly surprised to find that I could write it entirely as a SQL function!  Truly, Postgres's SQL dialect is turing-complete.

So here's the new all-SQL function, with some helper functions.

Then I performance tested it, and was pleasantly surprised again.  The SciPy version took 2.6 seconds* to aggregate 100,000 sets of 20 measurements.  The new SQL version takes 40 milleseconds, cutting response time by 98%.  Wow!

And I've eliminated a hard-to-install dependency.  So it's all win.  Of course, if anyone has ideas on making it even faster, let me know.

Pushing the limits of SQL to the edge of insanity.

(* note: I expect that most of the extra time for the SciPy version is in calling out to Python through PL/Python, rather than in SciPy itself.)

Thursday, March 19, 2015

PostgreSQL data migration hacks from Tilt

Since the folks at aren't on Planet Postgres, I thought I'd link their recent blog post on cool data migration hacks.  Tilt is a YCombinator company, and a SFPUG supporter.

Monday, March 16, 2015

Benchmarking Postgres in the Cloud, part 1

In 2008, when Heroku started, there was only one real option for cloud hosting PostgreSQL: roll-your-own on EC2, or a couple other not-very-competitive platforms.  Since then, we've seen the number of cloud hosting providers explode, and added several "PostgreSQL-As-A-Service" providers as well: first Heroku, then Gandi, CloudFoundry, RDS, OpenShift and more.  This has led many of pgExperts' clients to ask: "Where should I be hosting my PostgreSQL?"

So to provide a definitive answer to that question, for the past several weeks I've been doing some head-to-head testing of different cloud hosting options for PostgreSQL.  Even more work has been done by my collaborator, Ruben Rudio Rey of  I will be presenting on the results of this testing in a series of blog posts, together with a series of presentations starting at SCALE and going through pgConf NYC, LinuxFestNorthWest, and culminating at pgCon.   Each presentation will add new tests and new data.

Here's my slides from SCALE, which compare AWS, RDS, and Heroku, if you want to get some immediate data.

What We're Testing

The idea is to run benchmarks against ephemeral instances of PostgreSQL 9.3 on each cloud or service.  Our main goal is to collect performance figures, since while features and pricing are publicly available, performance information is not.  And even when the specification is the same, the actual throughput is not.  From each cloud or service, we are testing two different instance sizes:

Small: 1-2 cores, 3 to 4GB RAM, low throughput storage (compare EC2's m3.medium).  This is the "economy" instance for running PostgreSQL; it's intended to represent what people with non-critical PostgreSQL instances buy, and to answer the question of "how much performance can I get for cheap".

Large: 8-16 cores, 48 to 70GB RAM, high throughput storage (compare EC2's r3.2xlarge).  This is the maximum for a "high end" instance which we could afford to test in our test runs. 

The clouds we're testing or plan to test include:
  • AWS EC2 "roll-your-own".
  • Amazon RDS PostgreSQL
  • Heroku
  • Google Compute Engine
  • DigitalOcean
  • Rackspace Cloud
  • OpenShift PostgreSQL Cartridge
  • (maybe Joyent, not sure)
Note that in many cases we're working with the cloud vendor to achieve maximum performance results.  Our goal here isn't to "blind test" the various clouds, but rather to try to realistically deliver the best performance we can get on that platform.  In at least one case, our findings have resulted in the vendor making improvements to their cloud platform, which then allowed us to retest with better results.

The tests we're running include three pgbench runs:

  • In-Memory, Read-Write (IMRW): pgbench database 30% to 60% of the size of RAM, full transaction workload
  • In-Memory, Read-Only (IMRO): pgbench database 30% to 60% of RAM, read-only queries
  • On-Disk, Read-Write (ODRW): pgbench database 150% to 250% of RAM, full transactions
The idea here is to see the different behavior profiles with WAL-bound, CPU-bound, and storage-bound workloads.  We're also recording the load time for each database, since bulk loading behavior is useful information for each platform. 

Each combination of cloud/size/test needs to then be run at least 5 times in order to get a statistically useful sample.  As I will document later, often the difference between runs on the same cloud was greater than the difference between clouds.

Issues with pgBench as a Test Tool

One of the early things I discovered was some of the limitations of what pgbench could tell us.  Its workload is 100% random access and homogeneous one-liner queries.  It's also used extensively and automatically to test PostgreSQL performance.  As a result, we found that postgresql.conf tuning made little or no difference at all, so our original plan to test "tuned" vs. "untuned" instances went by the wayside.

We also found on public clouds that, because of the rapidfire nature of pgbench queries, performance was dominated by network response times more than anything on most workloads.  We did not use pgbench_tools, because that is concerned with automating many test runs against one host rather than a few test runs against many hosts.

For this reason, we also want to run a different, more "serious" benchmark which works out other performance areas.  To support this, I'm working on deploying Jignesh's build of DVDStore so that I can do that benchmark against the various platforms.  This will require some significant work to make a reality, though; I will need to create images or deployment tools on all of the platforms I want to test before I can do it.

To be continued ...