Tuesday, March 31, 2015

Primary Keyvil, reprised

Primary Keyvil was one of the most popular posts on my old blog.  Since the old block has become somewhat inaccessible, and I recently did my Keyvil lightning talk again at pgConf NYC, I thought I'd reprint it here, updated and consolidated.

Two actual conversations I had on IRC ages ago, handles changed to protect the ignorant, and edited for brevity (irc.freenode.net, channel #postgresql):

    newbie1: schema design:

    agliodbs: hmmm ... why do you have an ID column
        in "states"?  You're not using it.

    newbie1: because I have to.

    agliodbs: you what?

    newbie1: the ID column is required for normalization.

    agliodbs chokes

    newbie2: how do I write a query to remove the duplicate rows?

    agliodbs: please post your table definition

    newbie2: http://www.rafb.net/paste/results/Hk90fz88.html

    agliodbs: What's the key for "sessions"?

    newbie2: it has an "id" column

    agliodbs: Yes, but what's the real key? 
        Which columns determine a unique row?

    newbie2: I told you, the "id" column. 
        It's a primary key and everything.

    agliodbs: That's not going to help you 

        identify duplicate sessions. You need another
        key ... a unique constraint on real data columns, 
        not just an "id" column.

    newbie2: no I don't

    agliodbs: Good luck with your problem then.

The surrogate numeric key has been a necessary evil for as long as we've had SQL. It was set into SQL89 because the new SQL databases had to interact with older applications which expected "row numbers," and it continues because of poor vendor support for features like CASCADE.

Inevitably, practices which are "necessary evils" tend to become "pervasive evils" in the hands of the untrained and the lazy. Not realizing that ID columns are a pragmatic compromise with application performance, many frameworks and development pragmas have enshrined numeric IDs as part of the logical and theoretical model of their applications. Worse yet, even RDBMS book authors have instructed their readers to "always include an ID column," suturing this misunderstanding into the body of industry knowledge like a badly wired cybernetic implant.

What Are Numeric Surrogate Primary Keys, Exactly?

Before people post a lot of irrelevant arguments, let me be definitional: I'm talking about auto-numbering "ID" columns, like PostgreSQL's or Oracle's SERIAL and MySQL's AUTO_INCREMENT, or the various systems of UUID. Such columns are known as "surrogate keys" because they provide a unique handle for the row which has nothing to do with the row's data content. It is the abuse of these "numeric surrogate keys" which I am attacking in this column, not any other type of key.

Further, "keys" are real: a "key" is any combination of columns which forms a "predicate", or a set which uniquely identifies a tuple or row, of which there should be at least one per table. The concept of "primary key," however, has no intrinsic meaning in relational theory -- all keys are equal and no one of them is "primary". Instead, the idea of a "primary key" is based on the idea that one and only one key determines the physical order of the tuples on disk, something which relational theory specifically says we should ignore in the logical model of our data. Therefore primary keys are a specific violation of relational theory, a legacy of the days when most SQL databases were index-ordered.  Mind you, some of them still are.

Theory-Schmeery. Why Should We Care?

Since there has been a relational theory for over thirty years and an ANSI SQL standard for longer than we've had PCs, it's easy to forget that E.F. Codd created the relational model in order to cure major, chronic data management problems on the mainframes at IBM. Careless abandonment of tenets of the relational model, then, risk repeating those same data management problems. These are not theoretical issues; these are real data issues that can cost your company a lot of money and you many weekends and late nights.

I'll give you an example from my own work history. We once developed a rather sophisticated web-based legal calendaring system for some multi-state, multi-firm litigation involving thousands of plaintiffs. Since there were multiple law firms involved, and some of them had a significant amount of partner turnover, the project suffered from horrible "spec drift," going 2 years late and $100,000 over budget. In the course of several hundred revisions to the database schema, the unique constraint to the central "events" table got removed. The spec committee didn't see this as a problem, because there was still the "event_id" column, which was the "primary key."

Then the duplicates started to appear.

It didn't take long (about 2 months) to discover that there was a serious problem with having "id" as the only unique column. We got multiple hearings scheduled on the calendar, in the same docket, on the same date or in the same place. Were these duplicates or two different hearings? We couldn't tell. The calendaring staff had to hire an extra person for six weeks just to call the legal staff on each case and weed out the duplicates. In the meantime, several attorneys drove hundreds of miles to show up for hearings which had been rescheduled or cancelled. The lead firm probably spent $40,000 getting the duplicates problem under control, not including wasted attorney time.

The essential problem is that an autonumber "id" column contains no information about the record to which it's connected, and tells you nothing about that record. It could be a duplicate, it could be unique, it could have ceased to exist if some idiot deleted the foreign key constraint.

A second example occurred when I and a different partner were working on an accounting application. We spent, off and on, about 6 weeks trying to track down an elusive error that would throw disbursements out of balance. When we found it, it turned out the problem was assigning an ID from an transaction record to a variable meant to hold a sub-transaction record, causing part of the disbursment to be assigned to the wrong transaction. Since all of the IDs in question where 4-byte integers, who could tell? They all looked the same, even in debug mode.

I am not saying that you should avoid autonumber surrogate keys like an Uber driver with a claw hammer. The danger is not in their use but in their abuse. The "events_id" column in the "events" table didn't give us any trouble until we began to rely on it as the sole key for the table. The accounting application gave us problems because we were using the ID as the entire handle for the records. That crossed the line from use to misuse, and we suffered for it.

Unfortunately, I'm seeing design mistakes that I made in the past not only repeated wholesale by younger developers, the rationales for them are being defended vigorously on the Internet and elsewhere.

Reasons to Use an Autonumber Surrogate Key

What follows are a number of reasons people have given me, on IRC and the PostgreSQL.org mailing lists, for using autonumber keys. Some of them are "good" reasons which demonstrate and understanding of the costs and benefits. Others are "bad" reasons based on sloppy thinking or lack of training.  Form your own opinions before scrolling down.

Many-Column Keys
    The real key of the table has 3 or more columns and makes writing queries painful.

Table Size
    Since integers are smaller than most other types, using them makes the table, and my database, smaller. And a smaller database is faster.

    My web development framework requires that all tables have integer primary keys to do code generation from my database.

No Good Key
    My table has no combination of columns that makes a good natural key or unique index. Therefore I need to just use an ID.

    Our technical specification requires all tables except join and load tables to have an "id" and a "name" column, each of which is unique.

Join/Sort Performance
    Integers sort and join much faster than large column types. So using integer primary keys gives me better query performance.

Design Principles
    Using an ID column in each table is an important principle of good relational database design, and is required for "normalization".  I read a book/web page/magazine article that says so.

DB Abstraction/ORM
    The database abstraction library (like PDO or ActiveRecord) I use requires integer primary keys.

SQL Standard
    The SQL Standard requires an ID column in each table.

Programmer Demands
    The PHP/Python/JS/Java guys on the interface team refuse to deal with different data types and multi-column keys, and/or want an integer to use as an "object ID."

    The natural keys in my table can change, and IDs aren't allowed to change.

Reasons to Use an Autonumber Surrogate Key, Evaluated

Here's my evaluation of the various reasons above. You'll have your own opinions, of course, but read through this list to make sure that your design decisions are well-grounded.

Many-Column Keys
    It Depends. As much as this shouldn't be a reason, the rather verbose SQL join syntax and multicolumn index performance makes it one. If SQL was more terse, and query executors better, this would evaporate as a reason.  Note that in some cases though, it can still be better to use the multicolumn key, expecially if you're partitioning on some of the inherited key values.

No Real Key
    Very, Very Bad. This is an example of exactly the kind of very bad database design that puts the application designers into several weekends of overtime down the line. Without any natural key ... even if you use a surrogate key for joins, etc. ... you have no way of telling which rows in your table are duplicates. Which means that you will get duplicates, many of them, and be unable to fix the data without significant and costly fieldwork to reexamine the sources of your data ("Mary? Do we have one or two John MacEnroes working for us?")
    Worse, these indicate that the developer does not really know his data and that the spec was never really hashed out. When I interrogate people claiming that there's "no real key" I generally find that it's not actually the case that there aren't any unique keys, it's that the developer doesn't know what they are. This is a weathervane for far more serious design problems.
    As Jeff Davis likes to say, "Conflicts will get worked out somewhere.  In general, it's far less expensive to work them out in the database than in the real world."
     Note that thanks to Exclusion Constraints, GIN indexes, and functional unique indexes, PostgreSQL is able to support complex criteria as keys of which other databases would not be capable.  So if you're using something else, there is the possibility of "I know the real key, but my database engine doesn't support it."

External Requirements
    It Depends. The ORM, DB Abstraction and Programmer Demands arguments all amount to external requirements to use integer keys. Certainly a degree of genericization is necessary for any multi-purpose tool or code set. This is the most common reason for me when I succumb to using autonumber IDs. However, this should be a compelling reason only after you've evaluated the ORM, DB abstraction library and/or the staff involved to make sure that integer keys are a real requirement and that the tool/person will actually push your project forwards instead of becoming an obstacle.

    Usually Bad. A scrupulous adherence to consistent design standards is generally a good thing.  However, the ID/Name requirement suggests that the architects haven't looked very hard at the application's actual requirements or the real structure of the data.

Standard Practice
    Bad. Both the SQL Standard and the Design Principles arguments are based on ignorance. Generally the developer using these rationales heard from a friend of a collegue who read someone's blog who took a course at the University that ID columns were a good idea. That some of these ignorant designers are also book and article authors is really tragic. For the record, neither the SQL Standard nor relational theory compel the use of surrogate keys. In fact, the papers which established relational theory don't even mention surrogate keys.

    It Depends. It's an unfortunate reality that many SQL DBMSes do not support ON UPDATE CASCADE for foreign keys, and even those which do tend to be inefficient in executing it (this may be a reason to switch to PostgreSQL). As a result, real keys which change very frequenty in large databases are generally not usable as join keys. However, I've seen this argument used for values which change extremely infrequently in small databases (like for full names or SSNs in a small company personnel directory), which makes it just an excuse.
    Sometimes, however, this argument is based completely on the misinformation that keys are supposed to be invariable and immutable for the life of the record. Where this idea came from I'm not sure; certainly not from either the SQL standard or the writings of E.F. Codd. It's probably unthinking bleed-over from mis-applied OO design. If this is your reason for not using a real key, it's wrong.
    The other practical reason to require immutable keys is if you're using the keys as part of a cache invalidation or generic sharding system.   However, a smart design for such a system still doesn't use autonumber surrogate keys; instead, you have a synthetic key which carries information about the entity to which it is attached in compressed form, such as a application-specific hash or addressing system.

    Usually Bad. I've saved the Table Size and Join/Sort Performance for last because performance is the most complex issue. The reason I say "usually bad" is that 80% of the time the developer making these arguments has not actually tested his performance claims, on this or any other database. Premature optimization is the hobgoblin of database design.
    For data warehouses, the Table Size argument can be compelling, although it needs to be balanced against the need for more joins in real performance tests. For any other type of application ... or any database smaller than 10GB, period ... this argument is nonsense. Whether your web application database is 200mb or 230mb is not going to make an appreciable performance difference on any modern machine, but poor design choices can make an appreciable difference in downtime.
    Join/Sort performance is a bit more of a serious argument, depending on the size of the database, the number of columns and data types of the real key.  Note that using the natural key can, in many cases, allow you to avoid doing a join althogether, which can result in query speedups which outstrip any slowness due to column size.  I have refactored data warehouses to use natural keys precisely for this reason.
   If you think you need to use a surrogate key for database performance, test, then make a decision.


As I said earlier, it's not using autonumber surrogate keys (which are a necessary, pragmatic evil) but misusing them that causes pain, late nights and budget overruns. Just make sure that you're using integer keys in the right ways and for the right reasons, and always ask yourself if an autonumber surrogate key is really needed in each table where you put one.

Saturday, March 28, 2015

Crazy SQL Saturday: replacing SciPy with SQL

I have a data analytics project which produces multiple statistical metrics for a large volume of sensor data.  This includes percentiles (like median and 90%) as well as average, min and max.  Originally this worked using PL/R, which was pretty good except that some of the R modules were crashy, which was not so great for uptime.

This is why, two years ago, I ripped out all of the PL/R and replaced it with PL/Python and SciPy.  I love SciPy because it gives me everything I liked about R, without most of the things I didn't like.  But now, I've ripped out the SciPy as well.  What am I replacing it with?  Well, SQL.

In version 9.4, Andrew Gierth added support for percentiles to PostgreSQL via WITHIN GROUP aggregates. As far as I'm concerned, this is second only to JSONB in reasons to use 9.4.

Now, one of the more complicated uses I make of aggregates is doing "circular" aggregates, that is producing percentiles for a set of circular directions in an effort to determine the most common facings for certain devices.  Here's the PL/Python function I wrote for this, which calculates circular aggregates using the "largest gap" method.  This algorithm assumes that the heading measurements are essentially unordered, so to find the endpoints of the arc we look for two measurements which are the furthest apart on the circle.  This means shifting the measurements to an imaginary coordinate system where the edge of this gap is the low measurement, calculating percentiles, and then shifting it back.  Note that this method produces garbage if the device turned around a complete circle during the aggregate period.

Now, that SciPy function was pretty good and we used it for quite a while.  But we were unhappy with two things: first, SciPy is rather painful as a dependency because the packaging for it is terrible; second, having PostgreSQL call out to SciPy for each iteration isn't all that efficient.

So, since 9.4 has percentiles now, I started writing a function based the built-in SQL percentiles.  Initially I was thinking it would be a PL/pgSQL function, but was pleasantly surprised to find that I could write it entirely as a SQL function!  Truly, Postgres's SQL dialect is turing-complete.

So here's the new all-SQL function, with some helper functions.

Then I performance tested it, and was pleasantly surprised again.  The SciPy version took 2.6 seconds* to aggregate 100,000 sets of 20 measurements.  The new SQL version takes 40 milleseconds, cutting response time by 98%.  Wow!

And I've eliminated a hard-to-install dependency.  So it's all win.  Of course, if anyone has ideas on making it even faster, let me know.

Pushing the limits of SQL to the edge of insanity.

(* note: I expect that most of the extra time for the SciPy version is in calling out to Python through PL/Python, rather than in SciPy itself.)

Thursday, March 19, 2015

PostgreSQL data migration hacks from Tilt

Since the folks at Tilt.com aren't on Planet Postgres, I thought I'd link their recent blog post on cool data migration hacks.  Tilt is a YCombinator company, and a SFPUG supporter.

Monday, March 16, 2015

Benchmarking Postgres in the Cloud, part 1

In 2008, when Heroku started, there was only one real option for cloud hosting PostgreSQL: roll-your-own on EC2, or a couple other not-very-competitive platforms.  Since then, we've seen the number of cloud hosting providers explode, and added several "PostgreSQL-As-A-Service" providers as well: first Heroku, then Gandi, CloudFoundry, RDS, OpenShift and more.  This has led many of pgExperts' clients to ask: "Where should I be hosting my PostgreSQL?"

So to provide a definitive answer to that question, for the past several weeks I've been doing some head-to-head testing of different cloud hosting options for PostgreSQL.  Even more work has been done by my collaborator, Ruben Rudio Rey of ManageACloud.com.  I will be presenting on the results of this testing in a series of blog posts, together with a series of presentations starting at SCALE and going through pgConf NYC, LinuxFestNorthWest, and culminating at pgCon.   Each presentation will add new tests and new data.

Here's my slides from SCALE, which compare AWS, RDS, and Heroku, if you want to get some immediate data.

What We're Testing

The idea is to run benchmarks against ephemeral instances of PostgreSQL 9.3 on each cloud or service.  Our main goal is to collect performance figures, since while features and pricing are publicly available, performance information is not.  And even when the specification is the same, the actual throughput is not.  From each cloud or service, we are testing two different instance sizes:

Small: 1-2 cores, 3 to 4GB RAM, low throughput storage (compare EC2's m3.medium).  This is the "economy" instance for running PostgreSQL; it's intended to represent what people with non-critical PostgreSQL instances buy, and to answer the question of "how much performance can I get for cheap".

Large: 8-16 cores, 48 to 70GB RAM, high throughput storage (compare EC2's r3.2xlarge).  This is the maximum for a "high end" instance which we could afford to test in our test runs. 

The clouds we're testing or plan to test include:
  • AWS EC2 "roll-your-own".
  • Amazon RDS PostgreSQL
  • Heroku
  • Google Compute Engine
  • DigitalOcean
  • Rackspace Cloud
  • OpenShift PostgreSQL Cartridge
  • (maybe Joyent, not sure)
Note that in many cases we're working with the cloud vendor to achieve maximum performance results.  Our goal here isn't to "blind test" the various clouds, but rather to try to realistically deliver the best performance we can get on that platform.  In at least one case, our findings have resulted in the vendor making improvements to their cloud platform, which then allowed us to retest with better results.

The tests we're running include three pgbench runs:

  • In-Memory, Read-Write (IMRW): pgbench database 30% to 60% of the size of RAM, full transaction workload
  • In-Memory, Read-Only (IMRO): pgbench database 30% to 60% of RAM, read-only queries
  • On-Disk, Read-Write (ODRW): pgbench database 150% to 250% of RAM, full transactions
The idea here is to see the different behavior profiles with WAL-bound, CPU-bound, and storage-bound workloads.  We're also recording the load time for each database, since bulk loading behavior is useful information for each platform. 

Each combination of cloud/size/test needs to then be run at least 5 times in order to get a statistically useful sample.  As I will document later, often the difference between runs on the same cloud was greater than the difference between clouds.

Issues with pgBench as a Test Tool

One of the early things I discovered was some of the limitations of what pgbench could tell us.  Its workload is 100% random access and homogeneous one-liner queries.  It's also used extensively and automatically to test PostgreSQL performance.  As a result, we found that postgresql.conf tuning made little or no difference at all, so our original plan to test "tuned" vs. "untuned" instances went by the wayside.

We also found on public clouds that, because of the rapidfire nature of pgbench queries, performance was dominated by network response times more than anything on most workloads.  We did not use pgbench_tools, because that is concerned with automating many test runs against one host rather than a few test runs against many hosts.

For this reason, we also want to run a different, more "serious" benchmark which works out other performance areas.  To support this, I'm working on deploying Jignesh's build of DVDStore so that I can do that benchmark against the various platforms.  This will require some significant work to make a reality, though; I will need to create images or deployment tools on all of the platforms I want to test before I can do it.

To be continued ...

Friday, March 6, 2015

Fancy SQL Friday: subtracting arrays

Here's one which just came up:  how to see all of the elements in a new array which were not in the old array.  This isn't currently supported by any of PostgreSQL's array operators, but thanks to UNNEST() and custom operators, you can create your own:

    create or replace function diff_elements_text (
        text[], text[] )
    returns text[]
    language sql
    as $f$
    SELECT array_agg(DISTINCT new_arr.elem)
        unnest($1) as new_arr(elem)
        unnest($2) as old_arr(elem)
        ON new_arr.elem = old_arr.elem
    WHERE old_arr.elem IS NULL;

    create operator - (
        procedure = diff_elements_text,
        leftarg = text[],
        rightarg = text[]

Now you can just subtract text arrays:

    josh=# select array['n','z','d','e'] - array['a','n','z'];
    (1 row)

Unfortunately, you'll need to create a new function and operator for each base  type; I haven't been able to get it to work with "anyarray".  But this should save you some time/code on array comparisons.  Enjoy!