Tuesday, March 31, 2015

Primary Keyvil, reprised

Primary Keyvil was one of the most popular posts on my old blog.  Since the old block has become somewhat inaccessible, and I recently did my Keyvil lightning talk again at pgConf NYC, I thought I'd reprint it here, updated and consolidated.

Two actual conversations I had on IRC ages ago, handles changed to protect the ignorant, and edited for brevity (irc.freenode.net, channel #postgresql):


    newbie1: schema design:
      http://www.rafb.net/paste/results/Bk90sz89.html

    agliodbs: hmmm ... why do you have an ID column
        in "states"?  You're not using it.

    newbie1: because I have to.

    agliodbs: you what?

    newbie1: the ID column is required for normalization.

    agliodbs chokes




    newbie2: how do I write a query to remove the duplicate rows?

    agliodbs: please post your table definition

    newbie2: http://www.rafb.net/paste/results/Hk90fz88.html

    agliodbs: What's the key for "sessions"?

    newbie2: it has an "id" column

    agliodbs: Yes, but what's the real key? 
        Which columns determine a unique row?

    newbie2: I told you, the "id" column. 
        It's a primary key and everything.

    agliodbs: That's not going to help you 

        identify duplicate sessions. You need another
        key ... a unique constraint on real data columns, 
        not just an "id" column.

    newbie2: no I don't

    agliodbs: Good luck with your problem then.



The surrogate numeric key has been a necessary evil for as long as we've had SQL. It was set into SQL89 because the new SQL databases had to interact with older applications which expected "row numbers," and it continues because of poor vendor support for features like CASCADE.

Inevitably, practices which are "necessary evils" tend to become "pervasive evils" in the hands of the untrained and the lazy. Not realizing that ID columns are a pragmatic compromise with application performance, many frameworks and development pragmas have enshrined numeric IDs as part of the logical and theoretical model of their applications. Worse yet, even RDBMS book authors have instructed their readers to "always include an ID column," suturing this misunderstanding into the body of industry knowledge like a badly wired cybernetic implant.

What Are Numeric Surrogate Primary Keys, Exactly?


Before people post a lot of irrelevant arguments, let me be definitional: I'm talking about auto-numbering "ID" columns, like PostgreSQL's or Oracle's SERIAL and MySQL's AUTO_INCREMENT, or the various systems of UUID. Such columns are known as "surrogate keys" because they provide a unique handle for the row which has nothing to do with the row's data content. It is the abuse of these "numeric surrogate keys" which I am attacking in this column, not any other type of key.

Further, "keys" are real: a "key" is any combination of columns which forms a "predicate", or a set which uniquely identifies a tuple or row, of which there should be at least one per table. The concept of "primary key," however, has no intrinsic meaning in relational theory -- all keys are equal and no one of them is "primary". Instead, the idea of a "primary key" is based on the idea that one and only one key determines the physical order of the tuples on disk, something which relational theory specifically says we should ignore in the logical model of our data. Therefore primary keys are a specific violation of relational theory, a legacy of the days when most SQL databases were index-ordered.  Mind you, some of them still are.

Theory-Schmeery. Why Should We Care?


Since there has been a relational theory for over thirty years and an ANSI SQL standard for longer than we've had PCs, it's easy to forget that E.F. Codd created the relational model in order to cure major, chronic data management problems on the mainframes at IBM. Careless abandonment of tenets of the relational model, then, risk repeating those same data management problems. These are not theoretical issues; these are real data issues that can cost your company a lot of money and you many weekends and late nights.

I'll give you an example from my own work history. We once developed a rather sophisticated web-based legal calendaring system for some multi-state, multi-firm litigation involving thousands of plaintiffs. Since there were multiple law firms involved, and some of them had a significant amount of partner turnover, the project suffered from horrible "spec drift," going 2 years late and $100,000 over budget. In the course of several hundred revisions to the database schema, the unique constraint to the central "events" table got removed. The spec committee didn't see this as a problem, because there was still the "event_id" column, which was the "primary key."

Then the duplicates started to appear.

It didn't take long (about 2 months) to discover that there was a serious problem with having "id" as the only unique column. We got multiple hearings scheduled on the calendar, in the same docket, on the same date or in the same place. Were these duplicates or two different hearings? We couldn't tell. The calendaring staff had to hire an extra person for six weeks just to call the legal staff on each case and weed out the duplicates. In the meantime, several attorneys drove hundreds of miles to show up for hearings which had been rescheduled or cancelled. The lead firm probably spent $40,000 getting the duplicates problem under control, not including wasted attorney time.

The essential problem is that an autonumber "id" column contains no information about the record to which it's connected, and tells you nothing about that record. It could be a duplicate, it could be unique, it could have ceased to exist if some idiot deleted the foreign key constraint.

A second example occurred when I and a different partner were working on an accounting application. We spent, off and on, about 6 weeks trying to track down an elusive error that would throw disbursements out of balance. When we found it, it turned out the problem was assigning an ID from an transaction record to a variable meant to hold a sub-transaction record, causing part of the disbursment to be assigned to the wrong transaction. Since all of the IDs in question where 4-byte integers, who could tell? They all looked the same, even in debug mode.

I am not saying that you should avoid autonumber surrogate keys like an Uber driver with a claw hammer. The danger is not in their use but in their abuse. The "events_id" column in the "events" table didn't give us any trouble until we began to rely on it as the sole key for the table. The accounting application gave us problems because we were using the ID as the entire handle for the records. That crossed the line from use to misuse, and we suffered for it.

Unfortunately, I'm seeing design mistakes that I made in the past not only repeated wholesale by younger developers, the rationales for them are being defended vigorously on the Internet and elsewhere.

Reasons to Use an Autonumber Surrogate Key


What follows are a number of reasons people have given me, on IRC and the PostgreSQL.org mailing lists, for using autonumber keys. Some of them are "good" reasons which demonstrate and understanding of the costs and benefits. Others are "bad" reasons based on sloppy thinking or lack of training.  Form your own opinions before scrolling down.

Many-Column Keys
    The real key of the table has 3 or more columns and makes writing queries painful.

Table Size
    Since integers are smaller than most other types, using them makes the table, and my database, smaller. And a smaller database is faster.

Frameworks
    My web development framework requires that all tables have integer primary keys to do code generation from my database.

No Good Key
    My table has no combination of columns that makes a good natural key or unique index. Therefore I need to just use an ID.

Consistency
    Our technical specification requires all tables except join and load tables to have an "id" and a "name" column, each of which is unique.

Join/Sort Performance
    Integers sort and join much faster than large column types. So using integer primary keys gives me better query performance.

Design Principles
    Using an ID column in each table is an important principle of good relational database design, and is required for "normalization".  I read a book/web page/magazine article that says so.

DB Abstraction/ORM
    The database abstraction library (like PDO or ActiveRecord) I use requires integer primary keys.

SQL Standard
    The SQL Standard requires an ID column in each table.

Programmer Demands
    The PHP/Python/JS/Java guys on the interface team refuse to deal with different data types and multi-column keys, and/or want an integer to use as an "object ID."

Mutability
    The natural keys in my table can change, and IDs aren't allowed to change.

Reasons to Use an Autonumber Surrogate Key, Evaluated


Here's my evaluation of the various reasons above. You'll have your own opinions, of course, but read through this list to make sure that your design decisions are well-grounded.

Many-Column Keys
    It Depends. As much as this shouldn't be a reason, the rather verbose SQL join syntax and multicolumn index performance makes it one. If SQL was more terse, and query executors better, this would evaporate as a reason.  Note that in some cases though, it can still be better to use the multicolumn key, expecially if you're partitioning on some of the inherited key values.

No Real Key
    Very, Very Bad. This is an example of exactly the kind of very bad database design that puts the application designers into several weekends of overtime down the line. Without any natural key ... even if you use a surrogate key for joins, etc. ... you have no way of telling which rows in your table are duplicates. Which means that you will get duplicates, many of them, and be unable to fix the data without significant and costly fieldwork to reexamine the sources of your data ("Mary? Do we have one or two John MacEnroes working for us?")
    Worse, these indicate that the developer does not really know his data and that the spec was never really hashed out. When I interrogate people claiming that there's "no real key" I generally find that it's not actually the case that there aren't any unique keys, it's that the developer doesn't know what they are. This is a weathervane for far more serious design problems.
    As Jeff Davis likes to say, "Conflicts will get worked out somewhere.  In general, it's far less expensive to work them out in the database than in the real world."
     Note that thanks to Exclusion Constraints, GIN indexes, and functional unique indexes, PostgreSQL is able to support complex criteria as keys of which other databases would not be capable.  So if you're using something else, there is the possibility of "I know the real key, but my database engine doesn't support it."

External Requirements
    It Depends. The ORM, DB Abstraction and Programmer Demands arguments all amount to external requirements to use integer keys. Certainly a degree of genericization is necessary for any multi-purpose tool or code set. This is the most common reason for me when I succumb to using autonumber IDs. However, this should be a compelling reason only after you've evaluated the ORM, DB abstraction library and/or the staff involved to make sure that integer keys are a real requirement and that the tool/person will actually push your project forwards instead of becoming an obstacle.

Consistency
    Usually Bad. A scrupulous adherence to consistent design standards is generally a good thing.  However, the ID/Name requirement suggests that the architects haven't looked very hard at the application's actual requirements or the real structure of the data.

Standard Practice
    Bad. Both the SQL Standard and the Design Principles arguments are based on ignorance. Generally the developer using these rationales heard from a friend of a collegue who read someone's blog who took a course at the University that ID columns were a good idea. That some of these ignorant designers are also book and article authors is really tragic. For the record, neither the SQL Standard nor relational theory compel the use of surrogate keys. In fact, the papers which established relational theory don't even mention surrogate keys.

Mutability
    It Depends. It's an unfortunate reality that many SQL DBMSes do not support ON UPDATE CASCADE for foreign keys, and even those which do tend to be inefficient in executing it (this may be a reason to switch to PostgreSQL). As a result, real keys which change very frequenty in large databases are generally not usable as join keys. However, I've seen this argument used for values which change extremely infrequently in small databases (like for full names or SSNs in a small company personnel directory), which makes it just an excuse.
    Sometimes, however, this argument is based completely on the misinformation that keys are supposed to be invariable and immutable for the life of the record. Where this idea came from I'm not sure; certainly not from either the SQL standard or the writings of E.F. Codd. It's probably unthinking bleed-over from mis-applied OO design. If this is your reason for not using a real key, it's wrong.
    The other practical reason to require immutable keys is if you're using the keys as part of a cache invalidation or generic sharding system.   However, a smart design for such a system still doesn't use autonumber surrogate keys; instead, you have a synthetic key which carries information about the entity to which it is attached in compressed form, such as a application-specific hash or addressing system.

Performance
    Usually Bad. I've saved the Table Size and Join/Sort Performance for last because performance is the most complex issue. The reason I say "usually bad" is that 80% of the time the developer making these arguments has not actually tested his performance claims, on this or any other database. Premature optimization is the hobgoblin of database design.
    For data warehouses, the Table Size argument can be compelling, although it needs to be balanced against the need for more joins in real performance tests. For any other type of application ... or any database smaller than 10GB, period ... this argument is nonsense. Whether your web application database is 200mb or 230mb is not going to make an appreciable performance difference on any modern machine, but poor design choices can make an appreciable difference in downtime.
    Join/Sort performance is a bit more of a serious argument, depending on the size of the database, the number of columns and data types of the real key.  Note that using the natural key can, in many cases, allow you to avoid doing a join althogether, which can result in query speedups which outstrip any slowness due to column size.  I have refactored data warehouses to use natural keys precisely for this reason.
   If you think you need to use a surrogate key for database performance, test, then make a decision.

Wrap-Up


As I said earlier, it's not using autonumber surrogate keys (which are a necessary, pragmatic evil) but misusing them that causes pain, late nights and budget overruns. Just make sure that you're using integer keys in the right ways and for the right reasons, and always ask yourself if an autonumber surrogate key is really needed in each table where you put one.

11 comments:

  1. Excellent, Josh. Naturally we see this often also.
    Note that in MySQL syntax the keyword is AUTO_INCREMENT

    Regards,
    Arjen.

    ReplyDelete
  2. Good artice and sensible advice but your "definition" of a surrogate is flawed and I think that weakens the message here. The distinction between a natural or a surrogate key is based on their semantics and usage and has nothing to do with the technical features used to generate key values (such as auto-numbering / auto-incrementing column types). For explanation and references see Codd's 1979 "RM/T" paper.

    For example the natural key of an invoice is an invoice number. That is the key used in the business domain outside the database to identify invoices and is therefore a natural key irrespective of how it is generated (note that personally I much prefer the term Business Key or Domain Key rather than Natural Key but it means exactly the same thing). So there are actually two distinct issues under discussion. Firstly the need for natural keys, which are essential to data integrity and effective use of data. Secondly, the suitability or otherwise of certain DBMS features as key generators, whether for natural or surrogate key use.

    ReplyDelete
    Replies
    1. Point taken, but in a pragmatic sense anyone using a "pure" surrogate key will be using SERIAL, UUID, or something similar. My point was to distiguish the real surrogate key from something like a hash or constructed node-instance-entity ID, either of which carry actual meaning and are to some degree verifiable.

      The important question is: if the ID and the rest of the row are separated, is there some way in which they can be meaningfully reunited, or at least verified?

      Delete
  3. A data warehouse is another example of where the PK/FK should always be surrogate keys?

    http://www.kimballgroup.com/1998/05/surrogate-keys/

    ReplyDelete
    Replies
    1. That's the "space" argument mentioned above: that is, an SK to a lookup table takes up less storage space in the main table than, say, a 60-character string. The problem with that argument is that the join to the lookup table itself is not free; I personally have implemented moving values back out of the lookup tables and into the fact table in two different data warehouses for performance reasons. With the advent of column storage and other low-level optimizations, converting values to SKs looks even less attractive.

      Delete
  4. This comment has been removed by the author.

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. Great article!!

    Another argument to minimize surrogate keys :

    I once had to merge the database of a branch office with the one of the main office.
    Due to the use of identity columns we had clients, suppliers, orders, orderlines, invoices etc with the same surrogate primary keys, but of course, without any relation between them.
    We choosed to add the number 1000000 to every identity column in the branch office database before merging the databases and we added an "office key" into every table.
    So we created combined primary keys : Office code column + the identity column and all new rows started from the number 2000000.
    Of course, addingthe "office key" column implied we had to refactor a lot of client code as well!!

    Somebody mentioned we should have used GUIDs from the beginning.
    Probably, but do you want to copy a GUID into your queries when executing manual lookups, can you remember GUIDS (for security reasons that might be a good thing) and what about search performance on GUIDs? I do not want to count the number of pagesplits when using GUIDS.

    You can say I followed the examples in the books to much when starting to use identity columns in every table.

    At the other hand consider you want to analyse the metadata of all your databases. You might have the following primary keys ...


    Server : servername
    <->>
    DB-Instance : servername - instancename
    <->>
    Database : servername - instancename - databasename
    <->>
    Schema : servername - instancename - databasename - schemaname
    <->>
    Table : servername - instancename - databasename - schemaname - tablename
    <->>
    Column : servername - instancename - databasename - schemaname - tablename - columnname
    <->>
    Columnconstraints : servername - instancename - databasename - schemaname - tablename - columnname - columconstraintname

    In the RDBMS I'm using, the maximum length of an index is 900 characters and a database object can have 128 characters.
    Thus, the primary key length for the table "columnconstraints" will be 7 x 128 = 896 characters ... if we do not use unicode of course.

    Therefore, at this moment I often tend to create a clustered identity column (to assure performance, to ease join statements, ...).
    In addition I create a unique index/constraint on the natural key columns in order to avoid duplicate entries.
    I don't know whether this is the best approach but it works for me (at the moment).

    Actually I think there is no golden rule ...
    "IT DEPENDS"
    Best practices are often situational, you always have to weigh advantages and disadvantages and it's difficult to anticipate
    all possible business rule changes that might appear in the future.

    ReplyDelete