Monday, August 12, 2013

Fancy SQL Monday: ON vs. NATURAL JOIN vs. USING

First, there was the SQL89 JOIN, where there was no JOIN statement, and joins were performed by adding WHERE clauses.  To wit:

SELECT, count(*)
FROM users, comments
WHERE users.user_id = comments.user_id
   AND users.status = 'active'
   AND comments.comment_date BETWEEN '2013-07-01' and '2013-07-31'

There were two problems with this kind of JOIN: you often got join conditions mixed up with other filters and accidentally did cartesian products, and there was no way to do directional (outer) joins.

Not that that keeps people from continuing to use it.

So then we got the SQL92 JOIN:

SELECT, count(*)
FROM users JOIN comments
   ON users.user_id = comments.user_id
WHERE users.status = 'active'
   AND comments.comment_date BETWEEN '2013-07-01' and '2013-07-31'

But this wasn't the end of things.  Since SQL92 allowed users to define FOREIGN KEYS, why not allow them to just join along the keys "naturally"?

SELECT, count(*)
FROM users NATURAL JOIN comments
WHERE users.status = 'active'
   AND comments.comment_date BETWEEN '2013-07-01' and '2013-07-31'

Except that the SQL committee screwed up, and instead of using the declared foreign keys, NATURAL JOIN looks for columns with identical names.  And if there are several columns with identical names, there's no standard way to resolve them.  So, NATURAL JOIN, which ought to have been a real keyboard-saver, instead turned out to be uselessly unreliable.

One of the annoying issues about joins between two tables on columns of the same name is that SQL wants you to constantly qualify which table you're asking for the column from, even though the contents are identical:

ERROR:  column reference "user_id" is ambiguous

This was annoying, so they created the USING clause instead:

SELECT user_id,, count(*)
FROM users JOIN comments USING ( user_id )
WHERE users.status = 'active'
   AND comments.comment_date BETWEEN '2013-07-01' and '2013-07-31'
GROUP BY user_id,;

In Postgres, NATURAL JOIN also conflates matching columns, but it doesn't do this on all DBMSes.

Of course, USING only works if your columns do have the same names.  All the more reason to adopt that as a naming convention!


  1. Let me guess: we ended up with this weak definition of "natural" because some vendors wanted add that tick in their features matrix before they managed to tick the more complicated "FK" feature ?

    1. Not sure, haven't gotten Celko to comment on it. But I suspect so, especially since some vendors *still* don't have functional FKs.

  2. Minor copy/paste related nit: The last two queries need to have the first AND replaced by WHERE.

  3. I've actually found USING to be harmful. Consider the following example:

    CREATE TABLE users (userid serial primary key);
    CREATE TABLE colors (colorid serial primary key);
    CREATE TABLE cars (userid int references users(userid), colorid int references colors(colorid));

    And you write this query to get information about the cars a user owns:

    SELECT * FROM users JOIN cars USING (userid) JOIN colors USING (colorid);

    Everything works correctly, until someone wants to add a column to "colors" which specifies who added the certain color:

    ALTER TABLE colors ADD COLUMN userid integer;

    and bam, all your queries are broken and you never saw it coming. What's even worse, all views in the database using this query will continue to work, *until* you want to restore the database somewhere (e.g. during upgrade).

    Of course, in a perfect world you wouldn't call the new column "userid", but can you make sure you never add a conflicting column to any of your tables?

    Similar, less subtle caveats apply to NATURAL JOIN.