Thursday, September 18, 2014

Finding Duplicate Indexes

Recently a client asked us to help them find and weed out duplicate indexes.  We had some old queries to do this, but they tended to produce a lot of false positives, and in a database with over 2000 indexes that wasn't going to cut it.  So I rewrote those queries to make them a bit more intelligent and discriminating, and to supply more information to the user on which to base decisions about whether to drop an index.

Here's the first query, which selects only indexes which have exactly the same columns.  Let me explain the columns of output it produces:
  • schema_name, table_name, index_name: the obvious
  • index_cols: a comma-delimited list of index columns
  • indexdef: a CREATE statement for how the index was created, per pg_indexes view
  • index_scans: the number of scans on this index per pg_stat_user_indexes
Now, go run in on your own databases.  I'll wait.

So, you probably noticed that we still get some false positives, yes?  That's because an index can have all the same columns but still be different.  For example, it could use varchar_pattern_ops, GiST, or be a partial index.  However, we want to see those because often they are functionally duplicates of other indexes even though they are not exactly the same.  For example, you probably don't need both an index on ( status WHERE cancelled is null ) and on ( status ).

What about indexes which contain all of the columns of another index, plus some more?  Like if you have one index on (id, name) you probably don't need another index on just (id).  Well, here's a query to find partial matches.

This second query looks for indexes where one index contains all of the same columns as a second index, plus some more, and they both share the same first column.  While a lot of these indexes might not actually be duplicates, a lot of them will be.

Obviously, you could come up with other variations on this, for example searching for all multicolumn indexes with the same columns in a different order, or indexes with the same first two columns but others different.  To create your own variations, the key is to edit the filter criteria contained in this clause:

WHERE EXISTS ( SELECT 1
    FROM pg_index as ind2
    WHERE ind.indrelid = ind2.indrelid
    AND ( ind.indkey @> ind2.indkey
     OR ind.indkey <@ ind2.indkey )
    AND ind.indkey[0] = ind2.indkey[0]
    AND ind.indkey <> ind2.indkey
    AND ind.indexrelid <> ind2.indexrelid
)


... and change it to figure out the factors which give you the most real duplicates without missing anything.

Happy duplicate-hunting!

4 comments:

  1. Nice +1
    Full Query For Me Is:
    SELECT *
    FROM pg_index AS ind
    JOIN pg_class c ON ind.indexrelid = c.oid
    WHERE EXISTS ( SELECT 1
    FROM pg_index AS ind2
    WHERE ind.indrelid = ind2.indrelid
    AND ( ind.indkey @> ind2.indkey
    OR ind.indkey <@ ind2.indkey )
    AND ind.indkey[0] = ind2.indkey[0]
    AND ind.indkey <> ind2.indkey
    AND ind.indexrelid <> ind2.indexrelid
    )

    ReplyDelete
  2. On 9.3 I had to add some casts to the second query:

    AND ( ind.indkey::int[] @> ind2.indkey::int[]
    OR ind.indkey::int[] <@ ind2.indkey::int[] )

    ReplyDelete
    Replies
    1. You did? It just works for me, without the casts. Anything else unusual about your install?

      Delete
    2. Same here on 9.2. I believe my PG installed from apt.postgresql.org and intarrays installed via 'CREATE EXTENSION'.

      Delete