Thursday, September 18, 2014

Finding Duplicate Indexes

Recently a client asked us to help them find and weed out duplicate indexes.  We had some old queries to do this, but they tended to produce a lot of false positives, and in a database with over 2000 indexes that wasn't going to cut it.  So I rewrote those queries to make them a bit more intelligent and discriminating, and to supply more information to the user on which to base decisions about whether to drop an index.

Here's the first query, which selects only indexes which have exactly the same columns.  Let me explain the columns of output it produces:
  • schema_name, table_name, index_name: the obvious
  • index_cols: a comma-delimited list of index columns
  • indexdef: a CREATE statement for how the index was created, per pg_indexes view
  • index_scans: the number of scans on this index per pg_stat_user_indexes
Now, go run in on your own databases.  I'll wait.

So, you probably noticed that we still get some false positives, yes?  That's because an index can have all the same columns but still be different.  For example, it could use varchar_pattern_ops, GiST, or be a partial index.  However, we want to see those because often they are functionally duplicates of other indexes even though they are not exactly the same.  For example, you probably don't need both an index on ( status WHERE cancelled is null ) and on ( status ).

What about indexes which contain all of the columns of another index, plus some more?  Like if you have one index on (id, name) you probably don't need another index on just (id).  Well, here's a query to find partial matches.

This second query looks for indexes where one index contains all of the same columns as a second index, plus some more, and they both share the same first column.  While a lot of these indexes might not actually be duplicates, a lot of them will be.

Obviously, you could come up with other variations on this, for example searching for all multicolumn indexes with the same columns in a different order, or indexes with the same first two columns but others different.  To create your own variations, the key is to edit the filter criteria contained in this clause:

    FROM pg_index as ind2
    WHERE ind.indrelid = ind2.indrelid
    AND ( ind.indkey @> ind2.indkey
     OR ind.indkey <@ ind2.indkey )
    AND ind.indkey[0] = ind2.indkey[0]
    AND ind.indkey <> ind2.indkey
    AND ind.indexrelid <> ind2.indexrelid

... and change it to figure out the factors which give you the most real duplicates without missing anything.

Happy duplicate-hunting!

Tuesday, August 26, 2014

Portland schedule, revised

Just so folks know:  due to unsurmountable issues, I will not make it to DjangoCon this year.  My apologies for anyone who was planning on attending my sessions there.  However, I will be at both the Portland pgDay and FOSS4G.  See you in Portland, just later.

Monday, August 25, 2014

The great JSONB tradeoff

One thing we're waiting on to get 9.4 out the door is making a decision on JSONB storage.  Currently under debate are patches to the JSONB format which would, for users with large numbers (as in 150+) of top-level keys, cause JSONB to take up to 60% less space on disk in exchange for an up to 80% increase (that is slowdown) in response times on key extraction.  We need your feedback as potential users of the feature to make a good decision about what's best.

At stake is the question of whether the key locations should be saved as successive offsets, or as lengths.  The advantage of the former is that it speeds up extraction by making it only two lookups to locate a top-level key regardless of the number of keys you have.  The disadvantage is that a series of increasing offsets isn't very compressible.

For JSONB fields which consist of a few top-level keys and large values, this question makes no difference at all.  However, for the worst case ... 150+ top-level keys with short (under 10 bytes) values, the difference is quite dramatic.  For example, I constructed a test with 183 keys, of which 175 were NUMERIC.  I checked both table size and time to extract key #160 from 100,000 rows:

Table Size1147 MB541 MB
Extract 100K2.8s5.2s

This is a "worst case" scenario for the difference between these two designs.  Note that the extraction slowdown affects only retrieving the value to the client; it does not affect index lookups of JSONB rows, which are speedy no matter which patch is employed.

However, we're undecided on this "fix" because we don't know a couple things:

  • How likely are users to have 150+ top-level keys in one field (or keys on any single level together) with short values?
  • Is up to 60% space savings in return for up to 80% extraction slowdown a good tradeoff?  Or a bad one?
As such, I am turning to you, the users, to help us decide what tradeoff makes sense for 9.4.  Let me know in the comments.

Thursday, August 14, 2014

More on 9.4 percentiles

Some of you may be following the discussion on JSONB compression on -hackers, which is the remaining major issue for 9.4.  As part of this, I needed to run some statistics on average column sizes for JSON vs JSONB tables. This makes a good example of how incredibly useful WITHIN GROUP will be in 9.4.

First, I created two versions of the data, one called "jsonic" which has text-JSON data, and one called "jsonbish" which has the JSONB version.  Then I ran some comparisons using pg_column_size on the JSON columns.

with colsizes_b as (
    select pg_column_size(jsonbish.jcol) as colsize
    from jsonbish
distrib_b as (
    select percentile_cont(array [0,0.25,0.5,0.75,1])
    within group (order by colsize) as colsize_distribution
    from colsizes_b
colsizes_j as (
    select pg_column_size(jsonic.jcol) as colsize
    from jsonic
distrib_j as (
    select percentile_cont(array [0,0.25,0.5,0.75,1])
    within group (order by colsize) as colsize_distribution
    from colsizes_j
select 'json' as thetype,
    colsize_distribution from distrib_j
union all
select 'jsonb',
    from distrib_b;

So I'm taking the column size of each row, then sorting them by size, and then doing a percentile distribution using the new WITHIN GROUP.  There's no group there because the group is actually the whole table.  I've chosen the usual box plot percentages: minimum, 25%, median, 75%, and maximum.

And the results:

 thetype |    colsize_distribution   
 json    | {1741,1767,1854,1904,2292}
 jsonb   | {3551,5866,5910,5958,6168}

... which demonstrates that we do indeed have a major issue with compressability.

Tuesday, August 12, 2014

SFPUG Video on YouTube

TL;DR: SFPUG video available on YouTube, including streaming video tonight.  I need help converting old videos and making new ones.

First, we have a few SFPUG videos available on YouTube:
While JSONB Deep Dive was also recorded, there were issues with the audio, so I don't recommend watching it, sorry.

We will have streaming video for tonight's SFPUG, which will be on Postgres-XL.   Join the Google Event to get a link to the streaming video once it starts; expected start time is 7:15PM PST.

This brings me to the third thing, which is that I could use some help with video, in two areas:
  1. I desperately need someone to help take better-quality videos of SFPUG meetings, with better focus and sound.  Right now I'm using a Logitech webcam, and it's just not cutting it.  Video will need to be compatible with Google Hangouts, unless you have another plan for broadcasting.
  2. I have 7 or 8 videos of SFPUG meetings from 2013, including Tom Lane explaining the query planner, in FLV format.  I need someone to transcode these to a YouTube format and upload them.
If you can help, email me.  Thanks!

Thursday, August 7, 2014

Call for quotes for the 9.4 release announcement

We are looking for endorsement quotes for the 9.4 release.
Specifically, we want quotes from end-users or ISVs* who have not been
featured for a previous release, and enthuse about one of the following
  • Streaming Changesets/Bi-Directional Replication
  • Performance improvements in 9.4
  • Something else about Flexibility, Scalability or Performance
We need these quotes by August 15th.  Contact me if you want to offer a quote.  Thanks!

(* while we love our PostgreSQL consulting and support companies,
endorsements from them aren't news in a PR sense)

Friday, June 13, 2014

Pending Portland Postgres Presentations

This year is apparently my year to spend most of my time in Portland.  Maybe I should move there ;-)  Please read through if you are in Portland or are attending conferences in Portland for a list PostgreSQL-related events there.

O'Reilly OSCON 2014: on Monday, July 21st, I will be presenting The Accidental DBA again at OSCON.  O'Reilly has asked me to reprise this tutorial as it was sold out last year.  This tutorial covers the care and feeding of a PostgreSQL server for people whose main job is something else, or for those new to database administration.  Will include multiple hands-on exercises using Vagrant on your own laptop.  The exercises are on github, but will be updated on or before July 10th in order to cover 9.3 and feedback from the last talk.

Note that there will also be a special PDXPUG meeting or BOF at OSCON; details TBD.  I will also be doing an Office Hours session at OSCON on July 22nd at 3:20 PM; please bring your questions about PostgreSQL scalability and replication.

I am likely to be speaking at in September.  As the Call for Presentations is still open, I don't know what it will be about, or have confirmation that I will, in fact, be speaking.

Right between and FOSS4G, PDXPUG will be hosting a full PostgreSQL Day at Portland State University on September 6.  This event is free, so if you are going to either DjangoCon or FOSS4G, extend your stay in Portland in order to attend!  RSVP required.

People are still submitting presentations for that; if you're interested in speaking, contact Mark per the wiki page.  I expect to be talking about 9.4, replication, or maybe Postgres on Amazon.

Finally, at FOSS4G, I will be doing a full day of Postgres/PostGIS workshops.  In the morning I will do Accidental DBA again, this time tailored for PostGIS administrators.  In the afternoon, I will do a hands-on "learning to do PostGIS replication" tutorial.  This does mean that I will likely not include much about replication in the morning session.  Both of these workshops are "bring your own device", which means that you will need to install the tutorial materials on your laptop in advance.

Since I will be spending the first half of September in Portland.  If you or your company is interested in onsite training or consulting during September, please get in touch with PGX.