Sunday, May 6, 2012

Random Page Cost Revisited

Because the planner has limited information about the host system PostgreSQL is running on, we offer several planner configuration variables, which can be set either dynamically or in the postgresql.conf file.  One of these, which has been a source of much contention in the past, is Random Page Cost (or random_page_cost, hereafter RPC).  For the last couple of years RPC was not a variable you needed to worry about, but now, thanks to advances in storage technology, it's become one of the essential parameters again.

Random Page Cost is meant to represent the coefficient between the cost of looking up one row (out of many) via sequential reads, vs. the cost of looking up a single row individually using random access (disk seeks).  This factor strongly influences the planner's decisions to use indexes vs. table scan, and composite vs. simple indexes, in executing queries.

For many generations of spinning hard drives, this ratio remained about the same, around 4.0. 

Things have changed recently though.  In fact, there are five now-common types of storage where you want to change RPC in order to match a different scan/seek ratio for different hardware.

1. High-End NAS/SAN: when working well, your large storage box combines a large volatile cache with many many disks, improving the number of concurrent seeks you can do.  However, if storage is not direct-attach, you usually have limited throughput.  This means that a random_page_cost of 2.5 or 3.0 is more appropriate.

2. Amazon EBS and Heroku:  EBS, especially when "RAIDed", is an extreme example of the NAS/SAN case.  Seeks are fast with virtually unlimited concurrency, while storage bandwitdh is erratic and slow, limiting full table scans.  As such, we often use a random_page_cost of 1.1 to 2.0.

3. Bad SANs: the advent of iSCSI seems to have inspired a large variety of very poor-performing inexpensive storage units, such as the lower-end Dell Equallogics and various SATA-based Promise units.  These boxes are cheaper for a reason, folks!  Anyway, their quirks often give these boxes unique performance characteristics which strongly affect database access and prompt you to modify planner parameters.  For example, we discovered through testing that one Promise box had decent throughput, but seek rates slower than a single drive due to a combination of architecture and driver issues.  As such, we set RPC to 6.0.  You'll have to trust the results of bonnie++ and fio in order to figure out where you should set things, though; each inferior unit is inferior in a different way.

4. SSDs: in general, SSDs have not lived up to their early hype for performance; random writes and sequential scans on most SSDs are only incrementally better than on HDDs.  One area where SSDs shine, though, is random page access; a good array of SSDs can make even large index scans faster than sequential scans.  Therefore, use a random_page_cost of 1.5 to 2.5.

5. NvRAM (or NAND): durable memory, such as FusionIO and Virident drives, do deliver on most of the performance promises originally made about SSDs.  Everything is faster, sometimes orders of magnitude faster, than HDDs, most of all random access.  While the concept of NvRAM might make you inclined to set RPC to 1, forcing almost 100% index access, that's going a little far because PostgreSQL still has a disk page structure.  Therefore, I suggest 1.0 to 1.5.

Note that two other variables, effective_cache_size and effective_io_concurrency also affect seek/scan behavior.  Possibly I will blog about them at a different time.  Further, I have only been able to test with relatively few models and configurations of the above types of storage, so test, monitor and check your results before committing to a long-term configuration change.


  1. Hi, interesting article.

    But there is one question in my mind. What if i have a server with some database vm's (xen3.2.1), for example 20 databases which all use the same SSD raid beneath.
    Should I take care of this and choose a higher random_page_cost?


    1. I don't see that being on a VM makes any difference. Yes, shared IO makes throughput slower, but it also makes seeks less frequent due to concurrency issues.

  2. I have often found it necessary to set random_page_cost and seq_page_cost less than 1.0 when the data set is fully cached, a scenario that is becoming increasingly common with the very large system memories now available.

    1. If you have to set RPC to less than 1.0 in order to get reasonable behavior, then we have a bug in our planner math. 1.0 should mean "seeks are exactly as fast as scans", which they aren't, even in RAM.

      I was under the impression that changing seq_page_cost was pretty much just a cosmetic thing for your explain plans.

  3. "each inferior unit is inferior in a different way"

    The quotable Berkus...