Saturday, August 25, 2012

Wrong defaults for zone_reclaim_mode on Linux

My coworker Jeff Frost just published a writeup on "zone reclaim mode" in Linux, and how it can be a problem.  Since his post is rather detailed, I wanted to give a "do this" summary:

  1. zone_reclaim_mode defaults to the wrong value for database servers, 1, on some Linux distributions, including Red Hat.
  2. This default will both cause Linux to fail to use all available RAM for caching, and throttle writes.
  3. If you're running PostgreSQL, make sure that zone_reclaim_mode is set to 0.
Frankly, given the documentation on how zone_reclaim_mode works, I'm baffled as to what kind of applications it would actually benefit.  Could this be another Linux misstep, like the OOM killer?

4 comments:

  1. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. Defaults to zero on my system.

      $ cat /etc/redhat-release
      Red Hat Enterprise Linux Server release 6.3 (Santiago)

      $ uname -a
      Linux work-desktop 2.6.32-279.19.1.el6.x86_64 #1 SMP Sat Nov 24 14:35:28 EST 2012 x86_64 x86_64 x86_64 GNU/Linux

      $ cat /etc/sysctl.conf | grep zone

      $ cat /proc/sys/vm/zone_reclaim_mode
      0

      Delete
    2. After a little more looking, Both of us are correct. It can be different depending on the machine. HOpefully you can check view:

      https://access.redhat.com/site/solutions/60669

      To paraphrase:

      The commit 9eeff2395e3cfd05c9b2e6074ff943a34b0c5c21 introduced this.
      For more details, please check the upstream kernel discussion here : http://marc.info/?l=linux-kernel&m=113408418232531&w=2

      In RHEL-6.1 'zone_reclaim_mode' is set to 1 and in RHEL-6.2 it set back to 0.

      Delete
  2. This is an old post, but since there is a lot of confusion lingering about this setting, it's there for HPC workloads (which drove a lot of the NUMA development in the first place). HPC simulations are one example of a class of applications which (A) saturate the memory bus and (B) run in such perfect synchronization that they are highly sensitive to memory latency.

    When you run this sort of code and one NUMA node fills up, if that core has to borrow memory bandwidth from its neighbor then both cores start running at 50% at best (since that's the memory bandwidth available to each) or perhaps even slower. When one pair of cores degrades, the entire simulation slows down catastrophically since all the simulation cells have to exchange results for every time step in the simulation.

    Production Linux supercomputing dates back to the late 90s; even when NUMA became common in the mid 20-single-digits, this sort of large-RAM database design wasn't dominant yet. Now that it is, the default has swung the other way.

    ReplyDelete