Friday, February 13, 2015

"In-memory" is not a feature, it's a bug

So, I'm hearing again about the latest generation of "in-memory databases". Apparently Gartner even has a category for them now.  Let me define an in-memory database for you:

     An in-memory database is one which lacks the capability of spilling to disk.

As far as I know in my industry literature reading, nobody has demonstrated any useful way in which data should be stored differently if it never spills to disk.   While the talented engineers of several database products have focused on other performance optimizations to the exclusion of making disk access work, that's not an optimization of the database; it's an optimization of engineer time.   The exact same database, with disk access capabilities, would be automatically superior to its predecessor, because users would now have more options.

PostgreSQL can be an "in-memory" database too, if you simply turn all of the disk storage features off.  This is known as "running with scissors" mode, and people do it for useful effect on public clouds with disposable replicas.

So an "in-memory" database is a database with a major limitation.  It's not a feature, any more than an incapability of supporting SQL access is a feature.  Let's define databases by their useful features, not by what they lack, please.

Besides which, with the new types of persistent memory and fast random access storage coming down the pipe in a couple years, there soon won't be any difference between disk and memory anyway.

16 comments:

  1. The data structures you use for in memory are different in many cases than what you would use if you had to pay the cost of a disk request for each reques

    ReplyDelete
    Replies
    1. People keep saying this, without any material demonstration of that idea. I've worked with a lot of database engines, and they all use the same basic structures.

      Delete
    2. Redis offers features I have never seen in a disk based database and example is why there are benefits to something purpose built as memory only.

      Delete
    3. Actually, Redis is one of the ones I was thinking of; I'm casually involved in the project and know a lot about the internal architecture. There is nothing structurally inside Redis which prevents it from swapping to disk; it's strictly a matter of unimplemented code. In fact, Redis has adopted a number of data structures, such as compressed data pages, which are generally thought of as being pure "disk" features because they make in-memory performance better as well.

      You could argue with conviction that if Salvatore had had to work out all of the disk access stuff just to get started, Redis would be years behind where it is now, feature-wise, and I think you'd be right. But as I said above, that's an optimization of developer time, not of database engines. Swap-to-disk and single-node crash safety is still on Redis's TODO list, it's just not done yet.

      Delete
  2. Even for cache systems that do want in-memory only behavior from the OS kernel, the best way to do it still is to use the file based APIs. See https://www.varnish-cache.org/trac/wiki/ArchitectNotes for some details.

    ReplyDelete
  3. Josh, what a good definition :-) Precise and kinda obvious in a hindsight!

    Could you elaborate on turning off disk storage options? Turning fsync and friends off? Unlogged tables? RAM-disk?
    By the way, does PG works on RAM-disk just as on a spinning one? I mean that probably some common steps are useless when working on unreliable storage and could be skipped.

    ReplyDelete
    Replies
    1. OOps, I though that was a prior post on this blog. I'll do a follow-up post which goes over those settings.

      Delete
    2. I could not find a post on running postgresql database which resides on tmpfs.

      I did that but I have not looked into
      a) Backup the database into flash file system periodically
      b) Restore the database from the flash file system into tmpfs before the postgresql starts up.

      Delete
  4. I've used an in-memory database for years: NDB Cluster (aka MySQL Cluster), which can be used and was developed without the MySQL frontend. Everything is always in memory and partitioned over multiple machines. If it didn't spil to disk (to make it durable) it would have been useless. A newer feature is diskdata tables for which it is no longer required to have the full table in memory.

    Another thing which comes to mind is the in memory option of SQLite which can be used in similar situations as unlogged tables and the memory storage engine for MySQL
    https://www.sqlite.org/inmemorydb.html

    ReplyDelete
  5. If the data fits in RAM then the capability to spill to disk is a liability. What's wrong with a trade-off where you sacrifice a feature that you don't need for major performance, flexibility, productivity and maintainability benefits?

    I'm building OrigoDB, an in-memory database for NET. The datamodel is user defined with NET types and collections. LINQ is used for queries and precompiled C# stored procedures for modifications. I can run 100K fully serialized write transactions per second AND squeeze in millions of queries while waiting for the transaction log to flush.

    ReplyDelete
    Replies
    1. Yes, but if you could have the same performance while *also* having synchronous disk writes for crash-safety, wouldn't you? Or having the ability to handle data sets which don't fit in RAM on each node?

      I'm pointing out that "in-memory" by itself is not a feature, it's a deficiency. You certainly may successfully optimize other features by choosing not do deal with disk access, but the fact that you *cannot* do disk access, by itself, is not an advantage. It's those other features you should be talking about, not your limitations.

      I felt compelled to point this out because there are currently some new databases on the market which do not have any advantages over more conventional databases, and are using the "in-memory" label to claim performance they do not, in fact, have. If OrigoDB is that kick-ass, you don't want to put yourself in the same bucket with those.

      Delete
    2. Just to be clear, OrigoDB transactions are fully ACID with perfect isolation.

      I wouldn't *define* in-memory as 'not being able to spill to disk', that's just one implication of many, some constraining and some enabling. NoSQL on the other hand is a hideous name.

      I think the in-memory label, compared to many other buzz words, is an honest describing characteristic. Calling a key/value store a database is considerably more deceptive, imo.

      Not counting skewed benchmarks, which new products have false performance claims? We should call them out!

      Delete
  6. Constraints and limitations *can* be features. I think you need to use your imagination a bit more. There are legit reasons to run purely in memory and to consider the mere capability of using disk, even solid state, as a liability.

    ReplyDelete
  7. gemfire (aka apahace geodoe) has some interestng features on paper.

    in-memory processing across multiple nodes (speed and stuf), wan replication, and a persist to disk backed database sync thing (greenplum) .

    ReplyDelete