Design a distributed key-value store

Systems design · Distributed building blocks · Mar 2025

A key-value store offers the smallest useful database interface, which is to put a value under a key and get it back later. On a single machine that is a hash table with persistence, and excellent single-node stores exist in exactly that shape. The interview question is what happens when the data outgrows one machine and the system must keep answering through node failures, which is the situation behind shopping carts at Amazon, session data, user profiles, and device state. The Dynamo paper, which described the store behind Amazon's cart, set the template that Cassandra, Riak, and Voldemort followed, and the question has become the canonical way to probe whether a candidate understands replication, consistency, and failure handling as one connected story rather than as a list of vocabulary words.

The hard parts are exactly the places where single-node intuition fails. Some node must own each key while nodes come and go, some number of copies must confirm a write before it counts, two clients can update the same key on opposite sides of a network partition and the system must eventually say what the key's value is, and a node that was down for an hour has to catch up without anyone pausing the world on its behalf. The storage engine inside each node is its own rich problem as well, and a strong answer connects it to the distributed layer rather than treating it as a black box, because several of the system's operational ceilings turn out to live inside that engine.

Scope and requirements

Functionally the store exposes get and put on binary keys and values, with values up to perhaps 100 KB, no schema, no secondary indexes, and no cross-key transactions, and agreeing to that narrowness early is what makes the rest of the design achievable, since every feature on the excluded list would reintroduce coordination the design is about to spend great effort avoiding. The non-functional requirements drive everything else. The data set is large, tens of terabytes and beyond what any single machine holds. Availability is paramount, meaning the store keeps accepting reads and writes through node failures and even network partitions, the situations where some machines cannot reach others. Latency targets sit at single-digit milliseconds at the median, and consistency is tunable per operation, so a shopping cart can choose to always accept writes while a password table on the same cluster chooses to always read its own latest write.

That choice locates the design on the CAP triangle, the result stating that during a network partition a system must choose between staying consistent and staying available. This design chooses availability and partition tolerance, which makes it an AP system that accepts writes during a partition and reconciles the divergence afterward. The trade is right for carts and sessions, where refusing a customer's add-to-cart costs real money while merging two cart versions costs almost nothing, and it is wrong for bank balances, where a merged answer is meaningless, so saying plainly which workloads the design suits is itself part of a strong answer.

Sizing the problem

Concrete numbers anchor the partitioning and replication choices. Assume 1 billion keys averaging 10 KB of value each, which is 10 TB of raw data, and with every item replicated to 3 nodes the cluster stores 30 TB. If each node comfortably serves 2 TB from fast storage, the cluster needs about 15 data nodes, and rounding up to 18 leaves headroom for growth and for the temporary imbalance that rebalancing creates. For traffic, assume 100,000 reads and 20,000 writes per second at peak, which spread over 18 nodes is under 7,000 operations per second per node, well within what an in-memory-indexed, log-structured engine sustains. The figure that surprises people is the internal traffic, because each write touches 3 replicas, so the cluster carries 60,000 replica writes per second, and at 10 KB per value that is about 600 MB per second of replication bandwidth flowing between nodes before a single byte of client traffic is counted, which informs network provisioning more than any other number in the design.

The interface

The external API is small, and the only subtlety worth surfacing is the version token. Because the store can hold conflicting versions of a key after a partition, reads return an opaque context that writes pass back, so the store knows which version a client thought it was updating, and a read that finds genuine conflicts returns all of them at once.

GET /v1/keys/cart:8731
→ 200 { "value": {...}, "context": "vc:A2B1" }

PUT /v1/keys/cart:8731
X-Version-Context: vc:A2B1
{ "value": {...} }
→ 200

GET /v1/keys/cart:8731     (after a conflicting write elsewhere)
→ 300 { "siblings": [ {...}, {...} ], "context": "vc:A2B1C1" }

A per-request consistency hint, asking for one acknowledgment or a quorum, rides the same API, and that is what tunable consistency means in practice, since the application chooses the trade for each call rather than the database choosing once on behalf of everyone.

Partitioning decides who owns a key

With 18 nodes, some function must map each key to its owners, and the function must not reshuffle everything when membership changes, because at 30 TB a near-total reshuffle means days of network transfer. Consistent hashing is that function. Nodes and keys hash onto the same circular space, a key belongs to the first node clockwise from its hash, and each node owns the arc between its predecessor and itself, so adding or removing a node moves only about one Nth of the keys, the ones on the affected arc, where a modulo scheme would move nearly all of them the moment N changes. The naive ring has a fairness problem worth naming, because random arc sizes vary widely, and a dead node would dump its entire arc onto its one clockwise neighbor, doubling that neighbor's load at the worst possible moment. Each physical node therefore takes many virtual positions on the ring, a few hundred small arcs instead of one large one, which evens out the randomness and lets a dead node's load disperse across many survivors in small slices. Every node knows the full ring, so any node can receive any request and route it onward, and the node that coordinates a given request is simply the first replica for that key or whichever node the client happened to reach.

Replication and quorums

Durability and availability both come from writing each key to N nodes, conventionally the next N distinct physical nodes clockwise from the key's position on the ring, a list Dynamo calls the preference list. The interesting design lever is how many of those N must respond before the operation counts, where W replicas must acknowledge a write and R replicas must answer a read, and tuning those two small integers is how the same cluster serves both a cart and a password table.

The arithmetic that makes this rigorous is overlap. With N set to 3, W to 2, and R to 2, any write lands on at least 2 replicas and any read consults at least 2, and since 2 plus 2 exceeds 3, the two sets must share at least one node, so every read sees at least one copy of the latest acknowledged write, which the version metadata then identifies as the newest. A worked failure makes it concrete. A write to key k is acknowledged by nodes A and B while C is slow, a later read contacts B and C, B's copy carries the newer version, and the coordinator returns that version and, in passing, writes it back to C, a repair that costs nothing extra. Choosing W of 1 instead makes writes fast and maximally available, surviving even when only one replica is reachable, but a read at R of 2 can then consult two replicas that both missed the only copy, which is the precise meaning of the rule that W plus R at or below N admits stale reads, and the application must tolerate them knowingly. The cart chooses W of 1 or 2 because losing an add-to-cart is worse than briefly showing an old cart, while a password-change table chooses W of 2 with R of 2 because acting on a stale credential is the worse failure.

ClientCoordinatorfirst node for key kNode Breplica 2Node Creplica 3Node Dreplica, may lag12replicate to N = 334

The client's put reaches the coordinator in step 1, and in step 2 the coordinator sends the write to all three replicas in the key's preference list. Step 3 marks the moment two acknowledgments arrive, the coordinator's own and Node B's, which satisfies W = 2, and in step 4 success returns to the client while the slow replica catches up in the background.

Versions and conflicts

An AP store accepts writes on both sides of a partition, so two clients can update the same key concurrently and the system must later decide what the key's value is. The simple policy is last-write-wins, where each version carries a timestamp and the larger timestamp survives. It is easy to implement, it is what Cassandra does by default, and its cost deserves stating plainly, because one of the two concurrent updates is silently discarded, and clock skew between nodes, the normal small disagreement between machines' clocks, can even discard the update that happened later in real time. For data where any single update is disposable, a last-seen timestamp or a presence flag, that cost rounds to zero, which is why last-write-wins persists in respectable systems.

Vector clocks keep both updates instead. A vector clock is a small map from node to counter attached to each version, incremented by the coordinating node on each write, and it makes concurrency detectable rather than guessed at, since version X descends from version Y exactly when X's counters are all greater than or equal to Y's, and when neither descends from the other the writes were concurrent. Working it through, a cart is written via node A and carries a clock with A's counter at 1. One client reads it and saves through B, producing a clock with A at 1 and B at 1, while another client who also read the version with A at 1 saves through C, producing a clock with A at 1 and C at 1. Comparing the two, B's entry is larger in one and C's entry is larger in the other, so neither dominates, the store keeps both as siblings and returns them together on the next read, and the application merges them, for a cart by taking the union of the items, then writes back a version that descends from both. The trade is real rather than theoretical, because vector clocks grow with the number of coordinating nodes, clients must handle sibling lists, and the merge logic is application code that someone must write and test, which is why Dynamo used them for the cart while many successors retreated to last-write-wins plus careful data modeling. What the user experiences in the cart case is gentle, since the worst outcome of a union merge is a deleted item reappearing, which mildly annoys, while the worst outcome of last-write-wins is an added item vanishing, which loses a sale.

Surviving failures with sloppy quorum, handoff, and repair

A strict quorum refuses writes when too few of the key's home replicas are reachable, which sacrifices exactly the availability this design promised, so Dynamo relaxes it. Under the sloppy quorum, if a home replica is down, the coordinator writes to the next healthy node further around the ring, and that substitute stores the data with a hint naming the intended owner. When the owner returns, the substitute delivers the hinted writes and deletes its copy, a mechanism called hinted handoff. The effect is that a write succeeds whenever any W healthy nodes exist anywhere on the ring, which is about as available as a write can be made, and the cost is that a read during the outage might miss data parked on a substitute, one more stale-read window the application accepted when it chose this design over a consistent one.

Hints cover outages of minutes to hours, but a node that lost its disk needs a full resynchronization, and comparing terabytes key by key across the network is too expensive to contemplate. Merkle trees make the comparison cheap. Each node builds a tree of hashes over its key range, where each leaf is the hash of a bucket of keys and each parent is the hash of its children, so two nodes can compare their trees from the root downward and descend only into branches whose hashes differ, exchanging just the buckets that actually diverge, which turns a terabyte comparison into kilobytes of hash traffic whenever the replicas mostly agree. This anti-entropy process, the background repair that catches whatever the faster mechanisms missed, runs continuously in Dynamo-style systems and bounds how long any divergence between replicas can quietly persist.

Membership itself is maintained in the same decentralized spirit. Rather than a master tracking who is alive, every node runs a gossip protocol, picking a random peer once per second and exchanging its membership list with heartbeat counters, so news of a joined or failed node spreads epidemically and reaches the whole cluster in a number of rounds that grows only logarithmically with cluster size. A node that stops updating its heartbeat is marked suspect locally and routed around, which ties failure detection back into the sloppy quorum machinery without any central authority involved, and the absence of that authority is why the cluster contains no single component whose loss stops the world.

Inside a node sits the LSM storage engine

Each node still has to persist its share of the data, and the engine of choice for write-heavy stores is the log-structured merge tree, the design behind LevelDB, RocksDB, and Cassandra's storage layer. The organizing idea is to never update disk in place. A write first appends to a write-ahead log, a sequential file whose only job is to survive a crash that strikes before memory reaches disk, and then inserts into the memtable, a sorted in-memory structure. When the memtable reaches a threshold of a few tens of megabytes, it is written to disk in one sequential pass as an SSTable, a sorted and immutable file. Sequential appends are the cheapest operation a disk can perform, on spinning disks by orders of magnitude and on SSDs still meaningfully, which is why this engine sustains write rates that update-in-place trees cannot match.

Reads pay for that write speed, and the engine's remaining design exists to keep the bill small. A get must check the memtable and then SSTables from newest to oldest, since any of them might hold the key's latest value, and two mechanisms keep that affordable. A Bloom filter per SSTable, a compact probabilistic structure that says definitely-not-here for most keys a file lacks, lets reads skip almost every irrelevant file at a cost of a few bits per key, while background compaction merges overlapping SSTables into fewer, larger, non-overlapping ones, discarding overwritten and deleted versions as it goes, so the number of files a read must consider stays bounded. Deletes, counterintuitively, are writes, because the engine records a tombstone marker and the data physically disappears only when compaction processes it, a detail with operational teeth since a burst of heavy deletes grows disk usage before it shrinks it.

Client writeWrite-ahead logsequential appendMemtablesorted map in RAMSSTable L0immutable, Bloom filterSSTables L1+compacted, sorted runsClient read123compaction456

A write appends to the write-ahead log in step 1 and inserts into the memtable in step 2, and in step 3 a full memtable flushes to disk as an immutable SSTable, which compaction later merges downward. Steps 4 to 6 trace a read, which checks the memtable first and then SSTables from newest to oldest, with each file's Bloom filter letting most files be skipped.

A latency budget for a quorum operation

Stitching the two layers together produces the latency budget, and the budget explains several design choices after the fact. A quorum write travels from client to coordinator in a fraction of a millisecond on a data center network, fans out to three replicas in parallel, and at each replica costs a sequential log append plus a memtable insert, which together run well under a millisecond, so the write's response time is the network round trip to the second-fastest replica plus the engine work, comfortably 1 to 2 milliseconds at the median. A quorum read looks similar, with the engine cost being a memtable check and perhaps one SSTable read served from the operating system's page cache, and the same overlap rule that gives correctness also hands out a latency bonus, because waiting for 2 of 3 replicas means the slowest replica's pause, a garbage collection or a compaction burst, never appears in the response time. That filtering of the slowest node is a genuine reason to run quorum reads even when staleness would be tolerable.

The tail and the failure mode both live at the engine. When a node's compaction falls behind its ingest, reads on that node consult ever more SSTables and slow down, the node starts arriving third in every quorum it participates in, and the cluster's tail latency degrades while the medians still look healthy, which is why compaction backlog deserves an alert of its own. The client-visible failure walk-through is mercifully dull, because a request that reaches a coordinator with one slow replica simply takes a few extra milliseconds, a request during a node failure is absorbed by the sloppy quorum without the client learning anything happened, and the application notices a true incident only when so many nodes are unreachable that W healthy candidates cannot be found anywhere on the ring, which at N of 3 with substitutes available requires losing a large fraction of the cluster at once.

Scaling, failures, and operations

Growth is the ring's job. A new node takes its virtual positions, streams its arcs from the current owners while they continue serving, and the cluster's capacity grows by one node's worth with about one Nth of the keys moving. Read scaling beyond that comes from lowering R for tolerant workloads or adding replicas for hot ranges, and write scaling is nearly linear in node count because writes are sequential appends spread evenly by the hash. The coordination-free design means there is no master to outgrow, and the practical ceilings are the replication bandwidth the sizing section put at 600 MB per second and the compaction throughput inside each node, since a node that cannot compact as fast as it ingests slowly accumulates read amplification until its latencies degrade.

The failure stories compose from the mechanisms above, which is the satisfying property of the design. A node down for minutes is carried by the sloppy quorum and hinted handoff, and the hints drain back on recovery. A node dead for good has its virtual positions dispersed to survivors, Merkle-tree repair rebuilds full replication in the background, and operators replace the hardware on no particular schedule. A network partition leaves both sides accepting writes for the keys they can reach, vector clocks record the divergence, and reads after healing surface siblings for merge, so the system never returns an error it could have answered, which is precisely the AP promise and also precisely why a workload that must always read its own writes has to set W and R accordingly or live in a different store. Operationally the recurring work is tuning N, W, and R per table against measured latency, watching disk headroom because compaction needs scratch space to do its merging, and exercising the repair path on a schedule, since an anti-entropy mechanism that has never run in anger is a hypothesis rather than a safeguard.

Follow-up questions

  • Why choose AP here rather than strong consistency? The motivating workloads, carts, sessions, and device state, lose more from refusing writes than from reconciling occasional conflicts after the fact. A workload where a stale or merged read is unacceptable should use a CP store or set W to N, accepting that writes stall whenever a replica is down.
  • What do N = 3, W = 2, R = 2 actually guarantee? Every read overlaps every acknowledged write in at least one replica, because 2 plus 2 exceeds 3, so some contacted node always holds the newest acknowledged version. The guarantee is overlap rather than linearizability, since in-flight writes and hinted writes parked on substitutes can still be invisible to a given read.
  • Vector clocks or last-write-wins? Vector clocks preserve concurrent updates at the cost of sibling handling and merge logic living in the application, while last-write-wins is operationally simple and silently drops one of two concurrent writes. Choose by asking what a lost update costs, which for a cart item is a lost sale and for a last-seen timestamp is nothing at all.
  • How does a read repair stale replicas? The coordinator collects versions from R replicas, returns the newest to the client, and asynchronously writes that version back to any contacted replica that returned an older one. Frequently read keys therefore converge on their own, without waiting for the slower anti-entropy pass to find them.
  • Why an LSM tree instead of a B-tree inside each node? The workload is write-heavy, and the LSM turns random writes into sequential appends, which both disks and SSDs reward, while the read cost is contained by Bloom filters and compaction. A read-mostly workload full of range scans would tilt the choice back toward B-trees, which is why general-purpose databases still default to them.
  • What limits how big one value can be? Replication and compaction copy whole values repeatedly, so multi-megabyte values inflate write amplification and stretch quorum latency. Large blobs belong in an object store, with the key-value store holding the pointer and the metadata, which keeps the hot path moving small things quickly.

References

  1. DeCandia et al., Dynamo: Amazon's Highly Available Key-value Store (SOSP 2007), the template for this design.
  2. Lakshman and Malik, Cassandra: A Decentralized Structured Storage System (2010).
  3. O'Neil et al., The Log-Structured Merge-Tree (LSM-Tree) (1996), the storage engine's origin.
  4. Kleppmann, Designing Data-Intensive Applications (2017), chapters 5 and 6 on replication and partitioning.
  5. Xu, System Design Interview, Volume 1 (2020), chapter on key-value stores.