Design consistent hashing

Systems design · Distributed building blocks · Feb 2025

Consistent hashing answers the question every distributed storage and caching system faces, which is how to decide, for a given key, which of N servers should hold it in a way that survives servers joining and leaving. The naive answer hashes the key and takes the remainder modulo N, and it works perfectly until N changes, at which point almost every key suddenly belongs somewhere else. A storage cluster then spends hours shuffling data it never needed to move, while a cache fleet watches its hit rate collapse and melts the database behind it with misses. Consistent hashing rearranges the assignment so that adding or removing one server moves only the keys that server should own, roughly a 1/N fraction of the total, and leaves everything else untouched, which turns a membership change from a cluster-wide event into a local one.

Interviewers ask this both as a standalone design and as a building block inside larger questions, because it tests whether a candidate can reason about distribution, balance, and failure with actual arithmetic rather than hand-waving. The ring itself takes two minutes to describe, so the hard parts are the consequences that follow from it. Arcs come out badly uneven when there are only a few nodes, hotspots ignore the hash entirely because popularity is not spread by hashing, and replication and membership changes interact with the ring in ways that deserve careful thought. Each of those gets its own section below, with numbers attached.

Scope and requirements

The deliverable is a key-placement scheme, usually packaged as a library rather than a service, and it needs three operations, namely mapping a key to a server, adding a server, and removing one. Functionally, every client holding the same view of membership must route a given key to the same server, because two clients disagreeing about placement means a write lands where a later read will never look. Lookups must be fast enough to sit on the hot path of every request, since placement is consulted before any data is touched. The scheme must also support weighting, so that a machine with twice the capacity can take twice the keys, because real fleets are rarely uniform and hardware gets replaced generation by generation.

The non-functional requirements are where the design earns its name. When membership changes by one server, the fraction of keys that move should be close to the theoretical minimum of 1/N, because every moved key is either data to copy or a cache miss to absorb, and both cost real time and real load. Load should spread evenly, within a few percent of the mean per server, since the most loaded server sets the cluster's effective capacity and everyone else's headroom goes to waste the moment one machine saturates. The scheme should need no central coordinator on the lookup path, with each client computing placement locally from the membership list, because a placement service would add a network hop to every request and become a single point of failure in front of everything. And it must handle the practical sizes, hundreds of servers and billions of keys, without the lookup structure itself becoming a burden to store or search.

Sizing the problem

The numbers that matter are the cost of a lookup and the cost of a membership change. Take a cluster of 10 cache servers holding 50 million keys. With the ring scheme described below and 200 virtual positions per server, the lookup structure holds 10 × 200 = 2,000 entries, and at roughly 40 bytes per entry that is 80 KB, small enough to live in every client's memory and largely in its CPU cache. A lookup is a binary search over 2,000 sorted positions, costing log base 2 of 2,000, which is about 11 comparisons, well under a microsecond. Against a typical cache round trip of half a millisecond, the placement decision is hundreds of times cheaper than the request it routes, which is exactly the relationship a hot-path component should have.

Membership changes are where the scheme pays for itself. Growing from 10 servers to 11 should move about 1/11 of the keys, which is 50 million divided by 11, roughly 4.5 million keys. The modulo scheme, as the worked example below shows, would move about 80 to 90 percent of them, around 40 million keys, nearly ten times the data transfer or ten times the cache misses for the same one-server change. If each key is a 1 KB cached object, that is the difference between copying 4.5 GB in a controlled stream and refetching 40 GB through the database at whatever rate the misses arrive, which in a busy cluster is the difference between a non-event and an incident.

The interface

The whole design fits in a small class, and writing the lookup makes the data structure concrete. The ring state is nothing more than a sorted array of (position, server) pairs, and a lookup is a binary search for the first position at or after the key's hash, wrapping to the start when the search falls off the end. The bisect module used here is Python's binary search over a sorted list.

import bisect, hashlib

def h(s: str) -> int:
    return int.from_bytes(hashlib.md5(s.encode()).digest()[:8], "big")

class Ring:
    def __init__(self, vnodes: int = 200):
        self.vnodes = vnodes
        self.positions = []           # sorted list of (hash, server)

    def add(self, server: str):
        for i in range(self.vnodes):
            bisect.insort(self.positions, (h(f"{server}#{i}"), server))

    def remove(self, server: str):
        self.positions = [(p, s) for p, s in self.positions if s != server]

    def get(self, key: str) -> str:
        i = bisect.bisect_right(self.positions, (h(key), ""))
        return self.positions[i % len(self.positions)][1]

The hash function needs to spread values uniformly across the 64-bit space but has no security role, so a fast non-cryptographic hash such as MurmurHash or xxHash is the production choice, and MD5 appears above only because it ships with the standard library. Reaching for a cryptographic hash out of habit would spend many times the CPU per lookup buying tamper resistance that nothing in this design needs, while a poorly distributed hash would cluster positions and quietly undo the balance the whole scheme depends on, so the requirement is uniformity and speed rather than secrecy.

Why hash mod N fails

The baseline deserves a worked demolition because it looks so reasonable. Assign each key to server hash(key) mod N. With N = 4, a key hashing to 17 goes to server 1, since 17 mod 4 is 1. Now add a fifth server, and the same key goes to 17 mod 5, which is server 2, even though server 1 is still alive and still holds the data. Run the same check across hashes 0 through 19, one full cycle of both moduli, and a key stays put only when its hash gives the same remainder mod 4 and mod 5, which happens for hashes 0, 1, 2, and 3 and for no others in the cycle. Four out of twenty keys stay, so 80 percent of all keys move for a change that logically required moving only 20 percent. In general, growing from N to N+1 servers relocates about N/(N+1) of the keys, a fraction that approaches everything as the cluster grows, so a 99-to-100-server change moves 99 percent of the data to gain 1 percent of capacity.

For a storage system that means a rebalancing storm of copies saturating the network for hours. For a cache it means the hit rate collapses to near zero at the moment of the change, and every miss lands on the database the cache was protecting, which from the database's point of view looks like the entire cache tier vanished at once. Both versions of the disaster are self-inflicted by the placement function rather than by the hardware change itself, and that is the whole motivation for replacing the function.

The ring

Consistent hashing maps servers and keys into the same space and makes ownership a matter of proximity. Treat the output range of the hash function as a circle, so the largest value wraps around to zero. Hash each server's identifier to get its position on the circle, and hash each key the same way. A key is owned by the first server found moving clockwise from the key's position. That single rule produces the property the modulo scheme lacked, because a server's arrival or departure only redraws the boundary of one arc, and keys everywhere else still find the same owner by the same clockwise walk. Nothing about the rule requires coordination either, since any client that knows the membership list can rebuild the circle independently and reach the same answer as every other client.

Node ANode BNode CNode Dkey u15key u42key u77clockwise to ownerto Node D

Servers and keys hash onto the same circle, and each key belongs to the first server found clockwise, so keys u15 and u42 belong to Node B while key u77 belongs to Node D.

The construction comes from Karger and colleagues' 1997 paper, written originally for distributed web caching, and it has since become the placement scheme inside Amazon's Dynamo and its descendants, Apache Cassandra's token ring, client libraries for Memcached such as Ketama, and request routing in CDN caches. That list is worth reciting in an interview because each system stresses a different property, with Dynamo leaning on replication along the ring, Cassandra on balanced token ownership, and the cache clients on stable routing computed independently by thousands of web servers that never talk to each other.

Adding and removing nodes

The payoff shows up the first time membership changes. Suppose node E joins and hashes to a position between A and B. Every key whose position falls between A and E used to walk clockwise past that point to reach B, and now stops at E first, so those keys, and only those, change owner. With five nodes placed roughly evenly, the arc between A and E holds at most about one fifth of the circle's keys and typically half an arc, so on the order of 50 million / 5 = 10 million keys move in the 50-million-key cluster, and in the general case about 1/N of the keyspace, compared with the 80 percent the modulo scheme moved for the same event. Removal is the mirror image, since a leaving node's keys simply continue clockwise to the next node, and nothing else on the circle changes at all.

Before: four nodesABCDkey kNode E joinsAfter: E owns one arcAEBCDkey k

Node E lands between A and B, and only the highlighted arc between A and E changes owner, so key k, which belonged to B before, now stops at E on its clockwise walk. Keys on every other arc keep their owners.

Operationally the data copy follows the same arithmetic, because the joining node streams its new arc from the node that previously owned it, B in the picture, while the rest of the cluster serves traffic undisturbed, and the old owner can keep serving the arc until the copy completes so that the handoff stays invisible to clients.

A join, end to end

It is worth narrating a join the way an operator would experience it, since the step-by-step detail is where designs quietly fall apart. A new server is provisioned, its 200 virtual positions are computed from its stable identifier, and the membership system, whether gossip or a coordination service, distributes the new view. Until the data arrives the new node owns arcs it cannot yet serve, so the join proceeds in two phases. During the transfer phase the new node pulls each of its slivers from the current owner while that owner keeps serving reads and writes, with writes that land mid-copy either applied to both nodes or logged and replayed at the end. Only when a sliver is fully caught up does ownership flip for that sliver, which makes the flip an atomic metadata change rather than a data move, and a client that looks up a key mid-join is always served by whichever node currently holds the authoritative copy.

The arithmetic from the sizing section says the new node receives roughly 4.5 million keys, about 4.5 GB at 1 KB each, and because virtual nodes scatter its arcs around the circle, those gigabytes arrive from many donors in parallel, perhaps 450 MB from each of the ten existing servers. At a conservative 100 MB per second of streaming per donor the transfer completes in under a minute, during which clients notice nothing beyond a slightly busier network. Run backward, the same walk describes a graceful decommission. The contrast with the modulo story, where ten times the data moves and every server is simultaneously donor and recipient, is the difference between routine capacity work and a scheduled maintenance window.

Virtual nodes and balance

The plain ring has a balance problem that the picture hides. With only a handful of nodes, the arcs are whatever random hashing produced, and randomness is lumpy, so with 4 nodes it is entirely plausible that one arc covers 40 percent of the circle while another covers 10, leaving one server with four times the load of another. The expected size of the largest arc among N random positions is on the order of (ln N)/N rather than 1/N, which means the imbalance gets relatively worse as a fraction of fair share rather than better for small clusters, and no amount of waiting fixes it because the positions are frozen at join time.

The fix is to give each server many positions instead of one. Hash the server to k positions, for example serverA#0 through serverA#199, and let the server own the union of its 200 small arcs, which is the construction everyone calls virtual nodes. Each server's total share is now the sum of many independent random arc lengths, and sums of independent random variables concentrate around their mean, with the relative deviation shrinking roughly as 1 over the square root of k. Concretely, with one position per server the per-server load can easily deviate from fair share by 50 percent or more, while k = 100 brings the standard deviation down to roughly 10 percent of the mean and k = 400 to roughly 5 percent, because the square roots of 100 and 400 are 10 and 20. Virtual nodes also give weighting almost for free, since a server with double the capacity simply registers double the positions. The cost is a larger lookup table, which the sizing section already showed is tens of kilobytes, and a membership change now redraws k small boundaries instead of one, which actually helps, because a leaving node's load disperses across many successors instead of dumping onto one unlucky neighbor.

Replication, hotspots, and bounded loads

Storage systems need each key on more than one machine, and the ring extends naturally to provide it. Store each key on the first R distinct physical servers encountered clockwise from its position, skipping virtual positions that belong to a server already in the list. With R = 3, the key u42 in the first diagram would live on B, C, and D. The skip rule matters because adjacent virtual positions frequently belong to the same physical machine, and a replica set holding two copies on one box protects against nothing. This is exactly the preference-list construction in Dynamo and Cassandra, and it means a node failure leaves every key still reachable on its other replicas while the ring repairs, so a dead server becomes a capacity event rather than a data-loss event.

Hotspots mark the limit of what any placement scheme can fix, because hashing balances key counts rather than key popularity. If one celebrity's profile receives 100,000 reads per second, the server owning that single key is overloaded no matter how evenly the keyspace is spread. Within the placement layer the tool is bounded-load consistent hashing, the variant published by Mirrokni and colleagues, which sets a cap of (1 + ε) times the average load per server, for instance 1.25 times, and walks on to the next server whenever the clockwise owner is already at its cap. Placement stays mostly consistent, no server exceeds the cap, and the analysis shows membership changes still move few keys. Beyond the placement layer the remedies are caching the hot key closer to clients and replicating it explicitly, which are application decisions rather than ring decisions, and saying so in the interview shows where the boundary of the technique sits.

Scaling, failures, and operations

The ring itself scales gracefully because it is just a sorted array recomputed from membership, so 1,000 servers at 200 virtual nodes each is 200,000 entries, a few megabytes per client and a binary search of 18 comparisons. The real scaling question is who maintains the membership list. Cassandra and Dynamo-style stores gossip membership between nodes, meaning each node periodically exchanges its view with random peers until everyone converges, which avoids any central authority at the price of convergence delay. Memcached client rings typically read a static configuration pushed by deployment tooling, which is simpler and changes only when deploys happen. Systems that want strong agreement keep membership in a coordination service such as ZooKeeper or etcd, accepting one more moving part in exchange for a single source of truth. The failure mode to call out is split views, where two clients briefly hold different membership lists and route the same key to different servers, which a cache tolerates as extra misses but a storage system must reconcile with versioning or read repair, and that is why storage systems pair the ring with quorum reads rather than trusting any single placement.

Node failure is deliberately uneventful. When a server dies, its arcs transfer to the next servers clockwise, which with virtual nodes means the load spreads thinly across many machines rather than doubling one neighbor's traffic, and with 200 virtual nodes a dead server's load lands in roughly 200 slivers. Flapping membership, where a node repeatedly leaves and rejoins as health checks disagree, causes repeated small migrations that add up to real churn, so production systems damp membership changes with timeouts before declaring a node truly gone. The last operational rule is to never derive the ring position from anything mutable, such as an IP address that changes on reboot, because a node that silently changes position orphans its old arcs and adopts arcs whose data it does not hold, and stable server identifiers prevent that whole class of problem.

Follow-up questions

  • Why not just keep a key-to-server table instead of hashing? A table gives perfect control but must be stored, replicated, and consulted on every lookup, which makes the table service its own small distributed system, and at billions of keys the table itself is large. The ring computes placement from a few kilobytes of membership data with no lookup service in the path. Some systems, HDFS and many SQL shard maps among them, still choose explicit tables when key counts are small and fine-grained control matters more than simplicity.
  • How many virtual nodes should each server get? Enough that the load imbalance stops mattering, and since the deviation shrinks like 1 over the square root of k, 100 to 200 positions brings imbalance to roughly 5 to 10 percent, with Cassandra's historical default of 256 sitting in the same range. Beyond that point the lookup table keeps growing while the balance barely improves, so the returns diminish quickly.
  • What changes if two clients disagree about membership? The same key routes differently from each client. A cache absorbs that as extra misses until the views converge, while a store must detect the divergence after the fact, which is why ring-based databases add versioning, read repair, and quorum overlap rather than relying on placement alone.
  • Does consistent hashing help with a single hot key? It does not, and saying so plainly matters, because hashing balances key counts rather than popularity. Bounded-load variants cap per-server load by spilling to the next node, and beyond that the fixes are replicating or caching the hot key explicitly, which are decisions made above the placement layer.
  • How does a new node actually get the data for its arc? It streams the arc from the previous owner while that owner continues serving, and then ownership flips atomically in the membership view. With virtual nodes the new server pulls many small slivers from many donors at once, which parallelizes the transfer and keeps any single donor from saturating.
  • Where would you not use consistent hashing? Anywhere range queries matter, because hashing destroys key order and neighboring keys land on unrelated servers. Range-partitioned systems such as Bigtable and HBase split sorted key ranges instead and rebalance by moving range boundaries, trading the ring's cheap membership changes for the ability to scan.

References

  1. Karger et al., Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web (STOC 1997), the original construction.
  2. DeCandia et al., Dynamo: Amazon's Highly Available Key-value Store (SOSP 2007), on the ring, virtual nodes, and preference lists in production.
  3. Mirrokni, Thorup, Zadimoghaddam, Consistent Hashing with Bounded Loads (2016), the (1 + ε) cap variant.
  4. Kleppmann, Designing Data-Intensive Applications (2017), chapter 6 on partitioning and rebalancing.
  5. Xu, System Design Interview, Volume 1 (2020), chapter on consistent hashing.