Design a distributed cache

Systems design · Distributed building blocks · Mar 2025

A cache keeps copies of frequently read data in memory so requests can skip the slower system behind it, and the numbers explain why every large service runs one. A query that reads from a disk-backed database typically costs 10 to 20 milliseconds once disk seeks and query execution are paid, while fetching the same bytes from a memory cache over the local network costs a few hundred microseconds, which makes the cache roughly a hundred times faster. The protection effect matters as much as the speed, because at a 95 percent hit rate only 5 percent of reads ever reach the database, so the database carries one twentieth of the read load it otherwise would, and that ratio is frequently the difference between a database tier that needs 3 machines and one that needs 60. The cache, in other words, is not merely an accelerator but the thing standing between the database and its own traffic.

Interviewers ask the question because a cache looks like a hash table until it is distributed, and then every classic distributed systems problem appears in miniature. Keys must spread across nodes without reshuffling whenever membership changes, something must be evicted when memory fills, ten thousand requests can miss the same key in the same instant, and stale data must be either tolerated knowingly or chased out deliberately. The hard parts live in the failure and contention cases rather than the happy path, and a strong answer spends its time there, because the happy path of a cache genuinely is just a hash table.

Scope and requirements

Functionally the cache offers get, set with a time-to-live, and delete, over string keys and binary values up to perhaps 1 MB, shared by many application servers rather than private to one process. It works as a look-aside accelerator for a database that remains the source of truth, which means the cache is allowed to lose data on a crash, since everything in it can be rebuilt from the database at the cost of some reads. What it is not allowed to do is serve wrong answers indefinitely or fall over in a way that takes the database down with it, and those two prohibitions quietly generate most of the design below.

The non-functional requirements follow from the job. Access must be sub-millisecond at the median, since a cache slower than that stops being worth the extra network hop. Throughput must reach hundreds of thousands of operations per second across the cluster, and capacity must exceed one machine's memory, so the data has to shard. Availability matters in an unusual way, because a cold cache does not merely fail its own users but redirects its full read load onto the database at the worst possible moment. Behavior must also stay predictable under the pathological access patterns, the hot keys, the stampedes, and the floods of misses, that production traffic eventually produces, because every one of them arrives eventually and each has a name precisely because it has ruined many an on-call night.

Sizing the problem

Take a service handling 100,000 page requests per second at peak, where rendering a page makes on average 5 cache reads, which puts 500,000 gets per second against the cluster. A single well-run cache node sustains on the order of 100,000 operations per second, so 5 to 6 nodes cover the throughput before memory even enters the picture. At a 95 percent hit rate the misses run at 25,000 per second, which is the read load the database must absorb in steady state, and it is worth pausing on how fragile that figure is, because a hit rate slipping quietly from 95 to 90 percent doubles the database's read traffic without any deploy or incident to point at. Hit rate is the cache's single most important health metric, and the alert on it deserves to page a person.

Memory follows from the working set, the subset of data that is actually hot. Suppose the catalog holds 500 million objects averaging 2 KB, which is 1 TB in total. Caching everything is possible but wasteful, because access skews heavily, and if 10 percent of the objects receive essentially all of the day's reads, the working set is 50 million objects at 2 KB, about 100 GB, which a cluster of 4 nodes with 32 GB devoted to cache each holds with headroom. Getting the estimate wrong in the small direction shows up directly in the hit rate, because a cache smaller than its working set evicts entries that are still hot, each such eviction converts a future hit into a miss, and the database quietly inherits the difference.

The interface and access patterns

The client API is tiny, and showing the canonical usage pattern matters more than the signatures, because the pattern determines the consistency story. Cache-aside, the default, puts the application in charge of the dance, where a read tries the cache first, falls through to the database on a miss, and refills the cache on the way back:

def get_product(pid: str) -> Product:
    key = f"prod:{pid}"
    if (v := cache.get(key)) is not None:
        return v                          # hit
    row = db.query("SELECT ... WHERE id = %s", pid)
    cache.set(key, row, ttl=300)          # refill, 5 min TTL
    return row

def update_product(pid: str, fields: dict):
    db.update(pid, fields)                # source of truth first
    cache.delete(f"prod:{pid}")           # invalidate, next read refills

Three alternatives shift the work around, and each loses to cache-aside for this scope in a specific way. Read-through moves the miss handling into the cache itself, which fetches from the database on the application's behalf, so the application code gets simpler while the cache layer needs a loader plugged in, and since the behavior is otherwise cache-aside's, the choice is mostly about where the code lives. Write-through updates the cache synchronously on every write as well as the database, which keeps the cache permanently fresh, and pays for that freshness by slowing every write with a second synchronous hop and by spending memory on items that may never be read at all. Write-behind acknowledges the write after updating only the cache and flushes to the database asynchronously, which gives the fastest writes and absorbs bursts beautifully, and in exchange a cache node crash loses writes that were already acknowledged to users, so it is defensible only when the data is reconstructible or the loss window is engineered down to almost nothing. For the look-aside accelerator scoped here, cache-aside with delete-on-write is the recommendation, and the eviction and invalidation sections below carry the rest of its consistency story.

Topology: how keys spread across nodes

Once the data exceeds one node, some layer must map each key to a shard, and there are three places to put that logic, each with a different operational personality. Client-side sharding builds the map into the client library, so each application server hashes the key with consistent hashing, the scheme where nodes and keys share a hash ring and a membership change moves only about one Nth of the keys, and then talks directly to the owning shard. This is the Memcached lineage, standardized by the Ketama library, and it adds zero extra network hops, while its weakness is that every client must hold a correct and current view of the node list, which makes rolling topology changes delicate because a client with a stale view quietly reads and writes the wrong shard. A proxy topology inserts a small router, Twemproxy being the well-known example, so clients speak to the proxy and the proxy owns the sharding map, which keeps clients simple and localizes topology changes to the proxy fleet, at the price of one added network hop on every operation and a new tier that must itself be deployed, monitored, and scaled. A cluster protocol builds the routing into the cache nodes themselves, as Redis Cluster does, where the keyspace divides into 16,384 slots, each node owns a range of slots, a key's slot is its hash modulo 16,384, and a client that asks the wrong node receives a redirect naming the right one, which smart clients remember, so the steady state has no extra hops and the cluster can migrate slots between nodes while serving traffic. For a new build today I would pick the cluster protocol and say that client-side sharding remains perfectly respectable wherever the operational tooling around it already exists.

App serverscache client libraryShard 1 primarykeys in hash range 1Shard 2 primarykeys in hash range 2Shard 3 primarykeys in hash range 3Replica 1Replica 2Replica 3hash(key) picks shardreplicate

The client library hashes each key to one of three shards with consistent hashing, and each shard streams its writes asynchronously to a replica that can be promoted if the primary dies.

Eviction: choosing what to forget

Memory is the budget, so the cache must constantly decide what to discard, and the default policy nearly everywhere is least recently used, which evicts the entry that has gone longest without being touched, on the theory that recent access predicts future access. The classic implementation is worth being able to draw, because it achieves constant time for every operation. A hash map gives direct access to entries, the same entries are threaded onto a doubly linked list ordered by recency, every get unlinks the touched entry and reattaches it at the head, and eviction pops from the tail. Both structures update in a constant number of pointer operations, which is why a cache can afford to run this bookkeeping on every single request it serves. Production systems shave the cost further, and Redis, for instance, approximates LRU by sampling a handful of keys and evicting the least recent among the sample, which behaves close to true LRU at a fraction of the memory overhead because it skips maintaining the global list entirely.

LRU has a known weakness in the face of scans, because a burst of one-time reads, such as a batch job walking the entire catalog, floods the recency list and evicts the genuinely hot set in favor of entries nobody will ever read twice. Least frequently used eviction resists that by counting accesses and evicting the coldest by count, and it pays with heavier bookkeeping and slow adaptation when popularity shifts, since an item that was hot last week retains its high count and squats on memory this week. Hybrids with decaying counters split the difference, and they are what a careful operator reaches for when scan traffic is a fact of life. Time-to-live is the orthogonal mechanism rather than a competitor, because every entry carries an expiry chosen at write time, which bounds staleness no matter what eviction is doing. The working-set arithmetic from the sizing section is what keeps eviction in its lane, since a cache that comfortably fits the hot 100 GB evicts only the cold tail, while a cache that does not fit churns the hot set itself and pays the difference in hit rate.

Hot keys, stampedes, and penetration

Sharding spreads the keyspace, not the popularity. When a celebrity posts, one key can draw hundreds of thousands of reads per second, all hashing to the same shard, and a shard that sustains 100,000 operations per second has just become the whole system's ceiling. The fixes stack in layers. Request coalescing inside each application server collapses concurrent identical lookups into one outbound request, so a hundred threads waiting on the same key produce one network call rather than a hundred. A small local in-process cache holding the few hottest keys for a second or two answers most reads before they touch the network at all, trading a bounded second of staleness for an enormous reduction in load, and the trade is nearly always right because a key hot enough to need this treatment is being read far too often for one second of staleness to matter. If the key is still too hot after that, replicating it under derived names, the original key plus a random suffix from a small range with every replica readable interchangeably, spreads one key's load across several shards at the cost of multiplying its invalidations by the number of suffixes.

The cache stampede, also called the thundering herd, is the sharper version of the same emergency. A popular key expires, every request that misses it simultaneously queries the database and tries to refill, and the database receives in one instant exactly the load the cache existed to absorb. The arithmetic is sobering, because with 10,000 requests per second against a key whose recompute takes 200 milliseconds, an unprotected expiry sends about 2,000 identical queries into the database before the first refill lands. Per-key locking is the direct defense, where the first request to miss acquires a short-lived lock on that key, in Redis a SET with the NX flag, which sets only if the key does not already exist, then performs the single database read and refill and releases, while everyone else either waits a few milliseconds and retries the cache or, better, is served the just-expired stale value while the refresh proceeds. Probabilistic early refresh removes the cliff altogether by having each reader volunteer, with a probability that rises as expiry approaches, to refresh the entry before the deadline, so hot keys practically never expire cold, and the published version of the idea weights the probability by the recompute cost and provably minimizes stampedes for the noise it adds.

App serversconcurrent missesCache shardhot key just expiredDatabase12SET NX lock, one winner345

Thousands of requests miss the expired hot key at once in step 1, and in step 2 each tries to acquire the per-key lock, which exactly one wins. The winner alone queries the database in step 3, refills the key with a fresh TTL and releases the lock in step 4, and in step 5 the waiters are served the refilled value, or the stale value during the refresh, while the database saw one query instead of thousands.

Cache penetration is the third pathology, consisting of requests for keys that exist nowhere, whether from typos or from attackers probing random IDs, and such requests miss the cache by definition and hit the database every single time. Negative caching stores the absence itself as a short-TTL not-found entry, so repeated probes for the same missing key stop at the cache, and the short TTL keeps a key that later comes into existence from being shadowed by its own recorded absence for long. Against attackers spraying unique keys, where negative caching never gets a second hit on anything, a Bloom filter in front of the database earns its keep, which is a compact probabilistic structure that answers definitely-absent for most keys never written while letting an occasional false positive through to the database, where it costs one harmless miss.

Invalidation and the TTL backstop

Cache-aside's consistency story is delete-on-write, where the application updates the database first and then deletes the cache key, so the next reader misses and refills from fresh data. Deleting rather than setting the new value avoids the race in which two concurrent writers set their values in the wrong order and the older one wins, but a thinner race survives, where a reader misses, reads the database just before a writer commits, and refills the cache just after the writer's delete, pinning the stale value in place until something removes it. Fully closing that window requires versioned keys or coordination machinery that costs more than it saves for most workloads, and the practical posture is to accept the small window while bounding its damage with TTLs. Every entry gets one, even entries believed to be explicitly invalidated, because the TTL is the backstop guaranteeing that no failure mode, whether a missed delete, a crashed writer, or the race above, can outlive a few minutes. Choosing the TTL is then a product decision stated in seconds, asking how stale a price, a profile, or a permission may acceptably be, and since the answer differs per key family, the TTL belongs in the code at each call site rather than in a global configuration nobody revisits.

A latency budget for one get

Following a single get through its budget keeps the sub-millisecond claim from being hand-waving. The application serializes a small request, which costs single-digit microseconds, the request crosses the local network to the owning shard in 50 to 200 microseconds each way on a modern data center fabric, and the shard's own work, a hash lookup plus the LRU bookkeeping, costs a few microseconds more. A hit therefore completes in roughly half a millisecond, and the budget exposes the real lever, which is that the network round trip dominates everything else, so batching several keys into one request and pipelining requests over a connection buy far more than any server-side tuning ever will. A miss adds the database query at 10 milliseconds or more plus the refill set, which is why a page that misses five times feels visibly slower than a page that hits five times, and why pages should issue their cache reads concurrently rather than one after another, paying one round trip of latency instead of five.

The tail deserves its own attention, because a cluster serving 500,000 gets per second produces a large absolute number of slow requests even at tiny percentages. The usual tail culprits are a garbage collection pause on a cache node, a momentarily saturated network link, or one overloaded shard, and the defense is a short client timeout, on the order of 20 to 50 milliseconds, after which the application treats the get as a miss and proceeds to the database, so a slow cache node degrades a sliver of the traffic instead of stalling all of it. That fail-open posture carries a sharp edge worth naming, since timeouts that fire too eagerly during a brief blip can themselves shovel load onto the database, which is why the timeout pairs with a circuit breaker, a component that notices a shard failing repeatedly and stops sending it traffic for a cooling-off period rather than letting every request rediscover the problem independently.

Scaling, failures, and operations

Scaling out means adding shards, and consistent hashing or slot migration keeps the reshuffle proportional to the capacity added rather than to the total, while the operational caution is that every moved key arrives cold, so large rebalances should be paced to let the hit rate recover between steps rather than executed in one heroic night. Each shard pairs with a replica receiving asynchronous updates, and when a primary dies, promotion by Redis Sentinel, the cluster protocol, or the proxy's health checks puts a warm copy in service within seconds. The alternative of restarting empty explains why replication exists at all for a system that is allowed to lose data, because a cold shard at peak traffic converts its entire throughput into database load. The sizing arithmetic makes that vivid, since one cold shard out of four turns roughly 125,000 gets per second into misses, quintupling the database's read load until the shard rewarms, so availability work in a cache is really database-protection work wearing a different name.

A full cache restart after a datacenter power event is the worst case of the same story, and mature operations pre-warm the cluster from a snapshot or replay recent keys before admitting traffic, because admitting traffic to an empty cluster simply schedules the database's collapse for a few seconds later. The metrics that matter day to day are the hit rate per key family, the eviction rate, where rising evictions alongside a falling hit rate mean the working set no longer fits, per-shard operation counts to spot hot keys before they become incidents, and the database's read queries per second, which is the cache's report card rendered on someone else's dashboard.

Follow-up questions

  • Cache-aside or write-through? Cache-aside keeps the cache optional and spends memory only on what is actually read, at the cost of a stale window managed by delete-on-write and TTLs, while write-through keeps the cache always fresh but slows every write and fills memory with entries that may never be read. The default is cache-aside, and the exception is a workload where reads must never see staleness that a TTL cannot bound.
  • Why is LRU the default, and when is it wrong? It runs in constant time, it is simple to reason about, and recency predicts future access for most workloads. It goes wrong under scans, which flood the recency list with entries nobody reads twice, and under stable popularity skews, where LFU or a sampled hybrid protects the truly hot set better.
  • How do you protect one key receiving 500,000 reads per second? Coalesce duplicate in-flight requests inside each app server, serve the key from a one-second local in-process cache, and if that still is not enough, replicate the key under suffixed names across several shards. The owning shard stops being the ceiling once most reads never reach it.
  • What is the difference between a stampede and penetration? A stampede is many concurrent misses on a key that exists, addressed with per-key locks, serving stale values during refresh, and early refresh, while penetration is misses on keys that exist nowhere, addressed with negative caching and a Bloom filter. Both overload the database, just through different doors, which is why the defenses do not interchange.
  • Is the cache a source of truth for anything? Not in this design, because every entry must be reconstructible from the database, and that property is what licenses eviction, crash tolerance without persistence, and TTLs. Write-behind designs that buffer acknowledged writes change the contract and need durability machinery this design deliberately avoids.
  • Why not make cache and database consistent with transactions? Distributed transactions across a cache and a database would cost latency and availability on every single write to close a staleness window that TTLs already bound for pennies. The proportionate tool is versioned keys for the few key families that genuinely need them, rather than global coordination imposed on everyone.

References

  1. Nishtala et al., Scaling Memcache at Facebook (NSDI 2013), on look-aside caching, leases, and hot keys at extreme scale.
  2. Redis documentation, SET, the NX option behind per-key locks.
  3. Twitter, Twemproxy, the proxy sharding topology.
  4. Xu, System Design Interview, Volume 1 (2020), on cache patterns within its system designs.