Design a distributed rate limiter

A rate limiter decides, for every incoming request, whether the caller has exceeded an allowed budget such as 100 requests per minute, and it rejects the excess before any of it reaches the rest of the system. Every public API ends up needing one. A single misconfigured client retry loop can generate as much traffic as the rest of the user base combined, a scraper can vacuum an entire catalog at line speed, and a credential-stuffing attack, in which stolen username and password pairs are replayed against a login endpoint, can hammer that endpoint until the database behind it falls over. Rate limiting also enforces fairness between paying tiers, since a customer on the free plan should not crowd out one paying for guaranteed throughput, and it caps the cloud bill for endpoints that fan out into expensive downstream work, where every uncontrolled request multiplies into real money.

Interviewers like this question because the single-machine version is a pleasant data structures exercise while the distributed version forces trade-offs that have no free answer. Counters must be shared across many limiter nodes without making every request wait on coordination, the arithmetic of each counting algorithm has visible product consequences at window boundaries, and the behavior of the limiter when its own counter store is down has to be chosen deliberately because every available choice hurts someone. The difficulty concentrates in three places, namely the counting algorithm, the correctness of counters under concurrency, and how strictly the limit must hold while the system is degraded, and I will work through each in turn.

Scope and requirements

I would first pin down what is being limited and where, because the phrase rate limiter covers everything from a network traffic shaper to a per-user API quota, and those are different designs. The brief here is a server-side limiter for an API platform, one that enforces configurable rules keyed by user ID, API key, IP address, or endpoint. A single request may be subject to several rules at once, for example 10 requests per second per user alongside 1,000 per hour per IP, and it is admitted only when every matching rule still has budget, since each rule expresses an independent protection and the strictest should win. Clients that exceed a limit should receive a clear rejection that says when to retry rather than a silent drop, because silence teaches client authors to retry blindly and worsen the very overload the limiter exists to prevent. Rules must also be editable without redeploying services, since the moment an operator most needs to tighten a limit is mid-incident, when a deploy is the slowest tool available.

The non-functional requirements are what actually shape the design. The limiter sits on the hot path, meaning the code that runs on every request, so it must add very little latency, ideally under a couple of milliseconds against a typical API budget of around 100 milliseconds. It must itself scale to the full traffic of the platform, since by definition it sees everything, and a limiter that falls over at peak protects nothing. Memory matters because tracking millions of distinct callers with heavyweight state gets expensive fast, so each per-caller record should be tens of bytes rather than kilobytes. Accuracy can be slightly relaxed, and the asymmetry is worth stating out loud. Briefly admitting a few percent over the limit during a failure or a synchronization gap is usually acceptable, whereas rejecting legitimate traffic under the limit is worse, because over-admission costs a sliver of capacity while a false rejection breaks a paying customer's workflow. Finally, the limiter must not become a new single point of failure for the API it protects.

Sizing the problem

Concrete numbers keep the design grounded. Suppose the platform serves 1 billion requests per day. A day holds 86,400 seconds, and 1 billion divided by 86,400 comes to about 11,600 requests per second on average, while peaks of five times the average are common, so the limiter must make roughly 58,000 decisions per second at the worst moment of the day. Each decision amounts to one or two operations against a counter store, since the check and the update can be fused into a single call, which puts the store at about 100,000 small operations per second. A single large Redis node sustains on that order of throughput for simple commands, so a small cluster carries the load with headroom and nothing exotic is required.

Memory stays modest because per-caller state is tiny, and working the arithmetic shows why one algorithm is an outlier. A token bucket needs a token count and a timestamp, around 50 bytes once the key string and store overhead are included. If 10 million distinct callers are active in any given hour and each matches two rules, that is 20 million counters at 50 bytes apiece, about 1 GB, which fits comfortably in memory on one machine. The most expensive algorithm discussed below keeps a timestamp for every individual request rather than one small record per caller. At 10 requests per minute per caller, 10 million active callers, and 8 bytes per timestamp, the bill comes to roughly 800 MB for each minute of window, and a five-minute window pushes that toward 4 GB of state which must also be scanned and trimmed on every decision. That gap between 1 GB in total and 800 MB per minute is why the precise algorithm is reserved for low-volume, high-value endpoints.

The interface

From the client's point of view the limiter is invisible until it triggers, so the contract that matters most is the rejection. The convention is HTTP status 429 Too Many Requests, defined in RFC 6585, together with a Retry-After header saying how many seconds to wait, plus a set of X-RateLimit headers on every response so well-behaved clients can pace themselves before they ever hit the wall. The status code matters because the alternatives mislead. Returning 403 Forbidden would conflate throttling with authorization, and returning 503 Service Unavailable would tell load balancers and monitoring that the service itself is unhealthy when it is working exactly as designed. A 429 says precisely what happened, namely that the caller sent too much, and Retry-After says what to do about it.

GET /api/v1/orders
Authorization: Bearer ...

→ 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 23
X-RateLimit-Reset: 1714066920

GET /api/v1/orders          (the 101st request this minute)

→ 429 Too Many Requests
Retry-After: 37
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
{ "error": "rate_limited", "retry_after_seconds": 37 }

Publishing the remaining budget is a real design choice rather than decoration, because clients that can watch X-RateLimit-Remaining fall will back off voluntarily, which converts hard rejections into smooth self-throttling. A client library can pace itself from the headers long before the first 429 arrives, and support load drops because developers can see their own consumption instead of guessing at it. Without the headers every limit is discovered by colliding with it, and the natural response to a mysterious rejection is an immediate retry, exactly the traffic pattern the limiter exists to suppress.

Rules and counter state

Two kinds of data live in this system, and they have opposite shapes, which is why they end up in different stores. Rules are small, read constantly, and change rarely, so they belong in a durable relational store and are cached inside every limiter node with a refresh every few seconds or a push on change, since a rule lookup must never cost a network hop on the hot path. Counters are the opposite in every dimension, huge in cardinality, written on every request, and worthless once the window passes, so they belong in an in-memory store keyed by rule and caller, each entry carrying a time-to-live, an expiry the store enforces automatically so stale entries clean themselves up.

CREATE TABLE rate_limit_rules (
  id          BIGSERIAL PRIMARY KEY,
  scope       TEXT NOT NULL,        -- 'user' | 'ip' | 'api_key'
  endpoint    TEXT NOT NULL,        -- 'POST /v1/login' or '*'
  capacity    INT  NOT NULL,        -- bucket size, e.g. 100
  refill_per_sec NUMERIC NOT NULL,  -- e.g. 1.67 for 100/minute
  updated_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Counter keys in Redis, expiring automatically:
--   rl:{rule_id}:{caller}  ->  { tokens: 23, ts: 1714066883.21 }

Placement is the other scoping decision, and three candidates usually come up. Limiting purely in the client SDK amounts to courtesy rather than enforcement, because attackers do not run your SDK. Limiting inside each backend service does work, and sometimes survives as defense in depth, but it scatters rule logic across teams, spends backend CPU on requests that were going to be rejected anyway, and leaves nobody with a single view of who is being throttled. The strongest placement is the API gateway, the single front door that already terminates connections and authenticates callers. A limiter middleware there sees every request with the caller's identity attached, rejects excess traffic before it consumes anything downstream, and keeps the rules in one place where operators can actually manage them.

The high-level architecture

The shape that falls out is small. Clients reach an API gateway, the layer that routes external requests to internal services, and a limiter middleware inside the gateway consults a rules store for configuration and a Redis cluster for live counters. Redis is an in-memory key-value store chosen for three properties, namely sub-millisecond operations, built-in key expiry, and server-side scripting that makes multi-step updates atomic. Memcached offers comparable speed but lacks the scripting and per-key data structures, and a relational database offers durability that counters do not need at a latency the hot path cannot afford. Requests under the limit continue to the protected services, and the rest never leave the gateway.

The limiter middleware in the gateway checks Redis on every request and forwards only the traffic under the limit, while rules are fetched rarely and cached locally. Dashed arrows mark flows that are periodic or conditional.

Choosing the algorithm

Five algorithms cover the design space, and their differences show up at window boundaries and in memory cost, so each deserves a worked number rather than a name-drop.

The token bucket is the workhorse. Picture a bucket that holds at most 100 tokens and refills at 100 tokens per minute, which works out to 1.67 tokens per second, where each request removes one token and is rejected if the bucket is empty. A caller idle for a while accumulates a full bucket and may burst 100 requests at once, then sustain 1.67 per second, and that profile matches how real clients behave, quiet for minutes and then a flurry when a page loads. The state is two numbers per caller, the token count and the time of the last refill, and the refill needs no background job because it is computed lazily on the next request. If 30 seconds have passed, the limiter adds 30 times 1.67, which is 50 tokens, caps the result at the capacity, and proceeds, so an idle caller costs nothing at all.

The leaky bucket is the same metaphor turned inside out, since requests enter a queue of bounded size and drain from it at a fixed rate, making the output perfectly smooth no matter how bursty the input gets. Such smoothness is exactly right when the protected resource cannot absorb bursts at all, a payment processor with a strict contractual call rate being the classic case. The price is that a burst of legitimate requests waits in the queue even when the system has spare capacity, so users feel latency a token bucket would never impose, and I would choose it only when the downstream contract genuinely demands a smooth rate.

The fixed window counter is the simplest of the five, keeping one counter per caller per clock window, incremented on each request and compared against the limit. Its flaw lives at the boundary between windows. With a limit of 100 per minute, a caller can send 100 requests in the closing second of one window and 100 more in the opening second of the next, and both windows are individually legal, yet 200 requests landed within two seconds, double the intended rate over that span. An attacker who knows where the window edges fall can schedule traffic to exploit exactly this, so for abuse protection the boundary is a real hole rather than a rounding error.

The sliding window log repairs the boundary exactly by storing the timestamp of every request and counting how many fall within the trailing window, so the count is the literal truth rather than an estimate. The expense is equally literal, since the sizing section showed timestamps for millions of callers running to hundreds of megabytes per minute of window, and every decision must scan or trim a per-caller sorted list. The log earns its cost only where precision matters more than volume, such as login or password reset, where the request rate is naturally low and every over-admission hands an attacker one more free guess at a password.

The sliding window counter is the practical compromise, blending the two adjacent fixed windows with a weighted interpolation. Keep the counts for the previous and current fixed windows, and estimate the trailing-window count as the current count plus the previous count scaled by how much of the previous window still overlaps the trailing window. With a limit of 100 per minute, suppose the previous minute saw 84 requests and we are 15 seconds into the current minute with 36 so far. The trailing 60 seconds still overlaps 45 of the previous window's 60 seconds, so the estimate is 84 × (45/60) + 36, which is 63 + 36 = 99, just under the limit, and the next request is the last one admitted. The estimate assumes requests were spread evenly across the previous window, which is rarely exact, but when Cloudflare deployed this scheme across millions of domains the measured error was small, and the state remains two integers per caller, so the compromise buys boundary protection at fixed-window prices.

I would recommend the token bucket as the default for its burst friendliness and tiny state, with the sliding window counter wherever smooth enforcement over a window is the product contract. The decision flow for a single token bucket check looks like this.

In step 1 the request reaches the gateway, and in step 2 the limiter invokes a Lua script against the caller's bucket key. During step 3 the script refills tokens for the elapsed time, tries to take one, and returns the result atomically. Step 4 sends admitted requests on to the service, while step 5 answers rejected ones with 429 and Retry-After without ever touching the service.

Making the counter atomic

With many gateway nodes sharing one Redis, the naive read-modify-write sequence has a race condition, which is the failure mode where two concurrent operations interleave and produce a result neither intended. Suppose the bucket holds one final token and two gateway nodes each run a GET at nearly the same instant. Both see 1, both decide to admit, and both run a DECR, so both requests pass and the counter lands at minus one, even though neither node's logic was wrong in isolation. Under real load this interleaving happens constantly, and the over-admission grows with the number of concurrent nodes, so the limit quietly loosens at exactly the moment traffic is heaviest.

The fix is to push the whole decision into the store and execute it as one indivisible unit. Redis runs a Lua script as a single uninterrupted operation, so the refill arithmetic, the comparison, and the decrement happen with no other command interleaved. Optimistic transactions built from WATCH and MULTI could also detect the conflict, but they respond by retrying, and retries under heavy contention on a hot key can loop many times, wasting round trips precisely when the system is busiest, whereas a script succeeds on its first attempt every time. The entire limiter core fits in about a dozen lines.

# Lua, executed atomically inside Redis (shown with Python-style comments)
# KEYS[1] = bucket key; ARGV = capacity, refill_per_sec, now
local b = redis.call('HMGET', KEYS[1], 'tokens', 'ts')
local tokens = tonumber(b[1]) or capacity
local elapsed = now - (tonumber(b[2]) or now)
tokens = math.min(capacity, tokens + elapsed * refill_per_sec)
local allowed = 0
if tokens >= 1 then
  tokens = tokens - 1
  allowed = 1
end
redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', now)
redis.call('EXPIRE', KEYS[1], 2 * capacity / refill_per_sec)
return { allowed, tokens }

The EXPIRE call at the end carries the memory story, because a caller who goes quiet has their key deleted automatically once enough time has passed that the bucket would have refilled anyway, so the working set tracks active callers rather than every caller ever seen. Setting the expiry to twice the full-refill time leaves margin so a key never vanishes while it still encodes useful state. For the simpler fixed and sliding window counters, a plain INCR with an expiry achieves the same atomicity without any script at all, which is why the Redis documentation presents rate limiting as the canonical INCR pattern.

Many limiter nodes, one budget

A single shared Redis gives exact global counting at the cost of one network round trip per decision, roughly 0.5 to 1 millisecond within a datacenter, usually an acceptable tax. When it is not, or when limiter nodes span regions and the round trip stretches to 50 milliseconds, the alternative is local counting with asynchronous synchronization, where each gateway node counts in its own memory and periodically reconciles with the shared store, so decisions cost nothing on the wire and the global count becomes eventually consistent, meaning nodes converge on the true total only after a delay.

The cost of that delay is bounded and worth computing out loud. With 10 gateway nodes syncing every 100 milliseconds and a caller blasting requests at every node at once, each node can keep admitting for up to 100 milliseconds after the global budget is actually spent. At a limit of 100 per minute and an attacker pushing 1,000 requests per second spread evenly, each node sees 100 requests per second and may admit about 10 extras during its blind window, so the platform briefly admits on the order of 100 extra requests, doubling the per-minute budget exactly once before every node learns the bucket is empty. For protecting backend capacity that overshoot is noise. For billing enforcement or login protection it is not, which is why precision-critical rules should stay on the synchronous shared-store path while high-volume coarse rules can ride the local-counter path, with that flag carried explicitly in the rule configuration. A middle ground used by several commercial gateways keeps the shared store authoritative but answers from a local cache whenever Redis latency spikes, accepting a moment of imprecision to protect tail latency.

Sticky routing deserves a mention because it shrinks the problem rather than solving it. If the load balancer consistently routes each caller to the same gateway node, then that node's local counter is the global truth for that caller and no synchronization is needed at all. The approach pairs naturally with consistent hashing on the caller ID, and its weakness is exactly the rebalancing moment, when a node dies or the fleet scales and callers land on fresh nodes whose counters start empty, briefly handing everyone a reset budget, a caveat worth naming before an interviewer does.

A latency budget for one decision

It helps to walk the full path of a single admitted request and account for the time, because the limiter's value depends on staying imperceptible. The request arrives at a gateway that has already terminated TLS and authenticated the caller, so finding the applicable rules is a probe into the local rule cache costing single-digit microseconds. The Redis call is the only network hop, 0.5 to 1 millisecond round trip within a datacenter, of which the Lua script accounts for perhaps 50 microseconds on the Redis CPU. Altogether the decision adds about a millisecond to a request that will typically spend 50 to 200 milliseconds in the services behind the gateway, a tax of one or two percent that no user can perceive.

The tail of the distribution is where the budget gets interesting. Redis can stall for tens of milliseconds when it forks to write a snapshot or when one slow command blocks its single-threaded event loop, and at 58,000 decisions per second even a 10 millisecond stall queues nearly 600 decisions behind it. The defense is a strict client-side timeout on the limiter's Redis call, around 5 milliseconds, paired with a predetermined answer for what happens when it fires, since the worst outcome would be the protection layer becoming the latency problem. On timeout the middleware applies the per-rule degraded policy described below, so a slow counter store costs a moment of imprecision, never a stalled request. Keeping the counters on their own Redis cluster, separate from any general-purpose cache, prevents another team's expensive query from spending the limiter's latency budget.

Scaling, failures, and operations

Each tier scales by a different rule. Gateway nodes are stateless with respect to limiting, since the counters live in Redis, so they scale horizontally behind the load balancer. Redis scales by sharding the counter keys across a cluster, and because each key is touched only by one caller's requests, the keys distribute evenly with no cross-shard operations; at 58,000 peak decisions per second, three modest shards with replicas carry the load with headroom. The rules store barely registers, since rules are cached in every node and refreshed on the order of seconds, so its query load depends on node count rather than traffic.

The failure that matters most is Redis becoming unreachable, and the design must pick its posture in advance, because improvising during the outage is how a partial failure turns into a total one. Failing open means admitting all traffic unlimited while the store is down, which preserves availability but drops the shield at the precise moment an attack may be what caused the trouble. Failing closed means rejecting everything, which is safe for the backends and terrible for users, who experience a hard outage of an API that is otherwise healthy. The defensible answer is to fail open for general traffic with an immediate page to operations, and to fall back to conservative local-only counters for abuse-sensitive endpoints such as authentication, with the posture stated per rule in configuration. A local in-process token bucket holding a small fraction of the global budget caps the worst case while the store recovers, so from the user's side the platform merely becomes slightly stricter for a few minutes rather than broken.

Operationally, the limiter needs its own observability, including decisions per rule, the rejection rate over time, and Redis latency percentiles, because a rising rejection rate is often the first visible symptom that some customer's retry logic has gone wrong, and catching that on a dashboard beats catching it in a support escalation. Rule changes should roll out in shadow mode first, logging what would have been rejected without rejecting anything, since an operator typo in a capacity field is otherwise a self-inflicted outage. Hot keys are the remaining wrinkle, because a rule keyed on one shared resource concentrates all of its traffic on whichever Redis shard owns that single key, so global rules should be split into per-node allowances that sum to the global budget.

Follow-up questions

Why not enforce limits in each backend service instead of the gateway? Enforcing only in the backends spends their CPU on requests that were going to be rejected anyway, duplicates rule logic across teams, and gives up the single place to observe and tune limits. Service-level limits still earn their keep as a second line of defense for internal callers that bypass the gateway, so the practical arrangement is gateway first with services as backstop.
Token bucket or sliding window counter? I would reach for the token bucket when bursts are a feature for clients and per-caller state must stay minimal, and for the sliding window counter when the product contract is a smooth ceiling over a window with no boundary games. The fixed window is acceptable only when the boundary burst, which can briefly double the rate across a window edge, is genuinely harmless to whatever sits downstream.
What does the Lua script buy over INCR? Atomicity for multi-step logic is the purchase, since refill, comparison, and decrement execute as one uninterrupted unit, eliminating the race where two nodes both see one remaining token and both admit. A bare INCR suffices only when the whole algorithm is a single increment, as for the window counters.
How would you rate limit by IP behind a corporate NAT? With real care, because network address translation lets thousands of users in one office share a single public address, so a strict per-IP rule punishes all of them for one person's behavior. I would prefer authenticated identity whenever it exists, set IP limits high enough to clear shared egress points, and reserve IP rules for unauthenticated abuse, combined with other signals before anything gets blocked.
Fail open or fail closed when Redis is down? The posture should be chosen per rule rather than globally. General traffic fails open with degraded local counters and an alert, while authentication and payment endpoints fail closed or drop to a strict local budget, because the cost of over-admission there is account takeover or fraud rather than a little extra load.
How do you avoid punishing clients with rejected-then-retried storms? Returning Retry-After and the X-RateLimit headers gives clients what they need to pace themselves, documenting jittered exponential backoff in the SDK makes good behavior the default, and for paying tiers it can be worth briefly queueing slightly-over-limit requests at the edge in a small leaky bucket, so the customer experiences smoothing instead of errors.

References

Stripe Engineering, Scaling your API with rate limiters (2017), on token buckets and load shedders in production.
Cloudflare, How we built rate limiting capable of scaling to millions of domains (2017), on the sliding window counter and its measured accuracy.
Xu, System Design Interview, Volume 1 (2020), chapter on rate limiter design.
Redis documentation, INCR, which presents rate limiting as the canonical atomic counter pattern.
IETF, RFC 6585: Additional HTTP Status Codes (2012), defining 429 Too Many Requests.