Rate limiting caps how many requests a client may make in a window, protecting a service from overload, runaway clients, and abuse. The token bucket is the most common algorithm because it bounds the long-run average while still permitting short bursts, which matches how real traffic behaves.
Picture a bucket that refills with tokens at rate $r$ per second up to capacity $b$; each request spends one token, and requests are rejected when the bucket is empty. The two parameters give exactly the two guarantees you want: the bucket allows an instantaneous burst of up to $b$ requests, and over any long interval the admitted rate cannot exceed the refill rate $r$. The current token count is recomputed lazily on each request as $\min(b,\ \text{tokens} + (t - t_{\text{last}})\,r)$, so no background timer is needed.
The algorithm family
Leaky bucket is the dual: it drains at a constant rate and queues or drops overflow, enforcing a strictly smooth output with no bursts. Fixed window counts requests per calendar window and is trivial, but it allows a $2\times$ burst at a boundary, since a client can send a full window at the end of one and another full window at the start of the next. Sliding window log stores each request timestamp for an exact limit at $O(\text{window})$ memory, while the sliding window counter approximates it cheaply by weighting the previous window.
Distributed rate limiting
Per-instance limiting is easy but lets a client exceed the global limit by spreading requests across servers. A global limit needs shared state, typically Redis with an atomic INCR plus EXPIRE (or a small Lua script for token-bucket math), so the count is correct under concurrency. The why is atomicity: without it, two servers can both read the same count and both admit a request that should have been the last.
Step by step
- Refill the bucket lazily based on elapsed time, capped at capacity b.
- If at least one token is available, spend it and admit the request.
- Otherwise reject (return HTTP 429) and let it refill over time.
- For a global limit, keep the counter in a shared store with atomic updates.
import time
class TokenBucket:
def __init__(self, rate, capacity):
self.rate = rate # tokens added per second
self.capacity = capacity
self.tokens = capacity
self.last = time.monotonic()
def allow(self, cost=1):
now = time.monotonic()
self.tokens = min(self.capacity, self.tokens + (now - self.last) * self.rate)
self.last = now
if self.tokens >= cost:
self.tokens -= cost
return True
return False
Complexity (time and space)
Each check is $O(1)$ time and $O(1)$ space per client, just a token count and a timestamp. A sliding-window log is $O(1)$ amortized but $O(\text{requests-in-window})$ space, which is why high-volume limiters prefer the token bucket or the counter approximation.
Worked example
A bucket of capacity 2 admits two requests immediately, then rejects until it refills:
tb = TokenBucket(rate=1, capacity=2) # 1 token/sec, holds at most 2
print(tb.allow()) # True (2 -> 1 token)
print(tb.allow()) # True (1 -> 0 tokens)
print(tb.allow()) # False (empty; refills over time)
Follow-up questions
- What two guarantees does a token bucket give? A burst of up to capacity b, and a long-run average admitted rate no greater than the refill rate r.
- Token bucket vs leaky bucket? Token bucket permits bursts up to capacity; leaky bucket enforces a strictly smooth, constant output rate.
- Why does a fixed window allow a 2x burst? A client can send a full window at the end of one period and another at the start of the next, doubling the rate across the boundary.
- How do you rate limit across many servers? Keep the counter in a shared store like Redis with atomic INCR and EXPIRE (or a Lua script) so the global limit holds under concurrency.
- What status code should a limited request get? HTTP 429 Too Many Requests, ideally with a Retry-After header.
References
- Alex Xu, ByteByteGo System Design (vol. 1 and 2).
- Kleppmann, Designing Data-Intensive Applications (2017).