Rate limiting · Stanley Jacob

Rate limiting caps how many requests a client may make in a window, protecting a service from overload, runaway clients, and abuse. The token bucket is the most common algorithm because it bounds the long-run average while still permitting short bursts, which matches how real traffic behaves.

Picture a bucket that refills with tokens at rate $r$ per second up to capacity $b$; each request spends one token, and requests are rejected when the bucket is empty. The two parameters give exactly the two guarantees you want: the bucket allows an instantaneous burst of up to $b$ requests, and over any long interval the admitted rate cannot exceed the refill rate $r$. The current token count is recomputed lazily on each request as $\min(b,\ \text{tokens} + (t - t_{\text{last}})\,r)$, so no background timer is needed.

The algorithm family

Leaky bucket is the dual: it drains at a constant rate and queues or drops overflow, enforcing a strictly smooth output with no bursts. Fixed window counts requests per calendar window and is trivial, but it allows a $2\times$ burst at a boundary, since a client can send a full window at the end of one and another full window at the start of the next. Sliding window log stores each request timestamp for an exact limit at $O(\text{window})$ memory, while the sliding window counter approximates it cheaply by weighting the previous window.

Distributed rate limiting

Per-instance limiting is easy but lets a client exceed the global limit by spreading requests across servers. A global limit needs shared state, typically Redis with an atomic INCR plus EXPIRE (or a small Lua script for token-bucket math), so the count is correct under concurrency. The why is atomicity: without it, two servers can both read the same count and both admit a request that should have been the last.

Step by step

Refill the bucket lazily based on elapsed time, capped at capacity b.
If at least one token is available, spend it and admit the request.
Otherwise reject (return HTTP 429) and let it refill over time.
For a global limit, keep the counter in a shared store with atomic updates.

import time

class TokenBucket:
    def __init__(self, rate, capacity):
        self.rate = rate            # tokens added per second
        self.capacity = capacity
        self.tokens = capacity
        self.last = time.monotonic()

    def allow(self, cost=1):
        now = time.monotonic()
        self.tokens = min(self.capacity, self.tokens + (now - self.last) * self.rate)
        self.last = now
        if self.tokens >= cost:
            self.tokens -= cost
            return True
        return False

Complexity (time and space)

Each check is $O(1)$ time and $O(1)$ space per client, just a token count and a timestamp. A sliding-window log is $O(1)$ amortized but $O(\text{requests-in-window})$ space, which is why high-volume limiters prefer the token bucket or the counter approximation.

Worked example

A bucket of capacity 2 admits two requests immediately, then rejects until it refills:

tb = TokenBucket(rate=1, capacity=2)   # 1 token/sec, holds at most 2
print(tb.allow())   # True   (2 -> 1 token)
print(tb.allow())   # True   (1 -> 0 tokens)
print(tb.allow())   # False  (empty; refills over time)

Follow-up questions

What two guarantees does a token bucket give? A burst of up to capacity b, and a long-run average admitted rate no greater than the refill rate r.
Token bucket vs leaky bucket? Token bucket permits bursts up to capacity; leaky bucket enforces a strictly smooth, constant output rate.
Why does a fixed window allow a 2x burst? A client can send a full window at the end of one period and another at the start of the next, doubling the rate across the boundary.
How do you rate limit across many servers? Keep the counter in a shared store like Redis with atomic INCR and EXPIRE (or a Lua script) so the global limit holds under concurrency.
What status code should a limited request get? HTTP 429 Too Many Requests, ideally with a Retry-After header.

References

Alex Xu, ByteByteGo System Design (vol. 1 and 2).
Kleppmann, Designing Data-Intensive Applications (2017).