Load balancing and horizontal scaling

Vertical scaling (a bigger machine) hits a ceiling fast; horizontal scaling (more machines) is how systems grow, and a load balancer is what makes many servers look like one. It spreads requests, removes failed servers from rotation, and often terminates TLS, so a single address fronts an elastic pool.

The cleanest designs keep servers stateless so any request can go to any server; session state then lives in a shared store or a signed token. That is what frees the balancer to use any distribution policy and to fail requests over to healthy servers without a user noticing.

Distribution algorithms

Round robin rotates through servers and is ideal when requests are uniform and servers identical; weighted round robin sends more to bigger servers; least connections routes to the server with the fewest in-flight requests, which handles uneven request costs; and hashing or sticky routing sends a client consistently to one server, often via consistent hashing, when some per-client state must stay local.

The power of two choices

A surprising and practical result governs random assignment. Placing each request on a single random server leaves the busiest server with about $\Theta\!\left(\tfrac{\log n}{\log\log n}\right)$ more load than average, but picking two servers at random and sending the request to the less loaded one cuts the maximum imbalance to about $\Theta(\log\log n)$, an exponential improvement (Mitzenmacher 2001). The why is that a second sample almost always avoids the unlucky pile-ups a single random choice creates, which is why production balancers like NGINX adopt it.

Health checks and layers

The balancer periodically health-checks each server and pulls failing ones from the pool until they recover, which is how it routes around failure. It operates at one of two layers: L4 routes by IP and port (fast, protocol-agnostic), while L7 routes by HTTP content such as path or headers, enabling smarter routing and TLS termination at higher cost.

Step by step

Front the pool with one balancer address and keep servers stateless.
Distribute by round robin, weighted, least connections, or hashing.
For random placement, sample two servers and pick the less loaded.
Health-check servers and remove failing ones until they recover.

from itertools import cycle

class RoundRobin:
    def __init__(self, servers):
        self.it = cycle(servers)

    def next(self):
        return next(self.it)

# weighted: expand the list (["A","A","B"]); least-connections: pick min by active count

Complexity (time and space)

Round robin is $O(1)$ per decision with $O(1)$ state; least connections is $O(1)$ with a per-server counter (or a heap for the minimum). The real constraint is statelessness, since it is what lets the balancer pick any server, and the power-of-two-choices policy is also $O(1)$ while delivering near-optimal balance.

Worked example

Round robin cycles through the pool in order:

rr = RoundRobin(["A", "B", "C"])
print([rr.next() for _ in range(4)])   # ['A', 'B', 'C', 'A']

Follow-up questions

Why prefer stateless servers? So any server can handle any request; state lives in a shared store or signed token, enabling free scaling and failover.
What does the power of two choices buy? Sampling two random servers and taking the less loaded cuts max imbalance from about log n / log log n to log log n, an exponential improvement.
L4 vs L7 load balancing? L4 routes by IP and port (fast, protocol-agnostic); L7 routes by HTTP content (path, headers), enabling smarter routing and TLS termination.
Least connections vs round robin? Least connections handles uneven request costs better; round robin is ideal when requests are uniform and servers identical.
How does the balancer handle a dead server? Periodic health checks remove failing servers from the pool until they pass again.

References

Mitzenmacher, The Power of Two Choices in Randomized Load Balancing (2001).
Kleppmann, Designing Data-Intensive Applications (2017).
Alex Xu, ByteByteGo System Design (vol. 1 and 2).