System design is deciding how the pieces of a system fit together: clients, services, storage, and the paths between them. Most designs come down to a handful of recurring components applied to a given set of requirements and a target scale, so the skill is less about invention than about composing known parts and naming the trade-offs each one carries.
A consistent order keeps the work tractable: settle the requirements, do rough capacity math, sketch the API and data model, draw the high-level components, then find where load concentrates and relieve it.
The framework
- Clarify requirements: functional (what it does) and non-functional (scale, latency, availability, consistency).
- Estimate: requests per second, storage, and bandwidth, with round numbers.
- Define the API: a few endpoints with their inputs and outputs.
- Design the data model and choose storage (SQL or NoSQL).
- Draw the high-level design: clients, load balancer, services, caches, databases, queues.
- Find the bottlenecks and scale them: caching, replication, sharding, async work.
Back-of-the-envelope estimation
A few formulas turn vague scale into concrete components. Request rate is daily active users times actions per user per day, spread over a day and multiplied by a peak factor,
$$\text{QPS} \approx \frac{\text{DAU} \times \text{actions/user/day}}{86400} \times \text{peak},\qquad \text{peak} \approx 2\text{ to }5.$$
Storage is rows times bytes per row times retention, and bandwidth is QPS times payload size. For example, 10 million DAU at 10 actions each is about 1,160 average QPS, perhaps 3,000 to 5,000 at peak, which immediately tells you whether one server or a fleet behind a load balancer is needed. The point of round numbers is to size the design, not to be precise.
Recurring building blocks
Most designs assemble the same parts: a load balancer to spread traffic, caches in front of slow reads, read replicas and sharding to scale a database, queues to make slow or spiky work asynchronous, and a CDN to push static content to the edge. Knowing the handful of blocks and what each one costs is most of system design; the interview or the design doc is really about choosing among them deliberately.
Trade-offs
Every choice trades something: caching adds staleness, sharding adds operational complexity, strong consistency costs availability and latency, and async work adds eventual-consistency and ordering concerns. The useful skill is naming which trade-off a design is making, and why, rather than reaching for a component reflexively.
Worked example
A small, classic design is a URL shortener. Reads (redirects) vastly outnumber writes, so the redirect lookup sits behind a cache, and short codes come from a unique counter encoded in base62, which packs far more ids per character than base10:
import string
ALPHABET = string.ascii_letters + string.digits # 62 symbols
def encode(n):
if n == 0:
return ALPHABET[0]
s = []
while n:
n, r = divmod(n, 62)
s.append(ALPHABET[r])
return ''.join(reversed(s))
print(encode(125)) # 'cb' (a 7-char code covers 62**7 > 3 trillion ids)
Follow-up questions
- How do you estimate QPS? Daily active users times actions per user per day, divided by 86400, times a peak factor of 2 to 5.
- Why base62 for short codes? It packs more ids per character than base10 or base16, so codes stay short; 62 to the 7th is over 3 trillion.
- Where does a cache help in a URL shortener? In front of the redirect lookup, since reads vastly outnumber writes and hot links are read repeatedly.
- How do you scale the ID generator? Hand out id ranges to each app server, or use a dedicated key-generation service, so servers never coordinate per request.
- What are the standard building blocks? Load balancer, cache, read replicas and sharding, queues for async work, and a CDN for static content.
References
- Alex Xu, ByteByteGo System Design (vol. 1 and 2).
- Kleppmann, Designing Data-Intensive Applications (2017).