Design walkthroughs of the systems behind familiar products, each written to the depth of a real thirty-minute interview answer. A system design interview gives you about half an hour to take a vague prompt and turn it into a defensible architecture, and these write-ups follow that same arc. Each one starts by agreeing on scope, turns vague scale into concrete numbers, sketches the architecture with a diagram, and then spends most of its time where a real interview is decided, in the deep dives on the parts that are genuinely hard, before closing with the follow-up questions an interviewer tends to ask. The designs range from classic warm-ups like a URL shortener to payment systems, stock exchanges, and the machine learning systems behind search and recommendations, and the summary under each title says what the design is really about, so the list reads as a study guide on its own.
Almost every system in this collection is assembled from the same small kit of concepts. Horizontal scaling means adding identical servers rather than buying a bigger one, with a load balancer spreading requests across them, and it works because the servers are kept stateless, holding no per-user state between requests. Caching keeps recently used data in fast memory so most reads never touch the database. Partitioning, often called sharding, splits data too big for one machine across many, usually by hashing a key, while replication keeps copies of each partition on several machines so one failure loses nothing. Those copies introduce the central trade of distributed systems, which is consistency against availability. When replicas disagree, the design has to choose between answering with possibly stale data and refusing to answer until the replicas agree, and there is no third option. Message queues decouple fast producers from slow consumers so heavy work happens asynchronously, off the path a user is waiting on. And indexes, from B-trees to inverted indexes to geospatial cells, are how anything is found quickly inside all that data.
The kit of parts nearly every design below assembles. Reads are answered from memory when they can be, writes land in partitioned and replicated storage, and anything slow rides the queue so a user never waits on it. Dashed arrows are taken only sometimes or asynchronously.
The walkthroughs themselves run the way the conversation runs in an interview, in four moves. First agree on scope, because "design YouTube" means nothing until upload, playback, and scale are pinned down. Then turn vague scale into numbers with back-of-the-envelope arithmetic, since 100 million new links a month sounds enormous but works out to about 40 writes per second, and the numbers decide which problems are real. Then sketch the whole architecture, from clients and load balancers through stateless services and caches down to partitioned, replicated storage and the queues that carry asynchronous work off the critical path. Only then go deep on the two or three places where that particular system is genuinely hard, which is where an interview is actually decided, and close with the follow-up questions an interviewer tends to ask next.
The four moves of a thirty-minute design conversation. Each move earns the next, and the deep dives at the end are where the interview is actually decided.
The questions that open most interview loops, small enough to finish and deep enough to show judgment.
A garage system issues tickets, assigns spots, and collects fees, and the question measures object design more than scale, since even a city operator running 200 garages sees only about 14 gate events per second. The walkthrough locks per level rather than lot-wide so two cars can never claim the last spot while gates on other floors proceed, and folds pricing behind a strategy interface so fee changes never touch gate code. The edge cases decide the interview, because lost tickets, grace periods, and overnight stays are exactly what break naive implementations.
The classic warm-up, where 100 million new links a month turns out to be only 40 writes per second against 4,000 redirects, so the whole engineering budget belongs to the read path. The walkthrough compares hashing, counters, and key generation services, works the base62 encoding of ID 11157 into 2TX by hand, and settles on leased counter ranges because truncated hashes collide once the table holds billions of rows. It answers 302 rather than 301, accepting the cost of serving every click to keep analytics, destination edits, and abuse takedown.
The components that larger designs are assembled from. Knowing these well makes every other question easier.
A rate limiter decides whether each of a billion daily requests may proceed, and the decision must cost far less than the work it protects. The walkthrough compares token bucket, sliding window, and fixed window counters, chooses token bucket for burst tolerance and constant-time refill arithmetic, and makes the counters atomic across gateway nodes with Lua scripts in Redis. Failure policy gets equal weight, failing open for general traffic so a sick Redis never takes the API down while failing closed on login endpoints, where unmetered traffic is the worse outcome.
Consistent hashing places servers and keys on the same circular hash space so growing from 10 servers to 11 moves about a tenth of the keys, where modulo assignment reshuffles up to 90 percent of them. The walkthrough sizes the ring at 200 virtual nodes per server, which evens load to within a few percent and costs only kilobytes of membership data, then follows a node join from provisioning to the ownership flip without clients noticing. It is the building block worth knowing cold, since stores, caches, and queues all reach for it the moment they shard.
Distributed systems need 64-bit IDs that are unique across machines and sortable by creation time, minted at 100,000 per second with no central counter in the hot path. The walkthrough builds the Snowflake layout, packing a 41-bit timestamp, worker bits, and a per-millisecond sequence into one integer so numeric order matches creation order for 69 years from a custom epoch. The deep dive is clock skew, where a machine whose clock moves backward waits out small regressions and refuses to issue on large ones, trading rare bounded pauses for the guarantee that no duplicate key ever silently corrupts data.
A Dynamo-style store keeps shopping carts and sessions readable and writable through node failures and network partitions, holding 30 TB across 18 nodes with three replicas of everything. The design assembles consistent hashing for placement, quorum reads and writes whose overlap makes consistency tunable per table, vector clocks to detect concurrent updates, and an LSM-tree engine where even deletes are writes. It chooses availability over consistency during partitions, because an unreachable cart loses the sale while a stale one merely merges items, and shows the same cluster serving carts and passwords by turning the quorum dials.
A cache tier absorbs reads that would otherwise need 60 database machines, answering 500,000 gets per second from memory at a 95 percent hit rate. The walkthrough commits to cache-aside with delete-on-write invalidation because it keeps the cache optional and bounds staleness with TTLs, then spends its depth on LRU eviction with constant-time bookkeeping and on stampedes, where thousands of simultaneous misses for one expired key collapse into a single database query behind a per-key lock while everyone else briefly serves the stale value.
A Kafka-style queue is a replicated append-only log that decouples producers from consumers, absorbing a gigabyte per second and retaining a week of traffic, about 1.8 PB with replication, so a slow consumer can fall behind and catch up. The walkthrough explains why strictly sequential log writes make ordinary disks fast, how the in-sync replica set makes an acknowledgment mean something precise, and how consumer groups track progress with a single offset. It partitions by key hash, giving up global ordering so each key's events stay ordered on one partition, the trade that buys consumer parallelism.
Trending lists cannot afford an exact counter per item, since a million events per second over 500 million distinct daily items would need 25 GB per window. The walkthrough builds a count-min sketch that answers within a 0.001 percent error margin in under 8 MB and never undercounts, pairing the fast approximate path with a nightly exact batch in a lambda shape. The decisive detail is partitioning by item ID rather than by source, shown with an item counted 900 times on each of three nodes that vanishes from every merged per-node top ten.
How large systems find things, from crawling and indexing the web to completing a query as a person types and searching a social graph by relationship.
Downloading a billion pages a month is 400 pages and 200 MB per second, under the constraint that no host is hit more than once every couple of seconds. Politeness shapes the whole design, forcing 800 or more hosts in flight at once and a two-stage frontier where front queues carry priority and per-host back queues carry timing. URL deduplication runs through a Bloom filter that silently skips about 1 percent of links as an accepted cost, and sharding by host keeps each node's politeness timers local, with no cross-shard coordination on the hot path.
Full-text search over 50 million documents rests on the inverted index, a map from each term to the sorted list of documents containing it, which turns a multi-word query into a linear merge. The walkthrough works BM25 scoring numerically, showing how length normalization lets short dense titles outrank long pages, and shards by document hash so every shard scores its slice and a coordinator merges. Document sharding wins over term sharding because an update lands on one shard, and pagination uses cursors after showing that offset paging materializes 20,000 candidates to serve page 100.
Completions must arrive within about 100 milliseconds of each keystroke, faster than the next key falls, across 40 million prefix requests a day. The design precomputes the top ten completions for every prefix so a query becomes one key-value lookup instead of a trie subtree walk, served through layered caches from browser to CDN to memory shards. New versions of the table build offline and swap in atomically, so shards never coordinate an update mid-keystroke, and 50 to 100 milliseconds of client-side debouncing collapses a fast typist's burst roughly fivefold before it reaches a server.
Typing a name should surface your friend above a billion strangers, ranked by friendship distance, inside a 50 millisecond budget. The walkthrough first shows why traversal at query time fails, since checking distance two at 500 friends per hop touches a quarter million nodes, then replaces it with precomputed sets, exact first-degree lists updated within seconds and Bloom filters over second-degree sets rebuilt lazily. Distance ranking becomes a few constant-time membership checks per candidate, and the filters' occasional false positive merely promotes a stranger one tier, a harmless error bought for 4 KB per user.
Feeds, timelines, notifications, and the different shapes a message can take, from durable chat history to messages that disappear.
Two hundred million daily users write 40 million posts and read timelines four billion times, and that 100 to 1 ratio decides the architecture. The design fans out on write, pushing each post's ID into followers' cached timelines, because 93,000 background cache appends per second beat millions of foreground lookups, then flips to read-time merging for accounts past roughly 100,000 followers. Time-ordered IDs double as pagination cursors that survive insertions and deletions, and viral engagement lands on sharded counters flushed in batches so a million likes never serialize onto one hot row.
A feed assembles recent posts from a couple hundred followed accounts in under 200 milliseconds, 750 million times a day. Each user gets a cached list of 800 post IDs maintained by fanout on write and merged with celebrity authors at read time, hydrated into full posts at serving so deletes and edits take effect immediately. The cap at 800 entries is the quietly load-bearing choice, holding the cluster to 2 TB because almost nobody scrolls past a few hundred items, and a synchronous insert into the author's own cache preserves read-your-own-post.
One platform service owns every push, SMS, and email so preferences, quiet hours, and per-user rate caps are enforced in one place rather than by forty producer teams independently. Averages are gentle at under 200 sends per second, but a five million user campaign compressed into ten minutes spikes to 8,300, so transactional and campaign traffic ride structurally separate queues where a password reset can never wait behind a marketing flood. Delivery is at-least-once with idempotency keys at intake and at the provider edge, and the token registry ages out dead devices from provider feedback.
Messages cross from sender to recipient in a few hundred milliseconds over WebSockets, with 12.5 million sockets open at the evening peak and two billion messages a day appended to a wide-column store. Ordering comes from per-channel sequence numbers rather than timestamps, because phone and server clocks drift while a per-conversation counter costs one atomic increment on a partition the write already touches. A session registry maps each user to the server holding their socket, a queue decouples sending from delivery, and a client message ID on every send makes retries over weak signal invisible rather than duplicated.
When messages disappear after viewing or 24 hours, deletion is the product, and the design must separate the moment content stops being readable from the moment bytes leave disks. Expiry rides native TTLs checked lazily at read time, since actively scanning hundreds of millions of daily rows would compete with live traffic for disk I/O that compaction gives away free. Crypto-shredding carries the deletion promise, storing each megabyte blob encrypted while its 64-byte key sits in the metadata row, so expiring the key neutralizes every cached copy at once, and view-once runs as a server-side compare-and-set.
A billion-account email service receives 580,000 messages per second over SMTP and ingests 2.5 PB a day, which stored naively would approach an exabyte a year. Content-addressed storage carries the design, keeping each unique body once so an attachment mailed to a thousand recipients stores a single copy, turning 5 GB of deliveries into 5 MB on disk, while the database keeps only queryable metadata and a hash pointer. Search runs on per-user inverted indexes that never cross accounts, and durability pairs a replicated intake log with at-least-once replay into idempotent delivery.
The designs that move and keep large content, covering video upload and playback, file sync across devices, and the object store that sits underneath both.
Creators upload 500 hours of video every minute and viewers watch a billion hours a day, and the two pipelines deserve opposite treatment because a waiting creator grumbles while a buffering viewer leaves. Uploads split into ten-second segments so a two-hour video transcodes in about a minute across parallel workers, players switch bitrate rungs at segment boundaries from HLS manifests, and the economics live in the CDN, where a 95 percent hit ratio turns 125 Tbps of egress into about 6 Tbps at origin. Even the view counter is designed, aggregated from event streams because one hot row cannot absorb 800 increments a second.
Sync must carry an edit on one device to the rest within seconds without ever losing anyone's work, across five exabytes of files. Files split into 4 MB content-addressed blocks, so changing one block of a gigabyte file uploads 4 MB rather than the whole thing, and identical blocks deduplicate across users. When two offline devices edit the same file, the losing commit is refused and renamed into a conflicted copy for a person to merge, because silently keeping the last writer is unforgivable in a product holding the only copy, and the garbage collector, the lone destructive process, runs behind soft-delete grace periods.
An S3-style store promises eleven nines of durability, about one object lost per year across a hundred billion stored, on disks that individually fail all the time. The design separates hot small metadata from immutable bulk data, erasure-codes each object into 8 data and 4 parity fragments so any four losses are survivable at 1.5x overhead against replication's 3x, and packs small objects into gigabyte append-only blocks so billions of files never fracture a filesystem. Writes acknowledge at two durable copies of three, keeping tail latency reasonable while the third lands seconds later.
Systems built around location, including nearby search, live location sharing among friends, road routing with traffic, and matching riders to drivers.
Finding restaurants near a point defeats ordinary indexes, because a B-tree can drive the scan on latitude or longitude but not both at once. Geohashing fixes it by naming the world in cells whose names share prefixes when the cells sit near each other, and querying the nine-cell block around the user catches the pair of points ten meters apart across a cell boundary. At a few gigabytes the whole index replicates wholesale to every node rather than sharding by region, which removes geographic hotspots, and exact haversine filtering over a few thousand candidates costs under a millisecond.
Showing which friends are within five miles means 333,000 location updates per second, each needing to reach a small audience of online friends within seconds. Updates flow through pub/sub channels pinned to nodes by hash, where standing subscriptions replace per-message registry lookups, and edge servers filter by distance locally so only the one message in ten inside the radius reaches a phone. Live positions sit in Redis under a 60 second TTL with no durable store at all, because data rewritten every 30 seconds makes durability worthless, and the expiring key doubles as the offline signal.
Maps is three systems sharing a brand, tiles that draw the world, routing that plans a path, and traffic that keeps the plan current. The tile pyramid quadruples per zoom level to over a trillion cells at level 20, shipped as vector geometry through CDNs so clients restyle without refetching. Routing cannot search a continental graph per query, so contraction hierarchies precompute shortcut edges and searches climb an importance hierarchy instead of rediscovering highways. ETAs blend live speeds with historical profiles, because the jam observed now says little about a road you reach four hours from now.
Matching riders to drivers starts with 1.25 million GPS writes per second from five million drivers, a firehose kept in memory under TTLs because positions overwrite every four seconds and durability buys nothing. The geo index shards by city so a match never crosses shards, candidates rank by pickup ETA on the road graph rather than straight-line distance, since a driver just across the river is far away no matter how close, and dispatch takes an exclusive lease per driver through atomic set-if-absent so two concurrent matchers can never offer the same car. The trip itself runs as a state machine over an append-only event log.
High-volume event streams and what gets computed from them, including the monitoring that watches everything else.
Watching 100,000 servers means ingesting a million samples per second and answering a week-wide dashboard query instantly. Gorilla-style compression, delta-of-delta timestamps with XOR-encoded values, shrinks a day's 86 billion samples to about 120 GB, and a Kafka buffer in front of storage means a shard failure or deploy never drops samples. The quiet villain is cardinality, where one added user ID label multiplies series counts by millions, and alerting carries pending states to suppress flapping plus a dead-man's switch, a rule that fires forever so an outside service can page when alerting itself goes silent.
Click counts become invoices, so the numbers must survive disputes while campaign pacing needs them within a minute or two. A billion daily clicks stream through windows keyed by event time, so a late mobile click lands in the window it belongs to, with watermarks bounding how long windows stay open and deduplication by event ID absorbing replays. A nightly batch recomputes finals from the raw log in lambda fashion, streaming and batch totals reconcile hourly to catch drift, and suspect clicks divert to a quarantine store with reasons attached so disputes resolve by recomputation rather than negotiation.
Ranking 26 million players is easy to store and hard to query, since a relational index can answer a player's exact rank only by counting everyone above them. A Redis sorted set backed by a skip list answers rank in logarithmic time, and the entire board fits in 2.6 GB of memory. Ties break deterministically by packing points into the high bits of the score and complemented season seconds into the low bits, inside the 53 bits a float carries exactly, so the earlier achiever ranks higher. The sorted set is treated as a rebuildable view over a durable event log, restored in about two minutes when a node dies.
Systems where correctness is the product itself, because double-booking, double-charging, and double-spending all have to be impossible by construction.
Selling room-nights is an inventory problem where overbooking must be impossible by construction even when two guests grab the last room in the same instant. Inventory counts room types per night rather than physical rooms, leaving the front desk free to assign numbers at check-in, and races resolve through optimistic concurrency, a conditional update that checks the version column and the overbooking cap in one statement. A hold reserves inventory for ten minutes while payment settles and expires on its own if it never does, and idempotency keys make a retried booking return the existing reservation rather than charging twice.
A payment system's first job is to never lose track of money, and at a dozen orders per second the bottleneck is correctness rather than throughput. Every movement of a cent lands in an append-only double-entry ledger, retries carry idempotency keys so a double-submitted charge executes once, a state machine forbids transitions that should never happen, and nightly reconciliation compares the ledger against the processor's settlement files. The deep dive everyone skips is the timeout, where a charge that dies mid-flight stays pending until the provider answers definitively, because guessing success ships goods free and guessing failure charges a card it must then refund.
Moving stored value between accounts at a million transfers per second forces sharding across more than a hundred database shards, and with accounts placed randomly, 99.5 percent of transfers cross shards. Two-phase commit would hold coordinator locks while throughput dies, so the design uses try-confirm-cancel, reserving funds on one shard, crediting the other, and cancelling on failure, with every intermediate state visible and auditable. Balances are never stored as truth at all but derived from an append-only event log, shards replicate by Raft, and a recovery scanner reads the journal to finish whatever a crashed coordinator left in flight.
An exchange matches orders by price-time priority in microseconds, and the surprise is that the fastest correct design is single-threaded, one core per symbol with no locks, since one core runs millions of operations a second and a full order book fits in 100 MB. A sequencer stamps every inbound message into one global order, so Raft-replicated engines consuming the same stream stay bit-identical and failover promotes an already-correct standby. The latency budget is itemized down to roughly 100 nanoseconds of matching, and fairness is engineered physically, with equal-length colocation cables and multicast market data so nobody hears prices early.
Design questions where a model sits inside a larger system, and most of the work is framing the task well, gathering features and training data, and building a serving path that answers in milliseconds.
Searching by photo means learning an embedding space where visual similarity is distance, then finding nearest neighbors among a billion catalog vectors inside 250 milliseconds. The pipeline detects and crops objects before embedding so vectors describe products rather than scenes, trains the encoder contrastively, and serves from a sharded approximate nearest neighbor index that absorbs ten million new items a day. The operational deep dive is retraining, since a new encoder invalidates every stored vector, and re-embedding a billion items takes around 14 hours on a hundred GPUs before canary queries validate the index and traffic cuts over atomically.
Blurring every face and license plate in a billion panoramas per season is a recall problem, because an over-blurred statue costs nothing while one missed face is a privacy violation already published. Offline batch processing removes latency from the equation, which buys a heavy two-stage detector ensemble whose members catch one another's misses, run over perspective tiles cut from each panorama so projection distortion never suppresses a detection. Thresholds tilt deliberately toward false positives, idempotent checkpoints let the pipeline crash and resume without double-processing, and user reports of missed faces become the next round's hardest training examples.
Text queries against a billion videos retrieve through two parallel paths that fail differently, lexical BM25 over titles and speech transcripts for exact names, and CLIP-style embeddings for queries that paraphrase what they want. The union feeds a ranker measured by nDCG, worked through numerically, and trained on click logs only after correcting position bias, since clicks concentrate on whatever ranked first for any reason at all. Freshness for breaking news comes from a near-real-time index that carries bare metadata within minutes of upload while transcripts and embeddings backfill as they finish.
Screening a billion daily posts for violence, hate, and self-harm is multi-label classification at a 0.1 percent base rate, and the real design surface is what happens after the model scores. Per-category thresholds map scores to tiered actions, automatic removal where precision is high and human review queues prioritized by severity times reach below that, and the article works the arithmetic at a 0.92 threshold to show what 0.45 recall means in violations caught and posts wrongly removed. Perceptual hashing clears known re-shared material before classifiers spend compute, modalities fuse late for operability, and reviewer decisions return as training labels.
Picking 30 homepage videos from billions in 200 milliseconds requires a funnel, where two-tower retrieval against an ANN index narrows billions to thousands, a gradient-boosted ranker narrows thousands to hundreds, and re-ranking chooses what renders. Two towers win the retrieval stage precisely because they see no cross-features, which lets item embeddings precompute into an index. Watch time alone is a corrupt objective that rewards clickbait and autoplay chains, so heads predicting clicks, watch time, likes, and survey-measured satisfaction blend under tuned weights, and one percent of impressions go to exploration so the catalog never ossifies around what the model already shows.
Events are perishable inventory, dead at start time before accumulating interaction history, which quietly breaks the collaborative filtering most recommenders lean on. The design substitutes content and context, topic embeddings, organizer track record, and a geo-time index that cuts five million live events to roughly 600 candidates in ten milliseconds. The ranker is a pointwise gradient-boosted model kept deliberately calibrated so downstream business rules can trust its probabilities, retrained nightly on registrations because those labels arrive in days while attendance arrives too late, and each user's own day-of-week registration history supplies the temporal fit.
Predicted click probability multiplies against bids inside an auction, so calibration is the product, and a model that ranks perfectly while running 20 percent hot misprices every impression silently. The article defends a logistic regression baseline trained online with FTRL, since log loss yields naturally calibrated probabilities and the sparse model stays cheap and interpretable, layering tree-derived feature crosses on only when they pay. Training downsamples the 99 percent negatives tenfold and corrects in closed form, and evaluation runs on normalized entropy, worked numerically, because AUC is blind to exactly the miscalibration that costs money.
The carousel under a vacation rental suggests alternatives, and the insight is that similarity is better learned from booking behavior than from attributes. Skip-gram embeddings train on click sessions with the eventually booked listing inserted as global context for every click, tilting the geometry from co-browsing toward conversion. Serving pulls about 200 approximate nearest neighbors from 640 MB of 32-dimensional vectors in milliseconds, applies availability and capacity as hard filters at candidate generation because recommending a listing the traveler cannot book destroys trust, and reranks the survivors with a supervised model before nightly retraining swaps the index atomically.
Each feed load ranks about 2,000 eligible posts inside a 100 millisecond model budget, scored by a multi-task network whose heads predict likes, comments, shares, dwell, and hides off one shared trunk so the sparse signals are not drowned by the frequent ones. The final ordering comes from a human-tuned value formula rather than a learned one, weighting a predicted comment 15 times a like and a predicted hide at minus 100, so a rare explicit negative vetoes a pile of weak positives and product intent stays legible. Position bias enters training as a feature and freezes to a constant at serving, with a one percent randomized slice kept for unbiased evaluation.
Connection suggestions are link prediction on a graph where triadic closure does the heavy lifting, since most future connections already share a mutual friend. Friends-of-friends expansion yields tens of thousands of candidates for a well-connected user, scored by Adamic-Adar, which weights each shared mutual inversely by the log of their own connection count, and the worked example shows a mutual with 50 connections vouching roughly three times harder than one with 50,000. Rankings precompute in daily batch because request-time expansion would blow any interactive budget, patched on signup and new-connection events, the moments freshness actually matters, and the model trains on accepted invitations rather than sent ones.
A design exercise for the fun of it, where the speed of light becomes the bottleneck.
With Earth and Mars separated by 3 to 22 light minutes and a two-week blackout at solar conjunction, no protocol that waits for an acknowledgment can work, so the governing rule is that zero synchronous coordination crosses the gap. TCP gives way to Bundle Protocol custody transfer, where each hop stores a message durably and acknowledges per link, and each planet runs as a fully autonomous region, strongly consistent inside and reconciled across. Concurrent updates merge through CRDTs, shown with a grow-only counter both planets advance through a blackout and merge identically regardless of message order, and the arithmetic that one terabyte needs nine days of continuous transfer at 10 Mbps decides what replicates at all.