Designing YouTube means building two pipelines that never touch the same bottleneck. The first is an upload path that swallows hundreds of hours of video every minute and turns each file into a dozen playable variants, and the second is a playback path that serves a billion hours of watching a day from caches sitting close to the viewers. Interviewers ask this question because it is the canonical heavy-data design, where the interesting decisions are not about request routing but about bytes, meaning where they are transformed, where they sit, and who pays to move them. The hard parts live in the processing pipeline, which has to transcode planet-scale ingest fast enough that creators are not left waiting for hours, and in the delivery economics, where the content delivery network bill dwarfs every other line item and quietly dictates the architecture. Search, recommendations, and comments are real systems too, and I would explicitly scope them out in the first minute of the interview so the remaining twenty-nine are spent where the data actually is. The walkthrough below follows that plan, sizing both pipelines, building each one, and closing with the numbers an operator would watch to know the product is healthy.
Scope and requirements
The agreed scope covers creators uploading videos up to hours long, the service processing each upload into multiple resolutions and formats, viewers watching on phones, browsers, and TVs with playback starting quickly and adapting to their bandwidth, and the system counting views. Out of scope, by explicit agreement with the interviewer, are search, recommendations, comments, and monetization, since each of those is its own interview and none of them changes where the video bytes go. Non-functionally, uploads must survive flaky consumer connections and resume rather than restart, processing should make a typical video watchable within minutes of upload, and playback should start within a second or two of perceived delay and rarely stall thereafter. Availability matters asymmetrically between the two pipelines, because a viewer who cannot watch leaves immediately and may not return, while a creator whose upload processes slowly merely grumbles and waits, so the read path gets the stricter budget and the write path gets the cheaper hardware. That asymmetry is worth stating early because it justifies almost every cost decision that follows.
Sizing the problem
The canonical ingest figure is 500 hours of video uploaded per minute, which works out to 720,000 hours per day. An hour of 1080p source footage at consumer bitrates runs around 5 GB, so daily ingest is roughly 720,000 × 5 GB, about 3.6 PB of raw video per day before a single transcode has run. The processed renditions add less than intuition suggests, because every output rung is smaller than a 1080p source and the lower rungs shrink geometrically, so the full ladder roughly doubles stored bytes, call it 7 PB a day of storage growth, which crosses 2.5 EB within a year. A number that large explains why storage tiering and per-video popularity decisions sit at the core of the design rather than in an optimizations appendix, since even object storage pricing turns exabytes into a board-level line item.
Playback is bigger still, and working the arithmetic shows why the CDN is not optional. One billion watch hours per day at an average delivered bitrate of 3 Mbps is 1,000,000,000 hours × 3,600 seconds × 3,000,000 bits, about 1.35 exabytes of egress per day, and dividing by 86,400 seconds gives an average egress rate near 125 terabits per second, with regional evening peaks well above it. No origin datacenter serves that, and no conceivable one could be built to, so the only architecture that works pushes the bytes outward to a CDN, the geographically distributed cache fleet that answers requests from a point of presence near each viewer. From that point on, the design question is not whether to cache but how to make the cache hit ratio extremely high, because every miss is origin bandwidth the company pays for twice, once to store and once to ship.
The upload API
Consumer uploads fail constantly, on hotel Wi-Fi and at the edge of cellular coverage, and usually somewhere deep into a multi-gigabyte file, so the upload protocol is resumable. The file travels in chunks, the server records which byte ranges have arrived, and after a disconnect the client asks where to resume instead of starting over, which means a 2 GB upload that dies at 90 percent costs one more chunk rather than another 2 GB. The alternative, a single long PUT, loses the moment one packet drops at minute forty, and creators on slow uplinks would simply never finish, so resumability is less a feature than the difference between having uploads and not having them.
POST /api/v1/videos
{ "title": "Trail ride", "size_bytes": 2147483648 }
→ 201 { "video_id": "v71x2", "upload_url": "https://upload.example.com/u/8842" }
PUT /u/8842
Content-Range: bytes 0-8388607/2147483648
→ 308 Range: bytes=0-8388607 // server confirms what it holds
PUT /u/8842
Content-Range: bytes 8388608-16777215/2147483648
→ 308 ... // repeat until the final chunk
→ 201 { "video_id": "v71x2", "state": "processing" }
On the final chunk the assembled file lands in object storage, a content hash is computed for deduplication so a re-uploaded identical file short-circuits processing entirely, and a record enters the metadata database with state processing while the pipeline takes over. The creator sees a progress card immediately, which matters for trust, because a creator staring at a spinner with no state will re-upload and double the load.
The processing pipeline as a DAG
Processing is modeled as a DAG, a directed acyclic graph of tasks where each task starts when its inputs are ready and independent tasks run in parallel. For one video the graph includes splitting the source, transcoding each piece at each quality level, generating thumbnails and preview strips, fingerprinting the content against a copyright reference database, running policy scans, packaging segments and manifests, and finally flipping the video's state to ready. Transcoding, the conversion of video from one encoding to another, is the expensive node, since the source must be decoded and re-encoded at each target resolution with a codec, the compression format the player must understand. H.264 plays on effectively every device ever made, while newer codecs such as VP9 and AV1 cut bitrate 30 to 50 percent at the same visual quality but cost several times more compute to encode, so the pragmatic ladder encodes everything in H.264 first and spends the expensive codecs only on videos with enough predicted viewership to repay the encode in bandwidth savings. That policy is the first of several places where popularity, not correctness, drives resource allocation in this design.
The reason the pipeline splits videos before transcoding is wall-clock time, and the arithmetic is worth doing aloud. Encoding runs near real time per core-stream at good quality settings, so a 2-hour video transcoded as one file at one quality is a 2-hour wait, multiplied across every rung of the ladder, which would make the upload-to-ready experience unacceptable for exactly the long videos creators care most about. Splitting the same video into 10-second segments yields 720 independent pieces, and with a worker per piece each takes tens of seconds, so the whole video at every quality finishes in about a minute plus scheduling overhead, a speedup of two orders of magnitude bought entirely with parallelism. Segment boundaries are placed on keyframes, the self-contained frames a decoder can start from without reference to earlier frames, so each piece encodes independently and the later concatenation is seamless to the eye. The same segmentation, conveniently, is what adaptive playback needs anyway, so the split is paid for once and used twice.
The upload service assembles resumable chunks into the raw store and registers the video, the DAG fans the work across parallel workers (split, transcode per rung, thumbnails, fingerprinting, packaging), and the encoded store acts as origin for the CDN, which fills on demand along the dashed path.
Adaptive bitrate playback
Viewers live on wildly different and constantly changing connections, and the answer the whole industry shipped is adaptive bitrate streaming through HLS or DASH. Both protocols work on the same principle, namely that the video exists as a ladder of renditions, say 240p at 0.4 Mbps up through 4K at 16 Mbps, each chopped into the same few-second segments, and a manifest, a small text file, lists the renditions and the URL of every segment. The player downloads the manifest, starts conservatively on a low rung so the first frame appears fast, measures the actual throughput of each segment download, and switches rungs at segment boundaries, stepping up when measured bandwidth comfortably exceeds the next rung's bitrate and dropping hard the moment the playback buffer shrinks, because a stall is perceptually far worse than a few seconds of softer picture. All the intelligence sits in the player, while the servers do nothing but serve small static files over plain HTTP, which is exactly what CDNs are best at, and that is not a coincidence but the reason the protocols were shaped this way, since a design that required smart servers at the edge would have priced itself out of existence.
The player asks the playback service for the video (1) and receives the manifest with the rendition ladder (2), then starts fetching low-rung segments from the CDN edge (3). A cache miss fills from the origin store along the dashed path (4), and once the player has measured segment throughput it switches up to a higher rendition at a segment boundary (5).
The economics of this path reduce to one number, the cache hit ratio. Video popularity is extravagantly skewed, with a small fraction of videos producing nearly all watch time, so edges holding just the hot set answer most requests, and at a 95 percent hit ratio the origin egress is 5 percent of 125 Tbps, around 6 Tbps, which is large but feasible, while every additional point of hit ratio is measured in real money. The long tail of rarely watched videos is deliberately not cached at the edge, and is served instead from origin through regional shield caches, intermediate caches that sit between many edges and the origin so that repeat misses from different edges are absorbed once per region. Popularity is also managed actively rather than waited on, because a new upload from a creator with ten million subscribers will be hot within minutes of the notification going out, so the system prewarms it, pushing segments to edges in the regions where that creator's audience lives before the first viewer clicks. Regional skew gets the same respect, since a video viral in Brazil has no claim on cache space in Japan, and per-region admission keeps each edge's limited disk spent on its own audience.
The first second of playback
Join time, the interval between the click and the first frame, deserves its own budget because it is the single number viewers feel most. The click triggers a playback service call that resolves the video to a manifest URL, costing one round trip plus a metadata cache read, perhaps 50 to 100 milliseconds. Fetching the manifest from the edge costs another round trip of 20 to 50 milliseconds when the edge is nearby, and then the player needs enough video to start, typically one or two low-rung segments, and at 240p a 10-second segment is only about 500 KB, which a modest connection pulls in a few hundred milliseconds. Stacking those gives a first frame somewhere between half a second and a full second on a healthy edge hit, which is exactly the experience the requirements asked for, and the budget makes plain why the player opens on the lowest rung, since starting at 1080p would multiply the first fetch by twenty and push the first frame seconds away.
The same budget shows what a viewer feels when each piece fails. A manifest cache miss adds an origin round trip and stretches the start by a few hundred milliseconds, which most people never notice. A cold video in a distant region starts a second or two late while the shield fills, and the player hides most of it behind its conservative first rung. A mid-stream throughput collapse, as when a phone walks out of Wi-Fi range, shows up as the picture softening rather than stopping, because the player drops rungs ahead of the buffer running dry, and only when bandwidth falls below the lowest rung does the spinner appear. Designing so that failures degrade picture quality before they degrade continuity is the quiet principle underneath the whole playback stack.
Counting a billion views
A view counter sounds like UPDATE videos SET views = views + 1, and at this scale that statement is itself the design failure, because a viral video taking 50,000 views a minute would serialize 800 writes a second onto one hot row, while the count is also read on every watch page render, so the row becomes a contention point for both directions at once. Views are an aggregation problem rather than an increment problem. Players emit watch events into an event stream, a streaming job aggregates counts per video over short windows, and the aggregate is flushed to the store every few seconds, so the displayed number is eventually consistent, meaning briefly stale but always converging, which is harmless for a vanity metric and explains why the count on a live viral video visibly jumps in steps. The same event stream feeds fraud filtering, which separates plausible human views from replay loops and click farms before they ever reach the public number, and that filtering is the deeper reason view pipelines are asynchronous, because classification needs context, such as the pattern of a device's recent events, that a synchronous increment can never see. A creator's analytics ride the same stream with richer dimensions, which is why the private dashboard and the public counter can disagree for a few minutes without either being wrong.
Live streaming deserves its one contrasting paragraph, because it looks adjacent and inverts nearly every constraint. The pipeline must run in seconds end to end, so transcoding happens on dedicated always-on encoders rather than a batch DAG, segments shrink from ten seconds to two, manifests update continuously as new segments appear, and caches only help with the last few segments because nothing older has any audience. The shift in the latency budget changes the entire cost structure, since a video-on-demand system optimizes cost per byte over an infinite shelf life, while a live system optimizes seconds of delay over content with a shelf life of nothing, which is why the two are built by different teams on largely different infrastructure even inside one company.
Scaling, failures, and operations
Each tier scales on its own axis, and the differences are what make the system economical. Upload ingestion scales with creator count and is the least demanding tier. The transcode fleet scales with ingest hours and is an ideal fit for elastic and preemptible compute, the discounted machines a cloud can reclaim at any moment, because a killed worker costs one 10-second segment re-encoded rather than a restarted video, and the DAG scheduler retries failed tasks idempotently since every task writes to a deterministic output location, so running the fleet on cheap interruptible capacity is close to free risk. Storage scales by adding object storage capacity, with lifecycle policies moving cold renditions to denser tiers and the coldest tail to archival storage, where a rare view pays a seconds-long retrieval that its rarity makes acceptable. The metadata database, holding video records and rendition lists, is read-heavy and shards by video ID with caches in front, and it is the one tier whose failure is visible everywhere, because a metadata outage means no watch page can resolve to a manifest even while every segment sits healthy on the CDN, an asymmetry that argues for caching manifests themselves at the edge with short revalidation so the read path can coast through a brief metadata stumble.
The failure stories each have a user on the other end, which is the right way to tell them. A dead transcode worker is retried invisibly and the creator never knows. A CDN point of presence failing means its viewers re-resolve to a neighboring edge, where they pay a latency bump and the neighbor pays a temporary hit-ratio dip while it warms with the orphaned audience's videos. An origin region failure is survived because encoded segments replicate across regions and the CDN's fill path is configured to fail over, so the symptom is slower cold starts rather than missing video. Copyright fingerprinting and policy scanning sit inside the DAG before publication, which makes the safety property structural rather than procedural, since a video that has not passed scanning has no ready state and no manifest, and therefore no race exists in which content is watchable before it is checked. The operational dashboard leans on five numbers, which are rebuffer ratio, the fraction of watch time spent stalled, join time from click to first frame, cache hit ratio by region, transcode queue depth, and time-to-ready per upload percentile, because together they describe the product exactly as the viewer and the creator experience it, and a regression in any of them is a person somewhere having a worse afternoon.
Follow-up questions
- Why split videos into segments before transcoding? Splitting buys parallelism, since a 2-hour video becomes 720 independent 10-second pieces that transcode simultaneously in about a minute instead of hours serially, and the same segmentation is what adaptive streaming needs anyway, so the work is paid once and used by both pipelines.
- Why not transcode everything to AV1 if it saves 30 to 50 percent bandwidth? The encode is several times more expensive and older devices cannot decode it, so the saving only repays the cost when a video is watched enough times. Popular videos therefore earn the expensive codecs, while the long tail stays on H.264, which every device ever shipped can play.
- Where does the first second of playback latency go? It is spent on the manifest fetch plus the first low-rung segment, which is why the player deliberately starts on a cheap rendition whose segment is a few hundred kilobytes, and then upgrades within a few segments once it has measured what the connection can really carry.
- What happens on a CDN cache miss for an unpopular video? The edge fetches from a regional shield cache, which fetches from origin object storage if needed, so the viewer pays a slower start measured in a second or two and the system pays origin egress, a trade that is acceptable precisely because the tail is rarely watched.
- Why is the view count eventually consistent? A synchronous increment would hot-spot one database row per viral video and would bypass fraud filtering entirely. Streaming aggregation batches millions of events into periodic flushes and gives classification time to discard replay loops and click farms before the public number moves.
- How would live streaming change this design? The batch DAG becomes an always-on low-latency encoder chain, segments shrink to about two seconds, manifests update continuously, and caching helps only at the live edge, so the cost structure shifts from storage and egress toward real-time compute and fanout.
References
- Pantos and May, RFC 8216: HTTP Live Streaming (2017).
- Xu, System Design Interview, Volume 1 (2020), chapter on designing YouTube.
- Netflix Technology Blog, Per-Title Encode Optimization (2015), on tailoring the encoding ladder.
- High Scalability, YouTube Architecture (2008), on the early scaling history.
- Kleppmann, Designing Data-Intensive Applications (2017), on stream processing and aggregation.