Design a video recommendation system

The YouTube homepage answers a strange question many times per second, which is how to choose, out of billions of videos, the few dozen that this particular person should see right now, with no search query to reveal what they want. Recommendation drives the majority of watch time on large video platforms, which is why the design question is an interview staple, and a generous one, because it gathers nearly every major theme of production machine learning into a single system. Retrieval must run at a scale where most models cannot run at all, ranking must be learned from data the previous ranker biased, cold start arrives on both sides of the marketplace, and a feedback loop sits underneath everything, ready to quietly eat the product if nobody names it.

The organizing idea is the funnel, and stating it with numbers up front frames everything that follows. No model worth using can score billions of items inside a single request, so the system works in stages of increasing cost applied to shrinking sets. Candidate generation reduces billions to a few thousand with math cheap enough to run against the entire corpus, ranking scores those few thousand with a model that can afford rich features because the set is now small, and re-ranking shapes the final few dozen with business and diversity rules that operate on the page as a whole. Each stage exists so the stage after it can afford to be smarter, since a model's per-item budget grows a thousandfold at each step down the funnel, and most of the decisions below reduce to choosing which stage a given piece of intelligence belongs in.

The walkthrough lingers on the two-tower factorization that makes retrieval possible, the value model that corrects watch time's clickbait failure, and the feedback loop between what the system shows and what it later trains on, which is the part most designs forget.

Scope and requirements

The surface under design is the signed-in homepage feed of a platform with two billion users and a corpus of several billion videos growing by hundreds of hours of new footage every minute. A request must return roughly 30 ranked videos, personalized from watch history, subscriptions, and context such as device, time of day, and language, inside a backend budget of about 200 milliseconds at the 95th percentile, because the homepage is the front door of the product and a slow front door suppresses every metric behind it. Peak traffic runs to a few hundred thousand homepage requests per second, and that number rewards multiplication, because requests at that rate, each ranking a couple of thousand candidates, come to hundreds of millions of ranking inferences every second, which makes the cost of the ranking model a first-order design input. In scope are candidate generation, ranking, re-ranking, and the training loop. Out of scope are search, which has a query of its own, the up-next panel, a sibling system sharing most of these parts, advertising, and the integrity systems that decide what is eligible to recommend, although re-ranking consumes their outputs as demotion signals.

Framing the problem as a machine learning task

Candidate generation is framed as extreme-scale retrieval, meaning the system learns representations of users and videos such that, given a user, it can fetch the few thousand videos most likely to engage them without ever scoring the whole corpus. Ranking is framed as supervised prediction, where the model takes a user and video pair and predicts the probability of a click together with the expected watch time and satisfaction if the video is shown, trained from logged behavior. Two simpler framings deserve naming because each fails in an instructive way. Treating the whole problem as one classification over billions of classes fails on arithmetic alone, since a softmax output layer, the layer that assigns one probability to every class, would need billions of entries that can neither be trained nor served affordably, and the class set changes by the minute. Treating it as pure collaborative filtering, the classic insight that people who watched the same things in the past will watch the same things next, fails alone on freshness and cold start, because a video uploaded this morning has no co-watch history to consume, and that blind spot covers exactly the content the platform most needs to surface. The production framing therefore combines learned retrieval with learned ranking, with the collaborative signal surviving inside the learned embeddings alongside content features.

Data and labels

Training data is the impression log, the record of every video shown on the page together with what the user did about it, whether they clicked, how long they watched, whether they liked, shared, or pressed the button marked not interested, all keyed by the position the video occupied on the page. Implicit labels dominate by volume, and the strongest simple label is watch time conditioned on impression rather than the raw click, because clicks alone teach the model that a misleading thumbnail is excellent content, since the misled user clicks, discovers the deception, and leaves, and the click is logged as a success either way. Explicit signals such as likes and survey responses arrive at a tiny fraction of the volume but anchor what satisfaction means.

Position bias infects all of it. An item shown in the first slot gets engaged with more simply for being first, regardless of quality, so a model trained naively on the log learns to predict placement as much as preference and then ratifies its own past choices. The standard corrections either include position as a training feature and neutralize it at serving time, scoring every candidate as if it sat in one fixed slot, or weight training examples by a propensity estimate, meaning an estimate of how much each position inflated the engagement. Both corrections depend on knowing how strong the bias is, and the only trustworthy measurement is to randomize the ordering on a small slice of real traffic and watch how engagement moves with position when quality is held equal. The deeper bias is that the log only contains outcomes for videos the previous system chose to show, so the data is silent about everything never tried, which is the feedback loop hazard treated in its own section below.

Features and the feature store

Three feature families feed the models, differing in cost as much as in content. User features include aggregates of watch and search history, topic affinities, subscription lists, and context such as device and hour of day. Video features include topic, language, duration, upload age, channel statistics, and engagement rates such as click-through rate and average completion. Cross features describe the specific pair, for example how many videos from this channel this user watched in the last month, and they are the expensive, high-value family, because they can only be computed per combination and the number of combinations is the product of two enormous sets, which is why only the ranking stage, facing a few thousand pairs rather than billions, can afford them.

These features live in a feature store, which is the system that computes each feature once and serves identical values to offline training and online inference. The store earns its place through the failure it prevents, called training-serving skew, where the training pipeline computes a feature one way and the serving path computes the supposedly identical feature in a subtly different way, perhaps a seven-day average offline against a six-day average online. Nothing errors when this happens, the model simply meets serving inputs shifted from the ones it learned on, quality sags by a few percent that no dashboard attributes to anything, and the cause is typically found months later by someone diffing feature values row by row. The discipline that travels with the store is to log the exact feature values used at serving time and train on those logged values, because recomputing history from raw events reopens every opportunity for the two paths to disagree. The store also stratifies by freshness, recomputing slow batch features such as thirty-day topic affinities daily while streaming fast counters such as a video's clicks in the last hour in near real time, because a count that arrives a day late has lost most of its value.

Model architecture

Candidate generation runs several sources in parallel and feeds ranking their union, because no single source covers the space. The workhorse is two-tower retrieval, which deserves a careful definition since it is the load-bearing idea of the stage. Two neural networks are trained jointly, a user tower that maps user features to an embedding, meaning a learned vector that summarizes taste as a position in a shared space, and an item tower that maps video features to an embedding in that same space, and training pushes the dot product, the simple similarity measure between two vectors, high for genuinely engaged user and video pairs and low for sampled non-engaged pairs, so proximity in the space comes to mean predicted engagement. The towers never look at each other's raw inputs, and that enforced independence is the entire trick, because it lets every video's embedding be computed offline and loaded into an approximate nearest neighbor index, a data structure that finds the vectors with the highest dot product against a query while examining only a sliver of the corpus. At request time the system computes one user embedding, runs one index lookup, and holds its few thousand candidates within milliseconds.

The alternatives show what the independence buys. A model that reads the user and the video together would score each pair more accurately because it sees their interactions, and it loses at this stage because nothing about it can be precomputed, since every score depends on the specific pair, and scoring billions of pairs per request is exactly the impossibility the funnel exists to avoid. The older alternative, matrix factorization, learns one embedding per user identity and one per video identity from co-engagement alone, and it loses because identities are all it knows, so a new video is invisible to it, whereas the item tower places a brand-new upload into the space from its topic, language, and channel features before anyone has watched it. Alongside the towers run deliberately cheap sources, recent uploads from subscribed channels, because a subscription is an explicit request to see that channel's work, popular-in-region lists that need no personalization, and a fresh-content lane that gives new videos a route into the funnel before they have any history.

The ranking model is where the capacity is spent. A deep network with multi-task heads, meaning one shared body whose several output heads simultaneously predict click probability, expected watch time, like probability, and predicted survey satisfaction, scores each of the couple of thousand candidates with full cross features. Sharing one body lets scarce labels such as surveys ride on representations learned from the abundant ones, which is the practical argument for multi-task training over four separate models fighting for the same serving budget. The baseline worth naming first is a gradient-boosted tree ensemble or a plain logistic regression over the same features, which would serve faster and be easier to inspect, and the neural ranker justifies its keep by learning feature interactions the baseline cannot represent.

The value model and the clickbait correction

The heads produce four or five predictions per candidate while the page needs a single ordering, so a value model blends them into one score with tuned weights, and the weighting is a product decision wearing a model's clothes, because each weight encodes a judgment about what the platform wants more of. The field learned the shape of the wrong answer the expensive way. Ranking purely by predicted watch time rewards whatever keeps a session running, which includes thumbnails that overpromise, autoplay chains that escalate, and content people consume compulsively and regret afterward, and a user who feels worse after an evening on the platform rarely files a complaint but simply returns a little less often, a drift no two-week experiment can see. The correction the industry converged on gives deliberate weight to the satisfaction-flavored heads, survey responses asking whether a video was worth the viewer's time, likes, and the not-interested rate counted as a negative, so the blend pulls toward content users are glad they watched rather than merely content they watched.

A worked example shows the blend doing its job. Suppose candidate A carries a predicted twelve minutes of watch time alongside a high predicted not-interested rate and a weak predicted survey score, while candidate B carries a predicted eight minutes with strong satisfaction on both counts. Under watch time alone A wins by half again as much, and under a blend that prices the satisfaction heads meaningfully B overtakes it, which on a real page is the difference between a feed that escalates and a feed that earns the next visit. The weights are revisited against survey-based experiment results rather than set once, because the right trade between minutes today and trust next quarter moves as the catalog and the audience move.

Offline evaluation

Candidate generation is measured with recall@K against held-out future behavior. The procedure takes a user's actual next-watched video from a time slice after the training window and checks whether it appears in the generator's top K candidates. If the held-out watch shows up in the top 2,000 candidates for 62 of 100 test users, recall@2000 is 0.62, and a change that lifts it to 0.66 has handed the ranker four more chances per hundred users to put the right video on the page, because no ranker can rescue a video that never entered the pool.

Ranking is measured with AUC, which is the probability that a randomly chosen positive example receives a higher score than a randomly chosen negative one, and a tiny case makes the number concrete. Suppose two positives score 0.9 and 0.4 while two negatives score 0.7 and 0.2, giving four positive-negative pairs to check. The 0.9 positive outscores both negatives, contributing two correctly ordered pairs, the 0.4 positive outscores the 0.2 negative but loses to the 0.7 negative, contributing one more, and three correct pairs out of four gives an AUC of 0.75. Watch-time heads are regression targets scored with squared or absolute error instead, and calibration is checked separately, by binning predictions and comparing predicted probability with observed rate per bin, because the value model blends the heads as if their outputs mean what they say, and a head that drifts hot quietly tilts every ordering it touches.

Splits are strictly time-based, training on the past and evaluating on a later slice, because a random split leaks tomorrow's engagement statistics into today's features and flatters every model. The unhappy fact to state plainly is that offline gains often shrink or vanish online. Offline evaluation can only replay what the old policy logged, so a new model that would have shown different videos gets judged only on the narrow region where the two policies overlap, and nothing offline can observe how users would have reacted to pages that were never served. Offline metrics therefore gate which models earn an online experiment, and no model ships on offline numbers alone.

The serving funnel

A homepage request is worth walking end to end with the clock running. It arrives carrying the user's identity, and the first work, fetching user features and computing the user tower embedding, costs around ten milliseconds because the tower is small. The candidate sources then fan out in parallel, the approximate nearest neighbor lookup beside the subscription, regional popularity, and fresh-content sources, and since they run concurrently the stage costs what its slowest member costs, namely the index lookup at roughly ten to twenty milliseconds. Merging and deduplicating the union yields about 2,000 videos at around the 30 millisecond mark. The ranker then pulls cross features from the feature store's online tier and pushes candidates through the network in large batches, the most expensive hop at about 50 milliseconds, which is exactly why the funnel protected it from seeing more than a few thousand items. Re-ranking takes the top couple of hundred by blended score and applies the rules a per-item scoring model structurally cannot express, because they concern the composition of the page rather than the merit of any single video, enforcing diversity across channels and topics so the feed is not one channel's wall, mixing in recency, demoting borderline content flagged by the integrity systems, and suppressing items the user was recently shown and ignored. Roughly 30 videos render, the model path has spent about 100 milliseconds, and the rest of the envelope absorbs network hops, page assembly, and the occasional retry.

Degraded modes belong to the design. If the ranker times out, the page is served from retrieval scores alone, blander but never blank, and if the feature store's online tier is unhealthy the ranker falls back to the features carried in the request and its own learned priors, so the user sees a slightly duller homepage and almost never an error, which is the correct direction for a discovery surface to fail in.

The funnel at a glance. Parallel candidate sources, with two-tower retrieval over an approximate nearest neighbor index doing the heavy reduction from billions, merge into about 2,000 candidates, the multi-task ranker keeps a couple of hundred, and the re-ranking rules shape the final 30 for the page.

Cold start, exploration, and the feedback loop

Cold start arrives on both sides of the marketplace, and each side needs different machinery. A new user offers the user tower no history to summarize, so the first page leans on context, language, region, device, and any onboarding topic picks, blended with regional popularity, and the experience improves within the first session, because each early watch immediately becomes a feature, the embedding moves with every refresh, and by the second day the page reads as a feed rather than a popularity chart. A new video faces the harder problem, since every engagement-based source is blind to it by construction. It enters through the item tower, which places it in the embedding space from topic, language, and channel features alone, and through a deliberate exploration budget, meaning a small percentage of impressions reserved for showing videos the system is uncertain about to a random slice of plausibly matched users, which gathers the engagement evidence no amount of modeling can conjure from nothing.

Exploration is also the antidote to the feedback loop, the named hazard of this design, which deserves a slow statement. The model trains on what it showed, it shows what it scored highly, and it therefore never collects evidence about the videos it never showed, so an early error of omission becomes permanent, the catalog ossifies around its early winners, and the system grows more confident about a narrower and narrower slice of the catalog while its training metrics look superb. Seen from outside, the failure is gradual and easy to dispute, a feed that begins to feel samey, creators reporting that nothing new gains traction, topic concentration creeping upward in the audits, and by the time the trend is undeniable the log holds years of data confirming the narrow policy. The exploration slice breaks the loop by paying a small, measured engagement cost today for unbiased evidence tomorrow, and the logged randomness does double duty, since it also keeps the position-bias propensities calibrated and makes off-policy evaluation, the technique of estimating a new policy's performance from logs the old policy collected, trustworthy on at least the randomized slice. A platform that declines to fund this line item pays the same cost invisibly, in a feed that slowly narrows.

The training loop. Impression logs feed the feature store, the trainer publishes versioned models, and serving writes new impressions back into the logs it will later train on. The dashed exploration path is the deliberate randomness that keeps the loop from learning only about its own past choices.

Online evaluation and operations

New models graduate through A/B tests read on long-horizon metrics, daily active users, total satisfied watch time, survey satisfaction, and retention measured over weeks, because a short-term engagement lift can cash out long-term trust. A small holdout population is kept away from a change for months to expose the slow effects a two-week experiment cannot see. Deployment is conservative because the homepage is the product. A new ranker first runs on shadow traffic, meaning it scores real requests while its outputs are discarded, so its score distributions, calibration, and latency can be compared against the incumbent with zero user exposure, and only then does it ramp through small percentage slices with automated rollback wired to metric regressions, all enabled by the model registry keeping every version addressable.

Steady-state operations revolve around freshness and skew. Rankers retrain daily or continuously, because engagement features move quickly enough that a week-old ranker is measurably stale. The two-tower model retrains on a similar cadence, item embeddings are re-batched into the index every few hours so new uploads gain entry, and the index is versioned together with the tower that produced it, because a user tower from one training run paired with an item index built by another places queries and items by inconsistent geometry, and retrieval then returns plausible-looking but wrong neighbors with no error raised anywhere. The on-call engineer watches, roughly in the order each signal fires, the per-head score histograms, which shift within minutes of a drifting model or a broken upstream feature, then feature null rates and distribution drift, which catch pipeline failures while the model still coasts on its priors, then candidate source overlap, exploration slice health, and end-to-end funnel latency. The histogram alarm usually fires hours before any business metric moves, so the standing reflex for a sick model is to roll back first and diagnose afterward, since the previous version is one registry pointer away and engagement lost while reading logs never comes back. Periodic audits examine the page-level outcomes no per-item model can see, topic concentration, channel diversity drift, and the share of impressions reaching new creators, and the re-ranking rules are retuned from those audits.

Follow-up questions

Why a funnel instead of one great model? The arithmetic decides it, because scoring billions of items with a cross-feature ranker per request is impossible at any affordable price, while two-tower dot products against a precomputed index are nearly free. The funnel spends compute where the candidate set has been shrunk far enough to afford it.
Why can the ranker use cross features when the towers cannot? Two-tower retrieval depends on every item embedding being computable before any user asks, which structurally forbids features describing the user and item together. Once candidate generation has shrunk billions to a few thousand, per-pair features become affordable, and that affordability is exactly the ranker's advantage.
How do you stop watch time from rewarding clickbait? The blended objective gives real weight to heads that predict satisfaction, namely survey responses, likes, and the not-interested rate counted as a negative, and watch patterns that end in a quick bounce are labeled as the failures they are. The weights are revisited against survey-based experiment results as the catalog and the audience change.
What is the exploration budget actually buying? It buys unbiased evidence about new and under-shown videos, calibrated position-bias estimates, and protection from the feedback loop that would otherwise lock the catalog to its early winners. The price is a small, measured dip in short-term engagement, which is cheap insurance against a feed that narrows invisibly.
Why do offline wins evaporate online? Offline evaluation replays logs collected under the old policy, so a new model is judged only where the two policies overlap, and nothing in the logs reveals how users would have reacted to pages that were never served. Offline metrics decide which models earn an A/B test, and only the test decides what ships.
What breaks if the feature store is skipped? Training-serving skew does, because the offline pipeline and the online service drift into computing the supposedly identical feature differently, the model quietly underperforms with no error anywhere, and the cause surfaces months later when someone diffs feature values by hand. The store enforces one computation with two consumers, and logging values at serving time closes the last gap.

References

Covington, Adams, and Sargin, Deep Neural Networks for YouTube Recommendations (RecSys 2016).
Yan, System Design for Recommendations and Search, on the retrieval-ranking split in industry systems.
Malkov and Yashunin, Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs (2016).
Aminian and Xu, Machine Learning System Design Interview (2023).