Video search is the box at the top of YouTube, where a user types "how to replace a bike chain" and expects, within a heartbeat, a ranked page of videos that actually show the repair. The query is text and the document is a moving picture with a soundtrack, and bridging that gap is the whole problem. Interviewers like the question because it forces a candidate to combine two systems that are usually taught separately, classical text retrieval and learned content understanding, and then to rank their merged output under a latency budget while the corpus grows by hours of video every second. A candidate who treats it as web search with thumbnails misses what makes the problem interesting, and a candidate who reaches straight for a giant multimodal model misses what makes it feasible.
The framing worth stating up front is that there are two complementary signal sources and the design needs both. Text metadata, meaning the title, description, tags, and uploader-provided captions, is cheap to index and often accurate, but metadata lies, because uploaders keyword-stuff, mislabel, and clickbait whenever lying pays. Content understanding, meaning signals computed from the frames and the audio themselves, cannot lie about what is in the video, but it is expensive to compute and fuzzier to match against a text query. The hard parts of the design are making these two sources cooperate, correcting the biases baked into click-based training data, and serving fresh results for queries about events that happened twenty minutes ago, and each of those gets its own treatment below.
Scope and requirements
The product takes a text query and returns a ranked list of videos from a corpus of, say, one billion videos with 500 hours of new footage uploaded every minute, which compounds to roughly 720,000 hours of new video per day and means the index never stops moving. Traffic runs on the order of tens of thousands of queries per second at peak, and the backend budget for a search results page is around 200 milliseconds at the 95th percentile, of which the retrieval and ranking core gets perhaps half, since the rest goes to page assembly, thumbnail fetches, and network hops the search team does not control. Freshness is a product requirement rather than a nicety, because a meaningful slice of queries concern breaking events where the best video on the platform is minutes old, and a search box that cannot find it teaches users to look elsewhere. In scope are query understanding, retrieval, and ranking. Out of scope here are personalization beyond coarse signals such as language and region, video recommendation, which is a related but distinct system that starts from a user rather than a query, and ads.
Framing the problem as a machine learning task
The task decomposes into retrieval plus ranking. Retrieval takes the query and narrows a billion videos to a few hundred candidates using cheap, scalable structures, and ranking applies a learned model with richer features to order those candidates carefully. The split exists because of arithmetic rather than taste, since scoring every video with the ranker would mean a billion model evaluations per query, while making retrieval perfect is unnecessary, because retrieval only needs to get good candidates into the pool for the ranker to find. This funnel logic, stages of increasing cost operating on shrinking sets, organizes most large-scale search and recommendation systems, and most of the design decisions below come down to which stage a given piece of intelligence belongs in.
Within retrieval, the input is the query text, possibly rewritten, and the output is a candidate set, while within ranking the input is a query-video pair with features and the output is a relevance score. Two alternative framings are worth naming because each fails instructively. Treating search as pure text matching over metadata is the right baseline and the wrong destination, because it inherits every metadata lie and never looks at the video itself, so the keyword-stuffed upload beats the genuinely useful one forever. Scoring video content directly against the query with a heavy multimodal model at query time would fix that and sits years of compute away from the budget, since no model that actually watches video can run per query at tens of thousands of queries per second. The production answer threads between the two by precomputing content understanding once at upload time and keeping query-time work light.
Data and labels
Training labels come almost entirely from behavior logs. A click on a result is weak evidence of relevance, a click followed by substantial watch time is stronger, and a click followed by an immediate bounce back to the results page is evidence against, so the practical label is a graded one built from click plus dwell. The grades carry real meaning, since a user who clicked a chain-repair video and watched eight minutes probably found the repair, while a user who clicked, lasted ten seconds, and came back to reformulate the query just testified that the title overpromised. Human-rated query-video pairs, judged against written guidelines by trained raters, are far cleaner and far scarcer, and they anchor evaluation and calibrate the click-derived labels rather than replacing them, because no rating budget covers the tail of queries that logs cover for free.
The dominant noise in click data is position bias, which is the tendency of top-ranked results to get clicked simply for being on top. Log studies agree that the first result draws several times the attention of the fifth even when their relevance is identical, so a model trained naively on clicks learns to imitate the old ranker's ordering rather than relevance, and each generation of model inherits the previous one's mistakes with interest. Two mitigations are standard and worth explaining plainly. Randomization swaps adjacent results for a tiny slice of traffic, which produces data where position and relevance are untangled and the true effect of position can be measured directly. Bias-corrected training then down-weights each click by the estimated probability that its position alone would have earned it, a quantity called the propensity, so a click at rank one counts for less than a click at rank eight, which was earned against the page's gravity. Proposing to train on clicks without one of these is proposing a design with a known systematic error in its labels.
Features and representation
The text side is represented with an inverted index, the classic structure that maps each term to the list of videos whose metadata contains it. Scoring uses BM25, a battle-tested formula that rewards documents containing rare query terms often while damping repetition, so the tenth occurrence of a word adds far less than the second, and normalizing for document length, so long descriptions cannot win by sheer volume. The rare-term emphasis does the real work, because matching "the" says nothing while matching "derailleur" nearly settles the question. ASR, automatic speech recognition, extends the text side powerfully, since transcribing the audio track turns the spoken content into text that is indexed like any other field, and a video that never says "bike chain" in its title but demonstrates the repair aloud becomes findable. Transcripts carry recognition errors, especially on accents, music, and jargon, so transcript matches live in a weaker-weighted field rather than being trusted like titles, and the per-word confidence the recognizer emits can gate which words enter the index at all.
The visual side is represented with embeddings, learned vectors where distance tracks semantic similarity. Frames are sampled from each video, one every few seconds plus the shot boundaries where the picture changes, and pushed through an image encoder into vectors that are pooled into a single video-level vector. The load-bearing requirement is that video vectors and query-text vectors live in the same space, which is what CLIP-style training provides, since a text encoder and an image encoder are trained jointly on hundreds of millions of paired images and captions so that matching pairs land close together, and afterward a text query can be encoded and compared directly against video embeddings with nothing more than a dot product. Searching those vectors at scale uses an ANN index, short for approximate nearest neighbor, a structure that finds close vectors while examining only a tiny fraction of the corpus, trading a few points of recall for several orders of magnitude in speed, and the trade is taken without hesitation because a retrieval stage that misses one good candidate in twenty is acceptable while one that takes a full second is not.
The ranking stage adds the remaining feature families on top of the retrieval scores, text relevance from each index field, embedding similarity, freshness, engagement statistics such as the video's historical click and watch rates, channel quality, and query-independent quality signals such as resolution and stability. Engagement features are served from a feature store, the system that computes each feature once and serves identical values to training and to serving, because the alternative, two pipelines computing the same nominal average watch rate slightly differently, produces a model that underperforms in production with no error message appearing anywhere.
Model architecture
Candidate generation is hybrid, with the text index and the vector index queried in parallel, each contributing its top few hundred, and the union deduplicated into roughly 500 candidates. The two retrievers fail differently, which is the entire point of running both. Text retrieval nails exact names and rare terms, so "ronaldo skills 2025" is its kind of query, but it misses paraphrase, while embedding retrieval handles "fix a slipped gear thingy" by landing near chain-repair videos in vector space, but blurs precise entities, since one footballer's embedding sits suspiciously close to another's. Either path alone reintroduces its own blind spots, and the union covers both for the price of a dedupe step.
The ranker is a learned model over query-video feature vectors. The strong baseline is a GBDT, a gradient-boosted decision tree ensemble that builds many small trees where each corrects the errors of the ones before, and it endures because it handles heterogeneous tabular features gracefully, trains fast enough for daily refreshes, and exposes feature attributions that let an engineer trace a strange ranking back to the feature responsible. The production-grade step up is a neural ranker that consumes the same features plus raw text and embedding inputs, often with multiple output heads predicting click, watch time, and satisfaction together, blended into one ordering score. Presenting this to an interviewer, I would ship the GBDT first, because its diagnosability matters most while the data pipelines mature, and graduate to the neural ranker once randomization and bias correction have made the labels trustworthy enough to reward the extra capacity, since a high-capacity model trained on biased labels mostly learns the bias with more conviction.
Offline evaluation
The two stages get different metrics because they have different jobs. Candidate generation is judged by recall at K, meaning the fraction of known-relevant videos that survive into the top K candidates across a set of judged queries, since anything retrieval drops is gone forever no matter how clever the ranker is. Ranking is judged by nDCG, normalized discounted cumulative gain, which rewards placing the most relevant items highest, and a worked example keeps it from staying abstract. Suppose a query has five judged results that the ranker returned in the order 2, 3, 0, 3, 1, where each number is a graded relevance judgment and 3 means excellent. Each position's gain is its relevance multiplied by a discount of 1 over log base 2 of the position plus one, which gives discount factors of 1.0, 0.631, 0.5, 0.431, and 0.387 for the five positions, falling off quickly because users read from the top. Multiplying and summing gives a DCG of 2 times 1.0, plus 3 times 0.631, plus nothing for the third slot, plus 3 times 0.431, plus 1 times 0.387, which comes to about 5.57. The ideal ordering 3, 3, 2, 1, 0 scores about 6.32 by the same arithmetic, so nDCG is 5.57 over 6.32, roughly 0.88, and the missing 0.12 is precisely the cost of ranking one excellent result fourth instead of first while weaker ones sat above it.
Split discipline matters more in search than almost anywhere, because engagement features leak the future. If Tuesday's click counts inform a model evaluated on Monday's queries, the offline numbers inflate, and the inflation evaporates at launch. Splits are therefore time-based, training on weeks one through six and evaluating on week seven, with every feature computed only from data available before each evaluation moment, which is fussy engineering and also the entire difference between offline numbers that predict launches and offline numbers that decorate slides.
The serving funnel
Query understanding runs first and stays deliberately thin, covering spelling correction, language identification, and entity recognition, which detects that "ronaldo skills 2025" contains a person whose name should weight exact matches, all in about 5 milliseconds, because every millisecond spent here is taken from retrieval and ranking. The rewritten query then fans out to both retrieval paths in parallel, BM25 over the inverted index and ANN search over the vector index, each returning its top 300 in roughly 40 milliseconds, and the 40 is overlapped rather than summed because the two run concurrently, so the slower path sets the cost. The union and dedupe stage assembles about 500 candidates in a millisecond or two, the ranker scores them in around 30 milliseconds with feature lookups batched into a handful of round trips to the feature store rather than 500 individual reads, and a light final pass applies diversity and policy rules before the page renders. Adding the stages lands the core path near 80 milliseconds against the 100 millisecond allocation, and the slack is tail insurance rather than generosity, because 95th percentile budgets are consumed by the occasional slow feature fetch, and a stage that overruns its slice gets truncated rather than waited on, since a slightly worse page now beats a perfect page late.
The serving funnel. A thin query understanding stage feeds the text and vector retrievers in parallel, their union of about 500 candidates flows into the learned ranker, and 20 videos come out. Candidate counts shrink left to right so each stage can afford a heavier model than the one before it.
The indexing side
Everything the funnel touches was precomputed at upload time. When a video arrives, three processors run on it. The metadata indexer tokenizes the title, description, tags, and uploader captions into the inverted index within seconds, since text indexing is cheap. The frame pipeline samples frames, embeds them with the image tower of the CLIP-style model, pools them into the video vector, and inserts it into the ANN index. The ASR system transcribes the audio, typically within minutes, and its transcript terms join the text index as the separate weaker-weighted field described earlier. The heavy content work, embedding and transcription, costs real GPU time per video, which is exactly why it happens once per video at upload rather than per query, since a video uploaded once may be queried a million times, and moving the cost to upload time divides it by that ratio.
Freshness gets its own lane because the batch indexes rebuild on cycles measured in hours, which is an eternity for a news query. A near-real-time index, a small append-only structure holding the newest videos indexed on metadata alone within seconds to minutes of upload, sits beside the batch indexes and is queried together with them, its entries aging out as the batch indexes absorb them. Content signals for fresh videos arrive late by construction, since the embedding and the transcript have not been computed yet, so the ranker is trained with explicit missing-feature handling and leans on metadata, channel history, and early engagement velocity until the content signals land. That training choice is also the design's answer to why freshness is a ranking problem and not merely an indexing one, because an index that contains the fresh video achieves nothing if the ranker then discards it for lacking features it has not had time to earn.
Upload-time processing. Frames are embedded into the vector index, metadata and ASR transcripts feed the text index, and the highlighted near-real-time lane indexes new videos on metadata alone within minutes so news queries can find them before the heavy content signals arrive.
Online evaluation and operations
Offline nDCG gains earn an A/B test rather than a launch, and the online metrics are behavioral. Session success, the share of search sessions that end in a meaningful watch, leads the panel, joined by time to first satisfying click, abandonment and query-reformulation rates, and watch time from search. Watch time deserves its caveat said plainly, because optimizing it alone rewards clickbait titles and artificially stretched videos, so satisfaction signals temper it, through bounce rates, explicit "not interested" feedback, surveys run on a small slice of users, and long-term return visits. A change that lifts watch time while raising immediate bounces is a regression wearing a costume, and the metric design has to be able to say so before the launch review does.
Operationally, the ranker retrains daily to weekly because engagement features drift quickly, while the embedding and ASR models refresh on a slower cadence, and any embedding model change forces re-embedding the corpus and rebuilding the vector index, which is scheduled and rehearsed like the infrastructure event it is, because query vectors from a new text encoder searched against item vectors from an old image encoder return geometric nonsense. Training-serving skew, where features are computed one way in the training pipeline and another way online, is guarded against by serving both from the one feature store and by logging the exact feature values used at serving time so that training replays reality instead of reconstructing it. Monitoring covers prediction score distributions per query segment, retrieval overlap between the two paths, fresh-index age, ASR backlog, and per-stage latency tails. The shape of a freshness incident is worth knowing in advance, since it is what the on-call engineer actually sees, with the fresh-index age alarm firing first, news queries quietly returning day-old videos while the rest of search looks perfectly healthy, and reformulation rates on event queries climbing as users fail to find what they came for, and the immediate response is failing the fresh lane over to its replica and backfilling, because the batch indexes cannot be hurried. Query distributions drift with the news cycle and the language mix, so per-language dashboards are the difference between catching a regression in a week and hearing about it from the press.
Follow-up questions
- Why run two retrieval paths instead of one good one? The two paths fail differently, since the text index wins on exact names and rare terms while the vector index wins on paraphrase and visual queries, and the union covers both blind spots for the cost of a dedupe step. Collapsing to either alone reintroduces that path's misses, and the lost candidates are unrecoverable because the ranker never sees them.
- How do you keep metadata lies from winning? Content signals, meaning embedding similarity and transcript matches, enter the ranker alongside metadata relevance, so a stuffed title has to compete with evidence computed from the actual frames and audio. Engagement features then finish the job over time, because bounce-aware labels mean a clickbaited click teaches the model the opposite of what the uploader hoped.
- Why is position bias such a big deal in training data? Clicks confound relevance with placement, so training naively on them teaches the new ranker to copy the old one's ordering, mistakes included, and the system calcifies. Randomized swaps on a small traffic slice produce data where the two are untangled, and propensity-weighted training discounts each click by how much its position alone would have earned it, which together break the circularity.
- How does a video uploaded five minutes ago become searchable? It enters through the near-real-time index, which carries metadata-only entries within seconds to minutes of upload, while embeddings and transcripts backfill over the following hour. The ranker is trained to handle those missing content features explicitly, leaning on channel history and early engagement velocity, rather than discarding fresh videos for lacking signals they have not had time to earn.
- What would you measure to decide if ASR is worth its cost? Retrieval recall and session success on the slice of queries that match transcripts but not metadata, weighed against the GPU cost of transcribing every upload. Long-tail how-to and lecture queries usually justify the spend, while transcribing music uploads mostly indexes lyrics nobody searches for, so the refinement is gating ASR by predicted speech content and spending the saved compute elsewhere.
- GBDT or neural ranker? The GBDT ships first, because it is strong on tabular features, cheap to retrain daily, and diagnosable through feature attribution when a ranking looks wrong. The neural ranker earns its complexity later, once bias-corrected labels and text-embedding inputs give its extra capacity something real to learn, since before that point the bigger model mostly memorizes the logging policy's biases more thoroughly.
References
- Radford et al., Learning Transferable Visual Models From Natural Language Supervision (CLIP, 2021).
- Elastic, Practical BM25: The BM25 Algorithm and its Variables.
- Covington, Adams, and Sargin, Deep Neural Networks for YouTube Recommendations (RecSys 2016), for the funnel pattern and watch-time labels.
- Aminian and Xu, Machine Learning System Design Interview (2023).