Every time a social app renders an ad, an auction has just finished. Advertisers bid for the slot, and the platform ranks them not by bid alone but by bid multiplied by the predicted probability that this user clicks this ad right now. That probability comes from the click-through rate model, usually called the CTR model, and it is arguably the most money-sensitive ML system in industry, because its output is multiplied by dollars on every one of hundreds of thousands of requests per second, advertisers are charged based on it, and a one percent relative improvement in its quality is worth real revenue at scale. The cast of characters matters as much as the math. The user wants ads that are at least tolerable, the advertiser wants the outcomes they paid for, and the platform sits between them selling predictions of behavior, so an error in this model is never just a modeling error, since somebody is overcharged, shown the wrong thing, or underpaid every time it happens. Interviewers ask this question because it forces a different set of priorities than a recommender does, and the candidate who treats it as just another binary classifier misses the point.
The point is calibration. A model is calibrated when its probabilities mean what they say, so that among all impressions where it predicts a 2 percent click probability, clicks actually happen about 2 percent of the time. A recommender only needs the order of its scores to be right, but an auction multiplies the score by a bid, and multiplication makes the absolute value load-bearing. Suppose ad A bids 1 dollar with a true click rate of 2 percent, and ad B bids 50 cents with a true click rate of 5 percent. Expected revenue per impression is the bid times the click probability, which comes to 2 cents for A and 2.5 cents for B, so B should win the slot. Now let the model overpredict A at 5 percent while scoring B correctly. A's apparent expected revenue becomes 5 cents, A wins the auction, the platform earns 2 cents in expectation instead of 2.5, advertiser A pays for placements that underdeliver and eventually notices, and the user sees a worse-fitting ad in the bargain. Every party loses something, and nothing in the system throws an error. Miscalibration is not a cosmetic flaw here; it silently misprices the entire marketplace.
Scope and requirements
The system predicts the click probability for candidate ads inside the delivery path. Targeting and retrieval, meaning the machinery that finds the ads eligible for this user, budget pacing, and the auction mechanics themselves are adjacent systems with designs of their own, so this walkthrough owns the model that scores candidates, the features it eats, and the pipeline that keeps it fresh. Drawing that boundary early keeps the conversation from drifting into auction theory, which is a different interview.
Scale assumptions deserve stating before any architecture is drawn. Peak traffic runs to a few hundred thousand ad requests per second, each request scores on the order of a few hundred candidate ads, and the whole auction is allotted around 50 milliseconds, of which scoring can take perhaps 10 to 15. Multiplying those numbers out gives on the order of a hundred million model evaluations per second across the fleet, which is why every added feature and every extra layer in the model carries a hardware cost someone can compute in dollars. The output must be a calibrated probability rather than just a ranking score, for the auction reasons above, and the system must adapt within hours rather than days, because campaigns launch and exhaust budgets daily, interest follows the news cycle, and a model trained on last week's inventory has never seen much of what it scores today.
Framing the problem as a machine learning task
The framing is binary classification over (user, ad, context) triples, where the question for each impression is whether it produces a click within the attribution window, and the model outputs a probability rather than a hard yes or no. Pointwise probability output is non-negotiable here, unlike in ranking systems where pairwise objectives are an option, precisely because the auction consumes the absolute value, and a pairwise model could order ads beautifully while giving the auction nothing it can multiply by a bid. There is a plausible alternative framing as predicting downstream conversion, meaning the purchase or signup the advertiser ultimately wants, and mature platforms do run conversion models alongside, but click remains the foundation for two practical reasons. Click labels arrive in minutes rather than the days a conversion can take, and click volume is large enough to support continuous training, while conversions are sparse enough that their models end up leaning on the click model's machinery anyway.
Data and labels
Training data is the impression stream itself. Every served ad logs the full feature vector as computed at serving time, and the label arrives when a click for that impression shows up, or never does. Two properties of these labels shape the whole pipeline, and both deserve a slow look.
The first property is delay. A click can land seconds or minutes after the impression, while the user scrolls back up or finishes reading, so the training example cannot be finalized at serve time. The common approach fixes an attribution window, say five minutes, chosen by measuring the click-delay curve and cutting where it flattens, and a joining job holds impressions until the window closes before emitting them as labeled examples. The alternative is to emit every impression immediately as a negative and issue a correcting update when the late click arrives, which buys fresher data at the cost of a trainer that must handle label revisions, and it appears in the most latency-hungry systems for exactly that reason. Choosing the window length is a genuine tradeoff rather than a default, since a short window mislabels slow clickers as negatives while a long one delays every example it holds.
The second property is imbalance. Click rates run around 1 percent, so 99 percent of examples are negatives, and training on all of them costs far more than it teaches, because the marginal no-click impression mostly repeats what the previous million already said. Negatives are therefore downsampled, often keeping a tenth, which cuts training cost nearly tenfold while preserving almost all of the informative examples, and the price is that the model now learns probabilities on a reshaped distribution, creating the recalibration obligation worked through below. There is also a structural bias to acknowledge, namely that the system only ever receives labels for ads it chose to show, so the data reflects the current model's preferences, and a small exploration budget is the standing remedy.
Features and representation
Features come in three families plus one that quietly does much of the work. User features cover demographics, declared interests, and interaction history, including which ad categories this person has clicked or hidden before. Ad features cover the creative, the campaign, the advertiser, and the ad's format, which matters because video, carousel, and static formats earn structurally different click rates. Context features cover placement, such as feed versus story versus right column, along with device, hour of day, and day of week. The quiet workhorses are the historical-rate features, meaning this ad's click rate over the last few hours and days, this advertiser's recent click rate, and this user's click rate within this ad category, each smoothed toward a prior when counts are small so that an ad with three impressions does not carry a rate of zero or one. A fresh estimate of how this exact ad has been performing is hard for any model to beat from raw attributes alone, which is why these counters get their own streaming infrastructure later in the design.
Feature crosses matter because click behavior lives in the interactions. A cross is a feature formed from the joint value of two or more raw features. Suppose overall click rate on iOS is 1.1 percent and overall click rate in the top feed slot is 1.8 percent, but iOS users in the top slot click at 3.0 percent. A linear model with separate iOS and top-slot features can only stack their individual effects and lands well short of 3.0, while the crossed feature "iOS and top slot" carries the surplus directly, and the model just learns a weight for it. Classic systems hand-built thousands of such crosses and pruned them by measured value, while deep models learn the same interactions implicitly through embedding arithmetic, which is much of their appeal and much of their opacity.
The representation problem is cardinality. Advertiser IDs, ad IDs, and interest tokens are categorical features with millions of distinct values, far too many for one-hot encoding, where each value would get its own input dimension. Two standard answers exist, and they trade against each other. Embeddings are learned dense vectors of perhaps 16 to 64 numbers per ID such that IDs with similar click behavior end up near each other, and they carry the most signal but require a vocabulary, storage proportional to the number of IDs, and a story for IDs never seen during training. The hashing trick instead maps the millions of raw values into a fixed number of buckets, say one million, by hashing the ID, accepting occasional collisions in exchange for bounded memory and the ability to absorb never-before-seen IDs without a vocabulary update. The two compose well in practice, hashing the long tail while learning real embeddings for the head entities whose data can support them.
Model architecture
The classic baseline is logistic regression over sparse features and crosses, and it endured for a decade at the largest platforms for reasons worth saying out loud. It trains and serves fast, it updates online one example at a time, its probabilities come out naturally well calibrated under log loss, and a per-feature weight is something an engineer can actually read when an advertiser asks why their delivery changed yesterday. Google's published production work used exactly this with FTRL, an online optimizer designed to keep sparse models sparse while updating from a stream, and the longevity of that recipe is a useful corrective to the instinct that the newest architecture is the safest answer.
The famous Facebook recipe layered gradient-boosted decision trees in front of the linear model. A GBDT, an ensemble of small trees each trained to correct the errors of the previous ones, is fit to the raw features, and then each example is represented by which leaf it lands in within each tree, with those leaf indicators fed as input features to a logistic regression trained online. The trees act as an automatic feature-cross factory, because a path through a tree is precisely a conjunction of conditions like iOS user, top slot, and evening hour, and the LR layer stays fresh by updating continuously while the trees retrain more slowly. The modern successors are deep models, and the wide-and-deep idea captures the design space cleanly, pairing a wide linear part that memorizes specific feature crosses known to matter with a deep embedding part that generalizes to combinations never seen before, both feeding one output. In the interview, presenting LR as the baseline, GBDT-plus-LR as the proven middle, and a deep model as the scale play shows the landscape rather than a single fashionable answer, and the progression also maps onto how teams actually migrate, since each stage keeps serving traffic while the next one earns its way in through experiments.
Offline evaluation
AUC, the probability that a random positive scores above a random negative, is the reflex metric, and its blind spot must be named, because AUC is invariant to any monotonic rescaling of the scores. A model whose every prediction is exactly half the true probability has perfect AUC and would still misprice the auction badly, so AUC alone can certify a model that loses money. The metric used in Facebook's published practice is normalized entropy, the model's average log loss divided by the log loss of a constant predictor that always outputs the background click rate. Log loss penalizes a prediction by how surprised the model was by the actual outcome, and a nat is its unit when natural logarithms are used. A worked example makes the normalization concrete. With a background click rate of 2 percent, the constant predictor's average log loss is 0.098 nats per impression, which is the entropy of a coin that comes up heads 2 percent of the time. If the model's average log loss on the same data is 0.088, normalized entropy is 0.088 divided by 0.098, about 0.90, meaning the model removes about 10 percent of the uncertainty relative to just knowing the average, and the division is what makes segments with different background rates comparable at all, since raw log loss is dominated by the base rate before the model has done anything. Lower is better, and unlike AUC it punishes miscalibration. Alongside it, calibration is checked directly by bucketing predictions and comparing predicted to realized rates per bucket, summarized as the ratio of average prediction to realized click rate, which should sit near 1.0 and gets watched per segment in production. Evaluation splits are time-based, training on days 1 to 13 and evaluating on day 14, because in a system this non-stationary a random split flatters the model with knowledge of the future and the flattery never survives contact with live traffic.
The serving architecture
The serving path runs inside the auction's 50 millisecond budget, and walking the budget through stage by stage shows where the pressure concentrates. A request arrives carrying the user and context, targeting and retrieval produce a few hundred eligible candidate ads, and the scorer fetches features and runs the model on all of them in one batch. A reasonable allocation gives 10 milliseconds to retrieval, 10 to feature fetch from the online feature store, which is the low-latency store serving the same feature definitions the trainer sees, including the streaming counters behind the historical-rate features, then 15 to model inference on a few hundred candidates, 10 to the auction itself, and a few in reserve for the tail. Feature fetch is the stage that fights back hardest, since hundreds of candidates multiply every per-ad feature read, and 500 candidates with even a handful of fetched features each means thousands of key lookups per request. Ad-side features are therefore cached aggressively, embeddings for hot ads live in local memory on the scoring hosts, and user-side features are fetched once per request rather than once per candidate, which turns the multiplication back into addition. Every scored impression is logged with its exact serving-time feature values, which is the single most effective defense against training-serving skew, because the trainer then learns from what the model actually saw rather than from a later reconstruction that might quietly differ.
The auction-time path. Retrieval produces a few hundred eligible ads, the feature store supplies fresh counters, the CTR model scores the batch, and the auction ranks by bid times predicted click rate, all inside roughly 50 milliseconds.
Continuous training and recalibration
Freshness is the second defining pressure. New campaigns launch constantly, budgets exhaust mid-day, and interest shifts with the news cycle, so the model trains continuously on the click stream rather than in nightly batches. Online learning means the model's weights are updated incrementally from a stream of examples instead of being refit from scratch, and in this system the stream is the output of the label join, with impressions held until the attribution window closes, matched against any click, downsampled on the negative side, and fed to the trainer in arrival order. Updated models are snapshotted to a registry, sanity-checked against a recent holdout, and pushed to the serving fleet on the order of hours or faster. The historical-rate counters update even faster than the weights and carry much of the short-term adaptation, which is a deliberate division of labor, since a fresh counter can absorb an ad's first few hours of performance without the weights having to chase every hourly wiggle.
Downsampling creates a debt that must be repaid exactly. If only a fraction w of negatives is kept, the model learns probabilities on a reshaped distribution where clicks look roughly 1/w times more common than they are, and serving those raw numbers would overcharge and misrank systematically. The correction is closed-form:
# w = fraction of negatives kept during downsampling (e.g. 0.1)
# p = probability predicted by the model trained on downsampled data
def recalibrate(p: float, w: float) -> float:
return p / (p + (1.0 - p) / w)
recalibrate(0.0917, 0.1) # ~0.0100
Walking the numbers through makes the formula less mysterious. With a true click rate of 1 percent and negatives kept at w = 0.1, every positive survives the sampling while only one negative in ten does, so the training data contains one click per roughly 9.9 negatives instead of one per 99, and clicks appear at about 9.2 percent. A well-trained model therefore predicts about 0.0917 on an average impression, and the correction maps it back to 0.0100 by deflating the odds by the same factor the sampling inflated them, since the odds of click to no-click were multiplied by exactly 1/w when nine of every ten negatives were dropped. Teams sometimes add a final calibration layer fitted on a recent unsampled holdout as well, which also absorbs residual miscalibration from the model itself, but the sampling correction is the part with an exact closed-form answer, and forgetting it is a classic failure mode that shows up as every prediction in the system running hot by the same factor.
The continuous training loop. Impressions wait for late clicks at the join, negatives are downsampled, the trainer updates online, predictions are recalibrated exactly for the sampling, and fresh snapshots deploy within hours. The dashed path closes the loop, including the small exploration slice.
Online evaluation and operations
A/B testing is the launch gate, with revenue per thousand impressions, click rate, and advertiser outcomes as the primary readouts, but ad experiments carry a complication that content experiments do not, because treatment and control interact through shared budgets and shared auctions. If the treatment model spends an advertiser's budget more efficiently in the morning, the control group faces a depleted budget in the afternoon, and the comparison is contaminated in a way that more traffic cannot fix. The standard remedies are budget-split designs, where each campaign's budget is divided between arms so neither side can starve the other, and advertiser-level holdouts kept apart for weeks to read ecosystem effects rather than per-impression effects.
The feedback loop deserves explicit handling. The system receives labels only for ads it chose to show, so a model that scores some ad poorly today ensures no data ever arrives to change its mind, and the error becomes self-sealing. A small exploration budget, on the order of 1 percent of traffic where ranking is randomized or scores are perturbed, keeps the label distribution wide enough to learn from, and that slice doubles as the clean data for measuring position effects and calibration. Monitoring follows the money. The ratio of predicted to realized clicks is tracked per segment, across placement, device, country, and advertiser vertical, and alarms fire on sustained departures from 1.0, because segment-level miscalibration is invisible in the global average while still mispricing whole markets. The shape of a calibration incident is worth knowing in advance, since it is what the on-call engineer actually sees. A counter pipeline stalls, the historical-rate features freeze at stale values, predictions for the newest campaigns drift high while the global revenue dashboard still looks normal, and the first alarm that fires is the predicted-to-realized ratio for the fresh-ad cohort, while the first humans to notice otherwise are advertisers in one vertical quietly paying more per click than they did yesterday. Feature distributions are watched for drift against training snapshots, and the operational drill for a sick model is rollback first and diagnosis second, restoring a recent registry snapshot within minutes, because with continuous training a single corrupted hour of labels propagates into the weights quickly and the damage compounds while anyone is still reading logs.
Follow-up questions
- Why does calibration matter more here than AUC? The auction multiplies the predicted probability by a bid, so the absolute value sets prices and rankings across advertisers, and an error in scale is an error in money. AUC is unchanged by rescaling all scores, which means a model can post excellent AUC and still misprice every auction it touches.
- How is a brand-new ad scored with no history? Its content, campaign, and advertiser features carry the first prediction, with the historical-rate features falling back to advertiser-level and category-level priors instead of reading zero. The streaming counters then sharpen the estimate within hours of the first impressions arriving, so the cold period is short by design.
- Why downsample negatives at all? At a 1 percent click rate, negatives are 99 percent of the data and most of them repeat what the model already knows. Keeping a tenth cuts training cost nearly tenfold with minimal quality loss, provided the predictions are recalibrated afterward with the exact closed-form correction, which is the non-negotiable half of the bargain.
- What degrades if training lags by a day? The newest inventory suffers first, because fresh campaigns and shifted budgets are scored by a model that has never seen them, and revenue on new ads sags before anything else moves. The historical-rate counters cushion the lag by carrying recent performance even while the weights are stale, which is one reason they earn their complexity.
- How would you detect miscalibration in production? Track predicted versus realized click rate continuously, per segment rather than only globally, using the exploration slice as the least biased sample of behavior. A global ratio of 1.0 can hide a segment running at 1.4 that is quietly overcharging one market while another runs cold and underdelivers.
- Why not rank by predicted click rate alone? The platform is selling outcomes to bidders, and bid times pCTR is the expected revenue of an impression, so ranking by pCTR alone would hand slots to clicky low-value ads while ignoring what advertisers are actually willing to pay for the outcome.
References
- He et al., Practical Lessons from Predicting Clicks on Ads at Facebook (ADKDD 2014), the GBDT-plus-LR recipe, normalized entropy, and downsampling recalibration.
- McMahan et al., Ad Click Prediction: a View from the Trenches (KDD 2013), Google's production lessons on online learning with FTRL and calibration.
- Cheng et al., Wide & Deep Learning for Recommender Systems (2016), the memorization-plus-generalization architecture.
- Aminian and Xu, Machine Learning System Design Interview (2023), chapter on ad click prediction.