Design a personalized news feed

When a person opens a social app, the system has a few thousand posts it could legitimately show them, drawn from friends, pages they follow, and groups they belong to. The infrastructure that gathers those candidates, the fan-out of each new post into the inventories of everyone who follows its author, is its own design question, and this article assumes it exists and takes on the layer above it, the machine learning system that decides what to show first. That ordering decision is the product. The same two thousand posts ordered well feel like a feed that knows you, and ordered badly they feel like noise, and the entire difference between those two experiences is a ranking model and the apparatus around it. The person scrolling never sees any of the machinery, which is exactly why the machinery has to be careful, because the only signal they send back when ranking degrades is that they open the app a little less often, and by the time that shows up in a dashboard the damage has been compounding for weeks.

The goal deserves a careful statement, because the sloppy version leads the whole design astray. The aim is to order the eligible posts so the person finds the most value in the session, and since value cannot be measured directly, it gets operationalized through predicted interactions, namely the probability that the person will like, comment on, share, hide, or spend time reading each post. That proxy is genuinely imperfect, and saying so plainly in an interview costs nothing while signaling a great deal. People interact with outrage they do not value and silently appreciate things they never click, which is why the design below keeps humans in the loop on how predictions combine into a score and why the online evaluation reaches beyond interaction counts. The interesting parts of the problem are the multi-stage funnel that fits heavy models inside a tight budget, the multi-task model at the heart of it, and the stubborn gap between offline and online truth.

Scope and requirements

Assume hundreds of millions of daily users, an average inventory of about 2,000 eligible posts per feed load, and a server-side budget of roughly 200 milliseconds to produce a ranked feed, with the machine learning portion holding to about half of that so network transfer, page assembly, and rendering keep their share. The system must incorporate signals from the current session, because the feed a person sees on their third refresh should account for what they already saw and skipped on the first two, and a feed that repeats itself within a session reads as broken even when every individual ranking decision was defensible. The design below covers the ranking funnel, the main ranking model, the value computation that turns predictions into an ordering, the integration of integrity signals, and the training pipeline. Candidate gathering, the internals of the content classifiers, and ad insertion stay out, with ads running as a separate auction whose results are interleaved into the organic ranking afterward.

Framing the problem as a machine learning task

The naive framing is a single binary model predicting engagement or not, and its weakness is what defines the real design, because interactions differ wildly in what they say about value. A comment is rarer and far more meaningful than a passive like. A share extends the post's reach to a whole new audience, which the platform values differently again. A hide is a strong negative, and a single blended engagement model would drown it out entirely, since hides are perhaps a tenth as common as likes and a model optimizing one mixed target learns mostly from whatever signal is most frequent. The production framing is therefore multi-task prediction, meaning one model with a shared network and several separate output heads, where a head is a small final layer responsible for one prediction. The heads predict P(like), P(comment), P(share), P(hide), and expected dwell time, where dwell time is how long the person will spend on the post before scrolling on. The input is a (viewer, post, context) triple, the output is a vector of predictions, and a separate value model collapses that vector into a single ranking score. Sharing the network matters statistically and not just operationally, because comments are sparse, and a comment head trained alone would starve, while one sharing a trunk with the like head inherits everything the dense like signal teaches about viewers and posts and only has to learn what makes commenting different. The alternative of five entirely separate models loses on exactly that ground, and it also multiplies the serving cost and the number of pipelines that can quietly drift apart.

Data and labels

Training data is the stream of logged impressions joined with what happened next, so each (viewer, post) impression becomes one example carrying a label per head, like or not, comment or not, dwell seconds, all collected within an attribution window chosen by measuring how long interactions actually take to arrive and cutting the window where the curve flattens. The labels are implicit, arising from behavior rather than human annotation, which makes them plentiful and biased at the same time. The dominant bias is position bias, the tendency of posts shown at the top of the feed to receive more of every interaction simply because they were seen, so a model trained naively learns that whatever the old system ranked first was good and partly recycles the incumbent's opinions back into itself. Two standard mitigations work together. The display position is logged and included as a training-time feature that gets fixed to a constant at serving time, which lets the model attribute some of the engagement to placement rather than to content, and a small randomized slice of traffic, around 1 percent served with lightly shuffled order, provides nearly position-unbiased data that evaluation prefers. Hides and see-less-like-this responses are logged with particular care, because they are the rare explicit negatives in a sea of implicit signals, and losing even a fraction of them to a logging mistake costs the hide head more than losing ten times as many likes would cost the like head.

Features and representation

Four feature families carry most of the weight. Actor-viewer affinity captures the relationship between the viewer and the post's author, namely how often this viewer interacts with this author, how recently, and in which ways, computed over rolling windows so a friendship that cooled two years ago does not rank like an active one. Content-type fit captures format taste, because a person who watches videos to the end but skims past text posts should see that pattern reflected, and the viewer's historical engagement rate per content type expresses it directly. Recency enters with decay rather than as a sort order, and the reasoning deserves spelling out, since a feed is partly news yet a six-hour-old post from a close friend can still beat a two-minute-old post from a distant acquaintance, which means freshness has to be a feature the model weighs against affinity rather than a rule that overrides it. Topic match uses embeddings, where an embedding is a learned list of numbers representing a piece of content so that similar content ends up with nearby vectors; the post's text maps to a vector, the viewer's interest profile is an aggregate of the vectors of posts they engaged with, and the similarity between the two becomes a feature. Alongside these four families, session context flows in through a short-term store that remembers what the person already saw this session, what they skipped, and what they lingered on minutes ago, and that store is the difference between a feed that adapts mid-session and one that greets every refresh like a stranger.

All of this is served from a feature store, the shared system that computes and serves feature values consistently for training and serving, and consistency is the entire point of the abstraction. Training-serving skew is the failure mode in which the offline pipeline and the online path compute what is nominally the same feature differently. Suppose offline affinity gets computed over 90 days of history while the online store keeps only 30, so the model trains against one distribution and serves against another. Nothing errors, nothing pages, and quality simply sags, which is what makes skew the most insidious failure in this family of systems. The strongest guard is to log the exact feature values used at serving time and to train only on those logs, because skew cannot arise from recomputation if nothing is ever recomputed.

Model architecture

The funnel has two ranking stages because the heavy model cannot afford 2,000 candidates, and the arithmetic behind that sentence is simple enough to do aloud, since running the full model over 2,000 posts costs four times the compute of running it over 500 and spends most of the extra on posts that never had a chance. The first-pass ranker is a cheap model that scores the full inventory in a few milliseconds using only precomputed features, often a distilled version of the main ranker, meaning a small model trained to imitate the big model's outputs rather than the raw labels, or a linear model where even distillation is too heavy. Its job is recall rather than precision, because it only needs to avoid discarding posts the main ranker would have loved, and a mistake at this stage is invisible to every stage after it. It keeps the top 500 or so. The main ranker is the multi-task network described above, a shared trunk of a few hidden layers over the concatenated features feeding the five heads, sized so that scoring around 500 candidates fits the latency budget on accelerated hardware.

The value model sits on top, and it is deliberately not learned. Each post's score is a weighted combination of the head predictions, for example P(like) times 1 plus P(comment) times 15 plus P(share) times 30 minus P(hide) times 100 plus a small dwell term. Walking one post through the arithmetic shows the intent behind the weights. Suppose the heads predict a like probability of 0.10, a comment probability of 0.02, a share probability of 0.01, and a hide probability of 0.005. The weighted sum works out to 0.10 plus 0.30 plus 0.30 minus 0.50, which leaves a score of 0.20 before the dwell term, and the instructive part is that a hide probability of half a percent has outweighed the entire like term, which is exactly what the weights were chosen to do. The weights themselves are a product decision reviewed by humans rather than parameters fit by gradient descent, because they encode what the company means by value, with comments worth more than likes and hides treated as small vetoes. Learning them from engagement data would quietly re-derive engagement maximization with extra steps, while keeping them human-owned means the objective changes through deliberate review when the product philosophy changes rather than drifting wherever the gradient points.

The multi-task ranker. One shared network feeds five prediction heads, and the value model combines them with human-reviewed weights, here P(like) × 1 + P(comment) × 15 + P(share) × 30 - P(hide) × 100 plus a dwell term, while integrity scores enter as dashed demotions.

Integrity integration

Separately trained classifiers score every post for clickbait, engagement bait, likely misinformation through proxies such as matches against fact-check databases, and borderline content that approaches policy lines without crossing them. These classifiers are trained on human-reviewed labels rather than engagement, because engagement is precisely the signal that fails here, given that the content they target often engages exceptionally well. Posts that actually violate policy are removed upstream and never reach ranking at all. Everything else enters the value model as demotions, which are multiplicative reductions applied to the score, rather than as hard removals, and the choice is deliberate, because a demotion strength is a number that can be reviewed, measured, and adjusted per classifier, while silent removal of non-violating content is a heavier policy with no gradations to tune. Transparency requirements push in the same direction, since a why-am-I-seeing-this feature has to trace a ranking back to its contributing causes, such as following the page or the post being popular among friends, and an architecture of explicit heads, weights, and demotions can produce that explanation, where a single end-to-end engagement score could offer nothing beyond the score itself.

Offline evaluation

Each head is evaluated separately on a time-split holdout, training on two weeks of impressions and validating on the following days, with AUC and normalized entropy reported per head. AUC is the probability that a randomly chosen positive example outranks a randomly chosen negative one. Normalized entropy is the model's log loss divided by the log loss of a predictor that always answers the base rate, where log loss measures how surprised the model was by what actually happened, and lower is better. The like head makes a concrete example. At a 5 percent base like rate, the constant predictor's log loss comes out near 0.199, so a model achieving 0.179 has a normalized entropy of roughly 0.90, meaning it has captured about 10 percent of the uncertainty beyond what the base rate already gives away. The same model might post an AUC of 0.92 on likes and only 0.78 on hides, and that per-head breakdown is the diagnostic view that matters, because the value model amplifies the hide head a hundredfold, so a weak hide head damages feeds far out of proportion to its share of the training data. Calibration per head gets checked alongside, by bucketing predictions and comparing each bucket's average prediction against its realized rate, because the value model's weighted sum silently assumes the probabilities are comparable across heads, and a like head predicting double the true rate would distort every score it touches.

The serving architecture

At request time the funnel runs end to end, and walking the budget stage by stage is worth doing aloud. Roughly 2,000 candidates arrive from inventory. The first-pass ranker prunes them to about 500 in a handful of milliseconds, which it can only do because it restricts itself to features that were precomputed before the request arrived. Feature assembly takes the next stretch, around 25 milliseconds dominated by batched reads from the feature store and the session store, and the batching is load-bearing, because 500 sequential lookups at even a millisecond apiece would consume the entire envelope on their own. The main ranker then scores all 500 candidates across its heads in around 40 milliseconds on accelerated hardware, the value model applies its weights and the integrity demotions in well under a millisecond because it is plain arithmetic, and the final feed rules run last, enforcing author diversity so no more than two consecutive posts come from the same author, mixing formats, and suppressing posts the person already saw. The ML path closes at around 80 to 100 milliseconds inside the 200 millisecond envelope, and the headroom is deliberate, since the budget is written against tail latency rather than the average. The two-stage shape is the load-bearing decision of this section, and its price is a managed recall risk, which gets monitored directly by measuring how often the main ranker's top posts on the randomized slice would have been pruned by the first pass, so the cheap model's blind spots show up as a tracked number rather than as a slow, unexplained quality drift.

The ranking funnel. A cheap first pass prunes 2,000 candidates to 500, the feature store feeds the multi-task main ranker, the value model combines head predictions with dashed integrity demotions, and diversity rules shape the final feed.

Online evaluation and operations

Offline metrics gate iteration, but launches are decided online, and the two regularly disagree for reasons worth naming individually. A model can win offline by predicting clicks better and lose online because what it surfaces more of is clickbait-adjacent content that drives short-term interactions and long-term fatigue. Offline evaluation also scores against logged data shaped by the old policy, so it systematically underestimates how behavior shifts once the ranking itself shifts. The online metrics that decide launches are therefore broader and slower. Sessions per user and time in app get read carefully rather than maximized blindly, meaningful interaction rates such as comments and reshares among real connections carry more launch weight than raw clicks, hide and report rates serve as guardrails that can veto a launch on their own, and direct user surveys asking whether a post was worth the person's time remain the only signal that approximates value without passing through engagement at all. Experiments run for weeks because feed changes have lagged effects, and the deepest changes are read against long-term holdouts, meaning groups of users kept on the old experience for months so that slow drifts in satisfaction stay measurable against something.

Operationally the models retrain daily or continuously, because affinity and content distributions move fast enough that a week-old model is measurably stale. A new model is validated by shadow scoring, which means running it on live traffic and logging its scores without serving them, and that single practice catches feature-pipeline mismatches and score-distribution shifts before any user sees a ranking from the new weights. Promotion then proceeds through a small percentage rollout with automated rollback triggers wired to the guardrail metrics. Monitoring covers inputs and outputs continuously, with live feature distributions compared against training-time snapshots, per-head prediction means compared against realized rates as a standing calibration check, and score distributions tracked per content type, where a sudden shift usually means an upstream pipeline change nobody announced. The shape of a data-path incident from the on-call seat justifies all of this instrumentation. The session store degrades quietly, seen-post suppression stops working, and no model alarm fires because the model is fine; the person scrolling simply meets the same posts twice, experiences the product as broken in a way they cannot quite articulate, and refreshes less. The alarm that catches it early is the one watching the fraction of served posts already seen this session, which is why the data path earns the same scrutiny as the weights, and why most quality regressions in systems like this arrive through the pipes rather than the model.

Follow-up questions

Why multi-task instead of one engagement model? Interactions carry different meanings and wildly different base rates, and one binary target collapses them into mush, with rare strong signals like hides drowned by frequent weak ones. Separate heads keep each signal legible, share statistical strength through the common trunk, and hand the value model explicit levers to weigh, which is also what makes rankings explainable afterward.
Why are the value weights not learned? They define what the product means by value, which is a judgment rather than a statistic. Learning them from engagement data would re-derive engagement maximization with extra steps, while keeping them human-owned means the objective stays inspectable and changes through deliberate review rather than silent drift.
How do you fight position bias? Log the display position and train with it as a feature that is fixed to a constant at serving time, so the model can separate placement effects from content effects, and keep a small randomized traffic slice as a nearly unbiased evaluation set. Ignoring the bias trains the model to agree with the previous ranker rather than with users.
What is training-serving skew and one concrete guard? Skew arises when features are computed differently offline and online, such as affinity measured over 90 days in training but 30 days in serving, and it degrades quality without producing a single visible error. The concrete guard is to log the feature values used at serving time and train only on those logs, so no recomputation exists to disagree.
Why can a model win offline and lose online? Offline evaluation scores against data generated by the old policy and counts proxies rather than value. A model that predicts clicks better can shift the feed toward content that erodes sessions and trust over weeks, which only guardrail metrics, surveys, and long-running experiments reveal, and which no offline number anticipates.
Why demote integrity-flagged content instead of removing it? Content that violates policy is removed upstream before ranking ever sees it. For the borderline remainder, demotion strengths are tunable, measurable numbers compatible with transparency and appeals, while silently removing non-violating content is a heavier policy with no gradations and no explanation to offer the author.

References

Lada, Wang, and Yan, How machine learning powers Facebook's News Feed ranking algorithm (Meta Engineering, 2021), the multi-stage, multi-task production design.
Zhao et al., Recommending What Video to Watch Next: A Multitask Ranking System (RecSys 2019), multi-task heads with a combined score and position-bias handling.
Joachims et al., Accurately Interpreting Clickthrough Data as Implicit Feedback (SIGIR 2005), the foundational study of position bias in logged clicks.
Aminian and Xu, Machine Learning System Design Interview (2023), chapter on news feed ranking.