Design a harmful content detection system

Every large social platform runs an integrity system, which is the machinery that detects posts violating policy, whether that means graphic violence, hate speech, self-harm content, sexual material, or spam, and acts on them before they reach people. It is among the hardest ML system designs an interviewer can pose, and the difficulty is structural rather than algorithmic. The users producing the worst content actively adapt to evade detection, so the data distribution fights back in a way it never does in search or recommendation, where users are at worst indifferent to the model's success. Violations are a sliver of total volume, so the class imbalance is severe enough to make accuracy a meaningless number. Errors harm people in both directions, since a missed violation can mean a person sees the worst image of their life, while a false removal silences legitimate speech and teaches the silenced user that the platform is arbitrary. The policies themselves differ by category, language, and jurisdiction, so a single threshold tuned on a single curve cannot possibly be the answer. No single model solves this either. The design that works is a layered system with humans inside it, and the layering is where the interesting decisions live.

The hard parts, and the places this walkthrough spends its time, are the operating point machinery that converts model scores into tiered actions, the human review loop that belongs to the architecture rather than to an appendix, and the latency split between what must run synchronously at post time and what can follow minutes later. Getting those three right matters more than any single model choice, because a mediocre classifier inside well-designed machinery protects users better than an excellent classifier wired to one naive threshold.

Scope and requirements

The system watches every new post, spanning text, images, video, and their combinations, on a platform handling on the order of one billion posts per day. Averaged across a day that is roughly 12,000 posts per second, and peaks run several times higher, so every synchronous component must hold up at something like 50,000 posts per second without queueing into the user's publish action. The system must score content against a set of policy categories, take automated action where confidence justifies it, route uncertain cases to human reviewers, and support appeals, because wrongly actioned users need a working path back. Policy categories carry very different stakes, since imminent-harm categories such as credible threats and self-harm content need the fastest path and the highest recall the system can buy, while spam tolerates meaningful error in both directions and mostly needs to stay cheap. A reasonable violation base rate to carry through the design is 0.1 percent of posts, and that number rewards a pause, because 0.1 percent of a billion is still a million violating posts arriving every single day. No workforce reviews a million posts daily, so machines must carry nearly all of the volume, and human judgment has to be spent exactly where it changes outcomes. In scope are detection, action, and the review loop. Out of scope here are policy authoring itself, advertiser content, and live-stream moderation, which shares the models but runs under its own latency regime.

Framing the problem as a machine learning task

The core framing is multi-label classification per policy category over multi-modal inputs. Multi-label means each post receives an independent score for every category rather than one winner-take-all class, and the reason is operational rather than aesthetic, because a single post can be hateful and violent and spammy at once, the downstream actions differ per category, and the router that decides those actions needs each probability separately. The alternative framing as multi-class classification, where the model elects a single best label, loses on exactly those posts, since whichever label wins suppresses the evidence for the others and leaves the action router blind to a second violation it would have handled differently. Multi-modal means the inputs span text, image, video, audio, and context such as the posting account's history, since modern violations live in the interaction between modalities, the innocuous photo with a sinister caption, or hateful text rendered inside an image precisely so the text models never read it.

Combining modalities raises the fusion question, and both options deserve plain definitions because the choice shapes operations for years. Early fusion merges the raw learned representations, feeding text and image embeddings into one model that scores the post jointly, and it is the only approach that can catch meaning living between the modalities, because the model sees both signals before either has been collapsed into a verdict. Late fusion runs a separate model per modality and combines their scores at the end, which is weaker on cross-modal meaning but far easier to operate, because when a late-fusion system misfires the per-modality scores show which component erred, and one model can be retrained without disturbing the others. A defensible production answer uses late fusion as the workhorse precisely because its failures can be localized to a component, then adds an early-fusion multimodal model where cross-modal violations concentrate, in the meme-shaped content where image and caption conspire. The remaining alternative, one end-to-end model for everything, is rejected on operability grounds rather than accuracy grounds, since a monolith that drifts hands the team a single giant knob and no way to learn which modality went wrong.

Data and labels

Gold labels come from human review decisions, meaning trained reviewers applying written policy to queued content and producing the per-category judgments that train and evaluate every model in the system. The supply is expensive, rate-limited, and noisy. Policy lines are genuinely contested, agreement between reviewers on categories like hate speech sits well below perfect, and a label that one reviewer assigns and another withholds is less a fact about the post than a sample from a distribution of defensible judgments. The pipeline therefore measures inter-reviewer agreement, double-reviews a sample of decisions to estimate label quality, and treats appeal outcomes, where a second reviewer overturns the first, as a free-arriving measurement of label noise. Working with noisy labels is survivable when the noise is measured, because thresholds can be chosen with the noise rate in view, whereas unmeasured noise quietly becomes a ceiling on every model's apparent quality and nobody can say why the curves stopped improving.

The class imbalance is the other defining property of the data. At a 0.1 percent violation rate, a random training sample of a million posts contains about a thousand positives, and a model can post 99.9 percent accuracy by calling everything clean, which is why accuracy never appears on an integrity dashboard. Training handles the imbalance by downsampling the clean class, keeping every labeled violation while sampling perhaps one clean post in a hundred, which lifts the apparent violation rate from one in a thousand to roughly one in eleven and lets every batch carry real signal instead of restating the word clean a thousand ways. The price is the one ad systems pay for the same trick, because the model now learns probabilities on a reshaped distribution, so outputs are corrected afterward, either with the closed-form adjustment for the sampling rate or with a calibration layer fitted on an unsampled holdout, and the correction is mandatory because everything downstream consumes the scores as probabilities. Evaluation resists the same trap by reporting precision and recall per category rather than any pooled accuracy, on labeled sets that are stratified, enriched with hard cases, and kept frozen so that quarter-over-quarter comparisons mean something.

User reports supply a further label stream, abundant and very noisy, since most reported posts violate nothing and reporting is itself weaponized in personal disputes, but reports still earn their keep as a candidate generator for review and as a weak training signal when aggregated. Multilingual and cultural coverage is at bottom a data problem rather than a modeling one, because the policy categories exist in every language while labels concentrate in a few, so the label budget is allocated across languages and regions deliberately, and cross-lingual transfer from multilingual encoders bridges the gap while labeled data catches up. Coded language and dialect shift make a model trained on one region's data quietly wrong in another, and the failure looks like nothing at all in aggregate metrics, which is why sliced evaluation by language is non-negotiable and why only locally sourced labels truly fix what the slices surface.

Known-content matching

Before any classifier runs, the cheapest and most precise layer asks whether this content has been seen before, because much of the worst material is re-shared rather than new, and in the severest categories the bulk of daily volume consists of copies of media the platform or its industry partners have already judged. Perceptual hashing computes a fingerprint of an image or video designed so that re-encoding, resizing, cropping, and watermarking still produce the same or a nearby hash. That goal is the exact opposite of a cryptographic hash, which is engineered so that flipping one pixel scrambles the whole digest, while a perceptual hash is engineered so that changes human eyes barely register move the digest barely at all. Matching the fingerprint against databases of known violating media, in the style of PhotoDNA, delivers near-certain detection at microseconds of cost, and because matching is an index lookup rather than a model inference it scales to every post on the platform without a GPU anywhere in the path. Industry hash-sharing programs extend the databases across companies for the most severe categories, so media removed on one platform becomes recognizable on the others within hours.

One step softer sits embedding similarity, where learned vector representations of media are compared in an approximate nearest neighbor index, which catches near-duplicates that survive heavier edits than hashing tolerates, at somewhat lower precision. The precision gap is handled by distance, since a match within a tight radius of a known item can act automatically while a looser match routes to human review instead of to removal. Together these layers strip away a large fraction of total violating volume before a single expensive classifier runs, and they are the reason the classifiers can afford to be heavy, because the models only ever face the residue the cheap layers could not dismiss.

Model architecture

The classifier stack is tiered by cost. The synchronous tier, which runs while the user waits for the post to publish, contains the hash and similarity matchers plus a fast text model and a lightweight image classifier. The text model is a distilled transformer, meaning a small network trained to reproduce the outputs of a much larger one so that most of the quality survives at a small fraction of the cost, and it scores every text field in a few milliseconds on ordinary CPUs. The asynchronous tier runs within minutes of publication and holds the expensive machinery, full-size multimodal transformers, video models that sample frames and score them together with the audio track, and context models that fold in account history and propagation patterns, which matter because a network of fresh accounts amplifying one post is itself evidence about the post.

Why not simply run the heavy models on everything before publishing? The arithmetic closes that door. A full multimodal pass with video sampling costs on the order of half a GPU-second per post, and a billion posts per day at that price comes to roughly 500 million GPU-seconds daily, which works out to a standing fleet of five or six thousand accelerators screening traffic that is 99.9 percent clean, while also adding seconds of latency to every publish action on the platform. The design instead accepts a window of minutes during which a violating post may be live, sizes the synchronous tier to catch the known and the obvious instantly, and spends the saved compute making the asynchronous models genuinely good. Each category gets its own calibrated head, meaning a predicted 0.9 corresponds to roughly 90 percent of such posts being true violations, and calibration is enforced rather than hoped for, because every threshold downstream consumes these scores as probabilities, and a head that drifts 20 percent hot silently moves every action boundary at once, removing posts the policy team never agreed to remove.

Offline evaluation

Per-category precision-recall curves are the core offline artifact. Precision is the share of flagged posts that truly violate, recall is the share of true violations that get flagged, and the curve traces the trade between them as the score threshold moves. The more familiar ROC curve, which plots the true positive rate against the false positive rate, is avoided here deliberately, because at a 0.1 percent base rate the false positive rate divides by the enormous clean class and looks microscopic even while the review queue drowns, so ROC numbers flatter every model and precision-recall numbers tell the operational truth. Because thresholds drive actions with different costs, the readings that matter come from specific points on the curve, precision at the auto-remove threshold, where false positives mean wrongly silenced users, and recall at the review threshold, where false negatives mean harm reaching people.

A worked example grounds the reading. Suppose the violence model at threshold 0.92 shows precision 0.98 and recall 0.45 on the frozen evaluation set. Out of 1,000 true violations in a day's traffic, recall of 0.45 means the threshold catches 450 of them. Finding the wrongful removals takes one more step, because precision speaks about everything removed rather than only the correct removals, so 450 correct removals at 98 percent precision implies about 459 total removals, and the difference of roughly nine posts is the daily count of users wrongly silenced per thousand violations at this operating point. Calibration is checked alongside by binning predictions and comparing predicted probability with observed violation rate per bin, and every evaluation is sliced by language, region, and content type, since aggregate curves hide exactly the failures that matter most. Validation splits are time-based, training on past months and evaluating on the most recent, because adversaries update their evasions weekly and a random split would let the model peek at the very tricks it will be graded on.

The layered serving architecture

At post time, the synchronous gauntlet runs while the user's publish request is in flight, and its budget rewards a slow walk-through. Hash and similarity matching cost a millisecond or two, because both are index lookups rather than inferences. The distilled text model adds a few milliseconds, the lightweight image classifier roughly ten more, and the threshold router is a table lookup, so the entire gauntlet lands near 30 milliseconds and the posting user perceives nothing. A known-database match acts immediately, removing the post before a single person sees it, and the author receives a notice naming the policy along with a path to appeal. Otherwise the fast scores feed the threshold router, and the post publishes unless the synchronous tier itself crossed an action line.

Minutes later the heavy asynchronous models deliver their richer scores to the same router, which can then remove, demote, or queue content the fast tier let through, and from the author's side an asynchronous removal looks like a post that lived briefly and then vanished with a notification. The asynchronous queue is prioritized rather than first-in-first-out, because a borderline post accelerating toward a million views deserves its heavy scoring ahead of a post nobody is opening, so predicted reach buys queue position. The router itself is where policy meets probability, since per category the scores map to bands and the bands map to actions, auto-remove at high confidence, human review at middle confidence, demotion, which quietly reduces distribution while review catches up, below that, and no action at the bottom.

The layered path. Known-content matching acts instantly, fast synchronous models gate publication, heavy asynchronous models deliver richer scores minutes later, and the per-category threshold router fans out to tiered actions. The dashed return path is the system feeding itself the next round of training labels.

The operating point machinery

Choosing the bands amounts to reading points off the precision-recall curve and attaching costs to them, separately per category. Continuing the violence example, threshold 0.92 with precision 0.98 and recall 0.45 defines the auto-remove band, because the platform has decided that roughly two wrongful removals per hundred automated actions is the most user trust will bear in this category. Dropping the threshold to 0.60 raises recall to 0.75 at precision 0.80, and the range between 0.60 and 0.92 becomes the human review band. Walking the day's 1,000 true violations through it, the band catches another 300 of them beyond the 450 removed automatically, and at 80 percent precision the queue must receive about 375 posts to contain those 300, which is precisely how review headcount gets sized from model quality. Scaled to the platform's million daily violations, the same proportions imply a review stream in the hundreds of thousands of posts per day, and with a trained reviewer sustaining a few hundred careful decisions in a shift, the band edges translate directly into thousands of people, which is why moving a threshold is a budgeting decision wearing a statistics costume.

From 0.30 to 0.60 the score still carries signal but not enough to justify removal or scarce reviewer time, so distribution is limited while the post waits for stronger evidence, whether that is user reports accumulating or the late asynchronous scores arriving, and below 0.30 nothing happens at all. Imminent-harm categories shift every line toward recall and route to specialist queues with response targets measured in minutes, while spam shifts toward automation because both of its error types are cheap. The bands are also revisited whenever a model retrains, since a new model traces a new precision-recall curve and yesterday's thresholds land at different points on it, which is an easy detail to forget and an expensive one to get wrong.

One category's bands, numbered top to bottom. The highest scores remove automatically at audited precision, the middle band buys recall with reviewer time, the lower band limits distribution, and the rest pass untouched. The dashed loop returns reviewer judgments as the next training set.

The human review loop

The review system deserves treatment as a first-class component, because every model in the stack trains on its output and every operating point is sized around its capacity. The queue is ordered by expected harm, meaning the severity of the suspected category multiplied by the predicted reach of the post, since reviewing a borderline post heading for a million views matters more than confirming a certain violation nobody will open. Reviewers see the post with its surrounding context, the account's recent history, and the exact policy text the model suspects was violated, because a decision made against the written line is a label the trainer can use, while an unanchored gut call is noise. Their decisions flow back as fresh gold labels within hours, and appeals run as a second pass in front of different reviewers, which both corrects individual errors and measures the false-removal rate from the inside, since the overturn rate on appeals is the most direct estimate of wrongful action the platform ever gets.

Quality and sustainability of the reviewing workforce are engineering concerns rather than staffing footnotes. Pre-labeled golden tasks are salted into the queue to measure each reviewer's agreement with policy over time, contested categories get double review with adjudication, and the measurement matters because reviewer error flows straight into the training data of every future model. Media is blurred by default and revealed deliberately, exposure to the severest categories is capped per shift, and rotation off those queues is scheduled, because the system depends on sustained human judgment, and exhausted judgment degrades in exactly the ways that are hardest to detect, with agreement drifting, queues slowing, and decisions quietly turning into rubber stamps.

Online evaluation and operations

Two north-star metrics define success. Prevalence measures how much violating content users actually see, estimated by sampling viewed content and human-labeling the sample, and it is expressed as violating views per ten thousand views, so a prevalence of five means five of every ten thousand views landed on violating content. The metric weights a million-view miss a million times more heavily than a miss nobody saw, which matches the harm, and it is the user's experience of the system compressed into a single number. Proactive rate measures the share of actioned violations the system caught before any user report arrived, which tracks whether detection leads harm or trails it. Around the two sit per-category automated-action precision audited by ongoing sampling, review queue latency and backlog, appeal overturn rates as the running measure of wrongful action, and reviewer agreement. New models graduate through A/B tests read on these online metrics rather than on offline curves, because a model that wins offline can still lose on prevalence by shifting its errors onto high-reach content.

Operations are dominated by adversarial adaptation, and the shape of an evasion wave is worth knowing in advance because it is what the on-call engineer actually sees. Score distributions for one category drift downward over days as a new coding spreads, report rates rise while the model's catch rate falls, and clusters of near-duplicate content pile up just beneath the review threshold, scoring 0.55 against a band that starts at 0.60. The prevalence sample lags because it needs human labels, so it confirms the wave rather than warning of it, and the leading indicators have to carry the alarm. The runbook starts with the hash databases, because fingerprinting the wave's media once reviewers confirm a few instances is hours faster than retraining anything, and a targeted label-and-retrain cycle for the affected category follows on a weekly or faster cadence, with adversarial categories retraining fastest of all.

The unglamorous failures need watching just as closely, because a feature pipeline that silently drops image embeddings for one language's traffic is indistinguishable from an evasion wave in the topline numbers until somebody inspects the slices. Missing-modality rates, calibration drift per category head, and per-language score distributions all carry their own alarms, and the operational reflex for a sick model is rollback before diagnosis, restoring the previous model version within minutes, because harm compounds while logs are being read and the diagnosis can proceed calmly once users are protected again.

Follow-up questions

Why not block every post until all models clear it? The heavy models add seconds of latency and would demand a standing fleet of thousands of GPUs to screen a billion daily posts, of which 99.9 percent are clean, so total synchronous screening buys very little protection at enormous cost. The design accepts a minutes-long exposure window for whatever the cheap synchronous tier cannot catch, and it shrinks that window by letting predicted reach prioritize the asynchronous queue.
Why per-category thresholds instead of one global score? The costs of the two error types differ by category, since a wrongful spam removal is a trivial annoyance while a wrongful hate-speech removal silences a person, and a missed self-harm post is a different magnitude of harm than a missed spam post. Each category's precision-recall curve therefore gets read against its own cost model, and the bands land in genuinely different places as a result.
Early or late fusion for multimodal posts? Late fusion forms the backbone, because per-modality scores localize failures and each component retrains independently, which keeps the system operable by a real team. An early-fusion multimodal model is added specifically for the cross-modal violations late fusion structurally misses, such as a benign image carrying a weaponized caption, where neither modality alone contains the violation.
How does the system handle a 0.1 percent positive rate in training? The clean class is downsampled aggressively, keeping every labeled violation while sampling clean posts at perhaps one in a hundred, and the outputs are recalibrated afterward so the probabilities stay meaningful to the threshold router. Accuracy is never reported anywhere, because per-category precision and recall on a frozen, enriched evaluation set are the only numbers that move decisions.
What keeps the human review budget from exploding? The band boundaries do, because review receives only the score range where a human judgment actually changes the outcome, the queue is ordered by severity times predicted reach, and known-content matching keeps re-shared material out of the queue entirely. Review capacity is a sized and budgeted component of the design rather than an overflow drain, and threshold moves are costed in reviewer headcount before they ship.
How do you know the system is winning against adversaries? Prevalence trending down while proactive rate trends up, sliced by category and language, is the only trustworthy answer. Model-level metrics can hold perfectly steady while coded language erodes real coverage underneath them, which is why the sampled, human-labeled prevalence estimate remains the metric of record even though it is slow and expensive to produce.

References

Microsoft, PhotoDNA, perceptual hashing for known harmful media.
Meta, Community Standards Enforcement Report, the public home of prevalence and proactive rate.
Kiela et al., The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes (2020).
Aminian and Xu, Machine Learning System Design Interview (2023).