Design a Street View blurring system

Street View publishes panoramic photographs of nearly every public road on earth, and the people and cars in those photographs never agreed to appear in a global product. The blurring system is the privacy layer that makes publication acceptable, because before any panorama goes live, every human face and every license plate in it must be found and irreversibly obscured. Interviewers reach for this question because it inverts the assumptions of most ML system design problems. No user is waiting on a response, so latency per image barely matters, while a single missed face among billions of images is a real harm to a real person, whose walk to a clinic or whose front door is now recorded in a product anyone can browse, and that asymmetry makes recall and throughput the constraints that shape everything.

The hard parts are choosing an operating point when the two error types carry wildly different costs, building training data for the long tail of things that look like faces but are not, and running a detection fleet over billions of images at a cost and timeline a business will accept. The serving section of a typical design becomes a batch pipeline here, and that single change ripples through every decision below, from the choice of detector to what the on-call engineer watches at night.

Scope and requirements

The product behavior is plain to state. Given the corpus of captured panoramas, the system produces published copies in which all faces and license plates are blurred, while originals are retained under strict access control for reprocessing and legal needs. Scale is the first number to agree on, and one billion new images per capture season is a workable figure, with each panorama being a very large image on the order of ten thousand pixels across, so the corpus is measured in petabytes rather than terabytes. The pipeline should finish processing a season's capture in days to a few weeks, because imagery freshness is part of the product's value, and the cost envelope should be a bounded GPU fleet rather than an unbounded burst, since this job repeats every season and its budget is a recurring line item someone has to defend.

The quality requirement is asymmetric, and I would name that asymmetry as the design's center of gravity. A missed face is a privacy violation with reputational and regulatory consequences for a person who never consented to the photograph, while an unnecessarily blurred statue or billboard slightly degrades one image among billions. Recall on faces and plates should therefore sit as close to perfect as the system can manage, with precision allowed to fall to whatever the imagery can tolerate, which is the opposite of how most detector deployments get tuned. Out of scope are video, user-requested blurring of whole houses, which is a manual workflow reusing the same blur machinery, and any real-time path, because nothing here serves a live request.

Framing the problem as a machine learning task

The task is object detection, meaning the model takes an image and predicts a set of bounding boxes, each carrying a class, face or license plate, and a confidence score, rather than a single label for the whole image. Whole-image classification would only say that a face exists somewhere, which locates nothing to blur, and segmentation, the finer-grained framing that labels every individual pixel, buys more precision than a blur ellipse needs at meaningfully higher labeling and compute cost. Detection with boxes, slightly dilated before blurring so that hairlines and plate edges are covered, is the right granularity, and naming both rejected neighbors makes the choice look deliberate rather than default.

Detectors come in two broad families, and the offline setting decides between them. One-stage detectors predict boxes and classes in a single dense pass over the image, which makes them fast and the usual choice for real-time work. Two-stage detectors first propose candidate regions and then run a second, more careful classifier over each proposal, buying accuracy, particularly on small objects, at perhaps several times the compute. Because this pipeline is offline, the usual speed pressure disappears and the calculus flips, so a heavier two-stage model run at high input resolution, possibly as an ensemble of two diverse models whose detections are unioned, becomes affordable, and every point of recall it buys is the product requirement itself. Unioning helps precisely because differently built models miss different faces, so the combined miss set is smaller than either model's alone, at the cost of more false alarms the asymmetry already forgives. Faces in panoramas are often twenty pixels tall, caught at extreme angles, and partially hidden behind car doors and crowds, so the heavy model's accuracy on exactly those cases is not a luxury.

Data and labels

Training data is human-labeled boxes drawn on real captured imagery, because the distribution is unusual in ways public datasets do not cover. Capture vehicles produce fisheye-adjacent projections, motion blur from a moving platform, harsh shadows, reflective glass, and faces that are small, angled, and incidental rather than posed. Public face datasets help with pretraining, but the labeling budget belongs on the platform's own panoramas, sampled to overweight hard conditions such as low light, rain, crowds, and dense parking, since those slices are where recall will be lost. Label quality gets its own check, with a fraction of images labeled twice by independent annotators so that missed-box rates among the labelers themselves are measured, because an evaluation set whose denominator is wrong makes every recall claim built on it wrong too.

Hard negatives deserve their own labeling track, since the things that fool a face detector form a predictable family of statues, mannequins, faces printed on billboards and bus wraps, murals, and reflections in shop windows. License plates bring a geographic version of the same problem, because jurisdictions vary in plate geometry, fonts, colors, and mounting positions, so the dataset needs deliberate coverage by region rather than whatever the densest capture cities happen to provide. A policy decision hides here that the interviewer will appreciate hearing said aloud. Printed faces on advertisements are usually blurred anyway, because the cost of over-blurring an ad is negligible and excluding them would teach the model a distinction it cannot reliably make, so the labeling guideline calls them faces and moves on. Augmentation rounds out the data by simulating capture conditions, with synthetic motion blur, exposure shifts, compression artifacts, and small-scale shrinking so the model keeps its recall on twenty-pixel faces.

Representation and preprocessing

Panoramas are stored in an equirectangular projection, the format that maps the full sphere onto a rectangle and badly distorts anything near the top and bottom. Running the detector directly on that projection costs recall in exactly the distorted regions, so the pipeline first re-projects each panorama into a set of overlapping perspective tiles, ordinary-looking rectangular views, runs detection per tile, and maps the detected boxes back into panorama coordinates. Tiles overlap by a couple hundred pixels so that a face straddling a tile boundary appears whole in at least one tile, and the duplicate detections the overlap creates are merged by non-maximum suppression, the standard step that collapses overlapping boxes for the same object into one. Tiling multiplies the work, since a single panorama might become eight to twelve tiles, and that multiplier belongs in the throughput math from the start rather than being discovered after the fleet is sized, because a pipeline budgeted per panorama and billed per tile runs roughly ten times over plan.

Offline evaluation

The research-style metric for detection is mAP, mean average precision, and it rests on a matching rule worth stating first. A predicted box counts as correct when its overlap with a labeled box, measured as intersection over union, the shared area divided by the combined area, exceeds a threshold such as 0.5, and every unmatched prediction is a false positive while every unmatched label is a miss. For one class, average precision then summarizes the whole precision-recall trade by ranking all predicted boxes by confidence and averaging the precision achieved at each level of recall, and the mean over classes gives mAP. A tiny example makes it concrete. Suppose an image set has 3 true faces and the model emits 4 boxes ranked by score, where the 1st, 3rd, and 4th match real faces and the 2nd is a false alarm. Precision when the first true face is recovered is 1/1 = 1.0, precision when the second arrives is 2/3 ≈ 0.67, and precision when the third arrives is 3/4 = 0.75, so average precision is (1.0 + 0.67 + 0.75) / 3 ≈ 0.81. mAP is the right number for comparing model candidates because it is threshold-free, summarizing quality across every possible operating point at once instead of at one chosen cut.

The metric that actually ships, though, is recall at the chosen operating point, measured on a held-out set labeled exhaustively enough to trust the denominator. Suppose the requirement is 99 percent recall on faces and the model achieves it at a confidence threshold where precision is 80 percent. Per million images containing, say, three million true faces, the system blurs 2.97 million of them and misses 30,000, and at 80 percent precision the 2.97 million true blurs arrive alongside roughly 740,000 false ones, mostly statues and posters nobody mourns. Those 30,000 misses are the number the design must keep attacking through better models, lower thresholds, and the report loop described below, while the 740,000 over-blurs are close to free. Evaluation sets should be sliced by region, capture condition, and object size, because a global 99 percent can hide an 88 percent on rainy-night imagery in one country, and the sliced view is what catches that gap before a regulator or a journalist does.

The batch pipeline

The architecture is a sequence of batch stages connected by queues, with the GPU detection fleet in the middle. Ingestion lands raw panoramas in blob storage and registers them in a metadata catalog; preprocessing performs the tiling; the detector fleet consumes tiles in large batches; the blur stage applies an irreversible transform to the published copies while originals stay in an access-controlled store; QA samples the output for human audit; and publication pushes blurred imagery to the serving stacks. Every stage is idempotent and checkpointed, meaning a re-run of any shard produces the same result without double-processing, which is what makes a multi-day job over a billion images operable by ordinary on-call humans. When a worker dies at three in the morning, the queue redelivers its work unit, a replacement picks it up, and the on-call engineer reads about it over coffee instead of being paged, which is the entire point of building the pipeline this way.

The throughput math sizes the fleet and is worth doing aloud. At 50 ms of GPU time per image, covering all of its tiles after batching efficiencies, one billion images costs 50 million GPU-seconds, which is about 580 GPU-days, so a fleet of 200 GPUs finishes in roughly 3 days of wall-clock time. The pleasant property of batch work is that this dial is fully tradable, since 50 GPUs finish in under two weeks at a quarter of the fleet cost, preemptible capacity works fine because checkpointed jobs tolerate interruption, and a model twice as heavy simply doubles a known number rather than blowing a latency budget. The fleet should be utilization-bound and fed by deep queues, because idle GPUs are the main way batch pipelines waste money, and the dashboard the team actually watches pairs GPU utilization with queue depth, where utilization sagging while queues stay full means a feeding stage upstream has stalled.

The batch pipeline snakes from ingest to publication, and only blurred copies ever reach serving while originals stay access-controlled. Dashed paths carry the feedback machinery, where QA audits and user reports become labels and labels become the next model version.

Blurring, originals, and storage

The blur itself deserves a moment, because not every way of hiding pixels actually hides them. The published transform must destroy information, by averaging the pixels inside the dilated ellipse or replacing them outright, since a reversible filter applied to published imagery is a privacy incident waiting for the first person with deconvolution code. Blurring also has to reach every rendition the serving stack generates, including zoom levels and thumbnails, because a face obscured at full resolution and visible in a cached thumbnail is still a miss as far as the person in the photograph is concerned.

Originals are kept, and keeping them is not a contradiction of the privacy goal. Model improvements only pay off if a better detector can re-run on clean pixels, since re-detecting on blurred copies would compound the previous model's errors rather than correct them, so the unblurred corpus lives in a separate store gated by access controls, audit logging, and a retention policy. Every published image additionally records which model version and threshold configuration produced its blurs, which turns the vague question of what imagery a model regression touched into an enumerable list of batches, and that bookkeeping is what makes targeted reprocessing and rollback possible at all.

Choosing the operating point

A detector emits a confidence score per box, and the system's behavior is defined by what it does in each score band, which makes the operating point a product decision expressed as thresholds rather than a modeling fact. Everything above a low threshold gets blurred automatically, and because the operating point favors recall, that threshold sits far lower than a typical detector deployment would choose, accepting the statue-blurring false positives computed earlier. Thresholds are also set per class rather than shared, since faces and plates have differently shaped score distributions and the consequences of missing each differ by jurisdiction. A band of genuinely ambiguous detections just below the auto-blur line can be routed to human review, but the arithmetic has to be respected, because at billion-image scale even a band capturing 0.5 percent of images is five million reviews, and a reviewer who clears two thousand images a day makes that 2,500 reviewer-days. The review band must therefore be kept narrow, sampled, or reserved for high-sensitivity regions, and the design should say so out loud rather than waving at the idea that humans will check.

Reading the thresholds off a precision-recall curve makes the choice concrete. If the curve shows 99 percent recall at 80 percent precision with a score cut of 0.15, and 99.5 percent recall at 55 percent precision with a cut of 0.05, then the band between 0.05 and 0.15 holds a disproportionate share of the remaining misses, and routing exactly that band to review buys most of the last half-point of recall at a bounded human cost. Below the lower cut, boxes are dropped, and the residual misses down there are what the report loop exists to catch. Every published image carries a report-a-problem affordance, so a person who finds their own face unblurred can flag it in a few taps, reported faces are blurred through a priority queue with a turnaround target of hours rather than weeks, and each report lands in the training set as a labeled false negative, which makes the loop the system's recall safety net and its best source of hard examples at the same time.

Each tile enters the detector, every candidate box receives a confidence score, and the router applies the recall-first thresholds, auto-blurring everything plausible, sending a narrow ambiguous band to review, and dropping the rest, where the report loop stands guard.

Online evaluation and operations

There is no A/B test in the usual sense, since users do not experience model variants, so online quality means statistical auditing. A continuous sampling process draws published images, stratified by region, capture batch, and conditions, and sends them to trained reviewers who count missed faces and plates, which yields an estimated shipped recall with confidence intervals per slice. Report rates per million views serve as the always-on, if noisy, recall signal between audits, and a rising report rate in one country is usually the first alert that a regional regression shipped. For the person filing the report, the experience should be a short form, a confirmation, and a visibly blurred image within days, because that turnaround is the only part of this entire system a harmed individual ever sees. Drift is real in this domain, because new camera generations change optics and color response, new vehicle mounts change viewing geometry, and plate designs change with policy, so the audit slices and the retraining set must track capture hardware versions explicitly. What the on-call engineer watches day to day is duller and just as load-bearing, namely queue depths between stages, GPU fleet utilization, per-shard failure and retry rates, and the report-queue turnaround time, because a pipeline failing quietly for a week can cost a capture season its deadline.

When a better model arrives, reprocessing the back catalog is a genuine strategic question, because re-running a billion-image fleet is not free. Model artifacts are versioned, every published image records which model version blurred it, and reprocessing is prioritized rather than blanket, so imagery in dense population centers, regions with elevated report rates, and slices where the audit shows weak recall get re-run first, while empty highway imagery from a strong model version can wait years without harm. The versioned-artifact discipline also enables the unglamorous rollback story, because if a new model ships a regression, the blast radius is the set of batches it processed, which is an enumerable and re-runnable list rather than a mystery.

Follow-up questions

Why favor a heavy two-stage detector here when one-stage models dominate production serving? The pipeline is offline, so per-image latency buys nothing, while small-object accuracy buys recall, which is the product requirement itself. Compute cost scales linearly and predictably in batch, so a model twice as heavy is a budget line rather than an outage, and an ensemble that would be unthinkable at thirty frames per second is just another multiplier in the fleet-sizing arithmetic.
Why blur statues and billboard faces instead of teaching the model the difference? The cost asymmetry makes the distinction worthless, since an over-blurred ad harms nobody, while teaching the model to suppress face-like objects risks suppressing real faces that resemble them. Spending the modeling effort on recall, and letting the labeling guideline call printed faces faces, is the cheaper and safer policy.
How do you know shipped recall if you cannot label a billion images? Through stratified sampling, which means auditing a few thousand published images per region per cycle with exhaustive human labeling, reporting recall with confidence intervals per slice, and using report rates as the continuous proxy between audits. The audit cannot certify every image, but it can bound the miss rate tightly enough to act on.
Why must the blur be irreversible, and why keep originals at all? A reversible filter is a privacy incident waiting for an attacker, so the published pixels must destroy the information outright. Originals are retained separately under access control because model improvements require re-detection on clean pixels, and reprocessing already-blurred imagery would compound the previous model's errors instead of correcting them.
What breaks when a new camera generation rolls out? The input distribution shifts underneath the model, since different optics, resolution, and color response quietly lower recall with no error surfacing anywhere. The defenses are audit slices keyed to hardware version, early labeling of new-camera imagery, and holding back fleet-wide rollout until sliced recall recovers to the old hardware's level.
Where would you spend the next engineering quarter? On the report loop and the audit machinery rather than the architecture, meaning faster report-to-blur turnaround, automatic conversion of reports into hard training examples, and tighter per-slice recall estimation, because those investments compound into recall at every retrain while another architecture swap buys a single step.

References

Ren et al., Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (2015).
Lin et al., Focal Loss for Dense Object Detection (2017).
Lin et al., Microsoft COCO: Common Objects in Context (2014), the standard reference for detection evaluation and mAP.
Aminian and Xu, Machine Learning System Design Interview (2023).