Design a visual search system

Visual search is the feature behind Pinterest Lens and Google Lens. A user points a camera at a lamp, or taps a lamp inside a photo someone else posted, and the system returns a grid of visually similar lamps they can save or buy. The query is an image rather than a string of words, which removes the user's burden of describing a shape, a texture, or a style they may not have vocabulary for, and that relief from description is exactly why commerce and discovery platforms want the feature. Interviewers like the question because it cannot be solved with a classifier and a lookup table, and it forces a candidate through representation learning, billion-scale nearest-neighbor retrieval, and the unglamorous indexing machinery that keeps results fresh while the catalog changes underneath the model every day.

The hard parts live in three places. Learning an image representation in which visual similarity is actually measurable comes first, because nothing about raw pixels makes two photographs of similar lamps numerically close, and a brass lamp shot in daylight differs from the same lamp shot at dusk in almost every pixel value. Searching a billion catalog items in tens of milliseconds without comparing the query against each one of them comes second. The third part is operational rather than algorithmic, since new items arrive constantly, the embedding model itself gets retrained, and every model change quietly invalidates the entire index. The walkthrough below takes those in the order a thirty-minute interview would, with framing and data first, models and evaluation in the middle, and serving plus the machinery that keeps the system alive at the end.

Scope and requirements

I would start by pinning the product behavior, because "visual search" covers several different products and each pulls the design in a different direction. Here the user submits a photo or taps a region of an existing image, and the system returns a ranked grid of visually similar catalog items, where the catalog means pins, products, or listings depending on the platform. A photo usually contains several objects, a sofa and a lamp and a rug in a single living-room shot, so the system should let the user pick which detected object to search, and that requirement puts object detection at the front of the pipeline. The experience should feel like pointing, tapping a highlighted box, and seeing a grid appear within a beat, and results should respect availability and region, because a perfect visual match that cannot be shipped to the user's country wastes its slot, which argues for a re-ranking stage on top of pure visual similarity.

For scale, a reasonable set of numbers to agree on is a catalog of one billion items, roughly ten million new items arriving per day, and query traffic peaking around 5,000 searches per second. Ten million daily arrivals works out to about 116 items per second on average, with bursts several times higher when large sellers upload in batches, a number that matters when the insertion pipeline gets sized. The latency budget matters because this is an interactive camera feature, and a response that takes more than a third of a second reads as broken in a way the same delay in a web search does not, so I would target 250 milliseconds at the 95th percentile for the backend, leaving room for network and rendering on top. Out of scope, and worth saying so explicitly, are text search, deep personalization beyond light re-ranking signals, and exact-duplicate detection, which is a related but different problem solved with hashing rather than learned similarity.

Framing the problem as a machine learning task

The tempting first framing is classification, meaning a model that labels the query image as "mid-century table lamp" so the system can return catalog items carrying that label. It breaks down quickly, for reasons that motivate everything that follows. The catalog gains millions of items daily and the space of visual concepts is unbounded, so any fixed label set is stale the week it ships, and human labeling can never keep up with catalog churn. Classification also throws away exactly the information the user is querying with, because two lamps in the same class can look nothing alike, and the user asked for things that look like this one rather than things that share its noun. A label is a lossy compression of appearance, and this product is entirely about appearance.

The framing that works is representation learning plus nearest-neighbor retrieval. An embedding is a learned vector of numbers, say 256 floating point values, produced by a neural network from an image, with the defining property that distance between vectors tracks visual similarity, so similar-looking images land close together and dissimilar ones land far apart. Under that framing, the input to the system is the query image or a cropped object inside it, the model's output is the query's embedding, and retrieval becomes a geometry problem of finding the catalog items whose embeddings sit nearest the query embedding. New catalog items need no labels at all, only a pass through the embedding model, which is what lets the framing survive a changing catalog where classification could not. The model objective, then, is not to predict a class but to shape the embedding space so that nearness means similarity, and the next two sections cover where the training signal for that comes from and how the model is trained on it.

Data and labels

The supervision is mostly implicit, harvested from behavior rather than annotators. When a user runs a visual search and then clicks or saves one of the results, that query-result pair is weak evidence of visual similarity, and the platform collects millions of such pairs daily at no labeling cost. Co-engagement supplies a second stream, since images that users repeatedly save to the same boards, and product photos of the same item uploaded by different sellers, form natural positive pairs without any search ever happening. Data augmentation supplies a third source with no logging at all, because a crop, a horizontal flip, or a mild color shift of an image should embed near the original, and these synthetic pairs teach the model invariances, such as indifference to lighting and framing, that engagement data underrepresents.

Each source carries its own noise, and naming that noise matters. Engagement conflates similarity with attractiveness, since users click pretty results and popular items rather than the most visually faithful ones, and results shown higher get clicked more simply for being higher, which is position bias, the tendency of click data to reward rank rather than relevance. The mitigation is to log the position at which every result was shown and either down-weight clicks earned at top positions or model the position effect explicitly during training. Same-product pairs run clean but narrow, teaching instance matching, the skill of recognizing the same lamp again, when the product also needs style similarity, the skill of recognizing lamps that belong together. The practical answer mixes all three sources and keeps a separate human-labeled evaluation set, a few thousand query images each paired with the catalog items that trained raters, working from written guidelines about what counts as visually similar, marked as good matches. That rater set never trains the model; it exists so quality gets judged against something cleaner than the training signal itself.

Embeddings and representation

The central representation is the image embedding itself. A 256-dimensional vector is a common working point, L2-normalized so that every vector has length one, which makes cosine similarity, the angle between vectors, the distance measure and simplifies the index. Dimension is a real trade rather than a free parameter. Moving to 512 dimensions buys a little accuracy while doubling index memory and search cost, which at a billion items means half a terabyte of additional raw vectors, and dropping to 128 halves the footprint while giving up some recall, so 256 with normalization is the defensible middle that the sizing math later in this design assumes.

Visual similarity alone does not make a good results page, so a second family of features serves the re-ranking stage, covering item category, price, availability, seller region, and engagement statistics such as historical click and save rates. These live in a feature store, a system that computes and serves the same feature values to offline training and online serving so the model sees identical inputs in both places, and the re-ranker reads them per candidate after retrieval, when a few hundred lookups are affordable. Keeping these signals out of the embedding is deliberate, because baking popularity into the vector would distort the geometry that retrieval depends on, and a popular but visually wrong item would begin crowding out faithful matches at the retrieval stage, where no later stage could ever recover them.

Model architecture

The baseline worth stating first is an off-the-shelf backbone, a ResNet or vision transformer pretrained on a large public dataset, used as a frozen feature extractor with its penultimate layer as the embedding. It works surprisingly well, costs nothing to train, and is the right first milestone for a team standing the system up, but its notion of similarity reflects the pretraining task rather than the product's. A pretrained classifier considers two photos similar because both contain a dog, while the product needs them similar because both show low-pile geometric rugs, and closing that gap is what the training budget exists to do.

The production model is trained with contrastive learning, which is the plain idea of teaching by comparison. Take a pair of images that should be similar, an engagement pair or an augmented copy, take a set of images that should not be, for which the other images in the same training batch serve nicely, and adjust the network so the positive pair's embeddings move closer together while the negatives get pushed apart. Training on large batches gives hundreds of free negatives per example, and mining hard negatives, items that look deceptively close but were not engaged with, sharpens the space exactly where retrieval mistakes would otherwise happen. The backbone choice is a secondary decision and I would say so in the interview, since a convolutional network is cheaper to serve and well understood, a vision transformer typically wins a few points of recall at higher compute, and either slots into the same contrastive recipe.

In front of the embedding model sits an object detector, a model that outputs bounding boxes around objects rather than one label for the whole image. Real query photos contain whole rooms, and embedding the entire scene produces a vector that matches other whole rooms instead of the lamp the user cares about, a failure that shows up in testing as eerily good room-to-room matches and useless object results. The pipeline therefore detects shoppable objects, lets the user confirm or pick one, crops it, and embeds the crop, and the same detect-crop-embed treatment is applied to catalog images at indexing time so that query and index vectors both describe objects rather than scenes. Any asymmetry there, objects on one side and scenes on the other, would quietly cost recall with nothing visible to point at. The detector can stay small and fast since it only needs coarse boxes over a few dozen object groups, and its 20 milliseconds in the budget buys the entire premise of object-level search.

Offline evaluation

The retrieval stage is measured with recall@K, which asks what fraction of the items that should have been retrieved actually appear in the top K results. The human-labeled set makes this concrete. Suppose a query image of a brass floor lamp has five rater-approved similar items in the catalog, and the system's top 10 contains three of them; recall@10 for that query is 3/5 = 0.6, and averaging this over, say, 2,000 labeled queries gives the model's score. A move from 0.58 to 0.64 average recall@10 means roughly one extra good match for every other query, the kind of offline win that justifies an online test. Engagement-based evaluation sets, built from logged clicks rather than raters, are larger and cheaper but inherit the biases of the logging policy, because they can only contain items an earlier model already surfaced, so a genuinely better model gets penalized for retrieving good items the old system never showed. Reporting both numbers, with the human-labeled one as the headline, is the discipline that keeps the team from optimizing into its own logging shadow.

Validation splits need the same care that the metrics do. Splitting by query image alone leaks near-duplicates between train and test, since the same product photographed twice can land on both sides and make memorization look like generalization, so the split should group by product or by user session before dividing. Evaluation queries should also postdate the training window whenever engagement pairs are involved, because a model graded on clicks that happened before its training data closed is being graded on answers it has already seen.

The serving architecture

Serving is a short pipeline with a strict budget. The query image hits the detector, the chosen crop hits the embedding model, the embedding fans out to the ANN index shards, and the merged candidates pass through the re-ranker. ANN stands for approximate nearest neighbor search, and the idea is that instead of comparing the query vector against all one billion index vectors exactly, the index uses a precomputed structure to examine only a tiny fraction of them, trading a small loss of recall, typically a few percent, for a speedup of several orders of magnitude. HNSW builds a layered graph over the vectors and answers queries by greedily walking from node to neighboring node toward the query. IVF instead clusters the vectors ahead of time and searches only the handful of clusters whose centers lie nearest the query, which is more compact in memory but ties answer quality to how well the clusters still fit the data.

The memory math decides the shape of the fleet, so it is worth doing aloud. One billion vectors at 256 dimensions in float16, two bytes per value, comes to 1,000,000,000 × 256 × 2 bytes = 512 GB of raw vectors, and graph or cluster overhead adds roughly half again, call it 768 GB total. No single machine holds that, so the index is sharded 16 ways at about 48 GB per shard, each fitting comfortably in a 64 GB host, and every shard is replicated twice for availability and throughput. Replication is not generosity, because each of the 5,000 queries per second fans out to all 16 shards, so every shard sees the full query rate, and two replicas split that load while letting one die without dropping the shard's slice of the catalog. A query goes to all 16 shards in parallel, each returns its local top 100, and the merger keeps the global best 400 or so for re-ranking.

The latency budget walks through the path stage by stage. Detection takes about 20 ms, embedding about 10 ms on a GPU host with small-batch queueing, the ANN fan-out about 30 ms, re-ranking 400 candidates with metadata and engagement features about 15 ms, and the network hops plus serialization between stages roughly 25 ms, which lands near 100 ms and leaves headroom inside the 250 ms target for retries and queue spikes. The fan-out deserves a second look because the slowest of the 16 shards sets the pace, so per-shard tail latency matters more than the average, and one shard pausing briefly for memory management makes the whole feature stutter. Hedged requests, where the merger re-sends a shard's query to its replica if the first copy has not answered within about 20 ms, convert that tail into a modest amount of duplicate work, and when a shard misses its deadline entirely, the merger can return results from the other 15, which degrades recall by a few percent rather than failing the query.

In the serving path the detector finds objects, the chosen crop becomes a query vector, every ANN shard answers in parallel, and the re-ranker orders the merged candidates with catalog metadata, with the whole journey fitting inside roughly 100 ms.

Compressing the index

The float16 layout above is the straightforward choice, and the next rung down the cost ladder is worth knowing because index memory is usually the largest line item in this system's budget. Product quantization is a compression scheme that chops each vector into sub-vectors, say 32 chunks of 8 dimensions each, and replaces every chunk with the identifier of its nearest entry in a small learned codebook, so a vector is stored as 32 one-byte codes instead of 512 bytes of floats. That sixteenfold reduction shrinks the billion-vector index from 512 GB to about 32 GB, small enough for a pair of hosts, and distances are computed approximately against the codes at a cost of a few points of recall. The standard recovery for that lost recall is a two-pass design in which the compressed index nominates a generous candidate set, perhaps the top 1,000, and the exact float vectors, fetched for just those candidates, re-score them before the merge.

I would still ship float16 HNSW first because its operational story is simpler, and then name the conditions under which the compressed design wins, namely catalog growth toward ten billion items, multi-region replication multiplying every gigabyte, or plain cost pressure. Disk-resident ANN designs push the same trade one step further by keeping compressed vectors in memory and full vectors on fast local flash storage, serving catalogs well past memory limits at some latency cost.

The indexing pipeline

The index is a living thing, and most of the operational difficulty lives here rather than in serving. New catalog items flow through the same detect-crop-embed treatment as queries, and their vectors are inserted into the live shards within seconds to minutes of ingest, because a marketplace where today's listings are unsearchable until tomorrow loses exactly the freshest inventory users want, and sellers notice within hours when new uploads draw no traffic. HNSW supports incremental insertion naturally, since a new node simply links into the existing graph, while IVF accepts appends into existing clusters at some quality cost until the next rebuild, because the new items land in clusters that were fitted to an older snapshot of the data. Deletions for sold-out or removed items are handled with tombstones, markers that hide an entry from results until a rebuild physically removes it, which keeps deletes cheap at the price of slowly accumulating dead weight in the structure.

Periodic full rebuilds remain necessary even with streaming inserts, because incremental graph structures degrade, cluster assignments drift away from the data, and tombstones pile up, and a weekly rebuild compacts all three problems at once. The event that forces a rebuild of everything is a new embedding model version, and the reason deserves a plain statement in the interview. Two separately trained models produce vector spaces that are simply not comparable, since each training run settles on its own arbitrary orientation of the space, so a query embedded with model vN+1 lands in meaningless places in an index built from vN vectors, and nothing errors while the results turn to noise. Shipping a new model therefore means re-embedding the full billion-item catalog, and the cost is worth computing aloud, since at 5 ms per item on a GPU, one billion items is 5 million GPU-seconds, which is about 58 GPU-days, so a 100-GPU fleet finishes in roughly 14 hours. Fresh shards are built alongside the live ones, validated against a canary set of probe queries whose expected neighbors are known, and traffic is cut over atomically, with the old shards kept warm briefly so a rollback is a single switch flip rather than another 14-hour job.

The top row keeps the live index fresh as items arrive, while the bottom row retrains the embedding model from engagement pairs. Dashed paths fire on a model version change, when the catalog gets re-embedded, fresh shards are built and shipped, and the serving model is swapped.

Online evaluation and operations

Offline recall is a proxy, so changes graduate through A/B tests measured on what users actually do with the results, namely click-through rate on the result grid, save rate, and add-to-cart or purchase where commerce applies, with save rate usually the most stable signal of genuine visual relevance. Interleaving offers a sharper instrument for ranking changes, and it works by blending results from the old and new rankers into a single grid shown to the same user and observing which ranker's items earn the clicks, which detects a preference with far less traffic than a standard A/B split because every session compares both systems directly.

Retraining runs on a regular cadence, for which monthly is a reasonable default, with off-cycle retrains when monitoring says the world has moved. Training-serving skew, where the model sees differently processed inputs offline and online, is the classic silent failure for image systems. If training resized images with one library and serving resizes them with another, the embeddings shift just enough to bleed recall with no error logged anywhere, so the preprocessing code should be one shared artifact, exercised by a test that embeds a fixed image set through both stacks and asserts near-identical vectors. Production monitoring watches embedding norm and dimension statistics for drift, replays the canary probe queries against every shard, and alerts on per-shard tail latency and recall divergence. For the on-call engineer, a healthy system looks like a flat norm histogram, canary neighbors matching yesterday's, and an index age distribution tracking ingest volume, and when the insertion pipeline stalls, the age metric is what fires, hours before any seller writes in asking why this morning's listings are invisible.

Follow-up questions

Why not classify the query image and look results up by label? The catalog changes daily and the concept space is unbounded, so labels are perpetually stale, and classification discards the fine-grained appearance the user is querying with. An embedding plus nearest-neighbor design needs no labels on new items, only a forward pass through the model, which is why it survives ten million daily arrivals.
Why is a new model version so expensive to ship? Independently trained models produce incompatible vector spaces, so every catalog item must be re-embedded and every index shard rebuilt before the new model can serve a single query. The mitigation is engineering the cutover well rather than avoiding it, meaning a parallel index build, validation against the canary query set, an atomic traffic flip, and a fast rollback path, sized by the re-embedding arithmetic worked out above.
How would you handle a query photo with five objects in it? Run the detector, surface the detected boxes to the user, and search the object they pick, defaulting to the most salient box so a first result grid appears without an extra tap. Indexing applies the same per-object treatment, which keeps query and catalog vectors describing objects rather than scenes and is what makes the crops comparable in the first place.
HNSW or IVF for the index? HNSW gives better recall-latency trade-offs and natural incremental inserts, at a higher memory overhead from storing the graph, while IVF is more compact and pairs well with product quantization but wants periodic retraining of its clusters as the catalog drifts. With 64 GB hosts and a freshness requirement, HNSW per shard is the default I would pick first, revisiting IVF with compression if memory cost becomes the dominant complaint.
Where does the system quietly lose quality without erroring? Preprocessing skew between training and serving shaves recall invisibly, a stalled insert pipeline ages the index while queries keep succeeding, and engagement-based evaluation flatters the incumbent model by only containing items it chose to show. All three are monitoring problems before they are modeling problems, which is exactly why the canary queries, the index age metric, and the rater-labeled evaluation set exist.
How do you evaluate without trusting clicks? Keep a rater-labeled set of query-to-similar-item judgments as the headline recall@K metric, refreshed as the catalog's style mix changes, and treat engagement-derived sets as a large but biased supplement, since they only contain items earlier systems chose to show and therefore cannot credit a model for finding what the old one missed.

References

Zhai et al., Learning a Unified Embedding for Visual Search at Pinterest (KDD 2019).
Malkov and Yashunin, Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs (2016).
Chen et al., A Simple Framework for Contrastive Learning of Visual Representations (2020).
Facebook AI Research, FAISS: a library for efficient similarity search.
Aminian and Xu, Machine Learning System Design Interview (2023).