Design a search service · Stanley Jacob

Almost every product eventually grows a search box, and behind it sits a service that takes a few words of free text and returns the most relevant documents in well under a second. The scope here is deliberately a product's own corpus, meaning the listings of a marketplace, the posts of a forum, or the items of a store, rather than the open web, because crawling and ranking the web is a different problem carrying a different article's worth of machinery. The interesting question for a product search service is how to turn "blue running shoes" into a ranked result list over fifty million documents, quickly and repeatably, while the documents themselves keep changing underneath the index.

Interviewers like this design because the database everyone already has is the wrong tool, and explaining precisely why forces real understanding rather than recitation. A SQL LIKE '%running%' query scans every row, because a leading wildcard defeats any ordinary index, it produces matches without any notion of ranking, and it has no idea that "running" and "runs" express the same idea. The hard parts of the design live in the index structure that makes text lookup fast, in the scoring function that makes the results worth reading, and in the distributed query that stitches shards together without wrecking either latency or ranking quality.

Scope and requirements

Functionally, the service indexes documents that mix text fields with structured fields such as price, category, and timestamps, answers keyword queries with ranked results, supports filters like category and price range alongside the text match, and paginates through long result lists. Updates should become searchable quickly, and the word quickly is worth negotiating early, because a few seconds of delay between writing a listing and finding it in search is acceptable for almost every product, and conceding those seconds buys the entire indexing architecture described below. Facets, the counts-by-category sidebar showing 1,204 results under one brand and 988 under another, come up in most product searches, so they stay in scope at the level of how they ride along with the main query.

For the non-functional shape, assume 50 million documents averaging 2 KB of indexable text, 20 million searches per day, and a latency target of about 200 ms at the 95th percentile for the full search round trip. Twenty million searches spread over 86,400 seconds is about 230 queries per second on average, and planning for a peak of four to five times the average means designing for roughly 1,000 queries per second. Reads dominate writes here the way they do in most search workloads, with perhaps a few hundred document updates per second arriving against a thousand queries, and that ratio is what licenses an architecture where writes do extra work, in analysis and segment building, so that reads can do almost none.

The inverted index, with a worked example

The core data structure inverts the document-to-words relationship into words-to-documents, which is why it is called an inverted index. For every term in the corpus, the index stores a postings list, which holds the IDs of the documents containing that term in sorted order, usually together with the term's frequency in each document and often the positions where it occurs. Three tiny documents make the structure concrete. Document 1 reads "blue trail running shoes," document 2 reads "red running shoes," and document 3 reads "blue suede boots." The postings list for running is [1, 2] and the postings list for blue is [1, 3]. A query for "blue running" walks both lists, and because both are sorted, it intersects them in a single linear merge, the same two-pointer scan used to merge sorted arrays, finding document 1 without ever looking at document 3's text. At fifty million documents the lists grow longer but the algorithm is unchanged, and the sortedness is what keeps ANDs, ORs, and skip-ahead operations efficient, since every list is consumed in one forward pass.

Terms do not come from the raw text directly but out of an analysis pipeline. Analysis tokenizes the text into words, lowercases them, drops stop words, the extremely common words like "the" that match everything and rank nothing, and stems each word, meaning it reduces the word to a root form so that "running," "runs," and "ran" all become run. One rule about analysis is non-negotiable, namely that exactly the same pipeline must run at index time and at query time, because the index stores run, and a query that searches for the unanalyzed Running would find nothing at all. Most search defects in production trace back to a mismatch between the two sides of that rule, often introduced when someone improves an analyzer without rebuilding the index, which is also why changing an analyzer is a rebuild-the-world operation, a point the operations section returns to with an actual procedure.

The query API

GET /v1/search?q=blue+running+shoes&category=footwear&price_max=150&size=20
→ 200 {
  "total": 8412,
  "took_ms": 41,
  "hits": [
    { "doc_id": "listing-99412", "score": 14.2, "title": "Blue trail running shoes" },
    { "doc_id": "listing-10293", "score": 12.8, "title": "Lightweight blue runners" }
  ],
  "facets": { "brand": { "asics": 1204, "nike": 988 } },
  "search_after": ["12.8", "listing-10293"]
}

The search_after cursor in the response is the pagination mechanism, and the reason it exists instead of a page number deserves its own discussion below, because it is one of the places where a distributed search service quietly diverges from single-machine intuition. Returning took_ms to clients is deliberate as well, since consuming teams build dashboards on it and the per-query timing becomes the cheapest distributed tracing the service will ever get.

The high-level architecture

The corpus is too large and too hot for one machine, so it is split into shards, each holding a full, self-contained index over a subset of the documents, with documents assigned to shards by hashing their IDs. Writes flow through a queue rather than hitting shards directly, so the services that own the documents publish creates, updates, and deletes, and an indexer consumes the stream, runs the analysis pipeline, and writes each document to the shard that owns its hash. Inside each shard, incoming documents batch in memory and open to queries on a refresh interval of about a second, which is the near-real-time compromise, where a freshly written document becomes searchable a heartbeat after it lands rather than instantly, in exchange for a read path that never takes a lock. Queries travel the opposite route through a coordinator that fans the query out to every shard and merges what comes back. This is recognizably the architecture inside Elasticsearch, where each shard is a Lucene index, and naming that embodiment in an interview anchors the design in something the interviewer can verify rather than a private invention.

The top lane carries the write path, where document changes queue up and the indexer analyzes each one and routes it to the shard owning its ID hash. The bottom lane carries the read path, where the coordinator scatters every query across all shards and merges their top results into the global answer.

Inside a shard: segments and the index structure

Within a shard, the index consists of a term dictionary pointing at postings lists, and both live in immutable segment files. New documents accumulate in an in-memory buffer and flush as a fresh small segment on each refresh, which is why writes never modify existing files. Deletes work by tombstone, a small marker recording that a document is dead, checked at query time so the document stops appearing immediately even though its bytes remain on disk, and a background merge process periodically rewrites several small segments into one larger one, dropping the tombstoned documents for good. Immutability is the load-bearing property in all of this, because queries can run lock-free against files that never change underneath them, and the operating system can cache those files as aggressively as it likes with no invalidation logic. The merge process is the price of the arrangement, a steady background I/O load that operations has to budget for, and a shard that falls behind on merging accumulates so many small segments that every query pays a per-segment overhead, which is one of the quiet ways search latency degrades while nothing is actually down.

Each dictionary entry records its document frequency, written df, which counts how many documents contain the term, and points to a postings list whose entries pair a document ID with the term frequency, written tf, inside that document. Sorted IDs are what make intersecting two lists a linear merge.

Relevance: from TF-IDF intuition to BM25

Matching finds candidates and scoring orders them, and the intuition behind scoring has two halves. A document that mentions a query term many times is probably more about that term, which is the term frequency half. A term that appears in few documents carries more information than one that appears everywhere, which is the inverse document frequency half, and a shoe store's catalog makes it vivid, since "shoes" there says almost nothing about relevance while "waterproof" narrows the field sharply. TF-IDF multiplies the two together, and BM25 is the refinement that production systems actually run, adding two corrections with a tunable knob each. The knob called k1, usually set near 1.2, saturates term frequency so that the tenth occurrence of a word adds far less than the second, because a keyword-stuffed document should not win on repetition alone. The knob called b, usually 0.75, normalizes by document length relative to the corpus average, because three mentions inside a 50-word title and blurb signal more than three mentions buried in a 500-word essay.

A small worked comparison shows the machinery moving. The term-frequency part of BM25 is tf times (k1 + 1) over (tf + k1 times (1 minus b plus b times dl over avgdl)), where dl is the document's length and avgdl the corpus average, which here is 100 words. Document A has tf 3 and length 50, so its denominator works out to 3 + 1.2 times (0.25 + 0.75 times 0.5), which is 3.75, and the term contributes 6.6 over 3.75, about 1.76 times the term's IDF. Document B has the same tf of 3 but length 500, so its denominator is 3 + 1.2 times (0.25 + 3.75), which is 7.8, and the contribution falls to 6.6 over 7.8, about 0.85 times IDF. The two documents mention the term equally often, yet B scores roughly half of A purely because its mentions are diluted across ten times the text, which matches the judgment a human reader would make. Real ranking then layers product signals on top of the text score. A recency boost multiplies scores by a decaying function of document age so that fresh listings surface, and a popularity boost folds in click or sales counts, with the standing caution that popularity feedback loops entrench incumbents, since whatever ranks first gathers the most clicks and thereby keeps ranking first, so the boost needs damping and a little deliberate exploration.

The distributed query: scatter, gather, and pagination

Because documents are sharded by hash, every shard holds a random slice of the corpus, and a query must visit all of them. The coordinator broadcasts the query, each shard runs it locally against its own index and returns only its top results with their scores, and the coordinator merges those small lists into the global answer. A two-shard example with a request for the top 3 makes the merge concrete. Shard 1 returns A at 9.1, B at 8.4, and C at 7.9, while shard 2 returns D at 8.8, E at 8.0, and F at 6.5, so the merged top 3 is A, then D, then B, assembled from twelve numbers rather than from two full result sets, and that smallness is what keeps the gather step cheap no matter how large the corpus grows. The latency consequence is less friendly, because the user waits for the slowest shard, and the 95th percentile of the maximum across 10 shards is governed by each shard's worst moments rather than its typical ones. Tail-latency techniques earn their complexity here, the most useful being hedged requests, where the coordinator sends a duplicate query to a replica when the first copy is slow to answer and takes whichever response arrives first.

Pagination hides a trap called deep paging. Page 100 with 20 results per page asks for results 1,981 through 2,000, but under scatter-gather each shard must return its own top 2,000 candidates, because any one of them might rank globally above the cutoff, so a 10-shard cluster materializes and sorts 20,000 entries to serve 20, and the cost keeps growing linearly with page depth. The fix is the search_after cursor from the API, where the client passes the sort values of the last result it has seen instead of an offset, and each shard returns only its top 20 entries ranking strictly after that boundary, which makes page 100 cost the same as page 1. The product concession is that users cannot jump to an arbitrary page number, a capability virtually no search user exercises, which is why every large search product has quietly made the same trade. Filters ride along inexpensively because the structured fields are indexed too, so a price ceiling becomes one more postings intersection, and facet counts are computed per shard and summed by the coordinator during the same gather, which is why the sidebar costs almost nothing extra.

A latency budget for one query

Walking one query through the system shows where the 200 ms target actually goes and which parts of it the design controls. The query leaves the user's device and spends 30 to 60 ms on the network reaching the API edge, a cost the service can only influence by being deployed close to its users. The coordinator parses the query and runs the analysis pipeline in well under a millisecond, then scatters to all 10 shards in parallel, paying one intra-datacenter round trip of about half a millisecond each way. Each shard intersects the postings for the analyzed terms, scores the candidates with BM25, applies the filters, and keeps its local top 20 with a small heap, which for a moderately selective query over five million documents takes 10 to 40 ms depending on how long the postings lists run and whether the segment files are warm in the page cache, the memory the operating system uses to hold recently read file data. The coordinator's merge of a few hundred score pairs costs microseconds, fetching the display fields for the final 20 documents adds a few milliseconds, and the response rides the network back. The median lands somewhere near 80 to 120 ms end to end, which leaves the 95th-percentile budget absorbing the unlucky cases.

The unlucky cases are specific enough to plan for individually. A query made of two very common words drags long postings lists and lands at the slow end of the shard work. A cold shard whose segments fell out of the page cache after a merge pays disk reads where its peers pay memory reads, and a single garbage collection pause on one shard node holds the entire gather hostage, which is the fan-out amplifying one machine's hiccup into a user-visible delay. Hedged requests cap the damage from the slow-shard cases, warming caches after merges removes the cold-start case, and the remaining tool is a per-shard timeout, where the coordinator returns slightly incomplete results rather than slow ones, a trade most products accept once they learn that users notice 800 ms but never notice one missing result at position 19. When everything works, the user sees results appear faster than their eyes move from the box to the list, and when a shard misbehaves, the experience degrades to a fractionally thinner result page rather than a spinner, which is the failure shape worth engineering toward.

Scaling, failures, and operations

Read throughput scales with replicas, since each shard keeps one or more full copies on other nodes and the coordinator load-balances queries across them, so a thousand queries per second spread over 10 shards with two replicas each works out to a modest few hundred shard-queries per second per node. Write throughput scales with shard count, but changing the shard count changes every document's hash assignment, so the count is chosen with headroom up front, ten shards of five million documents each in this design, and an actual reshard is treated as the same heavyweight operation as a full reindex. The queue in the write path is what makes indexing resilient, because when the indexer or a shard falls over, updates simply accumulate in the queue and replay once the consumer returns, while search keeps serving slightly stale results in the meantime. That is the graceful version of the failure, and the operator's view of it is a consumer-lag graph that climbs and then drains rather than an outage that pages anyone.

The operation that deserves its own rehearsed plan is changing the index itself, because analyzers and field mappings are baked into the segment files at index time. A new stemmer, a new field type, or a new shard count each require rebuilding from scratch, and the safe pattern is to build alongside and swap. The team creates a second index with the new configuration, bulk-reindexes all fifty million documents into it from the source of record while the live index keeps serving and both indexes receive the ongoing writes, verifies document counts and spot-checks rankings between old and new, and then atomically flips an alias, which is a named pointer that clients query instead of a physical index name. Rollback is the same flip performed in reverse, and the old index is kept for a few days before deletion in case a ranking regression surfaces slowly. At a few thousand documents per second of bulk indexing, the rebuild takes a few hours, which is precisely why the alias indirection exists, since no in-place migration could hold the read path stable for that long.

Failures otherwise follow the shard-replica pattern, where a node loss promotes replicas of the lost primaries and rebuilds spare copies elsewhere, a slow node is the tail-latency story already told, and the coordinator tier is stateless, scaling flat behind a load balancer. The two metrics worth alerting on are refresh-to-visible lag on the write side and per-shard p99 latency on the read side, because each one degrades quietly while every binary health check stays green, and by the time users complain about stale listings or slow searches, the graphs will show that the problem started hours earlier.

Follow-up questions

Why not serve search from the primary database? Text matching by table scan is linear in corpus size, a leading-wildcard LIKE defeats ordinary indexes, no ranking exists beyond arbitrary sort orders, and no linguistic normalization happens, so "running" never matches "runs." The inverted index makes lookup cost proportional to the result size rather than the corpus size, and BM25 supplies an ordering worth showing, which together are the whole reason dedicated search infrastructure exists.
Why shard by document instead of by term? Term sharding sends each query only to the shards owning its terms, which sounds efficient until the intersections start, because multi-term queries then need postings lists shipped across the network to be intersected, hot terms create hot shards, and a single document update touches as many shards as the document has distinct terms. Document sharding keeps every intersection local to one shard and confines an update to one place, which is why real systems almost universally choose it.
What does "near-real-time" actually mean here? A new document becomes searchable on the next refresh, about a second after the indexer writes it, because until then it sits in an in-memory buffer that queries do not see. The delay is a deliberate trade for lock-free reads over immutable files, and products that need authors to see their own writes immediately handle that single case at the application layer, for example by overlaying the user's own recent submissions onto their results.
How would you add typo tolerance? Either index character n-grams, which are overlapping fixed-length fragments of each word, or maintain a fuzzy term lookup that matches dictionary terms within edit distance 1 or 2 of the query word, then rank exact matches above fuzzy ones so correctness still wins. Fuzzy matching widens candidate sets considerably, so it is best gated to queries that returned few or no results, where the user is clearly not finding what they wanted.
Why is deep pagination expensive, and what is the fix? An offset of N forces every shard to return N plus page-size candidates, because the coordinator cannot know in advance which shard holds the true global slice, so the cost grows linearly with depth across the whole cluster. A search-after cursor keyed on the last result's sort values lets each shard skip directly to its own continuation point, keeping every page priced like the first, and the only sacrifice is random jumps to arbitrary page numbers.
How do you change an analyzer in production? Never in place, because the index files embed the old analysis and a mixed index silently corrupts matching. The procedure is to build a new index alongside with the new analyzer, reindex everything from the source of record while dual-writing to both, validate counts and rankings, then swap the serving alias atomically, keeping the old index briefly as the rollback path.

References

Manning, Raghavan, and Schütze, Introduction to Information Retrieval (2008), on inverted indexes, scoring, and analysis.
Elastic, Elasticsearch reference documentation, on shards, replicas, refresh, and aliases.
Apache Lucene, Lucene Core, the index and scoring implementation underneath.
Kleppmann, Designing Data-Intensive Applications (2017), on derived data and keeping systems in sync.