Design a search autocomplete system

Autocomplete is the dropdown that appears under a search box and offers the five to ten most likely completions of whatever the user has typed so far. It looks like a garnish on the edge of the real product, yet it quietly shapes how people search, because on the major search engines a majority of users pick a suggestion rather than typing their query to the end, which means the feature carries real traffic and real product weight. The defining constraint is time. Suggestions need to appear within about 100 milliseconds of a keystroke to feel attached to the typing rather than trailing behind it, and once the dropdown lags past that threshold users stop trusting it and type straight through. Network time to a realistic user eats 30 to 60 of those milliseconds before the server does anything at all, so the backend is left with a few tens of milliseconds, and closer to 20 at the median, to produce an answer. That budget governs everything below, and the whole design can fairly be read as a list of things computed ahead of time so that the request path does almost no work.

Interviewers like this question because the naive answer is easy to propose and easy to demolish with arithmetic. Querying the search database for every stored query that starts with the prefix, then sorting the matches by popularity, sounds reasonable until the numbers arrive, since a one or two letter prefix matches millions of rows, the scan and sort repeat on every keystroke of every user, and no database survives that politely. The repair brings together a classic data structure, an offline batch pipeline, and a layered caching story inside one design, which gives the conversation natural depth. The genuinely hard parts live in choosing what to precompute, in shipping fresh precomputed data to a serving fleet without downtime, and in keeping a deliberately stale system responsive to a world where a query can begin trending within minutes of a news event.

Scope and requirements

Functionally, the system returns up to ten ranked completions for the current contents of the search box. Matching is restricted to prefixes, meaning a suggestion must begin with exactly the characters the user typed, which keeps the scope controlled, and matching in the middle of words or phrases is acknowledged as a follow-up rather than built into the core. Ranking follows historical popularity, with a separate mechanism that lets trending queries surface quickly instead of waiting for the nightly statistics to catch up. Every suggestion must also pass a filter that keeps offensive, harmful, or policy-violating completions out of the box, and the reason this counts as a functional requirement rather than a nicety is that a suggestion is something the product says, not something the user said, so a single bad completion in a screenshot can become a news story. Personalization stays out of scope beyond noting where it would attach, namely at the final merge step where a user's own recent searches could be blended into the displayed ten.

The non-functional requirements begin with the latency target already described, about 100 ms end to end with a server budget near 20 ms at the median. Availability matters in a specific shape, because the search box must never block on a dead suggestion service, so the client treats a missing dropdown as a non-event and degrades to plain typing that the user barely notices. The third property is the great gift of this particular problem, which is staleness tolerance. A suggestion list computed yesterday is almost as good as one computed this second for everything except trending topics, since the relative popularity of queries like "weather" and "cheap flights" barely moves between Tuesday and Wednesday. Saying that tolerance out loud early matters, because it is the license for everything the design does cheaply, from aggressive caching to overnight batch rebuilds, and a candidate who treats freshness as uniformly precious ends up building a much harder system than the product needs.

Sizing the problem

Request volume multiplies out of search volume because every keystroke is a potential request, and the multiplication deserves to be done slowly. Take a product with 10 million searches per day. An average query runs about 20 characters, so a completely naive client would fire 20 requests per search, one for every keystroke along the way. Two client-side disciplines described with the serving path, debouncing and a minimum prefix length, cut that to an average of about 4 requests per search, and 4 times 10 million gives 40 million requests per day. Spreading those over 86,400 seconds yields roughly 460 requests per second on average, and search traffic follows the waking day strongly, so planning for a peak factor of 5 means designing for about 2,300 requests per second. Each response is a kilobyte or two of JSON, which makes bandwidth a non-issue and leaves the entire problem as lookup rate held against the latency budget.

Storage is the number that decides whether precomputing every answer is affordable, so it gets the same slow treatment. Suppose cleaning the query logs yields 50 million distinct queries worth suggesting, averaging 20 characters each. Storing a top-ten list at every prefix of every query gives a worst case of 50 million times 20, or one billion prefix entries, but queries share prefixes heavily, since "cheap flights" and "cheap hotels" pay for "c", "ch", and "cheap" only once between them, so the realistic count is a few hundred million. Each entry holds the prefix itself plus ten short strings with their counts, roughly 300 bytes, which puts the ceiling at one billion times 300 bytes, or 300 GB, with the realistic shared-prefix figure landing under 100 GB. That much data does not fit comfortably in one memory node once the operating system and headroom take their share, but it shards beautifully by prefix, and it is small enough to rebuild from scratch every night. Both of those properties stop being trivia and become foundations, because the architecture below leans on sharding for the serving fleet and on cheap rebuilds for the pipeline.

The trie, and why query time walks are too slow

The natural structure for prefix lookup is a trie, a tree in which each edge adds one character, so the path from the root spelling out "ca" arrives at the node that owns that prefix, and every query in the subtree below it starts with "ca". Finding suggestions then means walking the subtree under the prefix node, collecting completions with their popularity counts, and sorting to keep the best ten. That walk is exactly where the naive trie fails the budget. A two-letter prefix on a 50-million-query corpus can own hundreds of thousands of descendants, the walk touches all of them, the sort adds its own cost on top, and the identical work repeats on the very next keystroke when the user types one more letter. The repair is to do each node's work once, offline, by storing the top ten completions directly at every node, so the query path walks down exactly as many edges as the prefix has characters and reads a finished list, with no subtree visit at all.

class TrieNode:
    __slots__ = ("children", "top_k")
    def __init__(self):
        self.children = {}     # char -> TrieNode
        self.top_k = []        # [(count, query)] precomputed, length <= 10

def suggest(root, prefix):
    node = root
    for ch in prefix:
        node = node.children.get(ch)
        if node is None:
            return []          # no query starts with this prefix
    return node.top_k          # O(len(prefix)), no subtree walk

Filling top_k for every node takes one bottom-up pass over the built trie, where each node merges its children's lists with its own terminal query and keeps the best ten, so the whole build runs in time linear in the corpus with a small constant for the merges. The same idea flattens into a plain table that maps each prefix string to its list, and the flattened form is what actually ships, because a trie is the natural structure for building while a key-value table of prefix → suggestions is the natural artifact for serving, and the storage arithmetic above has already priced that table. A worked example shows the lists in motion. If "calendar" was searched 1.2 million times last month and "calculator" 900 thousand times, then the node for "ca" holds both near the top of its list, the node for "cal" still holds both, and the node for "cale" drops "calculator" because it no longer matches the prefix. Every keystroke therefore costs one lookup against a list that was finished hours ago, which is how a 20 ms budget turns into sub-millisecond server work.

The serving path

The read path is a cache hierarchy with a sharded table at the bottom, and each layer exists to keep traffic away from the layer below it. The browser caches recent prefix results itself, so a user who backspaces and retypes never leaves the device. The CDN, the content delivery network of shared edge caches sitting close to users, holds responses for popular prefixes with a time-to-live of about a minute, which is safe precisely because of the staleness tolerance established earlier, and it matters because prefix traffic is as skewed as everything else on the internet, so a few million hot prefixes cover a large majority of requests. Behind the CDN, a suggest service checks an in-memory result cache and then reads the shard that owns the prefix. Sharding is by prefix range, so one shard might own everything from "a" to "m", though in practice hashing the first couple of characters with an explicit overflow map for the hottest prefixes balances load better than a naive alphabetical split, because "s" carries far more traffic than "x" and a static split bakes that imbalance in permanently. Each shard holds its slice of the table fully in memory, replicated across at least two machines for availability, and answers in well under a millisecond, which leaves the latency budget to the network where it belongs.

Most keystrokes never travel past the CDN. The misses that do reach the suggest service are answered from the result cache or the owning prefix shard, with trending candidates blended in along the dashed path before the response leaves.

The client carries two responsibilities that belong to the design rather than to polish. Debouncing means waiting a beat after each keystroke, typically 50 to 150 ms, before sending a request, so a fast typist generates one request for "shoes" rather than five requests for the prefixes along the way, and combined with a minimum prefix length of one or two characters it is what turned 10 million searches into only 40 million requests in the sizing arithmetic. Request cancellation handles the overlap that debouncing cannot remove. When the user types another character while a request is still in flight, the client aborts the old request, because rendering stale suggestions on top of fresher text is the most user-visible failure this product has, with the dropdown briefly offering completions of a prefix the user has already left behind. An AbortController attached to each fetch makes the discipline a one-line habit, and it is worth naming in the interview because candidates who only design the server half tend to forget that the client is where half of the latency story lives.

The offline pipeline: from logs to versioned tries

Everything served above is an artifact, and the pipeline that builds it runs far from the request path, on batch infrastructure where a slow job hurts nobody. Search query logs accumulate in blob storage as users search. An aggregation job runs daily or weekly depending on how fast the product's queries drift, counting queries over a trailing window such as 30 days with more recent days weighted higher, so the rankings track the present without forgetting the stable past. A filtering stage then removes offensive and policy-violating queries using blocklists and a classifier, and it also drops queries below a frequency floor. The floor does double duty, because it shrinks the corpus to the queries worth suggesting and it protects privacy, since a rare query containing one person's name must never be suggested back to the world, and skipping that protection has produced real incidents at real companies. The trie builder then computes the per-prefix top-ten lists, flattens them into the serving table, slices the table per shard, and writes the result to an artifact store under a new version number.

Raw logs feed the aggregator in step 1, the counts pass the policy filter in step 2, and the cleaned, weighted queries reach the trie builder in step 3. Step 4 lands the per-shard artifacts in the store under a new version, and in step 5 each shard downloads that version, loads it beside the old one, and swaps a pointer atomically.

Versioning is what makes shipping safe rather than exciting. Each shard keeps serving version 42 from memory while it downloads and loads version 43 alongside, then swaps a pointer between two requests, so deploys are invisible to users and rollback is the same pointer swap performed in reverse. A canary step serves the new version to a small slice of traffic while comparing suggestion click-through against the old version, which catches a poisoned build before it reaches everyone, and keeping the last few versions in the artifact store means the worst pipeline outcome, a corrupt build finishing at three in the morning, degrades to serving yesterday's suggestions, a condition no user can detect.

Trending is handled by a second, much smaller pipeline layered on top, rather than by rebuilding the big one faster. A streaming job counts queries over the last 10 to 30 minutes, and the suggest service blends that short-window output with the stable lists at request time, for example by promoting a query into the displayed ten when its recent velocity clears a threshold. Keeping the blend at the merge step is deliberate, because the stable trie supplies trustworthy long-run ranking while the overlay supplies reaction within minutes, and neither pipeline has to acquire the other's properties. Typo tolerance lives at the same merge step. When a prefix lookup returns few or no completions, the service retries with a handful of candidate prefixes within edit distance one, meaning one character inserted, deleted, or substituted, generated against the prefix table rather than the full corpus so the cost stays bounded, and the retry only fires when the primary lookup comes back thin, which keeps the common path untouched.

A latency budget for one keystroke

Tracing one keystroke from finger to dropdown shows where the 100 ms actually goes, and the trace justifies most of the earlier choices. The user types the fourth character of a query. The client waits out its debounce timer, which adds up to 100 ms of intentional delay but does not count against perceived latency the way network time does, because the user is still typing and the dropdown from the previous prefix is still on screen doing useful work. The request then leaves the device, and reaching a CDN edge takes 20 to 40 ms on a decent connection, with mobile networks paying more. On a CDN hit, which is the common case for short popular prefixes, the answer turns straight around, the total round trip lands near 50 to 80 ms, and the budget is met without the origin ever hearing about the keystroke.

On a miss, the request continues from the edge to the origin, which costs another 10 to 30 ms depending on geography, and that hop is the strongest argument for replicating the serving stack into several regions rather than centralizing it. Inside the origin, the suggest service checks its result cache in well under a millisecond, reads the owning shard's in-memory table in another fraction of a millisecond, spends a millisecond or two merging the trending overlay and applying the deny set, and serializes a kilobyte of JSON, so total server time stays under 5 ms and even the miss path arrives around 90 to 120 ms after the keystroke. The tail behaves worse than the median, as tails do. A cold connection pays for TLS setup, a rare prefix misses every cache layer at once, and a garbage collection pause on a shard adds tens of milliseconds at the worst percentiles. The product-shaped insight is that a late response is worth less than no response, because by the time it arrives the user has typed another character, so the client's cancellation discipline quietly converts tail latency into invisible waste rather than visible flicker, and the dropdown simply updates one keystroke later than it might have.

Scaling, failures, and operations

Every serving tier scales horizontally because every serving tier is read-only, which is the structural reward for pushing all writes into the pipeline. Prefix shards split further when a slice outgrows its memory or its traffic, replicas absorb growing read load, and the CDN absorbs the skew, since the few million hottest prefixes serve a large majority of requests without touching the origin at all. The pipeline scales like any batch system, by adding workers, and its duration affects only how fresh the stable rankings are, never whether the product works, which is an unusually forgiving relationship between an offline job and a user-facing feature.

The failure modes stay contained because each one degrades to a softer version of the product. A dead shard fails over to its replica, costing at worst a few seconds of staleness while traffic moves. A dead suggest tier degrades the search box to plain typing, and the client treats that as a non-event by rendering no dropdown rather than an error, so the user experiences a slightly more manual search instead of a broken one. A stuck pipeline freezes rankings at yesterday's truth, and the alarm for it watches artifact age directly rather than waiting for user reports, because no user will ever file one. The genuinely sharp edge is the filter stage, since a build that lets one harmful suggestion through turns a backend artifact into a screenshot with the product's name on it. Blocklist updates therefore need a fast path that does not wait for the next full build, so shards consult a small serve-time deny set, propagated within minutes, that sits in front of whatever the artifact contains, and the artifact itself is patched at the next build.

One operational habit deserves the last word, which is replaying a sample of yesterday's real prefixes against each new artifact before promoting it and comparing suggestion overlap with the previous version. The reason is that a quiet aggregation defect which empties half the prefix lists looks exactly like success to every infrastructure metric, with the service fast, the error rate at zero, and the artifact a plausible size, while being the worst regression the product can ship. Only a comparison against the previous version notices the difference, which makes the replay check the cheapest insurance in the whole design.

Follow-up questions

Why precompute top-K at every prefix instead of walking the trie per query? A short prefix can own hundreds of thousands of completions, so collecting and sorting them blows a 20 ms budget thousands of times over, and it does so again on every keystroke. Precomputation moves that cost into a nightly build where nobody is waiting, makes the read path one bounded lookup, and the storage bill of under 100 GB with prefix sharing is cheap against the latency it buys.
How do new or trending queries appear if the trie rebuilds nightly? They arrive through the short-window overlay, a streaming count over the last few minutes that the suggest service merges into responses when a query's velocity clears a threshold. The stable trie supplies trustworthy long-run rankings, the overlay supplies recency, and neither pipeline has to deliver both properties at once, which is what keeps each of them simple.
Where does debouncing belong and what does it actually save? It belongs in the client, before the request exists, because waiting roughly 100 ms after a keystroke collapses a fast typist's burst into one request, and no server-side mechanism can recover that saving once the requests are already on the wire. In the sizing it is the difference between every character of 10 million searches hitting the backend and an average of 4 requests per search, roughly a fivefold reduction.
How is an offensive suggestion prevented, and how is one removed fast? Prevention happens at build time through blocklists, a classifier, and a frequency floor. Removal cannot wait for a build cycle, so shards also check a small serve-time deny set that propagates within minutes, and the artifact gets patched at the next build, which means the deny set only ever holds the handful of entries discovered since last night.
What changes for multi-word and mid-word matching? The index gains additional entry points, with each query inserted under the prefixes of each of its words, so "weather tokyo" becomes reachable from "tok". The table grows by roughly the average number of words per query, and ranking needs a tie-break that prefers whole-query prefix matches over mid-query ones, which keeps quality predictable while recall widens.
Could you serve this from Redis sorted sets instead of custom shards? Yes, and it is a respectable mid-scale answer, with one sorted set per prefix scored by popularity and sharded by key. The custom in-memory table earns its keep when the corpus and traffic outgrow what general-purpose storage can serve inside the latency budget, or when immutable version-swapped artifacts simplify operations enough to matter, and naming that threshold is more useful than defending either choice absolutely.

References

Xu, System Design Interview, Volume 1 (2020), chapter on search autocomplete.
Sanfilippo, Auto Complete with Redis (2010).
Elastic, Suggesters documentation, on completion suggesters and fuzzy completion.
MDN, AbortController, on cancelling in-flight fetch requests.