Design social graph search

Type "jo" into the search box of a large social network and the first results are not the most famous people named Jo, they are your friends, then friends of your friends, and only then everyone else. That ordering is the product itself, because people search on a social network is really relationship search, with relevance computed mostly from graph distance, live, as the user types, under a keystroke-to-results budget of about 100 milliseconds. Interviewers like the question because it arrives dressed as a search problem and turns out to be a graph problem, and the instructive moment comes when a candidate realizes that the textbook answer, breadth-first search over the friendship graph at query time, cannot work at social network scale no matter how well it is implemented. Accepting that early forces the design toward precomputation and caching, which is the road real systems such as Facebook's typeahead actually took, and the rest of the walkthrough is about traveling that road deliberately rather than stumbling onto it after the whiteboard fills up with frontiers.

Three problems carry the difficulty. The graph holds on the order of a trillion edges and must serve any user's neighbors in one cheap read, relationship distance has to be computed without traversing the graph for every keystroke, and name prefixes must match against a billion names quickly enough that distance ranking still has time to run inside the budget. Privacy filtering wraps around all three, because blocked users and restricted profiles have to vanish from results before anything leaves the backend, and a design that solves the first three while leaking the fourth has failed at the one requirement users will never forgive.

Scope and requirements

Functionally, the system serves typeahead people search, where each keystroke sends the current prefix to the backend and the response is a ranked list of perhaps eight people whose names match it. Ordering follows graph distance first, so direct friends outrank friends of friends, who in turn outrank the global tier of everyone else, and interaction signals such as mutual friend counts break ties inside each tier. Privacy behaves as a functional requirement rather than a refinement, since blocked users must never appear and users whose visibility settings exclude the searcher must be filtered on the server, where no client misbehavior can un-hide them. Friend and unfriend events need to be reflected within seconds, because a person who just accepted a friend request will often search for that new friend moments later, and a new friend who fails to surface reads as a defect rather than as a delay. Full-text search over posts and pages is a different system with different index shapes, so I would name it explicitly and leave it out.

The non-functional side is dominated by one number. Human perception registers feedback as instant somewhere near 100 milliseconds, the network between a phone and the data center consumes around half of that before any server code runs, and the backend is therefore left with roughly 50 milliseconds per query for everything described below. Every keystroke is a query, which means the request rate tracks the aggregate typing speed of the user base rather than the number of searches, and that multiplier matters when the sizing is done. The system must also degrade rather than fail, meaning that when graph distance cannot be computed inside the deadline the service returns name-relevance results with slightly worse ordering instead of returning nothing, and that degraded mode deserves to be designed, metered, and rehearsed rather than discovered for the first time during an incident.

Sizing the problem

Assume 1 billion users with an average of 500 friends, and that branching factor of 500 is the number the rest of the article leans on. Friendship is mutual, but storing every edge in both directions makes either endpoint's friend list a single local read, so the edge store holds 1 billion users times 500 friends, which is 500 billion logical friendships and about 1 trillion directed entries once both directions are materialized. At roughly 12 bytes per entry, which covers two 8-byte IDs with compression plus per-row overhead, the raw adjacency data runs around 12 TB before indexes and replication multiply it, far beyond any single machine and naturally sharded by user ID so that one user's entire friend list lives together on one shard.

The query rate comes from typing rather than from searching. If 200 million people search on a given day, each running about 5 searches of 5 keystrokes, the platform absorbs 5 billion prefix queries per day, and dividing by 86,400 seconds gives an average near 58,000 queries per second with peaks past 150,000. That rate is why the per-query work has to amount to a handful of memory reads rather than a graph walk, because a design spending even ten store round trips per keystroke would demand more than a million store reads per second before any ranking happened. The name index is the small partner in the pair, since a billion names at a few dozen bytes each is around 30 GB of raw strings, and even with prefix structures multiplying that several times over it fits in memory across a modest shard set. The asymmetry between the two stores is the design's good fortune, because it lets name matching play the cheap opening act while the latency budget is spent on the part that differentiates the product, namely ranking by relationship.

The interface

One endpoint serves the box, and its shape is worth a moment because the client and the server share the latency budget between them. Each keystroke sends the current prefix with a small result limit, the response carries the match tier for every result so the client can render section headers without re-deriving relationships, and a session token ties the keystrokes of one typing burst together so the backend can reuse work across them. A degraded flag rides along as well, telling the client when ordering came from name relevance alone, so that it can quietly retry for the ranked version if it chooses to.

GET /api/v1/typeahead?q=jo&limit=8&session=s-4471
→ 200 {
  "results": [
    { "user_id": "u-2241", "name": "Jordan Park",  "tier": "friend",
      "mutual_friends": 41 },
    { "user_id": "u-9077", "name": "Joanna Reyes", "tier": "friend_of_friend",
      "mutual_friends": 6 },
    { "user_id": "u-5550", "name": "Jo Martin",    "tier": "global" }
  ],
  "degraded": false
}

The data model

The graph lives as adjacency lists, which means each user's record holds the list of that user's direct neighbors. That layout is the natural one when the dominant question is "give me this user's friends" rather than "does this exact edge exist", because one read keyed by user ID returns the entire neighborhood, while an edge-keyed layout would scatter a user's friendships across the keyspace and turn every friend-list fetch into hundreds of point reads. Facebook's TAO is the canonical production example of the shape, a graph API layered over sharded MySQL with a write-through cache in front, serving association lists like friend edges at extremely high read rates, and the design here borrows that posture deliberately. A sketch of the two stores:

-- Graph store, sharded by user_id (both directions materialized)
CREATE TABLE friend_edges (
  user_id    BIGINT NOT NULL,
  friend_id  BIGINT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL,
  PRIMARY KEY (user_id, friend_id)
);

-- Name index, sharded by name token prefix
-- prefix "jo" → posting list of (user_id, name, popularity)
CREATE TABLE name_prefixes (
  prefix   VARCHAR(8) NOT NULL,
  user_id  BIGINT NOT NULL,
  name     TEXT   NOT NULL,
  rank_hint REAL,
  PRIMARY KEY (prefix, user_id)
);

Two details in the sketch carry more weight than they appear to. The name index is keyed by prefix rather than by user, so a prefix query touches exactly one partition while a name change touches only the handful of rows derived from the old and new spellings, and the rank hint column carries a precomputed popularity score so the index can truncate its posting lists without consulting any other service at query time. The friend edges table records creation time because recency feeds ranking, and materializing both directions costs double the storage but converts what would otherwise be a scatter-gather query for the reverse direction into the same cheap single-partition read the forward direction enjoys, a trade the sizing numbers already showed the system can easily afford.

The high-level architecture

A typeahead service orchestrates three stages inside the budget, and each stage exists to keep the next one cheap. The prefix index, which is a precomputed map from each short name prefix to the posting list of user IDs whose names begin with it, turns the hardest-looking part of the problem, matching against a billion names, into a couple of milliseconds of sharded lookups. The distance ranker then intersects those candidates with the searcher's cached first-degree and second-degree friend sets to assign each one a tier, and privacy filtering trims the list before the response is assembled. The component doing the quiet heavy lifting is the friend-list cache underneath the ranker, which holds the precomputed sets the ranker tests against and is rebuilt from the graph store whenever an edge changes, because everything fast about the query path is downstream of that cache being warm and current.

Each keystroke fans the prefix out to the name index shards, the returned candidates are tiered by the distance ranker against cached first and second degree sets, and ranked, privacy-filtered results flow back to the client. The cache rebuilds from the graph store whenever friendships change.

Why query-time BFS fails

Breadth-first search explores a graph in rings, visiting the source's direct neighbors first, then their neighbors, expanding one hop at a time until the target turns up, and on a whiteboard it is the obvious way to compute the distance between the searcher and a candidate. At this scale it dies twice over. The first death comes from fan-out arithmetic, because with a branching factor of 500 one hop reaches 500 people and two hops reach about 500 squared, which is 250,000, so merely confirming that a candidate is a friend of a friend by forward search means materializing a quarter-million-node frontier, and three hops reach 125 million people, an eighth of the entire network, all in service of one candidate under one keystroke. The second death comes from the network underneath the algorithm. Because the graph is sharded across many machines, every hop becomes a round of parallel reads across shards, and even a generous 5 milliseconds per round turns a multi-hop walk with giant frontiers into hundreds of milliseconds of shard traffic. Multiplying that by eight candidates per keystroke and 150,000 keystrokes per second platform-wide would ask the store for billions of reads per second. The budget allows perhaps two or three cache reads per query while BFS wants thousands of store reads, and a gap of that shape is structural, which means no amount of tuning, batching, or added hardware closes it, and the algorithm itself has to move somewhere cheaper.

Bidirectional BFS, and why precomputation still wins

Bidirectional BFS is the classic algorithmic remedy, expanding a frontier from both endpoints at once and stopping when the two frontiers touch. The reason it helps is that exponential growth gets paid for only half the distance from each side, so finding a shortest path of length d costs about 2 times b^d/2 node expansions instead of b^d, where b is the branching factor. Worked numbers make the gap vivid. For distance 2 with b of 500, one-sided search touches about 250,000 nodes while two-sided search touches about 500 from each end, roughly 1,000 in total and a 250-fold saving, and for distance 4 the comparison becomes 62.5 billion against about 500,000, which is the difference between impossible and merely expensive. The meeting test is itself cheap, because asking whether two 500-element sets intersect is a hash-set membership check repeated 500 times, an operation measured in microseconds that runs entirely in local memory rather than against the graph.

Each side expands one hop and the mutual friends appear where the frontiers intersect, so a distance-2 check costs two friend-list reads and a set intersection, while one-sided BFS would have materialized a frontier of about 250,000 nodes.

The production insight hiding inside the algorithm is that for distance 2, bidirectional BFS reduces to a single operation, namely intersecting my friend list with the candidate's friend list, and an operation that small is worth precomputing rather than performing per query. The system therefore keeps each user's first-degree list cached, and for active users it also maintains a compact second-degree structure built offline by expanding the cached friend lists, stored either as a sorted array of IDs or as a Bloom filter, which is a small probabilistic bitmap that answers definitely-absent or probably-present using a few bits per member. With those structures in place, the per-candidate tier check at query time becomes a pair of membership tests against data already in memory, asking first whether the candidate sits among my friends and then whether they sit in my second-degree set, with everyone else falling to the global tier. The expensive expansion now happens once per friendship change instead of once per keystroke, and that exchange rate is the entire trick, because friendship edits arrive perhaps a few thousand times per second platform-wide while keystrokes arrive 150,000 times per second and each keystroke would have triggered the work for several candidates at once. A Bloom filter does occasionally claim a stranger is a friend of a friend, at a false-positive rate the operator tunes down toward a fraction of a percent, and the system tolerates that gracefully because the consequence is one slightly over-promoted result rather than anything private being revealed.

Matching names and ranking candidates

The name side rests on a prefix index. Every display name is tokenized and lowercased, and for each token prefix up to a few characters the index stores a posting list of matching user IDs, sharded by the prefix itself so the "jo" list lives at a known address, with hot two-letter prefixes subdivided by their next character when a single posting list draws too much traffic. A global posting list for a prefix like "jo" holds millions of entries, vastly more than could be scored inside the budget, so it is truncated to a globally popular head containing the public figures any searcher might plausibly mean. Personalized coverage comes from a second source running in parallel, because the searcher's own first-degree and second-degree names are few enough to scan exhaustively, with 500 friends and perhaps 250,000 second-degree names sitting in cache, and that exhaustive scan is what guarantees a friend named Jordan always surfaces for "jo" even though Jordan is globally obscure. Candidates from both sources are unioned, tiered by the membership tests described above, and scored within tiers by interaction signals such as mutual friend counts and recency of contact, so the close friend messaged yesterday outranks the acquaintance friended a decade ago.

Privacy filtering then runs in the backend, after tiering and before the response is assembled, and the placement is not negotiable, because a result that ships to the client and is merely hidden by the interface remains a data leak regardless of what the screen shows. Block lists are small per user, cached beside the friend lists, and checked in microseconds, which removes any performance argument for postponing them. Freshness rides the write path rather than the read path, so a friend or unfriend event updates the graph store and then invalidates and rebuilds the cached structures for both endpoints, and the two structures are allowed different urgency. The first-degree list must update within seconds to keep the new-friend promise, while the second-degree structure can rebuild lazily over minutes, because a friend's new friend creeping into the middle tier sits below anyone's perception. Name changes flow to the prefix index through the same event stream, retiring postings under the old prefixes and inserting them under the new ones.

A latency budget for one keystroke

The 50 server-side milliseconds only mean something once they are divided into parts. Authentication and routing at the gateway cost a millisecond or two, after which the typeahead service issues its two lookups concurrently, sending the prefix to the right name index shard, which returns the truncated global posting list in about 2 milliseconds, while the personalized scan of cached friend and second-degree names completes in about the same window, and running the two in parallel means the service pays for the slower of them rather than the sum. Tier assignment is a few hundred membership tests against in-memory sets and costs well under a millisecond, scoring and sorting the few dozen survivors costs about as little, privacy checks add microseconds per candidate from cached block lists, and serialization rounds the warm-path server total to something near 10 milliseconds, which leaves real headroom under the deadline.

That headroom exists to absorb the cold cases. A searcher whose friend cache has been evicted, typically someone inactive for months, forces a graph store read and an on-the-fly tier computation that can eat 20 or 30 milliseconds, which still fits precisely because the warm path left room for it, and the ranker's internal deadline backstops the truly slow outliers by dropping to name relevance rather than spending the whole budget chasing perfect ordering. Session reuse funds the rest of the margin, because each keystroke usually extends the previous prefix, so the service can refine the candidate set it already holds for the session, with "jor" filtering the cached "jo" results, instead of re-running the fan-out, and that turns a five-keystroke search into one expensive query followed by four cheap ones, cutting shard traffic several-fold for the price of a small per-session cache.

Scaling, failures, and operations

Each store scales along its own key. The graph store shards by user ID and replicates within each shard, with the cache layer in front absorbing the read multiple, which is the TAO shape, while the prefix index shards by prefix and subdivides and replicates the hot ones, because prefix popularity follows a steep skew in which a handful of short prefixes carry a large share of all queries. The friend-list cache shards by user ID, and its sizing arithmetic lands comfortably, since roughly 500 entries of 8 bytes per active user is about 4 KB each, so 200 million active users fit in under a terabyte of cache memory across the cluster, with the second-degree structures adding a few kilobytes more per user when stored as Bloom filters instead of raw arrays. The typeahead service itself is stateless apart from its small session caches and scales horizontally behind the load balancer, which concentrates all the interesting failure behavior in the stateful tiers beneath it.

The failure plan is the degraded mode promised at the start, and it is worth narrating as an experience. When the friend-list cache or the graph store turns slow, the ranker reaches its internal deadline of a few milliseconds, abandons tiering, and the service returns name-relevance ordering with the degraded flag set, so the person typing still sees results land within the perceptual budget, merely ordered a little worse, with a celebrity perhaps outranking a casual acquaintance for a few minutes. Nothing spins and nothing errors, which is the entire point of designing the fallback in advance. When a prefix shard dies its replica takes over, and if an entire prefix becomes unreachable the global candidates for it vanish while personalized candidates from the friend cache still match, a failure that points in the direction users forgive, because missing a stranger is invisible while missing a friend gets reported. Operations watches keystroke-to-response p99, the ranker timeout rate, which meters how often the degraded mode actually engages, and cache rebuild lag after friend events, which measures the new-friend promise directly. A privacy regression suite runs in continuous deployment alongside those dashboards, because the one unforgivable defect in this system is a blocked user appearing in a result list, and that property has to be machine-checked on every deploy rather than entrusted to review.

Follow-up questions

Why not run BFS at query time? A branching factor of 500 makes a two-hop frontier about 250,000 nodes, every hop costs a cross-shard round trip, and the budget allows roughly 50 milliseconds at 150,000 queries per second. The arithmetic fails before any implementation detail gets a chance to matter, which is why the design moves the graph work to write time instead of optimizing the walk.
What exactly does bidirectional BFS save? Exploration drops from b^d to about 2 times b^d/2, so for b of 500 and d of 2 the search touches about 1,000 nodes instead of 250,000. Better still, the distance-2 case collapses into intersecting two cached friend lists, which is precisely the operation the production system precomputes.
How fresh must the cached friend lists be? The first-degree list must update within seconds, because a person who just made a friend will often search for them immediately and expects friend-tier ranking. Second-degree structures can rebuild lazily over minutes, since a friend's new friend creeping into the middle tier sits below anyone's perception.
How do you handle a user with millions of followers? Followers and friends receive different treatment, because distance ranking runs only on the bounded friendship graph while celebrity accounts surface through the global popularity tier of the prefix index. No per-query structure ever scales with follower count, so a famous account costs the same to rank as an obscure one.
Where does privacy filtering run, and why there? It runs in the backend after tiering and before the response is built, because results that are shipped and merely hidden by the client remain a leak regardless of the interface. Block lists are small per user and cached beside the friend lists, so the check costs microseconds and offers no performance excuse to skip.
What degrades first under load, and what does the user see? The distance ranker hits its internal deadline first, and the service falls back to name-relevance ordering with the degraded flag set. Results still appear within the budget, ordered a little worse than usual, which is the designed failure for a feature where late perfect results lose to punctual decent ones.

References

Bronson et al., TAO: Facebook's Distributed Data Store for the Social Graph (USENIX ATC 2013).
Curtiss et al., Unicorn: A System for Searching the Social Graph (PVLDB 2013), Facebook's graph-aware search backend.
Kleppmann, Designing Data-Intensive Applications (2017), on partitioning and graph data models.
Xu, System Design Interview, Volume 2 (2022), on proximity and typeahead style problems.