Design nearby friends · Stanley Jacob

Nearby friends is the feature in a messaging or social app where friends who have both opted in can see each other's live distance, refreshed every few seconds, with people appearing when they come within a few miles and quietly dropping off when they go offline. It looks like a small variation on nearby search, and the design value of the question is that it is not. A nearby business search indexes data that barely moves and serves it to many readers, while here every data point is a moving stream, with hundreds of thousands of location updates arriving each second, each of which must promptly reach the right small set of viewers. The heart of this system is fanout, the work of delivering one incoming message to its interested recipients, and the indexing tricks that dominate static search recede to a supporting role.

Interviewers reach for this question to see whether a candidate recognizes that shift, keeps ephemeral data out of durable storage, and can run the delivery arithmetic without flinching. The walkthrough below follows that arc, starting from scope and sizing, then building the socket and fanout machinery that moves the updates, and finishing with the edge behaviors and failure stories that make the feature feel dependable on a phone.

Scope and requirements

Functionally, a user who enables the feature sees a list of friends who also enabled it and are currently within a radius, say 5 miles, each with an approximate distance and a last-updated time, and the list updates within a few seconds as people move. Friendship already exists in a social graph owned by another service, and the friend count is capped by product design at a few thousand, which conveniently excludes the celebrity-broadcast problem before it starts. Location history can be recorded for optional features, but the live view is the product. Privacy is a first-class requirement rather than a checkbox at the end. Sharing is opt-in per user, distances shown are approximations rather than precise coordinates, and turning the feature off must stop sharing immediately, because a feature built on trust dies the first time it leaks a location someone meant to keep private.

The non-functional shape is low latency on a moving stream. An update should reach a nearby friend's screen within a couple of seconds, the system must tolerate users churning online and offline constantly, and absolute consistency is not required, since a distance that is 30 seconds stale is acceptable and a friend who briefly fails to appear is a nuisance rather than a correctness violation. Durability for live locations is explicitly not a requirement, and stating that early sets up the storage choice that follows, because the most expensive mistake available in this design is treating disposable data as precious.

Sizing the problem

Assume 100 million users have the feature enabled and 10 percent are active at any moment, giving 10 million concurrently online users. Each active client reports its location every 30 seconds, so the write rate is 10 million divided by 30, about 333,000 location updates per second, sustained, and that single number disqualifies most leisurely designs. The 30-second interval is itself a design decision, balancing battery drain on the phone against freshness on the friend's screen, and halving it would double every number downstream. Fanout multiplies the write rate further. Suppose an average user has 400 friends, of whom 10 percent are online, so each update is of interest to about 40 people. Multiplying 333,000 by 40 gives roughly 13 million potential deliveries per second, and only after filtering by the radius does the number an actual phone receives shrink, since most online friends are in other cities. If a tenth of those candidates are within range, the system still pushes over a million messages per second to handsets.

Storage, by contrast, is almost nothing. A live location is a user ID, latitude, longitude, and timestamp, perhaps 100 bytes with overhead, and 10 million of them total 1 GB, which fits in a single Redis instance by size, though the 333,000 writes per second argue for a small sharded cluster. The asymmetry between those two paragraphs is the lesson of this section, because the system is bounded by message movement rather than by data volume, and every later choice should favor cheap delivery over careful storage.

The interface

Polling every few seconds from 10 million phones would waste most requests, since a poll returns nothing new far more often than not, so the client holds a WebSocket, which is a persistent two-way connection that starts as an ordinary HTTP request and is then upgraded so either side can send messages at any time. The upgrade matters because it keeps the connection friendly to existing load balancers and firewalls while escaping the request-and-response rhythm that polling is stuck with. The same socket carries location reports up and friend updates down, and when the connection drops, as mobile connections constantly do, the client reconnects with a small randomized delay and the server repaints its view from the cache, so a flaky train ride degrades into brief staleness rather than errors.

GET /v1/nearby/ws HTTP/1.1
Upgrade: websocket
Connection: Upgrade
→ 101 Switching Protocols

client → server (every 30 s, or sooner on significant movement):
{ "type": "location", "lat": 37.7749, "lng": -122.4194, "ts": 1726000000 }

server → client (only for friends inside the radius):
{ "type": "friend_update", "user_id": 4821,
  "distance_mi": 0.8, "updated_at": 1726000004 }

{ "type": "friend_offline", "user_id": 4821 }

Where the data lives

Live locations belong in a cache, not a durable database, and the reasoning is worth saying explicitly in an interview. The value of a location decays to zero within a minute, every write is overwritten by the next report, and losing the data costs nothing because the next 30-second cycle repopulates it, so paying for write-ahead logs and replication in a durable store buys durability nobody asked for at 333,000 writes per second. A Redis cluster stores one entry per user, location:{user_id} mapping to coordinates and a timestamp, written with a TTL, a time-to-live after which the cache deletes the key automatically, set to a small multiple of the report interval such as 60 seconds. The TTL earns its place twice over, since it caps memory growth and its expiry doubles as the went-offline signal, because a user whose phone stops reporting simply vanishes from the cache, and the feature can tell friends they went offline without any heartbeat bookkeeping. An interviewer will sometimes press on what happens if a cache machine reboots, and the satisfying answer is that thirty seconds later the world has repainted itself, which is a recovery story very few systems get for free.

The one durable table in the system is optional location history, kept well away from the hot path, and a later section covers how it stays out of the way.

The high-level architecture

Clients connect through a load balancer to a tier of WebSocket servers, which unlike ordinary web servers are stateful in one specific way, namely that each holds the open sockets of its connected users, so a message for a given user must reach the particular server holding that socket. That single piece of state is what the rest of the architecture organizes itself around, because it rules out treating servers as interchangeable per request. Location reports are written to the cache and published to a pub/sub cluster, and the WebSocket servers subscribe to the channels of their users' friends, consult the friends service on connect to learn which channels those are, and stream history out asynchronously.

Location reports flow up the socket, into the cache, and onto pub/sub channels; deliveries flow back down the same sockets. Dashed paths are asynchronous or only sometimes taken.

Fanout through pub/sub

Pub/sub, short for publish/subscribe, is messaging where senders publish to named channels without knowing who is listening and subscribers receive everything published to channels they have joined, and it solves this system's central routing problem cleanly. Give every user a channel named by their ID. When Alice's update arrives, her WebSocket server publishes it to channel:alice once, and every server holding a socket for one of Alice's online friends has already subscribed to that channel, so the pub/sub layer routes the update to exactly the machines that need it without any server maintaining a global map of who is connected where.

The alternative worth naming is a connection registry, a shared table mapping each user to the server currently holding their socket, with senders looking up each friend and pushing directly to the right machine. It can be made to work, but every connect and disconnect must update the registry, every publish performs around 40 lookups, and stale entries misroute messages during exactly the churn storms that need the most care, so pub/sub, which maintains that routing implicitly through standing subscriptions, is the calmer design and the one I would defend.

The arithmetic shows why the cluster must be sharded, meaning the channels are divided across many pub/sub nodes by hashing the channel name. There are 10 million online users each subscribed to about 40 online friends' channels, which is 400 million standing subscriptions, and 333,000 publishes per second each fanning to about 40 subscribers makes 13 million channel messages per second flowing through the layer. No single node moves 13 million messages per second, but 100 nodes each owning a hash slice of the channels move 130,000 apiece, which is comfortable, and because a channel is pinned to a node by its hash, subscriptions and publishes for a given user always meet at the same place. Memory is similarly tractable, since 400 million subscriptions at tens of bytes each spread over the cluster is a few tens of GB total.

The delivery path then completes at the edges. A subscribed WebSocket server receiving Alice's update looks up its own connected user's last known location, which it holds in process memory, computes the distance between the two points, and forwards the update down the socket only if the result is inside the radius, so the expensive global question "which of Alice's friends are nearby" decomposes into thousands of independent local checks, each made by the one server that has both facts in hand. No central component ever computes the answer, which is precisely what lets the system scale, because a central nearby-computer would need every user's position and every friendship in one place. Updates for far-away friends die quietly at this filter, which is the majority case and the reason handset traffic stays near a million messages per second rather than 13 million.

Alice's update arrives at her server (step 1), is written to the cache with a TTL (step 2), and is published to her channel (step 3). Pub/sub then delivers it to every subscribed server (step 4), each computes the distance to its own connected user (step 5), and only in-radius updates travel down the socket to Bob (step 6).

Filtering, batching, and privacy at the edge

The edges of the system are where battery, bandwidth, and privacy are won. On the way up, the client need not report on a fixed clock, since phones can report sooner on significant movement and stretch the interval when stationary, which means a user sitting in a cafe for an hour generates a handful of updates rather than 120, and the phone's battery is the direct beneficiary. On the way down, a server can batch several friends' changes into one socket message per interval, and it can suppress an update entirely when a friend's movement would not change the displayed distance, since a value shown as 0.8 miles need not be re-sent because the true distance moved from 0.81 to 0.79. Precision rounding serves privacy as well as efficiency. The server forwards distances rounded to a coarse step such as a tenth of a mile and never forwards raw coordinates of the other person, so even a modified client cannot read a friend's exact position from the wire, only the approximate distance the product intends to show, and the privacy promise holds against the client itself rather than depending on the app behaving.

Presence is the remaining edge behavior. When a user opens the app, their server fetches the friend list, filters to those with the feature enabled, subscribes to each of their channels, and reads each friend's current entry from the location cache to paint the initial screen, so a connect costs tens to hundreds of subscribe operations plus cache reads. Morning peaks make this burst real rather than theoretical. If 100,000 users open the app in the same second, the pub/sub cluster absorbs several million subscribe operations in that second, which is why subscribe must be a cheap in-memory operation, why connect storms after an outage need jitter, meaning randomized reconnect delays spread over tens of seconds, and why a server should subscribe lazily where the product allows it. Going offline is the TTL doing its quiet work, because the cache entry expires on its own, and either the expiry event or the next failed read tells friends' servers to send friend_offline, so the feature's sense of presence is a side effect of cache hygiene rather than a separate subsystem to operate.

A latency budget from report to screen

The couple-of-seconds freshness target deserves an itemized budget, because every hop in the fanout path contributes something. Alice's phone takes roughly 50 ms to deliver a report over the socket to her server. Writing the cache entry and publishing to her channel are both single-digit millisecond operations against in-memory systems in the same data center. Pub/sub delivery to the subscribed servers adds a few more milliseconds, the distance check is microseconds of arithmetic, and the push down Bob's socket is another 50 ms of wide-area network. The end-to-end sum lands near 150 ms, which means the freshness a user actually experiences is dominated not by the pipeline but by the 30-second reporting interval itself, and that is the right place for the slack to live because the interval is the knob the product can tune deliberately. When the dashboard shows delivery latency creeping from 150 ms toward whole seconds, the usual culprits are a pub/sub node near its message ceiling or a socket server paging under too many connections, and both show up in per-node metrics well before users notice anything wrong.

Location history off the hot path

If the product wants history, perhaps to show "you and Dana were both at the park," updates are additionally enqueued to a stream and written by separate consumers into a time-series store such as Cassandra, partitioned by user with time-ordered rows. The essential discipline is that this path is fully asynchronous and lossy-tolerant, so a slow history writer can lag minutes behind without delaying a single live update, and the live path never reads from it. Writing history synchronously on the hot path would be the tempting shortcut, and it loses because it welds the latency of a disk-backed store onto a pipeline that otherwise never touches disk. History also carries the heaviest privacy obligations in the system, deserving short retention, encryption, and deletion that actually deletes rather than merely hiding.

Scaling, failures, and operations

Each tier scales along its own axis. WebSocket servers are bounded by concurrent sockets and per-update CPU, and at a conservative 100,000 sockets per node, a figure comfortably within what an event-loop server sustains when most sockets are idle at any instant, 10 million users need about 100 nodes; the load balancer must route by consistent session rather than per request, since the socket lives where it was opened. The pub/sub cluster scales by adding nodes and rehashing channel slices, the location cache shards by user ID across enough Redis nodes to absorb 333,000 writes per second, which a handful of nodes handles at tens of thousands of writes per second each with replicas for read capacity, and the friends service is read-mostly and caches well.

The failure modes have a pleasing self-healing flavor because the data refreshes every 30 seconds. A WebSocket server dies and its users reconnect through the load balancer to other nodes, which re-subscribe and repaint state from the cache, with jitter keeping the stampede polite; from the phone's side the app shows a brief reconnecting state, then the friend list returns at most half a minute stale. A pub/sub node dies and the channels it owned go silent until a replica or a rehash takes over, but the gap costs at most a cycle or two of updates rather than any data loss, since the next reports flow through the recovered channels on their own. A cache shard loss is the same story, as entries repopulate within one reporting interval. The corner that needs real care is the offline signal. A friend disappearing should mean the friend actually went offline rather than that a pub/sub slice is down, so servers should distinguish a TTL expiry from a delivery gap before telling users someone left, and the distinction matters because a false offline message actively misleads while a stale distance merely lags. Operationally, the dashboards that matter are end-to-end delivery latency from report to friend's screen, socket counts per node, pub/sub messages per second per node against capacity, and subscription churn during connect storms.

Follow-up questions

Why WebSockets instead of polling? Polling every few seconds from 10 million clients comes to roughly 3 million requests per second, almost all of them returning nothing new, and it still cannot push a timely update between polls. A persistent socket costs one connection's worth of server memory and delivers in both directions immediately, so it wins on every axis except operational familiarity, and that gap has closed as socket tiers became routine infrastructure.
Why not store live locations in a durable database? The data expires in seconds and is rewritten every 30, so durability buys nothing anyone asked for, while 333,000 writes per second is expensive against a write-ahead log but easy against memory. The TTL even doubles as the offline signal, which a durable store would need extra machinery to replicate.
What if Redis pub/sub drops messages during a failover? Accepting the loss is the right call here. The next 30-second report repairs the view on its own, which is why at-most-once delivery is the correct semantics, and a system that truly needed guaranteed delivery would reach for a replicated log like Kafka and accept the higher latency and operational weight that come with it.
How would a celebrity with 10 million followers break this? One publish would fan out to millions of subscribers and melt the node owning that channel, which is why fanout-on-write only works when the audience is capped. The product caps friends at a few thousand, and a true broadcast feature would flip to fanout-on-read, where viewers pull the location when they look, paying the cost at read time where it spreads naturally.
How does the radius check work if the receiving server only knows its own user? Each update carries the sender's coordinates through the channel, and the receiving server holds its connected user's last position in process memory, so the distance computation is local arithmetic with both operands already at hand and no extra lookup sits on the delivery path.
What changes at 10x users? Every tier multiplies nodes along its existing axis, with more socket servers, more channel shards, and more cache shards. The design holds because no component keeps global state, and the one genuinely global structure, the friend graph, is owned elsewhere and read-cached, so growth becomes a provisioning exercise rather than a redesign.

References

Xu, System Design Interview, Volume 2 (2022), chapter on nearby friends.
Redis, Redis Pub/Sub documentation, on channels and delivery semantics.
MDN, The WebSocket API, on the upgrade handshake and message framing.
IETF, RFC 6455: The WebSocket Protocol (2011).