A chat system moves short messages between people in close to real time, and WhatsApp, Messenger, and Slack are the familiar shapes of it. The product sounds simple, since one person types a message and the other person sees it, but the machinery underneath is unlike most web systems, because the server has to push data to clients the moment it arrives rather than wait to be asked for it. Interviewers reach for this question because it forces a candidate out of the comfortable stateless request-response world and into stateful connections, message ordering, fanout to many recipients, and the surprisingly tricky problem of knowing who is online. The hard parts live in the connection tier that holds millions of open sockets, in the identifiers that keep a conversation in order, and in the arithmetic that decides whether a group message is written to every member's feed or read from a shared log. I will walk the design in interview order, from requirements and sizing through the message path to the failure stories that show whether it holds together when a server full of sockets dies.
Scope and requirements
The feature list I would confirm up front covers one-to-one chat, group chat up to a few hundred members with very large groups treated as a special case, text messages with small media handled by reference, message history that survives reinstalls, sync across a user's phone and laptop, online presence indicators, and the delivery states users expect to see, namely sent, delivered, and read. When the recipient is offline, the system must hand the message to a push notification service, which is the operating-system channel (APNs on iOS, FCM on Android) that can wake an app that is not running. Confirming the list out loud matters because each item drags a subsystem behind it, presence being a fanout problem bigger than messaging itself, while media handled by reference keeps bulk bytes off the chat path entirely.
The non-functional requirements are what shape the design more than the feature list does. Delivery should feel instant, ideally a few hundred milliseconds end to end when both people are online, because somewhere past half a second a conversation stops feeling live and starts feeling like email. Messages must never be lost once the sender sees the sent state, which forces persistence to happen before acknowledgment rather than after it, an ordering decision that will reappear in the delivery sequence. Within a single conversation everyone should see messages in the same order, though no ordering promise is needed across different conversations, and relaxing that second half is precisely what lets the system scale horizontally. Finally the design must tolerate the constant churn of mobile networks, where connections drop every few minutes and clients reconnect from new IP addresses, so every protocol decision gets tested against the question of what happens when the radio cuts out halfway through an exchange.
It is equally useful to say what the design leaves out. Voice and video calling is a media relay problem with its own infrastructure and deserves its own interview. Federation with other chat networks changes the trust model completely and is out. End-to-end encryption constrains features rather than architecture, so I flag early that I will return to it near the end rather than carry it through every section.
Sizing the problem
Assume 50 million daily active users sending 40 messages each per day. Multiplying gives 2 billion messages per day, and since a day has 86,400 seconds, the average send rate works out to about 23,000 messages per second. Traffic is far from flat, because usage concentrates in the evening hours of each region, and a five times multiplier over the average is a safe planning number, which gives roughly 115,000 messages per second at peak. Every message is also delivered at least once, so the connection tier handles double that in raw socket traffic, and once delivery receipts are added later the realistic planning figure for socket frames climbs to several hundred thousand per second. That arithmetic points the pressure squarely at the connection and write path rather than at reads.
Storage turns out gentler than the request rate suggests. At 100 bytes for a typical text message plus metadata, 2 billion messages a day is 200 GB per day, which over a year accumulates to about 73 TB. Even with indexes and replication tripling it, a couple hundred terabytes a year sits comfortably within a modest storage cluster, so capacity is not the worry. What the storage tier must survive is write throughput, tens of thousands of small appends per second sustained with evening peaks several times higher, and that observation will drive the database choice directly.
The connection count is the number that makes this system different from a web service. If a quarter of daily actives are connected at the evening peak, that is 12.5 million concurrent sockets. A well-tuned server holding mostly idle connections can manage on the order of a million sockets, so the arithmetic floor is around 13 chat servers, and running 25 to 30 gives headroom for failures and rolling deploys. Holding a socket costs almost no CPU but real memory, roughly tens of kilobytes of kernel and application state per connection, so a million sockets occupies tens of gigabytes of RAM on a single box, and the operating system needs tuning for file descriptor limits, which are the per-process caps on open connections that default far too low for this workload.
Connections and the interface
The first design decision is how the server pushes a message to a recipient who did not ask for it. Plain HTTP polling, where the client asks for news every few seconds, collapses at this scale, because 50 million connected clients polling every five seconds would generate 10 million requests per second, nearly all of which return nothing, while each user still waits an average of half the polling interval before seeing a message, and the constant radio wakeups drain phone batteries. Long polling improves on that by having the server hold each request open until a message arrives or a timeout passes, which removes the empty responses and most of the perceived lag, but it still pays connection setup and header costs on every cycle and keeps one outstanding request parked per client.
The right tool is a WebSocket, which is a persistent two-way TCP connection created by upgrading an ordinary HTTP request. Once established, either side can send frames at any time with a few bytes of overhead, so the server can push a message the instant it arrives, the client can send without any setup cost, and the battery wakes only when there is something to say. Long polling stays in the design as the fallback for the small fraction of corporate networks and aging proxies that mishandle the upgrade handshake, and the client library hides the difference so the rest of the system never knows which transport a given user is on.
GET /ws HTTP/1.1
Host: chat.example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
HTTP/1.1 101 Switching Protocols
// after the upgrade, JSON frames flow both ways
→ { "type": "send", "channel_id": "c917", "client_msg_id": "a1b2", "text": "hey" }
← { "type": "ack", "client_msg_id": "a1b2", "msg_id": 4821, "state": "sent" }
← { "type": "message", "channel_id": "c917", "msg_id": 4822, "from": "u042", "text": "hi" }
← { "type": "receipt", "channel_id": "c917", "msg_id": 4821, "state": "read" }
The client_msg_id field matters more than it looks. Picture a phone on a weak signal that sends a message, never receives the acknowledgment because the radio dropped, and resends after a timeout. Without protection the conversation now shows the message twice, which users notice immediately and forgive slowly. Mobile networks drop acknowledgments as readily as they drop messages, so the server treats sends as idempotent, meaning a retried request carrying the same client-generated ID has no additional effect and simply returns the original result. Implementing that costs a small recent-ID table per sender, checked before each insert, and it converts an unavoidable network behavior into something invisible to both people in the conversation.
The data model
Messages are written once, never updated, read as recent ranges within one conversation, and arrive at 23,000 per second sustained. That access pattern fits a wide-column store such as Cassandra or HBase, which is a database organized as a sorted key-value structure where all rows sharing a partition key live together in write-optimized log-structured storage, meaning writes are appended sequentially rather than placed into a tree. A relational database fits poorly here, and not because SQL cannot express the schema, which is trivial. The trouble is that sustaining this insert volume against B-tree indexes, the balanced tree structures relational engines maintain on disk, causes constant page splits and random writes, so the relational system would need heavy sharding anyway, while joins and multi-row transactions, the machinery that justifies the engine, never run on the hot path. Choosing the wide-column store means paying for exactly the two operations the workload performs, cheap appends and sequential range reads.
CREATE TABLE messages (
channel_id BIGINT, -- one-to-one pair or group
msg_id BIGINT, -- ordered within the channel
sender_id BIGINT,
body TEXT,
created_at TIMESTAMP,
PRIMARY KEY ((channel_id), msg_id)
) WITH CLUSTERING ORDER BY (msg_id DESC);
CREATE TABLE inbox (
user_id BIGINT, -- per-user delivery feed
seq BIGINT, -- per-user sequence number
channel_id BIGINT,
msg_id BIGINT,
PRIMARY KEY ((user_id), seq)
);
Partitioning by channel_id keeps each conversation's history physically together, so the common query, which asks for the latest fifty messages in this channel, becomes one sequential read from one node rather than a scatter across the cluster. The inbox table is the second half of the model and the one that multi-device sync will lean on later. Every message addressed to a user lands there with a per-user sequence number, so a device can ask for everything after sequence 90,210 and receive one ordered answer covering all of its conversations at once, which is a far better reconnect primitive than querying every channel separately. The cost of maintaining the inbox is one extra small write per recipient per message, and that cost is the subject of the group fanout discussion below.
The high-level architecture
Clients hold a WebSocket to a chat server, and because a socket lives on exactly one machine, the tier is stateful, so the system must always be able to answer which server holds a given user's connection right now. That answer lives in a session registry, a small key-value store mapping user ID to server, written when a connection is established, refreshed by heartbeats, and cleared when the connection drops or the entry's time-to-live lapses. A plain load balancer cannot replace it, because load balancers route new inbound requests, while delivery needs to find one specific existing socket among millions.
Between acceptance and delivery sits a message queue, which decouples the sender's server from the recipient's so that persistence and delivery proceed even when the recipient's server is slow or briefly gone. The alternative worth naming is direct server-to-server RPC, where server A calls server B synchronously, and it loses for a concrete reason. A slow server B would stall server A's threads while they wait, retries during a deploy would multiply traffic exactly when capacity is lowest, and a message in flight during B's crash would simply vanish unless A built its own buffering, at which point A has reinvented a worse queue. With a real queue, the message rests durably until B consumes it, and per-recipient partitioning preserves ordering. The message store and the push notification service hang off this same path, as the diagram shows.
Chat servers hold the sockets, the session registry records which server holds whom, and the queue decouples persistence from delivery, while the dashed branch hands the message to the push service whenever no live socket exists for the recipient.
The life of a one-to-one message
The sequence below is where the architecture earns its keep, and the ordering of the steps is deliberate. Persisting the message before the sender sees the sent state means a crash anywhere after that point can delay delivery but never lose data, which is exactly the durability promise the requirements demanded. Reversing the order would shave a few milliseconds off the sender's perceived latency and quietly break the promise every time a server dies in the gap, so the slower ordering wins without much argument.
The sender writes over its socket (1), server A persists the message and assigns its channel sequence number (2), then acknowledges sent back to the sender (3). Server A forwards to server B, located through the session registry (4), server B pushes the frame down the recipient's socket (5), and the recipient acknowledges so that a delivered receipt flows back along the reverse path (6). When the registry shows no socket, the dashed branch (7) hands the message to the push service instead.
The message ID assigned in step 2 carries the ordering of the whole conversation, so it deserves a careful moment. The clean option is a per-channel sequence, a counter scoped to the conversation so that messages get IDs 1, 2, 3 within that channel and sorting by ID is exactly the order everyone should see. Maintaining the counter costs an atomic increment on the channel's partition, which the wide-column store performs cheaply because the partition lives on one node at a time, so the coordination is local rather than cluster-wide. The looser alternative is a time-ordered global ID in the style of Twitter's Snowflake, a 64-bit number packing a millisecond timestamp, a machine identifier, and a small counter, which needs no coordination at all but only orders messages as well as the machines' clocks agree, so two messages sent in the same millisecond through different servers can sort either way. Client timestamps are worse still, since phone clocks drift by minutes and users adjust them. For chat the per-channel sequence wins, because people notice a reply rendered above its question instantly, and the channel partition the counter lives on is already the unit of storage the write was touching anyway.
A latency budget for one message
Adding the milliseconds makes the few-hundred-millisecond requirement concrete. The sender's frame crosses a mobile network to server A in half a round trip on a connection that already exists, typically 30 to 60 milliseconds on LTE, with no TLS handshake to pay because the socket was established at app start. Persisting with a quorum write, meaning the store waits for a majority of replicas to confirm, costs 5 to 10 milliseconds inside the datacenter, and the acknowledgment spends another half round trip, so the sender's tick mark appears roughly 80 to 130 milliseconds after the tap. Delivery continues in parallel, with a registry lookup costing about a millisecond against a memory store, a queue hop a few more, and the push down the recipient's socket another half round trip, which lands the message on the recipient's screen around 150 to 250 milliseconds after the send. That leaves comfortable margin under the budget, and the margin exists to be spent on bad days, when a congested cell tower adds 200 milliseconds to a radio leg and the conversation still feels live.
The same budget explains what users see when a leg fails. If persistence succeeds but the recipient's server is slow, the sender still gets the sent tick on time, the message waits in the queue, and the recipient sees it a few seconds late with nothing lost, so the failure reads as a sluggish moment rather than an error. If the write itself fails, the sender's client never receives the acknowledgment, retries with the same client ID, and either succeeds idempotently or eventually surfaces a visible failed state with a manual retry arrow, which is the correct experience because the one thing the product must never do is show sent for a message that no server holds. The offline branch has the loosest budget of all, since push notification services make no latency promise, and a message delivered by push commonly takes one to several seconds to wake the phone, which users tolerate because the app was closed.
Multi-device sync and delivery states
One user is several devices, and a message must reach the phone in a pocket and the laptop at a desk, whichever connects next. The per-user inbox from the data model is the mechanism. Every inbound message lands in the recipient's inbox with an increasing per-user sequence number, and each device keeps a cursor, which is simply the highest sequence number it has applied. A device that reconnects after a subway ride sends its cursor, the server returns everything after it in one ordered batch, and the device is current again without any per-conversation bookkeeping or risk of gaps, because a contiguous sequence makes missing entries detectable. The same inbox covers the sender's own other devices, since a message sent from the phone is also placed in the sender's inbox, which is how the laptop learns to render the outgoing message in the right place in the thread.
Delivery states ride the same rails as messages because they are messages, just tiny ones, and treating them uniformly avoids building a second delivery system. Sent is the acknowledgment in step 3 of the sequence, issued the moment the store accepts the write. Delivered is a receipt frame the recipient's device emits when the message arrives, traveling back through the recipient's chat server into the sender's inbox, where it updates the tick mark on every one of the sender's devices at once. Read is the same receipt with a different state, emitted only when the conversation is actually on screen, which is why it can trail delivered by hours. The volume is worth respecting, since every message can generate two receipts and therefore roughly triple the original write count, and that multiplication is one more reason the storage tier was chosen for cheap appends. In groups the read state is kept per member but surfaced as an aggregate such as read by 12, because rendering two hundred individual receipt streams helps nobody and the per-member detail remains queryable when someone asks.
Group chat, fanout, and presence
A group message must reach every member, and there are two ways to arrange that. Fanout on write copies the message into every member's inbox at send time, so each device syncs exactly one stream, its own inbox, and reads stay as cheap as one-to-one chat. Fanout on read writes the message once to the group's channel and makes each device pull from every channel it belongs to when it comes online. The arithmetic picks the policy rather than taste. For a 50-member group, fanout on write costs 50 inbox appends of roughly 100 bytes each, about 5 KB per message, which is a trivial price for the simplicity it buys on every subsequent read. For a 100,000-member announcement channel the same message would cost 100,000 writes, and a hundred messages a day in such a channel becomes 10 million daily writes for content most members will never scroll far enough to see, so large channels flip to fanout on read, where the message is written once, devices query the channel when it is opened, and only a lightweight hint that the channel has news is pushed. A threshold of a few thousand members separates the regimes, and the right value is found by measuring where inbox write volume starts to crowd out regular traffic rather than by argument.
Presence, the green dot beside a name, is a soft-state system layered alongside messaging, where soft state means data that is allowed to be slightly stale because it is continuously refreshed. Each connected client sends a heartbeat, a tiny keepalive frame every 30 seconds or so, and the chat server records last-seen timestamps in a memory store, showing a user online if a heartbeat arrived within the last minute. That window means the indicator can lag a real disconnect by up to a minute, and the imprecision is accepted across the industry because the alternative, marking users offline on every dropped packet, produces a storm of flapping transitions on each cellular hiccup that is worse than the staleness. Propagating changes is where the arithmetic bites again, because 12.5 million connected users each notifying a few hundred friends on every transition multiplies into billions of tiny updates per hour, so high-fanout pushes are simply not attempted. Clients instead pull presence lazily for exactly the users currently on screen, and pushed updates are reserved for small friend lists and the open conversation, which covers what people actually look at.
End-to-end encryption deserves its promised acknowledgment even when the interviewer leaves it out of scope. In an end-to-end design the devices hold the keys and encrypt for each other, so the server still routes, stores, and fans out ciphertext exactly as drawn, and the session registry, inbox, and queue all survive unchanged. What changes is everything that read message content, because the server can no longer index bodies for search, scan for abuse centrally, or render link previews, so search moves on-device, moderation leans on reports and metadata, and the message store becomes a blind mailbox. Saying out loud that the architecture survives while the feature set bends is exactly the kind of boundary an interviewer wants drawn.
Scaling, failures, and operations
Each tier scales by its own rule, and naming the rule per tier is more useful than declaring the system horizontally scalable. Chat servers scale by adding boxes, but unlike stateless services they cannot hide behind a plain load balancer for delivery, because a specific user lives on a specific server, and the session registry absorbs that coupling. The registry must itself be a small replicated store, Redis with replicas or etcd, since losing it strands every in-flight delivery until reconnects rebuild it. The message store scales by partitions, and because partition keys are channel and user IDs, load spreads evenly except for the occasional enormous channel, which the fanout-on-read policy already keeps from becoming a hot partition. The queue scales by topic partitions keyed on recipient, which preserves per-user ordering while spreading throughput across brokers.
The interesting failure is a chat server dying with a million sockets on it. Every affected client notices the dead TCP connection within seconds, reconnects through the load balancer to a surviving server, and the session registry is updated as each socket lands. Messages sent to those users during the gap simply rest in their inboxes and arrive on the cursor sync that follows every reconnect, so the user experience is a few seconds of quiet followed by everything appearing in order, with nothing lost and nothing duplicated. The operational hazard is the reconnect itself rather than the crash, because a million clients retrying at once form a thundering herd, the surge that arrives when many waiters wake simultaneously, and it can knock over the next server in line. Clients therefore back off with jitter, meaning each one waits a randomized and growing delay before retrying so the herd spreads into a stream, and servers shed load by accepting connections before processing backlogs. Deploys get the same respect by draining sockets from a box gradually instead of restarting it whole.
Push notifications are the one tier the system does not control, so they are treated as best effort, with delivery considered confirmed only when the device actually syncs and the queue retaining undelivered messages until then. Monitoring leans on a few numbers that describe the product rather than the hardware, which are socket count per server, end-to-end delivery latency sampled by echo messages that the system sends to itself, inbox backlog age, and registry staleness. Each one maps to a user-visible symptom, so an alert on any of them describes a thing a person is experiencing rather than a graph that merely looks unusual.
Follow-up questions
- Why WebSockets instead of long polling everywhere? Long polling re-establishes a request per message and parks one outstanding request per idle client, while a WebSocket amortizes a single TCP connection across the whole session with a few bytes of overhead per frame, which also spares the phone radio. Long polling survives only as the fallback for networks that mishandle the upgrade handshake.
- Where does message order come from? Order comes from a per-channel sequence number assigned at persistence time on the channel's partition. Client timestamps cannot be trusted because phone clocks drift and get adjusted, and server clocks differ across machines, so an assigned sequence is the only ordering that holds up, and it costs one atomic increment on a partition the write was already touching.
- What happens when a chat server dies? Clients detect the dead socket within seconds and reconnect with jittered backoff, the registry is updated as they land, and the cursor-based inbox sync replays anything missed during the gap. Because persistence happened before acknowledgment, the failure delays delivery rather than losing messages, and the user sees a brief pause instead of an error.
- Why not MySQL for messages? The access pattern is append-only writes at tens of thousands per second plus range reads within a partition, which log-structured wide-column stores serve natively. A relational engine can be sharded into doing the same work, but it would spend its effort maintaining B-tree indexes under random writes while none of its joins or transactions are used on this path.
- How do you avoid duplicate messages on retry? The client attaches a client-generated message ID to every send, and the server treats sends as idempotent by checking a recent-ID table before inserting, returning the original result for a retried ID instead of writing a second row. Retries then become invisible to both participants.
- How would end-to-end encryption change the design? Routing, storage, and fanout are unchanged because they move ciphertext just as well as plaintext, while every server-side feature that read content has to move or disappear, so search shifts on-device, previews are generated by clients, and moderation falls back to reports and metadata.
References
- Xu, System Design Interview, Volume 1 (2020), chapter on chat systems.
- Fette and Melnikov, RFC 6455: The WebSocket Protocol (2011).
- Discord Engineering, How Discord Stores Billions of Messages (2017).
- High Scalability, The WhatsApp Architecture Facebook Bought for $19 Billion (2014).
- Kleppmann, Designing Data-Intensive Applications (2017), on partitioning and ordering.