Design a distributed message queue

A message queue sits between services that produce data and services that consume it, holding messages durably until the consumers are ready to read them. Putting a queue in the middle buys decoupling, which means the producer no longer cares whether the consumer is fast, slow, or down for a deploy, because the queue absorbs the difference. An order service can emit an event and move on while the email service, the analytics pipeline, and the fraud checker each read that event at their own pace. The same buffer smooths traffic spikes, so a flash sale that triples write volume for ten minutes becomes a backlog that drains over half an hour rather than a cascade of timeouts. A queue also gives one event many readers, the fan-out that direct service-to-service calls handle poorly, because a producer that calls five downstream services synchronously inherits the availability and latency of all five at once, while a producer that appends to a queue inherits neither.

Interviewers like this question because it is really a storage and replication question wearing a messaging costume. A strong answer has to explain how bytes hit disk, how order is preserved, what happens when a machine dies mid-write, and what the word delivered even means once retries and crashes enter the picture. The difficulty concentrates in three places, namely the on-disk log structure that makes the throughput possible, the replication protocol that makes durability a checkable property rather than a hope, and the delivery semantics that determine whether a consumer can see a message twice or never.

Scope and requirements

The first decision to surface is which kind of queue we are building, because the word covers two different products. A point-to-point queue delivers each message to exactly one of several competing workers, which is what a pool of image resizers wants. A publish-subscribe system delivers each message to every subscribed group, which is what an order event needs when billing, search indexing, and analytics must each receive it independently and none of them should be able to starve the others. The design below is a log-based system in the style of Kafka that supports both patterns through one mechanism, and saying that choice out loud early matters because it shapes every later decision, from how the disk is laid out to what a consumer crash costs.

Functionally, producers append messages to named topics, consumers read from topics in order and can replay old messages, and messages survive broker crashes once acknowledged. Non-functionally, the system should sustain on the order of a gigabyte per second of incoming data, deliver messages with end-to-end latency in the low tens of milliseconds under normal load, retain data for days so consumers can recover from their own outages, and keep accepting writes when any single machine fails. Days of retention earn their cost the first time a consumer team ships a defect on Friday and needs to reprocess everything since Thursday. Ordering deserves the most careful wording of all. Promising order globally across a topic is neither possible at this scale nor usually needed, so the requirement worth writing down is order per key, meaning all events about customer 42 arrive in the order they were produced while events about different customers may interleave freely.

Sizing the problem

Concrete numbers keep the design grounded, so it is worth doing the arithmetic slowly. Assume one million messages per second on average with a 1 KB average message, which multiplies out to 1 GB/s of incoming data, and plan for a peak of twice that, since launches and incidents can double the steady state without warning. With a replication factor of three, every byte is written three times, so the cluster absorbs 3 GB/s of disk writes on average and 6 GB/s at peak. The read side adds its own load on top. If two consumer groups each read the full stream, outbound traffic is 2 GB/s toward consumers, and the followers themselves fetch another 2 GB/s from leaders to stay in sync, about 4 GB/s of network egress before any peak multiplier.

Retention dominates storage, and the chain of multiplications is short. Keeping seven days of data at 1 GB/s means 86,400 seconds per day times 1 GB, about 86 TB per day, around 605 TB for the week, and roughly 1.8 PB once tripled by replication. Brokers with 20 TB of usable disk each put the floor at about 90 machines, and that count conveniently also covers the bandwidth, because 1 GB/s of leader writes spread over 90 brokers is about 11 MB/s per broker, rising to roughly 33 MB/s of disk writes once the replica copies land on the same fleet, comfortably inside what one machine with modern NICs and disks can sustain. Parallelism gets sized last. If each partition should carry no more than 10 MB/s so that a single consumer can keep up with one partition, then 1 GB/s needs at least 100 partitions on the main topic, and choosing 200 leaves headroom for hot keys and for growth, since adding partitions later moves keys around.

The interface

The client API is small, and writing it out makes the later discussion of acknowledgments and offsets concrete. Producers send key-value messages to a topic and choose how much confirmation to wait for before treating the send as successful. Consumers join a named group, poll for batches, and periodically record how far they have read. The smallness is deliberate, because every operation the broker declines to offer, such as deleting an individual message or querying by field, would force it to track per-message state and give up the sequential disk behavior the next section depends on.

producer.send(
    topic="orders",
    key="customer-42",        # same key, same partition, same order
    value=payload_bytes,
    acks="all",               # 0, 1, or all: the durability dial
)

consumer = Consumer(group="billing", topics=["orders"])
while True:
    batch = consumer.poll(max_records=500)
    for message in batch:
        process(message)      # message.offset, message.key, message.value
    consumer.commit()         # record progress for the group

The log on disk

The data model is the heart of the system. A topic is split into partitions, and each partition is an append-only log, which is a file that only ever grows at the end, where every message gets a sequential number called an offset. The broker never updates a message in place and never deletes from the middle, which means the disk only ever sees sequential writes, and sequential disk I/O is fast in a way that random I/O is not. A spinning disk that manages a few hundred random operations per second can stream sequential writes at 100 MB/s or more, and even on SSDs the sequential pattern wins because it batches well and plays nicely with the operating system's page cache, the slice of RAM the kernel uses to hold recently touched file data. Consumers that are keeping up read the tail of the log straight out of that cache, so a healthy cluster's disks spend most of their effort on writes. The design this beats is any queue that updates per-message state on every delivery, because marking individual messages consumed scatters small random writes across the disk, and that difference is what separates a million messages per second on modest hardware from a tenth of that on better hardware.

On disk, a partition is a directory of segment files, each a fixed-size chunk of the log named by the offset of its first message, plus a small index mapping offsets to byte positions so a consumer can start reading anywhere without scanning forward from the beginning of the file.

topic "orders", partition 7
  00000000000000000000.log     closed segment, offsets 0 to 511,999
  00000000000000512000.log     active segment, appends go here
  00000000000000512000.index   sparse map: offset to byte position

record at offset 512,041:
  timestamp | key "customer-42" | value (1,024 bytes) | checksum

Segments are what make retention cheap, because deleting old data means deleting whole files from the front of the directory rather than rewriting anything. Retention is configured by time or size, say seven days or 1 TB per partition, whichever comes first, enforced by a background thread off the hot path. Some topics want a different policy called log compaction, where the broker keeps at least the latest message for each key and discards older versions. Compaction turns the topic into a durable changelog, since replaying it from the start rebuilds the current state of every key, which is exactly what a cache-warming or database-mirroring consumer needs after losing its own state.

The high-level architecture

A cluster of brokers holds the partitions, with each partition replicated on a few brokers. One replica is the leader, which takes all reads and writes for that partition, and the others are followers that copy the leader's log. Funneling everything through the leader is a choice worth defending, because letting clients read from followers would spread load but would also let a consumer observe a replica that has not caught up, seeing messages appear and then vanish on failover, and a queue whose ordering promise sometimes fails is worse than one that costs modestly more. Producers learn the partition-to-leader mapping from cluster metadata and write directly to leaders, with no routing tier in between to add a hop or a fleet to operate. Consumers organize into groups, and the partitions of a topic are divided among each group's members. A coordination service, ZooKeeper historically and an internal Raft quorum in newer designs, stores the metadata and elects a controller, the broker responsible for reassigning leadership when machines fail. Raft here just means a consensus protocol, a way for a handful of machines to agree on a sequence of decisions even while some of them crash.

One topic with partitions P0, P1, and P2, where each broker leads one partition and follows another. The producer routes each message by key hash to the owning leader, both consumer groups independently read every partition, and the dashed arrows carry the background replication traffic.

Producers: partitioning, batching, and acknowledgment

The producer decides which partition each message lands in, normally by hashing the message key, so every event for customer 42 maps to the same partition and therefore stays in order. Messages without keys are sprayed round-robin purely for balance. The client library batches messages per partition, waiting a few milliseconds or until a batch fills before sending, and compresses each batch before it leaves the process. Batching is where much of the throughput comes from, because sending a thousand 1 KB messages as one 400 KB compressed request costs a fraction of a thousand small round trips in network overhead, in broker CPU, and in disk operations, since the broker appends the whole batch with a single write. The linger time is a real trade rather than a free win, in that a producer tuned for throughput waits longer to fill batches and adds a few milliseconds of latency, while a producer tuned for latency ships half-empty batches, and the right setting differs between a telemetry pipeline and a payment flow.

The acknowledgment level is the producer's durability dial, and all three settings deserve explanation because the differences between them are exactly where messages get lost. With acks=0 the producer fires and forgets, so a dropped connection silently loses data, which is acceptable only for metrics-grade traffic where a small gap is invisible. With acks=1 the leader confirms once the message is in its own log, so a leader that crashes after acknowledging but before any follower copied the message loses it permanently when a follower takes over, and the producer never finds out. With acks=all the leader waits until every in-sync replica has the message before confirming, so the write survives the loss of any single broker, at the cost of one extra network round trip inside the cluster, typically a few milliseconds. The dial belongs to the producing team per topic, because the same cluster carries billing events that want acks=all alongside click telemetry that can live with less.

Consumers: groups, offsets, and the ordering contract

A consumer group is a set of cooperating readers sharing one subscription, and the broker assigns each partition of the topic to exactly one member of each group. That single rule yields both messaging patterns, since workers competing for jobs are one group whose members split the partitions, while independent downstream systems are separate groups that each receive everything. Consumers pull batches from brokers rather than having brokers push, because a push design forces the broker to track every consumer's capacity and buffer for the slowest one, while a pull design lets each consumer fetch exactly as fast as it can process, turning a slow consumer into its own problem. Progress tracking pushes state to the edges the same way. Each group periodically commits its offset, the number of the next message it will read in each partition, to a durable internal topic, and the broker never tracks per-message state. After a crash, the group simply resumes from its last committed offset, and replay is nothing more exotic than rewinding a number.

The ordering contract follows directly from the structure, which is that order is guaranteed within a partition and nowhere else. A worked example makes the boundary clear. Suppose customer A's checkout produces events A1 then A2, both hashed to partition 0, and customer B's checkout produces B1 hashed to partition 1, with the real wall-clock order being A1, B1, A2. The consumer assigned partition 0 always sees A1 before A2, which is the promise that matters for correctness, but whether B1 is processed before or after either of them depends on scheduling and lag, and nothing in the system constrains it. Any design review that needs cross-customer ordering has discovered either a modeling problem or a reason to use a single partition and accept its throughput ceiling, around 10 MB/s in our sizing.

When a consumer joins or leaves a group, the partitions must be redistributed, which is called a rebalance. During a classic rebalance every member stops consuming, the assignment is recomputed, and members resume from the committed offsets, so a group of fifty consumers can stall for several seconds whenever one pod restarts. An operator experiences this as a sawtooth in end-to-end latency that lines up suspiciously well with deploys. Incremental and cooperative rebalancing protocols shrink the stop-the-world window by moving only the partitions that actually change hands, but the cost never reaches zero, which argues against designs that constantly churn group membership, such as autoscalers that flap consumer counts every minute.

Replication and what acknowledged means

Each partition has one leader and, with replication factor three, two followers on other brokers. Followers continuously fetch new records from the leader, append them to their own logs, and report how far they have gotten. From those reports the leader maintains the in-sync replica set, or ISR, which is the set of replicas that are fully caught up or within a small lag bound. A message counts as committed once every ISR member has it, and the leader exposes that boundary as the high watermark, the offset up to which consumers are allowed to read. Consumers never see uncommitted messages, which protects a reader from acting on data that a failover could erase, and that gap normally stays only a few milliseconds wide.

In step 1 the producer sends a batch with acks=all, and in step 2 the followers issue their continuous fetch requests, drawn dashed. Step 3 has the leader returning the new records for each follower to append, and in step 4, once every in-sync replica holds the batch, the leader advances the high watermark and acknowledges the producer.

Failure handling falls out of the ISR rule. If a follower falls behind or dies, the leader drops it from the ISR and keeps going, so one slow disk does not stall the topic, though a shrinking ISR deserves an alert, because a partition running on a single in-sync replica is one failure away from the corner case below. If the leader dies, the controller picks a new leader from the remaining ISR members, and because every ISR member has every committed message, nothing acknowledged is lost. The walk-through takes seconds end to end, in which producers writing to the dead leader see connection errors, refresh their metadata, find the new leader, and retry, so a producer configured with retries and acks=all rides through the whole event without losing a message. The dangerous corner is unclean leader election, which arises when every ISR member is gone and the only surviving replica is a stale one. The operator must then choose between availability, electing the stale replica and silently losing the tail of the log, and consistency, keeping the partition offline until a real ISR member returns. For anything transactional the only defensible default is to wait, and waiting should be the shipped default rather than a setting discovered mid-incident.

Delivery semantics, plainly

What a consumer observes depends on where failures land relative to two actions, namely processing a message and committing its offset. Committing before processing yields at-most-once, because a crash between the two skips the message forever, and a user experiences that as an email that never arrives with no error recorded anywhere. Processing before committing yields at-least-once, because a crash between the two replays the message after restart, and the user experiences that as two identical emails unless the consumer tolerates duplicates. Tolerating duplicates usually means making writes idempotent, where applying the same update twice leaves the same state as applying it once, for example by keying payment rows on a payment ID so that a replay overwrites an identical row instead of inserting a second charge.

Exactly-once is really two separate fixes layered on top of at-least-once. On the producer side, a retry after a lost acknowledgment can duplicate a write into the log, so an idempotent producer attaches a producer ID and a per-partition sequence number to every batch, and the broker discards any batch whose sequence number it has already appended, which makes retries safe without any coordination. Across the read-process-write pattern, transactions let a consumer atomically publish its output messages and commit its input offsets together, so a crash either rolls both back or carries both forward, and downstream readers in read-committed mode never observe half of the pair. The result is exactly-once stream processing within the system's boundaries, and the boundary deserves stating plainly, because the moment a consumer writes to an external store, that store has to carry the idempotence itself, usually as a unique key the write cannot duplicate.

A latency budget for one message

Tracing one billing event from producer to consumer shows where the low-tens-of-milliseconds target actually goes. The producer holds the message for up to 5 ms of linger time while a batch fills, spends well under a millisecond serializing and compressing its share, and pays about half a millisecond for the network hop to the leader inside a datacenter. The leader appends to its log in a fraction of a millisecond because the write lands in the page cache rather than waiting for the platter, the followers' next fetch picks the batch up within a few milliseconds, each follower append costs the same small fraction, and the acknowledgment travels back in another half millisecond. A producer using acks=all therefore sees its confirmation in roughly 8 to 12 ms, dominated by linger time and the follower fetch interval rather than by any disk. The consumer side adds its own poll cadence, a few milliseconds when the consumer is keeping up, so the event is processed about 15 to 25 ms after the producing service emitted it, which lands inside the requirement with room to spare.

The tail behaves differently from the median, and the tail is what pages people. A segment roll, where the broker closes one log file and opens the next, adds a brief stall, garbage collection pauses on a JVM-based broker can add tens of milliseconds at the 99.9th percentile, and a follower restarting after a deploy temporarily widens the commit gap because the leader waits for the slowest in-sync member. None of these break the design, but they explain why the latency requirement was stated for normal load rather than as an unconditional bound, and they make produce-to-acknowledge time per partition the one number a dashboard should track, since it degrades first under every failure above.

Scaling, failures, and operations

Each tier scales along its own axis. Topics scale by adding partitions, though adding partitions changes which keys map where, so events for one customer straddle the boundary moment, and over-partitioning modestly at creation is kinder than resharding later. The cluster scales by adding brokers and rebalancing partition replicas onto them, a background data move that is bandwidth-limited rather than disruptive. Consumers scale by adding group members up to the partition count, beyond which extra members sit idle, and that ceiling is exactly why the sizing exercise chose 200 partitions rather than the bare minimum of 100.

The metric that says the system is in trouble is consumer lag, the gap between the newest offset and each group's committed offset, measured per partition. Lag growing without bound means a consumer is too slow, and the queue's job is to make that a graceful condition, with the backlog accumulating on disk inside the retention window rather than back-pressuring producers immediately. Backpressure still exists at the edges, where producers feel it as batches queueing in their client buffers when brokers slow down, and eventually as sends that block or fail once those buffers fill. The right responses, in order, are bigger batches and compression, more partitions and consumers, and finally shedding or sampling low-value traffic. An alert should fire long before lag approaches the retention window, because a consumer that is seven days behind on a seven-day topic is about to lose data it never read.

Broker failures are routine rather than emergencies. Leadership for the dead broker's partitions moves to ISR followers within seconds, producers and consumers refresh metadata and reconnect, and the only client-visible effect is a brief latency blip, so the on-call's actual job is to replace the machine, watch it re-replicate its partitions from the new leaders, and confirm the ISR sets fill back out. The controller and metadata quorum are the small consistent core that must stay healthy, and they are protected by doing almost nothing on the hot path, since brokers serve data without consulting the controller, which only acts when membership changes.

It is worth closing with when not to use the design. A broker-style queue such as RabbitMQ or SQS tracks each message individually, supports per-message acknowledgment, redelivery of exactly the failed message, dead-letter queues for messages that repeatedly fail, and routing rules, and it deletes messages once consumed. For a work queue of heterogeneous jobs where one bad job must not block its neighbors and nobody ever replays history, that model fits better than a log. Saying so costs nothing, because the log buys throughput, per-key ordering, replay, and fan-out, and it charges for them by making per-message bookkeeping the consumer's problem, a price a job queue has no reason to pay.

Follow-up questions

Why partitions instead of one big log per topic? A single log caps throughput at one machine's sequential write speed, around 10 to 100 MB/s in practice, and serializes all consumers behind one reader. Partitions are the unit of parallelism, placement, and replication, so 1 GB/s falls out of 200 partitions at 5 MB/s each.
What exactly is lost with acks=1? The window between the leader appending a message and any follower fetching it. A leader crash inside that window promotes a follower that never saw the message, so the acknowledged write vanishes and the producer has no way to learn that it did. The window is only milliseconds wide, but at a million messages per second a few milliseconds is thousands of messages.
Can the queue guarantee global ordering across a topic? Only with a single partition, which gives up all parallelism and caps the topic at one consumer's read speed. The usable contract is per-key ordering via key hashing, and a consumer that needs cross-key order must either reorder with its own sequence numbers or accept the partition boundary, as almost every real workload does.
How does a new consumer group read from the beginning? Offsets are per group, so a fresh group simply starts at offset zero of each partition and streams forward through retained history. Replay is the log's most underrated property, because reprocessing last week's events after fixing a defect becomes a configuration change rather than a restore from backup.
What does log compaction give you that time retention does not? A topic that never forgets the latest value per key while still shedding old versions, so its size is bounded by the number of live keys rather than by elapsed time. That turns the queue into a durable changelog from which a cache, a search index, or a replica table can be rebuilt at any moment, which time-based retention cannot promise.
When would you pick SQS or RabbitMQ over this design? When the workload is a job queue, meaning it needs per-message acknowledgment and retry, dead-lettering of poison messages, no replay, and modest throughput. Operating ninety brokers to move a few hundred messages per second is engineering theater.

References

Apache Kafka, official documentation, on the log, replication, and the protocol.
Kreps, The Log: What every software engineer should know about real-time data's unifying abstraction (2013).
Kleppmann, Designing Data-Intensive Applications (2017), chapters on replication and stream processing.
Xu, System Design Interview, Volume 2 (2022), chapter on distributed message queues.