Design a distributed email service

This question asks for a webmail product at Gmail's scale, meaning a billion accounts that can send, receive, read, search, and label mail, with spam filtered before it ever reaches the inbox. What makes email different from every other messaging design is that the service does not own both ends of the conversation. A chat system invents its own protocol and talks only to itself, while an email service must interoperate with every other mail server on the planet through protocols standardized decades ago, and it inherits their semantics, their retry conventions, and their abuse. Interviewers like the question because it combines a high-volume ingestion pipeline, a storage problem measured in exabytes, a per-user search engine, and an adversarial subsystem in spam, where the opponent adapts to whatever defense is built. The hard parts live in the receive pipeline, which must never lose a message it has accepted, in the storage layout that makes a billion mailboxes affordable, and in the send path, where getting mail delivered turns out to be as much about reputation as about engineering. I will take those in interview order, sizing first, then the two pipelines, then search and spam, and close with how the whole thing fails and recovers.

Scope and requirements

The functional scope covers receiving mail from any server on the internet, delivering it to the right mailbox with spam scored on the way in, letting users read, label, archive, and delete through the product's own web and mobile clients, searching across a mailbox's full history, and sending mail to any address in the world. The product's own clients talk to an HTTP API, which is simpler and richer than the legacy protocols, but the service still speaks the standard ones at its edges, because interoperability is the whole point of email. SMTP, the Simple Mail Transfer Protocol, is how servers transfer mail to each other and cannot be negotiated away, while IMAP and POP, the older protocols that let third-party clients read a mailbox, are supported as legacy access for the minority of users who still want them, kept behind their own adapter tier so the core system never bends its data model around them.

Non-functionally, accepted mail must never be lost, which makes durability the first constraint on the receive path, because once this service tells a sending server that a message was accepted, responsibility has transferred and the sender will delete its copy. Inbox reads should render in a couple hundred milliseconds so the product feels instant, search should answer in well under a second even across a decade of mail, and availability must be high enough that mail is effectively always receivable, since sending servers retry for only a few days before giving up and bouncing, which converts an extended outage directly into lost mail. Mail content is among the most sensitive data a person has, so encryption at rest and tight internal access controls are assumed throughout rather than listed as a feature.

Sizing the problem

Take 1 billion users receiving an average of 50 emails a day, a figure that counts the spam the system must process even when it never shows it to anyone. Multiplying gives 50 billion deliveries a day, and dividing by 86,400 seconds yields about 580,000 messages per second on average, with peaks two to three times higher as weekday mornings sweep across the major time zones. Every one of those messages passes through acceptance, scanning, and delivery, so the pipeline is built for millions of messages per second of headroom, and any component that cannot scale to that rate has no place on the receive path.

Storage is the number that forces the architecture rather than merely informing it. At an average of 50 KB per message with attachments amortized in, 50 billion messages a day is 2.5 PB of new mail daily, which compounds to roughly 900 PB a year and several exabytes over five years, all before replication multiplies it by three. Nobody stores that naively, and two levers bring it back within reach. The first is deduplication, which stores identical content exactly once, and it matters enormously for email because newsletters, notifications, and forwarded attachments are sent to thousands or millions of recipients each, so the corpus is full of repetition. A 5 MB attachment mailed to 1,000 recipients occupies 5 GB stored naively and 5 MB stored content-addressed, a thousandfold saving that lands precisely on the heaviest objects in the system. The second lever is tiering, which exploits the observation that mail older than a few months is rarely opened, so cold messages migrate to cheaper, denser storage with slightly slower retrieval, and the only thing a user ever notices is a moment's pause when opening a thread from ten years ago.

The client API

The product's own clients use an HTTP API that treats the mailbox as a database of threads and labels rather than a folder of files, which is the data model users actually think in, and notifications are pushed over a persistent channel so a new message appears without the client polling for it.

GET /api/v1/threads?label=inbox&limit=50
→ 200 { "threads": [ { "thread_id": "t8812", "subject": "Quarterly report",
                       "snippet": "Attached is the...", "unread": true } ] }

GET /api/v1/messages/m42751
→ 200 { "from": "ana@example.org", "body_ref": "blob/9f31c2...", "labels": ["inbox"] }

POST /api/v1/messages
{ "to": ["ben@elsewhere.com"], "subject": "Re: Quarterly report", "body": "..." }
→ 202 { "message_id": "m42788", "state": "queued" }

POST /api/v1/messages/m42751/labels   { "add": ["archive"], "remove": ["inbox"] }

Send returns 202 rather than 200 because sending is asynchronous by nature, so the message enters an outbound queue, and the actual delivery to the remote server happens, possibly with retries stretching over days, long after the API call has returned. Pretending otherwise with a synchronous response would force the client to wait on a foreign server's mood, and no mail product has ever worked that way. The label operations are small metadata writes against the user's partition, which is what lets archive and read-state changes feel instantaneous and sync across devices within a second or two.

The data model

The storage split that makes the system work keeps metadata apart from bodies, because the two have opposite shapes. Mailbox metadata, meaning threads, labels, read flags, and per-message envelope fields, is small, hot, and queried in user-scoped patterns, so it lives in a distributed database partitioned by user ID with strong consistency inside a partition, since a user who archives a thread on their phone must see it archived on their laptop immediately, and explaining away a resurrected thread as eventual consistency satisfies nobody. Bodies are large, immutable, and mostly cold, so they live in object storage and are content-addressed, which means the raw message in MIME format, the standard envelope-and-parts encoding that lets one email carry text, HTML, and attachments together, is hashed, and the hash serves as both its name and its dedup key. The thousandth copy of a newsletter therefore stores nothing new and simply increments a reference count, and the database keeps only the hash as a pointer.

CREATE TABLE messages (         -- partitioned by user_id
  user_id     BIGINT,
  message_id  BIGINT,
  thread_id   BIGINT,
  from_addr   TEXT,
  subject     TEXT,
  snippet     TEXT,
  labels      TEXT[],           -- inbox, sent, spam, user labels
  body_ref    CHAR(64),         -- content hash into object storage
  received_at TIMESTAMPTZ,
  PRIMARY KEY (user_id, message_id)
);

CREATE TABLE blobs (
  body_ref    CHAR(64) PRIMARY KEY,
  size_bytes  BIGINT,
  refcount    BIGINT            -- dedup bookkeeping
);

The alternative worth naming is storing bodies in the database alongside the metadata, and it loses on every axis that matters here, since multi-megabyte rows wreck the cache behavior of the hot metadata pages, replication traffic balloons with bytes that never change, and the database's transactional machinery is wasted on content that is written once and read rarely. Keeping a 64-character hash in the row buys the join for the price of a pointer.

The high-level architecture: receiving mail

Receiving starts before the first byte arrives, with DNS. A domain's MX records, the DNS entries that name the mail servers willing to accept mail for that domain, point at a fleet of SMTP gateways behind regional load balancing, so the world's sending servers discover where to connect without the service telling any of them anything directly. A gateway accepts the SMTP conversation, applies connection-level defenses such as rate limits and IP reputation checks, and writes the raw message to a durable intake log before acknowledging, because the acknowledgment is a promise that the sender is entitled to act on by deleting its copy. Everything downstream of acceptance is asynchronous, with scanning classifying the message for spam and malware, the delivery service writing metadata into the recipient's partition and the body into object storage, the search index absorbing the new message, and a notification waking any connected client, and the asynchrony is what lets each stage scale and fail independently without holding open SMTP connections hostage.

The gateway durably logs each accepted message before acknowledging the sender, and everything downstream of acceptance runs asynchronously, as delivery writes metadata and the body, updates the index, and wakes connected clients over the dashed notification path.

A latency budget from acceptance to inbox

Walking one message through the pipeline puts numbers on the experience. The SMTP conversation with the sending server takes a few hundred milliseconds across the public internet, and the gateway's replicated log write adds 5 to 10 milliseconds before the acceptance response goes out, which is the only synchronous durability cost in the whole path. Scanning is the budget's biggest line item, since authentication checks are fast but the machine-learned content classifier can spend 50 to 200 milliseconds per message, and the tier is sized so queueing delay stays small at the morning peak. Delivery then writes the mailbox row and the body in parallel, a few tens of milliseconds together, the index update follows within a second or two, and the notification reaches a connected client in a few hundred milliseconds more. Adding it up, a message is typically visible in the recipient's inbox one to three seconds after a foreign server connected, and searchable a breath later, which matches what users observe in practice when they email themselves as a test.

The budget also clarifies what degrades first under stress. When scanning backs up, the intake log absorbs the difference and mail arrives late rather than never, with the consumer lag metric measuring exactly how late. When a mailbox partition is failing over, deliveries to those users park in the log for the seconds the promotion takes, and the user experience is an inbox that is briefly quiet rather than wrong. The one component with no graceful degradation is the intake log itself, which is why it gets the strongest replication in the system, because every other tier's failure story begins with the phrase that the log still has the message.

Sending mail and the reputation game

The outbound path starts at the submission service, which authenticates the user, applies per-account rate limits that keep a stolen password from turning an account into a spam cannon, and signs the message before queueing it for the outbound MTA pool, the mail transfer agents that actually open connections to remote servers. Three authentication standards govern whether the world believes the mail, and they are worth defining individually because interviewers ask. SPF lets a domain publish in DNS which servers are allowed to send mail on its behalf, so a receiver can check whether the connecting IP is authorized. DKIM attaches a cryptographic signature to each message so receivers can verify it was really sent by the domain and was not altered in transit. DMARC lets the domain tell receivers what to do when SPF and DKIM checks fail, and asks them to send back aggregate reports so the domain can see who is forging it. A provider at this scale enforces all three on inbound mail and maintains them meticulously on outbound mail, because its sending reputation, tracked per outbound IP and per domain by every large receiver, decides whether a billion users' legitimate mail lands in inboxes or in spam folders, and reputation lost in a day takes months to rebuild.

The client submits through the authenticated API (1), the submission service rate-limits and DKIM-signs the message into the outbound queue (2), and an outbound MTA picks it up (3) and attempts SMTP delivery to the recipient's MX (4). A temporary 4xx response parks the message in the retry queue (5), where backoff timers schedule another attempt (6), while a permanent 5xx response instead generates a bounce back to the sender.

The retry discipline comes straight from SMTP's response codes, and respecting the distinction is what separates a good citizen from a blocked one. A 4xx response means temporary failure, perhaps a mailbox over quota or a server that greylists first attempts on purpose, and the correct behavior is to retry on a backoff schedule, minutes at first and stretching toward hours, for up to a few days before giving up and informing the sender. A 5xx response means permanent failure, such as no such user or mail refused by policy, and retrying it is worse than useless, because hammering a server with mail it has already rejected reads as abusive and damages the reputation every other queued message depends on, so the message bounces immediately with a delivery failure notice. The outbound pool also shapes traffic per destination, since large receivers throttle by source IP, which means queues are organized per destination domain and drained at each domain's tolerated pace, and a backlog toward one slow receiver never delays mail headed anywhere else. From the sender's chair the whole apparatus is invisible until it is not, and the visible artifact is the bounce message, which is why bounces are written carefully enough that a person can tell a typo in an address from a full mailbox.

Search across a billion mailboxes

Email search is a different problem from web search in a way that makes it tractable, because no query ever spans users, so the index can be partitioned exactly like the mailbox. Each user gets their own inverted index, the structure that maps every term to the list of messages containing it, sharded on the same user ID as the metadata so that a user's mail and index live together and a query never crosses a partition boundary. On delivery the pipeline tokenizes the subject and body, updates the user's index, and makes the change visible to search within seconds, and the throughput follows directly from the sizing, since a delivery rate of 580,000 messages per second is also 580,000 index updates per second, which is why the index write path is built into the pipeline rather than batched overnight. Batching would be cheaper, and it loses anyway, because a user who cannot find the message that arrived ten minutes ago concludes that search is broken regardless of what the architecture diagram says.

The index stores term positions for phrase queries and structured fields for the filters people actually type, covering sender, recipient, attachment presence, and date ranges, because mailbox search is navigational rather than exploratory, and users are usually hunting one specific message they half remember. A median mailbox of a few gigabytes yields an index of a few hundred megabytes, which across a billion users reaches hundreds of petabytes of index, kept affordable the same way bodies are, with the hot recent index on fast media and the index for cold mail paged from or rebuilt out of cheaper storage when an unusual query needs it.

Spam as a layered system

Spam filtering cannot be one classifier, because the adversary probes whatever single defense is deployed and routes around it, so the system is built as layers running from cheap to expensive, each one shrinking the volume the next must handle. Connection-time checks reject the worst traffic before reading a byte of content, using IP reputation lists, rate anomalies, and protocol fingerprints, and that placement matters because a majority of all connection attempts at a large provider are abusive, so rejecting them early saves the entire downstream pipeline. Authentication results from SPF, DKIM, and DMARC then separate forged sender claims from genuine ones, which removes the easiest and most dangerous impersonations. Content rules catch known campaign signatures and forbidden attachment types, and a machine-learned classifier scores everything that survives, trained on features from headers, body text, sender history, and the individual user's own past behavior toward similar mail, which is why the same message can land in one user's inbox and another's spam folder and both placements be correct.

The user feedback loop is what closes the system, since every report-spam and not-spam click is a labeled training example, and at this scale that is hundreds of millions of fresh labels a day, a data advantage no small provider can match and the real reason large providers filter well. Scoring places mail in the inbox or the spam folder rather than silently dropping it except at the very highest confidence, because false positives, meaning real mail marked as spam, are the costliest mistake the system can make, and a user who loses a job offer to the spam folder does not care how clean the inbox looked. Keeping borderline mail visible in a folder the user can check converts a catastrophic error into a recoverable one.

Scaling, failures, and operations

The tiers scale on different axes, and naming each rule beats waving at horizontal scale. Gateways and scanners are stateless and scale with connection volume, the mailbox store scales by adding user partitions, and object storage scales by hash space, which is the easiest kind of growth in the system. The intake log is the component whose failure semantics matter most, because a gateway that crashes after acknowledging a message must not lose it, so acceptance writes to a replicated log, a Kafka-style commit log or equivalent, before the SMTP 250 response goes out, and delivery consumes from that log with at-least-once semantics, meaning every message is processed one or more times and never zero, with deduplication on message ID ensuring a replay never produces two copies in an inbox. The pairing of at-least-once delivery with idempotent application is the standard move here, and it is worth saying in the interview that neither half works without the other.

Multi-region design follows the mailbox rather than fighting it. Each user's partition is homed to one region, with synchronous replication inside the region for durability and asynchronous replication to a second region for disaster recovery, so a regional failure means failing over mailboxes with a bounded few seconds of replication lag rather than reconciling a globally writable store, a trade that accepts a sliver of potential loss in a true disaster in exchange for sane semantics every ordinary day. Inbound mail tolerates regional trouble gracefully because MX records list gateways in multiple regions, any gateway can accept a message and forward it to the home region's pipeline, and the sending server's own retry behavior papers over brief gaps, which is one of the few places where email's ancient conventions actively help the design. The operational dashboard watches acceptance-to-inbox latency, intake log consumer lag, bounce rates per outbound IP, spam false positive reports, and the storage dedup ratio, and each is chosen because it moves before users complain, with the bounce rate in particular acting as the early warning that sending reputation is starting to slip somewhere in the world.

Follow-up questions

Why object storage for bodies instead of the database? Bodies are large, immutable, and mostly cold, which is exactly object storage's design point, and content addressing gives deduplication for free at the same time. The database keeps only what queries actually need, meaning envelope fields, labels, flags, and the body hash, so the hot path stays small and cache-friendly.
What stops a crashed gateway from losing mail? The SMTP acknowledgment is only sent after the message rests in a replicated intake log, so a crash after acceptance replays from the log rather than losing anything, and delivery's dedup on message ID makes the replay invisible to the recipient.
Why are 4xx and 5xx handled so differently on send? A 4xx is the remote server saying not right now, so backing off and retrying for a few days is correct and expected, while a 5xx means never, so the message bounces immediately, because retrying permanent failures wastes capacity and burns the sender reputation that every other message in the queue depends on.
How does search stay fast at a billion users? It stays fast by never being global, since each user's inverted index is sharded alongside their mailbox and a query touches exactly one shard. Index updates ride the delivery pipeline itself, which is how new mail becomes searchable within seconds instead of after a nightly batch.
What is the single highest-leverage cost optimization? Content-addressed deduplication of bodies and attachments wins by a wide margin, because bulk mail dominates volume and is byte-identical across recipients, so storing each unique body once collapses the largest class of data by orders of magnitude while changing nothing the user can see.
Why does the provider's own outbound mail need SPF, DKIM, and DMARC? Other receivers judge this provider by the same rules it applies to them, and reputation is tracked per IP and per domain. Misconfigured authentication at this scale means a billion users' legitimate mail starts landing in spam folders worldwide, and recovering from that takes far longer than preventing it.

References

Klensin, RFC 5321: Simple Mail Transfer Protocol (2008).
Kucherawy and Zwicky, RFC 7489: Domain-based Message Authentication, Reporting, and Conformance (DMARC) (2015).
Crocker, Hansen, and Kucherawy, RFC 6376: DomainKeys Identified Mail (DKIM) Signatures (2011).
Xu, System Design Interview, Volume 2 (2022), chapter on distributed email services.
Kleppmann, Designing Data-Intensive Applications (2017), on logs, partitioning, and replication.