Design ephemeral messaging

Ephemeral messaging is the product Snapchat made familiar, where messages disappear after they are viewed, photos can be opened exactly once, and stories live for 24 hours before vanishing. Most chat questions are about getting data to people and keeping it forever, and this one inverts the second half, because here deletion is the product. Interviewers like the question because it looks like generic chat right up until it is not, since a candidate has to reason about what deleted actually means in a distributed system, where copies live in databases, caches, content delivery networks, and backups, and has to be precise about which of those copies the service can truthfully promise to destroy. The hard parts live in the gap between marking data expired and physically removing it, in making a view-once photo burn on every one of the recipient's devices at the same moment, and in the awkward truce between caching media for speed and erasing it on schedule. I will build the system the way I would narrate it in a thirty minute interview, agreeing on the promise first, then making the machinery keep it.

Scope and requirements

The functional surface I would agree on covers one-to-one disappearing messages with sender-chosen lifetimes, view-once photos and videos, 24-hour stories broadcast to friends or followers, screenshot notification, and delivery states similar to ordinary chat. Abuse reporting has to exist as well, which immediately complicates the deletion story, because reporting content the service has already destroyed is impossible, so a narrow carve-out for reported content enters the scope from the very first minute rather than as a retrofit. Stating that tension up front is worth doing in the interview, since the whole design is an exercise in keeping two promises that pull against each other.

The non-functional requirements are where this design departs from a chat system. Deletion must really happen within a bounded and stated window, not merely hide content from the interface, since the product's entire promise is that the service does not keep what you sent, and a promise without a number attached is not a requirement an engineer can build to. Delivery latency should match a normal messenger, a few hundred milliseconds when both parties are online, because ephemerality is not an excuse for sluggishness. Durability is required only within a message's lifetime, and that phrasing is deliberate, because losing a message that was going to be destroyed tomorrow is still a failure today, so the store replicates like any other store. The pleasant consequence of the product shows up in cost, because the corpus is small and rolling rather than monotonically growing, the storage bill stays roughly flat as the service ages, and the design should protect that property instead of accidentally hoarding data in side channels such as logs, caches, and analytics pipelines.

Sizing the problem

Assume 30 million daily active users sending 30 messages each per day, which multiplies to 900 million messages per day, or about 10,400 per second on average, and a peak factor of five makes that roughly 50,000 per second in the busy evening hours. Suppose half the messages carry media averaging 1 MB after client-side compression, so the service ingests around 450 million media objects per day, roughly 450 TB of blob traffic daily, and the blob path rather than the text path is where the bytes live.

Now comes the arithmetic that makes ephemerality interesting. A keep-forever service ingesting 450 TB per day grows by about 165 PB a year and never stops growing, which means its storage budget compounds the way debt does. This service, with everything expiring within 24 hours of viewing or posting, holds a steady-state corpus of roughly one to two days of traffic, on the order of 1 PB including replication, and that figure stays put forever no matter how old the service gets. Text rows are negligible beside the media, since 900 million rows at 300 bytes of metadata each is 270 GB per day and the rolling window keeps the whole table under a terabyte. Put plainly, the entire storage tier for a 30-million-user product fits inside what a single day costs the keep-forever competitor, which is the business reason ephemerality is a real architecture and not a gimmick, and it is also why every section below treats accidental retention as a defect rather than a curiosity.

The interface

Clients hold a WebSocket, a persistent two-way connection upgraded from an ordinary HTTP request, to a socket tier for real-time delivery, and they use plain HTTP for media. Media never travels through the socket, because mixing megabyte uploads into a connection tuned for tiny frames would stall every message behind a photo. Instead the client asks the API for an upload target and receives a signed URL, which is an object-storage link carrying a cryptographic signature that grants one specific operation on one specific object for a short time, so clients talk to storage directly without ever holding credentials, and the storage tier scales independently of the messaging tier.

POST /api/v1/media/upload-urls
{ "content_type": "image/jpeg", "bytes": 812400 }
→ 200 { "media_key": "m/7f3a9c", "put_url": "https://blobs.example.com/m/7f3a9c?sig=...&expires=300" }

// then over the socket
→ { "type": "send", "to": "u815", "media_key": "m/7f3a9c",
    "view_policy": "once", "ttl_seconds": 86400, "client_msg_id": "k21" }
← { "type": "ack", "client_msg_id": "k21", "msg_id": 99312, "state": "sent" }

POST /api/v1/stories
{ "media_key": "m/2cc41e", "ttl_seconds": 86400 }   → 201

Every send carries an explicit lifetime, and that is a protocol decision rather than a detail. The server treats the time-to-live, the number of seconds a row is allowed to exist, as a first-class column instead of an afterthought, because every downstream system from the database to the cache layer to the sweeper keys its behavior off that one value, and a lifetime that lived only in application logic would be invisible to all of them.

The data model

Messages land in a wide-column or key-value store with native row expiry, and both major candidates offer it, since DynamoDB lets you name a TTL attribute and Cassandra accepts a TTL on any write. Building expiry into the store rather than the application means a forgotten code path cannot accidentally keep rows alive, which matches the principle that the deletion promise should not depend on every engineer remembering it. The schema looks like ordinary chat with two additions, the expiry timestamp and the view state, and those two columns carry most of the product.

CREATE TABLE messages (
  conversation_id BIGINT,
  msg_id          BIGINT,
  sender_id       BIGINT,
  media_key       TEXT,          -- pointer into object storage
  view_policy     TEXT,          -- 'once' or 'timed'
  view_state      TEXT,          -- unviewed / viewed / expired
  expires_at      TIMESTAMP,     -- the native TTL column
  PRIMARY KEY ((conversation_id), msg_id)
);

It is worth saying plainly how native TTL works, because the mechanism defines what the product can promise. The store does not run around deleting rows the second they expire. Expiry is lazy instead, which means a read that touches an expired row filters it out as if it were gone, while the bytes are physically reclaimed later during compaction, the background process in log-structured stores that rewrites data files and drops dead entries. Laziness creates a gap between logical deletion, the moment no query can return the row, and physical deletion, the moment the bytes no longer exist on disk, and in a busy Cassandra cluster that gap can stretch from hours to days depending on compaction cadence. The design has to state which of the two the product promise refers to, and the defensible answer is logical deletion immediately at expiry with physical deletion guaranteed within a stated window, enforced by scheduling compaction aggressively on these tables and by keeping message bodies out of long-retention systems such as backups and diagnostic logs entirely. An interviewer who hears a candidate distinguish those two deletions knows the candidate has thought about what the database actually does rather than what its documentation headline says.

The high-level architecture

The shape deliberately resembles a chat system, with a socket tier delivering frames and a store holding rows, because inventing a novel delivery architecture would spend the interview on the wrong problem. What distinguishes it is that every storage component carries expiry and a sweeper exists as a first-class service rather than a cron script someone wrote later. Media lives in object storage as encrypted blobs reachable only through signed URLs, the push service covers recipients with no live socket, and nothing anywhere holds content without a clock attached to it.

Every row in the message store carries a TTL, and the sweeper turns logical expiry into physical deletion, while clients upload and download media directly against object storage using short-lived signed URLs minted through the socket tier's API.

Making deletion real

Database rows are the easy third of the problem, since TTL plus aggressive compaction handles them. Media blobs are harder because they are large, cached, and replicated for speed, and a content delivery network, the geographically distributed cache tier that serves static content from locations near users, is precisely a machine for making copies the origin does not control. The tension is real rather than theoretical, because caching a popular story's video at the edge makes playback fast, yet every cached copy is a deletion obligation, and CDN purge APIs are best-effort calls with propagation delays the service cannot observe. The design resolves the tension differently per content type rather than with one rule. View-once media skips the CDN entirely and is served straight from object storage, since content fetched exactly once gains nothing from a cache whose whole purpose is repeat hits. Stories, which are watched many times within a day, may use a CDN with cache lifetimes capped well below the 24-hour content lifetime, so edge copies die of natural causes before the deadline arrives and no purge call ever has to win a race.

The strongest tool in the design is encryption used as deletion. Each media object is encrypted at upload with its own random data key, the blob in object storage is ciphertext from the first byte, and the data key lives in the message row, which carries the TTL. When the row expires, the key vanishes with it, and every remaining copy of the blob, in storage, in a cache, on a misplaced disk in a decommissioned rack, degrades into unreadable bytes, because deleting one small key is equivalent to deleting every copy of the content at once. The blob itself is still removed by a cleanup job for cost reasons, but the promise no longer depends on that job winning a race against replication. This technique is usually called crypto-shredding, and naming it explicitly in an interview is worth doing, because it converts a distributed deletion problem, which is unwinnable in general, into a local one, which is easy.

Two scope limits keep the promise truthful, and a candidate should volunteer both rather than be cornered into them. The first is that the service's guarantee stops at its own storage, because the recipient's phone displayed the photo, and a phone can screenshot it or a second camera can photograph the screen, so screenshot detection is a notification feature that tells the sender what happened rather than a prevention mechanism, and pretending otherwise would be a promise the laws of physics decline to back. The second is that reported content is the deliberate exception, since moderators cannot review what no longer exists. When a user reports a message, the service copies it into a separate quarantine store with its own access controls and a retention clock scoped to the investigation, and a legal hold can extend that clock. The product copy should state both limits in plain words, and the architecture should make the quarantine path the only door out of the ephemeral world, so that an audit can check one door instead of hunting for many.

View-once and the multi-device wrinkle

View-once is a server-side state machine, not a client promise, and the distinction is what makes it enforceable. The view state lives in the message row, and when a recipient opens the photo, the client reports the view, the server transitions unviewed → viewed atomically, starts a short fuse on the media key, and writes a tombstone, a small marker row recording that this message existed and was consumed, so a late-syncing device learns the message is burned rather than watching it vanish without explanation. Atomicity matters because two devices can genuinely race, as when a user with a phone and a tablet opens the same photo on both within the same second, and the compare-and-set on the view state, an operation that succeeds only if the value still holds what the caller expected, lets exactly one open count as the first view while the other renders the burned placeholder. The multi-device wrinkle reduces to a single rule, which is that viewing on one device must burn the message on all devices, so the burn is a server event fanned out like any other message, and clients holding a cached preview discard it the moment the event arrives.

The message is persisted with its TTL and delivered (1), the recipient opens it and the server records the first view atomically (2), the view starts the burn timer and the message transitions to expired, invisible to every query (3). The dashed branch (4) covers messages whose TTL fires without a view, and compaction, key deletion, and the blob cleanup job make the deletion physical (5).

A timeline for one photo

Walking one photo through the clock makes the promises concrete. At second zero the sender taps send, the client uploads the encrypted blob through its signed URL, which takes a second or two for a megabyte on a typical uplink, and the socket frame carrying the metadata lands in the store with its TTL, so the sender sees the sent state roughly two seconds after the tap, dominated entirely by the upload rather than the messaging path. When the recipient opens the app, the frame delivery itself costs the usual few hundred milliseconds, and the photo fetch from object storage adds a few hundred more without a CDN, which is the deliberate price of keeping view-once content out of edge caches and is barely perceptible for a one-shot viewing. At the moment of viewing, the compare-and-set flips the state, the burn event fans out to the user's other devices within the same few hundred milliseconds a message would take, and the media key's fuse is lit.

From there the deadlines belong to the operator rather than the user. Logical deletion is immediate at expiry, so no API, query, or sync can return the message from that moment, and the user-facing promise is already kept in the only sense a user can observe. The key deletion follows within seconds, neutralizing every copy of the ciphertext, and the stated physical-deletion window, say 24 to 48 hours, covers compaction dropping the row's bytes and the cleanup job removing the orphaned blob. An operator watching the dashboard sees each message as a small contract with three deadlines, and the deletion-lag metric described below measures the worst contract currently outstanding, which is what makes the promise auditable rather than rhetorical.

Stories, counters, and the sweep

A story is a broadcast with a fixed 24-hour TTL, and the delivery question is the usual fanout choice that every social design faces. Posting writes the story row once, then fans out a lightweight reference into each follower's story tray, the per-user list the app renders along the top of the screen, and for a typical friend graph of a few hundred connections that costs a few hundred 50-byte writes per post, cheap enough to be ignored. Accounts with millions of followers flip to read-time merging, exactly like oversized channels in a chat system, so the tray render pulls their active stories on demand instead of materializing millions of references nobody may open. Tray references carry the same expiry as the story itself, which means the tray cleans itself, since a client opening the app at hour 25 simply receives no reference, and no synchronous mass-delete across millions of trays ever needs to run, which removes an entire class of background job that could fall behind.

View counts and viewer lists are the one place ephemerality meets a counter, and the counters are deliberately approximate while in flight, with view events streaming through a queue to increment a per-story aggregate, and the viewer list kept as a set keyed by story carrying the story's TTL, so both disappear with the story and leave no residue. The cleanup sweep serves as the service's background heartbeat, and its job description is short but essential. It walks expiring partitions, deletes orphaned blobs whose key rows are already gone, verifies that compaction is keeping the physical-deletion window inside the published bound, and emits metrics on the lag between logical and physical deletion, because that lag is the product promise expressed as a number an operator can watch, graph, and page on.

Scaling, failures, and operations

The socket tier scales like any chat connection tier, horizontally behind a session registry mapping users to servers, and the message store scales by conversation partitions, so nothing about ephemerality changes those rules and I would not spend interview time re-deriving them. What changes is the failure analysis around deletion, because the failure modes here are silent by nature. If the sweeper falls behind, no user-visible feature breaks, no error rate climbs, and no one files a complaint, which is exactly why it needs explicit monitoring, since the deletion-lag metric has to raise an alarm on its own merits, with nothing else in the system positioned to notice. If compaction stalls on a store node, expired rows persist physically past the window, and the response is operational, prioritizing compaction on the affected ranges, while the crypto-shredding backstop keeps the stranded content unreadable in the meantime, which is the difference between a missed internal deadline and a broken user promise. Caches deserve suspicion everywhere in this system, so application-level caches of message bodies are simply banned, and any cache that does exist must carry a lifetime no longer than the content's remaining TTL, a rule simple enough to enforce in code review.

Backups are the failure mode people forget, and they deserve their own paragraph because they defeat the product silently. A nightly snapshot of the message store quietly extends every message's life by the backup retention period, which contradicts everything the service tells its users. The resolution is to back up the metadata needed for recovery, meaning accounts, the social graph, and configuration, while excluding message content tables entirely, and to accept that a catastrophic store loss loses in-flight messages, a defensible trade for this product because everything in the store was about to be destroyed anyway and the alternative breaks the core promise every single night. Push notifications must not leak content either, so they carry only a sender name such as new message from Ana, with the actual payload fetched over the socket once the app opens. The quarantine store for reported content runs the opposite regime from everything around it, being durable, backed up, access-logged, and deliberately small, with every entry tied to a case and a clock, so the exception stays an exception.

Follow-up questions

What does "deleted" actually mean here? A message is logically deleted at expiry, meaning no query can return the row from that moment, and physically deleted within a stated window enforced by compaction scheduling and the sweeper. Crypto-shredding covers the gap by making stray copies unreadable the moment the key row expires, so the user-facing promise holds even while bytes await reclamation.
Why not just run a cron job that deletes expired rows? At hundreds of millions of rows a day, scan-and-delete competes with live traffic for the same disks and always lags behind. Native store TTLs make expiry a property of the row itself, enforced at read time, and compaction reclaims the space as part of work the store performs anyway, so no extra job sits on the critical path of the promise.
How does view-once survive a user with two devices? The view state is server-side and the first-view transition is a compare-and-set, so when two devices race, exactly one open counts as the view. The burn then fans out as an event that all of the user's devices apply, discarding any cached preview, which keeps every screen consistent within a round trip.
Can the service promise the recipient cannot keep a copy? It cannot, and the design should say so plainly, because screenshots and second cameras sit outside the trust boundary of any server-side system. The promise covers the service's own storage, and screenshot detection exists to notify the sender rather than to prevent the capture.
How do you cache media on a CDN if it must disappear? Edge cache lifetimes are capped below the content's remaining TTL so copies expire on their own before the deadline, view-once media skips the CDN entirely since it is fetched once, and blobs are encrypted so that key expiry neutralizes any copy that somehow outlives a purge call.
What happens when someone reports a message that then expires? Reporting copies the content into a quarantine store before the ephemeral clock can destroy it, with scoped access controls and case-bound retention, and a legal hold extends retention there rather than in the main store, so the investigation proceeds without weakening the promise for everything else.

References

AWS, DynamoDB Time to Live (TTL), official documentation on lazy expiry semantics.
AWS, Sharing objects with presigned URLs, Amazon S3 documentation.
Signal, Disappearing messages for Signal (2016), on timer-based deletion in a messenger.
Apache Cassandra, official documentation, on TTL, tombstones, and compaction.
Kleppmann, Designing Data-Intensive Applications (2017), on log-structured storage and compaction.