A notification system is the platform service every other team calls when they need to reach a user outside the app, whether that is a mobile push for the order that shipped, an SMS carrying a login code, or an email holding the weekly digest. It deserves to be its own system rather than a library each team embeds, because the hard parts repeat across every producer and reward being solved exactly once. External providers rate limit and fail in their own peculiar ways, the registry of device tokens rots constantly, user preferences and quiet hours must be enforced everywhere or they protect nothing, deduplication has to ensure that retries never double-send, and analytics must tell product teams whether anyone actually opened the thing. Centralizing those concerns is the whole argument for the design, and interviewers ask the question because it cleanly showcases asynchronous pipelines, retry discipline, and the habit of treating third parties as unreliable components to be managed rather than trusted. It also probes a quality the flashier questions miss, which is whether a candidate designs for the user's attention as a finite resource rather than a free one, because a platform that optimizes only for throughput eventually teaches its users to turn notifications off, and a muted channel is worth nothing to anybody.
Scope and requirements
Functionally, internal producer services submit notification requests referencing a template and a payload, and the platform delivers through mobile push, SMS, and email, with the channel chosen by the use case and the user's preferences. The platform owns user notification settings, covering per-category opt-outs, quiet hours, and channel choices, along with device token registration for push and delivery tracking through sent, delivered, and opened states. Two very different kinds of traffic share these pipes, since transactional notifications are triggered by a user's own action, like a password reset they are actively waiting for, while broadcast campaigns go to large audiences at once, like a feature announcement nobody asked for, and much of the architecture below exists to keep the second kind from trampling the first.
Non-functionally, delivery should be at-least-once, meaning every accepted notification is eventually delivered or explicitly dead-lettered and never silently dropped, and the duplicate sends that at-least-once permits must be suppressed by idempotency so no user ever receives the same message twice. Transactional latency should be seconds at most, since a login code that arrives in two minutes is a failed product, and the person staring at the code entry screen will have given up, retried, and generated a second code that races the first. Campaigns can take minutes to drain but must never delay transactional traffic. The system must also be polite by construction, which means rate caps and quiet hours sit in the platform rather than in the goodwill of forty producer teams, because goodwill does not survive reorganizations and a cap that depends on every team remembering it is not a cap at all.
Sizing the problem
Assume a mid-size product sending 10 million pushes, 1 million SMS, and 5 million emails per day, which is 16 million notifications in total. Spread over 86,400 seconds that averages about 185 per second, and even transactional traffic peaking at five times the average stays under 1,000 per second, a rate a modest service tier absorbs without strain. What actually sizes the system is the campaign burst rather than the steady state, because when marketing launches a push to 5 million users at 9 a.m. and the product wants the campaign delivered within 10 minutes, the pipeline must sustain 5 million sends over 600 seconds, which is more than 8,300 per second and an order of magnitude over the average. Queues are what absorb the difference between the rate at which work is accepted and the rate at which providers will take it, and that mismatch is the reason the architecture is queues all the way through.
Storage stays manageable by comparison. A delivery log row of about 500 bytes per notification accumulates 8 GB per day, so a 90-day retention window holds around 720 GB, comfortably one time-partitioned table. The device token registry carries a few rows per user, say 300 million rows at 200 bytes for a 100-million-user product where people own multiple devices, which comes to 60 GB and is small. The constraint that surprises people is SMS cost rather than anything technical, because at half a cent per message the 1 million daily texts run about 5,000 dollars a day and nearly 2 million a year, which is why the design routes to SMS only when a cheaper channel cannot carry the message, and why that routing decision belongs in the platform where the cost is visible rather than in producer teams for whom it is someone else's invoice.
The API
Producers call one endpoint and never talk to providers directly, which is what lets the platform change vendors, add channels, and tighten policy without forty teams shipping code. The idempotency key is supplied by the producer so that a producer-side retry of the same logical event cannot create a second notification, and the template ID keeps content rendering inside the platform, where localization, formatting, and channel-specific length limits live, and where a malformed template can be rejected before anything is queued.
POST /api/v1/notifications
{
"idempotency_key": "order-58821-shipped",
"user_id": "u-417",
"template_id": "order_shipped_v3",
"payload": { "order_id": "58821", "eta": "Friday" },
"channels": ["push", "email"], // preference-filtered server side
"category": "transactional"
}
→ 202 { "notification_id": "n-77b2", "status": "accepted" }
GET /api/v1/notifications/n-77b2
→ 200 { "status": "delivered", "channel": "push",
"events": ["accepted", "rendered", "sent", "delivered"] }
POST /api/v1/devices // from the mobile app
{ "user_id": "u-417", "platform": "ios", "token": "a91f..." } → 204
The data model
Three tables anchor the platform, namely the device token registry, the user preference store, and a delivery log that doubles as the idempotency record, where a unique constraint on the producer's key makes a duplicate submission fail visibly at insert time instead of silently producing a second send.
CREATE TABLE device_tokens (
user_id BIGINT NOT NULL,
platform TEXT NOT NULL, -- ios | android
token TEXT NOT NULL,
updated_at TIMESTAMPTZ NOT NULL,
PRIMARY KEY (user_id, token)
);
CREATE TABLE preferences (
user_id BIGINT NOT NULL,
category TEXT NOT NULL, -- marketing, social, security ...
channel TEXT NOT NULL,
enabled BOOLEAN NOT NULL DEFAULT true,
quiet_start SMALLINT, -- local-time hour, e.g. 22
quiet_end SMALLINT, -- e.g. 8
PRIMARY KEY (user_id, category, channel)
);
CREATE TABLE delivery_log (
notification_id UUID PRIMARY KEY,
idempotency_key TEXT NOT NULL UNIQUE,
user_id BIGINT NOT NULL,
channel TEXT,
status TEXT NOT NULL, -- accepted ... delivered | dead
attempts SMALLINT NOT NULL DEFAULT 0,
created_at TIMESTAMPTZ NOT NULL
);
The high-level architecture
The flow is a pipeline with one validation brain and per-channel muscle. Producers emit events to the notification service, which checks the idempotency key, loads preferences and applies rate caps and quiet hours, renders the template into channel-specific content, resolves the user's device tokens or address, and enqueues a delivery task on the right channel queue. Channel workers then do nothing but talk to providers. Push traffic goes to APNs, the Apple Push Notification service that is the only road onto an iOS device, and to FCM, Firebase Cloud Messaging, which plays the same gatekeeper role for Android, while texts go to SMS gateway companies and email leaves through a sending provider or an in-house mail transfer agent. Provider responses, including invalid-token feedback, flow back asynchronously into the registry and the delivery log, so the pipeline learns something from every send it makes.
The notification service validates, applies preferences, renders, and enqueues one task per chosen channel, and channel workers deliver through external providers. Provider feedback flows back asynchronously to prune dead tokens and update the delivery log.
The enqueue arrow in the diagram hides an essential split, because each channel actually has at least two queues, one for transactional traffic and one for campaigns, with workers draining the transactional queue at strict priority, meaning campaign work proceeds only when no transactional task is waiting. A campaign dropping 5 million tasks at 9 a.m. therefore fills the campaign queue and drains at whatever rate provider quotas allow, while a password reset submitted at 9:01 still ships in seconds because it never stands behind the marketing flood. Without that split the platform's worst incident becomes self-inflicted, with the business sending a promotional campaign and thereby locking its own users out of their accounts for half an hour, an outage no postmortem reads kindly. The isolation is structural rather than tuned, which is why it beats any weighting scheme that tries to share one queue, because a weight can be misconfigured under pressure while a separate queue cannot be starved by traffic it never carries.
The device token registry
Push delivery is only as good as the token registry, because APNs and FCM address devices by opaque tokens that change without warning rather than by anything stable like a user ID. The app registers its current token at every launch and on every token-refresh callback, the registry upserts by user and token with a freshness timestamp, and a user with three devices simply holds three rows, each of which receives its own copy of every push. Rot is the registry's permanent condition, since users reinstall the app, switch phones, or disable notifications, and the providers only report the resulting dead tokens after the fact, APNs through delivery failure responses and its feedback channel and FCM through unregistered errors. The pruning rule treats a provider's invalid-token verdict as authoritative and deletes the row immediately, then ages out tokens unseen for a long window of 60 or 90 days, because sending to dead tokens wastes quota and erodes the sender reputation scores providers keep, which in turn throttles delivery to the healthy tokens too. A rotten registry produces the quietest and most corrosive symptom in the system, the person who swears they never got the alert their friend got, so the platform measures registry health as a delivered-to-sent ratio per device cohort rather than waiting for complaints to surface it. The registry read sits on the hot path of every push, which earns it a cache keyed by user ID, invalidated on registration and pruning events, so the common case costs a memory lookup instead of a database query.
Delivery that survives retries
The reliability stance is at-least-once delivery with idempotency layered on top, because exactly-once across a network boundary to a third party is not achievable in principle. A send that times out leaves the platform unable to know whether the provider received the message, since the timeout destroyed the answer, and the only safe move is to retry, which risks a duplicate. Retries are therefore made safe rather than avoided, with every send keyed by the notification ID so that a duplicate-suppression check in front of each provider call, backed by a short-lived store recording which IDs the provider has already accepted, drops the second attempt of a send that actually succeeded the first time. Backoff between attempts is exponential, meaning the wait doubles each time, for example 1, 2, 4, 8, and 16 seconds, with random jitter added so that thousands of sends failing together do not retry in lockstep and hammer the recovering provider in synchronized waves. Transient provider errors earn this treatment, while permanent verdicts such as an invalid token or a hard email bounce do not, because retrying a permanent failure spends quota to learn nothing new.
A worker pulls a delivery task (1) and sends it to the provider under the notification's idempotency key (2). When the provider times out or returns a transient error (3), the task is requeued with an exponentially growing delay (4) and redelivered for another attempt (5). After the attempt limit it parks in the dead-letter queue (6), which alerts an operator and preserves the message for inspection or replay.
The dead-letter queue, which is a holding queue for messages that have exhausted their retries, is what keeps at-least-once from quietly decaying into at-most-once. Nothing is ever dropped on the floor, failures become visible inventory that pages an operator, and once the underlying fault is fixed the queue can be replayed, with the idempotency layer guaranteeing the replay cannot double-send whatever managed to get through before the fault. Provider failover covers the faults retries cannot, so each channel keeps a secondary route where the market offers one, such as a second SMS gateway that receives traffic when the primary's error rate trips a circuit breaker, which is a switch that stops calls to a failing dependency for a cooldown period so workers fail fast instead of stacking up timeouts. Push has no alternative to APNs and FCM, so its failover is buffering and patience, with queue depth growing through the outage and draining afterward, and the drain re-applies quiet hours and a freshness check at send time so that a six-hour-old promotional push dies in the queue rather than waking its recipient at 3 a.m.
Keeping the platform polite
Rate limiting at the user level is a product safeguard rather than a capacity one, because the platform can technically deliver thirty messages a day to one person, and that person will respond by disabling notifications forever, which destroys the channel for every team at once. The user's tolerance is a shared budget that only the platform can see in full, so it enforces per-user, per-category caps, for example at most two marketing messages a day, with a counter in a fast store checked during validation, and it applies collapsing rules that merge bursts of similar events into a single message such as "Ana and 12 others liked your post" before anything reaches a queue. Quiet hours are enforced in the user's local time zone, holding non-urgent messages until morning, while security-critical categories pass through regardless, and that carve-out is a requirement rather than a loophole, because a fraud alert delayed eight hours out of politeness causes real financial harm.
Delivery receipts close the loop that politeness opens. Workers write sent status, providers report delivered where the channel supports it, and clients report opens through a lightweight event endpoint, with all of it landing in the delivery log and flowing onward to an analytics pipeline. Operators read those numbers to run the platform, since a falling push delivery rate is how token rot or a quiet provider problem first announces itself, and product teams read the same numbers to govern themselves, because a category with high sends and near-zero opens is a category that should send less before users make that decision on its behalf. Campaign tooling consumes the stream mid-flight as well, slowing a campaign whose bounce rate is climbing rather than burning sender reputation to finish on schedule.
A latency budget for a login code
The seconds-at-most promise for transactional traffic becomes credible once it is itemized. The producer's call into the notification service spends a couple of milliseconds on authentication and routing, around a millisecond on the idempotency insert into the delivery log, a millisecond or two reading cached preferences and device tokens, and single-digit milliseconds rendering the template, so the request is validated, rendered, and enqueued with its 202 response in well under 50 milliseconds. A healthy transactional queue is also a nearly empty one, which means a worker picks the task up within tens of milliseconds, and the provider call then dominates the server side at perhaps 50 to 200 milliseconds for APNs, FCM, or an SMS gateway. What remains is the leg the platform does not control, since APNs and FCM typically place the push on the device within a second or two and carrier networks usually land an SMS within a few seconds, putting the full experience from code requested to code on screen at roughly two to five seconds when everything works, which matches what users have learned to expect.
When something fails, the budget shows exactly where the time goes. A first provider attempt that times out after two seconds, waits one second with jitter, and succeeds on the retry delivers in about five or six seconds, which the user reads as a sluggish network rather than a broken product. A second consecutive failure pushes past ten seconds, which is the threshold where people start tapping resend, and that is why the resend path mints a fresh idempotency key, so the impatient tap creates a new logical notification rather than colliding with the retry of the old one. Monitoring alarms on transactional p99 measured in seconds rather than on averages, because the priority split means campaign load should never appear in this number at all, and if it ever does, the alarm is really telling the operator the split itself is broken.
Scaling, failures, and operations
Every tier scales by a boring rule, which is the compliment a pipeline design wants. The notification service is stateless and scales behind a load balancer, queues partition by user ID so that per-user ordering is preserved where it matters, and channel workers scale on queue depth, with the campaign worker pool capped at whatever the provider contracts allow, since an SMS gateway contract might permit 1,000 messages per second and the pool must respect that or the gateway will enforce it less politely with rejections and reputation penalties. The delivery log partitions by time and archives old partitions to cold storage. Failure modes are mostly external, with a provider outage becoming queue depth behind an opened circuit breaker, a token database failover pausing push resolution for the seconds a replica needs to promote, and a misconfigured template failing fast at render time during validation, before anything is queued, so the producer receives a synchronous error it can act on instead of a notification that vanishes somewhere downstream.
The operational dashboard tells the story of the system in four lines, namely end-to-end transactional latency at p99, queue depth per channel and class, delivery success rate per provider, and the arrival rate into the dead-letter queue. The incidents worth rehearsing are the ones the design claims to handle. A giant campaign should starve nothing because the queues are split, a provider going dark should become a tripped breaker followed by failover where a second vendor exists and patient buffering where one does not, and a duplicate storm from a retrying producer should be absorbed entirely by the idempotency key at the front door. When all three play out as designed, users notice nothing, and noticing nothing is the only good outcome a notification platform can have.
Follow-up questions
- Why at-least-once instead of exactly-once delivery? A send that times out is ambiguous, since the provider may have delivered it before the connection died, so the platform must retry or accept silent loss. Exactly-once across that boundary cannot be guaranteed by any protocol, while retries plus idempotency keys give the user-visible effect of exactly-once at a cost the system can actually pay.
- How exactly does the idempotency key prevent double sends? At intake the key is a unique column in the delivery log, so a producer retry of the same logical event conflicts on insert and receives the original notification's status instead of creating a second one. At the provider edge a short-lived record of completed sends drops re-attempts of work that already succeeded, which covers the timeout case where the platform could not know the first attempt landed.
- What stops a 5-million-user campaign from delaying password resets? Each channel runs separate queues for transactional and campaign traffic, drained at strict priority, so the campaign consumes leftover capacity rather than shared capacity. Because the isolation is structural, no traffic mix or misconfigured weight can erode it, which makes it stronger than any fairness tuning applied to a single queue.
- When do you delete a device token? Immediately when APNs or FCM declares it invalid through feedback or an error response, since the provider's verdict is authoritative, and lazily when the token has not been re-registered within an aging window of 60 to 90 days. Sending to dead tokens wastes quota and erodes the sender reputation scores providers use to throttle, so pruning protects delivery for every healthy device.
- How do you handle an APNs outage? No alternative provider exists for iOS push, so the plan is buffering and patience, with the circuit breaker stopping futile sends while queue depth absorbs the backlog. The drain afterward re-checks quiet hours and message freshness, so stale pushes are dropped rather than delivered at the wrong time, and anything genuinely urgent has already gone out through SMS or email where the category allows a fallback.
- Why do rate caps live in the platform instead of in producer services? The user's tolerance is a shared budget across all producers, and only the platform sees the whole picture, since one team's reasonable volume plus another team's reasonable volume adds up to an unreasonable total. The cap has to bind at the point where the streams meet, and the platform is the only such point.
References
- Apple Developer Documentation, User Notifications and APNs, on tokens, delivery, and feedback.
- Google Firebase, Firebase Cloud Messaging documentation, on Android push and unregistered-token errors.
- Xu, System Design Interview, Volume 1 (2020), chapter on notification systems.
- Kleppmann, Designing Data-Intensive Applications (2017), on at-least-once delivery, idempotence, and message queues.