Design an ad click event aggregation system

Systems design · Data pipelines and observability · Oct 2025

When someone clicks an ad, the advertiser is billed for that click, the campaign's budget is paced against it, and the bidding system learns from it, so three separate consumers depend on the event being counted correctly. The system designed here ingests every click from a large ad network, aggregates counts per ad per minute, and serves those numbers to dashboards, budget pacers, and the billing pipeline. Interviewers love the question because it is a money pipeline, and money changes the engineering. A metrics system that undercounts CPU usage wastes a capacity planner's afternoon, whereas an ad system that overcounts clicks bills customers for traffic that never happened, and those failure modes are measured in refunds, audits, and occasionally lawsuits. Correctness therefore outranks freshness here, which inverts the priorities of most streaming designs and quietly shapes every decision that follows.

The hard parts are time and duplication. Clicks arrive late and out of order from flaky mobile networks, the same click arrives twice because a phone retried it, and the aggregation must nonetheless converge on one defensible number per ad per minute. The design that follows is the standard shape for this problem, built from a durable raw log that serves as the source of truth, a streaming layer that produces fast approximate answers, and a batch layer that recomputes the truth at leisure, and most of the interview is justifying why each of those three pieces earns its complexity.

Scope and requirements

Functionally, the system must return the number of clicks for a given ad in any one-minute window, return the top N ads by clicks over a recent window such as the last hour, and support queries over both recent data for pacing and historical data for invoices and disputes. Filtering fraudulent clicks before they reach billable aggregates also belongs in scope, because no real ad network bills raw clicks, and a design that ignores fraud has designed a different product than the one advertisers buy.

Non-functionally, correctness leads everything else. Billable numbers must be exact, deduplicated, and reproducible, where reproducible means that recomputing any window from raw events yields the same answer an auditor was shown last quarter, since an invoice that cannot be re-derived is an invoice that cannot be defended. Freshness matters at the level of a minute or two for dashboards and pacing rather than seconds, which is a meaningful relaxation the design will spend. The system must absorb large traffic spikes, a viral moment or a holiday auction surge, without losing a single event, and every raw event must be retained long enough to recompute history when, not if, a defect in the aggregation logic ships, because the recovery story for a money pipeline cannot depend on the aggregation code having been perfect.

Sizing the problem

Assume 1 billion clicks per day. A day has 86,400 seconds, so the average rate is about 11,600 clicks per second, and because ad traffic follows audiences through their day and spikes around events, the system should be sized for peaks around 50,000 per second, roughly four times the average. A raw click event carries an event ID, ad ID, user identifier, timestamp, and a little context, around 0.1 KB once encoded, so a day of raw events is about 100 GB, which is a comfortably small volume by log standards. Keeping 90 days of raw events in Kafka with threefold replication costs about 100 GB times 90 times 3, or 27 TB of disk, and for a system whose disputes and audits reach back months, 27 TB is cheap insurance rather than expense. The aggregate side comes out smaller still, because with around 2 million ads active in a day, even the absurd worst case of every ad receiving clicks in every one of 1,440 minutes would produce 2.9 billion rows, and real click activity concentrates heavily on a small fraction of ads, so daily aggregate output lands well under that and a year of it fits in a few TB of columnar storage.

The query API

Two consumers dominate the read side, with dashboards and pacing reading recent windows while billing reads reconciled history, and the API serves both shapes. The write side is an internal ingestion endpoint that ad servers call in batches, and end-user clicks never hit this API directly.

POST /v1/clicks                       ← from ad servers, batched
{ "events": [ { "event_id": "9f1c-...", "ad_id": 88231,
                "user_id": "u-2231", "ts": "2025-10-07T12:00:40Z" }, ... ] }

GET /v1/ads/88231/clicks?from=2025-10-07T12:00Z&to=2025-10-07T13:00Z
    &granularity=minute
→ 200 { "windows": [ { "minute": "12:00", "clicks": 1042,
                       "source": "reconciled" }, ... ] }

GET /v1/ads/top?window=1h&n=100       → 200 { "ads": [ ... ] }

The source field in the response is a deliberate piece of API design rather than decoration, because every aggregate is labeled as streaming or reconciled and downstream systems choose what they can tolerate. Pacing happily consumes fast streaming rows that may still drift slightly, while billing simply refuses to read anything not yet reconciled, and encoding that distinction in the contract keeps the two guarantees from ever being confused in someone else's code.

The data model

The aggregate store is keyed by ad and minute, which matches the access pattern of every query above, and each row carries the identity of the computation run that produced it, so a recomputation can supersede old results without ever updating a row in place ambiguously. Append-and-supersede preserves the audit trail that in-place correction would destroy.

CREATE TABLE clicks_per_minute (
  ad_id        BIGINT      NOT NULL,
  minute       TIMESTAMPTZ NOT NULL,   -- window start, UTC
  clicks       BIGINT      NOT NULL,
  filtered     BIGINT      NOT NULL,   -- clicks held out as suspect
  run_id       BIGINT      NOT NULL,   -- which computation produced this
  reconciled   BOOLEAN     NOT NULL DEFAULT FALSE,
  PRIMARY KEY (ad_id, minute, run_id)
);
-- readers take the highest run_id per (ad_id, minute),
-- billing additionally requires reconciled = TRUE

The high-level architecture

Ad servers log every click to Kafka, a durable, partitioned, append-only log, and that log is the source of truth for the entire system, kept raw, immutable, retained for 90 days, and never filtered or rewritten, because any cleverness applied before the log cannot be undone later, while cleverness applied after it always can. A stream processor consumes the log, deduplicates, applies fraud filtering, aggregates into one-minute windows, and writes rows to the aggregate store within a minute or two of real time. Independently of that, a nightly batch job replays the same raw log and recomputes every window with hours of hindsight, writing corrected rows that supersede the streaming ones. Dashboards read whatever is freshest while billing reads only reconciled rows. This two-path arrangement is often called a lambda architecture, and the alternative worth naming is the kappa style, which keeps only the streaming path and handles corrections by replaying the stream itself. Kappa removes the cost of maintaining two computations, and it loses here because billing wants a clean, scheduled moment when a day's numbers become final, with full hindsight and final rule versions, and a nightly batch over an immutable log provides exactly that moment while a perpetually revising stream never quite does. The duplicated logic is the explicit price paid for putting a fast number on a dashboard without ever billing from it.

Click sourcesapps and ad serversIngest gatewayvalidate, batchKafka raw logsource of truthStream jobwindow, dedupBatch recomputenightly, from raw logAggregate store(ad, minute) rowsQuarantine storesuspect clicksQuery APIdashboards, billing1-min rowsnightly replaycorrectionssuspect clicks

Every click lands in the raw log before anything else happens to it. The stream job publishes fast per-minute rows, the nightly batch replays the same log and writes corrections, and suspect clicks are routed aside rather than deleted.

Event time, windows, and watermarks

Aggregation here means windowed counting, where the day is divided into tumbling windows, fixed, non-overlapping one-minute intervals, and each click increments the count of exactly one window. The subtlety is deciding which window, because every event carries two times. Event time is when the click actually happened on the device, while processing time is when the event reached the pipeline, and the gap between them can stretch from milliseconds to minutes. A click made forty seconds after noon from a phone in a subway tunnel may not arrive until two and a half minutes past the hour, and a pipeline that assigned it by arrival would land it in a window two minutes after the one it belongs to, overcounting one minute and undercounting another, and worse, replaying the same log later would assign the same click differently because processing times do not survive a replay, which destroys reproducibility outright. Billing-grade aggregation must therefore window by event time, and doing so immediately raises the question of when the noon window can be declared finished, given that some of its events may still be in a tunnel somewhere.

A watermark answers that question. It is the pipeline's moving claim that all events with event time earlier than T have, to its belief, arrived, and it is typically computed as the maximum event time seen so far minus an allowed lateness, say two minutes. When the watermark passes one minute past noon, the window for the minute beginning at noon closes and its count is emitted to the store. The setting is a trade with no free option on either side, because a tight watermark of a few seconds emits windows quickly but calls more events late, while a generous one of ten minutes catches nearly every straggler and delays every dashboard and every pacing decision by those same ten minutes, which budget pacers genuinely feel, since a campaign can overspend meaningfully in ten blind minutes. The practical middle ground emits at a two-minute watermark, accepts that a small tail will still arrive afterward, and lets closed windows be corrected, first by the streaming layer through late-update rows and then definitively by the nightly batch, which by that point has seen everything that will ever arrive.

window 12:00 to 12:0112:01 to 12:0212:02 to 12:0312:0012:0112:0212:03processing timewatermark passes 12:01late click, event time 12:00:40123

Clicks arriving on time count into the window holding their event time, then the watermark passes the end of the first window, which closes and emits, and a click stamped inside that first minute but arriving two minutes later is routed back as a correction to the already-closed window.

Counting each click exactly once

Mobile clients retry over bad networks, so the same click arriving more than once is a matter of routine rather than an anomaly, and the defense starts at the source. The client mints a unique event ID at the moment the click happens, not at the moment the request is sent, so every retry of the same click carries the same ID, and a duplicate becomes detectable instead of indistinguishable from a real second click. The stream job keeps a set of recently seen IDs in its keyed state and drops repeats on arrival, and the set stays bounded because the watermark bounds it, since an event older than the allowed lateness will be handled by reconciliation anyway and its ID can be forgotten. A few minutes of IDs at 50,000 events per second comes to tens of millions of small entries, a few gigabytes spread across all the job's tasks, which operator state carries without strain.

Deduplication handles duplicate inputs, but the pipeline can also duplicate its own work by crashing mid-stream, and the guard against that is the checkpoint. The stream processor periodically snapshots all of its operator state, meaning the partial window counts and the dedup set, together with the exact Kafka offsets it had consumed, the positions in the log marking how far it had read, and it stores that snapshot durably. After a crash, a replacement worker restores the snapshot and resumes from those offsets, so no event is skipped and none is counted twice inside the processor, and the recovery is invisible except as a brief spike in consumer lag. One gap remains at the boundary with the store, because a job that crashes after writing a row but before recording the checkpoint will rewrite that row on recovery, and the fix is an idempotent sink, meaning a write that can be applied many times with the same result. The schema above provides exactly that by making the row key include the window and the run, so a recovery rewriting the noon row for ad 88231 replaces an identical value rather than adding to it. Transactional sinks, where the write and the checkpoint commit or abort together, are the heavier alternative, and they earn their cost only when the target store supports transactions cheaply, which columnar aggregate stores often do not. None of this machinery makes events arrive exactly once, which is impossible over real networks, but it makes each event's effect on stored aggregates happen exactly once, and that weaker-sounding property is precisely the one billing needs.

A click's journey and its latency budget

Tracing one click end to end shows where the minute of freshness actually goes. The click happens on a phone forty seconds after noon, the app's SDK batches it with its neighbors for up to a second, and the ad server forwards the batch to the ingest gateway, which validates the schema and produces it to Kafka, waiting for the log to replicate the write before acknowledging, a few milliseconds well spent given what the log protects. Perhaps two seconds after the tap, the event sits durably in the raw log, and everything that matters has already happened, because from this point no failure anywhere downstream can lose the click. The stream job consumes it within a second or so, dedups it, scores it for fraud, and adds it to the open noon window, where it waits. The dominant term in the budget is the watermark, since the window cannot close until two minutes of allowed lateness expire, so the row lands in the aggregate store around three minutes past noon and a dashboard refreshing then shows it, roughly two and a half minutes after the physical tap, with computation accounting for seconds of that and deliberate waiting accounting for the rest.

The failure variant of the journey is worth narrating too. Suppose the stream job crashes two minutes after noon, after the click was consumed but before any checkpoint that includes it. The job restarts from the previous checkpoint taken a minute earlier, re-reads the log from those offsets, and rebuilds the noon window including the click, and the only trace of the incident is that the window emits half a minute later than usual. A pacing engineer watching consumer lag sees a spike and a recovery, dashboards stall for under a minute, and the eventual numbers match what they would have been without the crash, which is the entire point of checkpoint-and-replay over an immutable log.

Fraud filtering and the reconciliation path

Raw clicks are not billable clicks. Some fraction comes from automated traffic, click farms, and accidental double taps, and a filtering stage sits in the stream job ahead of billable aggregation, applying the cheap rules first, which catch impossible click rates from one device, datacenter IP ranges, and clicks with no matching ad impression, followed by a model score for the subtler patterns that rules miss. Suspicious events are never deleted but routed to a quarantine store with the triggering reason attached, counted in a separate filtered column, and kept for dispute resolution, because an advertiser challenging an invoice will ask exactly what was excluded and why, and because the filters themselves need auditing when their thresholds drift over time. The day an overzealous rule starts quarantining a legitimate campaign, the quarantine store is what turns the recovery from an apology into a recomputation.

The batch path is what makes the whole system defensible to an auditor. Every night, a job replays the raw log for the previous day and recomputes every window with complete information, since by then every late event has arrived and the fraud filters can run with final rule versions and the full day's context, and the output is written as a new run with the reconciled flag set. The correction policy is explicit and intentionally asymmetric, with dashboards and pacing reading streaming rows because they need speed and tolerate small drift, while billing reads only reconciled rows, so an invoice is never built on a number that a late click could still move. Streaming and batch totals are compared every night as a matter of routine, and a drift beyond a small threshold pages someone, because persistent disagreement means one of the two computations carries a defect and the comparison is the cheapest detector of that anyone has invented. The same machinery doubles as the recovery procedure after a faulty deploy, where the operator rewinds the consumer offsets to before the damage, reruns the corrected job over the affected range, and publishes the output under a new run ID, which supersedes the wrong rows while preserving the audit trail of what was previously reported and when.

Scaling, failures, and operations

Kafka partitions the click topic by ad ID, which gives each stream-processor task a disjoint slice of ads and makes all window state local to one task, so no aggregation ever crosses a network boundary. The raw byte rate is mild, since 50,000 events per second at 0.1 KB is 5 MB/s, and the partition count is therefore set by the parallelism the stream job wants rather than by throughput, a point worth making in an interview because it shows the sizing numbers are being used rather than recited. The aggregate store partitions the same way and scales reads with replicas. The stream job scales by adding tasks, with its failure story being the checkpoint replay walked through above, and the queue absorbing backlog while a recovering job catches up. One ad going viral creates a hot key, a single partition receiving a disproportionate share of traffic while its siblings idle, and the standard relief is local pre-aggregation, where the hot ad's counts are split across several sub-keys, counted independently in parallel, and summed at emission, which trades a little merge work for an even load across tasks.

The failures worth rehearsing are the quiet ones, because the loud ones page someone by construction. If the stream job lags, dashboards go stale but nothing is lost, and an alert on consumer lag surfaces it within minutes. If the aggregate store goes briefly unavailable, the job pauses against its checkpoint rather than dropping output, and resumes where it stopped. If a fraud rule misfires, the quarantine store plus a reconciliation rerun repairs the damage as a recomputation rather than a negotiation. The one disaster with no good recovery is losing raw events before they reach the log, which is why the ingest gateway acknowledges an ad server only after Kafka has replicated the write across machines, and why the log's replication factor and retention period are treated as billing-critical configuration under change review rather than as tunables someone can relax to save disk.

Follow-up questions

  • Why window by event time instead of arrival time? Arrival-time windows change with network weather and change again on every replay, so the same raw log would yield different invoices on different days, which no auditor would accept. Event-time windows make every recomputation land identically, and that reproducibility is close to the definition of auditable.
  • How do you choose the watermark delay? Measure the real lateness distribution rather than guessing, and if 99.9 percent of clicks arrive within two minutes, a two-minute watermark leaves a 0.1 percent tail for reconciliation to fix. Tighter delays buy freshness and pay in correction volume, and the right balance follows from how much drift pacing can tolerate.
  • What does exactly-once actually mean here? It cannot mean events traverse the network once, which no protocol can promise, so it means each event's effect on stored aggregates is applied once. Deduplication by client-minted event ID, checkpointed state tied to Kafka offsets, and idempotent keyed writes combine to give that end-to-end property.
  • Why keep both a streaming and a batch path? They serve different masters, since pacing needs an answer in about a minute while billing needs an answer that can never move afterward. A single path could serve both only by being as slow as the batch or as approximate as the stream, and the nightly comparison between the two doubles as a free correctness check on each.
  • How does top-100 ads over the last hour work? The OLAP store sums sixty per-minute rows per ad and ranks the results, and at a few million active ads that scan is inexpensive and runs in well under a second. A streaming top-K sketch such as count-min can serve the dashboard variant when sub-second latency is wanted, with the exact ranking always recoverable from the store.
  • A deploy miscounted six hours of clicks. Now what? Rewind the consumer offsets to before the deploy, replay the raw log through the fixed job, and publish under a new run ID so reconciled rows supersede the wrong ones while the old rows remain visible to auditors. The raw log's immutability is what turns this from an incident report into a documented procedure.

References

  1. Akidau et al., The Dataflow Model (VLDB 2015), on event time, windows, and watermarks.
  2. Apache Kafka, Documentation, on partitioning, replication, and retention.
  3. Apache Flink, Timely stream processing concepts, on event time and watermarks in practice.
  4. Kleppmann, Designing Data-Intensive Applications (2017), on stream processing and exactly-once semantics.
  5. Xu, System Design Interview, Volume 2 (2022), chapter on ad click event aggregation.