Design a metrics monitoring and alerting system

A metrics system watches everything else. Every server, container, and service in a fleet emits numbers, CPU load, request counts, queue depths, error rates, and this system collects them all, stores them as time series, draws them on dashboards, and wakes someone up when a number crosses a line. Operations teams live inside it for most of their working day, capacity planning is impossible without the history it keeps, and during an incident it is the difference between a five-minute diagnosis and an hour of guessing in the dark. Interviewers reach for this question because it looks like simple plumbing and then turns out to contain a specialized database, a streaming pipeline, and a small state machine with pager-duty consequences, and because every one of those parts must keep working at precisely the moment the rest of the infrastructure is on fire.

The genuinely hard parts deserve naming up front. The data model has a tendency to explode combinatorially when one engineer adds one careless label, the storage engine must absorb a million writes per second on hardware that a general-purpose database would saturate at a small fraction of that rate, and the alerting path has to be the most reliable component in the company while depending, directly or indirectly, on almost everything else. The walkthrough below takes those three in order, with the supporting machinery built around them.

Scope and requirements

Functionally, the system collects operational metrics from a large fleet, on the order of 100,000 servers, stores them as time series, serves a query language for dashboards and ad hoc investigation, and evaluates alert rules that notify on-call engineers through pagers, chat, and email. Log and trace collection are explicitly out of scope, because logs are text events and traces are request trees, both with their own storage shapes and their own systems, while this design is for numbers over time. Saying that boundary out loud in an interview saves ten minutes of designing the wrong thing.

Non-functionally, ingestion must keep up with the fleet at around one million samples per second without dropping data during traffic spikes, since the spikes are exactly when the data matters most. Queries over recent data should return in well under a second, because a person iterating on a dashboard during an incident will issue dozens of queries and every second of lag compounds into minutes of delayed diagnosis. A year or more of history must be retained at a cost that does not embarrass the system next to the fleet it monitors. Alert evaluation carries the strictest requirement of all, namely that it keep functioning through partial failures of the very infrastructure it observes, because a late or lost page is the worst defect this system can have. Two freedoms balance these demands, in that slightly stale data is acceptable almost everywhere, and samples are append-only facts that are never updated in place, and the storage design below leans on both.

Sizing the problem

The unit of storage is a series, meaning one metric name plus one unique combination of labels. If each of 100,000 servers exposes about 100 series, the fleet carries 10 million active series, and scraping each one every 10 seconds produces 1 million samples per second, which compounds to 86.4 billion samples per day. A sample is a timestamp and a 64-bit float, 16 bytes uncompressed, so a day of raw data is about 1.4 TB before compression and, as the storage section works out, nearer 120 GB after it. None of those numbers is alarming on its own, and the design would be easy if they stayed put.

The number that actually kills these systems is cardinality, the count of distinct label combinations in existence. Labels are key-value pairs attached to a metric name, such as http_requests_total{service="checkout", region="us-east"}, and every new combination mints a new series with its own index entry and its own in-memory state. Consider what happens when one engineer adds a user_id label to a request-latency metric in a service with 2 million active users and 10 instrumented endpoints. That single line of instrumentation can create up to 20 million new series, tripling the fleet-wide total overnight, and at roughly 3 KB of index and head-block memory per active series it demands about 60 GB of additional RAM across the storage tier, which arrives as mysterious memory pressure pages for the storage team rather than as a clear signal pointing at the offending deploy. Series churn does similar damage on a slower fuse, since a label that embeds a container ID creates a fresh series on every redeploy and leaves the old one rotting in the index. Guardrails on label values are therefore a requirement of the design rather than a refinement to add later.

The query API

Dashboards and alert rules deliberately speak one interface, which selects series by name and label matchers, applies functions over time windows, and aggregates across series. Sharing the interface means an alert and the graph an engineer pulls up to investigate it can never disagree about what the data says, which matters at 3 a.m. more than any other property of the API. The HTTP surface stays small and the expression language carries the weight, PromQL in spirit here:

GET /api/v1/query_range
    ?query=sum by (service) (rate(http_requests_total{status="500"}[5m]))
    &start=2025-10-07T10:00:00Z&end=2025-10-07T16:00:00Z&step=60s
→ 200 { "result": [ { "labels": {"service": "checkout"},
                      "values": [[1759831200, 4.2], ...] }, ... ] }

GET /api/v1/series?match=node_cpu_seconds_total{region="us-east"}
→ 200 { "series": [ ...label sets... ] }

POST /api/v1/ingest        ← push path for batch jobs and edge agents
{ "series": [ { "labels": {...}, "samples": [[ts, value], ...] } ] }

The data model in storage

Logically the store consists of two structures, namely an index from label values to series IDs and a set of per-series sample streams. The index is inverted, keeping one posting list per label pair, which is what lets a query for {service="checkout"} resolve directly to its few hundred series instead of scanning ten million label sets to find them, and queries with several matchers intersect the relevant posting lists, which is cheap because the lists are sorted. Without that structure every dashboard panel would be a full index scan, and the system would fall over the first time someone opened a busy dashboard during an incident. The relational sketch makes the shape concrete even though the real engine is purpose-built:

CREATE TABLE series (
  series_id   BIGINT PRIMARY KEY,
  metric_name TEXT NOT NULL,           -- e.g. http_requests_total
  labels      JSONB NOT NULL           -- {"service":"checkout", ...}
);
-- inverted index: label pair → series ids, one row per (pair, series)
CREATE INDEX series_by_label ON series USING gin (labels);

CREATE TABLE samples (
  series_id   BIGINT NOT NULL,
  ts          BIGINT NOT NULL,         -- ms since epoch
  value       DOUBLE PRECISION NOT NULL,
  PRIMARY KEY (series_id, ts)          -- append-mostly, time-ordered
);

The high-level architecture

Collection comes in two flavors and a serious design supports both. Pull collection, the Prometheus model, has collectors discover targets through service discovery, the registry of what is running where, and scrape each target's metrics endpoint on a schedule. Pull wins inside a datacenter because the scraper controls its own load, a dead target announces itself through the failed scrape, and nothing needs configuring on the monitored host, which matters enormously across 100,000 servers owned by dozens of teams. Push collection has agents send metrics through a gateway instead, and it wins for short-lived batch jobs that may start and finish entirely between two scrapes, and for edge devices behind NAT, the address translation that hides devices from inbound connections, which a scraper simply cannot reach. Whichever way samples are collected, they flow into Kafka, a durable distributed queue, before touching storage, and that buffer decouples the two sides of the pipeline, so a storage deploy or a slow shard never pushes back on collectors and an ingest spike parks safely in the queue instead of being dropped at the door. The alternative of writing straight from collectors to storage looks simpler on the whiteboard and couples every storage hiccup to data loss at the edge, which is exactly the wrong component to make fragile. Storage nodes consume from the queue, partitioned so each series consistently lands on one shard, and a query layer fans out across shards while the rule evaluator and dashboards sit on top of it.

Samples flow left to right into sharded time-series storage; the query service fans out across shards for dashboards and for the rule evaluator, whose firing alerts pass through the alert manager before anyone is paged.

The time-series storage engine

A general-purpose B-tree database struggles here for a structural reason rather than a tuning one. One million samples per second arriving for ten million different series means writes land all over the key space, so the tree's pages churn randomly, write amplification climbs as half-empty pages rewrite themselves, and per-row overhead of dozens of bytes swamps samples that are 16 bytes of actual fact. The data has exploitable shape, though, because it is append-mostly, time-ordered within each series, and highly regular from one sample to the next, and the engine that fits is the one Facebook described in the Gorilla paper. Timestamps compress by delta-of-delta encoding, which works because scrapes arrive on a nearly fixed cadence, so instead of recording 1759831200, 1759831210, 1759831220 in full, the engine records the difference between consecutive deltas, a number that is almost always zero and encodes in one or two bits. Values compress by XOR, where each float is combined bitwise with its predecessor, and because adjacent samples of a gauge are usually close in value, the result is mostly zero bits that encode down to a handful of meaningful ones. Gorilla reported an average around 1.37 bytes per 16-byte sample, a twelvefold reduction, and that single measurement is what makes the rest of the design affordable, shrinking 86.4 billion daily samples from 1.4 TB to roughly 120 GB and letting the hottest data live in memory where dashboards need it.

Each shard keeps the most recent two hours of its series in a compressed in-memory head block, writing every incoming sample to a write-ahead log, an append-only file that lets a crashed shard replay its way back to the exact state it lost, and then flushes the head to an immutable, time-partitioned file on disk with its own small index. Queries for recent data, which dominate dashboard traffic by a wide margin, are served entirely from memory, while older windows read a handful of immutable files. Compaction merges small files into larger ones in the background to keep file counts sane, and because files are immutable, deleting expired data means removing whole files rather than rewriting anything, which is the cheapest delete a storage system can hope for.

Retention and downsampling

Nobody needs 10-second resolution from eight months ago, and the budget arithmetic shows why pretending otherwise fails. Keeping raw samples for 5 years would cost about 120 GB times 1,825 days, around 220 TB, almost all of it data nobody will ever query at full detail, so the standard escape is downsampling, the replacement of many fine-grained samples with fewer coarse aggregates over each window. Raw samples are kept for 15 days, about 2 TB, which covers incident response and week-over-week comparison, the two uses that genuinely need detail. A rollup job then reduces each series to per-minute aggregates, and it stores a sum and a count per minute rather than a precomputed average, because sums and counts can be re-aggregated exactly across any longer window while averages of averages quietly lie. The arithmetic works out to 10 million series times 1,440 minutes times 2 values, about 29 billion values per day, roughly 43 GB at the same compression, and 13 months of that is about 17 TB. A second tier keeps per-hour aggregates for 5 years at about 0.7 GB per day, 1.3 TB in total. The three tiers together cost around 20 TB against 220 TB for raw-everything, and the query layer picks the coarsest tier whose resolution still answers the question, so a one-year dashboard reads hourly points instead of dragging billions of raw samples through the network. What is lost deserves an unflinching sentence, because a sub-minute spike older than 15 days is invisible forever, and any percentile that was not precomputed cannot be recovered from sums and counts, so teams that care about long-term tail latency must decide before the data ages, not after.

The raw tier holds immutable two-hour blocks at full resolution; rollup jobs feed the one-minute and one-hour tiers, and total storage lands near 20 TB instead of 220 TB for five years of raw data.

The life of a dashboard query

Walking one query through the system shows where the sub-second budget goes. An engineer loads a panel asking for the per-service rate of 500-status requests over the last six hours, and the query service first parses the expression and extracts the label matchers. Resolving {status="500"} intersected with the metric name against the inverted index costs a few milliseconds per shard and yields perhaps two thousand matching series, and that selection step is the point where guardrails act, because if a careless regex matcher would touch two million series instead, the query is rejected with a clear error rather than allowed to pull a shard's whole index into memory. Each shard then reads the selected series for the time range, decompressing from the in-memory head for the recent window at millions of samples per core per second, computes the rate locally, and ships back partial aggregates rather than raw samples, which keeps the network cost proportional to the number of series rather than the number of samples. The query service merges the partials, performs the final sum by service, and returns in 200 to 400 milliseconds for a typical panel, with the index lookup, the shard decompression, and the merge each taking a visible slice of that budget.

The slower cases reveal the design's edges. A query reaching back into the rollup tiers reads from immutable files instead of memory and pays disk latency, though far less data, so a thirty-day panel often returns nearly as fast as a six-hour one at coarser resolution. A query during shard failover may hit the replica and return complete data a beat slower, and a query that genuinely spans a dead shard's unreplicated window comes back with a partial-data flag, because in monitoring a clearly labeled incomplete answer is worth far more than a refusal, and an engineer mid-incident can act on a graph with a small gap but cannot act on an error page.

Alerting and its state machine

The rule evaluator runs each alert rule on a fixed clock, typically every 15 to 60 seconds, by issuing the rule's query through the same query layer the dashboards use, so an alert and a graph can never disagree about the data. Each rule's result drives a small state machine per label set, in which a rule sits inactive while its condition is false, moves to pending when the condition first turns true, advances to firing only after the condition has held continuously for a configured duration, and resolves when it turns false again. The pending state exists to suppress flapping, because a CPU metric that spikes over its threshold for one scrape and drops back never pages anyone when the rule requires five minutes of sustained breach, since brief noise enters pending and dies there quietly. The cost is detection delay equal to exactly the configured hold, which is why rules for hard failures like a dead database primary use holds of seconds while rules for slow burns like disk filling use holds of an hour, and choosing those durations is on-call quality engineering as much as it is configuration.

Firing alerts then pass through the alert manager, which exists because raw alerts at fleet scale would page one person hundreds of times for a single outage. It deduplicates identical alerts arriving from redundant evaluators, groups related alerts so that fifty instances of one failing service produce one notification listing fifty members rather than fifty pages, honors silences during planned maintenance so a scheduled database failover does not wake three teams, and routes by severity, sending a page for the database primary being down but only a chat message for a disk filling slowly. Escalation policies re-page up a chain when nobody acknowledges within a deadline, which converts an unlucky heavy sleeper from an outage extender into a minor footnote. One question has to close the design, which is who watches the watcher. The standard answer is a dead-man's switch, an alert deliberately configured so its condition is always true, whose continuous firing a small external service expects every minute, and when that heartbeat stops arriving the external service pages directly, so silence from the monitoring system becomes the loudest signal it can send.

Scaling, failures, and operations

Each tier scales by its own rule. Collectors are stateless and partition the target list among themselves using the same service discovery they scrape from, so adding a collector rebalances the scrape load automatically. Kafka scales by partitions, and the sizing is mild here, since a million samples per second in batched, compressed form is on the order of 100 MB/s, comfortably within a small cluster's capacity. Storage shards partition by hash of series so ingest spreads evenly, each shard replicates to a peer so a node loss costs no data and queries fail over, and a shard that dies and returns replays the queue from its last consumed offset, which is the quiet payoff of buffering ingestion, because the queue holds hours of backlog patiently while a shard rebuilds. Queries that span shards fan out and merge, and the expensive ones are bounded by limits on the number of series a single query may touch, for the reason walked through above.

The failure stories worth rehearsing are the partial ones, because total failures are obvious and partial failures are where designs quietly rot. If Kafka itself degrades, collectors buffer briefly in memory and then shed the oldest data first, on the argument that during an incident the newest samples carry almost all of the diagnostic value. If a rollup job falls behind, raw data simply lives a little longer before deletion, so the failure costs disk for a few hours rather than correctness forever. The rule evaluator runs as two replicas evaluating every rule independently, with the alert manager deduplicating their output, so losing one evaluator changes nothing anyone can observe. The alerting path as a whole is kept deliberately boring, with few dependencies, its own small deployment, and pinned versions, because the one component that must work during a bad day is the component that tells you the day is bad, and boring is the highest compliment that component can earn.

Follow-up questions

Pull or push collection? Pull wins inside the datacenter, because the scraper controls its own load, a dead target is detected by the failed scrape itself, and monitored hosts need no configuration. Push wins for short-lived jobs that finish between scrapes and for edge agents behind NAT that a scraper cannot reach, so real fleets run pull as the default with a push gateway standing at the edges.
How do you stop a cardinality explosion before it lands? Enforce per-service series quotas at ingestion so the blast radius is capped by policy, reject label keys like user or request IDs outright, and surface a report of fastest-growing label sets so the offending deploy is identified in minutes. Without the guardrails the first symptom is storage nodes running out of memory, which is the most expensive possible way to learn about a new label.
Why buffer ingestion through Kafka instead of writing straight to storage? The buffer decouples failure domains, so a storage deploy, a slow shard, or an ingest spike becomes queue lag that drains later instead of samples dropped at the edge, and a rebuilt shard replays from its last offset as if nothing happened. Direct writes make every storage hiccup a permanent hole in the data.
Why not store metrics in a relational database? Random-key writes at a million rows per second churn B-tree pages and spend tens of bytes of per-row overhead on facts that are 16 bytes long, so the hardware bill grows absurd before the throughput is reached. A purpose-built engine using delta-of-delta and XOR encoding cuts storage about twelvefold and serves the hot window from memory, which the general-purpose engine cannot approach.
What does "pending for five minutes" actually buy? It buys flap suppression, since threshold noise enters the pending state and expires there without ever paging a human. The price is exactly five minutes of detection delay on real failures, which is why hold durations are tuned per rule, short for hard failures and long for slow burns.
Who alerts when the alerting system is down? A dead-man's switch does, in the form of one rule built to fire continuously forever, paired with a tiny external service that expects the heartbeat and pages directly when it stops arriving. The design turns the system's silence into a signal, which is the only signal a fully dead monitor can still send.

References

Pelkonen et al., Gorilla: A Fast, Scalable, In-Memory Time Series Database (VLDB 2015), on delta-of-delta and XOR compression.
Prometheus, Overview documentation, on the pull model, data model, and architecture.
Prometheus, Alerting rules documentation, on pending and firing semantics.
Kleppmann, Designing Data-Intensive Applications (2017), on storage engines and stream processing.
Xu, System Design Interview, Volume 2 (2022), chapter on metrics monitoring and alerting.