"Is the service healthy?" is a useless question until you make it numeric. SLIs, SLOs, and error budgets are the vocabulary that turns reliability from a feeling into a measurement you can act on. I put all of this into practice in my Uptime & SLO Monitor.
The chain is simple: an SLI measures behavior, an SLO sets a target for it, and the error budget is what the target leaves you to spend on failure.
SLI, SLO, SLA
- An SLI (indicator) is a measured number: the fraction of successful requests, or the p95 latency.
- An SLO (objective) is the internal target for an SLI: 99.9% of requests succeed over 30 days.
- An SLA is the contractual version with consequences; the SLO is the goal you actually engineer toward, usually set stricter than the SLA.
The error budget
If the target is 99.9%, then 0.1% of failure is not merely tolerated, it is budgeted: $\text{error budget} = 1 - \text{SLO}$. Translating that to time over a 30-day month makes it concrete:
- 99% (two nines): about 7.2 hours of allowed downtime per month.
- 99.9% (three nines): about 43 minutes per month.
- 99.99% (four nines): about 4.3 minutes per month.
- 99.999% (five nines): about 26 seconds per month.
The budget is a shared currency: plenty left means ship faster and take risks; nearly spent means freeze features and stabilize. It replaces "should we deploy?" arguments with a number both developers and operators agree on.
Burn rate
Burn rate is how fast you are spending the budget relative to spending it evenly across the window: $\text{burn} = \text{error rate} / \text{error budget}$. A burn rate of 1 exhausts the budget exactly at the end of the window; a burn rate of 14.4 would burn an entire 30-day budget in about two days, which is why that threshold pages immediately.
Multi-window, multi-burn-rate alerting
Alerting on raw burn rate is either too twitchy or too slow, so the Google SRE approach pairs a long and a short window and fires only when both exceed the threshold. The long window confirms the budget is genuinely burning; the short window makes the alert fire quickly and, just as importantly, reset quickly once the incident is over. A high burn rate pages a human; a slow leak only files a ticket.
def burn_rate(error_rate, slo):
budget = 1 - slo # slo 0.999 gives a budget of 0.001
return error_rate / budget if budget > 0 else float('inf')
def multi_window_alert(error_rate_long, error_rate_short, slo, threshold):
"""Fire only when BOTH windows are burning fast."""
return (burn_rate(error_rate_long, slo) >= threshold and
burn_rate(error_rate_short, slo) >= threshold)
Complexity (time and space)
Both functions are $O(1)$; the engineering cost is in measuring the SLIs accurately over rolling windows, which a monitoring system does with bucketed counters. The alerting policy is a small fixed set of window/threshold pairs evaluated each tick.
Worked example
A 2% error rate against a 99.9% SLO is burning twenty times too fast, and the multi-window rule only fires when both windows agree:
print(round(burn_rate(0.02, 0.999), 1)) # 20.0 burning 20x the safe rate
print(multi_window_alert(0.05, 0.05, 0.999, 14.4)) # True both windows hot -> page
print(multi_window_alert(0.05, 0.0, 0.999, 14.4)) # False short window clear -> no page
Follow-up questions
- What is an error budget? One minus the SLO: the quantitative amount of failure the target permits over the window, e.g. 43 minutes per month at 99.9%.
- What is burn rate? The observed error rate divided by the budget; a burn of 1 spends the whole budget exactly over the window.
- Why pair a long and short window? The long window confirms real budget burn; the short window makes the alert fire fast and reset fast, avoiding noise.
- Why does 14.4x page immediately? At that rate the entire 30-day budget is gone in about two days, so it warrants paging a human.
- SLO vs SLA? The SLO is the internal engineering target; the SLA is the external contract with consequences, usually set looser than the SLO.
References
- Beyer et al., Site Reliability Engineering, ch. Service Level Objectives (O'Reilly, 2016).
- Beyer et al., The Site Reliability Workbook, ch. Alerting on SLOs (O'Reilly, 2018).