SLOs, SLIs, and error budgets

Software & architecture · reliability · Nov 2024

"Is the service healthy?" is a useless question until you make it numeric. SLIs, SLOs, and error budgets are the vocabulary that turns reliability from a feeling into a measurement you can act on. I put all of this into practice in my Uptime & SLO Monitor.

The chain is simple: an SLI measures behavior, an SLO sets a target for it, and the error budget is what the target leaves you to spend on failure.

SLI, SLO, SLA

  • An SLI (indicator) is a measured number: the fraction of successful requests, or the p95 latency.
  • An SLO (objective) is the internal target for an SLI: 99.9% of requests succeed over 30 days.
  • An SLA is the contractual version with consequences; the SLO is the goal you actually engineer toward, usually set stricter than the SLA.

The error budget

If the target is 99.9%, then 0.1% of failure is not merely tolerated, it is budgeted: $\text{error budget} = 1 - \text{SLO}$. Translating that to time over a 30-day month makes it concrete:

  • 99% (two nines): about 7.2 hours of allowed downtime per month.
  • 99.9% (three nines): about 43 minutes per month.
  • 99.99% (four nines): about 4.3 minutes per month.
  • 99.999% (five nines): about 26 seconds per month.

The budget is a shared currency: plenty left means ship faster and take risks; nearly spent means freeze features and stabilize. It replaces "should we deploy?" arguments with a number both developers and operators agree on.

Burn rate

Burn rate is how fast you are spending the budget relative to spending it evenly across the window: $\text{burn} = \text{error rate} / \text{error budget}$. A burn rate of 1 exhausts the budget exactly at the end of the window; a burn rate of 14.4 would burn an entire 30-day budget in about two days, which is why that threshold pages immediately.

Multi-window, multi-burn-rate alerting

Alerting on raw burn rate is either too twitchy or too slow, so the Google SRE approach pairs a long and a short window and fires only when both exceed the threshold. The long window confirms the budget is genuinely burning; the short window makes the alert fire quickly and, just as importantly, reset quickly once the incident is over. A high burn rate pages a human; a slow leak only files a ticket.

def burn_rate(error_rate, slo):
    budget = 1 - slo                  # slo 0.999 gives a budget of 0.001
    return error_rate / budget if budget > 0 else float('inf')

def multi_window_alert(error_rate_long, error_rate_short, slo, threshold):
    """Fire only when BOTH windows are burning fast."""
    return (burn_rate(error_rate_long, slo) >= threshold and
            burn_rate(error_rate_short, slo) >= threshold)

Complexity (time and space)

Both functions are $O(1)$; the engineering cost is in measuring the SLIs accurately over rolling windows, which a monitoring system does with bucketed counters. The alerting policy is a small fixed set of window/threshold pairs evaluated each tick.

Worked example

A 2% error rate against a 99.9% SLO is burning twenty times too fast, and the multi-window rule only fires when both windows agree:

print(round(burn_rate(0.02, 0.999), 1))            # 20.0   burning 20x the safe rate
print(multi_window_alert(0.05, 0.05, 0.999, 14.4)) # True   both windows hot -> page
print(multi_window_alert(0.05, 0.0,  0.999, 14.4)) # False  short window clear -> no page

Follow-up questions

  • What is an error budget? One minus the SLO: the quantitative amount of failure the target permits over the window, e.g. 43 minutes per month at 99.9%.
  • What is burn rate? The observed error rate divided by the budget; a burn of 1 spends the whole budget exactly over the window.
  • Why pair a long and short window? The long window confirms real budget burn; the short window makes the alert fire fast and reset fast, avoiding noise.
  • Why does 14.4x page immediately? At that rate the entire 30-day budget is gone in about two days, so it warrants paging a human.
  • SLO vs SLA? The SLO is the internal engineering target; the SLA is the external contract with consequences, usually set looser than the SLO.

References

  1. Beyer et al., Site Reliability Engineering, ch. Service Level Objectives (O'Reilly, 2016).
  2. Beyer et al., The Site Reliability Workbook, ch. Alerting on SLOs (O'Reilly, 2018).