DevOps SLO, SLI, and Error Budgets Explained
Service Level Indicators, Objectives, and error budgets demystified: how to pick the right metric, set a target, and use the budget as a decision tool.
What you'll learn
- ✓Difference between SLI, SLO, SLA
- ✓How to pick good SLIs
- ✓Setting realistic SLO targets
- ✓Calculating an error budget
- ✓Using burn rate alerts
Prerequisites
- •Familiar with shell and YAML
What and Why
“How reliable should our service be?” sounds philosophical until your CEO is paying premium for five nines and your team is firefighting every weekend. SLIs, SLOs, and error budgets are the SRE answer to that question. They turn reliability into a number, and the number into a budget the team can spend.
The framework forces tradeoffs to be explicit. Shipping features fast burns the budget. Excessive caution wastes it. Done well, the budget aligns engineering and product on when to ship versus when to harden.
Mental Model
SLI = a measurement (e.g. success ratio of HTTP requests)
SLO = a target (e.g. 99.9% of requests succeed over 30 days)
SLA = a contract (e.g. refund if below 99.5%)
Error budget = 1 - SLO (e.g. 0.1% of requests may fail) The SLI is the raw measurement. The SLO is the target. The error budget is what is left over. If your SLO is 99.9% over 30 days, you have a budget of about 43 minutes of downtime per month. Spend it on risky releases. Stop spending it when it runs out.
Hands-on Example
Start by writing the SLI as a precise query. For an HTTP service:
# Good events / valid events over the window
sum(rate(http_requests_total{status!~"5.."}[30d]))
/ sum(rate(http_requests_total[30d]))
Encode the SLO in a config file your team owns:
service: checkout-api
slo:
objective: 0.999 # 99.9% success
window: 30d
description: "Successful HTTP responses excluding 5xx"
sli:
good: 'sum(rate(http_requests_total{status!~"5.."}[5m]))'
total: 'sum(rate(http_requests_total[5m]))'
Now derive a burn rate alert. Burn rate is how fast you are spending the budget compared to a steady rate. A burn rate of 1 means you exactly match the budget. A burn rate of 14 means you will exhaust 30 days of budget in about two days.
groups:
- name: checkout-slo
rules:
- alert: CheckoutErrorBudgetFastBurn
expr: |
(
sum(rate(http_requests_total{job="checkout",status=~"5.."}[1h]))
/ sum(rate(http_requests_total{job="checkout"}[1h]))
) > (14.4 * 0.001)
for: 2m
labels:
severity: page
annotations:
summary: "Checkout burning error budget fast"
That alert pages when the error rate over the last hour would burn the entire 30-day budget in under two days. Pair it with a slow-burn alert that catches steady leakage.
Common Pitfalls
The first pitfall is measuring uptime instead of user experience. A service can be “up” while every login fails. Define SLIs from the user’s perspective: did the request succeed quickly enough?
The second is setting unrealistic SLOs. Asking for 99.99% when your dependencies are 99.9% is mathematically impossible. Start with what you currently deliver, then improve.
The third is treating the SLO as a target to barely meet. The point is not to dance on the line; it is to know when you are safe to take risks.
Production Tips
Multi-window burn rate alerts catch both spikes and slow leaks. The Google SRE Workbook recommends pairs like (1h, 5m) for fast burn and (6h, 30m) for medium burn, evaluated together so a transient blip does not page someone.
Make the error budget visible. A Grafana panel showing “budget remaining: 64% of 43 minutes” tells the team at a glance whether to ship the risky migration this week.
Set a policy. When the budget is healthy, the team can ship freely. When it drops below a threshold, freeze risky changes and focus on reliability work. Write this down before you need it.
Wrap-up
SLIs, SLOs, and error budgets transform reliability from a vague aspiration into a tracked number. Pick one user-facing SLI per critical service, set a target you currently meet, write the budget, and use it to make shipping decisions. The framework is simple; the discipline of honoring it is what separates teams that ship calmly from teams that firefight.
Related articles
- DevOps Chaos Engineering Introduction for DevOps Teams
An introduction to chaos engineering: hypothesis-driven failure injection that finds weaknesses before customers do.
- DevOps Feature Flags Best Practices for DevOps Teams
Feature flags decouple deploy from release. Learn flag types, rollout strategies, and how to keep your codebase from drowning in stale toggles.
- DevOps DevOps Incident Response Playbook
A practical playbook for running production incidents: roles, comms, mitigation order, and the postmortem that turns pain into improvement.
- DevOps DevOps Monitoring with Prometheus and Grafana
A practical tour of monitoring services with Prometheus for metrics collection and Grafana for dashboards, alerts, and SLO tracking.