Skip to content
C Codeloom
DevOps

DevOps SLO, SLI, and Error Budgets Explained

Service Level Indicators, Objectives, and error budgets demystified: how to pick the right metric, set a target, and use the budget as a decision tool.

·4 min read · By Codeloom
Intermediate 9 min read

What you'll learn

  • Difference between SLI, SLO, SLA
  • How to pick good SLIs
  • Setting realistic SLO targets
  • Calculating an error budget
  • Using burn rate alerts

Prerequisites

  • Familiar with shell and YAML

What and Why

“How reliable should our service be?” sounds philosophical until your CEO is paying premium for five nines and your team is firefighting every weekend. SLIs, SLOs, and error budgets are the SRE answer to that question. They turn reliability into a number, and the number into a budget the team can spend.

The framework forces tradeoffs to be explicit. Shipping features fast burns the budget. Excessive caution wastes it. Done well, the budget aligns engineering and product on when to ship versus when to harden.

Mental Model

SLI  =  a measurement   (e.g. success ratio of HTTP requests)
SLO  =  a target         (e.g. 99.9% of requests succeed over 30 days)
SLA  =  a contract       (e.g. refund if below 99.5%)
Error budget = 1 - SLO  (e.g. 0.1% of requests may fail)
From signal to decision

The SLI is the raw measurement. The SLO is the target. The error budget is what is left over. If your SLO is 99.9% over 30 days, you have a budget of about 43 minutes of downtime per month. Spend it on risky releases. Stop spending it when it runs out.

Hands-on Example

Start by writing the SLI as a precise query. For an HTTP service:

# Good events / valid events over the window
sum(rate(http_requests_total{status!~"5.."}[30d]))
/ sum(rate(http_requests_total[30d]))

Encode the SLO in a config file your team owns:

service: checkout-api
slo:
  objective: 0.999          # 99.9% success
  window: 30d
  description: "Successful HTTP responses excluding 5xx"
sli:
  good: 'sum(rate(http_requests_total{status!~"5.."}[5m]))'
  total: 'sum(rate(http_requests_total[5m]))'

Now derive a burn rate alert. Burn rate is how fast you are spending the budget compared to a steady rate. A burn rate of 1 means you exactly match the budget. A burn rate of 14 means you will exhaust 30 days of budget in about two days.

groups:
  - name: checkout-slo
    rules:
      - alert: CheckoutErrorBudgetFastBurn
        expr: |
          (
            sum(rate(http_requests_total{job="checkout",status=~"5.."}[1h]))
            / sum(rate(http_requests_total{job="checkout"}[1h]))
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: page
        annotations:
          summary: "Checkout burning error budget fast"

That alert pages when the error rate over the last hour would burn the entire 30-day budget in under two days. Pair it with a slow-burn alert that catches steady leakage.

Common Pitfalls

The first pitfall is measuring uptime instead of user experience. A service can be “up” while every login fails. Define SLIs from the user’s perspective: did the request succeed quickly enough?

The second is setting unrealistic SLOs. Asking for 99.99% when your dependencies are 99.9% is mathematically impossible. Start with what you currently deliver, then improve.

The third is treating the SLO as a target to barely meet. The point is not to dance on the line; it is to know when you are safe to take risks.

Production Tips

Multi-window burn rate alerts catch both spikes and slow leaks. The Google SRE Workbook recommends pairs like (1h, 5m) for fast burn and (6h, 30m) for medium burn, evaluated together so a transient blip does not page someone.

Make the error budget visible. A Grafana panel showing “budget remaining: 64% of 43 minutes” tells the team at a glance whether to ship the risky migration this week.

Set a policy. When the budget is healthy, the team can ship freely. When it drops below a threshold, freeze risky changes and focus on reliability work. Write this down before you need it.

Wrap-up

SLIs, SLOs, and error budgets transform reliability from a vague aspiration into a tracked number. Pick one user-facing SLI per critical service, set a target you currently meet, write the budget, and use it to make shipping decisions. The framework is simple; the discipline of honoring it is what separates teams that ship calmly from teams that firefight.