Runbook Best Practices for DevOps

Beginner 8 min read

What you'll learn

✓What separates a useful runbook from a stale wiki page
✓The standard sections a runbook should have
✓How to link runbooks to alerts
✓Patterns for keeping runbooks fresh
✓When to automate the runbook away entirely

Prerequisites

•You have done any on-call rotation, or are about to start one

What and Why

A runbook is a short, action-oriented document that tells a tired engineer what to do when a specific alert fires or a specific system misbehaves. It is not a design document. It is not a user guide. It is the page someone opens at 3 a.m. with eight minutes of patience.

The reason you write runbooks is that on-call rotation always includes engineers who did not build the system. Tribal knowledge does not survive team changes. A good runbook turns “page the original author” into “follow these five steps”, which is the difference between a five-minute incident and a fifty-minute one.

Mental Model

Imagine the reader. They are half-asleep, pulled out of bed by a pager, looking at one alert and one runbook URL. They have not touched this service in months. Every sentence you write either helps them act or wastes their time.

That filter changes what you include. Background and history go to a separate design doc. Theory goes to the README. The runbook contains the alert, what it means, how to confirm it is real, the safe mitigations, and the escalation path. Nothing more.

A good runbook is also runnable. Wherever a step is a command, that command appears verbatim, copy-pasteable, with the right environment variables already filled in by template.

Hands-on Example

A minimal runbook for a “high checkout error rate” alert:

# Runbook: Checkout 5xx Rate High

**Severity:** Sev2
**Owner team:** #payments
**Related dashboard:** https://grafana/d/checkout
**Source code:** github.com/org/checkout

## What this alert means
More than 2% of /checkout requests have returned a 5xx for 5 minutes.

## Confirm the alert
1. Open the dashboard above.
2. Look at the "5xx by version" panel.
   - If one version is the source, it is likely a bad deploy.
   - If both versions show the spike, suspect a dependency.

## Mitigations (try in order)
1. If a recent deploy is implicated:
   `kubectl rollout undo deployment/checkout -n payments`
2. If a dependency is degraded:
   - Check status.payments-provider.com
   - Toggle the circuit breaker:
     `kubectl set env deploy/checkout PROVIDER_BREAKER=open`
3. If neither helps, escalate.

## Escalation
- Primary: payments on-call (PagerDuty)
- Secondary: platform on-call
- After 30 minutes Sev2 still active: page the incident commander

Alert fires
 |
 v
Page on-call (includes runbook URL)
 |
 v
Open runbook -> Confirm alert is real
 |
 +-- not real -> silence + file ticket to tune alert
 |
 v
Apply mitigation step 1
 |
 +-- resolved -> close incident, write postmortem
 |
 v
Step 2, step 3 ...
 |
 v
Escalate per runbook escalation section

From alert to action with a runbook

Common Pitfalls

Stale runbooks are the most common failure. The runbook says “kubectl get pods in the prod-old namespace” when that namespace was renamed a year ago. The fix is to treat runbooks as code: stored in the same repo as the service, reviewed in pull requests, and updated whenever the underlying behavior changes.

Theory creep is the second pitfall. A runbook that starts with three paragraphs of architecture context is one no one finishes reading at 3 a.m. Move context to a linked design doc.

Missing the link from alert to runbook is the third. If the pager message says “high error rate” but contains no runbook URL, the on-call has to guess where to look. Every alert should embed its runbook URL in the notification.

The fourth is the runbook that should have been automation. If the steps are always the same and never need judgment, write a script and remove the human.

Practical Tips

Keep each runbook to a single page. If it grows beyond that, split it by alert or by symptom.

Link every alert rule to its runbook URL. Most alerting systems support an annotation for this.

Add an “owner team” and a last-reviewed date at the top. Stale dates flag candidates for review.

Run runbook drills. Pick a quiet afternoon, fire a synthetic alert, and time how long it takes a non-author to resolve it using only the runbook.

Delete runbooks for systems you have retired. Outdated pages are worse than missing ones.

Wrap-up

Runbooks are the operational memory of your services. They turn rare, high-pressure failures into routine, repeatable responses. Keep them short, runnable, owned, and linked from the alerts that trigger them. When the same runbook step runs three times in a row without judgment, replace it with automation. A living set of small, accurate runbooks beats a sprawling wiki every time.