DevOps Incident Response Playbook
A practical playbook for running production incidents: roles, comms, mitigation order, and the postmortem that turns pain into improvement.
What you'll learn
- ✓Severity levels and triggers
- ✓Roles during an incident
- ✓Comms cadence with stakeholders
- ✓Mitigation before diagnosis
- ✓Blameless postmortems
Prerequisites
- •Familiar with shell and YAML
What and Why
Incidents are inevitable. The question is not whether your system will fail, but how cheaply you can detect, communicate, mitigate, and learn when it does. An incident response playbook turns the chaotic minutes after an alert into a repeatable process so that engineers can focus on the problem instead of figuring out what to do next.
A well-run incident response saves money, protects user trust, and keeps the team sane. A poorly-run one burns engineers out, blocks support channels, and produces no learning.
Mental Model
Detect --> Declare --> Triage --> Mitigate --> Resolve --> Postmortem
| | | | | |
monitor sev level incident IC restore all-clear learnings
alerts bridge open service announce action items The flow is linear in theory and messy in practice. The playbook keeps everyone aligned. Two ideas matter most. First, mitigate before diagnose: if a rollback fixes it, ship the rollback now and investigate later. Second, separate roles: an Incident Commander coordinates while engineers debug.
Hands-on Example
Pick severity levels and write triggers. A common scheme:
severities:
SEV1:
description: "Total outage or data loss for many users"
response_time: "5 minutes, page on-call + manager"
comms: "Status page red, hourly updates"
SEV2:
description: "Major feature broken or significant degradation"
response_time: "15 minutes, page on-call"
comms: "Status page yellow, every 2 hours"
SEV3:
description: "Minor issue, no user-visible impact"
response_time: "Next business day"
comms: "Internal channel only"
Define roles. For a SEV1:
- Incident Commander (IC): drives the call, tracks time, decides on mitigations. Does not type at a terminal.
- Comms Lead: posts to the status page and to Slack
#incidentsevery 30 minutes. - Operations Lead: the hands-on engineer running commands.
- Scribe: pastes commands, timestamps, and decisions into a running doc.
A first-15-minutes checklist:
1. Page acknowledged, IC declared in #incidents
2. Bridge opened (Zoom / Meet)
3. Severity assigned
4. Status page updated
5. Most recent deploy identified
6. Rollback considered before any code change
7. Customer support told what to say
A quick rollback snippet you keep in your runbook:
# Roll back the last deployment of the api service
kubectl rollout undo deployment/api -n production
kubectl rollout status deployment/api -n production
Common Pitfalls
The most common pitfall is having everyone debug at once with no coordinator. Three engineers all run kubectl delete pod and now nobody knows what state the cluster is in. Always assign an IC, even for small incidents.
Another classic is silently fixing without comms. Customers refresh, see errors, and tweet. Update the status page within five minutes even if you have nothing to share except, “We are investigating.”
Skipping the postmortem because “we already know what happened” is the most expensive shortcut. The point of a postmortem is not just explanation; it is action items with owners and dates.
Production Tips
Run game days. Pick a quiet afternoon and simulate a real incident: black-hole traffic to a service, fail a database, inject latency. Practice the playbook end-to-end. The first time the team uses the runbook should not be at 2 a.m.
Keep runbooks next to the alerts. Each alert in Alertmanager should link to a one-page runbook with the diagnostic queries and likely mitigations. If an alert has no runbook, it is a tuning problem, not a paging problem.
Make postmortems blameless. The format that works:
Summary: 1 paragraph
Impact: who saw what, for how long
Timeline: bullet list with timestamps
Root causes: contributing factors, not "human error"
What went well / What went poorly
Action items: owner + due date + ticket link
Wrap-up
Incident response is a muscle. The playbook is the training plan. Define severities, name roles, write runbooks, communicate often, mitigate first, and run blameless postmortems. The teams that recover fastest are not the ones with the best engineers; they are the ones with the best habits.
Related articles
- DevOps Blameless Postmortems for DevOps Teams
How to run a blameless postmortem that actually improves your systems. Covers the philosophy, the meeting structure, a template you can copy, and the traps that turn a good process bad.
- DevOps Chaos Engineering Introduction for DevOps Teams
An introduction to chaos engineering: hypothesis-driven failure injection that finds weaknesses before customers do.
- DevOps Feature Flags Best Practices for DevOps Teams
Feature flags decouple deploy from release. Learn flag types, rollout strategies, and how to keep your codebase from drowning in stale toggles.
- DevOps DevOps SLO, SLI, and Error Budgets Explained
Service Level Indicators, Objectives, and error budgets demystified: how to pick the right metric, set a target, and use the budget as a decision tool.