Skip to content
C Codeloom
DevOps

DevOps Incident Response Playbook

A practical playbook for running production incidents: roles, comms, mitigation order, and the postmortem that turns pain into improvement.

·4 min read · By Codeloom
Intermediate 9 min read

What you'll learn

  • Severity levels and triggers
  • Roles during an incident
  • Comms cadence with stakeholders
  • Mitigation before diagnosis
  • Blameless postmortems

Prerequisites

  • Familiar with shell and YAML

What and Why

Incidents are inevitable. The question is not whether your system will fail, but how cheaply you can detect, communicate, mitigate, and learn when it does. An incident response playbook turns the chaotic minutes after an alert into a repeatable process so that engineers can focus on the problem instead of figuring out what to do next.

A well-run incident response saves money, protects user trust, and keeps the team sane. A poorly-run one burns engineers out, blocks support channels, and produces no learning.

Mental Model

Detect  -->  Declare  -->  Triage  -->  Mitigate  -->  Resolve  -->  Postmortem
 |           |             |           |             |             |
monitor   sev level     incident IC   restore     all-clear     learnings
alerts                  bridge open   service     announce      action items
Incident lifecycle

The flow is linear in theory and messy in practice. The playbook keeps everyone aligned. Two ideas matter most. First, mitigate before diagnose: if a rollback fixes it, ship the rollback now and investigate later. Second, separate roles: an Incident Commander coordinates while engineers debug.

Hands-on Example

Pick severity levels and write triggers. A common scheme:

severities:
  SEV1:
    description: "Total outage or data loss for many users"
    response_time: "5 minutes, page on-call + manager"
    comms: "Status page red, hourly updates"
  SEV2:
    description: "Major feature broken or significant degradation"
    response_time: "15 minutes, page on-call"
    comms: "Status page yellow, every 2 hours"
  SEV3:
    description: "Minor issue, no user-visible impact"
    response_time: "Next business day"
    comms: "Internal channel only"

Define roles. For a SEV1:

  • Incident Commander (IC): drives the call, tracks time, decides on mitigations. Does not type at a terminal.
  • Comms Lead: posts to the status page and to Slack #incidents every 30 minutes.
  • Operations Lead: the hands-on engineer running commands.
  • Scribe: pastes commands, timestamps, and decisions into a running doc.

A first-15-minutes checklist:

1. Page acknowledged, IC declared in #incidents
2. Bridge opened (Zoom / Meet)
3. Severity assigned
4. Status page updated
5. Most recent deploy identified
6. Rollback considered before any code change
7. Customer support told what to say

A quick rollback snippet you keep in your runbook:

# Roll back the last deployment of the api service
kubectl rollout undo deployment/api -n production
kubectl rollout status deployment/api -n production

Common Pitfalls

The most common pitfall is having everyone debug at once with no coordinator. Three engineers all run kubectl delete pod and now nobody knows what state the cluster is in. Always assign an IC, even for small incidents.

Another classic is silently fixing without comms. Customers refresh, see errors, and tweet. Update the status page within five minutes even if you have nothing to share except, “We are investigating.”

Skipping the postmortem because “we already know what happened” is the most expensive shortcut. The point of a postmortem is not just explanation; it is action items with owners and dates.

Production Tips

Run game days. Pick a quiet afternoon and simulate a real incident: black-hole traffic to a service, fail a database, inject latency. Practice the playbook end-to-end. The first time the team uses the runbook should not be at 2 a.m.

Keep runbooks next to the alerts. Each alert in Alertmanager should link to a one-page runbook with the diagnostic queries and likely mitigations. If an alert has no runbook, it is a tuning problem, not a paging problem.

Make postmortems blameless. The format that works:

Summary: 1 paragraph
Impact: who saw what, for how long
Timeline: bullet list with timestamps
Root causes: contributing factors, not "human error"
What went well / What went poorly
Action items: owner + due date + ticket link

Wrap-up

Incident response is a muscle. The playbook is the training plan. Define severities, name roles, write runbooks, communicate often, mitigate first, and run blameless postmortems. The teams that recover fastest are not the ones with the best engineers; they are the ones with the best habits.