The On-Call Survival Guide for Software Engineers
Practical advice for surviving and learning from on-call rotations: preparation, triage, communication, post-incident habits, and protecting your sleep.
What you'll learn
- ✓How to prepare before your shift
- ✓A simple triage flow for pages
- ✓Communicating during an incident
- ✓Postmortem habits that pay off
- ✓Protecting sleep and sanity
Prerequisites
- •Familiar with production systems and basic monitoring
What and Why
On-call means being the human who responds when a system breaks at 3 AM. Done badly, it burns engineers out and pushes them to quit. Done well, it is one of the fastest ways to learn how production really behaves and to grow technical judgement.
The difference is mostly preparation, process, and culture — not raw skill. This guide collects the habits that make on-call survivable, and even valuable.
Mental Model
An incident has three phases: detect, mitigate, resolve. Your job during a page is almost always mitigation — stop the bleeding — not root-cause analysis. Fixing the underlying bug can wait until business hours. Restoring service cannot.
Hold that distinction tightly. Engineers who try to find the real bug at 3 AM tend to spend hours and make things worse. Mitigators roll back, fail over, or shed load, then sleep and dig in fresh.
Hands-on Example
A practical triage flow when a page hits:
Page received
|
v
Acknowledge in 5 min
|
v
Check dashboards & logs
|
v
Is it real? -- no --> silence, file followup
|
yes
|
v
Can I mitigate fast?
| |
yes no
| |
rollback / escalate
failover / to next person
feature flag |
| v
v join war room
verify mitigate together
|
v
file incident ticket -> postmortem The whole flow assumes you can find runbooks. If you cannot, your first incident response improvement is writing them.
A good page acknowledgement message in chat: “Got it, looking at dashboards.” A good update five minutes later: “Confirmed elevated 5xx on checkout-api since 02:47 UTC. Rolling back deploy abc123, ETA 3 min.” Frequent, factual, short.
Common Pitfalls
- Heroics over communication: silently debugging for 40 minutes scares your team and blocks help.
- Skipping the rollback: “I will just patch forward” at 3 AM is famously how 30-minute incidents become 4-hour ones.
- Ignoring noisy alerts: every false page erodes trust in the system. Silence or tune them deliberately, not by reflex.
- No handoff at shift end: incidents in flight need a clear written handoff to the next on-call.
- Treating postmortems as blame: blameless retros are the only kind that produce real fixes.
Practical Tips
Before your shift starts, read the runbooks for the top 5 services you might be paged for. Test that your laptop, VPN, 2FA, and paging app actually work — before the page, not during. Keep a paper notebook next to the bed; pages knock thoughts out of your head fast. After every incident you handle, write a 10-line note: what paged, what you did, what you wished existed. Those notes become runbooks. Push hard on tuning alerts: every false page is a tax on sleep and morale, and fixing alert quality is real engineering work, not avoidance.
Wrap-up
On-call is a forcing function. It teaches you what your system actually does in production, not what the design doc claimed. The engineers who get the most out of it are not the ones who never get paged — they are the ones who make each page produce a runbook, a fix, or a tuned alert so the next person sleeps better. Treat your rotation as a teaching tool for the team, and it stops being a thing you dread.
Related articles
- Career On-Call Survival Guide for Software Engineers
On-call is part of the job for most production engineers. Here is how to survive your first rotation, sleep better, and come out a stronger engineer.
- Career SRE Roadmap: How to Become a Site Reliability Engineer
A practical roadmap to becoming a Site Reliability Engineer. Linux, networking, observability, IaC, Kubernetes, incident response, and SLOs explained in order.
- DevOps DevOps Incident Response Playbook
A practical playbook for running production incidents: roles, comms, mitigation order, and the postmortem that turns pain into improvement.
- DevOps Runbook Best Practices for DevOps
How to write runbooks that on-call engineers actually use at 3am. Covers structure, tone, automation hand-offs, and how to keep runbooks alive instead of letting them rot.