The On-Call Survival Guide for Software Engineers

Intermediate 7 min read

What you'll learn

✓How to prepare before your shift
✓A simple triage flow for pages
✓Communicating during an incident
✓Postmortem habits that pay off
✓Protecting sleep and sanity

Prerequisites

•Familiar with production systems and basic monitoring

What and Why

On-call means being the human who responds when a system breaks at 3 AM. Done badly, it burns engineers out and pushes them to quit. Done well, it is one of the fastest ways to learn how production really behaves and to grow technical judgement.

The difference is mostly preparation, process, and culture — not raw skill. This guide collects the habits that make on-call survivable, and even valuable.

Mental Model

An incident has three phases: detect, mitigate, resolve. Your job during a page is almost always mitigation — stop the bleeding — not root-cause analysis. Fixing the underlying bug can wait until business hours. Restoring service cannot.

Hold that distinction tightly. Engineers who try to find the real bug at 3 AM tend to spend hours and make things worse. Mitigators roll back, fail over, or shed load, then sleep and dig in fresh.

Hands-on Example

A practical triage flow when a page hits:

 Page received
    |
    v
Acknowledge in 5 min
    |
    v
Check dashboards & logs
    |
    v
Is it real?  -- no --> silence, file followup
    |
   yes
    |
    v
Can I mitigate fast?
 |              |
yes            no
 |              |
rollback /     escalate
failover /     to next person
feature flag      |
 |                v
 v             join war room
verify         mitigate together
 |
 v
file incident ticket -> postmortem

On-call triage flow

The whole flow assumes you can find runbooks. If you cannot, your first incident response improvement is writing them.

A good page acknowledgement message in chat: “Got it, looking at dashboards.” A good update five minutes later: “Confirmed elevated 5xx on checkout-api since 02:47 UTC. Rolling back deploy abc123, ETA 3 min.” Frequent, factual, short.

Common Pitfalls

Heroics over communication: silently debugging for 40 minutes scares your team and blocks help.
Skipping the rollback: “I will just patch forward” at 3 AM is famously how 30-minute incidents become 4-hour ones.
Ignoring noisy alerts: every false page erodes trust in the system. Silence or tune them deliberately, not by reflex.
No handoff at shift end: incidents in flight need a clear written handoff to the next on-call.
Treating postmortems as blame: blameless retros are the only kind that produce real fixes.

Practical Tips

Before your shift starts, read the runbooks for the top 5 services you might be paged for. Test that your laptop, VPN, 2FA, and paging app actually work — before the page, not during. Keep a paper notebook next to the bed; pages knock thoughts out of your head fast. After every incident you handle, write a 10-line note: what paged, what you did, what you wished existed. Those notes become runbooks. Push hard on tuning alerts: every false page is a tax on sleep and morale, and fixing alert quality is real engineering work, not avoidance.

Wrap-up

On-call is a forcing function. It teaches you what your system actually does in production, not what the design doc claimed. The engineers who get the most out of it are not the ones who never get paged — they are the ones who make each page produce a runbook, a fix, or a tuned alert so the next person sleeps better. Treat your rotation as a teaching tool for the team, and it stops being a thing you dread.