On-Call Survival Guide for Software Engineers
On-call is part of the job for most production engineers. Here is how to survive your first rotation, sleep better, and come out a stronger engineer.
What you'll learn
- ✓How to prepare before your first rotation starts
- ✓A triage flow for the moment the pager fires
- ✓How to write a runbook your future self will thank you for
- ✓How to handle escalation without panic
- ✓How to recover after a rough week
Prerequisites
- •Any developer experience
The first time the pager goes off at 3 AM, you forget your own name. The screen is bright. Your brain is foggy. Some dashboard you have never opened is blinking red, and somewhere a customer is having a bad day.
That feeling fades. Not because the alerts get easier, but because you build the muscle. This post is the survival kit I wish I had on my first rotation.
Before the rotation starts
Most people prepare for on-call by reading the runbooks the day before. That is too late. Prepare a full week before.
- Walk the architecture. Find a senior engineer and ask them to whiteboard the system you will be paging on. Where does traffic enter? Where does it die? What are the external dependencies?
- Open every dashboard once. When you are sleepy at 3 AM is not the time to learn where the latency graph lives.
- Test the pager. Send yourself a test alert. Make sure the sound wakes you up. Set up a backup channel.
- Read the last month of incidents. You will probably page on something similar.
- Find the on-call buddy. There is always someone you can call. Know their name, their timezone, and how to reach them.
Five hours of prep saves twenty hours of suffering.
The triage flow
When the pager fires, do not start debugging. Start triaging. The order matters.
- Acknowledge the page within five minutes. Even if you are not actually fixing yet, ack stops the auto-escalation.
- Assess impact. Is anyone losing money or data right now? Is it a partial outage or a full one? Are customers paging support?
- Stabilize before you investigate. If a deploy went out 10 minutes ago and metrics tanked, roll it back. Investigate after. Bleeding first, autopsy later.
- Communicate. Open an incident channel. Post a one-line status. Update every 15 minutes even if there is nothing new.
- Investigate. Logs, metrics, recent changes. Form a hypothesis, test it, repeat.
- Resolve. Fix forward or roll back. Verify recovery with the same metrics that triggered the page.
- Hand off or close out. If you are exhausted, hand off. Heroics cause second outages.
Most rookies skip stabilize. They jump into debugging while the fire is still spreading. The instinct is to understand. The job is to stop the bleeding.
Communicating during an incident
Your first job during an incident is not to fix it. It is to make sure the people who need to know, know. That includes other engineers, the on-call manager, customer support, sometimes the CEO.
A good incident update is three lines:
- What is happening. “Login service returning 500s for ~30 percent of requests.”
- What we know. “Started after the 02:14 deploy, isolated to the auth pod.”
- Next step. “Rolling back, ETA 5 minutes.”
Send it every 15 minutes. Even “no change, still investigating” is better than silence. Silence makes people panic and start their own parallel investigations.
Runbooks that actually help
Every alert should have a runbook. Most runbooks are useless because they are written for the person who already knows the system.
A good runbook answers:
- What does this alert mean, in plain English?
- What is the user impact?
- What are the first three things to check?
- What is the most common fix?
- When do I escalate, and to whom?
Write it for a sleepy engineer who has never seen the system before. That engineer might be future-you in six months, half-awake, with no memory of why you wrote the original code.
After every incident, update the runbook. The lesson should not have to be re-learned by the next on-call.
When to escalate
New on-calls under-escalate. They burn three hours trying to look competent when 10 minutes with a senior engineer would have solved it.
Escalate when:
- You are stuck for more than 30 minutes with no working hypothesis.
- The blast radius is growing despite your actions.
- You need a decision you are not paid to make. “Should we take the database read-only for an hour?” is not a junior engineer’s call.
- You are exhausted. Tired engineers cause second outages.
Senior engineers expect to be paged. That is what the rotation exists for. Waking someone up at 4 AM is annoying for them. Making the outage worse is annoying for the entire company.
What to do when you cannot reproduce it
The worst alerts are the ones that disappear when you log in. Pager fires, you wake up, the graph is already recovering, and you have no idea what happened.
Resist the urge to mark it noise immediately. Instead:
- Snapshot the dashboards. The data will roll off.
- Check for correlated events. Deploys, traffic spikes, dependency status.
- Tag the incident as “self-recovered” and file a ticket to investigate during business hours.
A pattern of “self-recovered” pages almost always points at something real. You just need daylight to see it.
After a rough week
A bad on-call week leaves you tired and a little raw. That is normal. Treat the recovery seriously.
- Take real time off the pager. No “just keeping an eye on Slack.”
- Sleep. Seriously, two good nights fixes a lot.
- Write the post-mortem honestly. No blame. What happened, why, and what we will do differently.
- Talk to your manager if the rotation is unsustainable. Burned-out on-call engineers leave companies.
The post-mortem is where on-call pays back into your career. A good post-mortem is a written artifact of your judgment under pressure. Save them. They become great stories in your next behavioral interview, exactly the kind that work with the STAR structure.
On-call as a learning accelerator
Here is the secret no one tells junior engineers: on-call is the single fastest way to grow. You see the system at its worst. You learn where the bodies are buried. You learn which abstractions leak and which dependencies lie about being healthy.
A year of on-call teaches you more about distributed systems than a year of reading papers. It also gives you stories worth telling. When you eventually interview elsewhere, lean on those incidents in your behavioral round, and back them with broader prep from dev interview prep basics. When you update your resume, mention real incident impact with numbers, not vague claims of “production experience.”
Small habits that compound
- Keep a personal incident journal. One line per page. Patterns will emerge.
- Bookmark the five dashboards you actually use. Close the rest.
- Pin the runbook links in the team channel.
- Build a “first 10 minutes” checklist and keep it on your second monitor.
- After every shift, write one thing to fix in the system or the rotation.
None of these are heroic. All of them compound.
Wrap up
On-call is a craft. The first few rotations are scary because you have not built the muscle yet. The fix is preparation, a calm triage flow, honest communication, and a culture of escalation without shame.
Stop the bleeding first. Communicate every 15 minutes. Update the runbook every time. Sleep when you can. The pager will fire again. You will be ready.