SRE Roadmap: How to Become a Site Reliability Engineer
A practical roadmap to becoming a Site Reliability Engineer. Linux, networking, observability, IaC, Kubernetes, incident response, and SLOs explained in order.
What you'll learn
- ✓What this role actually does day-to-day
- ✓The exact skills and tools to learn in order
- ✓A realistic month-by-month plan for the first 6-12 months
- ✓How to build a portfolio that gets interviews
- ✓How to land the first job and what to expect
Prerequisites
- •Basic comfort with a computer and willingness to commit ~10 hours/week
A Site Reliability Engineer keeps production systems up, fast, and debuggable. Day to day this is dashboards, runbooks, on-call pages, capacity planning, and removing the toil that makes engineers hate their week. SREs write code, but the code mostly automates infrastructure rather than ships features.
Follow these steps in order. SRE is the most layered career path in this list because each tool sits on top of the one below it. Each step links to a Codeloom tutorial so you can start now without wandering.
The Step-by-Step Path
Step 1 — Linux Fluency
Production runs on Linux. You need to navigate, debug, and read logs without thinking. This is the foundation everything else rests on.
Step 2 — Networking and Shell Scripting
When a service is down at 3am, you will be reading tcpdump output and writing a shell loop. Networking fundamentals (DNS, TCP, HTTP, TLS) and confident scripting separate juniors from real responders.
Step 3 — Docker
Containers are how modern services are packaged. You cannot debug what you do not understand. Build, run, exec into containers until it is muscle memory.
Step 4 — Kubernetes
Kubernetes is the standard for running services at scale, and the standard for SRE interviews. Learn pods, deployments, services, and namespaces before any operator or service mesh.
Step 5 — CI/CD
Reliability starts at deploy time. A solid pipeline is the cheapest reliability investment a company can make. Build one yourself before you critique someone else’s.
Step 6 — Observability
You cannot fix what you cannot see. Prometheus for metrics, Grafana for dashboards, and a log aggregator round out the basics. Pair these with the SRE classic of SLIs, SLOs, and error budgets.
- (Resource hint: Prometheus and Grafana official docs)
Step 7 — Cloud (AWS First)
AWS is the most common cloud you will see on SRE job descriptions. Learn the core compute, storage, networking, and identity primitives before any managed Kubernetes service.
Step 8 — Infrastructure as Code
Click-ops does not scale. Terraform is the default IaC tool and the one most teams use. Once you can describe infra in code, you can review it, version it, and roll it back.
- (Resource hint: Terraform official Get Started guides)
Step 9 — Incident Response
The job is judged by what you do during the worst hour of the quarter. Learn the incident command pattern, blameless postmortems, and how to write a runbook someone else can follow at 2am.
- (Resource hint: Google SRE Workbook, chapters on incident response)
Step 10 — On-Call Mindset
The mindset is the differentiator. Calm under pressure, written clearly, and an obsession with reducing toil. Read the Google SRE book even if you skip the exercises.
What to Build (Portfolio Projects)
- A homelab cluster with k3s, Prometheus, Grafana, and a sample app, fully on GitHub with a writeup. Demonstrates the full stack at small scale.
- A Terraform repo that stands up a VPC, a managed database, and an ECS service on AWS. Demonstrates IaC fluency.
- A chaos experiment writeup where you intentionally break a service and document recovery. Demonstrates incident skills.
- A small dashboard project that exposes meaningful SLIs for an app you wrote. Demonstrates observability thinking.
Common Mistakes
- Memorizing Kubernetes commands without understanding networking and Linux first.
- Treating SRE as just DevOps with a fancier title. The reliability math matters.
- Skipping postmortems and runbook writing. The writing is half the job.
- Chasing every new CNCF tool instead of mastering Prometheus, Grafana, Terraform, and one cloud.
- Avoiding code. Modern SRE is half engineer, half ops. Python or Go fluency is expected.
- Ignoring cost. Reliability without cost awareness gets you fired in a downturn.
How to Get the First Job
- Resume: lead with the homelab and the IaC repo. Quantify with uptime numbers and toil reduction.
- Portfolio: a public writeup of one real incident you handled, even a simulated one, is gold.
- Networking: the SRE community is small and friendly. Join SREcon talks on YouTube and follow the authors.
- Interviews: expect Linux deep dives, system design, and a behavioral round focused on calm under pressure.
- Target the right roles: Platform Engineer, Infra Engineer, and Production Engineer are SRE-adjacent and often easier first jobs.
Wrap up
SRE rewards systems thinkers who can stay calm and write clearly. Work bottom-up through Linux, containers, Kubernetes, cloud, and IaC, then prove it with a homelab. The path is long but the comp and the craft justify it.