CI/CD Rollback Strategies

Intermediate 8 min read

What you'll learn

✓Why every deploy strategy needs a rollback plan
✓The five common rollback patterns
✓How database changes constrain rollbacks
✓How feature flags shrink the blast radius
✓How to choose between speed and safety

Prerequisites

•Some experience deploying services to production

What and Why

A rollback is the act of reverting a production system to a known-good state after a bad deploy. It sounds simple, but the actual mechanism varies widely. Pushing a Git revert, flipping a load balancer, or rolling back a Kubernetes Deployment are very different operations with very different recovery times.

You need a rollback strategy because every deploy will eventually fail. The question is not whether, but how fast you can return to safety. Teams that practice rollback recover in minutes. Teams that improvise recover in hours.

Mental Model

A rollback is just another deploy where the target version is older than the current one. Everything that makes a deploy work, like artifact storage, immutable builds, and config separation, also makes rollback work. Everything that makes a deploy fragile, like in-place edits or coupling code with database state, also makes rollback fragile.

There are five common patterns, ordered roughly from slowest to fastest:

Redeploy previous artifact through the pipeline.
kubectl rollout undo or equivalent in-place reversion.
Blue-green flip back to the idle environment.
Canary reverse, returning the traffic split to zero.
Feature flag off, leaving code deployed but inert.

Each strategy assumes the previous version is still runnable, which is the deepest invariant a rollback depends on.

Hands-on Example

Take a Kubernetes Deployment using a blue-green pattern with two ReplicaSets:

apiVersion: v1
kind: Service
metadata: { name: api }
spec:
  selector: { app: api, slot: blue }   # currently serving
  ports: [{ port: 80, targetPort: 8080 }]

You deployed green with a bad release. To roll back:

kubectl patch svc api -p '{"spec":{"selector":{"app":"api","slot":"blue"}}}'

Traffic returns to blue in seconds. The bad green ReplicaSet stays around so you can investigate.

Slow                                              Fast
+--------+--------+--------+--------+--------+
| Pipeline | kubectl  | Blue/    | Canary | Feature |
| redeploy | rollout  | green    | reverse| flag    |
| (minutes)| (~30s)   | flip(<5s)| (<5s)  | (<1s)   |
+--------+--------+--------+--------+--------+
     depends on                  depends on
     fresh build                  pre-deployed code
                                  + flag service

Five rollback strategies on a time-to-safe axis

Common Pitfalls

The biggest pitfall is the database. If your release ran a migration that dropped a column or changed a type, rolling the binary back leaves the schema incompatible with the old code. The fix is the expand-contract pattern: ship the additive migration in one release, the code that uses it in a second, and the cleanup migration in a third. Each step is independently rollback-safe.

Another pitfall is rollback drift. The “previous” version was built six weeks ago and depends on a config map you removed yesterday. When you try to roll back, it crashes on startup. Keep config backward compatible for at least one release.

A third is unattended rollback. An automated system rolls back the moment error rate crosses a threshold, then a transient downstream blip causes another rollback, and you flap. Use cooldowns and require a successful health check before re-allowing automatic rollback.

Practical Tips

Practice rollback in non-production. Treat the rollback path as part of every deploy, not an emergency procedure. If you have never run it, you do not have one.

Keep artifacts immutable and addressable. A rollback should be a config change that points at a known image digest, not a rebuild.

Default to feature flags for risky changes. A flag flip is the fastest possible rollback and does not change the running binary.

Separate schema and code releases. Migrations land in their own deploy, never bundled with code that depends on them.

Record every rollback. The frequency and cause are leading indicators of release quality.

Wrap-up

Rollback is a deploy in reverse, and the strategy you pick is a trade-off between recovery time and operational complexity. Feature flags and traffic flips are nearly instant but require infrastructure investment. Pipeline redeploys are slow but free. The single most important rule is to keep code, config, and schema independently revertible. If you can do that, the rest is plumbing.