The Saga Pattern Explained
How the saga pattern coordinates long running business transactions across services using local commits and compensating actions instead of distributed two phase commit.
What you'll learn
- ✓Why distributed transactions are hard
- ✓How sagas replace 2PC in microservices
- ✓Choreography vs orchestration
- ✓How to design compensating actions
Prerequisites
- •Basic microservices knowledge
The moment your business logic crosses two services, the comfortable world of database transactions disappears. The saga pattern is the most widely used answer to the question of how to keep multi service operations consistent without resorting to distributed locks.
What and why
A saga is a sequence of local transactions, each in a different service, where every step has a defined compensating action that undoes its effect. If any step fails, the saga runs the compensations of the previously completed steps in reverse order, leaving the system in a consistent state.
The reason to bother is that two phase commit, the classic distributed transaction protocol, scales badly and couples services tightly. Sagas trade strict atomicity for availability and autonomy, which is the right trade for most business workflows. You give up the illusion that the whole operation is one atomic act and instead model it as a small state machine you can reason about.
Mental model
Think of a saga as a checklist with an eraser. You walk down the list, ticking off boxes one at a time. If something goes wrong halfway, you walk back up the list, erasing what you ticked. The catch is that erasing is not the same as never having ticked. A charge has already been made, an email has already been sent. Compensation means doing the visible inverse, not pretending the step never happened.
That distinction (semantic rollback rather than true rollback) is what makes saga design interesting. You are not undoing time; you are designing forward facing actions that bring the world back to a consistent state.
Architecture
There are two common ways to coordinate the steps: choreography, where services react to each other’s events, and orchestration, where a central coordinator drives the flow.
Orchestration:
orchestrator -> order svc -> ok
-> payment svc -> ok
-> shipping svc -> fail
-> compensate payment, compensate order
Choreography:
order svc -> emits OrderCreated
payment svc -> reacts, emits PaymentCharged
shipping svc -> reacts, emits ShippingFailed
payment svc -> reacts to ShippingFailed, refunds
order svc -> reacts to PaymentRefunded, cancels Orchestration centralizes the workflow in one place. The orchestrator is a service (or a workflow engine like Temporal) that calls each step in order, listens for the result, and decides the next move. The flow is easy to read and debug because the entire saga lives in one file.
Choreography distributes the workflow across services connected by events. Each service listens for events it cares about, performs its local transaction, and emits the next event. There is no central coordinator. The flow is decoupled but harder to trace because it is implicit in the event topology.
Trade-offs
Orchestration is easier to understand and to change. New steps go into one place. Failure handling is explicit. The downside is that the orchestrator becomes a critical dependency and can drift into being a god service if you let it.
Choreography is more decoupled and resilient to coordinator outages, but the saga as a whole exists only in the heads of the engineers who designed it. Adding a step means touching every service that needs to react. Debugging a stuck saga often means tracing events across logs.
A second trade-off is around compensation. Some actions are not naturally reversible. You cannot un-send an email, but you can send a correction. You cannot un-charge a card, but you can issue a refund. Design compensations as new business actions, not as rollbacks, and accept that the user may see a brief window of inconsistency.
Practical tips
Make every step idempotent. Network retries are guaranteed and a saga that double charges on a retry is worse than no saga at all. Use a saga id or step id as the idempotency key on every external call.
Start with orchestration unless you have a strong reason to choreograph. The clarity is worth the small coupling cost, especially when the flow is more than three or four steps. Tools like Temporal, Camunda, or AWS Step Functions remove most of the implementation burden.
Log the saga state machine transitions explicitly. A saga that fails halfway through is one of the hardest things to debug if you only have per service logs. A central saga log lets you answer “where did this order get stuck” in seconds.
Wrap-up
The saga pattern replaces distributed transactions with a sequence of local ones plus compensations. The shift from “atomic rollback” to “semantic undo” is the conceptual jump that takes the most getting used to. Once you accept that, sagas become a clean and well supported way to model long running business workflows across services.
Related articles
- Backend The Circuit Breaker Pattern Explained
Use circuit breakers to stop cascading failures, with state transitions, timeouts, and tuning advice for production microservices.
- Backend Monolith vs Microservices: A Pragmatic Comparison
Compare monoliths and microservices across team size, deploy cadence, complexity, and operational overhead to choose the right architecture.
- System Design Designing Rate Limiters: A System Design Deep Dive
A senior-engineer guide to designing rate limiters: algorithms, distributed coordination, trade-offs, and production patterns that actually scale.
- Backend CQRS vs Event Sourcing
CQRS and event sourcing are often mentioned together but solve different problems. This post separates them, shows how they combine, and when each is worth the complexity.