The Saga Pattern Explained

Intermediate 10 min read

What you'll learn

✓Why distributed transactions are hard
✓How sagas replace 2PC in microservices
✓Choreography vs orchestration
✓How to design compensating actions

Prerequisites

•Basic microservices knowledge

The moment your business logic crosses two services, the comfortable world of database transactions disappears. The saga pattern is the most widely used answer to the question of how to keep multi service operations consistent without resorting to distributed locks.

What and why

A saga is a sequence of local transactions, each in a different service, where every step has a defined compensating action that undoes its effect. If any step fails, the saga runs the compensations of the previously completed steps in reverse order, leaving the system in a consistent state.

The reason to bother is that two phase commit, the classic distributed transaction protocol, scales badly and couples services tightly. Sagas trade strict atomicity for availability and autonomy, which is the right trade for most business workflows. You give up the illusion that the whole operation is one atomic act and instead model it as a small state machine you can reason about.

Mental model

Think of a saga as a checklist with an eraser. You walk down the list, ticking off boxes one at a time. If something goes wrong halfway, you walk back up the list, erasing what you ticked. The catch is that erasing is not the same as never having ticked. A charge has already been made, an email has already been sent. Compensation means doing the visible inverse, not pretending the step never happened.

That distinction (semantic rollback rather than true rollback) is what makes saga design interesting. You are not undoing time; you are designing forward facing actions that bring the world back to a consistent state.

Architecture

There are two common ways to coordinate the steps: choreography, where services react to each other’s events, and orchestration, where a central coordinator drives the flow.

Orchestration:
orchestrator -> order svc   -> ok
            -> payment svc -> ok
            -> shipping svc -> fail
            -> compensate payment, compensate order

Choreography:
order svc -> emits OrderCreated
payment svc -> reacts, emits PaymentCharged
shipping svc -> reacts, emits ShippingFailed
payment svc -> reacts to ShippingFailed, refunds
order svc -> reacts to PaymentRefunded, cancels

Saga orchestration vs choreography

Orchestration centralizes the workflow in one place. The orchestrator is a service (or a workflow engine like Temporal) that calls each step in order, listens for the result, and decides the next move. The flow is easy to read and debug because the entire saga lives in one file.

Choreography distributes the workflow across services connected by events. Each service listens for events it cares about, performs its local transaction, and emits the next event. There is no central coordinator. The flow is decoupled but harder to trace because it is implicit in the event topology.

Trade-offs

Orchestration is easier to understand and to change. New steps go into one place. Failure handling is explicit. The downside is that the orchestrator becomes a critical dependency and can drift into being a god service if you let it.

Choreography is more decoupled and resilient to coordinator outages, but the saga as a whole exists only in the heads of the engineers who designed it. Adding a step means touching every service that needs to react. Debugging a stuck saga often means tracing events across logs.

A second trade-off is around compensation. Some actions are not naturally reversible. You cannot un-send an email, but you can send a correction. You cannot un-charge a card, but you can issue a refund. Design compensations as new business actions, not as rollbacks, and accept that the user may see a brief window of inconsistency.

Practical tips

Make every step idempotent. Network retries are guaranteed and a saga that double charges on a retry is worse than no saga at all. Use a saga id or step id as the idempotency key on every external call.

Start with orchestration unless you have a strong reason to choreograph. The clarity is worth the small coupling cost, especially when the flow is more than three or four steps. Tools like Temporal, Camunda, or AWS Step Functions remove most of the implementation burden.

Log the saga state machine transitions explicitly. A saga that fails halfway through is one of the hardest things to debug if you only have per service logs. A central saga log lets you answer “where did this order get stuck” in seconds.

Wrap-up

The saga pattern replaces distributed transactions with a sequence of local ones plus compensations. The shift from “atomic rollback” to “semantic undo” is the conceptual jump that takes the most getting used to. Once you accept that, sagas become a clean and well supported way to model long running business workflows across services.