AWS Step Functions Tutorial: Orchestrating Serverless Workflows
Learn how AWS Step Functions coordinate Lambda, ECS, and SDK calls into reliable state machines, with patterns for retries, parallelism, and error handling.
What you'll learn
- ✓What Step Functions are and the problems they solve
- ✓Standard vs Express workflow trade-offs
- ✓How to write a state machine in ASL
- ✓Retry, catch, and parallel patterns
- ✓Production tips for observability and cost
Prerequisites
- •Familiar with shell
- •Basic AWS Lambda knowledge
What and Why
AWS Step Functions is a managed orchestration service that lets you describe a workflow as a state machine in JSON. Each state can invoke a Lambda function, an SDK call, an ECS task, or wait for a human approval. The service handles retries, branching, parallelism, and long waits without you running a worker process.
Without Step Functions, teams often glue Lambdas together using SQS queues, EventBridge rules, and home-grown idempotency tables. That works but the workflow is implicit, scattered across the console, and impossible to visualize. Step Functions makes the workflow a first-class object with a built-in execution history.
Mental Model
A state machine is a directed graph of states. Execution starts at the StartAt state and walks through transitions until it reaches an End: true state or fails. Input flows in, each state can transform it, and output flows to the next state. The Amazon States Language (ASL) is the JSON dialect you write.
There are two workflow types. Standard workflows run for up to a year, are exactly-once, and cost roughly twenty-five dollars per million state transitions. Express workflows run for up to five minutes, are at-least-once, and cost a small fraction. Use Standard for business processes and Express for high-volume request pipelines.
Hands-on Example
Suppose a user uploads a video. You want to transcode it, generate thumbnails in parallel, store metadata, and notify the user. A Step Functions state machine makes this explicit.
+------------------+
| Start: S3 PUT |
+---------+--------+
|
+-------v--------+
| Validate (Lambda) |
+-------+--------+
|
+-------v--------+
| Parallel |
+---+--------+---+
| |
+-------v--+ +--v---------+
| Transcode| | Thumbnails |
+-------+--+ +--+---------+
| |
+---v--------v---+
| Save Metadata |
+-------+--------+
|
+-------v--------+
| Notify SNS |
+-------+--------+
|
[End]
A minimal ASL snippet for the parallel step looks like this:
{
"Type": "Parallel",
"Branches": [
{ "StartAt": "Transcode",
"States": { "Transcode": { "Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": { "FunctionName": "transcode" },
"End": true } } },
{ "StartAt": "Thumbnails",
"States": { "Thumbnails": { "Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": { "FunctionName": "thumbnails" },
"End": true } } }
],
"Next": "SaveMetadata"
}
The execution UI shows each branch, its duration, its input, and its output. Debugging a stuck workflow becomes a matter of clicking a step instead of grepping logs.
Common Pitfalls
The most common pitfall is forgetting that state input and output are size-limited (256 KB). Large payloads must be passed by reference using S3 keys or DynamoDB IDs, not embedded. Workflows that pass entire images or PDFs between steps eventually fail under load.
Another pitfall is mixing business retries with Lambda retries. If a Lambda is configured with three retries and the Step Functions task is also configured with three retries, a failure can run nine times. Pick one place to own retry policy, usually the state machine, and set the Lambda to zero retries.
Engineers also underestimate state transition costs. A Standard workflow with hundreds of Map iterations can become surprisingly expensive. Express workflows or batching reduce the bill.
Finally, do not store secrets in state input. Anything passed between states appears in the execution history for ninety days.
Production Tips
Version your state machines. Use the publish-version feature so deploys do not break in-flight executions. Combine this with aliases for blue-green rollouts.
Use ResultSelector and ResultPath deliberately. They control what is kept from a task’s output and where it lands in the running state. Without them, state grows on every step and hits the 256 KB ceiling sooner than expected.
Wire executions into CloudWatch metrics and X-Ray. The ExecutionsFailed and ExecutionsTimedOut metrics deserve alarms. For Express workflows, enable logging at ERROR level by default and ALL when debugging.
Prefer the AWS SDK integrations (arn:aws:states:::aws-sdk:...) over wrapping every call in a Lambda. Calling DynamoDB or S3 directly from the state machine removes a layer of code and a layer of cost.
Wrap-up
Step Functions turn implicit, glue-based workflows into explicit state machines with visualization, retries, and history out of the box. Use Standard workflows for long, exactly-once business processes and Express for high-throughput pipelines. Keep payloads small, centralize retry policy, and lean on direct SDK integrations to keep code minimal.
Start small. Convert one fragile chain of Lambda-to-SQS-to-Lambda into a state machine and you will see the value immediately the next time something fails at 3 a.m.
Related articles
- AWS AWS Lambda Cold Starts: A Deep Dive
What actually happens during a Lambda cold start, why some functions are worse than others, and the techniques that meaningfully reduce p99 latency in production.
- AWS AWS Lambda Basics: Serverless Functions
A beginner-friendly tour of AWS Lambda — the handler signature, runtime choices, triggers from API Gateway and S3 and EventBridge, cold starts, packaging, and the IAM execution role every function needs.
- AWS AWS API Gateway vs ALB: Choosing the Right Entry Point
Compare API Gateway and Application Load Balancer for fronting AWS workloads, including features, pricing, latency, and when to use each in production.
- AWS AWS CloudFront CDN Tutorial: Caching at the Edge
Learn how AWS CloudFront accelerates content delivery, what cache behaviors look like, and how to wire it up to an S3 origin with sensible defaults.