AWS Step Functions
Step Functions is AWS's managed workflow orchestrator. You define a workflow as a state machine — named states, transitions, parallel branches, error handlers, retries — in Amazon States Language (JSON or YAML). Step Functions runs it, calls AWS services on each step, tracks progress, and shows a visual execution history.
It replaces the Lambda whose job is "call A, wait, call B, retry on failure, call C if both succeed" — with better observability and far less custom code.
Standard vs Express and States
Standard workflows are durable and long-running (up to one year), exactly-once, with 90 days of execution history — for order processing and multi-day approvals. Express workflows are short (up to 5 minutes), high-volume, and much cheaper — for high-frequency event processing. Their execution guarantee depends on how you call them: asynchronous Express is at-least-once (make steps idempotent), synchronous Express is at-most-once. You pick Standard vs Express at creation and cannot convert.
State types include Task (do something), Choice (branch), Parallel, Map (per-item iteration), Wait, and Pass. Task states integrate directly with many AWS services, so you call SQS, DynamoDB, or ECS without an intermediary Lambda.
Error Handling and Distributed Map
Every Task can declare Retry rules (retry on these errors with exponential backoff) and Catch rules (jump to a fallback state on failure). Declaring this once in the state machine — instead of in every Lambda — centralizes and surfaces the error logic.
Distributed Map reads a list of items from S3 and processes each in parallel with configurable concurrency, up to 10,000 child executions — replacing custom EC2 schedulers or complex SQS-and-Lambda fan-out for large batch jobs.
Step Functions — ordered, branching, retryable multi-step workflows with visible state and error handling.
EventBridge — fire-and-forget event routing, not ordered step-by-step flows.
Plain Lambda + SQS — simple high-frequency work that does not need workflow semantics — often cheaper.
- Using Standard workflows for high-volume short work where Express is an order of magnitude cheaper.
- Wrapping every AWS-service call in a Lambda instead of using Step Functions' direct service integrations.
- Re-implementing retry and catch logic in each Lambda instead of declaring it once in the state machine.
- Building custom fan-out for large batches instead of using distributed Map with S3 input.
- Reaching for Step Functions for a simple request/response API, where API Gateway plus Lambda is enough.
- Running weeks-long human-approval workflows on Step Functions Standard, where cost and missing human-task features hurt.
- Use Express workflows for high-volume short work and Standard for durable, long-running flows.
- Use direct AWS-service integrations instead of wrapping each call in a Lambda.
- Declare Retry and Catch rules in the state machine, not in each Lambda.
- Use Map — and distributed Map with S3 input — for large parallel batches.
- Version state machines with built-in versions and aliases.
- Monitor execution duration and failures in CloudWatch.
Knowledge Check
What is the difference between Standard and Express workflows?
- Standard is durable, long-running, exactly-once; Express is short, high-volume, and cheaper
- Express runs for up to a year, while Standard caps at 5 minutes
- Standard is cheaper at high volume because it bills per state transition instead of by duration
- Only Express supports Retry and Catch on Task states
What advantage do Step Functions' direct service integrations give?
- Task states can call SQS, DynamoDB, ECS, and others without an intermediary Lambda
- They automatically make every workflow an Express type
- They remove the need for IAM permissions on the role
- They guarantee sub-millisecond latency on every direct call to an integrated AWS service
Where should retry and catch logic live in a Step Functions workflow?
- Declared once in the state machine, centralizing and surfacing the error handling
- Inside each individual Lambda function that the workflow happens to invoke at runtime
- In an EventBridge rule that watches the workflow
- In the SQS dead-letter queue attached downstream
What does distributed Map enable?
- Processing up to 10,000 items from S3 in parallel, replacing custom batch fan-out
- Converting a Standard workflow to an Express workflow on the fly at runtime
- Synchronous request/response handling for a low-latency API front end
- Storing the workflow's full execution state in a DynamoDB table for later durability
You got correct