
Lambda Durable Functions Are Not Step Functions Replacements
AWS Lambda Durable Functions look like Step Functions killers. They're not. Here's when each one wins, what the checkpoint-and-replay model actually costs, and the architectural patterns I'd use in production.
Every AWS team I've talked to since re:Invent 2025 has the same question: "Should we migrate our Step Functions to Lambda Durable Functions?"
The answer is no. Not yet. And probably not all of them.
What Lambda Durable Functions actually are
Lambda Durable Functions shipped at re:Invent 2025 as a new execution mode for Lambda. The pitch: write stateful, long-running workflows in regular code instead of JSON state machines. Enable "durable execution" on a function, and AWS handles checkpointing, failure recovery, and suspension for up to a year.
The 15-minute Lambda timeout hasn't changed. The physical container still dies at 900 seconds. What AWS built is a checkpoint-and-replay runtime that serializes your function's progress, then spins up a new container to pick up where you left off.
Think of it as a relay race. Each Lambda invocation runs until it hits a checkpoint, serializes state, and hands the baton to the next invocation.
import { task, sleep, waitForEvent } from '@aws-sdk/lambda-durable';
export async function handler(event: OnboardingEvent) {
// Step 1: Create account
const account = await task('createAccount', () =>
accountService.create(event.email)
);
// Step 2: Wait for email verification (up to 7 days)
const verification = await waitForEvent('email-verified', {
timeout: Duration.days(7)
});
if (!verification) {
await task('sendReminder', () =>
notificationService.remind(account.id)
);
return { status: 'abandoned' };
}
// Step 3: Provision resources
await task('provisionResources', () =>
resourceService.provision(account.id)
);
return { status: 'complete', accountId: account.id };
}
Each task() call is a checkpoint. If the container dies between steps 2 and 3, the next invocation replays the function from the start, but skips steps 1 and 2 because their results are already persisted. The waitForEvent call suspends execution entirely, costing you nothing while you wait for the user to click a verification link.
This is genuinely powerful. It eliminates the SQS-chaining workarounds we've all built. It removes the state management boilerplate that turns simple workflows into spaghetti.
But it's not a Step Functions replacement.
Where Step Functions still win
Step Functions have been around since 2016. They're battle-tested at serious scale, and they solve problems that Durable Functions don't touch.
Visual debugging and observability
Step Functions give you a visual execution graph. Every state transition is visible in the console. You can see exactly where a workflow failed, what input it received, and what output it produced. Product managers can look at the graph and understand the flow.
Durable Functions give you CloudWatch logs. That's it. Debugging a replayed function means reading through logs and understanding which invocation ran which checkpoint. For complex workflows with branching logic, this gets painful fast.
Parallel execution with fan-out
Step Functions handle parallel branches natively with Map and Parallel states. You can fan out to 10,000 concurrent executions with a single state definition:
{
"Type": "Map",
"ItemsPath": "$.orders",
"MaxConcurrency": 100,
"Iterator": {
"StartAt": "ProcessOrder",
"States": {
"ProcessOrder": {
"Type": "Task",
"Resource": "arn:aws:lambda:...:processOrder",
"End": true
}
}
}
}
Durable Functions can do parallel work, but you're writing the concurrency control yourself. You're managing Promise.all(), handling partial failures, implementing retry logic per branch. The SDK gives you primitives, not orchestration.
Service integrations
Step Functions integrate directly with over 200 AWS services. You can invoke a Lambda, wait for an SQS message, start an ECS task, call a SageMaker endpoint, and write to DynamoDB, all without writing any Lambda code. Many workflows are just service-to-service orchestration. No compute needed.
Durable Functions require a Lambda function for everything. If your workflow is "receive event, transform data, write to DynamoDB, notify SNS," Step Functions Express Workflows handle this with zero compute cost beyond the state transitions. Durable Functions make you pay for Lambda execution time.
Where Durable Functions win
That said, there are patterns where Durable Functions are clearly better.
Complex branching logic
If your workflow has deeply nested conditionals, loops with variable exit conditions, or recursive patterns, writing it in Amazon States Language (ASL) is miserable. ASL is a JSON-based DSL that wasn't designed for complexity. I've seen teams maintain 2000-line Step Function definitions that nobody can reason about.
Durable Functions let you write the same logic in TypeScript (or Python, or Java). Your IDE gives you autocomplete. Your tests run locally. You can refactor with confidence because it's just code.
export async function processLoan(application: LoanApplication) {
const creditScore = await task('checkCredit', () =>
creditService.check(application.ssn)
);
if (creditScore < 580) {
return await task('autoReject', () =>
rejectApplication(application.id, 'credit_score')
);
}
if (creditScore >= 750 && application.amount < 50000) {
return await task('autoApprove', () =>
approveApplication(application.id)
);
}
// Manual review path with multiple approval stages
const reviewers = determineReviewChain(creditScore, application.amount);
for (const reviewer of reviewers) {
await task(`assign-${reviewer.id}`, () =>
assignReview(application.id, reviewer)
);
const decision = await waitForEvent(`review-${reviewer.id}`, {
timeout: Duration.hours(48)
});
if (!decision || decision.status === 'rejected') {
return { status: 'rejected', stage: reviewer.role };
}
}
return await task('finalApprove', () =>
approveApplication(application.id)
);
}
Try writing that loop with variable-length reviewer chains in ASL. I've tried. It's not fun.
Long-running human-in-the-loop processes
Any workflow where you're waiting days or weeks for human input is a natural fit. The waitForEvent primitive costs nothing while suspended. You're not paying for a Lambda sitting idle. You're not running a polling loop against DynamoDB.
Customer onboarding, document approval chains, multi-party contract signing. These used to require careful state management with DynamoDB, SQS, and a scheduler. Now it's a few lines of code.
AI agent orchestration
This is the use case AWS is betting on. AI workflows that chain multiple LLM calls with human review gates, tool invocations, and conditional branching. The Bedrock AgentCore integration makes this straightforward:
export async function researchAgent(query: string) {
const plan = await task('plan', () =>
bedrock.invoke('claude-4', { prompt: `Create research plan for: ${query}` })
);
const sources = await task('search', () =>
searchService.find(plan.queries)
);
const analysis = await task('analyze', () =>
bedrock.invoke('claude-4', {
prompt: `Analyze sources for: ${query}`,
context: sources
})
);
// Human review gate
const approval = await waitForEvent('human-review', {
timeout: Duration.hours(24)
});
if (approval?.approved) {
return await task('publish', () =>
publishService.create(analysis)
);
}
return { status: 'needs_revision', feedback: approval?.feedback };
}
The hidden costs of checkpoint-and-replay
Here's what most articles skip: the replay model has real performance and cost implications.
Replay overhead
When a function resumes after a checkpoint, it replays from the beginning. Every task() call before the current one gets executed again, but instead of running the actual work, the runtime reads the cached result from the durable store. This is fast, but not free.
For a function with 20 checkpoints, the 20th step replays 19 cached results before doing its actual work. Each replay means deserializing state, checking the durable store, and returning cached data. In benchmarks I've seen, this adds 50-200ms per checkpoint on replay.
For most workflows, that latency is irrelevant. For latency-sensitive paths, it matters.
State size limits
Every checkpoint serializes its input and output to the durable store. AWS hasn't published hard limits yet, but the SDK documentation warns against large payloads. If your tasks return multi-megabyte results, you'll hit serialization costs and storage limits quickly.
The pattern here is the same one we use with Step Functions: store large data in S3, pass references through the workflow.
Cold start multiplication
Each checkpoint that triggers a new container invocation pays a cold start penalty. If your function has 10 checkpoints spread across 10 separate invocations, you pay 10 cold starts instead of 1. With provisioned concurrency or SnapStart this is manageable, but it's a cost most teams don't account for upfront.
My decision framework
After working with both services over the past two months, here's how I decide:
Use Step Functions when:
- Your workflow is primarily service-to-service orchestration
- You need visual debugging and non-technical stakeholders review workflows
- You need massive parallelism (Map state with thousands of concurrent branches)
- Your team doesn't want to manage workflow code as application code
Use Durable Functions when:
- Your workflow has complex branching, loops, or recursive patterns
- You're building human-in-the-loop processes with long wait times
- Your team prefers code over configuration
- You're orchestrating AI agent workflows
- You want your workflow logic testable with standard unit testing tools
Use neither when:
- Your workflow fits in a single Lambda invocation (just use Lambda)
- You need sub-100ms latency (the checkpoint overhead will bite you)
- Your workflow is a simple event-driven pipeline (EventBridge + Lambda is simpler)
The migration question
If you have working Step Functions, don't migrate them for the sake of it. Step Functions aren't going away. AWS has invested too heavily in them and they serve a different use case.
If you're starting a new workflow project, evaluate both options against the criteria above. The honest truth is that most teams will end up using both: Step Functions for service orchestration and parallel processing, Durable Functions for complex stateful logic.
The real win isn't replacing Step Functions. It's eliminating the homegrown state management code that teams build when Step Functions feel too rigid and raw Lambda feels too manual. That's the gap Durable Functions fill, and they fill it well.
Build for the right abstraction level. Don't chase the shiny new thing.
Get new posts in your inbox
Architecture, performance, security. No spam.
Keep reading
Durable Objects: The Primitive AWS Doesn't Have
Cloudflare's Durable Objects give you single-threaded, globally unique compute with embedded SQLite. AWS has no equivalent. Here's how they change backend architecture.
MinIO Is Dead. Here's What Your Infrastructure Team Should Do Next.
60,000 GitHub stars. One billion Docker pulls. Officially archived. MinIO's five-year wind-down from Apache 2.0 to AGPL to dead is the most dramatic open-source infrastructure collapse in years. Here's the migration playbook.
Building Production-Ready MCP Servers
MCP servers are everywhere. Production-ready ones aren't. Here's the architecture I use after running MCP in real workloads: error boundaries, state isolation, security hardening, and scaling patterns that actually hold up.