Lambda Durable Functions Are Not Step Functions Replacements

Every AWS team I've talked to since re:Invent 2025 has the same question: "Should we migrate our Step Functions to Lambda Durable Functions?"

The answer is no. Not yet. And probably not all of them.

What Lambda Durable Functions actually are

Lambda Durable Functions shipped at re:Invent 2025 as a new execution mode for Lambda. The pitch: write stateful, long-running workflows in regular code instead of JSON state machines. Enable "durable execution" on a function, and AWS handles checkpointing, failure recovery, and suspension for up to a year.

The 15-minute Lambda timeout hasn't changed. The physical container still dies at 900 seconds. What AWS built is a checkpoint-and-replay runtime that serializes your function's progress, then spins up a new container to pick up where you left off.

Think of it as a relay race. Each Lambda invocation runs until it hits a checkpoint, serializes state, and hands the baton to the next invocation.

typescript

import { task, sleep, waitForEvent } from '@aws-sdk/lambda-durable';

export async function handler(event: OnboardingEvent) {
  // Step 1: Create account
  const account = await task('createAccount', () => 
    accountService.create(event.email)
  );

  // Step 2: Wait for email verification (up to 7 days)
  const verification = await waitForEvent('email-verified', {
    timeout: Duration.days(7)
  });

  if (!verification) {
    await task('sendReminder', () => 
      notificationService.remind(account.id)
    );
    return { status: 'abandoned' };
  }

  // Step 3: Provision resources
  await task('provisionResources', () => 
    resourceService.provision(account.id)
  );

  return { status: 'complete', accountId: account.id };
}

Each task() call is a checkpoint. If the container dies between steps 2 and 3, the next invocation replays the function from the start, but skips steps 1 and 2 because their results are already persisted. The waitForEvent call suspends execution entirely, costing you nothing while you wait for the user to click a verification link.

This is genuinely powerful. It eliminates the SQS-chaining workarounds we've all built. It removes the state management boilerplate that turns simple workflows into spaghetti.

But it's not a Step Functions replacement.

Where Step Functions still win

Step Functions have been around since 2016. They're battle-tested at serious scale, and they solve problems that Durable Functions don't touch.

Visual debugging and observability

Step Functions give you a visual execution graph. Every state transition is visible in the console. You can see exactly where a workflow failed, what input it received, and what output it produced. Product managers can look at the graph and understand the flow.

Durable Functions give you CloudWatch logs. That's it. Debugging a replayed function means reading through logs and understanding which invocation ran which checkpoint. For complex workflows with branching logic, this gets painful fast.

Parallel execution with fan-out

Step Functions handle parallel branches natively with Map and Parallel states. You can fan out to 10,000 concurrent executions with a single state definition:

json

{
  "Type": "Map",
  "ItemsPath": "$.orders",
  "MaxConcurrency": 100,
  "Iterator": {
    "StartAt": "ProcessOrder",
    "States": {
      "ProcessOrder": {
        "Type": "Task",
        "Resource": "arn:aws:lambda:...:processOrder",
        "End": true
      }
    }
  }
}

Durable Functions can do parallel work, but you're writing the concurrency control yourself. You're managing Promise.all(), handling partial failures, implementing retry logic per branch. The SDK gives you primitives, not orchestration.

Service integrations

Step Functions integrate directly with over 200 AWS services. You can invoke a Lambda, wait for an SQS message, start an ECS task, call a SageMaker endpoint, and write to DynamoDB, all without writing any Lambda code. Many workflows are just service-to-service orchestration. No compute needed.

Durable Functions require a Lambda function for everything. If your workflow is "receive event, transform data, write to DynamoDB, notify SNS," Step Functions Express Workflows handle this with zero compute cost beyond the state transitions. Durable Functions make you pay for Lambda execution time.

Where Durable Functions win

That said, there are patterns where Durable Functions are clearly better.

Complex branching logic

If your workflow has deeply nested conditionals, loops with variable exit conditions, or recursive patterns, writing it in Amazon States Language (ASL) is miserable. ASL is a JSON-based DSL that wasn't designed for complexity. I've seen teams maintain 2000-line Step Function definitions that nobody can reason about.

Durable Functions let you write the same logic in TypeScript (or Python, or Java). Your IDE gives you autocomplete. Your tests run locally. You can refactor with confidence because it's just code.

typescript

export async function processLoan(application: LoanApplication) {
  const creditScore = await task('checkCredit', () => 
    creditService.check(application.ssn)
  );

  if (creditScore < 580) {
    return await task('autoReject', () => 
      rejectApplication(application.id, 'credit_score')
    );
  }

  if (creditScore >= 750 && application.amount < 50000) {
    return await task('autoApprove', () => 
      approveApplication(application.id)
    );
  }

  // Manual review path with multiple approval stages
  const reviewers = determineReviewChain(creditScore, application.amount);
  
  for (const reviewer of reviewers) {
    await task(`assign-${reviewer.id}`, () => 
      assignReview(application.id, reviewer)
    );

    const decision = await waitForEvent(`review-${reviewer.id}`, {
      timeout: Duration.hours(48)
    });

    if (!decision || decision.status === 'rejected') {
      return { status: 'rejected', stage: reviewer.role };
    }
  }

  return await task('finalApprove', () => 
    approveApplication(application.id)
  );
}

Try writing that loop with variable-length reviewer chains in ASL. I've tried. It's not fun.

Long-running human-in-the-loop processes

Any workflow where you're waiting days or weeks for human input is a natural fit. The waitForEvent primitive costs nothing while suspended. You're not paying for a Lambda sitting idle. You're not running a polling loop against DynamoDB.

Customer onboarding, document approval chains, multi-party contract signing. These used to require careful state management with DynamoDB, SQS, and a scheduler. Now it's a few lines of code.

AI agent orchestration

This is the use case AWS is betting on. AI workflows that chain multiple LLM calls with human review gates, tool invocations, and conditional branching. The Bedrock AgentCore integration makes this straightforward:

typescript

export async function researchAgent(query: string) {
  const plan = await task('plan', () => 
    bedrock.invoke('claude-4', { prompt: `Create research plan for: ${query}` })
  );

  const sources = await task('search', () => 
    searchService.find(plan.queries)
  );

  const analysis = await task('analyze', () => 
    bedrock.invoke('claude-4', { 
      prompt: `Analyze sources for: ${query}`,
      context: sources 
    })
  );

  // Human review gate
  const approval = await waitForEvent('human-review', {
    timeout: Duration.hours(24)
  });

  if (approval?.approved) {
    return await task('publish', () => 
      publishService.create(analysis)
    );
  }

  return { status: 'needs_revision', feedback: approval?.feedback };
}

The hidden costs of checkpoint-and-replay

Here's what most articles skip: the replay model has real performance and cost implications.

Replay overhead

When a function resumes after a checkpoint, it replays from the beginning. Every task() call before the current one gets executed again, but instead of running the actual work, the runtime reads the cached result from the durable store. This is fast, but not free.

For a function with 20 checkpoints, the 20th step replays 19 cached results before doing its actual work. Each replay means deserializing state, checking the durable store, and returning cached data. In benchmarks I've seen, this adds 50-200ms per checkpoint on replay.

For most workflows, that latency is irrelevant. For latency-sensitive paths, it matters.

State size limits

Every checkpoint serializes its input and output to the durable store. AWS hasn't published hard limits yet, but the SDK documentation warns against large payloads. If your tasks return multi-megabyte results, you'll hit serialization costs and storage limits quickly.

The pattern here is the same one we use with Step Functions: store large data in S3, pass references through the workflow.

Cold start multiplication

Each checkpoint that triggers a new container invocation pays a cold start penalty. If your function has 10 checkpoints spread across 10 separate invocations, you pay 10 cold starts instead of 1. With provisioned concurrency or SnapStart this is manageable, but it's a cost most teams don't account for upfront.

My decision framework

After working with both services over the past two months, here's how I decide:

Use Step Functions when:

Your workflow is primarily service-to-service orchestration
You need visual debugging and non-technical stakeholders review workflows
You need massive parallelism (Map state with thousands of concurrent branches)
Your team doesn't want to manage workflow code as application code

Use Durable Functions when:

Your workflow has complex branching, loops, or recursive patterns
You're building human-in-the-loop processes with long wait times
Your team prefers code over configuration
You're orchestrating AI agent workflows
You want your workflow logic testable with standard unit testing tools

Use neither when:

Your workflow fits in a single Lambda invocation (just use Lambda)
You need sub-100ms latency (the checkpoint overhead will bite you)
Your workflow is a simple event-driven pipeline (EventBridge + Lambda is simpler)

The migration question

If you have working Step Functions, don't migrate them for the sake of it. Step Functions aren't going away. AWS has invested too heavily in them and they serve a different use case.

If you're starting a new workflow project, evaluate both options against the criteria above. The honest truth is that most teams will end up using both: Step Functions for service orchestration and parallel processing, Durable Functions for complex stateful logic.

The real win isn't replacing Step Functions. It's eliminating the homegrown state management code that teams build when Step Functions feel too rigid and raw Lambda feels too manual. That's the gap Durable Functions fill, and they fill it well.

Build for the right abstraction level. Don't chase the shiny new thing.

Lambda Durable Functions Are Not Step Functions Replacements

What Lambda Durable Functions actually are

Where Step Functions still win

Visual debugging and observability

Parallel execution with fan-out

Service integrations

Where Durable Functions win

Complex branching logic

Long-running human-in-the-loop processes

AI agent orchestration

The hidden costs of checkpoint-and-replay

Replay overhead

State size limits

Cold start multiplication

My decision framework

The migration question

Get new posts in your inbox

Keep reading

Durable Objects: The Primitive AWS Doesn't Have

MinIO Is Dead. Here's What Your Infrastructure Team Should Do Next.

Building Production-Ready MCP Servers