4 Things I Learned from Hands-On Testing Lambda Durable Functions
Table of Contents
Introduction
In December 2025, AWS added Durable Functions to Lambda. This feature lets you write multi-step workflows directly in Lambda function code without Step Functions. It uses a checkpoint-and-replay model for automatic recovery, supports execution suspension for up to one year, and incurs no compute charges during wait periods.
To answer "how does it actually differ from Step Functions?" and "what's the real developer experience like?", I tested four patterns using the AWS CLI. Here's what I found. If you just want the results, skip ahead to Test 1.
Prerequisites:
- AWS CLI configured (
lambda:*,iam:*,logs:*permissions) - A region where Node.js 24.x runtime is available
- Test region: ap-northeast-1 (Tokyo)
Why Durable Functions?
When implementing multi-step processing in Lambda, there were traditionally two options:
- Self-managed within Lambda — Save intermediate state to DynamoDB, implement retry and idempotency yourself. Code gets complex and state management bugs creep in
- External orchestration with Step Functions — Define workflows in ASL (Amazon States Language). Robust, but tightly coupled business logic means dual maintenance of ASL and code
Durable Functions fill this gap. You write normal sequential code, and the SDK handles checkpoint management, retries, and state recovery automatically.
How Durable Functions Work
Durable Functions operate on a checkpoint-and-replay model:
- Each
step()completion persists its result to durable storage as a checkpoint - When the function suspends via
wait()or fails, the execution environment is released - On resume, the function re-executes from the beginning, but completed steps are skipped — stored results are returned instead
This means you write normal sequential code, and the SDK handles fault tolerance automatically.
The SDK supports JavaScript/TypeScript (Node.js 22/24), Python (3.13/3.14), and Java (Preview).
Test Environment
- Region:
ap-northeast-1(Tokyo) - Runtime: Node.js 24.x
- Memory: 256 MB
- Durable config:
RetentionPeriodInDays: 1,ExecutionTimeout: 120seconds - Deployment: AWS CLI (
create-function+--durable-config)
Setup
IAM Role Setup
Create an IAM role for Durable Functions. Beyond the standard Lambda execution role, checkpoint and callback permissions are required.
IAM role setup steps
cat <<'EOF' > trust-policy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": { "Service": "lambda.amazonaws.com" },
"Action": "sts:AssumeRole"
}
]
}
EOF
aws iam create-role \
--role-name lambda-durable-test-role \
--assume-role-policy-document file://trust-policy.jsonaws iam attach-role-policy \
--role-name lambda-durable-test-role \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRoleAdd Durable Functions-specific permissions as an inline policy:
cat <<'EOF' > durable-policy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"lambda:CheckpointDurableExecution",
"lambda:GetDurableExecutionState",
"lambda:SendDurableExecutionCallbackSuccess",
"lambda:SendDurableExecutionCallbackFailure"
],
"Resource": "arn:aws:lambda:<REGION>:<ACCOUNT_ID>:function:durable-*"
}
]
}
EOF
aws iam put-role-policy \
--role-name lambda-durable-test-role \
--policy-name DurableFunctionPermissions \
--policy-document file://durable-policy.jsonOne gotcha here: some documentation references CheckpointDurableExecutions (plural), but the actual required action is CheckpointDurableExecution (singular). Using the plural form results in a permission error.
Creating and Deploying Functions
Add --durable-config to a standard Lambda function creation — that's it.
aws lambda create-function \
--function-name durable-basic-test \
--runtime nodejs24.x \
--handler index.handler \
--role arn:aws:iam::<ACCOUNT_ID>:role/lambda-durable-test-role \
--zip-file fileb://function.zip \
--timeout 30 --memory-size 256 \
--durable-config '{"RetentionPeriodInDays":1,"ExecutionTimeout":120}'Critical: invocation requires a versioned ARN. $LATEST won't start a Durable Execution.
aws lambda publish-version --function-name durable-basic-test
aws lambda invoke \
--function-name "arn:aws:lambda:ap-northeast-1:<ACCOUNT_ID>:function:durable-basic-test:1" \
--payload '{"orderId": "ORD-001"}' \
--cli-binary-format raw-in-base64-out \
response.jsonFor tests 2–4 below, the same workflow applies: save the code, zip function.zip <filename>.mjs, then create-function → publish-version → invoke. Only the function name and handler change.
Test Results Overview
I tested four patterns:
- Step + Wait — Basic checkpoint and suspension behavior. Replay mechanics during wait
- Failure Recovery and Retry — Automatic retry on step failure. Skipping completed steps
- Callback — Human-in-the-loop pattern with external event wait. Sync vs async invocation
- Parallel / Map — Parallel execution overhead. CPU-bound vs I/O-bound differences
Test 1: Basic Step + Wait
A simple 3-step order processing function with a wait. Save as index.mjs and package with zip function.zip index.mjs.
import { withDurableExecution } from "@aws/durable-execution-sdk-js";
export const handler = withDurableExecution(
async (event, context) => {
const orderId = event.orderId || "ORD-001";
// Step 1: Validate
const validation = await context.step("validate-order", async (stepCtx) => {
stepCtx.logger.info(`Validating order ${orderId}`);
return { orderId, status: "validated", timestamp: Date.now() };
});
// Step 2: Payment
const payment = await context.step("process-payment", async (stepCtx) => {
stepCtx.logger.info(`Processing payment for ${orderId}`);
return { orderId, status: "paid", amount: 4980, timestamp: Date.now() };
});
// Wait 10 seconds (no compute charges)
await context.wait({ seconds: 10 });
// Step 3: Confirm (runs after wait resumes)
const confirmation = await context.step("confirm-order", async (stepCtx) => {
stepCtx.logger.info(`Confirming order ${orderId}`);
return { orderId, status: "confirmed", timestamp: Date.now() };
});
return { orderId, steps: { validation, payment, confirmation } };
}
);The SDK package @aws/durable-execution-sdk-js is pre-installed in the Lambda runtime. Wrapping the handler with withDurableExecution replaces the standard Lambda context with a DurableContext, giving access to step(), wait(), and other durable operations.
Results
{
"orderId": "ORD-TEST-002",
"steps": {
"validation": {
"orderId": "ORD-TEST-002", "status": "validated", "timestamp": 1774019546545
},
"payment": {
"orderId": "ORD-TEST-002", "status": "paid", "amount": 4980, "timestamp": 1774019546623
},
"confirmation": {
"orderId": "ORD-TEST-002", "status": "confirmed", "timestamp": 1774019556800
}
},
"totalElapsedMs": 99
}Findings
The gap between payment (546623) and confirmation (556800) timestamps is ~10.2 seconds — wait works accurately. Meanwhile, totalElapsedMs is only 99ms. This is because replay runs after the wait, and Date.now() is re-evaluated at replay time.
- Non-deterministic operations like
Date.now()outside steps produce inconsistent values during replay. This aligns with the "Write deterministic code" section in the official documentation - The response includes a
DurableExecutionArnfor execution tracking - The SDK is pre-installed in the Lambda runtime, so no need to include it in the zip. However, bundling is recommended for production to pin the SDK version
Test 2: Failure Recovery and Retry
Step 2 intentionally fails on the first attempt and succeeds on retry. Save as index-retry.mjs and deploy with --handler index-retry.handler.
import { withDurableExecution } from "@aws/durable-execution-sdk-js";
let callCount = 0;
export const handler = withDurableExecution(
async (event, context) => {
const orderId = event.orderId || "ORD-001";
const step1 = await context.step("step1-validate", async (stepCtx) => {
stepCtx.logger.info(`Step 1: Validating ${orderId}`);
return { status: "validated", timestamp: Date.now() };
});
// Fails first, succeeds on retry
const step2 = await context.step("step2-flaky-payment", async (stepCtx) => {
callCount++;
if (callCount <= 1) {
throw new Error(`Payment service unavailable (attempt ${callCount})`);
}
return { status: "paid", attempt: callCount, timestamp: Date.now() };
});
const step3 = await context.step("step3-confirm", async (stepCtx) => {
return { status: "confirmed", timestamp: Date.now() };
});
return { orderId, step1, step2, step3 };
}
);step2 intentionally throws on the first call and succeeds on retry. The global variable callCount resets on replay, which is why this logic works.
Results
{
"orderId": "ORD-RETRY-001",
"step1": { "status": "validated", "timestamp": 1774019591142 },
"step2": { "status": "paid", "attempt": 2, "timestamp": 1774019593796 },
"step3": { "status": "confirmed", "timestamp": 1774019593882 }
}Findings
step2.attempt is 2 — automatic retry after the first failure, succeeded on the second attempt. Step 1 was skipped during replay, with its result restored from the checkpoint.
- The ~2.6 second gap between step1 and step2 timestamps shows backoff between retries
- Completed step1 is never re-executed — stored results are returned. This is the core of checkpoint-and-replay
- Global variables reset on replay, so they can be used for retry logic. However, best practice is to manage state through step return values
Test 3: Callback (External Event Wait)
A human-in-the-loop approval pattern using callbacks. context.createCallback() suspends execution, and an external system resumes it via the send-durable-execution-callback-success API. Save as index-callback.mjs.
index-callback.mjs (full code)
import { withDurableExecution } from "@aws/durable-execution-sdk-js";
export const handler = withDurableExecution(
async (event, context) => {
const orderId = event.orderId || "ORD-001";
const doc = await context.step("prepare-document", async (stepCtx) => {
stepCtx.logger.info(`Preparing document for ${orderId}`);
return { orderId, status: "prepared", timestamp: Date.now() };
});
// Create callback wait (60s timeout)
const [approvalPromise, callbackId] = await context.createCallback(
"approval",
{ timeout: { seconds: 60 } }
);
// Notify external system with callback ID
await context.step("notify-approver", async (stepCtx) => {
stepCtx.logger.info(`Callback ID for approval: ${callbackId}`);
return { callbackId, notifiedAt: Date.now() };
});
// Execution suspends here (no compute charges)
const approval = await approvalPromise;
// Resumes after callback
const result = await context.step("process-approval", async (stepCtx) => {
return { orderId, approved: true, approvalData: approval };
});
return { orderId, doc, result };
}
);The key point is that createCallback() returns a [Promise, callbackId] tuple. You pass the callbackId to an external system, and the function suspends when you await the promise.
Execution Steps
1. Invoke asynchronously
Synchronous invoke times out during callback wait, so --invocation-type Event is required.
aws lambda invoke \
--function-name "arn:aws:lambda:ap-northeast-1:<ACCOUNT_ID>:function:durable-callback-test:1" \
--payload '{"orderId": "ORD-CALLBACK-001"}' \
--invocation-type Event \
--cli-binary-format raw-in-base64-out \
response.json2. Get the callback ID from CloudWatch Logs
aws logs filter-log-events \
--log-group-name /aws/lambda/durable-callback-test \
--filter-pattern "Callback ID"The callback ID in the logs is a long Base64-encoded string containing the Durable Execution ARN and operation ID.
3. Send the callback to resume execution
aws lambda send-durable-execution-callback-success \
--callback-id "<CALLBACK_ID>" \
--result '{"approved": true, "approver": "manager@example.com"}'Within seconds of sending the callback, the function resumed and the process-approval step executed. No compute charges during the callback wait.
Findings
- Synchronous invoke times out during callback wait —
--invocation-type Event(async) is required. This was the first gotcha I hit in this test - No compute charges during callback wait — ideal for human-in-the-loop workflows
- Always set a
timeout— without one, the callback waits indefinitely
Test 4: Parallel / Map
Concurrent execution with parallel() and map(). Save as index-parallel.mjs.
index-parallel.mjs (full code)
import { withDurableExecution } from "@aws/durable-execution-sdk-js";
export const handler = withDurableExecution(
async (event, context) => {
// parallel(): 3 independent tasks
const parallelResults = await context.parallel([
async (ctx) => ctx.step("check-inventory", async () => {
const start = Date.now();
while (Date.now() - start < 500) {} // 500ms work
return { available: true, timestamp: Date.now() };
}),
async (ctx) => ctx.step("check-fraud", async () => {
const start = Date.now();
while (Date.now() - start < 300) {} // 300ms work
return { passed: true, timestamp: Date.now() };
}),
async (ctx) => ctx.step("check-credit", async () => {
const start = Date.now();
while (Date.now() - start < 400) {} // 400ms work
return { approved: true, timestamp: Date.now() };
}),
]);
// map(): process array items
const items = [
{ id: "ITEM-1", price: 1000 },
{ id: "ITEM-2", price: 2000 },
{ id: "ITEM-3", price: 3000 },
];
const mapResults = await context.map(items, async (ctx, item, index) => {
return ctx.step(`process-item-${index}`, async () => {
const start = Date.now();
while (Date.now() - start < 200) {} // 200ms work
return { ...item, processed: true, timestamp: Date.now() };
});
});
return {
parallel: { results: parallelResults.getResults() },
map: { results: mapResults.getResults() },
};
}
);The key point is that each function passed to parallel() receives its own ctx (child context). map() applies the same operation to each array element. Results are retrieved with getResults() as an array, and errors with getErrors().
Results
| Pattern | Sequential (theoretical) | Actual | Overhead |
|---|---|---|---|
| parallel (500+300+400ms) | 1200ms | 1983ms | +783ms |
| map (200ms×3) | 600ms | 1001ms | +401ms |
Checkpoint persistence overhead is hundreds of milliseconds per step, making it dominant for lightweight parallel tasks.
Findings
- Timestamps show the 3 parallel tasks completed ~300ms apart sequentially — for CPU-bound work, parallel effectively runs sequentially. The documentation mentions child contexts can "efficiently utilize multiple CPU cores", but busy-wait loops that monopolize the CPU don't benefit. I/O-bound operations (e.g., API calls, database queries) would see real parallelism
- parallel/map is best suited for "running independent I/O-bound tasks concurrently", not "mass-parallelizing lightweight tasks"
Use Cases and Design Patterns
Practical patterns that emerged from testing:
Fault-tolerant payment/order processing — As confirmed in Tests 1 and 2, step checkpointing and automatic retry handle transient external service failures transparently. No need for DynamoDB intermediate state or custom retry logic.
Serverless approval workflows — The Callback pattern from Test 3 enables human approval waits for up to one year with no compute charges. Previously this required SQS + Lambda + DynamoDB for state management; with Durable Functions, a single await line suffices.
Design considerations — As Test 4 revealed, checkpoint persistence overhead is hundreds of milliseconds per step. The right step granularity is roughly "one external service call per step". Avoid wrapping lightweight loop iterations as individual steps. The documentation's "Design effective steps" section also recommends balancing step granularity.
Durable Functions vs Step Functions
| Aspect | Durable Functions | Step Functions |
|---|---|---|
| Definition | Code (JS/Python/Java) | ASL (JSON/YAML), CDK, or visual designer |
| Dev experience | IDE + unit tests + LLM agents | Console visual editor |
| AWS integrations | SDK calls from within Lambda | 220+ native service integrations |
| State management | SDK-managed (checkpoints) | Fully managed by service |
| Debugging | CloudWatch Logs + standard debuggers | Visual execution history |
| Best for | Workflows tightly coupled with business logic | Multi-service orchestration |
Choose Durable Functions when:
- Workflow and business logic are tightly coupled within Lambda
- You want to add fault tolerance to existing Lambda functions
- You prefer code-first development in your IDE
Choose Step Functions when:
- Orchestrating across multiple AWS services
- Non-engineers need to understand and validate workflows
- You want zero-maintenance infrastructure
Takeaways
- Checkpoint-and-replay is powerful but demands deterministic code —
Date.now()andMath.random()must go inside steps. Non-deterministic code outside steps breaks during replay. - Callbacks dramatically simplify human-in-the-loop implementations — Wait up to one year with no compute charges. But synchronous invoke times out, so async invocation is required.
- Understand parallel/map overhead before designing — Checkpoint persistence costs hundreds of milliseconds per step. Not ideal for mass-parallelizing lightweight tasks. Best for I/O-bound operations or steps taking several seconds each.
- Not a Step Functions replacement, but a complementary option — Code-first Lambda-contained workflows → Durable Functions. Multi-service orchestration → Step Functions.
Cleanup
Delete Lambda functions, CloudWatch log groups, and the IAM role in that order. Durable Functions-specific resources (checkpoint data, etc.) are automatically deleted after the RetentionPeriodInDays retention period, so no manual cleanup is needed for those.
Resource deletion commands
# Delete Lambda functions
aws lambda delete-function --function-name durable-basic-test
aws lambda delete-function --function-name durable-retry-test
aws lambda delete-function --function-name durable-callback-test
aws lambda delete-function --function-name durable-parallel-test
# Delete CloudWatch log groups
for fn in durable-basic-test durable-retry-test durable-callback-test durable-parallel-test; do
aws logs delete-log-group --log-group-name "/aws/lambda/$fn"
done
# Delete IAM role
aws iam delete-role-policy \
--role-name lambda-durable-test-role \
--policy-name DurableFunctionPermissions
aws iam detach-role-policy \
--role-name lambda-durable-test-role \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
aws iam delete-role --role-name lambda-durable-test-role