Step Functions Standard vs Express

May 26, 2027 · 15 min read

Developer Associate · DVA-C02 · part of The Exam Room

The situation

The IoT flow looks roughly like this in Amazon States Language: a Parallel state pulling data from three enrichment Lambdas, a Choice on the resulting envelope, a Task writing to DynamoDB, a Task publishing to an SNS topic, and a terminal state. End-to-end latency per execution is about 1.2 seconds. Each step is idempotent, a replay with the same input produces the same writes. The workflow is invoked directly by an EventBridge rule on a high-throughput telemetry bus. Back-of-envelope: 800 executions per second is around 2.07 billion executions per month; at roughly ten state transitions per execution, that’s about 20.7 billion state transitions per month. On the current per-transition pricing meter, that lands near half a million dollars a month in AWS charges for this one state machine.

The account-opening flow looks nothing like it. One execution per new customer signup, call it five thousand per month. Each execution opens with a synchronous KYC Task, then a Task waiting on .waitForTaskToken while a human reviewer in a back-office tool decides yes or no. If the reviewer asks for more documents, the workflow loops through a further .waitForTaskToken cycle. Typical elapsed time: four days. Pathological cases: six weeks. When the workflow finishes, compliance needs to open it in the console three months later and see every state, every input, every output, every timestamp, the regulator has asked questions about older files than that.

Both workloads are defined as Standard state machines. The team has heard of Express and wonders whether moving the IoT flow would be sane. It also wants to confirm, out loud, that leaving the account-opening flow on Standard is the correct call rather than an unexamined default.

What actually matters

Before reaching for a migration plan, it’s worth naming what Step Functions actually charges for and what it actually guarantees. The two flavours have different pricing meters, one counts the arrows between states, the other counts the execution requests and the compute time. The same state machine on the two flavours has two completely different bill shapes: a ten-state workflow on the per-arrow meter costs ten times as much per execution as on the per-request meter.

Ownership of the two workloads is probably different teams: an IoT/data platform team owns the telemetry flow, a product or compliance team owns the account-opening flow. The team that owns the IoT flow cares about throughput per second; the team that owns account-opening cares about audit retention and human-approver UX. The two teams have different worries and the same orchestration service has two flavours that match those worries exactly, which is the cleanest sign that the current “both on Standard” setup is an accident of defaults, not a design.

Blast radius of getting the IoT flow wrong is the monthly bill. Blast radius of getting the account-opening flow wrong is a regulator asking why there’s no audit trail for Mrs Patel’s KYC review from March 2024. The second is career-limiting. That asymmetry is why the workloads don’t get to share a flavour just because they happen to run on the same service.

Cost shape is the IoT flow’s whole story. At 2 billion executions a month, the per-arrow meter produces a six-figure monthly line item; the per-request-plus-duration meter produces a low-four-figure one. Two orders of magnitude on the same workload, that’s not a rounding difference, that’s “pick the correct flavour or pay for it forever”.

Failure modes are the account-opening flow’s whole story. A step that provisions a brokerage account, charges a setup fee, or mails a welcome pack is not idempotent in the business sense, a duplicate run is a duplicate account, a duplicate charge, a duplicate email. One flavour guarantees each state runs exactly once; the other guarantees each state runs at least once (and does in practice on certain transient failures). The contract matters: if a step can’t safely run twice, the at-least-once flavour is the wrong fit regardless of duration and throughput.

Coupling between the Amazon States Language definition and the flavour is surprisingly low. A state machine defined for one flavour can usually be ported to the other with only a handful of changes, a few states and integration patterns are exclusive to the durable flavour. The migration isn’t a rewrite; it’s a flavour flip plus adjustments.

What we’ll filter on

Five filters each candidate has to clear:

  1. Cost shape at scale. At 800 executions per second with ten transitions each, the pricing meter is the dominant bill.
  2. Maximum execution duration. Whichever flavour owns the account-opening flow has to tolerate human-scale waits.
  3. Execution semantics. Idempotent steps can live with at-least-once; non-idempotent steps cannot.
  4. Post-hoc execution history. Audit-facing workloads need durable per-execution history; high-volume idempotent ones don’t.
  5. Throughput at scale. The flavour under the IoT flow has to sustain 800 starts per second without throttling.

The workflow-orchestration landscape

Step Functions Standard. Durable state machine. Exactly-once execution of each state. Up to one year of runtime per execution. Full execution history retained for 90 days in the console, plus optional CloudWatch Logs and CloudTrail management-event logging. Supports .waitForTaskToken for human-in-the-loop and external callbacks. Billed per state transition at $0.025 per 1,000 transitions. Async invocation only.

Step Functions Express (Synchronous). Short-lived state machine invoked with StartSyncExecution; returns the final output to the caller. At-least-once execution. Maximum duration five minutes. No console execution history; CloudWatch Logs optional but the only way to see what happened. Billed at $1.00 per million requests plus $0.00001667 per GB-second of duration. Commonly fronted by API Gateway for real-time request/response.

Step Functions Express (Asynchronous). Same engine, invoked with StartExecution, the caller gets an execution ARN and the workflow runs in the background. Same at-least-once semantics, same five-minute ceiling, same per-request-plus-GB-second pricing. The correct Express shape when the caller doesn’t need the result inline, event-driven triggers from EventBridge, SQS, or Kinesis, or any fire-and-forget producer.

SQS + Lambda (DIY orchestration). Build the state machine yourself out of queues and functions. No visual history, no native .waitForTaskToken, no Parallel / Choice primitives, every control-flow concept turns into code and a queue topology. Cheap at tiny volumes; unmanageable as the graph grows.

Amazon SWF. The older orchestration service Step Functions was built to replace. AWS explicitly recommends Step Functions for new work; no reason to pick it for a greenfield workload.

Side by side

Option Cost at 800/sec Max duration Exactly-once Visual history Throughput at scale
Standard ✗ (transitions dominate) ✓ (1 year) ✓ (90 days) , (account quotas)
Express Synchronous ✗ (5 min) ✗ (at-least-once) ✗ (logs only)
Express Asynchronous ✗ (5 min) ✗ (at-least-once) ✗ (logs only)
SQS + Lambda ,
SWF ,

Matching workloads to flavours

IoT data processing ~800 exec/sec · ~1.2 s each · idempotent EventBridge start Parallel enrich × 3 Choice DynamoDB put SNS publish Requirement check under 5 min ✓ idempotent steps ✓ history ≤ 24 h OK ✓ Express Asynchronous at-least-once · 5-min cap · logs-only history $1.00 / 1M requests + $0.00001667 / GB-sec 2.07B executions · ~64 MB × 1.2 s each = ~$2,073 requests + ~$2,650 duration ≈ $4,700 / month Standard at 20.7B transitions: ≈ $517,500 EventBridge → StartExecution · fire-and-forget Account opening ~5,000 exec/month · days to weeks · audited Signup webhook KYC check Task waitForTaskToken human approval Choice: approved? more docs / done Provision account Send welcome Requirement check 5-min cap ✗ at-least-once ✗ no durable history ✗ Standard exactly-once · 1-year max · 90-day console history $0.025 / 1,000 state transitions 5,000 executions × ~15 transitions each = 75,000 transitions ≈ $1.88 / month Express would be cheap too, but lose exactly-once and durable history Archive executions to S3 for long-horizon audit
Same service, two flavours, opposite requirement profiles. Left: Express Asynchronous sheds a per-transition bill that would be fatal at 800 per second; at-least-once is fine because every step is idempotent. Right: Standard keeps exactly-once, one-year duration, and durable execution history that a human can open months later.

Moving the IoT flow onto Express, in depth

The migration is surprisingly light. The Amazon States Language definition is compatible across flavours for the states the IoT flow uses – Task, Parallel, Choice, Pass, Succeed, Fail, with a handful of exceptions worth calling out.

.waitForTaskToken is Standard-only; the IoT flow doesn’t use it. .sync integration patterns are supported for a shorter list of services on Express than on Standard; the IoT flow’s integrations (DynamoDB, SNS) work on both. Retries still work, but the durability contract at the top-level execution boundary is at-least-once, in practice this matters when an Express workflow invoked via StartExecution is retried internally on certain transient failures, each step in the invocation running, then running again in the retry invocation. The IoT flow’s idempotency is what makes this safe.

CloudWatch Logs is no longer optional in spirit. Standard gives an execution timeline in the console by default; Express does not. Enable logging on the Express state machine and set the level: ALL at the full 800/sec firehose is expensive; ERROR plus sampled ALL for a percentage of executions is the common compromise. Log groups are per state machine, so rotation and retention can be tuned separately.

IAM and EventBridge integration is identical. The execution role is the same shape. The EventBridge rule targets an Express state machine via the same states:StartExecution permission; EventBridge doesn’t care which flavour sits behind the ARN.

Why account-opening stays on Standard

The three crosses on the Express side of the table are each disqualifying for this workload. Five-minute cap: an account-opening flow that sits on .waitForTaskToken waiting for a human approver cannot live inside five minutes, the workflow has to still exist when the approver gets to their inbox the next morning. At-least-once semantics: a step that provisions a brokerage account, charges a setup fee, or sends the welcome pack is not idempotent in the business sense, and a duplicate run is a duplicate account, a duplicate charge, a duplicate email. No durable history: compliance opens workflows in the console three months after they finish and wants to see every input, every output, every state transition; Standard keeps that history for 90 days in the console and lets you archive it longer (the common pattern is a terminal LambdaInvoke that exports the execution history to S3 and keeps it in Glacier for the regulator’s retention window). Express doesn’t have the history to export.

On cost, Standard for this workload is a rounding error. 5,000 executions × about 15 transitions each = 75,000 transitions per month. At $0.025 per thousand, that’s under two dollars, well under the price of an engineer’s coffee. There is no budget pressure to move this workflow anywhere.

A worked example: the bill before and after

Before: both workloads on Standard. IoT at 20.7 billion transitions × $0.025 / 1,000 = roughly $517,500 per month. Account-opening at 75,000 transitions × $0.025 / 1,000 = roughly $1.88 per month. Total Step Functions bill: roughly half a million dollars a month, 99.9996% of which is the IoT flow.

After: IoT on Express Asynchronous, account-opening unchanged on Standard. IoT at 2.07 billion requests × $1.00 / 1,000,000 = roughly $2,073 per month; duration at 64 MB × 1.2 s × 2.07B = about 159 million GB-seconds × $0.00001667 ≈ $2,650 per month; total IoT bill roughly $4,700. Account-opening unchanged at $1.88. Total Step Functions bill: roughly $4,700 per month. A 99% reduction in the workflow line item, with no change to either state machine’s diagram. The trade on the IoT side is at-least-once semantics (already safe, because idempotent) and loss of the default visual execution history (already acceptable, because nobody looks after 24 hours).

What’s worth remembering

  1. Standard charges per state transition; Express charges per request plus GB-seconds. At 800/sec with ten transitions each, the transition meter dominates; the per-request meter does not.
  2. Express Workflows cap at five minutes of wall time. Anything that waits on a human, waits on a queue for more than a few minutes, or genuinely runs long stays on Standard.
  3. Standard guarantees exactly-once execution of each state. Express guarantees at-least-once. Idempotent workloads are safe on Express; non-idempotent workloads are not, regardless of how short they run.
  4. Standard keeps visual execution history for 90 days in the console. Express has none, only CloudWatch Logs, if they were enabled. Audit-facing workloads need Standard.
  5. Express has two invocation shapes: StartSyncExecution returns the result to the caller; StartExecution is fire-and-forget. Event-driven producers want async; API-Gateway-fronted real-time callers want sync.
  6. ASL is largely portable between flavours, but .waitForTaskToken and some .sync integration patterns are Standard-only. Check the state list before assuming a state machine will move across unchanged.
  7. The correct answer per workload is often “one of each”. A single account can run Standard and Express state machines side by side; nothing forces anyone to commit to one flavour across the estate.
  8. SQS + Lambda is a valid pattern for single-step work; it stops being a sensible orchestration substitute as soon as Parallel, Choice, or retries enter the picture. SWF is a legacy choice, no reason to reach for it.
  9. Migrating to Express is a durability and observability decision, not a diagram decision. The state machine looks the same; the guarantees around it do not.
  10. Enable CloudWatch Logs on Express workflows before anyone asks why. ERROR plus sampled ALL at a percentage is the common production shape; ALL at 800/sec is not free.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.