How to Handle Errors in a Step Functions Workflow

July 12, 2027 · 15 min read

Developer · DVA-C02 · part of The Exam Room

The situation

A fulfilment team runs a single Step Functions state machine that handles the post-order workflow. The machine’s happy path is:

  1. ChargePayment, authorises and captures payment via Stripe.
  2. ReserveStock, decrements warehouse stock in the inventory service.
  3. BookShipment, calls the carrier API to book a shipment and receive a tracking number.
  4. EmitReceipt, writes a receipt row to DynamoDB and emits an EventBridge event.

Each step is an SDK-integrated task or a Lambda. Three failure shapes have become common operational issues.

  • Transient carrier errors. The carrier API returns 503 Service Unavailable or throws a rate-limit error a few times an hour. Retrying with back-off almost always succeeds.
  • Post-reserve failure. If BookShipment fails permanently, ReserveStock has already decremented stock and ChargePayment has captured money. Both need to roll back.
  • Unknown unknowns. Occasionally a step fails in a way the team hasn’t seen before (a Lambda OOM, a carrier returning a 200 with a body the code doesn’t parse). These should open a ticket and sit in a waiting state for human review, not silently fail the workflow.

What actually matters

Step Functions’ error-handling model has two primitives attached to a task state (and to Parallel, Map, and some others): Retry and Catch. Both work from the same error taxonomy.

The error taxonomy. Step Functions represents errors as named strings. Built-in names include States.ALL (anything), States.TaskFailed (the task reported failure), States.Timeout, States.Permissions, and States.DataLimitExceeded. Custom names come from Lambda functions returning an error or ASL throw’ing a named error (e.g. CarrierRateLimited raised by the BookShipment Lambda).

Retry. A list of retriers, each matching a set of error names, with IntervalSeconds, MaxAttempts, BackoffRate, MaxDelaySeconds, and (in recent versions) JitterStrategy. The first retrier whose ErrorEquals matches the error is used; attempts count up; when MaxAttempts is exhausted, the error propagates to the Catch (if any) or fails the execution.

Catch. A list of catchers, each matching a set of error names, each pointing at a Next state. When an error propagates past retries (or when there are no retries), the first matching catcher redirects the workflow to its Next state and puts the error information at a path in the state’s data (typically $.errorInfo).

Together, retries and catches handle the three failure shapes.

  • Transient carrier errors → Retry with back-off on CarrierRateLimited and States.Timeout.
  • Post-reserve failure → Catch on BookShipment’s States.ALL routed to a compensation state that reverses stock and payment.
  • Unknown unknowns → Catch on States.ALL at the workflow level routed to a WaitForHumanReview state.

Side by side

Primitive Handles Attached to On match
Retry Specific error names Task / Parallel / Map Retry up to N times with backoff
Catch Specific error names Task / Parallel / Map Transition to named Next state with error in data
States.ALL Any error Used in Retry or Catch Catch-all
TimeoutSeconds Task taking too long Task Raises States.Timeout
HeartbeatSeconds Activity workers going silent Activity task Raises States.Timeout
Parallel failure semantics Branch fails Parallel All other branches cancelled; Parallel’s Catch fires
Map ToleratedFailureCount Some items fail Map Map continues, reports aggregate result
Fail state Explicit failure State Emits a named error and terminates

The interaction between Retry and Catch is what you have to get right.

The three patterns, built up

Pattern 1: Retry with back-off for transient errors.

"BookShipment": {
  "Type": "Task",
  "Resource": "arn:aws:states:::lambda:invoke",
  "Parameters": { "FunctionName": "book-shipment", "Payload.$": "$" },
  "Retry": [
    {
      "ErrorEquals": ["CarrierRateLimited", "States.Timeout"],
      "IntervalSeconds": 2,
      "MaxAttempts": 5,
      "BackoffRate": 2.0,
      "MaxDelaySeconds": 60,
      "JitterStrategy": "FULL"
    },
    {
      "ErrorEquals": ["Lambda.ServiceException", "Lambda.TooManyRequestsException"],
      "IntervalSeconds": 1,
      "MaxAttempts": 3,
      "BackoffRate": 2.0
    }
  ],
  "Next": "EmitReceipt"
}

The first retrier handles the carrier-specific errors with exponential backoff (2s, 4s, 8s, 16s, 32s, capped at 60s) and jitter to smooth thundering herds. The second retrier handles Lambda’s own transient errors (throttling, service exceptions). Persistent or unknown errors fall past these and become the workflow’s concern.

Pattern 2: Catch-and-compensate for post-reserve failures.

The workflow wraps BookShipment in a Catch that routes to a compensating path:

"BookShipment": {
  "Type": "Task",
  "Resource": "arn:aws:states:::lambda:invoke",
  "Parameters": { "FunctionName": "book-shipment", "Payload.$": "$" },
  "Retry": [ /* as above */ ],
  "Catch": [
    {
      "ErrorEquals": ["States.ALL"],
      "ResultPath": "$.errorInfo",
      "Next": "Compensate"
    }
  ],
  "Next": "EmitReceipt"
}

Compensate is a Parallel state that reverses the earlier effects:

"Compensate": {
  "Type": "Parallel",
  "Branches": [
    { "StartAt": "ReleaseStock",  "States": { "ReleaseStock":  { ... "End": true }}},
    { "StartAt": "RefundPayment", "States": { "RefundPayment": { ... "End": true }}}
  ],
  "Next": "FailExecution"
}

Each branch is idempotent (releasing stock that’s already released is a no-op; refunding a captured-but-uncharged payment is a no-op). After Compensate completes, FailExecution is a Fail state that emits a named error (OrderFulfillmentFailed), the workflow terminates unsuccessfully, but the data is consistent.

Pattern 3: Route unknown errors to human review.

Both ReserveStock and BookShipment could raise errors the team hasn’t seen before. A top-level fallback routes unmatched errors to WaitForHumanReview:

"WaitForHumanReview": {
  "Type": "Task",
  "Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
  "Parameters": {
    "QueueUrl": "https://sqs.../review-queue",
    "MessageBody": {
      "executionArn.$": "$$.Execution.Id",
      "taskToken.$": "$$.Task.Token",
      "errorInfo.$": "$.errorInfo"
    }
  },
  "ResultPath": "$.reviewOutcome",
  "Next": "ChooseAfterReview"
}

This is the callback pattern: the state machine sends a message to an SQS queue and pauses, waiting for a task token to be returned. A human (via a dashboard) or another workflow inspects the error, decides whether to retry or fail, and calls SendTaskSuccess or SendTaskFailure with the token. The state machine then transitions based on the outcome.

The three patterns compose in one state machine.

The error flow, drawn

Fulfilment workflow, happy path, retries, compensation, fallback ChargePayment Stripe capture ReserveStock decrement inventory BookShipment carrier API EmitReceipt DDB + EB Succeed Retry: CarrierRateLimited · 5× expo backoff Catch Catch Compensate (Parallel) ReleaseStock · RefundPayment, both idempotent FailExecution OrderFulfillmentFailed Top-level Catch on States.ALL, unknown-unknowns fallback WaitForHumanReview sqs:sendMessage.waitForTaskToken ticket opened · execution paused ChooseAfterReview retry / compensate / fail Legend happy path retry loop catch → compensation top-level fallback → human review Each mutating state has its own Catch; a workflow-level fallback covers anything the step-level catchers don't match.
Three error-handling primitives woven into one machine: retries for transient blips, catches for known-bad with compensation, a top-level fallback for unknown-unknowns routed to a human.

Retries in depth

A retry looks tidy in JSON but has four knobs worth thinking about.

  1. IntervalSeconds, initial delay before the first retry. 1 or 2 is fine for most cases; larger when the downstream is known-slow-to-recover.
  2. BackoffRate, multiplicative on each retry. 2.0 (double each time) is common; higher values spread retries further apart at the cost of user-visible latency.
  3. MaxAttempts, total retries after the first attempt. MaxAttempts: 3 means 4 total invocations.
  4. MaxDelaySeconds, cap on the interval so exponential backoff doesn’t run away.
  5. JitterStrategy: FULL, randomises each interval between 0 and the calculated value. Keeps thundering herds from synchronising on the same retry boundary.

A retry counts attempts across the whole task state; if the state transitions and returns, the retry count resets. Retries that are catch-able ultimately emit a States.TaskFailed (or the custom error) that falls through to Catch when exhausted.

Compensation patterns

The compensation pattern above (Parallel of reversals) is the SAGA pattern expressed in Step Functions. Three things make it robust.

  1. Idempotent reversals. The reversal for ReserveStock must succeed whether stock was already released or never decremented. RELEASE stock WHERE order_id = X AND status = 'RESERVED' with an ON CONFLICT DO NOTHING shape handles both.
  2. No ordering dependency in the reversal. Parallel reversals are cheap when the forward steps were independent. If one reversal depends on another (e.g. reversing a ship-label requires the shipment row to still exist), sequence them instead.
  3. Idempotent forward steps. Compensation is only needed when the forward step committed a side effect. A forward step that’s already idempotent-on-retry makes the compensation path simpler.

Alternative compensation models exist (e.g. retriable compensation where BookShipment that failed isn’t compensated but queued for retry later) but they depend on the business’s tolerance for partial state. The team has to decide.

The callback pattern

Three ways a Step Functions state can pause.

  • .sync, the SDK call starts a long-running AWS operation and the state waits for it to complete. Used with ECS tasks, EMR steps, Glue jobs.
  • .waitForTaskToken, the state sends a message including a task token, pauses, and waits for an external caller to return SendTaskSuccess or SendTaskFailure with that token. Used when something outside Step Functions must decide the outcome.
  • .wait states, pause for a fixed duration or until a timestamp; no external interaction.

.waitForTaskToken is how the human-review pattern works. The queue message contains the token; a dashboard lets a human click “approve” or “reject”, which calls SendTaskSuccess with the outcome. The state machine unblocks and branches based on the result.

What’s worth remembering

  1. Retry handles transient errors; Catch handles propagated failures. Retry first, Catch second; the error propagates to Catch only after Retry’s MaxAttempts is exhausted (or if there’s no matching retrier).
  2. Error names are the taxonomy. Built-in (States.Timeout, States.TaskFailed) plus Lambda’s (Lambda.ServiceException, user-thrown) plus custom (raised by the Lambda itself). Match on names in ErrorEquals.
  3. States.ALL is the catch-all. Use it as the last retrier or last catcher to handle the tail. Put specific matches first.
  4. Backoff plus jitter prevents synchronised retries. Exponential backoff alone thundering-herds; JitterStrategy: FULL spreads them.
  5. Catch routes to a state, not to an error handler. The state receives the error in its data path (ResultPath: "$.errorInfo" is idiomatic) and decides what to do.
  6. Compensate with idempotent reversals. SAGA via Parallel of reversal steps; each reversal must succeed on its own regardless of whether its target still exists.
  7. TimeoutSeconds and HeartbeatSeconds are safety nets. The task must finish within the time budget or the state fails with States.Timeout. Especially important for activity tasks where workers can go silent.
  8. .waitForTaskToken lets humans participate. State pauses; external caller returns success or failure via the token; state machine resumes.
  9. Parallel failure is atomic. If any branch raises an error (after retries/catches in that branch), the Parallel’s own Catch fires; other branches are cancelled.
  10. Map ToleratedFailureCount allows partial success. Useful for batch workflows where some items can fail without failing the batch; the Map records which.

Retries for transient; catches-to-compensate for known bad; top-level fallback to a human for unknown unknowns. Three primitives, three failure shapes, one state machine that tolerates each. The work isn’t catching everything, it’s being explicit about what’s retried, what’s compensated, and what escalates.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.