The situation
The nightly warehouse build. The DAG is small but real:
- 01:00 UTC –
extract-orders(Glue Spark job, reads the orders partition from yesterday). - 01:00 UTC –
extract-inventory(Glue Spark job, reads inventory snapshot). - After (1):
transform-orders(Glue Spark job, produces the orders fact). - After (2):
transform-inventory(Glue Spark job, produces the inventory dim). - After (3):
enrich-currency(Lambda, calls a FX API and writes lookup CSV to S3). - After (3) and (4):
join-fact-dim(Glue Spark, joins orders with inventory dim). - After (5) and (6):
apply-fx(Glue Spark, applies FX rates). - After (7):
data-quality-gate(Glue Data Quality ruleset, halts the pipeline on fail). - After (8) passes:
manual-approval(pause for human approval via EventBridge Scheduler + Lambda + SNS; blocks until approved or 1-hour timeout). - After (9) approved:
publish-to-warehouse(Glue Spark, writes curated Parquet and updates Glue catalog). - After (10):
notify-stakeholders(Lambda, posts to Slack and updates a status page).
Eleven nodes. Three parallel branches. One human approval. One non-Glue step (Lambda for FX). One external HTTP call. One data-quality gate that might short-circuit the run.
Today it’s a chain of shell scripts on EC2 cron, each one polling the previous Glue job’s status with the AWS CLI. It’s held together with sleep and || exit 1. The humans involved have been asking for it to be replaced for a year.
The two AWS-native options are Glue Workflows and Step Functions. They overlap in what they can orchestrate but differ in shape, expressiveness, and operational model.
What actually matters
Before reaching for a service, it’s worth being explicit about what “orchestration” means for this DAG.
Orchestration is five overlapping capabilities:
Dependency management. Job B runs after A completes. Job D runs after B and C both complete. This is the DAG. Every orchestration tool handles it. The question is how much ceremony, how much introspection, and how obvious failure handling is.
Heterogeneous step types. Can the workflow include Glue jobs, Lambda functions, Step Functions (state machines within state machines), SNS publishes, manual approvals, wait-for-event, HTTP calls, ECS tasks? Some orchestrators are one-service (all Glue); some are many-service.
Error handling. When a step fails, what happens? Retry with backoff? Retry a specific number of times then fail the workflow? Branch to a cleanup step? Alert and halt? Different failure modes for different step types: transient AWS API errors retry, data-quality failures don’t.
Observability. Does the console show the workflow as a graph with nodes coloured by state? Can I click a failing node and get to its logs? Can I see the history of runs and compare their duration and outcomes over time?
Parameterisation and reuse. Can the workflow accept parameters (date, environment, feature flag) and feed them to steps? Can parts of the workflow be reused across multiple workflows without duplication?
A sixth, separate concern: where does orchestration state live when the workflow pauses. A one-hour human-approval step is a one-hour pause; the orchestrator has to keep track of “where am I, what am I waiting for, how do I resume”. Some orchestrators do this natively; some don’t.
What we’ll filter on
Distilling into filters we can score each option against:
- Step heterogeneity (can the workflow include non-Glue steps without contortion?)
- Parallelism and branching (conditionals, parallel branches, fan-out over collections?)
- Error handling and retry (per-step retry policies, catch blocks, alternate paths on failure?)
- Long-running waits (human approval, callback tokens, wait-for-event?)
- Observability (graph visualisation, per-execution trace, history comparison?)
The orchestration landscape
1. AWS Step Functions. A general-purpose workflow service. Workflows are state machines defined in ASL (Amazon States Language, JSON-based) or via the Workflow Studio visual editor. States include Task (invoke a service integration), Choice (conditional), Parallel (branches), Map (iterate over an array), Wait (delay), Pass (transform), Succeed, Fail. Supports 220+ native service integrations (Glue, Lambda, ECS, SNS, SQS, DynamoDB, S3, API Gateway, EMR, Athena, SageMaker, many more) with optimised integrations that include .sync mode (wait for job completion) and .waitForTaskToken (pause for external callback). Two flavours: Standard (long-running, up to 1 year, exactly-once, higher per-transition cost) and Express (short-running, at-least-once, much cheaper, up to 5 minutes).
2. AWS Glue Workflows. A Glue-specific orchestration feature. Workflows consist of Triggers (conditions that fire) and Jobs / Crawlers (things triggers start). Dependencies are expressed as “trigger fires when job X completes successfully”. Visual editor in the Glue console. Handles the happy path cleanly for all-Glue pipelines; less capable outside Glue’s world.
3. Apache Airflow / Amazon MWAA. Airflow on AWS as a managed service. Python-coded DAGs, wide operator library, strong community. Right answer when teams are already Airflow-fluent or when the DAGs include non-AWS infrastructure the community has operators for. Operational overhead even in the managed form.
4. EventBridge Scheduler + custom glue. Roll your own: Scheduler fires on a cron, Lambda kicks off the first Glue job, Lambda polls for completion, Lambda kicks off the next. Works. No graph, no observability, no error handling beyond what you code. Rarely the correct answer when Step Functions or Glue Workflows exist.
5. Step Functions from within Step Functions (nested). A single state machine can invoke another via Step Functions service integration. Useful pattern for reusing a state machine across multiple parent workflows; not really a separate option, just a composition capability.
Side by side
| Option | Step heterogeneity | Parallelism / branching | Error handling | Long-running waits | Observability |
|---|---|---|---|---|---|
| Step Functions (Standard) | ✓ (220+ integrations) | ✓ (Parallel, Choice, Map) | Per-state Retry/Catch | ✓ (up to 1yr, task tokens) | Graph + per-step trace |
| Step Functions (Express) | ✓ | ✓ | Limited (short-running) | ✗ (5-min cap) | Logs-only |
| Glue Workflows | Partial (Glue, Lambda via glue scripts) | ✓ (parallel triggers) | Job-level retry | ✗ (no native pause/wait) | Graph + run history |
| MWAA (Airflow) | ✓ (operators) | ✓ (Python-coded) | Task-level retries + callbacks | ✓ | Airflow UI |
| EventBridge + Lambda glue | User-coded | User-coded | User-coded | User-coded | User-coded |
Reading by workflow shape:
- All-Glue pipeline, no external calls, no human approval: Glue Workflows. Simplest for the all-Glue case; native to Glue’s own console.
- Mixed step types including Lambda, SNS, manual approval, external HTTP: Step Functions Standard. The heterogeneous-orchestration answer.
- Very short workflows with high invocation volume: Step Functions Express. Don’t pay Standard’s per-transition price for sub-5-minute flows.
- Team already on Airflow; DAGs include non-AWS infrastructure: MWAA. Tolerable managed Airflow cost in exchange for Airflow’s ecosystem.
- Your DAG is one of eleven jobs including manual approval: Step Functions Standard. Glue Workflows can’t pause for human approval natively; that alone moves the decision.
Glue Workflows in depth
Triggers, jobs, and crawlers as nodes. A Glue Workflow is built from three node types:
- Jobs: Glue ETL jobs (Spark, Python shell, Ray).
- Crawlers: Glue catalog crawlers.
- Triggers: conditions that start jobs or crawlers. Types are Schedule (cron), On-demand (manual), EventBridge (triggered by an event), and Conditional (fires when named predecessors succeed, fail, or meet a custom condition).
A workflow’s graph is “triggers fire jobs; jobs complete and update triggers; triggers fire next jobs”. Dependencies are implicit in trigger configuration.
Parameters. Workflows support run properties: key-value pairs that are passed to jobs at run time and mutable between runs. Jobs read properties via getResolvedOptions and can update them for downstream jobs.
What Glue Workflows don’t do well. No native wait-for-external-event. No built-in manual approval. Non-Glue steps (Lambda, SNS, SQS, ECS) aren’t orchestrated natively. You can invoke them from a Glue job’s script, but the orchestration layer doesn’t see them. Error handling is per-job retry; no catch blocks for “if this step fails, go do cleanup over there”. Conditional branching is shallow: you can fire on success or failure, but complex branching belongs elsewhere.
If the DAG is all-Glue and the happy path is what matters, Glue Workflows is clean and simple. The moment an approval or a non-Glue step shows up, the cracks widen.
Step Functions in depth
States and ASL. A state machine is a JSON document listing states and transitions. Each state has a Type, inputs, outputs, and transitions. A typical Task state:
"TransformOrders": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName": "transform-orders",
"Arguments": { "--date.$": "$.date" }
},
"Retry": [
{ "ErrorEquals": ["States.ALL"], "IntervalSeconds": 30, "MaxAttempts": 2, "BackoffRate": 2 }
],
"Catch": [
{ "ErrorEquals": ["States.ALL"], "Next": "HandleFailure" }
],
"Next": "ParallelBranch"
}
.sync on the resource tells Step Functions to wait for the Glue job to complete before moving on. Without it, Step Functions would start the job and immediately proceed. Retry and Catch are per-state; Retry is before failure is final, Catch is where execution goes after retries exhaust.
.waitForTaskToken for manual approval. A Task state can pause indefinitely waiting for a callback. The state is invoked with a task token in its input; some external actor (a human via a console, a Lambda handling an approval button click, etc.) calls SendTaskSuccess or SendTaskFailure with the token and the state resumes. This is how Step Functions handles human approval. The manual-approval step is a Task with waitForTaskToken, paired with a Lambda that emails a link to the approver; clicking approve invokes SendTaskSuccess; the state machine proceeds.
Parallel state. Runs multiple branches concurrently; each branch is a sub-state-machine; the outer state doesn’t complete until all branches complete.
Map state. Iterates over a JSON array, executing a sub-state-machine for each item, optionally with concurrency caps. Useful for fan-out over a list of partitions, customers, datasets, etc.
Choice state. Branches based on input values. Supports comparisons (equals, less-than, string-matches) and boolean combinators (And, Or, Not).
Standard vs Express. Standard is for long-running workflows (up to 1 year), exactly-once semantics, pricier per state transition, supports all state types. Express is for short workflows (5-min cap), at-least-once semantics, drastically cheaper per transition (around 1/100th), supports most states but not waitForTaskToken. Nightly ETL is Standard; a high-frequency API-triggered workflow that does two steps is Express.
The DAG, expressed two ways
A worked state machine: the relevant ASL
The manual-approval pattern is the interesting bit. The state invokes a Lambda with the task token in its input; the Lambda sends an approval email (or Slack message); the approval link hits an API Gateway backed by another Lambda, which calls SendTaskSuccess(taskToken, output); the state machine resumes:
"ManualApproval": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
"Parameters": {
"FunctionName": "send-approval-request",
"Payload": {
"TaskToken.$": "$$.Task.Token",
"Date.$": "$.date",
"DQScore.$": "$.dq_score",
"ApprovalContext": "Nightly ETL ready for warehouse publish"
}
},
"TimeoutSeconds": 3600,
"HeartbeatSeconds": 1800,
"Catch": [
{ "ErrorEquals": ["States.Timeout"], "Next": "ApprovalTimedOut" }
],
"Next": "PublishToWarehouse"
}
TimeoutSeconds: 3600 gives the approver one hour. HeartbeatSeconds: 1800 requires the external service to call SendTaskHeartbeat every 30 minutes, useful for cases where the external process is itself at risk of dying; the heartbeat says “I’m still waiting”. Catch routes to a cleanup/notification state if nobody approves in time.
The Lambda send-approval-request stores the task token somewhere durable (DynamoDB), emails the approver with a link embedding the token’s ID, and returns without calling SendTaskSuccess. The state machine stays paused. When the approver clicks the link, a separate Lambda reads the stored token and invokes SendTaskSuccess, and the state machine resumes from PublishToWarehouse.
What’s worth remembering
- Glue Workflows is best for all-Glue happy-path pipelines. Simple, native to Glue, visual in the Glue console. The moment the DAG includes non-Glue steps or approvals, its limitations start to dominate.
- Step Functions Standard is the general-purpose orchestrator. 220+ service integrations, Retry/Catch per state,
.syncfor job completion,.waitForTaskTokenfor callbacks,Parallel/Map/Choicefor structure. Most AWS-native orchestration should start here unless there’s a specific reason not to. - Express is Standard’s short-and-cheap sibling. Up to 5 minutes, at-least-once, much cheaper per transition. For high-frequency workflows that complete fast and don’t need long waits.
.waitForTaskTokenis how human approval works. The state pauses indefinitely; an external process (Lambda, Step Functions activity worker, your own code) callsSendTaskSuccessorSendTaskFailureto resume. Timeouts and heartbeats bound the pause.- Retry and Catch are per-state. Declarative retry policies (error type, max attempts, backoff rate) on each state; Catch blocks route to alternate paths when retries exhaust. Cleaner than coding retries into every Lambda or job.
- MWAA is Airflow managed. Right answer when the team already speaks Airflow or the DAGs involve non-AWS systems with strong Airflow operators. Higher operational baseline cost than Step Functions.
- Service integrations come in three flavours. Request-response (fire and continue),
.sync(wait for completion),.waitForTaskToken(pause for callback). Pick based on how the step behaves downstream. - Observability is where Step Functions shines. The execution graph, per-state inputs and outputs, history of runs, CloudWatch integration. Debugging a failed workflow is clicking a red node, not grepping through Lambda logs.
Glue Workflows is good where it’s good; Step Functions is good where everything else happens. The DAG in the opener has one step (manual approval) that Glue Workflows cannot natively express, and that alone moves the decision. Don’t orchestrate everything in Step Functions for its own sake; do orchestrate anything with heterogeneous steps, real error paths, or long waits there.