The situation
An ML team runs a fraud-classifier retrain every night. The workflow pulls yesterday’s labelled transactions from S3, feature-engineers with a Processing job, trains an XGBoost model, evaluates against a holdout split, conditionally registers the model if AUC exceeds the production threshold of 0.94, and triggers a deploy Lambda if and only if registration happened.
On top of the DAG, three cross-cutting requirements: lineage tracking (given a model package, identify the TrainingThe process of fitting a model’s weights to data by minimising a loss function.
dataset, processing job, and hyperparameters), automatic retries on transient failures (throttled CreateTrainingJob, brief insufficient-capacity), and visibility into the last fourteen nights of executions with per-step timing without bespoke dashboards.
What actually matters
Before picking an orchestrator, worth thinking about what distinguishes an ML workflow from a general workflow, and why those differences change what “good” looks like.
An ML pipeline has a different relationship with its artefacts than, say, a nightly ETL or a fan-out event processor. The output of each step isn’t just a value that gets passed along, it’s a first-class thing that will be interrogated later. Which dataset produced this model? Which evaluation run justified that approval? Which training job’s weights ended up in production? These questions get asked on real mornings, usually when something is failing and the team is trying to rewind time. The orchestrator that makes those answers a query rather than an investigation saves hours every time something goes wrong.
The second observation is that the steps themselves are ML-shaped: a Processing job takes minutes to hours, a Training job does the same, an evaluation writes a JSON file somebody needs to read. These are not millisecond Lambdas. The orchestrator needs to babysit long-running jobs on the team’s behalf, poll to completion, surface failures, retry the flaky ones, without the pipeline author writing those loops themselves.
The third is branching on real numbers. “Register the model if AUC is above 0.94” is a cheap sentence that requires pulling a float out of a JSON file produced by the previous step and comparing it. Every orchestrator can do this with enough code; the question is whether the orchestrator has a first-class way to do it (a typed condition step) or whether the team writes a Lambda to read a file and make the call.
The fourth is retry semantics. Transient failures in AWS ML APIs are a fact of life – Step.THROTTLING on a DescribeTrainingJob call, SageMaker.CAPACITY_ERROR when a specific instance type is momentarily gone. An orchestrator that classifies these by exception type and retries with exponential backoff costs nothing to configure and removes a class of 02:00 pages. One that requires the team to write retry logic around every API call costs real code and tends to grow subtly different retry behaviours in different steps.
The fifth is operational posture. Can this orchestrator be just another AWS resource in the account, or does it bring a persistent cluster the team has to capacity-plan, patch, and back up? A nightly pipeline is not a good reason to stand up a permanent Airflow environment, but it might be a good excuse to join one that already exists.
And the sixth is what happens when the pipeline shape grows. Does it compose with other orchestration (Step Functions for the non-ML bits, EventBridge for the triggers)? Does it have escape hatches (a Lambda step, a callback step) for the things that don’t fit its native vocabulary? Or is it an island that’s easy to reach and hard to extend?
What we’ll filter on
- ML-aware step types. Processing, training, evaluation, conditional registration as first-class concepts.
- Lineage for free, the orchestrator already knows which upstream inputs produced each artefact.
- Conditional branching inside the DAG, the “register if AUC > 0.94” decision happens inside the workflow.
- Typed retries on transient failures, throttling, internal server errors, capacity errors classified by exception type.
- Low operational overhead, no cluster, no scheduler host, no plugin ecosystem.
The orchestration landscape
SageMaker Pipelines. Purpose-built ML pipeline service. A DAG of typed steps – ProcessingStep, TrainingStep, TuningStep, ModelStep, ConditionStep, LambdaStep, TransformStep, FailStep, more. Runs visible in the Studio Pipelines UI. Every step automatically emits lineage entities (Artifacts, Actions, Contexts, Associations) into SageMaker ML Lineage Tracking. Retry policies are declarative on the step, keyed to exception types.
AWS Step Functions (Standard). General-purpose orchestrator. State machine in Amazon States Language; tasks invoke optimised service integrations including SageMaker job APIs (CreateProcessingJob.sync, CreateTrainingJob.sync) that poll to completion. Conditionals are Choice states. Retries per-state. Execution history 90 days. Can run the DAG, but lineage to ML artefacts is not automatic, the team wires SageMaker lineage entities explicitly.
AWS Step Functions (Express). Same API surface, different runtime. Max execution 5 minutes, billed per execution. Doesn’t support .sync or .waitForTaskToken. Wrong tool.
Apache Airflow on MWAA. Managed Airflow. Rich SageMaker operator family. Environments cost ~$0.49-$1.80/hour for the scheduler alone, live permanently. Lineage isn’t automatic against SageMaker’s registry. Overkill for one nightly DAG.
GitHub Actions. Nightly scheduled workflow, each step calling the SageMaker API with boto3. Trivial to start. No step-level retries for SageMaker errors, no lineage, no clean conditional registration.
Side by side
| Option | ML-aware steps | Lineage for free | Conditional branching | Typed retries | Low ops |
|---|---|---|---|---|---|
| SageMaker Pipelines | ✓ | ✓ | ✓ | ✓ | ✓ |
| Step Functions (Standard) | , | ✗ | ✓ | ✓ | ✓ |
| Step Functions (Express) | , | ✗ | ✓ | ✓ | ✗ |
| MWAA (Airflow) | ✓ | ✗ | ✓ | ✓ | ✗ |
| GitHub Actions | ✗ | ✗ | , | ✗ | ✓ |
Matching workflow shapes to orchestrators
SageMaker Pipelines, in depth
A pipeline is defined in Python with the SageMaker SDK, then uploaded (pipeline.upsert()) as a JSON definition SageMaker stores and executes. Each run is an immutable record of definition version, parameter values, and step status.
Pipeline parameters give one definition multiple runtime shapes. ParameterString, ParameterInteger, ParameterFloat, ParameterBoolean declared at definition time, passed at start_execution() time, reachable throughout the DAG by symbolic reference.
Step types used here:
ProcessingStepwraps a Processing job. Outputs’ S3 URIs are symbolically referenceable by downstream steps.TrainingStepwraps a Training job. Exposesstep_train.properties.ModelArtifacts.S3ModelArtifacts.ConditionStepevaluates conditions against earlier outputs and branches intoif_stepsorelse_steps. Conditions compose viaConditionGreaterThanOrEqualTo,ConditionEquals, etc. The left side is often aJsonGetpulling a value out of a property file.ModelStep(SDK 2.90+) is the current way to create or register a model.ModelStepwithmodel.register(...)replaces the olderRegisterModel; withmodel.create(...)replacesCreateModelStep.LambdaStepinvokes a Lambda synchronously. Default timeout 120 s, max 10 min.FailStepstops the execution with a custom error, useful inelsebranches.
Retry policies are declared per step:
"RetryPolicies": [
{
"ExceptionType": ["SageMaker.JOB_INTERNAL_ERROR", "SageMaker.CAPACITY_ERROR"],
"IntervalSeconds": 60,
"BackoffRate": 2.0,
"MaxAttempts": 3
},
{
"ExceptionType": ["Step.SERVICE_FAULT", "Step.THROTTLING"],
"MaxAttempts": 5
}
]
Five exception types are recognised: Step.SERVICE_FAULT and Step.THROTTLING (retried by default), plus SageMaker.JOB_INTERNAL_ERROR, SageMaker.CAPACITY_ERROR, SageMaker.RESOURCE_LIMIT opt-in. MaxAttempts (up to 20) or ExpireAfterMin (up to 14,400), not both.
SageMaker ML Lineage Tracking is the piece that’s hard to replicate elsewhere. Every Pipeline execution emits lineage entities automatically. Artifacts (things), Actions (activities), Contexts (groupings), Associations (the edges). When a Training job consumes a Processing job’s S3 output, an Association is written. When a registered model references a training job, another lands. Given a model package ARN, walking the graph backwards lands at the training job, the processing job, and the input S3 prefix, exactly the “which dataset produced this model?” question the team needs answered.
A worked example: the nightly retrain
Outline in Python:
from sagemaker.workflow.parameters import ParameterString, ParameterFloat
from sagemaker.workflow.steps import ProcessingStep, TrainingStep
from sagemaker.workflow.condition_step import ConditionStep
from sagemaker.workflow.conditions import ConditionGreaterThanOrEqualTo
from sagemaker.workflow.functions import JsonGet
from sagemaker.workflow.model_step import ModelStep
from sagemaker.workflow.lambda_step import LambdaStep
from sagemaker.workflow.pipeline import Pipeline
data_date = ParameterString(name="DataDate")
auc_threshold = ParameterFloat(name="AucThreshold", default_value=0.94)
step_feat = ProcessingStep(name="Featurise", step_args=sklearn_processor.run(...))
step_train = TrainingStep(name="TrainXGB", step_args=xgb_estimator.fit(...))
step_eval = ProcessingStep(name="Evaluate", step_args=eval_processor.run(...),
property_files=[eval_report])
cond = ConditionGreaterThanOrEqualTo(
left=JsonGet(step_name="Evaluate", property_file=eval_report,
json_path="classification_metrics.auc.value"),
right=auc_threshold,
)
step_register = ModelStep(name="RegisterIfGoodEnough",
step_args=model.register(...))
step_deploy = LambdaStep(name="TriggerDeployLambda",
lambda_func=Lambda(function_arn="arn:aws:lambda:...:deploy"),
inputs={"ModelPackageArn": step_register.properties.ModelPackageArn})
step_gate = ConditionStep(name="GateOnAuc",
conditions=[cond],
if_steps=[step_register, step_deploy],
else_steps=[])
pipeline = Pipeline(name="nightly-fraud-retrain",
parameters=[data_date, auc_threshold],
steps=[step_feat, step_train, step_eval, step_gate])
EventBridge starts the pipeline at 02:00 nightly with DataDate set to yesterday. A sample run:
- 02:00.
start_pipeline_execution(...). Execution ID assigned. - 02:00-02:18.
Featuriseruns. Lineage:Artifact(input) → Action(Processing) → Artifact(train), Artifact(holdout). - 02:18-02:46.
TrainXGB. One retry onStep.THROTTLINGwhenDescribeTrainingJobis rate-limited. - 02:46-02:52.
Evaluatewritesevaluation.jsonwith AUC 0.948. - 02:52.
GateOnAucevaluates.JsonGetpulls 0.948;0.948 >= 0.94is true. - 02:52-02:53.
RegisterIfGoodEnough. Package v47 created, approval statusPendingManualApproval. - 02:53-02:54.
TriggerDeployLambda. Lambda invoked with the package ARN. - 02:54. Execution completes. Green row in Studio.
The morning-after question “which dataset produced v47?” walks the lineage graph from v47 back to the source S3 prefix in five edges, one query, zero custom plumbing.
A worse night (AUC 0.921) runs identically through Evaluate, the condition evaluates false, else_steps is empty, execution completes Succeeded with GateOnAuc skipping registration.
What’s worth remembering
- SageMaker Pipelines step catalogue:
ProcessingStep,TrainingStep,TuningStep,TransformStep,ModelStep(supersedingRegisterModel/CreateModelStep),ConditionStep,LambdaStep,CallbackStep,ClarifyCheckStep,QualityCheckStep,EMRStep,AutoMLStep,NotebookJobStep,FailStep. - Pipeline parameters: four types, declared once, passed at
start_execution, reachable by symbolic reference. - Retry policies are declarative and exception-typed. Five types; pick
MaxAttempts(up to 20) orExpireAfterMin(up to 14,400). - SageMaker ML Lineage is automatic with Pipelines. Walking a package’s lineage reaches the dataset in a handful of edges.
- The Studio Pipelines UI is the visibility requirement solved, past executions, per-step timing, I/O S3 URIs, lineage graph.
- Step Functions Standard vs Express differ on more than duration. Express can’t do
.syncor.waitForTaskToken. - SageMaker service integrations from Step Functions exist, it can run the DAG, it just doesn’t get lineage for free.
LambdaStepcaps at 10 minutes; anything longer needsCallbackStepwith async SQS handling.- Pipeline-driven registration is how “lineage for free” is actually earned. Manual
create-model-packageoutside a pipeline loses the links.
Build the retrain as a SageMaker Pipeline with ProcessingStep (featurise), TrainingStep (XGBoost), ProcessingStep (evaluate with property file), ConditionStep gating on JsonGet("auc") >= 0.94, ModelStep(register=True) in the if_steps branch, and LambdaStep for the deploy trigger. Declare retry policies on each SageMaker step covering SageMaker.JOB_INTERNAL_ERROR and SageMaker.CAPACITY_ERROR with MaxAttempts: 3 and BackoffRate: 2.0. Schedule with EventBridge passing DataDate as a parameter.