Orchestrating a Nightly Model Retrain

February 16, 2028 · 16 min read

ML Engineer Associate · MLA-C01 · part of The Exam Room

The situation

An ML team runs a fraud-classifier retrain every night. The workflow pulls yesterday’s labelled transactions from S3, feature-engineers with a Processing job, trains an XGBoost model, evaluates against a holdout split, conditionally registers the model if AUC exceeds the production threshold of 0.94, and triggers a deploy Lambda if and only if registration happened.

On top of the DAG, three cross-cutting requirements: lineage tracking (given a model package, identify the TrainingThe process of fitting a model’s weights to data by minimising a loss function. dataset, processing job, and hyperparameters), automatic retries on transient failures (throttled CreateTrainingJob, brief insufficient-capacity), and visibility into the last fourteen nights of executions with per-step timing without bespoke dashboards.

What actually matters

Before picking an orchestrator, worth thinking about what distinguishes an ML workflow from a general workflow, and why those differences change what “good” looks like.

An ML pipeline has a different relationship with its artefacts than, say, a nightly ETL or a fan-out event processor. The output of each step isn’t just a value that gets passed along, it’s a first-class thing that will be interrogated later. Which dataset produced this model? Which evaluation run justified that approval? Which training job’s weights ended up in production? These questions get asked on real mornings, usually when something is failing and the team is trying to rewind time. The orchestrator that makes those answers a query rather than an investigation saves hours every time something goes wrong.

The second observation is that the steps themselves are ML-shaped: a Processing job takes minutes to hours, a Training job does the same, an evaluation writes a JSON file somebody needs to read. These are not millisecond Lambdas. The orchestrator needs to babysit long-running jobs on the team’s behalf, poll to completion, surface failures, retry the flaky ones, without the pipeline author writing those loops themselves.

The third is branching on real numbers. “Register the model if AUC is above 0.94” is a cheap sentence that requires pulling a float out of a JSON file produced by the previous step and comparing it. Every orchestrator can do this with enough code; the question is whether the orchestrator has a first-class way to do it (a typed condition step) or whether the team writes a Lambda to read a file and make the call.

The fourth is retry semantics. Transient failures in AWS ML APIs are a fact of life – Step.THROTTLING on a DescribeTrainingJob call, SageMaker.CAPACITY_ERROR when a specific instance type is momentarily gone. An orchestrator that classifies these by exception type and retries with exponential backoff costs nothing to configure and removes a class of 02:00 pages. One that requires the team to write retry logic around every API call costs real code and tends to grow subtly different retry behaviours in different steps.

The fifth is operational posture. Can this orchestrator be just another AWS resource in the account, or does it bring a persistent cluster the team has to capacity-plan, patch, and back up? A nightly pipeline is not a good reason to stand up a permanent Airflow environment, but it might be a good excuse to join one that already exists.

And the sixth is what happens when the pipeline shape grows. Does it compose with other orchestration (Step Functions for the non-ML bits, EventBridge for the triggers)? Does it have escape hatches (a Lambda step, a callback step) for the things that don’t fit its native vocabulary? Or is it an island that’s easy to reach and hard to extend?

What we’ll filter on

  1. ML-aware step types. Processing, training, evaluation, conditional registration as first-class concepts.
  2. Lineage for free, the orchestrator already knows which upstream inputs produced each artefact.
  3. Conditional branching inside the DAG, the “register if AUC > 0.94” decision happens inside the workflow.
  4. Typed retries on transient failures, throttling, internal server errors, capacity errors classified by exception type.
  5. Low operational overhead, no cluster, no scheduler host, no plugin ecosystem.

The orchestration landscape

SageMaker Pipelines. Purpose-built ML pipeline service. A DAG of typed steps – ProcessingStep, TrainingStep, TuningStep, ModelStep, ConditionStep, LambdaStep, TransformStep, FailStep, more. Runs visible in the Studio Pipelines UI. Every step automatically emits lineage entities (Artifacts, Actions, Contexts, Associations) into SageMaker ML Lineage Tracking. Retry policies are declarative on the step, keyed to exception types.

AWS Step Functions (Standard). General-purpose orchestrator. State machine in Amazon States Language; tasks invoke optimised service integrations including SageMaker job APIs (CreateProcessingJob.sync, CreateTrainingJob.sync) that poll to completion. Conditionals are Choice states. Retries per-state. Execution history 90 days. Can run the DAG, but lineage to ML artefacts is not automatic, the team wires SageMaker lineage entities explicitly.

AWS Step Functions (Express). Same API surface, different runtime. Max execution 5 minutes, billed per execution. Doesn’t support .sync or .waitForTaskToken. Wrong tool.

Apache Airflow on MWAA. Managed Airflow. Rich SageMaker operator family. Environments cost ~$0.49-$1.80/hour for the scheduler alone, live permanently. Lineage isn’t automatic against SageMaker’s registry. Overkill for one nightly DAG.

GitHub Actions. Nightly scheduled workflow, each step calling the SageMaker API with boto3. Trivial to start. No step-level retries for SageMaker errors, no lineage, no clean conditional registration.

Side by side

Option ML-aware steps Lineage for free Conditional branching Typed retries Low ops
SageMaker Pipelines
Step Functions (Standard) ,
Step Functions (Express) ,
MWAA (Airflow)
GitHub Actions ,

Matching workflow shapes to orchestrators

Pure ML pipeline every step is an ML step Mixed workflow ML plus DB and API calls Per-event fan-out short, high-volume Org-wide Airflow DAGs already everywhere Nightly fraud retrain Process, Train, Evaluate Condition on AUC > 0.94 Register + deploy Lambda Retrain + BI export SageMaker + RDS + third-party human approval step cross-account too Per-request enrichment ~seconds, thousands/sec no long-running tasks cost-sensitive at scale One DAG among 200 platform team owns MWAA Python DAG culture amortised ops cost All steps ML? yes All steps ML? no Duration < 5 min? yes Airflow exists? yes Lineage automatic? yes Wire lineage manually? yes High RPS, no .sync? yes Explicit lineage? yes Typed retries? yes Choice + Wait? yes Per-transition cost? yes Ops cost amortised? yes SageMaker Pipelines typed ML step catalogue lineage auto-wired ConditionStep on JsonGet Studio Pipelines UI retry by exception type Step Functions Standard .sync service integrations Choice / Wait / Parallel waitForTaskToken approvals lineage is manual 90-day execution history Step Functions Express sub-5-minute workflows per-execution pricing no .sync / waitForTaskToken high-volume fan-out wrong shape for retrains MWAA Airflow rich SageMaker operators permanent environment $0.49-$1.80/hr scheduler lineage not automatic one DAG among many
The defining question is whether the pipeline is pure-ML. If yes, Pipelines pays back its specialised vocabulary in lineage-for-free. If not, Step Functions Standard wraps SageMaker service integrations.

SageMaker Pipelines, in depth

A pipeline is defined in Python with the SageMaker SDK, then uploaded (pipeline.upsert()) as a JSON definition SageMaker stores and executes. Each run is an immutable record of definition version, parameter values, and step status.

Pipeline parameters give one definition multiple runtime shapes. ParameterString, ParameterInteger, ParameterFloat, ParameterBoolean declared at definition time, passed at start_execution() time, reachable throughout the DAG by symbolic reference.

Step types used here:

  • ProcessingStep wraps a Processing job. Outputs’ S3 URIs are symbolically referenceable by downstream steps.
  • TrainingStep wraps a Training job. Exposes step_train.properties.ModelArtifacts.S3ModelArtifacts.
  • ConditionStep evaluates conditions against earlier outputs and branches into if_steps or else_steps. Conditions compose via ConditionGreaterThanOrEqualTo, ConditionEquals, etc. The left side is often a JsonGet pulling a value out of a property file.
  • ModelStep (SDK 2.90+) is the current way to create or register a model. ModelStep with model.register(...) replaces the older RegisterModel; with model.create(...) replaces CreateModelStep.
  • LambdaStep invokes a Lambda synchronously. Default timeout 120 s, max 10 min.
  • FailStep stops the execution with a custom error, useful in else branches.

Retry policies are declared per step:

"RetryPolicies": [
  {
    "ExceptionType": ["SageMaker.JOB_INTERNAL_ERROR", "SageMaker.CAPACITY_ERROR"],
    "IntervalSeconds": 60,
    "BackoffRate": 2.0,
    "MaxAttempts": 3
  },
  {
    "ExceptionType": ["Step.SERVICE_FAULT", "Step.THROTTLING"],
    "MaxAttempts": 5
  }
]

Five exception types are recognised: Step.SERVICE_FAULT and Step.THROTTLING (retried by default), plus SageMaker.JOB_INTERNAL_ERROR, SageMaker.CAPACITY_ERROR, SageMaker.RESOURCE_LIMIT opt-in. MaxAttempts (up to 20) or ExpireAfterMin (up to 14,400), not both.

SageMaker ML Lineage Tracking is the piece that’s hard to replicate elsewhere. Every Pipeline execution emits lineage entities automatically. Artifacts (things), Actions (activities), Contexts (groupings), Associations (the edges). When a Training job consumes a Processing job’s S3 output, an Association is written. When a registered model references a training job, another lands. Given a model package ARN, walking the graph backwards lands at the training job, the processing job, and the input S3 prefix, exactly the “which dataset produced this model?” question the team needs answered.

A worked example: the nightly retrain

Outline in Python:

from sagemaker.workflow.parameters import ParameterString, ParameterFloat
from sagemaker.workflow.steps import ProcessingStep, TrainingStep
from sagemaker.workflow.condition_step import ConditionStep
from sagemaker.workflow.conditions import ConditionGreaterThanOrEqualTo
from sagemaker.workflow.functions import JsonGet
from sagemaker.workflow.model_step import ModelStep
from sagemaker.workflow.lambda_step import LambdaStep
from sagemaker.workflow.pipeline import Pipeline

data_date = ParameterString(name="DataDate")
auc_threshold = ParameterFloat(name="AucThreshold", default_value=0.94)

step_feat  = ProcessingStep(name="Featurise", step_args=sklearn_processor.run(...))
step_train = TrainingStep(name="TrainXGB",   step_args=xgb_estimator.fit(...))
step_eval  = ProcessingStep(name="Evaluate", step_args=eval_processor.run(...),
                            property_files=[eval_report])

cond = ConditionGreaterThanOrEqualTo(
    left=JsonGet(step_name="Evaluate", property_file=eval_report,
                 json_path="classification_metrics.auc.value"),
    right=auc_threshold,
)

step_register = ModelStep(name="RegisterIfGoodEnough",
                          step_args=model.register(...))
step_deploy   = LambdaStep(name="TriggerDeployLambda",
                           lambda_func=Lambda(function_arn="arn:aws:lambda:...:deploy"),
                           inputs={"ModelPackageArn": step_register.properties.ModelPackageArn})

step_gate = ConditionStep(name="GateOnAuc",
                          conditions=[cond],
                          if_steps=[step_register, step_deploy],
                          else_steps=[])

pipeline = Pipeline(name="nightly-fraud-retrain",
                    parameters=[data_date, auc_threshold],
                    steps=[step_feat, step_train, step_eval, step_gate])

EventBridge starts the pipeline at 02:00 nightly with DataDate set to yesterday. A sample run:

  • 02:00. start_pipeline_execution(...). Execution ID assigned.
  • 02:00-02:18. Featurise runs. Lineage: Artifact(input) → Action(Processing) → Artifact(train), Artifact(holdout).
  • 02:18-02:46. TrainXGB. One retry on Step.THROTTLING when DescribeTrainingJob is rate-limited.
  • 02:46-02:52. Evaluate writes evaluation.json with AUC 0.948.
  • 02:52. GateOnAuc evaluates. JsonGet pulls 0.948; 0.948 >= 0.94 is true.
  • 02:52-02:53. RegisterIfGoodEnough. Package v47 created, approval status PendingManualApproval.
  • 02:53-02:54. TriggerDeployLambda. Lambda invoked with the package ARN.
  • 02:54. Execution completes. Green row in Studio.

The morning-after question “which dataset produced v47?” walks the lineage graph from v47 back to the source S3 prefix in five edges, one query, zero custom plumbing.

A worse night (AUC 0.921) runs identically through Evaluate, the condition evaluates false, else_steps is empty, execution completes Succeeded with GateOnAuc skipping registration.

What’s worth remembering

  1. SageMaker Pipelines step catalogue: ProcessingStep, TrainingStep, TuningStep, TransformStep, ModelStep (superseding RegisterModel / CreateModelStep), ConditionStep, LambdaStep, CallbackStep, ClarifyCheckStep, QualityCheckStep, EMRStep, AutoMLStep, NotebookJobStep, FailStep.
  2. Pipeline parameters: four types, declared once, passed at start_execution, reachable by symbolic reference.
  3. Retry policies are declarative and exception-typed. Five types; pick MaxAttempts (up to 20) or ExpireAfterMin (up to 14,400).
  4. SageMaker ML Lineage is automatic with Pipelines. Walking a package’s lineage reaches the dataset in a handful of edges.
  5. The Studio Pipelines UI is the visibility requirement solved, past executions, per-step timing, I/O S3 URIs, lineage graph.
  6. Step Functions Standard vs Express differ on more than duration. Express can’t do .sync or .waitForTaskToken.
  7. SageMaker service integrations from Step Functions exist, it can run the DAG, it just doesn’t get lineage for free.
  8. LambdaStep caps at 10 minutes; anything longer needs CallbackStep with async SQS handling.
  9. Pipeline-driven registration is how “lineage for free” is actually earned. Manual create-model-package outside a pipeline loses the links.

Build the retrain as a SageMaker Pipeline with ProcessingStep (featurise), TrainingStep (XGBoost), ProcessingStep (evaluate with property file), ConditionStep gating on JsonGet("auc") >= 0.94, ModelStep(register=True) in the if_steps branch, and LambdaStep for the deploy trigger. Declare retry policies on each SageMaker step covering SageMaker.JOB_INTERNAL_ERROR and SageMaker.CAPACITY_ERROR with MaxAttempts: 3 and BackoffRate: 2.0. Schedule with EventBridge passing DataDate as a parameter.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.