Governed Model Promotion on SageMaker

February 28, 2028 · 21 min read

ML Engineer Associate · MLA-C01 · part of The Exam Room

The situation

A healthcare-adjacent platform team trains a clinical risk-stratification model on patient event data. The retraining job runs weekly off a curated feature store, evaluates the candidate on a fixed holdout set drawn from de-identified historical records, and produces model.tar.gz plus an evaluation.json in S3. Today somebody human-in-the-loops the results by reading the metrics in Slack and then running an S3 copy command into the production prefix. The SageMaker endpoint picks up the new artefact on next reload.

That process satisfies none of the team’s four new non-negotiables. Signed human approval on every promotion: a named reviewer (principal in a defined IAM role) records an Approved decision on this specific version before deployment runs. Documented evaluation metrics on a fixed holdout: AUC, calibration, and subgroup fairness metrics stored against the model version, not lost in a CI log that rolls over. Data lineage to the training dataset version: given any version that ever reached production, point at the exact S3 paths and dataset versions that trained it. Traceable rollback to the previously-approved version: when a deployed model misbehaves, redeploy the prior Approved version in one step and leave a record of the swap.

What actually matters

Before picking services, worth thinking about what “governed promotion” actually means for a team that’s been copying files.

The first observation is that every requirement in the new list is about turning an implicit action into an auditable one. “Somebody approved this” needs to be a query, not a Slack screenshot. “This model was trained on this data” needs to be a traversable link, not a filename convention. “We rolled back to Tuesday’s model” needs to be a recorded event, not an undocumented cp. Governance isn’t about adding bureaucracy, it’s about making the actions already happening legible to somebody who wasn’t in the room.

The second observation is that every path from “approval” to “deploy” goes through some mechanical action. The question is whether that action is a shell script the team wrote (which means the team owns the bug surface and the audit trail), or a first-class AWS resource (which means AWS owns both, and the team gets to reason about the happy path). For a regulated-industry team, the fewer shell scripts touching production the better, because every shell script is a piece of unwritten policy the auditors have to read.

The third is immutability. When something ships to production and is later found to be wrong, the team needs to know exactly what shipped. Not “whatever was at that S3 key on that date”, if the key is mutable, that answer is “whatever somebody decided to put there.” The artefact reference, the evaluation metrics, the training lineage all have to be pinned at registration time and untouchable thereafter. New training runs produce new versions; old versions stay addressable forever. That’s what makes rollback a lookup rather than an archaeology exercise.

The fourth is the human signature. “Alice approved v47 with description ‘Holdout AUC 0.942, calibration drift < 0.01 vs v46’” is a different kind of record than “Alice said OK in #ml-ops.” The difference is that the former ties an IAM principal to a specific artefact version with a timestamp and a description, in a system that CloudTrail already watches. The downstream deploy automation can read that record and be certain of what it says. No automation can read a Slack emoji.

The fifth is cross-account posture. Training accounts and production accounts separate for a reason, it’s a security boundary, not an org-chart convention. But the boundary has to be crossable by approved artefacts without copying bytes manually, because manual copies are exactly the workflow being replaced. Whatever the registry is, it has to share cleanly across accounts without leaking unapproved versions.

The sixth is ops posture. A registry that solves all of this by running itself on EC2 behind RDS is a registry the team has to patch and back up. For a team of four, taking on that operational responsibility is a real cost. The question isn’t whether a given capability exists, several options have it, but whether the team can use it without taking on a second platform to run.

And the seventh, inevitably: what does “Approved” actually cause to happen? If approval is a database flip and the team writes a polling cron to notice it, the gate is only as good as the cron. If approval emits an event that flows through EventBridge into a Lambda that builds the new EndpointConfig, the gate is a first-class piece of infrastructure. The deploy has to be triggered by the approval, not merely permitted by it.

What we’ll filter on

  1. Approval workflow with explicit state, every candidate sits in a waiting state until a human reviewer flips it. The state is first-class data downstream automation can read, not a tag convention.
  2. Immutable versioning per model package, every registration produces a new, read-only version. Rollback is redeploying an earlier version, not reconstructing it.
  3. Automatic lineage tracking, datasets, processing jobs, training jobs, and artefacts wire into a queryable graph without the team writing the audit trail.
  4. Cross-account sharing, training and production accounts are separate; the approved model has to be reachable from production without copying bytes manually.
  5. Low operational overhead, no self-hosted control plane, no registry server to patch, no database behind a bespoke approval UI.

The governance landscape

Ad-hoc S3 with object tags. The current state, dressed up. Put the model artefact in S3 with tags like status=approved, approver=alice, auc=0.94. Downstream deploy reads the tags. Works in the sense that it runs, but tags are mutable, not versioned, not audit-grade, and carry no native approval-state machine. Lineage is whatever the team writes into additional tags or a sidecar table. Rollback is prefix archaeology.

Git-based artefact tracking (DVC etc.). Data Version Control pins large binaries to Git commits via content-addressed storage. Every model version is a commit; every dataset pointer is a DVC file. Code review on the PR that bumps the model pointer gives the human approval. Cross-account via underlying storage. Lineage is to a commit, not to SageMaker resources a production endpoint can natively consume. Studio UI is disconnected; audit trail spans two systems to correlate by hand.

MLflow self-hosted. Open-source registry, self-hosted on EC2 or ECS behind RDS. Primitives are exactly right: model versions, state machine, approval transitions as API calls, metrics attached. Cross-account is “put MLflow behind a load balancer.” The catch: the team runs the service, the database, backups, upgrades, auth integration. For a regulated-industry team, self-hosting a governance-critical system means taking responsibility for its availability and audit log.

SageMaker Model Registry. AWS’s first-party registry. Organises models into Model Package Groups (one per logical model). Each registration is a Model Package with an immutable integer version. Every package carries a ModelApprovalStatusPendingManualApproval, Approved, or Rejected, which defaults to PendingManualApproval for pipeline registrations. Packages carry inference specification, ModelMetrics for evaluation reports, ModelDataQuality for training stats, custom metadata. Approval changes emit EventBridge events downstream Lambda can route on. Lineage wires automatically when registration comes from a SageMaker Pipeline’s ModelStep. Cross-account sharing uses AWS Resource Access Manager. Nothing self-hosted.

Side by side

Option Approval workflow Immutable versioning Automatic lineage Cross-account Low ops
Ad-hoc S3 + tags
Git-based (DVC) , ,
MLflow self-hosted ,
SageMaker Model Registry

Matching the shape to the service

Regulated AWS-native signed approval, lineage Git-first team source-code provenance Existing MLflow estate multi-cloud / on-prem Early-stage no audit pressure yet Clinical risk model HIPAA-adjacent weekly retrain separate prod account Research org reproducible builds code is the artefact PR-based approvals Hybrid platform MLflow already runs on-prem + cloud ops absorbed Prototype tool single user no audit trail OK for now AWS-native? yes Code-first audit? yes MLflow in place? yes Audit pressure? no Signed approval needed? yes PR review fits? yes Ops absorbed? yes Small team OK? yes Cross-account needed? yes Git host trusted? yes Own DB/HA/auth? yes Revisit later? yes Model Registry ModelPackageGroup PendingManualApproval default RAM cross-account EventBridge state events ML Lineage automatic DVC / Git-native commits as versions PR merge as approval content-addressed artefacts lineage to code + data audit spans two systems MLflow self-hosted state transitions API metrics / params attached own DB + HA + auth cross-cloud friendly ops is the cost Ad-hoc S3 + tags cheap to set up tags mutable no state machine rollback is archaeology replace before audit
Regulated AWS-native promotion lands on Model Registry. The alternatives serve specific shapes: code-first teams, existing MLflow estates, or early-stage experiments that haven't yet hit audit pressure.

SageMaker Model Registry, in depth

A Model Package Group is the container; a Model Package is a version. Groups live for the lifetime of the model’s existence; packages are created every time a candidate appears.

Creating a group (one-time setup):

aws sagemaker create-model-package-group \
  --model-package-group-name clinical-risk-stratifier \
  --model-package-group-description "Weekly-retrained risk score model"

Registering a version from a pipeline. A ModelStep with model.register(...) produces a new package:

step_register = ModelStep(
    name="RegisterCandidate",
    step_args=model.register(
        content_types=["application/json"],
        response_types=["application/json"],
        inference_instances=["ml.m5.large", "ml.m5.xlarge"],
        transform_instances=["ml.m5.large"],
        model_package_group_name="clinical-risk-stratifier",
        approval_status="PendingManualApproval",
        model_metrics=model_metrics,
    ),
)

model_metrics points at the evaluation.json the Evaluate step wrote to S3. AUC, subgroup accuracy, calibration. That file becomes part of the package’s permanent record.

Three approval states, one flip. PendingManualApproval is the default for pipeline-registered packages. A reviewer changes the status via the Studio UI or the API:

aws sagemaker update-model-package \
  --model-package-arn arn:aws:sagemaker:eu-west-1:111:model-package/clinical-risk-stratifier/47 \
  --model-approval-status Approved \
  --approval-description "Holdout AUC 0.942, calibration drift < 0.01 vs v46."

Who did it, when, and what they wrote land in the package’s metadata and in CloudTrail. IAM principal + ApprovalDescription + CloudTrail = signed approval.

Immutable versions. Once registered, the artefact reference and metadata cannot be edited. Only ModelApprovalStatus and ApprovalDescription change. New training runs produce new versions; old versions stay addressable forever.

Lineage for free with Pipelines. Every Pipeline-driven registration creates lineage entities in SageMaker ML Lineage: the S3 dataset artefact, the processing job action, the training job action, the training artefact, the model package, all joined by associations. Walk backwards from a package ARN to reach the training dataset prefix without writing any audit code.

Cross-account via RAM. Share the Model Package Group with another account:

aws ram create-resource-share \
  --name clinical-risk-stratifier-prod \
  --resource-arns arn:aws:sagemaker:eu-west-1:111:model-package-group/clinical-risk-stratifier \
  --principals 222222222222

The production account can list, describe, and deploy approved versions without copying bytes.

The promotion flow, end to end

Pipeline weekly retrain Model Registry versioned packages Human reviewer IAM-authenticated Deploy EventBridge + Lambda Featurise / Train / Eval Processing + Training evaluation.json on S3 ModelStep.register approval_status= PendingManualApproval v47 Pending candidate immutable artefact ref v46 Approved current prod rollback target v45 Approved prior rollback target Reads metrics, decides UpdateModelPackage ApprovalStatus=Approved ApprovalDescription="..." CloudTrail captures: principal, timestamp description, version EventBridge rule source: aws.sagemaker Status=Approved Deploy Lambda build EndpointConfig UpdateEndpoint tag with package ver Prod endpoint serving v47 tag: version=47 Rollback: invoke Deploy Lambda with v46 ARN SageMaker ML Lineage (automatic) Artifact(s3://labels/2026-12/) → Action(TrainXGB-47) → Action(registration) → Artifact(v47) walk backwards from any package to find the dataset that trained it
Pipeline registers, reviewer approves, EventBridge fires, Lambda deploys. Every edge is a first-class AWS resource; nothing in the flow is a tag convention or a shell script. Rollback is the same Lambda pointed at an earlier Approved version.

The Lambda is worth writing down in full. An EventBridge rule matches state-change events:

{
  "source": ["aws.sagemaker"],
  "detail-type": ["SageMaker Model Package State Change"],
  "detail": {
    "ModelPackageGroupName": ["clinical-risk-stratifier"],
    "ModelApprovalStatus": ["Approved"]
  }
}

The target Lambda:

def handler(event, context):
    detail = event["detail"]
    if detail["ModelApprovalStatus"] != "Approved":
        return  # belt-and-braces; rule already filtered
    package_arn = detail["ModelPackageArn"]

    sm.create_model(
        ModelName=f"clinical-risk-{version(package_arn)}",
        Containers=[{"ModelPackageName": package_arn}],
        ExecutionRoleArn=EXECUTION_ROLE,
    )
    sm.create_endpoint_config(
        EndpointConfigName=f"clinical-risk-cfg-{version(package_arn)}",
        ProductionVariants=[{
            "VariantName": "AllTraffic",
            "ModelName": f"clinical-risk-{version(package_arn)}",
            "InitialInstanceCount": 2,
            "InstanceType": "ml.m5.xlarge",
        }],
    )
    sm.update_endpoint(
        EndpointName="clinical-risk-prod",
        EndpointConfigName=f"clinical-risk-cfg-{version(package_arn)}",
    )
    sm.add_tags(
        ResourceArn="arn:aws:sagemaker:...:endpoint/clinical-risk-prod",
        Tags=[{"Key": "model-package-version", "Value": str(version(package_arn))}],
    )

Three things guard the gate: the rule pattern filters on ModelApprovalStatus: Approved; the Lambda’s first line checks the status again (belt and braces); the endpoint gets tagged with the deployed version so describe-endpoint answers “what’s running now?” in one call.

Rollback, traced

Rollback is the same mechanism in reverse. The prior Approved version (v46) never went away, still in the registry, still carries its metrics, still has its lineage graph. Redeploy:

aws lambda invoke --function-name governed-deploy \
  --payload '{"detail": {"ModelPackageArn": "arn:aws:sagemaker:...:model-package/clinical-risk-stratifier/46", "ModelApprovalStatus": "Approved"}}' \
  response.json

Same Lambda, same path, CloudTrail captures the invocation, the endpoint’s version tag updates. “The way back” is addressable, not reconstructed.

What’s worth remembering

  1. Model Package Groups (container) vs Model Packages (immutable registered version). create-model-package-group is one-time; create-model-package (or ModelStep) fires every training run.
  2. Three approval states, default is PendingManualApproval. Pipeline-driven registrations default to it unless explicitly set.
  3. UpdateModelPackage is the approval action. Takes ModelApprovalStatus and ApprovalDescription. Principal, timestamp, description in CloudTrail and package metadata.
  4. Model packages are immutable past approval status. Everything else, artefact URI, image, metrics, inference spec, is frozen at registration time.
  5. Lineage is automatic when registration comes from a Pipeline’s ModelStep. Manual create-model-package outside a pipeline loses the links.
  6. ModelMetrics captures evaluation reports against the version. Statistics, quality, bias, explainability JSON in S3, referenced from the package.
  7. Approval events come through EventBridge. Source aws.sagemaker, detail-type "SageMaker Model Package State Change", detail includes ModelApprovalStatus and ModelPackageArn.
  8. Cross-account sharing uses AWS RAM on the Model Package Group resource. Not bucket policies, not replicated S3.
  9. A deployed endpoint should be tagged with the version it’s serving. model-package-version=47 makes “what’s running?” a describe-endpoint away.

Stand up a SageMaker Model Registry group for the model, change the training pipeline’s final step to ModelStep with model.register(...) and approval_status="PendingManualApproval", point model_metrics at the pipeline’s evaluation report, let ML Lineage capture the dataset-to-package graph, have reviewers approve via UpdateModelPackage with a substantive ApprovalDescription, route the "SageMaker Model Package State Change" event through an EventBridge rule filtered on ModelApprovalStatus: Approved, and target a Lambda that builds a new EndpointConfig from the package ARN and swaps the production endpoint. Cross-account separation via a RAM share. Rollback is the same Lambda pointed at a prior approved version.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.