The situation
A consumer-credit team runs a weekly XGBoost retraining pipeline on SageMaker. Each week produces:
- A training dataset assembled from two upstream feature stores and a manual exclusion list.
- A training job, producing a model artifact in S3.
- A model registered in SageMaker Model Registry with a versioned
ModelPackage. - A deployment to a real-time endpoint, potentially a blue/green update.
The regulator wants, for a specific loan application declined on 12 March:
- Which model version served that InferenceRunning a trained model to produce output – as opposed to training it. ?
- Which training dataset produced that model version?
- Which upstream feature store versions were included in that dataset?
- What were the hyperparameters and the evaluation metrics at training time?
- Who approved the model for deployment, and when?
Today answering all five takes a compliance engineer two days of cross-referencing logs. Half the links live in S3 paths, half in email approvals, and one key link (“which feature store snapshot did the dataset include?”) was never recorded.
SageMaker Lineage Tracking is the service designed for exactly this. The question on the desk is what it does natively, what it doesn’t, and how to use it so a regulatory request is answered in minutes, not days.
What actually matters
The temptation is to bolt on a catalogue (a spreadsheet or a wiki page that maps deployments to datasets to approvers). That breaks the first time somebody forgets to update it. The alternative is a mechanism that builds the graph automatically as a side effect of training and deploying models, so the graph is always current by construction.
What such a mechanism has to offer breaks into four parts:
A typed graph model. Datasets, model artifacts, evaluation reports as nodes; training jobs and processing jobs as events that consumed inputs and produced outputs; model packages and pipeline executions as logical groupings; and typed edges linking the lot.
Automatic capture from native operations. When training, processing, and registration happen via the platform’s SDKs and pipeline primitives, the graph gets populated as a side effect: no extra calls, no extra discipline.
Manual capture for the external facts. The regulator approval, the upstream feature-store snapshot IDs, the code commit: these don’t live in the training platform and have to be associated explicitly. The mechanism has to make that cheap to do.
A query API that walks the graph. Ancestors, descendants, shortest paths between two nodes, filters by type. The output has to be serialisable for a compliance report and small enough that a regulator-response script can run it in under a minute.
Three things matter for getting the graph complete:
What’s captured by default. Training jobs run through the platform’s SDKs capture training inputs, output artifacts, and (if registered) the resulting model package. Processing jobs capture input and output storage paths. Pipeline executions capture the full DAG of steps with correct relationships. Endpoint deployments capture the association between endpoint and model package.
What isn’t captured by default. Upstream data provenance (which row from which feature store? which exclusion list version?) is not auto-discovered; the platform only sees the final input URI. Training that happens outside the platform’s SDKs creates no lineage at all. Approval workflows live in the model registry’s status fields but not always as first-class graph associations.
What the team has to add. Associations from the training dataset to the upstream feature-store versions; association from the model package to the code commit that produced the training code; association from the deployment to the approval record. These are small, but they have to be present before the regulator asks.
What we’ll filter on
- Auto-captured on training: does lineage form as a side effect of normal platform operations?
- Captures data lineage upstream of training: feature store versions, raw data, exclusion lists?
- Captures code and approval: Git commit, approver identity?
- Queryable by node identifier: ancestors, descendants, shortest path between two nodes?
- Integrates with the model registry: model packages as first-class lineage entities?
The lineage landscape
1. SageMaker Lineage Tracking (native). Automatic capture for SageMaker training, processing, pipelines, and endpoints. Manual association API for external entities. Query API for graph traversal. Integrates with Model Registry. Covers training through deployment; upstream-data lineage and code/approval lineage require explicit association calls.
2. SageMaker Pipelines + its lineage. Pipelines add structure on top of Lineage Tracking: each Pipeline execution is a Context, each step is an Action, each input/output is an Artifact. Because the step DAG is explicit, the lineage is more complete than ad-hoc training. Highly recommended for workflows that need auditability.
3. Third-party lineage tools (OpenLineage, MLflow, DVC). Cross-system lineage standards that can be wired into SageMaker via emitters or adapters. Useful if the organisation has lineage needs spanning AWS and non-AWS systems. Adds operational overhead and usually a second graph.
4. AWS Glue Data Catalog + Lake Formation lineage. Covers the data side better than SageMaker does: table-level and column-level lineage for Glue, Athena, EMR. Doesn’t cover model training. Complement to Lineage Tracking if the feature pipelines live in Glue.
5. Hand-built audit table in DynamoDB / a wiki. The status quo. Every step writes a row. Works until somebody forgets to write a row. Doesn’t survive the first regulator request.
Side by side
| Option | Auto on training | Upstream data | Code + approval | Graph query | Model Registry integration |
|---|---|---|---|---|---|
| SageMaker Lineage Tracking | ✓ | via associations | via associations | ✓ | ✓ |
| SageMaker Pipelines + Lineage | ✓ (DAG-level) | partial | via associations | ✓ | ✓ |
| OpenLineage / MLflow | partial | ✓ (if configured) | via events | ✓ (own API) | partial |
| Glue + Lake Formation | ✗ (no model side) | ✓ (data side) | ✗ | ✓ (data only) | ✗ |
| Hand-built audit table | ✗ | manual | manual | brittle | ✗ |
Reading by need:
- Answer “which dataset → which model → which endpoint” at regulator speed: Lineage Tracking + Model Registry, with Pipelines if the workflow is more than one or two steps.
- Answer “which upstream feature source”: Lineage Tracking plus explicit
AddAssociationcalls from the dataset artifact to the Feature Store version identifier, run once per training dataset assembly. - Answer “who approved this deployment”: Model Registry approval status carries the IAM identity; adding an approval
Artifacttied to the model package makes it trivially queryable.
The lineage graph shape
The pick in depth
Run training as a SageMaker Pipeline; register the model; explicitly associate the three lineage gaps. Everything else is automatic.
Pipeline. A SageMaker Pipeline defines three steps: ProcessingStep for dataset assembly, TrainingStep for XGBoost training, RegisterModelStep for Model Registry. Running the pipeline via pipeline.start() creates a pipeline execution, each step becomes an Action linked to its input and output artifacts, and the resulting model package is a Context linked back through the steps.
Upstream feature associations. The dataset-assembly step reads from two Feature Store groups. After the ProcessingStep, a small script calls:
from sagemaker.lineage import association, artifact
dataset_artifact = artifact.Artifact.load(artifact_arn=dataset_arn)
fs_accounts = artifact.Artifact.create(
artifact_name='fs-accounts-2027-W10',
source_uri='arn:aws:sagemaker:...:feature-group/accounts',
artifact_type='FeatureStoreVersion',
properties={'version': '2027-W10'}
)
association.Association.create(
source_arn=fs_accounts.artifact_arn,
destination_arn=dataset_artifact.artifact_arn,
association_type='ContributedTo'
)
Repeat for the behaviour feature store and the exclusion list CSV. Three associations per training run; ~20 lines of Python.
Code commit association. The CodePipeline stage that launches training exposes the Git commit SHA as an environment variable. The pipeline’s first step writes that SHA to the pipeline’s parameters, and a lineage script associates a code-commit Artifact with the training Action.
Approval record association. When someone approves the model package in Model Registry (via the console or UpdateModelPackage API), an EventBridge rule fires on ModelPackageStateChanged with ModelApprovalStatus=Approved. A Lambda creates an Artifact representing the approval (including the approver’s IAM ARN from CloudTrail and the timestamp) and associates it with the Model Package Context.
Query by ARN. When the regulator asks about loan 8231:
- Look up the inference in the endpoint’s data-capture S3 prefix. The captured record contains the endpoint’s model variant name.
- Map the variant name to its current Model Package ARN via the endpoint description.
- Use
LineageQuery.query(start_arns=[model_package_arn], direction='Ancestors', include_edges=True). The query returns the training dataset artifact, the training job action, the three upstream feature-store artifacts, the code-commit artifact, and the approval artifact. - Serialise the graph to JSON for the compliance report. A script renders it as a Markdown table: “Inference X was served by model package Y (approved by Z on date D), trained on dataset W assembled from feature-store versions V1 and V2 with exclusion list E, by code commit C.”
Two-day compliance lookup becomes a 30-second script plus a one-minute review.
A worked regulator query
The regulator emails on 15 September about loan application 8231, scored on 12 March.
- Compliance engineer runs
aws s3 cpon the endpoint’s data-capture prefix for 12 March. Finds the captured JSONL line withinferenceId=loan-8231,modelVariant=credit-prod-v73. aws sagemaker describe-endpoint --endpoint-name credit-prod --query ProductionVariants[0].ModelNamereturns the model name for variant v73, which resolves toModelPackageArn=arn:aws:sagemaker:...:model-package/credit-xgb/73.python lineage_trace.py --model-package arn:aws:sagemaker:...:model-package/credit-xgb/73callsLineageQueryfor ancestors. Output:ModelPackage v73 (Approved 2027-03-08 by risk-ops@company.com) ├── Training job: credit-xgb-2027W10-a3c7 (ended 2027-03-06 03:14 UTC) │ ├── Training dataset: s3://credit/train/2027-W10/ (27M rows, 180 features) │ │ ├── Feature group: accounts, version 2027-W10 │ │ ├── Feature group: behaviour, version 2027-W10 │ │ └── Exclusion list: s3://policy/excl-v14.csv │ └── Code commit: github.com/acme/credit-ml @ a14f2b9 └── Approval record: risk-ops@, 2027-03-08 09:42 UTC- Compliance exports the output as PDF, attaches to regulator response, ships within the hour.
What made this possible: the graph already existed. Every training run for six months assembled the same edges because the pipeline code added them. No scrambling through notebooks or Slack threads.
What’s worth remembering
- Lineage Tracking is a graph, not a log. Entities (Artifact, Action, Context) connected by typed Associations. The graph is queryable by ARN and traversable in both directions.
- SageMaker operations create lineage automatically. Training jobs, Processing Jobs, Pipelines, Model Registry operations, and endpoint deployments all emit lineage as a side effect of the SDK calls.
- Upstream-data and code associations are manual. Feature store versions, raw data, exclusion lists, Git commits, approval records: all have to be explicitly associated via
Association.create()or via EventBridge-driven Lambdas. Build this in once per pipeline; it’s stable. - SageMaker Pipelines make lineage richer. A Pipeline execution is itself a
Context, and each step’s DAG relationship is captured. Harder to get wrong than ad-hocEstimator.fit()calls. LineageQuerywalks the graph. Ancestors, descendants, shortest-path, filtered by entity type. Ships in the Python SDK; the response is a list ofVertexobjects with their edges, which is trivial to serialise.- Model Registry is a first-class lineage node.
ModelPackageis aContext; approvals change its state; lineage queries starting from an endpoint’s variant can walk up through the model package to the training job and its inputs. - Data capture is the other half of the audit story. Lineage tells you the model that produced the inference; data capture tells you the exact inputs and output for that specific call. Both are needed for full traceability.
- Do the wiring once, not per run. If the pipeline code creates the associations, every future run has them. A bolted-on spreadsheet forgets one run in three; automation doesn’t.
The regulator’s question has the same answer every time: follow the graph from the inference log back to the approval, the training, the dataset, and the sources of the dataset. Lineage Tracking builds that graph for free as long as training happens inside SageMaker’s SDK, and fills the gaps the moment you spend an afternoon adding three calls to Association.create() to the pipeline code. The investment is small; the payoff is the difference between two days of log-chasing and a 30-second query.