The situation
A gradient-boosted fraud classifier has been in production on a SageMaker real-time endpoint for nine months. TrainingThe process of fitting a model’s weights to data by minimising a loss function. data came from the prior twelve months of disputed transactions; the model scores every authorisation in the payments pipeline in roughly 40ms. Since launch, the precision/recall curve has been boringly stable: precision around 0.81, recall around 0.73, review queue size predictable enough that the fraud ops team staffed around it.
This week, three things changed, and not in unison.
- The score distribution shifted. The endpoint’s output histogram, which has spent nine months skewed hard towards zero with a thin tail above 0.5, has a visible bulge between 0.3 and 0.5 that wasn’t there last month.
- Recall on labelled outcomes dropped from 0.73 to 0.66 over two weeks. Precision held. The ops team is catching fewer of the frauds that end up disputed.
- The
merchant_category_codefeature’s distribution changed: two MCCs that represented 2% of traffic now represent 9%, and one MCC that was 6% has fallen to 1%.
The question isn’t whether something drifted, something clearly did, but which something, and which monitor would have caught it first. “The model is drifting” collapses at least four distinct failure modes into one sentence, and the fix for each is different.
What actually matters
Model monitoring is less about catching a single metric going red and more about having a framework for where to look when something goes wrong. Production ML systems have at least four moving parts, and any of them can shift independently.
The first is the input data itself, the feature values the model sees at InferenceRunning a trained model to produce output – as opposed to training it. time. If the mix of merchant categories changes, if a new payment method launches, if a data-engineering upstream change alters how a field is encoded, the model is suddenly scoring transactions that look unlike the ones it was trained on. Nothing has changed about the model; the world around it has. Catching this early means the team can investigate upstream, retrain, or at least know the scores are less trustworthy than usual.
The second is the model’s predictions, the score distribution coming out of the endpoint. Even if individual features look unchanged, the model might be outputting a different mix of scores, perhaps because a combination of features has shifted that no single-feature check would catch. A score histogram that used to be bimodal going flat is a signal that something systematic is happening; whether it’s a problem depends on whether the labels that eventually come back agree.
The third is the labels themselves, once they’re available. For fraud, the label is “did this transaction get disputed”, which takes days or weeks to arrive. When it does, we can compare the model’s predictions from two weeks ago against the ground truth and recompute precision, recall, F1, AUC. This is the slowest signal but the one that actually measures whether the model is still useful.
The fourth, often forgotten, is the features the model pays attention to. A model that used to weight transaction_amount heavily might, as the distribution of amounts changes, start leaning on merchant_category_code instead. The predictions might look fine at the distribution level, the labels might not have landed yet, but the reasoning has moved, and when that happens, fairness and explainability stories built around the old feature importances are out of date.
Four sources of drift, four questions worth asking separately. The useful framing is not “is the model drifting” but “which of the four monitors is the one that should fire first given what we know.”
What we’ll filter on
Distilling the exploration into filters that separate the monitor types:
- What’s being compared, inputs, outputs, labels vs outputs, or feature attributions.
- What baseline it needs, training-data statistics, training-time SHAP values, or something else.
- How soon it can fire, immediately, when labels arrive, or on a schedule.
- What signal it produces, distributional shift, metric regression, or attribution shift.
- What you’d do about it, retrain, investigate upstream, recalibrate, or re-examine the fairness story.
The Model Monitor landscape
1. Data Quality Monitor. Watches the input feature distributions against a baseline computed from the training set. SageMaker runs a Processing job (using the built-in Deequ-based container) over the training data to produce statistics.json and constraints.json, per-feature means, standard deviations, min/max, null counts, string-length ranges, and inferred type constraints. At schedule time, the monitor captures the features from the endpoint’s data-capture S3 location, runs the same Processing job over the captured window, and reports violations (a feature’s mean drifted more than the threshold, a feature started producing nulls where it hadn’t before, a string feature’s value set expanded). Answers: “is the world feeding the model what it expects?”
2. Model Quality Monitor. Watches the predictions vs ground truth. Requires the endpoint’s data capture to include both the prediction and, eventually, the actual label for that prediction (matched via an inference ID). The baseline is computed from a labelled validation set: precision, recall, F1, AUC for classifiers; MAE, MSE, RMSE, R² for regressors; a problem-type-appropriate set. On schedule, Model Monitor joins captured predictions with the ground-truth-labels S3 path and recomputes the same metrics over the window. Answers: “is the model still as good as it was?”, but only as fast as labels arrive.
3. Model Bias Monitor. Watches for bias drift in predictions, using the SageMaker Clarify bias metrics (DPPL, DI, CDDPL, and others). The baseline is computed from the training set over sensitive facets (e.g. age bracket, geographic region). On schedule, the monitor joins captured data against ground truth (when available) and recomputes the bias metrics. Answers: “are the fairness properties the model was signed off against still holding?”
4. Feature Attribution Drift Monitor. Watches for shifts in which features the model is relying on, using SHAP values. The baseline is a SHAP-value distribution from the training set (computed via a Clarify Processing job); on schedule, the monitor takes captured inferences, computes SHAP values over them, and measures the normalised discounted cumulative gain (NDCG) between the ranked feature-importance lists. A big NDCG drop means the model’s feature-importance ranking has changed, the model is paying attention to different things. Answers: “is the model reasoning the way we thought it was?”
5. Custom monitor (out-of-scope here). Model Monitor supports bring-your-own-container for bespoke checks, statistical tests that don’t map to the four built-ins. Worth naming for completeness; the built-ins cover the fraud scenario.
Side by side
| Monitor | What it compares | Baseline needed | Fires when | Signal | Typical action |
|---|---|---|---|---|---|
| Data Quality | Inputs vs training stats | statistics.json + constraints.json from training |
On schedule, no labels needed | Feature distribution shift, nulls, type changes | Investigate upstream; retrain if persistent |
| Model Quality | Predictions vs labels | Metrics on labelled validation set | Only when labels catch up | Precision/recall/F1/AUC regression | Retrain on recent data |
| Model Bias | Predictions + labels across facets | Clarify bias baseline | Only when labels catch up | DPPL, DI, CDDPL drift | Re-examine fairness story |
| Feature Attribution | SHAP rankings over captured inputs | Clarify SHAP baseline | On schedule, no labels needed | NDCG drop on feature-importance ranking | Investigate whether reasoning still valid |
Reading the table against the fraud scenario:
- The
merchant_category_codedistribution change is Data Quality, it’s a feature distribution shift, visible without labels, caught by the baseline’s categorical-value-set constraint. - The score-distribution bulge is an output shift. Model Monitor’s Data Quality job can also be pointed at the output of the endpoint (score is just another column), catching distribution changes in predictions before labels arrive.
- The recall drop is Model Quality, only visible once labels arrive, which is why it lagged the Data Quality signal by a fortnight.
- If the fraud model’s SHAP rankings have also shifted (say,
merchant_category_codeclimbed from rank 5 to rank 2), Feature Attribution catches it and gives the investigation a reasoning-level story, not just a distribution-level one.
Four monitors, four baselines, four schedules, and the order in which they fire is information, not noise.
The four monitors in the pipeline
The picks in depth
Data Quality is the first line of defence. It’s the only monitor that catches problems in the window between “something changed” and “labels arrive.” In the fraud scenario, the merchant_category_code shift would have been flagged by a Data Quality run the morning after it happened, the constraint completeness: 1.0, distinct_count: 421, most_common_value: "5411" no longer held when two new MCCs appeared in meaningful volume.
The baseline job is a one-off Processing job run at model-build time: sagemaker.model_monitor.DefaultModelMonitor.suggest_baseline(baseline_dataset='s3://train/features.csv', ...). It emits statistics.json (per-column mean, stddev, quantiles) and constraints.json (inferred per-column constraints: type, completeness, value-set cardinality). The monitoring schedule job runs the same container on a rolling window of captured data and emits a constraint_violations.json and a CloudWatch metric per feature – feature_baseline_drift_<feature_name> with a threshold configurable per feature.
Pointing the Data Quality monitor at the output column, not just the inputs, is a cheap trick that buys an extra signal: it catches the score distribution bulge in the fraud case without needing the labelled-validation path that Model Quality requires.
Model Quality is the authoritative metric. Nothing else answers “is it still working.” The cost is the ground-truth join. The endpoint’s data-capture JSON records an InferenceId for each prediction; a separate S3 path receives labels keyed by the same ID (typically written by a Lambda that subscribes to the dispute-resolution stream). The monitoring job joins the two, recomputes the metrics, and reports them against the baseline.
The lag is structural: fraud labels take weeks. The monitor is the slowest signal, but it’s the one that justifies retraining. A 7-point drop in recall over a fortnight is not Data Quality’s answer and not Feature Attribution’s answer, it’s Model Quality’s answer, and the action it recommends is “retrain on a window that includes the last two months.”
Bias Drift and Feature Attribution are the fairness and explainability monitors. Both require Clarify baselines computed at training time. Bias Drift works like Model Quality, it needs labels, and watches the bias metrics across sensitive facets. Feature Attribution works like Data Quality, it runs on captured inferences alone, and watches whether the SHAP-ranked feature importance list still looks like the training one. An NDCG of 0.95 is “basically unchanged”; 0.7 is “the model is reasoning differently.”
Feature Attribution is particularly useful when Data Quality hasn’t fired but something still feels wrong. If the marginal feature distributions all look fine but the combinations have shifted, the model sees the same MCCs and the same amounts but in new pairings, univariate drift checks miss it, and SHAP-based attribution catches it because a ranking built on interactions responds to the change.
A worked investigation
Back to the fraud model. The timeline runs:
- T+0: two new MCCs quietly rise in the authorisation stream (a merchant onboarding, a processor expansion).
- T+1 day: Data Quality Monitor run for the previous day flags
categorical_values_constraintviolations onmerchant_category_code. CloudWatch alarmfeature_baseline_drift_merchant_category_code > 0.1transitions to ALARM, paging the MLOps rota. - T+2 days: the Data Quality job on the output column flags a distribution shift in the score. The team now knows that the feature change is translating into a prediction change, not being absorbed.
- T+3 days: Feature Attribution Monitor run reports NDCG of 0.82 against the training SHAP baseline.
merchant_category_codeclimbed from rank 6 to rank 2;transaction_amountdropped from 2 to 4. The model is leaning on the shifted feature more than it used to. - T+14 days: enough disputes land to populate the ground-truth prefix. Model Quality Monitor reports recall down from 0.73 to 0.66, precision flat at 0.81. The slowest monitor confirms what the earlier three implied.
- T+16 days: retraining job kicks off, using the last 90 days of data plus a re-balanced sample across MCCs. Feature Attribution baseline is recomputed as part of the same pipeline.
The narrative is not “the model drifted”, it’s “input distribution changed, then prediction distribution changed, then the reasoning shifted, then the metric caught up.” Each monitor played a different role; skipping any of them would have left a gap in the story.
What’s worth remembering
- Four monitors, not one. Data Quality, Model Quality, Bias Drift, Feature Attribution. Each watches a different part of the pipeline against a different baseline.
- Data Quality fires first. It needs no labels, runs hourly or daily on captured inputs, and catches upstream changes before they translate into measurable metric drops.
- Model Quality is authoritative but slow. Needs the ground-truth join via inference IDs; the lag is the time it takes for labels to land. Don’t rely on it as the only monitor; do treat it as the one that justifies retraining.
- Bias Drift needs labels too. Uses Clarify bias metrics (DPPL, DI, CDDPL) across facets; baseline is computed at training time from the training set plus the fairness story the model was signed off against.
- Feature Attribution is the reasoning monitor. Uses SHAP rankings against a Clarify baseline, measured by NDCG on the ranked importance list. Catches combinations-of-features shifts that univariate Data Quality checks miss.
- Data capture is the shared prerequisite. All four monitors read from the endpoint’s captured inputs/outputs S3 prefix; enable data capture on the EndpointConfig and budget S3 storage and egress for it.
- Baselines are Processing jobs. Each monitor has its own baseline job run at training time; changing the model means recomputing all four. The monitoring schedules are separate Processing jobs running the same containers on the capture prefix.
- CloudWatch metrics + S3 violation reports. Each monitor emits both: CloudWatch for alarms and paging, S3 for the auditable explain-it-later artefact. An alarm on
feature_baseline_drift_*is the 3am version; the violation report is the post-mortem version.
“The model is drifting” isn’t a diagnosis; it’s a question. Model Monitor’s four flavours answer which part of the pipeline moved, and because the four fire on different schedules and need different baselines, the order in which they alert is most of the story.