The situation
A content-recommendation team runs a production SageMaker endpoint that scores articles for a news homepage. The model is re-trained weekly on click data; the endpoint serves roughly 5,000 requests per second during the morning spike.
Model Monitor is already enabled with the four built-in job types:
- Data Quality: baseline statistics from TrainingThe process of fitting a model’s weights to data by minimising a loss function. data (mean, std-dev, distinct count, missing rate per feature) compared against captured live InferenceRunning a trained model to produce output – as opposed to training it. inputs.
- Model Quality: accuracy, precision, recall, AUC calculated from inference outputs joined against ground-truth labels.
- Bias Drift: statistical parity and other Clarify bias metrics tracked over time.
- Feature Attribution Drift: SHAP-based per-feature contribution comparisons against a baseline.
The team caught a regression in last week’s deploy only when editors noticed that the homepage’s second slot had been showing the same three stories for two hours. The feature distributions were normal; the model quality metrics reported no change because click ground-truth was still catching up from 6 hours earlier; the bias metrics were fine. What was broken: the diversity of recommendations in the top-10 had collapsed to almost zero. The team has a domain-specific way to measure that: a held-out validation set of 500 user sessions replayed hourly, checking how many distinct article categories appear in each top-10. Healthy: 6-8 categories. Broken: 1-2. There is no built-in Model Monitor job that measures this.
What they want: Model Monitor tracks the custom metric, reports it to CloudWatch alongside the built-ins, and alarms when the hourly value drops below 4. Built in the same way the native jobs are built, so the on-call doesn’t need a parallel observability stack.
What actually matters
The reflex is “write a custom metric”. The interesting layer is understanding what the existing monitoring layer actually is, because the shape of the custom extension depends on it.
The built-in monitor is not a single service; it’s a set of scheduled processing jobs the platform runs on our behalf, each with a specific container image and a specific contract. The built-in jobs are purpose-built containers for data quality, bias, and model-quality checks. Each container reads inputs from object storage (the captured inference data, the baseline statistics, ground-truth labels, and so on), runs an analysis, writes results back to storage, and publishes metrics to the platform’s metrics service.
The shape is: scheduled processing job + container + storage input + storage output + metric publish. Once that’s visible, the way to extend it becomes obvious: a custom monitor is a container image we build, conforming to the same contract, scheduled with the same API. The container reads its inputs from object storage (captured inference data and anything else we pre-position there), runs whatever analysis we write, and emits metrics in the expected format.
Three things are worth naming before reaching for code:
Data capture. The whole pattern depends on per-request data capture being enabled on the endpoint. When enabled, every inference request and response (or a sampled subset) is written to a configured storage prefix with timestamps. This is the raw material every monitoring job consumes. Without data capture, there’s no data to monitor; the first fix is always making sure it’s on.
Baseline vs. metric. The built-in jobs have a two-step model: first, compute a baseline from training data (once, when the model is trained); then, periodically, compare captured live data against that baseline. A custom metric may or may not need a baseline. The diversity metric doesn’t; “categories in top-10” is a raw number, not a comparison. A custom feature-drift metric might, in which case we’d need to compute and store the baseline too.
Built-in integration. The valuable thing about using Model Monitor rather than a side-channel script is that the custom job’s results land in the same S3 directory structure and the same CloudWatch namespace as the built-in jobs. The on-call sees one dashboard, one alarm-set, one runbook. The engineering investment is in conforming to Model Monitor’s conventions, not in reinventing them.
What we’ll filter on
- Scheduled execution: does the solution run on a schedule inside SageMaker, or does it live outside?
- CloudWatch integration: do the metrics land in the same namespace as built-in Model Monitor?
- Data access: does the job see the full captured inference data (inputs and outputs, not just outputs)?
- Baselining support: can we compute a reference distribution once and compare periodically?
- Alarm integration: can CloudWatch alarms route the signal through the existing on-call?
The monitoring options landscape
1. Built-in Data Quality monitor. Compares live feature distributions to training baselines via Deequ. Flags missing values, type drift, distribution shifts. No custom metric support; you get what the container computes.
2. Built-in Model Quality monitor. Compares model predictions to ground-truth labels (supplied separately). Computes accuracy, precision, recall, F1, AUC, confusion matrix. Requires labelled ground truth in S3, which is why it doesn’t help here; the labels arrive hours late.
3. Built-in Bias Drift monitor. Tracks Clarify bias metrics over time. Valuable but orthogonal to the diversity question.
4. Built-in Feature Attribution Drift monitor. Tracks SHAP values per feature to detect when the model’s decision structure shifts even when inputs look stable. Useful for “the model is still making predictions, but why has changed”; again orthogonal to our specific question.
5. Custom Model Monitor container. We build a Docker image that implements the Model Monitor contract: read /opt/ml/processing/input/endpoint/<endpoint-name>/<variant>/<YYYY>/<MM>/<DD>/<HH>/*.jsonl for the hour’s captured records, perform whatever analysis we want, write a JSON report to /opt/ml/processing/output/, and optionally write CloudWatch metrics in the expected format. Scheduled via CreateMonitoringSchedule with MonitoringType=ModelQuality (or another type) and the custom container image reference.
6. Standalone Lambda on a schedule. A parallel observability path. A Lambda reads from the data-capture S3 prefix every hour, runs whatever analysis, puts CloudWatch metrics, triggers alarms. Conceptually equivalent to a custom Model Monitor job but doesn’t integrate with Model Monitor’s dashboard or S3 layout. Defensible when the custom work is small and doesn’t need the full Processing Job runtime (for example, doesn’t need a GPU, doesn’t need to load a reference model).
7. EventBridge-triggered SageMaker Processing Job. The middle ground. A Processing Job on a schedule, running our container, reading captured data, writing metrics. Similar to custom Model Monitor but outside the Model Monitor schedule abstraction (meaning we miss the Model Monitor dashboard but keep the Processing Job’s scaling and the container reuse).
Side by side
| Option | Scheduled inside SageMaker | CloudWatch namespace | Sees captured data | Baselining support | Alarm-ready |
|---|---|---|---|---|---|
| Built-in Data Quality | ✓ | ✓ (native) | ✓ | ✓ | ✓ |
| Built-in Model Quality | ✓ | ✓ (native) | ✓ (+ labels) | ✓ | ✓ |
| Custom Model Monitor container | ✓ | ✓ (native) | ✓ | ✓ (if we build it) | ✓ |
| Lambda on schedule | partial (EventBridge) | ✓ (custom namespace) | ✓ | we build it | ✓ |
| EventBridge + Processing Job | ✓ (EventBridge) | ✓ (custom namespace) | ✓ | we build it | ✓ |
The line between “custom Model Monitor container” and “EventBridge-triggered Processing Job” is thin but real: Model Monitor is a wrapper around scheduled Processing Jobs with opinionated S3 layout, scheduling API, and a dashboard. Using the Model Monitor path keeps everything in one observability surface; using the raw Processing Job path gives more flexibility but scatters the operational surface.
For this team, staying inside Model Monitor is the correct answer; the whole point is for the on-call to have one place to look.
The shape of a custom metric job
The pick in depth
Build a custom Model Monitor container and schedule it alongside the Data Quality monitor.
The container is a small Python image. Its entrypoint reads the conventional Model Monitor environment variables:
dataset_source: S3 path prefix with captured inference data for the scheduled hour.output_path: where to writestatistics.jsonandconstraint_violations.json.baseline_constraintsandbaseline_statistics: S3 paths to the baseline from training.- Plus any custom parameters we pass via the
MonitoringAppSpecification.ContainerArgumentsfield when creating the schedule.
The script’s job, in about 80 lines of Python:
- Load the reference validation set (500 user sessions) from its S3 location.
- For each session, call the endpoint’s inference URL with the session’s feature vector. Collect the returned top-10 article IDs.
- Look up each article’s category from a reference map.
- Compute distinct-categories-in-top-10 per session. Aggregate across sessions: mean, min, 5th percentile.
- Load the baseline from training time, a JSON file with expected distributions for each aggregate metric.
- Emit
statistics.jsonwith the computed values, andconstraint_violations.jsonlisting any metric that fell outside the baseline’s acceptable band. - Put CloudWatch metrics in the namespace
aws/sagemaker/Endpoints/data-metricswith dimensionsEndpointName,VariantName,MonitoringSchedule.
The baseline was computed once, when the model was trained, by running the same replay against the held-out validation set at training time and recording the observed diversity numbers. That baseline is re-computed each time the model is re-trained; the workflow for that is a SageMaker Pipelines step that writes s3://churn-monitor/baselines/<model-version>/diversity_baseline.json.
The scheduling is:
aws sagemaker create-monitoring-schedule \
--monitoring-schedule-name recommender-diversity-hourly \
--monitoring-schedule-config '{
"ScheduleConfig": {"ScheduleExpression": "cron(0 * * * ? *)"},
"MonitoringJobDefinition": {
"MonitoringAppSpecification": {
"ImageUri": "123456789012.dkr.ecr.eu-west-1.amazonaws.com/diversity-monitor:v3",
"ContainerArguments": ["--min-diversity", "4"]
},
"MonitoringInputs": [...],
"MonitoringOutputConfig": {"MonitoringOutputs": [{"S3Output": {...}}]},
"MonitoringResources": {"ClusterConfig": {"InstanceCount": 1, "InstanceType": "ml.m6i.large", "VolumeSizeInGB": 20}},
"RoleArn": "arn:aws:iam::...:role/service-role/AmazonSageMaker-ExecutionRole"
}
}'
The CloudWatch alarm on top10_category_diversity threshold < 4 for 2 out of 3 consecutive 1-hour periods routes through the existing SNS topic to PagerDuty. The runbook is a one-liner: roll back the endpoint to the previous variant, investigate in the morning.
A worked monitoring cycle
At 09:00:00 UTC, the schedule fires. SageMaker submits a ProcessingJob.
- Processing job pulls
diversity-monitor:v3from ECR, spins up onml.m6i.large, downloads the captured-inference records froms3://recs-capture/recommender-prod/AllTraffic/2027/10/31/08/(the previous hour). - The entrypoint loads the validation set from
s3://recs-monitor/validation/sessions-v9.jsonl. - For each of the 500 sessions, the script POSTs to the endpoint’s URL (the endpoint’s production variant, not a shadow). 500 calls complete in ~8 seconds given the endpoint’s regular latency.
- Categories counted per session, aggregated. Result: mean 6.2, min 3, p5 4. (Healthy.)
- Baseline from training had mean 6.8, min 4, p5 5. The current p5 of 4 is below the baseline’s p5-of-5, which generates one constraint violation:
p5_category_diversity below baseline. - Script writes
statistics.jsonandconstraint_violations.jsontos3://recs-monitor/reports/2027/10/31/09/. - Script PutMetricData with
top10_category_diversity_mean=6.2,_min=3,_p5=4underaws/sagemaker/Endpoints/data-metrics, dimensionsEndpointName=recommender-prod. - Total processing job runtime: 92 seconds. Cost: ~$0.003.
One week later, a model deploy pushes bad recommendations. The 10:00 run sees _min=1, _p5=2. The CloudWatch alarm fires after two consecutive sub-threshold hours (13:00 and 14:00), PagerDuty pages the on-call, the deploy rolls back via SageMaker deployment guardrails, and the homepage recovers before the afternoon editorial review notices. The built-in monitors caught none of it; the custom monitor, hand-built, 80 lines of Python, was the whole defence.
What’s worth remembering
- Model Monitor is scheduled processing jobs, not magic. A built-in container, a schedule, an S3 layout, a CloudWatch namespace. That decomposition is the extension point.
- Data capture is the prerequisite.
DataCaptureConfigon the endpoint writes every (or sampled) inference to S3 as JSON Lines. Every monitor, built-in or custom, reads from there. - The four built-in job types are focused. Data Quality, Model Quality, Bias Drift, Feature Attribution Drift. If the domain signal isn’t one of those, a custom job is the AWS-native answer.
- A custom Model Monitor is a container plus a schedule. Build a Docker image, push to ECR, reference it in
CreateMonitoringSchedulewithContainerArgumentsfor parameters. The container’s contract is S3 inputs, S3 outputs, CloudWatch metrics. - Baselines are optional but useful. If the metric is a comparison to training-time behaviour, compute the baseline during training, persist it in S3, and have the runtime job load it. If the metric is an absolute threshold, the baseline can be simpler or absent.
- CloudWatch metrics come with alarms for free. Once the metric is published to CloudWatch in a consistent namespace with dimensions, the rest of the observability stack (alarms, dashboards, composite alarms) works normally. No separate alerting service.
- Lambdas and standalone Processing Jobs are valid alternatives. For cases where the Model Monitor dashboard integration doesn’t matter, an EventBridge-triggered Lambda or Processing Job works and is simpler. The trade is the loss of integration with the native Model Monitor view.
- Custom monitor code should stay small. The more the container does, the harder it is to debug when the job starts failing silently. Keep the entrypoint to what’s specific to the domain metric; reuse libraries for S3 I/O and CloudWatch publishing.
The built-in monitors catch the drifts they were designed to catch. When the signal that matters is domain-specific, recommender diversity, personalisation coverage, a KPI nobody else’s model has, the answer is a small custom container on the same schedule abstraction, talking to the same CloudWatch namespace, watched by the same on-call. Model Monitor is the frame; custom metrics are how you make the frame hold the pictures you actually need.