Picking the Right Auto-Scaling Signal for GPU Endpoints

March 01, 2028 · 21 min read

The situation

The image classifier is a ResNet-50 variant serving product-catalogue tagging for a retail platform. Each request is one 1024×1024 JPEG in, one list of tag probabilities out. Traffic shape over a typical day: baseline ~40 RPS steady; spike at the top of every hour (driven by scheduled ingestion) ramping to ~220 RPS over ninety seconds and holding; trough 04:00-06:00 at ~5 RPS.

Each inferenceInferenceRunning a trained model to produce output – as opposed to training it. takes around 35 ms on the GPU. The CPU in front of it decodes the JPEG, normalises the tensor, and marshals the response, roughly 8 ms of real CPU work per request. One ml.g5.xlarge (4 vCPU, 1 × A10G, 16 GB GPU memory) comfortably serves ~50 RPS before the GPU saturates. Above that, SageMaker’s invocation queue starts to grow, requests wait for a GPU slot, and tail latency blows up.

The current auto-scaling policy targets SageMakerVariantCPUUtilization at 70%. Target tracking needs two CloudWatch datapoints breaching the target to trigger a scale-out. At one-minute resolution that’s two minutes minimum, and in practice closer to three by the time the metric is published, evaluated, and a new instance is provisioned. By then the queue already has hundreds of requests in it, and the new instance has to drain that backlog before P99 recovers.

The CPU metric is not wrong; it is just watching the wrong constraint. CPU peaks at ~35% when the GPU is already at 100%, because JPEG decode and tensor shuffling are cheap compared to the forward pass.

What actually matters

Before picking a replacement, worth thinking about what auto-scaling is actually for and why the wrong signal fails silently.

The first observation is that auto-scaling is a control loop: observe a metric, compare it to a target, adjust instance count to close the gap. Control loops assume three things about the metric, that it tracks the thing we actually care about, that it responds predictably to instance-count changes, and that it moves early enough that the provisioning latency has time to land before the symptom bites. CPU utilisation on a GPU-bound workload fails the first two badly. The thing we care about is “are requests waiting?” and CPU utilisation can say “no, everything’s fine” while the GPU queue is growing.

The second observation is that the ideal metric for this shape is demand itself, “how many requests per instance per unit time?”, not any specific resource bottleneck. If we can measure the rate at which work is arriving relative to capacity, we don’t have to guess whether the bottleneck today is GPU, tomorrow is CPU when the model changes, or next month is I/O when the instance type changes. Demand per instance moves the moment traffic changes, independent of which physical resource happens to be the limiter.

The third observation is about lead time. CloudWatch metrics publish at one-minute resolution. Target tracking needs two breaching datapoints. SageMaker variant updates take 2-4 minutes to provision a new instance. That’s a hard floor of roughly 4-6 minutes between “demand increases” and “new capacity is serving.” Any spike that ramps faster than that floor overruns a purely reactive policy. The right metric minimises the reactive lag; the right posture accepts that reactive scaling has a floor and covers the rest with scheduled pre-warming.

The fourth observation is about the difference between alarming and scaling. A latency SLO breach should page a human; it shouldn’t necessarily trigger a scale-out. Latency is a step function, flat until saturation, then a cliff, which violates target tracking’s assumption of linear response to capacity changes. Using latency as a scaling trigger gets you bang-bang control and flapping. Using latency as an alarm gets you a human who can look at the dashboard and decide whether the scaling policy is wrong, the instance type is undersized, or the model changed under them.

The fifth is operational overhead. GPU utilisation is the accurate signal for a GPU-bound workload, but it’s not a predefined target-tracking metric, it requires a CloudWatch agent or DCGM exporter, custom metric publishing, and a customised target-tracking policy. For a team running one endpoint on one model, that’s more ongoing work than the signal accuracy is worth. For a team running many workloads where per-request cost varies wildly, it’s worth the wiring.

And the sixth is robustness to shape changes. The workload that’s GPU-bound today becomes memory-bound tomorrow if the model doubles in size, or I/O-bound next year if the input shape changes. A scaling signal tied to a specific resource has to be re-tuned every time. A signal tied to demand itself keeps working across those changes.

What we’ll filter on

GPU-aware, or at least queue-aware, the signal has to move when the GPU is starved or when requests queue.
Responsive to queue depth before it turns into latency, queue growth is the leading indicator.
Simple to configure, predefined target-tracking metric, no custom publishing pipelines.
Handles ninety-second bursts, the RPS shape is the workload; anything that only reacts on multi-minute windows with no pre-scale arrives late.
Low operational overhead, no DCGM exporter to patch, no CloudWatch agent config to push.

The signal landscape

CPUUtilization target tracking (current policy via SageMakerVariantCPUUtilization). Tracks CPU across the variant. On a GPU-backed instance, CPU runs well below saturation while the GPU is pinned, the signal fires late or never. Right for CPU-bound containers; wrong here.

SageMakerVariantInvocationsPerInstance target tracking. SageMaker-emitted CloudWatch metric: total invocations against the variant divided by healthy instance count, published at one-minute resolution to AWS/SageMaker. Effectively requests-per-instance-per-minute. Because each instance has a known safe throughput ceiling (~50 RPS here, ~3,000 invocations per minute), the metric is a direct proxy for how close each instance is to its queue-growth point, regardless of whether CPU, GPU, memory, or I/O is the bottleneck. Target-tracking recognises it natively; no custom publishing.

GPUUtilization via CloudWatch agent or DCGM exporter. The signal that genuinely tracks the GPU. SageMaker’s default metrics don’t include it, requires the CloudWatch agent with nvidia_gpu metrics, or the NVIDIA DCGM exporter. Both need sidecar/init config, custom metric namespace, PutMetricData IAM, and a custom target-tracking policy referencing CustomizedMetricSpecification. Direct, accurate, real, and a meaningful amount of wiring.

ModelLatency as an alarm trigger. SageMaker publishes ModelLatency (microseconds) and OverheadLatency. Both back CloudWatch alarms; neither is a predefined target-tracking metric. Two deeper problems: latency is a lagging indicator (by the time it breaches, the queue is already bad), and target tracking requires a metric that responds roughly linearly to instance count, latency is a step function (flat until saturation, then spike).

Step scaling on a custom InvocationsPerSecond metric. Publish a custom metric (invocations-per-second averaged over 30 seconds) and attach a step-scaling policy with tiered adjustments. More aggressive than target tracking at the upper end, but custom plumbing and a manually-tuned step schedule the team re-tunes whenever traffic shape changes.

Side by side

Option	GPU-aware	Responsive to queue	Simple to configure	Handles burst	Low ops
`CPUUtilization` target tracking	✗	✗	✓	✗	✓
`SageMakerVariantInvocationsPerInstance`	✓	✓	✓	✓	✓
`GPUUtilization` via CW agent / DCGM	✓	,	✗	✓	✗
`ModelLatency` alarm	,	✗	✗	✗	✓
Step scaling on custom RPS	✓	✓	✗	✓	✗

Matching the shape to the signal

InvocationsPerInstance is the right proxy when per-request cost is uniform, which this workload is. The other signals have their places. GPU utilisation when the wiring is worth it, CPU for CPU-bound, latency for paging.

Why invocations-per-instance is the right proxy

The metric counts requests dispatched to each instance, divided by healthy instance count. Each ml.g5.xlarge has a known ceiling, ~50 RPS before the queue grows, or ~3,000 invocations per minute. Pick a target below that ceiling (say 2,400 invocations per instance per minute = 40 RPS), and target-tracking adds instances whenever the measured value climbs toward the target.

Three things make it a better signal than CPU for this workload:

It moves the moment work shows up. Invocations are counted when they hit the variant, before any container work happens. Doesn’t matter whether the bottleneck is CPU, GPU, or disk, the metric reflects demand.
It’s roughly linear in instance count. At constant traffic, doubling the instances halves the value. Target tracking’s control loop assumes this linearity; the metric satisfies it by construction.
It’s already published. No agent, no custom namespace, no PutMetricData permissions. Lives in AWS/SageMaker at one-minute resolution, with the EndpointName and VariantName dimensions the scaling target already references.

Choosing the target value is a capacity-planning exercise: measure the per-instance RPS at which queue depth starts to grow (load-test until the GPU saturates), then pick a target 20-30% below. For this classifier:

Saturation: ~50 RPS per instance.
Safe operating ceiling: ~40 RPS per instance.
Target for auto-scaling: 2,400 invocations/instance/minute (= 40 RPS).

With MinCapacity=2 and MaxCapacity=8, baseline runs on two instances at ~20 RPS each, well below target. At the top-of-hour spike, target tracking wants 220 / 40 = 6 instances.

The scale-out timeline

Target tracking is an improvement, not a miracle. CloudWatch metrics publish at one-minute resolution, two breaching datapoints are required, a SageMaker variant update provisions a new instance in roughly two to four minutes, and scheduled traffic that ramps in ninety seconds can still overrun a reactive policy.

Same spike, three regimes. CPU-tracked scales three minutes late because GPU saturates long before CPU does. InvocationsPerInstance breaches earlier and the queue peaks lower. Layering a scheduled pre-warm on top absorbs the spike almost entirely.

Three things worth spelling out:

The metric-to-alarm lag is ~one minute even when the metric is right. CloudWatch aggregates at one-minute resolution; target tracking needs breaching datapoints. An instant-perfect metric still costs ~60 seconds before the first scale-out call.
Provisioning lag is the unavoidable tail. A new ml.g5.xlarge takes ~2-4 minutes from API call to InService. Target tracking can’t compress that; pre-scaling can cover it.
Pre-scaling is complementary, not a replacement. Scheduled actions help for known spikes. Unexpected surges still need reactive scaling.

Putting the pieces together

Swap the target-tracking policy:

aws application-autoscaling put-scaling-policy \
  --policy-name invocations-target-2400 \
  --service-namespace sagemaker \
  --resource-id endpoint/product-tagger/variant/AllTraffic \
  --scalable-dimension sagemaker:variant:DesiredInstanceCount \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 2400.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
    },
    "ScaleInCooldown": 300,
    "ScaleOutCooldown": 60
  }'

Scale-out cooldown is short (60 s) because the workload bursts hard and briefly; scale-in cooldown is longer (300 s) to avoid flapping on the spike tail.

Add a scheduled action that pre-warms just ahead of the hourly spike:

aws application-autoscaling put-scheduled-action \
  --service-namespace sagemaker \
  --schedule "cron(58 * * * ? *)" \
  --scheduled-action-name prewarm-hourly-spike \
  --resource-id endpoint/product-tagger/variant/AllTraffic \
  --scalable-dimension sagemaker:variant:DesiredInstanceCount \
  --scalable-target-action 'MinCapacity=4,MaxCapacity=8'

And a second action that relaxes the floor after the spike:

aws application-autoscaling put-scheduled-action \
  --service-namespace sagemaker \
  --schedule "cron(10 * * * ? *)" \
  --scheduled-action-name relax-after-spike \
  --resource-id endpoint/product-tagger/variant/AllTraffic \
  --scalable-dimension sagemaker:variant:DesiredInstanceCount \
  --scalable-target-action 'MinCapacity=2,MaxCapacity=8'

Scheduled actions nudge MinCapacity; target tracking does the reactive work.

What’s worth remembering

SageMakerVariantInvocationsPerInstance is a predefined target-tracking metric. No custom publishing, no agent, no IAM changes. Lives in AWS/SageMaker.
CPU utilisation is a misleading proxy for GPU-bound workloads. CPU in front of the GPU is often under-utilised when the GPU is pinned.
Invocations-per-instance measures demand, not constraint. Moves with traffic regardless of which resource is the bottleneck.
Target value is a capacity-planning choice: measure saturation in a load test, pick 20-30% below, leave headroom for scale-out to complete before the queue grows.
CloudWatch 1-minute resolution plus two breaching datapoints plus 2-4 minute provisioning sets a hard floor on reactive scaling.
Scheduled scaling complements target tracking. put-scheduled-action moving MinCapacity covers known spikes.
GPUUtilization needs a CloudWatch agent or DCGM exporter and a custom metric policy. Worth the wiring when per-request cost is variable.
ModelLatency backs CloudWatch alarms, not target-tracking policies. Lagging indicator, step-function response, violates target tracking’s control-loop assumptions.
Scale-in cooldown should be longer than scale-out cooldown for bursty traffic.
The metric is per-minute, not per-second. A 40 RPS peak is 2,400/minute, not 40.

Keep the ml.g5.xlarge sized for the GPU, replace CPUUtilization with SageMakerVariantInvocationsPerInstance target tracking at ~2,400 invocations/instance/minute, and layer a scheduled action that lifts MinCapacity to 4 at the fifty-eighth minute of every hour and relaxes it back to 2 after the spike clears. CPU utilisation goes back to being useful information on a dashboard, not a scaling input for a GPU-bound endpoint.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.