The situation
A deep-learning team runs daily training of a transformer-based recommender on a single p4d.24xlarge. Last month, an epoch took 2.8 hours and the GPUs sat at ~92% utilisation for most of it. This month, the same epoch takes 5.9 hours, the GPUs sit at ~40% utilisation, and nothing in the training script has changed, or at least, nothing the team is admitting to.
Starting evidence:
- nvidia-smi during the run shows the 8 A100s bouncing between 20% and 55% utilisation, never sustained high.
- Loss curves are identical to previous runs, the ModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. is learning correctly, just slowly.
- CPU utilisation on the instance is high (~85% across most cores) whereas it used to sit around 50%.
- The dataset moved last month from tfrecord files on S3 to parquet files on S3; the data pipeline rebuild was “approved because it was simpler.”
- The training code was refactored in the same week to use a newer augmentation library.
The team suspects the data pipeline or the augmentation but can’t prove it. Rewriting everything on a hunch costs a week of engineering time, and if the real problem is actually a single pathological kernel somewhere, the rewrite won’t fix it. A profiler tells us where the time is actually going.
What actually matters
The first useful reframe is that “training is slow” has at most four suspects on a single instance, and a profiler’s job is to tell you which.
Suspect one: the data pipeline. Images (or tokens, or whatever) have to be read from wherever they live, decoded, augmented, collated into a batch, and moved to GPU memory. All of that is CPU and IO work. If the per-step CPU time exceeds the per-step GPU compute, the GPU waits on the data loader, visible as intermittent GPU utilisation and sustained CPU saturation. This is the most common cause of low GPU utilisation and the thing teams most often under-budget for.
Suspect two: the model itself. Maybe the model has grown, more layers, bigger attention, more parameters, and is genuinely slower to execute per batch. This is visible as high GPU utilisation (which rules it out in the recommender case), or as a shift in the breakdown between forward and backward time.
Suspect three: the communication. On multi-GPU training, gradient all-reduce between ranks takes time, and if the sync is serialised with compute (rather than overlapped) the GPUs wait on each other. Visible as gaps between the backward pass and the next forward pass where GPUs are idle but not blocked on IO.
Suspect four: the framework overhead. Python-side overhead per step, operator dispatch, small tensors created and freed, dynamic shapes, can dominate when the per-op kernel is small. Visible as very small GPU kernels followed by CPU-side gaps.
Each of the four is addressable, but the fix is different: pre-compute augmentation, cache decoded data, shard differently, recompile the model, enable overlap, none of these is cheap, and picking the wrong one wastes the week. The profiler’s value is answering “which suspect” rather than guessing.
The second useful reframe is that profiling is sampled, not exhaustive. A training job that runs for six hours doesn’t need six hours of microsecond traces, that’s terabytes of data and nobody will read it. It needs a couple of windows: a few hundred steps captured in enough detail to understand the per-step breakdown, plus running aggregates over the full job. Debugger is structured around this pattern: continuous coarse profiling plus on-demand detailed profiling.
The third reframe is that there are really two kinds of things to watch: system metrics (CPU, GPU, memory, IO, network, the things nvidia-smi and top would show you) and framework metrics (per-operator timing, per-step breakdown, the shape of the forward/backward/optimizer phases). Debugger captures both; the system metrics tell you there’s a problem, the framework metrics tell you where.
What we’ll filter on
Four filters worth naming:
- What’s captured, system-level (CPU/GPU/IO), framework-level (operator traces), or both.
- Capture overhead, how much slower is training while profiling?
- How is analysis done, manual, built-in rules, or Studio visual tools?
- What actions come out, where to optimise, not just how long things took.
The profiling landscape
1. nvidia-smi / htop / iostat (ad hoc). What teams reach for first. Gives a point-in-time snapshot of GPU utilisation, CPU load, memory pressure. Useful as confirmation, yes, GPUs are only 40% busy, but doesn’t tell you why. No per-step structure, no framework context.
2. PyTorch Profiler (self-managed). A first-party, in-process profiler. Wraps training steps in a context manager, captures per-operator timings, produces a trace viewable in Chrome tracing or TensorBoard. Accurate and well-integrated, but the team has to wire it up, decide when to enable it, export results somewhere, and analyse the traces themselves.
3. SageMaker Debugger (profiler). Debugger’s profiling mode runs as a sidecar inside the training container; captures system metrics continuously (GPU/CPU/IO/network, every 500ms) and framework metrics (PyTorch/TF operator traces, data loader timings) during configurable windows. Writes to S3; Studio’s Debugger UI renders timelines, operator breakdowns, data-loader performance, and, the key differentiator, built-in rules that flag common issues: LowGPUUtilization, GPUMemoryIncrease, BatchSize, DataloaderBottleneck, OverallFrameworkMetrics, IOBottleneck, and so on. The rules run on the captured data in a parallel Processing job and write alerts to CloudWatch if they fire.
4. SageMaker Debugger (tensors). Debugger’s other half: capturing model tensors (weights, gradients, activations) and running rules against them – VanishingGradient, ExplodingTensor, UnchangedTensor, etc. Useful for convergence issues, not performance issues; worth naming for completeness. Different wiring, different rules, same SDK.
5. Third-party APM (Datadog, Scalyr, etc.). Works at the infrastructure level; visible in the way nvidia-smi is visible. Doesn’t understand the ML training inner loop.
Side by side
| Tool | Captures | Overhead | Analysis | Actionable output |
|---|---|---|---|---|
| nvidia-smi / htop | System, point-in-time | 0 | Manual | “Something is slow” |
| PyTorch Profiler | Framework traces, on demand | Medium (during window) | Manual (Chrome tracing) | Per-op time breakdown |
| Debugger Profiler | System + framework | Low (sampled) | Studio UI + built-in rules | Rule alerts: dataloader bottleneck, low GPU, etc. |
| Debugger Tensors | Model tensors | Low-medium | Rules + Studio UI | Gradient / activation issues |
| Infra APM | Infrastructure metrics | 0 | External dashboards | Host-level load |
Reading the table against the 40%-GPU scenario: Debugger Profiler’s built-in DataloaderBottleneck and LowGPUUtilization rules are likely to fire immediately; the system-metrics timeline confirms whether it’s CPU-bound; the framework-metrics window shows the per-step breakdown that points at the specific stage.
What Debugger shows you
The pick in depth
Enable the Debugger profiler at training-job submission. The SageMaker SDK exposes it via the profiler_config argument on the estimator:
from sagemaker.debugger import ProfilerConfig, FrameworkProfile, Rule, rule_configs
profiler_config = ProfilerConfig(
system_monitor_interval_millis=500,
framework_profile_params=FrameworkProfile(
start_step=1000,
num_steps=500,
),
)
rules = [
Rule.sagemaker(rule_configs.low_gpu_utilization(),
rule_parameters={"threshold": "70", "window": "500"}),
Rule.sagemaker(rule_configs.dataloader_bottleneck()),
Rule.sagemaker(rule_configs.overall_system_usage()),
Rule.sagemaker(rule_configs.overall_framework_metrics()),
]
estimator = PyTorch(
...,
profiler_config=profiler_config,
rules=rules,
)
estimator.fit(...)
Two pieces of configuration:
ProfilerConfig.system_monitor_interval_milliscontrols the cadence of system-metric sampling. 500ms is the default and cheap. It captures GPU util, GPU memory, CPU util, IO, network continuously for the full job.FrameworkProfile(start_step, num_steps)defines the detailed capture window. 500 steps between step 1000 and 1500 is a typical choice, far enough in to be past warm-up, short enough not to dominate storage. Outside the window, the framework-level profiler is off; overhead drops to near-zero.
Each rule is a small Processing-job-shaped worker that runs in parallel with training, reads the profiler output from S3, and evaluates its condition. LowGPUUtilization checks the GPU utilisation time series against a threshold; DataloaderBottleneck compares data-loading time with compute time; BatchSize checks for underutilised memory; OverallFrameworkMetrics looks for unusual op distributions. When a rule fires, the Processing job exits with a non-zero code, the rule status in describe_training_job is IssuesFound, and an optional CloudWatch alarm paging.
Reading the output. Studio’s Debugger tab renders the profiler data as a timeline, with the system metrics on one axis and the per-step breakdown on another. The output for the recommender case looks like the diagram above:
- GPU utilisation oscillating around 40%.
- CPU utilisation pinned at ~85%.
- Per-step breakdown: data loading 60%, forward 12%, backward 20%, optimizer 8%.
LowGPUUtilizationandDataloaderBottleneckrules firing.
The verdict writes itself: the data pipeline is the bottleneck, not the model. Drilling into the data-loading operator breakdown (which Debugger also captures) names the two sub-stages consuming the most time: parquet decoding (22%) and augmentation (28%).
The fix. Knowing where the time goes reshapes the remediation plan:
- Decoded data cache. Parquet decoding done per-epoch on 120M rows is expensive; decode once and cache the output as tfrecord or a zero-copy format. One-time cost, amortised over 30 epochs.
- Move augmentation to GPU. The new library runs on CPU; many operations (RandomCrop, ColorJitter, RandomHorizontalFlip) have GPU-resident equivalents in
torchvision.transforms.v2or NVIDIA DALI. Moves the work off the starved CPUs and onto the underutilised GPUs. - Increase dataloader workers.
DataLoader(num_workers=16, prefetch_factor=4)instead of the default. Parallelises the CPU work across more processes; saturates the 96 vCPUs on thep4d.24xlarge. - Pin memory and non-blocking transfer.
pin_memory=Trueand.to(device, non_blocking=True)let the PCIe transfer overlap with GPU compute.
After applying these four, re-run profiling: GPU utilisation should climb towards 85-90%, data-loading percentage should drop below 20% of the step, the DataloaderBottleneck rule should stop firing. Epoch time should drop back towards the 2.8h baseline or better.
When tensor-level debugging is the answer instead. If the loss isn’t converging, or is producing NaN, the performance profiler isn’t the correct tool. Debugger’s tensor-capture path (DebuggerHookConfig, rules like VanishingGradient and ExplodingTensor) runs alongside training, sampling gradient and activation tensors, and fires when they go bad. Different problem, same product.
A worked before/after
The concrete wins from one week of applying profiler-driven fixes:
Before After
-----------------------------------------------
GPU utilisation 40% → 88%
CPU utilisation 85% → 55%
Epoch time 5.9h → 2.6h
Full training run 30 × 5.9h=177h 30 × 2.6h=78h
Instance-hours saved: 99h
At $32.77/h (p4d.24xl) ~$3,244/run saved
Daily training budget recovered to overnight fit
The fix that actually worked wasn’t the one the team would have picked from intuition (they had bet on the augmentation library, which was half of it; the cache was the bigger half). Without the profiler, it would have been a week of rewriting augmentation without touching the parquet decode, measured as “a bit faster,” and the backlog item “migrate away from parquet” would have sat unresolved.
What’s worth remembering
- Debugger profiles both system and framework levels. System metrics (GPU, CPU, IO, network) run continuously every 500ms. Framework metrics (operator traces, per-step breakdown) run during a configured window.
- Built-in rules are the value.
LowGPUUtilization,DataloaderBottleneck,IOBottleneck,BatchSize,OverallFrameworkMetrics,OverallSystemUsageflag common problems without manual analysis. - Profiling windows are finite. Don’t profile for the entire training job; set
start_stepandnum_stepsto a representative slice past warm-up. Continuous system metrics still run the whole job. - Rules are Processing jobs in parallel. They run alongside training and evaluate conditions on the captured S3 data.
IssuesFoundshows up indescribe_training_jobif they fire. - Studio is where the profiler UI lives. Timelines, operator breakdowns, data-loader performance, heatmaps. Non-Studio consumption is possible (raw S3 output +
smdebugPython library) but slower. - Data loading is the most common bottleneck. GPUs wait on CPUs; the fix is parallel dataloader workers, pinned memory, GPU-side augmentation, and decoded-data caching.
- Tensor debugging is the other half of Debugger. Convergence problems (vanishing gradients, exploding activations, NaN weights) use
DebuggerHookConfigand tensor rules, not the profiler. Same product, different configuration. - Profile before optimising. The team that rewrites the augmentation library because “it feels slow” often fixes the wrong thing. One profiler run before any rewrite tells you which stage of the step owns the time.
“The GPU is only 40% busy” is a symptom; the profiler turns it into a diagnosis with a name. For a training job that the team runs every day, knowing which stage to optimise, data loader, model, communication, framework overhead, is the difference between a week of productive work and a week of plausible-looking rewrites that don’t move the metric.