Choosing a SageMaker Hyperparameter Tuning Strategy

February 09, 2028 · 15 min read

ML Engineer Associate · MLA-C01 · part of The Exam Room

The situation

A fraud-detection team trains XGBoost classifiers on a labelled dataset of 50 million transactions. Labels are “legitimate” or “fraud”, heavily imbalanced (roughly 0.3% positive class). They want the best AUC on a held-out validation set.

They have identified 20 candidate hyperparameters worth tuning: max_depth, eta, gamma, min_child_weight, subsample, colsample_bytree, colsample_bylevel, lambda, alpha, scale_pos_weight, num_round, and nine more. Ranges are defined. Each training run takes 25 minutes on an ml.m5.4xlarge.

The tuning budget is 200 training jobs, with up to 10 parallel within account limits. XGBoost emits validation-auc after every boosting round, so intermediate results are available without extra plumbing. A previous tuning job exists from a stale snapshot of the data and the team would like to reuse it.

What actually matters

Before touring the strategies, worth thinking about what the budget is actually being spent on and what “spending it well” looks like.

A tuning job is a loop: propose a hyperparameter configuration, train a model with it, observe the objective, decide what to propose next. Every strategy is a different answer to “what should the next proposal be?” Random asks the question and throws dice. Grid asks and reads off a table. Bayesian fits a surrogate model of the objective surface and samples where it expects improvement. Hyperband spreads the budget across many configurations and kills the ones that look bad early. None of these is obviously correct until we know the shape of the problem.

The first axis is sample efficiency: how quickly does the strategy converge when each training run is expensive? Bayesian optimisation is the textbook answer here because every completed run sharpens the surrogate, and the next choice is the one with the highest expected information gain. But that advantage depends on sequential evaluation: if ten candidates are chosen without each other’s results, nine of them can’t benefit from the first one’s answer. Parallelism taxes Bayesian’s edge.

The second axis is waste recovery. A 200-job budget sounds generous until you notice that the bottom quartile of configurations will train to completion just to produce a mediocre AUC. If we could identify those after 10% of a training run and kill them, the recovered budget would buy us more good runs. That’s the Hyperband trade: instead of “which config next?” it asks “which configs deserve more resource?”, killing the rest after a cheap early look.

The third axis is dimensionality. Grid search is exhaustive over a categorical product space; 20 hyperparameters with three values each gives 3.5 billion combinations, which is beyond any realistic budget. Random search is famously resilient to dimensionality because when only a handful of the 20 hyperparameters actually matter, random sampling still covers the important axes densely. Bayesian’s Gaussian Process is fine dimension-wise. Hyperband is dimension-agnostic because it doesn’t try to model the objective surface at all.

The fourth axis is what “done” means. If the team has a job ceiling (200), the question is “find me the best possible config within 200 jobs.” If the team has a wall-clock ceiling and essentially unlimited instances, the question is “find me a good config as fast as possible.” These favour different strategies. 200 jobs and 10 parallel workers is the former shape.

And the fifth is warm-start. The prior tuning job on stale data isn’t bit-identical to this one, but it’s not useless either. Some strategies can seed their state with prior results; others start fresh every time. Throwing away a previous job’s evidence is a free budget cut.

What we’ll filter on

  1. Finds a near-best configuration within 200 jobs (the strategy can’t waste most of its budget on uninformative regions).
  2. Recovers wasted runs (ideally by killing obviously bad configurations early).
  3. Handles 20 dimensions (a shape that collapses in high dimensions is out).
  4. Scales with parallelism without losing efficiency (10 parallel workers should behave like a 10x wall-clock speed-up, not a 10x sample-efficiency penalty).
  5. Accepts prior evidence (the stale-data parent job should be usable as a starting point).

The tuning strategy landscape

Grid Search. Exhaustively enumerates every combination of categorical values. Supports categorical only; continuous ranges aren’t supported. Entirely reproducible. Disqualifies here: the scenario has continuous parameters, and even if everything were categorical, 3^20 combinations is meaninglessly beyond the 200-job budget or the 750-job hard cap per tuning job.

Random Search. Samples each hyperparameter independently and uniformly, ignoring every previous result. Trivially parallel. Surprisingly competitive in high dimensions because when only a few of 20 hyperparameters matter, random sampling covers the important axes densely. But it’s oblivious: every draw is uninformed by prior results, so 200 samples waste a lot of runs on obviously-bad regions.

Bayesian Optimisation. Treats tuning as regression. Builds a Gaussian Process surrogate that maps hyperparameter vector to expected objective, updates after each completed run, and chooses the next configuration via an acquisition function balancing exploitation (sample near current best) and exploration (sample where surrogate is most uncertain). Sequential information gain is its strength; parallelism is its weakness. Low MaxParallelTrainingJobs (1-3) makes Bayesian far more sample-efficient than high parallelism.

Hyperband. A multi-fidelity strategy. Hyperband doesn’t just choose which configurations to try; it also chooses how much resource (boosting rounds, epochs) to spend on each, using successive halving. Start many configurations with a small resource slice, kill the worst fraction, give survivors more resource, repeat. Requires iterative training with intermediate metric emission. XGBoost qualifies. Parallelises cleanly because brackets run concurrently. Kills bad configurations on real intermediate metrics.

Side by side

Strategy Near-best in 200 Recovers waste Handles 20 dims Parallel-friendly Warm-startable
Grid
Random
Bayesian , (via Auto early stop)
Hyperband

Matching problem shapes to strategies

Big budget, parallel, iterative emits intermediate metric Tight budget, sequential final metric only Small categorical space reproducibility matters High dim, huge budget lots of parallelism XGBoost fraud 200 jobs, 10 parallel 25 min/run, 20 params validation-auc per round Expensive tuning ~30 jobs, 1-3 parallel no intermediate signal every run must count A/B/C sweep 3 categorical params reproducibly exhaustive <100 combinations Open exploration 500+ jobs budget surface not understood high parallelism Intermediate metric? yes Intermediate metric? no All categorical? yes Parallelism wanted? yes Parallel wanted? yes Low parallelism OK? yes Combos < 750? yes Happy oblivious? yes Kill bad early? yes Surrogate informs next? yes Enumerate all? yes Coverage? yes Hyperband successive halving kills bad configs early ~1/4 Random's time full parallelism MinRes=3, MaxRes=100 Bayesian GP surrogate model explore vs exploit best at 1-3 parallel expensive per run Auto early stop optional Grid exhaustive enumeration categorical only bit-perfect reproducible no continuous ranges count auto-derived Random uniform sampling ignores prior results embarrassingly parallel high-dim friendly add Auto to trim waste
Each shape answers a small handful of questions (intermediate metric? parallel budget? categorical-only?) and the strategy falls out the bottom.

Hyperband, in depth

Hyperband runs successive halving. Start many configurations at a small resource allocation, kill the worst fraction, give survivors more resource, repeat. The classical schedule for min_resource = 1, max_resource = 81, halving factor eta = 3:

Round 1:  81 configs x  1 round  = 81 resource-units
Round 2:  27 configs x  3 rounds = 81 resource-units
Round 3:   9 configs x  9 rounds = 81 resource-units
Round 4:   3 configs x 27 rounds = 81 resource-units
Round 5:   1 config  x 81 rounds = 81 resource-units

Only 1 of the 81 initial configurations is trained to full resource. The other 80 die at progressively deeper allocations, and their recovered budget buys longer looks at the survivors. For XGBoost the resource_type is num_round; the kill decision uses the real validation-auc from the last emitted round.

"HyperParameterTuningJobConfig": {
  "Strategy": "Hyperband",
  "StrategyConfig": {
    "HyperbandStrategyConfig": {
      "MinResource": 3,
      "MaxResource": 100
    }
  },
  "ResourceLimits": {
    "MaxNumberOfTrainingJobs": 200,
    "MaxParallelTrainingJobs": 10
  },
  "HyperParameterTuningJobObjective": {
    "Type": "Maximize",
    "MetricName": "validation:auc"
  },
  "TrainingJobEarlyStoppingType": "Off"
}

MinResource = 3 means every configuration trains at least 3 boosting rounds before it can be killed; too small and noisy early rounds wrongly kill good configs, too large and Hyperband collapses toward Random. TrainingJobEarlyStoppingType: Off is correct: Hyperband has its own kill path, and layering Auto creates two overlapping mechanisms.

Warm start stacks on top. A new tuning job references up to five parent tuning jobs. Two flavours:

  • IdenticalDataAndAlgorithm assumes the surface is identical. Ranges can change; budget can change.
  • TransferLearning assumes the surface is similar, not identical: the correct choice when the training data has changed. Seeds the search toward promising regions without assuming equivalence.

The scenario’s prior job ran on a stale snapshot. That’s TransferLearning.

Parallelism costs different strategies differently. Random and Grid: no efficiency loss. Bayesian: efficiency drops because each concurrent job is chosen without the others’ results; keep to 1-3 when the budget is tight. Hyperband: parallelises within and across brackets without significant efficiency loss.

A worked example: the 200-job budget

200-job budget, 10 parallel workers, ml.m5.4xlarge at ~$0.922/hour, 25 minutes per full run.

Random baseline. 200 independent samples; every run completes. 200 × 25 = 5,000 instance-minutes = 83 instance-hours, wall-clock ~8.3 hours at 10 parallel. Cost ~$77.

Hyperband. 200 configurations enter the bracket schedule. A typical run kills ~60% after the first bracket, ~25% after the second, and trains the remaining ~15% progressively longer. Rough totals:

  • ~120 configs × 3 rounds = ~360 min
  • ~50 configs × 10 rounds = ~400 min
  • ~25 configs × 30 rounds = ~500 min
  • ~5 configs × 100 rounds = ~125 min

Total ~1,385 instance-minutes = ~23 instance-hours, wall-clock ~2.3 hours at 10 parallel. Cost ~$21. AUC typically matches or slightly beats Random at the same job count, at roughly a quarter of the wall-clock time and cost.

Bayesian at 10 parallel. Every run completes. Same 83 instance-hours, ~$77. The surrogate can only use information from the ~190 jobs that completed before each of the final 10 was chosen. Sample efficiency drops; result is worse than Hyperband on the same budget.

Bayesian at 3 parallel, same 200 jobs. Wall-clock rises to ~28 hours, but sample efficiency is much better.

Layering TransferLearning on Hyperband against the stale-data parent typically reaches a target AUC in 20-40% fewer jobs than the cold version.

What’s worth remembering

  1. Four strategies: Random samples uniformly, Grid enumerates exhaustively (categorical only), Bayesian builds a GP surrogate, Hyperband runs successive halving.
  2. Hyperband requires intermediate metrics; it only works with iterative algorithms that emit per-epoch or per-round objectives.
  3. Parallelism costs Bayesian sample efficiency because each in-flight job is chosen without the others’ results. Keep MaxParallelTrainingJobs at 1-3 for tight Bayesian budgets.
  4. TrainingJobEarlyStoppingType: Auto layers median-based kill on Random or Bayesian. Hyperband has its own better logic; leave Auto off with Hyperband.
  5. Warm start has two flavours: IdenticalDataAndAlgorithm (same surface) and TransferLearning (similar surface). Max 5 parent jobs, same objective, same count and types.
  6. Resource limits: 750 training jobs per tuning job (10,000 for Random via a support ticket), 10 parallel default (raisable to 100), 30 hyperparameters max, 20 metrics per tuning job.
  7. Grid is categorical-only; MaxNumberOfTrainingJobs is auto-calculated from the combination count.
  8. Rule of thumb on iterative workloads: Hyperband reaches comparable AUC to Random in roughly 1/4 the total training time, with full parallelism.

Run the 200-job budget as a Hyperband tuning job with num_round as the resource type, MinResource = 3 and MaxResource = 100, MaxParallelTrainingJobs = 10, TrainingJobEarlyStoppingType: Off, objective Maximize validation:auc, and a TransferLearning warm start referencing the stale-data parent. Expected result: AUC comparable to a 750-job Random Search, delivered in roughly 2-3 hours of wall-clock at around $21 of compute.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.