How to Right-Size a SageMaker Endpoint With Inference Recommender

May 29, 2028 · 16 min read

The situation

A product team has fine-tuned a 7B-parameter LLM on customer support transcripts. The trained artifact is a ~14 GB modelModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. in S3. Production requirements:

Payload: request is a ~2,000-token prompt, response is ~500 tokensTokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. .
Latency target: p99 under 2 seconds for the full response.
Traffic: ~80 requests per minute steady state, spiking to ~250/min during daytime.
Budget ask from product: “as cheap as possible that meets the latency target”.

The team’s instinct is to deploy on ml.g5.12xlarge (4 × A10G GPUs, 192 GB RAM, $7/hour on-demand) because “that’s what we trained on”. The platform engineer’s instinct is that a 7B model might fit comfortably on one A10G, which would point at ml.g5.2xlarge, one GPU, 32 GB RAM, $1.50/hour. A 4.5× price delta.

Guessing wrong either way is expensive. Guessing too small means missing the latency target and either paging the on-call or burning weeks rebuilding. Guessing too big means paying 4-5× too much per month for the life of the endpoint. What the team wants is a structured way to benchmark the candidate instance types against the actual model and payload, without spending a week building a load-testing framework.

What actually matters

Before reaching for a specific service, it’s worth being explicit about what “sizing an inference endpoint” actually requires and which properties of any chosen mechanism end up doing the work.

The first thing worth thinking about is the gap between predicted and measured performance. Sizing by spec sheet, “this model has 7B parameters, this GPU has 24 GB of memory, so it’ll fit”, captures the static memory footprint but misses everything else: KV-cache growth with context length, serving-framework overhead, kernel launch latency, batching dynamics, GPU utilisation under realistic load. The only honest way to know whether an instance type meets the SLO is to run the actual model on the actual instance with the actual payload and watch what happens. Anything else is intuition with a calculator.

The second is payload authenticity. A benchmark is a function of the input distribution it’s fed. A “representative” prompt that’s actually a 100-token sample under-counts the work a 2,000-token production prompt does; a single canned input under-counts the variance a mixed traffic stream produces. Whatever does the measuring is only as good as the payloads it’s measuring against, which means the team owns the job of building a sample set that mirrors production, size distribution, shape variety, the awkward edge cases that hit the slow path. The mechanism doesn’t generate that data; we do.

The third is the ranking metric. Latency, throughput, cost-per-hour, cost-per-inference, GPU utilisation, each can be the right answer depending on the question. For “cheapest fit that meets the SLO,” cost-per-inference (cost-per-hour divided by sustainable throughput at the latency target) is the metric, because it folds price and capacity into one comparable number. A ranking on cost-per-hour alone over-favours small instances that can’t sustain peak; a ranking on raw throughput over-favours big instances we’re paying for idle headroom. The mechanism has to produce the right metric, and the team has to know which metric to look at.

The fourth is steady-state versus tail behaviour. A single short benchmark captures steady-state latency once the model is warm. It doesn’t capture cold-start, capacity ramp behaviour under bursts, or what happens at the breaking point where latency starts to climb. For a service with a flat traffic profile, steady-state is enough; for a service that spikes 3× during the day or scales to zero overnight, the tail behaviour matters more than the median. Knowing which regime applies, and whether the benchmarking approach measures it, is part of the sizing work.

The fifth is the rejection signal. Some instance types will OOM on load; some will miss the SLO; some won’t have a compatible container available. Knowing which candidates failed and why is as informative as the ranking of the ones that passed, it tells the team where the floor is, which constraints are real, and what would have to change (quantisationQuantisationStoring model weights at lower precision (8 bits, 4 bits, sometimes fewer) so the model is smaller and faster to run. , distillationDistillationTraining a smaller model to imitate a larger one’s outputs, so you get something close to the big model’s quality at a fraction of the cost. , a different inference framework) to unlock cheaper hardware. A mechanism that silently drops failures is worse than one that lists them with explanations.

The sixth is the cost of the benchmark itself. Running real deployments costs real money for the instance-hours used during the test. The benchmarking spend has to be small relative to the wrong-decision spend, which, for an endpoint that will run 24/7 for months, it almost always is. An hour of mixed-instance benchmarking costs less than a single day of running on a 4× over-sized instance, so the right posture is to spend liberally on measurement before committing to a production size.

The seventh is the boundary of what sizing can solve. No sizing exercise turns a too-big model into one that fits on a small GPU. Quantisation, distillation, and model-size choice change the sizing equation entirely, a 4-bit 7B model runs on hardware where the fp16 version OOMs. The sizing decision comes after the model engineering decisions; running the benchmark to discover “nothing fits in budget” is the signal that the engineering work isn’t done, not that the benchmark is wrong.

What we’ll filter on

Covers relevant instance types. GPU, CPU, Inferentia, appropriate sizes?
Uses our actual payload, not a synthetic or default one?
Respects constraints, max latency, min throughput, allowed instance types?
Cost ranking, output ordered by cost-per-inference or similar?
Confidence level, fast heuristic or rigorous load test?

The instance-sizing landscape

1. SageMaker Inference Recommender (Instance Recommendation). Fast mode: ~45 minutes, evaluates 5-10 candidate instance types, returns a ranked list with measured latency, throughput, and cost per inference. Good for the first narrowing.

2. SageMaker Inference Recommender (Load Test). Slow mode: several hours, evaluates a chosen shortlist with ramping traffic. Returns capacity curves and more robust numbers. Use after the fast mode narrows to 2-3 candidates.

3. Manual benchmarking via CloudWatch metrics. Deploy to one instance type, send representative traffic from a load generator (Locust, k6, custom script), watch CloudWatch metrics for ModelLatency, InvocationsPerInstance, CPUUtilization, GPUUtilization. Repeat per candidate. Time-consuming and error-prone compared to Inference Recommender, but gives full control, and is the only option for non-SageMaker deployments.

4. Vendor guidance and model-size heuristics. “A 7B model needs ~14 GB for weights in fp16, plus KV cacheKV cacheA reuseable cache of the model’s attention computations for tokens it’s already seen, so generating the next token doesn’t redo work. , plus overhead, 32 GB total is usually enough for modest context.” Useful for the first candidate selection but not a substitute for measurement.

5. SageMaker Savings Plans / Reserved Endpoints. Orthogonal to sizing, but relevant: once the correct instance type is chosen, committing to it via Savings Plans saves 30-50% if the workload is steady. Size first, commit second.

Side by side

Option	Instance coverage	Actual payload	Constraint support	Cost ranking	Confidence
Inference Recommender (Instance)	Broad, AWS-curated	✓ (sample payload required)	✓	✓	Moderate (short test)
Inference Recommender (Load Test)	User-selected shortlist	✓	✓	✓	High (ramped load)
Manual + CloudWatch	Whatever we try	✓	via our scripts	via our scripts	As high as we invest in it
Heuristics alone	n/a	✗	partial	rough	Low

Reading by situation:

First cut on a new model we’ve never deployed. Instance Recommendation mode. Fast, cheap, good for shortlisting 2-3 types.
Final choice before production. Load Test mode on the shortlist. Confirms the chosen type hits the SLO at peak traffic with headroom.
Ongoing capacity planning, re-run periodically as traffic grows. The correct instance for 80 req/min might be wrong at 800.

For the 7B LLM, the obvious plan is: Instance Recommendation to pick 2-3 candidates, then a Load Test to confirm the winner.

The recommender’s workflow

Inputs go in, ranked measurements come out, a load test confirms the shortlist, and the decision falls out with numbers attached.

The pick in depth

Run an Instance Recommendation job first, then a Load Test on the top two.

The Instance Recommendation job is configured via CreateInferenceRecommendationsJob with JobType=Default:

job_response = sagemaker_client.create_inference_recommendations_job(
    JobName='llm7b-instance-reco-2027W45',
    JobType='Default',
    RoleArn='arn:aws:iam::...:role/InferenceRecommenderRole',
    InputConfig={
        'ModelPackageVersionArn': 'arn:aws:sagemaker:...:model-package/llm7b/v1',
        'SupportedInstanceTypes': [
            'ml.g5.2xlarge', 'ml.g5.4xlarge', 'ml.g5.12xlarge', 'ml.g5.24xlarge',
            'ml.g4dn.xlarge', 'ml.g4dn.2xlarge',
            'ml.inf2.xlarge', 'ml.inf2.8xlarge'
        ],
        'JobDurationInSeconds': 7200,
        'TrafficPattern': {
            'TrafficType': 'PHASES',
            'Phases': [{'InitialNumberOfUsers': 1, 'SpawnRate': 1, 'DurationInSeconds': 120}]
        },
        'EndpointConfigurations': [...]
    },
    StoppingConditions={
        'MaxInvocations': 1000,
        'ModelLatencyThresholds': [{'Percentile': 'P99', 'ValueInMilliseconds': 2000}]
    }
)

The sample payload is stored in the Model Package’s SamplePayloadUrl: a .jsonl file of ~50 representative requests, each around 2,000 tokens of prompt. The variety matters, 50 identical prompts would be benchmarked against cache; 50 varied prompts reflect real production behaviour more honestly.

After 45 minutes, the job finishes. Results (as illustrated above): the 7B model on g5.xlarge runs out of memory on load (16 GB is not enough for a 14 GB artifact plus KV cache). On g5.2xlarge, p99 is 1,340 ms, well under target. On g5.12xlarge, p99 is 780 ms, faster, but at almost 5× the cost. g4dn.2xlarge (T4 GPU) misses the SLO at 3,200 ms. inf2.xlarge requires a Neuron-compiled model which this artifact isn’t, so it’s skipped.

Shortlist: g5.2xlarge and g5.4xlarge.

The Load Test job runs against the shortlist with the actual traffic profile (ramp from 0 to 500 rpm over 15 minutes). It measures how many requests each instance can handle before p99 crosses threshold. g5.2xlarge holds to ~140 rpm per instance; g5.4xlarge to ~220 rpm. Peak traffic is 250 rpm, so a minimum of two g5.2xlarge instances handles it with headroom, and autoscaling to four accommodates 10× burst.

Final deployment decision: ml.g5.2xlarge, two-instance minimum, autoscale to four on InvocationsPerInstance target. Monthly cost: roughly $2,160, vs. $10,200 if we’d shipped on g5.12xlarge by intuition. Latency target met with margin. The $8K/month difference funds other things.

A worked benchmarking job

Team kicks off the job on a Wednesday morning.

Model is already registered in Model Registry as llm7b/v1 with ModelApprovalStatus=Approved and SamplePayloadUrl=s3://llm7b-assets/samples.jsonl.
create_inference_recommendations_job API call returns JobArn in ~1 second. Job enters IN_PROGRESS state.
In the background, SageMaker launches endpoints on each candidate instance type in parallel. Each runs for ~8-12 minutes with increasing traffic from 1 to roughly the limit that either stops throughput scaling or crosses the latency threshold.
Team monitors DescribeInferenceRecommendationsJob via polling or the SageMaker Studio console. Progress updates every few minutes.
At ~50 minutes, job enters COMPLETED state. DescribeInferenceRecommendationsJob returns a InferenceRecommendations array: each entry is {InstanceType, EnvironmentParameters, Metrics: {CostPerHour, CostPerInference, MaxInvocations, ModelLatency, CpuUtilization, MemoryUtilization, GpuUtilization}}.
Team sorts the results by CostPerInference ascending among entries meeting the latency threshold. g5.2xlarge at $0.00018/inference wins; g5.4xlarge is next at $0.00024.
Team runs a Load Test against the shortlist with JobType=Advanced. This one takes ~2.5 hours. Returns capacity curves.
Team picks g5.2xlarge × 2, writes the CloudFormation / CDK deployment, and ships the endpoint in production by end of week.

Cost of the benchmarking phase: 8 instance types × ~15 minutes each + 2 instance types × ~2.5 hours each ≈ $20 in instance-hours plus $5 in associated services. Against a 4.5× monthly cost delta, the $25 is rounding error.

What’s worth remembering

Inference Recommender is a managed benchmark. It really deploys the model on real instances with real payloads and measures them. It doesn’t predict; it runs the experiment for us.
Two job types, two levels of confidence. Instance Recommendation (~45 min, 5-10 candidates) for the first cut. Load Test (several hours, 1-3 candidates) for the production decision. Run them in that order.
Payload authenticity drives result quality. Put representative samples, variety of sizes, shapes, contents, in the SamplePayloadUrl. A single canned payload underestimates production variance.
Constraints narrow the search. SupportedInstanceTypes, ModelLatencyThresholds, MaxInvocations focus the job on what we care about. Without constraints, the default is to benchmark the AWS-suggested candidate set.
Results include cost per inference. Not just latency and throughput – CostPerInference lets the team rank by the economics directly.
Rejected results matter too. Instances that OOM, miss the SLO, or have incompatible container requirements are labelled explicitly. Don’t overlook the explanations; they tell us why the cheap options didn’t qualify.
Revisit as traffic grows. The correct instance at 80 rpm might not be correct at 800 rpm. Re-run Inference Recommender periodically; it’s cheap.
Not a substitute for model engineering. Quantisation, distillation, and model-size optimisation can change the sizing equation entirely (a quantised 4-bit 7B fits on much smaller GPUs). Recommender benchmarks the model as given; use it after engineering decisions, not instead of them.

Sizing inference endpoints is where money leaks out of ML budgets. Defaulting to “whatever we trained on” or “the biggest we can afford” leaves most of the decision to intuition, and intuition about GPU utilisation is usually wrong. Inference Recommender replaces that intuition with a measured answer, on our model, with our payload, against the candidate instance types that actually fit the problem. The test takes an hour; the saving often runs for years.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.