Exam Room · Advanced GenAI

SageMaker JumpStart or Bedrock for the Same Model

July 27, 2026 · 13 min read

The situation

A platform team is picking a serving path for a new internal service. The model has been chosen. Llama 3.3 70B Instruct, based on evaluation results from the research team. Traffic projection: starts at ~5,000 requests per day, growing to ~50,000 per day over six months. Request shape: average 2,000 input tokensTokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. , 400 output tokens. Latency target: p95 under 4 seconds. Budget: flexible but accountable, the team is expected to defend the choice against cheaper options at quarterly review.

Two serving paths are on the table. SageMaker JumpStart offers a one-click deployment of Llama 3.3 70B onto a SageMaker real-time endpoint running on GPU instances, g5, g6, p4d, or p5 family. Bedrock offers the same Llama 3.3 70B in its foundation-model catalog, accessed via Converse with per-token billing and no infrastructure.

Additional context: the team has three other Bedrock-hosted models in production already (Claude for conversational, Titan for embeddings, Nova Micro for classification), so Bedrock ergonomics are familiar. They do not run any SageMaker endpoints today. They do run EKS for other workloads, so ops maturity exists, but not for real-time ML serving specifically.

What actually matters

The decision is about where the line falls between “we run the model” and “AWS runs the model.”

The first decision is pricing model. Bedrock charges per token. SageMaker JumpStart charges by instance-hour, the endpoint runs on a specific GPU instance and bills whether or not traffic is hitting it. The break-even depends on traffic volume: low throughput favours per-token (pay only for what you use); high sustained throughput favours instance-hour (amortise the box).

The second is operational surface. Bedrock: none. Call the API, done. SageMaker endpoint: health checks, autoscaling policies, deployment pipelines, instance-type tuning, monitoring for memory/GPU utilisation, endpoint version management. Not crushing overhead, but real.

The third is latency and capacity control. A dedicated SageMaker endpoint has consistent latency, no shared-tenancy queueing, and we control its scaling policies. Bedrock’s shared infrastructure has variable latency depending on global load, and there is no buying your way out of it for this model: Provisioned ThroughputProvisioned ThroughputReserved Bedrock capacity bought by the hour for a fixed term, paid for whether traffic fills it or not. covers Amazon’s first-party models (Nova, Titan), not the open-weight catalog. If latency predictability becomes a hard requirement, the answer lives on the SageMaker side.

The fourth is customisation. A SageMaker endpoint can host a fine-tuned Llama, a quantised Llama, a Llama with speculative-decoding enabled, or a custom inferenceInferenceRunning a trained model to produce output – as opposed to training it. container running vLLM with specific flags. Bedrock’s hosted Llama is as AWS configured it, no knobs. For a team who wants vanilla Llama 3.3 70B, this doesn’t matter; for a team who wants to run AWQ-quantised weights or an LMI config with paged attention, it does.

The fifth is region availability. Bedrock foundation-model availability varies by region. SageMaker can host a model anywhere SageMaker runs. For a global service needing presence in 10 regions, this can tip the scales.

The sixth is compliance and data isolation. Bedrock’s foundation models run in AWS’s shared infrastructure with strict data-handling policies (no data for training, encrypted in transit). SageMaker endpoints run in the customer’s VPC if configured so, which satisfies some compliance regimes that want data to never leave a known boundary.

And a softer one: the team’s operational appetite. A team that likes running infrastructure and wants the control will pick SageMaker; a team that wants to ship features and let AWS worry about the GPU fleet will pick Bedrock.

What we’ll filter on

Cost at projected throughput, what’s the monthly bill at 5k/day and at 50k/day?
Operational surface, what do we run, tune, and monitor?
Latency profile, p50, p95, p99 at expected load?
Customisation, can we run the model with the flags we want?
Integration with the rest of the stack, same SDK, same IAM, same CloudWatch story?

The serving-path landscape

Bedrock (Llama 3.3 70B, on-demand). Per-token pricing. No infrastructure. Same Converse API as other Bedrock models. IAM, CloudTrail, CloudWatch metrics built in. Regional availability varies but is broad for flagship models. Shared tenancy; latency generally good but variable under peak load.
Bedrock (Llama 3.3 70B, batch inference). Same model at half the on-demand rate, results in hours rather than seconds. Only fits an offline share of the workload; nothing here meets a 4-second p95. It matters mostly as a boundary marker: Provisioned Throughput doesn’t cover Llama, so batch is Bedrock’s only alternative to on-demand for this model.
SageMaker JumpStart deployment. Choose instance type (typically ml.p4d.24xlarge or ml.g5.48xlarge for 70B at FP16, ml.g5.12xlarge for quantised), click deploy, get an endpoint URL. JumpStart packages optimised inference containers (LMI with DeepSpeed, TensorRT-LLM, vLLM) and reasonable defaults. Billed by instance-hour. Endpoint scales via autoscaling policies configured by us.
SageMaker JumpStart with a custom container. JumpStart as the starting point, custom inference container replacing the default. Maximum control over inference runtime (quantisationQuantisationStoring model weights at lower precision (8 bits, 4 bits, sometimes fewer) so the model is smaller and faster to run. , batching strategy, attention algorithm). Maximum ops overhead.
Self-hosted on EKS with vLLM/TGI. GPU nodes in EKS, a vLLM Deployment serving Llama, an internal LoadBalancer. Most flexible; highest ops cost. Suits teams with existing Kubernetes ML-serving maturity.
Bedrock Custom Model Import. For teams with their own fine-tuned or modified weights: import them and get Bedrock’s managed API surface back, billed per Custom Model Unit per minute of active use rather than per token, scaling to zero when idle. Here the weights are vanilla Llama 3.3 70B, which Bedrock already hosts at a per-token rate an import can’t beat, so it doesn’t apply.

Side by side

Option	Cost at 5k/day	Cost at 50k/day	Ops surface	Latency	Customisation
Bedrock on-demand	Low	Scales linearly	None	Variable	None
Bedrock batch	Lowest per token	Scales linearly	None	Hours, not seconds	None
SageMaker JumpStart	High floor (endpoint-hours)	Flat	Moderate	Predictable	Some
JumpStart + custom container	Same floor	Flat	High	Predictable	Total
Self-hosted EKS + vLLM	Variable	Flat	Heavy	Ours to tune	Total

At 5k requests/day × 2,400 tokens average = 12M tokens/day, ~360M tokens/month. Rough comparison using Llama pricing at the time:

Bedrock on-demand (Llama 3.3 70B):
  360M tokens × $0.72/M (input and output)       ≈ $260/month

SageMaker JumpStart (ml.p4d.24xlarge):
  $37.70/hour × 730 hours                        ≈ $27,500/month

SageMaker JumpStart (ml.g5.48xlarge, FP16):
  $19/hour × 730 hours                           ≈ $13,870/month

SageMaker JumpStart (ml.g5.12xlarge, AWQ int4):
  $5.67/hour × 730 hours                         ≈ $4,140/month

At 50k/day = 120M tokens/day ≈ 3.6B/month, Bedrock on-demand climbs to ~$2,600; the endpoint costs stay flat. Even at the target volume, Bedrock undercuts the cheapest endpoint; the break-even against the quantised g5.12xlarge sits around 80k requests/day at this request shape, and the bigger instances don’t catch up until volumes several times that.

Cost vs throughput, plotted

Bedrock wins across most of the charted range; the quantised endpoint only catches up at sustained volumes around 80k requests/day. The crossover moves with instance choice, quantisation, and request shape.

The pick in depth

Start on Bedrock. Revisit if daily volume heads toward 60-70k requests/day.

At the starting volume (5k/day), Bedrock costs ~$260/month and has no operational cost. The cheapest SageMaker option (quantised on g5.12xlarge) would be ~$4,140/month plus the engineering cost of setting up, monitoring, and scaling the endpoint. Bedrock wins both axes by an order of magnitude.

At the target volume (50k/day), the arithmetic barely changes. Bedrock climbs to ~$2,600/month, still below the cheapest endpoint’s flat $4,140 floor. The break-even against the quantised g5.12xlarge sits around 80k requests/day at this request shape; the full-precision instances don’t catch up until volumes several times that. On pure cost, Bedrock holds the lead well past the six-month projection.

The migration plan. Don’t pre-optimise. Ship on Bedrock. Track usage weekly. If daily requests pass a tripwire (call it 60k/day, where Bedrock lands at ~$3,100/month and the quantised endpoint’s $4,140 is visibly within reach), stand up the SageMaker endpoint in a staging environment, evaluate latency and throughput, plan the cutover. Use the months in between to build whatever endpoint tooling the team doesn’t have yet, because by then the migration case may rest on latency control rather than dollars.

Latency considerations. Bedrock latency is “fine but variable.” A well-tuned g5.48xlarge endpoint serving Llama 3.3 70B with vLLM’s continuous batching handles ~40 concurrent requests with p95 ~3 seconds consistently. Bedrock typically does the same at low load but degrades under global peak, and for Llama there is no dedicated-capacity tier to fall back on within Bedrock. If the p95 SLA is strict and the service is latency-sensitive, SageMaker’s predictability might justify earlier migration; it is the only door to dedicated capacity for this model.

Regional presence. If Bedrock doesn’t have Llama in the regions we need, JumpStart wins by default. Check regional availability before the analysis; sometimes the question is decided before the cost math runs.

Team readiness. The team hasn’t run SageMaker endpoints before. The first real-time GPU endpoint is a learning curve: autoscaling policies, instance-type tuning, monitoring GPU utilisation vs CPU memory, handling deployment rollouts. None of this is hard, but all of it is new. Running Bedrock for the first six months while the endpoint skills bake on the side is a sensible ramp.

When to pick SageMaker from day one. Three scenarios flip the default: (1) the model has to run in a region without Bedrock, (2) the model needs customisation (quantisation, fine-tuneFine-tuningContinuing to train an already-trained model on a smaller dataset to adapt its behaviour. , custom inference flags) that Bedrock doesn’t expose, (3) compliance requires the inference to run inside a VPC with no AWS-managed shared tenancy. Any of those, pick SageMaker. None of them, pick Bedrock until the cost forces the issue.

A worked example: the first quarter

Week 1: service launches on Bedrock. Llama 3.3 70B via Converse, same SDK pattern as the other Bedrock services. ~200 requests/day during internal testing; a ~$10/month bill.

Weeks 2-8: traffic grows to ~4,000 requests/day as teams adopt the service. Bedrock spend ~$200/month. CloudWatch metrics tracking p95 latency (holding at 2.8s), invocation count, throttle rate (zero). Monthly review: on track, no migration planned.

Weeks 9-12: traffic hits ~8,000 requests/day; integration with a customer-facing product kicks in. Bedrock spend ~$400/month. Cost-vs-SageMaker comparison shows Bedrock still ahead by roughly $3,700/month versus the quantised endpoint. No migration.

Quarter-end review: spend projection to end of next quarter based on current growth. Team presents the cost curve, the ~60k/day tripwire, the operational investment planned for migration. Finance and product align on “keep on Bedrock, build endpoint skills on the side, migrate only if growth or latency demands it.”

The decision is defensible, reversible, and produced by numbers rather than preference.

What’s worth remembering

Bedrock and JumpStart serve the same model through different operational surfaces. Bedrock trades per-token for zero ops; JumpStart trades endpoint-hours for control.
Break-even moves with quantisation, and it sits higher than intuition says. Bedrock’s per-token rate for open-weights models is low; even the quantised g5.12xl endpoint needs ~80k requests/day at this request shape to break even, and full-precision instances need several times that.
Start on Bedrock at low volume. Pay per token, no ops, and migrate when the bill crosses the line.
Latency predictability is a real SageMaker advantage at scale. A dedicated endpoint avoids shared-tenancy queueing, and for open-weight models it is the only dedicated-capacity option: Bedrock’s Provisioned Throughput covers Amazon’s first-party models (Nova, Titan), not Llama.
Customisation is the SageMaker lock-in. Quantization, custom inference runtime, fine-tuned weights, all require JumpStart (or Custom Model Import back into Bedrock).
Regional availability can decide it. Check Bedrock’s model availability in the target regions before the cost analysis.
Ops readiness matters. A team without real-time ML-serving experience pays a hidden cost on JumpStart that doesn’t show up in the instance-hour calculation.
Keep the migration reversible. The application code should abstract the serving layer so switching between Bedrock and SageMaker is a config change, not a rewrite.

Two doors, same model, different rooms behind them. The correct door depends on how much infrastructure the team wants to live with, at what point in the traffic curve the math tips, and whether anything forces the choice before the math does. Start where the bill is lowest; migrate when it isn’t.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.