The situation
A platform team is picking a serving path for a new internal service. The model has been chosen. Llama 3.3 70B Instruct, based on evaluation results from the research team. Traffic projection: starts at ~5,000 requests per day, growing to ~50,000 per day over six months. Request shape: average 2,000 input TokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. , 400 output tokens. Latency target: p95 under 4 seconds. Budget: flexible but accountable, the team is expected to defend the choice against cheaper options at quarterly review.
Two serving paths are on the table. SageMaker JumpStart offers a one-click deployment of Llama 3.3 70B onto a SageMaker real-time endpoint running on GPU instances, g5, g6, p4d, or p5 family. Bedrock offers the same Llama 3.3 70B in its foundation-model catalog, accessed via Converse with per-token billing and no infrastructure.
Additional context: the team has three other Bedrock-hosted models in production already (Claude for conversational, Titan for embeddings, Nova Micro for classification), so Bedrock ergonomics are familiar. They do not run any SageMaker endpoints today. They do run EKS for other workloads, so ops maturity exists, but not for real-time ML serving specifically.
What actually matters
The decision is about where the line falls between “we run the model” and “AWS runs the model.”
The first decision is pricing model. Bedrock charges per token. SageMaker JumpStart charges by instance-hour, the endpoint runs on a specific GPU instance and bills whether or not traffic is hitting it. The break-even depends on traffic volume: low throughput favours per-token (pay only for what you use); high sustained throughput favours instance-hour (amortise the box).
The second is operational surface. Bedrock: none. Call the API, done. SageMaker endpoint: health checks, autoscaling policies, deployment pipelines, instance-type tuning, monitoring for memory/GPU utilisation, endpoint version management. Not crushing overhead, but real.
The third is latency and capacity control. A dedicated SageMaker endpoint has consistent latency, no shared-tenancy queueing, and we control its scaling policies. Bedrock’s shared infrastructure has variable latency depending on global load; Provisioned Throughput addresses this at a cost but is a separate commitment.
The fourth is customisation. A SageMaker endpoint can host a fine-tuned Llama, a quantized Llama, a Llama with speculative-decoding enabled, or a custom InferenceRunning a trained model to produce output – as opposed to training it. container running vLLM with specific flags. Bedrock’s hosted Llama is as AWS configured it, no knobs. For a team who wants vanilla Llama 3.3 70B, this doesn’t matter; for a team who wants to run AWQ-quantized weights or an LMI config with paged attention, it does.
The fifth is region availability. Bedrock foundation-model availability varies by region. SageMaker can host a model anywhere SageMaker runs. For a global service needing presence in 10 regions, this can tip the scales.
The sixth is compliance and data isolation. Bedrock’s foundation models run in AWS’s shared infrastructure with strict data-handling policies (no data for training, encrypted in transit). SageMaker endpoints run in the customer’s VPC if configured so, which satisfies some compliance regimes that want data to never leave a known boundary.
And a softer one: the team’s operational appetite. A team that likes running infrastructure and wants the control will pick SageMaker; a team that wants to ship features and let AWS worry about the GPU fleet will pick Bedrock.
What we’ll filter on
- Cost at projected throughput, what’s the monthly bill at 5k/day and at 50k/day?
- Operational surface, what do we run, tune, and monitor?
- Latency profile, p50, p95, p99 at expected load?
- Customisation, can we run the model with the flags we want?
- Integration with the rest of the stack, same SDK, same IAM, same CloudWatch story?
The serving-path landscape
-
Bedrock (Llama 3.3 70B, on-demand). Per-token pricing. No infrastructure. Same
ConverseAPI as other Bedrock models. IAM, CloudTrail, CloudWatch metrics built in. Regional availability varies but is broad for flagship models. Shared tenancy; latency generally good but variable under peak load. -
Bedrock (Llama 3.3 70B, Provisioned Throughput). Same model, dedicated capacity, predictable latency, monthly commitment. Fits steady high-throughput workloads; wrong for bursty traffic.
-
SageMaker JumpStart deployment. Choose instance type (typically ml.p4d.24xlarge or ml.g5.48xlarge for 70B at FP16, ml.g5.12xlarge for quantized), click deploy, get an endpoint URL. JumpStart packages optimised inference containers (LMI with DeepSpeed, TensorRT-LLM, vLLM) and reasonable defaults. Billed by instance-hour. Endpoint scales via autoscaling policies configured by us.
-
SageMaker JumpStart with a custom container. JumpStart as the starting point, custom inference container replacing the default. Maximum control over inference runtime (QuantisationStoring model weights at lower precision (8 bits, 4 bits, sometimes fewer) so the model is smaller and faster to run. , batching strategy, attention algorithm). Maximum ops overhead.
-
Self-hosted on EKS with vLLM/TGI. GPU nodes in EKS, a vLLM Deployment serving Llama, an internal LoadBalancer. Most flexible; highest ops cost. Right for teams with existing Kubernetes ML-serving maturity.
-
Bedrock Custom Model Import. For teams with their own fine-tuned weights wanting Bedrock’s API surface. Here we have vanilla Llama 3.3 70B, so import doesn’t add value over the hosted version.
Side by side
| Option | Cost at 5k/day | Cost at 50k/day | Ops surface | Latency | Customisation |
|---|---|---|---|---|---|
| Bedrock on-demand | Low | Scales linearly | None | Variable | None |
| Bedrock PT | High floor | Flat regardless | None | Predictable | None |
| SageMaker JumpStart | High floor (endpoint-hours) | Flat | Moderate | Predictable | Some |
| JumpStart + custom container | Same floor | Flat | High | Predictable | Total |
| Self-hosted EKS + vLLM | Variable | Flat | Heavy | Ours to tune | Total |
At 5k requests/day × 2,400 tokens average = 12M tokens/day, ~360M tokens/month. Rough comparison using Llama pricing at the time:
Bedrock on-demand (Llama 3.3 70B):
360M tokens × blended rate ≈ $1,800/month
SageMaker JumpStart (ml.p4d.24xlarge):
$37.70/hour × 730 hours ≈ $27,500/month
SageMaker JumpStart (ml.g5.48xlarge, FP16):
$19/hour × 730 hours ≈ $13,870/month
SageMaker JumpStart (ml.g5.12xlarge, AWQ int4):
$5.67/hour × 730 hours ≈ $4,140/month
At 50k/day = 120M tokens/day ≈ 3.6B/month, Bedrock on-demand climbs to ~$18,000; the endpoint costs stay flat. The break-even shifts; the endpoint becomes competitive at the higher end, especially with quantization.
Cost vs throughput, plotted
The pick in depth
Start on Bedrock. Revisit at 30k requests/day.
At the starting volume (5k/day), Bedrock costs ~$1,800/month and has no operational cost. The cheapest SageMaker option (quantized on g5.12xlarge) would be ~$4,140/month plus the engineering cost of setting up, monitoring, and scaling the endpoint. Bedrock wins both axes at this volume.
At the target volume (50k/day), the arithmetic changes. Bedrock climbs to ~$18,000/month. A quantized g5.48xlarge endpoint holds at ~$13,870 and can serve the traffic with headroom. The endpoint now wins on cost by $4k/month, call it $50k/year, enough to justify the operational work. At that point the migration is worth doing.
The migration plan. Don’t pre-optimise. Ship on Bedrock. Track usage weekly. When daily requests pass a threshold (call it 30k/day, where Bedrock lands at ~$11,000/month and the g5.48xlarge is at ~$13,870, still Bedrock, but visibly closing), stand up the SageMaker endpoint in a staging environment, evaluate latency and throughput, plan the cutover. Use the months between 5k and 30k to build whatever endpoint tooling the team doesn’t have yet.
Latency considerations. Bedrock latency is “fine but variable.” A well-tuned g5.48xlarge endpoint serving Llama 3.3 70B with vLLM’s continuous batching handles ~40 concurrent requests with p95 ~3 seconds consistently. Bedrock typically does the same at low load but degrades under global peak; if the p95 SLA is strict and the service is latency-sensitive, SageMaker’s predictability might justify earlier migration.
Regional presence. If Bedrock doesn’t have Llama in the regions we need, JumpStart wins by default. Check regional availability before the analysis; sometimes the question is decided before the cost math runs.
Team readiness. The team hasn’t run SageMaker endpoints before. The first real-time GPU endpoint is a learning curve: autoscaling policies, instance-type tuning, monitoring GPU utilisation vs CPU memory, handling deployment rollouts. None of this is hard, but all of it is new. Running Bedrock for the first six months while the endpoint skills bake on the side is a sensible ramp.
When to pick SageMaker from day one. Three scenarios flip the default: (1) the model has to run in a region without Bedrock, (2) the model needs customisation (quantization, Fine-tuningContinuing to train an already-trained model on a smaller dataset to adapt its behaviour. , custom inference flags) that Bedrock doesn’t expose, (3) compliance requires the inference to run inside a VPC with no AWS-managed shared tenancy. Any of those, pick SageMaker. None of them, pick Bedrock until the cost forces the issue.
A worked example: the first quarter
Week 1: service launches on Bedrock. Llama 3.3 70B via Converse, same SDK pattern as the other Bedrock services. ~200 requests/day during internal testing; $100/month bill.
Weeks 2-8: traffic grows to ~4,000 requests/day as teams adopt the service. Bedrock spend ~$1,400/month. CloudWatch metrics tracking p95 latency (holding at 2.8s), invocation count, throttle rate (zero). Monthly review: on track, no migration planned.
Weeks 9-12: traffic hits ~8,000 requests/day; integration with a customer-facing product kicks in. Bedrock spend ~$2,800/month. Cost-vs-SageMaker comparison shows Bedrock still ahead by roughly $1,400/month versus quantized endpoint. No migration.
Quarter-end review: spend projection to end of next quarter based on current growth. Team presents the cost curve, the 30k/day tripwire, the operational investment planned for migration. Finance and product align on “keep on Bedrock for now, build endpoint skills in Q3, migrate in Q4 if growth holds.”
The decision is defensible, reversible, and produced by numbers rather than preference.
What’s worth remembering
- Bedrock and JumpStart serve the same model through different operational surfaces. Bedrock trades per-token for zero ops; JumpStart trades endpoint-hours for control.
- Break-even moves with quantization. Quantized Llama 3.3 70B on g5.12xl breaks even with Bedrock at ~12k/day; full-precision on p4d.24xl needs ~76k/day to break even.
- Start on Bedrock at low volume. Pay per token, no ops, and migrate when the bill crosses the line.
- Latency predictability is a real SageMaker advantage at scale. A dedicated endpoint avoids shared-tenancy queueing; PT gets you there on Bedrock at cost.
- Customisation is the SageMaker lock-in. Quantization, custom inference runtime, fine-tuned weights, all require JumpStart (or Custom Model Import back into Bedrock).
- Regional availability can decide it. Check Bedrock’s model availability in the target regions before the cost analysis.
- Ops readiness matters. A team without real-time ML-serving experience pays a hidden cost on JumpStart that doesn’t show up in the instance-hour calculation.
- Keep the migration reversible. The application code should abstract the serving layer so switching between Bedrock and SageMaker is a config change, not a rewrite.
Two doors, same model, different rooms behind them. The correct door depends on how much infrastructure the team wants to live with, at what point in the traffic curve the math tips, and whether anything forces the choice before the math does. Start where the bill is lowest; migrate when it isn’t.