The situation
Clinical research has been fine-tuning Llama 3.1 8B on de-identified medical-notes data for the past quarter. The fine-tune is a LoRAA fine-tuning technique that trains a small low-rank matrix on top of the frozen base model, instead of updating every parameter. adapter merged back into the base weights, trained on 40,000 labelled examples with human-preference signals. The research team’s evaluation shows the fine-tuned model outperforms Claude Sonnet 4.5 on their specific summarisation task by a noticeable margin on their internal rubric, unsurprising, because the TrainingThe process of fitting a model’s weights to data by minimising a loss function. data is the target distribution.
Training happened on SageMaker training jobs. The weights, roughly 16 GB of safetensors, are in an S3 bucket. Now production wants to serve them. The ask: Bedrock’s API surface (the same Converse calls the rest of the stack already uses), the same IAM and VPC posture as the other Bedrock traffic, the same CloudWatch metrics, no SageMaker endpoint for ops to manage, and a predictable bill.
Three question on the table. First, can Bedrock actually serve these weights, or does the base architecture disqualify them? Second, what’s the throughput and cost model, does it match on-demand foundation models or behave differently? Third, what’s the operational surface for deployment, versioning, and retirement?
What actually matters
The core trade with serving custom weights is ergonomics for inflexibility. At one end, a managed-catalog foundation model is ready to go: call the API, pay per TokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. , done. At the other end, a self-hosted fine-tune is everything configurable, the instance type, the scaling policy, the InferenceRunning a trained model to produce output – as opposed to training it. code, the container, and everything is the team’s problem. A managed-import path sits in the middle: the platform serves the model, the team brings the weights.
The first thing to ask is what architectures the path supports. Managed-import surfaces only accept weights from a known list of base architectures, in a known format. If the research team has fine-tuned something on that list, the path is open; if they’ve built a novel architecture, it isn’t, and the answer drifts toward self-hosting or a fully managed endpoint.
The second is throughput and cost model. A pay-per-token foundation model and a dedicated-capacity model behave very differently as utilisation changes. Pay-per-token is cheap when traffic is sporadic and expensive when traffic is heavy and constant. Dedicated capacity is cheap per-token at high utilisation and expensive per-token at low utilisation, because the bill ticks regardless of how many calls land on it. Whichever path the workload chooses, the shape of the bill follows from that choice.
The third is cold start and scaling. A hosted model that’s been idle has to be brought back online before the next call returns; that’s measurable seconds of latency. Whether that matters depends on whether the workload is interactive or batch, and whether the scaling unit is a request or a slab of capacity.
The fourth is versioning and deployment. Every new weight set is a new model identity somewhere, a new endpoint, a new model ARN, a new container tag. Rolling from v1 to v2 is at minimum a caller-config change; rollback is the same operation in reverse. Whatever the path, the trick is making that flip cheap and fast.
The fifth is operational surface compared to alternatives. Self-hosting gives full control and full operational responsibility. A managed import path gives the cloud provider the hosting in exchange for less control over inference internals. For a team that wants to ship a fine-tune without standing up GPU ops, that trade is usually worth taking; for a team with existing GPU-ops muscle and unusual requirements, the calculus flips.
The sixth is compliance fit. Medical notes: PHI, HIPAA, audit, the works. Whichever path is chosen, the data-handling story has to carry over, no training on inference data, no inference logging outside the account, private network egress, full audit trail.
What we’ll filter on
- Base-model support, does this path accept the architecture in use?
- Operational surface, what are we running vs what AWS runs?
- Cost shape, per-token, per-hour, per-CMU-minute?
- Latency and cold-start, first-call and steady-state?
- Version and rollback, how fast from weights-in-S3 to traffic flowing?
The custom-model serving landscape
-
Bedrock Custom Model Import. Upload weights to S3; create an imported model in Bedrock; call it with
InvokeModelorConverseusing the model ARN. Bedrock handles the hosting, scaling across CMUs, and the API surface. Supported bases include Llama family, Mistral, Mixtral, and others as the list grows. Per-CMU-minute billing with a minimum. Tight fit with the rest of the Bedrock stack, same IAM, same CloudWatch, same VPC endpoints. -
SageMaker real-time endpoint. Deploy the model behind a SageMaker endpoint on a chosen instance type (ml.g5, ml.g6, ml.p4d/p5 for larger models). Full control over the inference container, TorchServe or Triton or LMI. Scaling via SageMaker’s autoscaling policies. Billed by instance-hours. Requires endpoint ops, health checks, deployment pipelines, scaling policies, version alias management.
-
SageMaker serverless inference. Pay per invocation with automatic scaling to zero. Cold starts can be seconds-to-minutes for large models; concurrency limits apply. Attractive for low-traffic fine-tunes; impractical for the medical-notes workload if it runs continuously.
-
SageMaker JumpStart pre-trained. If the task can be done with a JumpStart model instead of a bespoke fine-tune, it cuts out the training step. Not applicable here, the training data is the point.
-
Self-hosted on EKS/EC2 with vLLM or TGI. The team’s own GPU cluster running vLLM or Text Generation Inference, exposed via an internal endpoint. Maximum control; maximum operational cost. Correct for teams with GPU-ops maturity and workloads big enough to justify dedicated hardware.
-
Bedrock fine-tuning on a foundation model. Bedrock natively supports fine-tuning some foundation models (Claude, Nova, Titan) and serving the result. If the team could have fine-tuned a Bedrock-native model instead of Llama, this path is simpler end-to-end. Orthogonal to importing weights trained elsewhere.
Side by side
| Option | Base support | Ops surface | Cost shape | Latency | Version / rollback |
|---|---|---|---|---|---|
| Bedrock Custom Model Import | Supported list | Minimal | Per-CMU-minute | Warm: normal; cold: seconds | Import → caller config flip |
| SageMaker real-time endpoint | Anything | Heavy | Instance-hours | Warm: low; cold: controllable | Endpoint blue/green |
| SageMaker serverless inference | Anything | Light | Per-invocation | Cold start variable | Endpoint update |
| JumpStart | Catalog-limited | Light | Varies | Varies | JumpStart update |
| Self-hosted EKS + vLLM | Anything | Heaviest | Compute-hours | Ours to tune | Our deployment |
| Bedrock fine-tuning (native) | Bedrock-native only | Minimal | Per-token | Native | Model version flip |
For the medical-notes team, with a Llama 3.1 8B fine-tune in S3, Bedrock Custom Model Import is the clean answer: the architecture is supported, the operational surface is minimal, and the API aligns with the rest of the Bedrock stack. The catch is the CMU pricing model, predictable but not free when idle.
The import and serving flow
The pick in depth
Pre-import checklist. The weights need to be in a supported format (safetensors preferred) and a supported base architecture (Llama 3.1 is on the list). The tokenizer, config.json, and any generation-config files need to travel with the weights. Bedrock reads them to configure serving. The S3 prefix needs to be KMS-encrypted with a key the import-job role can decrypt. Region matters: the import job runs in a specific region, and the model is only available for invocation in that region until re-imported elsewhere.
CreateModelImportJob. A single API call kicks off the import. Parameters: jobName, importedModelName, roleArn (Bedrock’s role in the account, needs S3 read on the weights bucket and KMS decrypt), modelDataSource (S3 URI), and baseModelName (e.g., llama-3.1-8b). The job runs async. For an 8B-parameter model, import takes 20-30 minutes; for 70B models, hours.
What you get back. An ImportedModelArn. That ARN is the modelId passed to InvokeModel or Converse. IAM grants bedrock:InvokeModel on that ARN to whichever principals call it. CloudWatch metrics start accumulating on first invocation.
Pricing shape. Custom model import is billed by custom model units (CMUs) active per minute. A 5-minute minimum billable duration per active period and a No-Commitment model mean that a model invoked once an hour still pays for chunks of CMU time even when not serving. The economics favour steady, high-throughput workloads: 10k inferences an hour across 8 hours a day fills CMUs efficiently; 100 inferences a day across 24 hours fills them poorly. For the medical-notes workload (predictable daily batch of ~30k summaries), the CMU utilisation is high during business hours and drops overnight. Plan around that.
Cold starts. After a period of no traffic, the CMU spins down. The next invocation warms it, measurable seconds of latency. For interactive flows, keep a “warming” heartbeat: a tiny invocation every few minutes to keep at least one CMU warm. For batch flows, cold start doesn’t matter.
Versioning. Every new weight set = new import = new ImportedModelArn. The application uses a config entry (or SSM Parameter, or Prompt Management if we’ve put prompts in there) that names the current model ARN. Rollout is updating that entry; rollback is pointing it back. Old imports can be left registered (they cost nothing unless invoked) or deleted with DeleteImportedModel.
Comparison to SageMaker endpoint. The same 8B model on a SageMaker endpoint would need an ml.g5.12xlarge or similar, running 24/7 at ~$7/hour, ~$5k/month. Bedrock Custom Model Import’s CMU pricing, at similar throughput, lands in a comparable range but with AWS managing the instances, health checks, autoscaling, and deployment pipeline. The saving isn’t per-token; it’s in the ops not done.
A worked example: medical-notes batch, one day of operation
Morning cold start. 09:00, batch job kicks off with 30,000 medical notes to summarise. First request takes 8 seconds (CMU warming from idle overnight); subsequent requests land at 1.2s median for 250-token prompts and 60-token responses. The batch runs over ~45 minutes, with Bedrock auto-scaling to 4 CMUs concurrently at peak.
Afternoon trickle. Interactive use through a research notebook: ~200 requests over 6 hours. CMU stays warm (at least one CMU active throughout), serving at 1.2s median per request.
Overnight idle. 19:00 to 08:00 next morning: no traffic. CMUs spin down. Bill drops to the minimum until next traffic.
Daily totals. ~30,200 invocations; ~18 hours of active CMU time. CMU-minute charges compute the bill. Total engineering time for the day: zero, no endpoints to patch, no scaling policies to tune. The research team’s focus stays on the next fine-tune instead of the infrastructure of the last one.
Version update in the afternoon. Research team finishes a new fine-tune at 14:00. They kick off CreateModelImportJob; it completes at 14:35. Their evaluation suite runs against the new ARN for an hour. At 15:45, staging traffic routes to the new ARN via the config flip; the production flip comes the next morning after overnight validation. Rollback path: flip the config back. Average total time from “new weights” to “production traffic”: half a day, most of which is the eval.
What’s worth remembering
- Custom Model Import is for supported architectures only. Llama, Mistral, Mixtral, Flan, Gemma, Qwen at time of writing. Novel architectures need SageMaker endpoints.
- The cost model is per-CMU-minute, not per-token. Steady high-utilisation workloads map well; sporadic workloads pay for idle CMU time within minimum billable windows.
- Ops surface is near-zero compared to SageMaker endpoints. AWS manages the hosting, scaling, and availability. We manage the weights and the caller config.
- Cold starts exist. Measurable seconds on the first call after idle. Warm with a heartbeat if interactive latency matters; ignore if batch.
- Versioning is via new import + new ARN. Rollback is a config flip. Keep the previous ARN registered until you’re confident in the new one.
- Evaluation happens before and after import. Before, on the weights themselves in a SageMaker training/evaluation job. After, on the imported ARN to confirm no regression from the conversion step.
- Bedrock’s data handling applies. Imported models inherit Bedrock’s no-data-for-training policy, VPC endpoints, IAM, and CloudTrail. HIPAA workloads are supported the same way.
- When Bedrock fine-tuning is an option, prefer it. Fine-tuning a Bedrock-native model (Claude, Nova, Titan) through Bedrock’s own fine-tuning removes the import step entirely. Custom Model Import is the correct answer when the base must come from outside.
Weights trained elsewhere, served through Bedrock’s front door, with the same API, IAM, observability, and audit story as the foundation models sitting next to them. The research team ships their fine-tune; ops doesn’t get a new endpoint to care for; the application doesn’t learn a new SDK. That’s the whole point of the path.