Choosing Bedrock or SageMaker JumpStart for a RAG Chatbot

May 01, 2028 · 16 min read

ML Engineer · MLA-C01 · part of The Exam Room

The situation

A support team handles about 40,000 customer questions a month. Most of them have answers in the product documentation, a corpus of about 12,000 pages in Markdown, updated weekly. Leadership wants a chatbot that draws on the docs, cites the page it used, refuses to guess, and sits behind the existing customer-portal login. Latency target: median response under 4 seconds. Cost target: less than the salary of the two support engineers the bot might let them redeploy.

Two constraints shape the AWS-service choice:

  • The model needs to be good enough for customer-facing output. That’s a frontier-model conversation, not an open-source 7B-parameter conversation. We care about instruction-following, refusal behaviour, and tone.
  • Our documentation is not a base-model training set. Whatever model we pick needs to be steered by our content at InferenceRunning a trained model to produce output – as opposed to training it. time, by RAGA pattern where you retrieve relevant documents at query time and stuff them into the prompt so the model can ground its answer on them. (RAG), rather than rebuilt from scratch.

Later, the team wants to experiment with Fine-tuningContinuing to train an already-trained model on a smaller dataset to adapt its behaviour. a smaller open-weights model on a curated set of historical support transcripts. Same data team, different rhythm. The service that fits the chatbot might be the wrong service for the fine-tune.

What actually matters

The naive framing is “pick a model”. The real framing is “pick what we’re willing to operate”.

A foundation model is a large neural network trained on trillions of TokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. at a cost most organisations will never incur. At the access layer, there are two distinct stories for using one on AWS. The first is a managed API: we send a PromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. , we get a response, AWS runs the GPUs. The weights live on AWS’s servers, never on ours. We pay per token, not per hour. We don’t pick a GPU; we don’t manage a model server; we don’t patch CUDA. The second is a model we deploy ourselves: we pick an instance type, we pick a container, we pay for the GPU whether it’s serving traffic or not, and we get the freedom to do anything the model’s license allows: fine-tune, quantise, run on custom hardware, operate in a disconnected environment.

The difference is not about model capability. It’s about who operates the inference. Managed-API foundation models run on cloud-operated infrastructure and bill per token. Self-deployed models run on endpoints we provision and bill per instance-hour.

That operational split pulls several other decisions with it.

Model catalogue. The managed-API path tends to offer a curated shortlist of frontier-adjacent proprietary models, all accessed via the same invocation API. The self-deploy path has access to a much broader catalogue of open-weights models, but typically no access to the proprietary frontier families. If a specific proprietary family is the requirement, the managed-API path is where it lives.

Customisation depth. Both shapes can do RAG, both can use prompt engineering. For deeper customisation, the story diverges. The managed path supports fine-tuning on eligible base models, producing a custom private model that the platform hosts. The self-deploy path supports full-parameter fine-tuning of open-weights models using ordinary training jobs; we own the output weights, we deploy them to an endpoint we control. Managed customisation is tidier; self-deploy customisation is freer.

Cost shape. Per-token pricing makes small, bursty, or development workloads cheap; a few dollars a month is normal. Heavy sustained traffic flips the comparison: a self-deployed GPU endpoint costs the same per hour whether it’s busy or idle, and at high enough utilisation the per-hour bill beats the per-token one. The managed path also tends to offer a reserved-throughput option for committed high-volume workloads that looks more like the self-deploy bill shape.

Data residency and isolation. Managed inference runs in cloud-operated GPUs; the prompts and responses are subject to the platform’s data protection commitments but we don’t see the process metrics or the host. Self-deployed endpoints run in our account, our VPC, with our security groups; we can put them in a private subnet with no internet egress, which matters for some regulated workloads.

AgentA system that wraps an LLM with tools, memory, and a loop, so it can take multi-step actions toward a goal rather than just answering one prompt. and RAG plumbing. The managed path tends to ship managed RAG (ingestion, chunking, embedding, vector store, retrieval) and managed tool-use orchestration on top of the inference API. The self-deploy path has none of this; the RAG pipeline is ours to build on raw primitives.

GuardrailA filter or rule applied to an LLM’s inputs or outputs to keep it inside safe, legal, or on-brand behaviour. . Managed services typically offer denied topics, PII redaction, word filters, and contextual grounding checks as a feature applied inside the invocation API. For self-hosted models, the equivalent is our code and our responsibility.

What we’ll filter on

Distilling those into filters:

  1. Access to frontier proprietary models (non-open-weights families).
  2. Managed inference: do we operate GPUs or does AWS?
  3. Cost shape: per-token vs per-hour; which fits the expected traffic?
  4. Managed RAG and guardrails: does the platform provide them, or do we build?
  5. Fine-tune freedom: can we train on our own data, and do we own the resulting weights?

The foundation-model landscape on AWS

1. Amazon Bedrock (on-demand). API-style access to a curated catalogue of foundation models. InvokeModel for single-turn, Converse for multi-turn with tool-use, Streaming variants of both. Pricing is per input and output token, model-by-model. No infrastructure to manage. Bedrock Knowledge Bases provides end-to-end RAG (ingestion, chunking, embedding, VectorAn ordered list of numbers – in AI usage, almost always an embedding – and by extension the databases that index them for nearest-neighbour search. , retrieval). Bedrock Agents provides tool-use orchestration. Bedrock Guardrails provides content filtering. Customisation via CreateModelCustomizationJob for fine-tuning or continued pre-training on eligible base models; the result is a custom model Bedrock hosts and bills (via Provisioned Throughput).

2. Amazon Bedrock (Provisioned Throughput). Same models, but with reserved capacity billed per hour of model units. Required for running custom fine-tuned Bedrock models. Right choice when traffic is sustained and predictable, or when we need guaranteed throughput. One-month and six-month terms, roughly analogous to Reserved Instances.

3. SageMaker JumpStart (one-click deploy). A catalogue of pre-trained open-weights models with pre-built deployment scripts. Click “Deploy” in Studio, pick an instance type, and a SageMaker endpoint comes up running the model behind a load balancer. Billed per instance-hour. We own the endpoint, the VPC, the security groups, the autoscaling, the scaling policies. No Bedrock-style RAG; no Bedrock-style guardrails; we get the weights, the rest is ours.

4. SageMaker JumpStart (fine-tune). The same catalogue, but running a SageMaker training job on our dataset first. Full-parameter fine-tuning, parameter-efficient fine-tuning (LoRAA fine-tuning technique that trains a small low-rank matrix on top of the frozen base model, instead of updating every parameter. ), or continued pre-training depending on the model and notebook. Output is a model artifact we can deploy to an endpoint, export, or register in SageMaker Model Registry.

5. Self-hosted on EC2 / EKS. For completeness. Download weights (subject to license), bring up GPU instances, run vLLM / TGI / TensorRT-LLM, operate it all. Maximum freedom, maximum operational weight. Not what a product team asking for a chatbot tomorrow should pick.

Side by side

Option Frontier proprietary models Managed inference Cost shape Managed RAG + guardrails Fine-tune freedom
Bedrock on-demand ✓ (Claude, Nova, etc.) Per-token Limited (eligible models)
Bedrock Provisioned Throughput Per-hour (reserved) Required for custom-model hosting
JumpStart deploy ✗ (we run endpoint) Per-hour (instance) n/a
JumpStart fine-tune Per-hour (training + inference) ✓ (full control)
Self-hosted EC2/EKS Per-hour (raw) ✓ (full control)

Reading the table by job rather than by service:

  • Customer-facing chatbot over our docs, needs frontier quality: Bedrock on-demand, Claude or Nova, Bedrock Knowledge Bases for RAG, Bedrock Guardrails for refusal behaviour.
  • Experimental fine-tune of a 7B model on historical transcripts: JumpStart fine-tune, Llama or Mistral, own endpoint, own evaluation.
  • Custom domain model derived from Claude Haiku: Bedrock fine-tuning, output hosted via Bedrock Provisioned Throughput.
  • A model we need to run air-gapped: SageMaker JumpStart in a private VPC, no Bedrock.

Matching workload to service

Need a foundation model catalogue + inference Need Claude, Nova, or other proprietary model? yes no, open-weights is fine Sustained high-volume, predictable? no yes Bedrock on-demand per-token Knowledge Bases Guardrails Agents Converse API Bedrock Provisioned per-hour reserved 1 or 6 month term custom models hosted here guaranteed throughput Need full fine-tune / VPC isolation / custom hardware? yes only inference JumpStart fine-tune training job + endpoint LoRA or full-parameter own weights Model Registry per-hour billing JumpStart deploy one-click endpoint own VPC, own SG per-hour billing autoscaling we manage our own RAG + guardrails Self-hosted EC2 / EKS fallback for exotic needs max freedom, max weight rarely the correct first move
Two gates separate the four common answers: proprietary model access splits Bedrock from JumpStart; traffic shape and customisation depth split the two subtrees.

The picks in depth

The chatbot → Bedrock on-demand + Knowledge Bases + Guardrails. The support chatbot is a classic RAG workload. Bedrock Knowledge Bases handles the pipeline: point it at an S3 bucket of our docs, pick an embedding model (Titan Embeddings v2 at 1024 dimensions is a reasonable default), pick a vector store (OpenSearch Serverless if we don’t already run one, Aurora Postgres with pgvector if we do). The service handles chunking with configurable overlap, re-indexes on a schedule, and exposes a RetrieveAndGenerate API that does retrieval plus Claude / Nova inference in one call. The response includes the text, the citations (with S3 URIs and page numbers), and the model’s confidence.

Bedrock Guardrails sits on the Converse or RetrieveAndGenerate call. We define denied topics (“competitor products”, “legal advice”), PII redaction, and a contextual grounding check that refuses to answer if the response isn’t supported by the retrieved context. The grounding check is specifically the “don’t make things up” lever; it’s a separate model running behind the scenes and it adds latency, but it’s the difference between a chatbot that cites and a chatbot that confabulates.

Pricing at 40,000 queries/month × average 2,500 tokens (input + output, RAG-inflated) ≈ 100M tokens/month. On Claude Haiku: roughly $60-80/month. On Nova Micro: roughly $15-25/month. Well inside the budget, with room for experimentation.

The transcript fine-tune → JumpStart fine-tune. The historical-transcripts project is different. It’s exploratory, the data team wants to own the training pipeline, the target model (Llama 3.1 8B or Mistral 7B) is not a Bedrock customisation target, and the outcome is a research artifact that might or might not ship to production. JumpStart’s fine-tuning notebooks offer LoRA out of the box, which trains a small adapter on top of the frozen base weights; cheaper than full fine-tuning, usually good enough. Training runs on a ml.g5.12xlarge spot instance; the output goes to S3 and into Model Registry. If the model earns its way into production, a separate endpoint hosts it; if it doesn’t, the whole thing is archived and nobody owes a running GPU bill.

This is the workload where Bedrock doesn’t fit: we don’t want Claude, we want Llama; we don’t want AWS to host the trained weights, we want to inspect them; we don’t want per-token billing, we want per-hour control over a model we own.

A worked query trace

A customer asks the chatbot: “How do I pause my subscription for a month while I’m travelling?”

  1. The portal POSTs the question to an API Gateway endpoint backed by a Lambda function. Session ID in a header.
  2. Lambda calls bedrock-agent-runtime:RetrieveAndGenerate with the knowledge-base ID, the user’s question, the guardrail ID, and the model ARN (claude-3-haiku-20240307).
  3. Bedrock embeds the question via Titan Embeddings, queries OpenSearch Serverless, retrieves the top five chunks (tuned parameter, default three).
  4. The grounding check runs; the retrieved chunks are clearly about pausing subscriptions, so the check passes.
  5. Bedrock invokes Claude Haiku with a system prompt that includes the retrieved chunks and the user’s question. Claude produces a five-sentence answer citing the “Managing Your Subscription” page.
  6. Guardrails scan the response for denied topics and PII. Clean.
  7. Lambda returns {answer, citations[]} to the portal. Median latency: 2.8 seconds. Token cost: roughly $0.0012 for this query.

What’s worth remembering

  1. Bedrock is managed inference; JumpStart is managed model deployment. Bedrock: AWS runs the GPUs, you call an API, pay per token. JumpStart: you pick a model, deploy to your endpoint, pay per instance-hour. Same “foundation model” label, different operational contracts.
  2. Only Bedrock has the proprietary frontier models. Claude, Nova, and the other proprietary families are accessed through Bedrock or not at all. JumpStart’s catalogue is open-weights models.
  3. Bedrock comes with Knowledge Bases, Guardrails, and Agents. Managed RAG, managed content filtering, managed tool-use. If the workload is a chatbot or a document Q&A, the plumbing is built.
  4. JumpStart comes with fine-tuning notebooks. Full-parameter, LoRA, continued pre-training, depending on the model. Output is weights you own, deployed to an endpoint you operate, inside a VPC you control.
  5. Per-token vs per-hour is a traffic shape decision. Bedrock on-demand wins for bursty or low-volume; Bedrock Provisioned Throughput and JumpStart endpoints win for sustained high-volume where a GPU can be kept busy.
  6. Bedrock can fine-tune too, but narrowly. CreateModelCustomizationJob on eligible base models produces a custom Bedrock model, hosted via Provisioned Throughput. Bedrock customisation is tidier but narrower than JumpStart’s; the customised model lives on Bedrock’s infrastructure.
  7. VPC isolation leans toward JumpStart. A JumpStart endpoint can run in a private subnet with no internet egress, behind PrivateLink. Bedrock access is via VPC endpoints to AWS-operated inference; the model and host stay on AWS’s side.
  8. Picking a model is the last decision, not the first. Pick the operating model (managed inference vs managed deployment) based on who holds the GPU. Pick the feature set based on whether the workload wants Knowledge Bases out of the box. Pick the model from what’s available under that constraint.

Bedrock and JumpStart both say “foundation model” on the tin. Bedrock sells access to someone else’s running model; JumpStart sells the licence to run a model yourself. The chatbot over our docs is a Bedrock workload because the managed RAG, managed guardrails, and Claude-grade output are what the product wants. The transcript fine-tune is a JumpStart workload because we want the weights, the training pipeline, and the freedom to iterate without asking AWS’s model team first. Most organisations end up using both, not because one is better, but because they’re answers to different questions.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.