A platform team has five AI-shaped requests landing in a single sprint. Transcribe call-centre audio. Detect anomalies in sensor data. Extract text from scanned forms. Summarise customer emails. Detect faces in CCTV for visitor badging. Someone has already typed “use SageMaker” into three design docs. Someone else insists Bedrock is the answer to all of them. A third voice mutters about purpose-built services like Transcribe and Textract. AWS has at least three answers to almost every AI problem, so picking “the platform that wins” is the wrong frame; what matters is how to tell which layer of the stack each request lands on, and what that choice costs in time, money, and flexibility.
The situation
The platform team at a mid-size enterprise has a backlog of five AI-shaped requests, all due by end of sprint:
- Call-centre transcription. The support team records 40,000 calls a month. They want searchable transcripts with speaker diarisation (“AgentA system that wraps an LLM with tools, memory, and a loop, so it can take multi-step actions toward a goal rather than just answering one prompt. said X, then Customer said Y”) and redaction of credit-card numbers spoken aloud.
- Sensor anomaly detection. The facilities team has 2,000 IoT sensors across 12 sites streaming temperature, humidity, and vibration data. They want alerts when readings stray from normal patterns – patterns that vary by site, sensor type, and time of day.
- Form text extraction. Ops receives 3,000 scanned supplier invoices a week. They want the invoice number, date, line items, and total extracted into a structured row in a database.
- Email summarisation. The sales team wants one-paragraph summaries of customer email threads auto-inserted at the top of their CRM view.
- Visitor-badging face detection. Reception has a camera at the front door; they want a system that detects faces, checks them against an approved-visitor list for that day, and prints a badge (or alerts security).
Five requests, five different ML problem types, one sprint, one platform team of six. The question is which AWS service shape fits each, without the team writing five bespoke ML pipelines.
What actually matters
AWS organises its AI/ML offerings into three broad tiers, and recognising which tier a request belongs to is most of the work.
The first thing worth thinking about is what kind of ML problem this is. “AI” is a wide word. Speech-to-text is a solved problem that almost every cloud provider offers as a managed API. Anomaly detection on time-series data has well-understood algorithms but needs tuning per deployment. Document parsing with structured extraction is a specific service category (intelligent document processing). Summarisation is generative text, which points at Bedrock. Face detection is a computer-vision primitive available as a managed API. The problem shape is the first filter; the platform choice follows from it.
The second is the three tiers. The top tier is AI services: fully managed APIs for well-defined tasks. Transcribe for speech-to-text, Comprehend for NLP (sentiment, entities), Textract for document extraction, Rekognition for image and video analysis, Translate, Polly, Personalize, Fraud Detector. Input goes in, structured output comes out. No ModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. to train, no infrastructure to provision. This tier is where you want to live if the problem matches a named service and the service’s accuracy meets your bar.
The middle tier is Bedrock: foundation-model APIs for generative and general-purpose tasks. Text generation, summarisation, EmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. , RAGA pattern where you retrieve relevant documents at query time and stuff them into the prompt so the model can ground its answer on them. , agents, image generation. Managed API, pay per TokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. . The tier for problems where a general-purpose LLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. is a reasonable substrate (most text understanding, most generation, most RAG).
The bottom tier is SageMaker: the full ML platform. Train your own model, bring your own data, deploy your own endpoint, Fine-tuningContinuing to train an already-trained model on a smaller dataset to adapt its behaviour. open-source models, run notebooks, orchestrate pipelines. The tier for problems where no managed service fits – custom domains, custom metrics, proprietary architectures, regulated contexts where data can’t leave your VPC, or unusual volumes that shift the cost calculus.
The third thing is the default pull toward the bottom tier. Engineers with ML backgrounds reach for SageMaker because it’s the most powerful. Engineers without ML backgrounds reach for Bedrock because it’s the most recent. Neither is the correct first question. The correct first question is: “is there a purpose-built service for this?” If there is, that’s almost always the correct answer – faster to integrate, more accurate on the specific task, cheaper per request, and the team doesn’t become responsible for ongoing model maintenance.
The fourth is the cost shape per tier. AI services are usually priced per transaction: per minute of audio, per page of document, per image analysed. Bedrock is per token (input + output). SageMaker is per instance-hour for TrainingThe process of fitting a model’s weights to data by minimising a loss function. and per instance-hour for hosted endpoints, regardless of traffic. At low-to-medium volume, managed APIs crush the alternatives on cost; at very high volume, SageMaker’s fixed-per-hour pricing can win if your traffic saturates the endpoint. The crossover point is usually higher than teams assume.
The fifth is accuracy and customisation. A managed API gives you the accuracy the vendor tuned for the general case. If your domain is specific enough – medical transcription, legal documents, audio in a noisy factory – accuracy may fall short, and a custom model trained on your data can beat the general API. But this is empirical, not assumed: benchmark the managed API against a test set before concluding you need to build.
The sixth is the operational cost. A managed API is an HTTPS call. Bedrock is an HTTPS call plus a bit of PromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. . SageMaker requires endpoint capacity planning, auto-scaling configuration, model-version deployment, CloudWatch monitoring, and a team that stays current on SageMaker’s operational surface. These costs compound over the model’s life, not just at launch.
What we’ll filter on
Five filters, applied to each of the three tiers.
- Time to first working version – hours, days, or weeks?
- Customisability – can you fine-tune, retrain, or change the model?
- Cost shape – per-transaction, per-token, or per-instance-hour?
- Infrastructure overhead – do you run an endpoint, or does AWS?
- Breadth of task types – narrow, broad, or arbitrary?
The three-tier landscape
- AI services (top tier). Purpose-built, task-specific managed APIs. Each service does one thing well:
- Transcribe – speech-to-text with diarisation, custom vocabulary, PII redaction, real-time streaming option.
- Comprehend – sentiment, entities, key phrases, language detection, topic modelling; custom classification and entity recognition via Comprehend Custom.
- Textract – OCR with table and form extraction, plus specialised APIs for invoices and identity documents.
- Rekognition – face detection and matching, object and scene detection, moderation, text-in-image, video analysis.
- Translate – neural machine translation, 75+ languages.
- Polly – text-to-speech.
- Personalize – recommendation systems.
- Fraud Detector – custom fraud models on structured data.
- Forecast (retired as standalone; folded into SageMaker Canvas’s time-series mode).
- Kendra – intelligent search over enterprise documents.
- Lookout for Equipment / Metrics / Vision – anomaly detection for industrial equipment, metrics, and visual inspection.
Pricing: per-transaction (per minute, per page, per image). No infrastructure.
-
Bedrock (middle tier). Foundation-model APIs for generative and general-purpose tasks. Catalogue of models (Anthropic Claude, Amazon Nova and Titan, Meta Llama, Mistral, Cohere, AI21). Primary APIs:
InvokeModelfor synchronous generation,RetrieveAndGeneratefor managed RAG,InvokeAgentfor agent workflows. Pricing: per input + output token, with provisioned-throughput option for high-volume or fine-tuned models. No infrastructure. -
SageMaker (bottom tier). Full ML platform. Surfaces include:
- SageMaker Studio – the IDE for ML: notebooks, experiments, pipelines.
- SageMaker Training – run training jobs on managed compute (CPU, GPU, Trainium).
- SageMaker Endpoints – hosted InferenceRunning a trained model to produce output – as opposed to training it. , real-time or asynchronous or batch transform.
- SageMaker Autopilot – AutoML for tabular data.
- SageMaker Canvas – no-code UI on top of Autopilot plus time-series forecasting.
- SageMaker JumpStart – catalogue of open-weight foundation models deployable to your own endpoint.
- SageMaker Ground Truth – data labelling.
- SageMaker Clarify – bias detection and explainability.
- SageMaker Feature Store – managed feature storage.
- SageMaker Pipelines – MLOps orchestration.
- SageMaker Model Registry – versioned model artefacts with approval workflow.
Pricing: per-instance-hour for training jobs and endpoints (ml.m5, ml.g5, ml.p5 etc.), plus storage and data transfer. Full VPC isolation; your models live in your account.
Side by side
| Tier | Time to v1 | Customisable | Cost shape | Infra | Task breadth |
|---|---|---|---|---|---|
| AI services (Transcribe, etc.) | Hours | Limited (custom vocabs, classifiers) | Per-transaction | None | Narrow (one task each) |
| Bedrock | Hours-days | Prompt + RAG + fine-tune | Per-token | None | Broad (any text task; growing image) |
| SageMaker | Days-weeks | Any | Per-instance-hour | Endpoint management | Arbitrary |
Reading the table in reverse is instructive. If your problem matches an AI service, you’re done in hours. If it’s a generative or general-text task, Bedrock is hours to days. Only drop to SageMaker if neither fits – because the task is so domain-specific no managed service addresses it, or because volume makes per-transaction pricing lose to per-hour.
Mapping the five requests
The picks in depth
Call-centre transcription – Amazon Transcribe. The service has a call-analytics mode (StartCallAnalyticsJob) specifically for contact-centre audio: produces transcripts with speaker labels (“AGENT” / “CUSTOMER”), automatic sentiment scoring, issue detection, and content redaction that blanks out card numbers, SSNs, and other PII in the output. Input is an S3 URI of audio; output is JSON to another S3 URI. 40,000 calls a month at, say, 8 minutes average = 320,000 minutes; Transcribe Call Analytics prices per minute, so the monthly bill is straightforward to forecast. Custom vocabularies handle product names, internal jargon, and employee names. No model to train, no endpoint to host.
Sensor anomaly detection – Amazon Lookout for Equipment. Purpose-built for multivariate anomaly detection on industrial sensor data. You upload historical sensor readings, Lookout trains a site-specific model automatically, and you stream readings in for real-time inference. Handles site-to-site variation naturally (each asset or site can have its own model), doesn’t need you to hand-label anomalies (it learns normal patterns from the healthy history). The alternative – building this on SageMaker with a custom model – is a weeks-to-months project; Lookout is days. (Note: the service is being wound down; at the time of writing, Amazon is directing new customers toward SageMaker’s own anomaly-detection capabilities. Check current guidance before new builds.)
Form text extraction – Amazon Textract. Textract has a dedicated Invoice API (AnalyzeExpense) that returns structured fields from invoices: vendor name, invoice number, date, line items, totals, tax, currency. A per-page API call per invoice, output is JSON with each field tagged by type. 3,000 invoices a week at a page each is ~12,000 pages a month; straightforward per-page pricing. For non-invoice forms the generic AnalyzeDocument with FORMS and TABLES feature types does the same for arbitrary structured layouts.
Email summarisation – Bedrock. No purpose-built AI service exists for generic summarisation. Comprehend has a summarisation capability but it’s extractive (pulls existing sentences); for a paragraph summary written in the company’s voice, a foundation model is the correct tool. Bedrock with Claude Sonnet or Nova Lite, one InvokeModel call per email thread, a well-engineered prompt. Volume is low (sales-team-scale, probably low-thousands a day); on-demand per-token pricing. The Bedrock-tier pick that earns its place.
Visitor face detection – Amazon Rekognition. IndexFaces adds today’s approved visitors to a face collection at the start of the day; SearchFacesByImage on each camera frame returns matches above a similarity threshold. Low latency, fully managed. Quotas and pricing scale with API calls; the whole system is a Lambda + S3 + Rekognition pipeline. No custom model training.
Note that four of five are AI services and the fifth is Bedrock. None of them is SageMaker. That’s the correct outcome for this backlog: purpose-built services cover the named tasks, Bedrock covers the generative task, and SageMaker is reserved for the problems none of the above handles.
When would SageMaker win?
Worth naming, because “SageMaker” is still the default someone types. SageMaker is the correct tier when:
-
The task isn’t covered by an AI service. A custom computer-vision model for detecting specific defects in your specific product on your specific assembly line, where Rekognition’s general models don’t have the precision. A custom NLP classifier for a domain-specific taxonomy Comprehend Custom can’t learn. A custom regression on industrial sensor data that goes beyond what Lookout handles.
-
The workload is high-volume enough that per-transaction pricing loses. If you’d be calling Transcribe 10 million times a month and your contract negotiated you a bulk price that’s still more than running a self-hosted ASR model on a few ml.g5 endpoints, then self-hosting wins. This is unusual; the crossover is higher than teams assume.
-
Data residency or VPC isolation. Some AI services have VPC endpoints; some don’t. If your data can’t leave a specific VPC, or can’t be sent to a managed API under any circumstances, SageMaker endpoints inside your VPC are the answer.
-
You need to explain or audit the model’s behaviour in detail. Managed APIs are black boxes. SageMaker lets you inspect, explain (SageMaker Clarify), and version every model; for regulated contexts where “why did the model do this?” needs a substantive answer, that visibility matters.
-
You’re doing research, experimentation, or model development, not consumption. SageMaker Studio is an ML development environment. Managed APIs are ML consumption surfaces. If your team’s job is to build models, Studio is the workspace.
For this platform team’s five requests, none of those conditions applies. They’re consuming ML capabilities, not developing models. AI services and Bedrock are correct; SageMaker would be over-engineering.
A worked sprint
How the team could schedule the five builds across a two-week sprint:
Week 1
Day 1: team walks the five requests against the taxonomy.
Writes up tier assignments + rough cost estimates.
Day 2: parallelise.
Pair A: Transcribe call-analytics pipeline
(S3 + Lambda + StartCallAnalyticsJob + result consumer)
Pair B: Textract Invoices pipeline
(S3 + Lambda + AnalyzeExpense + DB writer)
Pair C: Bedrock email summariser
(Lambda + InvokeModel + prompt tuning + CRM integration)
Day 3-4: each pair gets to a working end-to-end version.
Day 5: review, measure accuracy against a held-out set of
real inputs, tune thresholds / prompts / custom vocabs.
Week 2
Day 1-2: Pair A picks up visitor face detection
(Rekognition + face collection management + badge printing).
Day 1-2: Pair B picks up sensor anomaly detection
(Lookout for Equipment, historical data ingest, inference
stream wiring, alerting).
Day 3-4: Pair C does guardrails, monitoring, IAM scoping
across all five pipelines.
Day 5: end-to-end review with stakeholder sign-off per request.
Six engineers delivering five AI-shaped features in a sprint is only realistic because none of them is a from-scratch ML project. Each is “wire up a managed service correctly.” If any of the five had required SageMaker, that one alone would have consumed the sprint.
The default to hold
When a new AI-shaped request lands, the sequence worth running is:
- Is there a purpose-built AI service for this task? (Transcribe for speech, Textract for forms, Rekognition for images, Comprehend for NLP, Lookout for sensors, etc.) If yes and accuracy meets the bar, use it.
- If no, is it a generative / general-text / RAG task? If yes, Bedrock.
- If neither, does the problem genuinely require custom ML? If yes, SageMaker.
Most requests terminate at step 1. A significant minority at step 2. A small minority at step 3. Reversing the sequence – starting with SageMaker and asking “could this be done simpler?” – is how teams end up running ML pipelines for problems Transcribe would have solved in a day.
What’s worth remembering
- AWS organises AI/ML into three tiers. Purpose-built AI services (Transcribe, Textract, Rekognition, Comprehend, etc.); Bedrock for foundation models; SageMaker for the full ML platform. The tier is the first question; the specific service follows.
- AI services win when the task matches. Transcribe does call-centre audio with diarisation and PII redaction out of the box. Textract does forms. Rekognition does faces. These services are tuned by teams of specialists; competing with them from scratch takes months.
- Bedrock is the foundation-model tier. Generative text, summarisation, RAG, agents, embeddings. Per-token pricing. When the task is “something with text that’s not covered by a specific AI service,” Bedrock is usually correct.
- SageMaker is the platform of last resort, not first choice. The correct tier when no managed service fits, when volume shifts the cost calculus, when data residency demands it, or when you’re developing rather than consuming models.
- Pricing shape varies by tier. Per-transaction (AI services), per-token (Bedrock), per-instance-hour (SageMaker). Low-to-medium volume favours managed per-transaction pricing; very high volume can tip toward SageMaker’s per-hour model.
- Time-to-first-version tracks the tier. AI services are hours. Bedrock is hours-to-days. SageMaker is days-to-weeks. Budget accordingly.
- Custom vocabularies, custom classifiers, and custom entities bridge the accuracy gap on AI services. Transcribe custom vocabularies, Comprehend Custom classification and entity recognition, Textract custom adapters. Often enough to close the accuracy gap without dropping to SageMaker.
- The default sequence is AI service -> Bedrock -> SageMaker. Start at the top, drop only when the tier above can’t meet the requirement. Reversing the sequence is how teams build custom ML pipelines for problems managed services would have solved in a day.
Five AI-shaped requests, one sprint, and no custom models. That’s the shape the AWS AI stack is designed for: most “AI” problems are not ML research projects; they’re integrations of capabilities someone else already built and tuned. Recognising which tier a problem belongs to – and holding the discipline to stay as high as the problem allows – is most of what makes a platform team fast.