The situation
A marketing analytics team owns two deployed models:
- Churn propensity. A LightGBM binary classifier over 180 features of account history. Trains nightly, scored nightly across the entire customer base (~20 million rows). The scores land in Redshift, where the CRM joins them into the morning’s priority queue for account managers. Nobody calls this model “live”; it’s consumed at the start of the business day, not during a session.
- Lead scoring. An XGBoost regressor over 40 features of lead behaviour (page views, form answers, traffic source). Called when a prospect submits the “book a demo” form; the predicted deal-size drives routing to a senior or junior sales rep within 10 seconds of submission.
Today, both models are deployed to SageMaker real-time endpoints. The churn endpoint runs 24/7 on ml.m6i.4xlarge instances, handles a single hour of 20M InferenceRunning a trained model to produce output – as opposed to training it.
calls starting at 02:00 UTC, and sits 97% idle the rest of the day. Finance has noticed.
The question on the desk: which of these workloads should stay on a real-time endpoint, and which should move to batch transform?
What actually matters
The two deployment shapes are designed for different questions. Putting them side by side:
Real-time endpoint: A persistent service on one or more instances behind a managed load balancer. A synchronous InvokeEndpoint call returns the prediction on the same connection, within the 60-second invocation timeout, with a 6 MB request/response cap. Scales via target-tracking on InvocationsPerInstance. Billed per instance-hour for the instances that are running, 24/7 unless the endpoint is deleted. Built for the question “a user is waiting, how long can they wait?”: sub-second for interactive flows, a few seconds for batch-tolerant clients.
Batch transform: A job, not a service. Submit a CreateTransformJob API call with an S3 prefix of inputs, an S3 prefix for outputs, an instance type, and an instance count. SageMaker provisions the instances, processes every input, writes every output, and tears the fleet down. No endpoint persists between runs. Billed per instance-hour only for the duration of the job. Built for the question “score a lot of things, once, and give me the results”: offline, throughput-oriented, priced for the actual work done.
The split in the question is not latency. It’s who’s waiting. A user clicking “book a demo” is waiting; nightly re-scoring is not. The deployment shape that fits each is a consequence of that.
The churn workload, at 20M rows once a day, is a textbook batch problem. Running it on a real-time endpoint means paying for 23 idle hours to get one hour of work done. Moving it to batch transform means paying for exactly the hour of work, once a day, on instances that shut down afterwards.
The lead-scoring workload is a textbook real-time problem. A batch transform job takes 2-5 minutes to spin up, which blows past the “routed within 10 seconds” requirement for a single inference. A persistent endpoint at minimum capacity (one or two ml.m6i.large instances) handles the ~300 leads/day comfortably and is always warm.
The work is seeing both decisions clearly enough to defend them to finance and to the marketing operations team who consume the outputs.
What we’ll filter on
- Latency requirement: is a user or an upstream pipeline waiting synchronously?
- Throughput per invocation: one record at a time, or many?
- Batch size and frequency: how many predictions per run, and how often?
- Cost efficiency at actual traffic: are we paying for idle capacity, or only the work?
- Integration pattern: HTTP call site or S3-to-S3 pipeline?
The inference-shape landscape
1. Real-time endpoint. Persistent HTTPS endpoint, 1+ instances, autoscaling on InvocationsPerInstance. Sub-100ms p50 on small models, up to ~60s on large. 6 MB payload cap. Billed 24/7 while running. Use when a caller is waiting on the response and the traffic is frequent enough that an always-on endpoint is cheaper than cold-starting per request.
2. Batch transform. Job-scoped compute, reads S3 prefix, writes S3 prefix, tears down. No persistent endpoint. Instances can be dozens or hundreds in parallel. Handles arbitrarily large datasets (the input side is S3-sized, not 6 MB-bound). Billed per job duration. Use when a batch of predictions is needed offline, inputs and outputs naturally live in S3, and nothing is waiting on the result synchronously.
3. Async inference. Between the two. Queue-backed, caller submits an S3-located input, endpoint processes asynchronously, output lands in S3, notification via SNS. Scale-to-zero when idle. Fits per-document ad-hoc requests too big or too slow for real-time but too fragmented for batch. Not the natural shape for either workload here (churn is bulk, lead-scoring is fast-synchronous), but worth mentioning to rule out.
4. Serverless Inference. Variant of real-time with AWS-managed instance lifecycle, billed per millisecond, cold-start penalty on idle periods. Good for sporadic traffic on small CPU models. The 300-leads/day lead-scoring case might fit, but the 10-second routing SLA means cold-start variance becomes a customer-facing problem, so we’d probably stay on a real-time endpoint with low minimum capacity instead.
Side by side
| Option | Latency | Throughput per run | Cost when idle | Integration | Instance lifecycle |
|---|---|---|---|---|---|
| Real-time endpoint | ms, seconds | 1 record per call | Full hourly rate | HTTPS InvokeEndpoint |
Persistent |
| Batch transform | Minutes, hours | Millions per job | Zero | S3 prefix in / out | Per-job |
| Async inference | Seconds, 1 hour | 1 record per call, queued | Zero (scale-to-zero) | S3 input + SNS out | Persistent with scale-to-zero |
| Serverless Inference | 100ms, 60s, incl. cold start | 1 record per call | Zero | HTTPS InvokeEndpoint |
AWS-managed |
Reading by workload:
- Churn nightly 20M rows → batch transform. No caller waiting; the result lives in S3/Redshift; 20M rows is exactly what batch transform is for. Moving from a 24/7 real-time endpoint to a one-hour nightly batch job cuts the bill by ~23×.
- Lead scoring ~300/day synchronous → real-time endpoint, minimum capacity one or two small instances. Low volume but the caller is waiting and the 10-second SLA makes cold-start variance unacceptable. The bill is small either way; optimising for latency predictability is the correct move.
Reading the shapes side by side
The picks in depth
Lead scoring stays on a real-time endpoint. The endpoint runs on two ml.m6i.large instances at the minimum, autoscaling to four during business hours. The monthly cost is roughly $146, small next to the revenue impact of routing a high-value lead to a senior rep in seven seconds instead of seventy. No change to the existing architecture; the existing endpoint is the correct shape.
What would change this decision: if the lead volume dropped to ~10/day, the cost of 24/7 readiness becomes indefensible and Serverless Inference (or even Async with a scale-to-zero endpoint) starts looking reasonable. At 300/day, the endpoint is busy often enough to justify its persistence.
Churn moves to batch transform, orchestrated by Step Functions.
- Stage 1: a Redshift
UNLOADpushes 20M rows of pre-computed features tos3://churn-in/2027-10-28/part-0000.csvthroughpart-0099.csv(100 files for parallelism). - Stage 2:
CreateTransformJobwithInstanceCount=20,InstanceType=ml.m6i.4xlarge,S3DataType=S3Prefixpointing at the input prefix,S3OutputPathat the output prefix. SageMaker distributes the 100 input files across the 20 instances (5 per instance, processed in parallel within each instance). The job runs ~60 minutes. - Stage 3: a Redshift
COPYreads the output prefix back into theaccount_scorestable. The CRM join runs at 03:30 UTC as before.
The 20-instance fleet exists for exactly the job’s duration. The total monthly bill works out to about $463, compared with roughly $1,400/month for the former 24/7 endpoint running on the same instance size, and the actual work throughput is much higher because 20 parallel workers beat 1-2 autoscaled workers for a job that fits the batch shape.
Instance sizing for batch transform. The choice of 20 × ml.m6i.4xlarge is an empirical one: benchmark the single-instance throughput on a 1M-row sample (say, 10,000 rows/second), divide the target runtime budget, round up. For 20M rows in 1 hour, single-instance capacity of 600K rows/hour means we need ~34 instances of m6i.4xlarge, or, more cheaply, fewer instances with more parallelism per instance using SingleRecord → MultiRecord strategy, which batches rows into each HTTP request the container handles. Setting BatchStrategy=MultiRecord with MaxPayloadInMB=6 and MaxConcurrentTransforms=8 typically gets throughput up by 3-5× over single-record mode, letting the fleet drop to 20 instances comfortably.
What doesn’t change. The model artifact, the inference container, the preprocessing logic: all the same. Batch transform uses the same container image and serving code as a real-time endpoint; the difference is in the lifecycle, not the code. Teams can move a model between real-time and batch deployment without retraining, just by changing the CloudFormation stack.
A worked batch job trace
At 02:00 UTC, the EventBridge schedule fires.
- Step Functions starts. State 1 calls Redshift
UNLOADvia the Data API. Redshift streams 20M rows to 100 partitioned CSV files ins3://churn-in/2027-10-28/. Completes in ~6 minutes. - State 2 calls
CreateTransformJobwithJobName=churn-2027-10-28, inputs3://churn-in/2027-10-28/, outputs3://churn-out/2027-10-28/, instance count 20, model namechurn-v417(pointing at the latest approved model in Model Registry). SageMaker starts provisioning at 02:06. - SageMaker takes ~3 minutes to provision 20
ml.m6i.4xlargeinstances, pull the container image, and load the model. At 02:09, processing begins. - Each instance processes ~1M rows in 50-55 minutes. Output partitions land in
s3://churn-out/2027-10-28/part-XXXXX.csv.out. Job completes at 03:03. - SageMaker tears down the fleet. Billing stops at 03:05.
- State 3 calls Redshift
COPYfroms3://churn-out/2027-10-28/intoaccount_scores. Completes at 03:09. - Step Functions writes a success notification to SNS. Downstream CRM integration runs at 03:30 as scheduled.
Total wall-clock: 69 minutes. Total billed compute: 20 instances × ~1 hour = 20 instance-hours.
What’s worth remembering
- Real-time endpoints serve synchronous callers. Instance-hours billed 24/7, sub-second to minute latencies, 6 MB payload cap. Right for interactive or near-interactive flows where a user or upstream pipeline is waiting.
- Batch transform processes S3 prefixes as jobs. Provisions a fleet, processes everything, tears down. Billed only for job duration. Right for bulk offline scoring where the results land in S3 or a warehouse.
- The decisive question is “who’s waiting?” Not latency, not throughput, but who’s blocking on the response. A nightly 20M-row re-score is nobody blocking in any useful sense; a form submission is one user, waiting, with a timer on them.
- Batch transform parallelises naturally.
InstanceCount > 1splits the input prefix across workers;BatchStrategy=MultiRecordbatches rows within each HTTP request. Tuning both is how large batch jobs finish on time. - The same model artifact works in both shapes. No retraining, no repackaging. Moving a model between real-time and batch is a CloudFormation / CDK / console change, not a model-engineering change.
- Cost per prediction is lower on batch. Because the fleet only runs for the duration of the work, and can use larger, more cost-efficient instances than a real-time endpoint optimised for minimum autoscaling floor, the per-prediction economics are strictly better when the batch shape fits.
- Batch transform has no service endpoint to hit. It’s invoked via API (
CreateTransformJob), typically from Step Functions, CodePipeline, Airflow, or EventBridge. There’s no persistent DNS name to call. - Integration pattern drives the picks more than people expect. If the downstream consumer is a Redshift
COPYor a Glue job, batch fits naturally. If the downstream consumer is a synchronous API call site, real-time fits. Making either model fit the wrong integration is possible, not wise.
The churn model and the lead model don’t differ in what they predict or how good they are at predicting. They differ in what the results do next: one enters a database overnight, one decides where a phone call gets routed in ten seconds. That difference decides the deployment. Real-time endpoints and batch transform jobs each do exactly what they advertise; the work is reading the workload clearly enough to tell which advertisement matches.