When Inference Takes Minutes

SageMaker offers four ways to serve a model. The instinctive framing is “pick by latency” – but real workloads push on different axes. Payload size. Processing time. Whether requests come in steady or in spikes. Whether they arrive at all when no one’s watching. The four options self-select against those axes faster than most people expect, and the survivor is often the one nobody reaches for first.

The situation

A document-processing team runs OCR on PDF batches uploaded by enterprise customers. Each batch is a single file, up to 1 GB in size, containing hundreds to thousands of pages. The OCR model takes between 5 and 15 minutes to process a batch end-to-end on a ml.g5.2xlarge instance.

Batches arrive unpredictably: some hours bring twenty in quick succession, some days bring none. Customers expect the processed output within an hour of upload but don’t need it instantly – they’re not waiting at a screen.

The team wants support for large request payloads (up to 1 GB), long processing times (up to 15 minutes), cost that scales with traffic rather than idle GPU capacity, and reliable result delivery – the output PDF is the customer-facing artefact and can’t be lost.

What we might want from this

Before reaching for a hosting mode, it’s worth asking what the workload is actually trading.

The first thing to name is the shape of the request itself. A 1 GB PDF is not going up a synchronous HTTP connection – that much payload has to live somewhere the endpoint can read it at its own pace, not something the caller streams in one TCP breath. So whatever we pick, the request body is going to be a reference to an object in storage, not the bytes themselves. That observation alone narrows the field before we open the pricing page.

The second thing is the shape of the response. Fifteen minutes is a long time for a client to hold a socket open, and every proxy, firewall, and load balancer between the caller and the endpoint has its own opinion about idle timeouts. Designing around a synchronous wait means accepting a whole category of flakiness that has nothing to do with the model itself. Designing around “call returns fast with a handle, result shows up later” trades that away for an asynchronous plumbing problem we can solve once.

Third, the arrival pattern. Twenty batches in an hour then nothing for a day is the defining shape here. A provisioned endpoint idling through the quiet stretches burns GPU dollars without earning them. The CFO question – what does this cost when nobody is using it? – needs to have “zero, or near enough” as its answer.

Fourth, failure modes. Customers tolerate “your output will be ready in an hour” but they do not tolerate “your file disappeared.” If the model crashes on malformed input, or the runtime OOMs on a very long document, the orchestration layer needs to notice, write a failure somewhere durable, and tell somebody. Lost work is a worse outcome than slow work.

Fifth, observability. We want to know the queue depth, because that’s the leading indicator of whether capacity is keeping up. We want to know per-batch latency, because that tells us whether a model change regressed runtime. And we want to know failure rate, because silent failures are the scary ones. The hosting mode should publish those natively, not force us to build a second dashboard.

Sixth, the planning horizon. Today the workload is bursty office-hours; tomorrow a new enterprise customer might upload continuously. The architecture shouldn’t collapse when that happens – it should just cost more, proportionally to the work done.

The attributes that matter

Payload size ceiling – the request body or its referent must accommodate up to 1 GB.
Per-request processing time ceiling – up to 15 minutes per inference.
Scale-to-zero when traffic stops.
Asynchronous result delivery – the customer doesn’t need the response on the same connection they made the request on.
Durable failure handling – malformed inputs and runtime errors land somewhere the application can see and retry.

The SageMaker inference landscape

SageMaker ships four inference modes. Each one optimises a different workload shape.

Real-time inference. A persistent HTTPS endpoint backed by one or more provisioned instances. Synchronous request/response. Maximum request payload around 6 MB for most instance types; maximum per-request processing time 60 seconds. Designed for interactive workloads – recommendations, fraud-scoring, chat-style responses. Can’t go to zero instances.

Serverless inference. Same synchronous HTTPS interface, but SageMaker manages capacity. Scales to zero between requests; cold starts when traffic resumes. Maximum payload ~4 MB depending on memory tier, processing time capped at 60 seconds. For sparse low-volume internal tools.

Asynchronous inference. Endpoint accepts a request that points at an S3 object as the input payload, queues it, processes it, and writes the response to S3. The client doesn’t hold a connection. Maximum payload 1 GB. Maximum processing time 1 hour. Scales to zero when the queue is empty. Optional SNS notifications on completion or failure.

Batch transform. A managed batch job that reads a static dataset from S3, runs inference across all records, writes results to S3, and tears down. No endpoint. Designed for offline scoring. Wrong shape for items arriving continuously: each batch pays launch/teardown overhead and orchestrating a job-per-upload is more code than maintaining an async endpoint.

The attribute table

Mode	Payload up to 1 GB	Processing up to 15 min	Scale-to-zero	Async delivery	Durable failure path
Real-time	✗ (~6 MB)	✗ (60 s)	✗	✗	✗
Serverless	✗ (~4 MB)	✗ (60 s)	✓	✗	✗
Asynchronous	✓	✓ (1 h cap)	✓	✓	✓ (SNS + error S3)
Batch transform	✓	✓	–	✓ (per job)	✓

Matching workloads to modes

Each workload's rhythm answers a small handful of questions -- payload size, processing time, arrival pattern -- and the hosting mode falls out the bottom.

Asynchronous inference, in depth

Async endpoints look like real-time endpoints from the deployment side – same Endpoint, EndpointConfig, and Model objects, same InstanceType. The difference is in how they accept work and return results.

Request shape. Instead of POSTing payload bytes to InvokeEndpoint, the client uploads input to S3, then calls InvokeEndpointAsync with the S3 URI of the input. The API returns immediately with an OutputLocation – the S3 key where the result will eventually appear. The client doesn’t hold the connection.

sagemaker.invoke_endpoint_async(
    EndpointName='ocr-processor',
    InputLocation='s3://input-bucket/jobs/batch-2026-05-01-001.pdf',
    InvocationTimeoutSeconds=900,
    ContentType='application/pdf'
)
# Returns: {'OutputLocation': 's3://output-bucket/jobs/batch-2026-05-01-001.pdf.out', ...}

InvocationTimeoutSeconds is per-request, capped at 3600 (1 hour). The endpoint config defines MaxConcurrentInvocationsPerInstance – how many simultaneous inferences each instance handles before queueing.

Queue. SageMaker maintains an internal queue per endpoint. Depth is exposed as a CloudWatch metric, ApproximateBacklogSizePerInstance, which is what target-tracking autoscaling uses.

Autoscaling and scale-to-zero. Async endpoints support Application Auto Scaling, plus one feature real-time doesn’t have: scale to zero instances when the queue is empty for a configured period. When the next request arrives, SageMaker scales up – expect a cold start of 5-10 minutes for a GPU instance pulling a model from S3.

For this workload, scale-to-zero is the single feature that justifies the choice over real-time. Real-time’s minimum of one ml.g5.2xlarge at ~$1.22/hour is ~$880/month of GPU sitting idle when no batches arrive.

Result delivery. Output lands at the configured S3 location. The client can poll or – better – subscribe to an SNS topic fired on success or failure.

Failure handling. If processing fails, the request lands in a configured error S3 location and an SNS failure notification fires. Retries are application-layer – async endpoints don’t auto-retry transient failures. The application listens to both topics and re-submits failed jobs.

A worked example: one Tuesday

12 batches arrive between 14:00 and 18:00. None overnight. Three more Wednesday morning.

Tuesday 13:50. No batches for an hour. Endpoint at zero instances. Cost since last activity: zero.
Tuesday 14:00. Batch 1 arrives. SageMaker scales up one instance (cold start ~6 minutes). Batch 1 begins processing ~14:06, completes ~14:18. SNS notification fires.
Tuesday 14:05-14:55. Eleven more batches arrive. Queue depth peaks at ~10. Autoscaling scales to two instances. All twelve complete by 16:30. Queue empty.
Tuesday 17:00. No requests for an hour. Endpoint scales back to zero.
Wednesday 09:00. First batch of the day. Cold start, then processes.

GPU instance-hours billed only during processing windows plus cold-start preamble. Real-time would cost ~$880/month for idle capacity. Async with ~6 hours a day of actual processing runs ~$220/month – roughly 4x cheaper for the same throughput.

What’s worth remembering

Four SageMaker inference modes, each optimised for a different request shape: real-time, serverless, async, batch transform.
Real-time and serverless cap payloads around 4-6 MB and processing at 60 seconds – hard limits, not tunable.
Async reads input from S3 and writes output to S3; the HTTPS request carries only URIs, which is how the 1 GB ceiling is possible.
Async supports scale-to-zero; real-time does not – this is usually the deciding feature for bursty workloads.
ApproximateBacklogSizePerInstance is the autoscaling target for async endpoints.
SNS topics for success and failure are how a reliable pipeline is built on top of async; async itself does not auto-retry.
Cold start is real – scale-from-zero on GPU instances takes minutes. Time-sensitive workloads either keep a minimum above zero or accept the cold-start tax.
The four-mode landscape isn’t “pick by latency” – it’s “pick by request shape and arrival pattern.” Latency only matters once you’ve narrowed by those.

Deploy an asynchronous inference endpoint on the ml.g5.2xlarge, with autoscaling targeting ApproximateBacklogSizePerInstance and a minimum of zero instances. Configure input and output S3 prefixes plus SNS topics for success and failure. The application uploads to S3, calls InvokeEndpointAsync, and reacts to the SNS notification when the output lands.