The situation
Legal operations at a mid-sized company run every inbound contract through a review model that extracts clauses, flags unusual terms, and produces a structured summary. The model is a fine-tuned LLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for.
hosted on a ml.g5.2xlarge instance; it ingests documents via a custom InferenceRunning a trained model to produce output – as opposed to training it.
container that handles chunking and clause assembly internally. Per-document work:
- Average document: 30 pages, ~12,000 TokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. .
- Average inference time: 45 seconds per document.
- Input payload: 2-25 MB PDF, passed directly to the endpoint.
- Response payload: ~50 KB JSON of extracted clauses.
- Volume: ~300 contracts per business day, bursty (12-20 at once when a deal closes).
Today the team calls a SageMaker real-time endpoint synchronously. This is breaking:
- Browsers and many API gateways enforce a 30-60 second HTTP timeout; 45 seconds falls exactly in the failure zone.
- The maximum payload size for a SageMaker real-time endpoint is 6 MB, so documents over ~18 pages are rejected outright.
- To hit peak concurrency without queueing, the team would need to keep five or six instances warm even though average utilisation is 8%.
The question on the desk is which SageMaker inference pattern fits this work.
What actually matters
SageMaker offers more than one inference shape. The naive framing is “real-time or batch”, and that misses the thing in the middle.
Real-time endpoints are the default: an HTTPS endpoint you call synchronously, the response comes back on the same connection, charged per instance-hour. Built for low-latency, high-frequency requests. Three constraints matter for the contract case: the 60-second invocation timeout, the 6 MB payload limit, and the fact that the instance is paid for 24/7 regardless of whether it’s serving traffic.
Batch transform is the other end: submit a job pointing at an input S3 prefix, SageMaker spins up N instances, processes every object, writes the output to an S3 prefix, and shuts down. Ideal for offline scoring of thousands or millions of records. Wrong shape for individual contracts arriving ad hoc: spinning up a fleet of instances for one document is overhead, and the batch submission model doesn’t fit “user clicked a button and wants a result when it’s ready”.
A third pattern fits between the two. The caller submits a request pointing to an object in storage; the endpoint returns immediately with the location where the result will eventually appear; the instance processes the request in the background and writes the result; the caller learns about completion via a notification, an event, or by polling. Key properties: payloads measured in gigabytes rather than megabytes, inference time measured in tens of minutes rather than seconds, a native queue so concurrent requests wait instead of failing, and scale-to-zero so the endpoint costs nothing when the queue is empty.
The 45-second-per-document, 8-MB-payload, bursty-arrival pattern wants something of that shape. The interesting work is in seeing why none of the alternatives fit as cleanly.
What we’ll filter on
- Invocation timeout: does the inference finish within the service’s timeout budget?
- Maximum payload size: does the request body fit?
- Scale-to-zero: can the endpoint drop to zero instances when idle?
- Queueing behaviour: can concurrent requests wait without failing?
- Call pattern: synchronous HTTP response, S3 notification, or pre-submitted batch job?
The SageMaker inference landscape
1. Real-time endpoint. Synchronous HTTPS. The canonical endpoint: instance(s) run behind a managed load balancer, respond inline. Invocation timeout: 60 seconds (hard, enforced by the service). Payload: 6 MB request, 6 MB response (hard). Minimum instance count: 1 (no scale-to-zero for classic real-time). Scaling: target-tracking on InvocationsPerInstance. Fits small, fast, frequent requests (fraud-scoring, recommendation, embedding-generation workloads).
2. Real-time endpoint (Serverless Inference). A variant where AWS manages instance lifecycle and bills per millisecond of inference time. Cold starts apply. Payload up to 4 MB, timeout 60 s, memory up to 6 GB. Good for sporadic traffic on small CPU models; not usable for the contract workload (GPU-size models not supported, and the timeout is too short).
3. Asynchronous inference endpoint. The caller calls InvokeEndpointAsync with InputLocation=s3://.../input.pdf. The call returns immediately with an OutputLocation=s3://.../output.json. The endpoint queues the request, processes it when an instance is free, writes the output, and (optionally) publishes to SNS or emits an S3 event on completion. Payload up to 1 GB, processing time up to 1 hour, supports scale-to-zero. The instance count scales on a custom CloudWatch metric (ApproximateBacklogSizePerInstance). Priced per instance-hour while instances are running; nothing when scaled to zero.
4. Batch transform. Offline batch scoring. Submit a CreateTransformJob with S3 input prefix and S3 output prefix; SageMaker provisions instances, iterates the inputs, writes outputs, tears down. No endpoint object persists between jobs. Priced per instance-hour of the job duration. Right for nightly re-scoring of a data lake; wrong for single-document request-response.
5. Multi-model endpoint (MME). Orthogonal feature: one endpoint hosts many models, loaded on demand. Can be combined with real-time or async. Not the answer to the timeout/payload question, but worth naming because it’s in the catalogue.
Side by side
| Option | Invocation timeout | Max payload | Scale-to-zero | Queueing | Call pattern |
|---|---|---|---|---|---|
| Real-time | 60 s | 6 MB | ✗ | ✗ (fail fast) | Sync HTTP |
| Serverless Inference | 60 s | 4 MB | ✓ (cold starts) | ✓ | Sync HTTP |
| Async inference | 1 hour | 1 GB | ✓ | ✓ (native queue) | S3 location + SNS |
| Batch transform | Job duration | S3-sized | n/a (job-scoped) | n/a | Job submit |
| MME (real-time or async) | inherits | inherits | inherits | inherits | inherits |
Reading by row, the async row is the only one that says yes to all four of: long timeout, big payload, scale-to-zero, queue-on-busy.
The two call patterns side by side
The pick in depth
Asynchronous inference endpoint, on ml.g5.2xlarge, autoscaling 0-5 instances.
- Input: clients upload the PDF to
s3://legalops-contracts/inbound/<uuid>.pdf, then callInvokeEndpointAsyncwithInputLocationpointing at that S3 URI andInferenceIdset to the same UUID. - Processing: the SageMaker managed queue holds the request until an instance is free. On a
g5.2xlarge, the model processes the document in ~45 seconds. The instance reads from S3 directly (no 6 MB HTTP body), runs inference, writess3://legalops-contracts/outbound/<uuid>.json. - Notification: the endpoint is configured with
NotificationConfig.SuccessTopic= an SNS topic. On completion, SageMaker publishes a message with the input and output S3 URIs. A Lambda subscribes to the topic, updates a DynamoDB row(contract_id, status=ready, output_uri=...), and pushes a WebSocket message to the legal-ops UI. The UI polls or listens; when the document is ready, it fetches the JSON and renders the clause view. - Scaling: a target-tracking autoscaling policy watches
ApproximateBacklogSizePerInstancewith a target of 3. When 20 documents arrive in a burst, the queue grows, the metric rises, and SageMaker spins up more instances. When the queue empties, the metric falls to 0 and SageMaker scales back to 0 instances. Scale-to-zero is the cost lever: overnight and at weekends, the endpoint costs nothing.
Cold start is the trade-off. When the endpoint is at zero and a request arrives, spinning up a g5.2xlarge with a GPU model loaded takes 3-5 minutes. Legal ops has no response-time SLA below “same business day”, so this is acceptable. If it weren’t, a CloudWatch schedule keeping one instance warm during business hours would cap the worst-case latency at the normal 45 seconds.
SNS for delivery, not polling. Polling DescribeAsyncInferenceJob from the client works but makes the client code ugly and produces surprise CloudWatch costs. The SNS notification fan-out lands on a small Lambda that does the UI side; the client UI just listens for the WebSocket push. CloudTrail captures the completion event and the Lambda writes a structured log line for observability; the loop is clean.
Instance sizing. ml.g5.2xlarge is the smallest instance that runs the model with acceptable latency. The question of whether to run on g5.4xlarge to go from 45 s to 30 s is a cost calculation: the larger instance costs roughly 2× per hour, and each concurrent request blocks one instance. At 300 requests/day × 45 s = 3.75 GPU-hours/day, the endpoint works out to roughly one or two hours of actual instance time most days. Spending more per hour to cut latency by a third is a bad trade when the user-facing metric is “same day” rather than “30 seconds”.
A worked request trace
A contract arrives at 10:14:22 local time, during a legal-ops deal crunch with 14 in-flight reviews:
- The UI uploads
contract-ac12.pdf(11 MB) tos3://legalops-contracts/inbound/ac12.pdf. Upload completes at 10:14:25. - The browser calls
InvokeEndpointAsync(EndpointName='legal-review', InputLocation='s3://.../ac12.pdf', InferenceId='ac12'). The API returns 202 withOutputLocation='s3://.../outbound/ac12.out'in under 200 ms. - The UI stores
ac12in the pending-contracts list with a spinner. - The SageMaker queue has 14 items ahead. Three instances are already running;
ApproximateBacklogSizePerInstancesits at 5, above the target of 3. Autoscaling adds a fourth instance over the next two minutes. - The request is picked up at 10:17:03. Inference takes 43 seconds. The endpoint writes
s3://legalops-contracts/outbound/ac12.outat 10:17:46. - SageMaker publishes to SNS. The notification Lambda updates DynamoDB and pushes a WebSocket message. The UI replaces the spinner with a “Ready” button at 10:17:47. Total end-to-end: 3 minutes 25 seconds.
- After the burst passes, the queue drains. At 11:40, backlog per instance is 0 for 15 minutes. Scaling policy removes instances one by one; by 12:10 the endpoint is at zero. No compute cost until the next request.
What’s worth remembering
- Async inference fits long-running, big-payload, bursty inference. 1 GB payload, 1 hour timeout, native queue, scale-to-zero. It’s the pattern built for exactly the “45-second ML inference on a 10 MB document” shape.
- Real-time endpoints have a 60-second timeout and a 6 MB payload limit. These are hard service limits, not tunable. When inference or payload approaches them, real-time is the wrong shape.
- Batch transform is for bulk offline scoring. Submit a job against an S3 prefix, get an S3 prefix of outputs back, instances tear down. Wrong shape for ad-hoc per-document requests where a user is waiting for a specific result.
- Serverless Inference is a sibling to real-time. Same timeout, smaller payload, true scale-to-zero with cold starts. CPU-only instance types, memory up to 6 GB. Wrong shape for GPU models.
- Async endpoints scale on
ApproximateBacklogSizePerInstance. Target a value (2-5 is typical), let autoscaling expand and contract. Scale-to-zero is the cost win; cold-start time is the trade. - Large payloads go through S3, not through the invocation body. Async takes an
InputLocation; the endpoint reads the object directly. That’s how it supports 1 GB payloads: the HTTP call itself is small. - Notifications beat polling. Configure
NotificationConfig.SuccessTopicandFailureTopic; let Lambda fan out to the UI, a downstream Step Function, or an EventBridge rule. PollingDescribeAsyncInferenceJobworks but is noisy. - Cold-start behaviour is the trade for scale-to-zero. A scheduled warm-keeper Lambda that sends a dummy request at 08:55 each business day caps the worst-case latency at the normal inference time during business hours without paying for overnight capacity.
The contract-review workload was never a real-time problem. Forty-five-second inferences don’t belong in a synchronous HTTP lifecycle; eight-megabyte payloads don’t fit in a 6 MB envelope; and the bursty, 300-per-day shape wants to scale to zero when nothing’s happening. Async inference is the pattern AWS built for exactly that workload. Once the call sites are adjusted to upload first, invoke second, and notify third, the rest of the system stops fighting the infrastructure.