Hosting Seventy Models Behind One SageMaker Endpoint

March 13, 2028 · 16 min read

ML Engineer · MLA-C01 · part of The Exam Room

The situation

A payments platform runs a fraud-detection stack with one model per enrolled merchant, currently 70 merchants, trending towards 150 by end of year. Each model is an XGBoost classifier trained on that merchant’s own transaction history. The models are all the same shape (same feature vector, same framework, same InferenceRunning a trained model to produce output – as opposed to training it. container) but different parameters.

Traffic is uneven:

  • Five merchants generate 80% of calls. These are always-on, latency-sensitive, and tolerant of nothing over 50ms at p99.
  • Fifty merchants generate the middle 19%. Active during business hours, quiet overnight.
  • Fifteen merchants are occasional. A few hundred calls a day, sometimes zero, sometimes a spike when an engineer runs a backfill.

Today each model has its own dedicated ml.m5.large endpoint. 70 endpoints, 70 idle instances overnight, and the bill has become the topic of monthly cost reviews. The team has three options to consider and wants to know which fits where.

What actually matters

The first thing worth untangling is: when multiple models live on the same endpoint, what exactly do they share?

Three things can be shared independently: the container, the instance, and the endpoint URL itself.

Sharing the container means all models run inside the same process, same framework, same inference code, same dependencies. This is cheap: one container image, one set of libraries, one loading code path. But it forces homogeneity: every model must speak the container’s protocol and use its framework.

Sharing the instance means multiple containers run on the same EC2 host. This is heavier: each container has its own process, its own memory, its own dependencies. But it allows diversity, an XGBoost model and a PyTorch model and a scikit-learn model can coexist on the same hardware, each in its own container.

Sharing the endpoint URL means the client calls one HTTPS address and passes a hint (“which model do you want?”) with each request. The routing is done by SageMaker, not by the client’s DNS. This is what turns “70 deployments” into “one deployment” operationally, one URL to provision, one set of IAM permissions, one alarm topology.

Three dimensions to share along, and the shapes SageMaker offers are different combinations of them.

The second thing worth asking is what happens when a model isn’t currently in memory. If every model always lives hot on every instance, the answer is “nothing, it’s always ready.” But if models can be loaded lazily, the answer becomes “cold-start latency the first time a merchant is seen today, fine after that.” For the 15 occasional merchants, cold starts are acceptable; for the five whales, they are not.

The third is the resource isolation story. If model A goes rogue and eats all the memory on the instance, does it take model B down? For models owned by different teams or serving different merchants with different contractual latency SLAs, this matters; for 70 variants of the same XGBoost training run run by the same team, it matters less.

What we’ll filter on

Five filters for picking between the shapes:

  1. How many models can share the endpoint, two? Ten? Thousands?
  2. Do they share the container or run in separate containers, homogeneous framework or heterogeneous?
  3. Load-on-demand or always hot, is there a cold start on first invocation of a given model?
  4. Resource isolation between models, can one model starve another?
  5. How the client picks which model to hit, header, variant, target parameter?

The SageMaker hosting landscape

1. Single-model endpoint (status quo). One model, one (or more) instances, one endpoint URL. The baseline. Always hot, full hardware for one model, simple IAM and monitoring. Fine for one or two high-value models; the 70-model version is the wall the team has hit.

2. Multi-model endpoint (MME). One endpoint, one container, many models that share the container and the instance. Models live as separate artefacts (typically a .tar.gz per model) in S3; SageMaker loads a model into the container’s memory on first invocation for that TargetModel, keeps it in an LRU cache, evicts under memory pressure. Client picks the model via the TargetModel parameter on InvokeEndpoint. All models must use the same container (same framework, same inference code). Cost-efficient for many models with long-tail traffic: only the hot set sits in memory; cold models pay a load-from-S3 latency on first call (typically sub-second for small tree models, longer for large neural networks). Not ideal for latency-sensitive always-hot models, the LRU eviction policy means a model can be evicted if it hasn’t been called in a while.

3. Multi-container endpoint (MCE). One endpoint, up to 15 different containers on the same instance, each hosting its own model. Client picks the container via the TargetContainerHostname header. All containers share the instance’s CPU, memory, and GPU (if any). Useful when a small number of different models (different frameworks, different dependencies) need to be co-located, e.g. a TensorFlow text model, a PyTorch image model, and an XGBoost tabular model in one place. Direct invocation mode routes to a named container; serial inference pipeline mode chains them (request goes through container A, then B, then C). Not a fit for the 70-merchant case, because 70 is well above the 15-container limit and the merchants all use the same framework.

4. Inference pipeline (serial). A variant of MCE where up to 15 containers run as a chain on the same instance, request flows through each container in order. Typical use: preprocessing container → model container → postprocessing container. Not a many-models pattern; it’s a many-stages pattern.

5. Serverless inference. Per-model endpoint, but backed by SageMaker-managed capacity that scales to zero. No instance bill when idle, cold start on the first invocation after scale-down. Each serverless endpoint still hosts one model; the 70-model bill becomes 70 serverless endpoints rather than 70 provisioned ones. Useful when per-model traffic is genuinely sporadic; less useful here because the whales are always on and serverless doesn’t give them the always-hot guarantee they need.

6. Asynchronous inference. Endpoint takes a request, puts it on an internal queue, returns a pointer to where the result will eventually land in S3. Useful for long-running inferences (minutes), not for the sub-100ms fraud-scoring case.

Side by side

Shape Models per endpoint Container model Load-on-demand Isolation Client routing
Single-model 1 Dedicated Hot always Full Endpoint URL
MME (Multi-model) Thousands Shared container ✓ lazy, LRU-cached ✗ shared memory TargetModel param
MCE (Multi-container) Up to 15 Per-container Hot always ✗ shared instance TargetContainerHostname header
Inference pipeline 2-15 (as stages) Per-stage container Hot always ✗ shared instance Chained automatically
Serverless 1 Dedicated ✓ scale-to-zero Full (managed) Endpoint URL

Reading the table against the 70-merchant scenario:

  • The five whales need always-hot, isolated, sub-50ms latency. The honest answer is five dedicated single-model endpoints (or one endpoint per whale with two instances behind an ALB-style autoscaling policy). Don’t fight it; they pay for themselves.
  • The 50 mid-volume merchants want a shared endpoint to amortise the instance cost but can tolerate the occasional cold-start millisecond. MME is the fit: one endpoint, one container (all XGBoost), 50 model .tar.gz files in S3, TargetModel routed.
  • The 15 occasional merchants ride the same MME as the middle tier, or sit on a second MME if their traffic pattern is different enough to justify independent scaling. Cold starts are fine for them; they already feel on-demand.

MCE is not the answer for 70 same-framework models. It’s the answer for a handful of different models on the same box.

Three shapes on one wall

Single-model one URL, one model Multi-model (MME) one URL, one container, many models Multi-container (MCE) one URL, up to 15 containers InvokeEndpoint(EndpointName) InvokeEndpoint(..., TargetModel="m37.tar.gz") InvokeEndpoint(..., TargetContainerHostname="pt") ml.m5.large XGBoost container model.tar.gz always hot simplest, 1 model per bill line ml.m5.xlarge Single XGBoost container LRU cache in memory m04 hot m12 hot m37 hot m22 cold m45 cold m58 cold s3://models/prefix/ m01.tar.gz … m70.tar.gz loaded on first call, LRU eviction ml.m5.2xlarge (shared) TensorFlow container 1 (tf) PyTorch container 2 (pt) XGBoost container 3 (xgb) diverse frameworks, co-tenanted
Same endpoint URL on the outside, three different stories inside. Single-model owns the instance; MME shares one container across many model artefacts and LRU-caches the hot set in memory; MCE puts up to 15 heterogeneous containers on the same instance and lets the client pick one by hostname.

The pick in depth

Multi-model endpoint for the 65 tail merchants. The MME configuration is deceptively small. The container must be built against the SageMaker MME contract, it implements LOAD, UNLOAD, LIST, and INVOKE hooks so that SageMaker can tell it to load a model from a given S3 prefix, unload one to reclaim memory, list what’s currently loaded, and invoke a specific one. The pre-built XGBoost and scikit-learn inference images ship with these hooks; a custom container needs to implement them against the Multi Model Server (MMS) framework.

Models go to S3 under a single prefix: s3://fraud-models/v2027-09/m01.tar.gz, m02.tar.gz, … m70.tar.gz. Training a new merchant model means uploading a new object; no endpoint redeployment. Invocation names the file: boto3.client('sagemaker-runtime').invoke_endpoint(EndpointName='fraud-mme', TargetModel='m37.tar.gz', Body=...).

Instance sizing drives the cache hit rate. If the working set (the sum of all models called in the last N minutes) exceeds instance memory, SageMaker evicts the least-recently-used model to make room for the new one, eviction is free but the next call for the evicted model pays the S3-download-plus-deserialise cost. For small XGBoost trees (~10-50MB), 50 models on an ml.m5.xlarge (16GB RAM) fit comfortably; for a fleet of 2GB neural nets, 50 models is nowhere close and the team needs either bigger instances or an acceptance of frequent cold starts. ModelCacheHit and ModelLoadingWaitTime in CloudWatch tell the story; tune instance type until cache hits dominate.

Autoscaling is based on InvocationsPerInstance. MME scales horizontally, add instances and the load is distributed across them; each instance independently caches a subset of the models. With a reasonable cache, the working set often distributes itself cleanly (the LRU on each instance keeps the locally hot models, and load balancing tends to give each instance a stable share of traffic).

Dedicated endpoints for the five whales. No clever sharing. Each high-volume merchant gets its own single-model endpoint with autoscaling between two and eight instances based on InvocationsPerInstance. The dollar cost is five dedicated endpoints; the win is guaranteed always-hot, guaranteed isolation, and the operational clarity of “this merchant’s problem lives here, not in the shared MME.”

When MCE would be the answer. Take a different scenario: one merchant that needs an XGBoost model for card-present transactions, a PyTorch vision model for document-verification, and a scikit-learn model for velocity checks. Three different frameworks, one merchant’s pipeline, 10-50 QPS combined, a single dedicated instance is under-utilised three ways. MCE puts all three containers on one instance, the client’s pipeline code picks the correct container per stage with a header, and the bill is one instance instead of three. That’s MCE’s actual use case: a handful of different models, not 70 of the same.

A worked deployment

Here’s the Python SDK shape for the 65-merchant MME plus five dedicated endpoints:

# MME for the tail
from sagemaker.multidatamodel import MultiDataModel
from sagemaker.xgboost import XGBoostModel

xgb_image = sagemaker.image_uris.retrieve(
    framework="xgboost", region="eu-west-1", version="1.7-1"
)
mme = MultiDataModel(
    name="fraud-tail-mme",
    model_data_prefix="s3://fraud-models/v2027-09/",
    image_uri=xgb_image,
    role=role,
)
mme.deploy(
    initial_instance_count=2,
    instance_type="ml.m5.xlarge",
    endpoint_name="fraud-tail-mme",
)

# Invocation
runtime.invoke_endpoint(
    EndpointName="fraud-tail-mme",
    TargetModel="m37.tar.gz",
    ContentType="text/csv",
    Body=feature_csv,
)

# Dedicated endpoint per whale
for merchant_id in ["m01", "m04", "m12", "m18", "m29"]:
    model = XGBoostModel(
        model_data=f"s3://fraud-models/v2027-09/{merchant_id}.tar.gz",
        role=role,
        entry_point="inference.py",
        framework_version="1.7-1",
    )
    model.deploy(
        initial_instance_count=2,
        instance_type="ml.m5.large",
        endpoint_name=f"fraud-{merchant_id}",
    )

Day-two operations: adding a new merchant to the tail MME is aws s3 cp m71.tar.gz s3://fraud-models/v2027-09/ and setting up routing in the client; no endpoint redeployment. Promoting a merchant from tail to whale status is spinning up a new dedicated endpoint and flipping the client’s routing; the model object can literally be the same S3 file.

What’s worth remembering

  1. Three shapes, three sharing stories. Single-model shares nothing. MME shares a container and an instance across many models. MCE shares an instance across up to 15 heterogeneous containers.
  2. MME is the many-of-the-same pattern. One container, many artefacts in S3, LRU-cached in memory, client picks via TargetModel. Scales to thousands of models per endpoint; cost model is “pay for hot-set memory, pay S3 bytes for cold loads.”
  3. MCE is the handful-of-different pattern. Up to 15 containers on one instance, client picks via TargetContainerHostname. Different frameworks co-tenanted. Not a 70-model pattern.
  4. MCE also does serial pipelines. InferenceComponents configured as a chain run the containers in order, preprocessing, model, postprocessing. Same instance, same call, different stages.
  5. Cold starts are the MME trade. First call for an unloaded model pays S3-download-plus-deserialise latency. Fine for long-tail merchants; not fine for whales. ModelCacheHit and ModelLoadingWaitTime are the metrics to watch.
  6. Instance memory sizes the cache. MME tuning is instance-family choice and instance count. If the working set fits, hit rate approaches 100%; if it doesn’t, tune up the instance or accept the cold-start rate.
  7. Dedicated endpoints for the 1%. High-traffic, latency-sensitive models that would dominate an MME’s cache anyway are cleaner on their own endpoint. Don’t over-share; the goal is matching traffic pattern to shape, not minimising endpoint count.
  8. Autoscaling scales the instance count, not the model count. MME spreads models across instances via caching; adding instances means each instance caches its locally hot set. Scale on InvocationsPerInstance (or custom CloudWatch metrics), not on model count.

Seventy models, one endpoint is usually the correct shape, but only when all seventy share a framework and the traffic has the long tail that MME caching assumes. Five whales and sixty-five tail merchants want two shapes, not one; pairing each merchant with the endpoint shape that fits its rhythm is the actual work.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.