How to Serve Thousands of Per-Tenant Models on One Endpoint

June 05, 2028 · 14 min read

ML Engineer · MLA-C01 · part of The Exam Room

The situation

A B2B SaaS team has decided to build per-tenant churn models rather than one global model. TrainingThe process of fitting a model’s weights to data by minimising a loss function. has produced 4,000 XGBoost artifacts, one per customer account, averaging ~45 MB each (~180 GB of model data in total). Each tenant’s churn endpoint serves ~20 inference requests per day, but the distribution is uneven: the 500 largest tenants account for 60% of total traffic, and the 2,000 smallest tenants account for 15%.

Deployment constraints:

  • Latency target: p95 under 300 ms for tenants whose model is already loaded; p95 under 3 seconds for first-call-today tenants (cold path).
  • Cost target: the team budgeted $3,000/month for the entire fleet.
  • Operational simplicity: they don’t want to run 4,000 CloudFormation stacks.

If we deploy each model to its own endpoint at ml.m6i.large minimum capacity (1 instance at ~$90/month), that’s $360,000/month, 120× the budget. Even if we consolidate groups of 100 models per endpoint via shared inference code, we’d still need 40 endpoints at ~$90 each plus orchestration. There’s clearly an opportunity to amortise instances across models; the question is how.

What actually matters

Before reaching for a deployment pattern, it’s worth being explicit about what governs serving cost and tail latency when many small models share infrastructure.

The first thing worth thinking about is traffic shape per model. 4,000 tenants × 20 requests/day each is 80,000 requests/day, which is roughly one request per second across the whole fleet. Each individual model is touched on average once every 4-5 minutes; the long tail of small tenants is touched a few times a day at most. That’s a profile where dedicated capacity per model is almost entirely idle; the cost is in keeping infrastructure warm, not in serving requests.

The second is the working set vs the catalogue. At any given moment, only some fraction of the 4,000 models is being actively used. The 500 largest tenants drive 60% of traffic and their models are essentially always hot. The 1,500 mid tenants come and go through the day. The 2,000 smallest tenants are mostly cold. A serving pattern that keeps the hot subset resident in memory and pays a load tax for the cold ones is structurally a much better fit than one that gives every model equal memory weight.

The third is cold-start tolerance. The SLA distinguishes between warm-path (300 ms) and first-call-today (3 seconds). That’s a strong hint: the design has explicit budget for paging a model in on first use. Any pattern that takes longer than that on cold path, think tens of seconds, doesn’t fit, regardless of cost.

The fourth is the cost driver. Per-model dedicated capacity scales cost linearly with model count, which is exactly what blew the budget. Anything that scales cost with traffic (instance-hours that serve all tenants together, or per-millisecond billing on demand) decouples the bill from the model count and makes the budget tractable.

The fifth is operational weight per model. 4,000 deployments means 4,000 things that can fail to deploy, drift, or expire credentials. A pattern that lets new models appear by dropping a tarball into S3 has near-zero per-model deployment overhead; a pattern that needs a CloudFormation stack per model has linear ops cost on top of the linear infra cost.

The sixth is isolation vs density. Dedicated capacity per model gives perfect tenant isolation: a misbehaving model can’t evict another tenant’s model, a bad inference container blast-radius is one model. Shared capacity trades that isolation for density. For 4,000 churn models that all use the same XGBoost container and serve internal scoring (not customer-facing per-request latency), shared blast radius is acceptable; for tenant-facing real-time scoring with hard SLAs, it might not be.

What we’ll filter on

  1. Supports thousands of models, does the pattern scale to our count?
  2. Cold-start latency acceptable, does first-call latency meet the SLA?
  3. Fits memory, is the working set plus overhead smaller than instance RAM?
  4. Cost scales to traffic, not model count, do we pay per endpoint or per request?
  5. Per-model deployment overhead, how much CloudFormation / CI per new model?

The many-model deployment landscape

1. One endpoint per model. Each model gets its own SageMaker endpoint, minimum one instance. Complete isolation, simplest latency story, completely blows the budget for large model counts.

2. Shared endpoint, one model. One endpoint serving one shared model for everyone. Doesn’t work for per-tenant personalisation, the whole point is different weights per tenant.

3. Multi-Model Endpoint (MME). One endpoint, thousands of models, loaded from S3 on demand. Shared instances, shared container, per-model TargetModel routing. Specifically designed for this shape.

4. Inference Pipeline Endpoint. A chain of containers in one endpoint for multi-step inference (preprocessing → model → postprocessing). Not relevant to the many-models case; it’s about composing steps, not hosting many models.

5. SageMaker Serverless Inference with per-model endpoints. Serverless scales to zero between calls; each tenant’s endpoint costs nothing when idle. Cold-start on serverless is longer than MME (several seconds for CPU models, tens of seconds for large ones) and the 6 MB payload cap applies. Works for small models with infrequent traffic; doesn’t scale well to 4,000 endpoints operationally.

6. Self-hosted model server on EC2 / EKS. Bring your own Kubernetes deployment of TorchServe / Triton / TGI; handle model loading/eviction yourself. Maximum flexibility, maximum operational weight. For 4,000 simple XGBoost models, MME is strictly better value.

Side by side

Option Many models Cold-start Fits memory Cost per request Per-model overhead
One endpoint per model ✓ (expensive) None after first Each instance Flat per endpoint High
MME ✓ native 100 ms - seconds Shared cache Per instance-hour Near-zero
Serverless per model ✓ operationally heavy Seconds n/a Per millisecond High
Custom on EKS Tunable Tunable Per instance-hour Very high

For 4,000 models at ~1 request/second total and $3,000/month budget, MME is the only option in the attribute table that fits all the constraints without compromises.

Sizing the MME

Instance memory layout (ml.m6i.2xlarge, 32 GB) Overhead ~10 GB Hot 500 ~12 GB Warm ~150 ~7 GB Free ~3 GB Container + OS: Multi-Model Server, Python base image, inference libs Hot set (500 top tenants): 60% of daily traffic stays cached, no cold start Warm set (rotating): LRU eviction as new tenants arrive Working set vs. hit rate cache capacity (models) steady-state hit rate 100 loaded → 42% 500 loaded → 80% 700 loaded → 91% 4,000 loaded → 100% (no eviction) Chosen sizing MME configuration ml.m6i.2xlarge × 2 (min) autoscale to 4 on InvPerInst 4,000 models in S3 prefix TargetModel routing ~$620/month vs $360,000 per-endpoint baseline well inside $3,000 budget Expected behaviour Hot 500 → always cached 60% of traffic at p95 ≈ 120 ms Warm ~150 → mostly cached 25% of traffic at p95 ≈ 180 ms Cold ~3,350 → load on first call 15% of traffic at p95 ≈ 2 s meets both SLAs (<300 ms hot, <3 s cold) What to watch ModelCacheHit CloudWatch metric ModelLoadingWaitTime latency LoadedModelCount per instance S3 egress cost (model loads) If hit rate drops, either: - add an instance (more cache) - size up (more per-instance cache) - warm-load at startup (for key tenants)
Memory is partitioned between container overhead, hot-model cache, and a small free buffer. The hit-rate curve shows diminishing returns past the working set size; the chosen configuration gets high hit rate for the traffic-weighted majority of requests.

The pick in depth

MME on two ml.m6i.2xlarge instances, autoscale to four, storing 4,000 models in a shared S3 prefix.

The deployment steps:

  1. Upload model artifacts to s3://churn-mme/models/<tenant_id>.tar.gz. Each tarball contains the XGBoost model and any tenant-specific preprocessing artifacts.
  2. Build an MME-compatible inference container. SageMaker’s Multi-Model Server (MMS) is the default; for XGBoost, the built-in XGBoost DLC supports multi-model mode via MODEL_STORE=/opt/ml/models and the appropriate entrypoint.
  3. Create the model and endpoint config:
    from sagemaker.multidatamodel import MultiDataModel
    model = MultiDataModel(
        name='churn-mme',
        model_data_prefix='s3://churn-mme/models/',
        image_uri=xgboost_image_uri,
        role=role,
    )
    predictor = model.deploy(
        initial_instance_count=2,
        instance_type='ml.m6i.2xlarge',
        endpoint_name='churn-mme-prod',
    )
    
  4. Invoke with TargetModel:
    predictor.predict(
        data={'features': [...]},
        target_model='tenant-00473.tar.gz'
    )
    
  5. Configure autoscaling on InvocationsPerInstance or on ApproximateBacklogSizePerInstance depending on traffic shape. Target 100 invocations/second/instance as a starting point.
  6. CloudWatch alarms on:
    • ModelCacheHit < 0.75 sustained (cache is too small)
    • ModelLoadingWaitTime p99 > 3 seconds (S3 loads are slow or models are large)
    • MemoryUtilization > 85% (nearing eviction threshold)

The warm-start optimisation. For the hot 500 tenants (60% of traffic), a periodic dummy call at endpoint startup keeps them in cache. A Lambda fires at 00:15 UTC each day and invokes the endpoint for each of the top-500 tenants with a zero-impact probe. Most of them would be loaded anyway; the small minority that wouldn’t see their first-call-today latency drop from the cold-path to the hot-path.

The cold tenants. 2,000 tenants average less than a call a day. Their model is essentially always loaded fresh from S3. The p95 for those is ~2 seconds, within the 3-second cold-path SLA. If that becomes a problem, the answer is either shrinking the models (smaller XGBoost via max_depth reduction) or pre-caching a per-tenant-lite model that loads faster.

A worked inference trace

Tenant acme-007 lands a churn-scoring request at 14:23:08.

  1. The caller (a Lambda in the CRM integration) calls predictor.predict(data={...}, target_model='tenant-acme-007.tar.gz'). The SageMaker runtime routes the call to one of the two instances (round-robin by default).
  2. Instance checks its local cache. tenant-acme-007.tar.gz is not loaded, acme-007 is in the 2,000 cold tenants and hasn’t been called today.
  3. Instance downloads s3://churn-mme/models/tenant-acme-007.tar.gz (45 MB). S3 download completes in ~600 ms.
  4. Instance loads the XGBoost model into memory (~40 ms). Total model-load wait: 640 ms.
  5. Instance runs inference (8 ms). Returns the prediction.
  6. Caller receives the response. Total round-trip: ~720 ms. Under the 3-second cold-path SLA.

Ten minutes later, another acme-007 call arrives. Cache hit. Inference: 12 ms. Response: ~60 ms round-trip including network. Under the 300 ms hot-path SLA.

Two hours later, with 150 unique tenants called in the interim and memory pressure rising on instance 2, acme-007’s model is evicted via LRU. Next call: back to ~720 ms.

Meanwhile, the 500 hot tenants are all in cache on both instances most of the time; their calls are all hot-path. The bill at the end of the month: $620 for compute + ~$180 for S3 egress (model loads) + ~$30 for CloudWatch = about $830/month. Under budget by $2,170.

What’s worth remembering

  1. MMEs host thousands of models on one endpoint. Models live in a shared S3 prefix; the container loads each on demand via TargetModel. One endpoint, one deployment, one bill.
  2. Instance memory is the cache; size it to the working set. Start with the memory budget = total active models × model size × 2 (rough headroom factor). The LRU cache silently evicts; good hit rate depends on the working set fitting.
  3. Hit rate drives latency. Cached models serve at inference speed; uncached models pay S3 download + load time as cold-start overhead. Target hit rate > 85% for the majority of traffic.
  4. Cold-start is per-instance, per-model. Each instance has its own cache, so the same model may load multiple times across an autoscaled fleet. Larger instances with one model per tenant fits better than many small instances with shared caches.
  5. Autoscaling adds cache space and throughput simultaneously. Two m6i.2xlarge caches are two separate LRU caches; adding a third adds 32 GB of cache and 33% more RPS capacity.
  6. CloudWatch metrics tell the story. ModelCacheHit, ModelLoadingWaitTime, LoadedModelCount, MemoryUtilization. Alarms on cache-hit and model-loading-wait predict latency regressions before users notice.
  7. Warm-loading key models is a cheap optimisation. A scheduled Lambda that invokes the endpoint for the top-N tenants each day ensures they’re always hot. Negligible cost, meaningful latency win.
  8. MME is orthogonal to inference type. Real-time MME is the common case; async MME exists for long-running per-model inferences. Same mechanism, different endpoint type.

The per-endpoint-per-model model is the obvious thing to try and the wrong thing to ship when the model count is high and per-model traffic is low. Multi-Model Endpoint is the pattern AWS built for this shape, and it collapses 4,000 deployments into one, 4,000 bills into one, and $360,000 into $830. The work is sizing the instance memory against the actual working set of models and watching hit rate once it’s live. The rest is the same as any SageMaker endpoint: autoscaling, alarms, the usual plumbing.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.