Hosting a 70B LLM with Business-Hours Traffic

February 23, 2028 · 19 min read

The situation

A B2B SaaS platform ships a writing assistant: a fine-tuned Llama 3 70B that rewrites draft text to match each tenant’s house voice. Observed traffic over three months:

Overnight (00:00-08:00): single-digit RPS, often zero for stretches.
Morning ramp (09:00-11:00): climbs to ~15 RPS.
Lunchtime peak (12:00-14:00): 40 RPS bursts, averaging 25 RPS.
Afternoon tail (14:00-18:00): 10-20 RPS.
Evening (after 19:00): back to near-zero.

That’s a 12:1 peak-to-trough ratio. Roughly 85% of the day’s requests fit inside an eight-hour window.

The team needs to serve their own fine-tuned Llama 3 70B weights (Hugging Face Llama 3 70B Instruct base with LoRALoRAA fine-tuning technique that trains a small low-rank matrix on top of the frozen base model, instead of updating every parameter. adapters merged into a full checkpoint), hit P95 latency under 3 s for typical prompts (~800 input tokensTokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. , ~300 output tokens), pay little for idle overnight capacity, and keep operational overhead minimal, three MLEs, none of whom want to own a TGI cluster. Deployment from us-east-1, with a DR plan that can lift to us-west-2.

What actually matters

Before picking a hosting path, worth being explicit about what’s actually on trial.

The first thing is the model itself. This isn’t an off-the-shelf foundation model we want to call; it’s a merged checkpoint the team has trained. Any service that doesn’t accept arbitrary weights is out, no matter how attractive the rest of its shape is.

The second thing is the economic shape of the traffic. A 12:1 peak-to-trough ratio, with 85% of traffic in an 8-hour window, is not a candidate for flat capacity. Running the peak all day pays for seventy-two idle instance-hours for every twenty-four productive ones. The hosting mode needs a story for “we want capacity now” and “we don’t want to pay for it later,” which points squarely at auto-scaling rather than a fixed commitment. The CFO’s specific complaint is about paying for idle capacity, which is the polite way to say “if we pick a flat-capacity option, we will get asked about the bill again.”

The third thing is how much containerisation pain we want to own. A fine-tuned 70B model is a substantial artefact to serve: tensor-parallelism across multiple GPUs, KV cacheKV cacheA reuseable cache of the model’s attention computations for tokens it’s already seen, so generating the next token doesn’t redo work. warm-up, inference-server tuning, health-check timeouts long enough that model loading doesn’t kill the instance before it’s ready. A team with a dedicated serving engineer can own that stack and tune it; a team of three MLEs shipping a product cannot afford the weeks that entails. Any path that delegates those choices to a pre-tested container is buying real time.

The fourth is regional footprint. Us-east-1 is fine; us-west-2 for DR is fine. But the product roadmap implies EU and APAC expansion eventually, and a service available in two regions today will become the reason for a re-platforming conversation later. It’s not decisive now, none of the paths fail on us-east-1, but it’s a future cost to weigh.

The fifth is cold-start behaviour. Because traffic is near-zero overnight, whatever we pick needs an honest answer to “what happens when the first 09:00 request arrives?” A service that goes cold after idle and spends minutes loading a 70B model from object storage gives us an embarrassing morning latency spike. Holding one warm instance through the night costs an instance-hour times however many hours are quiet. Scale-to-zero is a tempting third option that works well for workloads with true long idle stretches (weekends, dev tiers) and badly for workloads with constant low-rate drips that keep the endpoint technically-non-idle.

The sixth is observability. A production endpoint needs invocation latency, error rate, and throughput on dashboards out of the box, plus the ability to alarm on them. Different hosting paths plug into different telemetry stories; the team’s existing dashboards are easier to extend than to duplicate.

And the seventh is blast radius when we need to iterate. If we retrain the model next month, how do we ship it? Shadow deployment, canary routing, blue/green, which are native to the hosting path and which are things we’ll code ourselves?

What we’ll filter on

Auto-scaling that can shrink at the trough, 12:1 without proportional capacity response is 12x the floor bill all night.
Not provisioned-only, the shape of the traffic demands responsive capacity, not a flat commitment.
Supports our fine-tuned weights, not a base model, not a provider-hosted fine-tuneFine-tuningContinuing to train an already-trained model on a smaller dataset to adapt its behaviour. .
Low operational overhead, no container images, inference-server tuning, or GPU-memory audits.
Sensible cold-start posture, warm enough at 09:00 that the first request doesn’t sit through a weight-load.

The Llama-3-70B hosting landscape

Bedrock Custom Model Import (CMI). Bedrock’s path for bringing your own trained weights. Supports Llama 2/3/3.1/3.2, Mistral, Mixtral, Flan-T5 and a handful more. Upload checkpoint to S3, call CreateModelImportJob, invoke via InvokeModel. Regional availability: us-east-1 and us-west-2 only. Capacity via Custom Model Units (per-minute while active, 5-minute billing minimum, cold-start after idle). Zero container work.

SageMaker JumpStart. SageMaker’s catalogue of foundation models with pre-baked containers. Llama 3 70B Instruct is a JumpStart model. Supports domain adaptation and instruction fine-tuning; the resulting artefact deploys to a real-time endpoint via JumpStart’s LMI container. Real-time endpoints support Application Auto Scaling, target tracking on SageMakerVariantInvocationsPerInstance, CPU, or GPU utilisation, with configurable MinCapacity / MaxCapacity.

SageMaker endpoint from a Hugging Face container (BYO). Pull Hugging Face TGI or LMI DeepSpeed/vLLM, configure for Llama-3-70B, point at the checkpoint in S3, deploy on ml.g5.48xlarge or ml.p4d.24xlarge. Auto-scaling is identical to JumpStart’s. Full control over tensor-parallel degree, quantisationQuantisationStoring model weights at lower precision (8 bits, 4 bits, sometimes fewer) so the model is smaller and faster to run. , batching, logging, and full responsibility for all of them.

Bedrock native foundation models. Mentioned for contrast: Bedrock’s catalogue serves base or provider-fine-tuned variants, not arbitrary customer weights.

Side by side

Option	Auto-scales	Not provisioned-only	Serves our fine-tune	Low ops	Cold-start sensible
Bedrock Custom Model Import	✗	,	✓	✓	✗ (cold after idle)
SageMaker JumpStart	✓	✓	✓	✓	✓ (MinCapacity=1)
SageMaker BYO container	✓	✓	✓	✗	✓ (same as JumpStart)
Bedrock native foundation	✓	✓	✗	✓	✓

CMI serves the weights with no container work, but its capacity model is custom model units rather than request-rate-triggered auto-scaling, no TargetTrackingScalingPolicy, cold-start latency after idle windows as a visible symptom. JumpStart ticks every box when the fine-tune runs through its path: pre-tested LMI container, standard real-time endpoint, full Application Auto Scaling. BYO can serve the workload; what it fails is operational overhead. Bedrock native fails on fine-tuned-weights.

Matching the shape to the service

JumpStart is where the traffic curve and the team's capacity to operate it meet. CMI wins steady low-to-medium workloads where the Bedrock API matters more than bursty-curve efficiency. BYO is for teams with a dedicated serving engineer.

JumpStart, in depth

JumpStart is a model zoo with deployment glue. For any supported model. Llama 3 variants among many, it ships a Model ID (e.g. meta-textgeneration-llama-3-70b-instruct), a pre-tested LMI container image (AWS’s DJL Serving with vLLM and TensorRT-LLM back-ends), default deployment configuration, and an optional fine-tuning path.

Fine-tuning. JumpStartEstimator supports domain adaptation (continued pre-training on unlabelled text) and instruction fine-tuning (SFT on prompt-completion pairs). The style-guide corpus sits in S3 as JSONL; an estimator call runs the fine-tune on an ml.p4d.24xlarge; the artefact lands back in S3.

Deployment. JumpStartModel(model_id=..., model_version=...).deploy(...) creates the Model, EndpointConfig, and Endpoint in one call. Default instance for Llama 3 70B Instruct is ml.g5.48xlarge (8 × A10G, 192 GB GPU memory) or ml.p4d.24xlarge (8 × A100, 320 GB). G5 is meaningfully cheaper and generally sufficient for Instruct workloads.

Application Auto Scaling. The attribute that resolves the scenario. Real-time endpoints support:

Target tracking on SageMakerVariantInvocationsPerInstance (standard), CPUUtilization, or GPUUtilization via CloudWatch.
Step scaling with manual thresholds per step, for when target tracking is too slow.
Scheduled scaling: cron scale-out and scale-in on known schedules. Useful as a complement: pre-warm for 09:00 before traffic arrives; shrink after 19:00.

A reasonable starting policy: MinCapacity=1, MaxCapacity=5, target 600 InvocationsPerInstance per minute (10 RPS per instance). At trough, one ml.g5.48xlarge serves the background drip; at lunch peak, four or five instances handle the burst.

MinCapacity=0 (scale-to-zero) is supported; cold-start to return is a few minutes for a 70B model (container pull, weight load, KV cache warm-up). For this scenario, overnight drips rather than true zero, keep MinCapacity=1.

Cost shape. ml.g5.48xlarge on-demand is roughly $17/hr in us-east-1:

Off-peak 16 hours at 1 instance: $17 × 16 = $272/day floor.
Peak 8 hours averaging ~3 instances: $17 × 3 × 8 = $408/day variable.
Daily total ~$680, or ~$20k/month.

Flat five-instance capacity to hold the peak round-the-clock is $17 × 5 × 24 = $2,040/day, or ~$60k/month. That’s the 3x the auto-scaling attribute buys away.

Bedrock CMI, briefly

CMI is Bedrock’s answer to “we trained our own weights and want Bedrock’s invocation surface.” Deeply managed, no container surface, same InvokeModel / InvokeModelWithResponseStream API as native calls. Supported architectures: Llama 2/3/3.1/3.2, Mistral, Mixtral, Flan-T5 and a handful more. Regional availability: us-east-1 and us-west-2 only.

Billing via Custom Model Units, priced per-minute with a five-minute minimum per active period. After idle, the model goes cold; next invocation incurs a cold-start (tens of seconds to low minutes for 70B).

Where CMI fits: steady low-to-medium throughput, batch document processing, internal tooling, async pipelines, where the Bedrock API matters more than bursty-curve efficiency. Where it fails: business-hours curves where 12:1 peak-to-trough wants proportional capacity response. The CMU model doesn’t expose TargetTrackingScalingPolicy; you can’t ride the curve the same way a SageMaker endpoint does.

BYO, even more briefly

Full control: pick TGI, vLLM, or TensorRT-LLM; tune tensor-parallel degree (TP=8 on 8 × A10G for Llama 3 70B); choose quantisation (FP16, AWQ, FP8); set batching parameters; write a Dockerfile; push to ECR; configure model-data download; set health-check thresholds.

That’s a quarter of an MLE’s time for the first deployment, plus ongoing maintenance every time the container needs patching, the base model revs, or the inferenceInferenceRunning a trained model to produce output – as opposed to training it. library ships a breaking change. Three MLEs with a product to ship is not the team for this stack.

BYO is right for quantisation experiments that need weeks of A/B, inference libraries JumpStart’s LMI doesn’t yet support, request shapes that stress default batching heuristics, or teams with a dedicated serving engineer. BYO auto-scales identically to JumpStart, same endpoint primitive, same knobs. What it gives up is the pre-tested container, not the scaling surface.

Sticky auto-scaling details

The scaling target is the production variant, not the endpoint. Resource ID is endpoint/<endpoint-name>/variant/<variant-name>. Getting this wrong means RegisterScalableTarget succeeds silently and no scaling happens.

InvocationsPerInstance is per-minute, not per-second. A target of 20 means twenty per minute. For a 40 RPS peak (2,400/minute), target 20 would give 2400 / 20 = 120 instances, wildly over-scaled. The right target here is 600 (10 RPS per instance).

Cooldowns matter. ScaleOutCooldown and ScaleInCooldown default to 300 seconds. Shorter scale-out (60 s) responds faster on the way up; longer scale-in (180 s) holds capacity through short dips.

Scale-out isn’t free. A new ml.g5.48xlarge pulls ~140 GB of weights from S3 and into GPU memory; end-to-end scale-out latency is 4-8 minutes. This is why scheduled scaling before a known ramp is worth layering on, pre-warm at 08:45 so target tracking has head-room to react from.

Health checks can cold-kill an endpoint. The LMI container’s /ping path has tight default timeouts. If the first InvokeEndpoint after cold start takes longer than ping allows, the instance is flagged unhealthy and recycled. SAGEMAKER_TS_RESPONSE_TIMEOUT and MODEL_LOAD_TIMEOUT are the fix; JumpStart sets these generously by default.

A worked example: a Wednesday

import boto3
aas = boto3.client("application-autoscaling")
resource_id = "endpoint/writing-assistant-prod/variant/AllTraffic"

aas.register_scalable_target(
    ServiceNamespace="sagemaker", ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=1, MaxCapacity=5,
)
aas.put_scaling_policy(
    PolicyName="target-ipi", ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="TargetTrackingScaling",
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 600.0,
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"},
        "ScaleOutCooldown": 60, "ScaleInCooldown": 180,
    },
)
aas.put_scheduled_action(
    ServiceNamespace="sagemaker", ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    ScheduledActionName="morning-pre-warm",
    Schedule="cron(45 8 ? * MON-FRI *)",
    ScalableTargetAction={"MinCapacity": 2, "MaxCapacity": 5},
)

A typical Wednesday runs:

03:00. One ml.g5.48xlarge warm at ~1 RPS, 60 invocations/minute, well below the 600 target. No scaling action.
08:45. Scheduled action lifts MinCapacity to 2. The second instance pulls weights, warms up, passes health check by 08:51, head-room waiting for the 09:00 ramp.
09:00-10:30. Traffic ramps to ~15 RPS. Invocations-per-instance crosses 600/minute sustained; a third instance is scheduled. Three instances by 10:00.
12:15. Lunch peak hits 40 RPS. Target tracking requests a fourth and fifth instance; the fifth arrives around 12:22. Peak absorbed; P95 briefly spikes during scale-out but stays inside 3 s thanks to pre-warmed head-room.
15:00. Peak subsides; scale-in cooldown (180 s) delays action. Back to three instances by 15:10.
19:00. Scheduled action drops MinCapacity to 1, MaxCapacity to 3. Two instances drained over the scale-in window.

Day total ~$680, against $2,040 for a flat five-instance fleet with no quality or latency improvement.

What’s worth remembering

Bedrock Custom Model Import accepts a specific list of supported architectures (Llama 2/3/3.1/3.2, Mistral, Mixtral, Flan-T5, others), lives in us-east-1 and us-west-2 only, and bills via Custom Model Units closer to provisioned than to auto-scaling. Right for steady workloads; wrong for 12:1 bursty curves.
SageMaker JumpStart includes real-time endpoint deployment for Llama 3 70B Instruct with a pre-tested LMI container, supports domain adaptation and instruction fine-tuning, and integrates with Application Auto Scaling.
Application Auto Scaling target-tracking on SageMakerVariantInvocationsPerInstance is the default scaling mode. The metric is per-minute, a common arithmetic trap.
MinCapacity=0 is supported but comes with multi-minute cold-start for a 70B model. Use only for genuinely long idle periods; keep MinCapacity=1 for overnight drips.
Scheduled scaling complements target tracking by pre-warming known ramps. Scale-out has 4-8 minute latency for a new ml.g5.48xlarge with weights loaded, so reactive alone can lag a sharp ramp.
Instance defaults for Llama 3 70B: ml.g5.48xlarge (8 × A10G, cheaper) or ml.p4d.24xlarge (8 × A100, faster). G5 is usually enough for Instruct workloads.
SageMaker BYO with a Hugging Face container gives full control and identical auto-scaling mechanics, what it gives up versus JumpStart is the pre-tested container, not the scaling surface.
Bedrock native foundation models cannot serve arbitrary custom fine-tuned weights; CMI is the Bedrock path for that case.

Deploy the fine-tuned Llama 3 70B as a SageMaker JumpStart real-time endpoint on ml.g5.48xlarge, with Application Auto Scaling target-tracking on SageMakerVariantInvocationsPerInstance targeting 600/minute, MinCapacity=1, MaxCapacity=5, scale-out cooldown 60 s, scale-in cooldown 180 s, and scheduled scaling to pre-warm two instances at 08:45 weekdays and drop back to one at 19:00.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.