How to Diversify a Spot Fleet to Survive Capacity Events

December 02, 2026 · 15 min read

The situation

A data-engineering team runs a distributed video transcoding pipeline on EC2. The workload:

Batch of 3,000-50,000 video files per job, CPU-bound, each taking 30 seconds to 8 minutes.
Jobs arrive at unpredictable intervals, maybe 3-20 per day. Individual jobs have SLAs of 2-6 hours.
Needs roughly 500 vCPUs active during a job, scaling to zero between jobs.
Each task is stateless: a worker pulls a video URL from SQS, transcodes, uploads to S3, acks the message. If a worker dies mid-task, the message becomes visible again and another worker picks it up. The pipeline has been engineered to tolerate node loss cleanly.

Currently the team uses an Auto Scaling Group of c6i.2xlarge on-demand, scaling from 0 to 64 instances. Monthly bill: ~$18,000. Somebody suggested Spot. The team tried Spot in a single pool once (all c6i.2xlarge in us-east-1a), saved 80% for a week, then lost the entire fleet during a capacity event and had to explain to the business why a pipeline didn’t finish. They’re Spot-skeptical now, but the savings are enticing enough to want a second attempt, with diversification.

The questions: how many instance types should the fleet request? How many AZs? What allocation strategy? How do we size the commitment so a capacity event in one pool doesn’t take the fleet down with it?

What actually matters

Before picking a fleet shape, it’s worth understanding what “a Spot pool” actually means and why diversification matters.

The first thing is a Spot capacity pool is one (instance-type, AZ) combination. c6i.2xlarge in us-east-1a is one pool. c6i.2xlarge in us-east-1b is a different pool. c6i.4xlarge in us-east-1a is another different pool. Pools are independent: AWS reclaims Spot capacity from a pool based on that pool’s supply and demand. A request for 64 instances in one pool is 64 eggs in one basket; a request for the same 64 across 6 pools is roughly one-sixth the blast radius of any one pool’s capacity event.

The second thing is interruptions are probabilistic. AWS publishes a “Spot Instance interruption rate” per pool, the historical percentage of instances interrupted in the last month. It’s the best available indicator of how stable a pool is. Some pools (older generations, specialised instance types) can have interruption rates of 15-20%; common pools (current-gen general-purpose families) often sit at 5% or less. A good diversification strategy mixes multiple low-rate pools rather than leaning on one.

The third thing is allocation strategy. Any fleet API that requests Spot capacity across pools has to decide which pools to draw from on each launch. The available strategies span a range: pick the cheapest available (maximum savings, worst diversification, concentrates in one pool); spread capacity evenly across all configured pools (maximum resilience, may cost a touch more); pick pools with the most available capacity (lowest interruption rate proxied by recent launch success); or balance price and capacity signals together (the sweet spot for most workloads). The newer balanced strategy is the default for a reason; the others are situational.

The fourth thing is the breadth of acceptable instance types. The fleet needs a list of types it’s willing to accept on launch, and they should have compatible resource shapes for the workload. For a CPU-bound transcoding workload, any 8-vCPU instance in the right family slots in. Mixing generations and vendors (newer vs older, Intel vs AMD vs ARM where the workload tolerates it) multiplies the pool count without changing the per-instance cost-of-work meaningfully.

The fifth thing is the interruption signal. When Spot capacity is reclaimed, AWS delivers an interruption notice via the instance metadata service, with a short window before force-termination; on newer instance types there’s also a proactive rebalance signal that arrives earlier. The worker has to poll the metadata, detect the notice, stop accepting new work, finish what it can, and drain. Stateless workloads with idempotent tasks don’t need to save state; they just need to die cleanly so the queue’s visibility timeout re-delivers any in-flight work.

The sixth thing is the choice of orchestration API. There are several AWS APIs that can launch and manage a fleet of Spot capacity, some older and Spot-specific, some newer with broader integrations across the AWS ecosystem, and one that lives inside the standard Auto Scaling Group via a mixed-instances policy. The trade-offs differ on flexibility, integration with existing automation (load balancers, target-tracking scaling, lifecycle hooks), and operational fit. For a workload that already lives behind an Auto Scaling Group, the ASG-native option is usually the cleanest; one-shot batch and high-control use cases lean to a dedicated fleet API.

What we’ll filter on

Filters for each fleet shape:

Savings vs on-demand, actual observed discount.
Interruption frequency, how often does a worker get the 2-minute warning?
Resilience to capacity events, does losing one pool take down the fleet?
Scaling speed, how fast can the fleet go from 0 to 500 vCPUs?
Operational complexity, how much configuration to get correct?
Integration with existing automation, does it fit the ASG-based pipeline?

The fleet-shape landscape

Single pool, lowestPrice. c6i.2xlarge in one AZ, Spot only. Cheapest possible; worst possible diversification. One capacity event kills the fleet.
Two pools, lowestPrice. Two AZs for the same instance type. Slight improvement; two pools share a fate across AZ-level events less often than one, but a type-level event in c6i.2xlarge takes both.
Six pools, diversified. Three instance types × two AZs. Real diversification; allocation strategy spreads evenly. Slightly higher average cost because the strategy can’t always pick the cheapest pool.
Twelve pools, priceCapacityOptimized. Four instance types × three AZs. The modern default. The allocation strategy balances price and capacity signals; typical interruption rates under 5%.
Twenty pools, priceCapacityOptimized, with on-demand floor. Six instance types × three AZs, plus an on-demand base (e.g. 20% of capacity on-demand to guarantee a minimum). The “Spot for surge, on-demand for baseline” shape when the workload has a non-interruptible floor. This transcoding workload doesn’t really have a non-interruptible floor (tasks can always retry), but adding one is cheap insurance against “every Spot pool unavailable at once”, a rare but non-zero event.
Compute Savings Plan + Spot on top. Not a fleet shape exactly. Commit to baseline compute at SP-discount rates, use Spot for the burst above. For a workload with zero baseline between jobs, SPs add cost without saving much; for a workload with continuous baseline, SPs + Spot is the deepest discount combination available.

Side by side

Option	Savings	Interruption freq	Pool-event resilience	Scaling speed	Complexity	ASG integration
Single pool	~90%	High (one pool = all eggs)	✗	Fast	Low	✓
Two pools	~85%	Medium	—	Fast	Low	✓
Six pools, diversified	~80%	Low	✓	Fast	Medium	✓
Twelve pools, PCO	~80-85%	Very low	✓✓	Fast	Medium	✓
Twenty pools, PCO + OD floor	~70-75%	Very low; OD floor never interrupts	✓✓✓	Fast	Medium-high	✓
SP + Spot on top	~65%	Depends on Spot mix	✓	Fast	Medium	✓

For this workload: twelve pools with priceCapacityOptimized is the strong default. The 500 vCPU peak spread across twelve pools puts ~40 vCPUs per pool on average, an easily-absorbed disruption if any one pool gets reclaimed. No on-demand floor is needed because every task is retriable.

The 12-pool allocation

Four instance types × three AZs = twelve pools. priceCapacityOptimized weights allocation toward pools that are both cheap and have available capacity. Losing any single pool takes at most ~13% of the fleet offline; the remaining pools absorb in seconds.

The pick in depth

Auto Scaling Group with mixed-instances policy, four instance types across three AZs, priceCapacityOptimized allocation, 100% Spot.

The launch template and ASG config. One launch template with the base AMI, IAM role, security groups, user-data for the worker agent. The ASG references this template plus a mixed-instances policy that overrides the instance type per launch:

MixedInstancesPolicy:
  LaunchTemplate:
    LaunchTemplateSpecification:
      LaunchTemplateId: !Ref WorkerLaunchTemplate
      Version: $Latest
    Overrides:
      - InstanceType: c6i.2xlarge
      - InstanceType: c6a.2xlarge
      - InstanceType: c6i.4xlarge
      - InstanceType: c7i.2xlarge
  InstancesDistribution:
    OnDemandBaseCapacity: 0
    OnDemandPercentageAboveBaseCapacity: 0
    SpotAllocationStrategy: price-capacity-optimized
    SpotInstancePools: 0   # 0 = use allocation strategy across all overrides
VPCZoneIdentifier: !Split [',', !Ref PrivateSubnetIds]   # three AZs

OnDemandBaseCapacity: 0 and OnDemandPercentageAboveBaseCapacity: 0 mean 100% Spot. No on-demand floor for this workload; every task is retriable. SpotAllocationStrategy: price-capacity-optimized delegates the pool choice to AWS’s real-time signals.

Capacity Rebalance. ASG’s CapacityRebalance: true opts into Amazon’s proactive replacement when a pool shows elevated interruption risk. AWS launches a replacement before the existing instance gets its 2-minute warning. This is the single biggest reliability improvement for Spot workloads in recent years, free to turn on, substantially lower effective interruption impact.

Worker-side interruption handling. Each worker polls the instance metadata service every 5 seconds for the interruption notice:

#!/bin/bash
while true; do
  if curl -s -m 1 http://169.254.169.254/latest/meta-data/spot/instance-action >/dev/null 2>&1; then
    logger -t spot-handler "Interruption notice received; stopping message polling"
    systemctl stop transcode-worker
    # let current task finish or abandon; SQS visibility timeout handles the rest
    sleep 130  # 2-minute warning plus buffer
    shutdown -h now
  fi
  sleep 5
done

On interruption, the worker stops polling SQS (no new work), tries to finish the in-flight task, and shuts down. Any SQS message it was processing becomes visible again after the queue’s visibility timeout and another worker picks it up. Stateless and idempotent, which is the pre-requisite for Spot.

Scaling policy. Target-tracking scaling on SQS queue depth, with a target of ~30 messages per worker. A 3,000-message job triggers scale-out from 0 to ~60 instances in 5-8 minutes (launch time); work drains; the ASG scales to 0 over the next 5 minutes. Cooldown on scale-in is longer than on scale-out (don’t thrash).

A worked job

Ravi submits a 15,000-video job at 10:04 UTC.

04:02  Job upload complete; 15,000 messages in SQS
04:15  CloudWatch alarm fires: SQS ApproximateNumberOfMessages > 500
04:16  ASG desired capacity set to 60 (from 0)
04:22  ASG starts launching: 8× c6i.2xlarge in 1a, 6× c6a.2xlarge in 1a, 5× c6a.2xlarge in 1b, ...
06:48  First workers online, polling SQS
09:30  Steady state: 52 instances across 10 pools, 500 vCPUs, all pulling messages
11:47  Spot interruption notice for i-0abc1234 (c6i.4xlarge in 1c)
11:48  Capacity Rebalance triggers replacement launch
12:02  Replacement online, c6a.2xlarge in 1a
13:52  i-0abc1234 receives 2-minute warning, drains SQS poll, shuts down
38:14  Queue empty; ASG starts scale-in
44:00  ASG at 0 instances; job complete

Total wallclock: 40 minutes. One Spot interruption during the job, handled transparently. The job’s SLA was 2 hours; it ran in a third of that at about a quarter of the on-demand cost. The month-over-month bill for this pipeline drops from $18,000 to ~$4,700.

What’s worth remembering

A Spot capacity pool is (instance-type, AZ). Pools are the unit of Spot reclamation; diversification means spreading across many pools, not many instances in one pool.
Diversify on two axes: instance type and AZ. Four instance types across three AZs gives twelve pools. Losing any one pool takes at most ~8% of the fleet; the rest keeps running.
priceCapacityOptimized is the modern default allocation strategy. It balances price signals with capacity signals; typical interruption rates under 5% for well-diversified fleets.
Capacity Rebalance proactively replaces at-risk instances. Turn it on. No downside for stateless workloads.
Interruption handling is polling the metadata service. The 2-minute warning is a signal, not a state; a small loop in user-data plus a graceful SIGTERM handler does the job. Idempotent tasks make the rest trivial.
Mixed-instances policy on an ASG is the usual integration. For EC2-backed services behind ASGs, it’s a drop-in. For one-shot batch, EC2 Fleet offers more controls; for ECS/EKS, the Capacity Provider handles the same pattern.
On-demand base capacity is insurance, not the default. Workloads with a true non-interruptible floor (a minimum of N workers who must stay up) set OnDemandBaseCapacity > 0. Stateless batch pipelines usually don’t need it.
Savings depend on how wide you diversify. A single pool gets 90% off until the pool goes away. Twelve well-chosen pools gets 70-80% off consistently. The second number is the one that stays.

Spreading the bid across the pool is the trade: give up a few percentage points of savings to buy real resilience. Twelve pools, priceCapacityOptimized, Capacity Rebalance on, idempotent workers, that’s the combination that lets a stateless batch pipeline ride Spot at a quarter of on-demand cost without explaining another outage to the business.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.