Auto-Scaling for Planned and Surprise Spikes

The situation

A retailer runs the shop on EC2 instances behind an Application Load Balancer, inside an EC2 Auto Scaling group spread across three AZs. Weekday traffic has a clear shape: a morning peak between 08:00 and 10:00 local, a quieter midday, an evening peak from 19:00 to 22:00. Weekends run lighter across the day. That pattern repeats week after week.

Two complications sit on top of that baseline. Black Friday is known ten months in advance, brings 10x the usual peak, and lasts about six hours from midnight local; the same shape repeats the following Monday (Cyber Monday). Unscheduled spikes arrive without warning, a product hits the front page of a news site, a celebrity posts about a jacket, a morning show mentions the brand, and the ramp from baseline to 3-4x peak happens in single-digit minutes.

The business wants zero capacity-related outages on either kind of spike, costs that track demand (no paying for peak 24/7), and no standing army of on-call engineers scaling manually.

What actually matters

Before naming features, it’s worth thinking about what kinds of work this layer has to do, because the scenario’s “auto-scaling” is three different problems glued together.

The first is known, calendar-driven capacity. Black Friday is not a forecast; it’s a date on a wall. The business knows exactly when the floor needs to go up, by roughly how much, and when it comes down. Any mechanism that tries to learn this from metrics is running a science experiment when the calendar already has the answer. What we want is a tool that pins a minimum capacity in place for a window, and nothing else can undercut it. Not a forecast. Not a recommendation. A pin.

The second is cyclical, repeating, learnable shape. The morning and evening peaks repeat every weekday, every weekend has its own quieter shape, and nobody wants to hand-edit a calendar for that. A reactive feature that waits for CPU to climb before adding capacity is always minutes late on a ramp; what we really want is a feature that has seen last Tuesday’s curve and can start provisioning ten minutes before the curve would have started climbing. That’s a forecast that runs continuously, refreshes on new data, and scales out ahead of predicted demand. Scale-in is a separate concern, and deciding when to give capacity back is easier once demand has materialised.

The third is unpredictable reactive capacity. The viral moment is the opposite of the first two: not a calendar date, not a repeating shape, just a ramp that shows up unannounced. The only feature that can respond to it is one that watches the live metric and adds capacity when the metric moves. That’s inherently reactive and therefore always a bit behind, which means the other axis that matters is how quickly new capacity becomes useful. If the application takes five minutes to boot and the spike ramps in two, the reactive layer is chasing a lost cause no matter how fast it decides to scale.

The fourth (crosscutting) is spin-up latency. Cold EC2 launches are typically 60-90 seconds for a minimal AMI and 3-5 minutes for a real application that pulls a container, warms a JVM, fills a connection pool, registers with service discovery, and pre-loads a cache. Against a six-minute viral ramp that’s the difference between catching the peak and watching it overrun the group. Making capacity arrive faster is a separate problem from deciding that more capacity is needed, and it deserves its own layer.

The fifth, softer, is composition over combination. The answer here is almost certainly several mechanisms working together. What matters is how they interact when they disagree: if scheduled says 500 and predictive says 180 and target tracking says 220, which wins? The group takes the maximum, which makes scheduled scaling the floor that holds for known events. If it were average or minimum, the scheduled layer would be useless.

And sixth, cost discipline. The business doesn’t want to pay for Black Friday capacity on a Sunday afternoon in March. That means every layer except the scheduled one has to be willing to scale down when demand drops, or, for predictive scaling specifically, has to be paired with something that will.

What we’ll filter on

Distilling the exploration into filters:

Coverage of a known, large, time-bounded spike. A feature that lifts capacity on a calendar the business controls.
Coverage of the repeating cyclical baseline. A feature that learns the weekday shape and provisions for it before the ramp, not after.
Coverage of unannounced spikes. A feature that reacts to the live metric quickly enough that a three-minute ramp doesn’t drain the group.
Acceptable spin-up latency. Whatever layer delivers capacity must produce usable instances faster than the spike grows.
Safe composition. Multiple policies can run on the same group; they must combine predictably, not fight.

The auto-scaling landscape

EC2 Auto Scaling and Application Auto Scaling ship seven distinct mechanisms between them.

Simple scaling. A single alarm fires, a single adjustment runs, the group enters a cooldown and ignores further alarms. AWS’s own guidance is that simple scaling is legacy. If a spike is still climbing when the cooldown starts, the group won’t react until it expires. Too blunt.
Step scaling. One CloudWatch alarm with multiple step adjustments keyed to breach magnitude. 10% over target adds 1 instance; 20% over adds 3; 40% over adds 6. Instead of a flat cooldown it uses an instance warmup during which new launches don’t count toward the metric yet. Right tool when breach magnitude matters and different severities deserve different responses.
Target tracking scaling. Feed it a target value for a metric (CPU at 50%, ALBRequestCountPerTarget at 1000) and EC2 Auto Scaling provisions and manages the CloudWatch alarms itself. Simplest dynamic policy; AWS’s default recommendation. Purely reactive: capacity arrives after the metric has moved. Asymmetric by design: eager scale-out, cautious scale-in.
Scheduled scaling. A calendar action (recurring or one-off) that changes the group’s desired, minimum, and maximum capacity at a specified date and time. Supports cron expressions with an optional IANA time zone (Europe/London), up to 125 actions per group; execution may be delayed up to ~2 minutes. Perfect for Black Friday; useless against a viral moment.
Predictive scaling. A machine-learned forecast that analyses up to 14 days of CloudWatch history and produces an hourly forecast for the next 48 hours, refreshed every 6 hours. Needs 24 hours of history minimum before it generates its first forecast; works better with two weeks. In default ForecastAndScale mode it scales out only ahead of predicted load; scale-in is left to a dynamic policy. Learns the weekday morning/evening shape, but cannot anticipate a one-off event.
Warm pools. A pool of pre-initialised EC2 instances held alongside the Auto Scaling group in Stopped, Hibernated, or Running state. Stopped is the default: you pay EBS and any EIPs, not compute. When the group needs capacity, warm-pool instances start much faster than a cold launch because the AMI is on disk and the userdata has run. Doesn’t decide when to add capacity, just makes adding it faster.
Lifecycle hooks. A way to pause an instance in the Pending:Wait or Terminating:Wait state so custom work can run: register with service discovery, pre-warm a cache, drain connections. Default timeout 1 hour, maximum 48 hours. Hooks are the plumbing for warm-pool bootstrap and clean scale-in.

Side by side

Mechanism	Known spike	Repeating baseline	Unannounced spike	Spin-up latency
Simple scaling	✗	✗	✗	,
Step scaling	✗	✗	✓	,
Target tracking	✗	✗	✓	,
Scheduled scaling	✓	✗	✗	,
Predictive scaling	✗	✓	✗	,
Warm pools	,	,	,	✓
Lifecycle hooks	,	,	,	✓

Nothing wins alone. Scheduled has no answer to the viral moment. Predictive cannot see Black Friday the first year and cannot see unannounced spikes ever. Target tracking is always minutes behind the real curve. Warm pools don’t decide when to add instances. The scenario rewards composition: pick the feature for each attribute and let them stack.

Layering the four mechanisms

Three workload shapes, three different answers, one shared speed-up layer. The group takes the maximum of the three policies' desired capacities, which makes scheduled the floor that holds during known windows.

The survivor combination in depth

The layered answer, from slow-moving to fast-reacting.

Layer 1: scheduled actions for Black Friday, Cyber Monday, and any marketing event on the roadmap. A recurring action at, say, 55 23 24 11 * (23:55 on 24 November) lifts the group’s minimum to 10x the normal floor; a paired action at 0 6 25 11 * drops it back. Setting MinSize rather than just DesiredCapacity means no other scaling policy can quietly drain the group during the window. Up to 125 actions per group; IANA time zones so the cron reads Europe/London and survives daylight-saving changes.

Layer 2: predictive scaling for the weekday morning/evening pattern. After 14 days of observation it knows Tuesday at 08:00 needs capacity 3.5x the overnight floor, and at 07:45 it will already have started launching instances to meet that curve. Critically, predictive scaling only scales out, it won’t drain the group when the forecast is low; it leaves scale-in to Layer 3. Forecasts refresh every six hours so the model tracks drift.

Layer 3: target tracking scaling as the always-on dynamic policy. ALBRequestCountPerTarget at a safe target (say, 1000 requests/target/minute) keeps the group responsive to whatever Layers 1 and 2 didn’t anticipate. When multiple scaling policies are active on the same group, each computes a desired capacity independently and the group takes the maximum: if scheduled says 500, predictive says 180, and target tracking says 220, the group runs at 500. Target tracking also handles scale-in across all three layers once load drops, because the other two are scale-out-only by design.

Layer 4: warm pools to cut the spin-up tax on every layer above. Instances enter the pool in Stopped state after their launch lifecycle hook has run the full bootstrap (package install, config download, AMI specialisation). When the group needs capacity, the warm-pool instance transitions out in seconds to low minutes rather than the cold-launch path. For an application with a meaningful startup cost. JVM warm-up, large container image pulls, connection pools, this is the difference between catching a four-minute viral ramp and watching it overrun the group.

A worked example: the Black Friday trace

A concrete walk through 24 hours around a Black Friday opening, with a quiet-evening baseline of 50 instances and a scheduled minimum of 500.

Tuesday, -72 hours. Scheduled actions are configured. Predictive scaling has two months of history but no concept of Black Friday; its Friday forecast shows a normal Friday. Target tracking is at its usual 50% CPU target. Warm pool sits at 100 instances in Stopped state. AMI cold-boots in ~90 s; a warm-pool instance reaches InService in ~20 s.

Friday 23:55 local. Scheduled action fires. Minimum and desired both jump from 50 to 500. EC2 Auto Scaling pulls instances out of the warm pool as fast as it can: the first 100 come from the pool at ~20 s each; the remaining 350 cold-launch at ~90 s. Total time to 500 in service: about 3 minutes. The scheduled minimum pins the floor, and neither target tracking nor predictive can scale it back.

Saturday 00:00-06:00. Real traffic arrives on the predicted ramp. CPU on the 500-instance group settles around 55%, slightly above target. Target tracking responds: it provisions up to 540, then 580, then 620 as the ramp steepens. Warm-pool replenishment runs in the background so every new wave is met with a warm head rather than a cold boot.

Saturday 06:00 local. Paired scheduled action drops minimum and desired back to 50. Target tracking takes over scale-in; because it biases toward availability, the drain takes 15-30 minutes rather than all at once. By 06:30 the group is near baseline, warm-pool inventory back to its resting size.

Saturday 10:47. A celebrity posts about a jacket. Traffic to that product climbs from baseline to 4x in about six minutes. Scheduled scaling has nothing in the calendar. Predictive’s Saturday 10:00-11:00 forecast says nothing special. Target tracking sees CPU cross 50% at 10:51 and orders capacity. The first batch arrives from the warm pool at 10:51:30, with 100 instances InService in under a minute. A cold-launch batch hits InService around 10:53. By 12:30 interest has moved on; target tracking drains the extras over the next half hour.

Where predictive scaling doesn’t help

Predictive scaling’s value comes from spotting a repeating pattern. Does well: weekday-vs-weekend shapes, morning and evening peaks, lunchtime dips, fortnightly payday bumps if the shape repeats. Refreshes forecasts every six hours as new data arrives.

Does not do well: one-off events, irregular marketing spikes, anything whose trigger is unpredictable. The model can see that traffic was high yesterday at 14:00, but if there isn’t a matching 14:00 peak the previous Thursday, and the Thursday before, it won’t raise tomorrow at 14:00 in anticipation.

And predictive scaling is scale-out only in the default ForecastAndScale mode. It won’t remove capacity when the forecast drops; scale-in is deliberately left to a dynamic policy.

Warm pools and the spin-up budget

The warm-pool payoff depends on how long a cold launch actually takes. A modern Linux AMI with no bootstrap reaches InService in 60-90 s. A typical application image that pulls a container, warms a JVM, fills a connection pool, registers with service discovery, and pre-loads a cache can add another 3-5 minutes. Against a six-minute viral ramp, the reactive layer arrives after the peak, not during it.

Warm pools move the expensive work to pool-entry time. The launch lifecycle hook runs once when the instance enters the pool; when the instance is pulled into the group it’s already specialised and the transition to InService is dominated by network-connect work (target-group registration, health checks). That pushes typical InService time from minutes back to tens of seconds.

Three configuration choices matter. State: Stopped is the default (pay EBS and EIPs only); Hibernated preserves RAM (useful for JVM warm-up); Running pays full compute and is discouraged. Pool size: rule of thumb, cover the largest single expected scale-out event; scheduled scaling makes this easier because the peak is known. Reuse: a reuse policy lets instances return to the pool on scale-in rather than terminate, cutting bootstrap cost on subsequent events.

What’s worth remembering

Seven mechanisms exist: simple, step, target tracking, scheduled, predictive, warm pools, lifecycle hooks. Each covers a different axis; real production uses several at once.
Scheduled scaling is the correct tool for known events. It sets minimum/maximum/desired on a calendar and pins the group against other policies trying to scale it down.
Predictive scaling is scale-out only in default mode. It needs 24 hours of history, refreshes every 6 hours, forecasts 48 hours ahead from up to 14 days of data, and pairs with a dynamic policy for scale-in.
Target tracking is reactive and intentionally asymmetric: eager scale-out, cautious scale-in. It manages its own CloudWatch alarms and prefers warmup to cooldown.
Step scaling is the tool when breach magnitude matters and different severities deserve different responses. Uses instance warmup rather than flat cooldown.
Warm pools cut spin-up latency but don’t decide when to scale. They’re a performance optimisation for every layer above, not a scaling policy.
Lifecycle hooks pause launches and terminations for custom work. 1-hour default, 48-hour max. They’re what make warm pools useful and clean scale-in possible.
When multiple policies are active, the group takes the maximum desired capacity across all of them. Scheduled minimums can’t be undercut by predictive or dynamic policies, which is why the floor-setting layer is the one that holds.