ECS on Fargate or on EC2

The situation

A platform team runs four services on one ECS cluster in eu-west-1:

api, a 24/7 HTTP service behind an ALB. ~30 tasks, each 2 vCPU / 4 GB. CPU-bound under normal load at ~35% utilisation; autoscales between 20 and 60 tasks on TargetTrackingScaling.
worker, a 24/7 queue consumer pulling from SQS. ~12 tasks, each 1 vCPU / 2 GB. Runs at 60% CPU baseline; scales to 30 during campaign pushes.
batch-nightly, an EventBridge-triggered task that runs 01:00-04:00 UTC. 20 tasks, each 4 vCPU / 16 GB. Runs hot for three hours, then nothing until tomorrow.
ml-training, weekly, runs on Sunday afternoons. 8 tasks, each 8 vCPU / 32 GB, each with a 100 GB scratch volume. GPU not required; CPU-bound.

Today everything runs on a pool of ~15 EC2 instances (c6i.2xlarge, c6i.4xlarge) managed by ECS Capacity Providers with Auto Scaling Groups. Somebody maintains AMIs, somebody handles the OS patching rota, somebody debugged a bin-packing issue last quarter that left 30% of cluster CPU unused. The team wants to know, service by service, whether moving to Fargate would save money, save operational toil, both, or neither.

What actually matters

Before comparing sticker prices, it’s worth naming what each launch type actually gives and takes.

The first thing to be clear about is what ECS does vs what the launch type does. ECS, the control plane, schedules tasks onto compute capacity, manages task definitions, integrates with load balancers, CloudWatch, and IAM. The launch type. EC2 or Fargate, is how the compute gets provided. With EC2, you bring the instances; ECS places tasks on them using whatever bin-packing strategy you choose. With Fargate, AWS provides a correctly-sized microVM per task, charged per vCPU-hour and memory-GB-hour. ECS doesn’t know or care which it is from a task-definition perspective (mostly; a few fields differ).

The second thing is the per-hour economics. At list price the serverless option and a perfectly-filled slice of the self-managed option are roughly dead even per vCPU-hour. The Fargate-style number is what you always pay; the EC2-style number is what you pay if bin-packing is perfect. In practice, real clusters run at 60-80% utilisation because of task boundaries, AZ spread, scaling headroom, and drain buffers. A cluster running at three-quarters utilisation makes the self-managed slice meaningfully more expensive than its list-price suggests, the serverless premium is only a premium if your cluster packs well.

The third thing is patching and OS lifecycle. Running containers on instances the team owns means an AMI pipeline, a rolling rota of rebuilds as security updates ship, and a cordon-drain-replace dance to keep services up while hosts get replaced. The team has automation, but the automation exists; it was written once and will need updates forever. The serverless option pushes the OS underneath the line and reduces the operational surface to “redeploy when the platform version changes.”

The fourth thing is networking. Per-task networking models, each task gets its own ENI, its own security group, its own private IP, are the cleanest design but they consume ENIs from the host. On the self-managed side, instance type effectively dictates ENI count, which means instance-type selection ends up being driven by network density rather than CPU shape. The serverless option sidesteps the ENI-density conversation by hiding the host entirely.

The fifth thing is the workload shapes the serverless option can’t accommodate. No GPU. No privileged mode. No bind-mounting arbitrary host paths. No daemonsets. Scratch storage capped at a modest per-task limit, not arbitrary EBS volumes. If a workload needs local SSD scratch in hundreds of GB or TB, or a GPU, the serverless option forces a data- or hardware-architecture change.

The sixth thing is savings-commitment compatibility. Compute Savings Plans cover both serverless and self-managed compute under one commit, which matters for the “serverless is expensive” argument, the commitment can be retargeted from one launch type to the other without stranding spend. A team committing to a multi-year discount can re-aim that commitment as the launch-type mix changes.

What we’ll filter on

Filters for each service against each launch type:

Cost at realistic utilisation. Fargate list price vs EC2 after accounting for bin-pack waste.
Operational overhead, patching, AMIs, scaling group tuning, bin-pack debugging.
Workload fit. CPU/RAM shape, scratch storage, GPU, privileged mode, long-running vs bursty.
Networking flexibility. ENI-per-task limits on EC2 vs none on Fargate.
Start-up time, how fast a new task becomes ready.
Savings-commitment compatibility, can Savings Plans/RIs cover this?

The launch-type landscape

Fargate. Serverless containers. Per-task microVM, awsvpc networking mandatory, per-second billing after the first minute, ~$0.04048 per vCPU-hour and ~$0.004445 per GB-hour in eu-west-1. ARM option (Graviton) at ~20% discount. No OS patching from the customer. Start-up time typically 30-60 seconds for a small image from ECR. No GPU, no privileged mode, no host-path bind mounts, 20-200 GB ephemeral storage per task. Fargate Spot available for interruptible workloads at ~70% off.
EC2, Auto Scaling Group + ECS Capacity Provider. The familiar model. Choose instance types, sizes, and an ASG; ECS places tasks via the Capacity Provider, which can scale the ASG up and down based on reservation. Bin-packing strategy (binpack on CPU or memory, spread across AZs/instances, or random) controls placement. You own patching, rightsizing, Spot blending. Start-up time is task placement (seconds) plus sometimes an ASG scale-out (minutes).
EC2 Spot. Same as (2) with the ASG sized to use Spot capacity. Up to 90% off for workloads that tolerate 2-minute eviction. ECS handles the eviction notice via the Capacity Provider (it marks the instance as draining, ECS stops placing new tasks, existing tasks drain).
Fargate Spot. Fargate with interrupt semantics. ~70% off the Fargate on-demand rate. 2-minute termination warning delivered via SIGTERM to the task. Fewer features to worry about than EC2 Spot (no ASG, no diversification strategy), but capacity is sometimes constrained (it’s a best-effort pool).
EC2 with Graviton (c7g, m7g). ARM64 EC2 instances at a 20-40% price advantage for compatible workloads. Requires container images built for ARM64 (multi-arch images via docker buildx). Combines with (2) for self-managed ARM clusters.

Side by side

Option	Realistic cost	Ops overhead	Fit range	Networking	Start-up time	Savings
Fargate	Medium	Minimal	24/7 + bursty, no GPU, ≤200 GB ephemeral	awsvpc only, no ENI limits	30-60s	SP Compute
EC2 on-demand	Medium-High (with waste)	High	Anything	awsvpc, bridge, host	Placement fast, scale-out slow	SP, RI
EC2 Spot	Low (interruption tolerance needed)	Highest	Interruption-tolerant batch	Same as EC2	Same	Spot
Fargate Spot	Low (interruption tolerance needed)	Minimal	Interruption-tolerant, no GPU	awsvpc	30-60s	n/a (Spot)
EC2 Graviton	Low-Medium	High + arch management	ARM-compatible workloads	Same as EC2	Same	SP, RI

Reading the table service by service is where the actual decisions live.

Matching each service to a launch type

Four services, four different correct answers. api and worker go to Fargate on Compute Savings Plans; batch-nightly goes to Fargate Spot; ml-training stays on EC2 because 100 GB per-task scratch storage is outside Fargate's sensible range.

The picks in depth

api and worker → Fargate + Compute Savings Plan. Both are 24/7 services with predictable baseline usage and no special infrastructure needs. A Compute Savings Plan commit at the baseline (for example, $0.90/hour for the combined steady-state Fargate usage) shrinks the effective Fargate rate to ~$0.068 per 2vCPU/4GB task-hour. That’s marginally cheaper than the EC2 equivalent at 70% cluster utilisation, and the team stops patching instance AMIs, tuning ASGs, and debugging bin-pack failures. Task definition changes are minimal: requiresCompatibilities: [FARGATE], networkMode: awsvpc, remove any host-path volumes, use ephemeralStorage.sizeInGiB if more than the 20 GB default is needed.

Graviton (runtimePlatform.cpuArchitecture: ARM64) is a further 20% off if the container images build cleanly for ARM64. For Node.js, Go, and most Python the answer is “yes, with a multi-arch image”; for workloads pinned to x86-only native dependencies it’s a case-by-case call.

batch-nightly → Fargate Spot. Three hours per day, checkpointed, tolerant of restart. Fargate Spot at ~70% off the on-demand Fargate rate turns a ~$20/night job into a ~$6/night one, without ASG pre-warming, Spot diversification strategies, or instance-type babysitting. The 2-minute SIGTERM is handled by the batch controller (already writes checkpoints between stages). Capacity interruption risk is real. Fargate Spot is a best-effort pool, but the workload’s SLA is “done by 06:00,” which leaves four hours of slack for retries.

{
  "capacityProviderStrategy": [
    { "capacityProvider": "FARGATE_SPOT", "weight": 4 },
    { "capacityProvider": "FARGATE", "weight": 1 }
  ]
}

80% Spot, 20% on-demand as capacity-not-available insurance. The 20% on-demand floor keeps the job moving if Spot capacity is unavailable for a whole night.

ml-training → EC2 (with a dedicated ASG for Sundays). The 100 GB local scratch per task is the blocker for Fargate. A dedicated ASG of c6i.4xlarge or c7g.4xlarge instances with 500 GB gp3 EBS volumes, spun up at 14:00 Sunday via a scheduled Capacity Provider scaling action and scaled to zero at 22:00 Sunday. Instance Store variants (c6id.4xlarge) give faster scratch NVMe but can’t survive stop/restart, fine for this workload because the data is derived each run.

The team already has EC2 automation; adding a second Capacity Provider for the ML ASG with managedScaling: DISABLED and a cron-driven desired-count adjustment is ~20 lines of Terraform. EC2 Spot could halve the cost further if training checkpoints between epochs; worth trialling but not the migration-day decision.

A worked migration day

The api service migration runs on a Tuesday morning.

# 1. Register a new task definition with Fargate compatibility
$ aws ecs register-task-definition \
    --family api \
    --cpu 2048 \
    --memory 4096 \
    --network-mode awsvpc \
    --requires-compatibilities FARGATE \
    --execution-role-arn arn:aws:iam::111122223333:role/ecsTaskExecutionRole \
    --task-role-arn arn:aws:iam::111122223333:role/api-task-role \
    --container-definitions file://api-containers.json

# 2. Update the service to use the new task definition and Fargate
$ aws ecs update-service \
    --cluster platform \
    --service api \
    --task-definition api:47 \
    --capacity-provider-strategy capacityProvider=FARGATE,weight=1,base=20 \
    --network-configuration 'awsvpcConfiguration={subnets=[subnet-aaa,subnet-bbb,subnet-ccc],securityGroups=[sg-api-task],assignPublicIp=DISABLED}' \
    --force-new-deployment

Deployment id: ecs-svc/2345678901234567

ECS drains the EC2 tasks and launches Fargate tasks in parallel. The target group registers the new tasks; the old ones deregister once they’re connection-drained. Total migration time: about 12 minutes for 30 tasks, done in the background while the service stays up.

We watch:

CloudWatch target-group metrics: HealthyHostCount holds steady at ~30 throughout.
ALB TargetResponseTime p95 unchanged at ~140ms (the container is the same; only the compute underneath changed).
CloudWatch container insights: CPU and memory utilisation per task, equivalent to before.
Cost Explorer: the line item shifts from Amazon EC2 running Linux/UNIX to AWS Fargate vCPU-Hours and Memory-GB-Hours, and the ASG holding the api tasks scales to zero.

The ASG scales to zero over an hour. At the end of the week, the ASG is stopped (not deleted; keep it around for the ML service). The AMI pipeline’s weekly patching build no longer needs to target the api and worker instance tier, which trims the internal CI cost too.

What’s worth remembering

Fargate and EC2 are launch types, not competing products. ECS is the scheduler; the launch type is how compute capacity is provisioned. A single cluster can mix both (capacityProviderStrategy per service).
Fargate’s per-hour rate only looks high until you account for bin-pack waste. At realistic EC2 cluster utilisation (60-80%), Fargate’s list price is roughly even; with Compute Savings Plans, usually cheaper.
Fargate hands back the patching rota. No AMIs, no ASG tuning, no bin-pack debugging. The platform-version upgrade is a forceNewDeployment on a quiet afternoon.
Fargate networking is awsvpc-only, and that’s mostly good. Each task gets its own ENI, security group, private IP. No per-instance ENI limits to design around.
Fargate can’t do everything EC2 can. No GPU, no privileged mode, no host-path mounts, ephemeral storage caps at 200 GB per task. Workloads needing any of these stay on EC2.
Fargate Spot is the cleanest batch option going. ~70% off Fargate on-demand, no ASG, no diversification strategy; two-minute SIGTERM for well-behaved workloads.
Compute Savings Plans cover both EC2 and Fargate. A single dollar-commit applies to whichever you run. Switching from EC2 to Fargate mid-term doesn’t strand the commitment.
Graviton is 20% cheaper for compatible workloads. Multi-arch Docker images via buildx; most managed runtimes work transparently, native dependencies are the case-by-case call.

Serverless containers are the correct default for services without infrastructure-specific needs; the cluster you own is still the correct answer when the workload has a shape Fargate can’t accommodate. Four services, four answers, one ECS control plane stitching them together.