Instance Store or EBS for Scratch Disk

The situation

A streaming analytics pipeline runs on EC2 in eu-west-1 and has three workloads with scratch disk needs:

Kafka-compatible broker fleet (i4i.2xlarge × 9, three per AZ). Durable storage, but the hot tier of segments sits on a local NVMe and is compacted into S3 Tiered Storage every few minutes. If a broker dies, the replicas in the other two AZs carry the load; the dead broker’s local state is thrown away and the broker rejoins with a fresh local cache from its peers.
Stream processor fleet (r6i.xlarge × 40). Stateful stream joins using RocksDB as the local state store. Each node checkpoints RocksDB state to S3 every 10 minutes. On crash, the node resumes from S3 with a minute or two of re-consumption from Kafka to catch up.
Ad-hoc query workers (c6i.4xlarge, variable count). Materialise a subset of the data lake into local columnar files for interactive SQL queries. Queries are tolerant of a cold cache (they’ll re-read from S3), but hot cache makes response times 10× faster.

Today everything uses gp3 EBS volumes. The brokers have 2 TB each; the processors have 500 GB each; the query workers have 1 TB each. The cloud bill shows EBS as a non-trivial line item, and a colleague suggested “could any of these use Instance Store?”, because Instance Store is bundled with the instance type at no per-GB charge, and these instance families (i4i, r6id, c6id) offer NVMe Instance Store variants.

What actually matters

Before chasing the price delta, it’s worth being precise about the two storage types and how they fail.

The first thing is where the bytes live. EBS is network-attached block storage: the volume lives in a storage fabric and reaches the instance over AWS’s purpose-built network. When the instance stops, the volume detaches and persists; when the instance starts, it reattaches. Instance Store is physically present on the hardware hosting the instance. NVMe SSDs in the host, directly attached over PCIe. It’s faster by nature (no network hop) and cannot move with the instance because it isn’t addressable from anywhere else.

The second thing is the durability contract. Network-attached block storage is replicated across multiple devices within an Availability Zone and carries a quoted annualised failure rate small enough that it isn’t usually the design constraint. Host-local storage has no such guarantee, it’s drives on a specific host, and if the host loses power, fails a drive, or the instance stops, the data is gone. The contract is explicit: host-local data is lost on stop, hibernate, or termination of the instance, and on underlying disk drive failure. Reboot doesn’t lose it; stop does.

The third thing is when “ephemeral” is actually ephemeral. A workload that writes to local storage and keeps the only copy there is not treating the storage as ephemeral, it’s treating it as durable and hoping. A workload that writes to local storage and treats the local copy as a cache of something authoritative stored elsewhere (object storage, a replicated cluster, a checkpoint) is treating it as ephemeral. The host-local choice only works when the loss of the local bytes is recoverable without a human involved.

The fourth thing is performance. Host-local NVMe can do an order of magnitude more IOPS and more bandwidth than network-attached block storage at any tier, with sub-millisecond latency, because there’s no network in the path. For IOPS-hungry or bandwidth-hungry workloads, host-local storage can be an order of magnitude cheaper per IOPS, if the durability story fits.

The fifth thing is cost shape. Host-local capacity comes bundled with the instance-type sticker; there’s no per-GB-month line item separately. Choosing an instance variant that includes local NVMe usually costs a small hourly premium over the non-local equivalent and replaces a much larger per-GB-month block-storage bill. The crossover where bundled local storage becomes cheaper than the equivalent block storage is at a modest capacity per instance, above that threshold, the bundled option pays for itself; below it, the cheaper non-d instance plus a small block volume wins.

The sixth thing is snapshot and restore. Network-attached block storage has incremental snapshots that copy to object storage, survive AZ failure, and form the foundation of backup and DR workflows. Host-local storage has no snapshot. If the ability to take a consistent point-in-time image of the volume matters, only block storage delivers it.

What we’ll filter on

Filters for each storage type against each workload:

Durability contract, what does “the data is still there” mean? Does it survive stop? Host failure?
Performance profile. IOPS, bandwidth, latency.
Cost per TB at this utilisation, including whether bundled capacity is utilised.
Snapshot/restore story, can we back up? Can we clone?
Attachment semantics, can the volume move with the instance across stops?
Recovery cost on loss, if the storage evaporates, what does it take to rebuild?

The storage landscape

EBS gp3. The general-purpose default. Baseline 3,000 IOPS and 125 MiB/s included with the volume; pay to raise either. $0.08/GB-month in eu-west-1. Durable within an AZ, snapshottable, encryptable, resizable online, can survive instance replacement (detach-attach). Latency typically single-digit milliseconds; good for 95% of workloads.
EBS io2 Block Express. The highest-performance EBS tier. Up to 256,000 IOPS and 4,000 MiB/s per volume. $0.125/GB-month plus $0.065 per provisioned IOPS-month. Durable, snapshottable, encryptable. Sub-millisecond p50 latency. Used for database primaries where consistent low latency matters.
Instance Store NVMe (e.g. i4i, r6id, c6id). Bundled NVMe on the host. Hundreds of thousands of IOPS, multi-GB/s bandwidth, sub-millisecond latency. No per-GB charge (bundled with instance). Lost on stop/terminate/underlying-hardware-failure; survives reboot. Not snapshottable. Cannot move with the instance across stops.
EBS-backed RAID 0 across multiple volumes. Stripe several gp3 or io2 volumes into a single logical volume in the OS. Each volume independently durable; the stripe multiplies IOPS and bandwidth. Same EBS durability story applies per volume, though losing one volume in the stripe takes the stripe offline until rebuilt (by design, it’s a stripe, not a mirror).
EFS or FSx for shared filesystems. Not really an Instance Store peer, but worth naming. Shared across instances, durable, managed. Not a block device; not a drop-in replacement for EBS or Instance Store. Different shape of problem.

Side by side

Option	Durability	IOPS ceiling	Bandwidth ceiling	Cost profile	Snapshottable	Survives stop
EBS `gp3`	AZ-durable	16,000 (raisable)	1,000 MiB/s	$0.08/GB-mo	✓	✓
EBS `io2 Block Express`	AZ-durable	256,000	4,000 MiB/s	$0.125/GB-mo + IOPS	✓	✓
Instance Store NVMe	None (host-local)	Hundreds of thousands	Multi-GB/s	Bundled	✗	✗
EBS RAID 0 (4× gp3)	AZ-durable per volume	64,000	4,000 MiB/s	4 × $0.08/GB-mo	✓ per volume	✓
EFS/FSx	Regional/AZ	Varies	Varies	Per GB-mo + throughput	✓	n/a

Reading this against the three workloads:

Kafka brokers: cluster-level replication makes the local disk disposable. Instance Store fits.
Stream processors: RocksDB state with 10-minute S3 checkpointing makes the local disk disposable (two-minute replay on restart). Instance Store fits.
Ad-hoc query workers: S3-backed cache, cold cache is slow but correct. Instance Store fits cleanly.

All three workloads already treat their local storage as a cache of something authoritative stored elsewhere, the exact shape Instance Store was designed for.

Instance Store vs EBS decision flow

Three questions sort ephemeral from durable. Only when the data is NOT the only copy AND doesn't need to survive stop AND performance justifies it, does Instance Store become the answer.

The picks in depth

Kafka brokers → i4i.2xlarge Instance Store. The i4i family is built for storage workloads: 1.875 TB NVMe on i4i.2xlarge, 2.5 M read IOPS, 1.1 M write IOPS per instance. Brokers write segment files at multi-hundred-MB/s under load; EBS would need provisioned IOPS to keep up, Instance Store does it out of the box. The Kafka replication factor of 3 across AZs means a lost broker’s local log is rebuilt from its peers in minutes, not hours. The cost story: i4i.2xlarge at $0.744/hour includes the 1.875 TB NVMe; the equivalent EBS story would be 2 TB gp3 at $164/month plus the non-storage instance cost, coming out more expensive for the same performance, with worse performance at that.

Stream processors → r6id.xlarge Instance Store. 237 GB NVMe on r6id.xlarge, enough for RocksDB state with room to grow. Checkpoints to S3 every 10 minutes bound recovery to ~12 minutes of Kafka replay in the worst case (10-minute checkpoint + 2 minutes to catch up). The $0.017/hour premium over r6i.xlarge (Instance-Store-less) saves the 500 GB of gp3 ($40/month per node) and pays for itself at any instance count. At 40 nodes, switching saves ~$1,600/month and improves checkpoint write performance (RocksDB compaction is I/O-bound).

Query workers → c6id.4xlarge Instance Store. 950 GB NVMe on c6id.4xlarge. The cache materialisation is S3-backed; a cold query takes 45 seconds instead of 5, but correctness is unaffected. Instance Store’s higher bandwidth noticeably improves hot-cache query time (scan-bound workloads love local NVMe). The only wrinkle: the cache needs to survive occasional Auto Scaling Group events; the team accepts the cache warming cost on scale-up.

A worked broker migration

Kai migrates the broker fleet. The i4i.2xlarge instance type has different NVMe presentation than the old i3 family, it shows up as two NVMe devices and needs LVM or RAID 0 to see as one volume.

# Cloud-init user-data for new broker AMI
#!/bin/bash
set -e

# List Instance Store NVMe devices (exclude EBS root)
DEVICES=$(lsblk -lnpo NAME,MODEL | awk '/Amazon EC2 NVMe Instance Storage/ {print $1}')

# Build RAID 0 across all Instance Store devices
mdadm --create /dev/md0 --level=0 --raid-devices=$(echo "$DEVICES" | wc -l) $DEVICES

# Format and mount
mkfs.xfs -K /dev/md0
mkdir -p /var/lib/kafka
mount /dev/md0 /var/lib/kafka
chown kafka:kafka /var/lib/kafka

# Systemd drop-in: don't try to mount Instance Store devices across reboots
# (they persist but mdadm array needs reassembling; Kafka init handles it)

# Start Kafka; broker will bootstrap from peers if /var/lib/kafka is empty
systemctl start kafka

Kai uses a blue-green strategy. New i4i.2xlarge brokers join the cluster; old r6i.2xlarge + EBS brokers leave one at a time, reassigning partitions as they go. Kafka’s partition reassignment tool moves log segments over the network; each broker takes ~20 minutes to drain. For a 9-broker cluster that’s three hours start to finish. The observable effects during migration:

HealthyHostCount on the cluster stays at 9 (one new for each old draining).
End-to-end latency unchanged; replication traffic briefly doubles.
Disk-write bandwidth per broker rises from ~400 MiB/s on gp3 (IOPS-capped) to ~1.2 GiB/s on NVMe (compression-capped). The broker CPU goes up to keep up; the team upsizes to i4i.4xlarge for two brokers that were hitting a CPU ceiling.
Monthly cost: EBS line drops by ~$1,500, EC2 line rises by ~$400. Net -$1,100/month.

What “host-failure” actually means for Instance Store

A subtlety worth being explicit about: Instance Store loss isn’t just “when we issue a stop.” AWS retires host hardware periodically, and a retirement notice on an instance moves it to a new host, which means a new set of local NVMe drives, which means the data is gone. Retirement is typically scheduled 2-4 weeks out but can be short-notice if hardware is failing.

For a cluster with replication (brokers) or checkpointing (processors), retirement is a non-event: the instance terminates, a new one comes up, bootstrap happens, cluster rebalances. For workloads without that safety net, Instance Store retirement is a surprise data-loss event.

Automated responses: CloudWatch Events on AWS Health events of type AWS_EC2_INSTANCE_RETIREMENT_SCHEDULED trigger a Lambda that drains the instance and replaces it before the forced replacement. Most clusters with Instance Store already have this habit because they handle replacement routinely anyway; adding the retirement handler is belt and braces.

What’s worth remembering

Instance Store durability is host-local only. Data survives reboot; data does not survive stop, terminate, or host-level hardware failure/retirement. The only safe way to use it is as a cache of something authoritative stored elsewhere.
EBS gp3 is the right default for anything authoritative. AZ-durable, snapshottable, resizable online, $0.08/GB-month with 3,000 IOPS and 125 MiB/s baseline included. Raise IOPS and bandwidth when needed without changing volume type.
Instance Store shines for high-throughput workloads. Hundreds of thousands of IOPS and multi-GB/s bandwidth per instance, bundled with the hourly cost, no network in the path. Databases with replication, distributed caches, streaming logs.
Instance family matters. Look for the d suffix (r6id, c6id) or storage-optimised families (i4i, is4gen). Non-d variants have no Instance Store.
Instance Store devices need formatting and mounting on boot. They appear as raw NVMe devices; cloud-init, user-data, or configuration management handles creating filesystems, RAID arrays, and mounts.
EBS snapshots are the backup story. Instance Store has no backup. If the data needs to be restored from a point in time, EBS is the only option. “Restore” for Instance Store means “rebuild from the authoritative source.”
Retirement events move instances to new hosts, destroying Instance Store data. Handle AWS_EC2_INSTANCE_RETIREMENT_SCHEDULED with an automated drain-and-replace.
Cost crossover is around 150 GB. Below that, EBS on a non-d instance is cheaper. Above it, Instance Store variants usually cost less than the bundled EBS capacity would, and perform better.

The disk that dies with the instance is a feature when the data already lives somewhere else; it’s a liability when the data doesn’t. Pick Instance Store where the workload’s durability is carried by replication or checkpointing; pick EBS everywhere else. Match the storage contract to the data contract and the answer falls out.