Picking an EBS Volume Type for Each Workload

The situation

Three workloads running on EC2, each with a persistent EBS volume that’s been chosen by history rather than by design.

PostgreSQL OLTP, primary on db.r6g.2xlarge-equivalent EC2 (or equivalent Aurora elsewhere, but this is a self-managed cluster). 2 TB data volume, currently gp2 6,000 IOPS, hitting 100% utilisation during the afternoon. Write latency occasionally spikes into the tens of milliseconds, which the application notices as query stalls. Needs: consistent 15,000 IOPS at sub-millisecond p99, ideally headroom for growth.
Cassandra ring, 12 nodes in a ring, 8 TB of data per node. Workload is sequential writes (SSTables) and occasional compactions that do large sequential reads and writes. Throughput is the pain point: during compactions, the volume saturates at about 500 MiB/s, which makes compactions take hours longer than they should. Needs: high sustained throughput, IOPS doesn’t really matter.
CloudTrail log warehouse, 20 TB of gzipped CloudTrail logs on an EC2 instance that runs occasional Athena queries and monthly security scans. Written-once, read-rarely. Currently on gp3, which works fine but is overkill. Needs: cheap storage, latency isn’t a concern.

Three workloads, three very different requirements. The same volume type on all three is wrong in different ways each time.

What actually matters

Before comparing volume types it’s worth naming the performance dimensions block volumes are characterised by.

IOPS (input/output operations per second). The number of discrete read or write operations the volume can serve per second. For small random I/O (4 KiB database reads, for example), IOPS is the limiting factor. A PostgreSQL OLTP workload that reads a handful of 4 KiB pages per query and handles thousands of queries per second is IOPS-bound.

Throughput (MiB/s). The bandwidth of the volume, how many bytes per second can move. For large sequential I/O (streaming a 64 MiB log segment, reading a TB of data during a compaction), throughput is the limiting factor. A volume that delivers plenty of IOPS but caps at modest bandwidth is useless for a Cassandra compaction; one that delivers few IOPS but high bandwidth is great.

Latency. The time from “submit I/O” to “I/O complete”. For OLTP workloads where each transaction waits for a write, latency directly dictates transaction throughput. Sub-millisecond is table stakes for high-performance databases; tens of milliseconds under load is catastrophic.

Durability / availability. Every block volume is replicated within its AZ; the annual durability numbers vary by type. For any production workload the numbers are high enough not to be the differentiator, but specific compliance requirements sometimes push towards the highest-durability tier.

Cost shape. Block-volume pricing isn’t a flat dollar-per-gigabyte; each volume type has its own cost structure. Some bill for storage with provisioned IOPS and throughput above a baseline; some bill per-IOPS at a higher rate with no baseline included; the spinning-disk types bill cheaply per-GB but cap throughput and burst differently. The cost shape that fits a workload depends on whether the binding dimension is bursty or sustained.

Multi-attach. Some block volumes can be attached to multiple instances simultaneously, enabling shared-block-storage patterns that need a cluster-aware filesystem. Most are single-attach only.

First question per workload: which dimension is binding?. PostgreSQL OLTP is IOPS-bound and latency-sensitive. Cassandra is throughput-bound. CloudTrail is cost-bound because no performance metric matters.

Second: is the binding dimension bursty or sustained?. A workload that spikes for minutes and idles for hours can ride a burst bucket; one that sustains for hours needs the headroom provisioned outright. The cost shape that’s cheap for one is expensive for the other.

Third: block size matters a lot. IOPS limits are quoted at a fixed I/O size; smaller I/Os hit the IOPS limit first, larger I/Os hit the throughput limit first. A workload doing 4 KiB random I/O is IOPS-bound; the same workload doing 1 MiB sequential I/O is throughput-bound. Understanding the workload’s I/O size distribution is the first real analysis step.

Fourth: ephemeral vs persistent. Some instance classes come with local NVMe that delivers the lowest possible latency and highest possible throughput because it’s directly attached, but the data is lost on stop, hibernate, or termination. For workloads that can rebuild state (Cassandra ring members, caches, ephemeral compute) that’s fine; for anything that must survive an instance failure, it isn’t.

What we’ll filter on

Max IOPS, what’s the volume’s ceiling?
Max throughput (MiB/s), what’s the bandwidth?
Latency, sub-millisecond, millisecond, or multi-millisecond?
Durability, which annual-durability tier?
Multi-attach, can multiple instances attach the same volume?
Price, per GB, per provisioned IOPS, per provisioned MiB/s?

The EBS volume landscape

gp3 (General Purpose SSD). Baseline: 3,000 IOPS and 125 MiB/s, independent of volume size. Provisioned up to 16,000 IOPS and 1,000 MiB/s for a flat dollar-per-IOPS and dollar-per-MiB/s price above baseline. $0.08/GB-month storage, $0.005/IOPS/month above 3,000 IOPS, $0.04/MiB/s/month above 125 MiB/s. Replaces gp2 for new workloads; gp3 is cheaper and tunable.
gp2 (legacy). Older general-purpose SSD. Performance scales with volume size (3 IOPS per provisioned GB, up to 16,000), with a 30-minute burst bucket for small volumes. Still supported but no reason to start new workloads here; gp3 is strictly better.
io2 (Provisioned IOPS SSD, latest). Up to 64,000 IOPS per volume (256,000 on io2 Block Express on supported Nitro instance types), up to 1,000 MiB/s throughput (4,000 on Block Express). Sub-millisecond consistent latency. 99.999% annual durability. Single-attach or Multi-attach (up to 16 instances). $0.125/GB-month, $0.065/IOPS/month for the first 32,000 provisioned IOPS. The highest-performance, highest-durability, highest-cost standard EBS option.
io1 (legacy). Predecessor to io2. Similar performance envelope but 99.9% durability (vs 99.999% for io2). New provisioned-IOPS workloads should start on io2; io1 exists for continuity.
st1 (Throughput Optimized HDD). Spinning disk optimised for large sequential I/O. Baseline 40 MiB/s per TiB (up to 500 MiB/s per volume), burst to 250 MiB/s per TiB. Max IOPS 500 per volume, but each I/O can be up to 1 MiB, so aggregate bandwidth is what matters. $0.045/GB-month. Correct for big-data workloads like Hadoop, Cassandra (for throughput), Kafka brokers with large messages.
sc1 (Cold HDD). Colder spinning disk. Baseline 12 MiB/s per TiB (up to 250 MiB/s per volume), burst to 80 MiB/s per TiB. $0.015/GB-month. Correct for infrequently accessed data where cost dominates, like compliance logs and archival warehouses.
Instance Store (not EBS, but adjacent). Local NVMe attached to the host. Up to millions of IOPS and tens of GiB/s on i3en / i4i / i7i classes. Not persistent across stop/terminate. Not the answer for most databases but potentially correct for Cassandra nodes that rebuild from peers.

Side by side

Type	Max IOPS	Max throughput	Latency	Durability	Cost
gp3	16,000	1,000 MiB/s	low-ms	99.8%	$0.08/GB + provisioned
io2 Block Express	256,000	4,000 MiB/s	sub-ms	99.999%	$0.125/GB + $0.065/IOPS
io2	64,000	1,000 MiB/s	sub-ms	99.999%	$0.125/GB + $0.065/IOPS
st1	500	500 MiB/s	multi-ms	99.8%	$0.045/GB
sc1	250	250 MiB/s	multi-ms	99.8%	$0.015/GB
Instance Store	millions	tens of GiB/s	microseconds	ephemeral	bundled

Reading the table by workload:

PostgreSQL OLTP (15,000 IOPS, sub-ms latency, 2 TB): gp3 at 15,000 provisioned IOPS is within reach and cheaper than io2 at the same IOPS. If the workload grows past 16,000 IOPS or needs sub-millisecond at p99.9 consistently, io2 is the step up. Start with gp3, upgrade if needed.
Cassandra (throughput-bound, 500+ MiB/s sustained): st1 at large size. Per-node 8 TB at $0.045 = $360/month/node × 12 nodes = $4,320/month. gp3 at equivalent throughput provisioned would cost similar in storage but more in provisioned throughput, and st1’s large-sequential profile matches compactions better. Instance Store-backed i4i classes are a more radical alternative if node rebuild is acceptable.
CloudTrail warehouse (cost-bound): sc1. 20 TB at $0.015 = $300/month vs $1,600/month on gp3. Performance is terrible but nobody cares; monthly Athena scans take a few minutes longer, which is acceptable.

The IOPS-throughput-cost triangle

The triangle has three corners; each workload lives closest to one. Matching the workload to the corner picks the volume type.

The picks in depth

PostgreSQL OLTP → gp3, 15,000 IOPS, 500 MiB/s throughput, 2 TB. Provisioned explicitly:

aws ec2 create-volume \
    --availability-zone eu-west-1a \
    --size 2048 \
    --volume-type gp3 \
    --iops 15000 \
    --throughput 500 \
    --encrypted --kms-key-id alias/ebs

Cost: 2,048 GB × $0.08 + (15,000 - 3,000) × $0.005 + (500 - 125) × $0.04 = $163.84 + $60 + $15 = ~$239/month. Equivalent io2 at the same IOPS: 2,048 GB × $0.125 + 15,000 × $0.065 = $256 + $975 = ~$1,231/month. For a workload that’s at 15k IOPS, gp3 is 5x cheaper.

If the workload later crosses 16,000 IOPS or needs sub-millisecond p99.9 (gp3 is low-ms p99, not sub-ms; io2 is sub-ms), migrate to io2. The migration is straightforward – modify-volume with a type change, data stays in place, a brief performance transition as the volume state updates.

Cassandra ring → st1 at 8 TB per node. Provisioned:

aws ec2 create-volume \
    --availability-zone eu-west-1a \
    --size 8192 \
    --volume-type st1 \
    --encrypted --kms-key-id alias/ebs

Per-node cost: 8,192 × $0.045 = $369/month. Throughput at 8 TiB size: 40 MiB/s × 8 = 320 MiB/s baseline, burst to 250 MiB/s × 8 = 2,000 MiB/s. During compactions the burst bucket sustains high throughput for the large sequential I/O. If compactions saturate the burst bucket (unlikely at 8 TiB), step up to gp3 with 1,000 MiB/s provisioned.

Cassandra’s workload profile, append-only SSTables, sequential compactions, matches st1’s large-sequential-I/O sweet spot. Random-I/O workloads on st1 are terrible; SSDs are essential for anything that does random 4 KiB reads. But Cassandra specifically does not.

CloudTrail warehouse → sc1, 20 TB. Provisioned:

aws ec2 create-volume \
    --availability-zone eu-west-1a \
    --size 20480 \
    --volume-type sc1 \
    --encrypted --kms-key-id alias/ebs

Cost: 20,480 × $0.015 = $307/month. gp3 equivalent at 20 TB with baseline performance: 20,480 × $0.08 = $1,638/month. 5x cheaper for performance nobody needs.

For the monthly security scan, sc1’s 250 MiB/s peak means a 20 TB full scan takes ~22 hours. That’s acceptable when the scan runs monthly; it wouldn’t be acceptable for a nightly process. Knowing the workload’s real performance requirement, as opposed to the default “we’d like this to be fast”, is what unlocks the correct type.

A performance trace: PostgreSQL p99 write latency

Sam investigates why write latency occasionally spiked to 30 ms on the old gp2 6,000 IOPS volume; after migrating to gp3 15,000 IOPS, the same query pattern runs consistently under 2 ms.

Before (gp2, 6,000 IOPS provisioned, burst bucket):
  Steady afternoon load, ~5,800 IOPS sustained:
    write latency p50: 1.8 ms
    write latency p99: 8 ms  (burst bucket healthy)
  
  After ~2 hours at 5,800 IOPS, burst depletes:
    write latency p50: 3 ms
    write latency p99: 30 ms  (throttled to baseline 3 IOPS/GB = 6,000)
    CloudWatch VolumeQueueLength climbs from 1 to 20+
    PostgreSQL: "checkpoint complete, buffers written: 1,340" takes 4x longer

After (gp3, 15,000 IOPS explicit provision):
  Same afternoon load, ~5,800 IOPS sustained:
    write latency p50: 0.8 ms
    write latency p99: 1.6 ms
    VolumeQueueLength: steady at 1-2

The gp3 baseline is independent of volume size and burst bucket; 15,000 provisioned IOPS is always available without bucket depletion. For a database whose performance is IOPS-bound during peak, this eliminates the “tail latency increases after N hours” class of issues outright.

Aggregate impact: the user-facing application’s p99 query time dropped from 180 ms to 110 ms during afternoon peak, because the write-tail-latency was the dominant contributor.

What’s worth remembering

gp3 is the new general-purpose default. Predictable baseline (3,000 IOPS, 125 MiB/s) independent of size, with provisioned extras above that. Cheaper and more tunable than gp2; there’s no reason to start new workloads on gp2.
io2 for sub-millisecond and 99.999% durability. When gp3’s ceiling (16k IOPS) or its latency characteristics aren’t enough, io2 is the step up. Block Express pushes the ceiling to 256k IOPS and 4 GiB/s on supported instance types.
st1 is throughput-optimised HDD. Great for Cassandra, Hadoop, Kafka with large messages, anything whose workload is large sequential I/O. Terrible for random-access SSD patterns.
sc1 is the cheap-storage answer. 3x cheaper than gp3 per GB. Acceptable for infrequently-accessed data where latency doesn’t matter.
Instance Store is ephemeral and fast. Directly attached NVMe; microseconds-scale latency. Loses data on stop/terminate. Right for caches, Cassandra nodes that rebuild from peers, ephemeral compute.
Block size matters. IOPS limits assume 16 KiB I/O; smaller I/Os are IOPS-bound, larger are throughput-bound. Knowing the workload’s I/O size distribution is the first analysis step.
modify-volume is non-disruptive. Change volume type, size, IOPS, or throughput without detaching. The transition takes hours for large volumes; the volume remains usable throughout.
Multi-attach is only io1/io2. Shared block storage (needed for cluster-aware filesystems like GFS2) requires io1 or io2 with multi-attach enabled. The application must understand concurrent writes; EBS doesn’t arbitrate.

Three workloads, three very different performance profiles, three different EBS volume types. PostgreSQL gets gp3 for balanced IOPS and latency at a reasonable price; Cassandra gets st1 for cheap bulk throughput; CloudTrail gets sc1 for cheap storage that nobody needs to be fast. The mistake is picking one volume type for all three workloads; the win is matching the type to the dimension the workload actually cares about.