How to Scale Kinesis Reads Across Many Consumers

November 22, 2027 · 15 min read

The situation

A 50-shard KDS stream called app-telemetry ingests around 100 MB/s from application servers. Eight consumers read it, accumulated over six months: a fraud detector (Lambda, must score each event within seconds); low-latency alerting (ECS service pushing to PagerDuty); an OpenSearch indexer (Firehose); a monitoring-metrics consumer (a Go service emitting CloudWatch metrics); a real-time dashboard (WebSocket service pushing to browsers); a batch analytics job (Glue, every 30 minutes, writing Parquet to S3); an archive pipeline (daily copy into Glacier); and a reconciliation consumer (finance, runs nightly).

All eight call GetRecords against the stream in the same pull-based way. The first six were fine. The seventh’s tail was lagging by a couple of minutes, which the team wrote off as cold-start. The eighth tipped the system: every consumer failing intermittently, ProvisionedThroughputExceededException everywhere, fraud detection draining queues in 45 seconds instead of 2.

The part that surprised the team: the ingest rate didn’t change. No producer pushed harder. They only added a reader, and all the other readers broke.

What actually matters

Before reaching for a fix, it helps to name why this surprises so many teams the first time they meet it. Kinesis’s write side is straightforward, each shard has a byte and record ceiling and the producer either fits under it or gets told off. The read side has two ceilings stacked on top of each other (a byte ceiling and an API-call ceiling), both of which are shared across every consumer attached to a shard, and neither of which gets talked about until enough consumers queue up to notice them. The asymmetry is the trap.

The isolation story is the first thing to weigh. When one shared budget feeds N consumers, the behaviour of the slowest consumer is everybody’s problem. Today’s eighth consumer wakes up and starts polling aggressively; tomorrow’s seventh consumer starts a backfill and burns through the shared TPS cap; next week’s sixth consumer lags and enters catch-up mode. Each of those is a legitimate local decision that causes a global failure because the consumers are sharing a resource they can’t see.

The cost shape of the fix matters as much as the fix itself. Paying for dedicated read capacity per consumer is the obvious answer to isolation, but paying the same premium rate for the consumer that runs once a day as for the consumer that runs flat-out is wasteful. The price tag on “every consumer at the same quality of service” is easy to calculate and hard to justify once the bill arrives.

The latency targets are not uniform across consumers. Fraud detection wants sub-second; the daily archive tolerates a fourteen-hour delay; the monitoring consumer doesn’t notice whether records arrive in 200 ms or 5 s. Any architecture that delivers the fraud-detection SLA to the archive consumer is overspending; any architecture that delivers the archive’s tolerance to the fraud detector is underspending where it matters.

The blast radius of a future change matters more than it first seems. Whatever the team builds now is what the ninth consumer will be added to next quarter. If adding a ninth consumer breaks the eight, the architecture has failed again; if adding a ninth consumer is additive, the architecture has absorbed the scaling concern.

And finally, the operational simplicity. No broker cluster to stand up, no re-sharding of a production stream, no tearing up the producer path. The fix should be a configuration change on the consumer side, not a new platform to operate.

What we’ll filter on

Reader isolation, the fraud detector’s latency cannot depend on whether the archive job happens to be running.
Low end-to-end latency for consumers that need it, sub-second for fraud, alerting, dashboards, OpenSearch.
Cost shape that tracks the value of the consumer, eight consumers at the same premium price is wasteful.
Operational simplicity, no new broker cluster, no re-sharding, no producer-side changes.

The KDS read-side landscape

Kinesis Data Streams offers two ways a consumer can read records.

Shared throughput consumers (the default). A consumer calls GetShardIterator then repeats GetRecords, a pull model over plain HTTPS. Each call returns up to 10 MB or 10,000 records, whichever fills first. The shard’s read budget is shared across every consumer pulling from it: 2 MB/s per shard, aggregate across every shared consumer, and 5 GetRecords transactions per second per shard, aggregate across every shared consumer. The limit is on API calls, not on bytes, and it bites before most teams realise it does, five consumers polling at 1 TPS each is already at the ceiling. Shared reads are free; latency is dominated by the polling interval.

Enhanced Fan-Out (EFO). A consumer registers with RegisterStreamConsumer, then calls SubscribeToShard, an HTTP/2 persistent connection over which Kinesis pushes records as they arrive. Each registered EFO consumer gets its own dedicated 2 MB/s per shard on every shard. No 5 TPS cap, records are pushed, not polled. Subscriptions stay open for five minutes and auto-renew. Typical end-to-end latency is ~70 ms. Costs $0.015 per consumer-shard-hour of registration, plus $0.013/GB retrieved. Up to 20 registered EFO consumers per stream.

Two side notes. Throttling is silent until you read the error carefully – GetRecords doesn’t fail the whole operation when it hits the cap; it succeeds with fewer records and reports throttling via exceptions on subsequent calls, which most SDKs surface as retryable. And the team had been at 5 TPS since the fifth consumer was added, tolerating the odd retry.

Side by side

Option	Reader isolation	Low latency	Cost-scales-with-value	Operational simplicity
Shared throughput (GetRecords)	✗	✗	✓	✓
Enhanced Fan-Out (SubscribeToShard)	✓	✓	✗	✓
Mixed: EFO for some, shared for others	✓	✓	✓	✓

Shared throughput fails reader isolation the moment more than one or two consumers exist. EFO satisfies isolation and latency but bills every consumer at the same premium price whether they need 70 ms delivery or 24 hours of latency tolerance. The survivor is a mixed deployment: EFO for consumers whose value justifies the cost, shared for consumers where cost dominates over latency.

The two modes, side by side

Same 50-shard stream, two read modes. Left: one shared 2 MB/s-per-shard pipe that eight consumers fight over, plus a 5 TPS API cap that throttles them below any throughput concerns. Right: eight dedicated 2 MB/s-per-shard pipes at ~70 ms push latency, priced per consumer-shard-hour plus per-GB retrieval.

The 2 MB/s rule, from first principles

Every shard exposes two throughput budgets: writes at 1 MB/s or 1,000 records/s; reads at 2 MB/s or 2,000 records/s. Both per shard, not per stream. The critical asymmetry lives in the read side: for shared consumers the 2 MB/s is the ceiling across all consumers of that shard, while for EFO consumers the 2 MB/s is per consumer. A second EFO consumer doesn’t slice the first’s pipe; Kinesis duplicates the records onto each consumer’s subscription stream, paid for by the hour.

The 5 TPS trap sits alongside the byte ceiling. GetRecords has a 5-transactions-per-second-per-shard cap shared across all shared consumers. The KCL’s default poll interval is once per second. Five shared consumers at default polling = 5 TPS = at the limit. Six = over. This is why the scenario tipped at the seventh and eighth consumers.

Enhanced Fan-Out, in depth

EFO’s transport is HTTP/2 with server push. The lifecycle for one consumer is three steps: register once (RegisterStreamConsumer creates a consumer object; it lives until explicitly deregistered); subscribe per shard (SubscribeToShard; each subscription stays open for five minutes, auto-renewing on expiry); and process records (with the KCL in EFO mode, checkpoints commit to the DynamoDB lease table as with shared consumers).

Under the hood, Kinesis runs a dedicated fleet for EFO traffic, separate from the shared-consumer fleet. Each registered consumer’s subscriptions hit that fleet independently, which is where the per-consumer throughput guarantee comes from.

Registration economics, 50 shards, 8 consumers at full rate: registration is $0.015 x 50 x 8 x 730 = $4,380/month; retrieval per consumer is 100 MB/s x 86,400 s x 30 = 259,200 GB/month x $0.013 = $3,370/month; all-EFO total is $4,380 + 8 x $3,370 = ~$31,300/month on reads alone. Compare with shared-throughput reads at $0. The question isn’t “can we afford EFO?” but “which consumers justify the spend?”

The mixed deployment

Eight consumers, two categories by latency sensitivity.

High-value, low-latency. EFO. Four consumers need sub-second delivery: fraud detection, alerting, OpenSearch indexing, the real-time dashboard. Registration: 4 x 50 x 0.015 x 730 = $2,190/month. Retrieval at full rate: 4 x $3,370 = $13,480/month. EFO subtotal: ~$15,670/month.

Low-urgency, shared throughput. The other four tolerate minutes of lag. Drop their poll intervals to once every 5 seconds each: 4 x 0.2 = 0.8 TPS per shard, well under the 5 TPS cap. They divide the shared 2 MB/s-per-shard budget roughly equally, ~500 KB/s each per shard, or 25 MB/s aggregate across the stream. Enough for their cadences. Shared cost on the read side: $0.

Total read-side cost: ~$15,700 for the mixed deployment versus ~$31,300 for all-EFO. Half the spend, and the consumers that don’t need EFO aren’t paying for something they can’t use.

Registration lifecycle matters. EFO registration bills per consumer-shard-hour as long as the consumer is registered, whether or not it’s actively reading. A registered-but-idle consumer still bills $0.015 per shard-hour. For hourly or daily jobs, register EFO only for the active window or just use shared throughput.

A worked trace

One 1 KB record lands on shard 17. Track it through the mixed deployment.

Write side. The producer calls PutRecord with PartitionKey: device-9a4c. Kinesis hashes the key, maps it to shard 17, assigns a sequence number, returns. Record replicated across three AZs, clock ticking on the 24-hour default retention.

EFO path, fraud detection. The Lambda event-source mapping holds a SubscribeToShard against shard 17 for consumer fraud-detector. Within ~70 ms of the PutRecord return, Kinesis pushes the record onto the HTTP/2 stream. Lambda invokes, function scores the event, total end-to-end typically under 500 ms.

Shared path, archive pipeline. Archive runs once at 02:00. It calls GetShardIterator --shard-iterator-type TRIM_HORIZON --shard-id shard-17, then loops GetRecords until the iterator returns NextShardIterator: null. The traced record lands in a batch archive fetches at, say, 02:14:23. Archive writes to Glacier. Time-to-archive: 14 minutes, acceptable for a daily job. Read cost: $0.

No conflict between the two paths. The EFO consumer’s subscription operates on the dedicated fleet; the archive’s GetRecords operates on the shared fleet. The same record is delivered via two independent routes.

The concrete fixes: register four EFO consumers (fraud-detector, alerting-service, opensearch-indexer, realtime-dashboard); leave the other four on shared GetRecords with poll intervals dropped to once every 5 seconds; alarm on the ReadProvisionedThroughputExceeded CloudWatch metric per shard.

Other levers briefly

Re-sharding. Doubling shard count doubles write and read aggregate, halves shared-consumer contention, doesn’t help per-consumer EFO throughput. Use when write capacity is the real constraint; don’t use to solve a read-side fan-out problem that EFO answers more cheaply.

On-demand mode. Auto-scales shard count based on traffic. Still has the 2 MB/s-per-shard ceiling on shared reads and still supports EFO the same way. Solves capacity planning, not fan-out.

MSK. Kafka’s fan-out is different: every consumer group sees every record, adding a group doesn’t starve another. No per-shard read ceiling, no 5 TPS cap. The trade-off is brokers to operate, partitions to size, upgrade windows to plan. Legitimate alternative; not a 30-minute fix for a team already on KDS.

What’s worth remembering

Each KDS shard delivers 2 MB/s or 2,000 records/s on reads, with a 5 GetRecords TPS limit per shard. Both are shared across every shared-throughput consumer of that shard.
Adding a shared consumer starves the others. The per-shard read budget and the 5 TPS cap are both aggregate ceilings, not per-consumer budgets.
Enhanced Fan-Out gives each registered consumer its own 2 MB/s per shard, pushed over HTTP/2 via SubscribeToShard at ~70 ms p50 latency. No TPS limit. Up to 20 registered EFO consumers per stream.
EFO is not free. $0.015 per consumer-shard-hour registration + $0.013/GB retrieved. Shared-throughput reads are bundled into the shard-hour.
Mixed deployments are normal. Use EFO for consumers whose latency or isolation matters; shared for batch, archive, and daily jobs where cost dominates.
Registration bills while idle. An EFO consumer registered but reading nothing still charges $0.015 per shard-hour.
ProvisionedThroughputExceededException can mean either the byte ceiling or the TPS ceiling. Correlate with ReadProvisionedThroughputExceeded and with poll intervals to diagnose.
Re-sharding solves write-side capacity, not read-side fan-out. Doubling shard count halves shared-consumer contention but doesn’t change per-consumer EFO throughput or the 5 TPS per-shard ceiling.

Register four Enhanced Fan-Out consumers for fraud detection, alerting, the OpenSearch indexer, and the real-time dashboard. Keep monitoring, batch analytics, archive, and reconciliation on shared-throughput GetRecords, with poll intervals dropped to once every 5 seconds. The EFO consumers get isolated 2 MB/s per shard at ~70 ms push latency; the shared consumers get ~500 KB/s each with no ceiling contention, sufficient for their cadence. Monthly read-side cost roughly $15,700, about half an all-EFO answer, and the eighth consumer no longer breaks the other seven.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.