How to Fix DynamoDB Hot Partitions on a Leaderboard

May 12, 2027 · 14 min read

The situation

A studio operates a multiplayer game’s online leaderboard in DynamoDB. The table is Leaderboards, provisioned mode, 30,000 WCU and 90,000 RCU total. The key schema came off the back of a napkin the day the feature shipped: pk = game_id, sk = score#player_id, zero-padded score descending, player_id as tiebreaker. The application writes one item per player score via PutItem, and reads the top 100 via a Query on pk = game_id with ScanIndexForward = false and Limit = 100. Same partition key, sort key pre-ordered, top-100 for free.

Then one title goes viral. Traffic on that one game_id climbs to around 5,000 WCU per second sustained writes (peaks to 7,000) and around 12,000 RCU per second reads from dashboards, in-game panels, and streaming overlays. Other titles barely move the needle.

The application logs ProvisionedThroughputExceededException on a material fraction of PutItem calls. P99 read latency on the top-100 Query has climbed from low single-digit milliseconds into the tens. CloudWatch shows the table’s total consumed WCU sitting comfortably under 30,000. “Raise the provisioning” was the first thing someone tried. It didn’t help.

What actually matters

Before reaching for a remedy, it helps to notice that the symptom is a ceiling nobody on the team has looked up before. The table’s 30,000 WCU is a table-wide aggregate. Every physical partition the table is made of has its own per-partition ceiling: 1,000 WCU per second and 3,000 RCU per second. When one partition key carries every write, that key hashes to a single physical partition, and that partition is the real cap. Raising the table’s total just gives more capacity to partitions that weren’t struggling.

Ownership here is clean, one team, one service, one table, which means the team can move freely as long as the change is compatible with existing items. That’s the major constraint: there are historical scores in the table under the old key, and any solution that can’t read them during the transition is a migration hazard, not a fix.

Blast radius of getting it wrong is the whole leaderboard of the viral title. Blast radius of getting it right is a write-path change and a read-path change in one service. The risk lives in the read path: whatever replaces the single Query must still return “top 100 by score for game G” in a few milliseconds. A solution that turns the top-100 into a Scan is a wrong-shape fix.

Cost shape matters only insofar as the fix shouldn’t triple the bill. Sharding by itself doesn’t change how many WCU are being consumed, 5,000 WCU of real write work is still 5,000 WCU of real write work. What changes is where that consumption lands.

Failure modes are the interesting part. Adaptive capacity, on by default, already isolates the hot key onto its own partition and lends it unused capacity from cooler neighbours. It’s doing the best it can. A sustained workload that exceeds 1,000 WCU on one key will throttle no matter what adaptive capacity does, because 1,000 WCU is the absolute ceiling for any single partition-key value. The fix has to defeat that ceiling structurally, by making the work arrive at more than one key.

Coupling between base table and indexes is worth naming too. GSIs have their own partitions with the same per-partition limits. Adding a GSI on a different attribute helps reads on that attribute; it doesn’t rescue the base table’s write path, which throttles before the GSI ever sees the write.

What we’ll filter on

Four filters for any proposed fix:

Eliminates per-partition throttling under a single-key workload. The real ceiling isn’t 30,000 WCU; it’s the 1,000 WCU per partition that applies regardless of table provisioning.
Preserves ordered-by-score query semantics. “Top 100 by score for game X” has to stay an efficient query, not a full-table Scan.
Doesn’t dramatically inflate cost. Tripling the bill to paper over a hot partition is not a fix.
Implementable without a full data-model rewrite. A change to the write path and the read path, not a new dependency and an offline migration.

The DynamoDB hot-partition landscape

Seven patterns. The filter is: which address the structural cause (all traffic on one partition), which merely mask it, and which solve a different problem.

Raise provisioned capacity. Already tried. Gets allocated across all partitions; the hot one doesn’t see it.
Adaptive capacity. Already running by default; can isolate a hot key to its own partition and borrow unused capacity from cool ones. Hard ceiling: 1,000 WCU / 3,000 RCU for any single key.
On-demand capacity mode. Pay-per-request; no capacity planning; instantly accommodates up to 2× previous peak; scales further on 30+ minutes of gradual growth. Sits on the same underlying partition architecture, a single hot partition key still maxes at 1,000 WCU and 3,000 RCU.
DAX (DynamoDB Accelerator). Managed in-memory cluster in front of DynamoDB. Serves eventually-consistent reads in microseconds; writes are write-through, so they still hit the table at table speed. Helps read-heavy hot keys with cacheable freshness. Does nothing for the write hotspot.
A GSI on a different partition key. A GSI is another table under the hood with its own partitions and the same per-partition limits. A GSI on game_id hotspots the same way. A GSI on something more distributed helps reads on that dimension; the base-table write path throttles before the GSI ever sees the write.
Composite key redesign. Move the hot dimension into the sort key: pk = player_id, sk = game_id#score. Distributes writes beautifully across players; turns “top 100 for game X” into a full-table Scan.
Write sharding. Append a shard suffix to the partition key: pk = game_id#0, pk = game_id#1, … pk = game_id#N-1. Writes pick a shard at write time (random or calculated). Reads of “top 100 for game X” become N parallel queries merged in the application. Preserves ordered-by-score within each shard and globally via merge.

Side by side

Option	Ends throttling	Keeps ordered query	Cost neutral	No rewrite
Raise provisioned capacity	✗	✓	✓	✓
Adaptive capacity alone	✗	✓	✓	✓
On-demand mode	✗	✓	,	✓
DAX (reads only)	✗	✓	,	✓
GSI on different PK	✗	✓	,	,
Composite-key redesign (hot dim to SK)	✓	✗	✓	✗
Write sharding + scatter-gather	✓	✓	✓	✓

Matching the fix to the ceiling

Same workload, same total capacity. Top: one hot key, 1,000 WCU ceiling, 5× over. Bottom: ten shard keys, ten partitions, each carrying a tenth of the load, comfortably inside the ceiling that nothing except sharding can lift.

Write sharding, in depth

Three things to nail down: how shards are chosen on write, how reads reconstruct the leaderboard, and how to pick N.

Shard selection on write. Two flavours. Random sharding generates a random integer in [0, N-1] at write time and sets pk = <game_id>#<r>. Writes fan out uniformly in expectation; the player’s score lives in whichever shard the write picked, which means single-player lookups have to scatter across every shard. Calculated sharding derives the suffix from a stable attribute – pk = <game_id>#<hash(player_id) % N>, so every write for the same (game_id, player_id) lands on the same shard. Personal-best updates stay a direct GetItem. Distribution is only as uniform as the hash; FNV-1a or the low bits of SHA-256 is fine in practice. For a leaderboard, calculated sharding keyed on player_id is usually the better fit.

Read-path scatter-gather. The top-100 query becomes N parallel queries, one per shard, each returning up to 100 items; the application merges N × 100 items down to 100. Partial results from each shard are already sorted by the sort key, so the merge is a heap-merge that takes microseconds. Latency cost is the slowest of the N queries, fire them in parallel. Pagination is messier: carrying N cursors for “next 100” is doable but awkward. Most leaderboards show top 100 plus the caller’s own position (a point lookup on their shard), so deep pagination is rarely the question being asked.

Picking N. The constraint is the per-partition write ceiling: sustained load per shard should sit comfortably below 1,000 WCU with headroom for peaks. For 5,000 WCU/s sustained with peaks to 7,000, N = 5 leaves no headroom; N = 10 lands at 500-700 WCU per shard; N = 20 drops to 250-350 WCU per shard at the cost of a bigger scatter-gather. N is a write-time constant, changing it later means rewriting historical items, so pick it for the peak you expect with slack.

A worked example: peak traffic before and after

Before sharding, on the single key, the arithmetic is brutal. Writes target one physical partition at a 1,000 WCU ceiling against 5,000 WCU of demand; once the 300-second burst cushion is exhausted, roughly 80% of PutItem attempts throttle. The SDK sees ProvisionedThroughputExceededException and retries with exponential backoff and jitter; most succeed eventually, at the cost of multi-second tail latencies and some drops on retries that run out their budget. Reads target the same partition; 12,000 RCU of demand against a 3,000 RCU ceiling means queuing, climbing latencies, and intermittent read-side throttling that adaptive capacity cannot lift because it’s already isolated the key.

After sharding with N = 10, the key schema becomes pk = <game_id>#<hash(player_id) % 10> with sk = score#player_id unchanged. Writes at 5,000 WCU/s spread across ten partition-key values; with a reasonably uniform player_id hash each shard sees roughly 500 WCU/s sustained and peaks of 700, all inside the 1,000 ceiling. Write throttling stops. The top-100 query becomes ten parallel queries, each against a partition doing one-tenth of the original read work; per-shard read load sits well under 3,000 RCU. Top-100 fetch latency gains a few milliseconds for the fan-out and merge, a conscious trade below the product budget.

During the transition, historical items still sit under the old pk = <game_id> while new writes use sharded keys. Adaptive capacity isolates and borrows for the legacy partition for as long as that legacy read traffic lasts; it smooths the cutover without being the permanent answer.

What’s worth remembering

Per-partition limits are 1,000 WCU and 3,000 RCU, regardless of table provisioning or capacity mode. Total capacity is spread across partitions; a single hot key is capped at one partition’s worth.
Burst capacity is 300 seconds of unused throughput, a cushion for short spikes, not a sustained workaround for hot keys.
Adaptive capacity rebalances and can isolate a hot key to its own partition, but the ceiling it delivers is still 1,000 WCU / 3,000 RCU. It smooths uneven workloads; it doesn’t defeat the per-partition cap.
Write sharding – pk = <base_key>#<suffix>, is the structural fix for single-key hot partitions. Random suffix for uniform spread; calculated suffix (hash(attribute) % N) for direct single-item lookup.
Reads against sharded keys are scatter-gather: N parallel queries, merged in the application. The top-K merge is cheap; added latency is the slowest of the N queries, not the sum.
Pick N so peak per-shard load stays under 1,000 WCU/s (or 3,000 RCU/s if reads dominate). N = 10 for low-thousands WCU/s; more if growth is expected.
On-demand mode doesn’t lift per-partition limits. Good for unpredictable aggregate demand and removing capacity management; not a fix for a single hot key.
DAX is a read-side cache with eventually-consistent semantics and microsecond latency. Helpful for read-heavy hot keys with cacheable freshness; does nothing for writes.
GSIs have their own partitions with the same 1,000 WCU / 3,000 RCU limits. A GSI on a different partition key redistributes reads, not the base-table write path.
Composite-key redesign (hot dimension into the sort key) breaks the single-query top-N semantics for leaderboards and forces a Scan. Usually wrong.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.