The situation
Three workloads, one account, three different requirements for an in-memory data store.
- Session store, user sessions keyed by session ID, storing JSON blobs of roughly 2 KB each. Active sessions at peak: ~400,000. Data must survive primary node failure and ideally a full AZ failure; a session lost is a user logged out mid-checkout. Writes: ~500/s. Reads: ~8,000/s.
- Leaderboard, a gaming feature showing the top 100 players per region per game, updated every time a player finishes a round. Needs to answer “what is player X’s rank?” and “who are the top 100?” atomically. Not catastrophic if the leaderboard is stale for a few seconds after a node failure. Writes: ~2,000/s. Reads: ~15,000/s.
- Read-through cache, fronts a recommendation API. Computed recommendations, expensive to produce (~200 ms from the ML service), cached for 10 minutes. Any missing entry is rebuilt on read by the application. Purely a latency optimisation; no durability requirement at all. Writes: ~3,000/s. Reads: ~40,000/s.
Today all three use a single ElastiCache Redis cluster because “cache is cache”. The cluster is a cache.r7g.xlarge with one replica. It works, but the three workloads are contending for memory, and every feature shipped asks the question again: is this the correct engine for this workload?
What actually matters
Before comparing engines it’s worth naming what each workload actually needs from a cache.
The session store wants persistence and replication. A session is small state owned by the cache. If the cache loses it, the user loses their session. Running a single in-memory node and losing all sessions on a reboot is not acceptable. Running a replicated primary with snapshot persistence and automatic failover is.
The leaderboard wants rich data structures and atomic operations. A sorted set (Redis ZSET) is the natural representation: each member has a score, the set stays sorted, and operations like “add this score”, “get rank of this member”, “get top N” are O(log N) in a single atomic command. Doing this in Memcached (which has only flat key-value strings) would require either storing a serialised sorted list per key and rewriting the whole thing on every update (expensive, not atomic across clients) or maintaining the ordering in the application layer (race conditions, bugs). The engine needs to understand the data shape.
The read-through cache wants throughput, simple semantics, and horizontal scale. Every entry is independent; any entry can be rebuilt from source if missing. No cross-key operations. No persistence. The simplest possible cache, scaled across many small nodes to handle 55k ops/s. This is what Memcached was designed for.
Second: the persistence axis. Redis offers RDB snapshots (periodic point-in-time dumps) and AOF (append-only file logs) via ElastiCache’s “Backup and restore” and the engine’s internal persistence. Memcached offers nothing, if a node restarts, its cache is empty. For the session store, snapshots matter. For the recommendation cache, they don’t.
Third: the replication axis. Redis supports primary-replica replication with automatic failover via Multi-AZ; ElastiCache Redis (and Valkey) cluster mode sharded or non-sharded, with read replicas that serve reads and can be promoted to primary. Memcached in ElastiCache has no replication, a node is a node is a node; data is distributed across nodes by client-side hashing, and a node failure means losing the data on that node (which the application has to recompute from source).
Fourth: the data-structure axis. Redis understands strings, lists, sets, sorted sets, hashes, streams, bitmaps, HyperLogLog, and geospatial indexes. Every operation on these structures is atomic; pipelining and Lua scripting allow compound atomic operations. Memcached understands strings. Strings and nothing else. An increment operation exists, but not sorted sets or lists or any of the richer shapes.
Fifth: the cost-per-useful-byte axis. Memcached runs on ElastiCache’s cache.t4g and cache.m7g classes at slightly lower cost than equivalent Redis classes; but the bigger cost story is that Memcached has no replicas (you’re not paying for a replica you don’t need), while Redis nodes often come in pairs.
Sixth: the features in ElastiCache that only exist for Redis/Valkey. Encryption in-transit and at-rest are offered for both, but transactional guarantees, RBAC via ACLs, cross-region replication via Global Datastore, and IAM auth with SASL/OAUTH2 are Redis/Valkey features. Memcached in ElastiCache is a simpler service with simpler guarantees.
And finally, one mental shift: Redis and Valkey are near-interchangeable in ElastiCache today. Valkey is the fork of Redis 7.2 the OpenTofu-style split produced; ElastiCache supports both engines, and for new workloads the Valkey variant is typically cheaper on the same node class. The data model, commands, and client libraries are compatible. For the purposes of this decision, “Redis” and “Valkey” are one option; the choice between them is a later consideration about licensing, pricing, and long-term stability.
What we’ll filter on
- Data structures, simple strings, or rich sorted-sets/hashes/lists?
- Persistence, survives node restart?
- Replication, primary-replica with automatic failover, or no replication at all?
- Multi-AZ, can the store survive a full AZ failure?
- Horizontal scale, sharded, replicated, or single-node?
- Cost per GB of cache, base node class plus replica multiplier?
The ElastiCache engine landscape
-
ElastiCache Redis / Valkey, cluster mode disabled. Single primary node with up to 5 read replicas. Supports all Redis data structures, atomic operations, Lua scripting, transactions. Replica promotion on primary failure is automatic with Multi-AZ enabled. Maximum cache memory bounded by the node class (up to ~635 GB on
cache.r8g.24xlarge). Simplest Redis topology; best when the entire dataset fits on a single primary. -
ElastiCache Redis / Valkey, cluster mode enabled. Sharded across 1 to 500 shards, each shard a primary + up to 5 replicas. Horizontal scale for datasets larger than a single node; automatic failover within each shard. Client library must be cluster-aware (
CLUSTERcommand set). Adds some operational complexity (resharding, slot management) but opens multi-TB cache sizes. -
ElastiCache Memcached. Multi-node cluster where each node holds a disjoint slice of the keyspace (client-side consistent hashing). No replication, no persistence, no data structures beyond strings. Maximum 300 nodes per cluster, up to ~635 GB per node. Multi-threaded per node (Redis is single-threaded per shard), which helps when CPU on a single node would be the bottleneck.
-
Redis OSS self-managed on EC2. Not ElastiCache; mentioned for contrast. You own patching, failover, sharding, monitoring. Reach for this only when ElastiCache’s featureset doesn’t fit, rarely the case today.
-
DAX (DynamoDB Accelerator). A specialised in-memory cache specifically in front of DynamoDB. API-compatible with DynamoDB SDKs; transparently caches reads. Not a general-purpose cache; mentioned because it shows up in “which cache?” conversations and is the correct answer for DynamoDB-heavy workloads.
Side by side
| Option | Data structures | Persistence | Replication | Multi-AZ | Horizontal scale |
|---|---|---|---|---|---|
| Redis/Valkey, cluster mode disabled | rich | snapshots + AOF | primary + 0-5 replicas | ✓ | vertical (node class) |
| Redis/Valkey, cluster mode enabled | rich | snapshots + AOF per shard | per-shard replicas | ✓ | horizontal (shards) |
| Memcached | strings only | none | none | ✗ (node-level failure loses that node’s data) | horizontal (nodes) |
| DAX | DynamoDB items | replicated across nodes | yes, for DynamoDB items only | ✓ | cluster-level |
Reading the table by workload:
- Session store, needs persistence, replication, Multi-AZ. Redis (or Valkey) with cluster mode disabled is sufficient; the dataset fits easily on a single primary, and the replica + automatic failover handles the durability story. Enable Multi-AZ + automatic failover + daily snapshots.
- Leaderboard, needs sorted sets and atomic operations. Redis (or Valkey) mandatory. Cluster mode depends on scale, for 100 players per region per game, non-sharded Redis is fine; for a very large leaderboard across millions of players, cluster mode with per-shard replicas handles it.
- Read-through cache, needs throughput, no persistence, no replication. Memcached is a natural fit. Scale out with more smaller nodes; lose a node, application rebuilds entries on miss.
Engine to workload matching
The picks in depth
Session store → ElastiCache for Valkey, cluster mode disabled, Multi-AZ with automatic failover. One primary node (cache.r7g.large, ~13 GB) and one replica in a different AZ. 400,000 sessions × 2 KB = 800 MB, fits comfortably on a smaller class, with headroom for metadata and growth.
aws elasticache create-replication-group \
--replication-group-id sessions \
--replication-group-description "User sessions" \
--engine valkey \
--engine-version 7.2 \
--cache-node-type cache.r7g.large \
--num-cache-clusters 2 \
--automatic-failover-enabled \
--multi-az-enabled \
--at-rest-encryption-enabled \
--transit-encryption-enabled \
--snapshot-retention-limit 7
num-cache-clusters 2 means one primary + one replica; multi-az-enabled places them in different AZs. Automatic failover promotes the replica to primary on heartbeat failure; session data survives. snapshot-retention-limit 7 keeps daily snapshots for a week, which is the “accidentally ran FLUSHALL in prod” insurance.
Session TTL is handled by the application (SET sess:abc123 <json> EX 86400). Redis/Valkey expires keys automatically; no housekeeping cron.
Leaderboard → ElastiCache for Valkey, cluster mode disabled. Same engine, single primary. Dataset is small (100 entries per leaderboard × 50 leaderboards × a few hundred bytes per entry = a handful of megabytes). Updates use ZADD; queries use ZRANGE and ZREVRANK:
ZADD leaderboard:region:emea:game:42 150000 player:alice
ZREVRANK leaderboard:region:emea:game:42 player:alice # → 3 (4th place)
ZREVRANGE leaderboard:region:emea:game:42 0 99 WITHSCORES # top 100
A single ZADD from each game-finish event updates the score and maintains the ordering atomically. Reads are O(log N) for rank lookups, O(N + log M) for range queries. No replica strictly required because the data is rebuildable from the game history, but a Multi-AZ pair keeps it available during maintenance without application-layer fallback.
Recommendation read-through cache → ElastiCache for Memcached. Three cache.m7g.large nodes in three different AZs. The application library (e.g. pymemcache or dalli) uses consistent hashing to distribute keys across the three nodes. No replica, no persistence; if a node dies, its third of the keyspace is missing and the application rebuilds entries from the ML service on miss.
aws elasticache create-cache-cluster \
--cache-cluster-id reco-cache \
--engine memcached \
--cache-node-type cache.m7g.large \
--num-cache-nodes 3 \
--preferred-availability-zones eu-west-1a eu-west-1b eu-west-1c \
--transit-encryption-enabled
Throughput per node is higher than Redis because Memcached is multi-threaded; three nodes comfortably handle 55k ops/s. Scaling out is a node-count change, with the client library picking up the new node on next refresh.
Why not one Redis/Valkey cluster for everything
The temptation to consolidate all three workloads on one Redis cluster is real, fewer clusters to manage, one engine to know, simpler IAM.
The problems:
- Memory contention. Sessions are tiny and many; leaderboard entries are small and few; recommendation cache entries are medium-sized and many. On one cluster, an eviction policy has to pick which to drop when memory is tight, and the app-level signal about “which is least important” is not visible to Redis.
- Throughput contention. Redis is single-threaded per shard; 40k ops/s of recommendation reads compete with 8k ops/s of session reads on the same CPU core. Either scale the cluster bigger than any single workload needs, or accept the cross-workload latency coupling.
- Wrong tool for the read-through cache. Redis’s persistence and replication are overhead for a workload that doesn’t need them. Memcached’s multi-threaded nodes are cheaper and faster for this use case at comparable capacity.
Splitting gives each workload the correct engine, the correct sizing, and the correct failure mode. The operational overhead is marginal, three small clusters, and the cost is lower than one oversized cluster trying to absorb all three patterns.
A worked session trace
We’re debugging a session-loss incident: some users report being logged out unexpectedly around 03:14 UTC. The app logs show their session cookie was valid but the server-side session lookup returned miss.
03:14:07 ElastiCache sessions (replication group)
event: Automatic failover initiated
primary: cache.sessions-001.abc.euw1.cache.amazonaws.com (eu-west-1a)
replica: cache.sessions-002.abc.euw1.cache.amazonaws.com (eu-west-1b)
status: replica promoted to primary
03:14:12 ElastiCache sessions (replication group)
event: Failover completed
DNS: sessions.abc.euw1.cache.amazonaws.com → 002 (in eu-west-1b)
App logs, 03:14:08–03:14:13:
[WARN] redis.timeout MGET sess:abc123 -> context deadline
[ERR] auth.session session lookup failed, logging out user
[INFO] retry succeeded at 03:14:14 (new primary)
Six seconds of write-unavailability during failover, during which in-flight reads time out and the application’s session middleware falls back to “no session found”. For a small number of users whose requests happened to land in those seconds, the effect is a logout.
The fix is not to remove failover, it’s what saved everyone else’s sessions, it’s to make the application’s Redis client retry with an exponential backoff up to ~15 seconds before concluding the session is missing. Valkey’s failover completes in well under 15 seconds in the majority of cases; a retry absorbs it.
What’s worth remembering
- Data structures are the first filter. Redis/Valkey understand rich types (lists, sets, sorted sets, hashes, streams). Memcached understands strings. Workloads that need atomic operations on non-string data need Redis/Valkey.
- Persistence and replication are the second filter. Memcached has neither; a node restart loses that node’s data. Redis/Valkey offer snapshots, AOF, and primary-replica replication with automatic failover when Multi-AZ is enabled.
- Single-threaded per shard is a Redis characteristic. Memcached is multi-threaded per node, so a single Memcached node can saturate more CPU cores than a single Redis node. For pure throughput on simple keys, this can matter.
- Cluster mode enabled is for large datasets. Cluster mode shards the keyspace across multiple primaries, each with its own replicas. Use when a single primary can’t hold the whole dataset; otherwise cluster mode disabled is simpler.
- Valkey is the drop-in replacement for Redis. Same commands, same clients, same data model; typically cheaper on the same class in ElastiCache. For new workloads on ElastiCache the engine choice is “Valkey unless there’s a specific reason for Redis”.
- DAX is Redis-shaped but DynamoDB-specific. If the source of truth is DynamoDB, DAX replaces application-side caching with a transparent in-front cache that speaks the DynamoDB API. Not a general-purpose cache.
- Don’t consolidate for its own sake. One cluster per workload keeps the engine choice, sizing, and failure mode appropriate; the operational overhead of multiple small ElastiCache clusters is low.
- Multi-AZ for stateful workloads; single-AZ for ephemeral. Session data earns Multi-AZ. Recommendation cache does not; a node failure is a recomputation, not an incident.
Three caches, three engines in the same ElastiCache service. Data model picks Redis or Memcached; durability and replication pick the Redis topology; throughput and failure tolerance pick the scale. The wrong engine on the wrong workload either costs too much, drops data on failure, or forces the application to reinvent the data structures in user code.