How to Design Multi-Region Failover Under a Minute

The situation

We run a payments API in eu-west-1, a handful of Fargate services behind an ALB, backed by an Aurora PostgreSQL cluster and a DynamoDB table for idempotency keys. Revenue per minute of downtime is high enough that the business has set a recovery-time objective of 60 seconds and a recovery-point objective of zero: no lost transactions, ever, even if Dublin goes dark.

The current plan is to run the same stack in eu-west-2 and cut traffic over with Route 53. The question isn’t whether we can do that, we can, but what we have to be honest about. Active-active isn’t a checkbox; it’s a set of data-replication choices, each with a different latency cost, a different conflict model, and a different failure mode. Before we commit to a topology we need to walk the options and say what we’re willing to trade.

What actually matters

The core trade in multi-region architecture is latency in exchange for consistency. Single-Region, a write commits locally in milliseconds and everything downstream agrees. The moment we add a second Region, either the write waits for the remote Region to acknowledge (consistent but slow), or it commits locally and replicates asynchronously (fast but eventually consistent, with a replication lag during which the two Regions disagree).

Payments makes this uncomfortable. “Eventually consistent” means that during a failover the in-flight writes that hadn’t replicated yet either block, lose, or double-post. None of those are acceptable. So the first thing to ask is: which writes genuinely need synchronous cross-Region durability, and which can tolerate a second of lag? The idempotency keys, strict zero-RPO; a duplicate charge is a refund and a customer-service ticket. The transaction history, near-zero-RPO acceptable; we can reconcile from the card network’s side of the pipe. The customer profile data, eventual is fine; a stale address for ten seconds isn’t a payment incident.

The second thing is: how do we route traffic? Active-active can mean “every user hits the nearest Region” (latency-based routing) or “all traffic goes to one Region and the other is warm but idle until failover” (weighted routing with an override). The first halves our latency and doubles our blast radius for a bad deploy; the second keeps the failure modes tidy but leaves half the fleet paid for and doing nothing.

The third is: what breaks on failover? The cutover is only as fast as the slowest dependency. DNS-layer health checks can fire in tens of seconds, but if a relational replica takes minutes to promote, the RTO is minutes. If a key-value store accepts writes in both Regions natively, writes stay available through the outage and the application doesn’t care. The data-layer choice is the failover-speed choice.

The fourth is the honest one: can we run both Regions hot enough that we know failover works? A passive Region that only takes traffic during an outage is a passive Region whose configuration drift we discover during the outage. Real active-active means real traffic in both Regions every day, which means real bugs found in both Regions every day, which is exactly what you want.

What we’ll filter on

RPO at the data layer, how much in-flight data can we lose?
RTO for a whole-Region failure, how fast does traffic reach the surviving Region?
Write availability during a Region outage, do writes succeed in the surviving Region, or block until failover?
Conflict model, what happens when both Regions accept a write to the same key?
Steady-state cost multiplier, one Region of cost, two, or somewhere in between?

The replication landscape

DynamoDB Global Tables. Multi-Region, multi-master replication at the table level. Writes accepted in any Region, asynchronously replicated to the others with typical sub-second lag, last-writer-wins conflict resolution on the write timestamp. RPO is the replication lag, usually under a second, occasionally higher under load. RTO is effectively zero for writes: the surviving Region keeps accepting them with no failover dance. Cost is roughly 2× single-Region writes (replicated writes bill in both Regions) plus cross-Region data transfer. Ideal for idempotency keys, session data, and any workload where LWW is a safe resolution.
Aurora Global Database. One writer Region, up to five reader Regions. Replication uses the storage layer, not logical logs, so lag is typically under a second and physically bounded at one second by design. Readers serve low-latency reads locally; writes must round-trip to the writer. Failover is either managed (planned, lossless, takes roughly a minute) or unplanned (emergency promotion of a reader, up to ~60 seconds of data loss corresponding to the replication lag). Write availability drops to zero during the promotion window. Cost is one writer cluster + N reader clusters, each paying compute and storage.
Aurora cross-Region read replicas (legacy). The pre-Global-Database pattern: a standalone cluster in the DR Region fed by logical replication from the primary. Replication lag measured in seconds to minutes depending on load; promotion takes several minutes; manual. Cheaper than Global Database but slower in every dimension. Superseded for new designs.
S3 Cross-Region Replication (CRR). Asynchronous object replication between buckets. Lag is typically seconds, occasionally minutes for large objects or high throughput. Not a database, but the correct tool for replicating receipts, audit exports, and anything already living in S3. Writes are always available in either Region; conflicts handled by whichever bucket the client wrote to (there is no cross-bucket conflict because there is no shared key space).
Application-layer dual-write. The service writes to both Regions from the edge, succeeds when both acknowledge. Synchronous, strongly consistent, and slow: every write pays the cross-Region round-trip (~15ms Dublin to London, ~80ms Dublin to Virginia). Failure modes are ugly, partial writes when one Region is slow, retry storms when one is dead, the operator becomes responsible for conflict resolution. Avoid unless the consistency requirement is absolute and nothing in AWS’s native toolkit fits.
Single-Region primary + async backup. Not active-active at all: one Region takes all traffic, the other is a backup-restore target fed by AWS Backup, cross-Region snapshots, and S3 CRR. RPO measured in minutes to hours; RTO measured in tens of minutes once somebody runs the failover playbook. Cheapest option; fails our 60-second RTO by two orders of magnitude. Included to anchor the scale.

Side by side

Option	RPO	RTO for Region outage	Write availability during outage	Conflict model	Cost multiplier
DynamoDB Global Tables	Sub-second	~0	Writes continue	Last-writer-wins	~2×
Aurora Global Database	Up to 1s	~60s (unplanned)	Blocks until promotion	Single writer	~1.3–1.8×
Aurora cross-Region RR	Seconds-minutes	Minutes	Blocks until promotion	Single writer	~1.3×
S3 CRR	Seconds	~0	Writes continue	Per-bucket (no shared key)	~2× storage + transfer
App-layer dual-write	~0	~0	Degraded	Application-defined	~2× + complexity
Single-Region + backup	Minutes-hours	Tens of minutes	Blocks	N/A	~1.1×

Reading by data layer rather than by option:

Idempotency keys, strict zero-RPO, write-heavy, LWW is safe (the key is unique; the conflict is “both Regions saw the same request”). Global Tables wins outright.
Transactional state (Aurora), near-zero-RPO acceptable, single-writer is fine. Aurora Global Database with managed failover as the planned path, unplanned promotion as the break-glass.
Audit exports / receipts (S3), already in S3, asynchronous is fine. CRR with versioning and a Glacier lifecycle in the DR Region.
The service itself (Fargate behind ALB), stateless; we just need the ALB reachable and the task definition deployable in both Regions. Route 53 latency-based routing with health checks handles the cutover.

The topology

Three replication shapes, one per data layer. Global Tables multi-master for idempotency, Aurora Global Database single-writer for transactions, S3 CRR for receipts.

The picks in depth

Route 53 latency-based routing with health checks. An alias record for api.payments.example.com with two records, one per Region, each with a latency-based routing policy and a health check pointing at /healthz on that Region’s ALB. Users in Dublin resolve to Dublin; users in London resolve to London. A failing health check removes a Region from DNS within 30-60 seconds (check interval plus TTL). Clients see a handful of errors during the cutover, retry, and resolve to the healthy Region. The 60-second RTO budget is almost entirely spent here.

DynamoDB Global Tables for idempotency keys. CreatePaymentAttempt(key) writes to the table in whichever Region the request landed. The replica in the other Region sees the write within a second. The second attempt with the same key, wherever it lands, sees the record and short-circuits with the prior result. During a Region outage, writes go to the surviving Region with zero blocking. Last-writer-wins is safe here because the key space is per-attempt; we don’t expect the same key written with different values from both Regions in the replication window.

Aurora Global Database with a warm reader. Writer in eu-west-1, reader cluster in eu-west-2. During steady state, writes go to the writer through a cluster endpoint; reads can go to the local reader for low latency. During a planned cutover (for example, a Region-wide maintenance window), managed failover promotes the reader to writer losslessly in roughly a minute. During an unplanned outage of eu-west-1, the application issues failover-global-cluster, which promotes the eu-west-2 reader with up to one second of data loss. The application resumes writes in eu-west-2 as soon as promotion completes.

This is the bit that fails the strict zero-RPO interpretation of the requirement. For anything the business calls “a transaction” we rely on the card network’s side of the conversation as the source of truth: our ledger may briefly be behind, but the card network knows what happened and we can reconcile. The alternative, synchronous cross-Region transactions, costs us ~15ms of latency on every write and introduces failure modes that are worse than a reconciled second of replication lag.

S3 CRR for receipts. Versioning on in both buckets, replication configured one-way from Dublin to London, with DeleteMarkerReplication: Enabled so deletes propagate. Lifecycle rules on both buckets transition to Glacier Instant Retrieval at 30 days. During an outage the application writes receipts to the surviving bucket; reconciliation after the outage is “both buckets have the union of objects, CRR catches up on either direction.”

Fargate services in both Regions, deployed together. One CI/CD pipeline that deploys to both Regions on every merge, not a manual DR drill. The task definition, security groups, target groups, and ALB listeners all managed in Terraform, with identical module inputs per Region. The deployment is the disaster-recovery test we run every day; configuration drift doesn’t accumulate because there’s never a “passive” Region.

A worked failover trace

eu-west-1 starts misbehaving at 14:07. Route 53 health checks for the Dublin ALB begin failing at 14:07:15, confirmed at 14:07:45 after two consecutive failures. Dublin drops out of DNS. Client TTLs (60s) expire and resolvers start returning only the London record by 14:08:45. Traffic in London spikes to roughly 2× baseline; autoscaling adds Fargate tasks.

Application code in London keeps writing idempotency keys to the Global Table, writes continue uninterrupted because the replica is multi-master. Aurora writes fail for about 45 seconds while an operator runs aws rds failover-global-cluster --global-cluster-identifier payments-global --target-db-cluster-identifier payments-eu-west-2. The reader promotes, the application’s writer endpoint (abstracted behind a cluster-level DNS name that Route 53 also manages) resolves to the London writer, and writes resume. Total operator intervention: one CLI command.

S3 writes go to the London bucket. Receipts that were in-flight to Dublin and hadn’t replicated yet are lost, the business has accepted this as a known gap and relies on the card network’s records for reconciliation.

Total time from “Dublin goes dark” to “writes succeed in London”: about 75 seconds. Inside the 60-second RTO for reads (which never stopped), slightly over for writes (during the Aurora promotion). The business has seen the numbers and accepted the trade: faster write-RTO would require giving up zero RPO, which they care about more.

What’s worth remembering

Active-active is a data-layer decision, not a traffic decision. Route 53 can move requests between Regions in under a minute. The question is always “what’s the state layer doing while that happens?”
Pick replication shape per data layer. DynamoDB Global Tables for multi-master key-value; Aurora Global Database for single-writer relational; S3 CRR for objects. One topology for all three data layers forces compromises nobody enjoys.
RPO zero and sub-60s RTO pull in opposite directions. Global Tables gives RPO sub-second and RTO zero, but only with last-writer-wins. Aurora Global Database gives RTO ~60s with RPO up to 1s, not zero. True RPO zero requires synchronous cross-Region writes, which cost ~15-80ms of latency per write.
Managed failover is for planned cutover, unplanned failover is the emergency. Managed is lossless, takes about a minute, and requires both Regions healthy. Unplanned is up to a second of data loss but works when the primary is dead. Know which playbook maps to which incident.
Latency-based routing halves average latency; weighted with override keeps failure modes tidy. Latency-based is correct when both Regions take traffic continuously. Weighted-with-override is correct when a bad deploy in one Region should not become a bad deploy in both simultaneously.
The passive Region you don’t exercise is the Region that doesn’t work. Deploy to both Regions from the same pipeline on every change. DR drills catch configuration drift that accumulates over months of “we’ll test it soon.”
Route 53 health checks are a 30-60 second signal, not instant. Two consecutive failures plus TTL expiry puts the DNS-layer cutover in that window. If 60 seconds is too slow, the answer is client-side retry-on-different-endpoint, not a faster health check.
Cost scales roughly 2x for active-active, plus cross-Region transfer. Factor the transfer in, a chatty replication stream between two Regions can be the second-largest line on the bill after compute.

Active-active isn’t one pattern; it’s a set of replication shapes pinned to the data layer that best fits each. Get the shapes correct and Route 53 does the rest.