The situation
We run a mid-sized SaaS platform out of ap-southeast-2: around thirty EC2 and Fargate services, half a dozen RDS databases, a few DynamoDB tables, and a stack of S3 buckets holding customer uploads and generated reports. Revenue is around AUD$6m per quarter; a full-day outage is costly but not fatal. The board has approved a disaster-recovery programme and the regulator wants a written plan with tested recovery objectives.
The team has a budget. It is not “active-active across three Regions” money. We need to match strategy to criticality, not buy the most expensive option and call it a day.
Three services are in scope for stricter-than-default recovery:
- Checkout, the revenue service. An hour down is measurable; a day down is a board meeting.
- Customer portal, account management, billing. A few hours of downtime is awkward but survivable.
- Analytics, internal dashboards, BI reports. A day of downtime is annoying, not urgent.
Each of these wants a different recovery strategy. The trick is deciding which.
What actually matters
The core trade in disaster recovery is cost in exchange for recovery speed. The slowest DR strategy (backup and restore) costs almost nothing day-to-day but takes hours to execute. The fastest (multi-site active-active) is a second production environment that costs roughly the same as the first. Between them, pilot light and warm standby let us pre-stage progressively more of the DR environment, progressively faster to recover, progressively more expensive.
The first thing to ask is: what RTO does the business actually need? Not what the business wants, which is always zero. What it can afford, given what a longer RTO costs in lost revenue, customer trust, and regulatory exposure. Checkout: hours, not days. Portal: four to eight hours. Analytics: a working day.
The second is: what RPO is tolerable? How much data loss is recoverable? Checkout: minutes at most (we can reconcile from payment providers). Portal: an hour or two (billing cycles are daily). Analytics: a day, the dashboards rebuild from the warehouse.
The third is: how often will we test this? A DR plan never tested is a DR plan that doesn’t work. Backup and restore can be tested by a scripted weekly restore to a parallel account; multi-site is tested by regular traffic-shifting drills; pilot light tests need a rehearsal that stands up a working environment end-to-end, which is a half-day exercise when it’s honest. Testing cost has to be in the budget.
The fourth is: which components are stateful? Recovering a stateless service is a deployment; recovering a database is data replication. The DR strategy is really a strategy for the stateful layer, the stateless services follow.
What we’ll filter on
- RTO, how fast does the service come back?
- RPO, how much data can we lose?
- Steady-state cost, how much does the DR environment cost when idle?
- Complexity to maintain, how much drift accumulates between drills?
- Testing cadence feasible, how often can we honestly exercise the plan?
The recovery landscape
-
Backup and restore. Data is backed up to a DR Region (snapshots, S3 CRR, AWS Backup). No compute runs in the DR Region between incidents. During recovery, we run infrastructure-as-code to stand up the whole stack from zero, restore databases from snapshots, and cut DNS over. RTO measured in hours. RPO is the backup interval (AWS Backup on a daily schedule = up to 24 hours; hourly schedule = up to an hour). Cheapest option; longest recovery; requires the most post-failover smoke-testing.
-
Pilot light. The core stateful pieces run in the DR Region continuously, replica databases, replicated S3 buckets, but the compute layer is scaled to zero or a minimal footprint. During recovery, Auto Scaling groups and Fargate services scale out from zero; databases are already warm. RTO measured in tens of minutes. RPO measured in seconds to minutes depending on replication method. Moderate cost: pay for databases and storage in both Regions, near-zero for compute.
-
Warm standby. A scaled-down but fully functional copy of production runs in the DR Region. Databases are replicated; a minimal compute fleet takes load (can be used for read-only traffic, dark deploys, or DR-only). During recovery, scale out to production capacity and cut DNS over. RTO measured in minutes. RPO seconds. Moderate-to-high cost: paying for real capacity, even if reduced.
-
Multi-site active-active. Both Regions take live traffic, each sized to handle 100% if the other fails. Covered in the active-active post; included here for comparison. RTO measured in tens of seconds (DNS failover). RPO measured in sub-seconds depending on replication. Highest cost: approximately 2x production for the whole stack.
-
AWS Backup for the data layer. Not a DR strategy on its own, but the plumbing that makes backup-and-restore workable at scale. Org-wide backup plan, cross-Region and cross-account copy, tag-based resource selection, automated retention. Pairs with any of the above.
-
Elastic Disaster Recovery (DRS). AWS’s block-level replication service for EC2 (and on-prem servers, which is where it started). Continuously replicates server disks to a DR Region, with minimal compute cost in the DR Region (a lightweight replication server per source); failover launches instances from the replicated volumes in minutes. RTO minutes, RPO seconds. Covered separately; relevant here because it’s an option for the compute layer of pilot light and warm standby.
Side by side
| Strategy | RTO | RPO | Steady-state cost | Complexity | Testing cadence |
|---|---|---|---|---|---|
| Backup + restore | Hours | Hours-day | ~5-10% of prod | Moderate | Monthly at best |
| Pilot light | 10-60 min | Seconds-min | ~15-25% of prod | Moderate | Quarterly |
| Warm standby | Minutes | Seconds | ~30-50% of prod | High | Monthly |
| Multi-site AA | Seconds | Sub-second | ~100% of prod | Very high | Continuous |
Reading by workload:
- Checkout. RTO of 15 minutes, RPO of minutes. Warm standby: reduced-capacity compute running constantly, Aurora Global Database keeping state within a second. Can handle read traffic and dark deploys in steady state, scales out on failover.
- Customer portal. RTO of a few hours, RPO of an hour. Pilot light: databases replicated, compute scaled to zero. IaC stands up the compute in under an hour.
- Analytics. RTO of a day, RPO of a day. Backup and restore: AWS Backup with daily cross-Region snapshots, IaC stored in Git. Recover during business hours when needed.
The spectrum
The picks in depth
Checkout, warm standby. Aurora Global Database with the primary in ap-southeast-2 and a reader cluster in ap-southeast-4. Fargate services in ap-southeast-4 run at 20% of production capacity constantly, enough to serve smoke tests and a trickle of read traffic, not enough to handle live load. DynamoDB Global Tables keep idempotency keys in both Regions; S3 CRR handles receipts. On a Sydney outage, an operator runs aws rds failover-global-cluster and an ECS service update scales the Fargate capacity to 100%. Route 53 health checks pull Sydney out of DNS within a minute. Target RTO: 5-10 minutes. RPO: up to a second (Aurora replication lag).
The 20% capacity costs real money, roughly 20% of the primary compute bill and 100% of the Aurora reader bill, but the DR Region is exercised daily, which means drift doesn’t accumulate. Monthly traffic-shifting drills (cut 10% of live traffic to London for an hour, observe, cut back) are cheap and find real problems.
Customer portal, pilot light. RDS for PostgreSQL cross-Region read replica in ap-southeast-4. Fargate services in ap-southeast-4 are deployed but scaled to zero, the task definitions, ALB target groups, and security groups all exist in Terraform state, but the desired count is zero and no tasks run. ECR has the latest image, so a scale-up is “set desired count to production” and wait.
On an outage, the runbook is:
- Promote the RDS read replica to standalone (~10 minutes).
- Update the Fargate service desired count to production capacity (5-10 minutes to reach steady state).
- Update Route 53 to point
portal.example.comto the London ALB.
Total RTO: 20-40 minutes. RPO: up to the RDS read-replica lag, typically seconds. Steady-state cost: the RDS replica + a small running ALB + empty Fargate service = roughly 15% of production.
Testing is quarterly: stand up the London environment end-to-end against a synthetic load, verify the portal works, tear it back down. This is the drill that catches “oh, the secret rotation hasn’t been pushed to the DR Region’s Secrets Manager.”
Analytics, backup and restore. AWS Backup managed through a delegated administrator account in the org. Backup plan:
- Daily snapshots of the analytics RDS and DynamoDB tables.
- Cross-Region copy to
ap-southeast-4for every snapshot. - Cross-account copy to a dedicated backup-vault account with a different set of IAM controls (defence against ransomware that compromises the primary account).
- 35-day retention with a 90-day Glacier-tier archive for compliance.
No compute runs in ap-southeast-4 day-to-day. Recovery is “run Terraform with the DR Region variables, restore from snapshots, point DNS.” First recovery takes 2-4 hours; the runbook improves with practice. RPO: up to 24 hours of data, which analytics accepts because the warehouse can reprocess from event logs anyway.
Testing is monthly, not quarterly, because the steps are scripted: a CI job runs the Terraform against an isolated test account, restores the most recent snapshot, runs smoke queries, tears everything down. The job reports pass/fail; an operator doesn’t need to be present unless it fails.
AWS Backup as the chassis. One Backup plan in the delegated admin account covers every tagged resource across the organisation. The plan does three things:
- Selects resources by tag (
Backup=yes, plus service-specific tags for different retention). - Schedules backups (hourly for checkout’s Aurora, daily for portal’s RDS, daily for analytics).
- Copies backups to the DR Region, and to the separate backup account for the most critical resources.
An SCP on the org denies backup:DeleteRecoveryPoint from member accounts; only the backup-admin role in the dedicated account can delete backups, and that role is locked down with a manual break-glass workflow.
A worked recovery: portal pilot light
10:22, ap-southeast-2 starts returning 500s for the portal. 10:24, incident declared. 10:25, operator runs the promoted Terraform runbook in ap-southeast-4.
$ terraform -chdir=environments/portal-dr apply -auto-approve
aws_rds_cluster.portal_primary: Promoting replica...
aws_rds_cluster.portal_primary: Still promoting... (2m elapsed)
aws_rds_cluster.portal_primary: Promoted
aws_ecs_service.portal: Updating desired count from 0 to 8
aws_ecs_service.portal: Waiting for steady state...
aws_ecs_service.portal: Steady state reached (6m elapsed)
aws_route53_record.portal_dns: Updating alias to London ALB
Apply complete.
Total elapsed: 14 minutes for the Terraform + 60 seconds of DNS TTL = 15 minutes RTO. Customers see errors for 15 minutes; then the portal works. Database lag at the moment of failure was 3 seconds, so RPO was 3 seconds. Within target.
The quarterly drill the month before caught that the Secrets Manager secret for the portal’s database password hadn’t been replicated to ap-southeast-4, so the Fargate tasks crashed on startup. Fixed in the drill; didn’t surface during the real incident. That’s the point of the drill.
What’s worth remembering
- DR strategy is per-service, not per-organisation. A $5m/hour service wants warm standby or active-active; a read-only dashboard wants backup and restore. Mixing strategies is the normal answer.
- RTO and RPO are budgets, not targets. Set them based on what the business can afford to lose, not what the team wants to promise. The regulator asks for numbers the business signed off on.
- Backup and restore only works if tested. An untested runbook in Confluence is not a DR plan. Monthly rehearsals, scripted, pass-fail, that’s a DR plan.
- Pilot light is the sweet spot for most services. Real RTO improvement over backup-and-restore, small cost delta (mostly the replica database), operationally tractable. Most services land here once the numbers are honest.
- Warm standby earns its keep by being exercised daily. If the DR capacity also serves production read traffic, it’s not sitting idle, drift doesn’t accumulate, and DR tests are traffic-shifting drills rather than stand-up-the-environment exercises.
- AWS Backup is the boring-but-critical chassis. Org-wide plans, cross-Region copies, cross-account copies for the ransomware case, tag-based selection. All four strategies lean on it.
- Cross-account backup vaults are the ransomware answer. If the primary account is compromised, the attacker can delete backups in the same account. A separate account with a separate IAM boundary and SCP-enforced delete protection is the insurance policy.
- The hardest part of DR is not the technology. It’s tested runbooks, the behaviour of the team at 03:00, and the confidence that what happened in the drill is what will happen in the real event.
Four shapes, one spectrum. Pick the shape for each service that matches the RTO the business can afford, and test it at whatever cadence the shape deserves.