Recovery Objectives Made Measurable

September 06, 2028 · 14 min read

DevOps Engineer Pro · DOP-C02 · part of The Exam Room

The situation

A platform team maintains ~20 production applications:

  • Tier 1 (4 apps): RTO 1 hour, RPO 15 minutes. Payments, identity, ordering, core API.
  • Tier 2 (10 apps): RTO 4 hours, RPO 1 hour. Most customer-facing services.
  • Tier 3 (6 apps): RTO 24 hours, RPO 24 hours. Internal tools, reporting, batch.

Existing resilience posture:

  • Cross-AZ deployments across three AZs in eu-west-1.
  • Backups via AWS Backup on per-tier schedules (hourly continuous for Tier 1 RDS, daily for Tier 2 and 3).
  • One DR test per year per application, manually run.
  • Design docs claim the tier-specific RTO/RPO; actual measurement happens during the yearly DR test and rarely gets documented.

Last year’s incident: a Tier 2 application with a stated 4-hour RTO took 11 hours to recover. The post-mortem found a DNS failover not automated, a secondary-Region IAM role missing a permission, and a runbook that referenced a CloudFormation parameter nobody had updated when the template was reorganised. All small things; all invisible until tested.

The asks:

  • A machine-checked RTO/RPO estimate per application, based on the actual architecture.
  • Gap analysis. Where does the architecture fall short of the stated target?
  • Remediation recommendations prioritised by impact.
  • Continuous assessment. If someone changes a security group or deletes a backup, the score should update.
  • Integration with Audit Manager. Resilience evidence flows into the compliance report.

What actually matters

Before reaching for a tool, it’s worth being clear about what an “RTO of four hours” actually has to mean for it to survive an audit and the next real incident.

The first thing worth thinking about is what counts as a disruption. Recovery time depends entirely on what failed. Losing one Availability Zone, losing a whole Region, losing the application (bad deploy, data corruption), and losing a specific underlying service all have different recovery shapes, and the same architecture can meet target on three of them and miss badly on the fourth. A useful measurement has to be per-disruption-type, not a single number; otherwise the design doc is just averaging the easy cases over the hard ones.

The second is what gets measured and what gets assumed. A measurement that comes from inspecting infrastructure-as-code can see cross-AZ spread, Multi-AZ databases, backup cadence, replication, route policies. It cannot see whether the application reconnects cleanly to a failed-over database, whether a Lambda has a retry that produces a thundering herd, or whether a runbook step needs forty-five minutes of human time. The honest framing is that an architecture-level model gives a floor on recovery (a number the infrastructure could hit if nothing else got in the way), not a measured truth. The gap between the floor and the real number is where production behaviour lives.

The third is what the architecture-level model knows about each service. AWS publishes documented recovery characteristics for the services that have them (an RDS Multi-AZ failover at ~60 seconds, an S3 CRR replication SLA, DynamoDB Global Tables as multi-Region by construction). A model that uses those documented characteristics is repeatable and defensible, but optimistic: real failovers can be slower under load, hot buffer pools take time to warm, connection storms add latency. The output is “best-case if AWS’s documented numbers hold,” not “what happened last Tuesday.”

The fourth is coupling to infrastructure-as-code. The estate is mostly CloudFormation, with a handful of Resource-Groups-defined applications. Whatever measurement is chosen needs to read those definitions directly, not require a separate inventory; otherwise the score lags reality every time someone changes a stack. The same coupling produces the trigger: a stack update is the natural moment to re-evaluate, because that’s when the architecture changes.

The fifth is continuity of measurement. A one-off score on the day someone runs it isn’t useful three months later. The measurement has to refresh on change (so a regression caused by a stack update surfaces immediately), on schedule (so silent drift, e.g. a Secrets Manager replica being deleted manually, still surfaces), and emit an event when the score drops below policy, so the team finds out before the auditor does.

The sixth is what “remediation” looks like. A score by itself is a complaint. To be actionable, the measurement has to identify which components fall short and which change would close the gap, and rank those changes by how much they’d raise the score. “Add a cross-Region backup copy of bucket X” is useful; “your Region score is low” is not.

And finally, how the score becomes evidence. The auditor cares about controls, not consoles. Resilience evidence has to flow into the same compliance pipeline as everything else (controls, monthly reports, attached snapshots) so the resilience number is a control output the auditor can attest to without re-running anything by hand. An architecture-level model gives a baseline; an actual injected-fault test gives a measured number. The two are complementary: the model estimates continuously and cheaply; injected-fault testing validates expensively and rarely.

What we’ll filter on

Filtering:

  1. Scores against a stated policy (RTO/RPO target).
  2. Models multiple disruption types (AZ, Region, infra, application).
  3. Remediation suggestions prioritised by impact on the score.
  4. Continuous assessment rather than one-off.
  5. Integrates with existing tooling (CloudFormation, Terraform, Audit Manager).

The resilience-measurement landscape

1. Design-doc review and annual DR test. Status quo. Manual, retrospective, rarely produces a measured number.

2. AWS Well-Architected Framework tool review. The Reliability pillar asks the correct questions; produces a reviewer-graded posture. Not a quantitative RTO/RPO score; no ongoing measurement.

3. AWS Resilience Hub with a resilience policy per tier. Quantitative, policy-keyed, per-application. Assessments run on demand or on CFN stack update.

4. Resilience Hub + FIS experiments. Hub estimates recovery; FIS actually tests it. Closes the loop between modelled and measured.

5. Third-party resilience tools (Gremlin, Chaos Mesh). Focused on chaos engineering and fault injection. Complement rather than replace Hub.

Side by side

Option Scores RTO/RPO Multiple disruption types Prioritised remediation Continuous Integrates
Design-doc review Partial Manual
Well-Architected Tool Manual Partial
Resilience Hub
Hub + FIS ✓ + measured ✓ + validated
Third-party chaos tools Validated only Partial

Resilience Hub plus FIS is the durable pairing. Hub scores the architecture; FIS validates specific scenarios.

Hub reading the architecture

Inputs Resilience Hub FIS validation loop Outputs CloudFormation stack as app source or Terraform state Resource Group tag-based app grouping for non-CFN workloads Resilience policy RTO / RPO per disruption tier-specific (T1/T2/T3) Tier assignment app tag: ResilienceTier=T1 Triggers stack update, schedule, manual Assessment per-component RTO/RPO vs policy 4 disruption types: AZ, Region, Infra, App AZ disruption model ASG cross-AZ, RDS Multi-AZ, EFS targets DynamoDB (always), ElastiCache, EKS Region disruption model cross-region backup, Aurora Global, S3 CRR DNS failover, pilot-light, warm-standby Application disruption model backup coverage, snapshot frequency versioning, point-in-time restore FIS experiments actually inject: stop AZ, terminate tasks measure real recovery → compare to estimate Resilience score 0-100 per app policy compliance: yes/no Recommendations prioritised by score impact enable Multi-AZ, add CRR, etc. Runbook suggestions SSM Automation templates for common failover steps Alarms & events score-change EventBridge notify on regression Audit Manager evidence assessments as snapshots attached to controls
Hub estimates; FIS validates. Inputs come from CloudFormation or Resource Groups; outputs feed Audit Manager and EventBridge.

The pick in depth

One resilience policy per tier.

T1-policy: RTO 1h, RPO 15m per disruption type
T2-policy: RTO 4h, RPO 1h per disruption type
T3-policy: RTO 24h, RPO 24h per disruption type

Policies can differ per disruption type (e.g. T2’s Region RTO might be 8 hours while its AZ RTO is 30 minutes, reflecting the reality that Regional events are rare and budget-limited). Setting both at the same value simplifies the initial rollout.

Applications defined from CloudFormation stacks. Each of the 20 apps corresponds to a CloudFormation stack (or a group of stacks). Resilience Hub imports the stack’s resources, identifies the app type of each (RDS, ASG, EFS, ALB, DynamoDB, etc.), and runs the assessment. For apps that aren’t stack-managed, a Resource Group tagged AppID=payments achieves the same effect.

Continuous assessment. An EventBridge rule matches CloudFormation StackUpdate events and invokes Resilience Hub’s StartAppAssessment API. The assessment runs in a few minutes, produces a new score, and if the score drops below the policy threshold, an EventBridge rule on the Resilience Hub AppAssessmentStatusChanged event fires a Slack notification.

The initial run on the 20 apps produces a table of scores:

| App               | Tier | Score | AZ | Region | App | Infra |
|-------------------|------|-------|----|----|-----|-------|
| payments          | T1   | 92    | ✓  | ✓  | ✓   | ✓     |
| identity          | T1   | 78    | ✓  | ✗  | ✓   | ✓     |
| ordering          | T1   | 85    | ✓  | ✓  | ✗   | ✓     |
| core-api          | T1   | 88    | ✓  | ✗  | ✓   | ✓     |
| reporting         | T3   | 95    | ✓  | ✓  | ✓   | ✓     |
| ...

Remediations from the assessment tell the story: identity’s Region score is low because the Secrets Manager secrets aren’t replicated to the standby Region; adding a replica via CreateSecret --replication-region recovers the score. ordering’s App score is low because the DynamoDB table has point-in-time recovery disabled; enabling PITR adds a one-click remediation.

Integrating with Audit Manager. Resilience Hub assessments produce evidence that Audit Manager can reference. A custom Audit Manager control “Every Tier 1 application has a resilience score of ≥ 80” has a data source that queries Resilience Hub’s API for the latest assessment. The monthly Audit Manager report includes per-application scores as evidence for the control.

FIS validation. For each Tier 1 app, an FIS experiment template simulates an AZ stop (aws:ec2:stop-instances with a target selection of “instances tagged AppID=payments in AZ eu-west-1a”). The experiment runs quarterly in staging; measured RTO is compared to Resilience Hub’s estimated RTO. Material divergence (estimated 3 minutes, measured 25 minutes) surfaces the application-level blockers Resilience Hub can’t see.

What Hub can and can’t score

Can:

  • Backup coverage and RPO contribution from backup schedules.
  • Multi-AZ configuration for most AWS services.
  • Cross-Region replication for backup and eligible databases.
  • Auto-scaling group spread and minimum capacity.
  • Route 53 health-check configuration and failover routing policies.
  • DynamoDB Global Tables and Aurora Global Database.

Can’t:

  • Application code quality (does the app reconnect after failover?).
  • Runbook execution time (if a human has to click a button, Hub doesn’t add the wait time).
  • Third-party dependencies (an external API that’s down at the same time).
  • Data quality after restore (RDS PITR recovers to a timestamp; whether that data is “correct” is application-level).
  • Chaos at scale (a Region-wide event might affect more than Hub models).

The gap between “Hub estimates X” and “FIS measures Y” is where the real work lives. Hub gives the baseline to aim at; FIS reveals whether the application actually hits it.

What’s worth remembering

  1. Resilience Hub scores applications against a resilience policy. Policy declares RTO/RPO per disruption type; assessments produce a score (0-100) and per-component estimated recovery.
  2. Four disruption types modelled. AZ, Region, Infrastructure, Application. Each gets its own recovery estimate; an application is compliant if all four fall within policy.
  3. Applications defined from CloudFormation, Terraform, or Resource Groups. Stack-managed apps work directly; non-stack apps use tag-based Resource Groups.
  4. Recommendations are prioritised by score impact. “Enable Multi-AZ on this RDS” appears before “add PITR” if the former raises the score more. The platform team picks off the highest-impact items first.
  5. Hub models documented service characteristics, not measured behaviour. An RDS failover scored at 60 seconds is the documented fast path; real latency may differ. Hub provides an optimistic baseline.
  6. FIS is the validation loop. Hub estimates; FIS actually injects the fault and measures. Material divergence highlights application-level issues Hub can’t see.
  7. Continuous assessment via CloudFormation triggers. EventBridge rule on stack update invokes Hub; score regressions surface immediately rather than at next audit.
  8. Audit Manager integrates via API. Hub’s latest assessment becomes evidence for a resilience control in the monthly compliance report; the auditor sees machine-generated scores instead of review notes.

The eleven-hour incident from last year wouldn’t have happened with Hub running continuously: the missing cross-Region IAM role and the stale DNS failover would have shown up as reduced scores, and the team would have remediated before the DR moment. The stated RTO/RPO of every application becomes a measurable number instead of a design-doc aspiration; the compliance report carries a score, not a claim.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.