The situation
One AWS Organization with five linked accounts: core-prod, data-prod, dev, sandbox, and shared-services. Across those accounts live:
- RDS MySQL and PostgreSQL, roughly 20 databases, total 12 TB, daily automated snapshots retained for 7 days. No cross-Region copy today.
- Aurora clusters, 6 clusters, 3 TB total, automated backups retained 14 days.
- EBS volumes, roughly 400 volumes attached to 300 EC2 instances, total 40 TB. Some have DLM (Data Lifecycle Manager) policies, some don’t.
- DynamoDB tables, 50 tables, PITR enabled on about half, daily exports to S3 on the data-prod account only.
- EFS filesystems, 4 filesystems, homegrown Lambda function that runs
aws efs create-backupnightly. - FSx for Windows, 2 filesystems, automatic backups retained 30 days, cross-Region copy not configured.
- S3 buckets, roughly 400 buckets, versioning on some, replication on about a dozen critical ones.
The audit ask is simple: demonstrate recoverability within 24 hours for any production resource, including the case where eu-west-1 is entirely unavailable. The implicit ask is: one place to look, one policy per resource tier, one answer to “what’s our RPO and RTO?”.
What actually matters
Before reaching for any specific service it’s worth naming what “centralised backup” should do.
A backup policy has three parts: what to back up, how often and how long to keep it, and where the copy lives. Today, each of those is decided per-service, per-account, per-team. Centralisation means lifting those decisions into a small number of policies that apply broadly across resource types.
The what is driven by resource tagging. A tag on a resource is what a centralised plan can hang off; the tag survives resource replacement, so adding a new database with the correct tag picks up the backup automatically. Anything that selects by name or by ARN goes stale the moment infrastructure changes.
The how often and how long is a backup plan: a schedule plus a retention plus any lifecycle transitions (warm storage for a while, then cold). Different resource tiers want different plans: production with longer retention and cross-Region copy; dev with shorter retention and no cross-Region; compliance-tagged resources with retention long enough to match regulatory requirements.
The where is the cross-Region piece. The mechanism needs to support copying recovery points to another Region (and ideally another account) as part of the plan, not as a separate pipeline of Lambdas and cron. Anything that asks the team to build the replication step themselves multiplies the surfaces that can quietly stop working.
Second: coverage. Whatever mechanism gets picked has to span the bulk of stateful services in use. Anything left uncovered is a separate backup story with its own schedule, retention, and restore path, which is what the team is trying to escape. For any service the chosen mechanism doesn’t cover, the fallback is the service’s native backup plus a deliberate cross-Region copy.
Third: organisation-wide enforcement. “Every account should back up production” turns into a hard rule when it cascades from the Organization or an OU down to member accounts, with new accounts inheriting the backup posture automatically. Without that, the policy is a guideline that holds until someone forgets.
Fourth: immutability and legal hold. Backups that can be deleted by an attacker who compromised an IAM credential aren’t useful as ransomware defence. The mechanism needs a WORM property on the destination, a mode where even the root account can’t delete recovery points before their retention expires, plus a way to hold specific recovery points indefinitely for legal reasons. That’s the property audit departments care about.
Fifth: cross-Region copy costs. Cross-Region data transfer is metered per-GB; for a multi-TB footprint backed up daily and cross-Region-copied, the bill is dominated by the daily delta moved plus the storage in the destination Region. The plan has to account for both.
Sixth: what about S3?. S3 has its own mature replication story, versioning, replication to another Region, and an object-lock property that gives the immutability the rest of the portfolio needs from its backup mechanism. Whether S3 sits inside the unified backup plan or alongside it is a trade-off between one console for everything and the simpler, cheaper path that’s been working for S3 for years.
What we’ll filter on
- Service coverage, does the backup mechanism support this resource type natively?
- Scheduling and retention, how granular is the plan? Lifecycle tiers?
- Cross-Region and cross-account copy, is it a single-toggle or a pipeline to build?
- Org-wide enforcement, can policy cascade from the Org to member accounts?
- Immutability, can a bad actor delete the backups?
- Restore experience, how painful is the restore, and can it be tested?
The backup landscape
-
AWS Backup. Managed, centralised backup service. Plans, vaults, selections, org-wide policies. Supports EBS, EC2, RDS, Aurora, DynamoDB, EFS, FSx (all variants), Storage Gateway, DocumentDB, Neptune, Redshift, S3, SAP HANA, CloudFormation, VMware. Cross-Region and cross-account copy baked in. Vault Lock for immutability. Central place for restore testing.
-
Service-native snapshots. Each service has its own snapshot mechanism: RDS automated snapshots, EBS snapshots via
CreateSnapshot, DynamoDB backups viaCreateBackup, EFS backups viacreate-backup, FSx backups viacreate-backup. Native mechanisms are fine for single-service use; they proliferate in large environments and make cross-service policy impossible. -
DLM (Data Lifecycle Manager). AWS-native scheduling for EBS snapshots specifically. Schedules creation, tagging, retention, and cross-Region copy for EBS volumes selected by tag. Overlaps heavily with AWS Backup’s EBS support; AWS Backup is the strategic direction, DLM is the older specialist tool.
-
S3 Replication (CRR/SRR). Not a backup service per se, but covers the “durable copy in another Region” use case for S3 specifically. Cross-Region Replication copies objects to a destination bucket in a different Region, with optional Replication Time Control (SLA on replication latency) for RTO-critical cases. S3 Object Lock + Versioning + CRR is a complete S3 durability story.
-
Third-party backup platforms. Veeam, Druva, N2WS, Rubrik, Clumio. Useful in hybrid environments or when regulatory requirements pre-date AWS Backup’s feature set. Adds a management plane, agent or agentless; usually a layer above the native AWS APIs.
-
Homegrown Lambda + cron. Fine for one filesystem at 2am. Breaks down at organisational scale. Proliferates across teams; each implementation has its own bugs; the first time there’s a recovery test, half of them have quietly stopped running.
Side by side
| Option | Coverage | Cross-Region | Org-wide | Immutability | Restore UX |
|---|---|---|---|---|---|
| AWS Backup | broad (most stateful services) | ✓ (in-plan) | ✓ (Backup Policies in Organizations) | ✓ (Vault Lock) | unified |
| Service-native snapshots | per-service | per-service | per-service | per-service | per-service |
| DLM | EBS only | ✓ | via tagging | via vault | EBS-specific |
| S3 Replication | S3 only | ✓ | via bucket policies | ✓ (Object Lock) | object-level |
| Third-party | varies | varies | varies | varies | varies |
Reading the table: AWS Backup does most of the heavy lifting for “everything except S3”, and S3 Replication handles S3. Between the two, the portfolio covers the bulk of stateful resources with org-wide, cross-Region, immutable backups and a single restore experience.
The centralised backup topology
The design, in depth
One Organizations Backup Policy. Attached at the root OU. The policy selects resources with the Backup tag and routes them to different backup plans based on the value:
{
"plans": {
"production-daily": {
"regions": { "@@append": ["eu-west-1"] },
"rules": {
"daily-to-regional-vault": {
"schedule_expression": { "@@assign": "cron(0 3 ? * * *)" },
"target_backup_vault_name": { "@@assign": "local-backup-vault" },
"lifecycle": {
"delete_after_days": { "@@assign": "35" }
},
"copy_actions": {
"arn:aws:backup:us-east-1:111122223333:backup-vault:central-backup-vault": {
"target_backup_vault_arn": {
"@@assign": "arn:aws:backup:us-east-1:111122223333:backup-vault:central-backup-vault"
},
"lifecycle": {
"delete_after_days": { "@@assign": "2555" }
}
}
}
}
},
"selections": {
"tags": {
"select-production": {
"iam_role_arn": { "@@assign": "arn:aws:iam::$account:role/service-role/AWSBackupDefaultServiceRole" },
"tag_key": { "@@assign": "Backup" },
"tag_value": { "@@assign": ["production"] }
}
}
}
}
}
}
Every resource in every member account with Backup=production gets a daily 03:00 UTC backup to the local vault (35-day retention) plus a cross-Region cross-account copy to the central vault in us-east-1 (7-year retention). A Backup=dev selector with a shorter retention and no cross-Region copy runs alongside.
Central backup account with Vault Lock. A dedicated account (central-backup, part of the security OU) hosts the destination vault. Vault Lock is enabled in compliance mode with a minimum retention of 35 days, once applied, even the account’s root credentials cannot shorten the retention or delete unfinished recovery points. The vault’s KMS key policy allows the backup service to decrypt recovery points for restore; it allows a specific delegated-administrator IAM role to initiate restores; it denies everyone else.
Cross-account restore role. The central-backup account holds the recovery points but doesn’t hold the original resources. To restore into a member account (say core-prod), the delegated-administrator process assumes a role in the target account that has backup:StartRestoreJob permission and the service-linked restore permissions for the resource type. The restore is initiated by AWS Backup itself, which uses the cross-account role to create the restored resource.
S3 is a separate story. Cross-Region Replication on all production buckets into a destination bucket in us-east-1; Object Lock in governance mode with a minimum retention; versioning enabled. S3 doesn’t go through the AWS Backup vault in this design, the replication story is mature and simpler. If the audit team insists on unified policy, AWS Backup for S3 slots in and uses the same vault; the trade-off is cost (small markup over raw S3 Replication).
Restore testing. A quarterly exercise runs a representative restore from the central vault back into a dev account:
aws backup start-restore-job \
--recovery-point-arn arn:aws:backup:us-east-1:111122223333:recovery-point:RDS-snapshot-abc123 \
--metadata file://restore-params.json \
--iam-role-arn arn:aws:iam::core-prod:role/BackupRestoreRole \
--resource-type RDS
The restore creates a new resource in the dev account; a validation script checks that the data is readable. Evidence of the test goes into the compliance report. Without this, the backup plan is untested and the audit question “can you recover?” has no evidence.
The answer to the audit question
“Prove you can recover any production resource, cross-Region, within 24 hours.”
1. Every production resource is tagged Backup=production.
2. The Organizations Backup Policy backs up every matching resource daily
to its account's local vault with 35-day retention.
3. Each backup triggers a cross-Region, cross-account copy to the
central-backup vault in us-east-1 with 7-year retention and
Vault Lock (compliance mode, min 35-day).
4. AWS Backup's restore API starts a restore job in the target account
within minutes; large RDS restores complete in under 4 hours;
EBS/EFS/DynamoDB in under 2 hours.
5. Quarterly restore tests produce evidence that the end-to-end path works.
6. Audit console view: AWS Backup > Jobs shows every backup and every
copy job with timestamps, sources, destinations, and sizes.
One plan, one central vault, one console view, one restore API. What used to be three meetings is now a link to the AWS Backup console and a CSV export of the last 90 days of backup jobs.
What’s worth remembering
- AWS Backup is the centralised answer. One policy, one plan, one vault per Region. Covers most stateful AWS services natively; cross-Region and cross-account copy are in-plan.
- Organizations Backup Policies enforce at the account level. Cascade from the Organization root or an OU; new accounts inherit; member accounts can’t disable their backups without touching the policy tree.
- Tags drive selection. Resources with the correct
Backup=<tier>tag are included automatically. Tag-based selection survives resource replacement, which is what makes the policy durable over time. - Vault Lock is the ransomware defence. Compliance mode is immutable even for the root account; governance mode allows IAM-authorised changes but is auditable. Production destination vaults want compliance.
- Cross-Region copy is a single property on the backup rule. No Lambda, no DataSync, no custom pipelines. The same is true for cross-account copy.
- S3 has its own durability story. CRR + Versioning + Object Lock is the mature pattern. AWS Backup for S3 exists if unified policy is the goal; S3 Replication is simpler and cheaper for pure durability.
- Restore testing is where backups become real. A quarterly or semi-annual exercise that actually restores into a sandbox turns “we have backups” into “we can recover”. Missing this is the most common audit finding.
- Delegated administration keeps restore workflows operational. A
central-backupaccount holds the vaults; delegated-admin roles in member accounts let the central team initiate restores back into the right place without needing root access everywhere.
Backups that survive the Region are a policy problem, not a service problem. AWS Backup with an Organizations Backup Policy, destination vaults in a central account with Vault Lock, and quarterly restore tests gets the portfolio from “backups scattered across six services” to “one console, one answer, one set of evidence for the audit”.