How to Roll Out AWS Backup Across an Organisation

February 01, 2027 · 16 min read

Solutions Architect Pro · SAP-C02 · part of The Exam Room

The situation

The organisation has twenty AWS accounts under AWS Organizations, grouped into four OUs: Production, PreProduction, Sandbox, and Shared. Each account has its own EC2 fleet, its own RDS and DynamoDB, its own S3 buckets. Today, backup is whatever each team chose:

  • Some teams use AWS Backup with a local plan.
  • Some use nightly Lambdas calling CreateSnapshot directly.
  • Some run pg_dump to S3 and cross their fingers.
  • Two accounts have no backup strategy at all.

The auditor’s requirements:

  • Every production resource must have a documented backup schedule, retention, and destination.
  • Backups must be recoverable even if the source account is compromised, so cross-account copy to a vault in an account the source cannot write to.
  • Deletions must require a ceremony. A single compromised role should not be able to delete years of backups.
  • The control must be organisation-wide and auditable, “we trust each team to comply” does not pass.

The platform team needs one way to do this, applied top-down, visible from a single dashboard, impossible to opt out of.

What actually matters

The core trade in organisation-wide backup is central control in exchange for team flexibility. A fully central model (one plan applies to every account) is auditable but blunt, it backs up things teams didn’t need and misses things they did. A fully local model (each team configures their own) is flexible but unauditable. The middle path is a central policy every team inherits, combined with local tagging to opt into specific retention tiers.

The first thing to ask is: what gets backed up? Whatever primitive we use has to cover the stateful services in the estate, block storage, relational, key-value, file systems, object storage. Anything not covered by the primitive needs its own snapshot story. The policy has to be explicit about which resource types it applies to.

The second is: how do we select resources? Tag-based selection is the organisation-scale answer. A policy might say “everything tagged Backup=daily-prod goes into the daily plan.” Teams tag their resources; the policy does the rest. Untagged resources don’t get backed up, which sounds risky until you add an SCP that denies creating production resources without the tag.

The third is: where do backups live? A vault is a container for recovery points; it can be local (same account as the source) or cross-account (in a dedicated backup account). For ransomware protection, the backup must land in an account the source cannot write to, so the cross-account copy is mandatory, not optional.

The fourth is: how are backups protected from deletion? The strongest answer is a retention lock enforced by the storage layer itself, not by IAM. Once locked, no principal, not root, not the team that owns the data, can delete backups or shorten retention during the lock window. This is the “cannot be forgotten” part.

The fifth is: how do we know it worked? Continuous compliance checks against the policies, reporting per-resource “last backup” status, is what makes this auditor-facing rather than internal-faith-based.

What we’ll filter on

  1. Resource type coverage, which services are in scope?
  2. Selection mechanism, tag, ARN, all-resources, opt-in?
  3. Cross-account, can backups land in a separate account?
  4. Deletion protection, what stops deletion by a compromised principal?
  5. Audit visibility, can a regulator see compliance at a glance?

The landscape

  1. AWS Backup with organisation-wide Backup Policy. Through AWS Organizations, a BACKUP_POLICY is attached to the org root, an OU, or specific accounts. Member accounts inherit the policy and AWS Backup in each account enforces it. Policies specify plans, rules, vault destinations, and resource selections. Tag-based selection lets each team opt resources in. Inheritance is hierarchical: OU-level policies apply to all accounts below; account-level policies merge with inherited ones.

  2. AWS Backup with local plans only. Each account has its own AWS Backup plan configured independently. No organisation-wide visibility; no guarantee of coverage; no central policy to update when retention changes. The status quo for many organisations.

  3. Custom snapshot Lambdas. Nightly Lambdas calling CreateSnapshot, CreateDBSnapshot, etc. Works at low scale; falls over at organisation scale (per-account IAM, per-account scheduling, no centralised retention, no cross-account copy without bespoke code).

  4. AWS Backup Vault Lock. A retention lock on a backup vault. Two modes: Governance (can be disabled by a principal with backup:RemoveBackupVaultLockConfiguration) and Compliance (cannot be disabled at all, by anyone, once the cooling-off period expires, up to 30 days to back out the configuration, then immutable).

  5. AWS Backup Audit Manager. A compliance framework for backup. Ships with preset controls (“minimum retention”, “cross-region copy”, “cross-account copy”, “resource protected”). Controls evaluate against every resource in the org; non-compliant resources appear in reports. The auditor-facing output.

  6. SCPs on backup-related actions. Separate from backup policies. An SCP can deny backup:DeleteBackupVault, backup:DeleteRecoveryPoint, ec2:DeleteSnapshot etc. from all member accounts except the centralised backup-admin account. Closes the loophole where a local admin could delete backups before the vault lock caught them.

Side by side

Option Resource coverage Selection Cross-account Deletion protection Audit visibility
Org Backup Policy EBS, RDS, DDB, EFS, FSx, S3, etc. Tag-based Vault Lock (Compliance) Backup Audit Manager
Local Backup plans Same Tag or ARN ✓ (manual) Vault Lock per account Per-account only
Custom Lambdas Whatever you code Whatever you code Code it Nothing built-in Build it
Vault Lock N/A N/A N/A ✓ via Audit Manager
Audit Manager All Backup resources All selected Reports on it N/A
SCPs on delete actions All accounts Action-level N/A ✓ (complement) CloudTrail

The organisation-wide pattern is: Backup Policy for what to back up and where, Vault Lock Compliance on the destination vault for immutability, SCPs to prevent bypass, Audit Manager to prove it. Four controls; each does one job.

The architecture

Management account (AWS Organizations) Backup Policy daily-prod plan, monthly-prod plan attached to Production OU SCP: protect-backups deny backup:Delete*, ec2:DeleteSnapshot except backup-admin role Backup Audit Manager framework: CrossAccountBackup, VaultLock report to compliance bucket Production OU, member accounts acct: payments tag=Backup:daily-prod local vault RDS, DDB, S3 acct: ledger tag=Backup:daily-prod local vault Aurora, EFS acct: customer-api tag=Backup:daily-prod local vault EBS, DDB acct: reports daily-prod local vault RDS, S3 Plan execution (per account, per day) select resources by tag -> snapshot -> local vault -> copy action -> central vault CloudTrail records each step Resource selection (tag-based) Tag key = Backup, value = daily-prod | monthly-prod | compliance-7y SCP denies creating production resources without a Backup tag untagged = no backup, but also cannot be deployed Central backup account central-backup-vault Vault Lock: Compliance mode min retention: 30d max retention: 2555d (7y) KMS key managed by backup account vault policy: source accounts may put, never delete --- restore role: break-glass only MFA + ticket required --- region: eu-west-1 CRR copy to eu-central-1 for critical plans copy
Policy in the management account cascades to member accounts; each account backs up locally and copies to the central vault; Vault Lock keeps backups immutable; SCPs prevent bypass.

The picks in depth

The Backup Policy, attached to the Production OU. A JSON document with one or more backup plans. Simplified:

{
  "plans": {
    "daily-prod": {
      "regions": { "@@assign": ["eu-west-1"] },
      "rules": {
        "daily": {
          "schedule_expression": { "@@assign": "cron(0 2 ? * * *)" },
          "start_backup_window_minutes": { "@@assign": "60" },
          "complete_backup_window_minutes": { "@@assign": "240" },
          "lifecycle": {
            "delete_after_days": { "@@assign": "35" }
          },
          "target_backup_vault_name": { "@@assign": "local-prod-vault" },
          "copy_actions": {
            "arn:aws:backup:eu-west-1:999999999999:backup-vault:central-backup-vault": {
              "target_backup_vault_arn": {
                "@@assign": "arn:aws:backup:eu-west-1:999999999999:backup-vault:central-backup-vault"
              },
              "lifecycle": {
                "delete_after_days": { "@@assign": "2555" }
              }
            }
          }
        }
      },
      "selections": {
        "tags": {
          "daily-prod-resources": {
            "iam_role_arn": {
              "@@assign": "arn:aws:iam::$account:role/AWSBackupDefaultServiceRole"
            },
            "tag_key": { "@@assign": "Backup" },
            "tag_value": { "@@assign": ["daily-prod"] }
          }
        }
      }
    }
  }
}

The @@assign operators mean this value is the authoritative setting for child policies, child OUs or accounts can’t override. The IAM role reference uses $account so each member account uses its own role of the same name; a companion Stack Set ensures that role exists in every account.

Attaching the policy to the Production OU means all four production accounts inherit it without each team configuring anything. Adding a new production account is “move it into the OU”, the backup plan appears automatically.

The central vault, Compliance-mode locked. In a dedicated backup account, in the OU Shared. The vault:

  • Uses a KMS key managed by the backup account.
  • Has a vault policy allowing each source account to PUT recovery points but not DELETE.
  • Is locked in Compliance mode with MinRetentionDays: 30 and MaxRetentionDays: 2555.
  • Has a 30-day cooling-off window during which the lock can still be reversed. After that, nothing can change it, not the backup-admin, not the root account, not AWS Support.

The Compliance mode is the “organisation cannot forget” part. Before locking, the 30-day window is the time to verify everything works and the retention values are right. After the window, changes are no longer possible.

The SCP, denying deletion everywhere except the backup-admin role. Applied to the org root:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyBackupDeletionExceptAdmin",
      "Effect": "Deny",
      "Action": [
        "backup:DeleteBackupVault",
        "backup:DeleteRecoveryPoint",
        "backup:PutBackupVaultAccessPolicy",
        "backup:DeleteBackupVaultAccessPolicy",
        "backup:RemoveBackupVaultLockConfiguration",
        "backup:StopBackupJob"
      ],
      "Resource": "*",
      "Condition": {
        "ArnNotEquals": {
          "aws:PrincipalArn": "arn:aws:iam::*:role/BackupAdminRole"
        }
      }
    },
    {
      "Sid": "DenyUntaggedProductionResources",
      "Effect": "Deny",
      "Action": [
        "rds:CreateDBInstance",
        "rds:CreateDBCluster",
        "dynamodb:CreateTable",
        "ec2:RunInstances"
      ],
      "Resource": "*",
      "Condition": {
        "Null": { "aws:RequestTag/Backup": "true" },
        "StringEquals": { "aws:PrincipalOrgPath": "o-xxx/r-yyy/ou-Production" }
      }
    }
  ]
}

First statement: no one except BackupAdminRole can delete backups or change vault policies. Second statement: production resources must be created with a Backup tag. Without the tag, no backup policy matches; with the SCP, the resource can’t be created untagged. The loop is closed.

The snapshot copy path. When the plan runs in payments:

  1. AWS Backup in payments creates a recovery point in local-prod-vault.
  2. The copy action fires: AWS Backup copies the recovery point to central-backup-vault in the backup account.
  3. The copy inherits the destination’s Compliance-mode lock and retention.
  4. The recovery point in the source account’s vault has a 35-day lifecycle; the copy in the central vault has 2555 days.

The local copy is for fast recovery (in-Region, in-account restores). The central copy is for the ransomware/account-compromise case, even if payments is fully compromised, the central vault holds an immutable copy.

Backup Audit Manager. Deployed in the delegated administrator account via framework resources. The standard frameworks cover:

  • BACKUP_RECOVERY_POINT_ENCRYPTED, every recovery point is KMS-encrypted.
  • BACKUP_RECOVERY_POINT_MINIMUM_RETENTION_CHECK, recovery points meet the policy’s minimum retention.
  • BACKUP_RESOURCE_PROTECTED_BY_BACKUP_PLAN, every tagged resource has an active plan.
  • CROSS_ACCOUNT_BACKUP_COPY, recovery points copied to a different account.
  • CROSS_REGION_BACKUP_COPY, recovery points copied to a different Region.

Reports go to an S3 bucket in the audit account; the auditor opens the latest report and sees per-resource compliance.

A worked restoration

Production payments RDS is corrupted at 14:00 by an accidental UPDATE without a WHERE clause. The backup from 02:00 is in local-prod-vault and has a copy in central-backup-vault.

  1. On-call assumes BackupAdminRole in the central backup account (MFA-gated, ticket-required).
  2. aws backup start-restore-job --recovery-point-arn ... --metadata ... --iam-role-arn restore-role-in-payments.
  3. The restore creates a new RDS instance in payments with the 02:00 state. About 20 minutes for a mid-sized database.
  4. payments team cuts over the application to the restored instance.

Total data loss: 12 hours (02:00 to 14:00 window). Recovery time: 30 minutes including the approval ceremony. Audit artefact: CloudTrail in the backup account shows the restore job; the existing recovery point is still present because Compliance-mode prevents its deletion.

What’s worth remembering

  1. Organizations backup policies are top-down enforcement. Attach to OUs; new accounts inherit automatically. The management account defines “what must be backed up”; member accounts cannot opt out.
  2. Tag-based selection scales across the organisation. Teams tag resources; the policy picks them up. Pair with an SCP that requires tags on production resources to close the loophole.
  3. Compliance-mode Vault Lock is the immutability answer. No one, not root, not AWS, can delete recovery points or shorten retention during the lock. The 30-day cooling-off window is the only escape hatch, used before locking.
  4. Backups must live in an account the source cannot write to. A compromised source account shouldn’t be able to delete its own backups. Cross-account copy to a dedicated backup account is the control.
  5. SCPs complement backup policies. Policies say “back up these resources”; SCPs say “don’t delete backups” and “don’t create resources that evade the policy”. Both are needed.
  6. Backup Audit Manager is the auditor-facing story. Continuous compliance checks against standard frameworks, reports in S3, pass/fail per resource.
  7. @@assign in Organizations policies is the enforce verb. Child OUs cannot override @@assigned values. Use it for the values that must not drift.
  8. The backup plan is the disaster-recovery contract. It binds RPO (schedule), retention (lifecycle), and destination (vault). Every clause maps to an auditor question.

Twenty accounts, one policy, one central vault, one SCP, one audit dashboard. The organisation cannot forget because the organisation doesn’t decide, the management account does, once, and every account follows.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.