The situation
Every account in the organisation needs a baseline stack:
- GuardDuty enabled, with findings exported to a central Security Hub.
- Config enabled with the organisation’s conformance pack.
- CloudTrail with a trail in each account sending to the central logging bucket.
- IAM roles for the platform team, the security team, and a break-glass role, with permissions boundaries applied.
- VPC Flow Logs for the default VPC (or the absence of a default VPC, there’s a conformance rule for that).
Twenty accounts today, one new account per week from the account-vending-machine. The baseline has evolved three times in the last year (GuardDuty export, Config rule updates, new IAM role). Each evolution meant either a spreadsheet of “which accounts are on which version” or a brave engineer click-deploying twenty times.
The decision has been made: CloudFormation StackSets. The question now is how, which flavour of StackSets, targeting what, with what failure tolerance, and how we keep the set from rotting.
What actually matters
The core trade in organisation-wide CloudFormation deployment is uniformity in exchange for deployment-time cost. A set deployed to every account is uniform by definition, but a single bad change propagates everywhere. The more accounts, the longer the deployment, the more likely something transient fails, the more carefully we need to handle partial failures.
The first thing to ask is: service-managed or self-managed permissions? Self-managed StackSets require a trust relationship set up manually in each target account: an AWSCloudFormationStackSetAdministrationRole in the admin account and an AWSCloudFormationStackSetExecutionRole in each target, both created by the operator. Service-managed StackSets use AWS Organizations’ trusted-access feature. StackSets in the management (or delegated admin) account can deploy to any account in the org without per-account role setup. Service-managed is the default for any organisation that has Organizations trusted access enabled; self-managed remains for the pre-Organizations days or for non-organisation deployments.
The second is: what do we target? Targets can be the whole organisation, specific OUs, specific accounts, or filtered by OU + accounts + account tags. Targeting an OU means new accounts joining that OU get the stack automatically. Filtering by tag lets us exempt one account without removing it from the OU.
The third is: how do we handle failures? StackSets has a FailureTolerance setting, how many account deployments can fail before the whole operation aborts. And MaxConcurrentCount, how many accounts deploy in parallel. Low tolerance + low concurrency = safe but slow; high tolerance + high concurrency = fast but noisy.
The fourth is: drift. If someone manually modifies a resource created by a StackSet in a single account, that account’s stack has drifted. StackSets supports drift detection across the set. Detecting it is cheap; fixing it requires a deliberate redeploy.
The fifth is: how do we roll out changes? A StackSet update applies to every stack instance. Staged rollouts, canary to one OU, watch for a day, then the rest, are possible but not automatic; you do it by sequencing the update operations.
What we’ll filter on
- Permission model, self-managed roles vs service-managed via Organizations.
- Target granularity, whole org, OU, tag-filtered, specific accounts.
- Automatic enrolment, do new accounts get the stack without intervention?
- Concurrency and failure tolerance, how fast and how fault-tolerant is a rollout?
- Drift detection, can we see where a deployed stack no longer matches the template?
The deployment-scale landscape
-
CloudFormation StackSets, service-managed. Deployed from the management account or a delegated administrator. Targets OUs or the whole org. New accounts joining a targeted OU trigger automatic deployment. No per-account IAM setup. Supports the
AutoDeploymentfeature withEnabled=trueandRetainStacksOnAccountRemoval=falseto auto-enrol and auto-remove. -
CloudFormation StackSets, self-managed. Deployed from any account to any account where the execution role exists. Useful for pre-Organizations or mixed estates. More setup, more per-account trust to maintain.
-
Control Tower Customizations (CfCT). A layer on top of Control Tower that deploys CloudFormation templates (and Service Catalog products) per account as Control Tower creates them. Tightly integrated with Control Tower’s account vending; opinionated about structure.
-
Terraform with a workspace per account. The Terraform-shop answer. Each account is a workspace; a module is applied per workspace via CI. Different operational model; excellent for organisations already deep in Terraform.
-
AWS CDK Pipelines with a stage per account. CDK-native. One pipeline, N stages, each stage deploys to one account. Opinionated toward CI/CD; requires code changes to add accounts.
-
Manual CloudFormation per account. Included to anchor the scale. A terrible idea past three accounts.
Side by side
| Option | Permission model | Targeting | Auto-enrol new accounts | Concurrency / tolerance | Drift detection |
|---|---|---|---|---|---|
| StackSets service-managed | Organizations trusted access | OU / org / accounts / tags | ✓ | Built-in | ✓ |
| StackSets self-managed | Per-account roles | Specific accounts | ✗ | Built-in | ✓ |
| CfCT | Control Tower integration | CT lifecycle events | ✓ | CT-paced | Via Config |
| Terraform workspaces | Per-workspace creds | Workspace list | Scripted | CI-paced | Terraform plan |
| CDK Pipelines | Per-stage roles | Pipeline stages | Code change | Pipeline-paced | CDK diff |
| Manual | Whatever | Whatever | ✗ | None | None |
For an organisation that’s adopted Organizations and wants auto-enrolment, service-managed StackSets is the natural fit. The comparison table is less useful here than usual because the shape of the problem strongly suggests the answer; the interesting question is how to use StackSets well, not whether.
The StackSet architecture
The picks in depth
Service-managed permissions. Enable Organizations trusted access for CloudFormation StackSets (aws organizations enable-aws-service-access --service-principal member.org.stacksets.cloudformation.amazonaws.com), then delegate administration to the security account (aws cloudformation register-delegated-administrator ...). Future StackSet operations run from the security account, not the management account, good hygiene because the management account should hold minimal operational tooling.
The baseline template. One CloudFormation template containing every resource the baseline requires. About 300 lines: GuardDuty enablement, Config recorder + delivery channel, CloudTrail with encryption and log-file validation, IAM roles for platform/security/break-glass with permissions boundaries, a ConfigAggregatorAuthorization to trust the central aggregator.
The template is parameterised for things that vary per account: Environment (from the OU), CostCentre (from the account tag), CentralLoggingBucketArn (same value everywhere but referenced via parameter for clarity). No per-account conditionals, if a variation is needed, it’s a second StackSet or a parameter-override for specific accounts.
The StackSet definition. Created once in the security account:
aws cloudformation create-stack-set \
--stack-set-name security-baseline \
--template-body file://baseline.yaml \
--permission-model SERVICE_MANAGED \
--auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false \
--capabilities CAPABILITY_NAMED_IAM \
--region eu-west-1
SERVICE_MANAGED enables Organizations integration. AutoDeployment means new accounts joining a targeted OU trigger a deployment; accounts leaving an OU have their stack instance deleted (because RetainStacksOnAccountRemoval=false).
Targeting. Three stack-instance create operations, one per OU:
aws cloudformation create-stack-instances \
--stack-set-name security-baseline \
--deployment-targets OrganizationalUnitIds=ou-aaaa-11111111 \
--regions eu-west-1 eu-west-2 \
--operation-preferences RegionConcurrencyType=PARALLEL,MaxConcurrentPercentage=25,FailureTolerancePercentage=0
Target an OU ID, deploy to multiple regions in parallel (most baselines have Region-specific resources like CloudTrail), 25% of accounts in parallel, zero failure tolerance, any account failure aborts. For production, zero failure tolerance; for sandbox, a higher tolerance is fine.
Staged rollouts. Updates to the template use update-stack-set. To stage: first update targeting only the Sandbox OU with a deployment, wait a day, watch for issues, then update targeting PreProduction, wait another day, then Production. This is an operational discipline, not a StackSets feature, the CLI supports it via separate operations, but you have to enforce the pattern yourself (or via a CI pipeline).
Drift detection. aws cloudformation detect-stack-set-drift --stack-set-name security-baseline scans every stack instance and compares resource state to the template. Runs asynchronously; results viewable via describe-stack-set-operation. Set this up as a scheduled EventBridge rule weekly; any DRIFTED instance triggers a notification to the security team’s SNS topic.
Drifted stacks are usually a sign of a manual change (someone edited a Config rule, disabled GuardDuty temporarily, “fixed” an IAM role). The standard remediation is update-stack-instances without a template change, forces the instance to redeploy against the current template, reverting the drift.
Handling failures. StackSets reports operations with a matrix of per-account, per-region results. Common failures: an account was suspended at the time of deployment (account state filter excludes these, but if not, they fail); an existing conflicting resource (e.g., GuardDuty already enabled manually, which StackSets does handle with an import-first step); a Region the account never opted into. The failure-tolerance setting stops the cascade; the operation status shows which accounts failed, and a second attempt after fixing the underlying issue will retry.
Exempting an account. The baseline should apply everywhere, but sometimes an account needs an exemption (e.g., a licensed-software account where GuardDuty findings are handled by the vendor). Two patterns:
- Account filter on the StackSet:
Accountsdeployment target withAccountFilterType=DIFFERENCE, deploy to the OU minus these specific account IDs. - Parameter override per account: the template reads a parameter
EnableGuardDutyand the StackSet’soperation-preferencesincludes a parameter override for the exempted accounts settingEnableGuardDuty=false.
The filter is cleaner; the parameter override is more flexible. Both are documented changes, audit-trailed.
A worked update: adding VPC Flow Logs
The security team wants VPC Flow Logs on every default VPC.
- PR in the Git repo that holds
baseline.yamladds the Flow Logs resource with a conditional (Condition: HasDefaultVPC). - CI runs
cfn-lint,cfn-nag, and a dry-runvalidate-template. Pass. - Reviewer approves; merge.
- CI pipeline calls
update-stack-settargetingou-Sandboxfirst. Operation runs, 8 accounts updated, no failures. - Soak time: 24 hours. Operator watches Security Hub and the accounts’ own dashboards. No issues.
- Pipeline runs again targeting
ou-PreProduction. Same story. - Finally targeting
ou-Production. Zero failure tolerance, 25% concurrency, so 5 accounts deploy in parallel, then the next 5, etc. If any fails, the operation aborts and the security team decides. - Two days from PR to fully-rolled-out change; all 20 accounts at v5. New accounts joining any OU get v5 on day one.
A worked failure
Production account ledger-prod has a pre-existing GuardDuty detector (manually enabled six months ago by a panicked response to a security incident). The StackSet update v3 -> v4 wants to create a GuardDuty detector that conflicts.
ledger-prod is the first account deployed in the Production tranche. CloudFormation fails: GuardDuty detector already exists. Zero failure tolerance, so the operation aborts. Other 14 Production accounts are untouched, they’re still on v3.
Remediation:
- Operator runs
cloudformation importonledger-prodto import the existing detector into a new stack, then deletes the old-standalone detector so the baseline stack owns it. One account, hand-fixed. - Resume the StackSet operation:
update-stack-instances --stack-set-name security-baseline --accounts ... --regions ..., re-run just Production. - Now 14 Production accounts move from v3 to v4 plus
ledger-prodwhich is freshly created at v4.
The zero-failure-tolerance saved us from a cascading problem (imagine 14 accounts all had manually-created detectors and all failed in parallel; the first failure stops the damage). The trade is the operational overhead of fixing the one account and resuming.
What’s worth remembering
- Service-managed StackSets are the Organizations-era default. Enable trusted access, delegate to the security account, deploy from there. Self-managed is for the pre-Organizations days.
- Auto-deployment turns the OU into the source of truth. New accounts joining a targeted OU get the stack automatically. Account removals clean up the stack instances. Less “did we remember to deploy to the new account” because the answer is always yes.
- Failure tolerance plus concurrency is the rollout dial. Zero tolerance + low concurrency for production-critical changes; higher tolerance + higher concurrency for routine updates to non-prod.
- Staged rollouts are operational discipline, not a feature. Sandbox first, PreProduction second, Production last. A CI pipeline enforces the order; human review gates each step.
- Drift detection is the rot-prevention tool. Schedule it weekly. Anything drifted becomes a ticket; the standard remediation is “redeploy the same template”.
- Region list matters. Many baselines deploy to multiple Regions. StackSets handles Region concurrency (PARALLEL or SEQUENTIAL) separately from account concurrency.
- Parameter overrides let specific accounts differ. Useful for exemptions, less useful as a first instinct, prefer templates that work everywhere over templates with per-account parameters.
- Delegated admin moves operations out of the management account. The management account should own Organizations itself and very little else; a delegated admin for StackSets puts the deployments where the security team operates day-to-day.
One template, one StackSet, one OU target, many accounts. The manual click-through vanishes; the drift vanishes with it; the new-account-enrolment is automatic. The Pro-level part is knowing when to stage, how to handle failures, and how to keep the set from rotting as the organisation grows.