How to Reconcile CloudFormation Stack Drift Without Losing Fixes

August 21, 2028 · 15 min read

DevOps Engineer Pro · DOP-C02 · part of The Exam Room

The situation

A platform team inherits forty CloudFormation stacks in a single account:

  • ~1,200 resources in total across networking, IAM, ECS, RDS, S3, Lambda, Route 53.
  • Stacks were originally authored 3-5 years ago; several owners have rotated out.
  • Drift report requested by the auditor: DetectStackDrift run on each stack reveals 80 drifted resources across 22 stacks.
  • Of the 80: some are incident fixes (SG rule added during an incident, never re-declared), some are accidents (a console edit to a Lambda config), some are adjacent resources created by runtime (an ECS service scaled manually, an autoscaling group size changed).

The asks:

  • Drift must be reconciled. Every live resource either matches its template or is explicitly not managed.
  • No silent overwrites. If drift represents an intentional fix, it must go back into the template, not get reverted.
  • Template alignment. Each stack’s template must be the source of truth going forward.
  • Prevent future drift where possible; the auditor wants to know what controls exist.
  • Reporting that answers “what drifted, when, and what did we do about it” for every drift event in the past year.

What actually matters

Drift has three shapes worth distinguishing, and they want different remediations.

Out-of-band resource modification. A resource the stack thinks it owns has been changed outside the deployment tool: a security group rule added via the console, a Lambda environment variable edited in the API. The resource is still “in” the stack as far as the tool knows; its configuration has moved. This is the most common shape and the one drift detection names directly.

Resources added outside the stack. A resource sits next to a stack (created by a runtime process, a console click, or another tool) that logically belongs to the stack’s domain but isn’t managed by it. Drift detection doesn’t flag these because they’re not in the stack at all; they need to be brought in to become managed.

Parameter and property drift that affects computed values. A stack parameter was changed (or a resource property the parameter drives) and the downstream resources no longer reflect what the template would produce. The drift isn’t in a single resource; it’s in the relationship between parameter and template. Resolves by previewing the change before applying, not by reverting.

The remediation options pair with those shapes.

Revert by re-running the stack. Apply the existing template against live state; drifted resources are pushed back to the template’s declaration. Simple, effective, destructive if the drift was an intentional fix. Useful for genuine accidents.

Re-declare the drift in the template. Edit the template to include the changes that happened out-of-band, then re-apply. The tool sees no diff; the resources that used to drift now match. The template catches up to reality. Useful for intentional fixes.

Bring adjacent resources under management. A specific import operation adopts existing, unmanaged resources into a stack without destroying them. Useful for the “created outside” case; lets the team add a console-created resource to a stack rather than rebuilding it.

Preview before applying for parameter-driven drift. A change-set preview shows exactly which resources will change and, critically, which changes require replacement of a stateful resource. Review before executing; protects against destructive cascades from what looks like a minor parameter change.

The second useful frame is what to do after reconciliation to prevent recurrence.

  • Resource-level update restrictions on a stack block specific update actions (replace, delete) on named resources; useful for production databases that should never be replaced inside a routine update.
  • IAM policies that deny console-level edits to stack-managed resources force changes through the deployment tool. Org-wide service control policies can enforce this at scale.
  • Scheduled drift detection with alarms means drift is caught within hours rather than at audit time.
  • A compliance rule for drift lets drift state appear in the same posture dashboard as other compliance signals.

What we’ll filter on

Filtering the remediation approaches:

  1. Preserves intentional out-of-band fixes (the drift is now correct; don’t revert).
  2. Brings adjacent resources under stack control (unmanaged resources become managed).
  3. Non-destructive. Resources aren’t destroyed and recreated as a side effect.
  4. Auditable. The reconciliation is recorded as a stack operation with the change visible.
  5. Prevents recurrence. Future drift is caught or blocked.

The drift-remediation landscape

1. Ignore drift. Status quo. Fails the auditor, fails the “stack is truth” requirement. Rejected.

2. Delete-and-recreate stack. Nuclear option. Destroys every drifted resource (and every matching resource) and rebuilds from the template. Destroys data; causes outages; fails the non-destructive requirement.

3. UpdateStack with existing template (revert drift). Runs the stack update with the current template; drifted resources are reverted to template state. Fast, simple, destructive of intentional fixes. Use only when drift is known to be accidental.

4. Edit template to capture drift + UpdateStack. The intentional-fix path. The team reviews each drifted resource, decides whether the change should persist, and updates the template to match. UpdateStack then shows “no changes” for the newly-declared fix. Captures reality; non-destructive.

5. CreateChangeSet --change-set-type IMPORT for unmanaged resources. Imports an existing resource into a stack; the resource is now managed by CloudFormation, can be updated through stack updates, and appears in drift detection. Import is governed by what CloudFormation supports for the resource type (most common types are supported).

6. CreateChangeSet for parameter drift. Parameter change → change set preview → review → execute. The change set shows “Replace,” “Modify,” or “Add” per resource; replacements on stateful resources (RDS, EBS) are the red flags to catch before execution.

7. CloudFormation stack policy + IAM + scheduled drift detection. The preventive layer. Stack policies block modifications on designated resources; IAM policies route edits through CloudFormation; scheduled drift detection catches anything that slips through.

Side by side

Option Preserves fixes Brings in unmanaged Non-destructive Auditable Prevents recurrence
Ignore
Delete-and-recreate
UpdateStack revert ✓ (usually)
Edit template + UpdateStack
Change-set IMPORT
CreateChangeSet (param)
Stack policy + IAM + scheduled detection

No single option covers every drift; the answer is a toolkit plus a triage process. Preventive layer on top.

A drift triage flow

Input Triage question Remediation Prevention layer DetectStackDrift per stack every drift event Drift detail per resource diff of properties Console-found unmanaged resource adjacent to stack Parameter change template param drift stateful resource Is the drift intentional? (incident fix, design change) → capture in template Or accidental? (console edit, typo) → revert via UpdateStack Does it belong to the stack? → change-set IMPORT → else: new stack or leave Stateful resources affected? → CreateChangeSet first → review replacements Edit template re-declare the fix UpdateStack drift clears on next scan UpdateStack (revert) same template, same params resource reverts no template change CreateChangeSet IMPORT --change-set-type IMPORT ResourceIdentifier per import stack takes ownership CreateChangeSet preview "Replace" vs "Modify" pause if replacement execute when safe Stack policy Deny update on prod RDS resources IAM SCP deny console edits to stack-managed RTs Scheduled detection EventBridge rule daily → DetectStackDrift Drift alarm CW alarm on stack drift status changed Config rule cloudformation-stack-drift -detection-check
Drift detection is routine; the triage question determines which of four remediation paths to take, and prevention runs alongside.

The picks in depth

The triage. For each of the 80 drifted resources, the team asks three questions in sequence:

  1. Was this drift intentional? If yes (an incident fix, a deliberate tuning change), capture the change in the template. The new template update, when run, shows “no changes” for the reconciled resource.
  2. If no, was it destructive to revert? If the drift is on a stateless resource (Lambda configuration, security group rule), straight UpdateStack reverts safely. If the drift is on a stateful resource (RDS parameter group), run a change set first to check for replacement.
  3. Is there an adjacent unmanaged resource that should be part of the stack? If yes, CreateChangeSet --change-set-type IMPORT.

Intentional-fix capture. The common case is a security group rule added during an incident. The template has three ingress rules; live state has four. Engineer edits the template to include the fourth rule with a comment citing the incident ticket, commits, runs UpdateStack. CloudFormation sees no diff on the security group (both template and live have four rules now) and returns “No updates to perform” for that resource. Drift status clears on the next DetectStackDrift.

Accidental revert. A developer changed a Lambda function’s memory from 512 MB to 1024 MB in the console during debugging and forgot to revert. UpdateStack with the unchanged template pushes the memory back to 512 MB. Non-destructive for Lambda; would be non-destructive for most other resource types. Verify drift is truly accidental first; a 30-second conversation with the developer prevents reverting a fix.

Import for adjacent resources. The team finds a Route 53 record created outside the DNS stack. The record isn’t drift on any existing stack resource; it’s a gap. CreateChangeSet --change-set-type IMPORT --resources-to-import file://import.json with a JSON body naming the logical resource ID in the template and the physical resource ID of the record. Execute the change set; the record is now a stack-managed resource.

[
  {
    "ResourceType": "AWS::Route53::RecordSet",
    "LogicalResourceId": "MetricsAlias",
    "ResourceIdentifier": {
      "HostedZoneId": "Z1234567890ABC",
      "Name": "metrics.internal.",
      "Type": "A"
    }
  }
]

The template must already include the MetricsAlias resource definition; the import binds the logical ID to the physical resource. Any subsequent UpdateStack manages the record normally.

Change set first for stateful drift. The RDS parameter group drift case. The parameter group is attached to production databases; an update that requires a replacement would mean the databases get a new parameter group attached, which (depending on the parameter) triggers a reboot. CreateChangeSet with the template update returns a preview: [Modify] ParameterGroupA (no replacement) or [Replace] ParameterGroupA (RequiresReplacement: true). Modify is safe to execute; Replace requires a maintenance window conversation first.

The change-set preview is the safety net that prevents “I thought this was a minor tweak” incidents. Treat every update that touches stateful resources as change-set-first; treat stateless updates as straight UpdateStack once the team is comfortable.

Prevention: stack policies, IAM, scheduled detection

Stack policies are JSON attached to a stack that restrict what subsequent stack operations can modify. A common policy:

{
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "Update:*",
      "Principal": "*",
      "Resource": "*"
    },
    {
      "Effect": "Deny",
      "Action": ["Update:Replace", "Update:Delete"],
      "Principal": "*",
      "Resource": "LogicalResourceId/ProdDatabase"
    }
  ]
}

Even a stack operator can’t replace or delete ProdDatabase via stack update without first removing the deny via SetStackPolicy. Useful for “this database exists forever” declarations.

IAM and SCPs enforce that stack-managed resources can only be changed via CloudFormation. An SCP on the Organization that denies rds:Modify* except when aws:CalledVia is cloudformation.amazonaws.com forces RDS changes through stack updates. Strong but noisy; appropriate for regulated environments where drift is a compliance incident.

Scheduled drift detection. An EventBridge rule runs daily at 06:00 UTC, invoking a Lambda that calls DetectStackDrift on every stack and publishes a CloudWatch metric (DriftDetected = 0 or 1) per stack. CloudWatch alarms fire when the metric goes to 1; Slack receives a message with the stack name and a link to the drift detail. The team notices drift within 24 hours rather than at audit time.

AWS Config rule cloudformation-stack-drift-detection-check provides similar coverage via Config, with the drift state appearing in Config’s compliance dashboard alongside other compliance rules. Useful if Config is already the compliance data source of record.

What’s worth remembering

  1. Three shapes of drift, three remediation paths. Out-of-band modification, adjacent unmanaged resource, parameter-driven computed drift. Each needs a different fix.
  2. Intentional drift goes into the template. An incident fix that worked is part of the design now; update the template to reflect it. Reverting an intentional fix because it “drifted” is the wrong move.
  3. UpdateStack with the existing template reverts all drift. Fast, simple, destructive of intentional fixes. Only safe when drift is confirmed accidental on stateless resources.
  4. CreateChangeSet with IMPORT brings adjacent resources into a stack. Requires the template to include the resource definition; the import binds the logical ID to the physical resource. Non-destructive.
  5. Change-set-first for stateful resources. The preview reveals replacements before execution. [Replace] RequiresReplacement: true on an RDS instance is a very different operation than [Modify].
  6. Stack policies block specific update actions. Deny Update:Delete on production databases means accidental template changes don’t cascade into outages. Complement with IAM and change-management discipline.
  7. Scheduled drift detection closes the loop. EventBridge + Lambda + CloudWatch alarm catches drift within 24 hours. Config rules offer an alternative path via compliance tooling.
  8. Ignoring drift is the expensive option. Each week of un-reconciled drift increases the cost of reconciliation and the risk of an incident blaming “infrastructure as code” for something infrastructure-as-code didn’t manage.

Drift detection is trivial; the API call is free and runs in minutes. Drift remediation is the triage question repeated eighty times: was this intentional, is the revert destructive, does it belong in the stack? Templates become the truth again one deliberate reconciliation at a time; stack policies and scheduled detection keep them there.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.