Detecting Drift With AWS Config

The situation

A platform team runs ~180 AWS accounts in an AWS Organizations structure. Every account has CloudTrail flowing to a central logs bucket. Infrastructure is defined in Terraform and deployed through a pipeline that runs terraform plan + policy checks before apply. The security team knows:

The pipeline is watertight. Policy-as-code rules in conftest block PRs that open CIDR-wide ingress rules, create public S3 buckets, or turn off EBS encryption. If the pipeline is the only path, these things can’t happen.
The pipeline is not the only path. Console access exists for on-call incident response. IAM allows named engineers to log in and make changes in production accounts during incidents. Occasionally, “let me just tweak this one thing” becomes a Terraform drift that never gets reconciled. Occasionally worse.
The Saturday incident. An engineer triaging an on-call page added 0.0.0.0/0:22 to a security group to let themselves in from a cafe Wi-Fi without bothering with the VPN. Didn’t remove it. The Terraform drift-detection job didn’t run until Monday morning. Roughly 54 hours of exposure.

The team wants a mechanism that’s independent of the pipeline. It catches drift whether it came through Terraform or through a console click. It records it, alerts on it, and, for a short list of specific drift types, automatically reverses it. The candidates are AWS Config, CloudTrail Insights, Security Hub, and some combination of the three. The question is which does what, and how.

What actually matters

Before picking a mechanism, it’s worth naming what catching drift independently of the pipeline actually requires.

The first thing is the difference between a state log and an event log. An event log records “who called what API when”; a state log records “what does this resource look like after each change”. Both are useful, but they answer different questions. To detect drift you need the state, what is the current configuration of this resource?, and to attribute drift you need the event. Anything serious needs both.

The second thing is a policy layer that evaluates the state. “This resource should not have ingress from 0.0.0.0/0 on 22” is a statement about resource shape; it needs a place to live, a way to run against every relevant resource, and a verdict (compliant / non-compliant) that’s machine-readable. The mechanism needs a library of common policies out of the box (because most teams want the same dozen rules), plus an extension point for organisation-specific rules.

The third thing is the gap between detection and reversal. Detecting the drift is only half the job; closing it requires either a human in the loop or an automated step that takes the resource back to the expected shape. Automatic reversal is what gets the time-to-recovery from “the next time someone looks” down to “before the next coffee.” It only works for drift types that have an unambiguous corrective action (revoke this ingress rule, re-enable encryption, attach this missing tag) and the operational courage to let a robot make production changes.

The fourth thing is the aggregation problem. Detection that lives in one account per finding is fine when you have one account; with hundreds of accounts, “which accounts are non-compliant with which rules?” is a dashboard problem that needs a central pane of glass. The mechanism needs to fan compliance state up to an organisation-wide aggregator, or you’ll catch drift one account at a time and only when someone happens to look.

The fifth thing is cost shape. State-tracking is metered per configuration item recorded and per evaluation run, both small numbers individually that add up across hundreds of accounts with thousands of resources. The bill is real but typically modest; the lever the team controls is which resource types get recorded, recording everything is the easy default and the most expensive one. Curating the list trades cost against the risk of missing something.

The sixth thing is detection latency. Different mechanisms detect changes at different speeds. A scheduled drift-detection job that runs daily is hours-to-days behind; a state-tracking service that evaluates on change is under a minute; a direct subscription to the event stream is seconds. Faster detection costs more and surfaces more noise; the right tier depends on how high-value the specific drift type is and how often it’s likely to occur.

What we’ll filter on

Filters for each drift-detection approach:

Detects the change, does the mechanism see what happened?
Evaluates against policy, does it compare to “what should be”?
Time from change to detection, seconds, minutes, hours, days?
Can reverse automatically, does it fix without a human?
Scales across accounts, organisation-wide deployment and aggregation?
Per-resource-type configurability, can we skip what we don’t care about?

The detection landscape

Terraform drift detection job. Scheduled terraform plan across all state files; compares desired state to actual. Catches most drift types the team cares about. Runs on a schedule (hourly at best, daily in reality). Doesn’t catch changes to resources outside Terraform state. The Saturday incident fell between runs.
CloudTrail Insights. Anomaly detection on API call patterns. Flags unusual rates or unusual calls from an identity. Good for “this IAM user just made 200 AuthorizeSecurityGroupIngress calls” (attack-shape anomalies); not built for “this one rule opened port 22 to the world” (specific policy violations).
AWS Config with AWS-managed rules. Configuration-change-triggered evaluation against managed rules. AWS maintains the rule logic; you attach it to your recording. restricted-ssh flags any security group with inbound 22 from 0.0.0.0/0 or ::/0. Near-real-time evaluation (typically under a minute from change to rule fire).
AWS Config with custom Lambda rules. Same framework, your code. Useful for organisation-specific policies AWS doesn’t ship a rule for: “no RDS snapshot may be shared with an account outside our org,” “every S3 bucket must have a tag Owner.” Write Lambda, attach rule, profit.
AWS Config with auto-remediation. Any Config rule can trigger a Systems Manager Automation document. Managed remediations like AWSConfigRemediation-RevokeUnusedIAMUserCredentials cover common cases; custom Automation documents handle everything else. Execution is in the account where the finding landed, using a role with the needed permissions.
Security Hub with AWS Foundational Security Best Practices. A curated rule set (hundreds of rules) implemented via Config rules under the hood, with deduplication and scoring. The “turn on the whole package” version of a Config rule library. Integrates with SNS/EventBridge for ticketing.
Event-driven via EventBridge rule + Lambda. Direct subscription to CloudTrail events; a Lambda inspects the event and reacts. Fastest possible detection (seconds), but the “policy” is bespoke Lambda code, not a declarative rule. Useful for very specific, very high-value checks; doesn’t replace the full landscape a Config-rule library gives.

Side by side

Option	Detects change	Policy evaluation	Detection latency	Auto-reverse	Org-wide	Configurable
Terraform drift	— (per resource in state)	✓ (via plan)	Hours to days	✗	Per-workspace	✓
CloudTrail Insights	✓ (anomalies)	—	Minutes	✗	Organisation trail	—
Config + managed rules	✓	✓ (managed)	< 1 min	✓	Conformance packs	✓
Config + custom Lambda	✓	✓ (your code)	< 1 min	✓	Conformance packs	✓
Security Hub FSBP	✓	✓ (curated)	< 1 min	Some	Organisation	Limited
EventBridge + Lambda	✓	✓ (bespoke)	Seconds	✓	Via SNS fan-out	✓

Reading the table for the situation: Config + managed rules + auto-remediation is the backbone. Security Hub sits on top for aggregation. EventBridge + Lambda covers the “I need sub-minute detection on this one specific high-value event” case. CloudTrail Insights complements for anomaly-shaped attacks. Terraform drift stays for reconciliation against desired state.

Rule -> evaluation -> remediation lifecycle

Ninety seconds from a bad change to automatic reversal. CloudTrail captures the event, Config records the new state and evaluates the rule, Systems Manager Automation revokes the rule, Config re-evaluates to compliant, Security Hub closes the finding.

The setup in depth

Enable Config in every account. Via StackSets or Control Tower from the Organizations management account. Each member account records the full resource set; the delivery channel writes configuration history to a central S3 bucket in the log-archive account and publishes change events to an SNS topic.

aws configservice put-configuration-recorder \
    --configuration-recorder '{
      "name": "default",
      "roleARN": "arn:aws:iam::111122223333:role/aws-service-role/config.amazonaws.com/AWSServiceRoleForConfig",
      "recordingGroup": {
        "allSupported": true,
        "includeGlobalResourceTypes": true
      }
    }'

allSupported: true is the easy choice; cost-conscious teams switch to a curated resourceTypes list. For this platform team, recording everything for a month and then pruning based on actual usage is a reasonable starting policy.

Deploy rules via a conformance pack. The Operational-Best-Practices-for-AWS-Well-Architected-Security pack is AWS’s curated list; teams usually start there and prune. The YAML declares rules and their parameters:

Resources:
  RestrictedSSH:
    Type: AWS::Config::ConfigRule
    Properties:
      ConfigRuleName: restricted-ssh
      Source:
        Owner: AWS
        SourceIdentifier: INCOMING_SSH_DISABLED
  S3BucketPublicReadProhibited:
    Type: AWS::Config::ConfigRule
    Properties:
      ConfigRuleName: s3-bucket-public-read-prohibited
      Source:
        Owner: AWS
        SourceIdentifier: S3_BUCKET_PUBLIC_READ_PROHIBITED
  EBSEncryption:
    Type: AWS::Config::ConfigRule
    Properties:
      ConfigRuleName: encrypted-volumes
      Source:
        Owner: AWS
        SourceIdentifier: ENCRYPTED_VOLUMES

Deploy the pack to an Organizations OU via the organisation management account. Every account in the OU inherits the rules; aggregated compliance reports in a central aggregator account.

Auto-remediation for the cheap wins. For restricted-ssh, a remediation configuration wires the rule to a Systems Manager Automation document that revokes the offending ingress rule:

aws configservice put-remediation-configuration \
    --config-rule-name restricted-ssh \
    --resource-type "AWS::EC2::SecurityGroup" \
    --target-type SSM_DOCUMENT \
    --target-id AWS-RevokeSecurityGroupIngress \
    --parameters '{
      "AutomationAssumeRole": {"StaticValue": {"Values":["arn:aws:iam::111122223333:role/config-remediation-role"]}},
      "GroupId": {"ResourceValue": {"Value":"RESOURCE_ID"}}
    }' \
    --automatic \
    --maximum-automatic-attempts 3

The AWS-RevokeSecurityGroupIngress document is one of ~300 AWS-managed Automation documents; custom documents handle cases where the managed ones don’t fit. --automatic fires the remediation immediately; maximum-automatic-attempts caps retries on transient errors.

Notification path for unremediated findings. Config publishes to SNS. A Lambda subscriber posts non-compliant findings (ones that didn’t get remediated) to a Slack channel and opens a ticket in the on-call system. The signal: a message shows up only when a human needs to see it.

Security Hub on top. Enable Security Hub in every account, enable the Foundational Security Best Practices standard, point its aggregator at the security account. Security Hub pulls Config findings (and GuardDuty, Inspector, and third-party) into a single dashboard with severity scoring. One pane of glass across 180 accounts.

EventBridge + Lambda for the truly urgent. A small set of “must not happen” events get a direct EventBridge rule subscribing to CloudTrail events, triggering a Lambda that reacts in seconds rather than waiting the ~60-second Config cycle. Examples: PutBucketAcl setting public-read, PutBucketPolicy that grants Principal: *. The Lambda applies a corrective PutBucketAcl and alerts immediately.

A worked detection

We test the setup deliberately on a Tuesday morning in a test account:

$ aws ec2 authorize-security-group-ingress \
    --group-id sg-test-0abc1234 \
    --protocol tcp --port 22 --cidr 0.0.0.0/0

  # Simulated drift

CloudWatch observers record the subsequent events:

14:32  AuthorizeSecurityGroupIngress succeeded (sg-test-0abc1234)
14:33  CloudTrail event delivered to security-trail bucket
14:52  Config ConfigurationItem recorded (security group with new rule)
15:17  Config rule restricted-ssh evaluated: NON_COMPLIANT
15:18  Remediation configuration triggered AWS-RevokeSecurityGroupIngress
15:34  SSM Automation execution completed successfully
15:51  Config re-evaluated sg-test-0abc1234: COMPLIANT
15:55  Security Hub finding closed

Eighty-three seconds from drift to reversal. The Saturday scenario’s 54 hours of exposure becomes under two minutes. A human is never in the loop for this specific case; the reversal fires whether it’s 14:14 Tuesday or 02:14 Saturday.

We check the audit trail for the reversal:

$ aws cloudtrail lookup-events \
    --lookup-attributes AttributeKey=ResourceName,AttributeValue=sg-test-0abc1234 \
    --start-time 2027-05-04T09:14:00Z \
    --end-time 2027-05-04T09:16:00Z

Events:
  - EventName: AuthorizeSecurityGroupIngress
    Username: user@example.com
    EventTime: 09:14:32
  - EventName: RevokeSecurityGroupIngress
    Username: arn:aws:sts::...:assumed-role/config-remediation-role/...
    EventTime: 09:15:34
    EventSource: ssm.amazonaws.com

The trail shows who made the change, which automation role reversed it, and when. Postmortem-ready.

What’s worth remembering

Config is a state log; CloudTrail is an event log; Security Hub is an aggregator. All three matter; they answer different questions. Config tells you what a resource looks like; CloudTrail tells you who changed it; Security Hub tells you what to worry about today across your fleet.
Rules come in three flavours: managed, custom Lambda, custom policy. The managed library covers most common concerns; custom Lambda for organisation-specific logic; Guard policy language for declarative shape rules.
Configuration-change triggered vs periodic. Most rules fire on change; some (like “all IAM users should have MFA”) fire on a schedule because the resource they check isn’t itself changing.
Auto-remediation pairs a rule with a Systems Manager Automation document. Managed documents cover common fixes; custom documents handle the rest. --automatic makes it hands-off.
Conformance packs deploy rules organisation-wide. YAML bundles stamped through Organizations to every account in an OU; a central aggregator consolidates compliance reporting.
Record what matters; skip what doesn’t. All-supported is easy and expensive. Curated resourceTypes is cheaper and requires more thought; revisit annually.
EventBridge + Lambda is the urgent lane. For events where Config’s 30-60 second cycle is too slow, subscribe directly to CloudTrail via EventBridge and react in seconds.
Drift outside the pipeline is a real threat. The pipeline is one layer; Config is another. The two complement: pipeline blocks bad changes at merge time, Config catches bad changes at runtime. Both matter.

The resource that drifts is going to drift; the goal is to catch it in minutes rather than days. Config rules plus managed remediations plus Security Hub for aggregation, deployed via conformance packs across the organisation, that’s the backbone that stopped the 54-hour Saturday exposure from ever happening again.