How to Produce Auditable Patch Compliance Reports for EC2 Fleets

July 26, 2028 · 14 min read

DevOps Engineer Pro · DOP-C02 · part of The Exam Room

The situation

A platform team inherits patch management for roughly 800 EC2 instances across three accounts (dev, staging, prod) and one Region. The current state is uneven:

  • A subset of hosts have the SSM Agent installed and an instance profile with AmazonSSMManagedInstanceCore. Roughly 700 of the 800.
  • No custom patch policy in place; every instance falls back to the AWS-supplied default for its OS.
  • One maintenance window covers all of prod; dev and staging get patched “when someone remembers.”
  • Compliance reports today are console screenshots, emailed on the last Friday of the month.

The asks:

  • A monthly compliance report with per-instance, per-baseline status: compliant, non-compliant, missing critical patches over fourteen days old.
  • Per-team and per-environment breakdown, keyed to existing tags (Team, Environment).
  • Different baselines for production (security patches only, approved after a seven-day soak) and non-production (all patches, approved immediately).
  • Maintenance windows per environment, so dev gets patched on Tuesdays and prod on Saturday nights.
  • Evidence trail deep enough to answer “which instances received CVE-2028-1234 and when?” without a screenshot.

What actually matters

Patching splits cleanly into three problems that the storage layer has to address separately.

The first is what counts as “patched”: a policy that classifies patches into approved and rejected. For most Linux distros it’s severity, classification, and an approval delay. For Windows it’s the same, plus Microsoft-specific categories. This policy is what an instance is measured against during a compliance scan. Same instance, different policy, different compliance answer. The decision worth making deliberately is how strict the approval rules are and how long the delay is before a patch counts as “approved” (the soak period, useful insurance against regressions in a patch that ships on Tuesday and breaks things on Wednesday).

The second is when patching happens and against what: a schedule and a set of targets, with the policy attached. The separation is worth understanding: the policy decides what patches are in scope, the schedule decides when they run, and the target selection decides which instances are touched. Any of the three can be wrong independently, and the symptoms are different. Per-environment scheduling (different days for prod and non-prod) and policy-per-environment (longer soak in prod, instant in non-prod) only work if those three pieces are independent.

The third is the compliance data. Scanning produces per-resource verdicts, compliant, non-compliant, not-applicable, keyed by instance and patch. For a monthly report the console view is fine once; for a repeatable, team-scoped report that includes the missing-patch count older than fourteen days, the data has to land somewhere queryable, partitioned by account and Region, with the team and environment tags preserved on every row. SQL over the data beats screenshots over a dashboard.

The fourth is how much of this the platform team runs versus hands to application teams. Policy and scheduling centralised in the platform account keep things uniform; tag-based targeting lets app teams opt in or out of a window by tagging their instances. The report that actually matters, “your team’s compliance this month”, is the artefact that aligns incentives.

What we’ll filter on

Ranking the options against:

  1. Baseline flexibility, separate rulesets for prod vs non-prod, different OS families.
  2. Scheduling control, different maintenance windows per environment.
  3. Compliance data export, queryable outside the console.
  4. Team-scoped reporting, reports broken down by Team and Environment tags.
  5. Evidence trail, per-instance, per-patch history with timestamps.

The patching landscape

1. OS-native updaters (unattended-upgrades, yum-cron, Windows Update). Each OS has its own scheduler that pulls patches on its own cadence. Works; produces no AWS-visible compliance state, no centralised baseline, no report. A CloudWatch Logs agent scraping /var/log/dpkg.log is a poor substitute for ListComplianceItems. Rejected for the reporting requirement.

2. Patch Manager – AWS-DefaultPatchBaseline per OS. The simplest setup: every instance uses the AWS-provided default baseline for its OS, one global maintenance window scans and installs nightly. Easy to stand up; fails the “prod soak” requirement (no per-environment baseline) and doesn’t tie patches to maintenance windows beyond a single global schedule.

3. Patch Manager with custom baselines per environment and patch groups. The canonical approach. Two baselines per OS family (one for prod with severity Critical/Important only and a seven-day approval delay, one for non-prod with all severities and zero delay), a Patch Group tag on each instance, two maintenance windows per environment. Compliance data surfaces per instance. Ticks baseline flexibility, scheduling, and evidence; needs additional plumbing for reporting.

4. Patch Manager + Resource Data Sync to S3 + Athena. Sync compliance data and inventory to a central S3 bucket nightly; Athena queries answer the monthly reporting questions by team and environment. Pair with QuickSight or a simple SQL-to-CSV script for the report the compliance team reads. Ticks all five requirements.

5. Third-party tooling (Rapid7, Qualys, Tenable). Fine tools, mature compliance reporting, duplicates what Patch Manager already produces. Adds licence cost and agents; doesn’t replace Patch Manager for actually installing the patches. Valid if the organisation already owns the licences; not the minimum-viable answer.

Side by side

Option Baselines per env Per-env schedule Data export Team-scoped report Evidence trail
OS-native updaters Partial (OS logs)
Default baseline, one window
Custom baselines + patch groups Partial
+ Resource Data Sync to S3 + Athena
Third-party tooling

Patch Manager with custom baselines, patch groups, and Resource Data Sync is the AWS-native answer that clears every requirement without adding a licence bill.

How the pieces fit together

Targets (tagged instances) Policy & schedule Data & reporting prod Linux fleet Patch Group = prod-linux Environment = prod non-prod Linux fleet Patch Group = nonprod-linux Environment = dev / staging prod Windows fleet Patch Group = prod-windows Environment = prod non-prod Windows Patch Group = nonprod-windows Environment = dev / staging prod-linux-baseline Critical + Important, 7-day soak nonprod-linux-baseline All severities, 0-day soak prod-windows-baseline SecurityUpdates + CriticalUpdates, 7-day soak nonprod-windows-baseline All categories, 0-day soak MW: prod Sat 02:00 UTC targets: Env=prod Install step AWS-RunPatchBaseline MW: non-prod Tue 12:00 UTC targets: Env∈{dev,staging} Install step AWS-RunPatchBaseline Resource Data Sync nightly: PatchCompliance + Instance inventory → S3 bucket, partitioned S3 (compliance-reports) AccountId=…/Region=…/ ResourceType=ManagedInstance/ Athena SELECT Tag.Team, COUNT(*) FROM patch_compliance Monthly report per Team × Environment CVE age ≥ 14 days
Tags on instances pick the baseline; maintenance windows pick the schedule; Resource Data Sync lands the compliance data in S3 where Athena turns it into the monthly report.

The pick in depth

Custom baselines. Four baselines cover the fleet: prod-linux, nonprod-linux, prod-windows, nonprod-windows. Each is a JSON document with approval rules keyed on severity and classification, plus an ApprovalDelayInDays. The prod baselines set delay to seven, which quarantines a patch that shipped today until the non-prod fleet has had a week on it. The non-prod baselines set delay to zero, so non-prod is the canary.

A baseline’s rule for Amazon Linux reads as an approval-rule group: “approve any patch with severity Critical or Important AND classification Security after seven days.” Patches that match become eligible during the next scan; patches that don’t remain “pending approval” indefinitely. A “rejected patches” list is the explicit veto, a specific KB article or package name that is known to break the fleet.

Patch groups. Each instance gets a tag Patch Group = <group-name>. The baseline references that group name via its operating-system-specific patch group association. No direct link from instance to baseline, the tag is the indirection that lets new instances inherit the right baseline by tagging them at launch. Launch templates that set Patch Group based on the autoscaling group or stack tag make this free.

Maintenance windows. One window per environment, with two tasks each: a scan task (AWS-RunPatchBaseline with Operation=Scan) that updates compliance state without installing anything, and an install task that actually patches. The prod window runs Saturday 02:00 UTC; the non-prod window runs Tuesday 12:00 UTC. Targets are defined by tag (Environment=prod, or Environment in [dev, staging]), not by instance ID, so adding a new instance to the schedule is a tag away.

A scan runs daily via a separate, shorter maintenance window that has no install task. That way compliance state is never more than 24 hours stale, independent of when the install windows fire.

Resource Data Sync. One sync configuration in each account aggregates inventory and PatchCompliance data to a central S3 bucket (usually in a dedicated audit account). The bucket is organised as s3://compliance-reports/AccountId=<id>/Region=<region>/ResourceType=<type>/. Athena sits over the bucket with partition projection; the monthly query answers SELECT tags['Team'], tags['Environment'], COUNT(*) FROM patch_compliance WHERE status='NON_COMPLIANT' AND classification='Security' AND patch_release_date < current_date - interval '14' day GROUP BY 1, 2. The output lands in a CSV that gets attached to the compliance email.

A worked compliance cycle

Monday 10:00 UTC, instance i-0abc in staging tagged Patch Group=nonprod-linux, Environment=staging, Team=payments. The daily scan maintenance window runs:

  1. Maintenance window resolves targets by tag; i-0abc matches.
  2. Run Command invokes AWS-RunPatchBaseline with Operation=Scan, BaselineId=nonprod-linux-baseline.
  3. On the instance, the SSM Agent downloads the baseline rules, runs yum check-update against them, and reports per-patch state: approved-installed, approved-missing, rejected-installed, not-applicable, etc.
  4. The agent writes PatchComplianceData to Systems Manager; the data appears in the Patch Manager compliance view within minutes.

Tuesday 12:00 UTC, non-prod install window fires:

  1. Window runs AWS-RunPatchBaseline with Operation=Install. Approved-missing patches install, the instance reboots if required, and post-install the scan runs again.
  2. Instance state transitions from NON_COMPLIANT to COMPLIANT for any patch that landed successfully.

Tuesday 02:00 UTC, Resource Data Sync runs:

  1. Sync picks up the new compliance records and writes them as JSON to S3 under AccountId=…/Region=…/ResourceType=ManagedInstance/accountid=…region=…type=…/data_<timestamp>.json.
  2. Athena’s partition projection sees the new partition automatically.

Last Friday of the month, compliance engineer runs:

SELECT tags['Team'] AS team,
       tags['Environment'] AS environment,
       SUM(CASE WHEN status='NON_COMPLIANT' THEN 1 ELSE 0 END) AS non_compliant,
       COUNT(*) AS total
FROM patch_compliance
WHERE classification IN ('Security','CriticalUpdates','SecurityUpdates')
  AND date_diff('day', patch_release_date, current_date) >= 14
GROUP BY 1, 2
ORDER BY 1, 2;

Result: one row per Team × Environment with the non-compliant count. CSV attached; screenshot no longer required.

What’s worth remembering

  1. Baseline + patch group + maintenance window is the triad. The baseline decides what gets approved; the patch group tag binds an instance to a baseline; the maintenance window decides when and which instances. Any of the three can be wrong independently, and the symptoms differ.
  2. Patch groups are a tag, not a separate object. Patch Group=<name> on the instance; the baseline’s patch group association picks it up. Launch templates should set the tag so new instances inherit the correct baseline automatically.
  3. ApprovalDelayInDays is the soak period. Prod baselines set it high enough that non-prod is the canary; non-prod baselines usually set it to zero so they are the canary.
  4. Scan windows and install windows are separate. Daily scans keep compliance state fresh; install windows only fire on the cadence the change-management process permits. Don’t couple compliance-visibility frequency to install-risk tolerance.
  5. Resource Data Sync is the export path. Compliance and inventory data lands in S3 as JSON, partitioned by account and Region. Athena (or QuickSight) handles the reporting; the console view is good for spot checks, not for monthly evidence.
  6. Tags carry through to the report. Instance tags (Team, Environment, CostCenter) appear in the Resource Data Sync output. The monthly report’s group-by dimensions are whatever tags the instances carry; tag discipline at launch is what makes the report useful.
  7. Hybrid nodes work the same way. Activations register on-premises servers as managed instances; baselines, patch groups, and maintenance windows apply identically once the machine shows up as mi-<id>.
  8. CVE-specific queries are cheap once the data is in Athena. WHERE patch_id LIKE 'CVE-2028-1234%' over the partition for the past month answers “which hosts received this patch?” in seconds, the payoff for spending the morning on Resource Data Sync instead of screenshots.

Patch Manager doesn’t make patching happen by itself; it makes it recordable and reviewable. Four baselines divide the fleet into “soak” and “fast,” patch-group tags bind each instance to the right baseline, maintenance windows run the scan and install jobs on the schedule change management approves, and Resource Data Sync turns every compliance scan into a row in an Athena table. The monthly report stops being a screenshot and starts being a query, and that is the difference between patching that the compliance team believes and patching that they don’t.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.