The situation
A platform team inherits patch management for roughly 800 EC2 instances across three accounts (dev, staging, prod) and one Region. The current state is uneven:
- A subset of hosts have the SSM Agent installed and an instance profile with
AmazonSSMManagedInstanceCore. Roughly 700 of the 800. - No custom patch policy in place; every instance falls back to the AWS-supplied default for its OS.
- One maintenance window covers all of
prod;devandstagingget patched “when someone remembers.” - Compliance reports today are console screenshots, emailed on the last Friday of the month.
The asks:
- A monthly compliance report with per-instance, per-baseline status: compliant, non-compliant, missing critical patches over fourteen days old.
- Per-team and per-environment breakdown, keyed to existing tags (
Team,Environment). - Different baselines for production (security patches only, approved after a seven-day soak) and non-production (all patches, approved immediately).
- Maintenance windows per environment, so dev gets patched on Tuesdays and prod on Saturday nights.
- Evidence trail deep enough to answer “which instances received CVE-2028-1234 and when?” without a screenshot.
What actually matters
Patching splits cleanly into three problems that the storage layer has to address separately.
The first is what counts as “patched”: a policy that classifies patches into approved and rejected. For most Linux distros it’s severity, classification, and an approval delay. For Windows it’s the same, plus Microsoft-specific categories. This policy is what an instance is measured against during a compliance scan. Same instance, different policy, different compliance answer. The decision worth making deliberately is how strict the approval rules are and how long the delay is before a patch counts as “approved” (the soak period, useful insurance against regressions in a patch that ships on Tuesday and breaks things on Wednesday).
The second is when patching happens and against what: a schedule and a set of targets, with the policy attached. The separation is worth understanding: the policy decides what patches are in scope, the schedule decides when they run, and the target selection decides which instances are touched. Any of the three can be wrong independently, and the symptoms are different. Per-environment scheduling (different days for prod and non-prod) and policy-per-environment (longer soak in prod, instant in non-prod) only work if those three pieces are independent.
The third is the compliance data. Scanning produces per-resource verdicts, compliant, non-compliant, not-applicable, keyed by instance and patch. For a monthly report the console view is fine once; for a repeatable, team-scoped report that includes the missing-patch count older than fourteen days, the data has to land somewhere queryable, partitioned by account and Region, with the team and environment tags preserved on every row. SQL over the data beats screenshots over a dashboard.
The fourth is how much of this the platform team runs versus hands to application teams. Policy and scheduling centralised in the platform account keep things uniform; tag-based targeting lets app teams opt in or out of a window by tagging their instances. The report that actually matters, “your team’s compliance this month”, is the artefact that aligns incentives.
What we’ll filter on
Ranking the options against:
- Baseline flexibility, separate rulesets for prod vs non-prod, different OS families.
- Scheduling control, different maintenance windows per environment.
- Compliance data export, queryable outside the console.
- Team-scoped reporting, reports broken down by
TeamandEnvironmenttags. - Evidence trail, per-instance, per-patch history with timestamps.
The patching landscape
1. OS-native updaters (unattended-upgrades, yum-cron, Windows Update). Each OS has its own scheduler that pulls patches on its own cadence. Works; produces no AWS-visible compliance state, no centralised baseline, no report. A CloudWatch Logs agent scraping /var/log/dpkg.log is a poor substitute for ListComplianceItems. Rejected for the reporting requirement.
2. Patch Manager – AWS-DefaultPatchBaseline per OS. The simplest setup: every instance uses the AWS-provided default baseline for its OS, one global maintenance window scans and installs nightly. Easy to stand up; fails the “prod soak” requirement (no per-environment baseline) and doesn’t tie patches to maintenance windows beyond a single global schedule.
3. Patch Manager with custom baselines per environment and patch groups. The canonical approach. Two baselines per OS family (one for prod with severity Critical/Important only and a seven-day approval delay, one for non-prod with all severities and zero delay), a Patch Group tag on each instance, two maintenance windows per environment. Compliance data surfaces per instance. Ticks baseline flexibility, scheduling, and evidence; needs additional plumbing for reporting.
4. Patch Manager + Resource Data Sync to S3 + Athena. Sync compliance data and inventory to a central S3 bucket nightly; Athena queries answer the monthly reporting questions by team and environment. Pair with QuickSight or a simple SQL-to-CSV script for the report the compliance team reads. Ticks all five requirements.
5. Third-party tooling (Rapid7, Qualys, Tenable). Fine tools, mature compliance reporting, duplicates what Patch Manager already produces. Adds licence cost and agents; doesn’t replace Patch Manager for actually installing the patches. Valid if the organisation already owns the licences; not the minimum-viable answer.
Side by side
| Option | Baselines per env | Per-env schedule | Data export | Team-scoped report | Evidence trail |
|---|---|---|---|---|---|
| OS-native updaters | ✗ | ✗ | ✗ | ✗ | Partial (OS logs) |
| Default baseline, one window | ✗ | ✗ | — | ✗ | ✓ |
| Custom baselines + patch groups | ✓ | ✓ | — | Partial | ✓ |
| + Resource Data Sync to S3 + Athena | ✓ | ✓ | ✓ | ✓ | ✓ |
| Third-party tooling | ✓ | ✓ | ✓ | ✓ | ✓ |
Patch Manager with custom baselines, patch groups, and Resource Data Sync is the AWS-native answer that clears every requirement without adding a licence bill.
How the pieces fit together
The pick in depth
Custom baselines. Four baselines cover the fleet: prod-linux, nonprod-linux, prod-windows, nonprod-windows. Each is a JSON document with approval rules keyed on severity and classification, plus an ApprovalDelayInDays. The prod baselines set delay to seven, which quarantines a patch that shipped today until the non-prod fleet has had a week on it. The non-prod baselines set delay to zero, so non-prod is the canary.
A baseline’s rule for Amazon Linux reads as an approval-rule group: “approve any patch with severity Critical or Important AND classification Security after seven days.” Patches that match become eligible during the next scan; patches that don’t remain “pending approval” indefinitely. A “rejected patches” list is the explicit veto, a specific KB article or package name that is known to break the fleet.
Patch groups. Each instance gets a tag Patch Group = <group-name>. The baseline references that group name via its operating-system-specific patch group association. No direct link from instance to baseline, the tag is the indirection that lets new instances inherit the right baseline by tagging them at launch. Launch templates that set Patch Group based on the autoscaling group or stack tag make this free.
Maintenance windows. One window per environment, with two tasks each: a scan task (AWS-RunPatchBaseline with Operation=Scan) that updates compliance state without installing anything, and an install task that actually patches. The prod window runs Saturday 02:00 UTC; the non-prod window runs Tuesday 12:00 UTC. Targets are defined by tag (Environment=prod, or Environment in [dev, staging]), not by instance ID, so adding a new instance to the schedule is a tag away.
A scan runs daily via a separate, shorter maintenance window that has no install task. That way compliance state is never more than 24 hours stale, independent of when the install windows fire.
Resource Data Sync. One sync configuration in each account aggregates inventory and PatchCompliance data to a central S3 bucket (usually in a dedicated audit account). The bucket is organised as s3://compliance-reports/AccountId=<id>/Region=<region>/ResourceType=<type>/. Athena sits over the bucket with partition projection; the monthly query answers SELECT tags['Team'], tags['Environment'], COUNT(*) FROM patch_compliance WHERE status='NON_COMPLIANT' AND classification='Security' AND patch_release_date < current_date - interval '14' day GROUP BY 1, 2. The output lands in a CSV that gets attached to the compliance email.
A worked compliance cycle
Monday 10:00 UTC, instance i-0abc in staging tagged Patch Group=nonprod-linux, Environment=staging, Team=payments. The daily scan maintenance window runs:
- Maintenance window resolves targets by tag;
i-0abcmatches. - Run Command invokes
AWS-RunPatchBaselinewithOperation=Scan, BaselineId=nonprod-linux-baseline. - On the instance, the SSM Agent downloads the baseline rules, runs
yum check-updateagainst them, and reports per-patch state: approved-installed, approved-missing, rejected-installed, not-applicable, etc. - The agent writes
PatchComplianceDatato Systems Manager; the data appears in the Patch Manager compliance view within minutes.
Tuesday 12:00 UTC, non-prod install window fires:
- Window runs
AWS-RunPatchBaselinewithOperation=Install. Approved-missing patches install, the instance reboots if required, and post-install the scan runs again. - Instance state transitions from NON_COMPLIANT to COMPLIANT for any patch that landed successfully.
Tuesday 02:00 UTC, Resource Data Sync runs:
- Sync picks up the new compliance records and writes them as JSON to S3 under
AccountId=…/Region=…/ResourceType=ManagedInstance/accountid=…region=…type=…/data_<timestamp>.json. - Athena’s partition projection sees the new partition automatically.
Last Friday of the month, compliance engineer runs:
SELECT tags['Team'] AS team,
tags['Environment'] AS environment,
SUM(CASE WHEN status='NON_COMPLIANT' THEN 1 ELSE 0 END) AS non_compliant,
COUNT(*) AS total
FROM patch_compliance
WHERE classification IN ('Security','CriticalUpdates','SecurityUpdates')
AND date_diff('day', patch_release_date, current_date) >= 14
GROUP BY 1, 2
ORDER BY 1, 2;
Result: one row per Team × Environment with the non-compliant count. CSV attached; screenshot no longer required.
What’s worth remembering
- Baseline + patch group + maintenance window is the triad. The baseline decides what gets approved; the patch group tag binds an instance to a baseline; the maintenance window decides when and which instances. Any of the three can be wrong independently, and the symptoms differ.
- Patch groups are a tag, not a separate object.
Patch Group=<name>on the instance; the baseline’s patch group association picks it up. Launch templates should set the tag so new instances inherit the correct baseline automatically. ApprovalDelayInDaysis the soak period. Prod baselines set it high enough that non-prod is the canary; non-prod baselines usually set it to zero so they are the canary.- Scan windows and install windows are separate. Daily scans keep compliance state fresh; install windows only fire on the cadence the change-management process permits. Don’t couple compliance-visibility frequency to install-risk tolerance.
- Resource Data Sync is the export path. Compliance and inventory data lands in S3 as JSON, partitioned by account and Region. Athena (or QuickSight) handles the reporting; the console view is good for spot checks, not for monthly evidence.
- Tags carry through to the report. Instance tags (
Team,Environment,CostCenter) appear in the Resource Data Sync output. The monthly report’s group-by dimensions are whatever tags the instances carry; tag discipline at launch is what makes the report useful. - Hybrid nodes work the same way. Activations register on-premises servers as managed instances; baselines, patch groups, and maintenance windows apply identically once the machine shows up as
mi-<id>. - CVE-specific queries are cheap once the data is in Athena.
WHERE patch_id LIKE 'CVE-2028-1234%'over the partition for the past month answers “which hosts received this patch?” in seconds, the payoff for spending the morning on Resource Data Sync instead of screenshots.
Patch Manager doesn’t make patching happen by itself; it makes it recordable and reviewable. Four baselines divide the fleet into “soak” and “fast,” patch-group tags bind each instance to the right baseline, maintenance windows run the scan and install jobs on the schedule change management approves, and Resource Data Sync turns every compliance scan into a row in an Athena table. The monthly report stops being a screenshot and starts being a query, and that is the difference between patching that the compliance team believes and patching that they don’t.