How to Patch a Fleet With Systems Manager Patch Manager

August 16, 2027 · 17 min read

The situation

The fleet:

800 EC2 instances across us-east-1, eu-west-1, ap-southeast-1, ap-southeast-2.
150 on-prem VMs in two data centres (London and Frankfurt), reachable over a site-to-site VPN.
Operating systems: Amazon Linux 2023, Ubuntu 22.04 LTS, Windows Server 2022. Roughly equal thirds.
SOC2 obligations: every host’s patch level reported monthly (installed, missing, compliance state against a named baseline); critical CVEs patched within 14 days of vendor publication; no unscheduled reboots during business hours (defined per region).
Three engineers on the platform team. They cannot SSH into 950 hosts to run updates, and they can’t maintain a bespoke per-OS patching pipeline without drowning.

What actually matters

The first question is what does SOC2 actually want to see? Not “we patch regularly”, that’s a statement no auditor accepts. It wants evidence: for every host, on a named baseline, here is its compliance state on a timestamped report. That distinction matters because it eliminates a class of solutions immediately. Cron jobs on each host running dnf upgrade satisfy the patch requirement and fail the evidence requirement; the log on each host is a log, not a fleet-wide report, and the auditor doesn’t want to SSH to 950 machines to assemble one. The mechanism has to produce the artefact the auditor consumes, not just the outcome the hosts need.

The second is what’s the boundary between cloud and on-prem? EC2 instances are already in IAM’s world; on-prem VMs aren’t. Anything that works on EC2 only means a second parallel toolchain for the data-centre side, and three engineers are not the staffing level that survives two parallel toolchains. The mechanism has to reach both, ideally through the same console, the same API, and the same compliance report, on-prem VMs registering through a hybrid model so they look like managed nodes for patching purposes.

The third is how do three OSes fit into one policy? SOC2’s policy is “critical within 14 days”; that’s OS-agnostic. But the definition of “critical” is vendor-specific, and so is the mechanism for identifying and installing updates. The design pattern is one baseline per OS encoding the same policy, all three registered against the same grouping scheme so the operational loop is the same whether the host is Linux or Windows. One policy, three implementations, one operational rhythm.

The fourth is when can the reboots happen? “No unscheduled reboots during business hours” is the constraint. A global 02:00 UTC means mid-afternoon in Sydney. The answer is per-region maintenance windows, “02:00” is local to each region, not Zulu time. That choice cascades through scheduling: one maintenance-window definition per region per OS-ring, but each one trivially a cron expression and a target tag.

The fifth is how do we avoid patching everything at once? A bad patch that breaks a kernel on Amazon Linux would cascade across 270 hosts in minutes if there’s no staging. The standard answer is rings: a small -early slice patches first (Tuesday), a larger -main slice patches next (Wednesday), production-critical -late patches last (Thursday). A broken Tuesday earns the team a day to catch it and disable the rest of the week. That structure is expressed as group tags, and the ring is orthogonal to the OS, so nine groups total, not a matrix.

The sixth is what else lives in this toolchain? The patching layer sits on top of a generic remote-command layer, which sits on a per-host agent. The same agent that installs patches also runs one-off config changes, pulls inventory, and provides shell access. Investing in the agent everywhere pays back in more than patching, it’s the control plane for a lot of fleet-wide operations. That’s worth knowing when justifying the effort to register 150 on-prem VMs as managed nodes.

Finally: when would a third-party vulnerability-management tool earn its licence? When the estate is genuinely multi-cloud (AWS + Azure + GCP + endpoint laptops), when CVE prioritisation needs richer intelligence than vendor severity (exploit-in-the-wild telemetry, CVSS, asset criticality), or when patch posture is only one of several things reported in a unified compliance pane (CIS benchmarks, STIG, EDR). For 950 mostly-AWS hosts, none of those flips the decision.

What we’ll filter on

Five filters.

Scan and install across 950 mixed hosts, cloud and on-prem. EC2 in four regions plus VMs behind a VPN, Linux and Windows, without two separate toolchains.
Reportable compliance evidence. A file or API call the auditor accepts, per host, per baseline, exportable, time-stamped.
14-day SLA on critical CVEs. The tooling has to express the policy as configuration, not as an engineer’s memory.
Controlled reboot windows. Patching and rebooting happen when the business is asleep.
Low operational overhead. Three engineers, 950 hosts. A solution that scales linearly with host count eats the team.

The patching landscape

SSM Patch Manager. A feature of AWS Systems Manager. SSM Agent on each host, named patch baselines per OS, tag-based grouping, maintenance-window scheduling, continuous compliance reporting. Works on EC2, on-prem servers (via hybrid activations), and edge devices. No extra charge for patching supported OSes on EC2 or on standard-tier on-prem instances.

SSM Run Command with custom scripts. The generic “execute this document on these targets” primitive under Patch Manager. Write your own scan-and-install documents in bash and PowerShell, schedule them, push results into S3 yourself. More flexible. More brittle, you maintain the per-OS update commands, severity parsing, compliance record shape, and reboot logic. Three engineers rewriting yum-security-list output into a compliance report is a bad use of three engineers.

Third-party SaaS tools (Tanium, Qualys, Rapid7, Ivanti). Mature vulnerability-management platforms with agent-based scanning, rich CVE intelligence, cross-cloud and on-prem coverage. Wins when the estate is genuinely heterogeneous, or when the security team needs deeper CVE intelligence than AWS’s severity mapping gives. For a mostly-AWS fleet plus two VPN-reachable data centres, the tool is a licence line-item on top of what AWS gives free.

Per-OS native tools (dnf-automatic, unattended-upgrades, Windows Update). Built in, zero AWS dependency, and each is serviceable at single-host scale. At 950 hosts across three OSes and four regions, the problem isn’t whether updates install, it’s whether you can prove they installed.

Side by side

Option	Cloud + on-prem	Compliance evidence	14-day CVE SLA as config	Reboot windows	Low ops overhead
SSM Patch Manager	✓	✓	✓	✓	✓
SSM Run Command + custom scripts	✓	✓	✓	✓	✗
Third-party SaaS	✓	✓	✓	✓	—
Per-OS native tools	✗	✗	—	—	✗

One survives: SSM Patch Manager, extended to on-prem via hybrid activations.

Matching the fleet to the schedule

One agent, three custom baselines, nine patch groups (OS × ring), per-region maintenance windows with `RebootIfNeeded`, S3 reports as the monthly SOC2 artefact and Security Hub as the live view.

Patch Manager, in depth

The SSM Agent. A small process on every managed node that talks back to the Systems Manager endpoints. Preinstalled on current Amazon Linux 2023 and Windows Server 2022 AMIs. Not preinstalled on Ubuntu 22.04, install via snap (sudo snap install amazon-ssm-agent --classic) or the Debian package. On on-prem VMs always a manual install.

Hybrid activations, the on-prem wiring. For the 150 VMs to show up in Systems Manager, each is registered via a hybrid activation, an activation ID and code, generated in the console or API, with an IAM service role attached and an expiry date. The agent’s register command on the VM with those values brings it in as a managed node with a mi- prefix. Standard tier is free up to 1,000 hybrid-activated nodes per account per Region; 150 sits comfortably inside. Advanced tier at ~$0.00695 per instance per hour unlocks Session Manager on non-EC2 nodes, not worth it here. Anywhere the API wants an instance ID, it accepts either i-... or mi-....

Patch baselines. A baseline is the rule-set deciding which patches are approved and how “compliant” is defined. Baselines are per-OS. AWS ships a predefined default for every supported OS, useful for getting started and useless for SOC2 evidence because they report compliance as Unspecified.

Predefined baselines in this fleet:

AWS-AmazonLinux2023DefaultPatchBaseline. Security at Critical or Important plus all Bugfix, 7-day auto-approval delay.
AWS-UbuntuDefaultPatchBaseline, security patches approved immediately; Ubuntu’s published release dates aren’t reliable for the delay mechanism.
AWS-DefaultPatchBaseline (Windows) – CriticalUpdates and SecurityUpdates at MSRC severity Critical or Important, 7-day delay.

For SOC2 the team builds custom baselines. Custom baselines let you set a compliance severity per approval rule (CRITICAL, HIGH, MEDIUM, LOW), tighten the auto-approval delay (7-day for Critical, 14-day for Important), and maintain approve-lists and reject-lists by patch ID.

Patch groups. A patch group maps a managed node to a specific baseline. Tag-based. Tag key is Patch Group (with a space), or PatchGroup when EC2 instance-metadata tag access is enabled. Tag value is the patch-group name; each baseline is registered with one or more patch-group names.

This fleet slices by OS and rollout ring:

AL2023-early, AL2023-main, AL2023-late
Ubuntu22-early, Ubuntu22-main, Ubuntu22-late
Win2022-early, Win2022-main, Win2022-late

Nine patch groups. Each OS baseline is registered against all three rings for its OS.

Maintenance windows. A named schedule with tasks and targets. Schedules use cron(...) or rate(...) (same syntax as EventBridge). A task is a Run Command document (AWS-RunPatchBaseline for patching), an Automation document, a Lambda, or a Step Function.

For a four-region fleet: define per-region maintenance windows so “02:00” is local to each region’s business hours. Inside each region, separate windows per patch-group ring stagger the rollout:

mw-useast1-al2023-early – cron(0 2 ? * TUE *) in us-east-1. Targets: tag:Patch Group = AL2023-early. Task: AWS-RunPatchBaseline, Operation = Install.
mw-useast1-al2023-main – cron(0 2 ? * WED *). Targets: AL2023-main.
mw-useast1-al2023-late – cron(0 2 ? * THU *). Targets: AL2023-late.

Same skeleton for Ubuntu and Windows, shifted by a day so all three OSes don’t try to reboot on the same night.

Reboot policy. AWS-RunPatchBaseline takes RebootOption: RebootIfNeeded (default) or NoReboot. RebootIfNeeded reboots inside the maintenance window when a kernel or driver patch requires it, outside business hours by construction. The Operation knob: Scan reports missing patches without installing; Install installs approved patches. Nightly Scan keeps compliance data fresh; weekly Install does the rollout.

Compliance reporting. Every AWS-RunPatchBaseline run writes a per-host compliance record. Records surface three ways: the Compliance console (real-time COMPLIANT / NON_COMPLIANT view), scheduled CSV reports written to an S3 bucket, and AWS Security Hub findings. S3 is the monthly SOC2 artefact; Security Hub is the continuous view.

A worked patch cycle

One monthly cycle.

Day 1, 02:00 in every region. The per-region scan window fires. AWS-RunPatchBaseline with Operation = Scan runs across every managed node. Results land in SSM compliance within a minute or two. Console shows fresh per-host state. No reboots, no installs.

Day 2, 02:00 local, Tuesday. -early install windows fire. Amazon Linux 2023 -early hosts run Install with RebootIfNeeded. Kernel patches reboot the host; userspace patches don’t. If -early caught a broken patch, disable -main and -late windows for the week, add the bad patch ID to the baseline’s reject-list, remediate affected hosts by hand.

Day 3, 02:00 local, Wednesday. -main fires across all three OSes. Run Command’s rate-control caps concurrency: typically 10% concurrency with a 5% error threshold.

Day 4, 02:00 local, Thursday. -late fires. Production-critical workloads patched last.

Day 30. Scheduled compliance report runs: CSV per region, one row per host, columns for managed-node ID, patch group, baseline, compliance state, critical missing patches, timestamp. Dropped into s3://acme-soc2-evidence/patching/2026-06/. Every NON_COMPLIANT host has a ticket; every ticket closes within 14 days or becomes a SOC2 finding.

Wall-clock work: review scan results day 1, watch -early day 2, sign off monthly report day 30. Everything else is configuration running on a schedule.

When Run Command is still the correct answer

One-off remediation. A CVE landed Friday with no vendor patch; a config-change workaround fixes it. AWS-RunShellScript or AWS-RunPowerShellScript as Run Command, targets by tag, logged in CloudTrail.
Pre-patch or post-patch hooks. Stop a database cleanly before patching, start it after. Patch Manager supports lifecycle hooks via SSM documents.
Compliance for something that isn’t patches. “Every host has this CIS control applied” is a Run Command + Config or Inspector problem.

Patch Manager owns patches; Run Command owns arbitrary “run this on these hosts” work.

When the third-party tool earns its licence

Truly heterogeneous estates. AWS + Azure + on-prem hypervisors + developer laptops + network devices.
Vulnerability intelligence deeper than vendor severity. CVSS + CISA KEV + exploit-in-the-wild telemetry + asset criticality.
Unified reporting across configuration and vulnerability posture. CIS, STIG, EDR, patch, FIM in one pane.

SSM Patch Manager fits the AWS-native fleet. Starting with a third-party tool at 950 mostly-AWS hosts is over-spending.

What’s worth remembering

Patch Manager is the AWS-native answer for scan + install + compliance reporting across EC2 and on-prem mixed fleets, free on supported OSes, operated by SSM Agent on each host.
The SSM Agent is preinstalled on Amazon Linux 2023 and Windows Server 2022 AMIs, not on Ubuntu 22.04, and always manually installed on on-prem VMs.
Hybrid activations bring on-prem VMs into Systems Manager with managed-node IDs prefixed mi-. Standard tier free up to 1,000 per account per Region; advanced tier ~$0.00695/instance/hour unlocks Session Manager on non-EC2 nodes.
Predefined baselines report compliance as Unspecified. Custom baselines are what SOC2 wants, set CRITICAL/HIGH/MEDIUM compliance severity, tune auto-approval delay, maintain approve-lists and reject-lists by patch ID.
Patch groups are tag-based. Tag key Patch Group or PatchGroup, value is the name, baseline is registered against that name.
Maintenance windows are the scheduler. cron(...) or rate(...), per region so “02:00” is local to the fleet. AWS-RunPatchBaseline with Operation = Scan nightly, Install weekly, RebootOption = RebootIfNeeded to keep reboots inside the window.
Compliance reports export to S3 (scheduled CSVs) and surface in Security Hub. S3 is the monthly SOC2 artefact; Security Hub is the continuous view.
Run Command is the escape hatch for one-off remediation, lifecycle hooks, and anything that isn’t “install vendor patches.”
Third-party tools win when the estate is genuinely multi-cloud, when CVE prioritisation needs richer intelligence than vendor severity, or when patching sits alongside CIS / STIG / EDR in one platform.

The answer: install SSM Agent everywhere it isn’t already (Ubuntu via snap, all 150 on-prem VMs via hybrid activations registering as mi- managed nodes); author three custom patch baselines, one per OS, encoding the SOC2 policy as auto-approval delays with CRITICAL / HIGH compliance severities; tag every node with a Patch Group value (AL2023-early / AL2023-main / AL2023-late and the Ubuntu/Windows equivalents); define per-region maintenance windows that fire AWS-RunPatchBaseline with Operation = Install and RebootOption = RebootIfNeeded on the correct ring at 02:00 local, plus a nightly Scan window across everything; wire a scheduled compliance CSV export to S3 as the monthly SOC2 artefact and let Security Hub carry the continuous view. Three engineers, 950 hosts, all patched on time, and every reboot happens when the region’s already asleep.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.