Run Command or Runbook

August 30, 2027 · 17 min read

CloudOps Engineer · SOA-C03 · part of The Exam Room

The situation

A platform team runs about 400 EC2 instances across three Regions and a hundred-odd engineers. The ops backlog has three jobs on it this week.

  • A one-shot security patch went out from the security team: apply a specific yum update to every Amazon Linux host tagged Env=prod, today, and report the exit code of each run. No approvals, no sequencing, just “do this everywhere and tell us what happened.”
  • An AMI rebuild pipeline needs codifying. Take the current golden AMI, launch an instance from it, apply the weekly package updates, run the smoke-test suite, snapshot a new AMI, wait for a named approver to sign off in the console, then copy the AMI to us-east-1 and ap-southeast-2. Six steps, one human gate, one failure mode per step.
  • A nightly log-rotation job needs to run on every host tagged LogRotation=standard, at 02:00 local to each Region, with a report of any failures.

All three are “run something on some instances.” All three could, technically, be crammed into either Systems Manager feature. The question is which shape fits which job.

What actually matters

Before reaching for a specific feature, it’s worth asking what shape of execution each job actually wants, because “run something on some instances” hides at least two different problems, and the wrong shape forces one to mimic the other badly.

The first question to ask about a job is: does the work have a sequence, or is it a single action repeated across a fleet? A patch is a single action repeated across many hosts. An AMI pipeline is a sequence of steps with dependencies between them. A fan-out primitive handles the first cleanly; a flow primitive handles the second. Trying to express a sequence as nested shell-script logic inside one fan-out is how operators end up debugging step ordering through SSH; trying to express “do this on 400 hosts” as a single-step flow misses the parallelism and rate-control story.

The second question: does the work ever leave the instance? If any step needs to call an AWS API directly (creating an image, copying it across Regions, waiting for a resource to reach a state) or invoke a non-instance compute (a Lambda), that step doesn’t belong in a remote-shell-only primitive. The flow primitive needs typed actions for AWS API calls, not just shell-on-host.

The third: does a human ever need to sign off mid-flow? An approval gate isn’t something a fan-out has; it’s intrinsic to the flow shape. If the job has a gate, the whole job becomes a flow, and the instance-side work becomes one step inside it.

The fourth: how much output do we want to keep, and for how long? Fan-out output is per-target and tends to be capped in the API response, with anything longer going to durable storage by configuration. Flow output is per-step and lives in the execution record. The retention and aggregation shape are different in each.

The fifth: what triggers this? Some upstream callers, a CloudWatch alarm remediating an issue, a Config rule with an automated fix, an Incident Manager response plan, specifically accept the flow shape as their target, not raw fan-out invocations. If the job needs to be callable from one of those, that pushes it to the flow side even when its work is single-action.

And sixth: what’s the blast radius if it goes wrong? A fan-out has rate control and failure caps but no cross-target rollback. A flow can structure onFailure steps that unwind the earlier work. The more complex the job, the more the unwind plan matters.

What we’ll filter on

  1. Single action vs multi-step sequence, one command on many hosts, or many steps with branches and gates?
  2. Instance-only vs mixed API surface, does the work ever leave the instance and touch the AWS API directly?
  3. Needs a human approval gate, is there a mandatory pause for sign-off mid-run?
  4. Rollback and branching, does the job need onFailure handlers, conditional branches, or wait-for-state logic?
  5. Triggering surface, does an alarm, a Config rule, or Incident Manager need to call this?
  6. Output retention and aggregation, how much per-target output, how centralised, how queryable?

The Systems Manager execution landscape

  1. Run Command (ssm:SendCommand). Dispatches one SSM document, almost always AWS-RunShellScript, AWS-RunPowerShellScript, AWS-UpdateSSMAgent, or a custom Command-type document, to a list or tag-expression of managed nodes. Each target runs independently. MaxConcurrency paces how many run at once (absolute like 50 or percentage like 10%); MaxErrors caps how many failures are tolerated before the dispatch aborts. Output lands in the InvocationResult, truncated at 2500 characters, or in S3/CloudWatch Logs if configured. Triggered from console, CLI, EventBridge, State Manager, or Maintenance Windows.

  2. Automation runbook (ssm:StartAutomationExecution). Executes a multi-step SSM document of schemaVersion: '0.3'. Steps are typed actions: aws:runCommand wraps Run Command and is the usual way an Automation touches an instance; aws:executeAwsApi calls any AWS API directly; aws:approve pauses for named IAM principals to sign off; aws:branch switches on a previous step’s output; aws:waitForAwsResourceProperty polls until a resource property equals a value; aws:invokeLambdaFunction runs a Lambda; aws:executeScript runs inline Python or PowerShell in the Automation worker (not on an instance). Steps have onFailure (Abort, Continue, step:<name>) and nextStep hooks. Targets can be any resource type, not just EC2.

  3. State Manager association. A scheduled Run Command: “run this document on these targets on this cadence, and I’ll report drift.” Good for the nightly log-rotation job. Still Run Command under the hood; the schedule and desired-state reporting are the State Manager layer on top.

  4. Maintenance Windows. A time-bounded window during which a set of tasks (Run Command, Automation, Lambda, or Step Functions) runs against registered targets, with cutoff behaviour and concurrency. The correct place for “apply this patch between 02:00 and 04:00 local, Tuesday mornings, across the patch group.”

  5. Change Manager. Workflow layer on top of Automation for production changes: template, approval chain, freeze windows, runbook execution. Out of scope here, it’s the governance layer over Automation, not an alternative execution model.

Side by side

Option Multi-step sequence Can call non-SSM APIs Human approval Branching / rollback Triggered by alarm/Config Output model
Run Command ✓ (EventBridge) Per-invocation, 2500 chars + S3/CWL
Automation runbook ✓ (direct) Per-step, 30 days in execution record
State Manager Scheduled Compliance + Run Command output
Maintenance Window ✓ (as tasks) via Automation via Automation via Automation Scheduled Per-task
Change Manager ✓ (wraps Automation) ✓ (chain) Per-change

Reading by job:

  • Security patch, one action, every host, report exit codes. Run Command with AWS-RunShellScript, targets by Env=prod tag, MaxConcurrency=20%, MaxErrors=5%, output to S3. Ten minutes of work.
  • AMI pipeline, six steps, one approval, multiple AWS APIs, cross-Region copy. Automation runbook. A Run Command step inside the runbook does the on-instance package update; everything else is aws:executeAwsApi and aws:approve.
  • Nightly log rotation, repeating one action on a fleet. State Manager association pointing at a Run Command document, with a cron expression per Region.

The two shapes side by side

Run Command one document, many targets, parallel AWS-RunShellScript MaxConcurrency=20% MaxErrors=5% target: Env=prod i-0a1b… prod i-0c3d… prod i-0e5f… prod i-0g7h… prod Per-invocation output 2500-char cap + S3 bucket + CWL group status aggregated at command level no sequencing, no approvals, no cross-target rollback Automation runbook typed steps, branching, approvals, any API 1. aws:runCommand yum update on baking instance 2. aws:executeAwsApi ec2:CreateImage -> AMI id 3. aws:waitForAwsResourceProperty Image.State == available 4. aws:approve named IAM principals sign off 5. aws:executeAwsApi ec2:CopyImage x 2 Regions 6. aws:branch (onFailure) cleanup step if any earlier step failed steps sequenced, a human gate, multiple APIs, rollback path
Run Command is a fan-out; Automation is a flow. When a job needs both, patch a thousand hosts as one step of a larger pipeline. Automation calls Run Command and owns the orchestration.

Run Command in depth

The SendCommand call shape is small and has three important levers beyond “which document.”

aws ssm send-command \
  --document-name AWS-RunShellScript \
  --document-version '$DEFAULT' \
  --targets 'Key=tag:Env,Values=prod' 'Key=tag:OS,Values=amazon-linux-2023' \
  --parameters 'commands=["sudo yum update -y --security --setopt=deltarpm=false"]' \
  --max-concurrency '20%' \
  --max-errors '5%' \
  --output-s3-bucket-name acme-ssm-command-output \
  --output-s3-key-prefix security-patch/2027-05/ \
  --cloud-watch-output-config 'CloudWatchLogGroupName=/aws/ssm/run-command,CloudWatchOutputEnabled=true' \
  --comment 'Security patch SEC-2027-18'

--targets takes a tag expression or explicit instance IDs. Tag-based targeting is the one worth practising: the set of hosts is computed at dispatch time, so new instances that come up later are not retrospectively patched by an already-dispatched command. For the rolling case, use a State Manager association (same document, same tags, schedule). --max-concurrency '20%' says “no more than one in five targets running at a time”; --max-errors '5%' says “abort the whole command if more than one in twenty has already failed.” --output-s3-bucket-name is what makes the output story tolerable; without it, anything past the first 2500 characters is lost.

IAM on the calling principal needs ssm:SendCommand scoped to the document ARN and to the instance ARNs (with tag conditions if desired). The instance profile still needs AmazonSSMManagedInstanceCore. Run Command rides the same agent channel as Session Manager and Inventory.

Run Command’s sweet spot is the first job: a one-shot action on a tagged fleet, with rate and error control and output going somewhere durable.

Automation in depth

An Automation runbook is a schemaVersion: '0.3' document with a mainSteps list. Each step has a name, an action (the step type), a set of inputs, and optional outputs, onFailure, and nextStep.

The AMI pipeline as a runbook excerpt:

schemaVersion: '0.3'
description: Build, approve, and propagate the weekly golden AMI.
assumeRole: '{{ AutomationAssumeRole }}'
parameters:
  BaseAmiId: { type: String }
  AutomationAssumeRole: { type: String }
mainSteps:
  - name: launchBakingInstance
    action: aws:runInstances
    inputs:
      ImageId: '{{ BaseAmiId }}'
      InstanceType: m6i.large
      IamInstanceProfile: ami-baker
      MaxInstanceCount: 1
      MinInstanceCount: 1
    outputs:
      - { Name: InstanceId, Selector: $.InstanceIds[0], Type: String }

  - name: applyUpdates
    action: aws:runCommand
    inputs:
      DocumentName: AWS-RunShellScript
      InstanceIds: ['{{ launchBakingInstance.InstanceId }}']
      Parameters:
        commands: ['sudo yum update -y && sudo /opt/smoke/run.sh']

  - name: createImage
    action: aws:createImage
    inputs:
      InstanceId: '{{ launchBakingInstance.InstanceId }}'
      ImageName: 'golden-{{ global:DATE_TIME }}'
      NoReboot: false

  - name: approveCopy
    action: aws:approve
    inputs:
      Approvers: ['arn:aws:iam::111122223333:role/AmiApprover']
      Message: 'Approve copy of {{ createImage.ImageId }} to other Regions?'

  - name: copyToUsEast1
    action: aws:executeAwsApi
    inputs:
      Service: ec2
      Api: CopyImage
      SourceImageId: '{{ createImage.ImageId }}'
      SourceRegion: eu-west-1
      Region: us-east-1
      Name: 'golden-{{ global:DATE_TIME }}'

  - name: terminateBaker
    action: aws:changeInstanceState
    inputs:
      InstanceIds: ['{{ launchBakingInstance.InstanceId }}']
      DesiredState: terminated
    isEnd: true

Three things earn their keep in that shape. aws:runCommand is one step in a larger flow, not the flow itself. aws:approve is a real human gate, execution pauses, the named IAM principals receive an approval request, and the runbook resumes only when the required number have approved. And assumeRole at the top sets the IAM identity the Automation service will use for every step’s AWS calls; scoping that role to the minimum needed (ec2:RunInstances, ec2:CreateImage, ec2:CopyImage, ssm:SendCommand, and the passes required) is how the runbook stays least-privilege.

IAM for the caller needs ssm:StartAutomationExecution on the document; the runbook’s own execution then uses AutomationAssumeRole. Two IAM layers, not one, because the person who starts the runbook is rarely the identity that should be creating AMIs.

The nightly log rotation

State Manager is the correct answer here because the job is “this same command, forever, on this changing fleet.” One association per Region:

aws ssm create-association \
  --name AWS-RunShellScript \
  --targets 'Key=tag:LogRotation,Values=standard' \
  --parameters 'commands=["sudo /usr/sbin/logrotate /etc/logrotate.conf"]' \
  --schedule-expression 'cron(0 2 * * ? *)' \
  --compliance-severity MEDIUM \
  --max-concurrency '10%' \
  --max-errors '10%' \
  --output-location 'S3Location={OutputS3BucketName=acme-ssm-assoc,OutputS3KeyPrefix=logrotate/}'

Two things State Manager does that Run Command on its own doesn’t. It evaluates compliance, a target that’s been missed is reported non-compliant, which rolls up into Config and Security Hub. And new instances that match the target tag at the next evaluation are picked up automatically, so the fleet can grow and shrink without touching the schedule.

What’s worth remembering

  1. Run Command is a fan-out, Automation is a flow. One document to many targets vs many typed steps that can call any AWS API, wait for approvals, branch, and unwind on failure.
  2. Automation calls Run Command, not the other way round. When a multi-step job has an instance-side step, aws:runCommand is that step. Run Command cannot call an Automation runbook.
  3. aws:approve is the approval gate. The whole reason to promote a job from Run Command to Automation is often that single step, the mandatory human sign-off mid-flow.
  4. Output discipline matters. Run Command truncates at 2500 characters in the API response; always configure S3 and/or CloudWatch Logs output for anything that might emit more, and for anything that needs to survive the 30-day execution-record window.
  5. State Manager is scheduled Run Command with compliance. The nightly log-rotation shape. The schedule, the drift reporting, and the auto-inclusion of new matching targets are the value on top of raw SendCommand.
  6. Maintenance Windows wrap both. When the operational constraint is “only between 02:00 and 04:00 on patch Tuesday,” the Window owns the timing; Run Command or Automation is the task it launches.
  7. Two IAM layers for Automation. The caller needs ssm:StartAutomationExecution; the runbook needs an assumeRole with permission for every API its steps touch. Scoping the latter tight is how least-privilege survives a ten-step flow.
  8. Tag expressions are the fleet addressing. tag:Env=prod, tag:OS=amazon-linux-2023, tag:LogRotation=standard. The fleet’s shape in the JSON matches the fleet’s shape on the tags, and the fleet’s shape on the tags is the source of truth.

Run Command is the correct shape when the job is a single action and the fleet is the variable. Automation is the correct shape when the job is itself the variable, multiple steps, mixed APIs, human gates, rollback paths, even if the fleet is a single instance. Picking the wrong one means either inventing orchestration in a shell script (Run Command stretched too far) or wrapping a one-shot patch in a six-step runbook nobody reads (Automation stretched too wide). The shape of the job picks the tool, not the other way round.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.