How to Automate Incident-Response Runbooks With Step Functions and SSM

August 28, 2028 · 16 min read

DevOps Engineer Pro · DOP-C02 · part of The Exam Room

The situation

A security operations team responds to ~30 GuardDuty findings a month. The shape is consistent:

  • Finding arrives in Security Hub via GuardDuty integration.
  • Severity 7+ findings page the on-call; severity 4-6 land in a Slack channel for next-business-day review.
  • For severity 7+ on EC2 findings, the runbook is:
    1. Identify the instance and its EC2 profile.
    2. Change the instance’s security group to an “isolation” SG with no ingress/egress.
    3. Snapshot all attached EBS volumes.
    4. Tag the instance IncidentStatus=Quarantined with the GuardDuty finding ID.
    5. Revoke active sessions for the instance profile’s IAM role.
    6. Capture memory and process list via SSM Run Command before isolation fully takes effect.
    7. Create a ticket in the incident tracker with context.
    8. Notify the SOC channel with the finding and actions taken.
    9. Wait for SOC acknowledgement (human approval).
    10. If approved, mark the finding as investigated; if rejected, un-isolate.
    11. Record the outcome against the finding ID.
    12. If the volumes’ snapshots exceed 4 hours without review, page the SOC lead.

Problems:

  • Consistency: step 6 (memory capture) gets skipped because “the instance is already isolated.”
  • Speed: ~20 minutes elapsed time even when performed correctly.
  • Audit: the record of what happened is a mix of Slack messages, CloudTrail events, and memory.
  • Error recovery: if step 3 fails halfway, the engineer has to figure out which snapshots exist and resume from partial state.

What actually matters

An incident-response runbook turned into code should:

  • Execute deterministically. Same input, same sequence of actions, same outcome. No “I skipped step 6.”
  • Branch on findings. GuardDuty emits many finding types; the response is different per type. The automation routes to the correct branch.
  • Pause for human approval. Some actions (un-isolating, destroying snapshots) require a human in the loop. The workflow waits until the human acts.
  • Retry and compensate. A snapshot call that fails once retries; a snapshot that succeeds partially leaves a record so resumption doesn’t double-snapshot.
  • Produce an audit trail. Every step, every input, every output, who approved what, recorded for the incident record.
  • Time out and escalate. A human approval that doesn’t arrive in four hours escalates to the SOC lead, not the original on-call.

A state-machine model maps to this shape directly. A declarative document describes states, transitions, inputs, and outputs; task states invoke service APIs; choice states branch; wait states pause; parallel and map states fan out and iterate. The orchestrator owns retry and timeout; the task code stays small.

The connection from finding to workflow is event-driven: a rule on the finding’s event pattern targets the state machine with the finding as input, and the workflow starts within seconds.

The decisions worth making:

  • Long-running vs high-throughput workflow. Long-running workflows handle hour-or-day waits with full per-execution history. High-throughput workflows are short, at-least-once, and cheap. Incident response wants the long-running shape because approvals take hours.
  • Human approval pattern. A task-token primitive is the mechanism: the task publishes a token, the state machine pauses, an external process returns the token with success or failure when the approval arrives. Approvals typically route through interactive chat messages or a small approval web page.
  • Error handling granularity. Each task can have automatic retry with backoff on specified errors, and catch clauses that branch to compensation states on failure. Building the catches correctly means partial failures don’t leave the system in weird states.
  • Idempotency. Each task must be safe to run twice; the orchestrator retries, and a snapshot that’s already been taken shouldn’t be re-taken. Client tokens, conditional-create patterns, and check-before-act handle this.

What we’ll filter on

Ranking the options:

  1. Deterministic execution. Same input, same actions.
  2. Human-in-the-loop approvals with timeout and escalation.
  3. Auditable trail at step granularity.
  4. Error handling and resumption of partial failures.
  5. Integration with EventBridge and other AWS event sources.

The incident-automation landscape

1. Confluence runbook + human. Status quo. Fails determinism, audit, and error recovery.

2. Shell scripts in a repo. Engineer runs python isolate-instance.py <id> against the finding. Better than console clicks; still serial human judgement, no parallel branches, no approval pattern, no resumption.

3. Lambda chain via EventBridge. A sequence of Lambda functions, each triggering the next via an EventBridge rule or SNS message. Works for simple flows; quickly becomes unmanageable with branches, approvals, and error recovery. “I don’t know which Lambda is stuck” is the common failure mode.

4. AWS Systems Manager Automation runbook. SSM has its own runbook syntax (YAML, similar scope to Step Functions) with built-in actions for common AWS operations. Strong for fleet operations (patch, restart, configure); weaker than Step Functions for complex branching and long-running human approvals.

5. Step Functions Standard workflow. State machine in ASL, Task states invoke Lambda (or direct AWS service integrations), Choice states branch on finding type, Wait and WaitForTaskToken handle approvals, Map iterates over resources (e.g. multiple EBS volumes per instance), Catch handles errors. Full execution history visible per-run.

6. Step Functions + SSM Automation + EventBridge (layered). Step Functions orchestrates the high-level incident-response flow; calls out to SSM Automation runbooks for specific fleet operations (e.g. patching a batch of instances); EventBridge wires everything to GuardDuty. Covers breadth of incident responses without reinventing SSM’s fleet tooling.

Side by side

Option Deterministic Human approvals Audit trail Error handling EventBridge integration
Confluence + human Manual Partial Manual
Shell scripts Partial Manual Partial Manual Partial
Lambda chain Partial Custom Partial Limited
SSM Automation ✓ (approval step)
Step Functions Standard ✓ (waitForTaskToken) ✓ (Retry/Catch)
SF + SSM + EB layered

Step Functions Standard is the orchestrator for multi-step incident response with approvals. SSM Automation handles fleet-level sub-operations from within a Step Functions task.

The incident-response state machine

Main flow EventBridge trigger GuardDuty Finding Choice: finding type EC2 / IAM / S3 / … IsolateSecurityGroup ModifyInstanceAttribute Map: volumes CreateSnapshot × N CaptureMemory SSM Run Command Parallel Tag + Notify + Ticket WaitForTaskToken SOC approval, 4h timeout Choice: approved? yes / no / timeout RecordOutcome Security Hub + DynamoDB Success END RevokeSessions Revoke role credentials Un-isolate on SOC reject (false pos) EscalateTimeout Page SOC lead IAMBranch rotate keys, disable user Error handling (Catch clauses) CatchAllErrors States.TaskFailed, States.Timeout CompensationState tag finding Failed, page on-call LogExecutionFailure CloudWatch Logs, S3 archive Failed END (non-success)
The runbook becomes a state machine: deterministic sequence, parallel where safe, pause for approval, escalate on timeout, compensate on error.

The pick in depth

The state machine (simplified ASL excerpt).

{
  "StartAt": "ClassifyFinding",
  "States": {
    "ClassifyFinding": {
      "Type": "Choice",
      "Choices": [
        {"Variable": "$.finding.type", "StringStartsWithEquals": "UnauthorizedAPICall", "Next": "EC2Branch"},
        {"Variable": "$.finding.type", "StringStartsWithEquals": "IAM", "Next": "IAMBranch"}
      ],
      "Default": "UnhandledType"
    },
    "EC2Branch": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {"FunctionName": "isolate-sg", "Payload.$": "$"},
      "Retry": [{"ErrorEquals": ["States.TaskFailed"], "IntervalSeconds": 2, "MaxAttempts": 3, "BackoffRate": 2}],
      "Catch": [{"ErrorEquals": ["States.ALL"], "Next": "CompensationState"}],
      "Next": "SnapshotVolumes"
    },
    "SnapshotVolumes": {
      "Type": "Map",
      "ItemsPath": "$.finding.resource.instanceDetails.blockDeviceMappings",
      "Iterator": {
        "StartAt": "CreateSnapshot",
        "States": {
          "CreateSnapshot": {
            "Type": "Task",
            "Resource": "arn:aws:states:::aws-sdk:ec2:createSnapshot",
            "Parameters": {
              "VolumeId.$": "$.ebs.volumeId",
              "Description.$": "States.Format('IR-{}', $.finding.id)",
              "TagSpecifications": [...]
            },
            "End": true
          }
        }
      },
      "Next": "CaptureMemory"
    },
    "SOCApproval": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
      "Parameters": {
        "FunctionName": "post-slack-approval",
        "Payload": {"finding.$": "$", "taskToken.$": "$$.Task.Token"}
      },
      "TimeoutSeconds": 14400,
      "Catch": [
        {"ErrorEquals": ["States.Timeout"], "Next": "EscalateTimeout"},
        {"ErrorEquals": ["States.ALL"], "Next": "CompensationState"}
      ],
      "Next": "ApprovalChoice"
    }
  }
}

Each task invokes a small Lambda (or direct AWS SDK integration); each Task has Retry for transient errors and Catch to route to compensation. The Map state iterates over the EBS volumes attached to the instance, snapshotting in parallel; a 4-volume instance snapshots concurrently rather than serially.

The human approval pattern. waitForTaskToken makes the task state pause until an external caller hands the token back. The post-slack-approval Lambda posts an interactive message to the SOC channel with Approve / Reject buttons; when the SOC engineer clicks, Slack invokes an API Gateway endpoint backed by another Lambda that calls SendTaskSuccess (with the approval result) or SendTaskFailure. The state machine resumes the next millisecond.

Timeout is critical. TimeoutSeconds: 14400 gives the SOC four hours; if no button is clicked by then, the task fails with States.Timeout, the Catch routes to EscalateTimeout which pages the SOC lead. The runbook doesn’t wait forever.

Error handling and compensation. The CompensationState is the “roll back what we can” state. If isolation succeeded but snapshots failed, the compensation might untag and un-isolate (if the error suggests the finding was not real), or it might leave isolation in place and just page the on-call (if the finding is real but snapshotting broke). The correct answer is state-dependent; Catch plus an Input transformation that includes the last successful step lets the compensation state decide.

Idempotency per Lambda. isolate-sg checks whether the instance is already in the isolation SG before modifying; create-snapshot uses a client token derived from the finding ID so a retry doesn’t create a second snapshot; tag-resource is naturally idempotent. Every Lambda is safe to retry; the state machine doesn’t have to think about “did the previous call already succeed?”

EventBridge wiring

An EventBridge rule in the security account:

{
  "source": ["aws.guardduty"],
  "detail-type": ["GuardDuty Finding"],
  "detail": {
    "severity": [{"numeric": [">=", 7]}],
    "service": {"resourceRole": ["TARGET"]}
  }
}

Target: the Step Functions state machine IncidentResponseEC2, with an input transformer that reshapes the event into the state-machine input. Input transformers matter here because the GuardDuty event shape is verbose; the state machine accepts a simplified input with just the fields the workflow cares about. The EventBridge event is archived to S3 via a separate rule for the audit trail.

Cross-account is common: GuardDuty runs in an audit/security account, the Step Functions state machine lives in an incident-response account, and the resources being isolated live in workload accounts. EventBridge supports cross-account rule targets; the state machine uses sts:AssumeRole into the workload account for the actual API calls. The assumed roles have permissions narrowed to the specific actions the workflow needs (ModifyInstanceAttribute, CreateSnapshot, SendCommand, etc.), and a condition key scoping the trust to the state machine’s role.

What’s worth remembering

  1. Step Functions Standard is the orchestrator for long-running, human-approvable workflows. Up to 1 year execution, exactly-once semantics, full per-execution history. Express is for high-throughput, short-lived workflows.
  2. waitForTaskToken is the human-approval primitive. Task publishes a token, state machine pauses, external caller returns the token with success or failure, workflow resumes. Timeout and Catch handle absent approvals.
  3. Retry and Catch give per-task error handling. Automatic retries with backoff on specified errors; Catch routes failures to compensation states. Each Task state declares its own retry and catch policies.
  4. Map states iterate in parallel. The volumes-to-snapshot example is natural; many resources, same operation, parallel execution. MaxConcurrency caps parallelism if the downstream API has rate limits.
  5. Direct AWS SDK integration avoids Lambda shims. arn:aws:states:::aws-sdk:ec2:createSnapshot calls EC2 directly; no Lambda needed for the common case. Lambda is for logic or transformations that don’t map to a single API.
  6. EventBridge is the trigger. A rule matches the GuardDuty event pattern and targets the state machine with a transformed input. Archive rules to S3 for audit.
  7. Cross-account via assumed roles. The state machine’s role assumes target-account roles for the actual actions; permissions are scoped per target account.
  8. Idempotency is the Lambda’s job. State machine retries; each task must tolerate being run twice without creating two snapshots or two tickets. Client tokens, conditional creates, and check-before-act patterns handle this.

The twelve-step Confluence runbook becomes a state machine diagram, an ASL document, and a handful of small Lambdas. At 03:00 the EventBridge rule fires, the state machine runs, the SOC gets a Slack button to click, and the finding is fully handled with an auditable record, whether the engineer was awake or not. Runbooks that run themselves isn’t metaphorical; it’s the infrastructure model, expressed as state transitions and task tokens.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.