The situation
A platform team runs quarterly chaos-engineering exercises. Current state:
- Game days manually scripted: someone SSH’es into instances and runs
kill, or changes a security group. Blast radius is controlled by luck and attention. - Tests run in staging only. Nobody’s comfortable running them in prod because nobody trusts the rollback.
- Three resilience claims the team wants to validate:
- Loss of a single AZ: traffic fails over to the remaining two within 2 minutes.
- DynamoDB throttling on a critical table: the application degrades gracefully with cached reads for up to 30 seconds.
- ECS task failure in the payment service: replacement tasks come up within 90 seconds and traffic resumes.
The asks:
- Declarative experiments defined as templates, version-controlled, reviewable.
- Measurable outcomes. The experiment records metrics before, during, after.
- Safety rails that abort experiments before they cause real damage.
- Scoping to specific accounts and resources. A production experiment must not accidentally touch staging or vice versa.
- Integration with Resilience Hub. Measured RTO from the experiment feeds into the resilience posture.
What actually matters
Chaos engineering requires three things at minimum: a way to inject the fault reliably, a way to measure what happened, and a way to stop the test if it goes too far. Ad-hoc scripts provide the first; metrics dashboards provide the second; nothing reliably provides the third, which is why game days tend to be anxiety-inducing.
The first property an answer needs is declarative experiment definitions. Actions (what to inject), targets (which resources), and stop conditions (what aborts the experiment) in one document, versionable and reviewable. A script that someone reads from a wiki at 03:00 is the opposite of declarative.
The second is target scoping by tag, ARN, and proportion. “50% of instances tagged Application=payments in eu-west-1a” is the workhorse pattern: enough to trigger failover, not enough to zero the fleet. Selection by raw resource list works for spot tests; selection by filter scales with the estate. The scoping language has to be expressive enough to slice along the axis the experiment cares about.
The third is the abort surface. A safety rail tied to live metrics: if a named alarm transitions to ALARM state during the experiment, the experiment aborts and reversible actions roll back. The alarm has to be on a metric that reflects customer pain (5xx rate, latency p99), not on the fault itself; the abort has to happen automatically, because nobody is fast enough at 03:00 to spot the breach and run the stop command before the page goes out.
The fourth is breadth of injectable faults. Stopping instances, failing over databases, killing container tasks, throttling APIs, dropping network traffic, simulating AZ failure: the more of these are in scope, the more of the architecture is testable. Services without a native fault often need a small adapter to translate “drop write capacity briefly” into “apply a temporary IAM policy that produces the same effect.” The shape of an answer has to include both the breadth and the escape hatch.
The fifth is the execution role as the boundary. The principal that runs the experiment needs permission to perform the injected actions; condition keys on tags prevent mis-targeting (an action on instances tagged Application=payments only works against instances actually tagged that way). The IAM role is what stops the experiment from accidentally stopping the wrong fleet because a tag was mistyped.
What we’ll filter on
Filtering:
- Declarative templates. Versionable, reviewable, repeatable.
- Safety rails that abort on alarm breach.
- Breadth of action types. EC2, ECS, RDS, network, service-level faults.
- Scoping. Tight control over which resources are affected.
- Measurement. Outcomes recorded for post-experiment analysis.
The fault-injection landscape
1. Manual scripts and ad-hoc fault injection. Status quo. Fails safety and repeatability.
2. AWS Fault Injection Simulator (FIS). Managed service with the feature set above. AWS-native, integrates with CloudWatch for stop conditions, IAM for access control.
3. Chaos Monkey / Simian Army style. Open-source, Netflix-origin tool. Injects instance failures on a schedule. Limited compared to FIS; no stop conditions; less controllable.
4. Gremlin. Commercial chaos-engineering platform. Broad action library, mature UI, stop conditions, magnitude control. Licence cost.
5. Chaos Mesh / LitmusChaos. Kubernetes-focused chaos platforms. Strong for EKS-heavy estates; narrow outside Kubernetes.
Side by side
| Option | Declarative | Safety rails | Action breadth | Scoping | Measurement |
|---|---|---|---|---|---|
| Manual scripts | ✗ | Manual | Any | Manual | Manual |
| FIS | ✓ | ✓ (CW alarms) | EC2, ECS, EKS, RDS, network | Tag + ARN + percent | CloudWatch + Experiment summary |
| Chaos Monkey | Partial | ✗ | EC2 only | Limited | Manual |
| Gremlin | ✓ | ✓ | Wide | ✓ | ✓ |
| Chaos Mesh / Litmus | ✓ | ✓ | Kubernetes-focused | ✓ | ✓ |
FIS is the AWS-native answer with declarative templates, CloudWatch-based stop conditions, and integrations with the rest of the AWS observability stack.
How an FIS experiment composes
The picks in depth
AZ-loss experiment template.
{
"description": "Simulate loss of eu-west-1a for payments application",
"roleArn": "arn:aws:iam::111122223333:role/FISExperimentRole",
"actions": {
"StopPaymentsAZ": {
"actionId": "aws:ec2:stop-instances",
"parameters": {"duration": "PT10M"},
"targets": {"Instances": "PaymentsEuWest1a"}
}
},
"targets": {
"PaymentsEuWest1a": {
"resourceType": "aws:ec2:instance",
"resourceTags": {"Application": "payments", "Environment": "staging"},
"filters": [
{"path": "Placement.AvailabilityZone", "values": ["eu-west-1a"]},
{"path": "State.Name", "values": ["running"]}
],
"selectionMode": "ALL"
}
},
"stopConditions": [
{"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:eu-west-1:111122223333:alarm:payments-5xx-rate"}
],
"tags": {"Experiment": "az-loss-payments-staging"}
}
Running this experiment in staging: FIS assumes FISExperimentRole, selects every running payments instance in eu-west-1a (let’s say 4 of 12), calls StopInstances. The ALB health checks detect the failures within 30 seconds, remove the targets, and route traffic to the remaining 8 instances in eu-west-1b and eu-west-1c. Auto Scaling notices the reduced capacity and launches replacements in healthy AZs.
Observed recovery metrics:
- 5xx rate during experiment: 0.2% (brief spike at fail-out, absorbed by ALB retries).
- Latency p99 during experiment: 380ms (baseline 210ms, no alarm breach).
- Time for ASG to restore full capacity: 3 minutes 40 seconds.
The experiment completed (no stop condition tripped); the claim “traffic fails over within 2 minutes” is validated at the ALB level; full capacity restoration takes longer because ASG has to provision new instances. The team updates the documented claim to “traffic fails over within 30 seconds; full fleet capacity restored within 4 minutes”: more accurate and still within SLA.
DynamoDB throttling experiment. DynamoDB doesn’t have a direct throttling action, but FIS can run an SSM Automation document (AWSFIS-Run-DynamoDB-ThrottleTable or similar) that temporarily adjusts a table’s read or write capacity to force throttling. The application’s circuit-breaker is expected to engage within 30 seconds. Stop conditions: payments-throttle-cascade > 50% (if too many requests are failing outright rather than being cached). Actual measurement: circuit-breaker engages in 8 seconds; cached reads serve 96% of requests during throttle; full recovery within 2 seconds of throttling being lifted.
ECS task-failure experiment. aws:ecs:stop-task with a COUNT of 2 (out of 8 running tasks in the payments-api service). ECS service controller notices the stopped tasks, replacement tasks start within 45 seconds. Experiment validates the “90-second” claim with margin to spare. Observed traffic impact: minimal; ALB drains the dead tasks and routes around them.
Running in prod. After three months of staging experiments, the team progresses to a monthly prod experiment. Key changes:
- Stop conditions tighten:
payments-5xx-rate > 1%(not 5%) for prod. - The experiment is scheduled for a low-traffic window (Sunday 04:00 UTC).
- On-call is aware and the experiment’s dashboard is visible in the war-room.
- The first prod experiment is AZ-loss on staging-equivalent scale, then scaled up on subsequent runs.
Guardrails that matter most
Before the first prod experiment, the team establishes:
- IAM permissions on the FIS role are scoped tightly to the resources the template targets.
ec2:StopInstanceswith a conditionaws:ResourceTag/Application=paymentsprevents the experiment from ever stopping the wrong instance. - Resource tagging discipline is a prerequisite. The experiment filter depends on tags being correct; a mis-tagged prod instance as
Environment=stagingwould get stopped during a “staging experiment.” Tag policies enforce correctness at creation time. - Stop condition alarms are tested first. Before the experiment, the team confirms the alarm fires when the metric crosses the threshold. An alarm that’s in
INSUFFICIENT_DATAstate provides no safety rail; FIS only aborts onALARM. - Experiment dry-runs via FIS’s preview mode. The template can be validated without executing; FIS shows which resources it would target. Catches over-broad selections before real harm.
- Scheduled experiments via EventBridge + Lambda. The monthly prod experiment runs on schedule; the chaos culture shifts from “someone decides to break things” to “the system breaks itself on a schedule so the team can learn.”
What’s worth remembering
- FIS templates are declarative and versioned. Actions, targets, stop conditions, parameters all in one JSON document; commit to Git, review like any other change.
- Stop conditions are CloudWatch alarms. The safety rail that aborts experiments that start hurting more than planned. Test the alarms before trusting them.
- Targets support ARN, tag, and filter selection. Tag + AZ + percent is the workhorse pattern: “50% of instances tagged X in AZ Y.” Scope as narrowly as possible.
- Actions cover EC2, ECS, EKS, RDS, and network. DynamoDB and other services without native actions are injected via SSM Automation documents. Network actions use the SSM agent on EC2 to manipulate iptables.
- The execution role is the boundary. FIS’s IAM role has the permissions the experiment needs; condition keys on tags prevent mis-targeting.
- Preview mode shows targets without executing. Always run a preview before committing; a broader-than-intended target filter is the most common mistake.
- Start in staging; graduate to prod. Months of staging experiments build confidence and tune alarms; prod runs only once the team trusts the rollback.
- FIS integrates with Resilience Hub. Hub’s estimated RTO becomes measurable via FIS experiments. Material divergence is the actionable output.
Resilience Hub tells the team what recovery should be; FIS tells them what it is. Templates make the experiments reviewable; stop conditions make them safe; IAM and tag scoping make them precise. Breaking things on purpose stops being an annual anxiety-inducing event; it becomes a monthly confidence-building exercise, with each experiment adding data to the claim the design document used to make without evidence.