The situation
A SaaS platform runs twelve microservices behind an Application Load Balancer. Each service reports three CloudWatch metrics the on-call cares about: CPUUtilization (per ECS service average), TargetResponseTime (p95 latency from the ALB target group), and HTTPCode_Target_5XX_Count (error counter from the same target group).
Today the team has nine static-threshold alarms per service, three metrics times three severities, and PagerDuty routes the criticals to whoever is on rota. The rota is seven engineers. They are tired. The thirty-day noise audit tells the story: CPU alone fires 42 times, 38 of those during the nightly RDS snapshot and Elasticsearch reindex; latency alone fires 17 times, 14 of those from incidents upstream of the team; error rate alone fires 9 times, 7 of those from single-request anomalies. All three firing together, in the same service, in the same five-minute window, happened 3 times, all three were real incidents, all three were initially dismissed as yet more cry-wolf, and mean time to acknowledge averaged 23 minutes.
What actually matters
Before reaching for features, it helps to name what “a good page” actually means, because the current system is technically correct and operationally broken.
The signal-to-noise ratio is the first concern. A page should correspond to something the on-call can do something about. The CPU-at-02:00 page is technically correct (CPU really did spike) and operationally worthless (nothing to act on, backup finishes in an hour, nobody will touch anything). Every such page trains the on-call to ignore pages. The fix needs to move the threshold decision out of “engineer’s best guess at a number” and into “what does this metric normally look like for this hour of this day of the week.”
The single-signal brittleness is the next layer. Even with adaptive thresholds, one metric wobbling is not usually a good page. A latency spike by itself could be upstream flap; an error spike by itself could be a single bad client; a CPU spike by itself could be a thousand legitimate things. What is genuinely page-worthy is the coincidence of signals: CPU up and latency up and errors up, in the same service, in the same window, is a real incident with high probability. The page-worthy condition is a logical combination, not a single threshold crossing.
The hysteresis concern is the third layer. A single data point exceeding a threshold is noise; a sustained excursion is signal. The configuration needs to tolerate a one-minute spike while responding to a five-minute sustained condition within tolerable latency. “M of N evaluation periods” is the shape, and it’s the difference between a pager that fires once per backup and a pager that fires once per incident.
The cost curve matters because the team is twelve services today and could be thirty next year. Anything that costs tens of dollars per service per month at this scale becomes hundreds, then thousands, without adding signal. The solution has to be cheap enough that turning it on everywhere doesn’t require a budget conversation.
And finally, the investigability after the fact. When a composite alarm fires, the on-call needs to see which of the underlying signals contributed, so the investigation can start with the odd-one-out instead of hunting through dashboards. The per-metric alarms still need to be visible even though only the composite pages.
What we’ll filter on
- Adaptive thresholds, alarms learn the metric’s normal shape for that hour of that day of the week.
- Multi-signal AND/OR, the page-worthy condition is a logical combination of other conditions.
- Configurable hysteresis, “M out of N evaluation periods” shape; noise is absorbed, sustained excursions trigger.
- Cheap, tens of dollars per month, not hundreds, across a dozen services.
- Low operational overhead and per-metric transparency, adaptive bands visible on dashboards, composite rationale inspectable.
The alerting landscape
Static-threshold alarms per metric (the status quo). The classic CloudWatch alarm: one metric, one threshold, a comparison operator, and an evaluation-period count. Cheap (ten cents per alarm per month for standard-resolution), simple, well understood. Pick the threshold low and it fires every night during backups; pick it high and it misses a slow climb that matters.
CloudWatch anomaly-detection alarms. A CloudWatch metric math function, ANOMALY_DETECTION_BAND(metric, stdev), produces a pair of time series, an upper and lower band, derived from a model trained on the last two weeks of the metric’s history. The model captures time-of-day and day-of-week seasonality automatically. A metric alarm using ComparisonOperator: GreaterThanUpperThreshold (or its siblings) fires when the live value crosses the band rather than a fixed number. Storage of the model is free; the alarm itself is charged like any metric alarm.
CloudWatch composite alarms. An alarm whose state is a Boolean expression over other alarms’ states: ALARM("cpu-anom") AND ALARM("latency-anom") AND ALARM("err-anom"), or any combination with AND, OR, NOT, and parentheses up to 100 children. Composite alarms integrate with SNS, PagerDuty, EventBridge, and Auto Scaling identically. They also support an ActionsSuppressor alarm to mute the composite during known maintenance. Priced at US$0.50 per composite alarm per month regardless of how many children it watches.
CloudWatch Contributor Insights. A log-analysis tool that derives top-N breakdowns from CloudWatch Logs or structured metric streams. Brilliant for “what is driving the latency?” investigations once an incident is known. Not an alerting primitive, a rule is a view, not a conditional pager.
AWS Lookout for Metrics. A managed anomaly-detection service aimed at business metrics. AWS announced the end of new customer access to Lookout for Metrics on 10 October 2024; existing customers keep it until 31 October 2025, after which the service is retired. Not an option for a pager built now, independently of technical fit.
Side by side
| Option | Adaptive | AND/OR | Hysteresis | Cheap | Low overhead |
|---|---|---|---|---|---|
| Static-threshold per-metric | ✗ | ✗ | ✓ | ✓ | ✓ |
| Anomaly-detection alarms | ✓ | ✗ | ✓ | ✓ | ✓ |
| Composite alarms | — | ✓ | ✓ | ✓ | ✓ |
| Contributor Insights | — | ✗ | ✗ | ✓ | ✓ |
| Lookout for Metrics | ✓ | ✓ | ✓ | ✗ | ✗ |
No single option ticks every box. CloudWatch’s design splits the work deliberately. Anomaly-detection alarms solve attribute 1 at the per-metric layer. Composite alarms solve attribute 2 by being a Boolean expression over alarm state. Stack them together, three anomaly-detection alarms feeding one composite with AND, and all five attributes are ticked at once, for roughly US$0.80 per service per month.
How the composition works
Building the CPU anomaly alarm
Anomaly detection in CloudWatch is a metric math expression. The alarm references two time series: the metric itself (m1) and the band expression (ad1):
{
"AlarmName": "payments-svc-cpu-anom",
"Metrics": [
{
"Id": "m1",
"MetricStat": {
"Metric": {
"Namespace": "AWS/ECS",
"MetricName": "CPUUtilization",
"Dimensions": [
{ "Name": "ServiceName", "Value": "payments" },
{ "Name": "ClusterName", "Value": "prod" }
]
},
"Period": 60,
"Stat": "Average"
},
"ReturnData": true
},
{
"Id": "ad1",
"Expression": "ANOMALY_DETECTION_BAND(m1, 2)",
"ReturnData": true
}
],
"ThresholdMetricId": "ad1",
"ComparisonOperator": "GreaterThanUpperThreshold",
"EvaluationPeriods": 5,
"DatapointsToAlarm": 4,
"TreatMissingData": "notBreaching"
}
The second argument to ANOMALY_DETECTION_BAND is the standard-deviation multiplier: the band spans predicted +/- multiplier x standard-deviation. 2 is a reasonable starting point; tighter (1) is more sensitive and noisier, looser (3) is more tolerant.
EvaluationPeriods: 5 with DatapointsToAlarm: 4 is the hysteresis knob: the alarm fires only when four of the last five one-minute data points are outside the band. A single spike passes unnoticed. Five minutes of sustained excursion trips it. ComparisonOperator: GreaterThanUpperThreshold means “fire when m1 is higher than ad1’s upper band.” The sibling LessThanLowerThreshold catches unexpectedly-low values; LessThanLowerOrGreaterThanUpperThreshold catches either direction.
The model retrains continuously on the trailing two weeks. A permanent legitimate change in traffic pattern will be absorbed by the model within two weeks, and the band will stretch to match.
Two matching alarms get the same shape for TargetResponseTime and HTTPCode_Target_5XX_Count, referencing their respective AWS/ApplicationELB metrics with the target-group dimension.
Building the composite alarm
{
"AlarmName": "payments-svc-page",
"AlarmRule": "ALARM(\"payments-svc-cpu-anom\") AND ALARM(\"payments-svc-latency-anom\") AND ALARM(\"payments-svc-errors-anom\")",
"AlarmActions": ["arn:aws:sns:eu-west-1:111122223333:pagerduty-critical"],
"OKActions": ["arn:aws:sns:eu-west-1:111122223333:pagerduty-critical"],
"ActionsEnabled": true
}
The AlarmRule grammar is Boolean: ALARM(), OK(), and INSUFFICIENT_DATA() functions over alarm names, combined with AND, OR, and NOT, grouped by parentheses. Up to 100 child alarms per rule. Nesting composite alarms inside other composite alarms is supported up to a depth of 10.
Three behaviours worth knowing:
Actions fire on state transitions, not continuously. The composite moves from OK to ALARM when its rule evaluates true; the SNS action fires once, not every minute the rule stays true. OKActions fires on the ALARM -> OK transition so the on-call gets an auto-resolve.
Suppressor alarms mute the composite during maintenance. A separate alarm, usually fed from a metric filter on a maintenance-window log event or a manually-flipped CloudWatch alarm, is named via the ActionsSuppressor field. While the suppressor is in ALARM, the composite’s actions are inhibited, the rule still evaluates, the state still transitions, but no page goes out. ActionsSuppressorWaitPeriod adds a grace period after the suppressor enters ALARM; ActionsSuppressorExtensionPeriod keeps it active for a window after it clears. This is how the team turns off backup-window paging from a known-maintenance flag.
The cost is flat. US$0.50 per composite alarm per month, regardless of how many alarms the rule references. Thirty-six metric alarms (twelve services times three metrics) at US$0.10 each is US$3.60; twelve composite alarms at US$0.50 is US$6.00. Total US$9.60 per month for the whole platform’s primary pager.
The noise audit, re-run
With the new configuration in place the team re-runs the thirty-day noise audit on the shadow traffic for two weeks before cutting over.
CPU-anomaly alarm fires 6 times; the backup-window fires are gone because the model learned the 02:00 hump inside a week and widened the band there. Latency-anomaly alarm fires 19 times; the raw rate is up slightly because the model is tighter in the daytime than the old static p95 threshold was. Error-anomaly alarm fires 11 times; similar story. Composite alarm fires 3 times, the three real incidents, the same three the old static-threshold triple had caught, buried in 65 noise pages. MTTA falls from 23 minutes to 4.
The on-call is no longer tired.
When each ingredient is the wrong choice
Static thresholds still make sense for metrics where the threshold is a contractual value, not an observation. “Response time must stay under 200 ms per the SLA” is a static alarm; the bound is set by the business, not learnt from data.
Anomaly detection struggles with sparse metrics, anything that spends long periods at zero and then spikes briefly. The model has little history to learn from, the band collapses near zero, and the alarm flaps. For sparse counters, a static minimum with a generous ceiling is simpler.
Composite alarms are the wrong choice when the decision is genuinely single-signal. A disk-full alarm on a single volume is one metric; wrapping it in a composite adds cost and indirection for no logical gain.
What’s worth remembering
- CloudWatch anomaly-detection alarms solve adaptive thresholds:
ANOMALY_DETECTION_BAND(metric, stdev)produces an upper/lower band the metric alarm compares against, learning time-of-day and day-of-week seasonality from the last two weeks of history. GreaterThanUpperThreshold,LessThanLowerThreshold, andLessThanLowerOrGreaterThanUpperThresholdare the three comparison operators for band-based alarms.EvaluationPeriodsandDatapointsToAlarmprovide hysteresis.4 of 5is a common shape for one-minute metrics.- Composite alarms solve multi-signal AND/OR via an
AlarmRulestring combiningALARM(),OK(), andINSUFFICIENT_DATA()functions with Boolean operators. Up to 100 children, up to 10 levels of nesting. - Composite alarms support
ActionsSuppressorto mute actions during known maintenance without disabling the underlying evaluation, withActionsSuppressorWaitPeriodandActionsSuppressorExtensionPeriodgrace windows. - Cost is predictable. US$0.10 per standard-resolution metric alarm per month, US$0.50 per composite alarm per month, anomaly-model storage free.
- Contributor Insights is for top-N diagnosis, not for triggering pages. Lookout for Metrics is aimed at business KPIs and is closed to new customers since October 2024.
- The pattern is a two-layer pager. Anomaly-detection alarms at the per-metric layer capture “this signal is unusual for this time of day.” A composite alarm on top captures “these signals are unusual together.” Neither layer replaces the other; together they are what pages the on-call.
Per service, configure three CloudWatch anomaly-detection alarms, one on CPUUtilization, one on TargetResponseTime, one on HTTPCode_Target_5XX_Count, each using ANOMALY_DETECTION_BAND(metric, 2) with per-metric EvaluationPeriods / DatapointsToAlarm tuned for the signal’s noise profile. Sit a CloudWatch composite alarm on top with AlarmRule: ALARM(cpu) AND ALARM(lat) AND ALARM(err), and point PagerDuty only at the composite. Keep the per-metric alarms visible on the dashboard for investigation, but let the composite be the sole trigger for a page. Add an ActionsSuppressor tied to a maintenance-flag alarm so scheduled backups and deploys never wake anyone. Twelve services, thirty-six per-metric alarms, twelve composites, under ten dollars a month, and a rota that starts sleeping again.