How to Triage a 5xx Spike With CloudWatch Logs Insights

August 11, 2027 · 16 min read

CloudOps Engineer · SOA-C03 · part of The Exam Room

The situation

An e-commerce company runs a public-facing Application Load Balancer in us-east-1, fronting several target groups behind path-based routing rules – /api/* to the API service, /checkout/* to the checkout service, /static/* to a small asset cache. At 03:00 the on-call is paged: the alarm on HTTPCode_ELB_5XX_Count has tripped for the third consecutive minute.

The environment already has:

  • ALB access logs forwarded via Kinesis Firehose into a CloudWatch log group (/aws/alb/shop-prod-access), retained for 30 days, one record per request.
  • An on-call runbook whose first bullet is “find the failing path, the target group, and the client pattern, fast.”
  • No OpenSearch cluster, Athena table, or third-party SIEM already wired up. The incident is not the moment to stand one up.

What actually matters

The first question at 03:00 isn’t “which tool?”, it’s what mental model is the on-call running? An incident investigation proceeds as a funnel: confirm the symptom, localise it to a path, localise it to a target group, distinguish infrastructure failure from application failure, identify whether it’s population-wide or client-specific. Each step wants a single query and a single answer, and each answer narrows the search for the next. Whatever tool the on-call uses has to let them type a question, read an answer, and type the next question without leaving the screen. The latency between “I have a hypothesis” and “I can test it” is the actual performance metric of the investigation.

The second is what’s already there versus what has to be built? At 03:00 the cost of infrastructure is measured in minutes the incident keeps bleeding. A tool that needs a cluster sized, an index created, or a table partition registered isn’t a tool for this moment, it’s a project for a different week. The constraint pushes the decision toward whatever operates directly on the data where it already lives, and the data is already in CloudWatch.

The third is what shape does an ALB access log actually take? One line per request, space-delimited, with some fields quoted because they contain spaces themselves (the HTTP request line, the user agent, the trace ID, domain names). That structure matters because it dictates which parsing tools work. A naive glob *-splitter chokes on the embedded spaces inside "GET /api/orders?status=open HTTP/1.1". A regex with named capture groups handles it cleanly. The choice of parser is the first real decision in writing the query, not an implementation detail.

The fourth is what questions does the on-call actually need answered, and in what order? The time shape comes first, is this a spike, a step, or a creep? Top URI next, what path is hurting? Target group after that, which backend? Then the infrastructure-vs-application split, is it a 504 from the ALB giving up, a 502 from a bad response, or a 500 from the application itself? Latency correlation, is a slow target the cause? Client correlation, is this one misbehaving client or everybody? Each has a canonical Logs Insights query, and stitching them together is the runbook.

The fifth is what happens to the bill while I investigate? Logs Insights prices per gigabyte scanned. The lever isn’t how good the query is, it’s how narrow the time range and how specific the log-group selection are. A 15-minute window against one log group costs pennies; a 90-day window across every ALB costs real money. The filters run after the scan, so filtering aggressively doesn’t cut cost; it just cuts the result size. Cost discipline is time-range discipline.

The sixth is when does this tool stop fitting? An interactive query engine that scans log groups directly is comfortable at gigabyte scale; at hundreds of gigabytes per query the latency kills the workflow, and the correct pivot is a columnar query engine over an export to object storage. For top-N questions run every five minutes, a continuously-maintained top-N stream is cheaper than re-scanning. Knowing where the tool tops out is part of knowing when to reach for it.

Finally: what’s the artefact of the investigation? A saved query, pasted into the post-incident ticket, with the time range and log groups that answered it. That’s reusable knowledge, the next person paged for the next 5xx spike runs the same query against a different window.

What we’ll filter on

Four filters the 03:00 workflow needs.

  1. Ad-hoc and interactive. No job to schedule, no table to create. The question is asked once and answered once; a follow-up question is one more query, not one more pipeline.
  2. Structured field extraction from an unstructured log line. ALB access logs are a space-delimited, partially-quoted format. The tool needs to parse that into named fields the engineer can filter and aggregate on.
  3. Low time-to-result. Seconds from “I typed the query” to “I can read the answer.” Any solution that requires standing something up first is fighting the clock.
  4. Aggregation, percentile, and time-bucketed grouping. “Top URIs by count”, “P99 latency by target group”, “5xx rate in 1-minute buckets”, aggregation queries, not raw log tailing.

The query and analytics landscape

Five places you can reasonably send a question like “what’s happening in my ALB logs right now.”

CloudWatch Logs Insights. A purpose-built query language running directly against CloudWatch log groups with no extraction step. You select log groups, type a query, and answers come back in seconds against gigabytes. Supports field parsing via regex or glob patterns, aggregation, percentiles, and time-bucketed stats via bin(). Billed per GB scanned (~$0.005/GB).

CloudWatch Metrics Insights. A SQL-like language for CloudWatch metrics, not logs. Useful for slicing existing metric data across dimensions. Can’t read log lines; can’t parse a URI out of a request line.

Athena over CloudWatch Logs exported to S3. CloudWatch Logs supports an export to S3; Athena can query that bucket as a table partitioned by date. The correct shape for multi-week or multi-month retrospectives, large scans, and SQL joins. Two problems for the 03:00 case: the export is batch-oriented, and the table has to exist first.

OpenSearch Service ingestion. Route logs to an OpenSearch cluster, index them, query with KQL or OpenSearch DSL, visualise in Dashboards. Excellent for continuous observability with rich UIs, if the cluster already exists. For the situation as stated, it’s a project, not a query.

Third-party SIEM (Splunk, Sumo Logic, Datadog Logs, etc.). Same shape as OpenSearch: powerful if already in place, not a 03:00 tool otherwise.

Side by side

Option Ad-hoc / interactive Structured field extraction Low time-to-result Aggregation + percentiles
CloudWatch Logs Insights
CloudWatch Metrics Insights
Athena over S3 export
OpenSearch Service
Third-party SIEM

Logs Insights is the only one that clears all four at 03:00. Athena is the escalation path if the investigation grows past what Logs Insights scans comfortably; OpenSearch and a SIEM are correct if they already exist in the environment.

Matching questions to queries

Pager: HTTPCode_ELB_5XX_Count 3 consecutive minutes, us-east-1 1. Time shape stats count() as errors by bin(1m) filter elb_status >= 500 spike? step? ramp? 15-minute window 2. Top failing URI parse request, split method / url stats count() by url | sort desc which path is hurting? top row is usually the answer 3. Target group attribution stats count() by tg_arn filter on " 5xx " via regex which backend is failing? 4. ELB vs target stats count() by elb_status, tgt_status (504, -) ALB timeout (502, -) bad target response (500, 500) application bug 5. P99 latency by TG pct(tgt_pt, 99) as p99 exclude tgt_pt = -1 (no value) slow target precedes 5xx 6. Client IP correlation stats count() by client_ip parse client:port, split IP one client or many? Answer: checkout service, us-east-1d, slow DB + 504s Page to diagnosis typically under three minutes queries saved to the runbook for the next page
Six queries in sequence, each narrowing the hypothesis. Time shape first, then top URI, then target group, then the ELB-vs-target split, then P99 latency, then client correlation. Each answer informs the next.

Logs Insights, in depth

Logs Insights has a small, pipeline-style query language. Commands chain left-to-right with |; comments start with #. The commands you reach for most often:

  • fields, projects specific fields. @timestamp, @message, @logStream, and @log are always available.
  • filter, narrows to events matching conditions. Supports =, !=, >, <, like, not like, in, and, or, plus regex.
  • parse, extracts structured fields from an unstructured message, in glob mode ("pattern with * wildcards") or regex mode with named capture groups (/(?<name>pattern)/). Glob is easier to read; regex is what you need when the line has quoted substrings with embedded spaces.
  • stats, aggregates. count(), sum(), avg(), min(), max(), pct(field, percentile), stddev(), grouped with by.
  • sort, limit, bin(), top-N queries and time-series aggregation.
  • display, dedup, final-column selection and duplicate removal.

Higher-level commands: pattern, diff, anomaly. Useful, not bread-and-butter.

The structure of an ALB access log line. The ALB writes one record per request in a documented, space-delimited format with specific quoted substrings. In order: type, time, elb, client:port, target:port, request_processing_time, target_processing_time, response_processing_time, elb_status_code, target_status_code, received_bytes, sent_bytes, quoted "request" (method, URL, version), quoted "user_agent", ssl_cipher, ssl_protocol, target_group_arn, quoted "trace_id", quoted "domain_name", quoted "chosen_cert_arn", matched_rule_priority, request_creation_time, quoted "actions_executed", then several more quoted fields.

Parse with regex when the line has quoted substrings. A "GET /api/orders?status=open HTTP/1.1" value has an embedded space glob’s naive * would split on.

Multi-log-group queries. The console lets you select up to 50 log groups. For programmatic use, the SOURCE command (CLI and SDK only) targets by log-group prefix, account, or log class.

Pricing. Logs Insights bills per GB scanned (~$0.005/GB). Two controls affect the bill directly: the time range and the set of log groups. filter runs after the scan, so filtering aggressively doesn’t reduce cost, it reduces result size.

A worked example

The alarm tripped three minutes ago. Time range: last 15 minutes. Log group: /aws/alb/shop-prod-access.

Query 1, time shape.

parse @message /^(?<type>\S+) (?<time>\S+) (?<elb>\S+) (?<client_port>\S+) (?<target_port>\S+) (?<req_pt>\S+) (?<tgt_pt>\S+) (?<res_pt>\S+) (?<elb_status>\S+) (?<tgt_status>\S+) (?<rx>\S+) (?<tx>\S+) "(?<request>[^"]*)" "(?<ua>[^"]*)" (?<ssl_cipher>\S+) (?<ssl_proto>\S+) (?<tg_arn>\S+)/
| filter elb_status >= 500 and elb_status < 600
| stats count() as errors by bin(1m)
| sort @timestamp asc

Query 2, top failing URI.

parse @message /^(?<type>\S+) (?<time>\S+) (?<elb>\S+) (?<client_port>\S+) (?<target_port>\S+) (?<req_pt>\S+) (?<tgt_pt>\S+) (?<res_pt>\S+) (?<elb_status>\S+) (?<tgt_status>\S+) (?<rx>\S+) (?<tx>\S+) "(?<request>[^"]*)" (?<rest>.*)/
| parse request /^(?<method>\S+) (?<url>\S+) (?<proto>\S+)$/
| filter elb_status >= 500
| stats count() as errors by url
| sort errors desc
| limit 20

Query 3, target group attribution.

parse @message /^(?<prefix>(?:\S+\s+){12})"(?<request>[^"]*)" "(?<ua>[^"]*)" (?<ssl_cipher>\S+) (?<ssl_proto>\S+) (?<tg_arn>\S+)/
| filter @message like /" 5\d\d /
| stats count() as errors by tg_arn
| sort errors desc

Query 4, backend vs ELB errors. (504, -) is ALB gave up on the target; (502, -) is bad target response; (500, 500) is the target’s own error.

parse @message /^(?:\S+\s+){8}(?<elb_status>\d+) (?<tgt_status>\S+) /
| filter elb_status >= 500
| stats count() as errors by elb_status, tgt_status
| sort errors desc

Query 5. P99 latency by target group.

parse @message /^(?:\S+\s+){6}(?<tgt_pt>[\d.\-]+) (?<res_pt>[\d.\-]+) (?<elb_status>\d+) (?<tgt_status>\S+) /
| parse @message /(?<tg_arn>arn:aws:elasticloadbalancing:[^\s]+)/
| filter tgt_pt != "-1" and tgt_pt != "-"
| stats pct(tgt_pt, 99) as p99, pct(tgt_pt, 50) as p50, count() as requests by tg_arn
| sort p99 desc

Query 6, client IP correlation.

parse @message /^(?:\S+\s+){3}(?<client_port>\S+) (?:\S+\s+){4}(?<elb_status>\d+) /
| parse client_port /^(?<client_ip>[^:]+):\d+$/
| filter elb_status >= 500
| stats count() as errors by client_ip
| sort errors desc
| limit 20

Six queries, each seconds to type after the first, each seconds to run. Page to “we know what the checkout service is doing” is typically under three minutes.

When Logs Insights stops scaling

Logs Insights comfortably scans gigabytes. At tens of GB per query the wait starts to matter; at hundreds of GB or terabytes it’s time to pivot.

Saved queries plus a narrower time range. The cheapest optimisation. Save the query; set the default time range to the smallest window that still answers the question. Every halving of the time range halves the bytes scanned.

Athena over the Logs -> S3 export. CloudWatch Logs supports an export task to S3; Athena queries that bucket as a partitioned table. SQL joins are available, correlating ALB logs to CloudTrail, WAF, or application logs. Per-GB-scanned rate works out cheaper than Logs Insights for large retrospectives, especially in columnar format. Right tool for “all 5xx across the fleet for the last 90 days”; wrong tool for “5xx from three minutes ago.”

CloudWatch Contributor Insights. When the shape is “top-N over a continuous stream”, top failing URIs, top talker client IPs, top user agents. Contributor Insights maintains a running top-N from a log group, backed by a rule that pattern-matches and keys on chosen fields. Cheaper than re-running a Logs Insights top-N on a schedule, and the result shows up as a metric you can graph.

Logs Insights is the investigative tool; Contributor Insights is the operational one.

What’s worth remembering

  1. CloudWatch Logs Insights is the default tool for ad-hoc interactive querying of CloudWatch log groups, no infrastructure to stand up, answers in seconds, priced per GB scanned.
  2. The Logs Insights command vocabulary – fields, filter, parse (glob and regex), stats, sort, limit, bin(), display, dedup, covers the shapes most incident queries need.
  3. Parse with regex, not glob, when the line contains quoted substrings with embedded spaces (ALB request line, user agent). Glob’s * splitter eats the wrong characters.
  4. The ALB access log format is positional, space-delimited, with specific quoted substrings. Fields 1-12 are unquoted; field 13 (request_line) and 14 (user_agent) are quoted; the quoted pattern repeats for trace ID, domain, actions, and so on.
  5. Logs Insights prices on bytes scanned, not rows returned. Narrow the time range and the log-group selection to control cost. filter runs after scan, so filtering harder doesn’t reduce cost.
  6. Metrics Insights, Athena, OpenSearch, and third-party SIEMs are adjacent tools, not substitutes at 03:00. Athena is for retrospectives over huge volumes; OpenSearch for continuous rich dashboards; Metrics Insights for metric-only slicing.
  7. Contributor Insights is the correct pivot from Logs Insights for scheduled top-N analysis, it maintains the result continuously rather than rescanning on each refresh.
  8. SOURCE (CLI and SDK only) enables multi-log-group queries by prefix, account, or log class without enumerating every group, useful past the 50-log-group console limit.

The answer: CloudWatch Logs Insights, run from the console against /aws/alb/shop-prod-access over the last 15 minutes, with a regex parse extracting the positional ALB fields and request split into method, url, proto. Run the six queries, time shape, top URI, target group, ELB-vs-target split, P99 latency, top client IP, and the 5xx has a URI, a target group, a timing pattern, and a client signature before the coffee is cold. Widen past the 30-day retention window? Export the logs to S3 and pivot to Athena. Same top-N question every five minutes from now on? Spin up a Contributor Insights rule and stop paying to rescan.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.