How to Use Detective to Cut Incident MTTR

November 20, 2028 · 13 min read

Security · SCS-C03 · part of The Exam Room

The situation

An organisation running on AWS with GuardDuty enabled for some time, findings routed to PagerDuty for HIGH and CRITICAL, MTTR on security incidents running around six hours. The postmortem from the last incident was frank: the thirty minutes after the finding fired were productive, the four hours after that were “engineer manually joining CloudTrail, Flow Logs, DNS logs, and the GuardDuty findings console in Athena, trying to answer basic questions about what was normal for this entity.”

The security team’s Q3 ask is to cut MTTR in half. The concrete investigation needs are:

  • “What else has this entity touched?”, given a compromised role, enumerate every resource the role’s sessions interacted with in the last 30 days.
  • “Is this unusual?”, has this user logged in from this IP before; does this role normally make this many API calls per hour; does this instance normally talk to this port.
  • “Did this spread?”, from the initial instance, which other entities saw activity from the same actor during the incident window.
  • “When did this start?”, the finding fires when the pattern crosses a threshold, but the actual intrusion usually started earlier. Detective’s timeline view is meant to answer this.

These are the questions Detective is built for.

What actually matters

An investigation tool is doing something different from a detection tool. A detector asks “does anything look bad?” and produces findings. An investigator asks “given this finding, tell me the story around it” and produces a timeline and a neighbourhood.

The fundamental shape an investigator needs is a graph of entities and their interactions over time. Entities are IPs, instances, IAM users, IAM roles, accounts, EKS clusters, S3 buckets, Lambda functions. Interactions are API calls, network flows, DNS queries, assume-role chains, EKS API events. Store all that in a graph database, index by entity, and “what did this role do in the last 30 days” is a graph query. Build baselines over time, “this role normally calls 80 APIs an hour from two source IPs”, and “unusual for this role” becomes comparable.

Building that yourself is expensive. The data sources are CloudTrail management and data events, VPC Flow Logs, Route 53 Resolver DNS logs, GuardDuty findings themselves, EKS audit logs, and any cross-source repository the org has configured. Normalising them, joining them, retaining them for a useful window, indexing them by entity, and exposing baselines is a moderately complex data pipeline, which is why a managed graph layer over those same sources earns its place.

There are two tempting misconceptions to name. First, an investigator is not a replacement for the detector; it consumes the detector’s findings. Enable detection first, investigation second; “investigation instead of detection” is not a configuration that makes sense. Second, an AWS-native investigation graph is not a substitute for SIEM correlation; the SIEM correlates across sources it has been configured to read, at the pace it has been tuned for. An AWS-native graph is narrower (AWS sources only), deeper (entity-centric baselines over a long rolling window), and operated without tuning. For AWS-native investigation, the graph wins. For cross-cloud cross-source correlation, the SIEM wins. They coexist.

What we’ll filter on

  1. Entity coverage, which kinds of thing have profiles?
  2. Data sources, what does Detective see?
  3. Time window, how far back does the graph go?
  4. Baseline availability, can we compare “now” to “normal for this entity”?
  5. Entry point, how does an investigator arrive at the profile page?

The investigation-tool landscape

1. Amazon Detective. Managed behavioural graph built from CloudTrail, VPC Flow Logs, Route 53 Resolver query logs, EKS audit logs, GuardDuty findings, and Security Lake data. Entity profiles for IPs, EC2 instances, IAM users, IAM roles, accounts, EKS clusters, Kubernetes subjects, and findings. 365-day rolling window. Priced on volume of source logs ingested, not per-finding. Organisation-wide via delegated admin; consolidated view across accounts.

2. Athena over raw logs. The DIY graph. Point Athena at CloudTrail in S3, Flow Logs in S3, DNS logs in S3; write join queries to answer entity questions. Works; requires a senior engineer, per-query cost, and no baseline data.

3. SIEM correlation (Splunk, Sumo, Datadog, Elastic). The third-party alternative. Normalises AWS logs into the SIEM’s schema, provides dashboards, correlation rules, alerting. Price and tuning both significant. Primary wins: cross-cloud, custom rules, long-retention of hot queries if the budget supports it.

4. Security Lake. Not an investigator but the normalised event repository (OCSF format) that Detective, Athena, and SIEMs can all read. A reasonable substrate when the org wants to separate storage from query.

5. CloudTrail Lake. Managed SQL-queryable CloudTrail archive, up to 10 years. Handles “CloudTrail questions” but doesn’t have Detective’s cross-source graph or entity-behaviour baselines.

Side by side

Tool Entity coverage Sources Window Baselines Cost shape
Detective IP, EC2, user, role, account, EKS cluster, finding CT + Flow + DNS + GD + EKS 365 days rolling Built in Per-GB ingest
Athena over raw logs Any (join-on-ID) Whatever is in S3 Retention-bound None Per-GB scan + S3
SIEM correlation Varies Any sources piped in SIEM-bound Custom Per-GB ingest + licence
Security Lake N/A (storage layer) OCSF-normalised AWS + partner Lifecycle-configurable None Per-GB ingest + lifecycle
CloudTrail Lake CloudTrail entities CloudTrail only 10 years None Per-event + per-GB scan

Reading the table: Detective’s niche is AWS-native, entity-centric, baselined. No other tool in the list does all three. Each of the others is stronger at one dimension (Lake for CloudTrail retention, SIEM for cross-source custom rules, Athena for DIY flexibility), but Detective is the default answer for “something’s wrong with this instance, tell me the story.”

Walking the Detective graph

The graph walk, starting from a finding GuardDuty finding EC2/MaliciousIPCaller.Custom severity 5.0, 23:14 IAM role prod-payments-ec2 EC2 instance i-0abc1234 Suspicious IP 203.0.113.77 associated on connected to 47 API calls last 3 hours 2 source IPs 1 new-to-role 3 services S3, IAM, STS Flow Logs 4.2 MB/s out ASG siblings 2 instances AnonymousIP feed match First seen 23:08 Each node is a Detective profile page. Panels on the page: activity over time, baseline comparison, peer neighbourhood, recent API histogram.
Start at the finding, walk the three primary entities (role, instance, IP), drill each one for context.

The picks in depth

Enable Detective under the same delegated admin as GuardDuty. Both services benefit from a single organisational view; using the same delegated admin account means “Investigate in Detective” is one click from any GuardDuty finding in any member account, with the entity graph already loaded. Detective ingests from GuardDuty’s data-source configuration (so the Flow Logs, DNS logs, CloudTrail events that GuardDuty reads are the same ones Detective reads), and enables additional sources (EKS audit logs) if those services are in the environment.

Understand the 365-day window. Detective’s graph is a rolling 365-day window. Activity older than that is not available in Detective’s profile pages; queries that need older context fall back to CloudTrail Lake, Athena over CloudTrail in S3, or Security Lake. Practically this means Detective is always the first tool to open in an investigation, not the only one, and “let’s check 18 months ago” is a Lake or S3 query.

Use entity profile pages, not search. The Detective console is primarily a set of profile pages rather than a search bar. From a finding, click through to the IAM role’s page, which has panels for API-call volume over time, top user-agents, top API calls, unusual API calls, and linked entities (accounts the role was assumed from, instances the role ran on). From the instance’s page, Flow Log destinations, DNS queries, linked roles, sibling instances. From an IP’s page, every entity that has interacted with that IP in the window. The workflow is “finding → entity → panels → adjacent entity”, walking the graph by clicking, not by querying.

Read the baselines carefully. Detective computes rolling baselines, “this role’s API-call volume vs. its 30-day average,” “this user’s login sources vs. its historical set,” “this instance’s outbound bytes vs. its normal.” These are statistical signals, not alarms. The baseline says “unusual,” the context says “unusual because.” An engineer running a new deployment will also be “unusual”; the tool surfaces the anomaly and the human decides whether it’s the attack or the deploy.

A worked investigation walk

23:14. GuardDuty finding fires: UnauthorizedAccess:EC2/MaliciousIPCaller.Custom on i-0abc1234, severity 5, outbound to 203.0.113.77. On-call acks within 90 seconds and clicks “Investigate in Detective” from the finding page.

23:16. Detective opens the profile page for i-0abc1234. Panel 1: “Overall API call volume over time” shows a flat line for 60 days, then a step change at 22:58 UTC (16 minutes before the finding). Panel 2: “VPC Flow Logs volume” shows outbound traffic to new IPs starting at 22:58, peaking at 4.2 MB/s sustained. Panel 3: “Linked IAM entities” shows one role, prod-payments-ec2, assumed continuously since instance launch.

23:19. Click through to the prod-payments-ec2 profile. Panel 1: API calls. Normal baseline is ~80 calls/hour, two API types (s3:GetObject, dynamodb:Query). Last hour shows 1,900 calls across 18 API types including s3:ListBuckets, iam:ListRolePolicies, sts:GetCallerIdentity. Panel 2: Source IPs. Normal baseline shows one IP (the instance’s ENI). Last hour shows two, the instance, plus an IP in eu-central-1. Panel 3: User-agents. Boto3/1.34.0 Python/3.11.5 appears for the first time in the role’s history.

23:22. Back to the instance. Click “ASG siblings” panel. Two sibling instances, i-0def and i-0ghi. Profile on i-0def shows an 8-minute spike of outbound traffic to the same 203.0.113.77 three days ago, never repeated. The spread is latent; the compromise may be older than 22:58.

23:25. Three tickets opened: i-0abc1234 quarantined (security group swap), i-0def quarantined for forensics, IAM role credential-exfiltration response initiated. Containment decision made on evidence, not on speculation.

Thirty-one minutes from finding to containment on three entities, one of which was not in the original finding. The same investigation via Athena joins across CloudTrail and Flow Logs would have taken two to three hours to find the sibling instance; Detective’s graph walk finds it by clicking a panel.

A worked “is this normal?” check

A different scenario. 14:30 UTC on a Friday. GuardDuty fires Recon:EC2/PortProbeUnprotectedPort severity 2. The on-call wants to know: is this real, or is this a developer running a legitimate port scan from inside the VPC?

Detective profile for the probed instance. Panel: “New network connections”, every connection shown has a corresponding flow in Flow Logs with ACCEPT action to a port the instance normally has open. The “probe” is the developer’s IP doing a routine port scan for their application’s deploy verification, a scan they run every Friday. Not real.

The false-positive is caught in under a minute by comparing “now” to “normal for this Friday pattern.” Without Detective, the same check is a judgement call on the on-call’s experience.

What’s worth remembering

  1. Detective is an investigation tool, not a detection tool. It consumes GuardDuty findings, CloudTrail, Flow Logs, DNS logs, and EKS audit logs; it does not itself emit new findings.
  2. Enable Detective from the same delegated admin account as GuardDuty. The “Investigate in Detective” link from any GuardDuty finding is the primary entry point.
  3. Detective’s window is 365 days rolling. Older investigations fall back to CloudTrail Lake, Athena over S3, or Security Lake.
  4. Entity profile pages are the primary interface. Walk from finding → entity → linked entities by clicking panels; the console is graph-shaped, not query-shaped.
  5. Baselines are statistical, not alarms. “Unusual for this role” is a signal that needs human context before it’s a conclusion.
  6. Detective’s sources are AWS-native. Cross-cloud or external-source correlation belongs in a SIEM; AWS-native graph belongs in Detective.
  7. Pricing is per-GB of source data ingested, not per-finding. Consolidating Flow Logs volume via VPC endpoints and using Parquet upstream reduces ingest cost meaningfully.
  8. Detective does not do containment. It tells you what’s going on; remediation is still Security Hub automations, custom Lambdas on EventBridge, or manual action.

The graph is already drawn. The work during an incident is walking it fast: finding → primary entities → linked entities → baseline comparisons → containment decision. Thirty minutes, not three hours, is the shape Detective is trying to enable.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.