The situation
A subsidiary has been acquired and we’ve inherited their AWS estate. Twelve accounts, four Regions, a couple of hundred EC2 instances, three RDS clusters, a handful of S3 buckets. Logging is enabled. CloudTrail is on, VPC Flow Logs are on for most VPCs, GuardDuty is enabled in the payer account, but the security team that used to run this is gone with the acquisition, and the runbook is a README that says “check CloudWatch.”
Overnight, GuardDuty fired three findings:
UnauthorizedAccess:EC2/MaliciousIPCaller.Customoni-0abc, outbound to a known-bad IP.Recon:EC2/PortProbeUnprotectedPorton the same instance.UnauthorizedAccess:IAMUser/InstanceCredentialExfiltration.OutsideAWSon the IAM role attached toi-0abc, seen from a second-Region IP.
The on-call wants to answer five questions, fast:
- Did the packets GuardDuty is alerting on actually leave the VPC?
- What else did that instance talk to in the last fourteen days?
- Has the role’s credentials been used from anywhere else?
- Did any of that activity touch other accounts in the org?
- Is this still happening right now?
Three services are in play. Which answers which?
What actually matters
It’s tempting to treat “detection on AWS” as one capability and to open whichever console is closest. That produces a lot of tab-switching and not much signal. The three services are actually doing three separate jobs, and the jobs have a natural order.
The first job is recording what happened at the network layer. Every ENI in every VPC emits traffic; somebody has to capture metadata about that traffic cheaply, at scale, so that weeks later we can answer “did this IP ever talk to that IP?” The answer has to be storable, queryable, and not so expensive that we stop recording under budget pressure. This is a primary-source-of-truth job, accurate, comprehensive, narrow in scope (packet metadata, not payloads).
The second job is noticing when the recorded activity looks bad. That requires a detector, somebody (or something) that watches the feeds, knows what malicious patterns look like, pattern-matches continuously, and emits a finding when something crosses a threshold. The detector has to be smart enough to ignore benign anomalies and fast enough to fire before the incident is over. It needs to correlate across more than one data source, because most real attacks show up in several logs at once.
The third job is investigation after the fact. Once a finding exists, somebody has to ask “what else was this actor doing?” That’s a graph-traversal problem: starting from one IP, one instance, one role, expand outward to everything that interacted with it, and sort by time so the narrative reads left-to-right. Building that graph from raw logs means joining Flow Logs, CloudTrail, DNS logs, and GuardDuty findings by hand every time, which is exactly the job an investigation tool should have done once and made reusable.
Three different tools for three different jobs. The mistake most teams make is either treating them as substitutes (picking one and skipping the others) or treating them as duplicates (turning all three on and not knowing which console to open). They’re complements with a pipeline shape.
What we’ll filter on
- Layer of abstraction, packets, behaviour patterns, or entity relationships?
- Data it produces, raw records, findings, or a traversable graph?
- Latency to signal, how fast does the answer arrive?
- Retention and cost shape, how long and how much?
- Primary investigative question, what question does the service exist to answer?
The detection landscape
1. VPC Flow Logs. The primary-source tape of network metadata. For every flow on every ENI we care about, Flow Logs records srcaddr, dstaddr, srcport, dstport, protocol, packets, bytes, start and end, action (ACCEPT or REJECT), and log-status. Delivery to CloudWatch Logs or S3 or Firehose; the S3 sink with Parquet format is the cheap option once volume grows. No analysis, these are raw records. Question answered: “did this flow happen?”
2. GuardDuty. A continuous threat-detection service that consumes VPC Flow Logs, DNS query logs, CloudTrail management and S3 data events, EKS audit logs, RDS login events, Lambda network activity, and Malware Protection scans, and emits findings when activity matches known-bad patterns or ML-derived anomalies. Findings carry a type (UnauthorizedAccess:EC2/…), a severity (1-8.9), a resource, an actor, and a window. Single-click enable per account and Region; organisation-wide via a delegated administrator. Question answered: “does anything in these logs look like an attack?”
3. Detective. A managed investigation graph. Continuously ingests Flow Logs, CloudTrail, GuardDuty findings, EKS audit logs, and Security Lake data; builds a behavioural graph linking entities (IPs, instances, users, roles, accounts) to activity over a rolling 365-day window; exposes it as a console of linked profiles and time-series panels. Question answered: “given this finding, what else did the actor touch, when, and how does that compare to their baseline?”
4. Athena over Flow Logs in S3. The DIY version of Detective’s first half. Point Athena at Parquet-formatted Flow Logs in S3 and a skilled engineer can answer most of Detective’s questions with SQL. Cheaper, more flexible, no baselining, no cross-source joins out of the box. Useful as a complement or a fallback; not a replacement for the pattern-matching and graphing that Detective performs automatically.
5. Security Lake. The org-wide normalised log repository (OCSF format in S3), not a detection tool itself but the substrate Detective, Athena, and third-party SIEMs all read from. Centralising logs here is usually the correct move before scaling detection org-wide.
Side by side
| Service | Layer | Output | Latency | Retention shape | Primary question |
|---|---|---|---|---|---|
| VPC Flow Logs | Packet metadata | Raw records | ~10 min aggregation | Pay-per-GB S3, indefinite | Did this flow happen? |
| GuardDuty | Behaviour patterns | Findings | Minutes | 90 days in console, CloudWatch Events forever | Does anything look bad? |
| Detective | Entity relationships | Graph + timelines | ~hours to build profile | 365-day rolling window | What else did this actor do? |
| Athena over Flow Logs | Packet metadata | Query results | Seconds per query | As long as S3 holds the data | Did this flow happen? (DIY) |
| Security Lake | Normalised log store | OCSF objects in S3 | Minutes | Lifecycle-configurable | Where are all my logs? |
Reading the table by investigative stage:
- Before the incident. Flow Logs record everything, continuously, for later. GuardDuty watches continuously for patterns. Detective builds baselines so “unusual for this role” is a meaningful phrase when a finding fires.
- At the moment of the finding. GuardDuty tells us something looks bad, with a type and severity that shape first response.
- During investigation. Detective gives the narrative (what else did this actor touch, when, is it unusual); Flow Logs give the forensic detail (the actual packet counts); Athena fills gaps where Detective’s 365-day window isn’t enough.
The investigation funnel
The picks in depth
VPC Flow Logs. Enable at the VPC level (every ENI in the VPC emits), deliver to S3 in Parquet format partitioned by year/month/day/hour, set a lifecycle rule that transitions to Glacier Deep Archive after 90 days, retain for the full compliance window. Custom log format is worth the small effort, the default fields are fine for most queries, but adding tcp-flags, pkt-srcaddr, pkt-dstaddr, traffic-path, and flow-direction catches the cases where NAT or Gateway Load Balancer rewrites the addresses. Flow Logs don’t capture payloads and don’t see traffic that never hits an ENI (intra-host loopback, some managed-service internal traffic). For payload capture, Traffic Mirroring is the adjacent tool; it’s far more expensive and usually reserved for targeted investigations, not blanket coverage.
GuardDuty. Enable in every account, every Region, with a single delegated administrator account aggregating findings org-wide. Three paid protection plans sit on top of the base detector: Malware Protection for EC2 scans EBS volumes on suspicious findings; Malware Protection for S3 scans uploaded objects; Runtime Monitoring drops an eBPF agent on EKS, ECS-on-Fargate, and EC2 for in-guest signals. Findings stream to EventBridge as they fire, a Lambda subscriber can auto-remediate low-severity hits (quarantine via a security-group swap, stop the instance, revoke the exposed credential), page out on critical. The 90-day console retention is cosmetic; the EventBridge stream is the durable store, and an archive bucket of finding events is the correct backstop.
Detective. Enable in the same delegated-admin account as GuardDuty; the graph is organisation-wide from a single place. Ingestion costs are a function of the volume of source logs, not the number of accounts, consolidating Flow Logs ingest onto a VPC Endpoint set and compressing Parquet upstream pays off quickly. The 365-day rolling window means older activity rolls off; if the investigation needs a date beyond that, Athena over the Flow Logs archive is the fallback. The useful Detective surface for most investigations is three entity profiles linked together: the IP the finding points at, the resource that talked to it, and the principal that held the credentials.
A worked incident trace
03:14 UTC. GuardDuty fires UnauthorizedAccess:EC2/MaliciousIPCaller.Custom on i-0abc, severity 5. EventBridge rule routes to SNS which pages the on-call. The finding page in the GuardDuty console shows the threatened resource, the connection, the severity, and the ThreatIntel source that matched the destination IP.
03:16. The on-call clicks “Investigate in Detective” on the finding. Detective’s profile for i-0abc opens with a fourteen-day activity panel. Immediately visible: the instance’s outbound volume to the malicious IP started at 02:58, peaked at 4.2 MB/s, still going. Panel two: “accounts this instance has received API calls from” shows the attached IAM role was assumed four times in the last hour from an IP in a second Region. Panel three, the baseline: this instance normally speaks to three internal endpoints and an S3 VPC endpoint. The malicious-IP traffic is wildly unusual for this entity.
03:19. The on-call opens Detective’s profile for the IAM role. The role’s “AWS API calls volume” panel shows a step-change at 02:54 from ~80 calls/hour to ~2000 calls/hour. The API call type histogram is dominated by s3:ListBuckets, s3:GetBucketAcl, iam:ListRolePolicies, classic discovery. “New user-agents seen” shows Boto3/1.34.0 Python/3.11.5 appearing for the first time ever.
03:22. The on-call needs to know which buckets the role touched and what it took. Detective’s timeline has it; a confirming Athena query over the CloudTrail archive has the full request parameter set. s3:GetObject on three buckets, 47 objects total, all in the payments-archive bucket. Revoke decision is now evidence-based.
03:24. Back to Flow Logs, the on-call wants to know if the malicious IP has also been reached by any other instance in the VPC. An Athena query over the S3 archive, filtered by dstaddr = <malicious-ip> and the last 30 days, returns two other instances that briefly reached it three days ago. Those instances are in the same ASG; the implication is that the compromise is not scoped to i-0abc.
03:31. Containment. Quarantine security group applied to the three instances (egress only to a forensics endpoint), IAM role’s session credentials revoked via an explicit aws iam put-role-policy that denies everything with a condition on aws:TokenIssueTime, KMS keys used by those instances rotated. GuardDuty still firing? Yes. New Backdoor:EC2/C&CActivity.B finding on one of the sibling instances confirms the scope is correct.
Seventeen minutes from finding to containment, with three consoles open. Without Detective, the same investigation is a senior engineer joining Flow Logs, CloudTrail, and GuardDuty by hand in Athena for two hours; the answers are the same, the delivery is not.
A worked org rollout
The acquired-subsidiary estate has twelve accounts, four Regions. One-time setup:
- In the Organizations management account, designate a security-audit account as the delegated administrator for both GuardDuty and Detective. Same account so the views align.
- From the delegated admin, enable GuardDuty across all accounts in all Regions; turn on Malware Protection for EC2 (EBS volume scans) and Runtime Monitoring for the production EKS clusters; route findings to the audit account’s EventBridge bus.
- Enable Detective from the same delegated admin, scoped to the same accounts. Detective pulls its feeds from GuardDuty’s data-source configuration, no duplicate plumbing.
- In every VPC, enable Flow Logs to an S3 bucket in the audit account (cross-account delivery), Parquet format, hourly partitions, lifecycle to Glacier Deep Archive at 90 days, retention aligned to the 7-year compliance floor.
- CloudTrail: org trail from the management account, multi-Region, log-file validation enabled, delivering to the audit bucket. Detective and GuardDuty both consume CloudTrail; the org trail means they see every account without per-account config.
- EventBridge rules in the audit account: HIGH and CRITICAL GuardDuty findings page on-call via SNS; MEDIUM write to a ticket queue; LOW archive to S3 for weekly review. A second rule on
GuardDuty Findingevents writes every finding to an archive S3 bucket via Firehose, guaranteeing the record survives the 90-day console retention.
From that point forward, every new account the org creates inherits the configuration via the delegated-admin pattern; no per-account enable ceremony.
What’s worth remembering
- Three services, three jobs. Flow Logs record, GuardDuty detects, Detective investigates. Picking one and skipping the others leaves a hole; treating them as substitutes wastes money.
- GuardDuty’s own protection plans are extras. The base detector reads Flow Logs, DNS, CloudTrail, S3 data events, EKS audit logs, RDS login events, and Lambda network activity. Malware Protection (EBS volume scan, S3 object scan), Runtime Monitoring (eBPF agent), and EKS Protection are billable add-ons with their own value.
- GuardDuty findings fire on EventBridge, not SNS. The 90-day console retention is cosmetic; the durable record is the EventBridge stream, which is where auto-remediation Lambdas and SIEM forwarders subscribe.
- Detective’s window is 365 days, rolling. Investigations that need older history fall back to Athena over the Flow Logs archive.
- Detective consumes GuardDuty findings directly. Enabling both from the same delegated admin account means “Investigate in Detective” is one click from any finding, with the entity graph preloaded.
- VPC Flow Logs don’t capture payloads. For in-packet inspection use Traffic Mirroring on selected ENIs. For managed-service internal traffic the agent-layer logs (CloudTrail data events on S3, RDS audit logs) fill the gaps.
- Parquet in S3 is the sane Flow Logs delivery. Partition by hour, lifecycle to Glacier Deep Archive, query with Athena. Raw JSON to CloudWatch Logs gets expensive fast at VPC scale.
- Delegated administrator is the org-wide enable lever. One security-audit account, one enable step per service, all member accounts covered automatically.
Flow Logs give us the tape. GuardDuty gives us the alarm. Detective gives us the story. At 03:14, when the finding lands and the chat channel wakes up, knowing which one to open first is half the job.