CloudTrail Lake for Long-Term Audit Queries

November 13, 2028 · 15 min read

The situation

A long-running account has CloudTrail configured the way most accounts do: an organisation trail written by the management account, multi-Region, log-file integrity enabled, delivering gzipped JSON to an S3 bucket in the security-audit account. Object lifecycle: Standard for 90 days, IA for 180, Glacier Deep Archive indefinitely. Retention is currently 7.2 years. Three workloads now pull on that bucket:

Compliance investigations. Auditors want point-in-time queries spanning months to years: “all iam:PutRolePolicy on the production payer account between 2023-06-01 and 2023-08-31.” Today, an engineer downloads the relevant S3 prefixes, unzips, and greps, which takes hours and requires an analyst who knows the schema.
SIEM ingest. A third-party SIEM subscribes to the same trail via EventBridge and S3 notifications, ingesting every event for threat-hunting. The SIEM bill is now $11k/month and the trend is linear with API volume, which is not linear with business value.
Incident response. At 02:17 when a GuardDuty finding fires, the on-call needs last-40-minutes-to-last-3-days of recent CloudTrail for a specific principal. Today the answer is Athena over the S3 bucket, which works but takes 40-90 seconds per query and requires per-engineer setup.

Three jobs. Same data. Different needs. Is one CloudTrail setup serving all three the right architecture, or is this a job for CloudTrail Lake?

What actually matters

CloudTrail’s base abstraction is “a log of every API call made in the account.” What changes across the three use cases is the access pattern, how old, how frequent, how selective, how fast, and storage formats that optimise for one access pattern penalise the others.

The audit workload wants very wide and very deep. Seven years of data, arbitrary queries, happy to wait minutes per query in exchange for “just works.” The volume is too big to keep hot, but it’s queried rarely enough that cold storage with on-demand rehydration is fine. What the audit workload can’t tolerate is having to tell an engineer “that year is in Glacier Deep Archive, you’ll need to wait 12 hours for restore.”

The SIEM workload wants every event streamed in near-real-time, in a normalised schema. The cost driver is ingest volume, so anything that reduces the volume sent to the SIEM (without losing signal) pays back. The SIEM itself is often the most expensive part of the security stack; handing it unfiltered CloudTrail is paying for cleanup work the SIEM doesn’t need to do.

The incident-response workload wants fast queries on recent data, infrequent but critical. At 02:17, a 40-second Athena query is acceptable; a 15-minute Glacier restore is not. The queries are selective (by principal, by event name, by resource) over a short time window (last hour, last day, last week). The volume retrieved per query is small, but the time-to-first-byte matters a lot.

The underlying tension: a single storage format that suits one access pattern penalises the others. A raw archive partitioned by date supports rare deep queries but requires per-query setup; a streaming feed to an external system pays per-event for volume that’s mostly noise; a cold tier optimised for retention loses interactive access. No single configuration serves all three, so the question is whether the workloads fan out from one source or whether a separate managed query layer earns its place alongside the archive.

What we’ll filter on

Retention, how far back can we query natively?
Query experience. SQL, API, or grep?
Latency to query, seconds, minutes, or hours?
Cost shape, per-event ingest, per-GB query, per-event recording?
Integrations, does it land where the SIEM, the auditor, and the on-call all need it?

The CloudTrail landscape

1. Management events (free, always on, 90-day event history). Every AWS account gets 90 days of management-event history for free, queryable via aws cloudtrail lookup-events or the Event History console. Not configurable, not exportable, not integrated with SIEMs, no data events, no S3 storage. This is the “I just need to see who changed the security group” interface.

2. Trails, the classic. A CloudTrail trail streams events to an S3 bucket (and optionally CloudWatch Logs and/or EventBridge). Multi-Region and organisation-wide trails cover the full estate with one configuration. Storage format: gzipped JSON, partitioned by AWSLogs/<account>/CloudTrail/<region>/<YYYY>/<MM>/<DD>/. Per-GB S3 storage costs, plus per-event recording charges for data events (management events are free on the first trail per account). Queryable by Athena if you stand up the table and partitions.

3. CloudTrail Lake. A managed event-data-store (EDS) with retention from 7 days up to 10 years, per-event ingest pricing (~$2.50 per million management events, data-events priced separately), and a built-in SQL query engine priced per GB scanned ($0.005 per GB). Create an EDS, point it at your trails or directly at CloudTrail sources, and query via the console or the aws cloudtrail start-query API. Supports organisation-wide EDSes from the Organizations management account, multi-Region event collection, and cross-source ingestion (EventBridge events, AWS Audit Manager findings, Config configuration items) into a single queryable store.

4. Athena over S3 trails. The DIY Lake. Point Athena at the trail’s S3 bucket, run CREATE EXTERNAL TABLE with the documented CloudTrail schema and partitioning, query with SQL. Works, costs ~$5/TB scanned plus S3 storage. Requires maintenance: partition projection to avoid manual partition adds, a glue table definition, per-account and per-Region filter hygiene.

5. EventBridge + SIEM ingest. The trail emits events to EventBridge; a rule forwards them (directly or via Firehose) to the SIEM. Real-time, billed by the SIEM’s ingest pricing (Splunk, Sumo, Datadog, etc.). The right surface for alert-level correlations; wrong surface for seven-year retention because SIEM storage costs dwarf S3.

6. CloudWatch Logs. The trail can deliver to CloudWatch Logs in addition to S3. CloudWatch Logs is good for short-retention operational queries (Logs Insights, metric filters on CloudTrail events, alarms on specific API calls) but expensive at scale; 30-90 days retention with archive-to-S3 via subscription is the typical shape.

Side by side

Option	Retention	Query	Latency	Cost shape	SIEM fit
Event history (free)	90 days	Console / API	Seconds	$0	No
Trail → S3	Lifecycle-controlled (indefinite)	Athena (DIY)	Seconds (query after setup)	$/GB storage + $/event data events	Via EventBridge
CloudTrail Lake	7 days – 10 years	Built-in SQL	Seconds	$/event ingest + $/GB scanned	Cross-source ingestion
Athena over S3	Matches trail	Athena	Seconds	$/GB scanned + S3	No
EventBridge → SIEM	SIEM-dependent	SIEM UI	Seconds	SIEM ingest	Yes, primary
CloudWatch Logs	1d-∞ (costly)	Logs Insights	Seconds	$/GB ingest + storage	Via subscription

Reading the table by workload:

Compliance, seven-year retention. Lake at 10-year retention, or a trail to S3 with lifecycle. Lake wins on query ergonomics and “one query covers every year.” S3 wins on raw-storage cost if the data is genuinely cold and almost never queried.
SIEM real-time. EventBridge forwarding, not Lake. Lake’s query latency is seconds, not the milliseconds SIEMs want for streaming.
Incident response, last hour to last week. Lake (short-retention EDS) or Athena over S3. Lake if the investment in a managed store is worth it; Athena if S3 is already there.

The three-job pipeline

Three consumers fan out from a single org trail, each gets the format that matches its access pattern.

The picks in depth

Keep the organisation trail to S3 as the system of record. The raw event is the source of truth, and S3 with Object Lock in compliance mode is the canonical write-once-read-many store. Lifecycle rules move cold data to Glacier Deep Archive after 180 days, and retention runs the full 7 years (or whatever the auditor requires). This branch doesn’t change. Don’t try to use Lake as the only store; if the EDS is deleted, the events go with it, and compliance teams want the raw immutable log on S3 regardless of what sits on top.

Add CloudTrail Lake for query. Create an organisation event-data-store from the management account, 10-year retention, multi-Region, ingesting from CloudTrail management events and the data-event selectors we care about (S3 object-level on the compliance buckets, Lambda invoke on the payments functions, DynamoDB data events on the customer-data tables). Per-event ingest is ~$2.50/million management events and data events are per-event priced; at the organisation’s volume this lands around $1,800/month for the EDS versus the $11k SIEM bill it’s displacing part of. Query via aws cloudtrail start-query --event-data-store <arn> --query-statement "SELECT …"; results come back in seconds for most queries, minutes for seven-year scans. Charged at $0.005/GB scanned; a tight query against a partitioned time range costs pennies, a full-table scan costs more but rarely happens.

Filter the SIEM forwarder. The current SIEM is getting every CloudTrail event, most of which are never looked at. An EventBridge rule pattern selecting only high-signal events – AssumeRole events across account boundaries, IAM policy changes, KMS key changes, S3 bucket policy changes, network-security-group changes, ConsoleLogin failures, GetSecretValue on sensitive secrets, cuts the feed by about 80% without losing the events the SIEM actually correlates on. The SIEM bill drops from $11k to ~$4k; the events the SIEM no longer sees are still in Lake and S3 if they’re ever needed.

A worked compliance query

Auditor request: “Every iam:PutRolePolicy call against a role in the production payer account (555555555555) between 2023-06-01 and 2023-08-31, grouped by caller principal and target role, with a count per combination.”

With just Athena over S3:

Restore the relevant month prefixes from Glacier (12-hour wait on first query of cold data).
Write a CREATE TABLE statement or rely on Glue crawler output.
Athena query: ~2 minutes for 3 months of data, ~$0.40 in scan cost.
Total time to answer: ~13 hours first time; ~5 minutes for a repeat query.

With Lake:

Lake already has all 7 years hot.
aws cloudtrail start-query --event-data-store <eds-arn> --query-statement "SELECT userIdentity.principalId, element_at(requestParameters, 'roleName') AS target_role, COUNT(*) FROM <eds-uuid> WHERE eventName = 'PutRolePolicy' AND eventTime BETWEEN '2023-06-01T00:00:00Z' AND '2023-08-31T23:59:59Z' AND recipientAccountId = '555555555555' GROUP BY 1, 2"
Returns in ~35 seconds for 3 months of data.
Total time: ~1 minute, first time, including writing the query.

The 13-hour-to-1-minute difference is not the normal case, it’s only dramatic when the target data is in Glacier, but “rehydrate from Glacier” is the case the auditor’s request lands in often enough that Lake pays for itself on query frequency alone.

A worked incident query

02:17 UTC. GuardDuty fires on i-0xyz. The on-call wants to know: in the last 3 hours, what API calls did the IAM role prod-payments-ec2 make?

SELECT
  eventTime,
  eventName,
  sourceIPAddress,
  awsRegion,
  errorCode,
  element_at(requestParameters, 'bucketName') AS bucket,
  element_at(requestParameters, 'key') AS object_key
FROM <eds-uuid>
WHERE userIdentity.sessionContext.sessionIssuer.userName = 'prod-payments-ec2'
  AND eventTime > date_add('hour', -3, current_timestamp)
ORDER BY eventTime DESC

Submitted via start-query; results ready in ~12 seconds; 2,400 events returned; patterns visible immediately: 98% are routine S3 PutObject to the expected bucket; 12 events are GetCallerIdentity from an unusual IP; 3 events are AssumeRole into an account the role shouldn’t touch. That’s the lead.

Same query against Athena: stand up the partitions for today, run the query, ~40 seconds, ~$0.05. Works. Lake is faster and has less setup; the difference is whether every engineer on the rota can run this query without a senior engineer’s help.

What’s worth remembering

CloudTrail has three destinations: S3 (archive), CloudWatch Logs (stream), and EventBridge (real-time). Any trail can write to any combination. Lake is a fourth destination, but operationally distinct.
Event history (90 days, console-only) is always free and always on. Useful for “who just made this change” and nothing else.
CloudTrail Lake is a managed event-data-store with up to 10-year retention and a built-in SQL engine. Priced per-event-ingested and per-GB-scanned, with org-wide collection from the management account.
Keep the S3 archive trail regardless of Lake. S3 with Object Lock is the compliance-grade system-of-record; Lake is the query layer, not the storage of last resort.
Filter the SIEM forwarder. CloudTrail volume is mostly noise for SIEM correlation purposes; an EventBridge rule selecting security-relevant events cuts ingest cost without losing signal.
Athena-over-S3 is the DIY Lake. Works, requires Glue/table maintenance and Glacier restore handling; Lake is the managed version.
Lake’s cost shape is per-event-ingest + per-GB-scanned. Selective queries are cheap; full-table scans are not. A 7-year all-events scan will be a surprise line on the bill if done without a WHERE clause.
Data events are separately enabled and separately priced. The first trail per account gets management events free; data events (S3 object-level, Lambda invoke, DynamoDB item-level, KMS operations) are billed per event on every trail that records them.

The three jobs don’t need one pipeline; they need the right destination for each access pattern. Trails to S3 for archive; EventBridge to the SIEM for real-time alerting; Lake for the compliance SQL and the 02:17 incident queries. The bill drops, the auditor is faster, and the on-call stops running zgrep.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.