The situation
Acme runs 40 services across three accounts. Every service writes structured JSON logs to CloudWatch Logs (one log group per service per Region), retention set between 7 and 30 days depending on compliance class. The logs are useful: engineers tail them with Logs Insights, on-call engineers grep them during incidents, and the security team queries them monthly for audit.
Three new consumers need log data routed as it’s written:
- SIEM integration. Every log line that matches
"errorCode": "AccessDenied"or"errorCode": "UnauthorizedOperation"should land in Splunk within two minutes of being written, regardless of which service produced it. - Product analytics stream. Every log line that matches
"event": "order.created"in theorder-servicelog group should land in a Kinesis Data Stream within seconds, feeding a real-time analytics dashboard. - Incident Slack bot. Every log line containing
"level": "ERROR"with"stacktrace":present should call a Lambda that formats and posts to a Slack channel.
All three need real-time, filtered, at-source routing. None of them want “give me the whole log group and I’ll filter on my side”; the volume is too high and the latency too loose.
What actually matters
Before reaching for a mechanism, it’s worth naming the shape of the problem, because “get logs out of CloudWatch” hides several genuinely different questions whose answers want different tooling.
The first is streaming vs extraction. Streaming routes each log event through a filter at ingestion time and delivers it (or a transformed version) to a destination within seconds. Extraction pulls whole log groups into a downstream store and queries them retroactively. The three jobs in the brief are all streaming jobs; an extraction-shaped mechanism is the wrong answer regardless of how elegant the query language is.
The second is where the filter runs. A filter at the source means only matching lines leave the log group, and the bill is paid in proportion to interesting events, not total volume. A filter at the destination means everything is shipped and the consumer drops the noise, which works but pays bandwidth and ingest cost for every line nobody wanted. For high-volume sources with low-volume signals (AccessDenied, ERRORs with stack traces), the difference is an order of magnitude on the bill.
The third is how many consumers want the same source. One source, one consumer is a single pipe. One source, three consumers is fan-out: either the source supports multiple downstream attachments natively, or one attachment carries everything to a relay where many consumers can pick up. Hitting an attachment cap in the middle of a rollout is how a streaming design gets stuck.
The fourth is cross-account vs same-account. Same-account routing is local IAM. Cross-account routing needs the destination to expose a policy the source’s IAM can assume into; the mechanism either has a first-class way to express that or the integration becomes a Lambda relay.
The fifth is what fits at the consumer end. A Slack bot wants per-event invocation with a small batch and the right format. A SIEM wants buffered, durable delivery with retries and an error bucket. A real-time analytics dashboard wants an ordered, replayable stream. Each consumer wants a different shape of pipe, and the routing mechanism has to be honest about which shapes it serves cleanly.
And sixth, what the team doesn’t want to operate. A pipeline with a buffer, retries, format conversion, and a dead-letter queue is real work to run. If a managed delivery service slots in between the filter and the SIEM, the team buys back that work in exchange for some configuration. Whether that’s the right trade depends on the volume, the destinations, and the team’s appetite for owning yet another resilient pipe.
What we’ll filter on
- Real-time vs batch: seconds to destination, or minutes/hours?
- Pattern filtering at source: does the filter run before data leaves the log group?
- Fan-out to multiple consumers: how many destinations per source?
- Cross-account support: can logs in account A route to account B’s destination?
- Ingestion-format flexibility: JSON, plain text, CloudTrail, VPC Flow Logs?
- Operational cost shape: do we manage the pipe or does AWS?
The log-routing landscape
-
Subscription filter -> Lambda. Log events that match the pattern are delivered (up to 6 MB per invocation) to a Lambda function. The Lambda sees a batch of events with the log group, log stream, and event JSON. Suits per-event logic with modest throughput, like the incident Slack bot.
-
Subscription filter -> Kinesis Data Stream. Matched events land as Kinesis records on a stream. Multiple consumers can read the same stream, and enhanced fan-out gives each consumer dedicated throughput. Suits high-volume streaming, like the product analytics case.
-
Subscription filter -> Kinesis Data Firehose. Firehose buffers matched events and delivers to S3, Redshift, OpenSearch, Splunk, or a generic HTTPS endpoint with optional format conversion (JSON -> Parquet) and failure-record handling. Suits SIEM and archival use cases, like the Splunk integration.
-
Account-level subscription filter. A single policy that applies to every log group in the account (with optional include/exclude by log group name). One filter, hundreds of log groups. Avoids the two-filter-per-log-group constraint when the pattern is the same everywhere.
-
CloudWatch Logs -> S3 export.
CreateExportTaskexports a log group’s contents over a time window to S3. Point-in-time, up to 12 hours per task. Archival, not streaming. -
OpenSearch / Firehose -> OpenSearch. CloudWatch Logs can stream into an OpenSearch Service domain. Useful when the destination is an existing OpenSearch cluster the team queries directly.
Side by side
| Option | Real-time | Source-side filtering | Fan-out | Cross-account | Format flex | Ops cost |
|---|---|---|---|---|---|---|
| Sub. filter -> Lambda | ✓ | ✓ | via Lambda code | ✓ (via Destinations) | ✓ | Medium (Lambda) |
| Sub. filter -> Kinesis Data Stream | ✓ | ✓ | ✓ (many consumers) | ✓ | ✓ | Medium (stream shards) |
| Sub. filter -> Firehose | ✓ (buffered) | ✓ | ✗ (one destination) | ✓ | ✓ (w/ transform) | Low |
| Account-level subscription filter | ✓ | ✓ | One destination | ✓ | ✓ | Low |
| S3 export | ✗ | ✗ | At the bucket layer | ✓ | Raw logs | Low |
| Direct to OpenSearch | ✓ | ✓ | Via OS indices | ✓ | ✓ | High (cluster) |
Reading by scenario:
- SIEM (Splunk, every
AccessDeniedline). Account-level subscription filter with pattern{ ($.errorCode = "AccessDenied") || ($.errorCode = "UnauthorizedOperation") }, destination Firehose with an HTTP endpoint pointed at Splunk’s HEC. Firehose’s buffering and retry behaviour is exactly correct for a SIEM integration. - Product analytics (order events to Kinesis). Subscription filter on the
order-servicelog group with pattern{ $.event = "order.created" }, destination Kinesis Data Stream. The analytics Lambda and a Firehose archive consumer both read from the stream. - Incident bot (stack traces to Slack). Subscription filter on the relevant service log groups with pattern
{ ($.level = "ERROR") && ($.stacktrace = *) }, destination Lambda. The Lambda formats the event and posts to Slack.
The three destinations, side by side
Filter patterns in depth
CloudWatch Logs has two filter-pattern dialects. For plain text, the pattern is a space-separated list of terms with optional quoting: [ERROR, "database connection failed"]. For JSON events, the pattern uses a dollar-sign selector language: { $.errorCode = "AccessDenied" } matches events where the top-level errorCode field equals AccessDenied. Operators: =, !=, >, <, >=, <=, IS NULL, NOT EXISTS, plus && and ||.
Useful shapes:
{ $.level = "ERROR" } // level field exactly ERROR
{ $.level = "ERROR" && $.stacktrace = * } // ERROR with a stacktrace field present
{ $.event = "order.*" } // event starting with "order."
{ ($.errorCode = "AccessDenied") || // multiple conditions
($.errorCode = "UnauthorizedOperation") }
{ $.requestParameters.instanceType = "p4d.*" } // nested field, wildcard
{ $.latency > 5000 } // numeric comparison
Wildcards (*) match inside string values. Nested fields use dot notation. Array access via [i] (index) or [?] (any).
Worth testing filters before deploying: aws logs test-metric-filter --filter-pattern '{ $.errorCode = "AccessDenied" }' --log-event-messages '{"errorCode":"AccessDenied"}' '{"errorCode":"Allowed"}' returns which events matched. Cheap to iterate.
Creating the three filters
Scenario 1, account-level filter to Firehose -> Splunk:
aws logs put-account-policy \
--policy-name splunk-security-events \
--policy-type SUBSCRIPTION_FILTER_POLICY \
--policy-document '{
"DestinationArn": "arn:aws:firehose:eu-west-1:111122223333:deliverystream/splunk-security",
"RoleArn": "arn:aws:iam::111122223333:role/CWLToFirehoseRole",
"FilterPattern": "{ ($.errorCode = \"AccessDenied\") || ($.errorCode = \"UnauthorizedOperation\") }",
"Distribution": "Random"
}' \
--scope ALL
The policy applies to every log group in the account. Distribution: Random spreads records across Firehose shards; ByLogStream preserves order within a log stream when order matters.
Scenario 2, service-specific filter to Kinesis:
aws logs put-subscription-filter \
--log-group-name /aws/lambda/order-service \
--filter-name order-created-to-kds \
--filter-pattern '{ $.event = "order.created" }' \
--destination-arn arn:aws:kinesis:eu-west-1:111122223333:stream/order-events \
--role-arn arn:aws:iam::111122223333:role/CWLToKinesisRole \
--distribution Random
Scenario 3, service-specific filter to Lambda (no role needed; Lambda uses resource-based permission):
aws lambda add-permission \
--function-name incident-slack-bot \
--statement-id cwl-invoke \
--action lambda:InvokeFunction \
--principal logs.amazonaws.com \
--source-arn 'arn:aws:logs:eu-west-1:111122223333:log-group:/aws/ecs/payments-service:*'
aws logs put-subscription-filter \
--log-group-name /aws/ecs/payments-service \
--filter-name stacktrace-to-slack \
--filter-pattern '{ $.level = "ERROR" && $.stacktrace = * }' \
--destination-arn arn:aws:lambda:eu-west-1:111122223333:function:incident-slack-bot
Lambda-as-destination is the exception: no RoleArn, but the function needs a resource-based policy allowing the log-group ARN to invoke it. Missing that permission is the single most common reason a Lambda subscription silently stops delivering.
The Lambda receives what
A Lambda subscribed to a log group gets a compressed payload:
{
"awslogs": {
"data": "H4sIA...base64-gzipped-json..."
}
}
Decoded, the inner JSON is:
{
"messageType": "DATA_MESSAGE",
"owner": "111122223333",
"logGroup": "/aws/ecs/payments-service",
"logStream": "payments-01",
"subscriptionFilters": ["stacktrace-to-slack"],
"logEvents": [
{
"id": "37954305091211983",
"timestamp": 1717001234567,
"message": "{\"level\":\"ERROR\",\"stacktrace\":\"...\"}"
}
]
}
The Lambda handler code has to gunzip and JSON-parse. Up to 6 MB of gzipped events per invocation, which is usually a few thousand log events. The Lambda is invoked asynchronously; delivery failures are retried by Lambda’s own retry machinery and eventually land in the function’s on-failure destination if configured.
The two-filter limit and the account-level escape
The two-subscription-filters-per-log-group limit is the constraint that shapes most designs. Options when more than two consumers want the same log data:
- Fan out downstream. One subscription filter to a Kinesis stream; many consumers read from the stream. This is the pattern when consumers need different views of the same data and latency matters.
- Account-level filter. One policy with a single filter pattern and destination, applied to every (or a selected subset of) log groups in the account. Counts against a separate limit (one account-level policy per type per account, as of writing). Right for “this one pattern, this one destination, everywhere.”
- Different patterns, one log group. If the two filters have genuinely different patterns (say,
AccessDeniedto the SIEM andERROR-with-stacktrace to the Slack bot), two filters can coexist on the same log group and both are independent.
What’s worth remembering
- Subscription filters are the streaming exit. Pattern-filtered at the log group or account level, delivered to Lambda, Kinesis Data Streams, or Firehose in near-real time.
- Three destination shapes, three use cases. Lambda for per-event logic, Kinesis for many-consumer fan-out, Firehose for buffered delivery to a sink (S3, OpenSearch, Splunk, Redshift).
- Two filters per log group is the hard limit. Past that, fan out via Kinesis or use an account-level subscription filter policy.
- JSON filter pattern language.
{ $.field = "value" }, with&&,||, wildcards, nested fields, and numeric comparisons. Test filters withaws logs test-metric-filterbefore deploying. - Lambda destinations need
lambda:InvokeFunctionresource-based permission, not an IAM role. Kinesis and Firehose destinations need a role that CloudWatch Logs assumes. Easy thing to get wrong on the first deploy. - Account-level subscription filters are the “one pattern everywhere” answer.
PutAccountPolicywithSUBSCRIPTION_FILTER_POLICYtype. Include/exclude by log group name. - Cross-account routing via Destinations. A CloudWatch Logs Destination in account B wraps a Kinesis stream and carries an access policy naming account A. Source log groups in A subscribe to the Destination ARN.
- Subscription filters are streaming; S3 export is not.
CreateExportTaskpulls a time window to S3 as a batch job: correct for archival or analytics, wrong for two-minute SLAs.
The three consumers in the scenario end up with three subscription filters: one account-level filter to Firehose -> Splunk, one log-group filter to Kinesis for the analytics pipeline, one log-group filter to Lambda for the Slack bot. Filter at source, route by destination shape, accept the two-filter-per-log-group limit or work around it with fan-out. CloudWatch Logs goes from “the place we keep logs” to “the routing plane for log-derived signals”: same data, different consumers, correct tool at each exit.