The situation
A platform team operates a multi-account AWS Organization: production, staging, dev, observability, and security. The first three are producers, they emit operational events from Lambdas, pipelines, and managed services: deploy started / succeeded from CodePipeline, alarm transitioned from CloudWatch, GuardDuty findings, IAM Access Analyzer findings.
Two accounts consume. The observability account runs a dashboard Lambda writing to a DynamoDB operational timeline, plus an alerting Lambda that decides whether to page on-call. The security account runs a SOAR pipeline, a Step Functions state machine that ingests findings, enriches them, and either auto-remediates or opens a case.
The routing has to fan out from three producers to two consumers (with room for more producers later), carry versioned schemas consumer teams can codegen against, support replay for the SOAR pipeline if it goes offline, route permanent delivery failures to a dead-letter path, and stay small enough that the platform team doesn’t inherit a new system to keep alive.
What actually matters
Before reaching for a service, it’s worth naming what the middle of this pipe actually has to earn its keep by doing.
The first thing is ownership of the wire format. Events flowing cross-account are a contract, producers promise a shape, consumers rely on it. A routing layer that treats bodies as opaque bytes puts every breaking change one inbox away from a 3am page. A routing layer that versions schemas and makes them discoverable lets teams evolve producers and consumers independently, and spot drift before it causes incidents. The worst outcome is a Slack thread every time a field name changes; the second-worst is discovering a missing field in a downstream job that ran at midnight.
The second is blast radius on a consumer outage. If the SOAR pipeline goes down for an hour because of a bug, the findings it should have processed have to be processable once it comes back. That can mean “producers durably persist everything and consumers poll” (messaging), “producers push, consumers replay from an archive” (event bus with archive), or “producers and consumers share a streaming log” (Kinesis / Kafka). The shape we pick constrains what “recovery” means, and how hard it is to explain on a whiteboard six months from now.
The third is coupling between producers and consumers. Fan-out is the point here: producers shouldn’t know how many consumers there are. A topic with subscribers delivers on that for simple cases, but when the consumer-side filtering is subtle, “security only wants high-severity findings; observability wants every deploy event, filtered by environment”, we want filters to live on the consumer side, where the consumer team owns them. Filtering at the producer is a consumer-side concern leaking into producer code.
The fourth is dead-letter discipline. Permanent delivery failures happen: targets get deleted, permissions get revoked, consumers throw non-retryable errors. “Where does a failed event go?” should have an obvious answer per target, and the answer shouldn’t be “it retries for a while then disappears.” A DLQ per target, scoped to a specific downstream, keeps failures inspectable and the failure mode discoverable.
Fifth is billing shape and throughput assumptions. Operational events are low-throughput (thousands a day, not millions) and spiky (deploys happen in bursts). A service priced per-event at that volume is rounding error; a service that expects shard capacity planning is expensive overkill. The shape of the bill should match the shape of the traffic.
Finally, parallel control plane risk. A four-person platform team can’t inherit a Kafka cluster as a side project, however tempting the primitives. Serverless, managed, fewer knobs to tune, not because those tools are wrong in general, but because the budget for operating the routing layer in this org rounds to zero.
What we’ll filter on
- Many-to-few cross-account fan-out. N producers, two consumers, headroom for more producers later.
- Versioned schemas with discoverability, consumer teams see the shape, generate client code, handle version drift declaratively.
- Replay of historical events, the SOAR pipeline can ask for every finding from the last six hours after an outage.
- Per-target dead-letter queue, permanent delivery failures land somewhere retrievable, per target.
- Low operational overhead, serverless, no shards, no consumer groups, no bespoke control plane.
The event routing landscape
AWS has several answers to “move events from A to B across accounts.”
SNS topics with cross-account subscriptions. A topic in the producer account with a resource-based policy allowing the consumer account’s Lambda or SQS to subscribe. Fan-out is native; cross-account is a policy edit. But SNS treats message bodies as opaque, no schema registry, no built-in archive, so replay requires a home-built capture queue. Per-subscription DLQs exist. Strong on fan-out and DLQ, weak on schemas and replay.
SQS queues with cross-account access. A queue in the consumer account with a policy allowing producers to SendMessage. Point-to-point, not fan-out, two consumers means two queues and the producer publishes to both. No schemas, no replay once consumed, native DLQ via redrive. Fails fan-out: SQS is a queue, not a bus.
EventBridge custom buses with cross-account targets. A custom event bus in each consumer account with a resource-based policy granting events:PutEvents to producer accounts. Producers run rules on their default bus targeting the consumer bus’s ARN; consumer buses have their own rules dispatching to Lambda, SQS, Step Functions, and anything else EventBridge targets. Schema Registry versions event shapes; Archive plus Replay re-runs past events; DLQs are per target. Serverless throughout.
Kinesis Data Streams with cross-account consumers. Producers write to a stream in a shared account; consumers read via enhanced fan-out or KCL. Strong ordering, millisecond latency, replay within retention (up to 365 days). But the model is shards, shard iterators, consumer checkpointing, KCL state tables, and the schema story is Glue Schema Registry bolted on, not EventBridge’s. Right for high-throughput ordered workloads; overkill for operational events.
Side by side
| Option | Fan-out | Schemas | Replay | DLQ | Low overhead |
|---|---|---|---|---|---|
| SNS cross-account | ✓ | ✗ | ✗ | ✓ | ✓ |
| SQS cross-account | ✗ | ✗ | ✗ | ✓ | ✓ |
| EventBridge custom buses | ✓ | ✓ | ✓ | ✓ | ✓ |
| Kinesis Data Streams | ✓ | , | ✓ | ✗ | ✗ |
Matching the event flow to a routing mechanism
EventBridge cross-account in depth
Crossing account boundaries has five parts.
Every account has a default bus. The bus called default receives events from AWS services automatically. CloudWatch alarm transitions, CodePipeline stage events, GuardDuty and Access Analyzer findings. PutEvents from application code goes there too.
Custom buses are explicit resources. In each consumer account, create one – platform-events in observability, security-events in security. Each has its own rule set and its own resource-based policy.
The resource-based policy grants events:PutEvents to producer accounts. Without this, any PutEvents from those accounts is rejected. The clean pattern keeps rule management in the consumer account and grants only PutEvents.
Producer rules target the consumer bus’s ARN. In each producer account, a rule on the default bus matches the events of interest and has the consumer bus’s ARN as target. For the SOAR path: source: [aws.guardduty, aws.access-analyzer].
Target-bus targets require an IAM role. For cross-account event bus targets created after March 2, 2023, the producer rule must reference an IAM role in the producer account that trusts the EventBridge service and has permission to PutEvents on the destination bus. The resource-based policy on the consumer bus is still required, both sides must agree.
Consumer-side rules dispatch to the real targets. On platform-events, a rule matching deploy events targets the dashboard Lambda; another targets the alerting Lambda. On security-events, a rule matching findings targets the SOAR state machine.
Cross-account events are billed to the sending account as custom events. The sender pushed, the sender pays.
A worked event flow
A GuardDuty finding of type Recon:EC2/Portscan appears in production.
T+0. GuardDuty emits onto production’s default bus.
T+~100ms. A rule on production’s default bus matches GuardDuty findings and fires. Target is the security-events bus ARN, via an IAM role in production that can PutEvents to that bus.
T+~200ms. The security-events bus accepts the PutEvents because its resource-based policy allows production’s account ID.
T+~250ms. A rule on security-events fires. Target is the SOAR state machine, via a role with states:StartExecution.
T+~500ms. SOAR is running. Two hops, both managed, both authorised by explicit policy.
Failure path. The state machine was accidentally deleted. EventBridge sends failures for missing targets straight to the target’s DLQ without retries, the retry loop only helps when the problem is transient. The DLQ receives the event with EventBridge-added attributes describing the failure. When the state machine is restored, the team redrives the DLQ.
Replay path. The state machine had a six-hour logic bug rejecting every finding with a specific attribute. The fix ships. An archive on security-events has been capturing every finding; the operator triggers a replay with a six-hour window. EventBridge resends matching events onto the same bus at roughly one-minute intervals, routed only to the SOAR rule (rule-filtered replay keeps the backlog off other consumers). The state machine processes the backlog.
EventBridge Archive plus Replay
An archive attaches to exactly one event bus; you cannot switch the source after creation. The archive captures every event matching an optional event pattern, or everything if no pattern. Retention is configurable in days; the default is indefinite.
Replay reads from the archive and writes to the source bus. The binding constraint: replay targets only the bus the archive was attached to, not a different bus, not a different Region, not a different account. The source bus’s rules are the canonical dispatch surface and replay reuses them.
Replay is time-bounded, start and end time, and asynchronous: EventBridge works through the window at roughly one-minute intervals and events are not guaranteed to replay in the original order. Idempotent consumers like SOAR don’t care; order-sensitive consumers should not rely on replay. You can broadcast to every rule on the source bus or pick a subset; rule-filtered replay is what you want when other consumers shouldn’t see the replayed events.
Limits: ten concurrent replays per account per Region; completed replays retained in the console for 90 days.
Dead-letter queues and retry behaviour
Per-target RetryPolicy on an EventBridge rule target:
MaximumEventAgeInSeconds, valid range 60 to 86400 (one minute to 24 hours).MaximumRetryAttempts, valid range 0 to 185.
The loop stops at whichever limit is hit first. When retries exhaust, or when the failure is non-retryable like “missing permissions” or “target no longer exists”, the event goes to the target’s DLQ if one is configured. The DLQ is an SQS queue; EventBridge adds message attributes identifying the rule-target pair and the reason for failure.
DLQs are per target, not per rule and not per bus. A rule with three targets can have three independent DLQs or any mix. No target is configured without one.
Schema Registry
AWS ships three default registries: aws.events (built-in AWS service schemas), discovered-schemas (populated by schema discovery), and a virtual “All schemas” view. You can also create custom registries.
Schema discovery runs on an event bus. Enable it and EventBridge inspects events flowing through, infers OpenAPI schemas, and writes them to discovered-schemas. When an event’s shape changes, new field, changed type, discovery creates a new version.
Code bindings for Go, Java, Python, and TypeScript. A consumer team points their build at a schema version, downloads the binding, and gets typed structs. Custom schemas (OpenAPI 3 or JSONSchema Draft 4) can be uploaded to custom registries; they cannot be exported back out.
When SNS is still correct
SNS remains the correct shape when fan-out is the only requirement and there’s no need for schemas, replay, or filtering beyond subscription topic. It also wins for mobile push, SMS, and email, none of which EventBridge delivers, and for lowest-latency paths. And if the producer already publishes to SNS, bridging to EventBridge buys nothing.
What’s worth remembering
- Default bus vs custom buses. Every account has a default bus that receives AWS-service events automatically. Custom buses are explicit resources with their own rule sets and resource-based policies.
- Cross-account delivery is a two-hop pattern. Producer rule on the default bus targets the consumer bus ARN; consumer rule on that bus dispatches to the real target.
- The consumer bus’s resource-based policy grants
events:PutEventsto producer accounts. - Cross-account event bus targets require an IAM role in the producer account (for targets created after March 2, 2023).
- Sending accounts are billed for cross-account events as custom events.
- Schema Registry auto-discovers from bus traffic into
discovered-schemas, versions on shape change, and generates code bindings for Go, Java, Python, and TypeScript. - Archive plus Replay is source-bus-only. Time-bounded, one-minute intervals, unordered, with optional rule filtering. Ten concurrent replays per account per Region.
- DLQs are per target, not per rule. SQS queue, receives events whose retries exhausted or whose delivery failed non-retryably.
MaximumEventAgeInSecondsis 60 to 86400;MaximumRetryAttemptsis 0 to 185. Whichever limit hits first stops the loop.- Pipes is point-to-point; buses are many-to-many. Match shape to problem.
- SNS is still correct for pure fan-out without schemas or replay, and for mobile, SMS, and email.