The situation
The table is Accounts, keyed on account_id, with a balance attribute and a few bookkeeping fields. Every balance change, deposit, withdrawal, transfer, is a PutItem or UpdateItem on that table. The service is an internal Go service that the customer-facing API calls into.
The current flow, simplified: update DynamoDB, then (return values ignored or logged-and-dropped) call S3 for the audit object, OpenSearch for the index document, and EventBridge for the fraud event. When all four succeed the four systems agree. When DynamoDB fails, nothing downstream happens and the caller sees an error. When DynamoDB succeeds but (say) OpenSearch times out because a node is being replaced, the balance updates, the index doesn’t, the caller sees success, and the reconciliation job three days later finds the gap.
Variants of the same shape: S3 throttles because an upload burst collides with a lifecycle job; EventBridge rejects because the event schema registry has been tightened and the producer hasn’t been redeployed; OpenSearch refuses with a 429 because a shard is rebalancing. Each is survivable in isolation. The design has no survival mechanism, no retries with durable storage of what still needs sending, no dead-letter capture of anything that still won’t go, no replay for records that dropped silently last Tuesday.
What actually matters
Before reaching for a pipeline, it’s worth noticing what the current design is actually doing. It’s treating four independent systems as if they were one transactional store, and they aren’t. DynamoDB committed the write; the other three calls might succeed, might fail, might silently time out. There’s no coordination protocol and no rollback. The team has built four distributed writes without a distributed-transaction framework, and the bug is the one that always bites: the middle three drops on the floor.
Ownership of the four systems isn’t symmetric. DynamoDB is owned by this service; S3, OpenSearch, and EventBridge are owned by neighbouring teams who each have their own availability windows, maintenance calendars, and schema-versioning habits. A design that pretends all four are equally available is going to find out otherwise every couple of weeks.
Blast radius of the current design spreads invisibly. A failed OpenSearch index write doesn’t show up on the caller’s trace; it shows up three days later when the reconciliation job produces an exception report. The fix needs to turn invisible failures into visible ones with somewhere durable to retry from, and to contain them to one destination at a time.
Cost shape is a modest concern. Adding a stream and a Lambda adds dollars; the dollars matter less than the engineering time lost to chasing reconciliations and writing “OpenSearch is missing Tuesday’s events, can we replay?” runbooks. The more important shape question is where replay comes from when a destination loses its state: for short windows, from the stream itself; for long windows, from whichever destination is the canonical record.
Failure modes are where the design lives or dies. A slow OpenSearch should not back-pressure the audit log. A failing EventBridge schema should not block the search index. Each destination’s problems have to be contained to that destination, and the contained failure has to be retryable without replaying every event.
Coupling between the transaction-recording code and the downstream fan-out is the thing to reduce to zero. The service that records a transaction should, ideally, not know about the three followers. That knowledge belongs to a pipeline that triggers off the commit and does the correct thing on retries, not to the handful of Go files that handle account updates.
One more: ordering per account. Deposit-100-then-withdraw-30 must not arrive at the fraud model as withdraw-30-then-deposit-100, or a legitimate transaction on an initially-empty account gets flagged as negative-balance fraud. Per-account_id ordering has to survive the fan-out.
What we’ll filter on
Five filters for the pipeline:
- At-least-once delivery to every destination. A committed transaction must reach S3, OpenSearch, and EventBridge, even if one or more is unavailable at the moment of the original write. Retries persist somewhere that outlives a single application process.
- Per-destination independence. A slow OpenSearch shouldn’t back-pressure the audit log; a failing EventBridge schema shouldn’t block the search index.
- Ordering per account. Per-
account_idsequence survives the fan-out. - Replay capability. If OpenSearch loses its index, operators can replay the last N days. If the fraud team wants to score last month’s events, the events are available to re-deliver.
- Low application-code coupling. The transaction-recording code writes to DynamoDB and stops thinking about fan-out.
The change-propagation landscape
DynamoDB Streams. A per-table change log enabled with StreamSpecification. Every item-level modification (insert, modify, delete) becomes a stream record. Retained for 24 hours. Records appear exactly once and in the order modifications occurred per partition key. View types: KEYS_ONLY, NEW_IMAGE, OLD_IMAGE, NEW_AND_OLD_IMAGES. Consumed by Lambda ESMs or a custom Kinesis-compatible consumer. Practical concurrent reader limit: two per shard (AWS recommends one; adding a second risks throttling both).
Kinesis Data Stream for DynamoDB. A different feature with a similar name. Every item-level modification is written to a Kinesis Data Stream you own. Retention up to 365 days. Full Kinesis consumer ecosystem: KCL, Lambda ESMs, Firehose, Kinesis Analytics, and Enhanced Fan-Out consumers (up to 20 per stream, dedicated 2 MB/s per shard each). The trade-off sits on the contract: AWS documentation is explicit that “records might appear in a different order than the item changes occurred” and “the same stream record might also appear more than once”.
Neither stream option is a native source for EventBridge event buses. The supported ways to bridge DynamoDB to EventBridge are a Lambda that consumes DynamoDB Streams and calls PutEvents, or EventBridge Pipes with DynamoDB Streams as the source.
Lambda Event Source Mapping on DynamoDB Streams. The Lambda service polls the stream and invokes the function with batches that preserve per-partition-key order. Configurable: BatchSize (1-10,000), MaximumBatchingWindowInSeconds (up to 300), ParallelizationFactor (1-10, multiple concurrent invocations per shard preserving per-key order), MaximumRetryAttempts, BisectBatchOnFunctionError (split a failing batch and retry halves), FilterCriteria, and DestinationConfig.OnFailure, an SQS queue, SNS topic, or S3 bucket where batches that exhaust retries land for later handling.
Fan-out shapes. Three to name: one Lambda handling all three destinations; multiple Lambdas each reading the stream (constrained by the two-readers-per-shard ceiling); or one Lambda reading the stream and re-publishing to SNS or EventBridge Pipes, letting that layer own the per-destination fan-out.
Side by side
| Option | At-least-once | Per-destination isolation | Ordering per account | Replay | Low coupling |
|---|---|---|---|---|---|
| Synchronous fan-out in the app | ✗ | ✗ | ✓ | ✗ | ✗ |
| DDB Streams + single Lambda fanning out | ✓ | , | ✓ | , | ✓ |
| DDB Streams + three Lambdas, one per destination | ✓ | ✓ | ✓ | , | ✓ |
| Kinesis Data Stream for DynamoDB + many consumers | ✓ | ✓ | ✗ | ✓ | ✓ |
| EventBridge directly from the app | ✗ | ✗ | ✓ | ✗ | ✗ |
| DDB Streams + Lambda + per-destination DLQs | ✓ | ✓ | ✓ | ✓ (24 h + S3) | ✓ |
Matching the pipeline
The Streams pipeline, in depth
Stream configuration. Enable with StreamViewType: NEW_AND_OLD_IMAGES so the consumer has the before and after of each balance change, enough to reconstruct the transaction amount even for derived writes.
Event Source Mapping. BatchSize around 100 with a 1-5 second batching window; ParallelizationFactor of 5-10 so the same shard can be processed by multiple concurrent invocations while two writes to the same account_id still go to the same lane (the ESM uses the partition key to choose a lane); BisectBatchOnFunctionError: true to narrow a poison pill quickly; MaximumRetryAttempts bounded explicitly (10 is a reasonable floor) rather than the -1 default, so a broken record doesn’t block a shard for 24 hours; DestinationConfig.OnFailure pointing at an SQS queue for whole-batch failures.
Handler. For each record the Lambda delivers to the three destinations independently, typically with Go’s errgroup, collecting per-destination errors. Records whose S3 write failed go to the S3 DLQ; OpenSearch failures to the OpenSearch DLQ; EventBridge failures to the EventBridge DLQ. The handler returns success to the ESM as long as each record was either delivered or captured in a DLQ; unrecoverable failures are thrown so the ESM retries or bisects. Per-destination DLQs matter because each destination has a different idea of “broken”, an OpenSearch schema mismatch needs a different runbook than an S3 bucket-policy failure.
Replay. Short replay (within 24 hours) uses the stream itself, a new ESM with StartingPosition: TRIM_HORIZON. Longer replay uses the S3 audit log, which is the canonical “what happened” record by design; DynamoDB remains the canonical “what the balance is now”. Two data stores, two questions.
A worked example: one deposit, four failure paths
A deposit arrives: UpdateItem on account_id = ACC-17 increments the balance from AUD$1,000 to AUD$1,200.
On the happy path DynamoDB commits, a stream record appears with the old image (AUD$1,000) and new image (AUD$1,200) partitioned on ACC-17, the ESM picks it up within the batching window, and the handler writes an audit object to s3://audit-lake/2026/11/23/ACC-17/<ulid>.json, indexes a document to OpenSearch, and publishes a BalanceChanged event to the fraud bus. All three succeed; the ESM advances past the sequence number.
When OpenSearch is rebalancing a shard, S3 succeeds; OpenSearch returns a 429; an in-handler retry also fails. The handler sends this record to the OpenSearch DLQ and continues. S3 is already done, EventBridge succeeds. The ESM advances. A separate redriver consumes the OpenSearch DLQ with longer backoff; the document arrives minutes later, not lost.
When EventBridge rejects because the schema has drifted, PutEvents returns FailedEntryCount: 1 with a validation error. The handler sends the record to the EventBridge DLQ; S3 and OpenSearch are untouched. Ops fixes the registry or the producer serialiser and redrives.
When the Lambda itself throws, a nil-pointer dereference on an unexpected field, the whole batch fails. The ESM retries, fails again, and because BisectBatchOnFunctionError is on, splits the batch and narrows the bad record down quickly. Once MaximumRetryAttempts is exhausted, the sub-batch metadata lands in the ESM’s OnFailure queue and the iterator advances so the shard isn’t blocked. The engineer ships a fix and replays the affected sequence range from the stream (within 24 hours) or from the S3 audit log.
At no point does a committed balance change end up unaudited, unindexed, or unemitted without something durable remembering it still needs to get there.
What’s worth remembering
- Don’t fan out in application code across independent systems. Writing to DynamoDB and then calling three downstream services is four distributed writes with no coordination, any of the follow-ups can drop silently and nothing will retry.
- DynamoDB Streams guarantees per-partition-key ordering and exactly-once appearance per stream record, with 24-hour retention. Enable with
StreamViewTypeset to include the images the consumers need. - Kinesis Data Stream for DynamoDB trades ordering and uniqueness for consumer scale and 365-day retention. Duplicates possible; out-of-order possible; up to 20 Enhanced Fan-Out consumers.
- Lambda ESM on DynamoDB Streams gives durable consumption with built-in retries, parallelisation per shard, batch bisection, and an
OnFailuredestination. UseParallelizationFactorto scale concurrency without breaking per-key order. - Per-destination SQS DLQs isolate failures. An OpenSearch problem shouldn’t corrupt the audit trail; an EventBridge problem shouldn’t lose a search update. One DLQ per destination, one redrive playbook per destination.
- EventBridge is a destination here, not a source. DynamoDB → EventBridge bus goes via Lambda or EventBridge Pipes; the application shouldn’t call
PutEventsdirectly from the transaction path. - S3 is the long-term replay source. DynamoDB Streams retains 24 hours; beyond that, replay from the S3 audit log. The immutable audit entries are the design’s canonical “what happened” record.
- One Lambda consuming the stream is cheaper and simpler than one Lambda per destination. Three independent Lambdas consuming the same DynamoDB Stream pushes against the two-readers-per-shard practical limit.
- Reader-side isolation is what keeps per-destination problems from cascading. Whether the isolation comes from per-destination DLQs inside a single Lambda or from separate Kinesis EFO consumers, the property is the same: a failing destination gets its own retry budget.
- The application’s job shrinks to “write to DynamoDB”. That’s the design’s real win, one place to reason about, one source of truth for everything downstream.