Choosing a Change Data Capture Pipeline on AWS

December 15, 2027 · 19 min read

The situation

Our orders table in Postgres (RDS, 500 GB, 2,000 writes per minute at peak) is the source of truth for the commerce platform. Three consumers want change data from it.

The warehouse needs every change streamed into a curated orders fact table, near-real-time, but a minute of lag is fine. Output target: S3 Parquet, partitioned by day, catalogued in Glue.
The search index needs updates pushed to OpenSearch within seconds so recently-placed orders show up in the ops console. Output target: an OpenSearch index.
The fraud service needs a raw event stream it can subscribe to for scoring live transactions. Output target: something the fraud service’s Kafka consumer can attach to, reading from “now” with at-least-once delivery and ordering preserved per order ID.

One source table, three consumers, three latency profiles, and three different target shapes. The question is which shape of CDC pipeline fits where, and whether we should run one, the other, or both.

What actually matters

Before reaching for a service, it’s worth separating “CDC the capture” from “CDC the distribution”.

Capture is the business of extracting changes from the source database without missing any and without double-counting. For Postgres that means reading the write-ahead log via a logical-replication slot (pgoutput plugin or wal2json); for MySQL, reading the binary log; for SQL Server, reading the transaction log via CDC or change-tracking. The capture mechanism is database-specific; whatever service we pick is ultimately a consumer of that mechanism.

Distribution is what happens next. Some pipelines are point-to-point, the capturer writes directly to the target. Others are fan-out, the capturer publishes to a shared log that multiple consumers independently read. The architectural choice is where to draw that line, and it depends heavily on how many consumers there will ever be.

If there’s one consumer and it’ll always be one consumer, point-to-point is simpler, fewer moving parts, less infrastructure, less to operate. If there are already two consumers and a third is on the roadmap, a shared log up front saves us from rebuilding the pipeline later. The shared log also decouples producer and consumer failure domains: the capturer doesn’t go down because the warehouse writer is slow, and a new consumer can attach without renegotiating anything with the producer.

The other axis is what the consumer looks like. Consumers that want to pull from a topic and process at their own pace (fraud service, anything with a consumer group) want a log-shaped feed. Consumers that want structured rows landing in S3, a database, or a search index want a transformation-aware delivery tool. The question isn’t “which is more powerful”; it’s “which shape fits what we actually need to land”.

The third axis is initial load plus CDC. The warehouse doesn’t just want changes from now; it wants a consistent snapshot of the table plus changes since the snapshot. Both managed shapes can do this; the configuration story differs between them, but neither leaves a gap between snapshot and stream.

What we’ll filter on

Distilling that exploration into filters we can score each option against:

Source-to-target shape, point-to-point (one consumer) or fan-out via a shared log (many consumers)?
Initial snapshot plus CDC, can the service do full load and switch to CDC without dropping changes?
Target surface, what does the service deliver to natively?
Operational model, serverless, replication instances, or Kafka Connect workers?
Transformation in flight, filtering, schema mapping, masking inside the pipeline?

The CDC landscape

AWS Database Migration Service (DMS). Managed source-to-target migration and ongoing replication. Supports a long list of sources (Postgres, MySQL, Oracle, SQL Server, MongoDB, DocumentDB, S3, Aurora variants, and more) and a long list of targets (S3, Kinesis Data Streams, Kafka, Redshift, DynamoDB, OpenSearch, Neptune, DocumentDB, and the RDBMS family). Runs on replication instances (EC2 under the hood, with Multi-AZ for HA) in the classic product, or as DMS Serverless in the newer product where capacity autoscales by workload. Tasks configure source, target, table mappings (which tables, which columns, rename/reshape), and migration type (full load, CDC only, or full-load + CDC). Transformations are limited but real: column rename, table rename, column drop, column add with expression, include/exclude tables by pattern.
AWS MSK Connect. A managed runtime for Kafka Connect worker fleets. You deploy a connector (a plugin packaged as a JAR) to a worker fleet, and Kafka Connect handles the source-to-Kafka or Kafka-to-sink data movement. For CDC from Postgres, the usual plugin is Debezium PostgreSQL Connector, it opens a logical-replication slot, reads the WAL, and publishes change events to a Kafka topic (one topic per table, typically). From Kafka, sink connectors (Confluent S3 sink, OpenSearch sink, JDBC sink) write to whatever downstream needs data. MSK Connect provisions the Connect workers as MCUs (compute units); you pay per MCU-hour.
DMS + Kinesis + Lambda fan-out. A hybrid: DMS targets Kinesis Data Streams instead of the ultimate target, and Lambda consumers (or Firehose, or Flink) read from Kinesis and fan out. Gives multi-consumer fan-out without running Kafka, at the cost of Kinesis’s shard-based scaling model (which is different from Kafka’s partition model in a few operationally-important ways).
Zero-ETL integrations. AWS has been shipping native zero-ETL integrations from RDS/Aurora to Redshift and to OpenSearch (no DMS, no Kafka, no pipeline). Works for exactly those pairs and only those pairs; zero-ETL from Aurora Postgres to Redshift is live, and more pairs are landing. Worth mentioning; not a general solution.
Native logical replication / open-source Debezium self-hosted. Skipping the managed layer. Postgres logical replication to Postgres, or self-hosted Debezium + Kafka. Right answer only when the managed services don’t fit; the operational burden of self-hosted Kafka + Connect is substantial.

Side by side

Option	Shape	Full load + CDC	Native targets	Operational model	In-flight transforms
DMS (instance)	Point-to-point (one target per task)	✓	S3, Kinesis, Kafka, Redshift, DynamoDB, OpenSearch, RDBMS	Replication instances	Rename, drop, add, filter
DMS Serverless	Point-to-point	✓	Same list	Auto-scaling serverless	Same
MSK Connect + Debezium	Fan-out via Kafka	✓ (snapshot mode)	Anything with a Kafka sink connector	Connect worker fleet	SMTs (Single Message Transforms)
DMS → Kinesis → consumers	Fan-out via Kinesis	✓	Anything reading Kinesis	Mixed	DMS + Lambda
Zero-ETL (Aurora → Redshift / OpenSearch)	Point-to-point	✓ (continuous)	Redshift, OpenSearch only	Fully managed	None

Reading by use case:

One consumer, structured target, no desire to run Kafka. DMS or DMS Serverless. Simplest path; fewest pieces to run.
Multiple consumers, different shapes, reading at their own pace. MSK Connect + Debezium. A shared Kafka log is exactly what you want here.
Aurora Postgres → Redshift only, zero-ETL. No pipeline to own.
Fan-out required but team has no Kafka expertise. DMS → Kinesis, with Lambda/Firehose/Flink as consumers. The fan-out shape without Kafka’s operational load.

The two pipelines, side by side

DMS draws a direct line from source to target; a second target means a second task and a second slot's worth of read load on the database. MSK Connect puts Kafka in the middle, so one source connector feeds any number of independent consumers.

DMS in depth

DMS has two flavours: DMS (instance-based) with replication instances you size yourself (Multi-AZ optional), and DMS Serverless where AWS scales DCU (DMS compute units) by workload. Serverless is the default for new work unless a specific constraint (VPC peering quirks, very low idle cost, plugin needs) pushes to instance-based.

Tasks. The unit of work. A task has a source endpoint, a target endpoint, a replication instance or serverless pool, table mappings, and a migration type. Migration types: full-load, cdc, full-load-and-cdc. Full-load + CDC is the usual choice for a fresh target. DMS snapshots the source table, starts CDC from the snapshot LSN, and coordinates the two so no change is dropped.

Table mappings. JSON that lists included schemas and tables plus transformations. Transformations can rename columns (rename), drop columns (remove-column), add computed columns (add-column), convert case, or add a prefix/suffix. Not a full SQL language, complex transformation belongs upstream or in the target, but enough for “drop the PII column before it leaves the source” and “rename cents to amount to match the warehouse schema”.

Replication slot handling. For Postgres, DMS creates and manages a logical replication slot on the source. If a task fails and sits stale, the slot retains WAL on the source, which can balloon the source disk. DMS cleans up on task deletion; if a task is stopped and never resumed, the slot has to be dropped manually. Monitor pg_stat_replication and have an alert for lagging slots.

CDC targets worth knowing. S3 (Parquet or CSV, partitioned by date, with a _dms_change_record marker column for operation type), Kinesis Data Streams (one record per change, great for fan-out via Lambda), Kafka (any MSK or self-managed), Redshift (continuous load via COPY or Zero-ETL), OpenSearch (direct document indexing), DynamoDB (PK-based inserts/updates).

Validation. DMS can run ongoing row-level validation between source and target during migration, sampling rows, comparing, flagging mismatches. Useful for the first thirty days after a new target comes online.

MSK Connect + Debezium in depth

MSK Connect runs Kafka Connect workers as a managed fleet. You bring a custom plugin (the connector JAR, typically from Confluent Hub or a GitHub release), upload it to S3, register it as a plugin, and create a connector configured with your plugin plus environment (MSK cluster, VPC, subnets, security group, IAM auth).

Capacity. Specified in MCUs (MSK Connect Units), 1 MCU = 1 vCPU + 4 GB RAM. Autoscaling policy scales MCU count based on CPU utilisation with min/max bounds. For a Debezium source connector on a medium-throughput source, 2-4 MCUs is a reasonable starting point.

Debezium PostgreSQL source connector. Opens a logical replication slot (named configurably), reads WAL, publishes one topic per table by default (topic name <database.server.name>.<schema>.<table>), key is the primary key, value is a structured change envelope including op (c/u/d/r for create/update/delete/read-snapshot), before, after, source metadata, and ts_ms. Configuration is JSON; key fields:

database.hostname / port / user / password / dbname, the source.
plugin.name – pgoutput (built-in on PG 10+) or wal2json.
slot.name / publication.name, the logical-replication slot and publication Debezium uses.
table.include.list, which tables to capture.
snapshot.mode – initial (snapshot then stream), never (stream only), exported (use an externally-taken snapshot), when_needed.
heartbeat.interval.ms. Debezium sends periodic heartbeats so the replication slot advances even on quiet tables, preventing WAL bloat.
transforms. Single Message Transforms (SMTs): flatten (convert Debezium envelope to flat row), mask field, rename field, route by regex to a different topic.

Sink connectors. One per downstream. Confluent S3 sink writes records as Parquet or Avro, partitioned by field or time; OpenSearch sink indexes documents keyed by record key; JDBC sink writes to RDBMS. Each sink has its own connector task(s) managed by the Connect worker fleet.

Schema management. Debezium can emit Avro-serialised records with schemas registered in AWS Glue Schema Registry (or Confluent Schema Registry). Registered schemas let consumers deserialise without knowing the schema a priori, and evolution rules (backward/forward/full compatibility) prevent breaking consumers with ad-hoc producer changes. This is the feature that makes multi-consumer fan-out actually work at scale.

The picks, sketched

Warehouse only, nothing else on the horizon → DMS Serverless. One task, full-load + CDC, S3 Parquet target, Glue catalog registration. Two fewer layers of infrastructure than MSK Connect; DMS owns the slot, the capture, the write. Finance-friendly billing on the serverless model.

Three consumers, more likely on the way → MSK Connect + Debezium. One replication slot on the source, one source connector in Connect, three sink connectors or three consumer groups. Kafka is the shared log; adding a fourth consumer next quarter costs nothing on the source database. Worth the operational complexity because you already have three consumers; if you had one, DMS would be lighter.

Mix: DMS + MSK Connect. Not common but legitimate. DMS for the warehouse (because it does the S3-Parquet-with-day-partitioning ergonomics well out of the box) and MSK Connect for the live-event fan-out (fraud service + OpenSearch + whatever else). Two replication slots on the source is acceptable; the workloads are genuinely different shapes.

Zero-ETL for the specific pair. If the target is Redshift or OpenSearch and the source is Aurora Postgres, zero-ETL beats both DMS and MSK Connect on operational simplicity. Use it for exactly that pair; don’t contort other pipelines to force-fit it.

A worked task: DMS to S3

{
  "rules": [
    {
      "rule-type": "selection",
      "rule-id": "1",
      "object-locator": { "schema-name": "public", "table-name": "orders" },
      "rule-action": "include"
    },
    {
      "rule-type": "transformation",
      "rule-id": "2",
      "rule-target": "column",
      "object-locator": {
        "schema-name": "public",
        "table-name": "orders",
        "column-name": "total_cents"
      },
      "rule-action": "rename",
      "value": "total_amount"
    },
    {
      "rule-type": "transformation",
      "rule-id": "3",
      "rule-target": "column",
      "object-locator": {
        "schema-name": "public",
        "table-name": "orders",
        "column-name": "ssn_hash"
      },
      "rule-action": "remove-column"
    }
  ]
}

Task settings select full-load-and-cdc. S3 target endpoint is configured with DataFormat: parquet, ParquetVersion: parquet-2-0, CompressionType: GZIP, BucketFolder: orders/, AddColumnName: true (so the file knows its schema), CdcPath: cdc/ (writes CDC as new files under this prefix), and DatePartitionEnabled: true with DatePartitionDelimiter: SLASH, DatePartitionSequence: YYYYMMDD. Output lands at s3://warehouse/orders/YYYY/MM/DD/file.parquet for the initial load and s3://warehouse/orders/cdc/YYYY/MM/DD/file.parquet for ongoing changes; the Glue crawler picks up both.

A worked connector: Debezium source

{
  "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
  "tasks.max": "1",
  "database.hostname": "orders-db.cluster-xxxx.eu-west-1.rds.amazonaws.com",
  "database.port": "5432",
  "database.user": "debezium",
  "database.password": "${ssm:/debezium/pg/password}",
  "database.dbname": "orders",
  "database.server.name": "orders",
  "plugin.name": "pgoutput",
  "slot.name": "debezium_slot",
  "publication.name": "debezium_pub",
  "publication.autocreate.mode": "filtered",
  "table.include.list": "public.orders",
  "snapshot.mode": "initial",
  "heartbeat.interval.ms": "30000",
  "transforms": "unwrap,mask",
  "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
  "transforms.unwrap.drop.tombstones": "false",
  "transforms.mask.type": "org.apache.kafka.connect.transforms.MaskField$Value",
  "transforms.mask.fields": "ssn_hash",
  "transforms.mask.replacement": "",
  "key.converter": "org.apache.kafka.connect.json.JsonConverter",
  "value.converter": "io.confluent.connect.avro.AvroConverter",
  "value.converter.schema.registry.url": "https://glue.eu-west-1.amazonaws.com",
  "value.converter.schema.registry.region": "eu-west-1"
}

Records land on topic orders.public.orders, keyed by order_id, valued as Avro with schema registered in Glue Schema Registry. Three downstream consumers subscribe, the S3 sink connector reads the topic and writes Parquet, the OpenSearch sink connector indexes into orders-live, the fraud service’s consumer group reads from latest and scores in real-time.

What’s worth remembering

DMS is point-to-point; MSK Connect is fan-out through Kafka. If there’s one consumer, DMS is lighter. If there are many, the shared log pays for itself.
Full-load + CDC is solved by both. DMS’s full-load-and-cdc task type, Debezium’s snapshot.mode: initial. Same outcome, different configuration surface.
Source load scales with number of consumers, unless a shared log absorbs it. DMS opens a replication slot per task. MSK Connect opens one per Debezium source connector, regardless of how many sinks read from the resulting topics.
DMS Serverless is usually the default instance-based choice. Scales with workload, smaller minimum cost, fewer knobs. Instance-based DMS for edge cases (unusual VPC constraints, very low idle, specific plugin needs).
MSK Connect capacity is measured in MCUs. 1 MCU = 1 vCPU + 4 GB. Autoscaling policy with min/max; start at 2-4 MCUs for a source connector on a moderate table.
Glue Schema Registry is the schema broker. Debezium Avro records register schemas; consumers deserialise with compatibility rules. Critical for multi-consumer fan-out, without it, schema drift breaks consumers silently.
Zero-ETL exists for specific pairs. Aurora Postgres → Redshift, Aurora → OpenSearch. Use it when it fits; don’t force other pipelines into its shape.
Replication slot hygiene is a real concern. Stale slots retain WAL and balloon source disk. Monitor pg_stat_replication; alert when a slot’s lag is growing without bound.

DMS and MSK Connect solve overlapping problems with different architectures. The choice isn’t “which is better” but “how many consumers, how shaped, and how much operational surface do we want to own”. For one consumer, DMS Serverless. For three, MSK Connect. For the specific Aurora-to-Redshift pair, zero-ETL. For everything in between, draw the pipeline and see which side of the line it falls on.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.