How to Diagnose a Slow DMS Replication Task

January 03, 2028 · 14 min read

The situation

A fresh migration task: Postgres source (RDS, 800 GB, 120 tables), S3 target in Parquet, full-load + CDC, DMS replication instance dms.r6i.2xlarge (single AZ for non-prod, Multi-AZ in prod). Started on Monday morning; the monitoring dashboard shows full-load progress at 47% after 36 hours with no signs of accelerating.

The obvious things were checked:

The replication instance is at 35% CPU, 45% memory. Not saturated.
The source Postgres is at 40% CPU and I/O wait is moderate. It can give us more.
The target S3 bucket has no request-rate issues; CloudWatch shows PutObject latency flat.
Network between replication instance and source, instance and target: gigabits per second, untroubled.
No errors in the task logs; no retries; no warnings about commit deadlocks or slot lag.

Everything looks fine, and nothing is fast. This is the shape of tuning work, the system hasn’t broken; it’s just wasting compute by not using it.

What actually matters

Before reaching for configuration changes, it’s worth being explicit about where DMS spends its time.

A DMS task has four stages running concurrently for each table it processes.

Source read. The replication instance’s source pool issues queries to the source database – SELECT * FROM <table> for full-load, or reads from the WAL/binlog for CDC, and streams rows back. Rate limited by source query throughput (sort key, indexes, network) and by the number of parallel source-read threads.

Transform. The replication instance applies table mappings, transformations (rename, drop, add, filter), data-type conversions, and any computed columns. CPU-bound on the instance.

Target write. The instance issues writes to the target. For S3 this is buffering Parquet row groups in memory and flushing them as objects; for a database target it’s batched INSERT/UPDATE/DELETE statements. Rate limited by target throughput and by the number of parallel target-write threads.

Task coordination. The task controller manages shard-to-thread assignment, progress tracking, and the hand-off from full-load to CDC. Cheap but sequential where it serialises state.

Each stage has its own throughput ceiling; the task’s actual rate is the minimum of the four. If source read is the bottleneck, adding target parallelism does nothing. If target write is the bottleneck, tuning the source-read batch size does nothing. The diagnostic question is: which stage is pacing the others?

The second consideration is what “table” means in the parallelism story. The task can be configured for a mix of parallel loading (multiple tables at once) and, for large tables, segment loading (splitting one table across threads by a key range or partition). With 120 tables and a modest default concurrency, a handful load at once and the rest queue. If the big tables are in the front of the queue, the tail is fast; if a 400-GB table is alone late in the queue with no other work to fill the time, the task runs single-table for most of its life.

The third consideration is LOB handling, because large-object columns (TEXT, BLOB, XML, JSONB) have a specific nasty mode in this kind of replicator: default settings often truncate at a small ceiling. If the table has LOBs and the job hasn’t been configured for them, the apparent “slow” may actually be “correct but wrong”. LOB rows are being silently truncated.

The fourth is commit-and-apply batching on the target. The replicator batches rows before writing; too-small batches mean too many round-trips, too-large batches mean long flush pauses that look like stalls.

What we’ll filter on

Distilling the tuning surface into knobs we can score:

Task-level parallelism, how many tables load concurrently, and with how many sub-threads per large table?
Source-read batching, how many rows per fetch from the source?
Target-write batching, how many rows per commit or flush to the target?
LOB handling, full-LOB (unlimited but slow) vs limited-LOB (fast but truncated at max size) vs inline-LOB (fast and complete if under a threshold)?
Replication instance sizing. CPU, memory, network; the floor under all of the above?

The tuning landscape

Task settings. The TaskSettings JSON is where the bulk of the knobs live. The headline fields:

FullLoadSettings.MaxFullLoadSubTasks, how many tables load concurrently (default 8, max 49). Raise when tables are small-to-moderate and the instance has headroom; the tail gets shorter.
FullLoadSettings.CommitRate, batch size for target writes during full-load (default 10,000 rows). Tune with target; for S3 Parquet, larger batches produce fewer, larger Parquet objects, which query engines prefer.
FullLoadSettings.TransactionConsistencyTimeout, how long DMS waits for open transactions on the source before starting full-load. Relevant only for transactional sources.
FullLoadSettings.StopTaskCachedChangesApplied, tells DMS to stop after CDC catches up (after full-load), useful for one-shot migrations.
TargetMetadata.ParallelLoadThreads, threads per table during full-load (for supported targets). Default varies by target. For S3, raising it multiplies the per-table throughput if the source can sustain the read.
TargetMetadata.ParallelLoadBufferSize, buffer size per parallel thread; too-small means frequent flushes, too-large means memory pressure.
TargetMetadata.BatchApplyEnabled (CDC), batch CDC changes to the target instead of row-by-row. Dramatically faster for high-volume CDC on targets that support it.
LOB Settings (root-level):
- FullLob, unlimited LOB size, one row at a time, slow.
- LimitedSizeLob + LobMaxSize, fast, but rows over the limit have LOB columns truncated.
- InlineLob + InlineLobMaxSize, rows with LOBs under the limit get fast path; over the limit, slow path is used for that row only. Usually the correct answer.
ErrorBehavior.DataErrorPolicy / TableErrorPolicy, how DMS reacts to data and table errors (log and continue, fail, stop). Too-strict policies stop tasks on minor issues; too-lax policies hide real problems.

Table mappings, parallel-load rules. In the table-mapping JSON, a table-settings rule with parallel-load defines how DMS splits a single table across threads.

type: partitions-auto, lets DMS pick partitions for splitting. Simple; works for reasonably-sized tables with known partitioning.
type: partitions-list, you name the partitions to use as split points. Fine control.
type: ranges, split by a column range (e.g. order_id between 1..1M, 1M..2M, 2M..3M). Works for any integer or date column with reasonable distribution; the most flexible.
type: none, no parallel load for this table. Default for most tables.

Parallel-load across ranges is the knob that turns a 400-GB table from a single-threaded drag into an 8-thread load that finishes in an eighth of the time. Pick the range column carefully: a column with hot ranges (order_id clustered near current date) doesn’t distribute evenly.

The tuning flow

The bottleneck isn't the instance, then the source, then the table shape, then LOB handling, then parallelism. Each question narrows the knob set; tune one thing at a time so you can see what moved.

The picks, step by step

Step 1: identify the bottleneck before tuning. The worst DMS mistake is reaching for MaxFullLoadSubTasks immediately. Check the replication instance CPU/memory first, then the source, then look at per-table progress in the DMS table-statistics view. If 3 of the 120 tables are 90% of the task’s total rows and they’re running one at a time, parallelism on the wrong axis won’t help, you need parallel-load on those specific tables.

Step 2: size the replication instance for headroom. DMS is CPU and memory heavy on the instance. A task hitting 80%+ CPU is throttling itself; scale up to create headroom and re-measure. The instance cost is small compared to the engineer time lost to a slow migration. DMS Serverless takes this decision away, it scales DCUs automatically.

Step 3: parallel-load the whales. The tables that dominate total-row counts are where wall-clock savings live. Pick an even-distribution column (not created_at if new data clusters at the top) and use type: ranges to split across 4-8 threads per table. A 400-GB table on 8 parallel threads, each reading its range with its own query to the source, can run in roughly an eighth of the single-threaded time if the source can sustain the read.

Step 4: tune LOB handling. If any table has LOB columns and the task is in FullLob mode, switch to InlineLob with a threshold (say 32 KB) that covers 95%+ of rows. The common rows take the fast path; outliers take the slow path but only for themselves. If you don’t know whether tables have LOBs, check the source schema, a single unnoticed TEXT column can halve throughput.

Step 5: tune batch sizes for the target. CommitRate and target-specific batch settings control how many rows buffer before writing. For S3 Parquet, raising CommitRate to 50,000 or 100,000 produces fewer, larger Parquet objects, better for downstream query engines and fewer target-write round-trips. For RDBMS targets, batch sizes affect transaction size; too large and you risk long commit stalls on the target.

Step 6: enable BatchApplyEnabled for CDC. Once full-load finishes and CDC begins, BatchApplyEnabled: true (for supported targets) batches change-events instead of applying row-by-row. Dramatic difference on high-volume CDC. Check the DMS documentation for whether your target supports it and what constraints apply.

Step 7: raise task-level parallelism last. MaxFullLoadSubTasks from 8 to 16 or 32 is the knob that people turn first because it’s the most visible. Raise it after the earlier steps have removed the underlying bottlenecks; otherwise you’re just scheduling more idle threads.

A worked tuning: from 36 to 6 hours

Starting state: 36h elapsed, 47% complete, r6i.2xlarge at 35% CPU.

Source inspection: three tables dominate (orders, order_items, audit_log) at 60%, 20%, 12% of total rows. The other 117 tables are 8% combined.
orders has a created_at column with even daily distribution; order_id is monotonic but evenly sized.
No LOB columns.
CDC will follow; BatchApplyEnabled should be on once we get there.

Actions:

Add parallel-load range rule for orders on order_id, 8 ranges: 1-100M, 100M-200M, …, 700M-800M.
Add parallel-load range rule for order_items on order_id, 4 ranges.
Add parallel-load range rule for audit_log on id, 4 ranges.
Raise MaxFullLoadSubTasks to 16 (now we have 16 parallel-load threads for the big three plus 13 remaining slots for the other 117).
Raise CommitRate to 50,000 for S3 Parquet.
Set BatchApplyEnabled: true for the upcoming CDC phase.
Restart task with the new settings (full-load state is preserved for already-loaded tables).

Result: remaining full-load drops from 16 hours predicted to ~4 hours actual. Post-full-load CDC catches up in 10 minutes instead of 2 hours. Total task wall-clock: ~40 hours down to ~20, with roughly half of that being the before work that can’t be rewound.

What’s worth remembering

Tune the bottleneck, not the biggest number. Check instance, source, per-table distribution, LOB, batch sizes, parallelism, in that order. Tuning the wrong one doesn’t help and hides the real problem.
MaxFullLoadSubTasks is about table count, ParallelLoadThreads is about threads per table. Small-to-moderate tables benefit from raising the first; large tables benefit from the second.
Parallel-load by ranges is the big-table knob. A 400-GB table single-threaded is a slog; split into 8 ranges and it finishes in an eighth of the time, assuming the source can sustain the concurrent read.
LOB mode matters more than people think. FullLob is safe-but-slow; LimitedSizeLob is fast-but-truncated (dangerous); InlineLob is the usual correct answer.
CommitRate and batch-apply settings affect target throughput. Raise for large-batch-friendly targets (S3 Parquet); verify for RDBMS targets where transaction size has its own effects.
DMS Serverless removes instance sizing from the conversation. If you don’t want to think about replication-instance CPU, use DCUs. Classic DMS is still correct when you need specific VPC configurations or plugin-level control.
Monitor per-table statistics in the DMS console. The table-level view shows rows-loaded per table in real-time. That’s where bottlenecks show, an unmoving row count on a specific table points to what to tune.
Post-full-load matters too. CDC’s BatchApplyEnabled, CDC thread counts, and change-record handling are separate tuning surfaces from full-load. Don’t tune full-load, declare victory, and watch CDC fall over at go-live.

DMS tuning isn’t arcane; it’s systematic. Each stage of the task has its own knobs, and the art is knowing which stage is limiting the rest. The “slow” task in the opener is usually one of four shapes: instance too small, source overloaded, single table single-threaded, or LOBs in full mode. Diagnose before tuning; tune one thing at a time; watch what moved.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.