The situation
The organisation has 80 Windows and Linux VMs in an on-prem data centre: ERP, middleware, legacy file servers, a few custom applications that resist refactoring. Modernisation is a multi-year project. DR cannot wait.
Today the DR plan is “backup tapes, a contract with a recovery-as-a-service provider, an 8-hour RTO we’ve never actually tested.” The board has asked for an RTO under 30 minutes, tested quarterly. Going to a second data-centre would cost seven figures in hardware and two years of implementation. Replicating to AWS is faster and cheaper, if we pick the correct tool.
Options considered:
- AWS Elastic Disaster Recovery (DRS), block-level replication of whole servers to AWS; instances launched from the replicated disks on failover.
- AWS Backup for VMware, image-level backups to an AWS Backup vault.
- AWS Application Migration Service (MGN), the migration-focused sibling of DRS (same agent, different workflow).
- VMware Cloud on AWS, lift the whole vSphere environment into an AWS-hosted vSphere; full DR via vSphere’s own tools.
What actually matters
The core trade in cloud DR for on-prem workloads is replication fidelity in exchange for cost. File-level backups are cheap per GB but slow to restore (rebuild the OS, restore files). Block-level replication keeps a live disk mirror but costs per-GB of storage plus constant replication bandwidth. Image-level backups sit in between.
The first thing to ask is: how quickly must the workload come back? Under 30 minutes means the replica has to be ready-to-boot, which means block-level replication. Hours-long RTOs are fine with image-level backups. Days-long RTOs tolerate file-level.
The second is: how much data loss is acceptable? Continuous block-level replication typically gives RPOs of seconds. Scheduled image-level backups give an RPO equal to the interval. File-level tends to be nightly.
The third is: what do we need to fail back? After the primary data-centre is repaired, the workload usually needs to go home. Some approaches automate failback; some leave it as a manual rebuild. The DR option has to support the round-trip, not just the outbound leg.
The fourth is: what’s the test burden? DR is only real if tested. A pattern that lets us launch the replica into a non-production network without disrupting ongoing replication can be drilled quarterly; a pattern that requires a full restore to a sandbox is a half-day exercise nobody books often enough.
The fifth is: what does the source look like? Each option supports a different set of source types, physical servers, VMware, Hyper-V, other clouds, AWS-to-AWS. Some require the whole estate to be one hypervisor; some accept anything that can run an agent. Source compatibility is often the constraint that narrows the field first.
What we’ll filter on
- RTO, how fast does the replica boot?
- RPO, how much data can we lose?
- Source compatibility, physical, VMware, Hyper-V, cloud?
- Failback supported, can we go home after?
- Steady-state cost, what’s the bill when nothing has failed?
The DR-for-on-prem landscape
-
AWS Elastic Disaster Recovery (DRS). Continuous block-level replication from on-prem (or another cloud) to AWS. A lightweight agent on each source machine replicates disk changes to a staging subnet in the DR AWS account. During steady state, no instances run for the replicated servers, just replication storage (EBS snapshots + a replication server per source). On failover, DRS launches instances from the staged disks in minutes, using a pre-configured launch template per source server.
-
AWS Backup for VMware. Image-level backups of VMware VMs to AWS Backup vaults. Scheduled, not continuous. Restore creates a new VM from the backup image, works but slower than DRS.
-
AWS Application Migration Service (MGN). Same agent as DRS, but the workflow is migration-focused: cut over, finalise, decommission the source. Not intended for long-running DR. Mention-worthy because it can be confusing that two services share an agent.
-
VMware Cloud on AWS. An AWS-hosted vSphere environment; DR via VMware SRM (Site Recovery Manager). Expensive (it’s AWS bare metal running VMware); useful when the team must keep vSphere for operational reasons.
-
Custom replication via Storage Gateway. File Gateway or Volume Gateway for filesystem-level replication. Doesn’t provide a bootable instance on the other end; you’d need to pair it with AMIs to launch from. Niche.
-
Snapshot-and-ship. Nightly VM snapshots uploaded to S3, AMIs built from them, instances launched on failover. Hand-rolled. Roughly the pattern DRS automates; reinventing it is a trap.
Side by side
| Option | RTO | RPO | Source types | Failback | Steady-state cost |
|---|---|---|---|---|---|
| DRS | Minutes | Seconds | Physical, VMware, Hyper-V, cloud | ✓ | Replication storage + staging |
| AWS Backup for VMware | Hours | Backup interval (hourly/daily) | VMware only | Manual | Backup storage |
| MGN | Minutes (migration event) | Seconds | Same as DRS | Not a DR pattern | Same as DRS during cutover |
| VMware Cloud | Minutes | Seconds (SRM) | VMware only | ✓ (via SRM) | vSphere cluster on AWS |
| Storage Gateway + AMI | Hours | Hours | Filesystem only | Manual | Gateway + S3 |
| Snapshot-and-ship | Hours | Day | Anything | Manual | S3 + AMIs |
For 80 mixed-OS VMs with a 30-minute RTO target, DRS is the fit. VMware Cloud on AWS would also work but at much higher cost.
The DRS pipeline
The picks in depth
The replication agent. A small process installed on each source VM – aws-replication-agent on Linux, an equivalent MSI on Windows. It reads disk blocks directly from the source’s block device, compresses and encrypts them, and streams to the replication servers in the staging subnet. On first run, it does an initial sync of the full disk (can take hours to days for a multi-TB server over a slow link). After that, it tracks incremental block changes and streams deltas.
The agent runs with minimal impact: ~3-5% of CPU typically, proportional to disk churn. It reads from the device, not through the filesystem, so it captures everything on disk regardless of OS file locks.
The staging subnet. A dedicated subnet in the DR VPC, no public IPs, outbound internet access (for the replication servers to call back to DRS). Replication servers are tiny EC2 instances (t3.small by default) that DRS launches automatically, roughly one per 15-30 sources, depending on throughput. They terminate themselves when idle and relaunch when replication resumes.
The staged EBS volumes are thin-provisioned, encrypted with a KMS key in the DR account. They hold the latest point-in-time version of each source’s disks, plus DRS-managed snapshots for point-in-time recovery (default 24 hours of history, configurable up to 7 days).
The launch template. Per source server, DRS stores a launch template: target instance type, target subnet, security groups, IAM instance profile, IP assignment strategy, tags. On failover or drill, DRS creates an EC2 instance from the launch template, attaches the staged EBS volumes as boot + data disks, and boots. First boot takes 5-10 minutes depending on OS.
The launch template includes OS-level adjustments via the DRS boot process: replacing network drivers if moving from VMware to EC2 (the agent stages these changes in advance), injecting AWS credentials or SSM Agent if configured. The result is an EC2 instance that looks like the source server from the application’s perspective, with the same hostname, IP, and disks.
Source-to-target mapping for networks. The source VM’s on-prem IP (say 10.250.10.42) doesn’t exist in AWS. The launch template can specify either:
- Keep private IP, the target instance gets a specific private IP in the DR VPC, chosen to match the on-prem IP when DR VPC CIDR overlaps with on-prem.
- DHCP, the target gets any IP; application-layer DNS updates point to the new IP.
For applications that use IP literals (which is often the case for legacy systems), matching the on-prem IP in DR VPC is the cleaner path. Requires planning the DR VPC CIDR to match on-prem.
Recovery plans. A recovery plan groups sources (e.g., “ERP cluster”) and declares a launch order (database first, app tier second, web tier third) with wait conditions and custom post-launch actions (e.g., run an SSM document to re-register with Active Directory). A launch-in-order plan ensures the app tier doesn’t boot before its database is up.
Recovery plans are what the operator triggers during a real failover. One click starts the whole pipeline; DRS handles the rest.
Recovery drills. DRS supports two kinds of launch: recovery (the real thing; for a real DR event) and drill (into a separate VPC for testing). Drill launches don’t affect replication, sources keep streaming changes to staging, so the drill is non-disruptive. After the drill, operators terminate the drill instances and clean up.
Recommended cadence: monthly drill for a subset of servers, quarterly full-scale drill. Each drill validates the launch templates, the network plumbing in DR VPC, the post-launch SSM automation, and, most importantly, the operators’ confidence.
Failback. After the on-prem data-centre is repaired, the workload needs to go home. DRS supports failback:
- While running in AWS, the DRS agent (still installed on the source VMs, which are now up in AWS) reverses the replication direction.
- Changes made in AWS stream back to the on-prem VMs.
- Operator schedules a cutover during a maintenance window, stops the AWS instances, starts the on-prem VMs.
- Replication direction flips again; AWS is the standby once more.
Failback works but isn’t magic, requires Direct Connect/VPN bandwidth, takes a full resync first (hours to days), and has its own test burden.
Cost model. DRS charges per “source server” per hour (roughly $0.028/hour, ~$20/month per server). Plus EBS storage for the replicated volumes (~GB-month at gp3 rates), plus the small replication server EC2 instances (auto-scaled), plus data transfer in (free) and any cross-Region replication if configured. For 80 servers with ~8TB total, the monthly bill lands around $1500-2500, cheap insurance compared to a second data-centre.
On failover, the bill changes shape: target EC2 instances run (normal EC2 rates), staging storage continues (for ongoing replication until failback), cost goes up proportional to the production compute needed.
A worked failover
Data-centre fire at 13:47. On-prem DNS goes dark; monitoring in AWS alerts at 13:49.
- Operator opens DRS console. Selects the
ERP-Criticalrecovery plan. - Clicks “initiate recovery”, confirms target Region and launch template overrides.
- DRS begins launching instances in launch-order:
erp-db01first. EBS volumes attach from the staged snapshots (point-in-time = 13:47, matching the pre-failure state). Instance boots. - Database starts up, plays back journal, accepts connections.
- DRS launches
erp-app01,erp-app02in parallel per the plan. App tier starts. - Route 53 records are updated via an SSM document (or a custom Lambda) to point
erp.company.internalto the new AWS IPs. DNS TTL is 60s. - 14:12, 25 minutes from fire to ERP accepting transactions. Target RTO was 30 minutes; hit.
Data loss: whatever was in flight at the moment of the fire (seconds). RPO target was minutes; met.
A worked drill
Tuesday morning, quarterly drill:
- SRE clicks
Initiate Drillon theERP-Criticalplan. Target: drill VPC. - DRS launches a complete copy of ERP in the drill VPC, database, app servers, file server.
- SRE connects to the drilled instances, runs the smoke-test suite (balances reconcile, reports generate, user login works).
- Drill reveals that the file-server launch template is missing an SSM document that re-maps a drive letter. Bug filed.
- After validation, SRE terminates the drill instances. Staging state is unchanged; replication continues.
Three days later: the bug is fixed in the launch template. Next drill passes. The confidence is real, not “we think it will work” but “we have launched it successfully in the last 30 days.”
What’s worth remembering
- DRS replicates block-level, continuously, agent-based. Not filesystem, not image. That’s why it’s fast to failover and captures everything on disk.
- Staging is cheap; target runtime is normal EC2. Steady-state cost is replication storage + tiny replication servers + per-source licence. On failover, you pay normal EC2 rates for the actual DR workload.
- Launch templates are per-source. Instance type, subnet, security groups, tags, IP strategy. Recovery plans chain launches in order for complex stacks.
- Drills are launches into a separate VPC. Replication continues. Non-disruptive. The only honest way to say “we tested DR.”
- Failback requires bandwidth and patience. DRS supports it but it’s a full-reverse-sync operation first. Plan the network links for it.
- MGN and DRS share an agent but not a use case. MGN is migration (cut over, decommission source). DRS is ongoing replication. Don’t confuse them.
- Point-in-time recovery handles the corruption case. DRS keeps up to 7 days of snapshots. Useful when the failure isn’t “data centre fire” but “ransomware encrypted the source disks.”
- DR VPC CIDR planning matters. Matching on-prem CIDRs in the DR VPC lets apps keep IPs across failover; mismatched CIDRs mean DNS-layer changes.
Block-level replication, staged in AWS, launched in minutes, the on-prem estate gets a tested DR plan that wasn’t affordable five years ago. The VMs still exist in the data-centre; if the data-centre goes, they come back in AWS. Thirty-minute RTO, quarterly drills, real numbers for the board.