The situation
We run a hybrid estate: a primary datacentre in Slough, a DR datacentre in Manchester, and workloads in eu-west-1 across three VPCs attached to a Transit Gateway. Cross-connectivity uses:
- One Direct Connect connection, 10 Gbps, dedicated, at the Slough equinix-LD5 location. Terminates into a Direct Connect Gateway which associates with the Transit Gateway. BGP speaks to on-prem over a private VIF. ASN 65001 on-prem, 64512 on the AWS side.
- One Site-to-Site VPN, two tunnels, static routes, attached to the same Transit Gateway. Configured manually eighteen months ago. Nobody has tested failover since the person who built it left.
- Normal steady-state traffic: ~3 Gbps of database replication, file transfers, and internal API calls. Peaks to 7 Gbps during quarter-end. The VPN is sized for a 1.25 Gbps per-tunnel ceiling, it cannot carry steady-state traffic even if we wanted it to.
The last incident: a fibre cut between LD5 and our Slough cage took the Direct Connect offline at 14:02. The VPN tunnels were up. BGP on the TGW held the Direct Connect routes for ninety seconds before withdrawing them. The VPN’s static routes never became preferred because the TGW routing table had no dynamic path via the VPN. Result: four minutes of black-holed traffic, frantic manual route-table edits, and a postmortem that reads like one.
The CTO’s constraint: failover under one minute, no data-plane cost doubling, no “add another datacentre” answer.
What actually matters
Before chasing products, it’s worth asking what “reliable hybrid connectivity” actually means on AWS.
The first thing is two independent failure domains. One Direct Connect from one location is a single fibre run, a single MMR cross-connect, and a single customer router port. When any of them fails, the whole link fails. A second path has to not share those components, ideally not the same building, not the same provider’s backbone, and not the same on-prem router. The VPN over the public internet is independent of the Direct Connect physical path by construction, which is its underrated virtue.
The second thing is dynamic routing end to end. The old failover failed because the VPN was static and the TGW had no way to learn that routes via the DX had withdrawn. BGP on both sides. DX and VPN, lets the TGW and the on-prem router converge on whichever path is up, weighted by AS_PATH prepending or MED or local preference, without anyone editing a route table. When both are up, BGP picks the DX because it’s shorter; when the DX withdraws its routes, BGP picks the VPN automatically. The failover time becomes BGP convergence time (seconds, if tuned) rather than human time.
The third thing is bandwidth honesty about the backup. A 10 Gbps DX backed by two 1.25 Gbps VPN tunnels isn’t a backup; it’s a severely degraded mode. If steady-state traffic is 3 Gbps, the VPN cannot carry it, and “failover” means “most applications break, just differently.” Either the workload has to tolerate degraded bandwidth during failover (which should be an explicit decision, not an accident), or the backup has to be sized to carry real traffic, which means either multiple VPN tunnels in ECMP or a second Direct Connect.
The fourth is what AWS will actually underwrite with an SLA. AWS offers a Direct Connect uptime SLA only for specific resilience postures, the ones where redundancy is structural, not bolted on. A single circuit plus a VPN backup is a sensible design but it sits outside the SLA. If the business has a contractual uptime number to hit externally, the topology has to match a shape that AWS commits to numerically.
The fifth is cost shape. A second 10 Gbps dedicated DX roughly doubles the hourly port fee plus data-transfer-out. A VPN costs per-tunnel-hour (small) plus the same data-transfer rate as the DX when DX is up and a higher rate when traffic shifts to VPN. The cheapest honest backup is the VPN; the most resilient backup is a second DX at a second location; the middle answer is some combination.
What we’ll filter on
Distilling that exploration into filters for each option:
- Independent failure domain, does the backup share fibre, building, provider, or router with the primary?
- Dynamic failover, does BGP converge automatically, or does a human edit routes?
- Failover time under one minute, how fast does traffic actually move?
- Bandwidth sufficient during failover, can the backup carry real traffic, not just heartbeats?
- Covered by a DX SLA, does AWS underwrite uptime for this shape?
The connectivity landscape
-
Single DX + Site-to-Site VPN (status quo, fixed). Keep the existing 10 Gbps DX. Replace the static VPN with a BGP-based VPN attachment to the Transit Gateway. Use AS_PATH prepending on the VPN’s BGP session so the DX path is always preferred while both are up. Two independent failure domains (fibre vs public internet), dynamic failover via BGP. Failover time: ~30 seconds with default BGP timers, under 10 seconds with BFD on the DX. Backup bandwidth: 2.5 Gbps aggregate across both tunnels in ECMP, degrades 3 Gbps steady-state, cannot carry quarter-end 7 Gbps. No DX SLA (the SLA requires two DX).
-
Two DX connections, same location. Add a second 10 Gbps DX terminating at the same LD5 location on a different AWS device. Called “development and test” in the Resiliency Toolkit. Protects against a single AWS-side device failure; does not protect against a building-level failure at LD5 or against our Slough cage losing power. Shared fibre run into LD5 remains a single point of failure. Failover is BGP between the two DXs, sub-10 seconds. Bandwidth: 20 Gbps aggregate, plenty. Cost: roughly double the DX line items.
-
Two DX at two locations (High Resiliency). Add a second 10 Gbps DX at a different Direct Connect location. Telehouse North in Docklands is the natural second site from Slough. The two DXs terminate on separate AWS devices in separate buildings. Protects against building-level failure; does not protect against a catastrophic provider failure if both DXs use the same tier-1 backbone. Failover is BGP, sub-10 seconds. AWS offers a 99.9% monthly SLA on this shape. Cost: two 10 Gbps port fees, two cross-connects, plus whatever the second circuit costs.
-
Two DX at two locations + VPN backup. Belt and braces. Two DXs for the high-resiliency SLA, plus a VPN on a third path (public internet, different physical cable out of the datacentre) for the day both DXs fail. Three independent failure domains. BGP picks DX preferentially, VPN as last resort. The VPN rarely carries traffic, but when it does, it keeps some subset of the workload alive. Cost is (3) plus a small VPN line item.
-
Two DX at two locations, separate providers (Maximum Resiliency). Same as (3) but each DX uses a different network provider’s fibre to avoid shared-backbone correlated failures. 99.99% monthly SLA. Operationally more work (two providers to manage), cost is similar to (3) if the provider quotes aren’t wildly different.
Side by side
| Option | Independent failure domain | Dynamic failover | Sub-minute failover | Backup bandwidth sufficient | DX SLA |
|---|---|---|---|---|---|
| DX + VPN (BGP, fixed) | ✓ | ✓ | ✓ | ✗ | ✗ |
| 2× DX, same location | ✗ | ✓ | ✓ | ✓ | ✗ |
| 2× DX, two locations | ✓ | ✓ | ✓ | ✓ | 99.9% |
| 2× DX, two locations + VPN | ✓✓ | ✓ | ✓ | ✓ | 99.9% |
| 2× DX, two locations, two providers | ✓✓ | ✓ | ✓ | ✓ | 99.99% |
Reading the table against the situation: the VPN-only fix solves the failover problem (dynamic routing, sub-minute) but not the bandwidth problem (VPN can’t carry steady-state). Two DXs at the same location solve bandwidth but keep the single-building risk. Two DXs at two locations are the genuine SLA-covered answer. Adding a VPN on top is cheap insurance against a correlated DX failure.
Failover path in steady state and during outage
The pick in depth
The pragmatic answer for our scale is (4) Two DX at two locations + VPN backup. The second DX gets us the 99.9% SLA and genuine bandwidth during failover; the VPN covers the rare dual-DX failure scenario where both locations lose connectivity simultaneously.
The DX topology. First DX at LD5 Slough, terminating on AWS device A. Second DX at Telehouse North Docklands, terminating on AWS device B. Each DX has its own private VIF attached to the same Direct Connect Gateway, which associates with the Transit Gateway. The DXG is what hides the two VIFs behind a single logical attachment; the TGW doesn’t know or care that there are two physical circuits behind it.
BGP configuration. On the on-prem side, each router runs eBGP to its local DX. The active/active default is fine: both DXs advertise the same prefixes with the same AS_PATH length, and the TGW uses ECMP across them. If asymmetric routing is a concern (it usually isn’t for private VIFs to a TGW), prepend AS_PATH on the secondary site to force preference. On the AWS side, BFD is enabled by default on DX but not always on the on-prem side, enable it, set the interval to 300ms with a multiplier of 3, and failure detection drops from BGP’s 90-second hold time to under a second.
The VPN attachment. A single Site-to-Site VPN attached to the TGW, two tunnels, BGP-based. Configure the customer gateway with AS_PATH prepending set to three times on both tunnels so the VPN is only ever chosen when both DXs have withdrawn their routes. The TGW route table is associated with the DXG attachment (via the DXG association) and the VPN attachment; route propagation is on for all three, and BGP’s longest-prefix-match plus AS_PATH shortest-wins sorts the rest.
The TGW route tables. One route table per isolation boundary. Production workloads get a route table that propagates from the DXG attachment and the VPN attachment; development workloads get the same. The on-prem CIDR (10.0.0.0/8) is learned via BGP through both; the TGW picks DX because AS_PATH is shorter.
Monitoring. CloudWatch metrics on each DX connection (ConnectionBpsIngress, ConnectionBpsEgress, ConnectionState) and on each VPN tunnel (TunnelState, TunnelDataIn, TunnelDataOut). Alarms on BGP session state changes and on ConnectionState going below available. Route Analyzer in TGW Network Manager for end-to-end path verification. Reachability Analyzer for snapshot checks after config changes.
A worked failover
Ravi is duty SRE on a Wednesday afternoon. The Slough fibre is cut at 15:17. His pager goes off three times in six seconds:
15:17:04 [CRITICAL] DX connection dxcon-0a1b2c3d state=down (Slough LD5)
15:17:05 [INFO] BGP session dxvif-slough-prod state=idle
15:17:06 [INFO] TGW route table rtb-prod: 10.0.0.0/8 via dxvif-slough withdrawn
Ravi opens the TGW console and runs Route Analyzer from VPC vpc-payments-prod to 10.50.0.0/16 on-prem:
$ aws ec2 start-network-insights-analysis \
--network-insights-path-id nip-path-prod-to-onprem
Status: succeeded
Path: vpc-payments-prod → tgw-attachment eni → TGW rtb-prod →
dxvif-docklands (active) → DXG → Docklands DX → on-prem
15:17:07 [INFO] TGW rtb-prod: 10.0.0.0/8 via dxvif-docklands preferred
Under two seconds from fibre cut to stable path via the Docklands DX. The VPN never needed to activate, the second DX did its job. Ravi opens a ticket with the provider for the Slough circuit, watches CloudWatch confirm that Docklands is carrying the full 3 Gbps without stress, and goes back to his lunch. The postmortem for this one is four lines.
Three days later a test scenario: both DXs out simultaneously (scheduled maintenance overlap, a genuine worst-case that nearly happened last year):
$ aws ec2 describe-vpn-connections \
--vpn-connection-ids vpn-onprem-to-tgw \
--query 'VpnConnections[0].VgwTelemetry[*].Status'
[ "UP", "UP" ]
$ aws ec2 describe-transit-gateway-route-tables \
--transit-gateway-route-table-ids tgw-rtb-prod \
--query 'Routes[?DestinationCidrBlock==`10.0.0.0/8`].Attachments[0].ResourceType'
[ "vpn" ]
VPN carries the workload. Latency rises from 4ms to 22ms over the public internet path. Bandwidth caps at 2.5 Gbps, steady-state 3 Gbps degrades, quarter-end 7 Gbps would break. The runbook flags quarter-end as a “do not schedule DX maintenance” window.
What’s worth remembering
- The failure domain question is geographic and physical. One DX location is one building is one fibre run. Two DXs at the same location share the building; two DXs at different locations are the first genuine resilience step.
- The Direct Connect Resiliency Toolkit names three tiers. Development-and-test (single DX), High Resiliency (two DX at two locations, 99.9% SLA), Maximum Resiliency (two DX at two locations, two providers, 99.99% SLA). AWS’s SLA only applies to the named shapes.
- VPN as backup only works with BGP. Static VPN routes on a TGW don’t participate in failover. BGP-based VPN attachments let the TGW learn and prefer automatically; AS_PATH prepending controls preference without requiring per-route tuning.
- BFD cuts failover time from ninety seconds to under a second. Default BGP hold time is 90 seconds; BFD at 300ms with multiplier 3 declares liveness failure under a second. Enable on both sides or it doesn’t work.
- VPN bandwidth is the honest constraint. A VPN tunnel caps at ~1.25 Gbps; two in ECMP give ~2.5 Gbps. If steady-state traffic exceeds that, “failover” means “degraded,” which should be an explicit decision.
- Direct Connect Gateway hides multiple VIFs behind one TGW attachment. One DXG, two VIFs (one per DX), one TGW association. The TGW sees a single logical path and ECMPs across the underlying VIFs.
- Monitor what you depend on. ConnectionState, BGP session state, TunnelState, and route-table contents all have CloudWatch metrics or events; alarm on them. Route Analyzer and Reachability Analyzer let you verify the picture rather than hope.
- Cost scales with resilience; pick the tier that matches the uptime number. Single DX + VPN for “best effort with graceful degradation.” Two DX at two locations for “99.9% underwritten.” Two providers on top for “99.99% underwritten.” The right tier is whichever matches the SLA the business has committed to externally.
The postmortem action item was “make the failover actually work.” The answer is the topology, not the configuration of the old one: two Direct Connects at two locations, a VPN on a third path, BGP everywhere, BFD for fast detection, and a route table that lets the network converge without anyone editing it. The link that must not fall has to be more than one link.