Designing Direct Connect to Survive a Fibre Cut

January 29, 2029 · 15 min read

The situation

We have one 10 Gbps Direct Connect dedicated connection between our colo in Slough and the AWS Direct Connect location at Equinix LD5. A single cross-connect, a single customer router on our side, a single Direct Connect router on the AWS side, and a single BGP session carrying a handful of prefixes both ways. Most traffic is the nightly data warehouse sync to S3 plus live traffic for an API that backs the mobile app. Traffic is growing, the business has signed a contract with an SLA attached, and an internal review flagged the single-link setup as a risk we haven’t consciously accepted.

The options on the table:

Status quo. One connection, one location, one router, no SLA from AWS.
Second connection, same location. Another 10 Gbps link in LD5, different cross-connect, possibly a second customer router.
Second connection, different AWS Direct Connect location. A 10 Gbps into LD5 plus another 10 Gbps into, say, Telehouse North.
Backup via Site-to-Site VPN over the internet. Keep the one DX link, add an IPsec tunnel as a fallback path.
Hosted connections or LAGs. Partner-delivered sub-1G links, or bonded groups of dedicated connections that behave as one.

Five shapes, five SLA stories, five price tags. We need to know which shape pays for which failure mode.

What actually matters

Before pricing any of it, it’s worth asking what we’re trading.

The first trade is failure-domain coverage. A second link in the same AWS Direct Connect location protects against a single cross-connect failure, a failed customer-side optic, or a failed AWS router. It does not protect against anything that takes out the building itself: a power event in LD5, a fibre cut on the street LD5 sits on, or a facility-wide outage. A second link in a different AWS location protects against all of those too, at the cost of a second metro fibre run and another MRC for the cross-connect.

The second trade is bandwidth aggregation versus redundancy. A Link Aggregation Group (LAG) bonds up to four connections into one logical interface and presents one BGP session over the bundle; the bandwidth adds up. That’s useful when we need 40 Gbps and the port speed tops out at 10, but it’s not independent resilience, a LAG sits in a single location, and all its members terminate on the same side. Two separate LAGs in two locations is the shape that multiplies bandwidth and resilience, at roughly double the cost.

The third trade is data-plane versus control-plane resilience. Direct Connect carries traffic over a Virtual Interface (VIF) that rides on the connection; the BGP session is the control plane. A second connection with no second VIF gains nothing at the BGP layer. Two VIFs across two connections, with matching BGP community tags and consistent AS-path prepending, is the configuration that actually shifts traffic when a link dies. Multipath BGP on our side picks it up; AWS honours the community-encoded preferences we send.

The fourth trade is the role of a cheap fallback path. An over-the-internet tunnel costs a few dollars an hour, takes minutes to stand up, and runs over whatever broadband our office has. It is not fast (aggregate throughput well below a dedicated DX), it is not low-latency, and it is not covered by any Direct Connect SLA. But as a tertiary path that keeps control traffic flowing while DX is recovering, it’s one of the cheapest insurance policies AWS sells. It protects us from the “both fibre runs cut on the same day” scenario that the four-connection design still doesn’t cover.

The fifth, softer one: what SLA AWS will actually sign. Direct Connect SLAs are tiered. A single connection gets no availability SLA worth naming. Two connections at the same location gets a 99.9% SLA. Two connections at different Direct Connect locations gets 99.99%. Plus a VPN failover doesn’t improve the DX SLA itself, but it means the overall path-availability measured by the business keeps climbing. If the contract with the customer quotes 99.99%, the architecture has to match, or the lawyers will notice.

What we’ll filter on

Distilling the exploration into filters we can score each option against:

Protects against single cross-connect or router failure, can one piece of metal die without taking the path down?
Protects against AWS location failure, can the whole building go dark without taking the path down?
AWS SLA, what uptime will AWS put in writing?
Bandwidth aggregation, does this shape also give us more throughput, or only more resilience?
Incremental cost, what does the monthly bill look like relative to status quo?

The Direct Connect landscape

1. Single dedicated connection (status quo). One 10 Gbps port at one AWS Direct Connect location, one cross-connect, one BGP session. No SLA. A single cross-connect failure, optic failure, or building event takes the path down. Cheapest by far. The correct answer only when the workload tolerates internet fallback for every DX outage.

2. Two dedicated connections, same AWS Direct Connect location. Two 10 Gbps ports in LD5, ideally on different AWS Direct Connect routers (the fabric assigns them by default when ordering two connections), ideally cabled to different customer routers on our side. Each connection carries its own VIFs and its own BGP session; failover is a BGP convergence event. 99.9% SLA. Protects against connection, optic, and AWS-side router failure within the building. Does not protect against the building itself failing. Roughly double the MRC of a single link.

3. Two dedicated connections, different AWS Direct Connect locations. The headline resilient configuration: 10 Gbps in LD5, another 10 Gbps in Telehouse North (or any second DX location on the same metro). Independent buildings, independent power, independent fibre routes if the metro provider has been picked carefully. 99.99% SLA. Protects against every failure mode in option 2 plus location-level events. Adds the cost of a second metro fibre run and a second cross-connect.

4. Link Aggregation Group (LAG). One to four dedicated connections bonded via LACP into a single logical interface with aggregate bandwidth. Same AWS location. A LAG can carry VIFs the same way a single connection does; when a member link fails, traffic rebalances across the survivors transparently. Useful for scaling bandwidth past 10 Gbps without upgrading to 100 Gbps ports, and for smoothing member failures without BGP convergence. Does not provide location-level resilience on its own. Pairs cleanly with option 3: a LAG in each of two locations is the shape large customers actually run.

5. Site-to-Site VPN fallback over internet. One or more IPsec tunnels from the customer gateway to a Virtual Private Gateway or Transit Gateway, advertising the same VPC prefixes over BGP. Preference is set so DX is primary and VPN is backup (AWS prefers DX over VPN by default when both advertise the same prefix). Adds roughly $0.05/hour per tunnel plus egress data. Not a Direct Connect SLA improvement, but protection against the scenario where every DX path is down and we still need control traffic to flow.

Side by side

Option	Cross-connect / router failure	Location failure	AWS SLA	Bandwidth aggregation	Incremental cost
Single DX	✗	✗	None	✗	Baseline
Two DX, same location	✓	✗	99.9%	Partial	~2x
Two DX, different locations	✓	✓	99.99%	Partial	~2.5x
LAG (same location)	✓	✗	None	✓	~Nx members
VPN fallback	Partial	✓ (internet)	None	✗	~Low

Reading the table by failure mode rather than by option:

Single optic or router dies, any of options 2-4 covers it; option 5 only covers it if DX was already dual.
Entire building goes dark, only option 3 covers it inside the DX world; option 5 (VPN) is the non-DX fallback.
Fibre cut on the metro, option 3 protects if the two metro fibre routes are genuinely diverse; option 5 is the insurance policy against a same-day double cut.
Need more than 10 Gbps on one path, option 4 is the only DX shape that aggregates.

Two locations, two LAGs, one VPN underneath

Two Direct Connect locations, each with a 2-member LAG, plus a VPN over the internet as a tertiary path. Three paths, three BGP preferences, one Transit Gateway learning them all.

The pick(s) in depth

Two Direct Connect locations, LAGs at each, VPN underneath. This is the shape that earns the 99.99% SLA and still has a floor when both metro fibre paths fail together. It is also the most expensive: twice the cross-connects, twice the metro fibre runs, twice the customer-side routers, plus the VPN hourly charge. For a workload whose outage costs the business more than a month of MRC per hour, the sums work.

The key pieces to get correct:

Direct Connect Gateway sits in front of the VIFs and stitches them to a Transit Gateway (or to VGWs). Without the DXGW, the VIFs would each associate with one target; the DXGW is what lets both VIFs reach the same TGW and advertise the same set of prefixes.
Transit VIFs, not private VIFs, when the target is a Transit Gateway. A private VIF associates directly with a VGW and can’t attach to a TGW through a DXGW; a transit VIF is the TGW-aware shape.
Consistent prefix advertisement from both sides. On-prem advertises the same on-prem summaries on both VIFs; AWS advertises the same VPC prefixes on both. BGP picks the best path using standard attributes, the whole point of the design is that those attributes come out the way we want.
BGP community tags for preference, not AS-path prepending alone. AWS publishes a set of Direct Connect BGP communities (7224:7100 for local-preference 100, 7224:7200 for 200, 7224:7300 for 300) that let us influence AWS’s path selection from our side. Tagging the primary VIF’s advertisements 7224:7300 and the secondary 7224:7100 is cleaner than prepending because it’s explicit and doesn’t pollute AS-path length.
Return-path symmetry matters less than you’d think for TCP, more than you’d think for stateful firewalls on-prem. Make sure our side also prefers the primary; it’s easy to configure AWS’s preferences and forget that without matching local-pref on our side, return traffic goes out the secondary and the stateful firewall drops the packets.

A worked failover

0930 on a Tuesday: a contractor in Slough hits a duct carrying one of the two metro fibres between the colo and LD5. Both LAG members on that path drop simultaneously; the LAG goes down as one event. BGP on our Router A loses the session with the LD5 Direct Connect router; AWS’s DXGW notices the VIF go down. Within a few seconds BGP re-converges:

AWS withdraws the LD5 path for our prefixes and starts sending traffic for us out the Telehouse North VIF.
Our side withdraws AWS prefixes learned on Router A and starts sending to AWS out Router B.
The VPN tunnel stays up but stays unused because the DX secondary path is still preferred (7224:7300 > VPN’s default 7224:7100).
End-user impact: roughly 30-60 seconds of BGP convergence, existing TCP sessions on the lost path reset, new connections land on the Telehouse path.

At 1430 a backhoe on the other metro fibre route takes out Telehouse too. Both DX paths are now down. The VPN, which has been sitting idle advertising the same prefixes with a worse BGP preference, is suddenly the only path. BGP re-converges again; throughput drops from ~20 Gbps to ~1.25 Gbps per tunnel but the data plane stays up. The warehouse sync slows; the mobile API keeps serving. At 1715 the LD5 metro is spliced back; the primary DX path returns; BGP re-converges again and everything settles.

Three failover events, one day, no business-visible outage beyond the BGP convergence windows. The VPN paid for itself once; the second DX location paid for itself twice.

What’s worth remembering

Direct Connect resilience is tiered. One connection, no SLA. Two in the same location, 99.9%. Two in different locations, 99.99%. The SLA is AWS’s commitment; it tells us what shape of architecture AWS will stand behind.
Location diversity is the one that protects against buildings. Everything else inside one AWS Direct Connect location shares that building’s power, cooling, and metro fibre runs. Two connections in LD5 is twice the resilience against optics and routers, once the resilience against LD5 itself.
LAGs aggregate bandwidth, not failure domains. A four-member LAG gives 40 Gbps across one location; it does not give location-level resilience. Pair a LAG with a second LAG elsewhere, or accept that the LAG solves a throughput problem and not a resilience one.
VPN fallback is the cheapest insurance policy. For a few dollars an hour, an IPsec tunnel over the internet sits underneath DX advertising the same prefixes with a worse preference. It catches the double-fibre-cut scenario DX alone cannot.
Direct Connect Gateway is how two VIFs reach one TGW. Without a DXGW, a VIF is pinned to a single VGW or single TGW in a single account. The DXGW is the abstraction that lets diverse VIFs advertise to the same target.
BGP communities beat AS-path prepending for AWS preference. 7224:7300, 7224:7200, 7224:7100 map to local-preference 300, 200, 100 on AWS’s side. Tag advertisements explicitly rather than relying on prepend length.
Make return paths symmetric. On-prem stateful firewalls will drop return traffic that comes back via the secondary if the outbound went via the primary. Local-pref on our side needs to mirror AWS’s preference.
Match term to risk. 99.99% is a contract; the architecture has to back the contract. If the business hasn’t signed an SLA that demands it, two connections at one location plus a VPN may be the honest floor.

One Direct Connect link is a risk we’ve accepted whether we meant to or not. Two links in one location is a conscious trade: cheaper than full location diversity, limited by the building. Two locations plus VPN is the shape that survives the scenarios that actually take businesses down. The work isn’t picking the fanciest topology, it’s matching the architecture to the SLA the business has signed.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.