How to Hit Four Nines on Direct Connect

January 24, 2029 · 16 min read

Advanced Networking Specialty · ANS-C01 · part of The Exam Room

The situation

An enterprise runs production workloads in AWS us-east-1 with a steady 5 Gbps of traffic flowing to an on-prem data centre in Northern Virginia. The traffic splits into database replication from AWS back to on-prem (for regulatory archiving) and user file sync from on-prem applications into S3. Both are sensitive to sustained throughput loss; neither tolerates multi-hour outages.

Today the connection is a single 10 Gbps dedicated Direct Connect terminating at the Equinix DC2 Direct Connect location in Ashburn. Utilisation averages 50%, peaks around 70% during overnight batch windows. Up eighteen months without incident.

Compliance have asked for a 99.99% connectivity SLA written into the service catalogue, backed by AWS’s own SLA where possible. They want the resilience posture, the AWS configuration that achieves it, and the monthly cost delta against today. A post-mortem that begins “we lost the Direct Connect for six hours because the fibre was cut outside the building” is explicitly not acceptable.

The architect has five candidates on the whiteboard: the status quo (one DX at DC2); DX at DC2 with a Site-to-Site VPN backup over the public internet; two DX connections at Equinix DC2 on different AWS routers; two DX connections split across Equinix DC2 and CoreSite VA1 (a second Direct Connect location in the same metro); and four DX connections arranged as two at DC2 and two at VA1.

What actually matters

Before reaching for a pricing comparison, it’s worth noticing what compliance is actually asking for. “Four nines of connectivity” can mean two different things: it can be a calculated availability on top of whatever topology is in place, or it can be a contractual SLA that AWS signs. The two numbers are related but not the same. AWS publishes a tiered SLA for Direct Connect where the tier available depends on how many connections and how many distinct DX locations are in the deployment. Running a topology that theoretically delivers four nines but doesn’t map to a tier AWS will sign against is not the answer to compliance’s question, because the SLA is the instrument compliance wants to file.

Ownership sits with the network and compliance teams, jointly. The network team cares about failover behaviour, BGP configuration, and the run-of-the-mill maintenance events that AWS schedules on DX routers. The compliance team cares about what AWS will pay credits against. Any answer has to satisfy both, technically operable by the first team, contractually enforceable by the second.

Blast radius of the current single-connection setup is total outage on a fibre cut, a router line-card, or a site-wide event at DC2. The compliance team’s unspoken worry is that last one: the fire-suppression-discharges-across-the-colo scenario, the transformer fire, the contractor cutting the wrong conduit outside the building. Those are rare, but they’ve happened to peer organisations, and the post-mortem is career-defining. Configurations that don’t survive that event don’t answer the question even if they do survive a fibre cut.

Cost shape is the architect’s third worry. Going from one connection to four is a multiplier on port-hours, cross-connects, customer-side router capacity, and Enterprise Support floor requirements. In round numbers, four dedicated 10 Gbps DX ports in Ashburn is around $8,500/month in port-hours before cross-connect and data-transfer, compared with the single connection’s ~$2,100/month. Low five figures monthly against a written four-nines contract is defensible to a CFO; whether it’s worth it is a business decision, not a network one.

Failure modes are where the five candidates really differ. A fibre cut on one circuit is survivable by anything with more than one circuit. A full DX router maintenance event is survivable by anything with more than one AWS router. A whole-site outage at DC2 is survivable only by configurations with a circuit terminating somewhere that isn’t DC2. A double-site outage is survivable by nothing short of a multi-metro deployment, and AWS’s SLA doesn’t promise to pay out for that either.

Coupling between the technical and contractual answers is the quiet part: AWS documentation on Direct Connect resilience is deliberately precise about what each tier signs against. Helps prevent complete location failure is not the same phrase as resilient to complete location failure, and the difference between those two phrases is the difference between the 99.9% and 99.99% SLA tiers.

What we’ll filter on

Five filters each candidate must clear:

  1. Survives a single connection failure without losing all connectivity. A fibre cut, a router line-card failure, a patch-panel mishap in the meet-me room.
  2. Survives a whole-site failure, the DX location itself goes dark. Compliance’s actual worry.
  3. Preserves bandwidth under failure at or near the 5 Gbps steady state. A fallback path that caps at 1.25 Gbps per tunnel doesn’t keep the business running.
  4. Carries an AWS-backed SLA matching the target. The tier you’re eligible for is a function of how many connections and how many locations. Theoretical resilience that AWS won’t sign against doesn’t answer compliance’s question.
  5. Has a predictable cost shape. Dedicated 10 Gbps port-hour charges, cross-connect fees, and data transfer out, known quantities, not surprises.

The Direct Connect resilience landscape

AWS publishes three named resilience models in the Direct Connect Resiliency Toolkit. Development and Test, High Resiliency, and Maximum Resiliency, each mapped to a published SLA tier. The DX+VPN backup is a distinct pattern that predates the toolkit and doesn’t map cleanly onto any of them.

1. Single DX at one location. The status quo; AWS calls this Development and Test when deployed deliberately. Zero redundancy at any layer, fibre, patch, AWS device, customer router, colo power feed all single points. AWS’s position is explicit: not recommended for production workloads. No meaningful uptime SLA beyond the per-connection credit schedule which kicks in only below 95%.

2. DX plus Site-to-Site VPN backup. One DX carries traffic normally; a Site-to-Site VPN over the public internet takes over when BGP on the DX virtual interface fails. Failover is automatic. AWS prefers DX-learned routes over VPN-learned routes for the same prefix. Cheapest way to avoid total outage, reasonable for a dev or DR link. But a Site-to-Site VPN tunnel tops out around 1.25 Gbps per tunnel, and even with ECMP across multiple tunnels the realistic ceiling is a few Gbps over the public internet, asymmetric to the 10 Gbps DX in bandwidth and latency both. And the DX SLA tiers reward redundant DX deployments, not DX with internet backup.

3. Two DX at one location. Two 10 Gbps connections at Equinix DC2, ideally on different AWS routers and different customer-side cross-connects. Survives any one connection, line-card, or cable failure. Does not survive a whole-site failure: the fire, power event, or conduit cut takes both connections out together. AWS’s wording is blunt, for production workloads AWS does not recommend anything other than a multi-site deployment. The tiered SLA above single-connection requires at least two DX locations. This shape survives connection failure but qualifies for no multi-site SLA tier.

4. Two DX at two locations. One 10 Gbps at Equinix DC2; a second at CoreSite VA1 in the same metro. AWS calls this High Resiliency and the toolkit describes it as resilient against fibre cuts, device failures, and helps prevent complete location failure. The deliberately soft wording reflects that a single connection at each site means a single failure at each site still drops that half of the capacity. Eligible for AWS’s Multi-Site Non-Redundant SLA at 99.9% availability. Close to four nines but not at it, a factor of ten short. 99.99% is roughly 52.6 minutes of downtime per year; 99.9% is 8.77 hours.

5. Four DX across two locations. Two connections at DC2, two more at VA1, minimum two per location. AWS calls this Maximum Resiliency: resilient against device failures, connectivity failures, and complete location failures. The only configuration eligible for the Multi-Site Redundant SLA at 99.99% availability. An extra requirement worth naming: this SLA tier requires an Enterprise Support plan plus a completed Well-Architected Review with AWS on the deployment. The 99.99% isn’t automatic; it’s contingent on demonstrable architecture.

Side by side

Configuration Single conn. fail Whole-site fail Bandwidth under failure AWS SLA tier Cost shape
Single DX 0 Gbps None (per-conn. credits) 1 port-hour
DX + VPN backup ~1-2 Gbps None (unscored combo) 1 port-hour + VPN
Two DX, one site 10 Gbps None (single-site) 2 port-hours
Two DX, two sites 10 Gbps 99.9% (Multi-Site Non-Redundant) 2 ports + 2 XCs
Four DX, two sites 10-30 Gbps 99.99% (Multi-Site Redundant) 4 ports + 4 XCs

Only configuration 5 earns all five ticks for a written four-nines target. Configuration 4 is a near-miss, it survives both failure modes and preserves full bandwidth, but its SLA tier is 99.9%, a factor of ten short. For teams that can accept three nines with real site resilience as a pragmatic midpoint, configuration 4 is the honest stepping-stone.

Matching the topologies

Customer site Topology DX location(s) 1. Development / Test, one DX, one location No meaningful SLA · single point of failure at every layer On-prem DC 1 router 10 Gbps DX DC location A 1 AWS router × fibre cut, device failure, or site failure → 0 Gbps 2. Single-site redundant, two DX, one location Survives connection failure · does NOT survive site failure · no multi-site SLA On-prem DC 2 routers (diverse) 10 Gbps DX #1 10 Gbps DX #2 DC location A 2 AWS routers (diverse) × site failure → both connections down together → 0 Gbps 3. High Resiliency, one DX each at two locations Survives site failure · half bandwidth on single-connection failure · 99.9% Multi-Site Non-Redundant SLA On-prem DC 2 routers (diverse) 10 Gbps DX 10 Gbps DX DC location A 1 AWS router DC location B 1 AWS router × one DX fails → 10 Gbps on the other site (50% capacity loss) 4. Maximum Resiliency, two DX each at two locations Survives connection and site failure · full bandwidth preserved · 99.99% Multi-Site Redundant SLA
Four topologies stacked for comparison. Solid lines carry traffic under normal conditions; red annotations mark the failure mode each topology fails to survive. Only the Maximum Resiliency shape, two connections at each of two locations, earns the 99.99% Multi-Site Redundant SLA tier.

Maximum Resiliency, in depth

Configuration 5 is four 10 Gbps connections arranged as at least two at each of at least two DX locations. AWS specifies minimums, not a fixed count; larger deployments scale further, but these are the floor.

The physical setup. Two circuits at DC2 terminating on different AWS routers with diverse cross-connects and diverse customer-side cabling; two more at VA1 with the same within-site diversity. From the on-prem data centre, at least two physically-diverse fibre routes to each DC location, different conduits, different floors if the meet-me rooms are split. A single contractor cutting a single trench cannot take out both circuits at one location.

The BGP behaviour. All four virtual interfaces advertise the customer prefixes; longest-prefix-match and BGP attribute evaluation applies across them. The team has choices about local preference tagging via AWS BGP communities (7224:7100, 7224:7200, 7224:7300) to bias traffic, preferring same-region circuits over cross-region failover paths, for instance, if a Direct Connect Gateway spans regions. For this single-region scenario, default ECMP across all four is the common answer; failure of any one drops its share onto the remaining three.

What it survives. Single circuit failure, three remaining, 30 Gbps of headroom against 5 Gbps of demand. Site failure, two remaining at the surviving location, 20 Gbps of headroom. Double failure, one circuit at each site, 20 Gbps, still sufficient. Scenarios it fails to survive (dual-site outage, simultaneous fault of both DX locations) are ones AWS’s own 99.99% SLA doesn’t promise to pay out for either.

What it requires beyond four ports. The Multi-Site Redundant SLA tier has specific contractual preconditions: an Enterprise Support plan, a completed Well-Architected Review on the Direct Connect deployment, and conformance with all SLA document requirements. Four connections in the correct shape are necessary but not sufficient; the review is AWS’s validation step.

What it costs. Four 10 Gbps dedicated port-hours, four cross-connect fees at two separate colos, four sets of customer-side router ports, and the Enterprise Support floor. In round numbers for Ashburn, four 10 Gbps DX port-hours is around $8,500/month (list $2.25/hour × 730 × 4) before cross-connect and data-transfer. All-in, the move from one circuit to four is low five figures monthly, the number to defend to the CFO as the cost of written four nines.

A worked example: the DC2 transformer fire

Pick the Maximum Resiliency deployment and play through what happens when Equinix DC2 suffers a transformer fire at 02:14 on a Tuesday and loses both mains and UPS within ninety seconds.

At second 0 the fire alarm trips and the AWS routers at DC2 lose power; both DC2 virtual interfaces stop sending BGP keepalives. Traffic in flight on those circuits either drains or drops depending on queue depth.

Between seconds 2 and 30, BGP hold-down timers expire on the customer-side routers (default 180 seconds, but BFD with a 300 ms interval and multiplier 3 collapses this to under a second when deployed). Routes via the DC2 virtual interfaces are withdrawn from the customer FIB. Traffic re-hashes across the two surviving virtual interfaces at VA1. Aggregate demand is 5 Gbps; aggregate capacity on VA1 is 20 Gbps. No congestion, no capacity drops.

Over minutes 0 to 30, TCP flows in flight on the DC2 circuits experience retransmission; applications see elevated latency variance but no connection resets, because source and destination IPs haven’t changed, only the path through AWS.

Over hours 1 to 24, DC2 stays dark, VA1 carries the entire 5 Gbps on two active circuits. Monitoring shows the DC2 virtual interfaces down and CloudWatch ConnectionState at zero for both. When DC2 power is restored, BGP re-establishes and traffic re-hashes across all four circuits, a small rebalancing transient, applications unaffected.

On the SLA math: traffic flowed on VA1 throughout, so customer connectivity was never interrupted. The Direct Connect SLA measures service availability, at least one healthy virtual interface pair across the deployment, which was never breached. No SLA credit triggers. An AWS infrastructure event that cost the business no service-level outage: exactly what the 99.99% tier was bought to deliver.

What’s worth remembering

  1. AWS’s three named resilience models map to three SLA tiers. Development and Test (single connection), no meaningful SLA; High Resiliency (one connection per site × multiple sites), 99.9% Multi-Site Non-Redundant SLA; Maximum Resiliency (at least two connections per site × at least two sites), 99.99% Multi-Site Redundant SLA.
  2. The Multi-Site Redundant SLA requires Enterprise Support plus a Well-Architected Review. Four connections in the correct shape are necessary but not sufficient to claim the 99.99% tier. AWS’s validation is part of the SLA instrument.
  3. Two DX connections at one location is not a recognised Multi-Site tier. Survives connection failure but not site failure, qualifies for no multi-site SLA, a common configuration mistake when cost-constrained architects skip the second colo.
  4. DX + VPN backup is a different class of solution from DX + DX. Useful where the secondary can run at a lower bandwidth (DR links, dev, small branches) but asymmetric in bandwidth and latency, and qualifies for no multi-site DX SLA tier.
  5. BGP behaviour on redundant DX is active/active by default. ECMP across virtual interfaces with matching local preference. Failover is sub-second with BFD, sub-30-seconds with default BGP timers.
  6. Local preference tagging via AWS BGP communities – 7224:7100 (low), 7224:7200 (medium), 7224:7300 (high), is the knob for biasing traffic across DX connections. Used to prefer DX over VPN, or same-region over cross-region via a Direct Connect Gateway.
  7. The Direct Connect SLA measures service availability, at least one healthy virtual interface pair across the deployment. Individual circuit outages within a redundant posture don’t trigger credits; only loss of DX service as a whole does.
  8. 99.99% is 52.6 minutes/year; 99.9% is 8.77 hours/year. The factor-of-ten difference is what separates Multi-Site Redundant from Multi-Site Non-Redundant and why the shape matters, not just the number of circuits.
  9. The whole-site-failure mode is the one compliance is usually worrying about. Any configuration that doesn’t survive it isn’t answering the four-nines question, regardless of how many circuits it has.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.