Choosing Between Cloud WAN and a Transit Gateway Mesh

January 31, 2029 · 13 min read

Advanced Networking · ANS-C01 · part of The Exam Room

The situation

The network today:

  • 30 Regions, each with a Transit Gateway. VPCs attach to the local TGW; the TGWs are peered in a partial mesh (full mesh would be 435 peerings, which nobody wants). The partial mesh carries most inter-region traffic through three “hub” Regions – us-east-1, eu-west-1, ap-southeast-1.
  • Branch offices connect via Site-to-Site VPN to the nearest hub’s TGW or via SD-WAN appliances also attached to the hub TGWs.
  • Segmentation is enforced by TGW route tables. Three broad domains today – prod, nonprod, shared, implemented as separate TGW route tables per Region, with propagations and static routes configured per attachment. Prefix filtering on peering attachments controls what crosses Regions.
  • Operational reality: adding a Region takes a fortnight. Peerings, route tables, association and propagation logic, prefix filters, VPN re-homing, all scripted, but the script has grown to the point that nobody fully owns it.

The business wants to open three more Regions this quarter. The network team wants a sit-down.

Options in scope:

  • Status quo, but better scripts. Lean into Terraform modules and accept the operational tax.
  • AWS Cloud WAN, the managed global network with segments defined in JSON policy.
  • Third-party SD-WAN overlay, a vendor-managed mesh on top of AWS transport.
  • Flatten segmentation, collapse the route-table complexity by giving up some isolation.

What actually matters

Before picking, it’s worth asking what the TGW mesh is actually costing and what we’d gain from replacing it.

The first cost is configuration distance from intent. The business asks for “production is isolated from non-production globally.” Delivering that intent today means maintaining 30 TGWs × 3 route tables × N attachments, plus peering-attachment prefix filters, plus propagation rules. Auditing whether a given production VPC in ap-northeast-2 can actually reach a non-production VPC in sa-east-1 requires walking five route tables and two peering filters. The intent-to-configuration ratio is terrible.

The second cost is blast radius of per-Region change. A change to segmentation, say, adding a fourth domain for a new regulatory boundary, touches every TGW. Thirty change windows, thirty rollback plans, and a non-trivial chance of inconsistency. There’s no single place where “our global segmentation” lives.

The third cost is inter-Region attachment modelling. TGW peering is point-to-point. A hub-and-spoke routed through us-east-1 works, but every prefix that crosses Regions is explicitly routed via the hub’s TGW route table, and transit through us-east-1 for eu-west-2 to ap-southeast-1 traffic is a latency and dollar cost that accumulates.

The fourth cost is what we’d give up to flip the abstraction. Replacing the mesh with a single global object means a different way of expressing segmentation, not per-Region route tables but a policy document that defines isolation across the whole estate. The trade-off is flexibility: any pattern that relies on per-attachment routing tricks (asymmetric peerings, selective prefix leaking between route tables) takes more thought in a policy-defined world, and the team has to learn the new operational model and policy-as-code flow before the operational saving lands.

What we’ll filter on

  1. Global single object, one thing to reason about across all Regions, not N-per-Region.
  2. Policy-defined segmentation, segments described in code, applied atomically.
  3. Automatic inter-Region routing, no manual peering mesh.
  4. Branch integration. VPN and SD-WAN attach naturally.
  5. Migration path from TGW, can we get there without a flag day?

The global-network landscape

1. Transit Gateway mesh (status quo). Per-Region TGWs, partial-mesh peering, route tables per Region per domain. Mature, well-understood, but doesn’t scale cleanly past roughly 10-15 Regions before the operational tax becomes untenable. Everything is explicit; nothing is global.

2. AWS Cloud WAN. A single core network object spanning Regions. Segments are the segmentation primitive; each segment has attachments, sharing rules, and routing behaviour defined in the core network policy. Edge locations are Regions the core network extends into, chosen at core-network creation and updateable via policy. Attachments come in flavours: VPC, Site-to-Site VPN, Transit Gateway (for hybrid migration), SD-WAN via Connect attachments. Policy is a JSON document applied atomically; changes go through a two-step flow (create change set, execute change set) so you can see the delta before committing.

3. Third-party SD-WAN overlay. Vendor appliances in each Region VPC, running a control plane across public or Direct Connect transport. Replaces AWS’s inter-Region connectivity with the vendor’s. Useful when the same vendor already runs the on-prem WAN and we want one pane of glass. Adds appliance licensing, throughput licensing, and a data-path the vendor controls rather than AWS.

4. Flat network, less segmentation. Collapse the three domains into one and eliminate the per-Region route-table duplication. The simplest answer; also the wrong one in any serious organisation, because regulated workloads need real isolation and flat networks are audit nightmares.

Side by side

Option Global single object Policy-defined segmentation Automatic inter-Region routing Branch integration TGW migration path
TGW mesh Per-Region only ✗ (manual peering) n/a
Cloud WAN ✓ (TGW attachment)
SD-WAN overlay Vendor-defined Vendor-defined ✓ (vendor fabric) Parallel
Flat network ✓ (trivially)

The Cloud WAN policy in a picture

Cloud WAN core network, global-network-id: global-network-abc123 policy: three segments, attach via segment-actions, share via share-with eu-west-1 VPC prod-eu segment: prod VPC dev-eu segment: nonprod VPN attachment (branch offices) segment: shared us-east-1 VPC prod-us segment: prod VPC dev-us segment: nonprod VPC shared-svcs-us segment: shared ap-southeast-1 VPC prod-ap segment: prod VPC dev-ap segment: nonprod CNE eu-west-1 ASN 64512 CNE us-east-1 ASN 64513 CNE ap-southeast-1 ASN 64514 core-network-policy excerpt "segments": [ { "name": "prod", "edge-locations": ["eu-west-1","us-east-1","ap-southeast-1"], "isolate-attachments": false }, { "name": "nonprod", "edge-locations": ["eu-west-1","us-east-1","ap-southeast-1"], "isolate-attachments": false }, { "name": "shared", "edge-locations": ["eu-west-1","us-east-1"], "isolate-attachments": false } ], "segment-actions": [ { "action": "share", "segment": "shared", "share-with": ["prod","nonprod"], "mode": "attachment-route" } ], "attachment-policies": [ { "rule-number": 100, "condition-logic": "or", "conditions": [{"type":"tag-exists","key":"env","value":"prod"}], "action": { "association-method": "constant", "segment": "prod" } } ]
Three edge locations, three segments, one policy document. Shared services shares routes into prod and nonprod; prod and nonprod remain isolated from each other without per-Region route-table copy-paste.

The pick(s) in depth

Cloud WAN for the global backbone, keeping TGWs where they still earn their place. The realistic path is not a big-bang migration; it’s a gradual one.

The core network lives as a JSON policy document. The minimum viable policy for our situation:

version: 2021.12
core-network-configuration:
  asn-ranges: [64512-64555]
  edge-locations:
    - location: eu-west-1
    - location: us-east-1
    - location: ap-southeast-1
segments:
  - name: prod
    edge-locations: [eu-west-1, us-east-1, ap-southeast-1]
    isolate-attachments: false
  - name: nonprod
    edge-locations: [eu-west-1, us-east-1, ap-southeast-1]
    isolate-attachments: false
  - name: shared
    edge-locations: [eu-west-1, us-east-1]
    isolate-attachments: false
segment-actions:
  - action: share
    segment: shared
    share-with: [prod, nonprod]
attachment-policies:
  - rule-number: 100
    conditions:
      - type: tag-exists
        key: env
        value: prod
    action:
      association-method: constant
      segment: prod
  - rule-number: 200
    conditions:
      - type: tag-exists
        key: env
        value: nonprod
    action:
      association-method: constant
      segment: nonprod

The three patterns that earn understanding:

Segments define isolation by default. A segment is a distinct routing domain. Attachments in the same segment can reach each other (across Regions, automatically, no peering to configure). Attachments in different segments cannot, unless a segment-actions rule shares routes between them.

share-with is directional. shared → [prod, nonprod] means shared’s routes are advertised into prod and nonprod; prod and nonprod can reach the shared-services VPCs. The inverse is not automatic, shared does not see prod’s routes. That asymmetry is the point: a shared DNS resolver or monitoring VPC should be reachable from everywhere without the everywhere being able to reach each other through it.

Attachment policies auto-associate VPCs to segments by tag. A VPC attachment lands in prod because its CloudFormation template set env=prod on the attachment itself. No manual association step, no per-Region route-table editing. Spin up a new VPC in eu-central-2 next week, tag it, attach, it inherits the segment and all the routing, Region-to-Region included.

TGW attachments bridge the mesh. A TGW can be attached to the core network as a first-class attachment. During migration, the existing TGW mesh sits in a dedicated segment with share-with rules that match the old topology; new VPCs come up attached directly to Cloud WAN; old VPCs migrate one at a time by detaching from TGW and attaching to Cloud WAN. No flag day.

A worked migration step

Monday morning, we cut over a single non-critical VPC in ap-northeast-2 from the TGW mesh to Cloud WAN. The VPC currently has a TGW attachment with routes to 10.0.0.0/8 via the local TGW, and that TGW is peered to the ap-southeast-1 hub TGW, which in turn peers to us-east-1 and eu-west-1.

The cutover steps:

  1. Tag the VPC. env=nonprod on the Cloud WAN attachment configuration.
  2. Create the Cloud WAN VPC attachment. Attach the VPC to the core network in ap-northeast-2. The attachment lands in segment nonprod by rule 200. Routes start propagating: the VPC learns all other nonprod attachments’ prefixes globally, plus shared services’ prefixes via share-with.
  3. Update VPC route tables. Replace the TGW route-target on the VPC subnets with the Cloud WAN core network attachment. Two routes swap over atomically within the VPC.
  4. Remove the TGW attachment. Once the new path is verified, detach from the local TGW. The TGW mesh loses one endpoint.
  5. Validate. aws networkmanager get-network-routes on the segment shows the VPC’s prefix propagating globally. Ping from us-east-1 prod (via shared services) and from eu-west-1 nonprod (directly) both land.

No peering edits, no route-table surgery across multiple TGWs, no change windows in three other Regions. The operational cost of the migration is in this one account, and every subsequent VPC migration is almost the same script.

What’s worth remembering

  1. Cloud WAN is one global object; TGW is N regional objects. A core network spans Regions; adding a Region is an edge-location edit in the policy. TGWs are per-Region and peer point-to-point; adding a Region is a mesh-edit exercise.
  2. Segments are the isolation primitive, not route tables. Attachments live in one segment. Segments are isolated by default. Cross-segment connectivity is explicit via segment-actions with share-with.
  3. share-with is directional. Shared into prod does not mean prod into shared. This asymmetry is what makes shared-services VPCs clean.
  4. Attachment policies auto-associate via tags. Tag the VPC attachment with env=prod, the policy puts it in the prod segment. No manual association step per VPC.
  5. Edge locations are Regions; CNEs are the managed infrastructure inside them. AWS stands up the Core Network Edge automatically when a Region is added to the policy. You don’t run appliances.
  6. Cloud WAN accepts VPC, VPN, TGW, and Connect attachments. TGW attachment is the migration bridge: keep the existing mesh running as a segment while new VPCs go direct to Cloud WAN.
  7. Policy changes go via change sets. Create, review the delta, execute. The change is atomic across edge locations, no half-applied policies.
  8. Inter-Region data transfer is billed. Cloud WAN isn’t free; inter-edge-location traffic is priced per GB, similar to TGW peering. The saving is operational, not always financial, model the traffic before committing.

Thirty Regions and a partial-mesh TGW topology is an operational debt that only grows. Cloud WAN turns the global network into one object defined by policy, with segments as the isolation primitive and attachment tags as the assignment rule. The migration is gradual via TGW attachments; the end state is a network that can add a Region in an afternoon instead of a fortnight. The work isn’t throwing away what we’ve built, it’s moving the global topology into one place we can actually reason about.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.