Cloud WAN for Multi-Region Transit Gateway Sprawl

January 22, 2029 · 16 min read

The situation

A platform engineering team operates eight Transit Gateways across five regions (us-east-1, us-west-2, eu-west-1, eu-central-1, ap-southeast-2), with approximately 200 VPC attachments total. Attachment count is growing around 15% per quarter, by this time next year, roughly 350. Every TGW is provisioned through Terraform; every route table entry is a Terraform resource; every change is a pull request. Three engineers reckon they spend the better part of each release window reviewing and merging routing changes alone.

The team wants a higher level of abstraction than individual route table entries, they declare which segments can talk to which, and AWS materialises the underlying TGW configuration. They want global topology visibility, one view across all five regions for incident response, rather than five console tabs. They want segmentation that survives adding a region; bringing ap-northeast-1 online shouldn’t require rewriting policy in the other five. And they want a migration path that doesn’t require a big-bang cutover of 200 live attachments. Cost is a constraint but not the headline: they’re willing to pay AWS for managed policy, not to double the monthly network bill.

What actually matters

Before reaching for a service page, it’s worth noticing what the problem actually is. It isn’t that TGW route tables don’t work, they do, they’re just imperative. Every association, every propagation, every static route is an individual object, and the team has two hundred of each. The engineering time being burned is the cost of keeping two hundred imperative objects coherent across eight TGWs and five regions, which no amount of cleverer Terraform modules fixes because the modules still have to emit the same two hundred resources.

Ownership here is one platform team at the centre of several application teams. Anything that centralises policy decisions (who can talk to whom) while decentralising attachment decisions (which VPC exists, what tags it carries) is a natural fit. A system where an application team tags a new VPC and the routing just happens is the goal, it moves the three engineers out of the critical path of every new workload.

Blast radius is the quiet worry. A misconfigured segment that accidentally isolates prod from shared across five regions is a worse outage than any single Terraform mistake the team has made on route tables so far, because it’s atomic: one policy apply, five regions of damage. The migration path therefore has to include dry-run capability and per-segment rollout, not a weekend cutover.

Cost shape is worth pricing before committing. A managed declarative layer adds a flat per-region hourly charge on top of the attachment-hour and data-processing fees that already exist, at this scale that delta runs to thousands of dollars a month. The trade is three engineers’ combined half-release-window against that delta. At loaded cost the arithmetic usually favours the managed layer, but the team should do the sum before switching.

Failure modes across approaches look different. Imperative-route-table failure is a subtle misconfiguration found days later when a team can’t reach a service they should. Declarative-policy failure is a policy version applied without dry-run that isolates a segment atomically. Both are real risks; neither is larger than the other, but the shape of the risk changes, and the team’s runbooks have to change with it.

Coupling between the configuration layer and the observability layer is the other point often missed. They sit at different layers, one describes topology, the other configures it, and the scenario is easier to reason about once they’re separated.

What we’ll filter on

Five filters any approach has to meet:

Declarative policy over imperative route tables. The team describes segment intent (“prod talks to prod and shared; non-prod talks to non-prod and shared; neither talks to the other”) and AWS computes the route table entries.
Multi-region, multi-TGW visibility in one view. Incident response at 03:00 cannot involve tabbing through five consoles.
Policy that’s region-independent. Adding a sixth region should be a policy edit, not a policy rewrite.
A migration path that tolerates coexistence. Two hundred attachments cannot be torn down and rebuilt in a maintenance window.
Cost that scales with traffic, not with engineer hours saved. Per-attachment-hour and per-GB fees are the knobs; operational cost is the win.

The route-management landscape

1. Raw TGW route tables via CloudFormation or Terraform. Where the team is now. Every association, propagation, and static route is a separate Terraform resource. Adding a VPC means writing at least three; cross-region reachability adds more on the peer side. The state file grows linearly with attachments, and so does the review burden. Imperative top to bottom.

2. A bespoke orchestration layer. Wrap Terraform or the SDK in a policy engine of your own. Works until the engineer who wrote it leaves. Solves declarativeness at the cost of a permanent custom codebase; no help for global visibility unless you also build a topology visualiser, which nobody ever does well enough to replace the AWS console.

3. AWS Transit Gateway Network Manager (Global Networks). A registry and observability layer over existing TGWs. Creates a global network; registers TGWs across accounts via Organizations integration; provides topology visualisation, event streams, and Reachability Analyzer integration. It does not configure TGW route tables. It does not own policy. It watches and aggregates.

4. AWS Cloud WAN core networks. A declarative policy engine that creates and operates TGW-class core network edges for you. A JSON policy describes regions, segments, attachment rules, and segment-to-segment sharing actions; AWS materialises the route tables, associations, propagations, and cross-region connectivity. Segments are the first-class object. Adding a region is one line in edge-locations.

Side by side

Approach	Declarative	Global view	Region-indep	Migration-friendly	Cost
Raw TGW + Terraform / CFN	✗	✗	✗	,	✓
Bespoke orchestration wrapper	✓	✗	,	,	✓
Network Manager alone	✗	✓	✗	✓	✓
Cloud WAN (+ Network Manager)	✓	✓	✓	✓	✓

Every approach is marked ✓ on cost because the attribute asks for cost that scales with traffic, which all four do; Cloud WAN adds a flat per-CNE charge on top, priced out in the worked example below.

Matching the layers

Three regions, three segments, one policy. A VPC's tag decides its segment; the segment decides what it can reach. Cross-region routing inside a segment is the backbone's job, no per-region-pair peering configured.

Cloud WAN and Network Manager, in depth

Network Manager is the observability layer. A global network is a container; you register AWS resources (transit gateways, TGW attachments, Site-to-Site VPNs, Direct Connect gateways) and optionally on-prem resources modelled as devices, sites, and links. The global network becomes the authoritative catalogue of the wide-area topology across accounts and regions. AWS Organizations integration pulls in TGWs and attachments from member accounts automatically. The console renders the graph; events (attachment created, BGP state change) fan out to EventBridge; Reachability Analyzer launches from the topology. It does not write route tables; it does not own policy; it’s pure monitoring and discovery. Pricing: zero on top of the underlying TGW and VPN fees. Adopt it on day one regardless of the Cloud WAN decision, it solves the visibility problem on its own.

Cloud WAN is the configuration layer. A core network is AWS’s managed, segmented, multi-region fabric. In each region named in the policy, AWS creates a core network edge (CNE), a TGW-class construct that lives under the core network rather than as a standalone TGW. Attachments connect to a CNE. Cross-region connectivity between CNEs is automatic and runs over the AWS backbone; no per-region-pair peering to configure.

A segment is a routing domain. By default, attachments in the same segment can reach each other; attachments in different segments cannot. The segment is defined once in the policy and applies across every region the core network exists in. Terraformed TGW topologies miss this entirely: segmentation is global intent, not regional configuration.

The policy document is JSON with five top-level keys that matter. core-network-configuration declares edge-locations (the regions), asn-ranges, inside-cidr-blocks, plus vpn-ecmp-support, dns-support, security-group-referencing-support. Adding a region is one new entry here; nothing else changes. segments lists segment definitions: name, optional regional restriction, isolate-attachments (when true, even same-segment attachments can’t reach each other unless explicitly shared), and allow-filter/deny-filter for cross-segment sharing. segment-actions relaxes default isolation via three main verbs: share (one segment reachable from a set of others, how shared becomes visible to both prod and non-prod while those stay isolated), create-route (explicit static route within a segment, e.g. 0.0.0.0/0 to an attachment), and send-via (insert a network function group into the data path between two segments, forcing east-west traffic through an inspection appliance). attachment-policies assigns attachments to segments based on tags, account, region, or attachment type, a rule might say “any VPC with tag:Segment=prod in an account in the Production OU goes into prod”. New VPCs with the right tags are placed automatically. network-function-groups are logical containers for inspection or egress appliances, used by send-via and send-to.

The policy is versioned. PutCoreNetworkPolicy submits; the change set is reviewable; ExecuteCoreNetworkChangeSet applies. Reverting is one API call back to a prior version. Each CNE maintains route tables in the same shape as a TGW, one per segment. Cloud WAN handles associations and propagations; the CNE shows up in the TGW console as TGW-like constructs, but nothing there is safe to edit by hand, the next policy apply overwrites it. A core network can peer with a Transit Gateway in the same region via a peering attachment on the CNE, the feature that makes live migration possible, letting a legacy TGW stay in place while attachments move across one at a time.

Pricing: core network edge hourly charge (~$0.50/hour per CNE per region; five CNEs ≈ $1,825/month for edges alone); attachment hourly charge (~$0.065/hour in US East; for 200 attachments ≈ $9,500/month); data processing ($0.02/GB entering a CNE from VPC, VPN, or Direct Connect). Total for 200 attachments across five regions lands north of $11,000/month before data transfer, more than straight-TGW.

A worked example: a five-phase migration

Phase 0, register everything in Network Manager. Before anything changes, set up a global network and register all eight TGWs via Organizations integration. Zero-risk: no data path changes, no route table edits. The team gets the topology view they’ve been missing, which makes every subsequent phase easier to reason about. Visibility is satisfied on day one, independent of the Cloud WAN decision.

Phase 1, draft and dry-run the Cloud WAN policy. Write the policy with three segments, tag-based attachment policies, and share actions making shared reachable from both of the others. Include all five regions in edge-locations. Submit as a policy version and inspect the change set without executing it. The change set lists every CNE, route table, and association AWS would create, and costs nothing to read.

Phase 2, stand up the core network alongside the TGWs. Execute the change set. Five CNEs come online, no attachments yet. The TGWs continue to carry 100% of production traffic; the CNEs cost the edge fee but are otherwise idle. Pilot: attach a single non-production VPC via tag-based policy, verify the segment assignment, verify Reachability Analyzer shows what you expect.

Phase 3, peer each TGW to its regional CNE. A peering attachment on each side creates a bidirectional bridge. Configure static routes on each side so workloads still on the old TGW can reach segment-assigned VPCs on the new CNE. This is the step where a mistake causes an outage: TGW peering doesn’t run BGP, so the static routes on the TGW side are written by hand, and getting the CIDRs wrong drops a region’s traffic.

Phase 4, migrate attachments segment by segment. non-prod first (lower blast radius). For each VPC: remove the TGW attachment, add a CNE attachment with the correct Segment=non-prod tag, verify reachability. The attachment policy assigns the segment automatically. Production migrates last, in small batches, with rollback paths prepared.

Phase 5, retire the TGWs. Once the last attachment is on the CNE side, remove the peerings, delete the TGW attachments, delete the TGWs. Terraform state shrinks by several thousand lines.

Skipping any of those phases hurts. Skipping the dry-run means executing a policy that isolates prod from shared by accident, every production VPC loses auth and observability across five regions at once. Cutting VPCs straight from TGW to CNE without the peering bridge turns each migration into an outage window. Skipping the pilot means the first attachment ever on the new fabric is a production one.

What’s worth remembering

The layer separation is the distinction that matters: raw TGW route tables are imperative; Network Manager is monitoring; Cloud WAN is declarative policy. Confusing Network Manager with Cloud WAN, assuming the monitoring service configures anything, is the most common error on this topic.
Cloud WAN policy has five top-level keys that matter: core-network-configuration, segments, segment-actions, attachment-policies, network-function-groups. The three main segment actions are share, create-route, and send-via.
Attachment policies are tag-driven. VPCs are assigned to segments by tags, account, region, or attachment type, not by being listed individually. Adding a VPC is a matter of giving it the right tags at create time.
TGW route evaluation rules, longest prefix match, static beats propagated for the same CIDR, defined priority across attachment types, carry over unchanged into CNE route tables. Cloud WAN writes them for you.
Cross-region connectivity is automatic in Cloud WAN, manual in TGW. Eight TGWs across five regions need per-pair peering with static cross-region routes on both sides; Cloud WAN handles that from a single edge-locations list.
Cloud WAN peers with TGWs, which is what makes a live migration possible without a big-bang cutover, the two fabrics coexist for the transition.
Network Manager is the observability layer for any TGW, whether or not Cloud WAN is involved. Adopt it first, independently of the Cloud WAN decision, because it solves the visibility problem on its own at zero additional cost.
Cloud WAN pricing adds a core network edge hourly fee (~$0.50/hr/region) on top of attachment-hour and data-processing fees. For 200 attachments across five regions that’s a four-figure monthly increase over the equivalent TGW bill. The justification is engineer hours, not infrastructure cost.
Policy versions and change sets are the safety net. PutCoreNetworkPolicy + change-set review + ExecuteCoreNetworkChangeSet beats apply-and-see-what-happens, every time.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.