How to Filter Egress at Organisation Scale

The situation

An enterprise runs 40 accounts, ~200 VPCs, ~40,000 EC2 + Fargate + ECS instances. Outbound traffic shape:

~5 PB/month of outbound HTTPS to approved SaaS and public API destinations.
Long tail of long-running flows (TLS-based push mechanisms, server-sent events).
Hundreds of distinct outbound destinations to track.
Roughly 3 PB of that is S3 (customer uploads to own S3 buckets), which should never touch the egress path if Gateway endpoints are in place.

The requirements:

Every outbound packet inspected against an organisation-wide allow-list.
Domain-level control (SNI inspection), allow api.stripe.com, deny *.shadyservice.io.
Central logging of all egress decisions (allow and deny).
Scale past one TGW’s bandwidth ceiling, a single TGW is rated for up to 50 Gbps per attachment, 100 Gbps aggregate per TGW, which a 5 PB/month workload might saturate during peaks.
Survive a single AZ failure of the firewall fleet.
Not add 10ms to every outbound call, the existing p99 budget is tight.

A basic centralised egress (covered in /writing/one-door-out/) handles smaller scale. Past a few hundred accounts and petabytes, we’re in “at-scale” territory.

What actually matters

The core trade at scale is central policy in exchange for scaling overhead. Centralising is correct; centralising without thinking about bandwidth, the number of firewall endpoints, the transit-layer budget, the regional topology, that’s how central egress becomes the source of incidents rather than the mitigation.

The first thing to ask is: how do we scale the firewall layer? A managed firewall scales horizontally via endpoints, each endpoint handles a share of traffic. Deploy enough endpoints in enough AZs across enough VPCs and the aggregate throughput scales. The operational wrinkle: route-table updates per VPC have to keep up.

The second is: transit-layer capacity. The hub-and-spoke transit primitive has bandwidth ceilings (both per-attachment and aggregate per-hub). For 40,000 instances doing heavy outbound, multiple hubs (per Region, per workload class) may be needed. Each hub pairs with its own set of firewall endpoints.

The third is: how do we avoid a single point of firewall configuration pushing to everything? Staged rule deployment: a pipeline that rolls rule updates to one AZ’s firewall first, watches, then the next AZ. A bad rule takes out 1/3 of egress capacity, not 100%.

The fourth is: what egresses through the firewall and what doesn’t? Traffic to AWS service endpoints via Gateway endpoints bypasses. Known-good bulk flows (e.g., customer object-store uploads) via dedicated paths. Everything else through the firewall.

The fifth is: inspection depth. TLS-header inspection (no decryption): cheap, fast, visible only to SNI. Full TLS decryption with a trust CA on every workload: expensive, operationally heavy, sometimes compliance-required, often not justified.

What we’ll filter on

Scale ceiling, what throughput before adding more infrastructure?
Inspection depth, 5-tuple / domain / full TLS payload?
Blast radius of bad rule, which workloads fail when a rule is wrong?
Cost per GB, marginal cost per unit of traffic?
Operational complexity, how many moving parts to maintain?

The at-scale egress landscape

Centralised egress VPC + AWS Network Firewall (single hub). Pattern from /writing/one-door-out/. Works up to one TGW’s bandwidth and one firewall’s rule-complexity ceiling. Past that, it becomes the bottleneck.
Multi-hub centralised egress. Multiple hub-and-spoke constellations. Different OUs use different hubs; TGWs are peered or isolated. Scales horizontally; more moving parts; requires careful routing.
GWLB (Gateway Load Balancer) with third-party firewall fleet. Palo Alto, Fortinet, etc. Appliance-based. Full TLS decryption often supported. Vendor licences to manage; deployment on EC2.
Distributed firewalling (per-VPC Network Firewall). Each VPC has its own Network Firewall endpoint. Scales infinitely (each VPC is its own firewall), but defeats central-policy; 200 VPCs means 200 firewall configs.
Service-mesh egress gateways. Istio egress gateway pods per cluster. Works for east-west + north-south traffic originating from the cluster. Not applicable to non-mesh workloads (EC2, Lambda).
Outbound-only via PrivateLink where possible. For SaaS providers with PrivateLink endpoints, direct VPC-to-VPC endpoints bypass the firewall entirely. Best-of-both: known-good traffic doesn’t eat firewall capacity.

Side by side

Option	Scale ceiling	Inspection	Blast radius	Cost per GB	Complexity
Single-hub ANFW	~100 Gbps TGW	SNI (no decrypt)	Whole org	ANFW per-GB	Moderate
Multi-hub ANFW	Horizontal	Same	Per hub	Same	High
GWLB + appliances	Per-appliance fleet	Full TLS optional	Per appliance fleet	Vendor + GWLB	High
Per-VPC ANFW	Per VPC	SNI	Per VPC	ANFW per-GB	Very high (config drift)
Mesh egress	Per cluster	L7 rich	Per mesh	Mesh overhead	Mesh-specific
PrivateLink	Per endpoint	Endpoint policy	Per endpoint	Endpoint + per-GB	Low for known dests

For 40,000 instances at 5 PB/month, multi-hub centralised egress with AWS Network Firewall, plus Gateway endpoints for S3/DynamoDB, plus PrivateLink for specific SaaS providers where feasible.

The multi-hub architecture

One policy source of truth; three independent hubs; each hub has its own TGW, firewall, NAT, and blast radius. Bypass paths for S3/DDB/PrivateLink keep firewall capacity for the traffic that needs inspection.

The picks in depth

Multi-hub segmentation. Three hubs: production, non-production, analytics. Each with its own TGW, firewall fleet, NAT gateways. Workload VPCs attach to the hub matching their OU. The segmentation is deliberate:

Blast radius, a bad rule in non-prod doesn’t break prod.
Policy nuance, non-prod can permit a wider allow-list (developers need to fetch new libraries, experiment with new SaaS). Prod has a tighter allow-list.
Capacity isolation, analytics pipelines can burst; prod runs with steadier traffic. Separate capacity envelopes.
Compliance, analytics may handle data that has stricter egress policies; keeping it isolated simplifies audit.

Firewall endpoints per AZ. Each hub has at least two AZs; each AZ has one firewall endpoint. Production has three for capacity. Endpoint throughput is ~10 Gbps; three endpoints -> 30 Gbps in the firewall layer. NAT gateway supports up to 100 Gbps; usually not the bottleneck.

Scaling past ~30 Gbps means either upgrading endpoint types (larger endpoints exist for specific use cases) or spreading across more VPCs / hubs. 40,000 instances at, say, 50 KB/sec sustained = 2 Gbps aggregate; way under the limit. Peaks of 20 Gbps are comfortably handled by three endpoints.

Staged rule rollout. The policy repo is Git; rule changes are PRs; CI validates with suricata -T equivalent; merged rules deploy in stages:

First AZ of the non-prod hub gets the new rules. Monitor.
All of non-prod hub. Monitor.
First AZ of prod. Monitor.
All of prod. Done.

“Monitor” means watching denied-flow logs, error-rate on workloads, and alerting on anomalies. A bad rule is visible in the denied-flow log spike within seconds; roll back via the same pipeline (Git revert, re-deploy in reverse order).

Gateway endpoints everywhere. Every workload VPC has Gateway endpoints for S3 and DynamoDB. This is not optional; it’s enforced by a Config rule that flags VPCs without them. The result: ~3 PB/month of the 5 PB bypasses the firewall path entirely, saving firewall capacity and data-processing cost.

PrivateLink for heavy SaaS. Stripe, Datadog, Databricks, and others with PrivateLink offer endpoints. Each one is a per-provider endpoint in the hub VPC (or in a shared-services VPC), bypassing the NAT + firewall + internet path. For providers seeing 100s of GB/day, the PrivateLink path pays back in data-transfer savings quickly.

Policy structure. The Network Firewall rule groups are organised:

Baseline stateless: Drop RFC1918 leaks, drop known-bad ASNs from threat-intel.
Domain allow-list (stateful): Suricata tls.sni rules for approved domains.
Geo-deny (stateful): GeoIP-based drop for sanctioned countries.
Audit-mode rules: For domains being evaluated, log matches but allow, then promote to deny after review.

Policy-as-code: each rule has a file with metadata (owner team, ticket reference, date added, expiry). Rules older than the expiry date without renewal are pruned by a quarterly job.

Logging at firehose pace. ANFW emits alert logs (rule hits) and flow logs (5-tuple + bytes). Both go to a Kinesis Data Firehose, which writes to S3 (Parquet, partitioned) and streams to an OpenSearch cluster for dashboards. At scale, this is gigabytes of log per hour; the pipeline needs to handle it.

Metrics and alerting.

Denied-flow rate per firewall endpoint, alert if sudden spike (bad rule or attack).
Allowed-flow rate, alert if sudden drop (routing failure).
Firewall endpoint health.
NAT gateway error rate.
Per-Region outbound-bytes cost per day.

SCPs. As with single-hub egress: deny workload VPCs from adding their own IGW, from routing 0.0.0.0/0 to anywhere other than TGW, from creating VPC endpoints to “bypass” services without the SecOps team’s approval. The hub owns the egress path; SCPs stop workloads opening their own.

A worked at-scale egress flow

At 14:30 UTC, a Fargate task in payments-prod VPC (in OU Production) calls api.stripe.com.

Task’s outbound SG allows tcp/443 to PrivateLink endpoint SG for Stripe. Traffic goes to the Stripe PrivateLink endpoint in the payments VPC directly.
PrivateLink delivers to Stripe’s endpoint service; Stripe processes and responds.
Total: no NAT, no firewall, no public internet. 99% of Stripe traffic follows this path.

A different Fargate task in the same VPC calls api.customerfeedback.com (less common SaaS, no PrivateLink):

SG allows outbound 443 to 0.0.0.0/0 (last-resort). Packet goes to TGW-prod.
TGW-prod forwards to egress-prod VPC. Route table sends to firewall endpoint.
Network Firewall SNI-inspects. api.customerfeedback.com is in the allow-list. Pass.
Packet to NAT, to IGW, to internet.
Return packet reverses.

The slow path (firewall-inspected) handles the long tail; the fast path (PrivateLink / Gateway endpoints) handles the bulk.

A worked incident

At 23:17 UTC, monitoring alerts: denied-flow rate on ANFW-prod-AZ-a spikes 100x. Investigation:

Logs show source IP 10.240.5.42 making thousands of connections to *.ru-shady-domain.tv. Not in allow-list; correctly denied.
Reverse-lookup: 10.240.5.42 is an ECS task in customer-api VPC, account customer-api-prod.
Incident declared. Security isolates the ENI via a break-glass SG change.
Container image analyzed, finds a supply-chain-compromised NPM dependency making outbound calls on install.
Mitigation: block the dependency at the registry level; rotate the container image; recover the task.

Without egress filtering at scale, the pattern would be invisible. The denied-flow log made the compromise visible in the first 30 seconds.

What’s worth remembering

Multi-hub egress past single-TGW scale. One hub per blast radius domain (prod, non-prod, analytics). Each with its own firewall fleet and NAT.
Gateway endpoints for S3 and DynamoDB are mandatory at scale. Most of your bytes are probably S3; routing them via the firewall wastes capacity and money.
PrivateLink for heavy SaaS providers. One endpoint per provider; high-traffic flows bypass firewall + NAT + internet.
Staged rule rollout is non-negotiable. The firewall is on the critical path for everything. A canary AZ first, then full rollout, is how you avoid org-wide outages.
Policy as code, Git-reviewed. Every rule traceable to a ticket and an owner; expiry dates prune stale rules; changes reviewed by SecOps.
Log everything, alert on anomalies. Denied-flow spikes are how you catch compromises; dashboards on the alert + flow logs give SOC the visibility.
SCPs keep workloads from escaping the hub. Deny workload-account IGWs, deny default routes to non-TGW, deny unapproved PrivateLink endpoints. Central egress only works when workloads can’t route around it.
The firewall is now infrastructure. Not a checkbox. It’s on the path for every outbound call. Operating it, capacity planning, drift detection, rule governance, is a permanent function of the platform team.

Five petabytes a month through a policy that’s one review away from reflecting a real threat. The bad rule doesn’t take out forty thousand instances, because it only takes out one AZ first, and the staging cadence catches it. The compromised container never got past the first denied flow, because everything is inspected, everything is logged, everything is noticed. Outbound at scale, outbound that you trust.