The situation
One VPC, 10.0.0.0/16, three private subnets across three AZs. 80 EC2 instances that need to reach the internet for:
- OS package updates via
apt, roughly 500 MB per instance per week. - Third-party API calls to a payments provider (
api.stripe.com) and a metrics endpoint (api.datadoghq.com), roughly 50 GB per day in aggregate. - Docker image pulls from ECR Public, roughly 200 GB per day at peak rollout.
- Random outbound HTTPS for things nobody has quite inventoried but which break loudly when blocked.
Currently a single NAT Gateway in one AZ handles all of this. The bill for NAT Gateway data-processing ($0.045/GB) plus the NAT Gateway hourly ($0.045/h) is a few hundred dollars a month and climbing; a second and third NAT Gateway per AZ for HA would triple that. A newer adjacent IPv6-only VPC has just come online for a new service; the team wants outbound internet for that too, without handing every instance a public IPv6 address.
The goals:
- Every instance that needs egress gets egress.
- Per-AZ fault tolerance, losing one AZ doesn’t lose all egress.
- The IPv6-only subnets get egress without public IPv6 addresses being reachable from the internet.
- The bill stops growing faster than the workload does.
What actually matters
It’s worth starting with what “NAT” actually means here.
Network Address Translation is rewriting IP addresses on packets as they cross a boundary. For outbound internet traffic from a private subnet, the translation is: source IP in the packet is a private RFC 1918 address (say 10.0.12.87), which isn’t routable on the internet; the NAT device rewrites it to a public, routable address (its own) before forwarding the packet; on the reply, it does the reverse. From outside, the traffic looks like it came from the NAT device. From inside, the instance got an internet reply with no knowledge that translation happened.
For IPv4, this is essential, private instances can’t have internet-routable addresses, full stop, because the VPC’s CIDR is RFC 1918 and not advertised beyond AWS. For IPv6, translation is not essential, every IPv6 address is globally routable by design, but egress still needs a mechanism that lets outbound packets out while blocking inbound connections. That’s the egress-only Internet Gateway, which is a policy device rather than a translation device.
So the first question per-subnet is: IPv4 or IPv6?. Different IP versions, different egress components. You don’t use a NAT Gateway for IPv6 egress and you don’t use an egress-only Internet Gateway for IPv4.
Second: managed or self-run?. AWS offers NAT Gateway as a managed service. AWS runs it, scales it, bills for it. NAT Instance is the DIY alternative: an EC2 instance configured with ip_forward and iptables, sitting in a public subnet, with source/dest check disabled. NAT Instance is cheaper per hour if the volume is small and more expensive in engineering time because you own the patching, scaling, and failover. For anything beyond a dev sandbox, NAT Gateway’s managed story is worth the premium.
Third: cost shape. NAT Gateway bills $0.045/h plus $0.045/GB data-processing. A single NAT Gateway running 24/7 is ~$33/month idle; add 200 GB/day of data processing and the data charge becomes $270/day, which dominates the bill. Multiply by three AZs for HA and the idle+baseline becomes ~$100/month before any traffic. NAT Instance bills per instance-hour (potentially on Spot) and doesn’t charge extra for data processed, but you pay for EC2 instance performance and own the ops.
Fourth: VPC Endpoints as the escape hatch. A large chunk of “outbound internet” traffic from AWS instances actually goes to other AWS services. S3, DynamoDB, SSM, ECR, Kinesis, KMS. A VPC Endpoint (gateway endpoint for S3 and DynamoDB; interface endpoints for most others) keeps that traffic inside the AWS network, off the NAT Gateway, and out of the data-processing charge. A well-placed set of endpoints often halves a NAT bill without changing any application code.
Fifth: per-AZ placement. A NAT Gateway is AZ-scoped. Instances in eu-west-1a that reach the internet via a NAT Gateway in eu-west-1b pay inter-AZ data transfer on top of NAT’s own charges. One NAT Gateway per AZ with each private subnet’s route table pointing at its local NAT Gateway is the correct HA-and-cost shape; one NAT Gateway for the VPC is cheaper until an AZ fails and it isn’t.
Sixth: fault domain. A single NAT Gateway is a single point of failure for internet egress in its AZ. If that AZ loses power, egress from other AZs is fine only if their route tables are pointing at NAT Gateways in those AZs. Route-table-per-AZ is the correct pattern; it’s also what makes the cost conversation real, because now you’re paying for three NAT Gateways instead of one.
What we’ll filter on
- IP version. IPv4 or IPv6 traffic?
- Managed vs self-run. AWS-operated or EC2-based?
- Cost model, hourly + data-processed, or instance-hour only?
- Throughput and scaling, what’s the ceiling?
- Fault domain. AZ-scoped or VPC-scoped?
- What it does to inbound, stateful rewrite, or egress-only policy?
The egress landscape
-
NAT Gateway. Managed. IPv4 only. Placed in a public subnet in a specific AZ. Scales to 100 Gbps per gateway; 55,000 concurrent connections per destination IP:port. Bills $0.045/h + $0.045/GB data processed (regional pricing varies). Allocated with a public Elastic IP. Source IP of outbound traffic is the NAT Gateway’s EIP. Does stateful NAT, outbound initiates, inbound replies match state, other inbound is dropped.
-
NAT Instance. Self-run. IPv4 only. Regular EC2 instance in a public subnet with source/dest check disabled, Linux kernel configured for IP forwarding, iptables masquerade rules. Scales to whatever the instance class supports (capped by network performance of the chosen class). Bills per instance-hour; no per-GB data-processing charge. Pairs well with Auto Scaling and health checks, but “high availability” means “I built failover between two NAT instances” and that’s your problem.
-
Egress-only Internet Gateway (egress-only IGW). Managed. IPv6 only. VPC-scoped (not AZ-scoped). Free, no hourly, no data-processing charge. Does not rewrite addresses; instances keep their own public IPv6. Stateful: outbound connections are allowed, inbound replies match state, unsolicited inbound is dropped. The IPv6 equivalent of the outbound-only semantic a NAT Gateway provides for IPv4.
-
Internet Gateway (IGW). Managed, free, VPC-scoped. Supports both IPv4 and IPv6. Allows unrestricted inbound and outbound. Instances in subnets routed to an IGW with public IPs are on the internet; instances without public IPs can’t reach the internet through an IGW (that’s what NAT is for). Not an egress solution for private subnets; an IGW enables public subnets.
-
VPC Endpoints. Not an egress mechanism to the internet, the opposite. Gateway endpoints (S3, DynamoDB) add a route-table target; interface endpoints (most other services) place an ENI in your subnet and provide PrivateLink-based access. Traffic to those services stays inside the AWS network. Removes traffic from NAT Gateway data-processing charges. Interface endpoints bill $0.01/h plus $0.01/GB processed per endpoint per AZ, cheaper than NAT for the services that support them.
-
AWS Global Accelerator and CloudFront (outbound). Not egress mechanisms, mentioned here so they’re not confused with egress. Both are ingress-side.
Side by side
| Option | IP version | Managed | Cost | Throughput | Fault domain |
|---|---|---|---|---|---|
| NAT Gateway | IPv4 | ✓ | $0.045/h + $0.045/GB | up to 100 Gbps | AZ-scoped |
| NAT Instance | IPv4 | ✗ | instance-hour only | instance-class limited | AZ-scoped (your HA story) |
| Egress-only IGW | IPv6 | ✓ | free | implicit with VPC | VPC-scoped |
| Internet Gateway | IPv4 + IPv6 | ✓ | free | implicit with VPC | VPC-scoped |
| VPC Endpoints | IPv4 + IPv6 | ✓ | $0.01/h + $0.01/GB (interface); free (gateway) | per-service | VPC-scoped (interface is AZ-scoped) |
Reading the table:
- IPv4 private subnet, production: NAT Gateway per AZ, route-table-per-AZ pointing at the local gateway. Add S3 and DynamoDB gateway endpoints; add interface endpoints for any other AWS services with large traffic volumes. Direct non-AWS egress through NAT.
- IPv4 private subnet, dev or low-volume: NAT Instance on a
t4g.smallwith Spot if you’re brave, or NAT Gateway if the managed story is worth the $33/month/AZ. Usually it is. - IPv6-only subnet: egress-only IGW. Free. Done.
- Dual-stack subnet: NAT Gateway for the IPv4 traffic, egress-only IGW for the IPv6 traffic, both in the route table with different destinations (
0.0.0.0/0and::/0).
The egress path diagram
The picks in depth
Three NAT Gateways, one per AZ, route-table-per-AZ. The public subnets hold the NAT Gateways, each with an Elastic IP. Each private subnet’s route table has 0.0.0.0/0 → natgw-<az> pointing at the NAT Gateway in the same AZ. This is the difference between “NAT Gateway high availability” and “NAT Gateway single point of failure”; one-per-AZ means losing an AZ only takes out that AZ’s workloads and egress, not the whole VPC’s egress.
# Per-AZ route table
aws ec2 create-route-table --vpc-id vpc-0abc --query RouteTable.RouteTableId
rtb-private-a
aws ec2 create-route --route-table-id rtb-private-a \
--destination-cidr-block 0.0.0.0/0 \
--nat-gateway-id nat-0a1b2c3d # natgw-a in eu-west-1a
Repeat for rtb-private-b → natgw-b, rtb-private-c → natgw-c. Each subnet’s association points to its AZ’s route table. Cost: 3 × ($33 + data) per month for HA; in return, egress survives single-AZ failure and no packets cross AZ boundaries on the way out.
S3 and DynamoDB gateway endpoints attached to every route table. Zero cost, immediate NAT-bill reduction. Gateway endpoints add a prefix-list route (pl-xxxxxxxx) that takes precedence over the default route for S3/DynamoDB traffic:
aws ec2 create-vpc-endpoint --vpc-id vpc-0abc \
--service-name com.amazonaws.eu-west-1.s3 \
--route-table-ids rtb-private-a rtb-private-b rtb-private-c
After this, aws s3 cp from a private instance goes straight through the endpoint without touching NAT or the internet. If S3 is a meaningful fraction of outbound traffic, this alone can halve the NAT bill.
Interface endpoints for ECR, SSM, CloudWatch Logs, STS, Secrets Manager. Each bills $0.01/h per AZ plus $0.01/GB processed, versus NAT’s $0.045/GB. The break-even for an interface endpoint is roughly 9 GB/month per AZ of traffic to that service; above that, the endpoint is cheaper than NAT. Docker pulls from ECR are the big win: 200 GB/day × $0.045 = $9/day on NAT, versus ~$3/day on an interface endpoint per AZ.
Egress-only IGW for the IPv6-only subnet. One egress-only IGW per VPC; the subnet’s route table has ::/0 → eigw-0xxxx. Free. No EIP. Outbound IPv6 connections work; inbound unsolicited connections are dropped at the gateway. If the IPv6-only subnet also needs IPv4 egress (most subnets do, to reach IPv4-only internet endpoints), the NAT Gateway handles that side, dual-stack egress means two egress components.
No NAT Instance, unless it’s a lab. NAT Instance was the original answer before NAT Gateway existed. Modern use cases are small: dev environments that rarely use egress, or very cost-sensitive hobby setups. Production traffic should not go through an EC2 instance you’re responsible for patching.
A worked cost trace: before and after
Before: one NAT Gateway for the whole VPC, 250 GB/day total egress (50 GB third-party API, 200 GB ECR), no endpoints.
NAT Gateway hourly: 1 × 24 × 30 × $0.045 = $32.40
NAT Gateway data processing: 250 × 30 × $0.045 = $337.50
Total: $369.90/month
After: three NAT Gateways per-AZ, S3/DynamoDB gateway endpoints (free), ECR interface endpoints in all three AZs.
NAT Gateway hourly: 3 × 24 × 30 × $0.045 = $97.20
NAT Gateway data processing: 50 × 30 × $0.045 = $67.50
(ECR traffic moved to interface endpoints; third-party APIs remain on NAT)
Interface endpoints hourly: 3 × 24 × 30 × $0.01 × 5 svc = $108.00
Interface endpoints data: 200 × 30 × $0.01 = $60.00
Total: $332.70/month
Marginal savings in absolute terms, but a very different shape: the cost is now ~50% fixed (three NAT Gateways, interface endpoints per AZ) and ~50% data-driven, with far better fault tolerance and a much flatter bill when ECR pulls surge. The real win is the failure-domain change, one NAT Gateway going down no longer takes out the whole VPC’s egress.
If the bill is still the main concern, the next lever is moving ECR Public pulls to ECR Pull-Through Cache: the cache lives in the same Region as the workload, and interface endpoints in-Region dwarf cross-internet pulls on cost.
What’s worth remembering
- NAT Gateway for IPv4 egress from private subnets. Managed, scales automatically, bills hourly + data processed. Place one per AZ; point each AZ’s route table at its local gateway.
- Egress-only IGW for IPv6 egress from private subnets. Managed, free, VPC-scoped. IPv6 addresses are globally routable; the egress-only IGW provides the outbound-only policy, not the translation.
- Internet Gateway is for public subnets. Free, allows both directions, VPC-scoped. Does not provide egress for private subnets; that’s what NAT Gateway or egress-only IGW is for.
- Gateway VPC Endpoints (S3, DynamoDB) are free and take traffic off NAT. Add them to every private route table; there is no downside.
- Interface VPC Endpoints break even against NAT at ~9 GB/month per service per AZ. High-volume AWS service traffic should go through endpoints, not through NAT.
- NAT Instance is a legacy answer. Production uses NAT Gateway; self-run NAT shows up in labs or tight-budget dev environments.
- Cost shape is hourly + data, per AZ. Three NAT Gateways for HA mean 3x the hourly cost but the same per-GB rate on whichever AZ’s gateway actually transports the bytes.
- Dual-stack means two egress components. NAT Gateway for
0.0.0.0/0, egress-only IGW for::/0. Both in the route table; they don’t conflict.
Three ways out of the VPC, three different jobs. NAT Gateway rewrites IPv4; egress-only IGW enforces IPv6 egress policy; an ordinary IGW is for public subnets and isn’t an egress solution at all. Pick by IP version and by whether the traffic belongs on NAT at all.