The situation
The payments team runs payments-api in a dedicated VPC, 10.50.0.0/16, behind a Network Load Balancer in the payments account. Twelve consumer services across four other accounts need to make HTTPS calls to the payments API. Today, the payments VPC is peered to each consumer VPC individually (twelve VPC peerings, each with its own route-table edits) and consumers connect to the NLB’s private DNS name. The shape has grown in an unplanned way, and it has three problems:
- Peering is not transitive. A consumer VPC peered to payments cannot reach anything else in the payments team’s environment without a separate peering, which adds more peerings.
- The whole payments VPC is reachable from every peered consumer. Security groups limit what they can talk to, but if a consumer’s workload is compromised, nothing in the network stops it scanning
10.50.0.0/16. - IP-space overlap is becoming a problem. Two acquired business units have VPCs using
10.0.0.0/16, which would conflict with parts of the internal ranges. Peering requires non-overlapping CIDRs.
The team wants to consolidate the pattern. Options on the table:
- Keep peering, add more. A peering per consumer-VPC.
- Transit Gateway. Attach payments and all consumers to a TGW, let routing tables govern access.
- VPC Lattice. A managed service-to-service connectivity layer.
- PrivateLink / VPC endpoints service. Expose payments-api as a named service that consumers call through an endpoint in their own VPC.
- Public API through CloudFront with WAF and IAM auth. Send traffic over the internet with identity and rate limiting at the edge.
What actually matters
Before picking, it’s worth asking what we’re trading.
The first question is what consumers need, network-path or socket. If a consumer service needs to make outbound HTTPS calls to payments-api and nothing else, what they need is a socket: payments-api:443 reachable, and nothing else in the payments environment. If they also need to reach payments’ database for batch exports, or the payments team’s internal tooling, they need more of a network path, and the correct answer is probably that they should not have that access, because the coupling is too deep.
The second question is isolation default. A service-exposing primitive has a default of “nothing is reachable except the one service the producer published”; a network-connecting primitive has a default of “what the route table says is reachable, is reachable.” Picking between them is largely a choice between the two defaults and how much explicit work each one demands to express the actual access policy.
The third question is overlapping IP handling. A service-exposing primitive that puts the consumer-side termination in the consumer’s own CIDR avoids the address-space coupling entirely: the producer’s CIDR never enters the consumer’s route table. A network-connecting primitive cannot bridge VPCs whose CIDRs overlap.
The fourth question is multi-consumer ergonomics. A producer-consumer abstraction where the producer explicitly allowlists callers (by AWS principal) makes adding the 13th consumer a one-call change. A network-bridging abstraction requires per-attachment route-table plumbing and doesn’t intrinsically constrain what traffic crosses.
The fifth question is cost and scale. A service-exposing primitive typically charges per consumer endpoint and per byte; a network-connecting primitive typically charges per attachment and per byte. Which one is cheaper depends on consumer count, traffic volume, and how many of the consumers genuinely need network reachability versus a single socket. High-volume integrations can flip either way; the sums are worth modelling against actual traffic.
What we’ll filter on
- Exposes one service, not the whole VPC, does the consumer only get the one socket, or the whole CIDR?
- Tolerates overlapping CIDRs, does it work when consumers and producers use the same IP space?
- Transitive across many consumers, does adding consumer N+1 scale linearly or blow up the topology?
- Explicit producer control, can the producer revoke a single consumer without collateral?
- Operational model fit for service-to-service, does the abstraction match “I am running a service and some of you may call it”?
The connectivity landscape
1. VPC peering. Pairwise, non-transitive, requires non-overlapping CIDRs. Exposes the full peered VPC (gated by route tables and SGs). Reasonable for two or three connections; hostile at scale. Fails on transitivity and overlap.
2. Transit Gateway. Hub-and-spoke with route tables. Transitive within the hub; tolerates segmentation via route tables; still requires non-overlapping CIDRs between attachments that need to communicate. Exposes the full attached VPC’s routable space subject to SG controls. Good for network-level connectivity; wrong abstraction for “call one service.”
3. VPC Lattice. A managed application-layer connectivity service. Service directory, authentication policies, health checks, one level up from pure PrivateLink. Good when there are many services and the producers and consumers are both numerous. Worth its own post.
4. PrivateLink / VPC endpoint services. The producer puts a Network Load Balancer (or Gateway Load Balancer) in front of the service and creates a VPC endpoint service that points to the NLB. Consumers create Interface VPC endpoints in their own VPCs that terminate on ENIs with consumer-VPC IPs. DNS resolution inside the consumer VPC maps the endpoint name to the endpoint ENIs. Traffic flows consumer-ENI → AWS fabric → producer NLB, one-way only in terms of initiation. Producer allowlists consumers by AWS principal (account, IAM role, or IAM user ARN). Exposes exactly the one service, tolerates overlapping CIDRs, scales linearly.
5. Public API + CloudFront + WAF + IAM auth. Expose payments-api publicly, rely on identity (SigV4 or API Gateway authorisers), WAF for rate limiting, CloudFront for geographic reach. Works when consumers are external or cross-organisation. Usually the wrong answer inside one organisation because it adds the public internet into the data path for traffic that should stay private.
Side by side
| Option | Exposes one service | Tolerates overlap | Scales to many consumers | Producer-controlled | Service-to-service fit |
|---|---|---|---|---|---|
| VPC peering | ✗ | ✗ | ✗ | Partial | ✗ |
| Transit Gateway | ✗ | ✗ | ✓ | Partial | ✗ |
| VPC Lattice | ✓ | ✓ | ✓ | ✓ | ✓ |
| PrivateLink | ✓ | ✓ | ✓ | ✓ | ✓ |
| Public + CloudFront | ✓ | ✓ | ✓ | ✓ (IAM) | Partial |
One service, many consumers, overlapping CIDRs
The pick(s) in depth
PrivateLink for the payments-api exposure, TGW staying in place for the shared-services VPCs that really need network reachability. The distinction is the abstraction that fits the use case: payments-api is a service, not a subnet; TGW is the correct answer for shared monitoring and logging VPCs where consumers genuinely do need broad network paths.
Setting up the PrivateLink side:
1. Producer creates an NLB in the payments VPC. Internal-facing, target group is the payments-api service. Cross-zone balancing on if the consumer needs to reach any healthy target; off if per-AZ affinity matters.
2. Producer creates the VPC endpoint service. aws ec2 create-vpc-endpoint-service-configuration --network-load-balancer-arns <nlb-arn> --acceptance-required false --allowed-principals arn:aws:iam::1111:role/consumer-a ... The service gets a name like com.amazonaws.vpce.eu-west-1.vpce-svc-0abc.... acceptance-required false means consumer-side create-endpoint calls succeed immediately for allowlisted principals; true adds a manual approval step, useful for ad-hoc consumer onboarding.
3. Producer adds private DNS. A domain name like payments.internal.example.com can be associated with the endpoint service. Requires DNS verification (a TXT record on the domain). Once verified, consumers create endpoints with --private-dns-enabled true and AWS creates a Route 53 Private Hosted Zone in the consumer’s VPC mapping the name to the endpoint ENIs. Consumer apps call payments.internal.example.com:443 without learning any VPC-internal details.
4. Each consumer creates an interface VPC endpoint. In their own account, they run aws ec2 create-vpc-endpoint --vpc-endpoint-type Interface --service-name com.amazonaws.vpce.eu-west-1.vpce-svc-0abc... --vpc-id <consumer-vpc> --subnet-ids <multi-az subnets> --security-group-ids <sg-for-endpoint>. An ENI appears in each specified subnet, with an IP from the consumer’s CIDR. The endpoint’s security group controls who inside the consumer VPC can send to it.
5. Off we go. Consumer applications resolve payments.internal.example.com (or the auto-generated endpoint DNS name) inside their VPC, get the endpoint ENI IPs, and connect. Traffic never traverses any shared routing space; there are no routes between consumer and producer VPCs; the NLB sees source IPs from the AWS fabric (TCP proxy) or, with preserve-client-ip on the target group, the actual consumer-ENI IPs if the consumer opted in.
Two gotchas worth naming explicitly:
Source IP preservation. A default PrivateLink NLB does not preserve client IP; the payments-api sees the NLB’s IP. If payments-api needs the calling IP for per-caller rate-limiting or logging, use preserve-client-ip on the target group and understand that the source IP will be the consumer endpoint ENI’s private IP from the consumer’s VPC CIDR. The security group on the target must then allow the full range of possible client IPs.
Cross-region is not free. A PrivateLink endpoint service lives in one Region; consumers in other Regions cannot create endpoints to it directly. For cross-region consumption, either run the service in multiple Regions with separate endpoint services, or front the cross-region traffic with a DNS-layer routing policy (Route 53 latency or geo) and have consumers point to the nearest Regional endpoint.
A worked onboarding
A new consumer team, in account 5555, wants to call payments-api. Today’s process:
- Consumer requests access. They send their IAM role ARN (
arn:aws:iam::5555:role/orders-api) to the payments team. - Producer updates allowlist.
aws ec2 modify-vpc-endpoint-service-permissions --service-id vpce-svc-0abc... --add-allowed-principals arn:aws:iam::5555:role/orders-api. One API call; no route table edits; no peering; no ticket on anyone else’s queue. - Consumer creates the endpoint. From their own account:
aws ec2 create-vpc-endpoint ... --service-name com.amazonaws.vpce.eu-west-1.vpce-svc-0abc... --private-dns-enabled. An endpoint ENI appears in their subnets; private DNS starts resolving. - Consumer calls payments-api. Their code uses
payments.internal.example.com. First call succeeds; latency is LAN-class (the fabric is AWS-side, not internet); no changes to the consumer’s VPC routing.
Total elapsed time: maybe an hour, most of it in the consumer team’s own deployment pipeline. Compare to the peering alternative: four route-table changes, a new peering connection, a security-group dance, and a half-day of validation.
To revoke: remove the principal from the allowlist. Existing connections drop within seconds; new connections fail immediately. The consumer’s endpoint ENI still exists but points to nothing useful; they clean it up on their side.
What’s worth remembering
- PrivateLink exposes a service, not a network. An endpoint service is one NLB (or GWLB) target. Consumers can only reach that target; they cannot scan the producer’s VPC, because no route to it exists in their route tables.
- Overlapping CIDRs are not a problem. The endpoint ENI lives in the consumer’s VPC with a consumer-VPC IP. Two consumers on
10.0.0.0/16can both have endpoints to the same producer service; neither learns the other’s routes. - Producer allowlist is by AWS principal. Account, role, or user ARN. Adding a consumer is one API call; revoking is the same. No route-table surgery.
- Private DNS makes it pretty. Associate a domain with the endpoint service, verify ownership, and consumers resolve it to endpoint ENIs automatically via Route 53 Private Hosted Zone in their VPC.
- Default NLB hides the client IP;
preserve-client-ipon the target group exposes it. Know which one you want; the target security group has to match. - TGW is for network paths, PrivateLink is for services. Use TGW when consumers need broad reachability into the other VPC (shared monitoring, shared DNS, logging collectors). Use PrivateLink when consumers need one socket against one service.
- Cross-region needs multiple endpoint services. An endpoint service lives in one Region; consumers in other Regions cannot create endpoints to it. Run per-Region endpoints and route DNS.
- Cost scales with consumer count and bytes. Per-endpoint hour on the consumer side + per-GB data processing. Model it before committing for high-volume integrations; for moderate traffic, PrivateLink is often cheaper than adding a TGW attachment on the consumer side.
Peering says “your network is my network.” Transit Gateway says “our networks share a hub.” PrivateLink says “here is one socket, the rest of my network is none of your business.” For twelve consumers calling one API, the third framing is the one that matches reality. The work isn’t picking the newest AWS service, it’s noticing that “give them a route” and “give them a service” are different asks that happen to both involve networking.