The situation
An EKS cluster with 40 microservices across eight namespaces. Observability via CloudWatch + X-Ray; deployment via Helm charts in GitOps. The team has pain points:
- Inconsistent retries and timeouts. Every service implements its own, differently. Cascading failures are a regular incident pattern.
- mTLS piecemeal. Some services terminate TLS on their ALB, some use Istio, some don’t encrypt in-cluster at all.
- Traffic shifting is hard. Canary deploys rely on Kubernetes deployment weights, which are too coarse for gradual rollouts.
- Per-service metrics. Each team instruments their own; there’s no uniform “success rate per caller-callee pair” dashboard.
- East-west authorisation. Nothing stops Service A from calling Service B; authorisation is IAM on external endpoints only.
The proposal is to standardise on a mesh. Three candidates:
- AWS App Mesh. AWS-managed control plane, Envoy sidecars, integrates with Cloud Map for service discovery.
- Istio, open-source, feature-rich, operationally heavy.
- Linkerd, open-source, lightweight, less featureful.
- Cilium / EBPF mesh, an emerging alternative that avoids sidecars.
- VPC Lattice. AWS’s newer service-mesh product, sits between compute and works cross-cluster.
- No mesh, libraries in every service. The alternative, each service uses a shared library for retries, timeouts, mTLS.
Update: since this is 2028, AWS App Mesh’s posture has evolved. AWS has been guiding users toward alternatives for some use cases. The choice is not “pick App Mesh automatically”; it’s “match the workload to the right mesh and know why.”
What actually matters
The core trade in service meshes is uniformity in exchange for complexity. A mesh forces every service to speak the same language for retries, mTLS, metrics, and traffic shaping. That uniformity is the value. The cost is a second control plane (the mesh’s) in addition to Kubernetes, plus per-service overhead (sidecar memory, latency) and operational expertise.
The first thing to ask is: which problems are we actually solving? If the pain is observability, a mesh is overkill, an OTel-based approach with shared client libraries is cheaper. If the pain is mTLS, SPIRE plus client-side TLS can be lighter. If the pain is all of the above plus traffic shaping and authorisation, a mesh’s combined toolkit becomes worth the weight.
The second is: sidecar or sidecar-less? The traditional shape injects a proxy as a sidecar container into every pod. Each sidecar adds memory (tens of MB per pod), a hop (~1ms latency), and a failure mode (if the sidecar dies, the service does). Newer shapes use node-level agents to avoid the per-pod cost, with different trade-offs in feature parity.
The third is: who runs the control plane? A vendor-managed control plane removes operational burden but constrains feature choice; a self-managed control plane is the opposite. Running a mesh control plane is non-trivial at scale, versions, upgrades, compatibility with the data plane, and “we’ll manage it ourselves” is a commitment, not a checkbox.
The fourth is: cross-cluster and cross-VPC. Some workloads need mesh features across clusters or across VPCs. Some mesh shapes are single-cluster by design; others handle multi-cluster federation; others sit above the cluster boundary entirely and route across compute types. Matching the topology to the workload’s reach matters.
The fifth is: the team’s appetite. A rich-feature mesh requires a dedicated platform engineer or two who know it well. A lighter mesh demands less but does less. A managed control plane shifts some burden but not all, the data plane still needs upgrades. Knowing what the team can realistically run is the constraint that should drive the pick.
What we’ll filter on
- Control-plane operation, who runs it, who upgrades it?
- Data-plane overhead, sidecar cost (memory, latency)?
- Feature breadth, retries, timeouts, mTLS, traffic shaping, authz?
- Cross-cluster / cross-VPC, does it scale beyond one cluster?
- Ecosystem maturity, docs, community, incidents survived?
The mesh landscape
-
AWS App Mesh. AWS-managed control plane, Envoy-based sidecars. Integrates with Cloud Map, ACM, CloudWatch, X-Ray. Single-cluster-centric historically; declining momentum industry-wide as AWS has shifted focus.
-
Istio. Most feature-rich open-source mesh. Envoy sidecars,
istiodcontrol plane. Steep learning curve; large operator community. Ambient mode (sidecar-less) increasingly popular. -
Linkerd. Open-source; smaller, simpler, faster than Istio. Rust-based
linkerd2-proxysidecar is lightweight. Fewer features; many teams find the subset sufficient. -
Cilium Service Mesh. EBPF-based; no sidecars. Plugs into the CNI layer; very low overhead. Newer ecosystem, fewer features today than Istio but rapidly evolving.
-
AWS VPC Lattice. Service-to-service connectivity layer; works across VPCs, across accounts, across compute types (EKS, ECS, EC2, Lambda). IAM-auth-based authorisation; target groups per service. Not strictly a mesh but overlaps with mesh use cases, especially cross-cluster.
-
No mesh, client libraries. Each service imports a shared library that implements retries, circuit breakers, mTLS, metrics. Works if every service is in the same language and the platform team can maintain the library.
Side by side
| Option | Control plane | Data plane overhead | Features | Cross-cluster | Ecosystem |
|---|---|---|---|---|---|
| App Mesh | AWS-managed | Envoy sidecar | Retries, mTLS, split, metrics | Limited | Declining |
| Istio | Self-managed (istiod) | Envoy sidecar (or ambient) | Full | Multi-cluster | Very large |
| Linkerd | Self-managed | Rust proxy (lightweight) | Subset | Multi-cluster | Medium |
| Cilium Mesh | Self-managed (cilium-agent) | EBPF, no sidecar | Subset | Limited | Growing |
| VPC Lattice | AWS-managed | No sidecar | Routing, IAM auth, TLS | ✓ (native) | Newer |
| No mesh / libs | None | Per-service code | As implemented | Via libs | Per-library |
Reading by use case:
- All-Kubernetes, need feature breadth, team ready to invest: Istio (ambient mode to avoid sidecar cost).
- All-Kubernetes, want simplicity: Linkerd.
- Mixed compute (EKS + ECS + Lambda), cross-VPC: VPC Lattice at the top, internal cluster mesh under it if needed.
- Already on App Mesh, stable: keep it running but don’t expand into new use cases given AWS’s shifting focus.
- Few services, small pain: libraries, not a mesh.
For this cluster (40 services, mixed languages, cross-VPC requirements emerging), the recommendation is VPC Lattice for cross-VPC/cross-compute, Istio ambient in-cluster for the EKS-internal mesh features. The rest of the post focuses on that combination.
The layered architecture
The picks in depth
VPC Lattice as the cross-cluster mesh. A Lattice service network in the platform account, shared to workload accounts via RAM. Each service registers as a Lattice service; target groups point at EKS, ECS, Lambda, or EC2 resources. Clients of a service call it through the Lattice DNS name; Lattice routes the request to the right target group with IAM auth applied.
Key benefits:
- Cross-VPC without peering. Two EKS clusters in two VPCs talk via Lattice with no peering or TGW attachment for the mesh itself.
- Cross-compute-type. Lambda, ECS, EKS, EC2 all plug in as target types.
- IAM auth on every call. The caller’s IAM role must have
vpc-lattice-svcs:Invokeon the target service; the service’s auth policy defines which principals are allowed. - No sidecars. Lattice is transparent to the workload code (modulo DNS-name changes).
Trade-offs:
- Lattice is a managed hop in the middle. Latency +~1-5ms.
- Feature depth is narrower than Istio, no complex retry policies, less sophisticated traffic splitting.
Istio ambient mode inside each cluster. In each EKS cluster, Istio with ambient mode enabled. Ambient replaces per-pod sidecars with a per-node ztunnel DaemonSet that handles L4 mTLS and observability. For services that need L7 features (retries, header-based routing), an optional waypoint proxy runs per service.
Key benefits:
- No sidecar injection. Pods run unchanged; mesh features apply via the node-level proxy.
- Lower resource cost. One ztunnel per node vs one sidecar per pod.
- Feature breadth. Full Istio feature set (retries, timeouts, circuit breakers, traffic splitting) available, with the waypoint opt-in for L7.
Trade-offs:
- Ambient is newer than sidecar mode; less community experience with production incidents.
- The istiod control plane still needs to be operated and upgraded.
The integration. In-cluster traffic uses Istio; cross-cluster traffic uses Lattice. A service like payments-api in cluster A is:
- Reachable from sibling services in cluster A via Kubernetes DNS + Istio (
payments-api.payments.svc.cluster.local). - Reachable from services in cluster B (or ECS, Lambda) via Lattice DNS (
payments-api.payments.service.lattice...).
The service code handles both, the DNS name and the auth shape differ, but both are just HTTP calls.
Policy unification. For authorisation, Lattice uses IAM; Istio uses its own AuthorizationPolicy CRDs. The policies don’t merge natively, but they can be generated from the same source. A custom operator reads a single ServiceAuthorization CRD and emits both an Istio policy and a Lattice auth policy. The team writes “who can call payments” once, enforcement happens in both meshes.
Metrics unification. Istio emits standard metrics to Prometheus; Lattice emits to CloudWatch. A Prometheus scraper in each cluster pulls both (via the CloudWatch exporter for Lattice), producing a unified view in Grafana. The RED metrics (rate, errors, duration) per caller-callee pair span both meshes.
mTLS. Istio ambient provides mTLS in-cluster automatically. Lattice provides TLS (not mutual by default, but with IAM auth as the equivalent of identity). For end-to-end mTLS between a cluster-A pod and a cluster-B pod, the Lattice hop uses IAM-signed requests; the in-cluster hops use Istio mTLS. The identity is different (IAM vs Kubernetes service account) but both legs are authenticated.
A worked call trace
An orders pod in cluster B calls payments-api in cluster A.
- Pod’s HTTP client calls
payments-api.payments.service.lattice.... - DNS resolves to Lattice’s VIP. The request is SigV4-signed by the pod’s IAM role (via IRSA. IAM Roles for Service Accounts).
- Lattice receives the request, evaluates the auth policy on
payments-apiservice. Policy says “allow principalorders-role”. Principal isorders-role; pass. - Lattice forwards to the target group in cluster A, which is a Lattice target pointing at
payments-api’s Kubernetes service. - The request enters cluster A via a Lattice-controlled ALB or the equivalent; then through Istio’s ztunnel to the
payments-apipod. - In-cluster, Istio records the request metrics, enforces retries per DestinationRule, applies mTLS.
- Response flows back the same path.
Observability: X-Ray trace spans across Istio and Lattice (both forward the X-Amzn-Trace-Id header); Prometheus metrics in both clusters record the RED triplet; Lattice’s access logs record the caller identity.
What’s worth remembering
- Decide which problems a mesh solves before choosing one. Observability alone doesn’t need a mesh. mTLS + observability + traffic shaping + authorisation is when a mesh pays.
- Ambient mode avoids sidecar overhead. For Istio, ambient is the sidecar-less option; same feature set, node-level agent.
- VPC Lattice is cross-VPC mesh without sidecars. Works across clusters, ECS, Lambda, EC2. Managed by AWS. Best fit for “services running on different compute types need to talk.”
- App Mesh’s momentum has declined. Existing deployments fine; new designs should evaluate Istio/Lattice/Linkerd.
- In-cluster mesh + cross-cluster mesh is a valid pattern. Istio inside, Lattice between. Unified by emitting policy from one source and aggregating metrics centrally.
- Cost is memory + latency + operator time. Each adds up. Track it; justify it against the problems the mesh solves.
- mTLS is often the first wedge. Easy to explain, easy to measure, easy to justify. Build uniform mTLS, then add retries, then add traffic shaping.
- Service accounts + IRSA are the identity spine. Kubernetes service accounts + AWS IAM via IRSA give every workload a verifiable identity; the mesh enforces policy on that identity. Without the identity spine, mesh authorisation is addresses, which rot.
One mesh under each cluster, one mesh above all of them. The forty services stop arguing; the cascading failures stop happening; the cross-VPC paths stop requiring bespoke plumbing. The mesh becomes infrastructure the team uses, not infrastructure the team notices.