Choosing a Service Mesh on EKS

March 03, 2027 · 14 min read

The situation

An EKS cluster with 40 microservices across eight namespaces. Observability via CloudWatch + X-Ray; deployment via Helm charts in GitOps. The team has pain points:

Inconsistent retries and timeouts. Every service implements its own, differently. Cascading failures are a regular incident pattern.
mTLS piecemeal. Some services terminate TLS on their ALB, some use Istio, some don’t encrypt in-cluster at all.
Traffic shifting is hard. Canary deploys rely on Kubernetes deployment weights, which are too coarse for gradual rollouts.
Per-service metrics. Each team instruments their own; there’s no uniform “success rate per caller-callee pair” dashboard.
East-west authorisation. Nothing stops Service A from calling Service B; authorisation is IAM on external endpoints only.

The proposal is to standardise on a mesh. Three candidates:

AWS App Mesh. AWS-managed control plane, Envoy sidecars, integrates with Cloud Map for service discovery.
Istio, open-source, feature-rich, operationally heavy.
Linkerd, open-source, lightweight, less featureful.
Cilium / EBPF mesh, an emerging alternative that avoids sidecars.
VPC Lattice. AWS’s newer service-mesh product, sits between compute and works cross-cluster.
No mesh, libraries in every service. The alternative, each service uses a shared library for retries, timeouts, mTLS.

Update: since this is 2028, AWS App Mesh’s posture has evolved. AWS has been guiding users toward alternatives for some use cases. The choice is not “pick App Mesh automatically”; it’s “match the workload to the right mesh and know why.”

What actually matters

The core trade in service meshes is uniformity in exchange for complexity. A mesh forces every service to speak the same language for retries, mTLS, metrics, and traffic shaping. That uniformity is the value. The cost is a second control plane (the mesh’s) in addition to Kubernetes, plus per-service overhead (sidecar memory, latency) and operational expertise.

The first thing to ask is: which problems are we actually solving? If the pain is observability, a mesh is overkill, an OTel-based approach with shared client libraries is cheaper. If the pain is mTLS, SPIRE plus client-side TLS can be lighter. If the pain is all of the above plus traffic shaping and authorisation, a mesh’s combined toolkit becomes worth the weight.

The second is: sidecar or sidecar-less? The traditional shape injects a proxy as a sidecar container into every pod. Each sidecar adds memory (tens of MB per pod), a hop (~1ms latency), and a failure mode (if the sidecar dies, the service does). Newer shapes use node-level agents to avoid the per-pod cost, with different trade-offs in feature parity.

The third is: who runs the control plane? A vendor-managed control plane removes operational burden but constrains feature choice; a self-managed control plane is the opposite. Running a mesh control plane is non-trivial at scale, versions, upgrades, compatibility with the data plane, and “we’ll manage it ourselves” is a commitment, not a checkbox.

The fourth is: cross-cluster and cross-VPC. Some workloads need mesh features across clusters or across VPCs. Some mesh shapes are single-cluster by design; others handle multi-cluster federation; others sit above the cluster boundary entirely and route across compute types. Matching the topology to the workload’s reach matters.

The fifth is: the team’s appetite. A rich-feature mesh requires a dedicated platform engineer or two who know it well. A lighter mesh demands less but does less. A managed control plane shifts some burden but not all, the data plane still needs upgrades. Knowing what the team can realistically run is the constraint that should drive the pick.

What we’ll filter on

Control-plane operation, who runs it, who upgrades it?
Data-plane overhead, sidecar cost (memory, latency)?
Feature breadth, retries, timeouts, mTLS, traffic shaping, authz?
Cross-cluster / cross-VPC, does it scale beyond one cluster?
Ecosystem maturity, docs, community, incidents survived?

The mesh landscape

AWS App Mesh. AWS-managed control plane, Envoy-based sidecars. Integrates with Cloud Map, ACM, CloudWatch, X-Ray. Single-cluster-centric historically; declining momentum industry-wide as AWS has shifted focus.
Istio. Most feature-rich open-source mesh. Envoy sidecars, istiod control plane. Steep learning curve; large operator community. Ambient mode (sidecar-less) increasingly popular.
Linkerd. Open-source; smaller, simpler, faster than Istio. Rust-based linkerd2-proxy sidecar is lightweight. Fewer features; many teams find the subset sufficient.
Cilium Service Mesh. EBPF-based; no sidecars. Plugs into the CNI layer; very low overhead. Newer ecosystem, fewer features today than Istio but rapidly evolving.
AWS VPC Lattice. Service-to-service connectivity layer; works across VPCs, across accounts, across compute types (EKS, ECS, EC2, Lambda). IAM-auth-based authorisation; target groups per service. Not strictly a mesh but overlaps with mesh use cases, especially cross-cluster.
No mesh, client libraries. Each service imports a shared library that implements retries, circuit breakers, mTLS, metrics. Works if every service is in the same language and the platform team can maintain the library.

Side by side

Option	Control plane	Data plane overhead	Features	Cross-cluster	Ecosystem
App Mesh	AWS-managed	Envoy sidecar	Retries, mTLS, split, metrics	Limited	Declining
Istio	Self-managed (istiod)	Envoy sidecar (or ambient)	Full	Multi-cluster	Very large
Linkerd	Self-managed	Rust proxy (lightweight)	Subset	Multi-cluster	Medium
Cilium Mesh	Self-managed (cilium-agent)	EBPF, no sidecar	Subset	Limited	Growing
VPC Lattice	AWS-managed	No sidecar	Routing, IAM auth, TLS	✓ (native)	Newer
No mesh / libs	None	Per-service code	As implemented	Via libs	Per-library

Reading by use case:

All-Kubernetes, need feature breadth, team ready to invest: Istio (ambient mode to avoid sidecar cost).
All-Kubernetes, want simplicity: Linkerd.
Mixed compute (EKS + ECS + Lambda), cross-VPC: VPC Lattice at the top, internal cluster mesh under it if needed.
Already on App Mesh, stable: keep it running but don’t expand into new use cases given AWS’s shifting focus.
Few services, small pain: libraries, not a mesh.

For this cluster (40 services, mixed languages, cross-VPC requirements emerging), the recommendation is VPC Lattice for cross-VPC/cross-compute, Istio ambient in-cluster for the EKS-internal mesh features. The rest of the post focuses on that combination.

The layered architecture

Istio ambient inside each cluster for in-cluster mTLS, retries, and observability; VPC Lattice above the clusters for cross-VPC, cross-compute connectivity with IAM auth.

The picks in depth

VPC Lattice as the cross-cluster mesh. A Lattice service network in the platform account, shared to workload accounts via RAM. Each service registers as a Lattice service; target groups point at EKS, ECS, Lambda, or EC2 resources. Clients of a service call it through the Lattice DNS name; Lattice routes the request to the right target group with IAM auth applied.

Key benefits:

Cross-VPC without peering. Two EKS clusters in two VPCs talk via Lattice with no peering or TGW attachment for the mesh itself.
Cross-compute-type. Lambda, ECS, EKS, EC2 all plug in as target types.
IAM auth on every call. The caller’s IAM role must have vpc-lattice-svcs:Invoke on the target service; the service’s auth policy defines which principals are allowed.
No sidecars. Lattice is transparent to the workload code (modulo DNS-name changes).

Trade-offs:

Lattice is a managed hop in the middle. Latency +~1-5ms.
Feature depth is narrower than Istio, no complex retry policies, less sophisticated traffic splitting.

Istio ambient mode inside each cluster. In each EKS cluster, Istio with ambient mode enabled. Ambient replaces per-pod sidecars with a per-node ztunnel DaemonSet that handles L4 mTLS and observability. For services that need L7 features (retries, header-based routing), an optional waypoint proxy runs per service.

Key benefits:

No sidecar injection. Pods run unchanged; mesh features apply via the node-level proxy.
Lower resource cost. One ztunnel per node vs one sidecar per pod.
Feature breadth. Full Istio feature set (retries, timeouts, circuit breakers, traffic splitting) available, with the waypoint opt-in for L7.

Trade-offs:

Ambient is newer than sidecar mode; less community experience with production incidents.
The istiod control plane still needs to be operated and upgraded.

The integration. In-cluster traffic uses Istio; cross-cluster traffic uses Lattice. A service like payments-api in cluster A is:

Reachable from sibling services in cluster A via Kubernetes DNS + Istio (payments-api.payments.svc.cluster.local).
Reachable from services in cluster B (or ECS, Lambda) via Lattice DNS (payments-api.payments.service.lattice...).

The service code handles both, the DNS name and the auth shape differ, but both are just HTTP calls.

Policy unification. For authorisation, Lattice uses IAM; Istio uses its own AuthorizationPolicy CRDs. The policies don’t merge natively, but they can be generated from the same source. A custom operator reads a single ServiceAuthorization CRD and emits both an Istio policy and a Lattice auth policy. The team writes “who can call payments” once, enforcement happens in both meshes.

Metrics unification. Istio emits standard metrics to Prometheus; Lattice emits to CloudWatch. A Prometheus scraper in each cluster pulls both (via the CloudWatch exporter for Lattice), producing a unified view in Grafana. The RED metrics (rate, errors, duration) per caller-callee pair span both meshes.

mTLS. Istio ambient provides mTLS in-cluster automatically. Lattice provides TLS (not mutual by default, but with IAM auth as the equivalent of identity). For end-to-end mTLS between a cluster-A pod and a cluster-B pod, the Lattice hop uses IAM-signed requests; the in-cluster hops use Istio mTLS. The identity is different (IAM vs Kubernetes service account) but both legs are authenticated.

A worked call trace

An orders pod in cluster B calls payments-api in cluster A.

Pod’s HTTP client calls payments-api.payments.service.lattice....
DNS resolves to Lattice’s VIP. The request is SigV4-signed by the pod’s IAM role (via IRSA. IAM Roles for Service Accounts).
Lattice receives the request, evaluates the auth policy on payments-api service. Policy says “allow principal orders-role”. Principal is orders-role; pass.
Lattice forwards to the target group in cluster A, which is a Lattice target pointing at payments-api’s Kubernetes service.
The request enters cluster A via a Lattice-controlled ALB or the equivalent; then through Istio’s ztunnel to the payments-api pod.
In-cluster, Istio records the request metrics, enforces retries per DestinationRule, applies mTLS.
Response flows back the same path.

Observability: X-Ray trace spans across Istio and Lattice (both forward the X-Amzn-Trace-Id header); Prometheus metrics in both clusters record the RED triplet; Lattice’s access logs record the caller identity.

What’s worth remembering

Decide which problems a mesh solves before choosing one. Observability alone doesn’t need a mesh. mTLS + observability + traffic shaping + authorisation is when a mesh pays.
Ambient mode avoids sidecar overhead. For Istio, ambient is the sidecar-less option; same feature set, node-level agent.
VPC Lattice is cross-VPC mesh without sidecars. Works across clusters, ECS, Lambda, EC2. Managed by AWS. Best fit for “services running on different compute types need to talk.”
App Mesh’s momentum has declined. Existing deployments fine; new designs should evaluate Istio/Lattice/Linkerd.
In-cluster mesh + cross-cluster mesh is a valid pattern. Istio inside, Lattice between. Unified by emitting policy from one source and aggregating metrics centrally.
Cost is memory + latency + operator time. Each adds up. Track it; justify it against the problems the mesh solves.
mTLS is often the first wedge. Easy to explain, easy to measure, easy to justify. Build uniform mTLS, then add retries, then add traffic shaping.
Service accounts + IRSA are the identity spine. Kubernetes service accounts + AWS IAM via IRSA give every workload a verifiable identity; the mesh enforces policy on that identity. Without the identity spine, mesh authorisation is addresses, which rot.

One mesh under each cluster, one mesh above all of them. The forty services stop arguing; the cascading failures stop happening; the cross-VPC paths stop requiring bespoke plumbing. The mesh becomes infrastructure the team uses, not infrastructure the team notices.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.