Infrastructure

Two Years With a Service Mesh: What I'd Tell Past Me

Network topology visualization

We deployed Istio in late 2023 because we needed mTLS between services and better traffic management for canary deployments. Two years later it's still running, I understand it much better, and my opinion of it is more nuanced than either "service meshes are the future" or "service meshes are operational overhead you don't need."

Here's what I'd tell the version of myself who was reading the Istio getting-started docs and feeling cautiously optimistic.

What it actually solved

mTLS between services, without changing application code. This was the main ask and it delivered. Every service in the mesh gets a certificate managed by Istio's CA. Mutual authentication happens at the sidecar layer. Product engineers don't think about it.

Traffic splitting for deployments. We do canary deployments by sending 5% of traffic to the new version, watching error rates and latency for 30 minutes, then shifting to 50% and eventually 100%. This is done via VirtualService weight configuration and works reliably. Before Istio we did this with separate Kubernetes services and manual load balancer rules, which was fragile.

Distributed tracing. Istio injects trace headers and reports spans to our Jaeger instance. The caveat: services need to propagate the B3 headers themselves. Istio handles the ingress and egress spans; the in-service propagation is still application code. We had two services that didn't propagate headers correctly, creating gaps in traces that took us a while to diagnose.

What it made harder

Debugging network issues. Before Istio, a connection problem between two services was investigated with curl, tcpdump, and service logs. With Istio, there's a sidecar proxy on each end, an additional layer of TLS, and AuthorizationPolicy objects that may or may not be correctly configured. The surface area for "why can't service A talk to service B" is significantly larger.

The tool is istioctl analyze for configuration problems and istioctl proxy-config for runtime state. Learning these took time. Incidents where Istio was the cause typically took 2–3x longer to resolve in the first year than equivalent non-Istio network issues.

Upgrades. Istio upgrades are not trivial. The control plane and data plane (sidecars) need to be upgraded in coordination. We've done four minor version upgrades. Three went smoothly with canary control plane rollout. One caused a partial outage when a change in default AuthorizationPolicy behavior in 1.19 broke service-to-service communication for any service that had a policy with implicit deny — which was several of ours.

The resource overhead is real

Each sidecar proxy (Envoy) consumes memory. In our cluster: roughly 60–80MB per pod at idle, more under load. With 120 pods, that's 7–10GB of memory used exclusively by sidecar proxies. On a cluster with tight memory budgets this matters. We run DistributionPolicy to exclude low-traffic internal services from the mesh where mTLS isn't required, which recovered about 2GB.

The organizational complexity

Istio introduces new resource types that infrastructure engineers understand and product engineers don't. VirtualService, DestinationRule, AuthorizationPolicy, PeerAuthentication — these require mental model investment. When something goes wrong with service communication, product engineers file a ticket to the platform team because they can't investigate it themselves. This is a support burden that didn't exist before.

We've partially addressed this by writing runbooks for the five most common Istio-related issues and giving product engineers read access to istioctl commands with documentation. It's better but not solved.

Would I deploy it again

For a team of our size (6 platform engineers, 30+ product engineers, 60+ services) that has specific requirements around mTLS and traffic management: yes. For a team with fewer than 30 services that doesn't have strict service-to-service authentication requirements: probably not yet. The operational cost doesn't distribute evenly — it concentrates on whoever owns the mesh.


PgBouncer · All posts