Module 5.2: Service Mesh
Цей контент ще не доступний вашою мовою.
Complexity: [COMPLEX]
Section titled “Complexity: [COMPLEX]”Time to Complete: 60 minutes
Section titled “Time to Complete: 60 minutes”Prerequisites
Section titled “Prerequisites”Before starting this module, you should have completed:
- Module 5.1: Cilium
- Kubernetes Services and Ingress
- Basic understanding of proxies and TLS
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Deploy Istio or Linkerd service mesh with mTLS, traffic management, and observability features
- Configure traffic splitting, circuit breaking, and retry policies for resilient microservice communication
- Implement service mesh authorization policies for zero-trust inter-service security
- Compare Istio, Linkerd, and Cilium service mesh approaches for different operational complexity levels
Why This Module Matters
Section titled “Why This Module Matters”The security auditor’s report landed on the CTO’s desk like a bomb: “Service-to-service communication is completely unencrypted.”
When the mid-size fintech company underwent their SOC 2 audit, they assumed their Kubernetes cluster was secure. They had firewalls. They had ingress TLS. But inside the cluster? Every service talked to every other service over plaintext HTTP. Credit card data, authentication tokens, personal information—all flowing unencrypted between pods.
The auditor’s words still echoed: “An attacker who gains access to any pod can intercept traffic from every other service. You’ve built a castle with no walls inside.”
The team had 90 days to implement mutual TLS across 47 microservices. Modifying each service to handle certificates would take months. Service mesh became their only option.
This module teaches you when service mesh is the right answer, how to implement Istio effectively, and—critically—when the complexity isn’t worth it. Because the biggest mistake isn’t avoiding service mesh when you need it. It’s adopting it when you don’t.
War Story: The 200ms That Killed Black Friday
Section titled “War Story: The 200ms That Killed Black Friday”Characters:
- Jennifer: Platform Architect (6 years experience)
- Team: 8 engineers running 120 microservices
- Stack: E-commerce platform, $50M daily revenue
The Incident:
The company had adopted Istio six months before Black Friday. Everything worked in staging. Then traffic spiked.
Timeline:
November 23rd - Black Friday Prep09:00 AM: Traffic starts climbing (3x normal) Latency: Normal (~50ms p99)
10:30 AM: Traffic at 5x normal Latency: 85ms p99 "A bit slow, but acceptable"
11:00 AM: Traffic at 8x normal Latency: 180ms p99 First customer complaints
11:30 AM: Envoy sidecars start OOMing Pods restarting across the cluster Latency: 500ms+ p99
11:45 AM: Checkout service cascading failures "All checkout pods show OOMKilled" Revenue loss: $2,000/minute
12:00 PM: Circuit breakers trip everywhere Services can't communicate Revenue loss: $8,000/minute
12:15 PM: Jennifer: "Kill the sidecars" Team hesitates—"We'll lose mTLS" Jennifer: "We're losing $480K/hour"
12:20 PM: Emergency: Disable sidecar injection Pods restart without Envoy
12:45 PM: Services recovering Latency dropping Revenue resuming
1:30 PM: Full recovery Lost revenue: $340,000 Post-mortem begins immediately
Root Cause Analysis:───────────────────────────────────────────────────1. Envoy sidecars had default memory limits (128MB)2. Under high traffic, Envoy needed 400MB+3. No load testing with sidecars at Black Friday scale4. Checkout service made 47 downstream calls per request5. Each sidecar added ~2ms latency6. 47 calls × 2ms = 94ms overhead PER REQUEST7. At 8x traffic, memory exhaustion + latency spiralWhat They Fixed:
# Before: Default sidecar resources# Envoy would OOM at ~1000 req/s
# After: Right-sized sidecar resourcesmetadata: annotations: sidecar.istio.io/proxyMemory: "512Mi" sidecar.istio.io/proxyMemoryLimit: "1Gi" sidecar.istio.io/proxyCPU: "100m" sidecar.istio.io/proxyCPULimit: "2000m"# Excluded high-fanout services from meshmetadata: annotations: sidecar.istio.io/inject: "false" # Checkout calls 47 servicesLessons Learned:
- Load test with sidecars at 10x expected traffic
- High-fanout services multiply mesh latency
- Default sidecar resources are too small for production
- Have a kill switch to disable mesh in emergencies
- Some services should never be meshed
Financial Impact:
- Direct revenue loss: $340,000
- Emergency consulting: $45,000
- Re-architecture costs: $120,000
- Total: $505,000
Service Mesh Architecture
Section titled “Service Mesh Architecture”The Problem Service Mesh Solves
Section titled “The Problem Service Mesh Solves”┌─────────────────────────────────────────────────────────────────┐│ WITHOUT SERVICE MESH: CHAOS │├─────────────────────────────────────────────────────────────────┤│ ││ Every service implements (inconsistently): ││ ││ ┌─────────────────────────────────────────────────────────────┐││ │ Service A (Java) Service B (Go) │││ │ ├── Spring Cloud Netflix ├── Go kit │││ │ ├── Hystrix (circuit) ├── Custom retries │││ │ ├── Ribbon (LB) ├── gRPC balancing │││ │ ├── mTLS (maybe) ├── Plain HTTP (oops) │││ │ └── OpenTracing └── Custom logging │││ │ │││ │ Service C (Python) Service D (Node.js) │││ │ ├── No circuit breaker ├── Different retry lib │││ │ ├── No load balancing ├── No mTLS │││ │ ├── requests (no retry) ├── axios with timeout │││ │ └── No tracing └── Different tracing format │││ └─────────────────────────────────────────────────────────────┘││ ││ Result: 4 languages, 4 patterns, 0 consistency ││ Security audit: "This is a compliance nightmare" ││ │└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐│ WITH SERVICE MESH: CONSISTENCY │├─────────────────────────────────────────────────────────────────┤│ ││ ┌─────────────────────────────────────────────────────────────┐││ │ Service A (Java) Service B (Go) │││ │ └── Business logic └── Business logic │││ │ ↓ ↓ │││ │ ┌─────────────────┐ ┌─────────────────┐ │││ │ │ Envoy Sidecar │ │ Envoy Sidecar │ │││ │ │ ├── mTLS │ │ ├── mTLS │ │││ │ │ ├── Retries │ │ ├── Retries │ │││ │ │ ├── Circuit │ │ ├── Circuit │ │││ │ │ ├── LB │ │ ├── LB │ │││ │ │ └── Tracing │ │ └── Tracing │ │││ │ └─────────────────┘ └─────────────────┘ │││ └─────────────────────────────────────────────────────────────┘││ ││ Result: Apps only do business logic. Mesh handles networking. ││ Security audit: "Uniform mTLS everywhere. Approved." ││ │└─────────────────────────────────────────────────────────────────┘Data Plane vs Control Plane
Section titled “Data Plane vs Control Plane”┌─────────────────────────────────────────────────────────────────┐│ SERVICE MESH ARCHITECTURE │├─────────────────────────────────────────────────────────────────┤│ ││ CONTROL PLANE ││ ┌─────────────────────────────────────────────────────────────┐││ │ │││ │ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │││ │ │ Config │ │ Service │ │ Certificate │ │││ │ │ Store │ │ Discovery │ │ Authority │ │││ │ │ │ │ │ │ │ │││ │ │ "What rules │ │ "Where are │ │ "Here's your │ │││ │ │ to apply" │ │ services?" │ │ certificate"│ │││ │ └───────────────┘ └───────────────┘ └───────────────┘ │││ │ │││ │ THE BRAIN: Makes decisions, doesn't touch traffic │││ └──────────────────────────┬──────────────────────────────────┘││ │ ││ │ xDS Protocol ││ │ (Config distribution) ││ ▼ ││ DATA PLANE ││ ┌─────────────────────────────────────────────────────────────┐││ │ │││ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │││ │ │ Envoy │◀──▶│ Envoy │◀──▶│ Envoy │◀──▶│ Envoy │ │││ │ │ Sidecar │ │ Sidecar │ │ Sidecar │ │ Sidecar │ │││ │ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │││ │ │ │ │ │ │││ │ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ │││ │ │ App A │ │ App B │ │ App C │ │ App D │ │││ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │││ │ │││ │ THE MUSCLE: Actually routes traffic, enforces policies │││ └─────────────────────────────────────────────────────────────┘││ ││ Analogy: Control plane = Air traffic control tower ││ Data plane = The actual airplanes ││ │└─────────────────────────────────────────────────────────────────┘Service Mesh Options Compared
Section titled “Service Mesh Options Compared”The Landscape
Section titled “The Landscape”| Feature | Istio | Linkerd | Cilium Mesh | Consul Connect |
|---|---|---|---|---|
| Proxy | Envoy | linkerd2-proxy (Rust) | eBPF (no sidecar) | Envoy |
| Memory/pod | 50-100MB | ~10MB | ~0 (kernel) | 50-100MB |
| Latency overhead | 2-3ms | <1ms | <0.5ms | 2-3ms |
| Complexity | High | Medium | Low | Medium |
| Features | Most complete | Core features | Growing | HashiCorp ecosystem |
| Learning curve | Steep | Moderate | Gentle | Moderate |
| Multi-cluster | Excellent | Good | Excellent | Excellent |
| CNCF Status | Graduated | Graduated | Graduated | - |
Decision Tree
Section titled “Decision Tree”┌─────────────────────────────────────────────────────────────────┐│ DO YOU NEED A SERVICE MESH? │├─────────────────────────────────────────────────────────────────┤│ ││ Start Here: Do you need mTLS between ALL services? ││ │ ││ ├── NO ──▶ Do you need traffic management (canary, retries)? ││ │ │ ││ │ ├── NO ──▶ ❌ You don't need a service mesh ││ │ │ Use: NetworkPolicies + Ingress ││ │ │ ││ │ └── YES ──▶ Consider alternatives first: ││ │ • Argo Rollouts (canary/blue-green) ││ │ • Ingress retries ││ │ • Client-side libraries ││ │ ││ └── YES ──▶ How complex are your traffic needs? ││ │ ││ ├── BASIC (mTLS, retries, timeouts) ││ │ │ ││ │ ├── Want minimal overhead? ──▶ Linkerd ││ │ │ (10MB/pod, <1ms latency) ││ │ │ ││ │ ├── Already using Cilium? ──▶ Cilium Mesh ││ │ │ (No sidecars, kernel-level) ││ │ │ ││ │ └── HashiCorp shop? ──▶ Consul Connect ││ │ ││ └── ADVANCED (complex routing, rate limiting, ││ external auth, multi-cluster policies) ││ │ ││ └── Istio (or managed: GKE ASM, AWS App Mesh) ││ │└─────────────────────────────────────────────────────────────────┘Istio Deep Dive
Section titled “Istio Deep Dive”Architecture
Section titled “Architecture”┌─────────────────────────────────────────────────────────────────┐│ ISTIO ARCHITECTURE │├─────────────────────────────────────────────────────────────────┤│ ││ ┌─────────────────────┐ ││ │ istiod │ ││ │ │ ││ │ ┌──────┐ ┌──────┐ │ ││ │ │Pilot │ │Citadel│ │ ││ │ │ │ │ │ │ ││ │ │Config│ │Certs │ │ ││ │ └──────┘ └──────┘ │ ││ │ │ ││ │ One binary does │ ││ │ everything now │ ││ └──────────┬──────────┘ ││ │ ││ │ xDS (config) ││ ┌────────────────┼────────────────┐ ││ │ │ │ ││ ▼ ▼ ▼ ││ ┌───────────────────────────────────────────────────────────┐ ││ │ Pod │ ││ │ ┌─────────────────┐ ┌─────────────────────────────┐ │ ││ │ │ │ │ istio-proxy (Envoy) │ │ ││ │ │ Application │◀──▶│ │ │ ││ │ │ │ │ • Injected by MutatingWebhook │ ││ │ │ (your code) │ │ • Intercepts all traffic │ │ ││ │ │ │ │ • Handles mTLS termination │ │ ││ │ │ │ │ • Enforces policies │ │ ││ │ │ │ │ • Reports telemetry │ │ ││ │ └─────────────────┘ └─────────────────────────────┘ │ ││ └───────────────────────────────────────────────────────────┘ ││ ││ Traffic flow: App → iptables redirect → Envoy → Network ││ │└─────────────────────────────────────────────────────────────────┘Installation
Section titled “Installation”# Download istioctlcurl -L https://istio.io/downloadIstio | sh -cd istio-*export PATH=$PWD/bin:$PATH
# Choose a profile:# - demo: Good for learning, includes everything# - default: Production baseline# - minimal: Just the essentials
# Install for learningistioctl install --set profile=demo -y
# Install for productionistioctl install --set profile=default -y
# Enable automatic sidecar injection for a namespacekubectl label namespace default istio-injection=enabled
# Verify installationistioctl verify-install
# Check podskubectl get pods -n istio-systemCore Istio Resources
Section titled “Core Istio Resources”# VirtualService: "How should traffic be routed?"# Think of it as an enhanced Ingress for internal trafficapiVersion: networking.istio.io/v1beta1kind: VirtualServicemetadata: name: reviews-routingspec: hosts: - reviews # Intercept traffic to "reviews" service http: # Route Chrome users to v2 - match: - headers: user-agent: regex: ".*Chrome.*" route: - destination: host: reviews subset: v2 # Everyone else gets v1 - route: - destination: host: reviews subset: v1---# DestinationRule: "What policies apply to a destination?"# Defines subsets (versions) and traffic policiesapiVersion: networking.istio.io/v1beta1kind: DestinationRulemetadata: name: reviews-destinationspec: host: reviews trafficPolicy: connectionPool: tcp: maxConnections: 100 http: h2UpgradePolicy: UPGRADE http1MaxPendingRequests: 100 loadBalancer: simple: ROUND_ROBIN outlierDetection: consecutive5xxErrors: 5 interval: 30s baseEjectionTime: 30s subsets: - name: v1 labels: version: v1 - name: v2 labels: version: v2 - name: v3 labels: version: v3Traffic Management Patterns
Section titled “Traffic Management Patterns”Canary Deployments
Section titled “Canary Deployments”# Start: 95% to v1, 5% to v2apiVersion: networking.istio.io/v1beta1kind: VirtualServicemetadata: name: my-app-canaryspec: hosts: - my-app http: - route: - destination: host: my-app subset: v1 weight: 95 - destination: host: my-app subset: v2 weight: 5---# After validation: Shift to 50/50# Then: 0% v1, 100% v2# Each change is just a kubectl applyTimeouts and Retries
Section titled “Timeouts and Retries”apiVersion: networking.istio.io/v1beta1kind: VirtualServicemetadata: name: ratings-resiliencespec: hosts: - ratings http: - route: - destination: host: ratings timeout: 10s # Total timeout retries: attempts: 3 perTryTimeout: 3s retryOn: gateway-error,connect-failure,refused-stream,5xxCircuit Breaking
Section titled “Circuit Breaking”# "Stop calling a service that's failing"apiVersion: networking.istio.io/v1beta1kind: DestinationRulemetadata: name: reviews-circuit-breakerspec: host: reviews trafficPolicy: connectionPool: tcp: maxConnections: 100 http: http1MaxPendingRequests: 100 http2MaxRequests: 1000 maxRequestsPerConnection: 10 outlierDetection: # If a pod returns 5 consecutive 5xx errors... consecutive5xxErrors: 5 # ...checked every 30 seconds... interval: 30s # ...eject it for 30 seconds baseEjectionTime: 30s # Can eject up to 100% of pods maxEjectionPercent: 100Fault Injection (Chaos Testing)
Section titled “Fault Injection (Chaos Testing)”# Inject failures to test resilienceapiVersion: networking.istio.io/v1beta1kind: VirtualServicemetadata: name: ratings-chaosspec: hosts: - ratings http: - fault: # 10% of requests get 5 second delay delay: percentage: value: 10 fixedDelay: 5s # 5% of requests get HTTP 500 abort: percentage: value: 5 httpStatus: 500 route: - destination: host: ratingsSecurity: Mutual TLS (mTLS)
Section titled “Security: Mutual TLS (mTLS)”How mTLS Works
Section titled “How mTLS Works”┌─────────────────────────────────────────────────────────────────┐│ TLS vs MUTUAL TLS │├─────────────────────────────────────────────────────────────────┤│ ││ REGULAR TLS (HTTPS) ││ ───────────────────────────────────────────────────────────── ││ ││ Client Server ││ │ │ ││ │──── "Show me your certificate" ──────────▶│ ││ │ │ ││ │◀─── Server certificate ──────────────────│ ││ │ (proves server identity) │ ││ │ │ ││ │◀═══ Encrypted traffic ══════════════════▶│ ││ ││ Problem: Server doesn't know WHO the client is ││ Client could be anyone with network access ││ ││ ───────────────────────────────────────────────────────────── ││ ││ MUTUAL TLS (mTLS) ││ ───────────────────────────────────────────────────────────── ││ ││ Client Server ││ │ │ ││ │──── "Show me your certificate" ──────────▶│ ││ │ │ ││ │◀─── Server certificate ──────────────────│ ││ │ (proves server identity) │ ││ │ │ ││ │──── Client certificate ─────────────────▶│ ││ │ (proves client identity) │ ││ │ │ ││ │◀─── "Verified: You're frontend-sa" ──────│ ││ │ │ ││ │◀═══ Encrypted traffic ══════════════════▶│ ││ ││ Result: BOTH sides prove identity ││ Server knows EXACTLY which service is calling ││ This enables identity-based authorization ││ │└─────────────────────────────────────────────────────────────────┘Configuring mTLS
Section titled “Configuring mTLS”# PeerAuthentication: "Who can connect to services in this namespace?"apiVersion: security.istio.io/v1beta1kind: PeerAuthenticationmetadata: name: default namespace: productionspec: mtls: mode: STRICT # Options: STRICT, PERMISSIVE, DISABLE # STRICT: Only mTLS allowed # PERMISSIVE: Accept both plaintext and mTLS (migration mode) # DISABLE: No mTLS---# AuthorizationPolicy: "Who can call what?"apiVersion: security.istio.io/v1beta1kind: AuthorizationPolicymetadata: name: backend-authz namespace: productionspec: selector: matchLabels: app: backend action: ALLOW rules: # Only frontend service account can call backend - from: - source: principals: - "cluster.local/ns/production/sa/frontend" to: - operation: methods: ["GET", "POST"] paths: ["/api/*"] # Deny everything else (implicit)Verify mTLS Status
Section titled “Verify mTLS Status”# Check mTLS status for a serviceistioctl authn tls-check <pod-name>.<namespace> <service>.<namespace>.svc.cluster.local
# Example output:# HOST:PORT STATUS# backend.production.svc.cluster.local:80 OK mTLS (mode: STRICT)
# Check proxy certificatesistioctl proxy-config secret <pod-name> -n production
# Analyze potential issuesistioctl analyze -n productionObservability
Section titled “Observability”The Three Pillars in Istio
Section titled “The Three Pillars in Istio”# Install observability stackkubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/prometheus.yamlkubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/grafana.yamlkubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/jaeger.yamlkubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/kiali.yaml
# Access dashboardsistioctl dashboard kiali # Service graph + healthistioctl dashboard grafana # Metrics dashboardsistioctl dashboard jaeger # Distributed tracingKiali: The Service Mesh Console
Section titled “Kiali: The Service Mesh Console”┌─────────────────────────────────────────────────────────────────┐│ KIALI DASHBOARD │├─────────────────────────────────────────────────────────────────┤│ ││ Graph View - Live Traffic Visualization ││ ───────────────────────────────────────────────────────────── ││ ││ ┌──────────┐ ││ │ frontend │──────────────────────┐ ││ │ 🔒 ✓ │ 100 req/s │ ││ └────┬─────┘ │ ││ │ │ ││ 50 req/s│ 50 req/s ││ │ │ ││ ▼ ▼ ││ ┌──────────┐ ┌──────────┐ ││ │ backend │──────────────▶│ database │ ││ │ 🔒 ✓ │ 30 req/s │ 🔒 ✓ │ ││ └────┬─────┘ └──────────┘ ││ │ ││ 20 req/s│ (12% errors ⚠️) ││ │ ││ ▼ ││ ┌──────────┐ ││ │ payments │ ││ │ 🔒 ⚠️ │ Degraded ││ └──────────┘ ││ ││ 🔒 = mTLS enabled ││ ✓ = Healthy ││ ⚠️ = Issues detected ││ ││ Features: ││ • Real-time traffic flow ││ • Health status per service ││ • mTLS verification (lock icon) ││ • Error rate visualization ││ • Latency percentiles ││ • Configuration validation ││ │└─────────────────────────────────────────────────────────────────┘When NOT to Use Service Mesh
Section titled “When NOT to Use Service Mesh”The Honest Cost-Benefit Analysis
Section titled “The Honest Cost-Benefit Analysis”┌─────────────────────────────────────────────────────────────────┐│ SERVICE MESH: REAL COSTS │├─────────────────────────────────────────────────────────────────┤│ ││ RESOURCE OVERHEAD ││ ───────────────────────────────────────────────────────────── ││ • +2 containers per pod (init + sidecar) ││ • +50-100MB memory per pod (Envoy) ││ • +1-2ms latency per network hop ││ • CPU overhead: ~10-15% of app CPU ││ ││ Example: 500 pods × 100MB = 50GB just for sidecars ││ ││ OPERATIONAL OVERHEAD ││ ───────────────────────────────────────────────────────────── ││ • Steep learning curve (100+ CRDs in Istio) ││ • Complex debugging ("is it the app or the mesh?") ││ • Upgrade complexity (control plane + data plane) ││ • Certificate management ││ • Team needs mesh expertise ││ ││ LATENCY MULTIPLICATION ││ ───────────────────────────────────────────────────────────── ││ ││ Simple request: App A ──▶ App B ││ Actual path: App A ──▶ Envoy ──▶ Network ──▶ Envoy ──▶ App B ││ ││ High-fanout request (checkout calling 50 services): ││ Without mesh: 50 calls × ~0ms overhead = 0ms ││ With mesh: 50 calls × ~2ms overhead = 100ms added latency ││ │└─────────────────────────────────────────────────────────────────┘Decision Matrix
Section titled “Decision Matrix”| Scenario | Service Mesh? | Better Alternative |
|---|---|---|
| <10 services | ❌ | NetworkPolicies + Ingress |
| Starting with K8s | ❌ | Learn basics first |
| Just need canary deployments | ❌ | Argo Rollouts |
| Cost-sensitive | ⚠️ | Linkerd or Cilium Mesh |
| Latency-critical (<5ms) | ⚠️ | Cilium Mesh (eBPF) |
| Compliance requires mTLS | ✅ | - |
| Multi-language, need uniform observability | ✅ | - |
| Zero-trust security requirement | ✅ | - |
| Complex traffic policies | ✅ | - |
Common Mistakes
Section titled “Common Mistakes”| Mistake | Impact | Solution |
|---|---|---|
| Enabling mesh cluster-wide immediately | Breaks system components | Start with one app namespace |
| Not excluding kube-system | System pods fail with sidecars | Always exclude system namespaces |
| STRICT mTLS on day one | Legacy services can’t connect | Use PERMISSIVE mode during migration |
| Default sidecar resources | OOM at scale (see War Story) | Set explicit memory/CPU limits |
| Too many VirtualServices | Config explosion, impossible to debug | Use conventions, fewer rules |
| Not monitoring proxy metrics | Proxy issues look like app bugs | Alert on istio_requests_total errors |
| Meshing high-fanout services | Latency multiplication | Exclude or use headless services |
| No emergency kill switch | Stuck when mesh breaks | Document sidecar disable procedure |
1. What's the difference between the data plane and control plane in a service mesh?
Answer:
Data Plane:
- Sidecar proxies (Envoy) injected into each pod
- Intercepts ALL traffic to/from the application
- Enforces policies (mTLS, retries, circuit breaking)
- Collects metrics and traces
- Does the actual work of routing traffic
Control Plane:
- Centralized management component (istiod in Istio)
- Distributes configuration to all proxies
- Issues and rotates certificates
- Doesn’t handle actual traffic
- Makes decisions that proxies execute
Analogy: Control plane is the air traffic control tower (gives instructions). Data plane is the airplanes (actually moves traffic).
2. Why would you use mTLS instead of regular TLS?
Answer:
Regular TLS (one-way):
- Only server proves its identity
- Client remains anonymous
- Prevents eavesdropping on traffic
- Used for HTTPS websites
Mutual TLS (mTLS):
- BOTH client and server prove identity
- Server knows exactly which service is calling
- Enables identity-based authorization
- Required for zero-trust networking
In Kubernetes context: mTLS ensures that when “frontend” calls “backend”, the backend KNOWS it’s actually frontend—not a compromised pod pretending to be frontend.
3. A company has 8 microservices. They want canary deployments. Should they use a service mesh?
Answer: Probably not.
For just canary deployments without mTLS requirements, Argo Rollouts is a better choice:
- No sidecar overhead
- No latency addition
- Simpler to operate
- Purpose-built for progressive delivery
Service mesh is overkill if you only need one feature. Only adopt it when you need multiple capabilities (mTLS + traffic management + observability uniformly across all services).
4. What does "outlier detection" do in an Istio DestinationRule?
Answer: Outlier detection implements circuit breaking by ejecting unhealthy endpoints:
outlierDetection: consecutive5xxErrors: 5 # After 5 consecutive 5xx responses... interval: 30s # ...checked every 30 seconds... baseEjectionTime: 30s # ...eject for 30 seconds maxEjectionPercent: 100 # Can eject all pods if all are failingWhen a pod returns too many errors, Istio stops sending traffic to it temporarily. This prevents a failing instance from affecting all requests and gives it time to recover.
5. Why might enabling Istio on a high-fanout service cause significant latency increase?
Answer: Latency multiplication effect.
Each service mesh hop adds ~2ms latency (Envoy processing + mTLS handshake).
For a service that calls 50 downstream services per request:
- Without mesh: ~0ms overhead
- With mesh: 50 × 2ms = 100ms additional latency
This is why the War Story checkout service (47 downstream calls) saw latency jump from 50ms to 200ms+. High-fanout services should either be excluded from the mesh or redesigned to batch calls.
6. What's the difference between a VirtualService and a DestinationRule?
Answer:
VirtualService: “HOW to route traffic”
- Traffic routing rules
- Which subset/version gets traffic
- Weights for canary
- Match conditions (headers, paths)
- Timeouts and retries
DestinationRule: “WHAT policies apply at destination”
- Defines subsets (v1, v2, v3)
- Connection pool settings
- Load balancing algorithm
- Circuit breaker configuration
- TLS settings
They work together: VirtualService says “send 10% to v2”, DestinationRule defines what “v2” means and how to connect to it.
7. How does Istio inject sidecars into pods?
Answer: Istio uses a MutatingAdmissionWebhook:
- Pod creation request goes to API server
- API server calls Istio’s webhook
- Webhook modifies the pod spec to add:
istio-initcontainer (sets up iptables)istio-proxycontainer (Envoy sidecar)
- Modified pod spec is created
This is controlled by the namespace label:
kubectl label namespace default istio-injection=enabledOr per-pod annotation:
annotations: sidecar.istio.io/inject: "true" # or "false" to exclude8. What's the memory overhead of Linkerd vs Istio, and when does this matter?
Answer:
| Mesh | Memory per sidecar |
|---|---|
| Istio (Envoy) | 50-100MB |
| Linkerd | ~10MB |
When it matters:
On a 1000-pod cluster:
- Istio: 50-100GB just for sidecars
- Linkerd: ~10GB for sidecars
For cost-sensitive deployments or clusters with many small pods, Linkerd’s 5-10x lower memory usage can save significant cloud costs. At $0.05/GB/hour, a 1000-pod cluster saves $175-400/month choosing Linkerd over Istio.
Key Takeaways
Section titled “Key Takeaways”- Service mesh moves networking out of apps - Retries, mTLS, circuit breaking become infrastructure
- Data plane does the work, control plane gives orders - Proxies route traffic, istiod configures them
- mTLS proves both sides’ identity - Essential for zero-trust, enables identity-based authz
- Latency multiplies with fanout - 50 downstream calls × 2ms = 100ms overhead
- Size your sidecars for production - Default resources are too small for real traffic
- Start with one namespace - Don’t mesh everything at once
- PERMISSIVE mode first - Migrate to STRICT mTLS gradually
- Not every team needs service mesh - <10 services? Use NetworkPolicies
- Linkerd for simplicity, Istio for features - Choose based on actual needs
- Have an emergency kill switch - Know how to disable sidecars fast
Did You Know?
Section titled “Did You Know?”-
The term “service mesh” was coined in 2017 by William Morgan, CEO of Buoyant and creator of Linkerd. The pattern emerged from Netflix’s Eureka and Airbnb’s Synapse in the early 2010s, but didn’t have a name until Morgan’s blog post defined the category.
-
Linkerd’s proxy uses only 10MB of memory because it’s written in Rust, not C++ like Envoy. For a 1000-pod cluster, that’s 40-90GB less memory than Istio—potentially thousands of dollars per month in cloud costs.
-
Netflix doesn’t use a service mesh despite pioneering many of the patterns. They use client-side libraries (Hystrix, Ribbon) because they prioritize latency over operational simplicity. Their scale (~1000 microservices) means even 1ms overhead per hop adds up.
-
Istio’s control plane used to be 4 separate components (Pilot, Citadel, Galley, Mixer). The 2020 Istio 1.5 release merged them all into a single binary called “istiod,” dramatically simplifying operations and reducing resource usage.
Hands-On Exercise
Section titled “Hands-On Exercise”Objective
Section titled “Objective”Deploy Istio, enable mTLS, implement a canary deployment, and visualize traffic in Kiali.
Part 1: Installation
Section titled “Part 1: Installation”# Install Istio with demo profileistioctl install --set profile=demo -y
# Enable injection for default namespacekubectl label namespace default istio-injection=enabled
# Deploy sample Bookinfo applicationkubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/bookinfo/platform/kube/bookinfo.yaml
# Wait for pods (should show 2/2 containers)kubectl wait --for=condition=ready pod -l app=productpage --timeout=120skubectl get pods # All should show 2/2
# Install observability stackkubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/Part 2: Verify Sidecar Injection
Section titled “Part 2: Verify Sidecar Injection”# Check pods have 2 containerskubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{range .spec.containers[*]}{.name}{" "}{end}{"\n"}{end}'
# Should see: productpage-xxx productpage istio-proxyPart 3: Enable Strict mTLS
Section titled “Part 3: Enable Strict mTLS”kubectl apply -f - <<EOFapiVersion: security.istio.io/v1beta1kind: PeerAuthenticationmetadata: name: default namespace: defaultspec: mtls: mode: STRICTEOF
# Verify mTLS is workingistioctl authn tls-check $(kubectl get pod -l app=productpage -o jsonpath='{.items[0].metadata.name}').default productpage.default.svc.cluster.localPart 4: Traffic Routing (Canary)
Section titled “Part 4: Traffic Routing (Canary)”# Define subsetskubectl apply -f - <<EOFapiVersion: networking.istio.io/v1beta1kind: DestinationRulemetadata: name: reviewsspec: host: reviews subsets: - name: v1 labels: version: v1 - name: v2 labels: version: v2 - name: v3 labels: version: v3EOF
# Route 100% to v1kubectl apply -f - <<EOFapiVersion: networking.istio.io/v1beta1kind: VirtualServicemetadata: name: reviewsspec: hosts: - reviews http: - route: - destination: host: reviews subset: v1 weight: 100EOF
# Shift to 50/50 canarykubectl apply -f - <<EOFapiVersion: networking.istio.io/v1beta1kind: VirtualServicemetadata: name: reviewsspec: hosts: - reviews http: - route: - destination: host: reviews subset: v1 weight: 50 - destination: host: reviews subset: v3 weight: 50EOFPart 5: Observe in Kiali
Section titled “Part 5: Observe in Kiali”# Generate trafficfor i in $(seq 1 100); do kubectl exec "$(kubectl get pod -l app=ratings -o jsonpath='{.items[0].metadata.name}')" -c ratings -- curl -sS productpage:9080/productpage > /dev/nulldone
# Open Kialiistioctl dashboard kiali# Navigate to Graph → Select "default" namespace# Observe: Traffic split, mTLS locks, request ratesSuccess Criteria
Section titled “Success Criteria”- All pods show 2/2 containers (sidecar injected)
- mTLS check returns “OK” with “mode: STRICT”
- Traffic routes only to v1 initially
- Canary shows ~50/50 split in Kiali
- Lock icons visible (mTLS verified)
Cleanup
Section titled “Cleanup”kubectl delete -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/bookinfo/platform/kube/bookinfo.yamlistioctl uninstall --purge -ykubectl delete namespace istio-systemNext Module
Section titled “Next Module”Continue to Scaling & Reliability Toolkit to learn about Karpenter for node autoscaling, KEDA for event-driven scaling, and Velero for backup and disaster recovery.
Further Reading
Section titled “Further Reading”- Istio Documentation
- Linkerd Documentation
- Cilium Service Mesh
- Service Mesh Comparison
- CNCF Service Mesh Interface (SMI)
- Talk: “Service Mesh: The Hype, The Tech, The Future” - William Morgan
- Book: “Istio in Action” by Christian Posta