Module 1.4: Istio Observability
Complexity: [MEDIUM]
Section titled “Complexity: [MEDIUM]”Time to Complete: 40-50 minutes
Section titled “Time to Complete: 40-50 minutes”Prerequisites
Section titled “Prerequisites”Before starting this module, you should have completed:
- Module 1: Installation & Architecture — istiod, Envoy, sidecar injection
- Module 2: Traffic Management — VirtualService, DestinationRule, Gateway
- Module 3: Security & Troubleshooting — mTLS, AuthorizationPolicy, debugging
- Basic understanding of Prometheus metrics and distributed tracing concepts
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Configure Istio’s Telemetry API to control which metrics, access logs, and trace spans are collected per workload
- Construct Prometheus queries using Istio’s standard metrics (request count, duration, size) to build RED-method dashboards
- Integrate Istio with Jaeger/Zipkin for distributed tracing and propagate trace context headers across service boundaries
- Operate Kiali to visualize service topology, detect misconfigurations, and validate traffic flow through the mesh
Why This Module Matters
Section titled “Why This Module Matters”Observability accounts for 10% of the ICA exam. You’ll be asked to configure telemetry collection, understand Istio’s built-in metrics, and integrate with tools like Prometheus, Grafana, Kiali, and Jaeger. These aren’t just exam topics — they’re the reason most teams adopt Istio in the first place.
Without a service mesh, you’d need every development team to instrument their code with metrics, tracing headers, and structured logging. Istio gives you all three for free through the Envoy sidecar — no code changes required. The catch? You need to know how to configure, customize, and query that telemetry.
The Security Camera Analogy
Imagine a building with hundreds of rooms (services). Without Istio, you’d need to install cameras inside each room (application instrumentation). With Istio, you install cameras at every doorway (Envoy sidecars). You automatically see who enters, who leaves, how long they stay, and whether they were turned away. The Telemetry API controls what each camera records. Prometheus stores the footage. Grafana builds the dashboards. Kiali draws the floor plan. Jaeger lets you follow one person’s path through every room they visited.
What You’ll Learn
Section titled “What You’ll Learn”By the end of this module, you’ll be able to:
- Configure Istio’s Telemetry API v2 to control what telemetry is collected
- Understand and query Istio standard metrics (
istio_requests_total,istio_request_duration_milliseconds) - Configure Prometheus scraping for Istio metrics
- Set up distributed tracing with configurable sampling rates
- Enable and customize Envoy access logging
- Use Kiali for service mesh visualization and health monitoring
- Build Grafana dashboards for mesh, service, and workload views
Did You Know?
Section titled “Did You Know?”-
Istio generates metrics with zero application code changes: Every HTTP/gRPC/TCP request passing through an Envoy sidecar automatically emits request count, duration, and size metrics. A 200-microservice mesh gets full observability from day one.
-
Trace context propagation still requires application help: While Envoy generates spans automatically, your application must forward trace headers (
x-request-id,x-b3-traceid, etc.) between inbound and outbound requests. Without this, traces break into disconnected fragments. -
Kiali can detect misconfigurations that
istioctl analyzemisses: Kiali validates your mesh at runtime — it can spot services with missing sidecars, traffic flowing to unexpected destinations, and mTLS inconsistencies by watching actual traffic patterns.
1. Istio Telemetry API v2
Section titled “1. Istio Telemetry API v2”The Telemetry API (introduced in Istio 1.12, stable since 1.18) is the single control point for all observability configuration. It replaces the older EnvoyFilter-based approach and the deprecated Mixer component.
How Telemetry Works
Section titled “How Telemetry Works”┌──────────────────────────────────────────────────────────┐│ Telemetry Resource (CRD) ││ ││ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ ││ │ Metrics │ │ Tracing │ │ Access Logging │ ││ │ │ │ │ │ │ ││ │ Enable/ │ │ Sampling │ │ Envoy format / │ ││ │ Disable │ │ rate │ │ custom filters │ ││ │ Override │ │ provider │ │ provider │ ││ └──────────┘ └──────────┘ └──────────────────┘ ││ ││ Applied at: mesh-wide → namespace → workload │└──────────────────────────────────────────────────────────┘The Telemetry API has three pillars: metrics, tracing, and access logging. Each can be configured independently at three scopes:
| Scope | Where to apply | Example |
|---|---|---|
| Mesh-wide | istio-system namespace, no selector | Default for all workloads |
| Namespace | Target namespace, no selector | Override for a specific namespace |
| Workload | Target namespace + selector | Override for specific pods |
More specific scopes override broader ones. A workload-level Telemetry overrides namespace-level, which overrides mesh-wide.
Mesh-Wide Telemetry Configuration
Section titled “Mesh-Wide Telemetry Configuration”apiVersion: telemetry.istio.io/v1kind: Telemetrymetadata: name: mesh-default namespace: istio-system # mesh-wide scopespec: # Enable metrics for Prometheus metrics: - providers: - name: prometheus overrides: - match: metric: ALL_METRICS mode: CLIENT_AND_SERVER tagOverrides: request_protocol: operation: UPSERT value: "request.protocol" # Enable tracing with 1% sampling tracing: - providers: - name: zipkin randomSamplingPercentage: 1.0 # Enable access logging accessLogging: - providers: - name: envoyNamespace-Level Override
Section titled “Namespace-Level Override”apiVersion: telemetry.istio.io/v1kind: Telemetrymetadata: name: payments-telemetry namespace: payments # namespace scopespec: # Higher trace sampling for critical namespace tracing: - providers: - name: zipkin randomSamplingPercentage: 10.0Workload-Level Override
Section titled “Workload-Level Override”apiVersion: telemetry.istio.io/v1kind: Telemetrymetadata: name: checkout-debug namespace: paymentsspec: selector: matchLabels: app: checkout # workload scope # Full sampling for debugging a specific service tracing: - providers: - name: zipkin randomSamplingPercentage: 100.0 accessLogging: - providers: - name: envoy filter: expression: "response.code >= 400"Disabling Telemetry
Section titled “Disabling Telemetry”You can disable specific telemetry types for workloads that generate too much noise:
apiVersion: telemetry.istio.io/v1kind: Telemetrymetadata: name: disable-healthcheck-logging namespace: defaultspec: selector: matchLabels: app: high-traffic-proxy accessLogging: - disabled: true2. Istio Standard Metrics
Section titled “2. Istio Standard Metrics”Envoy sidecars emit a standard set of metrics for every request. These are the metrics you’ll query on the exam and in production.
Key Metrics
Section titled “Key Metrics”| Metric | Type | Description |
|---|---|---|
istio_requests_total | Counter | Total requests processed, labeled by response code, source, destination |
istio_request_duration_milliseconds | Histogram | Request duration distribution |
istio_request_bytes | Histogram | Request body sizes |
istio_response_bytes | Histogram | Response body sizes |
istio_tcp_sent_bytes_total | Counter | Total bytes sent for TCP connections |
istio_tcp_received_bytes_total | Counter | Total bytes received for TCP connections |
istio_tcp_connections_opened_total | Counter | Total TCP connections opened |
istio_tcp_connections_closed_total | Counter | Total TCP connections closed |
Important Labels
Section titled “Important Labels”Every metric carries labels that let you slice traffic by source, destination, and response:
istio_requests_total{ reporter="source", # "source" or "destination" source_workload="productpage-v1", source_workload_namespace="default", destination_workload="reviews-v2", destination_workload_namespace="default", destination_service="reviews.default.svc.cluster.local", request_protocol="http", response_code="200", response_flags="-", connection_security_policy="mutual_tls"}Prometheus Queries for Istio
Section titled “Prometheus Queries for Istio”# Request rate per service (last 5 minutes)rate(istio_requests_total{reporter="destination"}[5m])
# Error rate for a specific servicesum(rate(istio_requests_total{ reporter="destination", destination_service="reviews.default.svc.cluster.local", response_code=~"5.*"}[5m]))/sum(rate(istio_requests_total{ reporter="destination", destination_service="reviews.default.svc.cluster.local"}[5m]))
# P99 latency for a servicehistogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{ reporter="destination", destination_service="reviews.default.svc.cluster.local" }[5m])) by (le))Prometheus Integration
Section titled “Prometheus Integration”Istio exposes metrics on each sidecar’s port 15020 at /stats/prometheus. Prometheus discovers these via standard Kubernetes service discovery. If you installed Istio with the demo profile, Prometheus is pre-configured. Otherwise, add scrape configs:
# Prometheus scrape config for Istio sidecarsscrape_configs: - job_name: 'envoy-stats' metrics_path: /stats/prometheus kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_container_name] action: keep regex: istio-proxy - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:15020 target_label: __address__3. Distributed Tracing
Section titled “3. Distributed Tracing”Istio’s Envoy sidecars automatically generate trace spans for every request. But there’s a critical nuance that trips up many engineers.
What Envoy Does Automatically
Section titled “What Envoy Does Automatically”Each sidecar creates a span (a timed record of a single hop). The span captures:
- Source and destination service
- Request duration
- HTTP method, URL, response code
What Your Application Must Do
Section titled “What Your Application Must Do”Your application must propagate trace headers between incoming and outgoing requests. Without this, Envoy creates isolated spans that can’t be stitched into a full trace.
Headers to propagate:
x-request-idx-b3-traceidx-b3-spanidx-b3-parentspanidx-b3-sampledx-b3-flagsb3traceparent # W3C Trace Contexttracestate # W3C Trace ContextConfiguring Trace Sampling
Section titled “Configuring Trace Sampling”Sampling controls what percentage of requests generate traces. High sampling means more visibility but more overhead and storage.
apiVersion: telemetry.istio.io/v1kind: Telemetrymetadata: name: tracing-config namespace: istio-systemspec: tracing: - providers: - name: zipkin randomSamplingPercentage: 1.0 # 1% of requests — good for production customTags: environment: literal: value: "production" my_tag: header: name: x-custom-header defaultValue: "unknown"| Sampling Rate | Use Case |
|---|---|
| 0.1% | High-traffic production (>10K rps) |
| 1% | Standard production |
| 10% | Staging / pre-production |
| 100% | Debugging a specific issue (temporary!) |
Zipkin/Jaeger Provider Configuration
Section titled “Zipkin/Jaeger Provider Configuration”Tracing backends are configured in the Istio MeshConfig. With the demo profile, Zipkin is pre-configured:
apiVersion: install.istio.io/v1alpha1kind: IstioOperatorspec: meshConfig: defaultConfig: tracing: zipkin: address: zipkin.istio-system:9411 extensionProviders: - name: zipkin zipkin: service: zipkin.istio-system.svc.cluster.local port: 9411 - name: jaeger zipkin: # Jaeger accepts Zipkin format service: jaeger-collector.istio-system.svc.cluster.local port: 9411Exam Tip: Jaeger’s collector accepts Zipkin-format spans on port 9411. This is why the provider configuration looks the same for both — Istio sends Zipkin-format traces regardless of the backend.
4. Envoy Access Logging
Section titled “4. Envoy Access Logging”Access logs capture every request flowing through the mesh. They’re invaluable for debugging, but generate significant volume in large meshes.
Enabling Access Logs
Section titled “Enabling Access Logs”The simplest way is through the Telemetry API:
apiVersion: telemetry.istio.io/v1kind: Telemetrymetadata: name: access-log-config namespace: istio-systemspec: accessLogging: - providers: - name: envoyAlternatively, enable via MeshConfig during installation:
apiVersion: install.istio.io/v1alpha1kind: IstioOperatorspec: meshConfig: accessLogFile: /dev/stdout accessLogEncoding: JSON # TEXT or JSONFiltering Access Logs
Section titled “Filtering Access Logs”In production, logging every request is expensive. Filter to capture only errors or slow requests:
apiVersion: telemetry.istio.io/v1kind: Telemetrymetadata: name: error-only-logging namespace: defaultspec: accessLogging: - providers: - name: envoy filter: expression: "response.code >= 400 || connection.mtls == false"Common filter expressions:
| Expression | What it captures |
|---|---|
response.code >= 400 | Client and server errors only |
response.code >= 500 | Server errors only |
response.duration > 1000 | Slow requests (>1s) |
connection.mtls == false | Non-mTLS traffic (security auditing) |
Reading Envoy Access Logs
Section titled “Reading Envoy Access Logs”# View access logs for a specific pod's sidecarkubectl logs deploy/productpage-v1 -c istio-proxy --tail=20
# Example log line (TEXT format):# [2024-01-15T10:30:45.123Z] "GET /reviews/0 HTTP/1.1" 200 - via_upstream# - "-" 0 295 24 23 "-" "Mozilla/5.0" "abc-123" "reviews:9080"# outbound|9080||reviews.default.svc.cluster.local 10.244.1.5:39012# 10.244.2.8:9080 10.244.1.5:33456 - defaultKey fields in order: timestamp, method/path/protocol, response code, response flags, authority, upstream host, request duration (ms).
5. Kiali: Service Mesh Visualization
Section titled “5. Kiali: Service Mesh Visualization”Kiali is Istio’s official observability console. It provides real-time visualization of your service mesh topology, traffic flows, and configuration health.
What Kiali Shows You
Section titled “What Kiali Shows You”| View | What You See |
|---|---|
| Graph | Live service topology with traffic animation, error rates, latency |
| Applications | Grouped view of workloads forming an application |
| Workloads | Individual deployment health, pod logs, Envoy config |
| Services | Kubernetes services with Istio routing info |
| Istio Config | Validation of VirtualService, DestinationRule, etc. |
Accessing Kiali
Section titled “Accessing Kiali”# If installed with demo profile, Kiali is already runningkubectl get svc -n istio-system kiali
# Port-forward to access the dashboardkubectl port-forward svc/kiali -n istio-system 20001:20001
# Or use istioctl dashboard shortcutistioctl dashboard kialiKiali for the ICA Exam
Section titled “Kiali for the ICA Exam”Kiali is most useful for:
- Verifying traffic routing — After applying a VirtualService, watch the graph to confirm traffic splits match your weights
- Spotting misconfigurations — Kiali flags invalid Istio resources with warning icons
- Checking mTLS status — The security badge on each edge shows whether mTLS is active
- Identifying unhealthy services — Red nodes indicate error rates above threshold
Exam Tip: You likely won’t have Kiali on the exam itself, but understanding what it shows and how it integrates is fair game for questions.
6. Grafana Dashboards for Istio
Section titled “6. Grafana Dashboards for Istio”Istio ships with pre-built Grafana dashboards that visualize the Prometheus metrics described in Section 2. There are three main dashboards:
Mesh Dashboard
Section titled “Mesh Dashboard”Shows the global view across your entire mesh:
- Total request volume
- Global success rate
- Aggregate P50/P90/P99 latency
Service Dashboard
Section titled “Service Dashboard”Drill into a specific service:
- Inbound request volume and success rate
- Client workloads (who’s calling this service)
- Request duration distribution
- Request/response sizes
Workload Dashboard
Section titled “Workload Dashboard”Drill into a specific workload (deployment):
- Inbound and outbound request metrics
- TCP connection metrics
- Per-destination breakdowns
Accessing Grafana
Section titled “Accessing Grafana”# Port-forward (if installed with demo profile)kubectl port-forward svc/grafana -n istio-system 3000:3000
# Or use istioctl shortcutistioctl dashboard grafanaBuilding a Custom Dashboard
Section titled “Building a Custom Dashboard”If the pre-built dashboards don’t cover your use case, build custom panels using Istio metrics:
{ "targets": [ { "expr": "sum(rate(istio_requests_total{reporter=\"destination\", destination_service=~\"$service\"}[5m])) by (response_code)", "legendFormat": "{{response_code}}" } ], "title": "Request Rate by Response Code", "type": "timeseries"}Common Mistakes
Section titled “Common Mistakes”| Mistake | What Happens | Fix |
|---|---|---|
| Setting 100% trace sampling in production | Massive storage costs, performance degradation | Use 0.1-1% for production, increase temporarily for debugging |
| Forgetting to propagate trace headers in app code | Traces show disconnected spans, no end-to-end visibility | Forward all x-b3-* and traceparent headers in your application |
| Enabling TEXT access logs in high-traffic mesh | Log volume overwhelms storage, costs spike | Use JSON encoding with filters, or disable for noisy workloads |
Querying reporter="source" when you want destination metrics | Double-counting or inconsistent numbers | Use reporter="destination" for server-side metrics (more reliable) |
| Not installing Prometheus/Grafana with Istio | Metrics are generated but not stored or visualized | Install addons or configure external Prometheus to scrape Istio metrics |
| Applying Telemetry in wrong namespace | Config silently ignored because scope doesn’t match | istio-system for mesh-wide, target namespace for namespace scope |
Question 1
Section titled “Question 1”What are the three scopes at which the Istio Telemetry API can be configured?
Show Answer
Mesh-wide (applied in istio-system with no selector), namespace (applied in target namespace with no selector), and workload (applied in target namespace with a selector.matchLabels). More specific scopes override broader ones.
Question 2
Section titled “Question 2”Which Istio metric would you query to calculate the error rate of a service?
Show Answer
istio_requests_total with a filter on response_code=~"5.*" divided by total istio_requests_total. For example:
sum(rate(istio_requests_total{response_code=~"5.*", destination_service="myservice.default.svc.cluster.local"}[5m]))/sum(rate(istio_requests_total{destination_service="myservice.default.svc.cluster.local"}[5m]))Question 3
Section titled “Question 3”Why do distributed traces sometimes show disconnected spans even when Istio tracing is configured?
Show Answer
The application is not propagating trace headers. Envoy generates spans automatically, but the application must forward headers like x-b3-traceid, x-b3-spanid, x-b3-parentspanid, x-b3-sampled, and traceparent from incoming requests to outgoing requests. Without propagation, each hop creates an independent trace instead of a connected chain.
Question 4
Section titled “Question 4”How would you enable access logging only for requests that return HTTP 500+ errors?
Show Answer
Use the Telemetry API with a CEL filter expression:
apiVersion: telemetry.istio.io/v1kind: Telemetrymetadata: name: error-logging namespace: defaultspec: accessLogging: - providers: - name: envoy filter: expression: "response.code >= 500"Question 5
Section titled “Question 5”What port does Envoy expose Prometheus metrics on, and what is the metrics path?
Show Answer
Port 15020, path /stats/prometheus. Prometheus scrapes this endpoint on each sidecar to collect Istio standard metrics.
Hands-On Exercise: Configuring Istio Observability
Section titled “Hands-On Exercise: Configuring Istio Observability”Objective
Section titled “Objective”Configure telemetry for a running Istio mesh: enable metrics, set up tracing with custom sampling, and configure filtered access logging.
# Ensure you have a kind cluster with Istio demo profile# (see README.md for full setup instructions)istioctl install --set profile=demo -ykubectl label namespace default istio-injection=enabled --overwrite
# Deploy the bookinfo sample appkubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.22/samples/bookinfo/platform/kube/bookinfo.yaml
# Wait for pods to be readykubectl wait --for=condition=ready pod -l app=productpage --timeout=120s
# Generate some trafficfor i in $(seq 1 20); do kubectl exec deploy/ratings-v1 -- curl -s productpage:9080/productpage > /dev/nulldoneTask 1: Apply Mesh-Wide Telemetry
Section titled “Task 1: Apply Mesh-Wide Telemetry”Create a Telemetry resource in istio-system that enables Prometheus metrics, sets tracing to 5% sampling, and enables access logging:
kubectl apply -f - <<EOFapiVersion: telemetry.istio.io/v1kind: Telemetrymetadata: name: mesh-defaults namespace: istio-systemspec: metrics: - providers: - name: prometheus tracing: - providers: - name: zipkin randomSamplingPercentage: 5.0 accessLogging: - providers: - name: envoyEOFVerify the telemetry resource was created:
kubectl get telemetry -n istio-systemTask 2: Override Tracing for a Specific Namespace
Section titled “Task 2: Override Tracing for a Specific Namespace”Create a namespace-level override that increases trace sampling to 50% for the default namespace (simulating a debugging scenario):
kubectl apply -f - <<EOFapiVersion: telemetry.istio.io/v1kind: Telemetrymetadata: name: debug-tracing namespace: defaultspec: tracing: - providers: - name: zipkin randomSamplingPercentage: 50.0EOFTask 3: Filter Access Logs for Errors Only
Section titled “Task 3: Filter Access Logs for Errors Only”Apply a workload-level Telemetry that only logs errors for the productpage service:
kubectl apply -f - <<EOFapiVersion: telemetry.istio.io/v1kind: Telemetrymetadata: name: productpage-logging namespace: defaultspec: selector: matchLabels: app: productpage accessLogging: - providers: - name: envoy filter: expression: "response.code >= 400"EOFTask 4: Verify Metrics Are Flowing
Section titled “Task 4: Verify Metrics Are Flowing”# Check that Envoy is emitting Istio metricskubectl exec deploy/productpage-v1 -c istio-proxy -- \ curl -s localhost:15020/stats/prometheus | grep istio_requests_total | head -5
# Generate more traffic and check againfor i in $(seq 1 10); do kubectl exec deploy/ratings-v1 -- curl -s productpage:9080/productpage > /dev/nulldone
# Query Prometheus (if available)# kubectl port-forward svc/prometheus -n istio-system 9090:9090# Then open http://localhost:9090 and query: istio_requests_totalTask 5: View Access Logs
Section titled “Task 5: View Access Logs”# Generate a request that should succeed (no log due to filter)kubectl exec deploy/ratings-v1 -- curl -s productpage:9080/productpage > /dev/null
# Generate a request that should fail (logged due to filter)kubectl exec deploy/ratings-v1 -- curl -s productpage:9080/nonexistent-page > /dev/null
# Check productpage sidecar logs — only the 404 should appearkubectl logs deploy/productpage-v1 -c istio-proxy --tail=5Success Criteria
Section titled “Success Criteria”- Mesh-wide Telemetry resource exists in
istio-systemwith metrics, tracing (5%), and access logging - Namespace-level Telemetry in
defaultoverrides tracing to 50% - Workload-level Telemetry for
productpagefilters access logs to errors only -
istio_requests_totalmetric is visible from the Envoy stats endpoint - Access logs for
productpageonly show error responses (4xx/5xx)
Cleanup
Section titled “Cleanup”kubectl delete telemetry mesh-defaults -n istio-systemkubectl delete telemetry debug-tracing -n defaultkubectl delete telemetry productpage-logging -n defaultkubectl delete -f https://raw.githubusercontent.com/istio/istio/release-1.22/samples/bookinfo/platform/kube/bookinfo.yamlNext Module
Section titled “Next Module”You’ve now completed all four ICA modules:
- Module 1: Installation, profiles, sidecar injection, architecture (20%)
- Module 2: Traffic management, fault injection, resilience (35% + 10%)
- Module 3: Security, authorization, troubleshooting (15% + 10%)
- Module 4: Observability — metrics, tracing, logging, dashboards (10%)
For deeper dives into the tools referenced here, see:
- Observability Theory — Metrics, logs, traces fundamentals
- Observability Tools — Prometheus, Grafana, Jaeger in depth
Final Exam Prep Checklist
Section titled “Final Exam Prep Checklist”- Can install Istio with
istioctlusing different profiles - Can configure automatic and manual sidecar injection
- Can write VirtualService for traffic splitting, header routing, fault injection
- Can write DestinationRule for subsets, circuit breaking, outlier detection
- Can configure Gateway for ingress with TLS
- Can set up ServiceEntry for egress control
- Can configure PeerAuthentication (STRICT/PERMISSIVE)
- Can write AuthorizationPolicy (ALLOW/DENY)
- Can use
istioctl analyze,proxy-status,proxy-configfor debugging - Can configure Telemetry API for metrics, tracing, and access logging
- Can query Istio standard metrics with PromQL
- Can configure trace sampling rates and understand header propagation
- Can filter access logs using CEL expressions
- Can use Kiali, Grafana, and Jaeger for mesh observability
Good luck on your ICA exam!