Module 1.6: Pixie - Zero-Instrumentation Observability
Complexity: [MEDIUM]
Section titled “Complexity: [MEDIUM]”Time to Complete: 90 minutes Prerequisites: Module 1.1 (Prometheus), Module 1.2 (OpenTelemetry basics), Basic Kubernetes knowledge Learning Objectives:
- Understand eBPF-based auto-instrumentation
- Deploy Pixie to a Kubernetes cluster
- Use PxL (Pixie Language) to query telemetry data
- Debug production issues without deploying any instrumentation code
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Deploy Pixie for instant Kubernetes observability using eBPF-based auto-instrumentation
- Implement PxL scripts for querying HTTP traffic, DNS requests, and database queries without code changes
- Configure Pixie’s data retention and export pipelines for long-term observability storage
- Evaluate Pixie’s zero-instrumentation approach against traditional APM for rapid debugging workflows
Why This Module Matters
Section titled “Why This Module Matters”Traditional observability requires you to modify your code: add metrics libraries, instrument spans, configure log formats. This takes weeks of engineering effort and often means you can’t debug production issues you didn’t anticipate instrumenting.
Pixie changes this equation entirely.
Using eBPF (extended Berkeley Packet Filter), Pixie captures telemetry data directly from the Linux kernel—no code changes, no sidecars, no redeployments. It sees every HTTP request, every SQL query, every DNS lookup happening in your cluster. You get observability as a platform capability, not an application responsibility.
“It’s like having X-ray vision for your cluster. You see everything happening at the kernel level, without asking applications to cooperate.”
Did You Know?
Section titled “Did You Know?”- Pixie can capture full request/response bodies including HTTP, MySQL, PostgreSQL, and gRPC—without any code changes
- eBPF runs in the Linux kernel, so Pixie sees traffic that never reaches application-level instrumentation
- Pixie stores data locally in the cluster using only 5% of node memory, meaning sensitive data never leaves your infrastructure
- The PxL query language was designed to be learnable in under an hour—it’s Python-like with DataFrames
- Pixie was acquired by New Relic but remains fully open-source under Apache 2.0
The eBPF Revolution
Section titled “The eBPF Revolution”What is eBPF?
Section titled “What is eBPF?”eBPF allows you to run sandboxed programs inside the Linux kernel without changing kernel source code or loading kernel modules:
┌─────────────────────────────────────────────────────────┐│ User Space ││ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││ │ App A │ │ App B │ │ App C │ ││ │ (no │ │ (no │ │ (no │ ││ │ changes)│ │ changes)│ │ changes)│ ││ └────┬─────┘ └────┬─────┘ └────┬─────┘ ││ │ │ │ │├───────┴─────────────┴─────────────┴────────────────────┤│ Kernel Space ││ ┌──────────────────────────────────────────────────┐ ││ │ eBPF Programs │ ││ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ ││ │ │ Network │ │ Syscall│ │ Schedule│ │ ││ │ │ Probes │ │ Probes │ │ Probes │ │ ││ │ └────┬────┘ └────┬────┘ └────┬────┘ │ ││ │ └──────────┬─┴───────────┘ │ ││ │ ▼ │ ││ │ ┌────────────────┐ │ ││ │ │ eBPF Maps │ ◄── Data collection │ ││ │ └───────┬────────┘ │ ││ └─────────────────┴────────────────────────────────┘ ││ │ ││ ▼ ││ ┌──────────────────┐ ││ │ Pixie Agent │ ◄── Reads eBPF maps ││ │ (per node) │ ││ └──────────────────┘ │└─────────────────────────────────────────────────────────┘Why eBPF for Observability?
Section titled “Why eBPF for Observability?”| Traditional Approach | eBPF Approach |
|---|---|
| Modify application code | No code changes |
| Deploy new versions | Immediate visibility |
| SDK/library overhead | Near-zero overhead (~1% CPU) |
| Application-level only | Kernel-level visibility |
| Miss uninstrumented services | See everything |
| Data leaves cluster | Data stays local |
Pixie Architecture
Section titled “Pixie Architecture”┌─────────────────────────────────────────────────────────────────────┐│ Kubernetes Cluster ││ ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ pl namespace │ ││ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ ││ │ │ Vizier │ │ Cloud │ │ NATS │ │ ││ │ │ (Query │ │ Connector │ │ (Message bus) │ │ ││ │ │ Engine) │ │ (optional) │ │ │ │ ││ │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ ││ └─────────────────────────────────────────────────────────────┘ ││ ││ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ││ │ Node 1 │ │ Node 2 │ │ Node 3 │ ││ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ ││ │ │ PEM │ │ │ │ PEM │ │ │ │ PEM │ │ ││ │ │ (Pixie │ │ │ │ (Pixie │ │ │ │ (Pixie │ │ ││ │ │ Edge │ │ │ │ Edge │ │ │ │ Edge │ │ ││ │ │ Module) │ │ │ │ Module) │ │ │ │ Module) │ │ ││ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ ││ │ ▲ │ │ ▲ │ │ ▲ │ ││ │ │eBPF │ │ │eBPF │ │ │eBPF │ ││ │ ┌────┴────┐ │ │ ┌────┴────┐ │ │ ┌────┴────┐ │ ││ │ │ Kernel │ │ │ │ Kernel │ │ │ │ Kernel │ │ ││ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ ││ └───────────────┘ └───────────────┘ └───────────────┘ │└─────────────────────────────────────────────────────────────────────┘Components
Section titled “Components”| Component | Role | Description |
|---|---|---|
| PEM | Data collection | DaemonSet running eBPF probes on each node |
| Vizier | Query engine | Executes PxL scripts, aggregates data from PEMs |
| Cloud Connector | Optional UI | Connects to Pixie Cloud for web UI (or self-host) |
| NATS | Message bus | Internal communication between components |
| Metadata Service | K8s context | Enriches data with pod names, services, etc. |
Installing Pixie
Section titled “Installing Pixie”Prerequisites
Section titled “Prerequisites”Pixie requires:
- Kubernetes 1.21+
- Linux kernel 4.14+ (5.3+ recommended for full features)
- x86_64 architecture (ARM support experimental)
- Nodes with eBPF support (most modern distributions)
Option 1: Pixie CLI (Recommended)
Section titled “Option 1: Pixie CLI (Recommended)”# Install the Pixie CLIbash -c "$(curl -fsSL https://withpixie.ai/install.sh)"
# Deploy Pixie to your clusterpx deploy
# The CLI will:# 1. Check cluster compatibility# 2. Create a Pixie account (or use existing)# 3. Deploy all Pixie componentsOption 2: Helm Installation
Section titled “Option 2: Helm Installation”# Add Pixie Helm repohelm repo add pixie https://pixie-operator-charts.storage.googleapis.comhelm repo update
# Create namespacekubectl create namespace pl
# Install Pixie operatorhelm install pixie pixie/pixie-operator-chart \ --namespace pl \ --set deployKey=<your-deploy-key> \ --set clusterName=my-clusterOption 3: Self-Hosted (Air-Gapped)
Section titled “Option 3: Self-Hosted (Air-Gapped)”For environments without internet access:
# Deploy self-hosted Pixie cloudgit clone https://github.com/pixie-io/pixie.gitcd pixie
# Build and deploy (requires significant resources)./scripts/deploy_cloud.sh
# Then deploy Vizier pointing to self-hosted cloudpx deploy --cloud_addr=<your-cloud-addr>Verify Installation
Section titled “Verify Installation”# Check Pixie statuspx get viziers
# Output:# CLUSTER_NAME ID STATUS LAST_HEARTBEAT# my-cluster abc-123 CS_HEALTHY 2s
# Check all pods are runningkubectl get pods -n plThe PxL Query Language
Section titled “The PxL Query Language”PxL (Pixie Language) is a Python-like language for querying telemetry data:
# Basic structure of a PxL scriptimport px
# Get HTTP eventsdf = px.DataFrame('http_events')
# Filter and transformdf = df[df.resp_status >= 400] # Only errorsdf.latency_ms = df.resp_latency / 1000000 # Convert to ms
# Aggregatedf = df.groupby(['service', 'req_path']).agg( error_count=('latency_ms', px.count), avg_latency=('latency_ms', px.mean),)
# Display resultspx.display(df)Key Data Sources
Section titled “Key Data Sources”| DataFrame | What It Contains |
|---|---|
http_events | All HTTP requests/responses with full headers |
mysql_events | MySQL queries with response times |
pgsql_events | PostgreSQL queries with timing |
dns_events | DNS lookups and resolutions |
conn_stats | TCP connection statistics |
process_stats | CPU, memory per process |
network_stats | Network I/O per pod |
Practical Use Cases
Section titled “Practical Use Cases”1. Find Slow Endpoints (No Instrumentation Needed)
Section titled “1. Find Slow Endpoints (No Instrumentation Needed)”import px
# Get all HTTP requests from the last 5 minutesdf = px.DataFrame('http_events', start_time='-5m')
# Calculate latency in millisecondsdf.latency_ms = df.resp_latency / 1000000
# Find the slowest endpointsdf = df.groupby(['service', 'req_path', 'req_method']).agg( count=('latency_ms', px.count), p50=('latency_ms', px.quantiles(0.5)), p99=('latency_ms', px.quantiles(0.99)), error_rate=('resp_status', lambda x: px.mean(x >= 400)),)
# Filter to endpoints with significant trafficdf = df[df.count > 10]
# Sort by p99 latencydf = df.sort('p99', ascending=False)
px.display(df, 'Slowest Endpoints')2. Trace a Specific Request
Section titled “2. Trace a Specific Request”import px
# Find requests with specific trace ID (or any header)df = px.DataFrame('http_events', start_time='-1h')
# Filter by a specific header or pathdf = df[df.req_path.contains('/api/users/123')]
# Show full request/response detailsdf = df[['time_', 'service', 'req_method', 'req_path', 'req_body', 'resp_status', 'resp_body', 'resp_latency']]
px.display(df)3. Database Query Analysis
Section titled “3. Database Query Analysis”import px
# Analyze PostgreSQL queriesdf = px.DataFrame('pgsql_events', start_time='-10m')
# Calculate query latencydf.latency_ms = df.resp_latency / 1000000
# Group by normalized query patterndf = df.groupby(['service', 'req_body']).agg( count=('latency_ms', px.count), avg_latency=('latency_ms', px.mean), max_latency=('latency_ms', px.max),)
# Find slow queriesdf = df[df.avg_latency > 100] # Queries averaging over 100msdf = df.sort('avg_latency', ascending=False)
px.display(df, 'Slow PostgreSQL Queries')4. Service Dependency Map
Section titled “4. Service Dependency Map”import px
# Build service-to-service communication mapdf = px.DataFrame('http_events', start_time='-5m')
# Extract source and destination servicesdf.source = df.ctx['pod']df.destination = df.remote_addr
# Count connections between servicesdf = df.groupby(['source', 'destination']).agg( request_count=('latency', px.count), avg_latency_ms=('resp_latency', lambda x: px.mean(x) / 1000000), error_count=('resp_status', lambda x: px.sum(x >= 400)),)
px.display(df, 'Service Dependencies')5. Network Troubleshooting
Section titled “5. Network Troubleshooting”import px
# Find network issues - DNS failures, connection errorsdf = px.DataFrame('dns_events', start_time='-5m')
# Find DNS lookup failuresdf = df[df.resp_code != 0] # Non-zero = error
df = df.groupby(['service', 'query_name']).agg( failure_count=('resp_code', px.count),)
px.display(df, 'DNS Failures')Pixie vs Traditional APM
Section titled “Pixie vs Traditional APM”| Aspect | Traditional APM (Datadog, New Relic) | Pixie |
|---|---|---|
| Setup Time | Days to weeks (instrument each service) | Minutes (cluster-wide) |
| Code Changes | Required for each application | None |
| Data Location | Sent to vendor cloud | Stays in cluster |
| Cost Model | Per-host, per-span pricing | Free (open-source) |
| Overhead | SDK overhead varies | ~1% CPU from eBPF |
| Protocol Support | What you instrument | Automatic for common protocols |
| Historical Data | Unlimited (you pay for it) | ~24 hours on-cluster |
When to Choose Pixie
Section titled “When to Choose Pixie”Pixie is ideal when:
- You need immediate observability without code changes
- You have security/compliance requirements keeping data on-premise
- You’re debugging issues in services you don’t own
- You want to explore what’s happening before deciding what to instrument permanently
- You’re cost-conscious about observability spend
Traditional APM is better when:
- You need long-term historical data
- You want managed dashboards and alerting
- Your applications have custom business logic metrics
- You need distributed tracing across cloud boundaries
The Best of Both Worlds
Section titled “The Best of Both Worlds”Many teams use Pixie alongside traditional observability:
┌────────────────────────────────────────────────────────────┐│ Observability Strategy ││ ││ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐ ││ │ Pixie │ │ OpenTelemetry│ │ Prometheus │ ││ │ (eBPF) │ │ (SDK) │ │ (Metrics) │ ││ └──────┬───────┘ └──────┬───────┘ └──────┬──────┘ ││ │ │ │ ││ ▼ ▼ ▼ ││ ┌─────────────────────────────────────────────────────┐ ││ │ Unified Observability │ ││ │ - Pixie for instant debugging & exploration │ ││ │ - OTel for business-specific spans │ ││ │ - Prometheus for long-term metrics │ ││ └─────────────────────────────────────────────────────┘ │└────────────────────────────────────────────────────────────┘War Story: The Mystery of the Slow API
Section titled “War Story: The Mystery of the Slow API”A fintech company’s payment service was experiencing intermittent slowdowns. The traditional APM showed normal latencies. The problem appeared and disappeared randomly.
The Investigation:
- APM showed: Payment service processing time = 50ms
- Users reported: 3-5 second delays
- Where was the missing time?
Enter Pixie:
# They ran this query:import pxdf = px.DataFrame('http_events', start_time='-1h')df = df[df.req_path.contains('/payments')]df.latency_ms = df.resp_latency / 1000000
# Added network-level visibilitydf.client_pod = df.ctx['pod']df.dns_time = df.dns_resp_latency / 1000000
px.display(df[['time_', 'client_pod', 'latency_ms', 'dns_time', 'remote_addr']])The Discovery: DNS lookups were timing out and falling back to secondary DNS servers. The payment service made a DNS lookup for each request to an internal fraud detection service. CoreDNS was being overwhelmed during traffic spikes.
Request latency breakdown:- Application processing: 50ms (what APM saw)- DNS lookup (primary): 3000ms timeout- DNS lookup (secondary): 50ms- Total user experience: 3100msThe Fix: Added DNS caching and increased CoreDNS replicas.
The Lesson: Without kernel-level visibility, they would have blamed the payment service for weeks. Pixie saw what application instrumentation couldn’t.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| Deploying on incompatible kernel | eBPF programs fail to load | Check kernel version: uname -r (need 4.14+, prefer 5.3+) |
| Running on GKE Autopilot | Can’t run privileged pods | Use GKE Standard or another provider |
| Expecting unlimited retention | Pixie stores ~24h on-cluster | Export important data to long-term storage |
| Querying encrypted traffic | Can’t see TLS content | Deploy Pixie’s SSL library support or use service mesh |
| High cardinality queries | PEM memory pressure | Aggregate data, limit result sets |
| Ignoring protocol support | Not all protocols are traced | Check supported protocols (HTTP/MySQL/PgSQL/DNS/gRPC) |
Advanced: Exporting Data
Section titled “Advanced: Exporting Data”Export to Prometheus
Section titled “Export to Prometheus”# In your PxL script, use px.exportimport px
df = px.DataFrame('http_events', start_time='-1m')df = df.groupby(['service', 'req_path']).agg( request_count=('resp_latency', px.count), latency_p99=('resp_latency', px.quantiles(0.99)),)
# Export as Prometheus metricspx.export(df, px.otel.Data( endpoint='prometheus-pushgateway:9091', resource={'service.name': 'pixie-export'},))Export to OpenTelemetry
Section titled “Export to OpenTelemetry”# Deploy Pixie OpenTelemetry pluginhelm upgrade pixie pixie/pixie-operator-chart \ --namespace pl \ --set otel.enabled=true \ --set otel.endpoint=otel-collector:4317This sends Pixie traces to any OTLP-compatible backend.
Hands-On Exercise: Debug a Performance Issue with Pixie
Section titled “Hands-On Exercise: Debug a Performance Issue with Pixie”Objective: Use Pixie to identify and diagnose a performance bottleneck without any application instrumentation.
# Create a test namespacekubectl create namespace pixie-demo
# Deploy a sample microservices applicationkubectl apply -n pixie-demo -f https://raw.githubusercontent.com/pixie-io/pixie-demos/main/simple-gotracing/k8s/demo.yaml
# Wait for pods to be readykubectl wait --for=condition=ready pod -l app=demo -n pixie-demo --timeout=120sTask 1: Identify Slow Services
Section titled “Task 1: Identify Slow Services”Using the Pixie CLI:
# Open the Pixie live viewpx live
# Run the built-in service stats scriptpx script run px/service_statsOr using a custom PxL script:
import px
df = px.DataFrame('http_events', start_time='-5m')df.latency_ms = df.resp_latency / 1000000
stats = df.groupby('service').agg( request_count=('latency_ms', px.count), avg_latency=('latency_ms', px.mean), p99_latency=('latency_ms', px.quantiles(0.99)), error_rate=('resp_status', lambda x: px.mean(x >= 400) * 100),)
stats = stats.sort('p99_latency', ascending=False)px.display(stats, 'Service Performance')Task 2: Find the Slow Endpoint
Section titled “Task 2: Find the Slow Endpoint”import px
df = px.DataFrame('http_events', start_time='-5m')df.latency_ms = df.resp_latency / 1000000
# Find the slowest endpointendpoints = df.groupby(['service', 'req_path', 'req_method']).agg( count=('latency_ms', px.count), avg_latency=('latency_ms', px.mean), max_latency=('latency_ms', px.max),)
endpoints = endpoints[endpoints.count > 5] # Filter noiseendpoints = endpoints.sort('avg_latency', ascending=False)
px.display(endpoints.head(10), 'Slowest Endpoints')Task 3: Analyze Database Queries
Section titled “Task 3: Analyze Database Queries”import px
# Check if there are slow database queriesdf = px.DataFrame('pgsql_events', start_time='-5m')df.latency_ms = df.resp_latency / 1000000
queries = df.groupby(['service', 'req_body']).agg( exec_count=('latency_ms', px.count), avg_latency=('latency_ms', px.mean),)
queries = queries.sort('avg_latency', ascending=False)px.display(queries.head(10), 'Database Query Analysis')Task 4: Create a Service Map
Section titled “Task 4: Create a Service Map”import px
df = px.DataFrame('http_events', start_time='-5m')
# Build service dependency graphdf.source_service = df.ctx['service']df.dest_service = df.service
edges = df.groupby(['source_service', 'dest_service']).agg( request_count=('resp_latency', px.count), p50_latency=('resp_latency', lambda x: px.quantiles(x / 1000000, 0.5)),)
px.display(edges, 'Service Map')Success Criteria
Section titled “Success Criteria”- Identified which service has the highest latency
- Found the specific endpoint causing slowness
- Discovered whether database queries are the bottleneck
- Created a service dependency map showing communication patterns
- Did all of this without deploying any instrumentation code
Cleanup
Section titled “Cleanup”kubectl delete namespace pixie-demoQuestion 1
Section titled “Question 1”What technology does Pixie use to capture telemetry without code changes?
Show Answer
eBPF (extended Berkeley Packet Filter)
eBPF allows Pixie to run sandboxed programs inside the Linux kernel that capture network traffic, syscalls, and other events without requiring any application modifications.
Question 2
Section titled “Question 2”Where does Pixie store collected telemetry data by default?
Show Answer
Locally in the Kubernetes cluster
Pixie stores data in-cluster using approximately 5% of each node’s memory. This means sensitive data never leaves your infrastructure, making it suitable for environments with strict data residency requirements.
Question 3
Section titled “Question 3”What is the Pixie Edge Module (PEM)?
Show Answer
A DaemonSet that runs on each node to collect telemetry
The PEM is deployed as a DaemonSet, running one pod per node. It loads eBPF programs into the kernel and collects telemetry data from all pods on that node.
Question 4
Section titled “Question 4”What minimum kernel version does Pixie require?
Show Answer
Linux kernel 4.14+ (5.3+ recommended)
Kernel 4.14 is the minimum for basic functionality, but 5.3+ is recommended for full feature support including newer eBPF capabilities.
Question 5
Section titled “Question 5”Which protocols can Pixie automatically trace without instrumentation?
Show Answer
HTTP, MySQL, PostgreSQL, DNS, gRPC, Kafka, Redis
Pixie uses protocol-specific eBPF probes to automatically parse and trace these protocols. Encrypted traffic (TLS) requires additional configuration.
Question 6
Section titled “Question 6”What is PxL and what is it used for?
Show Answer
PxL (Pixie Language) is a Python-like query language for analyzing telemetry data
PxL uses DataFrames similar to pandas and allows you to filter, aggregate, and analyze the telemetry data collected by Pixie.
Question 7
Section titled “Question 7”How long does Pixie retain data by default?
Show Answer
Approximately 24 hours
Pixie is designed for real-time debugging and exploration, not long-term storage. Data is stored in memory on each node and typically retained for about 24 hours depending on available resources and data volume.
Question 8
Section titled “Question 8”What is the typical CPU overhead of running Pixie?
Show Answer
Approximately 1% CPU
eBPF programs run very efficiently in the kernel. The typical overhead is around 1% CPU per node, making Pixie suitable for production workloads.
Key Takeaways
Section titled “Key Takeaways”- eBPF enables zero-instrumentation observability - see everything without changing code
- Pixie captures data at the kernel level - sees traffic that application APM misses
- Data stays in your cluster - no privacy concerns with vendor cloud storage
- PxL makes querying intuitive - Python-like syntax with DataFrames
- Sub-minute deployment - instant observability for any cluster
- Protocol-aware tracing - automatic parsing of HTTP, SQL, DNS, gRPC
- Complement, don’t replace - use alongside traditional APM
- Short-term by design - export important data for long-term storage
- Debugging superpower - find issues that instrumentation would miss
- Open-source with enterprise option - Apache 2.0 with optional cloud UI
Further Reading
Section titled “Further Reading”- Pixie Documentation - Official guides and reference
- PxL Language Reference - Complete PxL syntax
- eBPF Explained - Understanding the underlying technology
- Pixie GitHub - Source code and examples
- PxL Script Library - Community scripts
Next Module
Section titled “Next Module”Continue to Module 1.7: Hubble - Network Observability with Cilium to learn about eBPF-based network observability and visualization.