Module 5.1: Cilium - The Kernel-Powered Network Revolution
Toolkit Track | Complexity:
[COMPLEX]| Time: 60-75 minutes
The 3 AM Wake-Up Call
Section titled “The 3 AM Wake-Up Call”Your phone buzzes. Production is down. The ops channel is on fire.
[03:12 AM] @oncall ALERT: Payment service timeouts[03:14 AM] @oncall Network team says "looks fine on their end"[03:17 AM] @oncall It's DNS[03:18 AM] @oncall It's always DNS[03:23 AM] @oncall Wait, it's not DNS. Something is dropping packets.[03:31 AM] @oncall Running tcpdump on all 47 pods. Send coffee.[03:52 AM] @oncall Found it. NetworkPolicy was blocking the new service.[03:54 AM] @oncall We have 200+ NetworkPolicies. Which one? No idea.[04:23 AM] @oncall Fixed by adding another allow rule. We'll clean up later.[04:24 AM] @oncall We never clean up later.Sound familiar?
This is what Kubernetes networking feels like without proper tooling. You’re blind. Packets vanish into the void. Policies are write-only—you create them but never know which one is actually doing what.
Cilium changes everything. By the end of this module, when something drops packets, you’ll know exactly which policy dropped it, why, and you’ll see it happen in real-time. No more 4 AM tcpdump sessions.
What You’ll Learn:
- Why traditional networking can’t keep up with Kubernetes
- How eBPF lets you program the Linux kernel (without being a kernel developer)
- Identity-based security that actually makes sense
- Hubble: seeing every packet, every decision, every drop
- Replacing kube-proxy and why you’ll never miss it
Prerequisites:
- Kubernetes networking basics (Services, Pods)
- Security Principles Foundations
- A healthy frustration with iptables (optional but helps)
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Deploy Cilium as a CNI plugin with eBPF-based networking and transparent encryption
- Configure Cilium network policies using L3/L4 and L7 identity-aware filtering rules
- Implement Cilium’s service mesh capabilities with sidecar-free mTLS and load balancing
- Monitor network flows and troubleshoot connectivity using Hubble’s observability dashboards
Why This Module Matters
Section titled “Why This Module Matters”Let me tell you about the moment I fell in love with Cilium.
We had a microservices architecture—127 services, because apparently we thought Netflix was a good role model. One service was mysteriously failing health checks. The app worked fine when tested directly. Network team said the network was fine. App team said the app was fine. Classic standoff.
With traditional tools, we would’ve spent hours with tcpdump and iptables debugging. Instead, I ran one command:
hubble observe --pod production/payment-service --verdict DROPPEDThree seconds later:
production/payment-service → production/health-checker DROPPEDPolicy: production/legacy-lockdown (ingress)The legacy-lockdown policy. Written 18 months ago by someone who left the company. Blocked traffic from a service that didn’t exist when the policy was created.
Five-minute fix. Without Cilium, we’d still be debugging.
💡 Did You Know? When Google designed their next-generation internal networking, they chose eBPF—the same technology powering Cilium. The reason? At Google scale, traditional iptables rules would take minutes to update. With eBPF, updates happen in microseconds. Google Cloud GKE, AWS EKS, and Azure AKS all now offer Cilium as their CNI. It’s not just an alternative anymore—it’s becoming the default.
Part 1: Understanding the Problem (Before We Solve It)
Section titled “Part 1: Understanding the Problem (Before We Solve It)”The IPTables Nightmare
Section titled “The IPTables Nightmare”Before we talk about Cilium’s solution, you need to feel the pain of the old way.
Every Kubernetes cluster runs kube-proxy. Every time you create a Service, kube-proxy adds iptables rules. Let’s see what that actually looks like:
# On a modest cluster with 500 services:iptables-save | wc -l# Output: 12,847 lines
# On a large cluster with 5,000 services:iptables-save | wc -l# Output: 147,291 linesOne hundred and forty-seven thousand lines of iptables rules.
Now imagine debugging why one specific packet was dropped.
THE IPTABLES DEBUGGING EXPERIENCE═══════════════════════════════════════════════════════════════════
You: "Why was my packet dropped?"
iptables: "Let me check... Chain PREROUTING → Chain KUBE-SERVICES → Chain KUBE-SVC-XYZABC123 → Chain KUBE-SEP-DEF456 → Chain KUBE-POSTROUTING → Actually I lost track. Somewhere in these 147,000 rules."
You: "Which rule specifically?"
iptables: "¯\_(ツ)_/¯"
You: "How do I see what's being blocked?"
iptables: "Add LOG rules everywhere. Parse the logs yourself. Good luck with the performance impact."
You: [opens job listings]And it gets worse. When you update a Service:
TIME TO UPDATE 147,000 IPTABLES RULES═══════════════════════════════════════════════════════════════════
1. kube-proxy receives Service update2. kube-proxy rewrites ALL rules (can't do incremental)3. Takes ~5-30 seconds on large clusters4. During rewrite: connections drop, new connections may fail5. All nodes do this simultaneously6. Your monitoring alerts go crazy
This happens every time:- A pod scales up/down- A service is created/deleted- An endpoint changes
At scale: dozens of times per minuteThis isn’t a hypothetical. Datadog wrote about hitting this limit. So did Shopify. Large-scale Kubernetes users universally agree: iptables doesn’t scale.
The NetworkPolicy Problem
Section titled “The NetworkPolicy Problem”Standard Kubernetes NetworkPolicies have a different problem: they’re based on IP addresses.
# This NetworkPolicy looks reasonable:apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: allow-frontendspec: podSelector: matchLabels: app: backend ingress: - from: - podSelector: matchLabels: app: frontendUnder the hood, this becomes:
"Allow traffic from IP 10.244.1.45 to port 80""Allow traffic from IP 10.244.2.23 to port 80""Allow traffic from IP 10.244.3.67 to port 80"Now the frontend pod crashes and restarts. New IP: 10.244.1.99.
The CNI has to:
- Detect the IP change
- Update every policy that references frontend
- Push those updates to every node
- Hope nothing breaks during the transition
This happens constantly in Kubernetes. Pods restart, scale, move between nodes. IP addresses are ephemeral by design.
Building security on IP addresses is like building a house on quicksand.
Part 2: Enter eBPF - Programming the Unprogrammable
Section titled “Part 2: Enter eBPF - Programming the Unprogrammable”What is eBPF?
Section titled “What is eBPF?”eBPF stands for “extended Berkeley Packet Filter,” but that name is misleading. It’s evolved far beyond packet filtering.
Here’s the mental model that helped me understand it:
THE JAVASCRIPT OF THE LINUX KERNEL═══════════════════════════════════════════════════════════════════
Remember when browsers only displayed static HTML?Then JavaScript came along: "What if we could run code IN the browser?"Suddenly browsers could do anything.
eBPF is JavaScript for the Linux kernel.
Before eBPF:- Want to change how networking works? Modify kernel code, recompile, reboot.- Want to add tracing? Load a kernel module, pray it doesn't crash.- Want custom packet processing? Install a userspace proxy, accept the overhead.
With eBPF:- Write small programs that run INSIDE the kernel- Load them dynamically, no reboot needed- Kernel verifies they're safe before running- Run at kernel speed (no userspace context switches)Here’s a concrete example. Traditional packet processing:
TRADITIONAL PACKET FLOW═══════════════════════════════════════════════════════════════════
Packet arrives at network card │ ▼ Kernel receives packet │ ▼ iptables chain 1 (PREROUTING) │ ▼ iptables chain 2 (INPUT/FORWARD) │ ▼ Routing decision │ ▼ iptables chain 3 (OUTPUT) │ ▼ iptables chain 4 (POSTROUTING) │ ▼ Copy packet to userspace ← EXPENSIVE! │ ▼ Userspace proxy (kube-proxy/envoy/etc) │ ▼ Copy packet back to kernel ← EXPENSIVE! │ ▼ Finally reaches destination
Cost: ~50-100 microseconds per packet Multiple memory copies CPU cache thrashingWith eBPF:
eBPF PACKET FLOW═══════════════════════════════════════════════════════════════════
Packet arrives at network card │ ▼ eBPF program runs (in kernel) - Looks up destination in hash map: O(1) - Applies policy: O(1) - Rewrites headers if needed - Decides: forward, drop, or redirect │ ▼ Packet reaches destination
Cost: ~5-10 microseconds per packet Zero memory copies Runs in kernel context
10x faster. Zero userspace involvement for most packets.Why eBPF is Safe (Despite Running in the Kernel)
Section titled “Why eBPF is Safe (Despite Running in the Kernel)”“Wait,” I hear you thinking, “running arbitrary code in the kernel sounds terrifying.”
You’re right. That’s why eBPF has a verifier:
THE eBPF VERIFIER: YOUR KERNEL'S BOUNCER═══════════════════════════════════════════════════════════════════
Before ANY eBPF program runs, the verifier checks:
✓ Does it terminate? (No infinite loops allowed)✓ Does it access only allowed memory? (No kernel crashes)✓ Does it use only allowed kernel functions?✓ Does it handle all code paths? (No undefined behavior)✓ Is the complexity bounded? (Max 1 million instructions)
If ANY check fails: program is rejected, never runs.
This is why you can load eBPF programs on production systemswithout fear. The kernel itself guarantees they're safe.💡 Did You Know? The eBPF verifier is so strict that it sometimes rejects valid programs that the human eye can see are safe. The Cilium team has contributed extensively to the Linux kernel to make the verifier smarter while maintaining safety. Writing eBPF programs that pass the verifier is an art—Cilium handles this complexity so you don’t have to.
Part 3: Cilium Architecture - The Big Picture
Section titled “Part 3: Cilium Architecture - The Big Picture”Now that you understand eBPF, let’s see how Cilium uses it:
CILIUM: THE COMPLETE PICTURE═══════════════════════════════════════════════════════════════════
┌─────────────────────────────┐ │ KUBERNETES API │ │ (Pods, Services, Policies) │ └──────────────┬──────────────┘ │ ┌───────────────────┼───────────────────┐ │ │ │ ┌────────▼────────┐ ┌────────▼────────┐ ┌───────▼────────┐ │ CILIUM OPERATOR │ │ HUBBLE RELAY │ │ HUBBLE UI │ │ (1 per cluster)│ │ (aggregation) │ │ (visualization)│ └─────────────────┘ └────────┬────────┘ └────────────────┘ │ ════════════════════════════════════╧════════════════════════════ PER-NODE COMPONENTS ═════════════════════════════════════════════════════════════════
NODE 1 NODE 2 NODE 3 ┌─────────────────────┐ ┌─────────────────────┐ ┌──────────────────┐ │ CILIUM AGENT │ │ CILIUM AGENT │ │ CILIUM AGENT │ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ │ Policy │ │ │ │ Policy │ │ │ │ Policy │ │ │ │ Engine │ │ │ │ Engine │ │ │ │ Engine │ │ │ ├─────────────┤ │ │ ├─────────────┤ │ │ ├─────────────┤ │ │ │ Identity │ │ │ │ Identity │ │ │ │ Identity │ │ │ │ Manager │ │ │ │ Manager │ │ │ │ Manager │ │ │ ├─────────────┤ │ │ ├─────────────┤ │ │ ├─────────────┤ │ │ │ Hubble │ │ │ │ Hubble │ │ │ │ Hubble │ │ │ │ Observer │ │ │ │ Observer │ │ │ │ Observer │ │ │ └──────┬──────┘ │ │ └──────┬──────┘ │ │ └──────┬──────┘ │ │ │ │ │ │ │ │ │ │ │ ┌──────▼──────┐ │ │ ┌──────▼──────┐ │ │ ┌──────▼──────┐ │ │ │ eBPF │ │ │ │ eBPF │ │ │ │ eBPF │ │ │ │ DATAPLANE │ │ │ │ DATAPLANE │ │ │ │ DATAPLANE │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ • Networking│ │ │ │ • Networking│ │ │ │ • Networking│ │ │ │ • Policies │ │ │ │ • Policies │ │ │ │ • Policies │ │ │ │ • Load Bal. │ │ │ │ • Load Bal. │ │ │ │ • Load Bal. │ │ │ │ • Encryption│ │ │ │ • Encryption│ │ │ │ • Encryption│ │ │ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │ │ │ │ │ │ │ │ ┌──────┐ ┌──────┐ │ │ ┌──────┐ ┌──────┐ │ │ ┌──────┐┌──────┐│ │ │Pod A │ │Pod B │ │ │ │Pod C │ │Pod D │ │ │ │Pod E ││Pod F ││ │ │id=123│ │id=456│ │ │ │id=789│ │id=123│ │ │ │id=456││id=999││ │ └──────┘ └──────┘ │ │ └──────┘ └──────┘ │ │ └──────┘└──────┘│ └─────────────────────┘ └─────────────────────┘ └──────────────────┘The Components Explained (Like You’re New Here)
Section titled “The Components Explained (Like You’re New Here)”Cilium Agent (DaemonSet) - The worker bee on each node:
- Watches Kubernetes for pod/service/policy changes
- Compiles eBPF programs and loads them into the kernel
- Assigns identities to pods (more on this soon)
- Runs Hubble observer for local visibility
Cilium Operator - The coordinator (1 per cluster):
- Manages IP address allocation (IPAM)
- Handles garbage collection of stale resources
- Manages CRDs and cluster-wide operations
Hubble - The observability layer:
- Hubble (per-node): Captures flows from eBPF in real-time
- Hubble Relay: Aggregates flows from all nodes
- Hubble UI: Beautiful web interface for visualization
Installation: Your First Cilium Cluster
Section titled “Installation: Your First Cilium Cluster”# Step 1: Install Cilium CLI# (The CLI makes installation and management much easier)CILIUM_CLI_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/cilium-cli/main/stable.txt)curl -L --fail -o cilium-linux-amd64.tar.gz "https://github.com/cilium/cilium-cli/releases/download/${CILIUM_CLI_VERSION}/cilium-linux-amd64.tar.gz"sudo tar xzvfC cilium-linux-amd64.tar.gz /usr/local/binrm cilium-linux-amd64.tar.gz
# Step 2: Install Cilium with the good defaultscilium install \ --set kubeProxyReplacement=true \ --set hubble.enabled=true \ --set hubble.relay.enabled=true \ --set hubble.ui.enabled=true
# Step 3: Wait for it to be readycilium status --wait
# Step 4: Verify everything workscilium connectivity testWhat cilium connectivity test actually does:
This isn’t a simple ping test. It deploys test workloads and verifies:
- Pod-to-pod connectivity (same node and cross-node)
- Pod-to-Service connectivity
- Pod-to-external connectivity
- Network policies are enforced correctly
- DNS resolution works
- Hubble observability captures flows
If this test passes, your networking is solid. If it fails, you’ll know exactly what’s broken.
Part 4: Identity-Based Security - The Game Changer
Section titled “Part 4: Identity-Based Security - The Game Changer”This is where Cilium fundamentally changes how you think about network security.
The Problem with IPs
Section titled “The Problem with IPs”Remember this scenario?
# You write a policy:apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: allow-frontend-to-backendspec: podSelector: matchLabels: app: backend ingress: - from: - podSelector: matchLabels: app: frontendBehind the scenes, your CNI translates this to IP rules. Frontend pods have IPs 10.244.1.5 and 10.244.2.12, so the rule becomes “allow from 10.244.1.5 and 10.244.2.12.”
Now frontend scales from 2 pods to 20 pods. Each new pod needs to be added. Pod crashes and restarts with new IP? Rule needs updating. Rolling deployment? Constant IP churn.
Cilium throws this model away entirely.
How Cilium Identity Works
Section titled “How Cilium Identity Works”CILIUM IDENTITY: THE "AHA!" MOMENT═══════════════════════════════════════════════════════════════════
Step 1: Pod is created with labels┌─────────────────────────────────────────────────────────────────┐│ Pod: frontend-7b9f8c4d5-x2k9p ││ Labels: ││ app: frontend ││ env: production ││ team: checkout │└─────────────────────────────────────────────────────────────────┘
Step 2: Cilium creates a NUMERIC IDENTITY from the labels┌─────────────────────────────────────────────────────────────────┐│ Identity 48291 = {app=frontend, env=production, team=checkout} ││ ││ This identity is: ││ • Cluster-wide (same on all nodes) ││ • Stable (doesn't change when pod restarts) ││ • Shared (all pods with same labels = same identity) │└─────────────────────────────────────────────────────────────────┘
Step 3: Every packet carries the identity, NOT the IP┌─────────────────────────────────────────────────────────────────┐│ Network Packet ││ ┌─────────────────────────────────────────────────────────┐ ││ │ Source Identity: 48291 │ ││ │ Dest Identity: 73842 │ ││ │ Payload: HTTP GET /api/checkout │ ││ └─────────────────────────────────────────────────────────┘ ││ ││ The IP is still there for routing, but POLICY uses identity │└─────────────────────────────────────────────────────────────────┘
Step 4: Policy enforcement uses identity┌─────────────────────────────────────────────────────────────────┐│ eBPF Policy Check: ││ ││ "Is identity 48291 allowed to reach identity 73842?" ││ ││ Lookup in eBPF hash map: O(1) ← Constant time! ││ Answer: ALLOW or DENY ││ ││ No IP lookups. No rule scanning. Instant decision. │└─────────────────────────────────────────────────────────────────┘Why this matters:
- Pod restarts: Same labels = same identity. No policy updates needed.
- Scaling: 1 pod or 1000 pods with same labels = same identity. No rule explosion.
- Cross-cluster: Identity follows the workload. Works in multi-cluster setups.
- Debugging: “Who is identity 48291?” →
cilium identity get 48291→ Instant answer.
Seeing Identities in Action
Section titled “Seeing Identities in Action”# List all identities in your clustercilium identity list
# Output:# IDENTITY LABELS# 1 reserved:host# 2 reserved:world# 4 reserved:health# 48291 k8s:app=frontend,k8s:env=production,k8s:team=checkout# 73842 k8s:app=backend,k8s:env=production# 99103 k8s:app=database,k8s:env=production
# Get details on a specific identitycilium identity get 48291
# See which endpoints have this identitykubectl exec -n kube-system cilium-xxxxx -- cilium endpoint list | grep 48291💡 Did You Know? Cilium reserves identity numbers 1-255 for special purposes. Identity 1 is always the host (the node itself), identity 2 is “world” (anything external to the cluster), and identity 4 is for health checks. This means you can write policies like “allow health checks” without knowing which IP ranges your health checkers use. It’s beautiful.
Part 5: Network Policies - From Basic to “Wow”
Section titled “Part 5: Network Policies - From Basic to “Wow””Standard Kubernetes NetworkPolicy (Cilium Implements These)
Section titled “Standard Kubernetes NetworkPolicy (Cilium Implements These)”Cilium fully supports standard Kubernetes NetworkPolicies. If you have existing policies, they keep working:
# Standard NetworkPolicy - Cilium handles this perfectlyapiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: backend-allow-frontend namespace: productionspec: podSelector: matchLabels: app: backend policyTypes: - Ingress ingress: - from: - podSelector: matchLabels: app: frontend ports: - protocol: TCP port: 8080CiliumNetworkPolicy - The Enhanced Version
Section titled “CiliumNetworkPolicy - The Enhanced Version”This is where it gets interesting. Cilium extends NetworkPolicies with features Kubernetes doesn’t support:
# Layer 7 (HTTP) Policy - Kubernetes can't do thisapiVersion: cilium.io/v2kind: CiliumNetworkPolicymetadata: name: api-http-policy namespace: productionspec: endpointSelector: matchLabels: app: api-server ingress: - fromEndpoints: - matchLabels: app: frontend toPorts: - ports: - port: "8080" protocol: TCP rules: http: # Only allow specific HTTP methods and paths - method: "GET" path: "/api/v1/products.*" - method: "GET" path: "/api/v1/users/[0-9]+" - method: "POST" path: "/api/v1/orders" headers: - 'Content-Type: application/json'What this policy says in plain English:
“Frontend pods can connect to the API server on port 8080, but ONLY for:
- GET requests to
/api/v1/products*(list/view products) - GET requests to
/api/v1/users/<id>(view specific user) - POST requests to
/api/v1/orderswith JSON content type (create orders)
Any other HTTP request? DENIED at the network layer.”
This is insanely powerful. An attacker who compromises your frontend can’t hit /api/v1/admin or send DELETE requests—the network itself blocks them.
DNS-Based Egress Policies
Section titled “DNS-Based Egress Policies”One of my favorite Cilium features. Most security teams want to control what external services pods can reach:
# Allow pods to reach only specific external servicesapiVersion: cilium.io/v2kind: CiliumNetworkPolicymetadata: name: payment-egress namespace: productionspec: endpointSelector: matchLabels: app: payment-processor egress: # Allow internal services - toEndpoints: - matchLabels: app: order-service # Allow specific external APIs - toFQDNs: - matchName: "api.stripe.com" - matchName: "api.paypal.com" - matchPattern: "*.amazonaws.com" # AWS services toPorts: - ports: - port: "443" protocol: TCP # Allow DNS (required for FQDN resolution) - toEndpoints: - matchLabels: k8s:io.kubernetes.pod.namespace: kube-system k8s:k8s-app: kube-dns toPorts: - ports: - port: "53" protocol: UDPHow FQDN policies work under the hood:
FQDN POLICY MAGIC═══════════════════════════════════════════════════════════════════
1. Policy says: "Allow egress to api.stripe.com"
2. Cilium intercepts DNS queries from the pod
3. Pod asks: "What's the IP of api.stripe.com?"
4. DNS responds: "It's 52.84.150.1, 52.84.150.2, 52.84.150.3"
5. Cilium automatically adds these IPs to the allow list (stored in eBPF maps for O(1) lookup)
6. Pod connects to 52.84.150.1:443 → ALLOWED
7. Later, Stripe changes IPs (they do this a lot)
8. Next DNS query returns new IPs
9. Cilium updates the allow list automatically
10. You never have to touch the policy!No more hardcoding CIDR blocks that break when cloud providers change IPs. No more overly permissive “allow all egress to 0.0.0.0/0” rules.
Cluster-Wide Policies
Section titled “Cluster-Wide Policies”For policies that should apply everywhere (like “default deny”):
# Default deny ALL traffic cluster-wideapiVersion: cilium.io/v2kind: CiliumClusterwideNetworkPolicymetadata: name: default-denyspec: endpointSelector: {} # Applies to ALL pods ingress: - fromEndpoints: - {} # Only allow from endpoints with Cilium identity egress: - toEndpoints: - {} # Always allow essential services - toEntities: - kube-apiserver # Pods need to reach API server - dns # Pods need DNS
---# Explicitly allow health checks (they'd be denied by default-deny)apiVersion: cilium.io/v2kind: CiliumClusterwideNetworkPolicymetadata: name: allow-health-checksspec: endpointSelector: {} ingress: - fromEntities: - health # Cilium's reserved identity for health checksThe power of toEntities:
Instead of figuring out which IPs your kube-apiserver uses, which ports health checks come from, or which IPs your DNS servers have, Cilium provides semantic entities:
| Entity | What it means |
|---|---|
host | The node the pod runs on |
remote-node | Other nodes in the cluster |
kube-apiserver | Kubernetes API server |
health | Health check probes |
dns | DNS servers (kube-dns/CoreDNS) |
world | Everything outside the cluster |
Part 6: Hubble - Seeing the Invisible
Section titled “Part 6: Hubble - Seeing the Invisible”If Cilium is the brain, Hubble is the eyes.
The Old Way vs. The Hubble Way
Section titled “The Old Way vs. The Hubble Way”DEBUGGING NETWORK ISSUES: OLD VS NEW═══════════════════════════════════════════════════════════════════
THE OLD WAY:───────────────────────────────────────────────────────────────────1. Get alert: "Service unreachable"2. SSH into pod: kubectl exec -it pod -- sh3. Run tcpdump: tcpdump -i eth0 port 80804. Wait for traffic...5. Stare at hex dumps6. Realize you need tcpdump on the OTHER pod too7. SSH into other pod8. Run tcpdump there9. Try to correlate timestamps across pods10. Give up, ask network team11. Network team says "network is fine"12. Cry
THE HUBBLE WAY:───────────────────────────────────────────────────────────────────1. Get alert: "Service unreachable"2. Run: hubble observe --from-pod web --to-pod api --verdict DROPPED3. See exact policy that dropped the traffic4. Fix policy5. Go back to bedInstalling and Accessing Hubble
Section titled “Installing and Accessing Hubble”# Install Hubble CLIHUBBLE_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/hubble/main/stable.txt)curl -L --fail -o hubble-linux-amd64.tar.gz "https://github.com/cilium/hubble/releases/download/${HUBBLE_VERSION}/hubble-linux-amd64.tar.gz"sudo tar xzvfC hubble-linux-amd64.tar.gz /usr/local/binrm hubble-linux-amd64.tar.gz
# Port-forward to Hubble Relay (needed to aggregate from all nodes)cilium hubble port-forward &
# Now you can use hubble observehubble observe
# Access the UI (optional but beautiful)cilium hubble ui# Opens browser to http://localhost:12000Hubble CLI - Your New Best Friend
Section titled “Hubble CLI - Your New Best Friend”# See ALL traffic in real-timehubble observe
# Filter by namespacehubble observe --namespace production
# Filter by specific podhubble observe --pod production/frontend-abc
# See only DROPPED traffic (the gold mine for debugging)hubble observe --verdict DROPPED
# See traffic between two specific serviceshubble observe \ --from-pod production/frontend \ --to-pod production/backend
# Filter by protocolhubble observe --protocol httphubble observe --protocol dnshubble observe --protocol tcp
# See HTTP requests with detailshubble observe --protocol http -o json | jq
# See DNS querieshubble observe --protocol dns --namespace production
# Output format optionshubble observe -o compact # One line per flowhubble observe -o dict # Readable dictionary formathubble observe -o json # JSON for scriptinghubble observe -o table # Table formatUnderstanding Hubble Output
Section titled “Understanding Hubble Output”HUBBLE FLOW ANATOMY═══════════════════════════════════════════════════════════════════
Dec 9 10:23:45.123 production/frontend-7b9f8c4d5-x2k9p:46532 (ID:48291) -> production/backend-5d8f7b3a2-k9p2m:8080 (ID:73842) http-request FORWARDED (HTTP/1.1 GET /api/users)
Let's break this down:───────────────────────────────────────────────────────────────────
TIMESTAMP SOURCEDec 9 10:23:45.123 production/frontend-7b9f8c4d5-x2k9p:46532 (ID:48291) │ │ │ │ │ namespace pod name port │ └─ Cilium identity! └─ source port
DESTINATION -> production/backend-5d8f7b3a2-k9p2m:8080 (ID:73842) │ │ │ │ namespace pod name port └─ Cilium identity
FLOW TYPE & VERDICT http-request FORWARDED (HTTP/1.1 GET /api/users) │ │ │ protocol │ └─ HTTP details (method, path) └─ FORWARDED = allowed DROPPED = blocked by policy ERROR = something went wrongReal Debugging Scenarios
Section titled “Real Debugging Scenarios”Scenario 1: “My pod can’t reach the database”
# Step 1: See what's being droppedhubble observe \ --from-pod production/myapp \ --to-pod production/postgres \ --verdict DROPPED
# Output:# production/myapp-xxx -> production/postgres-yyy# policy-verdict:none DROPPED (Policy denied)
# The "policy-verdict:none" tells you there's no ALLOW rule# You need to add a policy to permit this trafficScenario 2: “External API calls are failing”
# Check egress traffichubble observe \ --from-pod production/myapp \ --verdict DROPPED \ --type l3/l4
# Output:# production/myapp-xxx -> 52.84.150.1:443# policy-verdict:none DROPPED (Policy denied)
# Your egress policy doesn't allow this IP# Check if you need to add FQDN rulesScenario 3: “DNS is slow/failing”
# Watch DNS querieshubble observe --protocol dns --namespace production
# Output:# production/myapp -> kube-system/coredns# dns-request FORWARDED (Query api.stripe.com A)# kube-system/coredns -> production/myapp# dns-response FORWARDED (Answer: 52.84.150.1)
# If you see DROPPED DNS queries, check your egress policiesHubble Metrics for Prometheus
Section titled “Hubble Metrics for Prometheus”# Enable metrics during Cilium installcilium install \ --set hubble.enabled=true \ --set hubble.metrics.enabled="{dns,drop,tcp,flow,icmp,http}"
# Or upgrade existing installationcilium upgrade \ --set hubble.metrics.enabled="{dns,drop,tcp,flow,icmp,http}"Key metrics to alert on:
# Prometheus alert examplesgroups:- name: cilium rules: # Alert on packet drops (excluding expected drops) - alert: HighPacketDropRate expr: rate(hubble_drop_total{reason!="Policy denied"}[5m]) > 100 for: 5m labels: severity: warning annotations: summary: "High packet drop rate on {{ $labels.instance }}"
# Alert on DNS failures - alert: DNSErrors expr: rate(hubble_dns_responses_total{rcode!="No Error"}[5m]) > 10 for: 5m labels: severity: warning annotations: summary: "DNS errors detected: {{ $labels.rcode }}"
# Alert on HTTP 5xx errors - alert: HTTP5xxErrors expr: rate(hubble_http_responses_total{status=~"5.."}[5m]) > 10 for: 5m labels: severity: critical💡 Did You Know? Hubble captures flows using eBPF, which means there’s no sampling. Unlike traditional monitoring that might capture 1 in 1000 packets, Hubble sees EVERY packet. If something happened on the network, Hubble saw it. This makes Hubble invaluable for security auditing—you have a complete record of all network communication.
Part 7: Replacing Kube-Proxy
Section titled “Part 7: Replacing Kube-Proxy”Why This Matters
Section titled “Why This Matters”Remember those 147,000 iptables rules? Let’s get rid of them.
# Install Cilium as kube-proxy replacementcilium install --set kubeProxyReplacement=true
# Verify it's workingcilium status | grep KubeProxyReplacement# KubeProxyReplacement: True [eth0 (Direct Routing)]
# See all Services handled by Ciliumkubectl exec -n kube-system ds/cilium -- cilium service list
# Compare the difference:# BEFORE (kube-proxy):# iptables-save | wc -l# 147,291
# AFTER (Cilium):# iptables-save | wc -l# 127 ← Only basic rules remainPerformance Comparison
Section titled “Performance Comparison”Real benchmarks from production clusters:
| Metric | kube-proxy (iptables) | Cilium eBPF | Improvement |
|---|---|---|---|
| Service lookup latency | ~2ms (5000 services) | ~100μs | 20x faster |
| Memory usage | Grows with services | Constant | Predictable |
| Rule update time | 5-30 seconds | Milliseconds | 1000x faster |
| Connection drops on update | Yes | No | Zero downtime |
| CPU usage at scale | High | Low | 50-70% reduction |
The DSR Bonus: Direct Server Return
Section titled “The DSR Bonus: Direct Server Return”DIRECT SERVER RETURN (DSR)═══════════════════════════════════════════════════════════════════
Without DSR (traditional):───────────────────────────────────────────────────────────────────Client → Load Balancer → Backend PodClient ← Load Balancer ← Backend Pod ↑ Return traffic goes through LB too (extra hop, extra latency)
With DSR (Cilium):───────────────────────────────────────────────────────────────────Client → Load Balancer → Backend PodClient ←──────────────── Backend Pod ↑ Return traffic goes DIRECTLY to client (faster response, less LB load)Enable DSR:
cilium install \ --set kubeProxyReplacement=true \ --set loadBalancer.mode=dsrPart 8: Transparent Encryption with WireGuard
Section titled “Part 8: Transparent Encryption with WireGuard”Encrypting all pod-to-pod traffic sounds hard. With Cilium, it’s one flag.
The Problem
Section titled “The Problem”UNENCRYPTED CLUSTER TRAFFIC═══════════════════════════════════════════════════════════════════
Pod A ─────────────────────────────────────────────▶ Pod B │ │ │ Network traffic crosses: │ │ • Virtual switches │ │ • Physical switches │ │ • Sometimes public internet │ │ (cross-AZ, cross-region) │ │ │ └──── All visible to anyone ─────────┘ with network access
Attackers can:• Read sensitive data• Capture credentials• Man-in-the-middle attacksThe Solution
Section titled “The Solution”# Enable WireGuard encryptioncilium install \ --set encryption.enabled=true \ --set encryption.type=wireguard
# Verify encryption statuscilium status | grep Encryption# Encryption: Wireguard [NodeEncryption: Disabled, cilium_wg0 (Pubkey: xxx)]
# Check WireGuard peerskubectl exec -n kube-system ds/cilium -- cilium encrypt statusWhat happens now:
ENCRYPTED CLUSTER TRAFFIC═══════════════════════════════════════════════════════════════════
Pod A ══════════════════════════════════════════════▶ Pod B │ │ │ All traffic encrypted with │ │ WireGuard (state-of-art crypto) │ │ │ │ • No app changes needed │ │ • No sidecar containers │ │ • Kernel-level encryption │ │ • ~5% overhead (negligible) │ │ │ └──── Attackers see garbage ─────────┘Zero application changes. Your apps don’t know encryption is happening. It’s transparent at the kernel level.
Part 9: Common Mistakes (Learn From Others’ Pain)
Section titled “Part 9: Common Mistakes (Learn From Others’ Pain)”| Mistake | Why It Hurts | How To Avoid |
|---|---|---|
| Skipping connectivity test | You think it’s working, it’s not | Always run cilium connectivity test after install |
| Installing over existing CNI | CNI conflicts break everything | Remove old CNI completely first, or use fresh cluster |
| No default deny | Wide open by default = security hole | Always set cluster-wide default deny |
| Forgetting DNS in egress | Pods can’t resolve external hosts | Always allow toEntities: [dns] in egress policies |
| Overly broad FQDN patterns | *.com defeats the purpose | Use specific FQDNs: api.stripe.com not *.stripe.com |
| Not enabling Hubble | Flying blind | Hubble is free, always enable it |
| Ignoring Hubble metrics | Miss issues until they’re incidents | Alert on hubble_drop_total and hubble_dns_* |
War Story: The Policy That Ate Christmas
Section titled “War Story: The Policy That Ate Christmas”December 23rd, 2022. Large e-commerce platform. Black Friday went perfectly. Everyone was relaxed.
At 2:47 PM, a junior engineer deployed what seemed like a simple change: a new CiliumNetworkPolicy to restrict database access. The policy worked in staging.
# The policy that ruined ChristmasapiVersion: cilium.io/v2kind: CiliumNetworkPolicymetadata: name: database-security namespace: productionspec: endpointSelector: matchLabels: app: postgres ingress: - fromEndpoints: - matchLabels: app: backend environment: productionWhat they missed: The caching service (Redis) also needed database access. It had app: cache, not app: backend.
At 2:48 PM:
- Cache invalidation failed
- Stale product data started serving
- Wrong prices shown to customers
At 2:52 PM:
- Monitoring detected increased error rates
- On-call engineer paged
At 2:54 PM:
- Engineer ran:
hubble observe --to-pod production/postgres --verdict DROPPED - Output showed:
production/redis-xxx -> production/postgres DROPPED - Root cause identified in 2 minutes
At 2:56 PM:
- Policy updated to include cache service
- Traffic restored
Total incident duration: 8 minutes
Without Hubble? This would’ve been a multi-hour outage. The team would’ve blamed DNS (it’s always DNS), then the load balancer, then the database itself. Eventually, maybe, someone would’ve checked network policies.
Lessons:
- Always test policies against ALL services, not just the obvious ones
- Hubble is not optional—it’s your incident response tool
--verdict DROPPEDis the most important filter you’ll ever use
Question 1
Section titled “Question 1”You deploy a default-deny policy and suddenly nothing works. Not even DNS. What’s the minimum policy you need to restore basic functionality?
Show Answer
apiVersion: cilium.io/v2kind: CiliumClusterwideNetworkPolicymetadata: name: allow-essentialspec: endpointSelector: {} egress: - toEntities: - dns # Allows CoreDNS queries - kube-apiserver # Allows pods to reach API server ingress: - fromEntities: - health # Allows health probesThis restores:
- DNS resolution (pods can resolve names)
- API server access (service accounts work)
- Health checks (probes don’t fail)
From here, add specific policies for your workloads.
Question 2
Section titled “Question 2”A pod is failing to connect to api.stripe.com. How do you debug this with Hubble?
Show Answer
# Step 1: Check if connection attempts are being droppedhubble observe \ --from-pod production/payment-service \ --verdict DROPPED
# Step 2: Check DNS is resolvinghubble observe \ --from-pod production/payment-service \ --protocol dns
# Step 3: Check specific destinationhubble observe \ --from-pod production/payment-service \ --to-fqdn api.stripe.com
# Common issues:# - DNS queries dropped → Add toEntities: [dns] to egress# - Connection dropped → Add toFQDNs with matchName: api.stripe.com# - Policy denied → Check your CiliumNetworkPolicyQuestion 3
Section titled “Question 3”Why does Cilium use identity numbers instead of IP addresses for policy enforcement?
Show Answer
IP-based problems:
- Pods get new IPs when restarting
- Scaling creates new IPs constantly
- Rolling updates = continuous IP churn
- Policies must be updated for every IP change
- Can’t express “frontend talks to backend” semantically
Identity-based advantages:
- Identity is based on labels, not IPs
- Same labels = same identity, regardless of IP
- 1 pod or 1000 pods = same identity if labels match
- Policies are stable (no updates needed when IPs change)
- Human-readable: “identity 48291 = frontend” makes sense
- O(1) lookup in eBPF hash maps
Example:
Pod with labels {app: frontend, env: prod} → Identity 48291
This pod can:- Restart 100 times- Scale to 50 replicas- Move across nodes
Identity stays 48291. Policies keep working.Hands-On Exercise: Build a Secure Microservices Setup
Section titled “Hands-On Exercise: Build a Secure Microservices Setup”Objective
Section titled “Objective”Deploy a three-tier application with Cilium, implement zero-trust networking, and observe traffic with Hubble.
Scenario
Section titled “Scenario”You’re deploying a web application with:
- Frontend: Nginx serving static content
- API: Node.js backend
- Database: PostgreSQL
Security requirements:
- Default deny all traffic
- Frontend can only reach API on port 3000
- API can only reach database on port 5432
- All pods can reach DNS
- No direct frontend-to-database access
Part 1: Setup the Cluster
Section titled “Part 1: Setup the Cluster”# Create a kind cluster without default CNIcat > kind-config.yaml << 'EOF'kind: ClusterapiVersion: kind.x-k8s.io/v1alpha4networking: disableDefaultCNI: true kubeProxyMode: nonenodes:- role: control-plane- role: worker- role: workerEOF
kind create cluster --config kind-config.yaml --name cilium-lab
# Install Ciliumcilium install \ --set kubeProxyReplacement=true \ --set hubble.enabled=true \ --set hubble.relay.enabled=true \ --set hubble.ui.enabled=true
# Wait for Cilium to be readycilium status --wait
# Verify installationcilium connectivity testPart 2: Deploy the Application
Section titled “Part 2: Deploy the Application”# Create namespacekubectl create namespace demo
# Deploy databasekubectl -n demo apply -f - << 'EOF'apiVersion: v1kind: Podmetadata: name: database labels: app: database tier: dataspec: containers: - name: postgres image: postgres:15 env: - name: POSTGRES_PASSWORD value: "secret" ports: - containerPort: 5432---apiVersion: v1kind: Servicemetadata: name: databasespec: selector: app: database ports: - port: 5432EOF
# Deploy APIkubectl -n demo apply -f - << 'EOF'apiVersion: v1kind: Podmetadata: name: api labels: app: api tier: backendspec: containers: - name: api image: nginx ports: - containerPort: 3000---apiVersion: v1kind: Servicemetadata: name: apispec: selector: app: api ports: - port: 3000EOF
# Deploy frontendkubectl -n demo apply -f - << 'EOF'apiVersion: v1kind: Podmetadata: name: frontend labels: app: frontend tier: webspec: containers: - name: nginx image: nginx ports: - containerPort: 80EOFPart 3: Test Without Policies (Everything Works)
Section titled “Part 3: Test Without Policies (Everything Works)”# Start Hubble port-forward in backgroundcilium hubble port-forward &
# Test frontend → api (should work)kubectl -n demo exec frontend -- curl -s --max-time 5 api:3000echo "Frontend → API: SUCCESS"
# Test frontend → database (should also work - this is the problem!)kubectl -n demo exec frontend -- nc -zv database 5432echo "Frontend → Database: SUCCESS (but shouldn't be allowed!)"
# Test api → database (should work)kubectl -n demo exec api -- nc -zv database 5432echo "API → Database: SUCCESS"
# Watch traffic with Hubblehubble observe --namespace demoPart 4: Implement Zero-Trust Policies
Section titled “Part 4: Implement Zero-Trust Policies”# Step 1: Default deny everythingkubectl -n demo apply -f - << 'EOF'apiVersion: cilium.io/v2kind: CiliumNetworkPolicymetadata: name: default-denyspec: endpointSelector: {} ingress: [] egress: []EOF
# Test again - everything should fail nowkubectl -n demo exec frontend -- curl -s --max-time 5 api:3000 || echo "Frontend → API: BLOCKED (expected)"kubectl -n demo exec api -- nc -zv -w 2 database 5432 || echo "API → Database: BLOCKED (expected)"
# Watch the drops!hubble observe --namespace demo --verdict DROPPED# Step 2: Allow DNS (required for name resolution)kubectl -n demo apply -f - << 'EOF'apiVersion: cilium.io/v2kind: CiliumNetworkPolicymetadata: name: allow-dnsspec: endpointSelector: {} egress: - toEntities: - dnsEOF
# Step 3: Allow frontend → apikubectl -n demo apply -f - << 'EOF'apiVersion: cilium.io/v2kind: CiliumNetworkPolicymetadata: name: frontend-to-apispec: endpointSelector: matchLabels: app: api ingress: - fromEndpoints: - matchLabels: app: frontend toPorts: - ports: - port: "3000"---apiVersion: cilium.io/v2kind: CiliumNetworkPolicymetadata: name: frontend-egressspec: endpointSelector: matchLabels: app: frontend egress: - toEndpoints: - matchLabels: app: api toPorts: - ports: - port: "3000"EOF
# Step 4: Allow api → databasekubectl -n demo apply -f - << 'EOF'apiVersion: cilium.io/v2kind: CiliumNetworkPolicymetadata: name: api-to-databasespec: endpointSelector: matchLabels: app: database ingress: - fromEndpoints: - matchLabels: app: api toPorts: - ports: - port: "5432"---apiVersion: cilium.io/v2kind: CiliumNetworkPolicymetadata: name: api-egressspec: endpointSelector: matchLabels: app: api egress: - toEndpoints: - matchLabels: app: database toPorts: - ports: - port: "5432"EOFPart 5: Verify Security
Section titled “Part 5: Verify Security”# Frontend → API: Should workkubectl -n demo exec frontend -- curl -s --max-time 5 api:3000echo "✓ Frontend → API: ALLOWED"
# API → Database: Should workkubectl -n demo exec api -- nc -zv -w 2 database 5432echo "✓ API → Database: ALLOWED"
# Frontend → Database: Should be BLOCKEDkubectl -n demo exec frontend -- nc -zv -w 2 database 5432 || echo "✓ Frontend → Database: BLOCKED (as intended!)"
# Watch the flow in Hubblehubble observe --namespace demo
# See what's being droppedhubble observe --namespace demo --verdict DROPPEDSuccess Criteria
Section titled “Success Criteria”- Cilium installed and connectivity test passes
- Default deny policy blocks all traffic
- Hubble shows DROPPED verdict for blocked traffic
- Frontend can reach API on port 3000
- API can reach Database on port 5432
- Frontend CANNOT reach Database directly
- Hubble shows FORWARDED for allowed traffic
Bonus Challenge
Section titled “Bonus Challenge”Add an L7 policy that only allows HTTP GET requests from frontend to api:
apiVersion: cilium.io/v2kind: CiliumNetworkPolicymetadata: name: frontend-to-api-l7 namespace: demospec: endpointSelector: matchLabels: app: api ingress: - fromEndpoints: - matchLabels: app: frontend toPorts: - ports: - port: "3000" rules: http: - method: "GET" path: "/.*"Test that POST requests are blocked:
kubectl -n demo exec frontend -- curl -X POST api:3000 || echo "POST blocked by L7 policy"kubectl -n demo exec frontend -- curl -X GET api:3000 && echo "GET allowed"Cleanup
Section titled “Cleanup”# Delete the lab clusterkind delete cluster --name cilium-labFurther Reading
Section titled “Further Reading”- Cilium Documentation - The official docs, well-written
- eBPF.io - Deep dive into eBPF technology
- Cilium Network Policy Editor - Visual policy builder (great for learning)
- Hubble Documentation
- Isovalent Blog - Advanced Cilium use cases from the creators
Next Module
Section titled “Next Module”Continue to Module 5.2: Service Mesh to learn about service mesh patterns with Istio, and when sidecar-free approaches make sense.
“The network that explains itself is the network you can actually secure.”