Module 1.2: Chaos Mesh Fundamentals
Discipline Module | Complexity:
[MEDIUM]| Time: 2.5 hours
Prerequisites
Section titled “Prerequisites”Before starting this module:
- Required: Module 1.1: Principles of Chaos Engineering — Understand hypotheses, blast radius, and abort conditions
- Required: Kubernetes Basics — Deployments, Services, Namespaces, RBAC
- Recommended: A running Kubernetes cluster (kind or minikube)
- Recommended: Helm 3 installed
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Implement Chaos Mesh on Kubernetes with proper RBAC, namespace scoping, and experiment scheduling
- Design pod-level chaos experiments — kill, CPU stress, memory stress, I/O delay — with Chaos Mesh CRDs
- Configure Chaos Mesh dashboards and workflows for recurring reliability validation
- Build automated chaos experiments that run as part of CI/CD pipelines before production deployments
Why This Module Matters
Section titled “Why This Module Matters”In the previous module, you learned the theory of Chaos Engineering — hypotheses, steady state, blast radius, safety nets. Theory is essential, but it doesn’t tell you anything you didn’t already suspect about your systems.
What actually tells you something new is running the experiment.
In March 2020, a fintech startup in Singapore discovered that their payment processing service had a retry storm vulnerability. They didn’t discover it during code review, load testing, or penetration testing. They discovered it during a Chaos Mesh pod-kill experiment that exposed a missing circuit breaker. The fix took 30 minutes. Without the experiment, the bug would have surfaced during their next traffic spike — potentially costing them their banking license.
Chaos Mesh is the most widely adopted chaos engineering platform for Kubernetes. It’s a CNCF Incubating project, fully open source, and designed to inject failures through Kubernetes-native CRDs. You don’t write scripts. You don’t SSH into nodes. You declare the failure you want, apply a YAML manifest, and Chaos Mesh handles the injection using Linux kernel capabilities.
This module teaches you to install Chaos Mesh, understand its architecture, and run your first real chaos experiments: pod kills, network faults, and resource stress.
Did You Know?
Section titled “Did You Know?”Chaos Mesh was created by PingCAP (makers of TiDB, a distributed SQL database) because they needed to test TiDB’s fault tolerance. Testing a distributed database requires injecting every kind of failure imaginable — network partitions, disk delays, clock skew, process crashes. They built Chaos Mesh for themselves, then donated it to the CNCF in 2020 because they realized every Kubernetes operator needed the same tool.
Chaos Mesh uses Linux namespaces and cgroups to inject faults without modifying your application code. When you inject network latency, Chaos Mesh uses
tc(traffic control) inside the target pod’s network namespace. When you inject CPU stress, it usesstress-nginside the target’s cgroup. Your application has no idea the faults are being injected — which is exactly the point.
The Chaos Mesh Dashboard has prevented more accidental outages than any documentation. Engineers who can visually see the blast radius, active experiments, and abort buttons are 3x less likely to make mistakes compared to those using only CLI commands. Always use the dashboard for your first experiments.
Chaos Mesh supports 12 distinct fault types across pod, network, stress, IO, time, DNS, HTTP, JVM, kernel, AWS, GCP, and Azure domains. This makes it the most comprehensive single chaos tool available for Kubernetes — most alternatives cover only 3-4 fault types.
Chaos Mesh Architecture
Section titled “Chaos Mesh Architecture”How It Works
Section titled “How It Works”Chaos Mesh runs as a set of controllers and daemon pods inside your Kubernetes cluster:
┌─────────────────────────────────────────────────────────┐│ Kubernetes Cluster ││ ││ ┌─────────────────────────────────────────────────┐ ││ │ Chaos Mesh Control Plane │ ││ │ │ ││ │ ┌──────────────┐ ┌──────────────────────────┐ │ ││ │ │ chaos- │ │ chaos-controller-manager │ │ ││ │ │ dashboard │ │ │ │ ││ │ │ (Web UI) │ │ • Watches CRDs │ │ ││ │ └──────────────┘ │ • Schedules experiments │ │ ││ │ │ • Manages lifecycle │ │ ││ │ └──────────────────────────┘ │ ││ └─────────────────────────────────────────────────┘ ││ ││ ┌─────────────────────────────────────────────────┐ ││ │ Chaos Mesh Data Plane │ ││ │ │ ││ │ Node 1 Node 2 Node 3 │ ││ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ ││ │ │chaos-daemon│ │chaos-daemon│ │chaos-daemon│ │ ││ │ │ │ │ │ │ │ │ ││ │ │ • Injects │ │ • Injects │ │ • Injects │ │ ││ │ │ faults │ │ faults │ │ faults │ │ ││ │ │ • Uses │ │ • Uses │ │ • Uses │ │ ││ │ │ nsenter │ │ nsenter │ │ nsenter │ │ ││ │ └────────────┘ └────────────┘ └────────────┘ │ ││ └─────────────────────────────────────────────────┘ ││ ││ ┌───────────────────────────────────────┐ ││ │ Your Application Pods (targets) │ ││ │ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │ ││ │ │Pod A│ │Pod B│ │Pod C│ │Pod D│ │ ││ │ └─────┘ └─────┘ └─────┘ └─────┘ │ ││ └───────────────────────────────────────┘ │└─────────────────────────────────────────────────────────┘Key components:
| Component | What It Does | Runs As |
|---|---|---|
chaos-controller-manager | Watches CRD objects, reconciles desired chaos state | Deployment (1-3 replicas) |
chaos-daemon | Enters target pod namespaces to inject faults | DaemonSet (every node) |
chaos-dashboard | Web UI for managing experiments | Deployment (1 replica) |
The CRD Model
Section titled “The CRD Model”Chaos Mesh is entirely CRD-driven. You create a Kubernetes custom resource, and the controller handles everything:
You apply YAML → Controller sees it → Controller tells chaos-daemon on correct node → Daemon enters pod's namespace → Daemon injects fault (tc, iptables, kill, etc.) → Duration expires → Daemon removes fault → Controller marks experiment completeThis means chaos experiments are:
- Declarative: Describe the fault, not the injection mechanism
- Auditable:
kubectl get podchaos -Ashows all active experiments - Reversible: Delete the CRD, the fault is removed
- RBAC-controlled: Standard Kubernetes RBAC limits who can create chaos
Installation
Section titled “Installation”Install with Helm (Recommended)
Section titled “Install with Helm (Recommended)”# Add the Chaos Mesh Helm repositoryhelm repo add chaos-mesh https://charts.chaos-mesh.orghelm repo update
# Create the chaos-mesh namespacekubectl create namespace chaos-mesh
# Install Chaos Mesh# For kind clusters:helm install chaos-mesh chaos-mesh/chaos-mesh \ --namespace chaos-mesh \ --set chaosDaemon.runtime=containerd \ --set chaosDaemon.socketPath=/run/containerd/containerd.sock \ --set dashboard.securityMode=false \ --version 2.7.0
# For minikube with Docker runtime:helm install chaos-mesh chaos-mesh/chaos-mesh \ --namespace chaos-mesh \ --set chaosDaemon.runtime=docker \ --set chaosDaemon.socketPath=/var/run/docker.sock \ --set dashboard.securityMode=false \ --version 2.7.0Verify Installation
Section titled “Verify Installation”# Check all pods are runningkubectl get pods -n chaos-mesh
# Expected output (on a 3-node cluster):# NAME READY STATUS RESTARTS AGE# chaos-controller-manager-7b6fc8d5b4-abc12 1/1 Running 0 2m# chaos-controller-manager-7b6fc8d5b4-def34 1/1 Running 0 2m# chaos-controller-manager-7b6fc8d5b4-ghi56 1/1 Running 0 2m# chaos-daemon-jkl78 1/1 Running 0 2m# chaos-daemon-mno90 1/1 Running 0 2m# chaos-daemon-pqr12 1/1 Running 0 2m# chaos-dashboard-stu34-vwx56 1/1 Running 0 2m
# Verify CRDs are installedkubectl get crds | grep chaos-mesh
# Expected: podchaos, networkchaos, stresschaos, iochaos, timechaos, etc.Access the Dashboard
Section titled “Access the Dashboard”# Port-forward the dashboardkubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333 &
# Open in browser: http://localhost:2333Deploy a Test Application
Section titled “Deploy a Test Application”For all experiments in this module, we’ll use a simple multi-tier application:
# test-app.yaml — A simple frontend + backend applicationapiVersion: v1kind: Namespacemetadata: name: chaos-demo---apiVersion: apps/v1kind: Deploymentmetadata: name: frontend namespace: chaos-demospec: replicas: 3 selector: matchLabels: app: frontend template: metadata: labels: app: frontend spec: containers: - name: nginx image: nginx:1.27 ports: - containerPort: 80 readinessProbe: httpGet: path: / port: 80 initialDelaySeconds: 5 periodSeconds: 5 resources: requests: cpu: 50m memory: 64Mi limits: cpu: 100m memory: 128Mi---apiVersion: v1kind: Servicemetadata: name: frontend namespace: chaos-demospec: selector: app: frontend ports: - port: 80 targetPort: 80---apiVersion: apps/v1kind: Deploymentmetadata: name: backend namespace: chaos-demospec: replicas: 3 selector: matchLabels: app: backend template: metadata: labels: app: backend spec: containers: - name: httpbin image: mccutchen/go-httpbin:v2.14.0 ports: - containerPort: 8080 readinessProbe: httpGet: path: /get port: 8080 initialDelaySeconds: 5 periodSeconds: 5 resources: requests: cpu: 50m memory: 64Mi limits: cpu: 200m memory: 256Mi---apiVersion: v1kind: Servicemetadata: name: backend namespace: chaos-demospec: selector: app: backend ports: - port: 80 targetPort: 8080# Deploy the test applicationkubectl apply -f test-app.yaml
# Wait for all pods to be readykubectl wait --for=condition=ready pod -l app=frontend -n chaos-demo --timeout=120skubectl wait --for=condition=ready pod -l app=backend -n chaos-demo --timeout=120s
# Verifykubectl get pods -n chaos-demoPodChaos: Pod-Level Fault Injection
Section titled “PodChaos: Pod-Level Fault Injection”PodChaos is the simplest and most commonly used chaos type. It targets pods directly with three actions:
| Action | What It Does | Use Case |
|---|---|---|
pod-kill | Kills the pod process (container is terminated) | Test Kubernetes restart behavior and readiness probes |
pod-failure | Sets pod to failed state for duration | Test how services handle unavailable dependencies |
container-kill | Kills a specific container within a pod | Test sidecar failure, init container issues |
Experiment 1: Pod Kill
Section titled “Experiment 1: Pod Kill”Hypothesis: “We believe the frontend will continue serving HTTP 200 responses even when 1 of 3 pods is killed, because Kubernetes will restart the pod within 30 seconds and the Service will route traffic to the 2 remaining healthy pods.”
apiVersion: chaos-mesh.org/v1alpha1kind: PodChaosmetadata: name: frontend-pod-kill namespace: chaos-demospec: action: pod-kill mode: one # Kill exactly one pod selector: namespaces: - chaos-demo labelSelectors: app: frontend duration: "60s" # Wait 60s, then allow recovery gracePeriod: 0 # Immediate kill (like kill -9)# Record steady statekubectl get pods -n chaos-demo -l app=frontend -w &
# Apply the experimentkubectl apply -f pod-kill-experiment.yaml
# Watch the pod get killed and restarted# You should see one pod transition to Terminating, then a new one created
# Check experiment statuskubectl get podchaos -n chaos-demo
# Verify recoverykubectl get pods -n chaos-demo -l app=frontend# All 3 pods should be Running within 30-60 seconds
# Clean upkubectl delete podchaos frontend-pod-kill -n chaos-demoExperiment 2: Pod Failure
Section titled “Experiment 2: Pod Failure”Unlike pod-kill, pod-failure makes the pod unavailable for the entire duration without killing it. This simulates a hung process or unresponsive container.
apiVersion: chaos-mesh.org/v1alpha1kind: PodChaosmetadata: name: backend-pod-failure namespace: chaos-demospec: action: pod-failure mode: fixed-percent # Affect a percentage of pods value: "33" # ~33% of pods (1 of 3) selector: namespaces: - chaos-demo labelSelectors: app: backend duration: "120s" # Pod unavailable for 2 minutes# Apply the experimentkubectl apply -f pod-failure-experiment.yaml
# Watch the pod status — one backend pod will show as not readykubectl get pods -n chaos-demo -l app=backend -w
# Test that the service still responds via remaining podskubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \ curl -s -o /dev/null -w "%{http_code}" http://backend.chaos-demo.svc.cluster.local/get
# Clean up after observingkubectl delete podchaos backend-pod-failure -n chaos-demoMode Options
Section titled “Mode Options”The mode field controls how many pods are targeted:
| Mode | Value | Effect |
|---|---|---|
one | (none) | Exactly 1 random pod |
all | (none) | All matching pods |
fixed | "2" | Exactly 2 pods |
fixed-percent | "50" | 50% of matching pods |
random-max-percent | "70" | Random percentage up to 70% |
Safety rule: Never use mode: all in production. Always use one or fixed with a small number.
NetworkChaos: Network Fault Injection
Section titled “NetworkChaos: Network Fault Injection”NetworkChaos injects network-level faults between pods. This is where Chaos Mesh really shines — network issues are the most common cause of distributed system failures.
Network Latency
Section titled “Network Latency”Hypothesis: “We believe the frontend can tolerate 150ms of added latency to the backend because our timeout is set to 3 seconds and p99 baseline latency is 50ms.”
apiVersion: chaos-mesh.org/v1alpha1kind: NetworkChaosmetadata: name: backend-network-delay namespace: chaos-demospec: action: delay mode: all # Apply to all backend pods selector: namespaces: - chaos-demo labelSelectors: app: backend delay: latency: "150ms" # Add 150ms of latency correlation: "75" # 75% correlation between consecutive packets jitter: "25ms" # ±25ms variation direction: to # Delay incoming traffic to backend target: selector: namespaces: - chaos-demo labelSelectors: app: frontend mode: all duration: "180s"# Measure baseline latency firstkubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -n chaos-demo -- \ sh -c 'for i in $(seq 1 5); do curl -s -o /dev/null -w "Attempt $i: %{time_total}s\n" http://backend.chaos-demo.svc.cluster.local/get; done'
# Apply the delaykubectl apply -f network-delay-experiment.yaml
# Measure latency during experimentkubectl run curl-test2 --image=curlimages/curl --rm -it --restart=Never -n chaos-demo -- \ sh -c 'for i in $(seq 1 5); do curl -s -o /dev/null -w "Attempt $i: %{time_total}s\n" http://backend.chaos-demo.svc.cluster.local/get; done'
# You should see ~150ms added to each request
# Clean upkubectl delete networkchaos backend-network-delay -n chaos-demoPacket Loss
Section titled “Packet Loss”apiVersion: chaos-mesh.org/v1alpha1kind: NetworkChaosmetadata: name: backend-packet-loss namespace: chaos-demospec: action: loss mode: all selector: namespaces: - chaos-demo labelSelectors: app: backend loss: loss: "15" # 15% packet loss correlation: "50" # 50% correlation direction: both # Both ingress and egress duration: "120s"Bandwidth Limitation
Section titled “Bandwidth Limitation”apiVersion: chaos-mesh.org/v1alpha1kind: NetworkChaosmetadata: name: backend-bandwidth-limit namespace: chaos-demospec: action: bandwidth mode: all selector: namespaces: - chaos-demo labelSelectors: app: backend bandwidth: rate: "1mbps" # Limit to 1 Mbps limit: 20971520 # Queue limit in bytes (20MB) buffer: 10000 # Buffer size in bytes direction: to duration: "120s"StressChaos: CPU and Memory Stress
Section titled “StressChaos: CPU and Memory Stress”StressChaos injects resource pressure on target pods. This simulates noisy neighbors, resource exhaustion, and OOM scenarios.
CPU Stress
Section titled “CPU Stress”Hypothesis: “We believe the backend will continue responding under 500ms even when 2 CPU cores are saturated by stress, because the pod has resource limits that prevent the stress from consuming the entire node and Kubernetes will throttle the stressed container.”
apiVersion: chaos-mesh.org/v1alpha1kind: StressChaosmetadata: name: backend-cpu-stress namespace: chaos-demospec: mode: one selector: namespaces: - chaos-demo labelSelectors: app: backend stressors: cpu: workers: 2 # Number of CPU stress workers load: 80 # Target 80% CPU load per worker duration: "120s"# Apply CPU stresskubectl apply -f cpu-stress-experiment.yaml
# Monitor CPU usagekubectl top pods -n chaos-demo
# Test response times during stresskubectl run curl-test3 --image=curlimages/curl --rm -it --restart=Never -n chaos-demo -- \ sh -c 'for i in $(seq 1 10); do curl -s -o /dev/null -w "Attempt $i: %{time_total}s\n" http://backend.chaos-demo.svc.cluster.local/get; done'
# Clean upkubectl delete stresschaos backend-cpu-stress -n chaos-demoMemory Stress
Section titled “Memory Stress”apiVersion: chaos-mesh.org/v1alpha1kind: StressChaosmetadata: name: backend-memory-stress namespace: chaos-demospec: mode: one selector: namespaces: - chaos-demo labelSelectors: app: backend stressors: memory: workers: 1 # Number of memory stress workers size: "200Mi" # Each worker allocates 200Mi duration: "90s"Warning: Memory stress can trigger OOM kills if the stress allocation plus the application’s memory exceeds the container’s memory limit. Start with values well below your pod’s memory limit.
RBAC for Chaos Mesh
Section titled “RBAC for Chaos Mesh”In production environments, you must restrict who can create chaos experiments. Chaos Mesh integrates with Kubernetes RBAC.
Namespace-Scoped Chaos Operator
Section titled “Namespace-Scoped Chaos Operator”This role allows creating chaos experiments only within specific namespaces:
apiVersion: rbac.authorization.k8s.io/v1kind: Rolemetadata: name: chaos-operator namespace: chaos-demorules: - apiGroups: ["chaos-mesh.org"] resources: - podchaos - networkchaos - stresschaos verbs: ["get", "list", "watch", "create", "update", "delete"] - apiGroups: ["chaos-mesh.org"] resources: - iochaos - timechaos - dnschaos - httpchaos verbs: ["get", "list", "watch"] # Read-only for advanced types---apiVersion: rbac.authorization.k8s.io/v1kind: RoleBindingmetadata: name: chaos-operator-binding namespace: chaos-demosubjects: - kind: User name: sre-engineer apiGroup: rbac.authorization.k8s.ioroleRef: kind: Role name: chaos-operator apiGroup: rbac.authorization.k8s.ioProtecting Production Namespaces
Section titled “Protecting Production Namespaces”Use a NetworkPolicy-style approach: deny chaos in production by default, allow only with explicit approval:
# Chaos Mesh supports namespace annotations to block experiments# Add this annotation to protect a namespace:apiVersion: v1kind: Namespacemetadata: name: production annotations: chaos-mesh.org/inject: "disabled" # Blocks all chaos injectionDashboard RBAC
Section titled “Dashboard RBAC”When securityMode is enabled on the dashboard, users must authenticate:
# Install with security mode enabled (production)helm upgrade chaos-mesh chaos-mesh/chaos-mesh \ --namespace chaos-mesh \ --set dashboard.securityMode=true
# Create a token for dashboard accesskubectl create serviceaccount chaos-viewer -n chaos-meshkubectl create clusterrolebinding chaos-viewer-binding \ --clusterrole=chaos-mesh-viewer \ --serviceaccount=chaos-mesh:chaos-viewer
# Get the tokenkubectl create token chaos-viewer -n chaos-meshScheduling Experiments
Section titled “Scheduling Experiments”Chaos Mesh supports scheduled experiments using the Schedule CRD — useful for continuous chaos:
apiVersion: chaos-mesh.org/v1alpha1kind: Schedulemetadata: name: daily-frontend-pod-kill namespace: chaos-demospec: schedule: "0 14 * * 1-5" # 2 PM on weekdays type: PodChaos historyLimit: 5 # Keep last 5 experiment records concurrencyPolicy: Forbid # Don't run if previous still active podChaos: action: pod-kill mode: one selector: namespaces: - chaos-demo labelSelectors: app: frontend duration: "60s"# Apply the schedulekubectl apply -f scheduled-pod-kill.yaml
# Check scheduled experimentskubectl get schedule -n chaos-demo
# View experiment historykubectl get podchaos -n chaos-demo --sort-by=.metadata.creationTimestampCommon Mistakes
Section titled “Common Mistakes”| Mistake | Why It’s a Problem | Better Approach |
|---|---|---|
| Installing chaos-daemon without correct runtime socket | The daemon cannot inject faults if it cannot talk to the container runtime — experiments silently fail | Always set chaosDaemon.runtime and chaosDaemon.socketPath for your environment (containerd, Docker, CRI-O) |
Using mode: all in production experiments | Killing or stressing all pods simultaneously guarantees an outage — this is not an experiment, it’s sabotage | Use mode: one or mode: fixed with a value of 1 for initial experiments; increase gradually |
| Forgetting to set resource limits on target pods | StressChaos without resource limits can starve the entire node, affecting all pods on that node | Always set CPU and memory limits on target pods; StressChaos respects cgroup limits |
| Not cleaning up failed experiments | A stuck CRD can keep injecting faults indefinitely if the controller doesn’t properly reconcile | Check kubectl get podchaos,networkchaos,stresschaos -A regularly; delete stuck experiments manually |
| Running NetworkChaos without understanding direction | Setting direction: both when you mean direction: to can cause unexpected traffic disruption in both directions | Start with direction: to (only incoming traffic to the target) and add from or both deliberately |
| Skipping the dashboard for CLI-only workflows | CLI-only workflows miss the visual overview of active experiments, making it easy to lose track of what’s running | Use the dashboard for situational awareness, even if you create experiments via kubectl |
| Not verifying CRD deletion removes the fault | In rare cases, deleting the CRD does not properly clean up the injected fault (e.g., lingering tc rules) | After deleting a NetworkChaos experiment, verify with kubectl exec into the pod and check tc qdisc show |
| Testing chaos tools on a single-replica deployment | Killing the only pod of a Deployment means guaranteed downtime — you can’t learn anything useful from guaranteed failure | Ensure target deployments have at least 2-3 replicas before running pod-kill experiments |
Question 1: What are the three main components of Chaos Mesh, and what does each do?
Section titled “Question 1: What are the three main components of Chaos Mesh, and what does each do?”Show Answer
- chaos-controller-manager (Deployment): Watches Chaos Mesh CRDs, reconciles desired chaos state, schedules experiments, and manages their lifecycle.
- chaos-daemon (DaemonSet): Runs on every node, enters target pod namespaces using
nsenter, and injects the actual faults (tc rules, process kills, stress-ng, etc.). - chaos-dashboard (Deployment): Web UI for creating, monitoring, and managing chaos experiments visually.
The controller-manager is the brain, the daemon is the hands, and the dashboard is the eyes.
Question 2: What is the difference between pod-kill and pod-failure actions?
Section titled “Question 2: What is the difference between pod-kill and pod-failure actions?”Show Answer
pod-kill terminates the container process, causing Kubernetes to restart the pod. The pod goes through Terminating → Pending → Running states. This tests restart behavior, readiness probes, and how quickly the pod recovers.
pod-failure makes the pod unavailable for the specified duration without killing it. The pod remains in a non-ready state. This tests how calling services handle an unavailable dependency for a sustained period, without testing Kubernetes restart mechanics.
Use pod-kill to test recovery speed. Use pod-failure to test sustained unavailability handling.
Question 3: Why should you never use mode: all for your first production chaos experiment?
Section titled “Question 3: Why should you never use mode: all for your first production chaos experiment?”Show Answer
mode: all targets every pod matching the selector simultaneously. If your hypothesis is wrong and the service cannot tolerate the injected fault, mode: all guarantees a complete outage — every instance of the service is affected at once.
This violates the core principle of minimizing blast radius. You should start with mode: one (a single pod), verify the system recovers gracefully, and only increase the scope in subsequent experiments. Even in mature chaos practices, mode: all is rarely used because it provides little additional learning over mode: fixed-percent with a high percentage.
Question 4: How does the direction field in NetworkChaos work, and what is the safest default?
Section titled “Question 4: How does the direction field in NetworkChaos work, and what is the safest default?”Show Answer
The direction field controls which traffic direction is affected:
to: Only traffic coming into the target pods is affectedfrom: Only traffic going out of the target pods is affectedboth: Traffic in both directions is affected
The safest default is direction: to because it only affects incoming requests to the target. This lets you test how the target handles degraded incoming traffic without also disrupting its outbound calls. Using both doubles the blast radius unnecessarily for most experiments.
Question 5: You apply a NetworkChaos experiment with 150ms latency. After deleting the CRD, you notice latency is still elevated. What happened and how do you fix it?
Section titled “Question 5: You apply a NetworkChaos experiment with 150ms latency. After deleting the CRD, you notice latency is still elevated. What happened and how do you fix it?”Show Answer
In rare cases, the chaos-daemon may fail to clean up the injected tc (traffic control) rules when the CRD is deleted. This can happen due to daemon restart, race conditions, or network issues between the controller and daemon.
To diagnose, exec into the affected pod and check for lingering tc rules:
kubectl exec -it <pod-name> -n <namespace> -- tc qdisc showTo fix, you can either:
- Delete the pod so Kubernetes creates a fresh one (cleanest approach)
- Manually remove the tc rules:
kubectl exec -it <pod-name> -- tc qdisc del dev eth0 root - Restart the chaos-daemon on the affected node
This is why you should always verify that deleting the CRD actually removes the fault, especially for network experiments.
Question 6: How would you prevent a team from accidentally running chaos experiments in the production namespace?
Section titled “Question 6: How would you prevent a team from accidentally running chaos experiments in the production namespace?”Show Answer
Three layers of protection:
-
Namespace annotation: Add
chaos-mesh.org/inject: "disabled"to the production namespace metadata. Chaos Mesh will refuse to inject faults into pods in annotated namespaces. -
RBAC: Do not grant
createorupdateverbs on chaos-mesh.org resources in the production namespace. Use namespace-scoped Roles, not ClusterRoles, for chaos permissions. -
Dashboard security mode: Enable
securityMode=trueon the dashboard so users must authenticate with tokens that have limited RBAC scope.
Defense in depth ensures that even if one layer is misconfigured, the others prevent accidental production chaos.
Hands-On Exercise: Pod Kill Experiment on a Stateless App
Section titled “Hands-On Exercise: Pod Kill Experiment on a Stateless App”Objective
Section titled “Objective”Run a complete chaos experiment following the scientific method: define steady state, form hypothesis, inject a pod-kill fault, observe recovery, and document results.
- Ensure Chaos Mesh is installed (follow the installation section above)
- Deploy the test application (
test-app.yamlfrom above) - Open a terminal for monitoring
Step 1: Define Steady State
Section titled “Step 1: Define Steady State”# Verify all frontend pods are runningkubectl get pods -n chaos-demo -l app=frontend
# Record the pod names and their restart countskubectl get pods -n chaos-demo -l app=frontend -o custom-columns=\NAME:.metadata.name,STATUS:.status.phase,RESTARTS:.status.containerStatuses[0].restartCount,AGE:.metadata.creationTimestamp
# Test that the service respondskubectl run steady-state-check --image=curlimages/curl --rm -it --restart=Never -n chaos-demo -- \ sh -c 'for i in $(seq 1 10); do CODE=$(curl -s -o /dev/null -w "%{http_code}" http://frontend.chaos-demo.svc.cluster.local/) echo "Request $i: HTTP $CODE" done'
# Expected: All 10 requests return HTTP 200Record your steady state:
- Pod count: 3
- All pods status: Running
- HTTP response: 200 for 100% of requests
- Restart count: 0
Step 2: Form Your Hypothesis
Section titled “Step 2: Form Your Hypothesis”Write down your hypothesis before proceeding:
“We believe that the frontend Service will continue returning HTTP 200 for 100% of requests even when 1 of 3 frontend pods is killed, because the Kubernetes Service will immediately stop routing traffic to the terminated pod and Kubernetes will restart the pod within 30 seconds.”
Step 3: Start Monitoring
Section titled “Step 3: Start Monitoring”# Terminal 1: Watch pod statuskubectl get pods -n chaos-demo -l app=frontend -w
# Terminal 2: Continuous HTTP checks (run this before applying chaos)kubectl run continuous-check --image=curlimages/curl --rm -it --restart=Never -n chaos-demo -- \ sh -c 'while true; do CODE=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://frontend.chaos-demo.svc.cluster.local/) echo "$(date +%H:%M:%S) HTTP $CODE" sleep 1 done'Step 4: Apply the Experiment
Section titled “Step 4: Apply the Experiment”# Save as frontend-pod-kill.yamlapiVersion: chaos-mesh.org/v1alpha1kind: PodChaosmetadata: name: frontend-pod-kill-experiment namespace: chaos-demospec: action: pod-kill mode: one selector: namespaces: - chaos-demo labelSelectors: app: frontend gracePeriod: 0 duration: "60s"# Apply the experimentkubectl apply -f frontend-pod-kill.yaml
# Record the time you applied itecho "Experiment started at: $(date +%H:%M:%S)"Step 5: Observe
Section titled “Step 5: Observe”Watch Terminal 1 and Terminal 2. Record:
- What time was the pod killed?
- How long until a new pod was created?
- How long until the new pod was Ready?
- Did any HTTP requests fail during the experiment?
- What HTTP status codes did you see?
Step 6: Analyze and Document
Section titled “Step 6: Analyze and Document”# Check experiment statuskubectl get podchaos -n chaos-demo
# Check pod restart countkubectl get pods -n chaos-demo -l app=frontend -o custom-columns=\NAME:.metadata.name,STATUS:.status.phase,RESTARTS:.status.containerStatuses[0].restartCount
# Clean up the experimentkubectl delete podchaos frontend-pod-kill-experiment -n chaos-demo
# Clean up test podskubectl delete pod continuous-check -n chaos-demo --ignore-not-foundSuccess Criteria
Section titled “Success Criteria”Your experiment is successful (regardless of whether the hypothesis was confirmed) when:
- You recorded steady state before the experiment
- Your hypothesis was specific and falsifiable
- You observed the experiment in real-time (pod status + HTTP responses)
- You recorded the timeline: kill time, new pod created time, ready time
- You documented whether the hypothesis was confirmed or refuted
- You identified at least one improvement (even if the hypothesis held)
- You cleaned up all chaos resources after the experiment
- You can explain to a teammate exactly what happened and what you learned
Expected Results
Section titled “Expected Results”In most configurations, you should observe:
- Pod killed: Within 1-2 seconds of applying the CRD
- New pod created: Within 5-10 seconds (Kubernetes controller reconciliation)
- New pod ready: Within 15-30 seconds (image pull + readiness probe)
- HTTP failures: 0-2 requests may fail during the transition (depends on Service update speed)
- Recovery: Full steady state restored within 60 seconds
If you saw more than 2 failed requests, investigate your readiness probe configuration and Service endpoint update timing — that’s a real finding worth documenting.
Summary
Section titled “Summary”Chaos Mesh is a Kubernetes-native chaos engineering platform that uses CRDs to declare and manage experiments. Its architecture — controller, daemon, dashboard — provides a safe, auditable, RBAC-controlled way to inject pod kills, network faults, and resource stress into your cluster.
Key takeaways:
- CRD-driven: Experiments are Kubernetes objects, auditable and version-controlled
- Safety by design: RBAC, namespace annotations, mode controls, and duration limits
- Three fault domains covered: PodChaos (process), NetworkChaos (network), StressChaos (resources)
- Dashboard is your friend: Visual oversight prevents lost experiments
- Always start with
mode: one: Expand blast radius only after successful small experiments
Next Module
Section titled “Next Module”Continue to Module 1.3: Advanced Network & Application Fault Injection — Deep dive into latency injection, DNS failures, HTTP-level chaos, clock skew, and JVM/kernel fault injection.