Module 1.1: Principles of Chaos Engineering & Resilience
Discipline Module | Complexity:
[QUICK]| Time: 1.5 hours
Prerequisites
Section titled “Prerequisites”Before starting this module:
- Required: Kubernetes Basics — Core cluster concepts and workloads
- Required: Release Engineering — Understanding deployment pipelines and rollbacks
- Recommended: Experience operating at least one production system
- Recommended: Familiarity with monitoring concepts (Prometheus, Grafana)
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Design chaos engineering experiments with clear hypotheses, blast radius controls, and abort conditions
- Evaluate system resilience by identifying failure modes that monitoring and testing alone cannot catch
- Implement a chaos engineering program with incremental adoption from simple to complex experiments
- Build organizational buy-in for chaos engineering by communicating risk reduction in business terms
Why This Module Matters
Section titled “Why This Module Matters”On October 22, 2012, an engineer at Netflix pushed a routine configuration change to a critical microservice. Within 45 minutes, the entire streaming platform was down for millions of users worldwide. The postmortem revealed something uncomfortable: nobody had tested what would happen when that exact service failed during peak traffic.
That outage, and dozens like it before, led to a fundamental question: Why do we wait for production to teach us how our systems break?
Chaos Engineering was born from this question. It is the discipline of proactively injecting controlled failures into systems to discover weaknesses before they become real outages. It sounds counterintuitive — deliberately breaking things to make them stronger — but it is the same logic behind fire drills, vaccine science, and military war games. You stress the system under controlled conditions so it survives uncontrolled ones.
Here’s the uncomfortable truth: Your system will fail. The only question is whether you discover how it fails on your terms (during a planned experiment at 2 PM on a Tuesday) or on production’s terms (at 3 AM on New Year’s Eve when your on-call engineer is asleep).
This module teaches you the philosophy, scientific method, and safety practices of Chaos Engineering — everything you need before touching a single chaos tool.
Did You Know?
Section titled “Did You Know?”Netflix’s Chaos Monkey was created in 2011 and randomly terminated EC2 instances in production. Engineers initially hated it. Within six months, they loved it — their services had become so resilient that individual instance failures caused zero customer impact. Netflix later open-sourced it, spawning an entire discipline.
The term “Chaos Engineering” was formalized in a 2015 paper by Netflix engineers titled “Principles of Chaos Engineering”. Before that, the practice was informally called “chaos testing” or “failure injection.” The name change was deliberate — engineering implies rigor, hypotheses, and the scientific method, not random destruction.
Amazon’s GameDay tradition started internally in 2004. Teams would simulate catastrophic failures (entire data center loss, database corruption) with executives watching. The practice was so effective at finding weaknesses that AWS eventually built it into their standard operating procedures. Today, AWS runs hundreds of Game Days per year across their organization.
The largest chaos experiment ever run was by Google during the development of Spanner, their globally distributed database. Engineers simulated entire data center failures, network partitions across continents, and clock skew of several seconds — all simultaneously. The result was a database that maintains consistency across the planet with 99.999% availability.
The Philosophy of Chaos Engineering
Section titled “The Philosophy of Chaos Engineering”It Is Not Random Destruction
Section titled “It Is Not Random Destruction”Let’s kill the biggest misconception immediately: Chaos Engineering is not about randomly breaking things and seeing what happens. That’s vandalism, not engineering.
Chaos Engineering is a disciplined investigation. You form a hypothesis, design a controlled experiment, execute it with safety measures, observe results, and draw conclusions. It follows the scientific method as rigorously as any laboratory experiment.
Think of it this way:
| What It Is NOT | What It IS |
|---|---|
| Randomly killing pods | Hypothesis-driven experimentation |
| Breaking production for fun | Controlled failure injection with safety nets |
| Testing until something crashes | Verifying that systems behave as designed |
| Creating chaos | Discovering hidden chaos that already exists |
| A one-time activity | A continuous practice embedded in culture |
The Core Insight
Section titled “The Core Insight”Every distributed system has emergent behaviors — behaviors that arise from the interaction of components but cannot be predicted by examining any single component in isolation.
Consider a microservices application with 20 services. Each service has been unit tested, integration tested, and load tested. Every individual component works correctly. But when Service A retries failed calls to Service B, and Service B is slow because Service C’s database is under load, those retries amplify the load on Service B, which cascades into Service D, which has a 30-second timeout that blocks threads, which… you get the picture.
Stop and think: Can you recall a time when two healthy systems combined to create an outage in your environment? What was the emergent behavior?
No amount of unit testing or code review would catch this. The failure only emerges from the interaction of components under specific conditions. Chaos Engineering is designed to surface exactly these emergent failures.
Resilience vs. Robustness
Section titled “Resilience vs. Robustness”These terms are often confused, but the distinction matters enormously:
Robustness means the system can handle known failure modes. You’ve tested for them, built handling for them, and verified the handlers work. A robust bridge can handle its rated wind load.
Resilience means the system can handle unknown or unexpected failure modes and recover. A resilient bridge sways in unexpected wind patterns and returns to its original position.
Chaos Engineering builds resilience — the ability to withstand and recover from failures you haven’t explicitly planned for.
flowchart LR A["Fragile<br/>Breaks easily"] --> B["Robust<br/>Handles known failures"] B --> C["Resilient<br/>Handles unknown failures"] C --> D["Antifragile<br/>Gets stronger from stress"]The ultimate goal is antifragility — systems that actually improve when stressed. Netflix achieved this: every time Chaos Monkey killed an instance, engineers improved their services, making the entire platform stronger over time.
The Scientific Method of Chaos
Section titled “The Scientific Method of Chaos”The Chaos Engineering Cycle
Section titled “The Chaos Engineering Cycle”Every chaos experiment follows a strict cycle. Skipping steps is how you turn engineering into vandalism.
flowchart TD A[1. Define Steady State] --> B[2. Form Hypothesis] B --> C[3. Design Experiment] C --> D[4. Define Blast Radius & Abort Conditions] D --> E[5. Run Experiment] E --> F[6. Observe & Measure] F --> G[7. Analyze Results] G --> H[8. Document & Share Findings] H --> ALet’s examine each step in detail.
Step 1: Define Steady State
Section titled “Step 1: Define Steady State”Before you can detect a problem, you need to know what “normal” looks like. Steady state is not “everything is green” — it is a measurable definition of normal system behavior.
Good steady-state definitions:
| Metric | Steady State Value | Measurement Source |
|---|---|---|
| Request latency (p99) | < 200ms | Prometheus histogram |
| Error rate (5xx) | < 0.1% | Istio telemetry |
| Order completion rate | > 98.5% | Business metrics DB |
| Pod restart count | 0 restarts/hour | kube-state-metrics |
| Queue depth | < 500 messages | RabbitMQ exporter |
Notice the last two rows. Steady state should include business metrics, not just infrastructure metrics. A system can have perfect CPU and memory while silently dropping orders.
# Example: Steady state as a monitoring rule# This becomes your "is the experiment safe?" checkgroups: - name: chaos-steady-state rules: - alert: SteadyStateViolation expr: | ( histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.2 or sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.001 or rate(orders_completed_total[5m]) / rate(orders_started_total[5m]) < 0.985 ) for: 1m labels: severity: chaos-abortStep 2: Form a Hypothesis
Section titled “Step 2: Form a Hypothesis”A chaos hypothesis has a specific format:
“We believe that [system behavior] will continue even when [failure condition], because [resilience mechanism].”
Examples:
-
“We believe that order processing will continue within 3 seconds even when the payment service pod is killed, because Kubernetes will restart it within 30 seconds and the retry logic in the order service will handle the brief outage.”
-
“We believe that search results will continue to be served even when 200ms of latency is added between the API gateway and the search service, because circuit breakers will trip after 5 failed requests and return cached results.”
-
“We believe that user sessions will be preserved even when a Redis Sentinel node is killed, because Sentinel will promote a replica within 10 seconds and the session middleware reconnects automatically.”
Bad hypotheses (too vague):
- “The system should handle pod failures” — handle how? What metric?
- “Performance won’t degrade” — by how much? What’s acceptable?
- “Everything will keep working” — define “everything” and “working”
Step 3: Design the Experiment
Section titled “Step 3: Design the Experiment”An experiment specification includes:
# Chaos Experiment Specification (human-readable)experiment: name: "Payment service pod failure during checkout" date: "2026-03-24" owner: "platform-team" approvers: ["lead-sre", "payment-team-lead"]
hypothesis: > Order completion rate stays above 98.5% when the payment service pod is killed, due to Kubernetes auto-restart and order-service retry logic.
steady_state: metrics: - name: order_completion_rate query: "rate(orders_completed[5m]) / rate(orders_started[5m])" expected: ">= 0.985" - name: p99_latency query: "histogram_quantile(0.99, rate(http_duration_bucket[5m]))" expected: "<= 0.5"
injection: type: pod-kill target: "payment-service" namespace: "production" selector: app: payment-service count: 1 # kill 1 of 3 replicas
abort_conditions: - "order_completion_rate < 0.95 for 2 minutes" - "p99_latency > 2s for 1 minute" - "any 500 errors on checkout endpoint"
blast_radius: scope: "payment-service namespace only" max_impact: "1 pod out of 3 replicas" customer_impact: "possible 2-3s delay for ~5% of checkouts"
duration: 10 minutes rollback: | # For pod-kill, Kubernetes self-heals via ReplicaSet — verify pod recreation: kubectl get pods -l app=payment-service -n production -w # If needed, restore original replica count: kubectl scale deployment/payment-service -n production --replicas=3Step 4: Blast Radius and Abort Conditions
Section titled “Step 4: Blast Radius and Abort Conditions”Blast radius is the potential impact zone of your experiment. Always start small.
flowchart TD L1["Level 1: Single pod in staging<br/>(Start here)"] --> L2["Level 2: Single pod in production"] L2 --> L3["Level 3: Multiple pods in production"] L3 --> L4["Level 4: Entire service in production"] L4 --> L5["Level 5: Cross-service failure<br/>(Work up to here)"] L5 --> L6["Level 6: Availability zone failure"] L6 --> L7["Level 7: Region failure<br/>(Only for mature orgs)"]Abort conditions are your emergency brake. They must be:
- Automated: A human watching a dashboard is not fast enough
- Measurable: Based on metrics, not feelings
- Preset: Decided before the experiment, not during
- Tested: Verify the abort mechanism actually works before running chaos
Pause and predict: If you run a chaos experiment and your automated abort condition fires within 10 seconds, was the experiment a failure or a success?
# Example abort configuration (Chaos Mesh style)apiVersion: chaos-mesh.org/v1alpha1kind: PodChaosmetadata: name: payment-pod-killspec: action: pod-kill mode: one selector: namespaces: - production labelSelectors: app: payment-service duration: "10m" # Abort is handled externally via monitoring + Chaos Mesh dashboard pauseStep 5-8: Execute, Observe, Analyze, Share
Section titled “Step 5-8: Execute, Observe, Analyze, Share”Execute: Always run experiments during business hours when the team is available. Never run your first experiment at 3 AM or on a Friday afternoon.
Observe: Watch dashboards in real-time. Record everything — not just metrics, but observations. “The dashboard showed a spike at T+30s” is useful data.
Analyze: Did the system behave as hypothesized? If yes, increase the blast radius next time. If no, you found a weakness — that’s a success! Write a remediation plan.
Share: The worst thing you can do with chaos results is keep them to yourself. Share findings broadly. Write them up. Present them at team meetings. The organizational learning is as valuable as the technical findings.
Game Days vs. Continuous Chaos
Section titled “Game Days vs. Continuous Chaos”There are two modes of practicing Chaos Engineering, and mature organizations use both.
Game Days
Section titled “Game Days”A Game Day is a scheduled, focused chaos event. Think of it as a fire drill for your infrastructure.
Structure of a Game Day:
09:00 - Kickoff: Review today's experiments, assign roles09:30 - Steady State Verification: Confirm all systems nominal10:00 - Experiment 1: Pod failure in service A10:30 - Debrief Experiment 1: Results, surprises, actions11:00 - Experiment 2: Network partition between zones11:30 - Debrief Experiment 212:00 - Lunch Break13:00 - Experiment 3: Database failover simulation13:30 - Debrief Experiment 314:00 - Wrap-up: Overall findings, action items, next Game Day dateRoles during a Game Day:
| Role | Responsibility |
|---|---|
| Game Master | Coordinates experiments, manages schedule |
| Experimenter | Executes the chaos injection |
| Observer | Monitors dashboards, records metrics |
| Communicator | Updates stakeholders, manages abort decisions |
| Scribe | Documents everything in real-time |
When to use Game Days:
- First-time chaos experiments in a new environment
- Testing complex, multi-service failure scenarios
- Training new team members on incident response
- Building organizational confidence in chaos practices
- Before major events (Black Friday, product launches)
Continuous Chaos
Section titled “Continuous Chaos”Continuous chaos runs automated experiments on a schedule — hourly, daily, or triggered by deployments. This is the “Chaos Monkey” model.
When to use Continuous Chaos:
- After Game Days have validated basic resilience
- For well-understood failure modes with proven recovery
- To prevent resilience regression (services that were resilient staying resilient)
- To build confidence that auto-scaling and self-healing work continuously
Maturity Model:
flowchart TD L0["Level 0: No chaos practices"] --> L1["Level 1: Ad-hoc manual testing"] L1 --> L2["Level 2: Structured Game Days"] L2 --> L3["Level 3: Automated chaos in staging"] L3 --> L4["Level 4: Automated chaos in production"] L4 --> L5["Level 5: Chaos as culture"]Most organizations are between Level 0 and Level 1. Getting to Level 2 is a massive improvement. Don’t rush to Level 4 — premature continuous chaos in production causes the outages it’s supposed to prevent.
Building a Safety Culture
Section titled “Building a Safety Culture”The Pre-Requisites Before Any Chaos
Section titled “The Pre-Requisites Before Any Chaos”Before running your first chaos experiment, these must be in place:
-
Observability: You need dashboards and alerts to see the impact. If you can’t observe steady state, you can’t detect deviation from it.
-
Rollback capability: You must be able to undo the experiment instantly. If you kill a pod and can’t restart it, that’s not chaos engineering — that’s an outage.
-
Communication channels: Everyone who might be affected should know an experiment is happening. Use a dedicated Slack channel, not email.
-
Management buy-in: Your first experiment should not be a surprise to leadership. Get explicit approval, especially for production experiments.
-
Incident response process: If the experiment goes wrong, you need to handle it like a real incident. The chaos experiment becomes the incident trigger.
The “Opt-In” Principle
Section titled “The “Opt-In” Principle”Teams should never have chaos imposed on them. The opt-in principle states:
- Teams volunteer their services for chaos experiments
- Teams define their own steady state and abort conditions
- Teams participate in experiments on their services
- Teams own the remediation of discovered weaknesses
Forcing chaos on unwilling teams creates resentment, not resilience. The goal is cultural change, and culture cannot be mandated.
Communicating About Chaos
Section titled “Communicating About Chaos”When proposing chaos engineering to your organization, frame it correctly:
Don’t say: “We want to break things in production.” Do say: “We want to verify our resilience claims before our customers test them for us.”
Don’t say: “We’re going to inject failures everywhere.” Do say: “We’re going to run controlled, scientific experiments to find weaknesses we can fix proactively.”
Don’t say: “What’s the worst that could happen?” Do say: “Here is our blast radius analysis, abort conditions, and rollback plan.”
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It’s a Problem | Better Approach |
|---|---|---|
| Running chaos without observability | You can’t measure impact if you can’t see it — the experiment is useless or dangerous | Set up monitoring and dashboards first; verify you can detect the steady-state deviation |
| Skipping the hypothesis step | Without a hypothesis, you’re just breaking things randomly with no learning objective | Always write “We believe X will happen because Y” before running any experiment |
| Starting in production | Your first chaos experiment should not risk customer impact while you’re still learning the tools | Start in staging or dev; graduate to production only after multiple successful staging runs |
| No abort conditions | Without automated abort, a runaway experiment becomes a real outage | Define metrics-based abort conditions and verify they trigger correctly before running |
| Running experiments on Fridays | If the experiment causes lasting damage, no one wants to fix it over the weekend | Run chaos Tuesday through Thursday, during business hours, with the full team available |
| Blaming individuals for failures found | If chaos reveals a bug, blaming the developer who wrote it kills the program instantly | Treat findings as systemic improvements; celebrate discovery, not assign blame |
| Too large a blast radius too soon | Killing half your production pods on your first experiment guarantees an outage, not learning | Start with one pod in staging; increase blast radius only after successful smaller experiments |
| Not documenting results | Undocumented experiments provide no organizational learning and cannot be reproduced | Write up every experiment: hypothesis, results, actions, and share with the broader team |
Test your understanding of Chaos Engineering principles:
Question 1: Random Vandalism vs. Engineering
Section titled “Question 1: Random Vandalism vs. Engineering”A junior engineer logs into the staging cluster, runs
kubectl delete pods --all, and watches the monitoring dashboards to see what happens. When asked, they say they are practicing Chaos Engineering. Why is this engineer incorrect?
Show Answer
The engineer’s actions represent random destruction, not Chaos Engineering, because they completely ignored the scientific method. Chaos Engineering requires starting with a specific hypothesis about system behavior, defining a measurable steady state, and establishing automated abort conditions before any failure is injected. By blindly deleting all pods without a hypothesis or controlled blast radius, the engineer cannot learn anything specific about the system’s resilience mechanisms. True Chaos Engineering emphasizes rigor and safety, treating failure injection as a highly controlled measurement rather than an ad-hoc disruption.
Question 2: The Right Way to Measure Steady State
Section titled “Question 2: The Right Way to Measure Steady State”You are designing an experiment for a checkout service. Your steady-state definition checks that CPU usage remains below 60% and memory below 512MB. During the experiment, pod metrics stay well within these limits, but the customer support desk receives 500 calls about failed payments. What fundamental mistake was made in the steady-state definition?
Show Answer
The team defined their steady state entirely around infrastructure metrics while completely ignoring the actual business purpose of the service. Infrastructure metrics like CPU and memory can remain perfectly healthy even if the application logic is trapped in a retry loop or failing to process transactions. A valid steady-state definition must include business metrics, such as checkout success rate or orders processed per minute, because these are the ultimate indicators of system health. Without tracking business metrics, you cannot accurately determine if a failure injection is impacting the user experience.
Question 3: Controlling the Blast Radius
Section titled “Question 3: Controlling the Blast Radius”A team’s first-ever chaos experiment involves simulating a region-wide database failover in production to prove their new multi-region architecture works. The CTO halts the experiment before it begins, citing an “unacceptable blast radius.” What principle of blast radius did the team violate?
Show Answer
The team violated the core principle of starting with the smallest possible blast radius to learn safely. By immediately targeting a massive, complex failure in production (a region-wide failover) for their very first experiment, they risked causing a catastrophic outage for all customers if their assumptions were wrong. Chaos experiments should always begin small—such as killing a single pod in a staging environment—and only increase in scope once smaller experiments have proven successful. This incremental approach builds confidence and bounds the potential damage of discovering an unexpected weakness.
Question 4: Chaos Engineering and Availability Goals
Section titled “Question 4: Chaos Engineering and Availability Goals”A product manager demands that the engineering team guarantee 100% availability for a new microservice and refuses to authorize any chaos experiments because they “might cause downtime.” How should you explain the relationship between availability goals and chaos engineering to this manager?
Show Answer
First, you must explain that 100% availability is mathematically and practically impossible in distributed systems, and attempting to achieve it results in extreme risk aversion that halts innovation. Instead, teams should target a realistic Service Level Objective (SLO), such as 99.9%, and use the remaining error budget for engineering improvements. Chaos Engineering deliberately consumes a tiny fraction of this error budget in a controlled manner to discover hidden flaws on the team’s terms. By spending this budget proactively on chaos experiments, the team prevents massive, uncontrolled outages later, ultimately protecting the service’s long-term availability.
Question 5: Designing the First Experiment
Section titled “Question 5: Designing the First Experiment”Your team has robust monitoring, CI/CD pipelines, and explicit management approval to begin chaos engineering. You are tasked with designing the team’s very first experiment. Describe the environment, scope, and specific failure you would choose, and explain why.
Show Answer
The first experiment should take place in a staging environment, target a single pod of a stateless service, and have clearly defined, automated abort conditions. It should test a very simple, well-understood hypothesis, such as verifying that Kubernetes successfully restarts a killed pod within a specified time frame without impacting the service’s overall error rate. This approach is critical because the goal of a first experiment is not just to test the system, but to validate the team’s chaos processes, monitoring tools, and abort mechanisms. Starting small minimizes risk while building the team’s “muscle memory” for conducting rigorous, scientific failure testing.
Question 6: Maturing the Practice
Section titled “Question 6: Maturing the Practice”A mature SRE team wants to ensure their auto-scaling policies continue to work across all 50 microservices as new code is deployed daily. They propose scheduling a 4-hour Game Day once a quarter to manually test this. Why might this approach be suboptimal, and what practice should they adopt instead?
Show Answer
Relying solely on quarterly Game Days is suboptimal because it leaves the system highly vulnerable to regressions introduced by daily deployments over the intervening three months. Once a system’s resilience to a specific failure mode (like auto-scaling under load) has been proven and validated during a Game Day, that failure mode should be transitioned to continuous chaos. Continuous chaos automatically and frequently injects these understood failures—often triggered by CI/CD pipelines—to ensure the system remains resilient as the codebase rapidly evolves. Game Days should primarily be reserved for exploring new, unknown failure modes rather than regression-testing known ones.
Hands-On Exercise: Draft a Chaos Experiment Document
Section titled “Hands-On Exercise: Draft a Chaos Experiment Document”Objective
Section titled “Objective”Create a complete Chaos Experiment Document for a realistic Kubernetes application. This exercise builds the skill of structured thinking about chaos — the most important skill before touching any tool.
Scenario
Section titled “Scenario”You are an SRE for an e-commerce platform running on Kubernetes. The platform has these services:
flowchart TD I[Ingress] --> AG["API Gateway<br/>(3 replicas)"] AG --> P["Product Service<br/>(2-3 replicas)"] AG --> C["Cart Svc<br/>(2-3 replicas)"] AG --> O["Order Svc<br/>(2-3 replicas)"] P --> DB[("Catalog DB")] C --> R[("Redis")] O --> PG["Payment Gateway<br/>(External API)"]Task 1: Define the steady state for this system (at least 5 metrics with sources and thresholds).
Task 2: Write 3 chaos experiment specifications following the format from this module. Each should target a different failure domain:
- Experiment A: Pod failure (pick a service)
- Experiment B: Network issue (latency or packet loss)
- Experiment C: Dependency failure (database or Redis)
Task 3: For each experiment, define:
- A precise hypothesis using the “We believe X even when Y because Z” format
- Blast radius assessment
- At least 3 abort conditions
- Rollback procedure
Task 4: Design a half-day Game Day schedule that runs all 3 experiments with proper debriefs.
Success Criteria
Section titled “Success Criteria”Your Chaos Experiment Document is complete when:
- Steady state uses business metrics (not just infrastructure metrics)
- All 3 hypotheses are specific and falsifiable
- Blast radius starts small (staging, single pod/connection)
- Abort conditions are metric-based and automatable
- Rollback procedures are specific kubectl commands, not vague descriptions
- Game Day schedule includes roles, debriefs, and buffer time
- The document could be handed to another engineer who could execute the experiments without additional context
Example Solution (Experiment A only)
Section titled “Example Solution (Experiment A only)”experiment: name: "Cart Service Pod Failure During Active Shopping" date: "2026-03-28" owner: "platform-sre" environment: staging
hypothesis: > We believe that shopping cart operations (add, view, update) will continue with less than 500ms p99 latency even when 1 of 3 Cart Service pods is killed, because Kubernetes will restart the pod within 30 seconds, the API Gateway has a 3-second retry with exponential backoff, and Redis holds the cart state independently of the Cart Service pods.
steady_state: - metric: cart_operation_success_rate expected: ">= 99.5%" source: prometheus - metric: cart_p99_latency expected: "<= 200ms" source: prometheus - metric: active_shopping_sessions expected: "within 10% of pre-experiment count" source: redis - metric: cart_service_pod_count expected: "3 (before), 2 (during), 3 (after recovery)" source: kube-state-metrics - metric: redis_connected_clients expected: ">= 2 (matching healthy pod count)" source: redis-exporter
injection: type: pod-kill target: cart-service namespace: staging replicas_before: 3 pods_to_kill: 1 method: chaos-mesh-podchaos
abort_conditions: - "cart_operation_success_rate < 95% for 1 minute" - "cart_p99_latency > 2s for 30 seconds" - "active_shopping_sessions drop > 25%"
blast_radius: scope: "staging environment only" services_affected: ["cart-service"] max_user_impact: "none (staging)" max_data_impact: "none (Redis holds state)"
rollback: | # For pod-kill, Kubernetes self-heals via ReplicaSet — verify pod recreation: kubectl get pods -l app=cart-service -n staging -w # If needed, restore original replica count: kubectl scale deployment/cart-service -n staging --replicas=3 kubectl wait --for=condition=available deployment/cart-service -n staging --timeout=60s
duration: "10 minutes"Complete all 4 tasks to build your chaos experiment planning muscle. This document becomes your template for every future experiment.
Summary
Section titled “Summary”Chaos Engineering is a disciplined, scientific practice for discovering systemic weaknesses before they cause real outages. It follows the scientific method: define steady state, form a hypothesis, design a controlled experiment with safety measures, execute, observe, and learn.
Key takeaways:
- It’s engineering, not destruction — hypotheses, controls, measurements
- Start small — staging, single pods, short durations
- Safety first — abort conditions, blast radius limits, team availability
- Culture matters — opt-in, blameless, documented, shared
- Two modes — Game Days for learning, continuous chaos for regression prevention
In the next module, you’ll put these principles into practice with Chaos Mesh, the most popular chaos engineering tool for Kubernetes.
Next Module
Section titled “Next Module”Continue to Module 1.2: Chaos Mesh Fundamentals — Install, configure, and run your first chaos experiments using Chaos Mesh on a real Kubernetes cluster.