Module 1.1: What is Systems Thinking?
Complexity:
[MEDIUM]| Time: 35-45 minutes | Prerequisites: None (entry point to Platform track)
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Analyze a system failure by mapping interconnected components rather than investigating services in isolation
- Explain how emergent properties arise from component interactions and why they cannot be predicted from individual parts alone
- Apply systems thinking frameworks (stocks-and-flows, boundaries, leverage points) to real infrastructure architectures
- Diagnose incidents by zooming out to identify upstream and downstream dependencies before restarting individual services
The 3 AM Incident
Section titled “The 3 AM Incident”Tuesday, 2:47 AM. The on-call engineer’s phone explodes with alerts.
“Payment service latency increased.” Standard stuff. They check the dashboards—CPU at 23%, memory looks fine, no error spikes in the logs. Weird.
They do what any reasonable engineer would do at 3 AM: restart the service. Latency drops immediately. Back to sleep.
3:52 AM. Same alert. Latency is back.
Restart again. Works again. Sleep again.
4:23 AM. Again.
Now they’re annoyed. They add more replicas. Check the code for memory leaks. Review recent deployments. Nothing. The service is fine. But the system is broken.
At 6 AM, exhausted and frustrated, they finally do what should have been done hours earlier: zoom out. Instead of staring at the payment service, they look at what’s around it.
That’s when they see it. A batch job—completely unrelated to payments—started running at 2:30 AM. It’s processing end-of-day reports. It’s hammering the shared database with massive analytical queries. Those queries are holding locks that block the payment service’s small, fast transactions.
The payment service was perfect. It was the victim of a system-level problem that couldn’t be seen by looking at individual components.
This is systems thinking. The ability to see the whole, not just the parts. To understand that behavior emerges from relationships, not components. To stop playing whack-a-mole with symptoms and start solving actual problems.
This scenario plays out in engineering teams every day. This module will teach you how to avoid it.
What You’ll Learn
Section titled “What You’ll Learn”- Why looking at components in isolation is a trap
- How behavior emerges from interactions (not from parts)
- The iceberg model for seeing below the surface
- The vocabulary that will change how you troubleshoot
- How to apply this to your next incident (starting tonight)
Part 1: The Problem with Component Thinking
Section titled “Part 1: The Problem with Component Thinking”The Mechanic vs. The Engineer
Section titled “The Mechanic vs. The Engineer”Here’s an analogy that helped me understand this:
A mechanic fixes cars by testing parts until they find the broken one. Alternator dead? Replace it. Brake pads worn? New ones. This works because cars are complicated but not complex—the same input always produces the same output.
An engineer understands how the car actually works. They know that a weak alternator doesn’t just fail—it causes the battery to drain, which causes the car to run lean, which damages the catalytic converter, which triggers the check engine light. They see the cascade.
Most ops teams are mechanics. They replace parts (restart services, scale pods, rollback deployments) until the alert goes away. Sometimes that’s enough. But for complex systems, you need to be an engineer.
THE MECHANIC VS THE ENGINEER═══════════════════════════════════════════════════════════════════
MECHANIC APPROACH:─────────────────────────────────────────────────────────────────Alert fires │ ▼Check the broken thing │ ▼Restart it │ ▼Alert clears → Done! (until next time)
ENGINEER APPROACH:─────────────────────────────────────────────────────────────────Alert fires │ ▼What's the pattern? When did this start? │ ▼What changed in the system? What's connected? │ ▼What's the actual cause? (Often not where the alert fired) │ ▼Fix the cause → Problem actually solvedWhat is a System?
Section titled “What is a System?”A system is a set of interconnected elements organized to achieve a purpose.
Three key parts:
| Part | What It Is | Example |
|---|---|---|
| Elements | The things you can point at | Pods, services, databases, queues |
| Interconnections | How elements affect each other | Network calls, shared resources, data flows |
| Purpose | Why the system exists | Process payments, serve users, store data |
Here’s the crucial insight: you can understand every element perfectly and still not understand the system.
COMPONENT VIEW SYSTEMS VIEW─────────────────────────────────────────────────────────────────
┌─────────────────────────────────────────────────────────────────┐│ ││ ┌─────────┐ ┌─────────┐ ││ │ Service │ │ Service │◀──────┐ ││ │ A │ │ A │ │ ││ └─────────┘ └────┬────┘ │ ││ │ │ ││ ┌─────────┐ ┌────▼────┐ │ Feedback││ │ Service │ ──▶ │ Service │───────┤ Loops ││ │ B │ │ B │ │ ││ └─────────┘ └────┬────┘ │ ││ │ │ ││ ┌─────────┐ ┌────▼────┐ │ ││ │Database │ │Database │───────┘ ││ └─────────┘ └─────────┘ ││ ││ "Three healthy boxes" "A system with behavior" ││ (tells you nothing) (tells you everything) ││ │└─────────────────────────────────────────────────────────────────┘💡 Did You Know? The word “system” comes from Greek systema, meaning “organized whole.” The ancient Greeks understood something we keep forgetting: the whole is fundamentally different from the sum of its parts. Aristotle wrote about this 2,400 years ago. We’re still learning the same lesson.
Emergence: Where System Behavior Lives
Section titled “Emergence: Where System Behavior Lives”Emergence is when a system exhibits properties that none of its individual parts possess.
This is the most important concept in this entire module. Read it again.
Your brain is made of neurons. No single neuron is conscious—it’s just an electrochemical switch. But 86 billion of them connected in the right way, and suddenly you’re reading this sentence and thinking about it. Consciousness emerges.
In distributed systems:
- Individual service metrics: Service A: 50ms, Service B: 40ms, Service C: 30ms
- System behavior: p99 latency of 2000ms, random timeouts, cascade failures
Where did the 2000ms come from? Where did the cascades come from? Not from any individual service. They emerged from the interactions—retries that amplify load, connection pools that exhaust, locks that contend.
EMERGENCE: THE 50ms SERVICES THAT CREATE 2-SECOND LATENCY═══════════════════════════════════════════════════════════════════
Individual service latencies:───────────────────────────────────────────────────────────────────┌─────────────┐ ┌─────────────┐ ┌─────────────┐│ Service A │ │ Service B │ │ Service C ││ 50ms avg │ │ 40ms avg │ │ 30ms avg ││ Looking │ │ Looking │ │ Looking ││ healthy! │ │ healthy! │ │ healthy! │└─────────────┘ └─────────────┘ └─────────────┘
What the dashboards show: "Everything is fine ✅"
Actual system behavior:───────────────────────────────────────────────────────────────────Request hits A → A calls B → B times out → A retries→ B finally responds → A calls C → C is slow becauseB's retry exhausted the shared connection pool → C times out→ Request fails → User retries → Adds more load → Everythinggets worse
What users experience: "Your system is broken 🔥"
The 2-second latency doesn't exist in any component.It EMERGES from their interaction.Thought Exercise (2 minutes)
Think of a behavior in your system that only exists when components interact.
Examples:
- Shopping cart totals (product service + cart service + pricing rules)
- Search ranking (search service + recommendation engine + user history)
- Cascading failures (any service + retry logic + shared resources)
Where does that behavior “live”? Not in any single service’s code.
Why Reductionism Fails
Section titled “Why Reductionism Fails”Reductionism is the scientific approach of understanding something by breaking it into parts and studying each part separately.
It works brilliantly for complicated machines. Want to understand a car engine? Take it apart. Study each piece. Reassemble. Done.
It fails catastrophically for complex systems. Here’s why:
| Aspect | Complicated (car engine) | Complex (distributed system) |
|---|---|---|
| Behavior | Predictable from parts | Emergent, surprising |
| Cause & Effect | Linear, traceable | Circular, networked |
| Analysis | Take apart, study pieces | Must observe whole in motion |
| Fixing | Replace broken part | Change relationships |
| Same input | Same output | Different output each time |
This is why “works on my machine” is such a meme. Your laptop isn’t the system. The system includes the network, other services, the database state, the load from other users, and a hundred other interacting factors.
The Optimization Trap
Here’s a counterintuitive truth: optimizing individual components often makes the system worse.
Example: You make Service A 10x faster. Congratulations! Now it hammers the database 10x harder, causing lock contention that slows Services B and C. Global latency increases. Users are angrier than before.
This is called suboptimization—winning locally while losing globally. It’s one of the most common mistakes in distributed systems.
Part 2: The Iceberg Model
Section titled “Part 2: The Iceberg Model”Seeing Below the Surface
Section titled “Seeing Below the Surface”Most troubleshooting happens at the surface level. An alert fires, we react. Another alert, another reaction. We’re playing an endless game of whack-a-mole.
The iceberg model teaches us to look deeper:
THE ICEBERG MODEL═══════════════════════════════════════════════════════════════════
≈≈≈≈≈≈ VISIBLE (what we react to) ≈≈≈≈≈≈
╔═══════════════════════════════════════════════════╗ ║ EVENTS ║ ║ "Payment service is slow right now" ║ ║ ║ ║ Response: Restart it, scale it up ║ ║ Mindset: REACTIVE ║ ╠═══════════════════════════════════════════════════╣ ║ ║ ║ PATTERNS ║ ║ "Payment is slow every Monday 6-9 AM" ║ ║ ║ ║ Response: Create runbook, schedule extra pods ║ ║ Mindset: ADAPTIVE ║ ╠═══════════════════════════════════════════════════╣ ║ ║ ║ STRUCTURES ║ ║ "Batch job shares database with payments" ║ ║ "No resource isolation between workloads" ║ ║ ║ ║ Response: Separate resources, add limits ║ ║ Mindset: REDESIGN ║ ╠═══════════════════════════════════════════════════╣ ║ ║ ║ MENTAL MODELS ║ ║ "All workloads can safely share infrastructure" ║ ║ "Batch jobs don't affect real-time traffic" ║ ║ ║ ║ Response: New policies, architecture reviews ║ ║ Mindset: TRANSFORM ║ ╚═══════════════════════════════════════════════════╝
THE DEEPER YOU GO, THE MORE LEVERAGE YOU HAVEEach Level Explained
Section titled “Each Level Explained”Event level (what happened?):
- “The payment service is slow right now.”
- Response: Restart, scale up, add more resources
- Limitation: You’ll be doing this forever
Pattern level (what’s been happening?):
- “This happens every Monday morning.”
- Response: Schedule extra capacity, create runbooks
- Limitation: You’re managing the problem, not solving it
Structure level (what’s causing this?):
- “The batch job and payment service share a database with no isolation.”
- Response: Resource quotas, separate databases, connection pooling
- Limitation: Fixes this problem, but similar ones will appear
Mental model level (what beliefs allow this to exist?):
- “We assumed production workloads don’t need resource isolation.”
- Response: New architecture principles, design review processes
- Impact: Prevents entire categories of problems
War Story: The Monday Mystery (Extended Cut)
Section titled “War Story: The Monday Mystery (Extended Cut)”I want to tell you about the incident that taught me the iceberg model—the hard way.
A fintech company had “random” payment failures every Monday. They weren’t truly random, of course, but nobody had connected the dots. Every Monday morning, the on-call engineer would get paged, restart some services, add some pods, and things would stabilize. They wrote a runbook. They scheduled extra capacity for Mondays. They were world-class at managing the symptom.
For eight months.
One day, a new engineer—fresh out of university, hadn’t learned to accept dysfunction yet—asked an innocent question:
“Why only Mondays?”
The senior engineers looked at each other. Nobody knew. They’d always just… dealt with it.
The new engineer started digging. She pulled metrics from the past year. She correlated payment failures with everything she could find: deployment schedules, traffic patterns, marketing campaigns, infrastructure changes.
And there it was. At exactly 2:30 AM every Sunday night, CPU usage on one database server spiked. By 6 AM Monday, the spike ended—but the damage was done. Connection pools exhausted. Queries backed up. Payment timeouts cascading.
The culprit? A “Weekly Analytics Summary” batch job. Created two years ago. When the company was small. Processing a few thousand transactions. Now processing millions. What used to take 30 minutes now took 4 hours—and it was still growing.
Nobody owned this job anymore. The engineer who wrote it had left. It ran on the same database as real-time payments because, at the time, “it’s just a small report.”
The Fix:
| Level | What They Did |
|---|---|
| Event | (What they’d been doing) Restart services, add pods |
| Pattern | Added Monday runbook, scheduled extra capacity |
| Structure | Moved batch job to replica database, added connection pool limits |
| Mental Model | New policy: “No analytical workloads on transactional databases. Ever.” |
The Monday pages stopped. Not because they got better at responding—because they eliminated the cause.
Total time to fix once they understood the problem: 2 hours. Time they’d spent managing the symptom over 8 months: ~200 engineer-hours.
Lesson: The question “why only Mondays?” was worth hundreds of hours. The right question at the right level changes everything.
Part 3: Systems Thinking Vocabulary
Section titled “Part 3: Systems Thinking Vocabulary”To see systems clearly, you need the right words. These terms will become essential to how you troubleshoot and communicate.
The Essential Terms
Section titled “The Essential Terms”| Term | Definition | Example |
|---|---|---|
| System | Interconnected elements with a purpose | Your entire production stack |
| Boundary | What’s in vs. out of the system | Your services vs. AWS infrastructure |
| Stock | Accumulation within the system | Queue depth, connection count, error budget |
| Flow | Rate of change to a stock | Requests/second, pod creation rate |
| Feedback | When a system’s output influences its input | Autoscaler: high CPU → more pods → lower CPU |
| Delay | Time between cause and effect | Metric collection lag, autoscaler reaction time |
Stocks and Flows: The Bathtub Model
Section titled “Stocks and Flows: The Bathtub Model”The easiest way to understand stocks and flows is to think of a bathtub:
STOCKS AND FLOWS: THE BATHTUB═══════════════════════════════════════════════════════════════════
INFLOW (faucet) │ │ 10 liters/min ▼ ┌───────────────────┐ │ │ │ STOCK: Water │ │ (liters in tub) │ │ │ │ Current: 50L │ │ │ └────────┬──────────┘ │ ▼ 8 liters/min OUTFLOW (drain)
RULES:─────────────────────────────────────────────────────────────────• If INFLOW > OUTFLOW → Stock rises (tub fills up)• If INFLOW < OUTFLOW → Stock falls (tub drains)• If INFLOW = OUTFLOW → Stock stable (water level constant)
Right now: 10 in, 8 out → Tub is filling at 2L/min → Overflow coming!Now apply this to your systems:
STOCKS AND FLOWS: REQUEST QUEUE═══════════════════════════════════════════════════════════════════
INFLOW: Incoming requests │ │ 1000 req/sec ▼ ┌───────────────────┐ │ │ │ STOCK: Queue │ │ depth │ │ │ │ Current: 5000 │ │ requests │ │ │ └────────┬──────────┘ │ ▼ 800 req/sec OUTFLOW: Processed requests
You can't fix latency by looking at inflow or outflow alone.The STOCK determines latency. 5000 queued ÷ 800/sec = 6.25 seconds.
The only ways to reduce latency:1. Reduce inflow (rate limiting, shedding load)2. Increase outflow (more capacity, faster processing)3. Accept the backlog will drain eventually (if traffic drops)Delays: The Hidden Cause of Chaos
Section titled “Delays: The Hidden Cause of Chaos”Delays are everywhere in distributed systems:
| Delay | Typical Duration |
|---|---|
| Metric collection | 10-60 seconds |
| Alerting pipeline | 30-120 seconds |
| Autoscaler reaction | 1-5 minutes |
| DNS propagation | Seconds to hours |
| Human response | Minutes to hours |
| Rolling deployment | Minutes to hours |
Why delays matter: They cause oscillation and overshoot.
AUTOSCALER OSCILLATION: A TRAGEDY IN THREE ACTS═══════════════════════════════════════════════════════════════════
Target │ │ACT 1: The Spike │───────────────────────────────── │ │Load: ═══════════════╗ │ ║ │Pods: ┌─────────╨────┐ │ │ │ │ │ Autoscaler: │ │ │ "Need more │ │ │ pods!" │ │ └──────────────┘ │ │ │ACT 2: The Overshoot │───────────────────────────────── │ │Load: ══════════════════════════╗ │ ║ │ ║ │Pods: ══════════════════════╬═══ │ ← Way too many pods! ║ │ New pods finally ready! │ But load already dropped. │ │ │ACT 3: The Oscillation │───────────────────────────────── │ │Load: ╱╲ ╱╲ ╱╲ │ ╱ ╲ ╱ ╲ ╱ ╲ │────────╱────╲╱────╲╱────╲─────────────│── Target ╱ ╲ │ ╱ ╲ │ │ Delay → Overshoot → Undershoot → Repeat
THE FIX: Account for delays. Scale gradually. Use predictive scaling. The delay isn't a bug—it's physics. Design around it.💡 Did You Know? The famous “thundering herd” problem is a delay-induced catastrophe. A cache expires. All requests hit the database simultaneously. The database slows down. Requests time out and retry. More load. More timeouts. More retries. The delay between cache miss and successful refill creates a feedback loop that amplifies the original problem exponentially.
Part 4: Applying Systems Thinking
Section titled “Part 4: Applying Systems Thinking”The Questions That Change Everything
Section titled “The Questions That Change Everything”When troubleshooting or designing systems, systems thinkers ask different questions:
| Normal Question | Systems Thinking Question |
|---|---|
| ”Which service is broken?" | "What changed in the system as a whole?" |
| "Who deployed what?" | "What feedback loops are active?" |
| "What’s the error?" | "What’s the pattern over time?" |
| "How do I fix this?" | "What structure enables this problem?" |
| "Is this service healthy?" | "Is the system achieving its purpose?” |
A Systems Thinking Troubleshooting Session
Section titled “A Systems Thinking Troubleshooting Session”Scenario: Users report intermittent slowness. Dashboards show all services green.
Component approach (what most people do):
- Check each service’s CPU, memory, errors
- Everything looks fine
- Blame the network
- Add more logging
- Wait for it to happen again
- Still confused
Systems approach:
Step 1: MAP THE SYSTEM───────────────────────────────────────────────────────────────────Draw connections, not just boxes. What calls what? What shares what?
┌────────────────────────────────────────────────┐ │ │ User ───▶│ API ──▶ Cache ──┬──▶ Service A ──┐ │ │ │ │ │ │ │ │ └──▶ Service B ──┼──▶ DB │ │ │ │ │ │ │ └───────────────▶ Service C ──┘ │ │ │ │ │ │ │ └───────────┘ │ │ │ │ ↑ Both B and C share DB connection │ │ pool. Did we check pool exhaustion? │ └────────────────────────────────────────────────┘
Step 2: IDENTIFY FEEDBACK LOOPS───────────────────────────────────────────────────────────────────• Cache misses → DB load → Slower queries → More timeouts → More retries → More DB load → ... (Reinforcing loop!)
• Rate limiter → Rejected requests → Less load → Faster response → Rate limiter allows more → ... (Balancing loop ✓)
Step 3: LOOK FOR STOCKS───────────────────────────────────────────────────────────────────• Queue depths? Growing.• Connection pool? 100% utilized!• Error budget? Almost gone.
Step 4: CHECK THE DELAYS───────────────────────────────────────────────────────────────────• How old are these metrics? 60 seconds old.• Autoscaler cooldown? 5 minutes.• When did the pattern start? Yesterday at 3 PM.
Step 5: GO DEEPER THAN EVENTS───────────────────────────────────────────────────────────────────• Is this a pattern? Yes—happens during peak hours.• What structure enables it? Shared DB connection pool with no limits.• What mental model? "Services are independent."Root cause found: Service B and C share a database connection pool. During peak load, Service B takes all connections for a slow analytics query. Service C starves. Timeouts cascade.
Fix: Per-service connection pool limits.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Hurts | What To Do Instead |
|---|---|---|
| Treating symptoms | Problem recurs, wastes time | Use iceberg model—go deeper |
| Optimizing components | Can make system worse | Optimize for system-level goals |
| Ignoring delays | Causes oscillation, overshoot | Map delays explicitly |
| Tight system boundaries | Miss external dependencies | Include what affects behavior |
| Looking for THE root cause | Complex systems have multiple causes | Look for contributing factors |
| Assuming independence | Services affect each other | Map connections and shared resources |
Question 1
Section titled “Question 1”“A system’s behavior is best understood by studying its individual components in isolation.” True or false?
Show Answer
False. A system’s behavior emerges from the interactions between components, not the components themselves. You can have five healthy services that together create an unhealthy system. Studying parts in isolation misses the relationships, feedback loops, and emergent properties that determine actual behavior.
This is why “all services green” and “users unhappy” can coexist.
Question 2
Section titled “Question 2”What is “emergence” and why does it matter for troubleshooting?
Show Answer
Emergence is when a system exhibits properties that none of its individual parts possess.
Why it matters for troubleshooting:
- The 2-second latency doesn’t exist in any single service
- Cascading failures don’t exist in any single service
- The behavior you’re trying to fix might not be in any component
- You have to look at interactions, not just parts
Examples: Traffic jams emerge from cars (no car is a jam). Consciousness emerges from neurons (no neuron thinks). Cascading failures emerge from microservices (no service is a cascade).
Question 3
Section titled “Question 3”In the iceberg model, why is addressing mental models more impactful than addressing events?
Show Answer
Mental models are the beliefs and assumptions that shape the structures we build. Structures create patterns. Patterns produce events.
| Level | Leverage | Example Fix |
|---|---|---|
| Events | Low | Restart service when it’s slow |
| Patterns | Medium | Schedule extra capacity for Mondays |
| Structures | High | Add connection pool isolation |
| Mental Models | Highest | ”All workloads must declare resource needs” |
Changing mental models prevents entire categories of problems—not just the current incident, but all future incidents that would arise from the same flawed assumptions.
The Monday Mystery team spent 8 months managing events. Once they changed the mental model (“batch jobs don’t affect real-time traffic” → “all workloads must be isolated”), the problem and all similar future problems disappeared.
Question 4
Section titled “Question 4”Why do delays in distributed systems cause oscillation?
Show Answer
Delays cause oscillation because by the time a corrective action takes effect, the original condition may have changed.
Example: Autoscaler sees high CPU → Adds pods → Pods take 3 minutes to start → By then, load dropped → Now we have too many pods → Autoscaler removes pods → Pods take 1 minute to terminate → By then, load increased → Now we have too few pods → Repeat.
Without delays, adjustments would be instant and smooth. Delays create the gap where overshoot and undershoot live.
Solutions:
- Account for delays in your scaling algorithms
- Use predictive scaling (ML-based) instead of reactive
- Add dampening (don’t react to every fluctuation)
- Accept some oscillation as normal, design for it
Hands-On Exercise
Section titled “Hands-On Exercise”Part A: Observe Emergence in Kubernetes (15 minutes)
Section titled “Part A: Observe Emergence in Kubernetes (15 minutes)”Objective: See how system behavior emerges from component interactions.
# Create namespacekubectl create namespace systems-lab
# Deploy interconnected servicescat <<EOF | kubectl apply -f -apiVersion: apps/v1kind: Deploymentmetadata: name: backend namespace: systems-labspec: replicas: 2 selector: matchLabels: app: backend template: metadata: labels: app: backend spec: containers: - name: backend image: nginx:alpine ports: - containerPort: 80---apiVersion: v1kind: Servicemetadata: name: backend namespace: systems-labspec: selector: app: backend ports: - port: 80---apiVersion: apps/v1kind: Deploymentmetadata: name: frontend namespace: systems-labspec: replicas: 2 selector: matchLabels: app: frontend template: metadata: labels: app: frontend spec: containers: - name: frontend image: curlimages/curl:latest command: ["/bin/sh", "-c"] args: - | while true; do curl -s -o /dev/null -w "%{http_code}\n" http://backend/ sleep 1 doneEOF
# Watch the systemkubectl get pods -n systems-lab -wNow break something and watch the system respond:
# In another terminal, kill a backend podkubectl delete pod -n systems-lab -l app=backend \ $(kubectl get pod -n systems-lab -l app=backend -o jsonpath='{.items[0].metadata.name}') \ --wait=falseWhat to observe:
- Frontend continues working (load balances to surviving backend)
- New backend pod is created automatically
- System self-heals without human intervention
This is emergence. The self-healing behavior doesn’t exist in any individual pod. It emerges from the interactions between Deployment controller, ReplicaSet, Service, and kube-proxy.
Clean up:
kubectl delete namespace systems-labPart B: Apply the Iceberg Model (20 minutes)
Section titled “Part B: Apply the Iceberg Model (20 minutes)”Pick a recurring issue in your environment (or use this hypothetical):
“The checkout page is slow during sales events.”
Apply the iceberg model:
| Level | Analysis |
|---|---|
| Event | What happens? |
| Pattern | When? How often? What correlates? |
| Structure | What architecture/config enables this? |
| Mental Model | What assumption allowed this structure? |
Example answer:
| Level | Answer |
|---|---|
| Event | Checkout timeouts during Black Friday |
| Pattern | Happens every sale event. Correlates with >10x traffic. |
| Structure | Checkout service calls inventory synchronously. No caching. Single DB. |
| Mental Model | ”Real-time inventory is always required” (but is it for checkout display?) |
Success Criteria:
- Observed pod deletion and automatic recovery in Part A
- Can explain what “emergence” you witnessed
- Completed iceberg analysis for Part B
- Identified at least one mental model that enables the problem
Did You Know?
Section titled “Did You Know?”-
Systems thinking was born in biology, not engineering. Ludwig von Bertalanffy developed General Systems Theory in the 1930s to understand living organisms as integrated wholes, not collections of parts.
-
The Apollo program was one of the first engineering projects to formally apply systems thinking. With 2 million parts and 400,000 people, NASA couldn’t understand the spacecraft by studying components—they had to see the whole.
-
W. Edwards Deming, the quality management guru, estimated that 94% of problems are caused by the system, not the individual workers. When something goes wrong, the structure almost always matters more than the person.
-
Jeff Bezos attributes Amazon’s success to “working backwards from the customer”—a systems thinking approach. Instead of building components and hoping they create good outcomes, he defines the desired system behavior first.
Further Reading
Section titled “Further Reading”-
“Thinking in Systems: A Primer” by Donella Meadows — The foundational text. Readable, profound, and changed how I see everything.
-
“How Complex Systems Fail” by Richard Cook — An 18-point paper that every engineer should read. Takes 10 minutes. Will change your career.
-
“The Fifth Discipline” by Peter Senge — Systems thinking applied to organizations. Explains why your team keeps having the same problems.
-
“Drift into Failure” by Sidney Dekker — How systems gradually drift toward catastrophe while every individual decision seems reasonable.
Next Module
Section titled “Next Module”Module 1.2: Feedback Loops — Understanding reinforcing and balancing feedback, and why your autoscaler sometimes makes things worse.
“To understand is to perceive patterns.” — Isaiah Berlin
“You can’t understand a system by taking it apart. You can only understand it by seeing it in motion.” — Adapted from Russell Ackoff