Module 1.2: Feedback Loops
Цей контент ще не доступний вашою мовою.
Complexity:
[MEDIUM]Time to Complete: 35-40 minutes
Prerequisites: Module 1.1: What is Systems Thinking?
Track: Foundations
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Identify reinforcing and balancing feedback loops in production systems and predict their behavior under load
- Diagnose runaway cascades caused by positive feedback loops such as retry storms, autoscaler thrashing, and connection pool exhaustion
- Design damping mechanisms (circuit breakers, backoff strategies, rate limiters) that interrupt destructive feedback cycles
- Evaluate whether a system’s feedback loops will stabilize or amplify under failure conditions
The Black Friday Meltdown
Section titled “The Black Friday Meltdown”November 29th, 2019. 10:47 AM Eastern Time.
The ShopMart engineering team is watching dashboards with coffee in hand. Black Friday traffic is building—already 3x normal load, heading toward 10x by afternoon. The autoscaler is doing its job, spinning up new pods. Everything looks green.
10:52 AM. Someone notices something odd. Database connection count is climbing faster than traffic. Not by a little—exponentially. Traffic up 20%, connections up 400%.
The senior DBA pulls up query logs. Queries are taking 3x longer than normal. Nothing has changed in the code. The database itself isn’t maxed out—CPU at 40%, memory fine.
10:58 AM. The first timeouts start appearing. Payment service can’t reach the database. It retries. All instances retry simultaneously—thousands of retries per second.
11:04 AM. The circuit breaker finally trips, but it’s too late. The database connection pool is completely exhausted. New pods are spinning up (autoscaler sees high latency), but each new pod tries to grab connections from an empty pool. More pods, more connection attempts, more failures.
11:12 AM. Complete outage. The site displays “temporarily unavailable” to 400,000 shoppers.
The root cause? A single slow query. One poorly indexed query that normally took 50ms started taking 500ms under load. Connections held 10x longer meant connection pool filled 10x faster. Which meant more queuing. Which meant even longer waits. Which meant more timeouts. Which meant retries. Which added more load. Which made queries even slower.
A feedback loop turned a minor performance issue into a complete system collapse in under 25 minutes.
Why This Module Matters
Section titled “Why This Module Matters”Every production outage has a story. But if you look closely at the worst ones—the cascading failures, the death spirals, the “everything went wrong at once” disasters—they share a common element: feedback loops that amplified small problems into catastrophic failures.
Understanding feedback loops is understanding the DNA of system behavior.
A system with well-designed feedback loops is antifragile—it stabilizes under stress, recovers from failures, and gets stronger with use. A system with poorly designed feedback loops is a time bomb—stable in normal conditions, catastrophic when stressed.
This module teaches you to:
- Recognize feedback loops before they bite
- Predict which loops will help and which will harm
- Design systems that use feedback safely
- Break dangerous loops before they cascade
The Thermostat Principle
Your home thermostat is a perfect feedback loop. Temperature drops below setpoint → heater turns on → temperature rises → heater turns off. This is a balancing loop—it opposes change, maintaining stability.
Now imagine if your thermostat were wired backwards: temperature drops → heater turns off. That’s a reinforcing loop—it amplifies change. Your house would freeze.
Most production incidents are just thermostats wired backwards. The system is trying to help, but its corrective actions make things worse. This module teaches you to spot backwards thermostats before they freeze your production systems.
What You’ll Learn
Section titled “What You’ll Learn”- The two fundamental types of feedback loops
- Why delays turn helpful loops into destructive oscillations
- The six most dangerous feedback patterns in distributed systems
- How to design systems that use feedback safely
- Techniques for breaking loops once they start
Part 1: The Two Types of Feedback Loops
Section titled “Part 1: The Two Types of Feedback Loops”All feedback loops fall into two categories. Master this distinction and you’ll understand half of all production incidents.
1.1 Reinforcing Loops: The Amplifiers
Section titled “1.1 Reinforcing Loops: The Amplifiers”Reinforcing loops amplify change. Whatever direction the system is moving, reinforcing loops push it further. They’re called “positive feedback” not because they’re good (they usually aren’t), but because they add to the existing trend.
Think of a microphone placed in front of a speaker. Sound enters the mic → gets amplified → comes out the speaker → enters the mic louder → gets amplified more. Without intervention, you get that ear-piercing screech within seconds.
THE ANATOMY OF A RETRY STORM═══════════════════════════════════════════════════════════════
┌─────────────────────────────────────────────────┐ │ THE DEATH SPIRAL │ │ │ ▼ │┌─────────┐ ┌─────────┐ ┌─────────┐ ││ Latency │─────▶│Timeouts │─────▶│ Retries │─────────┘│Increases│ │ Occur │ │ Happen │└─────────┘ └─────────┘ └─────────┘ ▲ │ │ │ Each retry adds load │ ▼ │ ┌─────────────────┐ └────────────────────────│ Server becomes │ │ more overloaded │ └─────────────────┘
Timeline of a real incident:
t=0:00 Latency: 200ms Timeouts: 0/sec Load: 1000 req/st=0:30 Latency: 500ms Timeouts: 50/sec Load: 1050 req/st=1:00 Latency: 2000ms Timeouts: 400/sec Load: 1400 req/st=1:30 Latency: 5000ms Timeouts: 1200/sec Load: 2200 req/st=2:00 Latency: DEAD Timeouts: ALL Load: 4000+ req/s
From "a little slow" to "completely dead" in 2 minutes.The mathematics of reinforcing loops are terrifying. If each loop iteration increases load by just 10%, after 10 iterations you’re at 2.6x original load. After 20 iterations, 6.7x. After 30, 17x. This is exponential growth—and it happens fast.
Common reinforcing loops in production:
| Loop | How It Works | Why It’s Dangerous |
|---|---|---|
| Retry storms | Failure → retry → more load → more failure | Can 10x load in minutes |
| Cache stampede | Cache expires → all hit DB → DB slows → cache stays empty | Synchronized devastation |
| Connection pool exhaustion | Slow queries → connections held → pool fills → more waiting | Everything stops |
| Memory pressure | Swapping → slower processing → more memory pressure | Gradual then sudden death |
| Alert fatigue | Too many alerts → ignored → more incidents → more alerts | Human systems fail too |
1.2 Balancing Loops: The Stabilizers
Section titled “1.2 Balancing Loops: The Stabilizers”Balancing loops oppose change. They push the system back toward a target or equilibrium. They’re called “negative feedback” because they subtract from the current trend.
Your body temperature is maintained by balancing loops. Too hot → you sweat → cooling → temperature drops. Too cold → you shiver → heat generation → temperature rises. The target is 37°C, and your body fights any deviation.
HOW AUTOSCALING SHOULD WORK═══════════════════════════════════════════════════════════════
Target: 70% CPU │ ┌────┴────┐ │ │ ┌───────────────▼─────────▼───────────────┐ │ │ │ Measure ──▶ Compare ──▶ Adjust │ │ │ │ 70% vs Scale │ │ ━━━━━━ target pods │ │ │ └─────────────────────────────────────────┘ │ │ │ │ ▼ ▼ ┌─────────┐ ┌─────────┐ │CPU high │ │Add pods │ │ (85%) │ │ │ └────┬────┘ └────┬────┘ │ │ │ │ └─────────┬────────┘ │ ▼ ┌──────────┐ │ CPU back │ │ to 70% │ └──────────┘
This is a BALANCING loop: it opposes change.High CPU → action → lower CPU.Low CPU → action → higher CPU.System stabilizes around target.Common balancing loops in production:
| Loop | How It Works | What It Protects |
|---|---|---|
| Autoscaling | High load → add capacity → lower load | Performance |
| Rate limiting | Too many requests → reject excess → manageable load | Availability |
| Circuit breakers | Failures rise → stop calling → failures drop | Dependencies |
| Backpressure | Queue full → slow producers → queue drains | Memory |
| Garbage collection | Memory fills → GC runs → memory freed | Stability |
Did You Know?
Your body contains over 200 feedback loops maintaining temperature, blood sugar, blood pressure, pH levels, and more. Engineers who study biology often build more resilient systems.
The steam engine governor (1788) was one of the first mechanical feedback loops. When the engine sped up, weights flew outward and closed the steam valve. James Watt’s governor made the Industrial Revolution possible.
Predator-prey cycles in nature are feedback loops. Rabbits increase → foxes increase (more food) → rabbits decrease (more predation) → foxes decrease (less food) → rabbits increase again. Ecologists mapped these loops in the 1920s; they’re now used to model cloud services.
1.3 Identifying Loop Types: The Polarity Test
Section titled “1.3 Identifying Loop Types: The Polarity Test”Quick technique: Count the inversions. An inversion is when an increase in one thing causes a decrease in another (or vice versa).
THE POLARITY TEST═══════════════════════════════════════════════════════════════
Count the "inversions" (where ↑ causes ↓ or ↓ causes ↑):
REINFORCING (even number of inversions: 0, 2, 4...):┌───────────────────────────────────────────────┐│ A ↑ ─────▶ B ↑ ─────▶ C ↑ ─────▶ A ↑ ││ ││ Load ↑ → Latency ↑ → Retries ↑ → Load ↑ ││ (0 inversions = reinforcing) │└───────────────────────────────────────────────┘
BALANCING (odd number of inversions: 1, 3, 5...):┌───────────────────────────────────────────────┐│ A ↑ ─────▶ B ↑ ─────▶ C ↓ ─────▶ A ↓ ││ ▲ ││ INVERSION ││ ││ CPU ↑ → Pods ↑ → Load per pod ↓ → CPU ↓ ││ (1 inversion = balancing) │└───────────────────────────────────────────────┘
Quick test: Start with "A increases" and follow the loop.If A ends up increasing more → ReinforcingIf A ends up decreasing back → BalancingPractice examples:
-
Rate limiting: Requests ↑ → Rejections ↑ → Accepted requests ↓ → Load ↓
- One inversion (more rejections = fewer accepted) = Balancing
-
Cache miss cascade: Misses ↑ → DB queries ↑ → DB latency ↑ → Cache timeout ↑ → Misses ↑
- Zero inversions = Reinforcing
-
Pod eviction: Memory ↑ → Eviction ↑ → Running pods ↓ → Load per pod ↑ → Memory ↑
- Two inversions = Reinforcing (eviction makes things worse!)
Part 2: Delays—Why Good Loops Go Bad
Section titled “Part 2: Delays—Why Good Loops Go Bad”Here’s the dirty secret of feedback loops: balancing loops can become destructive when delays are too long. A well-intentioned stabilizing mechanism can oscillate wildly, causing more damage than if it didn’t exist at all.
2.1 The Shower Problem
Section titled “2.1 The Shower Problem”Everyone has experienced this. Hotel shower. Unfamiliar controls.
- Water is cold
- Turn up the hot
- Still cold (water still in pipes)
- Turn up more
- Still cold (patience running thin)
- Turn up even more
- SCALDING HOT (all adjustments hit at once)
- Yank it to cold
- Still hot (pipe delay again)
- Crank it colder
- FREEZING
- Repeat until you give up
This is a balancing loop with delay. The longer the delay relative to how fast you adjust, the worse the oscillation. The loop is trying to stabilize temperature, but the delay causes overshoot in both directions.
2.2 Autoscaler Oscillation: A Story in Three Graphs
Section titled “2.2 Autoscaler Oscillation: A Story in Three Graphs”SCENARIO: Autoscaler with 3-minute pod startup delay═══════════════════════════════════════════════════════════════
GRAPH 1: What the autoscaler SEES (metrics with delay)──────────────────────────────────────────────────────CPU %100│ ● ● ● ● (metrics arrive) │ ╱│ ╱│ ╱│ ╱│ 70│───╱─┼─────╱─┼─────╱─┼─────╱─┼────── Target │ ╱ │ ╱ │ ╱ │ ╱ │ 40│ ╱ │ ╱ │ ╱ │ ╱ │ │╱ │ ╱ │ ╱ │ ╱ │ └─────┴──────┴──────┴──────┴─────── Time ↑ ↑ ↑ ↑ "High!" "High!" "High!" "High!" +5 pods +5 pods +5 pods +5 pods
GRAPH 2: What ACTUALLY happened (reality)──────────────────────────────────────────────────────CPU %100│ ╱╲ │ ╱ ╲ 70│──────────╱────╲────────────────── Target │ ╱ ╲ 40│ ╱ ╲ │ ╱ ╲_______________ 20│──────╱────────────────────────── Massive overshoot! └────────────────────────────────── Time ↑ All 20 pods become ready simultaneously
GRAPH 3: The oscillation continues──────────────────────────────────────────────────────CPU %100│ ╱╲ ╱╲ ╱╲ │ ╱ ╲ ╱ ╲ ╱ ╲ 70│──╱────╲──────────╱────╲──────────╱────╲── Target │ ╱ ╲ ╱ ╲ ╱ ╲ 40│╱ ╲ ╱ ╲ ╱ ╲ │ ╲ ╱ ╲ ╱ ╲ 20│ ╲__╱ ╲__╱ ╲_ └────────────────────────────────────────────── Time Scale up Scale down Scale up Scale down (overshoot)(undershoot)(overshoot)...
Without damping, this continues indefinitely.2.3 The Delay Inventory
Section titled “2.3 The Delay Inventory”Every feedback loop contains delays. Knowing your delays is essential for tuning loops correctly.
| Delay Type | Typical Duration | Where It Lurks |
|---|---|---|
| Metric collection | 10-60s | Prometheus scrape interval |
| Metric aggregation | 15-60s | Query evaluation period |
| Alert threshold | 30-300s | ”Fire after 5 minutes of…” |
| Autoscaler cooldown | 30-600s | Prevent thrashing |
| Pod startup | 10-300s | Image pull + init + readiness |
| DNS propagation | 30-86400s | TTL-dependent |
| Human response | 300-3600s | Page → wake → investigate |
| Deployment pipeline | 300-3600s | Build + test + deploy |
| Cache invalidation | Variable | TTL or explicit purge |
The total loop delay is the sum of all delays in the loop. If your HPA evaluates every 15 seconds but pods take 3 minutes to start, your effective loop delay is 3+ minutes.
War Story: The Autoscaler That Destroyed Itself (Extended Version)
A logistics company ran their order tracking system on Kubernetes. They configured an HPA to scale based on a custom metric: messages in the processing queue. Clever idea—scale based on actual work, not just CPU.
The metric came from their message broker, scraped by Prometheus every 30 seconds, aggregated over 1 minute for smoothing. Total delay: about 2 minutes from queue state to metric availability.
The HPA evaluated every 15 seconds. So every 15 seconds, it would look at 2-minute-old data and make scaling decisions.
One Thursday, a burst of orders came in. Queue depth jumped. The 2-minute-old metric showed “queue growing.” HPA added 5 pods. Fifteen seconds later, still seeing “queue growing” (stale metric). Added 5 more. This continued for 2 minutes until fresh metrics arrived.
By then, they had 47 pods for a queue that 8 pods could handle. But it got worse.
Each pod needed 3 database connections. 47 pods × 3 connections = 141 connections. Their connection pool max was 100. Pods started failing health checks. The node autoscaler, seeing failing pods, added more nodes to “help.” Each new node started more pods. More pods meant more connection attempts against a saturated pool.
The database, overwhelmed with connection attempts, started timing out actual queries. The queue processing slowed. The queue grew. The HPA saw the growing queue (in now-accurate metrics) and tried to add more pods.
At the peak, they had 400+ pods failing to start across 12 nodes, their database completely unresponsive, and zero orders being processed.
The fix took three changes:
- Increased HPA evaluation interval to exceed metric delay (3 minutes)
- Added stabilization windows (5 minutes before scale-down)
- Set connection pool limits as pod resource constraints
Total outage: 47 minutes. Revenue impact: $230,000. The fix: 3 lines of YAML.
Part 3: The Six Deadly Loops
Section titled “Part 3: The Six Deadly Loops”These patterns cause the majority of cascading failures in distributed systems. Learn to recognize them instantly.
3.1 The Retry Storm
Section titled “3.1 The Retry Storm”Pattern: Failure triggers retries, retries add load, load causes more failures.
THE RETRY STORM ANATOMY═══════════════════════════════════════════════════════════════
Normal state:┌────────┐ 1000 req/s ┌────────┐│ Client │─────────────▶│ Server │ Everything fine└────────┘ └────────┘
Trouble begins (server slow):┌────────┐ 1000 req/s ┌────────┐│ Client │─────────────▶│ Server │ Latency: 500ms└────────┘ + └────────┘ (normal: 50ms) 200 retries
Getting worse:┌────────┐ 1000 req/s ┌────────┐│ Client │─────────────▶│ Server │ Latency: 2000ms└────────┘ + └────────┘ Timeouts starting 800 retries + 200 retry-of-retries
Death spiral:┌────────┐ 1000 req/s ┌────────────────┐│ Client │─────────────▶│ Server │└────────┘ + │ OVERLOADED │ 3000 retries│ 100% failure │ (3x load) └────────────────┘
Total traffic: 4000+ req/s from 1000 original requests.Server never recovers without external intervention.Breaking the loop:
- Exponential backoff: Each retry waits longer
- Jitter: Randomize backoff to prevent synchronization
- Retry budgets: Limit total retries per time window
- Circuit breakers: Stop retrying after threshold
3.2 The Thundering Herd
Section titled “3.2 The Thundering Herd”Pattern: Synchronized events cause coordinated resource access.
THE THUNDERING HERD═══════════════════════════════════════════════════════════════
Setup: 10,000 users, cache TTL = 3600s (1 hour)Cache populated at 10:00 AM
Normal operation (10:00 AM - 10:59 AM):┌─────────┐ ┌──────────┐│ Users │──▶ Cache HIT (99.9%) ──▶│ Happy ││ 10,000 │ │ users │└─────────┘ └──────────┘ DB load: ~10 req/s
11:00:00 AM - Cache expires (simultaneously for everyone):┌─────────┐ ┌──────────┐│ Users │──▶ Cache MISS ─────────▶│ Database ││ 10,000 │ (all at once!) │ │└─────────┘ └──────────┘ DB load: 10,000 req/s ↓ OVERWHELMED
11:00:01 AM - Database too slow, queries timeout:┌─────────┐ ┌──────────┐│ Users │──▶ Cache MISS ─────────▶│ Database ││ 10,000 │ (still empty!) │ DYING │└─────────┘ + └──────────┘ Retries from first requests DB can't recover because cache can't be repopulatedBreaking the loop:
- Jittered TTLs:
TTL = 3600 + random(-300, 300)seconds - Single-writer pattern: One request populates cache, others wait
- Cache warming: Refresh before expiration
- Request coalescing: Deduplicate identical in-flight requests
3.3 The Connection Pool Death Spiral
Section titled “3.3 The Connection Pool Death Spiral”Pattern: Slow operations hold connections longer, exhausting the pool.
THE CONNECTION POOL DEATH SPIRAL═══════════════════════════════════════════════════════════════
Pool size: 100 connectionsNormal query time: 10msQueries per second: 500
Normal state:────────────────────────────────────────────────────────────Connections in use: ~5 (500 req/s × 10ms = 5 concurrent)Pool utilization: 5%Wait time: 0ms────────────────────────────────────────────────────────────
Something makes queries slow (50ms instead of 10ms):────────────────────────────────────────────────────────────Connections in use: ~25 (500 req/s × 50ms = 25 concurrent)Pool utilization: 25%Wait time: 0ms ← Still fine, but watch this...────────────────────────────────────────────────────────────
Requests queue because of slow queries (now 500ms):────────────────────────────────────────────────────────────Connections in use: 100 (MAXED)Pool utilization: 100%Wait time: 500ms ← Total request time now 1000msNew requests timing out while waiting for connectionsTimeouts trigger retries...────────────────────────────────────────────────────────────
Death spiral engaged:────────────────────────────────────────────────────────────Connections in use: 100 (MAXED)Waiting for connection: 500 requestsWait time: 10,000ms+Database sees 100 concurrent slow queriesDatabase gets slowerQueries take longerConnections held longerMore requests waitingMore retriesMORE LOAD ON ALREADY DYING DATABASE────────────────────────────────────────────────────────────Breaking the loop:
- Connection timeouts: Release connections if held too long
- Query timeouts: Kill queries that exceed threshold
- Pool sizing: Size based on downstream capacity, not demand
- Bulkheading: Separate pools for different operation types
3.4 The Eviction Cascade
Section titled “3.4 The Eviction Cascade”Pattern: Resource pressure causes evictions, evictions redistribute load, causing more pressure.
THE EVICTION CASCADE═══════════════════════════════════════════════════════════════
Setup: 10 pods handling traffic evenly
Normal state:─────────────────────────────────────────────────────│ Pod 1 │ Pod 2 │ Pod 3 │ ... │ Pod 10 ││ 10% │ 10% │ 10% │ ... │ 10% │ (load each)─────────────────────────────────────────────────────Memory per pod: 400MB (limit: 512MB)
Memory leak develops in Pod 3:─────────────────────────────────────────────────────│ Pod 1 │ Pod 2 │ Pod 3 │ ... │ Pod 10 ││ 10% │ 10% │ 10% │ ... │ 10% ││ 400MB │ 400MB │ 512MB │ ... │ 400MB │ ↓ OOMKilled─────────────────────────────────────────────────────
Pod 3 evicted, load redistributed:─────────────────────────────────────────────────────│ Pod 1 │ Pod 2 │ │ ... │ Pod 10 ││ 11% │ 11% │ GONE │ ... │ 11% ││ 440MB │ 440MB │ │ ... │ 440MB │ ← Memory rising!─────────────────────────────────────────────────────
More pods approach limit:─────────────────────────────────────────────────────│ Pod 1 │ Pod 2 │ │ ... │ Pod 10 ││ 11% │ 11% │ GONE │ ... │ 11% ││ 490MB │ 510MB │ │ ... │ 500MB │ ↓ OOMKilled (Pod 2)─────────────────────────────────────────────────────
Cascade accelerates (now 8 pods for same traffic):─────────────────────────────────────────────────────8 pods... 7 pods... 6 pods... 5 pods...Each eviction increases load on survivorsEach survivor moves closer to memory limitEvictions accelerate until cluster collapse─────────────────────────────────────────────────────Breaking the loop:
- PodDisruptionBudgets: Limit concurrent evictions
- Proper resource requests: Ensure headroom
- Horizontal scaling: Add pods before eviction, not after
- Memory leak detection: Alert before OOM
3.5 The Alert Storm
Section titled “3.5 The Alert Storm”Pattern: Incidents generate alerts, alerts overwhelm responders, overwhelmed responders miss alerts, more incidents.
THE ALERT STORM═══════════════════════════════════════════════════════════════
Normal:┌────────────────┐ ┌────────────────┐│ 5 alerts/day │────▶│ Engineer reads │──▶ Issues fixed│ All meaningful │ │ all of them │└────────────────┘ └────────────────┘
After "improving monitoring":┌────────────────┐ ┌────────────────┐│ 50 alerts/day │────▶│ Engineer skims │──▶ Some fixed│ Mostly noise │ │ ignores most │└────────────────┘ └────────────────┘
After the first missed incident:┌────────────────┐ ┌────────────────────────┐│ 100 alerts/day │────▶│ "Add more monitoring!" ││ Even more noise│ │ Now 200 alerts/day │└────────────────┘ └────────────────────────┘
Alert fatigue sets in:┌────────────────┐ ┌────────────────┐ ┌─────────────┐│ 200 alerts/day │────▶│ Engineer mutes │────▶│ Incidents ││ All ignored │ │ notifications │ │ go unnoticed│└────────────────┘ └────────────────┘ └─────────────┘ │ ┌─────────────────────────────────────────┘ │ ▼┌─────────────────────┐│ "We need EVEN MORE ││ monitoring!" (loop) │└─────────────────────┘Breaking the loop:
- Alert on symptoms, not causes (user impact, not internal metrics)
- Every alert must be actionable
- Regular alert review: Delete alerts nobody acts on
- On-call feedback: Track alert-to-incident ratio
3.6 The Capacity Planning Spiral
Section titled “3.6 The Capacity Planning Spiral”Pattern: Under-provisioning causes incidents, incidents cause over-provisioning, budgets cut, under-provisioning.
THE CAPACITY PLANNING SPIRAL═══════════════════════════════════════════════════════════════
Year 1: Launch┌─────────────────────────────────────┐│ "We'll scale as needed" ││ Provisioned: Just enough ││ Buffer: ~5% │└─────────────────────────────────────┘
Year 1, Month 6: Black Friday┌─────────────────────────────────────┐│ Traffic: 10x normal ││ Result: Outage ││ Blame: "We need more capacity!" │└─────────────────────────────────────┘
Year 2: Post-mortem┌─────────────────────────────────────┐│ Provisioned: 20x normal (just in case) ││ Actual usage: 2x normal ││ Utilization: 10% ││ CFO: "Why are we paying for this?" │└─────────────────────────────────────┘
Year 2, Month 6: Budget cuts┌─────────────────────────────────────┐│ "Reduce cloud spend by 50%" ││ Provisioned: Back to 10x ││ Buffer: ~5% again │└─────────────────────────────────────┘
Year 2, Month 11: Pre-holiday traffic┌─────────────────────────────────────┐│ Traffic: 12x normal ││ Result: Outage (again) ││ Loop repeats │└─────────────────────────────────────┘Breaking the loop:
- Load testing: Know your actual limits
- Autoscaling: Match capacity to demand
- Cost attribution: Show cost vs. revenue at traffic levels
- Chaos engineering: Prove you can handle expected peaks
Part 4: Designing with Feedback in Mind
Section titled “Part 4: Designing with Feedback in Mind”4.1 Principles for Safe Feedback Loops
Section titled “4.1 Principles for Safe Feedback Loops”Principle 1: Match loop speed to delay
If your system changes faster than your feedback loop can respond, you’ll oscillate. The loop evaluation period should exceed the total delay.
# Kubernetes HPA with proper stabilizationapiVersion: autoscaling/v2kind: HorizontalPodAutoscalerspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: my-app minReplicas: 3 maxReplicas: 100 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 behavior: scaleDown: stabilizationWindowSeconds: 300 # Wait 5 min before scaling down policies: - type: Percent value: 10 # Remove max 10% of pods periodSeconds: 60 # Every minute scaleUp: stabilizationWindowSeconds: 60 # Wait 1 min before scaling up more policies: - type: Percent value: 100 # Can double pods periodSeconds: 60 - type: Pods value: 4 # Or add max 4 pods periodSeconds: 60 selectPolicy: Max # Use whichever allows more scalingPrinciple 2: Add damping to prevent oscillation
Damping slows down responses, trading speed for stability. Like shock absorbers on a car.
# Bad: React to every measurementdef scale_pods(current_cpu): if current_cpu > 70: add_pods(5) elif current_cpu < 50: remove_pods(5)
# Good: Damped response with smoothingclass DampedScaler: def __init__(self): self.measurements = [] self.last_scale_time = 0 self.cooldown = 300 # 5 minutes
def scale_pods(self, current_cpu): self.measurements.append(current_cpu)
# Only consider last 5 minutes of data self.measurements = self.measurements[-20:]
# Require cooldown period if time.time() - self.last_scale_time < self.cooldown: return # Too soon to act
avg_cpu = sum(self.measurements) / len(self.measurements)
# Require sustained deviation if all(m > 70 for m in self.measurements[-5:]): add_pods(2) # Smaller increments self.last_scale_time = time.time() elif all(m < 50 for m in self.measurements[-5:]): remove_pods(1) # Even smaller for scale-down self.last_scale_time = time.time()Principle 3: Break reinforcing loops with circuit breakers
Don’t let failure amplify failure. Insert breaks in the loop.
class CircuitBreaker: def __init__(self, failure_threshold=5, recovery_timeout=30): self.failures = 0 self.threshold = failure_threshold self.timeout = recovery_timeout self.state = "CLOSED" # CLOSED, OPEN, HALF_OPEN self.opened_at = None
def call(self, func, *args, **kwargs): if self.state == "OPEN": if time.time() - self.opened_at > self.timeout: self.state = "HALF_OPEN" else: raise CircuitOpenError("Circuit breaker is open")
try: result = func(*args, **kwargs) if self.state == "HALF_OPEN": self.state = "CLOSED" self.failures = 0 return result except Exception as e: self.failures += 1 if self.failures >= self.threshold: self.state = "OPEN" self.opened_at = time.time() raise
# Usage: Breaks the retry storm loop@circuit_breaker(failure_threshold=5, recovery_timeout=30)def call_payment_service(order): return http_client.post("/payments", order)Principle 4: Add jitter to prevent synchronization
# Bad: All caches expire at exactly the same timecache.set(key, value, ttl=3600) # All users hit this at 3600s
# Good: Randomize to spread loadimport randomjittered_ttl = 3600 + random.randint(-600, 600) # 50-70 minutescache.set(key, value, ttl=jittered_ttl)
# For retries: exponential backoff with jitterdef retry_with_backoff(func, max_retries=5): for attempt in range(max_retries): try: return func() except RetryableError: if attempt < max_retries - 1: base_delay = (2 ** attempt) # 1, 2, 4, 8, 16 seconds jitter = random.uniform(0, base_delay * 0.5) time.sleep(base_delay + jitter) raise MaxRetriesExceeded()4.2 The Feedback Loop Checklist
Section titled “4.2 The Feedback Loop Checklist”Before deploying any system with feedback mechanisms, answer these questions:
FEEDBACK LOOP ANALYSIS CHECKLIST═══════════════════════════════════════════════════════════════
For each feedback loop in your system:
□ IDENTIFICATION - What are the elements in the loop? - Is it reinforcing or balancing? - What behavior does it create?
□ DELAYS - What is the total delay around the loop? - Is the loop evaluation faster than the delay? - What happens during the delay period?
□ STABILITY - Is there damping/smoothing? - Are there stabilization windows? - Can the loop oscillate? At what frequency?
□ FAILURE MODES - What happens if feedback is delayed/lost? - What happens under extreme load? - Is there a circuit breaker to stop runaway loops?
□ SYNCHRONIZATION - Can multiple instances synchronize? - Is there jitter on timers/TTLs? - What triggers correlated behavior?
□ OBSERVABILITY - Can you see the loop in action? - What metrics show loop behavior? - How would you detect a runaway loop?Did You Know?
Section titled “Did You Know?”-
Audio engineers use feedback loops intentionally. Electric guitar sustain comes from controlled feedback between the pickups and amplifier. Jimi Hendrix was a master of feedback control.
-
The 2010 Flash Crash (stock market dropped 9% in 5 minutes) was caused by algorithmic trading feedback loops. One algorithm sold, prices dropped, other algorithms sold in response, prices dropped more. Circuit breakers now halt trading when prices move too fast.
-
Climate feedback loops worry scientists. Ice melts → less reflection → more heat absorption → more melting. Some loops are self-reinforcing with century-long delays—we might have already triggered them without knowing.
-
The Federal Reserve manages feedback loops constantly. Raise interest rates → less borrowing → less spending → lower inflation. But with 12-18 month delays, it’s like steering a supertanker through a narrow channel.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It’s Dangerous | Solution |
|---|---|---|
| Retries without backoff | Creates reinforcing loop that amplifies failures | Exponential backoff with jitter |
| Tight autoscaler settings | Oscillation wastes resources and can crash systems | Stabilization windows, gradual changes |
| Identical cache TTLs | Thundering herd on expiration | Jitter all TTLs by ±10-20% |
| No circuit breakers | Failures cascade until total outage | Add breakers at every service boundary |
| Ignoring metric delay | Autoscaler reacts to stale data, overshoots | Evaluation interval > metric delay |
| Alert on every metric | Alert fatigue, real issues missed | Alert on user-facing symptoms only |
| Scaling on connection count | Each new pod adds connections, triggers more scaling | Scale on latency or queue depth instead |
-
What’s the difference between a reinforcing and a balancing feedback loop?
Answer
Reinforcing loops amplify change in the current direction. If the system is moving up, reinforcing loops push it further up. If moving down, further down. Examples: retry storms (failures cause retries cause more failures), viral growth (users invite users who invite more users).
Balancing loops oppose change, pushing the system toward a target equilibrium. High CPU → add pods → lower CPU. High temperature → cooling → lower temperature.
Quick test: If A increases, trace through the loop. Does A end up increasing more (reinforcing) or decreasing back toward a target (balancing)?
-
Why do delays turn balancing loops into oscillating disasters?
Answer
Delays cause oscillation because corrective actions are based on stale information. The sequence:
- System detects problem (using delayed metrics)
- System applies correction
- During delay, system applies more correction (still seeing old data)
- All corrections take effect simultaneously
- System overshoots target in opposite direction
- Process repeats in reverse
The longer the delay relative to correction speed, the worse the oscillation. Solution: evaluation interval should exceed total delay, and changes should be gradual (damped).
-
Explain how a thundering herd creates a reinforcing loop.
Answer
The thundering herd reinforcing loop:
- Cache expires simultaneously for many users
- All requests miss cache, hit database
- Database overwhelmed, queries slow or fail
- Failed queries can’t populate cache
- Next requests still miss cache
- Even more database load
- Database slower, even fewer successful cache populations
- Loop continues until intervention
The reinforcing element: cache failure → database overload → cache stays empty → more database overload. Each iteration makes things worse.
Breaking the loop: jittered TTLs, single-writer pattern, request coalescing, cache warming before expiration.
-
What is jitter and why is it crucial in distributed systems?
Answer
Jitter is deliberate randomness added to timing. Instead of exact intervals, you use a range.
Why it’s crucial:
- Prevents synchronization: Without jitter, all caches expire at once, all retries happen together, all health checks fire simultaneously
- Converts spikes to spreads: Instead of 10,000 requests at second 0, you get 10,000 spread over 10 minutes
- Breaks thundering herds: If TTL is 3600 ± 600 seconds, expirations are spread over 20 minutes instead of all at once
- Reduces retry collisions: Random backoff means retries don’t all hit at 1s, 2s, 4s
Common applications:
- Cache TTLs: ±10-20% jitter
- Retry backoff: Add random 0-50% of base delay
- Periodic jobs: Random delay on startup
- Health checks: Random offset from interval
Hands-On Exercise
Section titled “Hands-On Exercise”Task: Analyze feedback loops in a production architecture.
Scenario: You’re reviewing this API service architecture:
┌─────────────────────────────────────────────────────────────┐│ PRODUCTION SYSTEM ││ ││ Users ──▶ CDN ──▶ Load Balancer ──▶ API Pods (HPA) ││ │ ││ ▼ ││ ┌─────────── Redis Cache ────────────┐ ││ │ TTL: 1 hour │ ││ │ (no jitter) │ ││ └────────────────────────────────────┘ ││ │ ││ ▼ ││ PostgreSQL ││ (connection pool: 50) ││ ││ Retry Policy: 3 retries, 1s fixed delay ││ HPA: Scale on CPU, check every 15s, cooldown 30s ││ Rate Limit: 1000 req/s per user ││ Circuit Breaker: None ││ │└─────────────────────────────────────────────────────────────┘Your Analysis:
-
Identify all feedback loops (20 minutes)
List every loop you can find. For each one:
- Elements in the loop
- Type (reinforcing/balancing)
- Total delay in the loop
- What triggers it
- What goes wrong
-
Find dangerous combinations (10 minutes)
Which loops could trigger together? Which balancing loops might fight each other?
-
Propose fixes (15 minutes)
For each dangerous loop, propose a concrete fix with configuration examples.
Expected Findings:
Click to see answer
Reinforcing Loops (Dangerous):
-
Retry Storm
- Path: Timeout → Retry → More load → Slower responses → More timeouts
- Delay: 1 second (fixed, no jitter)
- Trigger: Any slowdown
- Problem: 3 retries with no jitter means 3x load when slow
-
Cache Stampede
- Path: Cache expires → All miss → DB overload → Timeouts → Cache not populated → More misses
- Delay: Exactly 1 hour (synchronized)
- Trigger: Any heavily-cached key expiring
- Problem: No jitter means synchronized expiration
-
Connection Pool Exhaustion
- Path: Slow query → Connections held → Pool fills → Requests wait → Timeouts → Retries → More connections needed
- Delay: Query timeout (likely long)
- Trigger: Any database slowdown
- Problem: Only 50 connections, no circuit breaker
-
Rate Limit Retry Amplification
- Path: Rate limit hit → 429 → Client retries → Rate limit hit → More 429s
- Delay: 1 second (fixed retry)
- Trigger: Traffic near 1000 req/s
- Problem: Retries count against rate limit
Balancing Loops:
-
HPA CPU Scaling
- Path: High CPU → Add pods → Lower CPU per pod
- Delay: 15s evaluation + pod startup (60s+)
- Problem: Cooldown too short (30s), may oscillate
-
Rate Limiting
- Path: Too many requests → Reject some → Manageable load
- Delay: Immediate
- Problem: Works for server, but combined with retries is reinforcing for client
Dangerous Combinations:
-
Cache stampede + Retry storm + Connection exhaustion:
- Cache expires → DB hit → DB slow → Connections held → Pool full → Timeouts → Retries → 3x load on full pool → Complete failure
-
HPA oscillation + Retry storm:
- CPU high → Scale up → New pods retry → More load → CPU still high → Scale up more → Overshoot
Fixes:
# 1. Add jitter to cache TTLcache.set(key, value, ttl=3600 + random(-600, 600))
# 2. Exponential backoff with jitter for retriesretry_policy: max_retries: 3 backoff: type: exponential base: 1s max: 30s jitter: 0.5
# 3. Add circuit breaker on databasecircuit_breaker: failure_threshold: 5 timeout: 30s half_open_requests: 3
# 4. Fix HPA timingspec: behavior: scaleDown: stabilizationWindowSeconds: 300 scaleUp: stabilizationWindowSeconds: 60
# 5. Connection pool with timeoutpool: max_connections: 50 connection_timeout: 5s idle_timeout: 60sSuccess Criteria:
- Identified at least 4 reinforcing loops
- Identified at least 2 balancing loops
- Documented delays for each loop
- Found at least 2 dangerous combinations
- Proposed fixes with specific configurations
- Explained why each fix breaks the loop
Further Reading
Section titled “Further Reading”-
“Thinking in Systems” by Donella Meadows - Chapter 2 covers feedback loops with beautiful clarity. Essential reading.
-
“Release It!” by Michael Nygard - Chapters on stability patterns (circuit breakers, timeouts, bulkheads) are all about managing feedback loops in production.
-
“How Complex Systems Fail” by Richard Cook - 18 short points on why feedback loops in complex systems create surprising failures. Takes 10 minutes to read, provides lifetime of insight.
-
“Control Theory for Engineers” - Any introductory text. Understanding PID controllers will transform how you tune autoscalers.
-
“Feedback Control for Computer Systems” by Philipp K. Janert - Applies control theory directly to software systems. Dense but invaluable.
Next Module
Section titled “Next Module”Module 1.3: Mental Models for Operations - Build practical mental models for understanding production systems: leverage points, stock-and-flow diagrams, and the frameworks that experienced operators use instinctively.