Перейти до вмісту

Module 1.2: Feedback Loops

Цей контент ще не доступний вашою мовою.

Complexity: [MEDIUM]

Time to Complete: 35-40 minutes

Prerequisites: Module 1.1: What is Systems Thinking?

Track: Foundations

After completing this module, you will be able to:

  1. Identify reinforcing and balancing feedback loops in production systems and predict their behavior under load
  2. Diagnose runaway cascades caused by positive feedback loops such as retry storms, autoscaler thrashing, and connection pool exhaustion
  3. Design damping mechanisms (circuit breakers, backoff strategies, rate limiters) that interrupt destructive feedback cycles
  4. Evaluate whether a system’s feedback loops will stabilize or amplify under failure conditions

November 29th, 2019. 10:47 AM Eastern Time.

The ShopMart engineering team is watching dashboards with coffee in hand. Black Friday traffic is building—already 3x normal load, heading toward 10x by afternoon. The autoscaler is doing its job, spinning up new pods. Everything looks green.

10:52 AM. Someone notices something odd. Database connection count is climbing faster than traffic. Not by a little—exponentially. Traffic up 20%, connections up 400%.

The senior DBA pulls up query logs. Queries are taking 3x longer than normal. Nothing has changed in the code. The database itself isn’t maxed out—CPU at 40%, memory fine.

10:58 AM. The first timeouts start appearing. Payment service can’t reach the database. It retries. All instances retry simultaneously—thousands of retries per second.

11:04 AM. The circuit breaker finally trips, but it’s too late. The database connection pool is completely exhausted. New pods are spinning up (autoscaler sees high latency), but each new pod tries to grab connections from an empty pool. More pods, more connection attempts, more failures.

11:12 AM. Complete outage. The site displays “temporarily unavailable” to 400,000 shoppers.

The root cause? A single slow query. One poorly indexed query that normally took 50ms started taking 500ms under load. Connections held 10x longer meant connection pool filled 10x faster. Which meant more queuing. Which meant even longer waits. Which meant more timeouts. Which meant retries. Which added more load. Which made queries even slower.

A feedback loop turned a minor performance issue into a complete system collapse in under 25 minutes.


Every production outage has a story. But if you look closely at the worst ones—the cascading failures, the death spirals, the “everything went wrong at once” disasters—they share a common element: feedback loops that amplified small problems into catastrophic failures.

Understanding feedback loops is understanding the DNA of system behavior.

A system with well-designed feedback loops is antifragile—it stabilizes under stress, recovers from failures, and gets stronger with use. A system with poorly designed feedback loops is a time bomb—stable in normal conditions, catastrophic when stressed.

This module teaches you to:

  • Recognize feedback loops before they bite
  • Predict which loops will help and which will harm
  • Design systems that use feedback safely
  • Break dangerous loops before they cascade

The Thermostat Principle

Your home thermostat is a perfect feedback loop. Temperature drops below setpoint → heater turns on → temperature rises → heater turns off. This is a balancing loop—it opposes change, maintaining stability.

Now imagine if your thermostat were wired backwards: temperature drops → heater turns off. That’s a reinforcing loop—it amplifies change. Your house would freeze.

Most production incidents are just thermostats wired backwards. The system is trying to help, but its corrective actions make things worse. This module teaches you to spot backwards thermostats before they freeze your production systems.


  • The two fundamental types of feedback loops
  • Why delays turn helpful loops into destructive oscillations
  • The six most dangerous feedback patterns in distributed systems
  • How to design systems that use feedback safely
  • Techniques for breaking loops once they start

All feedback loops fall into two categories. Master this distinction and you’ll understand half of all production incidents.

Reinforcing loops amplify change. Whatever direction the system is moving, reinforcing loops push it further. They’re called “positive feedback” not because they’re good (they usually aren’t), but because they add to the existing trend.

Think of a microphone placed in front of a speaker. Sound enters the mic → gets amplified → comes out the speaker → enters the mic louder → gets amplified more. Without intervention, you get that ear-piercing screech within seconds.

THE ANATOMY OF A RETRY STORM
═══════════════════════════════════════════════════════════════
┌─────────────────────────────────────────────────┐
│ THE DEATH SPIRAL │
│ │
▼ │
┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ Latency │─────▶│Timeouts │─────▶│ Retries │─────────┘
│Increases│ │ Occur │ │ Happen │
└─────────┘ └─────────┘ └─────────┘
▲ │
│ │ Each retry adds load
│ ▼
│ ┌─────────────────┐
└────────────────────────│ Server becomes │
│ more overloaded │
└─────────────────┘
Timeline of a real incident:
t=0:00 Latency: 200ms Timeouts: 0/sec Load: 1000 req/s
t=0:30 Latency: 500ms Timeouts: 50/sec Load: 1050 req/s
t=1:00 Latency: 2000ms Timeouts: 400/sec Load: 1400 req/s
t=1:30 Latency: 5000ms Timeouts: 1200/sec Load: 2200 req/s
t=2:00 Latency: DEAD Timeouts: ALL Load: 4000+ req/s
From "a little slow" to "completely dead" in 2 minutes.

The mathematics of reinforcing loops are terrifying. If each loop iteration increases load by just 10%, after 10 iterations you’re at 2.6x original load. After 20 iterations, 6.7x. After 30, 17x. This is exponential growth—and it happens fast.

Common reinforcing loops in production:

LoopHow It WorksWhy It’s Dangerous
Retry stormsFailure → retry → more load → more failureCan 10x load in minutes
Cache stampedeCache expires → all hit DB → DB slows → cache stays emptySynchronized devastation
Connection pool exhaustionSlow queries → connections held → pool fills → more waitingEverything stops
Memory pressureSwapping → slower processing → more memory pressureGradual then sudden death
Alert fatigueToo many alerts → ignored → more incidents → more alertsHuman systems fail too

Balancing loops oppose change. They push the system back toward a target or equilibrium. They’re called “negative feedback” because they subtract from the current trend.

Your body temperature is maintained by balancing loops. Too hot → you sweat → cooling → temperature drops. Too cold → you shiver → heat generation → temperature rises. The target is 37°C, and your body fights any deviation.

HOW AUTOSCALING SHOULD WORK
═══════════════════════════════════════════════════════════════
Target: 70% CPU
┌────┴────┐
│ │
┌───────────────▼─────────▼───────────────┐
│ │
│ Measure ──▶ Compare ──▶ Adjust │
│ │
│ 70% vs Scale │
│ ━━━━━━ target pods │
│ │
└─────────────────────────────────────────┘
│ │
│ │
▼ ▼
┌─────────┐ ┌─────────┐
│CPU high │ │Add pods │
│ (85%) │ │ │
└────┬────┘ └────┬────┘
│ │
│ │
└─────────┬────────┘
┌──────────┐
│ CPU back │
│ to 70% │
└──────────┘
This is a BALANCING loop: it opposes change.
High CPU → action → lower CPU.
Low CPU → action → higher CPU.
System stabilizes around target.

Common balancing loops in production:

LoopHow It WorksWhat It Protects
AutoscalingHigh load → add capacity → lower loadPerformance
Rate limitingToo many requests → reject excess → manageable loadAvailability
Circuit breakersFailures rise → stop calling → failures dropDependencies
BackpressureQueue full → slow producers → queue drainsMemory
Garbage collectionMemory fills → GC runs → memory freedStability

Did You Know?

  • Your body contains over 200 feedback loops maintaining temperature, blood sugar, blood pressure, pH levels, and more. Engineers who study biology often build more resilient systems.

  • The steam engine governor (1788) was one of the first mechanical feedback loops. When the engine sped up, weights flew outward and closed the steam valve. James Watt’s governor made the Industrial Revolution possible.

  • Predator-prey cycles in nature are feedback loops. Rabbits increase → foxes increase (more food) → rabbits decrease (more predation) → foxes decrease (less food) → rabbits increase again. Ecologists mapped these loops in the 1920s; they’re now used to model cloud services.

1.3 Identifying Loop Types: The Polarity Test

Section titled “1.3 Identifying Loop Types: The Polarity Test”

Quick technique: Count the inversions. An inversion is when an increase in one thing causes a decrease in another (or vice versa).

THE POLARITY TEST
═══════════════════════════════════════════════════════════════
Count the "inversions" (where ↑ causes ↓ or ↓ causes ↑):
REINFORCING (even number of inversions: 0, 2, 4...):
┌───────────────────────────────────────────────┐
│ A ↑ ─────▶ B ↑ ─────▶ C ↑ ─────▶ A ↑ │
│ │
│ Load ↑ → Latency ↑ → Retries ↑ → Load ↑ │
│ (0 inversions = reinforcing) │
└───────────────────────────────────────────────┘
BALANCING (odd number of inversions: 1, 3, 5...):
┌───────────────────────────────────────────────┐
│ A ↑ ─────▶ B ↑ ─────▶ C ↓ ─────▶ A ↓ │
│ ▲ │
│ INVERSION │
│ │
│ CPU ↑ → Pods ↑ → Load per pod ↓ → CPU ↓ │
│ (1 inversion = balancing) │
└───────────────────────────────────────────────┘
Quick test: Start with "A increases" and follow the loop.
If A ends up increasing more → Reinforcing
If A ends up decreasing back → Balancing

Practice examples:

  1. Rate limiting: Requests ↑ → Rejections ↑ → Accepted requests ↓ → Load ↓

    • One inversion (more rejections = fewer accepted) = Balancing
  2. Cache miss cascade: Misses ↑ → DB queries ↑ → DB latency ↑ → Cache timeout ↑ → Misses ↑

    • Zero inversions = Reinforcing
  3. Pod eviction: Memory ↑ → Eviction ↑ → Running pods ↓ → Load per pod ↑ → Memory ↑

    • Two inversions = Reinforcing (eviction makes things worse!)

Here’s the dirty secret of feedback loops: balancing loops can become destructive when delays are too long. A well-intentioned stabilizing mechanism can oscillate wildly, causing more damage than if it didn’t exist at all.

Everyone has experienced this. Hotel shower. Unfamiliar controls.

  1. Water is cold
  2. Turn up the hot
  3. Still cold (water still in pipes)
  4. Turn up more
  5. Still cold (patience running thin)
  6. Turn up even more
  7. SCALDING HOT (all adjustments hit at once)
  8. Yank it to cold
  9. Still hot (pipe delay again)
  10. Crank it colder
  11. FREEZING
  12. Repeat until you give up

This is a balancing loop with delay. The longer the delay relative to how fast you adjust, the worse the oscillation. The loop is trying to stabilize temperature, but the delay causes overshoot in both directions.

2.2 Autoscaler Oscillation: A Story in Three Graphs

Section titled “2.2 Autoscaler Oscillation: A Story in Three Graphs”
SCENARIO: Autoscaler with 3-minute pod startup delay
═══════════════════════════════════════════════════════════════
GRAPH 1: What the autoscaler SEES (metrics with delay)
──────────────────────────────────────────────────────
CPU %
100│ ● ● ● ● (metrics arrive)
│ ╱│ ╱│ ╱│ ╱│
70│───╱─┼─────╱─┼─────╱─┼─────╱─┼────── Target
│ ╱ │ ╱ │ ╱ │ ╱ │
40│ ╱ │ ╱ │ ╱ │ ╱ │
│╱ │ ╱ │ ╱ │ ╱ │
└─────┴──────┴──────┴──────┴─────── Time
↑ ↑ ↑ ↑
"High!" "High!" "High!" "High!"
+5 pods +5 pods +5 pods +5 pods
GRAPH 2: What ACTUALLY happened (reality)
──────────────────────────────────────────────────────
CPU %
100│ ╱╲
│ ╱ ╲
70│──────────╱────╲────────────────── Target
│ ╱ ╲
40│ ╱ ╲
│ ╱ ╲_______________
20│──────╱────────────────────────── Massive overshoot!
└────────────────────────────────── Time
All 20 pods
become ready
simultaneously
GRAPH 3: The oscillation continues
──────────────────────────────────────────────────────
CPU %
100│ ╱╲ ╱╲ ╱╲
│ ╱ ╲ ╱ ╲ ╱ ╲
70│──╱────╲──────────╱────╲──────────╱────╲── Target
│ ╱ ╲ ╱ ╲ ╱ ╲
40│╱ ╲ ╱ ╲ ╱ ╲
│ ╲ ╱ ╲ ╱ ╲
20│ ╲__╱ ╲__╱ ╲_
└────────────────────────────────────────────── Time
Scale up Scale down Scale up Scale down
(overshoot)(undershoot)(overshoot)...
Without damping, this continues indefinitely.

Every feedback loop contains delays. Knowing your delays is essential for tuning loops correctly.

Delay TypeTypical DurationWhere It Lurks
Metric collection10-60sPrometheus scrape interval
Metric aggregation15-60sQuery evaluation period
Alert threshold30-300s”Fire after 5 minutes of…”
Autoscaler cooldown30-600sPrevent thrashing
Pod startup10-300sImage pull + init + readiness
DNS propagation30-86400sTTL-dependent
Human response300-3600sPage → wake → investigate
Deployment pipeline300-3600sBuild + test + deploy
Cache invalidationVariableTTL or explicit purge

The total loop delay is the sum of all delays in the loop. If your HPA evaluates every 15 seconds but pods take 3 minutes to start, your effective loop delay is 3+ minutes.

War Story: The Autoscaler That Destroyed Itself (Extended Version)

A logistics company ran their order tracking system on Kubernetes. They configured an HPA to scale based on a custom metric: messages in the processing queue. Clever idea—scale based on actual work, not just CPU.

The metric came from their message broker, scraped by Prometheus every 30 seconds, aggregated over 1 minute for smoothing. Total delay: about 2 minutes from queue state to metric availability.

The HPA evaluated every 15 seconds. So every 15 seconds, it would look at 2-minute-old data and make scaling decisions.

One Thursday, a burst of orders came in. Queue depth jumped. The 2-minute-old metric showed “queue growing.” HPA added 5 pods. Fifteen seconds later, still seeing “queue growing” (stale metric). Added 5 more. This continued for 2 minutes until fresh metrics arrived.

By then, they had 47 pods for a queue that 8 pods could handle. But it got worse.

Each pod needed 3 database connections. 47 pods × 3 connections = 141 connections. Their connection pool max was 100. Pods started failing health checks. The node autoscaler, seeing failing pods, added more nodes to “help.” Each new node started more pods. More pods meant more connection attempts against a saturated pool.

The database, overwhelmed with connection attempts, started timing out actual queries. The queue processing slowed. The queue grew. The HPA saw the growing queue (in now-accurate metrics) and tried to add more pods.

At the peak, they had 400+ pods failing to start across 12 nodes, their database completely unresponsive, and zero orders being processed.

The fix took three changes:

  1. Increased HPA evaluation interval to exceed metric delay (3 minutes)
  2. Added stabilization windows (5 minutes before scale-down)
  3. Set connection pool limits as pod resource constraints

Total outage: 47 minutes. Revenue impact: $230,000. The fix: 3 lines of YAML.


These patterns cause the majority of cascading failures in distributed systems. Learn to recognize them instantly.

Pattern: Failure triggers retries, retries add load, load causes more failures.

THE RETRY STORM ANATOMY
═══════════════════════════════════════════════════════════════
Normal state:
┌────────┐ 1000 req/s ┌────────┐
│ Client │─────────────▶│ Server │ Everything fine
└────────┘ └────────┘
Trouble begins (server slow):
┌────────┐ 1000 req/s ┌────────┐
│ Client │─────────────▶│ Server │ Latency: 500ms
└────────┘ + └────────┘ (normal: 50ms)
200 retries
Getting worse:
┌────────┐ 1000 req/s ┌────────┐
│ Client │─────────────▶│ Server │ Latency: 2000ms
└────────┘ + └────────┘ Timeouts starting
800 retries
+
200 retry-of-retries
Death spiral:
┌────────┐ 1000 req/s ┌────────────────┐
│ Client │─────────────▶│ Server │
└────────┘ + │ OVERLOADED │
3000 retries│ 100% failure │
(3x load) └────────────────┘
Total traffic: 4000+ req/s from 1000 original requests.
Server never recovers without external intervention.

Breaking the loop:

  • Exponential backoff: Each retry waits longer
  • Jitter: Randomize backoff to prevent synchronization
  • Retry budgets: Limit total retries per time window
  • Circuit breakers: Stop retrying after threshold

Pattern: Synchronized events cause coordinated resource access.

THE THUNDERING HERD
═══════════════════════════════════════════════════════════════
Setup: 10,000 users, cache TTL = 3600s (1 hour)
Cache populated at 10:00 AM
Normal operation (10:00 AM - 10:59 AM):
┌─────────┐ ┌──────────┐
│ Users │──▶ Cache HIT (99.9%) ──▶│ Happy │
│ 10,000 │ │ users │
└─────────┘ └──────────┘
DB load: ~10 req/s
11:00:00 AM - Cache expires (simultaneously for everyone):
┌─────────┐ ┌──────────┐
│ Users │──▶ Cache MISS ─────────▶│ Database │
│ 10,000 │ (all at once!) │ │
└─────────┘ └──────────┘
DB load: 10,000 req/s
OVERWHELMED
11:00:01 AM - Database too slow, queries timeout:
┌─────────┐ ┌──────────┐
│ Users │──▶ Cache MISS ─────────▶│ Database │
│ 10,000 │ (still empty!) │ DYING │
└─────────┘ + └──────────┘
Retries from
first requests DB can't recover
because cache can't
be repopulated

Breaking the loop:

  • Jittered TTLs: TTL = 3600 + random(-300, 300) seconds
  • Single-writer pattern: One request populates cache, others wait
  • Cache warming: Refresh before expiration
  • Request coalescing: Deduplicate identical in-flight requests

Pattern: Slow operations hold connections longer, exhausting the pool.

THE CONNECTION POOL DEATH SPIRAL
═══════════════════════════════════════════════════════════════
Pool size: 100 connections
Normal query time: 10ms
Queries per second: 500
Normal state:
────────────────────────────────────────────────────────────
Connections in use: ~5 (500 req/s × 10ms = 5 concurrent)
Pool utilization: 5%
Wait time: 0ms
────────────────────────────────────────────────────────────
Something makes queries slow (50ms instead of 10ms):
────────────────────────────────────────────────────────────
Connections in use: ~25 (500 req/s × 50ms = 25 concurrent)
Pool utilization: 25%
Wait time: 0ms ← Still fine, but watch this...
────────────────────────────────────────────────────────────
Requests queue because of slow queries (now 500ms):
────────────────────────────────────────────────────────────
Connections in use: 100 (MAXED)
Pool utilization: 100%
Wait time: 500ms ← Total request time now 1000ms
New requests timing out while waiting for connections
Timeouts trigger retries...
────────────────────────────────────────────────────────────
Death spiral engaged:
────────────────────────────────────────────────────────────
Connections in use: 100 (MAXED)
Waiting for connection: 500 requests
Wait time: 10,000ms+
Database sees 100 concurrent slow queries
Database gets slower
Queries take longer
Connections held longer
More requests waiting
More retries
MORE LOAD ON ALREADY DYING DATABASE
────────────────────────────────────────────────────────────

Breaking the loop:

  • Connection timeouts: Release connections if held too long
  • Query timeouts: Kill queries that exceed threshold
  • Pool sizing: Size based on downstream capacity, not demand
  • Bulkheading: Separate pools for different operation types

Pattern: Resource pressure causes evictions, evictions redistribute load, causing more pressure.

THE EVICTION CASCADE
═══════════════════════════════════════════════════════════════
Setup: 10 pods handling traffic evenly
Normal state:
─────────────────────────────────────────────────────
│ Pod 1 │ Pod 2 │ Pod 3 │ ... │ Pod 10 │
│ 10% │ 10% │ 10% │ ... │ 10% │ (load each)
─────────────────────────────────────────────────────
Memory per pod: 400MB (limit: 512MB)
Memory leak develops in Pod 3:
─────────────────────────────────────────────────────
│ Pod 1 │ Pod 2 │ Pod 3 │ ... │ Pod 10 │
│ 10% │ 10% │ 10% │ ... │ 10% │
│ 400MB │ 400MB │ 512MB │ ... │ 400MB │
OOMKilled
─────────────────────────────────────────────────────
Pod 3 evicted, load redistributed:
─────────────────────────────────────────────────────
│ Pod 1 │ Pod 2 │ │ ... │ Pod 10 │
│ 11% │ 11% │ GONE │ ... │ 11% │
│ 440MB │ 440MB │ │ ... │ 440MB │ ← Memory rising!
─────────────────────────────────────────────────────
More pods approach limit:
─────────────────────────────────────────────────────
│ Pod 1 │ Pod 2 │ │ ... │ Pod 10 │
│ 11% │ 11% │ GONE │ ... │ 11% │
│ 490MB │ 510MB │ │ ... │ 500MB │
OOMKilled (Pod 2)
─────────────────────────────────────────────────────
Cascade accelerates (now 8 pods for same traffic):
─────────────────────────────────────────────────────
8 pods... 7 pods... 6 pods... 5 pods...
Each eviction increases load on survivors
Each survivor moves closer to memory limit
Evictions accelerate until cluster collapse
─────────────────────────────────────────────────────

Breaking the loop:

  • PodDisruptionBudgets: Limit concurrent evictions
  • Proper resource requests: Ensure headroom
  • Horizontal scaling: Add pods before eviction, not after
  • Memory leak detection: Alert before OOM

Pattern: Incidents generate alerts, alerts overwhelm responders, overwhelmed responders miss alerts, more incidents.

THE ALERT STORM
═══════════════════════════════════════════════════════════════
Normal:
┌────────────────┐ ┌────────────────┐
│ 5 alerts/day │────▶│ Engineer reads │──▶ Issues fixed
│ All meaningful │ │ all of them │
└────────────────┘ └────────────────┘
After "improving monitoring":
┌────────────────┐ ┌────────────────┐
│ 50 alerts/day │────▶│ Engineer skims │──▶ Some fixed
│ Mostly noise │ │ ignores most │
└────────────────┘ └────────────────┘
After the first missed incident:
┌────────────────┐ ┌────────────────────────┐
│ 100 alerts/day │────▶│ "Add more monitoring!" │
│ Even more noise│ │ Now 200 alerts/day │
└────────────────┘ └────────────────────────┘
Alert fatigue sets in:
┌────────────────┐ ┌────────────────┐ ┌─────────────┐
│ 200 alerts/day │────▶│ Engineer mutes │────▶│ Incidents │
│ All ignored │ │ notifications │ │ go unnoticed│
└────────────────┘ └────────────────┘ └─────────────┘
┌─────────────────────────────────────────┘
┌─────────────────────┐
│ "We need EVEN MORE │
│ monitoring!" (loop) │
└─────────────────────┘

Breaking the loop:

  • Alert on symptoms, not causes (user impact, not internal metrics)
  • Every alert must be actionable
  • Regular alert review: Delete alerts nobody acts on
  • On-call feedback: Track alert-to-incident ratio

Pattern: Under-provisioning causes incidents, incidents cause over-provisioning, budgets cut, under-provisioning.

THE CAPACITY PLANNING SPIRAL
═══════════════════════════════════════════════════════════════
Year 1: Launch
┌─────────────────────────────────────┐
│ "We'll scale as needed" │
│ Provisioned: Just enough │
│ Buffer: ~5% │
└─────────────────────────────────────┘
Year 1, Month 6: Black Friday
┌─────────────────────────────────────┐
│ Traffic: 10x normal │
│ Result: Outage │
│ Blame: "We need more capacity!" │
└─────────────────────────────────────┘
Year 2: Post-mortem
┌─────────────────────────────────────┐
│ Provisioned: 20x normal (just in case) │
│ Actual usage: 2x normal │
│ Utilization: 10% │
│ CFO: "Why are we paying for this?" │
└─────────────────────────────────────┘
Year 2, Month 6: Budget cuts
┌─────────────────────────────────────┐
│ "Reduce cloud spend by 50%" │
│ Provisioned: Back to 10x │
│ Buffer: ~5% again │
└─────────────────────────────────────┘
Year 2, Month 11: Pre-holiday traffic
┌─────────────────────────────────────┐
│ Traffic: 12x normal │
│ Result: Outage (again) │
│ Loop repeats │
└─────────────────────────────────────┘

Breaking the loop:

  • Load testing: Know your actual limits
  • Autoscaling: Match capacity to demand
  • Cost attribution: Show cost vs. revenue at traffic levels
  • Chaos engineering: Prove you can handle expected peaks

Principle 1: Match loop speed to delay

If your system changes faster than your feedback loop can respond, you’ll oscillate. The loop evaluation period should exceed the total delay.

# Kubernetes HPA with proper stabilization
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 3
maxReplicas: 100
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 10 # Remove max 10% of pods
periodSeconds: 60 # Every minute
scaleUp:
stabilizationWindowSeconds: 60 # Wait 1 min before scaling up more
policies:
- type: Percent
value: 100 # Can double pods
periodSeconds: 60
- type: Pods
value: 4 # Or add max 4 pods
periodSeconds: 60
selectPolicy: Max # Use whichever allows more scaling

Principle 2: Add damping to prevent oscillation

Damping slows down responses, trading speed for stability. Like shock absorbers on a car.

# Bad: React to every measurement
def scale_pods(current_cpu):
if current_cpu > 70:
add_pods(5)
elif current_cpu < 50:
remove_pods(5)
# Good: Damped response with smoothing
class DampedScaler:
def __init__(self):
self.measurements = []
self.last_scale_time = 0
self.cooldown = 300 # 5 minutes
def scale_pods(self, current_cpu):
self.measurements.append(current_cpu)
# Only consider last 5 minutes of data
self.measurements = self.measurements[-20:]
# Require cooldown period
if time.time() - self.last_scale_time < self.cooldown:
return # Too soon to act
avg_cpu = sum(self.measurements) / len(self.measurements)
# Require sustained deviation
if all(m > 70 for m in self.measurements[-5:]):
add_pods(2) # Smaller increments
self.last_scale_time = time.time()
elif all(m < 50 for m in self.measurements[-5:]):
remove_pods(1) # Even smaller for scale-down
self.last_scale_time = time.time()

Principle 3: Break reinforcing loops with circuit breakers

Don’t let failure amplify failure. Insert breaks in the loop.

class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30):
self.failures = 0
self.threshold = failure_threshold
self.timeout = recovery_timeout
self.state = "CLOSED" # CLOSED, OPEN, HALF_OPEN
self.opened_at = None
def call(self, func, *args, **kwargs):
if self.state == "OPEN":
if time.time() - self.opened_at > self.timeout:
self.state = "HALF_OPEN"
else:
raise CircuitOpenError("Circuit breaker is open")
try:
result = func(*args, **kwargs)
if self.state == "HALF_OPEN":
self.state = "CLOSED"
self.failures = 0
return result
except Exception as e:
self.failures += 1
if self.failures >= self.threshold:
self.state = "OPEN"
self.opened_at = time.time()
raise
# Usage: Breaks the retry storm loop
@circuit_breaker(failure_threshold=5, recovery_timeout=30)
def call_payment_service(order):
return http_client.post("/payments", order)

Principle 4: Add jitter to prevent synchronization

# Bad: All caches expire at exactly the same time
cache.set(key, value, ttl=3600) # All users hit this at 3600s
# Good: Randomize to spread load
import random
jittered_ttl = 3600 + random.randint(-600, 600) # 50-70 minutes
cache.set(key, value, ttl=jittered_ttl)
# For retries: exponential backoff with jitter
def retry_with_backoff(func, max_retries=5):
for attempt in range(max_retries):
try:
return func()
except RetryableError:
if attempt < max_retries - 1:
base_delay = (2 ** attempt) # 1, 2, 4, 8, 16 seconds
jitter = random.uniform(0, base_delay * 0.5)
time.sleep(base_delay + jitter)
raise MaxRetriesExceeded()

Before deploying any system with feedback mechanisms, answer these questions:

FEEDBACK LOOP ANALYSIS CHECKLIST
═══════════════════════════════════════════════════════════════
For each feedback loop in your system:
□ IDENTIFICATION
- What are the elements in the loop?
- Is it reinforcing or balancing?
- What behavior does it create?
□ DELAYS
- What is the total delay around the loop?
- Is the loop evaluation faster than the delay?
- What happens during the delay period?
□ STABILITY
- Is there damping/smoothing?
- Are there stabilization windows?
- Can the loop oscillate? At what frequency?
□ FAILURE MODES
- What happens if feedback is delayed/lost?
- What happens under extreme load?
- Is there a circuit breaker to stop runaway loops?
□ SYNCHRONIZATION
- Can multiple instances synchronize?
- Is there jitter on timers/TTLs?
- What triggers correlated behavior?
□ OBSERVABILITY
- Can you see the loop in action?
- What metrics show loop behavior?
- How would you detect a runaway loop?

  • Audio engineers use feedback loops intentionally. Electric guitar sustain comes from controlled feedback between the pickups and amplifier. Jimi Hendrix was a master of feedback control.

  • The 2010 Flash Crash (stock market dropped 9% in 5 minutes) was caused by algorithmic trading feedback loops. One algorithm sold, prices dropped, other algorithms sold in response, prices dropped more. Circuit breakers now halt trading when prices move too fast.

  • Climate feedback loops worry scientists. Ice melts → less reflection → more heat absorption → more melting. Some loops are self-reinforcing with century-long delays—we might have already triggered them without knowing.

  • The Federal Reserve manages feedback loops constantly. Raise interest rates → less borrowing → less spending → lower inflation. But with 12-18 month delays, it’s like steering a supertanker through a narrow channel.


MistakeWhy It’s DangerousSolution
Retries without backoffCreates reinforcing loop that amplifies failuresExponential backoff with jitter
Tight autoscaler settingsOscillation wastes resources and can crash systemsStabilization windows, gradual changes
Identical cache TTLsThundering herd on expirationJitter all TTLs by ±10-20%
No circuit breakersFailures cascade until total outageAdd breakers at every service boundary
Ignoring metric delayAutoscaler reacts to stale data, overshootsEvaluation interval > metric delay
Alert on every metricAlert fatigue, real issues missedAlert on user-facing symptoms only
Scaling on connection countEach new pod adds connections, triggers more scalingScale on latency or queue depth instead

  1. What’s the difference between a reinforcing and a balancing feedback loop?

    Answer

    Reinforcing loops amplify change in the current direction. If the system is moving up, reinforcing loops push it further up. If moving down, further down. Examples: retry storms (failures cause retries cause more failures), viral growth (users invite users who invite more users).

    Balancing loops oppose change, pushing the system toward a target equilibrium. High CPU → add pods → lower CPU. High temperature → cooling → lower temperature.

    Quick test: If A increases, trace through the loop. Does A end up increasing more (reinforcing) or decreasing back toward a target (balancing)?

  2. Why do delays turn balancing loops into oscillating disasters?

    Answer

    Delays cause oscillation because corrective actions are based on stale information. The sequence:

    1. System detects problem (using delayed metrics)
    2. System applies correction
    3. During delay, system applies more correction (still seeing old data)
    4. All corrections take effect simultaneously
    5. System overshoots target in opposite direction
    6. Process repeats in reverse

    The longer the delay relative to correction speed, the worse the oscillation. Solution: evaluation interval should exceed total delay, and changes should be gradual (damped).

  3. Explain how a thundering herd creates a reinforcing loop.

    Answer

    The thundering herd reinforcing loop:

    1. Cache expires simultaneously for many users
    2. All requests miss cache, hit database
    3. Database overwhelmed, queries slow or fail
    4. Failed queries can’t populate cache
    5. Next requests still miss cache
    6. Even more database load
    7. Database slower, even fewer successful cache populations
    8. Loop continues until intervention

    The reinforcing element: cache failure → database overload → cache stays empty → more database overload. Each iteration makes things worse.

    Breaking the loop: jittered TTLs, single-writer pattern, request coalescing, cache warming before expiration.

  4. What is jitter and why is it crucial in distributed systems?

    Answer

    Jitter is deliberate randomness added to timing. Instead of exact intervals, you use a range.

    Why it’s crucial:

    • Prevents synchronization: Without jitter, all caches expire at once, all retries happen together, all health checks fire simultaneously
    • Converts spikes to spreads: Instead of 10,000 requests at second 0, you get 10,000 spread over 10 minutes
    • Breaks thundering herds: If TTL is 3600 ± 600 seconds, expirations are spread over 20 minutes instead of all at once
    • Reduces retry collisions: Random backoff means retries don’t all hit at 1s, 2s, 4s

    Common applications:

    • Cache TTLs: ±10-20% jitter
    • Retry backoff: Add random 0-50% of base delay
    • Periodic jobs: Random delay on startup
    • Health checks: Random offset from interval

Task: Analyze feedback loops in a production architecture.

Scenario: You’re reviewing this API service architecture:

┌─────────────────────────────────────────────────────────────┐
│ PRODUCTION SYSTEM │
│ │
│ Users ──▶ CDN ──▶ Load Balancer ──▶ API Pods (HPA) │
│ │ │
│ ▼ │
│ ┌─────────── Redis Cache ────────────┐ │
│ │ TTL: 1 hour │ │
│ │ (no jitter) │ │
│ └────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ PostgreSQL │
│ (connection pool: 50) │
│ │
│ Retry Policy: 3 retries, 1s fixed delay │
│ HPA: Scale on CPU, check every 15s, cooldown 30s │
│ Rate Limit: 1000 req/s per user │
│ Circuit Breaker: None │
│ │
└─────────────────────────────────────────────────────────────┘

Your Analysis:

  1. Identify all feedback loops (20 minutes)

    List every loop you can find. For each one:

    • Elements in the loop
    • Type (reinforcing/balancing)
    • Total delay in the loop
    • What triggers it
    • What goes wrong
  2. Find dangerous combinations (10 minutes)

    Which loops could trigger together? Which balancing loops might fight each other?

  3. Propose fixes (15 minutes)

    For each dangerous loop, propose a concrete fix with configuration examples.

Expected Findings:

Click to see answer

Reinforcing Loops (Dangerous):

  1. Retry Storm

    • Path: Timeout → Retry → More load → Slower responses → More timeouts
    • Delay: 1 second (fixed, no jitter)
    • Trigger: Any slowdown
    • Problem: 3 retries with no jitter means 3x load when slow
  2. Cache Stampede

    • Path: Cache expires → All miss → DB overload → Timeouts → Cache not populated → More misses
    • Delay: Exactly 1 hour (synchronized)
    • Trigger: Any heavily-cached key expiring
    • Problem: No jitter means synchronized expiration
  3. Connection Pool Exhaustion

    • Path: Slow query → Connections held → Pool fills → Requests wait → Timeouts → Retries → More connections needed
    • Delay: Query timeout (likely long)
    • Trigger: Any database slowdown
    • Problem: Only 50 connections, no circuit breaker
  4. Rate Limit Retry Amplification

    • Path: Rate limit hit → 429 → Client retries → Rate limit hit → More 429s
    • Delay: 1 second (fixed retry)
    • Trigger: Traffic near 1000 req/s
    • Problem: Retries count against rate limit

Balancing Loops:

  1. HPA CPU Scaling

    • Path: High CPU → Add pods → Lower CPU per pod
    • Delay: 15s evaluation + pod startup (60s+)
    • Problem: Cooldown too short (30s), may oscillate
  2. Rate Limiting

    • Path: Too many requests → Reject some → Manageable load
    • Delay: Immediate
    • Problem: Works for server, but combined with retries is reinforcing for client

Dangerous Combinations:

  1. Cache stampede + Retry storm + Connection exhaustion:

    • Cache expires → DB hit → DB slow → Connections held → Pool full → Timeouts → Retries → 3x load on full pool → Complete failure
  2. HPA oscillation + Retry storm:

    • CPU high → Scale up → New pods retry → More load → CPU still high → Scale up more → Overshoot

Fixes:

# 1. Add jitter to cache TTL
cache.set(key, value, ttl=3600 + random(-600, 600))
# 2. Exponential backoff with jitter for retries
retry_policy:
max_retries: 3
backoff:
type: exponential
base: 1s
max: 30s
jitter: 0.5
# 3. Add circuit breaker on database
circuit_breaker:
failure_threshold: 5
timeout: 30s
half_open_requests: 3
# 4. Fix HPA timing
spec:
behavior:
scaleDown:
stabilizationWindowSeconds: 300
scaleUp:
stabilizationWindowSeconds: 60
# 5. Connection pool with timeout
pool:
max_connections: 50
connection_timeout: 5s
idle_timeout: 60s

Success Criteria:

  • Identified at least 4 reinforcing loops
  • Identified at least 2 balancing loops
  • Documented delays for each loop
  • Found at least 2 dangerous combinations
  • Proposed fixes with specific configurations
  • Explained why each fix breaks the loop

  • “Thinking in Systems” by Donella Meadows - Chapter 2 covers feedback loops with beautiful clarity. Essential reading.

  • “Release It!” by Michael Nygard - Chapters on stability patterns (circuit breakers, timeouts, bulkheads) are all about managing feedback loops in production.

  • “How Complex Systems Fail” by Richard Cook - 18 short points on why feedback loops in complex systems create surprising failures. Takes 10 minutes to read, provides lifetime of insight.

  • “Control Theory for Engineers” - Any introductory text. Understanding PID controllers will transform how you tune autoscalers.

  • “Feedback Control for Computer Systems” by Philipp K. Janert - Applies control theory directly to software systems. Dense but invaluable.


Module 1.3: Mental Models for Operations - Build practical mental models for understanding production systems: leverage points, stock-and-flow diagrams, and the frameworks that experienced operators use instinctively.