Module 1.2: Feedback Loops

Цей контент ще не доступний вашою мовою.

Complexity: [MEDIUM]

Time to Complete: 35-40 minutes

Prerequisites: Module 1.1: What is Systems Thinking?

Track: Foundations

What You’ll Be Able to Do

After completing this module, you will be able to:

Identify reinforcing and balancing feedback loops in production systems and predict their behavior under load
Diagnose runaway cascades caused by positive feedback loops such as retry storms, autoscaler thrashing, and connection pool exhaustion
Design damping mechanisms (circuit breakers, backoff strategies, rate limiters) that interrupt destructive feedback cycles
Evaluate whether a system’s feedback loops will stabilize or amplify under failure conditions

The Black Friday Meltdown

November 29th, 2019. 10:47 AM Eastern Time.

The ShopMart engineering team is watching dashboards with coffee in hand. Black Friday traffic is building—already 3x normal load, heading toward 10x by afternoon. The autoscaler is doing its job, spinning up new pods. Everything looks green.

10:52 AM. Someone notices something odd. Database connection count is climbing faster than traffic. Not by a little—exponentially. Traffic up 20%, connections up 400%.

The senior DBA pulls up query logs. Queries are taking 3x longer than normal. Nothing has changed in the code. The database itself isn’t maxed out—CPU at 40%, memory fine.

10:58 AM. The first timeouts start appearing. Payment service can’t reach the database. It retries. All instances retry simultaneously—thousands of retries per second.

11:04 AM. The circuit breaker finally trips, but it’s too late. The database connection pool is completely exhausted. New pods are spinning up (autoscaler sees high latency), but each new pod tries to grab connections from an empty pool. More pods, more connection attempts, more failures.

11:12 AM. Complete outage. The site displays “temporarily unavailable” to 400,000 shoppers.

The root cause? A single slow query. One poorly indexed query that normally took 50ms started taking 500ms under load. Connections held 10x longer meant connection pool filled 10x faster. Which meant more queuing. Which meant even longer waits. Which meant more timeouts. Which meant retries. Which added more load. Which made queries even slower.

A feedback loop turned a minor performance issue into a complete system collapse in under 25 minutes.

Why This Module Matters

Every production outage has a story. But if you look closely at the worst ones—the cascading failures, the death spirals, the “everything went wrong at once” disasters—they share a common element: feedback loops that amplified small problems into catastrophic failures.

Understanding feedback loops is understanding the DNA of system behavior.

A system with well-designed feedback loops is antifragile—it stabilizes under stress, recovers from failures, and gets stronger with use. A system with poorly designed feedback loops is a time bomb—stable in normal conditions, catastrophic when stressed.

This module teaches you to:

Recognize feedback loops before they bite
Predict which loops will help and which will harm
Design systems that use feedback safely
Break dangerous loops before they cascade

The Thermostat Principle

Your home thermostat is a perfect feedback loop. Temperature drops below setpoint → heater turns on → temperature rises → heater turns off. This is a balancing loop—it opposes change, maintaining stability.

Now imagine if your thermostat were wired backwards: temperature drops → heater turns off. That’s a reinforcing loop—it amplifies change. Your house would freeze.

Most production incidents are just thermostats wired backwards. The system is trying to help, but its corrective actions make things worse. This module teaches you to spot backwards thermostats before they freeze your production systems.

What You’ll Learn

The two fundamental types of feedback loops
Why delays turn helpful loops into destructive oscillations
The six most dangerous feedback patterns in distributed systems
How to design systems that use feedback safely
Techniques for breaking loops once they start

Part 1: The Two Types of Feedback Loops

All feedback loops fall into two categories. Master this distinction and you’ll understand half of all production incidents.

1.1 Reinforcing Loops: The Amplifiers

Reinforcing loops amplify change. Whatever direction the system is moving, reinforcing loops push it further. They’re called “positive feedback” not because they’re good (they usually aren’t), but because they add to the existing trend.

Think of a microphone placed in front of a speaker. Sound enters the mic → gets amplified → comes out the speaker → enters the mic louder → gets amplified more. Without intervention, you get that ear-piercing screech within seconds.

THE ANATOMY OF A RETRY STORM
═══════════════════════════════════════════════════════════════

     ┌─────────────────────────────────────────────────┐
     │              THE DEATH SPIRAL                   │
     │                                                 │
     ▼                                                 │
┌─────────┐      ┌─────────┐      ┌─────────┐         │
│ Latency │─────▶│Timeouts │─────▶│ Retries │─────────┘
│Increases│      │ Occur   │      │ Happen  │
└─────────┘      └─────────┘      └─────────┘
     ▲                                 │
     │                                 │ Each retry adds load
     │                                 ▼
     │                        ┌─────────────────┐
     └────────────────────────│ Server becomes  │
                              │ more overloaded │
                              └─────────────────┘

Timeline of a real incident:

t=0:00   Latency: 200ms   Timeouts: 0/sec      Load: 1000 req/s
t=0:30   Latency: 500ms   Timeouts: 50/sec     Load: 1050 req/s
t=1:00   Latency: 2000ms  Timeouts: 400/sec    Load: 1400 req/s
t=1:30   Latency: 5000ms  Timeouts: 1200/sec   Load: 2200 req/s
t=2:00   Latency: DEAD    Timeouts: ALL        Load: 4000+ req/s

From "a little slow" to "completely dead" in 2 minutes.

The mathematics of reinforcing loops are terrifying. If each loop iteration increases load by just 10%, after 10 iterations you’re at 2.6x original load. After 20 iterations, 6.7x. After 30, 17x. This is exponential growth—and it happens fast.

Common reinforcing loops in production:

Loop	How It Works	Why It’s Dangerous
Retry storms	Failure → retry → more load → more failure	Can 10x load in minutes
Cache stampede	Cache expires → all hit DB → DB slows → cache stays empty	Synchronized devastation
Connection pool exhaustion	Slow queries → connections held → pool fills → more waiting	Everything stops
Memory pressure	Swapping → slower processing → more memory pressure	Gradual then sudden death
Alert fatigue	Too many alerts → ignored → more incidents → more alerts	Human systems fail too

1.2 Balancing Loops: The Stabilizers

Balancing loops oppose change. They push the system back toward a target or equilibrium. They’re called “negative feedback” because they subtract from the current trend.

Your body temperature is maintained by balancing loops. Too hot → you sweat → cooling → temperature drops. Too cold → you shiver → heat generation → temperature rises. The target is 37°C, and your body fights any deviation.

HOW AUTOSCALING SHOULD WORK
═══════════════════════════════════════════════════════════════

                     Target: 70% CPU
                          │
                     ┌────┴────┐
                     │         │
     ┌───────────────▼─────────▼───────────────┐
     │                                         │
     │   Measure ──▶ Compare ──▶ Adjust        │
     │                                         │
     │      70%        vs        Scale         │
     │     ━━━━━━    target      pods          │
     │                                         │
     └─────────────────────────────────────────┘
               │                  │
               │                  │
               ▼                  ▼
         ┌─────────┐        ┌─────────┐
         │CPU high │        │Add pods │
         │ (85%)   │        │         │
         └────┬────┘        └────┬────┘
              │                  │
              │                  │
              └─────────┬────────┘
                        │
                        ▼
                  ┌──────────┐
                  │ CPU back │
                  │ to 70%   │
                  └──────────┘

This is a BALANCING loop: it opposes change.
High CPU → action → lower CPU.
Low CPU → action → higher CPU.
System stabilizes around target.

Common balancing loops in production:

Loop	How It Works	What It Protects
Autoscaling	High load → add capacity → lower load	Performance
Rate limiting	Too many requests → reject excess → manageable load	Availability
Circuit breakers	Failures rise → stop calling → failures drop	Dependencies
Backpressure	Queue full → slow producers → queue drains	Memory
Garbage collection	Memory fills → GC runs → memory freed	Stability

Did You Know?

Your body contains over 200 feedback loops maintaining temperature, blood sugar, blood pressure, pH levels, and more. Engineers who study biology often build more resilient systems.

The steam engine governor (1788) was one of the first mechanical feedback loops. When the engine sped up, weights flew outward and closed the steam valve. James Watt’s governor made the Industrial Revolution possible.

Predator-prey cycles in nature are feedback loops. Rabbits increase → foxes increase (more food) → rabbits decrease (more predation) → foxes decrease (less food) → rabbits increase again. Ecologists mapped these loops in the 1920s; they’re now used to model cloud services.

1.3 Identifying Loop Types: The Polarity Test

Quick technique: Count the inversions. An inversion is when an increase in one thing causes a decrease in another (or vice versa).

THE POLARITY TEST
═══════════════════════════════════════════════════════════════

Count the "inversions" (where ↑ causes ↓ or ↓ causes ↑):

REINFORCING (even number of inversions: 0, 2, 4...):
┌───────────────────────────────────────────────┐
│  A ↑ ─────▶ B ↑ ─────▶ C ↑ ─────▶ A ↑        │
│                                               │
│  Load ↑ → Latency ↑ → Retries ↑ → Load ↑     │
│  (0 inversions = reinforcing)                 │
└───────────────────────────────────────────────┘

BALANCING (odd number of inversions: 1, 3, 5...):
┌───────────────────────────────────────────────┐
│  A ↑ ─────▶ B ↑ ─────▶ C ↓ ─────▶ A ↓        │
│                    ▲                          │
│                 INVERSION                     │
│                                               │
│  CPU ↑ → Pods ↑ → Load per pod ↓ → CPU ↓     │
│  (1 inversion = balancing)                    │
└───────────────────────────────────────────────┘

Quick test: Start with "A increases" and follow the loop.
If A ends up increasing more → Reinforcing
If A ends up decreasing back → Balancing

Practice examples:

Rate limiting: Requests ↑ → Rejections ↑ → Accepted requests ↓ → Load ↓
- One inversion (more rejections = fewer accepted) = Balancing
Cache miss cascade: Misses ↑ → DB queries ↑ → DB latency ↑ → Cache timeout ↑ → Misses ↑
- Zero inversions = Reinforcing
Pod eviction: Memory ↑ → Eviction ↑ → Running pods ↓ → Load per pod ↑ → Memory ↑
- Two inversions = Reinforcing (eviction makes things worse!)

Part 2: Delays—Why Good Loops Go Bad

Here’s the dirty secret of feedback loops: balancing loops can become destructive when delays are too long. A well-intentioned stabilizing mechanism can oscillate wildly, causing more damage than if it didn’t exist at all.

2.1 The Shower Problem

Everyone has experienced this. Hotel shower. Unfamiliar controls.

Water is cold
Turn up the hot
Still cold (water still in pipes)
Turn up more
Still cold (patience running thin)
Turn up even more
SCALDING HOT (all adjustments hit at once)
Yank it to cold
Still hot (pipe delay again)
Crank it colder
FREEZING
Repeat until you give up

This is a balancing loop with delay. The longer the delay relative to how fast you adjust, the worse the oscillation. The loop is trying to stabilize temperature, but the delay causes overshoot in both directions.

2.2 Autoscaler Oscillation: A Story in Three Graphs

SCENARIO: Autoscaler with 3-minute pod startup delay
═══════════════════════════════════════════════════════════════

GRAPH 1: What the autoscaler SEES (metrics with delay)
──────────────────────────────────────────────────────
CPU %
100│     ●       ●       ●       ●       (metrics arrive)
   │    ╱│      ╱│      ╱│      ╱│
 70│───╱─┼─────╱─┼─────╱─┼─────╱─┼────── Target
   │  ╱  │    ╱  │    ╱  │    ╱  │
 40│ ╱   │   ╱   │   ╱   │   ╱   │
   │╱    │  ╱    │  ╱    │  ╱    │
   └─────┴──────┴──────┴──────┴─────── Time
         ↑      ↑      ↑      ↑
      "High!" "High!" "High!" "High!"
      +5 pods +5 pods +5 pods +5 pods


GRAPH 2: What ACTUALLY happened (reality)
──────────────────────────────────────────────────────
CPU %
100│            ╱╲
   │           ╱  ╲
 70│──────────╱────╲────────────────── Target
   │         ╱      ╲
 40│        ╱        ╲
   │       ╱          ╲_______________
 20│──────╱────────────────────────── Massive overshoot!
   └────────────────────────────────── Time
           ↑
      All 20 pods
      become ready
      simultaneously


GRAPH 3: The oscillation continues
──────────────────────────────────────────────────────
CPU %
100│    ╱╲              ╱╲              ╱╲
   │   ╱  ╲            ╱  ╲            ╱  ╲
 70│──╱────╲──────────╱────╲──────────╱────╲── Target
   │ ╱      ╲        ╱      ╲        ╱      ╲
 40│╱        ╲      ╱        ╲      ╱        ╲
   │          ╲    ╱          ╲    ╱          ╲
 20│           ╲__╱            ╲__╱            ╲_
   └────────────────────────────────────────────── Time
         Scale up  Scale down  Scale up  Scale down
         (overshoot)(undershoot)(overshoot)...

Without damping, this continues indefinitely.

2.3 The Delay Inventory

Every feedback loop contains delays. Knowing your delays is essential for tuning loops correctly.

Delay Type	Typical Duration	Where It Lurks
Metric collection	10-60s	Prometheus scrape interval
Metric aggregation	15-60s	Query evaluation period
Alert threshold	30-300s	”Fire after 5 minutes of…”
Autoscaler cooldown	30-600s	Prevent thrashing
Pod startup	10-300s	Image pull + init + readiness
DNS propagation	30-86400s	TTL-dependent
Human response	300-3600s	Page → wake → investigate
Deployment pipeline	300-3600s	Build + test + deploy
Cache invalidation	Variable	TTL or explicit purge

The total loop delay is the sum of all delays in the loop. If your HPA evaluates every 15 seconds but pods take 3 minutes to start, your effective loop delay is 3+ minutes.

War Story: The Autoscaler That Destroyed Itself (Extended Version)

A logistics company ran their order tracking system on Kubernetes. They configured an HPA to scale based on a custom metric: messages in the processing queue. Clever idea—scale based on actual work, not just CPU.

The metric came from their message broker, scraped by Prometheus every 30 seconds, aggregated over 1 minute for smoothing. Total delay: about 2 minutes from queue state to metric availability.

The HPA evaluated every 15 seconds. So every 15 seconds, it would look at 2-minute-old data and make scaling decisions.

One Thursday, a burst of orders came in. Queue depth jumped. The 2-minute-old metric showed “queue growing.” HPA added 5 pods. Fifteen seconds later, still seeing “queue growing” (stale metric). Added 5 more. This continued for 2 minutes until fresh metrics arrived.

By then, they had 47 pods for a queue that 8 pods could handle. But it got worse.

Each pod needed 3 database connections. 47 pods × 3 connections = 141 connections. Their connection pool max was 100. Pods started failing health checks. The node autoscaler, seeing failing pods, added more nodes to “help.” Each new node started more pods. More pods meant more connection attempts against a saturated pool.

The database, overwhelmed with connection attempts, started timing out actual queries. The queue processing slowed. The queue grew. The HPA saw the growing queue (in now-accurate metrics) and tried to add more pods.

At the peak, they had 400+ pods failing to start across 12 nodes, their database completely unresponsive, and zero orders being processed.

The fix took three changes:

Increased HPA evaluation interval to exceed metric delay (3 minutes)

Added stabilization windows (5 minutes before scale-down)

Set connection pool limits as pod resource constraints

Total outage: 47 minutes. Revenue impact: $230,000. The fix: 3 lines of YAML.

Part 3: The Six Deadly Loops

These patterns cause the majority of cascading failures in distributed systems. Learn to recognize them instantly.

3.1 The Retry Storm

Pattern: Failure triggers retries, retries add load, load causes more failures.

THE RETRY STORM ANATOMY
═══════════════════════════════════════════════════════════════

Normal state:
┌────────┐  1000 req/s  ┌────────┐
│ Client │─────────────▶│ Server │  Everything fine
└────────┘              └────────┘

Trouble begins (server slow):
┌────────┐  1000 req/s  ┌────────┐
│ Client │─────────────▶│ Server │  Latency: 500ms
└────────┘     +        └────────┘  (normal: 50ms)
              200 retries

Getting worse:
┌────────┐  1000 req/s  ┌────────┐
│ Client │─────────────▶│ Server │  Latency: 2000ms
└────────┘     +        └────────┘  Timeouts starting
              800 retries
                 +
              200 retry-of-retries

Death spiral:
┌────────┐  1000 req/s  ┌────────────────┐
│ Client │─────────────▶│    Server      │
└────────┘     +        │  OVERLOADED    │
            3000 retries│  100% failure  │
              (3x load) └────────────────┘

Total traffic: 4000+ req/s from 1000 original requests.
Server never recovers without external intervention.

Breaking the loop:

Exponential backoff: Each retry waits longer
Jitter: Randomize backoff to prevent synchronization
Retry budgets: Limit total retries per time window
Circuit breakers: Stop retrying after threshold

3.2 The Thundering Herd

Pattern: Synchronized events cause coordinated resource access.

THE THUNDERING HERD
═══════════════════════════════════════════════════════════════

Setup: 10,000 users, cache TTL = 3600s (1 hour)
Cache populated at 10:00 AM

Normal operation (10:00 AM - 10:59 AM):
┌─────────┐                         ┌──────────┐
│  Users  │──▶ Cache HIT (99.9%) ──▶│  Happy   │
│ 10,000  │                         │  users   │
└─────────┘                         └──────────┘
                                    DB load: ~10 req/s

11:00:00 AM - Cache expires (simultaneously for everyone):
┌─────────┐                         ┌──────────┐
│  Users  │──▶ Cache MISS ─────────▶│ Database │
│ 10,000  │    (all at once!)       │          │
└─────────┘                         └──────────┘
                                    DB load: 10,000 req/s
                                              ↓
                                         OVERWHELMED

11:00:01 AM - Database too slow, queries timeout:
┌─────────┐                         ┌──────────┐
│  Users  │──▶ Cache MISS ─────────▶│ Database │
│ 10,000  │    (still empty!)       │  DYING   │
└─────────┘       +                 └──────────┘
            Retries from
            first requests         DB can't recover
                                   because cache can't
                                   be repopulated

Breaking the loop:

Jittered TTLs: TTL = 3600 + random(-300, 300) seconds
Single-writer pattern: One request populates cache, others wait
Cache warming: Refresh before expiration
Request coalescing: Deduplicate identical in-flight requests

3.3 The Connection Pool Death Spiral

Pattern: Slow operations hold connections longer, exhausting the pool.

THE CONNECTION POOL DEATH SPIRAL
═══════════════════════════════════════════════════════════════

Pool size: 100 connections
Normal query time: 10ms
Queries per second: 500

Normal state:
────────────────────────────────────────────────────────────
Connections in use: ~5 (500 req/s × 10ms = 5 concurrent)
Pool utilization: 5%
Wait time: 0ms
────────────────────────────────────────────────────────────

Something makes queries slow (50ms instead of 10ms):
────────────────────────────────────────────────────────────
Connections in use: ~25 (500 req/s × 50ms = 25 concurrent)
Pool utilization: 25%
Wait time: 0ms    ← Still fine, but watch this...
────────────────────────────────────────────────────────────

Requests queue because of slow queries (now 500ms):
────────────────────────────────────────────────────────────
Connections in use: 100 (MAXED)
Pool utilization: 100%
Wait time: 500ms  ← Total request time now 1000ms
New requests timing out while waiting for connections
Timeouts trigger retries...
────────────────────────────────────────────────────────────

Death spiral engaged:
────────────────────────────────────────────────────────────
Connections in use: 100 (MAXED)
Waiting for connection: 500 requests
Wait time: 10,000ms+
Database sees 100 concurrent slow queries
Database gets slower
Queries take longer
Connections held longer
More requests waiting
More retries
MORE LOAD ON ALREADY DYING DATABASE
────────────────────────────────────────────────────────────

Breaking the loop:

Connection timeouts: Release connections if held too long
Query timeouts: Kill queries that exceed threshold
Pool sizing: Size based on downstream capacity, not demand
Bulkheading: Separate pools for different operation types

3.4 The Eviction Cascade

Pattern: Resource pressure causes evictions, evictions redistribute load, causing more pressure.

THE EVICTION CASCADE
═══════════════════════════════════════════════════════════════

Setup: 10 pods handling traffic evenly

Normal state:
─────────────────────────────────────────────────────
│ Pod 1 │ Pod 2 │ Pod 3 │ ... │ Pod 10 │
│  10%  │  10%  │  10%  │ ... │   10%  │  (load each)
─────────────────────────────────────────────────────
Memory per pod: 400MB (limit: 512MB)

Memory leak develops in Pod 3:
─────────────────────────────────────────────────────
│ Pod 1 │ Pod 2 │ Pod 3 │ ... │ Pod 10 │
│  10%  │  10%  │  10%  │ ... │   10%  │
│ 400MB │ 400MB │ 512MB │ ... │  400MB │
                    ↓
              OOMKilled
─────────────────────────────────────────────────────

Pod 3 evicted, load redistributed:
─────────────────────────────────────────────────────
│ Pod 1 │ Pod 2 │       │ ... │ Pod 10 │
│  11%  │  11%  │ GONE  │ ... │   11%  │
│ 440MB │ 440MB │       │ ... │  440MB │  ← Memory rising!
─────────────────────────────────────────────────────

More pods approach limit:
─────────────────────────────────────────────────────
│ Pod 1 │ Pod 2 │       │ ... │ Pod 10 │
│  11%  │  11%  │ GONE  │ ... │   11%  │
│ 490MB │ 510MB │       │ ... │  500MB │
            ↓
       OOMKilled (Pod 2)
─────────────────────────────────────────────────────

Cascade accelerates (now 8 pods for same traffic):
─────────────────────────────────────────────────────
8 pods... 7 pods... 6 pods... 5 pods...
Each eviction increases load on survivors
Each survivor moves closer to memory limit
Evictions accelerate until cluster collapse
─────────────────────────────────────────────────────

Breaking the loop:

PodDisruptionBudgets: Limit concurrent evictions
Proper resource requests: Ensure headroom
Horizontal scaling: Add pods before eviction, not after
Memory leak detection: Alert before OOM

3.5 The Alert Storm

Pattern: Incidents generate alerts, alerts overwhelm responders, overwhelmed responders miss alerts, more incidents.

THE ALERT STORM
═══════════════════════════════════════════════════════════════

Normal:
┌────────────────┐     ┌────────────────┐
│ 5 alerts/day   │────▶│ Engineer reads │──▶ Issues fixed
│ All meaningful │     │ all of them    │
└────────────────┘     └────────────────┘

After "improving monitoring":
┌────────────────┐     ┌────────────────┐
│ 50 alerts/day  │────▶│ Engineer skims │──▶ Some fixed
│ Mostly noise   │     │ ignores most   │
└────────────────┘     └────────────────┘

After the first missed incident:
┌────────────────┐     ┌────────────────────────┐
│ 100 alerts/day │────▶│ "Add more monitoring!" │
│ Even more noise│     │ Now 200 alerts/day     │
└────────────────┘     └────────────────────────┘

Alert fatigue sets in:
┌────────────────┐     ┌────────────────┐     ┌─────────────┐
│ 200 alerts/day │────▶│ Engineer mutes │────▶│ Incidents   │
│ All ignored    │     │ notifications  │     │ go unnoticed│
└────────────────┘     └────────────────┘     └─────────────┘
                                                     │
           ┌─────────────────────────────────────────┘
           │
           ▼
┌─────────────────────┐
│ "We need EVEN MORE  │
│ monitoring!" (loop) │
└─────────────────────┘

Breaking the loop:

Alert on symptoms, not causes (user impact, not internal metrics)
Every alert must be actionable
Regular alert review: Delete alerts nobody acts on
On-call feedback: Track alert-to-incident ratio

3.6 The Capacity Planning Spiral

Pattern: Under-provisioning causes incidents, incidents cause over-provisioning, budgets cut, under-provisioning.

THE CAPACITY PLANNING SPIRAL
═══════════════════════════════════════════════════════════════

Year 1: Launch
┌─────────────────────────────────────┐
│ "We'll scale as needed"             │
│ Provisioned: Just enough            │
│ Buffer: ~5%                         │
└─────────────────────────────────────┘

Year 1, Month 6: Black Friday
┌─────────────────────────────────────┐
│ Traffic: 10x normal                 │
│ Result: Outage                      │
│ Blame: "We need more capacity!"     │
└─────────────────────────────────────┘

Year 2: Post-mortem
┌─────────────────────────────────────┐
│ Provisioned: 20x normal (just in case) │
│ Actual usage: 2x normal             │
│ Utilization: 10%                    │
│ CFO: "Why are we paying for this?"  │
└─────────────────────────────────────┘

Year 2, Month 6: Budget cuts
┌─────────────────────────────────────┐
│ "Reduce cloud spend by 50%"         │
│ Provisioned: Back to 10x            │
│ Buffer: ~5% again                   │
└─────────────────────────────────────┘

Year 2, Month 11: Pre-holiday traffic
┌─────────────────────────────────────┐
│ Traffic: 12x normal                 │
│ Result: Outage (again)              │
│ Loop repeats                        │
└─────────────────────────────────────┘

Breaking the loop:

Load testing: Know your actual limits
Autoscaling: Match capacity to demand
Cost attribution: Show cost vs. revenue at traffic levels
Chaos engineering: Prove you can handle expected peaks

Part 4: Designing with Feedback in Mind

4.1 Principles for Safe Feedback Loops

Principle 1: Match loop speed to delay

If your system changes faster than your feedback loop can respond, you’ll oscillate. The loop evaluation period should exceed the total delay.

# Kubernetes HPA with proper stabilization
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 3
  maxReplicas: 100
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
      - type: Percent
        value: 10                       # Remove max 10% of pods
        periodSeconds: 60               # Every minute
    scaleUp:
      stabilizationWindowSeconds: 60    # Wait 1 min before scaling up more
      policies:
      - type: Percent
        value: 100                      # Can double pods
        periodSeconds: 60
      - type: Pods
        value: 4                        # Or add max 4 pods
        periodSeconds: 60
      selectPolicy: Max                 # Use whichever allows more scaling

Principle 2: Add damping to prevent oscillation

Damping slows down responses, trading speed for stability. Like shock absorbers on a car.

# Bad: React to every measurement
def scale_pods(current_cpu):
    if current_cpu > 70:
        add_pods(5)
    elif current_cpu < 50:
        remove_pods(5)

# Good: Damped response with smoothing
class DampedScaler:
    def __init__(self):
        self.measurements = []
        self.last_scale_time = 0
        self.cooldown = 300  # 5 minutes

    def scale_pods(self, current_cpu):
        self.measurements.append(current_cpu)

        # Only consider last 5 minutes of data
        self.measurements = self.measurements[-20:]

        # Require cooldown period
        if time.time() - self.last_scale_time < self.cooldown:
            return  # Too soon to act

        avg_cpu = sum(self.measurements) / len(self.measurements)

        # Require sustained deviation
        if all(m > 70 for m in self.measurements[-5:]):
            add_pods(2)  # Smaller increments
            self.last_scale_time = time.time()
        elif all(m < 50 for m in self.measurements[-5:]):
            remove_pods(1)  # Even smaller for scale-down
            self.last_scale_time = time.time()

Principle 3: Break reinforcing loops with circuit breakers

Don’t let failure amplify failure. Insert breaks in the loop.

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30):
        self.failures = 0
        self.threshold = failure_threshold
        self.timeout = recovery_timeout
        self.state = "CLOSED"  # CLOSED, OPEN, HALF_OPEN
        self.opened_at = None

    def call(self, func, *args, **kwargs):
        if self.state == "OPEN":
            if time.time() - self.opened_at > self.timeout:
                self.state = "HALF_OPEN"
            else:
                raise CircuitOpenError("Circuit breaker is open")

        try:
            result = func(*args, **kwargs)
            if self.state == "HALF_OPEN":
                self.state = "CLOSED"
                self.failures = 0
            return result
        except Exception as e:
            self.failures += 1
            if self.failures >= self.threshold:
                self.state = "OPEN"
                self.opened_at = time.time()
            raise

# Usage: Breaks the retry storm loop
@circuit_breaker(failure_threshold=5, recovery_timeout=30)
def call_payment_service(order):
    return http_client.post("/payments", order)

Principle 4: Add jitter to prevent synchronization

# Bad: All caches expire at exactly the same time
cache.set(key, value, ttl=3600)  # All users hit this at 3600s

# Good: Randomize to spread load
import random
jittered_ttl = 3600 + random.randint(-600, 600)  # 50-70 minutes
cache.set(key, value, ttl=jittered_ttl)

# For retries: exponential backoff with jitter
def retry_with_backoff(func, max_retries=5):
    for attempt in range(max_retries):
        try:
            return func()
        except RetryableError:
            if attempt < max_retries - 1:
                base_delay = (2 ** attempt)  # 1, 2, 4, 8, 16 seconds
                jitter = random.uniform(0, base_delay * 0.5)
                time.sleep(base_delay + jitter)
    raise MaxRetriesExceeded()

4.2 The Feedback Loop Checklist

Before deploying any system with feedback mechanisms, answer these questions:

FEEDBACK LOOP ANALYSIS CHECKLIST
═══════════════════════════════════════════════════════════════

For each feedback loop in your system:

□ IDENTIFICATION
  - What are the elements in the loop?
  - Is it reinforcing or balancing?
  - What behavior does it create?

□ DELAYS
  - What is the total delay around the loop?
  - Is the loop evaluation faster than the delay?
  - What happens during the delay period?

□ STABILITY
  - Is there damping/smoothing?
  - Are there stabilization windows?
  - Can the loop oscillate? At what frequency?

□ FAILURE MODES
  - What happens if feedback is delayed/lost?
  - What happens under extreme load?
  - Is there a circuit breaker to stop runaway loops?

□ SYNCHRONIZATION
  - Can multiple instances synchronize?
  - Is there jitter on timers/TTLs?
  - What triggers correlated behavior?

□ OBSERVABILITY
  - Can you see the loop in action?
  - What metrics show loop behavior?
  - How would you detect a runaway loop?

Did You Know?

Audio engineers use feedback loops intentionally. Electric guitar sustain comes from controlled feedback between the pickups and amplifier. Jimi Hendrix was a master of feedback control.
The 2010 Flash Crash (stock market dropped 9% in 5 minutes) was caused by algorithmic trading feedback loops. One algorithm sold, prices dropped, other algorithms sold in response, prices dropped more. Circuit breakers now halt trading when prices move too fast.
Climate feedback loops worry scientists. Ice melts → less reflection → more heat absorption → more melting. Some loops are self-reinforcing with century-long delays—we might have already triggered them without knowing.
The Federal Reserve manages feedback loops constantly. Raise interest rates → less borrowing → less spending → lower inflation. But with 12-18 month delays, it’s like steering a supertanker through a narrow channel.

Common Mistakes

Mistake	Why It’s Dangerous	Solution
Retries without backoff	Creates reinforcing loop that amplifies failures	Exponential backoff with jitter
Tight autoscaler settings	Oscillation wastes resources and can crash systems	Stabilization windows, gradual changes
Identical cache TTLs	Thundering herd on expiration	Jitter all TTLs by ±10-20%
No circuit breakers	Failures cascade until total outage	Add breakers at every service boundary
Ignoring metric delay	Autoscaler reacts to stale data, overshoots	Evaluation interval > metric delay
Alert on every metric	Alert fatigue, real issues missed	Alert on user-facing symptoms only
Scaling on connection count	Each new pod adds connections, triggers more scaling	Scale on latency or queue depth instead

Quiz

What’s the difference between a reinforcing and a balancing feedback loop?

Answer

Reinforcing loops amplify change in the current direction. If the system is moving up, reinforcing loops push it further up. If moving down, further down. Examples: retry storms (failures cause retries cause more failures), viral growth (users invite users who invite more users).

Balancing loops oppose change, pushing the system toward a target equilibrium. High CPU → add pods → lower CPU. High temperature → cooling → lower temperature.

Quick test: If A increases, trace through the loop. Does A end up increasing more (reinforcing) or decreasing back toward a target (balancing)?
Why do delays turn balancing loops into oscillating disasters?
Answer

Delays cause oscillation because corrective actions are based on stale information. The sequence:
1. System detects problem (using delayed metrics)
2. System applies correction
3. During delay, system applies more correction (still seeing old data)
4. All corrections take effect simultaneously
5. System overshoots target in opposite direction
6. Process repeats in reverse
The longer the delay relative to correction speed, the worse the oscillation. Solution: evaluation interval should exceed total delay, and changes should be gradual (damped).
Explain how a thundering herd creates a reinforcing loop.
Answer

The thundering herd reinforcing loop:
1. Cache expires simultaneously for many users
2. All requests miss cache, hit database
3. Database overwhelmed, queries slow or fail
4. Failed queries can’t populate cache
5. Next requests still miss cache
6. Even more database load
7. Database slower, even fewer successful cache populations
8. Loop continues until intervention
The reinforcing element: cache failure → database overload → cache stays empty → more database overload. Each iteration makes things worse.

Breaking the loop: jittered TTLs, single-writer pattern, request coalescing, cache warming before expiration.
What is jitter and why is it crucial in distributed systems?
Answer

Jitter is deliberate randomness added to timing. Instead of exact intervals, you use a range.

Why it’s crucial:
- Prevents synchronization: Without jitter, all caches expire at once, all retries happen together, all health checks fire simultaneously
- Converts spikes to spreads: Instead of 10,000 requests at second 0, you get 10,000 spread over 10 minutes
- Breaks thundering herds: If TTL is 3600 ± 600 seconds, expirations are spread over 20 minutes instead of all at once
- Reduces retry collisions: Random backoff means retries don’t all hit at 1s, 2s, 4s
Common applications:
- Cache TTLs: ±10-20% jitter
- Retry backoff: Add random 0-50% of base delay
- Periodic jobs: Random delay on startup
- Health checks: Random offset from interval

Hands-On Exercise

Task: Analyze feedback loops in a production architecture.

Scenario: You’re reviewing this API service architecture:

┌─────────────────────────────────────────────────────────────┐
│                    PRODUCTION SYSTEM                        │
│                                                             │
│   Users ──▶ CDN ──▶ Load Balancer ──▶ API Pods (HPA)       │
│                                            │                │
│                                            ▼                │
│                    ┌─────────── Redis Cache ────────────┐   │
│                    │              TTL: 1 hour           │   │
│                    │              (no jitter)           │   │
│                    └────────────────────────────────────┘   │
│                                            │                │
│                                            ▼                │
│                                     PostgreSQL              │
│                              (connection pool: 50)          │
│                                                             │
│   Retry Policy: 3 retries, 1s fixed delay                  │
│   HPA: Scale on CPU, check every 15s, cooldown 30s         │
│   Rate Limit: 1000 req/s per user                          │
│   Circuit Breaker: None                                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Your Analysis:

Identify all feedback loops (20 minutes)

List every loop you can find. For each one:
- Elements in the loop
- Type (reinforcing/balancing)
- Total delay in the loop
- What triggers it
- What goes wrong
Find dangerous combinations (10 minutes)

Which loops could trigger together? Which balancing loops might fight each other?
Propose fixes (15 minutes)

For each dangerous loop, propose a concrete fix with configuration examples.

Expected Findings:

Click to see answer

Reinforcing Loops (Dangerous):

Retry Storm
- Path: Timeout → Retry → More load → Slower responses → More timeouts
- Delay: 1 second (fixed, no jitter)
- Trigger: Any slowdown
- Problem: 3 retries with no jitter means 3x load when slow
Cache Stampede
- Path: Cache expires → All miss → DB overload → Timeouts → Cache not populated → More misses
- Delay: Exactly 1 hour (synchronized)
- Trigger: Any heavily-cached key expiring
- Problem: No jitter means synchronized expiration
Connection Pool Exhaustion
- Path: Slow query → Connections held → Pool fills → Requests wait → Timeouts → Retries → More connections needed
- Delay: Query timeout (likely long)
- Trigger: Any database slowdown
- Problem: Only 50 connections, no circuit breaker
Rate Limit Retry Amplification
- Path: Rate limit hit → 429 → Client retries → Rate limit hit → More 429s
- Delay: 1 second (fixed retry)
- Trigger: Traffic near 1000 req/s
- Problem: Retries count against rate limit

Balancing Loops:

HPA CPU Scaling
- Path: High CPU → Add pods → Lower CPU per pod
- Delay: 15s evaluation + pod startup (60s+)
- Problem: Cooldown too short (30s), may oscillate
Rate Limiting
- Path: Too many requests → Reject some → Manageable load
- Delay: Immediate
- Problem: Works for server, but combined with retries is reinforcing for client

Dangerous Combinations:

Cache stampede + Retry storm + Connection exhaustion:
- Cache expires → DB hit → DB slow → Connections held → Pool full → Timeouts → Retries → 3x load on full pool → Complete failure
HPA oscillation + Retry storm:
- CPU high → Scale up → New pods retry → More load → CPU still high → Scale up more → Overshoot

Fixes:

# 1. Add jitter to cache TTL
cache.set(key, value, ttl=3600 + random(-600, 600))

# 2. Exponential backoff with jitter for retries
retry_policy:
  max_retries: 3
  backoff:
    type: exponential
    base: 1s
    max: 30s
    jitter: 0.5

# 3. Add circuit breaker on database
circuit_breaker:
  failure_threshold: 5
  timeout: 30s
  half_open_requests: 3

# 4. Fix HPA timing
spec:
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
    scaleUp:
      stabilizationWindowSeconds: 60

# 5. Connection pool with timeout
pool:
  max_connections: 50
  connection_timeout: 5s
  idle_timeout: 60s

Success Criteria:

Identified at least 4 reinforcing loops
Identified at least 2 balancing loops
Documented delays for each loop
Found at least 2 dangerous combinations
Proposed fixes with specific configurations
Explained why each fix breaks the loop

Next Module

Module 1.3: Mental Models for Operations - Build practical mental models for understanding production systems: leverage points, stock-and-flow diagrams, and the frameworks that experienced operators use instinctively.

Module 1.2: Feedback Loops

What You’ll Be Able to Do

The Black Friday Meltdown

Why This Module Matters

What You’ll Learn

Part 1: The Two Types of Feedback Loops

1.1 Reinforcing Loops: The Amplifiers

1.2 Balancing Loops: The Stabilizers

1.3 Identifying Loop Types: The Polarity Test

Part 2: Delays—Why Good Loops Go Bad

2.1 The Shower Problem

2.2 Autoscaler Oscillation: A Story in Three Graphs

2.3 The Delay Inventory

Part 3: The Six Deadly Loops

3.1 The Retry Storm

3.2 The Thundering Herd

3.3 The Connection Pool Death Spiral

3.4 The Eviction Cascade

3.5 The Alert Storm

3.6 The Capacity Planning Spiral

Part 4: Designing with Feedback in Mind

4.1 Principles for Safe Feedback Loops

4.2 The Feedback Loop Checklist

Did You Know?

Common Mistakes

Quiz

Hands-On Exercise

Further Reading

Next Module