Перейти до вмісту

Module 5.1: What Makes Systems Distributed

Цей контент ще не доступний вашою мовою.

Complexity: [MEDIUM]

Time to Complete: 25-30 minutes

Prerequisites: Systems Thinking Track (recommended)

Track: Foundations

After completing this module, you will be able to:

  1. Explain the eight fallacies of distributed computing and identify where each manifests in real cloud-native architectures
  2. Analyze a system’s architecture to determine which components introduce distributed-system challenges (partial failure, network partitions, clock skew)
  3. Design service boundaries that account for the inherent unreliability of network communication between components
  4. Evaluate when a distributed architecture is justified versus when a simpler single-process design would be more reliable and maintainable

February 28, 2017. Amazon Web Services experiences what would become one of the most costly outages in cloud computing history.

A simple typo in a command to remove a small number of S3 servers accidentally removes a larger set of servers than intended. The servers being removed supported the S3 index and placement systems—the metadata layer that tracks where every object is stored.

For the next 4 hours, S3 in US-East-1 was essentially offline. But the real impact was just beginning. Hundreds of other AWS services depended on S3. Websites couldn’t load images. Applications couldn’t access configuration. CI/CD pipelines failed. The AWS status dashboard—itself hosted on S3—couldn’t display that S3 was down.

The cascade revealed what most engineers knew but rarely confronted: even the simplest web application is a distributed system, with hidden dependencies, partial failures, and emergent behaviors that no single person fully understands.

S&P 500 companies collectively lost an estimated $150 million during the outage. And it all started because distributed systems don’t behave like the single-machine programs most engineers learn to write.

This module teaches why distributed systems are fundamentally different—and why understanding their constraints is essential for building anything that scales.


Every modern system is distributed. The moment you have a web server talking to a database, you have a distributed system. The moment you deploy to the cloud, you’re distributed across availability zones. The moment you scale beyond one machine, distribution becomes your reality.

But distributed systems don’t behave like single machines. Things that were easy become hard. Things that were guaranteed become probabilistic. Assumptions that held for decades suddenly break. The sooner you understand why, the sooner you can design systems that actually work.

This module introduces the fundamental challenges of distributed systems—the laws of physics and logic that constrain what’s possible, and the patterns that emerge when you can’t escape them.

The Space Station Analogy

Imagine Mission Control in Houston talking to astronauts on the International Space Station. Messages take time to travel. Signals can be lost. The astronauts can’t wait for permission before every action—they need autonomy. And Mission Control can’t know the exact state of the station at any instant. Every distributed system faces the same challenges: latency, unreliability, and uncertainty about remote state.


  • What makes a system “distributed” (and why it matters)
  • The fundamental challenges: latency, partial failure, no global clock
  • The CAP theorem and what it really means
  • Why distributed systems require different thinking
  • How Kubernetes is a distributed system

DISTRIBUTED SYSTEM DEFINITION
═══════════════════════════════════════════════════════════════
A distributed system is one where:
- Components run on multiple networked computers
- Components coordinate by passing messages
- The system appears as a single coherent system to users
Leslie Lamport's definition:
"A distributed system is one in which the failure of a computer
you didn't even know existed can render your own computer unusable."
KEY PROPERTIES
─────────────────────────────────────────────────────────────
1. CONCURRENCY: Multiple things happen simultaneously
2. NO GLOBAL CLOCK: No single source of "now"
3. INDEPENDENT FAILURE: Parts can fail without others knowing
4. MESSAGE PASSING: Communication via network, not shared memory
REASONS TO DISTRIBUTE
═══════════════════════════════════════════════════════════════
SCALABILITY
─────────────────────────────────────────────────────────────
One machine has limits. Need more capacity? Add more machines.
Single server: 10,000 requests/sec max
10 servers: 100,000 requests/sec
100 servers: 1,000,000 requests/sec
AVAILABILITY
─────────────────────────────────────────────────────────────
One machine will fail. Multiple machines can cover for each other.
Single server: Server dies → System dies
Multiple servers: Server dies → Others handle load
LATENCY
─────────────────────────────────────────────────────────────
Physics limits speed. Put data closer to users.
US user → US server: 20ms
US user → Europe server: 100ms
US user → Asia server: 200ms
ORGANIZATIONAL
─────────────────────────────────────────────────────────────
Different teams, different services, different deployment cycles.
Monolith: One team, one deployment, everyone waits
Microservices: Independent teams, independent deploys
DISTRIBUTION SPECTRUM
═══════════════════════════════════════════════════════════════
NOT DISTRIBUTED FULLY DISTRIBUTED
│ │
▼ ▼
┌──────────┐ ┌──────────────┐ ┌────────────┐ ┌──────────────┐
│ Single │ │ Web + DB │ │ Microsvcs │ │ Global │
│ Process │ │ (2 servers) │ │ (dozens) │ │ Multi-region │
└──────────┘ └──────────────┘ └────────────┘ └──────────────┘
│ │ │ │
│ │ │ │
No network Simple Complex Maximum
No latency failure modes coordination complexity
Easy debug Easy reasoning Hard to debug Very hard
Most systems live in the middle—distributed enough to need
the mental models, but not so distributed that nothing works.

Try This (2 minutes)

Think about a system you work with. Where does it fall on the spectrum?

ComponentRuns WhereCommunicates Via

Is it more distributed than you initially thought?


LATENCY: THE SPEED OF LIGHT PROBLEM
═══════════════════════════════════════════════════════════════
In a single machine:
Function call: ~1 nanosecond
Memory access: ~100 nanoseconds
Across a network:
Same datacenter: ~500 microseconds (500,000 ns)
Cross-country: ~30 milliseconds (30,000,000 ns)
Cross-ocean: ~100 milliseconds (100,000,000 ns)
IMPACT
─────────────────────────────────────────────────────────────
┌────────────────────────────────────────────────────────────┐
│ │
│ Local call: 1 ns │
│ Network call: 1,000,000 ns (1 ms) │
│ │
│ That's 1,000,000x slower! │
│ │
│ Design that works locally can be disastrous distributed. │
│ │
│ Example: Loop with 100 database calls │
│ Local: 100 × 1ns = 100ns │
│ Network: 100 × 1ms = 100ms (user notices!) │
│ │
└────────────────────────────────────────────────────────────┘
LATENCY NUMBERS EVERY PROGRAMMER SHOULD KNOW
─────────────────────────────────────────────────────────────
L1 cache reference: 0.5 ns
L2 cache reference: 7 ns
Main memory reference: 100 ns
SSD random read: 16,000 ns (16 μs)
Network round trip (same DC): 500,000 ns (500 μs)
Disk seek: 2,000,000 ns (2 ms)
Network round trip (US→EU): 100,000,000 ns (100 ms)
PARTIAL FAILURE: THE UNRELIABILITY PROBLEM
═══════════════════════════════════════════════════════════════
In a single machine:
Either the whole thing works, or the whole thing crashes.
Failure is total and obvious.
In a distributed system:
Part can fail while other parts continue.
Failure can be partial, intermittent, or undetectable.
FAILURE MODES
─────────────────────────────────────────────────────────────
┌────────────────────────────────────────────────────────────┐
│ │
│ CRASH FAILURE │
│ Node stops responding entirely. │
│ At least you know it's dead. │
│ │
│ OMISSION FAILURE │
│ Messages lost. Did the server get my request? │
│ Did it respond but I didn't receive? │
│ │
│ TIMING FAILURE │
│ Response came, but too slow. Did it timeout? │
│ Did it complete? Is it still running? │
│ │
│ BYZANTINE FAILURE │
│ Component lies or behaves maliciously. │
│ Hardest to handle. Usually assume this won't happen. │
│ │
└────────────────────────────────────────────────────────────┘
THE FUNDAMENTAL UNCERTAINTY
─────────────────────────────────────────────────────────────
You send a request. No response. What happened?
1. Request was lost in transit?
2. Server crashed before processing?
3. Server crashed after processing?
4. Response was lost in transit?
5. Server is just slow?
You cannot distinguish these cases from the client side.
This is fundamental. No protocol can solve it.
NO GLOBAL CLOCK: THE ORDERING PROBLEM
═══════════════════════════════════════════════════════════════
In a single machine:
Events have a clear order.
Thread A did X, then Thread B did Y.
System clock is the arbiter.
In a distributed system:
No shared clock. Each machine has its own.
Clocks drift. Even GPS-synced clocks differ by milliseconds.
"Before" and "after" become fuzzy concepts.
THE ORDERING PROBLEM
─────────────────────────────────────────────────────────────
Server A Server B
│ │
10:00:00.001 │ ─── Request ───────▶│ 10:00:00.003
│ │
10:00:00.005 │◀─── Response ───── │ 10:00:00.004
│ │
Which happened first? The request or the response?
Server A says: Request at 10:00:00.001, response at 10:00:00.005
Server B says: Request at 10:00:00.003, response at 10:00:00.004
Server B's clock is 2ms ahead. The timestamps are misleading.
CONSEQUENCES
─────────────────────────────────────────────────────────────
- Can't use timestamps to order events across machines
- "Last write wins" requires agreeing on what "last" means
- Debugging is hard: logs from different servers don't align
- Need logical clocks (Lamport clocks, vector clocks) for ordering

War Story: The $12 Million Clock Skew Incident

March 2018. A high-frequency trading firm discovers they’ve been losing money for months—not from bad trades, but from bad timestamps.

The firm operated trading servers in both New Jersey and Chicago. The servers used NTP to synchronize clocks, but NTP only guarantees accuracy within tens of milliseconds. The clocks had drifted 47ms apart.

When a trade executed in Chicago triggered a hedge trade in New Jersey, the timestamps sometimes showed the hedge happening before the original trade. The reconciliation system, which used timestamps to order events, processed them wrong. Sometimes hedges were cancelled as “duplicate trades.” Sometimes positions were calculated incorrectly.

Timeline of discovery:

  • Day 1: Risk analyst notices P&L discrepancies in overnight reports
  • Week 2: Engineering traces problem to event ordering logic
  • Week 3: Clock drift identified as root cause
  • Week 4: Logical clocks implemented (Lamport timestamps)
  • Month 2: Full audit reveals $12 million in losses over 6 months

The fix: The team replaced wall-clock ordering with Lamport timestamps. Every event now carries a logical clock value that increments with each operation. When server A sends a message to server B, it includes its clock value. Server B sets its clock to max(its_clock, received_clock) + 1. This guarantees that caused events always have higher timestamps than their causes—regardless of what the physical clocks say.

The lesson: In distributed systems, “when” something happened is less important than “what caused what.” Physical time is unreliable. Causal ordering is essential.


THE CAP THEOREM
═══════════════════════════════════════════════════════════════
In a distributed system, you can have at most TWO of:
C - CONSISTENCY
Every read receives the most recent write.
All nodes see the same data at the same time.
A - AVAILABILITY
Every request receives a response.
System is always operational.
P - PARTITION TOLERANCE
System continues operating despite network partitions.
Messages between nodes can be lost or delayed.
THE CATCH
─────────────────────────────────────────────────────────────
Network partitions WILL happen. You don't get to choose.
So really, during a partition you choose between:
CP: Consistent but unavailable during partition
AP: Available but possibly inconsistent during partition
┌─────────────────────────────────────────────────────────────┐
│ │
│ Normal operation: You can have all three! │
│ │
│ During partition: Choose C or A │
│ │
│ CP: Reject requests to maintain consistency │
│ AP: Accept requests, reconcile later │
│ │
└─────────────────────────────────────────────────────────────┘
CAP TRADE-OFFS IN REAL SYSTEMS
═══════════════════════════════════════════════════════════════
CP SYSTEMS (Consistent, Partition-tolerant)
─────────────────────────────────────────────────────────────
During partition: Refuse some requests to maintain consistency
Examples:
- Traditional RDBMS with synchronous replication
- ZooKeeper
- etcd (Kubernetes' brain)
- MongoDB with majority write concern
Use when: Correctness matters more than availability
- Financial transactions
- Inventory systems
- Coordination services
AP SYSTEMS (Available, Partition-tolerant)
─────────────────────────────────────────────────────────────
During partition: Accept requests, data may diverge
Examples:
- Cassandra
- DynamoDB (eventually consistent reads)
- DNS
- Caches
Use when: Availability matters more than immediate consistency
- Shopping carts
- Social media feeds
- Metrics/logging
REALITY CHECK
─────────────────────────────────────────────────────────────
Most systems need BOTH consistency AND availability.
The choice is: which to sacrifice DURING a partition?
Partitions are rare but not rare enough to ignore.
Design for partition. Choose your trade-off deliberately.
PACELC: A MORE COMPLETE MODEL
═══════════════════════════════════════════════════════════════
CAP only talks about partitions. But what about normal operation?
PACELC: If Partition, then Availability vs Consistency,
Else, Latency vs Consistency
┌─────────────────────────────────────────────────────────────┐
│ │
│ During Partition (P): │
│ Choose A (availability) or C (consistency) │
│ │
│ Else (normal operation): │
│ Choose L (latency) or C (consistency) │
│ │
│ PA/EL: Available during partition, low latency normally │
│ (Cassandra, DynamoDB) │
│ │
│ PC/EC: Consistent during partition, consistent normally │
│ (Traditional RDBMS, etcd) │
│ │
│ PA/EC: Available during partition, consistent normally │
│ (Some configs of MongoDB) │
│ │
└─────────────────────────────────────────────────────────────┘
This model acknowledges that even in normal operation,
you trade latency for consistency (synchronous replication)
or consistency for latency (asynchronous replication).

Try This (3 minutes)

For each scenario, which trade-off makes sense?

ScenarioDuring PartitionWhy
Bank account balanceCPCan’t show wrong balance
Social media likes countAP
Kubernetes pod state
Shopping cart
DNS

THE TWO GENERALS PROBLEM
═══════════════════════════════════════════════════════════════
Setup:
Two generals must coordinate an attack.
They can only communicate via messengers.
Messengers may be captured (lost messages).
Both must attack simultaneously or not at all.
The Problem:
─────────────────────────────────────────────────────────────
General A General B
│ │
│ "Attack at dawn" │
│ ─────────messenger──────────────▶ │
│ │
│ "Got it" │
│ ◀────────messenger─────────────── │
│ │
│ But did B receive my ack? │
│ "I got your ack" │
│ ─────────messenger──────────────▶ │
│ │
│ But did A receive... │
│ (infinite regress) │
IMPOSSIBILITY
─────────────────────────────────────────────────────────────
There is NO protocol that guarantees agreement if messages
can be lost. This is proven mathematically impossible.
IMPLICATIONS FOR DISTRIBUTED SYSTEMS
─────────────────────────────────────────────────────────────
You cannot guarantee that two nodes agree on anything
if the network is unreliable. The best you can do:
- Increase probability (retries, acknowledgments)
- Accept uncertainty (eventual consistency)
- Use consensus protocols (Paxos, Raft) when possible
BYZANTINE GENERALS PROBLEM
═══════════════════════════════════════════════════════════════
Same as Two Generals, but worse:
- Some generals might be traitors
- Traitors send conflicting messages to different generals
- Loyal generals must still reach agreement
THE CHALLENGE
─────────────────────────────────────────────────────────────
General A: "Attack!"
General B: "Attack!" (traitor, tells C "Retreat")
General C: "Attack!"
C receives: "Attack" from A, "Retreat" from B
A receives: "Attack" from B (B lies), "Attack" from C
How can A and C agree despite B's lies?
SOLUTION REQUIREMENTS
─────────────────────────────────────────────────────────────
To tolerate f Byzantine (lying) nodes, you need 3f + 1 total.
1 traitor → need 4 generals
2 traitors → need 7 generals
This is expensive! Most systems assume no Byzantine failures.
WHERE IT MATTERS
─────────────────────────────────────────────────────────────
- Blockchain (trustless networks)
- Safety-critical systems (aviation, medical)
- Systems where nodes might be compromised
Most internal systems assume nodes are honest but buggy.
They handle crash failures, not Byzantine failures.
IDEMPOTENCY: SAFE TO RETRY
═══════════════════════════════════════════════════════════════
An operation is IDEMPOTENT if doing it multiple times
has the same effect as doing it once.
WHY IT MATTERS
─────────────────────────────────────────────────────────────
Request sent. Timeout. Did it succeed?
If NOT idempotent: Dangerous to retry
"Transfer $100" → Retry → $200 transferred!
If idempotent: Safe to retry
"Set balance to $100" → Retry → Still $100
IDEMPOTENT OPERATIONS
─────────────────────────────────────────────────────────────
✓ SET x = 5 (result is always 5)
✓ DELETE user 123 (can't delete twice)
✓ PUT /users/123 (creates or replaces)
✗ INCREMENT x (adds more each time)
✗ POST /users (creates new each time)
✗ TRANSFER $100 (transfers more each time)
MAKING OPERATIONS IDEMPOTENT
─────────────────────────────────────────────────────────────
Add a unique request ID. Server tracks completed requests.
Client: "Transfer $100, request_id=abc123"
Server: Checks if abc123 already processed
If yes: Return cached result
If no: Process, save result, return
Client timeout → Retry with same abc123 → Safe!

Part 5: Kubernetes as a Distributed System

Section titled “Part 5: Kubernetes as a Distributed System”
KUBERNETES: A DISTRIBUTED SYSTEM
═══════════════════════════════════════════════════════════════
CONTROL PLANE
─────────────────────────────────────────────────────────────
┌─────────────────────────────────────────────────────────────┐
│ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ API Server│ │ API Server│ │ API Server│ (HA) │
│ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ │
│ │ │ │ │
│ └──────────────┼──────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ etcd │ (distributed consensus) │
│ │ ┌───┬───┬───┐ │ │
│ │ │ 1 │ 2 │ 3 │ │ (Raft protocol) │
│ │ └───┴───┴───┘ │ │
│ └─────────────────┘ │
│ │
│ ┌────────────────┐ ┌────────────────┐ │
│ │ Controller Mgr │ │ Scheduler │ │
│ │ (leader elect)│ │ (leader elect)│ │
│ └────────────────┘ └────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
DATA PLANE (Nodes)
─────────────────────────────────────────────────────────────
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ ┌──────┐ │ │ ┌──────┐ │ │ ┌──────┐ │
│ │kubelet│ │ │ │kubelet│ │ │ │kubelet│ │
│ └──────┘ │ │ └──────┘ │ │ └──────┘ │
│ ┌──────┐ │ │ ┌──────┐ │ │ ┌──────┐ │
│ │kube- │ │ │ │kube- │ │ │ │kube- │ │
│ │proxy │ │ │ │proxy │ │ │ │proxy │ │
│ └──────┘ │ │ └──────┘ │ │ └──────┘ │
│ [pods] │ │ [pods] │ │ [pods] │
└────────────┘ └────────────┘ └────────────┘
KUBERNETES DISTRIBUTED PATTERNS
═══════════════════════════════════════════════════════════════
CONSENSUS (etcd)
─────────────────────────────────────────────────────────────
etcd uses Raft consensus for strong consistency.
All cluster state lives in etcd.
Writes require majority agreement (quorum).
3 etcd nodes: Can lose 1
5 etcd nodes: Can lose 2
LEADER ELECTION
─────────────────────────────────────────────────────────────
Controller Manager and Scheduler use leader election.
Only one active at a time (avoid duplicate work).
If leader fails, another takes over.
# Check current leader
kubectl get leases -n kube-system
EVENTUAL CONSISTENCY (Controllers)
─────────────────────────────────────────────────────────────
Controllers reconcile desired vs actual state.
They're eventually consistent—changes propagate over time.
You: "I want 3 replicas"
Controller: Sees 2, creates 1
Repeat until actual = desired
WATCH PATTERN
─────────────────────────────────────────────────────────────
Components watch etcd for changes, not poll.
Reduces load, provides near-real-time updates.
kubelet watches for pod assignments
Controller watches for resource changes
KUBERNETES FAILURE SCENARIOS
═══════════════════════════════════════════════════════════════
NODE FAILURE
─────────────────────────────────────────────────────────────
kubelet stops sending heartbeats.
Node marked "NotReady" after timeout (default 40s).
Pods rescheduled after tolerationSeconds (default 300s).
Timeline:
0s: Node dies
40s: Node marked NotReady
300s: Pods evicted, rescheduled elsewhere
ETCD QUORUM LOSS
─────────────────────────────────────────────────────────────
If majority of etcd nodes fail, cluster becomes read-only.
Existing pods keep running (kubelet cached state).
No new pods, no updates until quorum restored.
3-node etcd: Lose 2 → Quorum lost
5-node etcd: Lose 3 → Quorum lost
NETWORK PARTITION
─────────────────────────────────────────────────────────────
Nodes can't reach control plane.
Existing pods keep running (autonomous kubelet).
No new scheduling until connectivity restored.
Split-brain scenario:
Partition A: Some nodes + some etcd
Partition B: Other nodes + other etcd
If neither has etcd quorum: Both read-only
Kubernetes chooses consistency over availability (CP)
API SERVER FAILURE
─────────────────────────────────────────────────────────────
With multiple API servers, load balancer routes around failure.
kubectl commands may fail temporarily.
Pods keep running (kubelet operates independently).

  • The speed of light limits distributed systems. A message from New York to London takes at least 28ms—the time light needs to travel through fiber. You can’t engineer around physics.

  • Google’s Spanner uses atomic clocks and GPS to synchronize time across datacenters to within a few milliseconds. It’s one of the few systems that can offer both strong consistency and global distribution.

  • The term “Byzantine fault” comes from the Byzantine Generals Problem, named for the Byzantine Empire’s reputation for complex political intrigue. In distributed systems, a Byzantine node is one that might behave arbitrarily—including lying.

  • Leslie Lamport’s “Time, Clocks, and the Ordering of Events” paper (1978) introduced logical clocks and is one of the most cited papers in computer science. Lamport later won the Turing Award partly for this work. The key insight: you don’t need real time to order events—you just need to track causality (“happened before” relationships).


MistakeProblemSolution
Ignoring network latencyDesign works locally, fails distributedMeasure latency, batch requests
Assuming networks are reliableNo retry logic, no timeoutsImplement timeouts, retries, circuit breakers
Using wall-clock time for orderingClock skew causes wrong orderUse logical clocks or version vectors
Not designing for partial failureOne failure takes down everythingBulkheads, graceful degradation
Synchronous everythingLatency compounds, cascade failuresAsync where possible, queues
Ignoring CAP trade-offsSurprised by behavior during partitionChoose trade-offs explicitly

  1. What are the three fundamental challenges of distributed systems?

    Answer

    The three fundamental challenges are:

    1. Latency: Network communication is millions of times slower than local function calls. Speed of light imposes physical limits.

    2. Partial failure: Components can fail independently. You can’t always tell if a remote call succeeded, failed, or is just slow.

    3. No global clock: Each machine has its own clock that drifts. You can’t rely on timestamps to order events across machines.

    These challenges are fundamental—they can be mitigated but not eliminated. Any distributed system design must account for all three.

  2. Explain the CAP theorem in your own words.

    Answer

    The CAP theorem states that during a network partition, a distributed system must choose between consistency and availability:

    • Consistency: All nodes see the same data
    • Availability: All requests get responses
    • Partition tolerance: System handles network splits

    Since network partitions are inevitable, the real choice is:

    • CP: During partition, refuse some requests to maintain consistency
    • AP: During partition, serve requests but data might diverge

    In normal operation (no partition), you can have both. The choice matters when things go wrong.

    Example: A banking system chooses CP—better to reject a transaction than show wrong balance. A social media site chooses AP—better to show slightly stale like count than fail entirely.

  3. Why can’t you use timestamps to order events in a distributed system?

    Answer

    You can’t rely on timestamps because:

    1. Clock drift: Each machine’s clock runs slightly fast or slow. Over time, clocks diverge.

    2. No perfect sync: Even with NTP, clocks differ by milliseconds. Events happening close together may get wrong order.

    3. Network delay: A message sent “earlier” might arrive “later” than one sent after it.

    Example: Server A (clock running fast) timestamps event at 10:00:00.050. Server B (accurate clock) timestamps event at 10:00:00.045. B’s event actually happened after A’s, but has earlier timestamp.

    Solutions:

    • Logical clocks (Lamport clocks): Track causality, not wall time
    • Vector clocks: Track causality per node
    • Hybrid logical clocks: Combine physical and logical time
  4. What makes Kubernetes a distributed system, and how does it handle distribution?

    Answer

    Kubernetes is distributed because:

    • Multiple components (API server, etcd, controllers, kubelets)
    • Running on multiple machines
    • Coordinating via network messages
    • No single point of failure (when properly configured)

    How it handles distribution:

    1. etcd for consensus: Uses Raft protocol for strong consistency of cluster state. Requires quorum for writes.

    2. Leader election: Controller Manager and Scheduler elect leaders to avoid duplicate work.

    3. Eventual consistency: Controllers continuously reconcile desired vs actual state. Changes propagate over time.

    4. Watch pattern: Components watch etcd for changes instead of polling, reducing load.

    5. Autonomous kubelets: Nodes keep running pods even if disconnected from control plane.

    Kubernetes chooses CP during partitions—better to refuse changes than have inconsistent state.

  5. A service makes 50 sequential database calls per request. Each call takes 2ms over the network. What’s the minimum latency for a user request? How could you improve it?

    Answer

    Calculation:

    • 50 calls × 2ms = 100ms minimum latency
    • This doesn’t include processing time, just network round trips

    Improvement strategies:

    1. Batch calls: Instead of 50 individual calls, combine into fewer queries

      • 5 batched calls × 2ms = 10ms (10x improvement)
    2. Parallelize: If calls are independent, execute simultaneously

      • 50 parallel calls = 2ms (50x improvement)
    3. Cache: Store frequently accessed data locally

      • Cache hit: 0.1ms vs 2ms network call
    4. Denormalize: Store data together to reduce joins/calls

      • Single call with all data: 2ms total

    The lesson: Network latency compounds. Design minimizes round trips.

  6. You send a request to a service and don’t receive a response after 5 seconds. List all possible states the request could be in. Why is this uncertainty fundamental?

    Answer

    Possible states:

    1. Request lost in transit - Never reached server
    2. Server received but crashed before processing - No work done
    3. Server processing but slow - Will eventually complete
    4. Server processed successfully, response lost - Work completed!
    5. Server processed, crashed after responding - Work completed!
    6. Response delayed in network - Will eventually arrive

    Why fundamental:

    This uncertainty cannot be eliminated by any protocol. From the client’s perspective, states 2, 3, 4, 5, and 6 look identical: silence.

    This is the Two Generals Problem—you cannot achieve certainty about remote state over an unreliable channel.

    Practical implications:

    • Must design operations to be idempotent (safe to retry)
    • Use unique request IDs to detect duplicates
    • Implement timeouts and retries with exponential backoff
    • Accept that “at-most-once” or “at-least-once” delivery—never “exactly-once”
  7. An etcd cluster has 5 nodes. How many can fail while maintaining quorum? If you lose quorum, what happens to the Kubernetes cluster?

    Answer

    Quorum calculation:

    • Quorum requires majority: floor(n/2) + 1
    • For 5 nodes: floor(5/2) + 1 = 3 nodes required
    • Can lose 2 nodes and still have quorum (3 remaining)

    If quorum is lost (3+ nodes fail):

    1. etcd becomes read-only - No writes accepted
    2. No new pods can be scheduled - Scheduler can’t write
    3. No deployments can update - Controllers can’t write
    4. Existing pods keep running - kubelets have cached state
    5. kubectl get works - Reads still succeed
    6. kubectl apply fails - Writes rejected

    Recovery:

    • Restore failed nodes to regain quorum
    • Or restore from etcd backup to new cluster
    • Never reduce below 3 nodes (can’t lose any)

    Best practice: Use 5 nodes for production (tolerates 2 failures), 3 for smaller clusters (tolerates 1 failure).

  8. A social media platform shows like counts that update every 30 seconds. During a network partition between US and EU datacenters, a post gets 1000 likes in US and 500 likes in EU. When the partition heals, what should the count be? What CAP trade-off did this system make?

    Answer

    The count should be 1500 (1000 + 500)

    This requires a CRDT (Conflict-free Replicated Data Type)—specifically a G-Counter (grow-only counter) that merges by summing per-replica increments.

    CAP trade-off:

    • System chose AP (Availability over Consistency)
    • During partition: Both regions accepted writes (available)
    • Result: Temporary inconsistency (US saw 1000, EU saw 500)
    • After partition: Merge brings eventual consistency

    Why this is appropriate for likes:

    • Temporary wrong count is acceptable
    • Users don’t expect real-time accuracy
    • Unavailability (can’t like) would be worse than inconsistency

    Contrast with banking:

    • Bank would choose CP
    • During partition: Reject transactions
    • Why: Incorrect balance is unacceptable
    • Unavailability is better than double-spending

    The pattern: Choose AP for data where “close enough” is acceptable, CP for data where correctness is critical.


Task: Explore distributed system behavior in Kubernetes.

Part 1: Observe Latency (10 minutes)

Run a pod and measure API server latency:

Terminal window
# Create a test pod
kubectl run latency-test --image=busybox --command -- sleep 3600
# Time various operations
time kubectl get pod latency-test
time kubectl get pod latency-test -o yaml
time kubectl describe pod latency-test
# Compare to local operations
time cat /etc/hostname

Record your findings:

OperationTimeNotes
kubectl get pod
kubectl describe
Local file read

Part 2: Observe Eventual Consistency (10 minutes)

Watch a deployment scale:

Terminal window
# Create deployment
kubectl create deployment scale-test --image=nginx --replicas=1
# In terminal 1: Watch pods
kubectl get pods -w
# In terminal 2: Scale up rapidly
kubectl scale deployment scale-test --replicas=10
# Observe:
# - How long until all pods are Running?
# - Do pods appear in order?
# - What intermediate states do you see?

Part 3: Simulate Partial Failure (10 minutes)

If using kind or minikube with multiple nodes:

Terminal window
# List nodes
kubectl get nodes
# Cordon a node (no new pods)
kubectl cordon <node-name>
# Watch what happens to pods
kubectl get pods -o wide -w
# Uncordon
kubectl uncordon <node-name>

Success Criteria:

  • Measured API server latency vs local operations
  • Observed eventual consistency during scaling
  • Understood intermediate states during changes
  • (Optional) Observed behavior during node failure

  • “Designing Data-Intensive Applications” - Martin Kleppmann. The definitive guide to distributed systems concepts. Chapter 8 covers these fundamentals.

  • “Distributed Systems for Fun and Profit” - Mikito Takada. Free online book covering distributed systems fundamentals.

  • “Time, Clocks, and the Ordering of Events in a Distributed System” - Leslie Lamport. The foundational paper on logical clocks.


Before moving on, ensure you understand:

  • The three fundamental challenges: Latency (network is slow), partial failure (parts fail independently), no global clock (can’t trust timestamps)
  • CAP theorem reality: During a network partition, you choose consistency OR availability—not both. Most systems need both, so you choose which to sacrifice during failures
  • PACELC extension: Even without partitions, you trade latency for consistency. Synchronous replication = consistent but slow. Async = fast but potentially stale
  • The Two Generals Problem: You cannot guarantee agreement over unreliable networks. This is mathematically proven impossible
  • Idempotency is essential: Design operations to be safe to retry. Use unique request IDs to detect duplicates
  • Kubernetes is CP: It chooses consistency over availability during partitions. etcd requires quorum; without it, the cluster becomes read-only
  • Logical clocks > wall clocks: Don’t use timestamps to order events. Use Lamport clocks or vector clocks to track causality
  • Distribution is a spectrum: A web server + database is distributed. Understanding where your system falls helps choose appropriate patterns

Module 5.2: Consensus and Coordination - How distributed systems agree on anything, and why it’s so hard.