Module 5.1: What Makes Systems Distributed
Цей контент ще не доступний вашою мовою.
Complexity:
[MEDIUM]Time to Complete: 25-30 minutes
Prerequisites: Systems Thinking Track (recommended)
Track: Foundations
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Explain the eight fallacies of distributed computing and identify where each manifests in real cloud-native architectures
- Analyze a system’s architecture to determine which components introduce distributed-system challenges (partial failure, network partitions, clock skew)
- Design service boundaries that account for the inherent unreliability of network communication between components
- Evaluate when a distributed architecture is justified versus when a simpler single-process design would be more reliable and maintainable
February 28, 2017. Amazon Web Services experiences what would become one of the most costly outages in cloud computing history.
A simple typo in a command to remove a small number of S3 servers accidentally removes a larger set of servers than intended. The servers being removed supported the S3 index and placement systems—the metadata layer that tracks where every object is stored.
For the next 4 hours, S3 in US-East-1 was essentially offline. But the real impact was just beginning. Hundreds of other AWS services depended on S3. Websites couldn’t load images. Applications couldn’t access configuration. CI/CD pipelines failed. The AWS status dashboard—itself hosted on S3—couldn’t display that S3 was down.
The cascade revealed what most engineers knew but rarely confronted: even the simplest web application is a distributed system, with hidden dependencies, partial failures, and emergent behaviors that no single person fully understands.
S&P 500 companies collectively lost an estimated $150 million during the outage. And it all started because distributed systems don’t behave like the single-machine programs most engineers learn to write.
This module teaches why distributed systems are fundamentally different—and why understanding their constraints is essential for building anything that scales.
Why This Module Matters
Section titled “Why This Module Matters”Every modern system is distributed. The moment you have a web server talking to a database, you have a distributed system. The moment you deploy to the cloud, you’re distributed across availability zones. The moment you scale beyond one machine, distribution becomes your reality.
But distributed systems don’t behave like single machines. Things that were easy become hard. Things that were guaranteed become probabilistic. Assumptions that held for decades suddenly break. The sooner you understand why, the sooner you can design systems that actually work.
This module introduces the fundamental challenges of distributed systems—the laws of physics and logic that constrain what’s possible, and the patterns that emerge when you can’t escape them.
The Space Station Analogy
Imagine Mission Control in Houston talking to astronauts on the International Space Station. Messages take time to travel. Signals can be lost. The astronauts can’t wait for permission before every action—they need autonomy. And Mission Control can’t know the exact state of the station at any instant. Every distributed system faces the same challenges: latency, unreliability, and uncertainty about remote state.
What You’ll Learn
Section titled “What You’ll Learn”- What makes a system “distributed” (and why it matters)
- The fundamental challenges: latency, partial failure, no global clock
- The CAP theorem and what it really means
- Why distributed systems require different thinking
- How Kubernetes is a distributed system
Part 1: Defining Distributed Systems
Section titled “Part 1: Defining Distributed Systems”1.1 What is a Distributed System?
Section titled “1.1 What is a Distributed System?”DISTRIBUTED SYSTEM DEFINITION═══════════════════════════════════════════════════════════════
A distributed system is one where:- Components run on multiple networked computers- Components coordinate by passing messages- The system appears as a single coherent system to users
Leslie Lamport's definition:"A distributed system is one in which the failure of a computeryou didn't even know existed can render your own computer unusable."
KEY PROPERTIES─────────────────────────────────────────────────────────────1. CONCURRENCY: Multiple things happen simultaneously2. NO GLOBAL CLOCK: No single source of "now"3. INDEPENDENT FAILURE: Parts can fail without others knowing4. MESSAGE PASSING: Communication via network, not shared memory1.2 Why Distribute?
Section titled “1.2 Why Distribute?”REASONS TO DISTRIBUTE═══════════════════════════════════════════════════════════════
SCALABILITY─────────────────────────────────────────────────────────────One machine has limits. Need more capacity? Add more machines.
Single server: 10,000 requests/sec max 10 servers: 100,000 requests/sec 100 servers: 1,000,000 requests/sec
AVAILABILITY─────────────────────────────────────────────────────────────One machine will fail. Multiple machines can cover for each other.
Single server: Server dies → System dies Multiple servers: Server dies → Others handle load
LATENCY─────────────────────────────────────────────────────────────Physics limits speed. Put data closer to users.
US user → US server: 20ms US user → Europe server: 100ms US user → Asia server: 200ms
ORGANIZATIONAL─────────────────────────────────────────────────────────────Different teams, different services, different deployment cycles.
Monolith: One team, one deployment, everyone waits Microservices: Independent teams, independent deploys1.3 The Distribution Spectrum
Section titled “1.3 The Distribution Spectrum”DISTRIBUTION SPECTRUM═══════════════════════════════════════════════════════════════
NOT DISTRIBUTED FULLY DISTRIBUTED │ │ ▼ ▼┌──────────┐ ┌──────────────┐ ┌────────────┐ ┌──────────────┐│ Single │ │ Web + DB │ │ Microsvcs │ │ Global ││ Process │ │ (2 servers) │ │ (dozens) │ │ Multi-region │└──────────┘ └──────────────┘ └────────────┘ └──────────────┘ │ │ │ │ │ │ │ │ No network Simple Complex Maximum No latency failure modes coordination complexity Easy debug Easy reasoning Hard to debug Very hard
Most systems live in the middle—distributed enough to needthe mental models, but not so distributed that nothing works.Try This (2 minutes)
Think about a system you work with. Where does it fall on the spectrum?
Component Runs Where Communicates Via Is it more distributed than you initially thought?
Part 2: The Fundamental Challenges
Section titled “Part 2: The Fundamental Challenges”2.1 Challenge #1: Latency
Section titled “2.1 Challenge #1: Latency”LATENCY: THE SPEED OF LIGHT PROBLEM═══════════════════════════════════════════════════════════════
In a single machine: Function call: ~1 nanosecond Memory access: ~100 nanoseconds
Across a network: Same datacenter: ~500 microseconds (500,000 ns) Cross-country: ~30 milliseconds (30,000,000 ns) Cross-ocean: ~100 milliseconds (100,000,000 ns)
IMPACT─────────────────────────────────────────────────────────────┌────────────────────────────────────────────────────────────┐│ ││ Local call: 1 ns ││ Network call: 1,000,000 ns (1 ms) ││ ││ That's 1,000,000x slower! ││ ││ Design that works locally can be disastrous distributed. ││ ││ Example: Loop with 100 database calls ││ Local: 100 × 1ns = 100ns ││ Network: 100 × 1ms = 100ms (user notices!) ││ │└────────────────────────────────────────────────────────────┘
LATENCY NUMBERS EVERY PROGRAMMER SHOULD KNOW─────────────────────────────────────────────────────────────L1 cache reference: 0.5 nsL2 cache reference: 7 nsMain memory reference: 100 nsSSD random read: 16,000 ns (16 μs)Network round trip (same DC): 500,000 ns (500 μs)Disk seek: 2,000,000 ns (2 ms)Network round trip (US→EU): 100,000,000 ns (100 ms)2.2 Challenge #2: Partial Failure
Section titled “2.2 Challenge #2: Partial Failure”PARTIAL FAILURE: THE UNRELIABILITY PROBLEM═══════════════════════════════════════════════════════════════
In a single machine: Either the whole thing works, or the whole thing crashes. Failure is total and obvious.
In a distributed system: Part can fail while other parts continue. Failure can be partial, intermittent, or undetectable.
FAILURE MODES─────────────────────────────────────────────────────────────┌────────────────────────────────────────────────────────────┐│ ││ CRASH FAILURE ││ Node stops responding entirely. ││ At least you know it's dead. ││ ││ OMISSION FAILURE ││ Messages lost. Did the server get my request? ││ Did it respond but I didn't receive? ││ ││ TIMING FAILURE ││ Response came, but too slow. Did it timeout? ││ Did it complete? Is it still running? ││ ││ BYZANTINE FAILURE ││ Component lies or behaves maliciously. ││ Hardest to handle. Usually assume this won't happen. ││ │└────────────────────────────────────────────────────────────┘
THE FUNDAMENTAL UNCERTAINTY─────────────────────────────────────────────────────────────You send a request. No response. What happened?
1. Request was lost in transit? 2. Server crashed before processing? 3. Server crashed after processing? 4. Response was lost in transit? 5. Server is just slow?
You cannot distinguish these cases from the client side.This is fundamental. No protocol can solve it.2.3 Challenge #3: No Global Clock
Section titled “2.3 Challenge #3: No Global Clock”NO GLOBAL CLOCK: THE ORDERING PROBLEM═══════════════════════════════════════════════════════════════
In a single machine: Events have a clear order. Thread A did X, then Thread B did Y. System clock is the arbiter.
In a distributed system: No shared clock. Each machine has its own. Clocks drift. Even GPS-synced clocks differ by milliseconds. "Before" and "after" become fuzzy concepts.
THE ORDERING PROBLEM───────────────────────────────────────────────────────────── Server A Server B │ │ 10:00:00.001 │ ─── Request ───────▶│ 10:00:00.003 │ │ 10:00:00.005 │◀─── Response ───── │ 10:00:00.004 │ │
Which happened first? The request or the response?
Server A says: Request at 10:00:00.001, response at 10:00:00.005Server B says: Request at 10:00:00.003, response at 10:00:00.004
Server B's clock is 2ms ahead. The timestamps are misleading.
CONSEQUENCES─────────────────────────────────────────────────────────────- Can't use timestamps to order events across machines- "Last write wins" requires agreeing on what "last" means- Debugging is hard: logs from different servers don't align- Need logical clocks (Lamport clocks, vector clocks) for orderingWar Story: The $12 Million Clock Skew Incident
March 2018. A high-frequency trading firm discovers they’ve been losing money for months—not from bad trades, but from bad timestamps.
The firm operated trading servers in both New Jersey and Chicago. The servers used NTP to synchronize clocks, but NTP only guarantees accuracy within tens of milliseconds. The clocks had drifted 47ms apart.
When a trade executed in Chicago triggered a hedge trade in New Jersey, the timestamps sometimes showed the hedge happening before the original trade. The reconciliation system, which used timestamps to order events, processed them wrong. Sometimes hedges were cancelled as “duplicate trades.” Sometimes positions were calculated incorrectly.
Timeline of discovery:
- Day 1: Risk analyst notices P&L discrepancies in overnight reports
- Week 2: Engineering traces problem to event ordering logic
- Week 3: Clock drift identified as root cause
- Week 4: Logical clocks implemented (Lamport timestamps)
- Month 2: Full audit reveals $12 million in losses over 6 months
The fix: The team replaced wall-clock ordering with Lamport timestamps. Every event now carries a logical clock value that increments with each operation. When server A sends a message to server B, it includes its clock value. Server B sets its clock to max(its_clock, received_clock) + 1. This guarantees that caused events always have higher timestamps than their causes—regardless of what the physical clocks say.
The lesson: In distributed systems, “when” something happened is less important than “what caused what.” Physical time is unreliable. Causal ordering is essential.
Part 3: The CAP Theorem
Section titled “Part 3: The CAP Theorem”3.1 Understanding CAP
Section titled “3.1 Understanding CAP”THE CAP THEOREM═══════════════════════════════════════════════════════════════
In a distributed system, you can have at most TWO of:
C - CONSISTENCY Every read receives the most recent write. All nodes see the same data at the same time.
A - AVAILABILITY Every request receives a response. System is always operational.
P - PARTITION TOLERANCE System continues operating despite network partitions. Messages between nodes can be lost or delayed.
THE CATCH─────────────────────────────────────────────────────────────Network partitions WILL happen. You don't get to choose.
So really, during a partition you choose between: CP: Consistent but unavailable during partition AP: Available but possibly inconsistent during partition
┌─────────────────────────────────────────────────────────────┐│ ││ Normal operation: You can have all three! ││ ││ During partition: Choose C or A ││ ││ CP: Reject requests to maintain consistency ││ AP: Accept requests, reconcile later ││ │└─────────────────────────────────────────────────────────────┘3.2 CAP in Practice
Section titled “3.2 CAP in Practice”CAP TRADE-OFFS IN REAL SYSTEMS═══════════════════════════════════════════════════════════════
CP SYSTEMS (Consistent, Partition-tolerant)─────────────────────────────────────────────────────────────During partition: Refuse some requests to maintain consistency
Examples: - Traditional RDBMS with synchronous replication - ZooKeeper - etcd (Kubernetes' brain) - MongoDB with majority write concern
Use when: Correctness matters more than availability - Financial transactions - Inventory systems - Coordination services
AP SYSTEMS (Available, Partition-tolerant)─────────────────────────────────────────────────────────────During partition: Accept requests, data may diverge
Examples: - Cassandra - DynamoDB (eventually consistent reads) - DNS - Caches
Use when: Availability matters more than immediate consistency - Shopping carts - Social media feeds - Metrics/logging
REALITY CHECK─────────────────────────────────────────────────────────────Most systems need BOTH consistency AND availability.The choice is: which to sacrifice DURING a partition?
Partitions are rare but not rare enough to ignore.Design for partition. Choose your trade-off deliberately.3.3 Beyond CAP: PACELC
Section titled “3.3 Beyond CAP: PACELC”PACELC: A MORE COMPLETE MODEL═══════════════════════════════════════════════════════════════
CAP only talks about partitions. But what about normal operation?
PACELC: If Partition, then Availability vs Consistency, Else, Latency vs Consistency
┌─────────────────────────────────────────────────────────────┐│ ││ During Partition (P): ││ Choose A (availability) or C (consistency) ││ ││ Else (normal operation): ││ Choose L (latency) or C (consistency) ││ ││ PA/EL: Available during partition, low latency normally ││ (Cassandra, DynamoDB) ││ ││ PC/EC: Consistent during partition, consistent normally ││ (Traditional RDBMS, etcd) ││ ││ PA/EC: Available during partition, consistent normally ││ (Some configs of MongoDB) ││ │└─────────────────────────────────────────────────────────────┘
This model acknowledges that even in normal operation,you trade latency for consistency (synchronous replication)or consistency for latency (asynchronous replication).Try This (3 minutes)
For each scenario, which trade-off makes sense?
Scenario During Partition Why Bank account balance CP Can’t show wrong balance Social media likes count AP Kubernetes pod state Shopping cart DNS
Part 4: Distributed System Patterns
Section titled “Part 4: Distributed System Patterns”4.1 The Two Generals Problem
Section titled “4.1 The Two Generals Problem”THE TWO GENERALS PROBLEM═══════════════════════════════════════════════════════════════
Setup: Two generals must coordinate an attack. They can only communicate via messengers. Messengers may be captured (lost messages). Both must attack simultaneously or not at all.
The Problem:───────────────────────────────────────────────────────────── General A General B │ │ │ "Attack at dawn" │ │ ─────────messenger──────────────▶ │ │ │ │ "Got it" │ │ ◀────────messenger─────────────── │ │ │ │ But did B receive my ack? │ │ "I got your ack" │ │ ─────────messenger──────────────▶ │ │ │ │ But did A receive... │ │ (infinite regress) │
IMPOSSIBILITY─────────────────────────────────────────────────────────────There is NO protocol that guarantees agreement if messagescan be lost. This is proven mathematically impossible.
IMPLICATIONS FOR DISTRIBUTED SYSTEMS─────────────────────────────────────────────────────────────You cannot guarantee that two nodes agree on anythingif the network is unreliable. The best you can do: - Increase probability (retries, acknowledgments) - Accept uncertainty (eventual consistency) - Use consensus protocols (Paxos, Raft) when possible4.2 The Byzantine Generals Problem
Section titled “4.2 The Byzantine Generals Problem”BYZANTINE GENERALS PROBLEM═══════════════════════════════════════════════════════════════
Same as Two Generals, but worse: - Some generals might be traitors - Traitors send conflicting messages to different generals - Loyal generals must still reach agreement
THE CHALLENGE───────────────────────────────────────────────────────────── General A: "Attack!" General B: "Attack!" (traitor, tells C "Retreat") General C: "Attack!"
C receives: "Attack" from A, "Retreat" from B A receives: "Attack" from B (B lies), "Attack" from C
How can A and C agree despite B's lies?
SOLUTION REQUIREMENTS─────────────────────────────────────────────────────────────To tolerate f Byzantine (lying) nodes, you need 3f + 1 total.
1 traitor → need 4 generals 2 traitors → need 7 generals
This is expensive! Most systems assume no Byzantine failures.
WHERE IT MATTERS─────────────────────────────────────────────────────────────- Blockchain (trustless networks)- Safety-critical systems (aviation, medical)- Systems where nodes might be compromised
Most internal systems assume nodes are honest but buggy.They handle crash failures, not Byzantine failures.4.3 Idempotency
Section titled “4.3 Idempotency”IDEMPOTENCY: SAFE TO RETRY═══════════════════════════════════════════════════════════════
An operation is IDEMPOTENT if doing it multiple timeshas the same effect as doing it once.
WHY IT MATTERS─────────────────────────────────────────────────────────────Request sent. Timeout. Did it succeed?
If NOT idempotent: Dangerous to retry "Transfer $100" → Retry → $200 transferred!
If idempotent: Safe to retry "Set balance to $100" → Retry → Still $100
IDEMPOTENT OPERATIONS─────────────────────────────────────────────────────────────✓ SET x = 5 (result is always 5)✓ DELETE user 123 (can't delete twice)✓ PUT /users/123 (creates or replaces)✗ INCREMENT x (adds more each time)✗ POST /users (creates new each time)✗ TRANSFER $100 (transfers more each time)
MAKING OPERATIONS IDEMPOTENT─────────────────────────────────────────────────────────────Add a unique request ID. Server tracks completed requests.
Client: "Transfer $100, request_id=abc123" Server: Checks if abc123 already processed If yes: Return cached result If no: Process, save result, return
Client timeout → Retry with same abc123 → Safe!Part 5: Kubernetes as a Distributed System
Section titled “Part 5: Kubernetes as a Distributed System”5.1 Kubernetes Architecture
Section titled “5.1 Kubernetes Architecture”KUBERNETES: A DISTRIBUTED SYSTEM═══════════════════════════════════════════════════════════════
CONTROL PLANE─────────────────────────────────────────────────────────────┌─────────────────────────────────────────────────────────────┐│ ││ ┌───────────┐ ┌───────────┐ ┌───────────┐ ││ │ API Server│ │ API Server│ │ API Server│ (HA) ││ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ ││ │ │ │ ││ └──────────────┼──────────────┘ ││ │ ││ ▼ ││ ┌─────────────────┐ ││ │ etcd │ (distributed consensus) ││ │ ┌───┬───┬───┐ │ ││ │ │ 1 │ 2 │ 3 │ │ (Raft protocol) ││ │ └───┴───┴───┘ │ ││ └─────────────────┘ ││ ││ ┌────────────────┐ ┌────────────────┐ ││ │ Controller Mgr │ │ Scheduler │ ││ │ (leader elect)│ │ (leader elect)│ ││ └────────────────┘ └────────────────┘ ││ │└─────────────────────────────────────────────────────────────┘
DATA PLANE (Nodes)─────────────────────────────────────────────────────────────┌────────────┐ ┌────────────┐ ┌────────────┐│ Node 1 │ │ Node 2 │ │ Node 3 ││ ┌──────┐ │ │ ┌──────┐ │ │ ┌──────┐ ││ │kubelet│ │ │ │kubelet│ │ │ │kubelet│ ││ └──────┘ │ │ └──────┘ │ │ └──────┘ ││ ┌──────┐ │ │ ┌──────┐ │ │ ┌──────┐ ││ │kube- │ │ │ │kube- │ │ │ │kube- │ ││ │proxy │ │ │ │proxy │ │ │ │proxy │ ││ └──────┘ │ │ └──────┘ │ │ └──────┘ ││ [pods] │ │ [pods] │ │ [pods] │└────────────┘ └────────────┘ └────────────┘5.2 How Kubernetes Handles Distribution
Section titled “5.2 How Kubernetes Handles Distribution”KUBERNETES DISTRIBUTED PATTERNS═══════════════════════════════════════════════════════════════
CONSENSUS (etcd)─────────────────────────────────────────────────────────────etcd uses Raft consensus for strong consistency.All cluster state lives in etcd.Writes require majority agreement (quorum).
3 etcd nodes: Can lose 1 5 etcd nodes: Can lose 2
LEADER ELECTION─────────────────────────────────────────────────────────────Controller Manager and Scheduler use leader election.Only one active at a time (avoid duplicate work).If leader fails, another takes over.
# Check current leader kubectl get leases -n kube-system
EVENTUAL CONSISTENCY (Controllers)─────────────────────────────────────────────────────────────Controllers reconcile desired vs actual state.They're eventually consistent—changes propagate over time.
You: "I want 3 replicas" Controller: Sees 2, creates 1 Repeat until actual = desired
WATCH PATTERN─────────────────────────────────────────────────────────────Components watch etcd for changes, not poll.Reduces load, provides near-real-time updates.
kubelet watches for pod assignments Controller watches for resource changes5.3 Kubernetes Failure Modes
Section titled “5.3 Kubernetes Failure Modes”KUBERNETES FAILURE SCENARIOS═══════════════════════════════════════════════════════════════
NODE FAILURE─────────────────────────────────────────────────────────────kubelet stops sending heartbeats.Node marked "NotReady" after timeout (default 40s).Pods rescheduled after tolerationSeconds (default 300s).
Timeline: 0s: Node dies 40s: Node marked NotReady 300s: Pods evicted, rescheduled elsewhere
ETCD QUORUM LOSS─────────────────────────────────────────────────────────────If majority of etcd nodes fail, cluster becomes read-only.Existing pods keep running (kubelet cached state).No new pods, no updates until quorum restored.
3-node etcd: Lose 2 → Quorum lost 5-node etcd: Lose 3 → Quorum lost
NETWORK PARTITION─────────────────────────────────────────────────────────────Nodes can't reach control plane.Existing pods keep running (autonomous kubelet).No new scheduling until connectivity restored.
Split-brain scenario: Partition A: Some nodes + some etcd Partition B: Other nodes + other etcd
If neither has etcd quorum: Both read-only Kubernetes chooses consistency over availability (CP)
API SERVER FAILURE─────────────────────────────────────────────────────────────With multiple API servers, load balancer routes around failure.kubectl commands may fail temporarily.Pods keep running (kubelet operates independently).Did You Know?
Section titled “Did You Know?”-
The speed of light limits distributed systems. A message from New York to London takes at least 28ms—the time light needs to travel through fiber. You can’t engineer around physics.
-
Google’s Spanner uses atomic clocks and GPS to synchronize time across datacenters to within a few milliseconds. It’s one of the few systems that can offer both strong consistency and global distribution.
-
The term “Byzantine fault” comes from the Byzantine Generals Problem, named for the Byzantine Empire’s reputation for complex political intrigue. In distributed systems, a Byzantine node is one that might behave arbitrarily—including lying.
-
Leslie Lamport’s “Time, Clocks, and the Ordering of Events” paper (1978) introduced logical clocks and is one of the most cited papers in computer science. Lamport later won the Turing Award partly for this work. The key insight: you don’t need real time to order events—you just need to track causality (“happened before” relationships).
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| Ignoring network latency | Design works locally, fails distributed | Measure latency, batch requests |
| Assuming networks are reliable | No retry logic, no timeouts | Implement timeouts, retries, circuit breakers |
| Using wall-clock time for ordering | Clock skew causes wrong order | Use logical clocks or version vectors |
| Not designing for partial failure | One failure takes down everything | Bulkheads, graceful degradation |
| Synchronous everything | Latency compounds, cascade failures | Async where possible, queues |
| Ignoring CAP trade-offs | Surprised by behavior during partition | Choose trade-offs explicitly |
-
What are the three fundamental challenges of distributed systems?
Answer
The three fundamental challenges are:
-
Latency: Network communication is millions of times slower than local function calls. Speed of light imposes physical limits.
-
Partial failure: Components can fail independently. You can’t always tell if a remote call succeeded, failed, or is just slow.
-
No global clock: Each machine has its own clock that drifts. You can’t rely on timestamps to order events across machines.
These challenges are fundamental—they can be mitigated but not eliminated. Any distributed system design must account for all three.
-
-
Explain the CAP theorem in your own words.
Answer
The CAP theorem states that during a network partition, a distributed system must choose between consistency and availability:
- Consistency: All nodes see the same data
- Availability: All requests get responses
- Partition tolerance: System handles network splits
Since network partitions are inevitable, the real choice is:
- CP: During partition, refuse some requests to maintain consistency
- AP: During partition, serve requests but data might diverge
In normal operation (no partition), you can have both. The choice matters when things go wrong.
Example: A banking system chooses CP—better to reject a transaction than show wrong balance. A social media site chooses AP—better to show slightly stale like count than fail entirely.
-
Why can’t you use timestamps to order events in a distributed system?
Answer
You can’t rely on timestamps because:
-
Clock drift: Each machine’s clock runs slightly fast or slow. Over time, clocks diverge.
-
No perfect sync: Even with NTP, clocks differ by milliseconds. Events happening close together may get wrong order.
-
Network delay: A message sent “earlier” might arrive “later” than one sent after it.
Example: Server A (clock running fast) timestamps event at 10:00:00.050. Server B (accurate clock) timestamps event at 10:00:00.045. B’s event actually happened after A’s, but has earlier timestamp.
Solutions:
- Logical clocks (Lamport clocks): Track causality, not wall time
- Vector clocks: Track causality per node
- Hybrid logical clocks: Combine physical and logical time
-
-
What makes Kubernetes a distributed system, and how does it handle distribution?
Answer
Kubernetes is distributed because:
- Multiple components (API server, etcd, controllers, kubelets)
- Running on multiple machines
- Coordinating via network messages
- No single point of failure (when properly configured)
How it handles distribution:
-
etcd for consensus: Uses Raft protocol for strong consistency of cluster state. Requires quorum for writes.
-
Leader election: Controller Manager and Scheduler elect leaders to avoid duplicate work.
-
Eventual consistency: Controllers continuously reconcile desired vs actual state. Changes propagate over time.
-
Watch pattern: Components watch etcd for changes instead of polling, reducing load.
-
Autonomous kubelets: Nodes keep running pods even if disconnected from control plane.
Kubernetes chooses CP during partitions—better to refuse changes than have inconsistent state.
-
A service makes 50 sequential database calls per request. Each call takes 2ms over the network. What’s the minimum latency for a user request? How could you improve it?
Answer
Calculation:
- 50 calls × 2ms = 100ms minimum latency
- This doesn’t include processing time, just network round trips
Improvement strategies:
-
Batch calls: Instead of 50 individual calls, combine into fewer queries
- 5 batched calls × 2ms = 10ms (10x improvement)
-
Parallelize: If calls are independent, execute simultaneously
- 50 parallel calls = 2ms (50x improvement)
-
Cache: Store frequently accessed data locally
- Cache hit: 0.1ms vs 2ms network call
-
Denormalize: Store data together to reduce joins/calls
- Single call with all data: 2ms total
The lesson: Network latency compounds. Design minimizes round trips.
-
You send a request to a service and don’t receive a response after 5 seconds. List all possible states the request could be in. Why is this uncertainty fundamental?
Answer
Possible states:
- Request lost in transit - Never reached server
- Server received but crashed before processing - No work done
- Server processing but slow - Will eventually complete
- Server processed successfully, response lost - Work completed!
- Server processed, crashed after responding - Work completed!
- Response delayed in network - Will eventually arrive
Why fundamental:
This uncertainty cannot be eliminated by any protocol. From the client’s perspective, states 2, 3, 4, 5, and 6 look identical: silence.
This is the Two Generals Problem—you cannot achieve certainty about remote state over an unreliable channel.
Practical implications:
- Must design operations to be idempotent (safe to retry)
- Use unique request IDs to detect duplicates
- Implement timeouts and retries with exponential backoff
- Accept that “at-most-once” or “at-least-once” delivery—never “exactly-once”
-
An etcd cluster has 5 nodes. How many can fail while maintaining quorum? If you lose quorum, what happens to the Kubernetes cluster?
Answer
Quorum calculation:
- Quorum requires majority: floor(n/2) + 1
- For 5 nodes: floor(5/2) + 1 = 3 nodes required
- Can lose 2 nodes and still have quorum (3 remaining)
If quorum is lost (3+ nodes fail):
- etcd becomes read-only - No writes accepted
- No new pods can be scheduled - Scheduler can’t write
- No deployments can update - Controllers can’t write
- Existing pods keep running - kubelets have cached state
- kubectl get works - Reads still succeed
- kubectl apply fails - Writes rejected
Recovery:
- Restore failed nodes to regain quorum
- Or restore from etcd backup to new cluster
- Never reduce below 3 nodes (can’t lose any)
Best practice: Use 5 nodes for production (tolerates 2 failures), 3 for smaller clusters (tolerates 1 failure).
-
A social media platform shows like counts that update every 30 seconds. During a network partition between US and EU datacenters, a post gets 1000 likes in US and 500 likes in EU. When the partition heals, what should the count be? What CAP trade-off did this system make?
Answer
The count should be 1500 (1000 + 500)
This requires a CRDT (Conflict-free Replicated Data Type)—specifically a G-Counter (grow-only counter) that merges by summing per-replica increments.
CAP trade-off:
- System chose AP (Availability over Consistency)
- During partition: Both regions accepted writes (available)
- Result: Temporary inconsistency (US saw 1000, EU saw 500)
- After partition: Merge brings eventual consistency
Why this is appropriate for likes:
- Temporary wrong count is acceptable
- Users don’t expect real-time accuracy
- Unavailability (can’t like) would be worse than inconsistency
Contrast with banking:
- Bank would choose CP
- During partition: Reject transactions
- Why: Incorrect balance is unacceptable
- Unavailability is better than double-spending
The pattern: Choose AP for data where “close enough” is acceptable, CP for data where correctness is critical.
Hands-On Exercise
Section titled “Hands-On Exercise”Task: Explore distributed system behavior in Kubernetes.
Part 1: Observe Latency (10 minutes)
Run a pod and measure API server latency:
# Create a test podkubectl run latency-test --image=busybox --command -- sleep 3600
# Time various operationstime kubectl get pod latency-testtime kubectl get pod latency-test -o yamltime kubectl describe pod latency-test
# Compare to local operationstime cat /etc/hostnameRecord your findings:
| Operation | Time | Notes |
|---|---|---|
| kubectl get pod | ||
| kubectl describe | ||
| Local file read |
Part 2: Observe Eventual Consistency (10 minutes)
Watch a deployment scale:
# Create deploymentkubectl create deployment scale-test --image=nginx --replicas=1
# In terminal 1: Watch podskubectl get pods -w
# In terminal 2: Scale up rapidlykubectl scale deployment scale-test --replicas=10
# Observe:# - How long until all pods are Running?# - Do pods appear in order?# - What intermediate states do you see?Part 3: Simulate Partial Failure (10 minutes)
If using kind or minikube with multiple nodes:
# List nodeskubectl get nodes
# Cordon a node (no new pods)kubectl cordon <node-name>
# Watch what happens to podskubectl get pods -o wide -w
# Uncordonkubectl uncordon <node-name>Success Criteria:
- Measured API server latency vs local operations
- Observed eventual consistency during scaling
- Understood intermediate states during changes
- (Optional) Observed behavior during node failure
Further Reading
Section titled “Further Reading”-
“Designing Data-Intensive Applications” - Martin Kleppmann. The definitive guide to distributed systems concepts. Chapter 8 covers these fundamentals.
-
“Distributed Systems for Fun and Profit” - Mikito Takada. Free online book covering distributed systems fundamentals.
-
“Time, Clocks, and the Ordering of Events in a Distributed System” - Leslie Lamport. The foundational paper on logical clocks.
Key Takeaways
Section titled “Key Takeaways”Before moving on, ensure you understand:
- The three fundamental challenges: Latency (network is slow), partial failure (parts fail independently), no global clock (can’t trust timestamps)
- CAP theorem reality: During a network partition, you choose consistency OR availability—not both. Most systems need both, so you choose which to sacrifice during failures
- PACELC extension: Even without partitions, you trade latency for consistency. Synchronous replication = consistent but slow. Async = fast but potentially stale
- The Two Generals Problem: You cannot guarantee agreement over unreliable networks. This is mathematically proven impossible
- Idempotency is essential: Design operations to be safe to retry. Use unique request IDs to detect duplicates
- Kubernetes is CP: It chooses consistency over availability during partitions. etcd requires quorum; without it, the cluster becomes read-only
- Logical clocks > wall clocks: Don’t use timestamps to order events. Use Lamport clocks or vector clocks to track causality
- Distribution is a spectrum: A web server + database is distributed. Understanding where your system falls helps choose appropriate patterns
Next Module
Section titled “Next Module”Module 5.2: Consensus and Coordination - How distributed systems agree on anything, and why it’s so hard.