Module 5.2: Consensus and Coordination
Цей контент ще не доступний вашою мовою.
Complexity:
[COMPLEX]Time to Complete: 35-40 minutes
Prerequisites: Module 5.1: What Makes Systems Distributed
Track: Foundations
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Explain how Raft and Paxos achieve consensus and why the FLP impossibility theorem constrains all consensus protocols
- Evaluate consensus-based systems (etcd, ZooKeeper, Consul) by analyzing their quorum requirements, failure tolerance, and split-brain prevention
- Design coordination patterns (leader election, distributed locks, barrier synchronization) appropriate for different consistency requirements
- Diagnose consensus failures by analyzing quorum loss scenarios, network partitions, and leader election storms in production systems
November 24, 2014. A routine database migration at a major financial services company triggers one of the most expensive consensus failures in banking history.
The company operated a distributed trading system with five database nodes using Paxos-based replication. During the migration, a network misconfiguration caused three nodes to become unreachable from each other—but each could still reach some of the remaining two nodes. The Paxos implementation had a subtle bug: under this specific partition pattern, two different nodes each believed they had achieved quorum.
For 47 minutes, the trading system had two leaders accepting conflicting writes. One leader processed $127 million in buy orders. The other processed $89 million in sell orders for the same securities. When the partition healed and the nodes attempted to reconcile, the conflict was irreconcilable—the audit log showed trades that couldn’t have both happened.
The total cost: $23 million in immediate losses from voided trades, $31 million in regulatory fines for order book integrity violations, and $180 million in customer lawsuit settlements. The root cause wasn’t the network—it was a consensus algorithm implementation that hadn’t been tested against byzantine partition scenarios.
This module teaches consensus: how distributed systems reach agreement, why it’s deceptively hard, and why getting it wrong can cost more than most companies earn in a year.
Why This Module Matters
Section titled “Why This Module Matters”How do you get multiple computers to agree on something? It sounds simple—until network partitions split them, messages get lost, and nodes crash mid-decision. Yet agreement is essential: which node is the leader? Is this transaction committed? What’s the current configuration?
Consensus is the foundation of reliable distributed systems. Without it, you can’t have consistent replicated databases, reliable leader election, or fault-tolerant coordination. Understanding consensus helps you choose the right tools and understand their limitations.
This module explores how distributed systems reach agreement—the algorithms, the trade-offs, and where you’ll encounter consensus in practice.
The Committee Analogy
Imagine a committee that must vote on decisions, but members are in different cities and can only communicate by mail. Letters get lost. Some members don’t respond. The committee must still make decisions. How do they ensure everyone agrees on what was decided? This is the consensus problem—made harder because there’s no chairperson everyone trusts.
What You’ll Learn
Section titled “What You’ll Learn”- What consensus means and why it’s hard
- Key consensus algorithms (Paxos, Raft)
- How leader election works
- Distributed locks and coordination
- How etcd and ZooKeeper implement consensus
- When you need consensus (and when you don’t)
Part 1: The Consensus Problem
Section titled “Part 1: The Consensus Problem”1.1 What is Consensus?
Section titled “1.1 What is Consensus?”CONSENSUS DEFINITION═══════════════════════════════════════════════════════════════
Consensus: Getting multiple nodes to agree on a single value.
REQUIREMENTS─────────────────────────────────────────────────────────────1. AGREEMENT: All non-faulty nodes decide on the same value2. VALIDITY: The decided value was proposed by some node3. TERMINATION: All non-faulty nodes eventually decide
SOUNDS SIMPLE, BUT...─────────────────────────────────────────────────────────────┌────────────────────────────────────────────────────────────┐│ ││ Node A proposes "X" Node B proposes "Y" ││ │ │ ││ └─────────┬───────────────┘ ││ │ ││ ▼ ││ ┌──────────────┐ ││ │ Network │ ││ │ (unreliable)│ ││ └──────────────┘ ││ │ ││ ┌─────────┴───────────┐ ││ │ │ ││ ▼ ▼ ││ Node A Node B ││ decides ? decides ? ││ ││ What if A doesn't hear from B? ││ What if B crashes mid-decision? ││ What if the network partitions? ││ │└────────────────────────────────────────────────────────────┘1.2 Why Consensus is Hard
Section titled “1.2 Why Consensus is Hard”THE FLP IMPOSSIBILITY RESULT═══════════════════════════════════════════════════════════════
Fischer, Lynch, and Paterson proved (1985):
In an asynchronous system where even ONE node might crash, there is NO algorithm that guarantees consensus.
ASYNCHRONOUS = No timing assumptions - Messages can take arbitrarily long - You can't tell if a node crashed or is just slow
THE PROBLEM─────────────────────────────────────────────────────────────You're waiting for Node B's vote. No response.
Option 1: Wait forever Problem: If B crashed, you never decide (no termination)
Option 2: Proceed without B Problem: B might be alive and decide differently (no agreement)
Option 3: Use timeouts Problem: You might timeout a live node, or wait for a dead one
There's no perfect solution. Every algorithm makes trade-offs.
PRACTICAL IMPLICATIONS─────────────────────────────────────────────────────────────FLP says: Can't guarantee consensus in all cases.Reality: Consensus is highly probable with good algorithms.
Algorithms like Paxos and Raft work in practice because:- True asynchrony is rare (most messages arrive quickly)- Random backoff prevents live-lock- Timing assumptions usually hold1.3 Consensus Use Cases
Section titled “1.3 Consensus Use Cases”WHERE YOU NEED CONSENSUS═══════════════════════════════════════════════════════════════
LEADER ELECTION─────────────────────────────────────────────────────────────"Who is the leader?"
Only one node should be leader at a time.All nodes must agree on who it is.If leader fails, elect a new one.
Examples: - Kubernetes controller-manager - Database primary - Message queue broker
DISTRIBUTED LOCKS─────────────────────────────────────────────────────────────"Who holds the lock?"
Only one client can hold a lock at a time.All nodes must agree on current holder.If holder crashes, lock must be released.
Examples: - Preventing duplicate processing - Coordinating batch jobs - Resource allocation
REPLICATED STATE MACHINES─────────────────────────────────────────────────────────────"What is the current state?"
All replicas apply the same operations in the same order.Must agree on operation ordering.
Examples: - etcd (Kubernetes configuration) - Replicated databases - Configuration management
ATOMIC COMMIT─────────────────────────────────────────────────────────────"Should this transaction commit?"
All participants must agree: commit or abort.Can't have some commit and some abort.
Examples: - Distributed transactions - Two-phase commit - Saga coordinationTry This (2 minutes)
Think of systems you use. Where is consensus happening?
System Consensus For What if it Fails? Kubernetes Leader election, etcd
Part 2: Consensus Algorithms
Section titled “Part 2: Consensus Algorithms”2.1 Paxos: The Original
Section titled “2.1 Paxos: The Original”PAXOS (Leslie Lamport, 1989)═══════════════════════════════════════════════════════════════
The first proven consensus algorithm.Famous for being difficult to understand.Basis for many production systems.
ROLES─────────────────────────────────────────────────────────────PROPOSERS: Suggest values (can be multiple)ACCEPTORS: Vote on values (majority must agree)LEARNERS: Learn the decided value
BASIC PAXOS (Single Value)─────────────────────────────────────────────────────────────Phase 1: PREPARE Proposer → Acceptors: "Prepare proposal number N" Acceptors → Proposer: "Promise to not accept < N" (plus any already-accepted value)
Phase 2: ACCEPT If majority promised: Proposer → Acceptors: "Accept value V with number N" Acceptors → Learners: "Accepted V"
If majority accept: Consensus reached!
┌────────────────────────────────────────────────────────────┐│ ││ Proposer Acceptors Learner ││ │ │ │ │ │ ││ │──Prepare(1)─────▶│ │ │ │ ││ │ │ │ │ │ ││ │◀──Promise(1)─────│ │ │ │ ││ │◀──Promise(1)────────│ │ │ ││ │◀──Promise(1)───────────│ │ ││ │ │ │ │ │ ││ │──Accept(1,"X")──▶│ │ │ │ ││ │ │──────Accepted(1,"X")───▶│ ││ │ │───Accepted(1,"X")───▶│ ││ │ │─Accepted(1,"X")──▶│ ││ │ │ │ │ │ ││ │ │ │ Consensus: "X" ││ │└────────────────────────────────────────────────────────────┘
WHY IT'S COMPLEX─────────────────────────────────────────────────────────────- Multiple proposers can conflict- Must handle old proposals- Single-decree Paxos decides ONE value- Multi-Paxos for sequences (even more complex)2.2 Raft: Understandable Consensus
Section titled “2.2 Raft: Understandable Consensus”RAFT (Diego Ongaro, 2014)═══════════════════════════════════════════════════════════════
Designed for understandability.Equivalent to Paxos but easier to implement.Used by etcd, Consul, CockroachDB.
KEY INSIGHT─────────────────────────────────────────────────────────────Instead of symmetric nodes, use a leader.Leader orders all decisions.Consensus becomes: "elect leader" + "follow leader"
THREE SUB-PROBLEMS─────────────────────────────────────────────────────────────1. LEADER ELECTION: Choose a leader2. LOG REPLICATION: Leader replicates log to followers3. SAFETY: Ensure consistency despite failures
NODE STATES─────────────────────────────────────────────────────────────┌────────────────────────────────────────────────────────────┐│ ││ ┌──────────┐ ││ timeout │ │ receives votes ││ ┌─────────│ Candidate│──────────┐ ││ │ │ │ │ ││ │ └────▲─────┘ │ ││ │ │ │ ││ │ timeout │ │ ││ ▼ │ ▼ ││ ┌──────────┐ │ ┌──────────┐ ││ │ Follower │────────┘ │ Leader │ ││ │ │◀─────────────────│ │ ││ └──────────┘ discovers └──────────┘ ││ higher term ││ ││ Start: All nodes are Followers ││ Timeout: Follower becomes Candidate, requests votes ││ Majority: Candidate becomes Leader ││ Failure: Leader times out, new election ││ │└────────────────────────────────────────────────────────────┘2.3 Raft Deep Dive
Section titled “2.3 Raft Deep Dive”RAFT: LEADER ELECTION═══════════════════════════════════════════════════════════════
TERMS─────────────────────────────────────────────────────────────Time divided into terms (epochs).Each term has at most one leader.Term number increases monotonically.
Term 1: Node A is leader Term 2: Node A fails, Node B elected Term 3: Node B fails, Node C elected
ELECTION PROCESS─────────────────────────────────────────────────────────────1. Follower doesn't hear from leader (timeout)2. Increments term, becomes candidate3. Votes for itself, requests votes from others4. Others vote if: - Haven't voted this term - Candidate's log is at least as up-to-date5. Majority votes → becomes leader6. No majority → timeout, new election with random delay
SPLIT VOTE PREVENTION─────────────────────────────────────────────────────────────Random election timeouts (150-300ms).Unlikely two candidates timeout simultaneously.If they do, random backoff ensures one wins next round.
RAFT: LOG REPLICATION═══════════════════════════════════════════════════════════════
Leader receives client requests.Appends to local log.Replicates to followers.Once majority acknowledge, entry is "committed."Leader notifies followers of commit.
Client Leader Followers │ │ │ │ │ │──Write "X"────────▶│ │ │ │ │ │──Append "X"───────▶│ │ │ │ │ │ │ │ │ │◀──ACK──────────────│ │ │ │ │◀──ACK────────────────│ │ │ │ (majority) │ │ │ │ │ │ │ │ │ COMMITTED! │ │ │ │ │ │ │ │ │◀──Success──────────│──Commit notify────▶│ │ │ │ │ │ │ │
LOG CONSISTENCY─────────────────────────────────────────────────────────────Leader's log is authoritative.Followers must match leader.If mismatch, leader sends earlier entries until sync.War Story: The $4.7 Million etcd Split-Brain
June 2019. A fintech startup’s Kubernetes cluster loses $4.7 million in a single weekend due to an etcd misconfiguration.
The company ran a payment processing platform on Kubernetes. Their 3-node etcd cluster sat in a single availability zone—violating every high-availability best practice. When the network switch serving Node A failed, the cluster split: Node A was isolated, while Nodes B and C remained connected.
The timeline of disaster:
- Friday 6:47 PM: Network switch fails, partitioning Node A
- Friday 6:47 PM: Nodes B and C detect missing heartbeat, start election
- Friday 6:48 PM: Node B wins election with term 42 (majority: B+C = 2/3)
- Friday 6:48 PM - Sunday 2:00 AM: System operates normally on B+C
- Friday 6:47 PM - Sunday 2:00 AM: Node A continues receiving writes from misconfigured clients
The critical failure: Some microservices had been configured with Node A’s IP directly, bypassing the load balancer. These services kept writing to Node A, which accepted writes despite having no quorum—the etcd version had a bug where stale leaders accepted reads but not writes, except through a deprecated API the microservices used.
When the network healed Sunday morning:
- Node A rejoined with term 41 (stale)
- Node A’s 33 hours of writes were rejected—term 42 > term 41
- 147,000 payment records existed only on Node A
- Node A’s data was overwritten by B+C’s authoritative log
The cost:
- $3.1 million in customer refunds for lost payment confirmations
- $1.2 million in emergency engineering (weekend rates, consultants)
- $400,000 in regulatory penalties for payment processing failures
The fix: The company moved to 5-node etcd across 3 availability zones, enforced all traffic through a load balancer with health checks, and implemented etcd endpoint monitoring that alerts on quorum loss within 30 seconds.
Part 3: Leader Election
Section titled “Part 3: Leader Election”3.1 Why Leaders?
Section titled “3.1 Why Leaders?”WHY USE LEADERS?═══════════════════════════════════════════════════════════════
LEADERLESS (all nodes equal)───────────────────────────────────────────────────────────── Every request needs coordination Complex conflict resolution Higher latency (wait for quorum) No single point of failure
LEADER-BASED───────────────────────────────────────────────────────────── Leader orders all operations Simple decision making Lower latency (leader decides alone) Must handle leader failure
COMPARISON─────────────────────────────────────────────────────────────┌──────────────────┬────────────────┬─────────────────┐│ │ Leaderless │ Leader-based │├──────────────────┼────────────────┼─────────────────┤│ Writes │ Any node │ Leader only ││ Coordination │ Every write │ Leader election ││ Latency │ Higher │ Lower ││ Availability │ Higher │ Lower (election)││ Complexity │ Complex reads │ Complex failover││ Examples │ Cassandra │ etcd, ZooKeeper │└──────────────────┴────────────────┴─────────────────┘3.2 Leader Election Mechanisms
Section titled “3.2 Leader Election Mechanisms”LEADER ELECTION APPROACHES═══════════════════════════════════════════════════════════════
BULLY ALGORITHM─────────────────────────────────────────────────────────────Highest ID wins. Simple but not partition-tolerant.
Node 1 (ID=1): "I want to be leader" Node 2 (ID=2): "I have higher ID, step aside" Node 3 (ID=3): "I have highest ID, I'm leader"
CONSENSUS-BASED (Raft/Paxos)─────────────────────────────────────────────────────────────Nodes vote. Majority wins. Partition-tolerant.
- Requires quorum for election - Leader has "lease" (term) - New election on leader failure
LEASE-BASED─────────────────────────────────────────────────────────────Leader holds time-limited lease. Must renew.
Leader acquires lease (e.g., 15 seconds) Leader renews every 5 seconds If leader crashes, lease expires Others can acquire after expiry
# Kubernetes leader election uses leases kubectl get leases -n kube-system
EXTERNAL COORDINATION─────────────────────────────────────────────────────────────Use external system (etcd, ZooKeeper) for coordination.
Component → etcd: "I'm leader" (with lease) etcd: "OK, you're leader until lease expires" Other components: Watch etcd for current leader3.3 Kubernetes Leader Election
Section titled “3.3 Kubernetes Leader Election”KUBERNETES LEADER ELECTION═══════════════════════════════════════════════════════════════
HOW IT WORKS─────────────────────────────────────────────────────────────Uses Lease objects in etcd.Leader creates/renews lease.Others watch lease, take over if expired.
EXAMPLE: CONTROLLER-MANAGER─────────────────────────────────────────────────────────────# View current leaderkubectl get lease kube-controller-manager -n kube-system -o yaml
apiVersion: coordination.k8s.io/v1kind: Leasemetadata: name: kube-controller-manager namespace: kube-systemspec: holderIdentity: master-1_abc123 # Current leader leaseDurationSeconds: 15 # Lease validity renewTime: "2024-01-15T10:30:00Z" # Last renewal
IMPLEMENTATION FOR YOUR APPS─────────────────────────────────────────────────────────────# Using client-go leader election
import ( "k8s.io/client-go/tools/leaderelection")
leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{ Lock: &resourcelock.LeaseLock{ LeaseMeta: metav1.ObjectMeta{ Name: "my-app-leader", Namespace: "default", }, }, LeaseDuration: 15 * time.Second, RenewDeadline: 10 * time.Second, RetryPeriod: 2 * time.Second, Callbacks: leaderelection.LeaderCallbacks{ OnStartedLeading: func(ctx context.Context) { // I'm the leader, do leader work }, OnStoppedLeading: func() { // I'm no longer leader }, },})Part 4: Distributed Locks and Coordination
Section titled “Part 4: Distributed Locks and Coordination”4.1 Distributed Locks
Section titled “4.1 Distributed Locks”DISTRIBUTED LOCKS═══════════════════════════════════════════════════════════════
PURPOSE─────────────────────────────────────────────────────────────Ensure only one process does something at a time.Coordinate access to shared resources.
LOCAL LOCK (single machine)───────────────────────────────────────────────────────────── mutex.Lock() // Critical section mutex.Unlock()
Simple. Process crashes → OS releases lock.
DISTRIBUTED LOCK (multiple machines)───────────────────────────────────────────────────────────── // Acquire lock from coordination service lock.Acquire("resource-x") // Critical section lock.Release("resource-x")
Complex. Process crashes → Who releases lock?
THE PROBLEM WITH DISTRIBUTED LOCKS─────────────────────────────────────────────────────────────┌────────────────────────────────────────────────────────────┐│ ││ Client A acquires lock ││ │ ││ ├──── Long GC pause ────┐ ││ │ │ ││ │ Lock expires │ ││ │ │ ││ │ Client B acquires │ ││ │ lock │ ││ │ │ ││ │◀── GC pause ends ─────┘ ││ │ ││ │ Client A thinks it still has lock! ││ │ Both A and B in critical section! ││ │└────────────────────────────────────────────────────────────┘
SOLUTION: FENCING TOKENS─────────────────────────────────────────────────────────────Lock server issues incrementing token with each acquisition.Resource checks token, rejects stale tokens.
Client A gets lock with token 33 Client A pauses Lock expires, Client B gets lock with token 34 Client A wakes, tries to write with token 33 Resource rejects: 33 < 34 (stale)4.2 Coordination Patterns
Section titled “4.2 Coordination Patterns”COORDINATION PATTERNS═══════════════════════════════════════════════════════════════
DISTRIBUTED QUEUE─────────────────────────────────────────────────────────────Multiple workers, one task at a time.
/tasks/task-001 → Worker A claims /tasks/task-002 → Worker B claims /tasks/task-003 → Worker C claims
Workers watch for new tasks, claim by creating ephemeral node.
BARRIER (RENDEZVOUS)─────────────────────────────────────────────────────────────Wait until N nodes are ready, then proceed.
Worker 1: Create /barrier/worker-1 Worker 2: Create /barrier/worker-2 Worker 3: Create /barrier/worker-3
All watch /barrier. When count = N, all proceed.
Use case: Coordinated restart, batch processing start.
SERVICE DISCOVERY─────────────────────────────────────────────────────────────Services register, clients find them.
Service A: Create /services/api/instance-1 (ephemeral) Service A: Create /services/api/instance-2 (ephemeral)
Client: List /services/api → [instance-1, instance-2]
If service crashes, ephemeral node deleted automatically.
CONFIGURATION DISTRIBUTION─────────────────────────────────────────────────────────────Central config, all nodes watch.
Admin: Write /config/feature-flags = {"new-ui": true} All nodes: Watch /config/feature-flags Change detected → All nodes update simultaneously4.3 etcd and ZooKeeper
Section titled “4.3 etcd and ZooKeeper”etcd vs ZooKeeper═══════════════════════════════════════════════════════════════
SIMILARITIES─────────────────────────────────────────────────────────────Both provide:- Distributed key-value store- Strong consistency (linearizable)- Watch mechanism (change notifications)- TTL/leases (automatic expiration)- Used for coordination, not data storage
DIFFERENCES─────────────────────────────────────────────────────────────┌─────────────────┬────────────────────┬────────────────────┐│ │ etcd │ ZooKeeper │├─────────────────┼────────────────────┼────────────────────┤│ Protocol │ gRPC │ Custom binary ││ Consensus │ Raft │ Zab (Paxos-like) ││ Data model │ Flat key-value │ Hierarchical (tree)││ API │ Simple KV │ ZNodes (like files)││ Watches │ Efficient (stream) │ One-time triggers ││ Typical use │ Kubernetes │ Kafka, Hadoop ││ Language │ Go │ Java │└─────────────────┴────────────────────┴────────────────────┘
etcd EXAMPLE─────────────────────────────────────────────────────────────# Set a keyetcdctl put /myapp/config '{"version": 2}'
# Get a keyetcdctl get /myapp/config
# Watch for changesetcdctl watch /myapp/config
# Set with TTL (lease)etcdctl lease grant 60etcdctl put /myapp/leader "node-1" --lease=<lease-id>
ZOOKEEPER EXAMPLE─────────────────────────────────────────────────────────────# Create a znodecreate /myapp/config '{"version": 2}'
# Get a znodeget /myapp/config
# Watch (one-time)get /myapp/config -w
# Ephemeral node (deleted when session ends)create -e /myapp/leader "node-1"Part 5: When to Use Consensus
Section titled “Part 5: When to Use Consensus”5.1 Consensus is Expensive
Section titled “5.1 Consensus is Expensive”THE COST OF CONSENSUS═══════════════════════════════════════════════════════════════
LATENCY─────────────────────────────────────────────────────────────Every write requires: 1. Client → Leader 2. Leader → Followers (parallel) 3. Followers → Leader (acknowledgments) 4. Leader → Client (commit confirmation)
Minimum: 2 round trips With geographic distribution: 100s of milliseconds
THROUGHPUT─────────────────────────────────────────────────────────────All writes go through leader.Leader is bottleneck.Can't horizontally scale writes.
Single leader: ~10,000-50,000 writes/second typical Compare to Redis: ~100,000+ writes/second (no consensus)
AVAILABILITY─────────────────────────────────────────────────────────────Requires quorum (majority).3 nodes: 1 can fail5 nodes: 2 can fail7 nodes: 3 can fail
More nodes = better fault tolerance More nodes = slower consensus (more coordination)
COMPLEXITY─────────────────────────────────────────────────────────────Consensus algorithms are hard to implement correctly.Subtle bugs can cause data loss.Use battle-tested implementations (etcd, ZooKeeper).5.2 When You Need Consensus
Section titled “5.2 When You Need Consensus”YOU NEED CONSENSUS WHEN═══════════════════════════════════════════════════════════════
✓ LEADER ELECTION Only one leader at a time, all must agree. Alternative: Live with multiple (might cause duplicates)
✓ DISTRIBUTED LOCKS (if correctness matters) Only one holder, must be certain. Alternative: Optimistic locking with conflicts
✓ CONFIGURATION CHANGES All nodes must see same config. Alternative: Eventual propagation (brief inconsistency)
✓ TRANSACTION COMMIT All participants agree: commit or abort. Alternative: Sagas (compensating transactions)
✓ TOTAL ORDERING All nodes process operations in same order. Alternative: Partial ordering or eventual consistency
YOU PROBABLY DON'T NEED CONSENSUS WHEN═══════════════════════════════════════════════════════════════
✗ CACHING Stale data is acceptable. Use TTLs instead.
✗ METRICS/LOGGING Approximate counts are fine. Eventual consistency is enough.
✗ USER PREFERENCES Minor inconsistency is tolerable. Conflict-free data types (CRDTs) work well.
✗ SHOPPING CART Merge conflicts on checkout. Eventual consistency with conflict resolution.5.3 Alternatives to Consensus
Section titled “5.3 Alternatives to Consensus”ALTERNATIVES TO CONSENSUS═══════════════════════════════════════════════════════════════
EVENTUAL CONSISTENCY─────────────────────────────────────────────────────────────Changes propagate asynchronously.All nodes converge to same state... eventually.
Pro: Higher availability, lower latency Con: Temporary inconsistency
CONFLICT-FREE REPLICATED DATA TYPES (CRDTs)─────────────────────────────────────────────────────────────Data structures that merge automatically.No coordination needed.
Examples: - G-Counter: Only grows (add, never subtract) - LWW-Register: Last-write-wins by timestamp - OR-Set: Observed-remove set
Pro: No coordination, always available Con: Limited operations, eventual consistency
OPTIMISTIC CONCURRENCY─────────────────────────────────────────────────────────────Assume no conflicts. Detect and retry if wrong.
Read record with version V Make changes Write "if version still V" If version changed, retry
Pro: No locks, high concurrency Con: Retries under contention
SINGLE LEADER (NO CONSENSUS)─────────────────────────────────────────────────────────────One designated leader (not elected).Simple but single point of failure.
Pro: Simple, no consensus needed Con: Manual failover, downtime during failureDid You Know?
Section titled “Did You Know?”-
Paxos was rejected twice by academic journals because reviewers found it too hard to understand. Lamport eventually published it as a “part-time parliament” allegory to make it more accessible.
-
Raft’s name comes from “Reliable, Replicated, Redundant, And Fault-Tolerant.” It was specifically designed to be understandable—the paper includes a user study showing people learn Raft faster than Paxos.
-
Google’s Chubby (Paxos-based lock service) was so critical that when it went down for 15 minutes, more Google services failed than when a major datacenter lost power. Dependencies on coordination services can be dangerous.
-
etcd started at CoreOS in 2013 as a simple key-value store for CoreOS’s distributed init system. When Kubernetes adopted it as its brain, etcd became one of the most critical pieces of infrastructure in cloud-native computing—now running in production at virtually every major tech company.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| Using consensus for everything | Slow, complex, bottleneck | Consensus only when needed |
| Wrong quorum size | 2 of 4 nodes isn’t majority | Use odd numbers (3, 5, 7) |
| Ignoring leader election time | Brief unavailability on failover | Design for election pauses |
| Distributed locks without fencing | Stale clients corrupt data | Use fencing tokens |
| Rolling your own consensus | Subtle bugs, data loss | Use battle-tested implementations |
| Consensus across datacenters | High latency, frequent elections | Prefer regional consensus |
-
What is consensus and why is it hard?
Answer
Consensus is getting multiple nodes to agree on a single value with three properties:
- Agreement: All non-faulty nodes decide the same value
- Validity: The value was proposed by some node
- Termination: All non-faulty nodes eventually decide
It’s hard because:
- FLP impossibility: In asynchronous systems with even one possible failure, consensus can’t be guaranteed
- Network unreliability: Messages can be lost, delayed, or reordered
- Partial failure: Can’t distinguish slow node from dead node
- No global clock: Can’t use time to order events
Practical algorithms (Paxos, Raft) work because true asynchrony is rare and random backoff prevents live-lock.
-
How does Raft achieve consensus?
Answer
Raft achieves consensus through:
-
Leader election: One node becomes leader via majority vote
- Nodes start as followers
- Timeout triggers candidate state
- Candidate requests votes
- Majority votes → becomes leader
-
Log replication: Leader orders all decisions
- Client sends request to leader
- Leader appends to log
- Leader replicates to followers
- Majority acknowledgment → committed
- Leader notifies client and followers
-
Terms: Time divided into terms, each with at most one leader
- Higher term wins conflicts
- Prevents split-brain
-
Safety: Log consistency rules ensure followers match leader
-
-
What’s the problem with distributed locks and how do fencing tokens solve it?
Answer
The problem: A client can acquire a lock, pause (GC, network), and resume after the lock expired. Another client acquires the lock, but the first client thinks it still holds it. Both are in the critical section.
Fencing tokens solve this:
- Lock server issues monotonically increasing token with each acquisition
- Token 33 → Client A, Token 34 → Client B (after A’s lock expired)
- Resource (database, file) checks token before accepting writes
- Client A wakes up with token 33, tries to write
- Resource rejects: 33 < 34 (last seen token)
- Only Client B (token 34) can write
The resource acts as the final gatekeeper, rejecting stale clients.
-
When should you NOT use consensus?
Answer
Avoid consensus when:
-
Stale data is acceptable: Caching, metrics, logging. Use eventual consistency.
-
Availability matters more: Shopping carts, social feeds. Use AP systems.
-
High write throughput needed: Consensus limits writes to leader’s capacity.
-
Geographic distribution: Cross-datacenter consensus has high latency.
-
Operations are commutative: Use CRDTs (counters, sets) that merge without coordination.
-
Conflicts are rare: Use optimistic concurrency with retry.
Consensus is expensive (latency, throughput, complexity). Use it for leader election, strong locks, and transaction commits—where correctness is critical.
-
-
A Raft cluster has 5 nodes with election timeout of 150-300ms. Node A (the leader) crashes. What’s the minimum and maximum time until a new leader is elected? What factors affect this?
Answer
Minimum time: ~150ms
- One follower times out at 150ms (earliest timeout)
- Becomes candidate, requests votes
- Receives 2 votes (majority of remaining 4)
- Becomes leader
- Best case: ~150ms + network RTT
Maximum time: Several seconds (worst case)
- All followers timeout near 300ms (late)
- Split vote (two candidates, each gets 2 votes)
- Both wait random backoff (150-300ms)
- Another split vote possible
- Multiple rounds until one wins
Factors affecting election time:
- Election timeout range: Wider range reduces split votes but increases minimum time
- Network latency: Higher latency = slower vote collection
- Number of nodes: More nodes = more coordination
- Network partitions: Can prevent majority formation
- Random backoff: Designed to break ties
Typical production: 1-3 seconds for new leader after failure detection.
-
You’re designing a system with 7 etcd nodes across 3 datacenters (3 in DC1, 2 in DC2, 2 in DC3). DC1 loses network connectivity. Can the cluster still function? What if DC2 also fails?
Answer
Quorum calculation for 7 nodes:
- Quorum = floor(7/2) + 1 = 4 nodes
Scenario 1: DC1 loses connectivity (3 nodes lost)
- Remaining: DC2 (2) + DC3 (2) = 4 nodes
- 4 ≥ quorum of 4: YES, cluster functions
- New leader elected from DC2 or DC3
- DC1 nodes become followers with stale data
Scenario 2: DC1 and DC2 fail (5 nodes lost)
- Remaining: DC3 only = 2 nodes
- 2 < quorum of 4: NO, cluster is read-only
- Cannot elect leader
- Cannot accept writes
- Existing pods keep running (kubelet cached state)
Design lesson: For 3 DC setup, distribute nodes as 3-2-2 not 5-1-1. This ensures any single DC failure leaves quorum intact. For true multi-DC resilience, use 5+ DCs or accept that losing 2 DCs breaks consensus.
-
A distributed lock has 15-second TTL with 5-second renewal. Client A acquires the lock, then experiences a 20-second GC pause. Explain the timeline. How would fencing tokens prevent data corruption?
Answer
Timeline:
T=0: Client A acquires lock (token 100), TTL=15sT=5: Client A should renew (but GC pause starts)T=10: Client A should renew (still in GC pause)T=15: Lock TTL expires, lock releasedT=16: Client B acquires lock (token 101), TTL=15sT=20: Client A wakes from GC pauseT=20: Client A thinks it still holds lock!T=20: Both clients believe they hold the lockWithout fencing tokens:
- Client A writes to shared resource
- Client B writes to shared resource
- Data corruption from interleaved writes
With fencing tokens:
T=20: Client A attempts write with token 100T=20: Resource checks: last_seen_token = 101T=20: 100 < 101 → REJECTED (stale token)T=20: Client B writes with token 101 → ACCEPTEDKey insight: The lock service can’t prevent Client A from trying to use an expired lock. But the resource (database, file system) can reject stale requests if it tracks fencing tokens. The token makes the lock’s expiration visible to the protected resource.
-
Your team is debating: use ZooKeeper vs etcd for a new coordination service. What questions should drive this decision? When would you choose each?
Answer
Key decision questions:
-
What’s your existing stack?
- Kubernetes/Go shop → etcd (native integration)
- Hadoop/Kafka/Java shop → ZooKeeper (proven integration)
-
What’s your data model?
- Flat key-value → etcd (simpler API)
- Hierarchical (paths, children) → ZooKeeper (tree structure)
-
What’s your watch pattern?
- Continuous streaming watches → etcd (efficient gRPC streams)
- One-time triggers → ZooKeeper (must re-register)
-
What’s your ops capability?
- etcd: Simpler, single binary, smaller footprint
- ZooKeeper: More complex, requires JVM tuning
Choose etcd when:
- Building on Kubernetes (already running etcd)
- Need efficient watch streams
- Prefer simpler operations
- Go ecosystem
Choose ZooKeeper when:
- Running Kafka, Hadoop, HBase (proven integration)
- Need hierarchical data model
- Team has ZooKeeper expertise
- Java ecosystem
Neither when:
- High write throughput needed (both limited by leader)
- Cross-datacenter with low latency requirements
- Simple caching (use Redis instead)
-
Hands-On Exercise
Section titled “Hands-On Exercise”Task: Explore consensus and coordination in Kubernetes.
Part 1: Observe etcd Consensus (10 minutes)
# If you have etcd access (e.g., kind cluster)# List etcd memberskubectl exec -it -n kube-system etcd-<node> -- etcdctl member list
# Check etcd healthkubectl exec -it -n kube-system etcd-<node> -- etcdctl endpoint health
# Watch etcd statisticskubectl exec -it -n kube-system etcd-<node> -- etcdctl endpoint status --write-out=tableRecord cluster status:
| Node | Is Leader | Raft Term | Raft Index |
|---|---|---|---|
Part 2: Observe Leader Election (10 minutes)
# View current leaderskubectl get leases -n kube-system
# Watch leader lease for controller-managerkubectl get lease kube-controller-manager -n kube-system -o yaml
# Observe holder identity and renew time# Note: renewTime updates regularly while leader is healthyPart 3: Implement Simple Coordination (15 minutes)
Create a ConfigMap that multiple pods watch:
apiVersion: v1kind: ConfigMapmetadata: name: shared-config namespace: defaultdata: feature-flag: "false"---apiVersion: v1kind: Podmetadata: name: watcher-1spec: containers: - name: watcher image: busybox command: ['sh', '-c', 'while true; do cat /config/feature-flag; sleep 5; done'] volumeMounts: - name: config mountPath: /config volumes: - name: config configMap: name: shared-config# Create the resourceskubectl apply -f coordination-config.yaml
# Watch the outputkubectl logs -f watcher-1
# In another terminal, update the configkubectl patch configmap shared-config -p '{"data":{"feature-flag":"true"}}'
# Observe: Does the watcher see the change?# (Note: ConfigMap volume updates have propagation delay)Success Criteria:
- Observed etcd cluster state and leader
- Understood lease-based leader election
- Saw configuration propagation delay
- Understand why coordination services are needed
Further Reading
Section titled “Further Reading”-
“In Search of an Understandable Consensus Algorithm” - Diego Ongaro. The Raft paper, specifically designed to be readable.
-
“Designing Data-Intensive Applications” - Martin Kleppmann. Chapters 8-9 cover consensus, distributed transactions, and coordination.
-
“The Raft Visualization” - raft.github.io. Interactive visualization of Raft consensus.
Key Takeaways
Section titled “Key Takeaways”Before moving on, ensure you understand:
- Consensus fundamentals: Getting nodes to agree on a single value with agreement (same value), validity (proposed value), and termination (eventual decision)
- FLP impossibility: In asynchronous systems, consensus can’t be guaranteed with even one failure. Practical algorithms work because true asynchrony is rare
- Raft mechanics: Leader election via majority vote, log replication to followers, terms to prevent split-brain, random timeouts to break ties
- Quorum math: Majority = floor(n/2) + 1. For 5 nodes, need 3. For 7 nodes, need 4. Use odd numbers to avoid ties
- Leader election trade-offs: Leader-based is simpler and faster but requires failover. Leaderless is more available but more complex
- Distributed lock dangers: GC pauses, network delays can cause client to think it holds expired lock. Fencing tokens are essential
- Consensus cost: Every write requires 2 round trips through leader. Limited to ~10-50K writes/sec. Don’t use for high-throughput data
- When to avoid consensus: Caching, metrics, logs, shopping carts—anywhere eventual consistency is acceptable. Save consensus for leader election, strong locks, transaction commits
Next Module
Section titled “Next Module”Module 5.3: Eventual Consistency - When you don’t need strong consistency, and how to make eventual consistency work.