Skip to content

Distributed Systems

Foundation Track | 3 Modules | ~1.5 hours total

The fundamentals of building systems that run across multiple machines. Understanding why distributed systems are hard, and the patterns that make them work.


Every modern system is distributed. The moment you have a web server and a database, you’re distributed. The moment you deploy to multiple availability zones, you face distributed systems challenges.

Distributed systems don’t behave like single machines. Things that were easy become hard:

  • Latency: Network calls are millions of times slower than local calls
  • Partial failure: Components fail independently, often invisibly
  • No global clock: You can’t reliably order events across machines
  • Uncertainty: You can’t always tell if a remote call succeeded

Understanding these challenges helps you design systems that work despite them.


#ModuleTimeDescription
5.1What Makes Systems Distributed25-30 minFundamental challenges, CAP theorem, Kubernetes as distributed system
5.2Consensus and Coordination35-40 minPaxos, Raft, leader election, distributed locks, etcd
5.3Eventual Consistency30-35 minConsistency models, replication, conflict resolution, CRDTs

START HERE
┌─────────────────────────────────────┐
│ Module 5.1 │
│ What Makes Systems Distributed │
│ └── The fundamental challenges │
│ └── CAP theorem │
│ └── Kubernetes as example │
│ └── Why it's hard │
└──────────────────┬──────────────────┘
┌─────────────────────────────────────┐
│ Module 5.2 │
│ Consensus and Coordination │
│ └── Paxos and Raft │
│ └── Leader election │
│ └── Distributed locks │
│ └── etcd and ZooKeeper │
└──────────────────┬──────────────────┘
┌─────────────────────────────────────┐
│ Module 5.3 │
│ Eventual Consistency │
│ └── Consistency spectrum │
│ └── Replication strategies │
│ └── Conflict resolution │
│ └── CRDTs │
└──────────────────┬──────────────────┘
FOUNDATIONS COMPLETE
┌──────────────┼──────────────┐
│ │ │
▼ ▼ ▼
SRE Platform GitOps
Discipline Engineering Discipline

ConceptModuleWhat It Means
Latency5.1Network calls are slow (physics)
Partial Failure5.1Parts fail while others continue
CAP Theorem5.1Choose consistency or availability during partition
Consensus5.2Getting nodes to agree on a value
Raft5.2Understandable consensus algorithm
Leader Election5.2Choosing one coordinator among many
Distributed Lock5.2Mutual exclusion across machines
Eventual Consistency5.3Convergence without immediate agreement
Version Vectors5.3Tracking causality without clocks
CRDTs5.3Conflict-free data structures


After completing Distributed Systems, you’re ready for:

TrackWhy
SRE DisciplineApply distributed systems thinking to reliability
Platform Engineering DisciplineBuild platforms on distributed foundations
GitOps DisciplineEventual consistency in practice
Observability ToolkitMonitor distributed systems

Books referenced throughout this track:

  • “Designing Data-Intensive Applications” — Martin Kleppmann (the definitive guide)
  • “Distributed Systems for Fun and Profit” — Mikito Takada (free online)
  • “Database Internals” — Alex Petrov

Papers:

  • “Time, Clocks, and the Ordering of Events” — Leslie Lamport
  • “In Search of an Understandable Consensus Algorithm” — Diego Ongaro (Raft)
  • “Dynamo: Amazon’s Highly Available Key-value Store” — DeCandia et al.

Question to AskWhy It Matters
”What if this call fails?”Design for partial failure
”What if it’s just slow?”Can’t distinguish slow from dead
”Do we need consensus here?”Consensus is expensive, use sparingly
”What consistency do we need?”Match consistency to requirements
”How do we handle conflicts?”Concurrent writes will happen
”What’s the failure domain?”Understand blast radius

This is the final track in the Foundations series. You’ve now covered:

  1. Systems Thinking: See systems as interconnected wholes
  2. Reliability Engineering: Design for failure, measure with SLOs
  3. Observability Theory: Understand through metrics, logs, traces
  4. Security Principles: Defense in depth, least privilege
  5. Distributed Systems: Consensus, consistency, coordination

These foundations prepare you for the practical Disciplines and Toolkits tracks.


“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” — Leslie Lamport