Skip to content

Site Reliability Engineering (SRE)

Discipline Track | 7 Modules | ~4 hours total

The practice of applying software engineering to operations. SRE provides concrete methods for measuring, maintaining, and improving the reliability of production systems.


Traditional operations faces a fundamental tension: development wants to ship fast, operations wants stability. These goals seem opposed.

SRE resolves this tension through:

  • Measurable reliability — SLOs replace vague “make it reliable” with concrete targets
  • Error budgets — Calculated risk-taking, not reckless shipping or excessive caution
  • Engineering mindset — Automate toil, don’t just do it
  • Learning culture — Blameless postmortems turn failures into improvements

SRE isn’t just operations with a fancy name. It’s a fundamentally different approach to running production systems.


#ModuleTimeDescription
1.1What is SRE?30-35 minOrigins, mindset, team structures, SRE vs DevOps
1.2Service Level Objectives35-40 minSLI, SLO, SLA hierarchy, choosing and measuring
1.3Error Budgets30-35 minBudget calculation, policies, balancing velocity
1.4Toil and Automation30-35 minIdentifying toil, 50% rule, automation strategies
1.5Incident Management35-40 minResponse roles, severity, on-call, runbooks
1.6Postmortems and Learning30-35 minBlameless culture, postmortem structure, action items
1.7Capacity Planning35-40 minForecasting, provisioning, load testing, cost

START HERE
┌─────────────────────────────────────┐
│ Module 1.1 │
│ What is SRE? │
│ └── SRE origins and mindset │
│ └── SRE vs DevOps vs Platform │
│ └── Team structures │
│ └── The 50% rule │
└──────────────────┬──────────────────┘
┌─────────────────────────────────────┐
│ Module 1.2 │
│ Service Level Objectives │
│ └── SLI, SLO, SLA hierarchy │
│ └── Choosing good SLIs │
│ └── Setting realistic SLOs │
│ └── SLO-based alerting │
└──────────────────┬──────────────────┘
┌─────────────────────────────────────┐
│ Module 1.3 │
│ Error Budgets │
│ └── Budget calculation │
│ └── Error budget policies │
│ └── Balancing reliability/velocity │
│ └── When to spend vs save │
└──────────────────┬──────────────────┘
┌─────────────────────────────────────┐
│ Module 1.4 │
│ Toil and Automation │
│ └── Defining and measuring toil │
│ └── The 50% rule in practice │
│ └── Automation hierarchy │
│ └── ROI-based prioritization │
└──────────────────┬──────────────────┘
┌─────────────────────────────────────┐
│ Module 1.5 │
│ Incident Management │
│ └── Response roles (IC, Comms) │
│ └── Severity levels │
│ └── On-call best practices │
│ └── Runbooks and playbooks │
└──────────────────┬──────────────────┘
┌─────────────────────────────────────┐
│ Module 1.6 │
│ Postmortems and Learning │
│ └── Blameless culture │
│ └── The "second story" │
│ └── Effective action items │
│ └── Sharing learnings │
└──────────────────┬──────────────────┘
┌─────────────────────────────────────┐
│ Module 1.7 │
│ Capacity Planning │
│ └── Demand forecasting │
│ └── Provisioning strategies │
│ └── Load testing │
│ └── Cost optimization │
└──────────────────┬──────────────────┘
SRE COMPLETE
┌──────────────┼──────────────┐
│ │ │
▼ ▼ ▼
Platform GitOps Observability
Engineering Discipline Toolkit

ConceptModuleWhat It Means
SLI1.2Service Level Indicator — what you measure
SLO1.2Service Level Objective — your target
SLA1.2Service Level Agreement — external promise
Error Budget1.3Allowed unreliability (1 - SLO)
Toil1.4Repetitive, automatable work
50% Rule1.1, 1.4Cap toil at 50% to ensure engineering time
Incident Commander1.5Person coordinating incident response
Blameless Postmortem1.6Learning from failure without blame
Burn Rate1.2, 1.3How fast error budget is being consumed
Headroom1.7Buffer capacity for traffic spikes


After completing the SRE Discipline, you’re ready for:

TrackWhy
Platform Engineering DisciplineBuild platforms using SRE principles
GitOps DisciplineDeclarative infrastructure operations
IaC DisciplineInfrastructure as Code for reliable provisioning
Observability ToolkitImplement Prometheus, Grafana, OpenTelemetry
IaC Tools ToolkitTerraform, OpenTofu, Pulumi for automation
CKA CertificationApply SRE to Kubernetes administration

Books (referenced throughout):

  • “Site Reliability Engineering” — Google (free online, the original SRE book)
  • “The Site Reliability Workbook” — Google (practical companion)
  • “Implementing Service Level Objectives” — Alex Hidalgo

Articles:

  • “What is SRE?” — Google Cloud
  • “SRE vs DevOps” — Atlassian

Tools mentioned:

  • Prometheus/Grafana: SLO monitoring
  • PagerDuty/OpsGenie: Incident management
  • k6/Locust: Load testing
  • Kubernetes HPA: Auto-scaling

Question to AskWhy It Matters
”What’s our SLO?”Can’t improve what you don’t measure
”How much error budget do we have?”Know when to ship vs when to stabilize
”Is this toil?”Automate if yes, protect engineering time
”What would prevent this incident?”Systems fix, not human vigilance
”Who’s the IC?”Clear roles in chaos
”Can we handle 2x traffic?”Plan before you need it

After these 7 modules, you can:

  • Define and measure reliability with SLOs
  • Balance reliability and velocity with error budgets
  • Identify and eliminate toil systematically
  • Respond to incidents with clear roles and processes
  • Learn from failures through blameless postmortems
  • Plan for future capacity and growth

You’re now equipped to practice SRE, not just read about it.


“Hope is not a strategy.” — Traditional SRE saying