Site Reliability Engineering (SRE)
Discipline Track | 7 Modules | ~4 hours total
The practice of applying software engineering to operations. SRE provides concrete methods for measuring, maintaining, and improving the reliability of production systems.
Why SRE?
Section titled “Why SRE?”Traditional operations faces a fundamental tension: development wants to ship fast, operations wants stability. These goals seem opposed.
SRE resolves this tension through:
- Measurable reliability — SLOs replace vague “make it reliable” with concrete targets
- Error budgets — Calculated risk-taking, not reckless shipping or excessive caution
- Engineering mindset — Automate toil, don’t just do it
- Learning culture — Blameless postmortems turn failures into improvements
SRE isn’t just operations with a fancy name. It’s a fundamentally different approach to running production systems.
Modules
Section titled “Modules”| # | Module | Time | Description |
|---|---|---|---|
| 1.1 | What is SRE? | 30-35 min | Origins, mindset, team structures, SRE vs DevOps |
| 1.2 | Service Level Objectives | 35-40 min | SLI, SLO, SLA hierarchy, choosing and measuring |
| 1.3 | Error Budgets | 30-35 min | Budget calculation, policies, balancing velocity |
| 1.4 | Toil and Automation | 30-35 min | Identifying toil, 50% rule, automation strategies |
| 1.5 | Incident Management | 35-40 min | Response roles, severity, on-call, runbooks |
| 1.6 | Postmortems and Learning | 30-35 min | Blameless culture, postmortem structure, action items |
| 1.7 | Capacity Planning | 35-40 min | Forecasting, provisioning, load testing, cost |
Learning Path
Section titled “Learning Path”START HERE │ ▼┌─────────────────────────────────────┐│ Module 1.1 ││ What is SRE? ││ └── SRE origins and mindset ││ └── SRE vs DevOps vs Platform ││ └── Team structures ││ └── The 50% rule │└──────────────────┬──────────────────┘ │ ▼┌─────────────────────────────────────┐│ Module 1.2 ││ Service Level Objectives ││ └── SLI, SLO, SLA hierarchy ││ └── Choosing good SLIs ││ └── Setting realistic SLOs ││ └── SLO-based alerting │└──────────────────┬──────────────────┘ │ ▼┌─────────────────────────────────────┐│ Module 1.3 ││ Error Budgets ││ └── Budget calculation ││ └── Error budget policies ││ └── Balancing reliability/velocity ││ └── When to spend vs save │└──────────────────┬──────────────────┘ │ ▼┌─────────────────────────────────────┐│ Module 1.4 ││ Toil and Automation ││ └── Defining and measuring toil ││ └── The 50% rule in practice ││ └── Automation hierarchy ││ └── ROI-based prioritization │└──────────────────┬──────────────────┘ │ ▼┌─────────────────────────────────────┐│ Module 1.5 ││ Incident Management ││ └── Response roles (IC, Comms) ││ └── Severity levels ││ └── On-call best practices ││ └── Runbooks and playbooks │└──────────────────┬──────────────────┘ │ ▼┌─────────────────────────────────────┐│ Module 1.6 ││ Postmortems and Learning ││ └── Blameless culture ││ └── The "second story" ││ └── Effective action items ││ └── Sharing learnings │└──────────────────┬──────────────────┘ │ ▼┌─────────────────────────────────────┐│ Module 1.7 ││ Capacity Planning ││ └── Demand forecasting ││ └── Provisioning strategies ││ └── Load testing ││ └── Cost optimization │└──────────────────┬──────────────────┘ │ ▼ SRE COMPLETE │ ┌──────────────┼──────────────┐ │ │ │ ▼ ▼ ▼Platform GitOps ObservabilityEngineering Discipline ToolkitKey Concepts You’ll Learn
Section titled “Key Concepts You’ll Learn”| Concept | Module | What It Means |
|---|---|---|
| SLI | 1.2 | Service Level Indicator — what you measure |
| SLO | 1.2 | Service Level Objective — your target |
| SLA | 1.2 | Service Level Agreement — external promise |
| Error Budget | 1.3 | Allowed unreliability (1 - SLO) |
| Toil | 1.4 | Repetitive, automatable work |
| 50% Rule | 1.1, 1.4 | Cap toil at 50% to ensure engineering time |
| Incident Commander | 1.5 | Person coordinating incident response |
| Blameless Postmortem | 1.6 | Learning from failure without blame |
| Burn Rate | 1.2, 1.3 | How fast error budget is being consumed |
| Headroom | 1.7 | Buffer capacity for traffic spikes |
Prerequisites
Section titled “Prerequisites”- Required: Reliability Engineering Track — Failure modes and resilience
- Required: Systems Thinking Track — Understanding system dynamics
- Recommended: Observability Theory Track — Metrics and monitoring
- Helpful: Experience operating production systems
- Helpful: Some Kubernetes experience
Where This Leads
Section titled “Where This Leads”After completing the SRE Discipline, you’re ready for:
| Track | Why |
|---|---|
| Platform Engineering Discipline | Build platforms using SRE principles |
| GitOps Discipline | Declarative infrastructure operations |
| IaC Discipline | Infrastructure as Code for reliable provisioning |
| Observability Toolkit | Implement Prometheus, Grafana, OpenTelemetry |
| IaC Tools Toolkit | Terraform, OpenTofu, Pulumi for automation |
| CKA Certification | Apply SRE to Kubernetes administration |
Key Resources
Section titled “Key Resources”Books (referenced throughout):
- “Site Reliability Engineering” — Google (free online, the original SRE book)
- “The Site Reliability Workbook” — Google (practical companion)
- “Implementing Service Level Objectives” — Alex Hidalgo
Articles:
- “What is SRE?” — Google Cloud
- “SRE vs DevOps” — Atlassian
Tools mentioned:
- Prometheus/Grafana: SLO monitoring
- PagerDuty/OpsGenie: Incident management
- k6/Locust: Load testing
- Kubernetes HPA: Auto-scaling
The SRE Mindset
Section titled “The SRE Mindset”| Question to Ask | Why It Matters |
|---|---|
| ”What’s our SLO?” | Can’t improve what you don’t measure |
| ”How much error budget do we have?” | Know when to ship vs when to stabilize |
| ”Is this toil?” | Automate if yes, protect engineering time |
| ”What would prevent this incident?” | Systems fix, not human vigilance |
| ”Who’s the IC?” | Clear roles in chaos |
| ”Can we handle 2x traffic?” | Plan before you need it |
Discipline Complete
Section titled “Discipline Complete”After these 7 modules, you can:
- Define and measure reliability with SLOs
- Balance reliability and velocity with error budgets
- Identify and eliminate toil systematically
- Respond to incidents with clear roles and processes
- Learn from failures through blameless postmortems
- Plan for future capacity and growth
You’re now equipped to practice SRE, not just read about it.
“Hope is not a strategy.” — Traditional SRE saying