Module 1.1: What is SRE?
Цей контент ще не доступний вашою мовою.
Discipline Module | Complexity:
[MEDIUM]| Time: 30-35 min
Prerequisites
Section titled “Prerequisites”Before starting this module:
- Required: Reliability Engineering Track — Understanding failure and resilience
- Required: Systems Thinking Track — See systems as wholes
- Recommended: Some experience operating production systems
- Recommended: Understanding of software development lifecycle
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Evaluate whether SRE practices are appropriate for your organization’s reliability needs
- Design an SRE team structure with clear roles, responsibilities, and engagement models
- Implement the core SRE principles — SLOs, error budgets, toil reduction — in a real service
- Analyze the gap between traditional ops and SRE to build a credible adoption roadmap
Why This Module Matters
Section titled “Why This Module Matters”You’ve learned reliability principles. Now you need a framework to apply them.
Site Reliability Engineering (SRE) is that framework. Invented at Google, adopted worldwide, SRE transforms reliability from “someone else’s problem” to “everyone’s measurable responsibility.”
Stop and think: How is reliability currently handled in your organization? Is it a shared goal, or does it fall squarely on the shoulders of a single team when things break?
Here’s the thing: Without SRE, reliability is:
- A vague goal (“make it more reliable”)
- Someone else’s job (“ops will handle it”)
- In conflict with features (“we don’t have time for reliability”)
With SRE, reliability is:
- A measurable objective (99.9% availability)
- A shared responsibility (developers on-call)
- Balanced with velocity (error budgets)
This module shows you how SRE makes this transformation happen.
The Origin Story
Section titled “The Origin Story”2003: Google’s Problem
Section titled “2003: Google’s Problem”Google was growing fast. Very fast.
Traditional operations couldn’t keep up:
- Manual processes: Too slow for Google’s scale
- Siloed teams: Developers threw code over the wall
- Perverse incentives: Ops wanted stability, Dev wanted features
Ben Treynor (now Treynor Sloss) was tasked with fixing this. His solution: treat operations as a software engineering problem.
“SRE is what happens when you ask a software engineer to design an operations team.” — Ben Treynor Sloss
The Key Insight
Section titled “The Key Insight”Instead of hiring traditional sysadmins, Google hired software engineers and gave them operations responsibilities.
These engineers did what engineers do:
- Automated repetitive tasks
- Built tools to manage systems at scale
- Applied engineering rigor to reliability
- Measured everything
The result was Site Reliability Engineering — a discipline that bridges development and operations through software engineering.
The SRE Mindset
Section titled “The SRE Mindset”Engineering, Not Administration
Section titled “Engineering, Not Administration”Traditional ops focuses on doing work: patching servers, deploying code, responding to alerts.
SRE focuses on eliminating work: automating deployments, reducing toil, preventing incidents.
| Traditional Ops | SRE Approach |
|---|---|
| Manual deployments | Automated CI/CD |
| Reactive firefighting | Proactive prevention |
| Tribal knowledge | Documented runbooks |
| ”Keep it running" | "Make it self-healing” |
| Blame individuals | Blame systems |
The 50% Rule
Section titled “The 50% Rule”SRE teams have a hard rule: no more than 50% of time on operational work.
The rest goes to engineering projects that:
- Reduce future operational burden
- Improve system reliability
- Automate repetitive tasks
- Build better tools
Pause and predict: What happens if an SRE team ignores the 50% rule and spends 90% of their time fighting fires?
If operational work exceeds 50%, something is wrong. Either:
- The system is too unreliable (needs reliability work)
- Too much toil exists (needs automation)
- The team is understaffed (needs hiring)
This rule ensures SRE remains an engineering discipline, not just operations with a fancy name.
Embrace Risk (Carefully)
Section titled “Embrace Risk (Carefully)”Here’s a counterintuitive SRE principle: chasing 100% reliability is usually the wrong goal.
Why? Because:
- Users may not notice: If other parts of the end-to-end path are less reliable than your service, pushing for additional 9s may yield little visible improvement
- It’s expensive: Each additional 9 costs exponentially more
- It’s slow: Extreme reliability requires extreme caution, which means slow releases
SRE embraces calculated risk through error budgets (more in Module 1.3).
Try This: Calculate Your Users’ Reliability Experience
Section titled “Try This: Calculate Your Users’ Reliability Experience”Think about your service’s path to users:
flowchart TD S["Your Service (99.9%)"] --> LB["Load Balancer (99.95%)"] LB --> I["Internet (99.9%)"] I --> ISP["User's ISP (99%)"] ISP --> W["User's WiFi (99.5%)"] W --> B["User's Browser (99.9%)"]Combined: 0.999 × 0.9995 × 0.999 × 0.99 × 0.995 × 0.999 = ~97.3%Pause and predict: If you spend engineering effort to increase your service reliability from 99.9% to 99.999%, how much will your users’ perceived reliability improve?
Your 99.9% doesn’t matter much when users only see ~97.3%.
Lesson: Match your reliability to what users can actually perceive.
SRE vs DevOps vs Platform Engineering
Section titled “SRE vs DevOps vs Platform Engineering”These terms get confused. Let’s clarify:
DevOps: A Culture
Section titled “DevOps: A Culture”DevOps is a cultural movement that breaks down silos between development and operations.
Core values:
- Collaboration: Dev and Ops work together
- Automation: Reduce manual work
- Measurement: Track what matters
- Sharing: Knowledge flows freely
- Feedback: Fast loops everywhere
DevOps is what you believe.
SRE: A Practice
Section titled “SRE: A Practice”SRE is a specific implementation of DevOps principles.
It provides:
- Concrete practices: SLOs, error budgets, toil reduction
- Defined roles: SRE teams with specific responsibilities
- Measurable goals: Reliability targets, not vague aspirations
- Engineering focus: Treat operations as software problem
SRE is how you work.
“Class SRE implements interface DevOps.” — Seth Vargo and Liz Fong-Jones
Platform Engineering: An Approach
Section titled “Platform Engineering: An Approach”Platform Engineering focuses on building internal developer platforms that make reliability, deployment, and operations self-service.
It provides:
- Golden paths: Paved roads for common tasks
- Self-service: Developers help themselves
- Abstraction: Hide infrastructure complexity
- Standardization: Consistent patterns across teams
Platform Engineering is what you build.
Stop and think: Is your organization trying to “buy DevOps” by simply renaming your traditional operations team, or are you actually changing your practices to align with SRE?
The Overlap
Section titled “The Overlap”flowchart TD subgraph DevOps["DevOps Culture"] direction LR SRE["SRE (Practice)<br/><br/>• SLOs<br/>• Error budgets<br/>• On-call<br/>• Postmortems"] PE["Platform Engineering (Approach)<br/><br/>• IDPs<br/>• Golden paths<br/>• Self-service<br/>• Backstage"] endYou can do SRE without Platform Engineering. You can do Platform Engineering without dedicated SREs. But many organizations do both, and they complement each other well.
SRE Team Structures
Section titled “SRE Team Structures”There’s no single way to structure SRE. Here are common models:
Model 1: Centralized SRE Team
Section titled “Model 1: Centralized SRE Team”flowchart TD Central["Central SRE Team<br/>(Owns reliability for all services)"] TeamA["Team A (Dev)"] TeamB["Team B (Dev)"] TeamC["Team C (Dev)"] Central --> TeamA Central --> TeamB Central --> TeamCPros:
- Consistent practices across organization
- Efficient use of specialized skills
- Clear ownership of reliability
Cons:
- Bottleneck for all teams
- SREs disconnected from product context
- “Throw it to SRE” mentality
Best for: Smaller organizations, early SRE adoption
Model 2: Embedded SREs
Section titled “Model 2: Embedded SREs”flowchart TD subgraph TA["Team A"] DevA["Developers<br/>+ 1-2 SREs"] end subgraph TB["Team B"] DevB["Developers<br/>+ 1-2 SREs"] end subgraph TC["Team C"] DevC["Developers<br/>+ 1-2 SREs"] endPros:
- SREs understand product context
- Faster response to team-specific issues
- Strong collaboration with developers
Cons:
- Inconsistent practices across teams
- SREs can get “captured” by product work
- Less efficient (specialized skills spread thin)
Best for: Larger organizations, mature teams
Model 3: Hybrid
Section titled “Model 3: Hybrid”flowchart TD Central["Central SRE Platform<br/>(Tooling, standards, knowledge)"] TeamA["Team A<br/>+ SRE"] TeamB["Team B<br/>+ SRE"] TeamC["Team C<br/>+ SRE"] Central --> TeamA Central --> TeamB Central --> TeamCPros:
- Consistent tooling and standards
- SREs have product context
- Best practices shared centrally
Cons:
- More complex to manage
- Potential for duplication
- Requires mature organization
Best for: Large organizations, scaled SRE
Model 4: You Build It, You Run It (No Dedicated SREs)
Section titled “Model 4: You Build It, You Run It (No Dedicated SREs)”flowchart TD TeamA["Team A<br/>Develops AND Operates their services"] TeamB["Team B<br/>Develops AND Operates their services"] TeamC["Team C<br/>Develops AND Operates their services"] Enabling["Enabling Team<br/>(Tooling, docs)"] TeamA --> Enabling TeamB --> Enabling TeamC --> EnablingPros:
- Developers fully own their services
- No “throw it over the wall”
- Deep product understanding
Cons:
- Not everyone wants operations work
- Inconsistent reliability practices
- Specialized skills may be lacking
Best for: Small organizations, high-autonomy cultures
Did You Know?
Section titled “Did You Know?”-
Google’s SRE teams are about 50% software engineering work. They spend half their time on automation projects that reduce future operational burden, not just keeping things running.
-
Error budgets became a core Google SRE concept for balancing reliability work against feature delivery.
-
Some organizations avoid a separate SRE function and instead use a “you build it, you run it” or full-cycle developer model, supported by strong internal tooling.
-
When Google published the first SRE book in 2016, it helped push SRE ideas into the broader DevOps conversation. Operations became much more widely discussed in explicit engineering terms.
War Story: The Team That Couldn’t Scale
Section titled “War Story: The Team That Couldn’t Scale”A fast-growing company can end up in a situation like this:
- A small operations group handling production for a much larger engineering organization
- On-call interruptions were frequent and exhausting
- Deployments were backed up long enough to slow delivery noticeably
- Burnout was rampant
They hired more ops people. It helped briefly, then got worse again.
The problem: They were scaling linearly while the system grew exponentially.
The fix: They adopted SRE practices:
- Established SLOs: Defined what “good enough” meant
- Created error budgets: Developers owned reliability
- Mandated toil measurement: Found where time went
- Set the 50% rule: Forced automation investment
Within 6 months:
- Deployment queue: Same day
- On-call incidents: Meaningfully reduced
- Team size: Held steady
- Coverage: Able to support a larger engineering organization
The shift wasn’t hiring more people. It was working differently.
Lesson: You can’t hire your way out of operational problems. You have to engineer your way out.
SRE Principles Summary
Section titled “SRE Principles Summary”| Principle | Meaning |
|---|---|
| Reliability is a feature | It’s not separate from product development |
| Embrace risk | Chasing 100% is usually the wrong goal; match reliability to user needs |
| Service level objectives | Measure reliability concretely |
| Error budgets | Balance reliability with velocity |
| Toil reduction | Automate repetitive work away |
| Blameless postmortems | Learn from failure without blame |
| Engineering mindset | Treat operations as software problem |
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| ”We hired an SRE” | SRE is a practice, not a person | Adopt SRE practices org-wide |
| 100% reliability target | Expensive and slow | Set realistic SLOs based on user needs |
| SRE team does all ops | Creates bottleneck and resentment | Shared responsibility model |
| Ignoring the 50% rule | SREs become glorified ops | Protect engineering time ruthlessly |
| SRE without SLOs | No way to measure success | Define SLOs before anything else |
| Copy Google exactly | Their context isn’t yours | Adapt principles to your situation |
Quiz: Check Your Understanding
Section titled “Quiz: Check Your Understanding”Question 1
Section titled “Question 1”Your company is migrating from a traditional operations model to an SRE model. The VP of Engineering suggests that the new SRE team should handle all deployments and manual server patching to free up developers. Based on SRE principles, why is this approach problematic?
Show Answer
This approach is problematic because it treats the SRE team as a traditional operations group focused on manual work rather than engineering solutions. In SRE, the primary goal is to treat operations as a software engineering problem. SREs should focus on automating deployments and reducing toil, governed by the 50% rule where no more than half their time is spent on operational tasks. By taking on all manual patching and deployments, the team would quickly exceed this limit and fail to build the scalable, automated systems that SRE is designed to create.
Question 2
Section titled “Question 2”Your product manager insists that the new critical payment service must be built to achieve 100% reliability, arguing that “any downtime is unacceptable for payments.” As an SRE, how would you address this requirement?
Show Answer
Targeting 100% reliability is considered an anti-pattern in SRE because it is both practically impossible and misaligned with actual user experience. Even if your service does not fail, users can still experience errors due to unstable internet connections, ISP outages, or device issues, making some of the extra effort invisible to them. Furthermore, achieving each additional “nine” of reliability costs exponentially more in engineering effort and severely throttles feature velocity. Instead, the SRE approach is to set a realistic Service Level Objective (SLO), such as 99.99%, and use the remaining error budget to safely deploy updates and balance reliability with innovation.
Question 3
Section titled “Question 3”An executive at your company states, “We don’t need to adopt DevOps because we’re already building a Platform Engineering team, and we’ll hire SREs to run it.” How would you clarify the relationship between these three concepts to correct this misunderstanding?
Show Answer
It is crucial to clarify that DevOps, SRE, and Platform Engineering are complementary concepts, not mutually exclusive alternatives. DevOps is the foundational cultural movement that emphasizes collaboration, shared responsibility, and breaking down silos between development and operations. SRE is a specific, prescriptive implementation of those DevOps principles, providing concrete practices like Service Level Objectives (SLOs) and error budgets. Meanwhile, Platform Engineering focuses on building internal developer platforms to enable self-service and standardize workflows. You still need the DevOps culture and SRE practices to ensure the platform is reliable and aligns with your business goals.
Question 4
Section titled “Question 4”Six months into your SRE team’s existence, you audit their time and find they are spending 75% of their week resolving tickets, responding to alerts, and manually scaling infrastructure. What SRE principle is being violated, and what should be the immediate course of action?
Show Answer
This situation directly violates the 50% rule, which mandates that SREs must spend at least half of their time on project-based engineering work rather than reactive operations. When operational work (toil) exceeds 50%, it indicates that the system is either too unreliable, under-automated, or the team is understaffed. The immediate course of action should be to push back on operational duties, potentially routing excess tickets back to the development teams. The SRE team must then redirect their focus toward automating the manual scaling and addressing the root causes of the alerts to permanently reduce the operational burden.
Hands-On Exercise: SRE Self-Assessment
Section titled “Hands-On Exercise: SRE Self-Assessment”Assess your organization’s SRE maturity:
Instructions
Section titled “Instructions”Rate each area from 1 (not at all) to 5 (fully implemented):
Reliability Measurement
[ ] We have SLOs for our critical services[ ] We measure availability and latency[ ] We have dashboards showing reliability metrics[ ] We review reliability metrics regularly
Score: ___/20Operational Practices
[ ] We have documented runbooks for common issues[ ] We do blameless postmortems after incidents[ ] We track and measure toil[ ] We have a formal on-call rotation
Score: ___/20Engineering Investment
[ ] Developers participate in on-call[ ] We regularly invest in automation[ ] We have error budgets that gate releases[ ] We protect time for reliability engineering
Score: ___/20Culture
[ ] Reliability is a product feature, not ops problem[ ] We learn from failures, not blame individuals[ ] Developers and operations collaborate closely[ ] Leadership supports reliability investment
Score: ___/20Interpreting Your Score
Section titled “Interpreting Your Score”| Score | Maturity Level | Focus Area |
|---|---|---|
| 60-80 | Advanced SRE | Continuous improvement |
| 40-60 | Developing | Fill in gaps systematically |
| 20-40 | Beginning | Start with SLOs and postmortems |
| 0-20 | Pre-SRE | Build awareness and buy-in |
Success Criteria
Section titled “Success Criteria”You’ve completed this exercise when you:
- Honestly assessed all 16 areas
- Identified your lowest-scoring category
- Listed 3 specific improvements for that category
Key Takeaways
Section titled “Key Takeaways”- SRE is software engineering applied to operations — not just ops with a new name
- The 50% rule protects engineering time and forces automation
- Chasing 100% reliability is usually the wrong goal — match reliability to user needs
- SRE implements DevOps — concrete practices for cultural values
- Team structure matters — choose based on your organization’s maturity
Further Reading
Section titled “Further Reading”Books:
- “Site Reliability Engineering” — Google (the original SRE book, free online)
- “The Site Reliability Workbook” — Google (practical companion)
- “Seeking SRE” — David Blank-Edelman (SRE across different contexts)
Articles:
- “What is SRE?” — Google Cloud (cloud.google.com/blog/products/devops-sre)
- “SRE vs DevOps” — Atlassian (atlassian.com/incident-management)
Talks:
- “Keys to SRE” — Ben Treynor Sloss (YouTube)
- “SRE: An Incomplete Guide to Cultural Nuance” — Liz Fong-Jones (YouTube)
Summary
Section titled “Summary”Site Reliability Engineering is a discipline that brings software engineering practices to operations. Born at Google, it provides:
- Measurable reliability through SLOs
- Balanced velocity through error budgets
- Reduced toil through automation
- Learning culture through blameless postmortems
SRE isn’t about perfection — it’s about being reliably good enough while still shipping features.
Next Module
Section titled “Next Module”Continue to Module 1.2: Service Level Objectives (SLOs) to learn how to define and measure reliability targets.
“Hope is not a strategy.” — Traditional SRE saying
Sources
Section titled “Sources”- cloud.google.com: sre in the 2022 state of devops report — This Google Cloud post directly says Google’s SRE practice has been embraced and extended by a global community.
- cloud.google.com: identifying and tracking toil using sre principles — The Google Cloud post explicitly states that within Google SRE the target is to keep toil below 50% of each SRE’s time.
- cloud.google.com: choose slos — The Google Cloud Architecture Center page directly states that a goal of 100% reliability is often not the most effective strategy.
- Google Cloud: Site Reliability Engineering — A current Google overview of SRE concepts, tools, and entry points for deeper study.
- How SRE Teams Are Organized, and How to Get Started — Useful follow-up for the team-structure section because it walks through common SRE organizational models and their tradeoffs.
- The Systems Engineering Side of Site Reliability Engineering — A foundational article on the systems-engineering mindset behind SRE and how it differs from purely software-focused roles.