Skip to content

Module 1.1: What is SRE?

Discipline Module | Complexity: [MEDIUM] | Time: 30-35 min

Before starting this module:

  • Required: Reliability Engineering Track — Understanding failure and resilience
  • Required: Systems Thinking Track — See systems as wholes
  • Recommended: Some experience operating production systems
  • Recommended: Understanding of software development lifecycle

After completing this module, you will be able to:

  • Evaluate whether SRE practices are appropriate for your organization’s reliability needs
  • Design an SRE team structure with clear roles, responsibilities, and engagement models
  • Implement the core SRE principles — SLOs, error budgets, toil reduction — in a real service
  • Analyze the gap between traditional ops and SRE to build a credible adoption roadmap

You’ve learned reliability principles. Now you need a framework to apply them.

Site Reliability Engineering (SRE) is that framework. Invented at Google, adopted worldwide, SRE transforms reliability from “someone else’s problem” to “everyone’s measurable responsibility.”

Stop and think: How is reliability currently handled in your organization? Is it a shared goal, or does it fall squarely on the shoulders of a single team when things break?

Here’s the thing: Without SRE, reliability is:

  • A vague goal (“make it more reliable”)
  • Someone else’s job (“ops will handle it”)
  • In conflict with features (“we don’t have time for reliability”)

With SRE, reliability is:

  • A measurable objective (99.9% availability)
  • A shared responsibility (developers on-call)
  • Balanced with velocity (error budgets)

This module shows you how SRE makes this transformation happen.


Google was growing fast. Very fast.

Traditional operations couldn’t keep up:

  • Manual processes: Too slow for Google’s scale
  • Siloed teams: Developers threw code over the wall
  • Perverse incentives: Ops wanted stability, Dev wanted features

Ben Treynor (now Treynor Sloss) was tasked with fixing this. His solution: treat operations as a software engineering problem.

“SRE is what happens when you ask a software engineer to design an operations team.” — Ben Treynor Sloss

Instead of hiring traditional sysadmins, Google hired software engineers and gave them operations responsibilities.

These engineers did what engineers do:

  • Automated repetitive tasks
  • Built tools to manage systems at scale
  • Applied engineering rigor to reliability
  • Measured everything

The result was Site Reliability Engineering — a discipline that bridges development and operations through software engineering.


Traditional ops focuses on doing work: patching servers, deploying code, responding to alerts.

SRE focuses on eliminating work: automating deployments, reducing toil, preventing incidents.

Traditional OpsSRE Approach
Manual deploymentsAutomated CI/CD
Reactive firefightingProactive prevention
Tribal knowledgeDocumented runbooks
”Keep it running""Make it self-healing”
Blame individualsBlame systems

SRE teams have a hard rule: no more than 50% of time on operational work.

The rest goes to engineering projects that:

  • Reduce future operational burden
  • Improve system reliability
  • Automate repetitive tasks
  • Build better tools

Pause and predict: What happens if an SRE team ignores the 50% rule and spends 90% of their time fighting fires?

If operational work exceeds 50%, something is wrong. Either:

  • The system is too unreliable (needs reliability work)
  • Too much toil exists (needs automation)
  • The team is understaffed (needs hiring)

This rule ensures SRE remains an engineering discipline, not just operations with a fancy name.

Here’s a counterintuitive SRE principle: chasing 100% reliability is usually the wrong goal.

Why? Because:

  1. Users may not notice: If other parts of the end-to-end path are less reliable than your service, pushing for additional 9s may yield little visible improvement
  2. It’s expensive: Each additional 9 costs exponentially more
  3. It’s slow: Extreme reliability requires extreme caution, which means slow releases

SRE embraces calculated risk through error budgets (more in Module 1.3).


Try This: Calculate Your Users’ Reliability Experience

Section titled “Try This: Calculate Your Users’ Reliability Experience”

Think about your service’s path to users:

flowchart TD
S["Your Service (99.9%)"] --> LB["Load Balancer (99.95%)"]
LB --> I["Internet (99.9%)"]
I --> ISP["User's ISP (99%)"]
ISP --> W["User's WiFi (99.5%)"]
W --> B["User's Browser (99.9%)"]
Combined: 0.999 × 0.9995 × 0.999 × 0.99 × 0.995 × 0.999
= ~97.3%

Pause and predict: If you spend engineering effort to increase your service reliability from 99.9% to 99.999%, how much will your users’ perceived reliability improve?

Your 99.9% doesn’t matter much when users only see ~97.3%.

Lesson: Match your reliability to what users can actually perceive.


These terms get confused. Let’s clarify:

DevOps is a cultural movement that breaks down silos between development and operations.

Core values:

  • Collaboration: Dev and Ops work together
  • Automation: Reduce manual work
  • Measurement: Track what matters
  • Sharing: Knowledge flows freely
  • Feedback: Fast loops everywhere

DevOps is what you believe.

SRE is a specific implementation of DevOps principles.

It provides:

  • Concrete practices: SLOs, error budgets, toil reduction
  • Defined roles: SRE teams with specific responsibilities
  • Measurable goals: Reliability targets, not vague aspirations
  • Engineering focus: Treat operations as software problem

SRE is how you work.

“Class SRE implements interface DevOps.” — Seth Vargo and Liz Fong-Jones

Platform Engineering focuses on building internal developer platforms that make reliability, deployment, and operations self-service.

It provides:

  • Golden paths: Paved roads for common tasks
  • Self-service: Developers help themselves
  • Abstraction: Hide infrastructure complexity
  • Standardization: Consistent patterns across teams

Platform Engineering is what you build.

Stop and think: Is your organization trying to “buy DevOps” by simply renaming your traditional operations team, or are you actually changing your practices to align with SRE?

flowchart TD
subgraph DevOps["DevOps Culture"]
direction LR
SRE["SRE (Practice)<br/><br/>• SLOs<br/>• Error budgets<br/>• On-call<br/>• Postmortems"]
PE["Platform Engineering (Approach)<br/><br/>• IDPs<br/>• Golden paths<br/>• Self-service<br/>• Backstage"]
end

You can do SRE without Platform Engineering. You can do Platform Engineering without dedicated SREs. But many organizations do both, and they complement each other well.


There’s no single way to structure SRE. Here are common models:

flowchart TD
Central["Central SRE Team<br/>(Owns reliability for all services)"]
TeamA["Team A (Dev)"]
TeamB["Team B (Dev)"]
TeamC["Team C (Dev)"]
Central --> TeamA
Central --> TeamB
Central --> TeamC

Pros:

  • Consistent practices across organization
  • Efficient use of specialized skills
  • Clear ownership of reliability

Cons:

  • Bottleneck for all teams
  • SREs disconnected from product context
  • “Throw it to SRE” mentality

Best for: Smaller organizations, early SRE adoption

flowchart TD
subgraph TA["Team A"]
DevA["Developers<br/>+ 1-2 SREs"]
end
subgraph TB["Team B"]
DevB["Developers<br/>+ 1-2 SREs"]
end
subgraph TC["Team C"]
DevC["Developers<br/>+ 1-2 SREs"]
end

Pros:

  • SREs understand product context
  • Faster response to team-specific issues
  • Strong collaboration with developers

Cons:

  • Inconsistent practices across teams
  • SREs can get “captured” by product work
  • Less efficient (specialized skills spread thin)

Best for: Larger organizations, mature teams

flowchart TD
Central["Central SRE Platform<br/>(Tooling, standards, knowledge)"]
TeamA["Team A<br/>+ SRE"]
TeamB["Team B<br/>+ SRE"]
TeamC["Team C<br/>+ SRE"]
Central --> TeamA
Central --> TeamB
Central --> TeamC

Pros:

  • Consistent tooling and standards
  • SREs have product context
  • Best practices shared centrally

Cons:

  • More complex to manage
  • Potential for duplication
  • Requires mature organization

Best for: Large organizations, scaled SRE

Model 4: You Build It, You Run It (No Dedicated SREs)

Section titled “Model 4: You Build It, You Run It (No Dedicated SREs)”
flowchart TD
TeamA["Team A<br/>Develops AND Operates their services"]
TeamB["Team B<br/>Develops AND Operates their services"]
TeamC["Team C<br/>Develops AND Operates their services"]
Enabling["Enabling Team<br/>(Tooling, docs)"]
TeamA --> Enabling
TeamB --> Enabling
TeamC --> Enabling

Pros:

  • Developers fully own their services
  • No “throw it over the wall”
  • Deep product understanding

Cons:

  • Not everyone wants operations work
  • Inconsistent reliability practices
  • Specialized skills may be lacking

Best for: Small organizations, high-autonomy cultures


  1. Google’s SRE teams are about 50% software engineering work. They spend half their time on automation projects that reduce future operational burden, not just keeping things running.

  2. Error budgets became a core Google SRE concept for balancing reliability work against feature delivery.

  3. Some organizations avoid a separate SRE function and instead use a “you build it, you run it” or full-cycle developer model, supported by strong internal tooling.

  4. When Google published the first SRE book in 2016, it helped push SRE ideas into the broader DevOps conversation. Operations became much more widely discussed in explicit engineering terms.


A fast-growing company can end up in a situation like this:

  • A small operations group handling production for a much larger engineering organization
  • On-call interruptions were frequent and exhausting
  • Deployments were backed up long enough to slow delivery noticeably
  • Burnout was rampant

They hired more ops people. It helped briefly, then got worse again.

The problem: They were scaling linearly while the system grew exponentially.

The fix: They adopted SRE practices:

  1. Established SLOs: Defined what “good enough” meant
  2. Created error budgets: Developers owned reliability
  3. Mandated toil measurement: Found where time went
  4. Set the 50% rule: Forced automation investment

Within 6 months:

  • Deployment queue: Same day
  • On-call incidents: Meaningfully reduced
  • Team size: Held steady
  • Coverage: Able to support a larger engineering organization

The shift wasn’t hiring more people. It was working differently.

Lesson: You can’t hire your way out of operational problems. You have to engineer your way out.


PrincipleMeaning
Reliability is a featureIt’s not separate from product development
Embrace riskChasing 100% is usually the wrong goal; match reliability to user needs
Service level objectivesMeasure reliability concretely
Error budgetsBalance reliability with velocity
Toil reductionAutomate repetitive work away
Blameless postmortemsLearn from failure without blame
Engineering mindsetTreat operations as software problem

MistakeProblemSolution
”We hired an SRE”SRE is a practice, not a personAdopt SRE practices org-wide
100% reliability targetExpensive and slowSet realistic SLOs based on user needs
SRE team does all opsCreates bottleneck and resentmentShared responsibility model
Ignoring the 50% ruleSREs become glorified opsProtect engineering time ruthlessly
SRE without SLOsNo way to measure successDefine SLOs before anything else
Copy Google exactlyTheir context isn’t yoursAdapt principles to your situation

Your company is migrating from a traditional operations model to an SRE model. The VP of Engineering suggests that the new SRE team should handle all deployments and manual server patching to free up developers. Based on SRE principles, why is this approach problematic?

Show Answer

This approach is problematic because it treats the SRE team as a traditional operations group focused on manual work rather than engineering solutions. In SRE, the primary goal is to treat operations as a software engineering problem. SREs should focus on automating deployments and reducing toil, governed by the 50% rule where no more than half their time is spent on operational tasks. By taking on all manual patching and deployments, the team would quickly exceed this limit and fail to build the scalable, automated systems that SRE is designed to create.

Your product manager insists that the new critical payment service must be built to achieve 100% reliability, arguing that “any downtime is unacceptable for payments.” As an SRE, how would you address this requirement?

Show Answer

Targeting 100% reliability is considered an anti-pattern in SRE because it is both practically impossible and misaligned with actual user experience. Even if your service does not fail, users can still experience errors due to unstable internet connections, ISP outages, or device issues, making some of the extra effort invisible to them. Furthermore, achieving each additional “nine” of reliability costs exponentially more in engineering effort and severely throttles feature velocity. Instead, the SRE approach is to set a realistic Service Level Objective (SLO), such as 99.99%, and use the remaining error budget to safely deploy updates and balance reliability with innovation.

An executive at your company states, “We don’t need to adopt DevOps because we’re already building a Platform Engineering team, and we’ll hire SREs to run it.” How would you clarify the relationship between these three concepts to correct this misunderstanding?

Show Answer

It is crucial to clarify that DevOps, SRE, and Platform Engineering are complementary concepts, not mutually exclusive alternatives. DevOps is the foundational cultural movement that emphasizes collaboration, shared responsibility, and breaking down silos between development and operations. SRE is a specific, prescriptive implementation of those DevOps principles, providing concrete practices like Service Level Objectives (SLOs) and error budgets. Meanwhile, Platform Engineering focuses on building internal developer platforms to enable self-service and standardize workflows. You still need the DevOps culture and SRE practices to ensure the platform is reliable and aligns with your business goals.

Six months into your SRE team’s existence, you audit their time and find they are spending 75% of their week resolving tickets, responding to alerts, and manually scaling infrastructure. What SRE principle is being violated, and what should be the immediate course of action?

Show Answer

This situation directly violates the 50% rule, which mandates that SREs must spend at least half of their time on project-based engineering work rather than reactive operations. When operational work (toil) exceeds 50%, it indicates that the system is either too unreliable, under-automated, or the team is understaffed. The immediate course of action should be to push back on operational duties, potentially routing excess tickets back to the development teams. The SRE team must then redirect their focus toward automating the manual scaling and addressing the root causes of the alerts to permanently reduce the operational burden.


Assess your organization’s SRE maturity:

Rate each area from 1 (not at all) to 5 (fully implemented):

Reliability Measurement

[ ] We have SLOs for our critical services
[ ] We measure availability and latency
[ ] We have dashboards showing reliability metrics
[ ] We review reliability metrics regularly
Score: ___/20

Operational Practices

[ ] We have documented runbooks for common issues
[ ] We do blameless postmortems after incidents
[ ] We track and measure toil
[ ] We have a formal on-call rotation
Score: ___/20

Engineering Investment

[ ] Developers participate in on-call
[ ] We regularly invest in automation
[ ] We have error budgets that gate releases
[ ] We protect time for reliability engineering
Score: ___/20

Culture

[ ] Reliability is a product feature, not ops problem
[ ] We learn from failures, not blame individuals
[ ] Developers and operations collaborate closely
[ ] Leadership supports reliability investment
Score: ___/20
ScoreMaturity LevelFocus Area
60-80Advanced SREContinuous improvement
40-60DevelopingFill in gaps systematically
20-40BeginningStart with SLOs and postmortems
0-20Pre-SREBuild awareness and buy-in

You’ve completed this exercise when you:

  • Honestly assessed all 16 areas
  • Identified your lowest-scoring category
  • Listed 3 specific improvements for that category

  1. SRE is software engineering applied to operations — not just ops with a new name
  2. The 50% rule protects engineering time and forces automation
  3. Chasing 100% reliability is usually the wrong goal — match reliability to user needs
  4. SRE implements DevOps — concrete practices for cultural values
  5. Team structure matters — choose based on your organization’s maturity

Books:

  • “Site Reliability Engineering” — Google (the original SRE book, free online)
  • “The Site Reliability Workbook” — Google (practical companion)
  • “Seeking SRE” — David Blank-Edelman (SRE across different contexts)

Articles:

  • “What is SRE?” — Google Cloud (cloud.google.com/blog/products/devops-sre)
  • “SRE vs DevOps” — Atlassian (atlassian.com/incident-management)

Talks:

  • “Keys to SRE” — Ben Treynor Sloss (YouTube)
  • “SRE: An Incomplete Guide to Cultural Nuance” — Liz Fong-Jones (YouTube)

Site Reliability Engineering is a discipline that brings software engineering practices to operations. Born at Google, it provides:

  • Measurable reliability through SLOs
  • Balanced velocity through error budgets
  • Reduced toil through automation
  • Learning culture through blameless postmortems

SRE isn’t about perfection — it’s about being reliably good enough while still shipping features.


Continue to Module 1.2: Service Level Objectives (SLOs) to learn how to define and measure reliability targets.


“Hope is not a strategy.” — Traditional SRE saying