Перейти до вмісту

Module 2.4: Measuring and Improving Reliability

Цей контент ще не доступний вашою мовою.

Complexity: [MEDIUM]

Time to Complete: 40-45 minutes

Prerequisites: Module 2.3: Redundancy and Fault Tolerance

Track: Foundations

After completing this module, you will be able to:

  1. Implement reliability measurement frameworks using MTTR, MTBF, and availability percentages tied to user-facing impact
  2. Analyze incident data to identify the highest-leverage reliability improvements for a given service
  3. Design a continuous reliability improvement process that balances feature velocity with system stability
  4. Evaluate whether a reliability investment (chaos engineering, redundancy, automation) is justified by its risk-reduction return

The Meeting That Changed How Google Thinks About Reliability

Section titled “The Meeting That Changed How Google Thinks About Reliability”

2003. Google’s Mountain View campus. The weekly “availability meeting.”

Engineering VP Ben Treynor sits at the head of a conference table. Around him: frustrated engineers, exhausted operators, and a whiteboard covered in incident timelines.

“Gmail is at 99.5% availability,” someone reports.

Treynor frowns. “Is that good?”

Silence. Nobody knows how to answer.

“Users are complaining,” someone offers.

“But they always complain. How do we know if we should drop everything to fix this, or ship the new features that will make users happy?”

More silence.

A junior engineer speaks up: “What if… we decided in advance how reliable Gmail needs to be? Like, actually picked a number?”

The room considers this. It sounds almost naive—just pick a number?

“Let’s say 99.9%,” someone suggests. “That gives users 43 minutes of downtime a month. Is that acceptable?”

Marketing is consulted. Product weighs in. Support data is reviewed.

The team agrees: 99.9% is acceptable for Gmail. If users can email most of the time and the service recovers quickly when it doesn’t, that’s good enough. Higher would be nice but isn’t necessary.

Then comes the insight that changes everything.

“If 99.9% is acceptable, and we’re at 99.5%, we need to stop shipping features and fix reliability. But if we hit 99.95%… we’re over-engineering. We should ship faster.”

This is the birth of the error budget.

Google had invented a way to resolve the eternal conflict between “move fast” and “be reliable.” The SLO became a ceiling, not just a floor. When budget is healthy: ship. When budget is depleted: stabilize.

Within years, every team at Google would have SLOs. The framework would spread across the industry. Today, SLIs and SLOs are standard practice at companies from startups to enterprises.

The revolution wasn’t technical. It was conceptual: reliability became something you could budget for, spend, and invest—just like money.


“We need to improve reliability” is a vague goal. Improve what? By how much? How will you know if you’ve succeeded?

This module teaches you to measure reliability objectively using SLIs (Service Level Indicators), set meaningful targets with SLOs (Service Level Objectives), and create a continuous improvement process. Without measurement, reliability is just hope. With measurement, it’s engineering.

THE TRANSFORMATION: FROM ARGUMENTS TO DATA
═══════════════════════════════════════════════════════════════════════════════
BEFORE SLOs (Politics)
────────────────────────────────────────────────────────────────
Monday standup at a typical tech company:
Product: "When is the new checkout feature shipping?"
Engineering: "We can't ship until we fix these reliability issues."
Product: "What issues? The site seems fine."
Engineering: "Trust us, there are problems."
Product: "But customers want this feature!"
Engineering: "And they also want the site to not crash!"
Manager: "Can we compromise and do both?"
Everyone: *sighs*
Result: Whoever argues loudest wins. Same conversation next week.
AFTER SLOs (Data)
────────────────────────────────────────────────────────────────
Monday standup at a team with SLOs:
Product: "When is the new checkout feature shipping?"
Engineering: "Let me check the error budget... We're at 99.92% against a
99.9% SLO. We have 14 minutes of budget left this month."
Product: "That's tight. What's using our budget?"
Engineering: "The database migration last week cost us 18 minutes."
Product: "If we ship the checkout feature, what's the risk?"
Engineering: "Estimated 5-10 minutes if something goes wrong."
Product: "So we'd likely burn the rest of the budget."
Engineering: "Correct. We could wait for the new month, or ship with
a quick rollback ready."
Product: "Let's wait. We need that budget for the holiday sale."
Result: Data-driven decision. No argument. Aligned priorities.

The Fitness Analogy

“I want to get fit” is a vague goal. “I want to run a 5K in under 25 minutes by March” is specific and measurable. You can track progress (current time), know when you’ve succeeded (under 25 minutes), and adjust training if needed. SLOs do the same for system reliability—they turn “be reliable” into “this specific thing, measured this way, at this target.”


  • What SLIs, SLOs, and SLAs are and how they differ
  • How to choose good SLIs for your services
  • Setting realistic SLOs
  • Using error budgets for decision-making
  • Continuous reliability improvement practices

TermWhat It IsWho CaresExample
SLI (Service Level Indicator)Measurement of service behaviorEngineers99.2% of requests succeed
SLO (Service Level Objective)Target for an SLIEngineering + Product99.9% of requests should succeed
SLA (Service Level Agreement)Contract with consequencesBusiness + Customers99.5% uptime or credit issued
SLI → SLO → SLA RELATIONSHIP
═══════════════════════════════════════════════════════════════
SLI (what you measure):
"Request success rate is currently 99.2%"
└── A metric, a fact
SLO (what you target):
"Request success rate should be ≥99.9%"
└── An internal goal, aspirational
SLA (what you promise):
"Request success rate will be ≥99.5% or customer gets credit"
└── An external contract, legally binding
Relationship:
SLO should be stricter than SLA (give yourself buffer)
SLA ≤ SLO
Example:
- SLA to customers: 99.5%
- Internal SLO: 99.9%
- Buffer: 0.4% to catch problems before breach

Without SLOs:

  • “Is the service reliable enough?” → “I think so?”
  • “Should we ship this feature or fix reliability?” → Arguments
  • “How urgent is this incident?” → Depends on who’s loudest

With SLOs:

  • “Is the service reliable enough?” → “Yes, we’re at 99.95% against a 99.9% target”
  • “Should we ship or fix reliability?” → “We have 3 hours of error budget left—fix first”
  • “How urgent is this incident?” → “It’s burning 10x normal error budget—high priority”
SLO-BASED DECISION MAKING
═══════════════════════════════════════════════════════════════
Scenario: Team wants to ship new feature that adds risk
WITHOUT SLO:
Product: "Ship it!"
Engineering: "It might break things!"
Product: "But customers want it!"
Engineering: "But reliability!"
→ Argument, politics, loudest voice wins
WITH SLO:
Current reliability: 99.95%
SLO target: 99.9%
Error budget: 0.05% (21.6 minutes this month)
Budget used: 5 minutes
Budget remaining: 16.6 minutes
Decision: We have error budget. Ship it, but monitor closely.
If we were at 99.85%, decision would be: Fix reliability first.
→ Data-driven decision, no argument needed

Did You Know?

Google’s SRE team famously uses error budgets to manage the tension between development velocity and reliability. When error budget is healthy, teams ship fast. When budget is depleted, feature freezes happen automatically—no negotiation needed. This has been adopted industry-wide as a best practice.


Google’s SRE book recommends monitoring these four signals:

SignalWhat It MeasuresExample SLI
LatencyHow long requests takep99 latency < 200ms
TrafficHow much demandRequests per second
ErrorsRate of failuresError rate < 0.1%
SaturationHow “full” the system isCPU < 80%
THE FOUR GOLDEN SIGNALS
═══════════════════════════════════════════════════════════════
┌─────────────────────────────────────────────┐
Traffic ──▶ YOUR SERVICE │──▶ Response
│ │
│ Latency: How fast? │
│ Errors: How often fails? │
│ Saturation: How full? │
│ │
└─────────────────────────────────────────────┘
These four signals capture most user-visible problems:
- High latency → Users wait → Bad experience
- High errors → Features broken → Bad experience
- High traffic → Might cause others → Leading indicator
- High saturation → About to have problems → Early warning
CategoryMeasuresGood For
AvailabilityIs it up?Basic health
LatencyIs it fast?User experience
ThroughputCan it handle load?Capacity
CorrectnessIs the output right?Data quality
FreshnessIs data current?Real-time systems
DurabilityIs data safe?Storage systems

A good SLI is:

CharacteristicWhy It MattersExample
MeasurableYou can actually collect the dataRequest latency from logs
User-centricReflects user experienceMeasured at the edge, not internally
ActionableYou can do something about itNot external dependencies
ProportionalWorse SLI = worse experiencep99 latency, not mean
GOOD vs BAD SLIs
═══════════════════════════════════════════════════════════════
BAD: "Server CPU utilization"
- Not user-centric (users don't care about CPU)
- Not proportional (80% CPU might be fine)
GOOD: "Request latency p99"
- User-centric (directly affects experience)
- Proportional (higher = worse)
BAD: "Database is up"
- Binary (up/down)
- Doesn't capture degradation
GOOD: "Percentage of queries completing in <100ms"
- Continuous (captures degradation)
- User-centric

Try This (3 minutes)

For a service you work with, define one SLI for each category:

CategoryYour SLI
Availability
Latency
Correctness

1. Start with user expectations, not technical capabilities

WRONG: "Our system can do 99.99%, so that's our SLO"
RIGHT: "Users expect checkout to work. What reliability do they need?"

2. Not everything needs the same SLO

DIFFERENTIATED SLOs
═══════════════════════════════════════════════════════════════
Service SLO Rationale
─────────────────────────────────────────────────────────────
Payment processing 99.99% Money is involved, high stakes
Product search 99.9% Important but degraded is okay
Recommendations 99.0% Nice to have, can hide if down
Internal reporting 95.0% Async, users can wait

3. SLO should be achievable but challenging

Too easy: 99% (you'll never improve)
Too hard: 99.999% (you'll always fail, SLO becomes meaningless)
Just right: 99.9% (achievable with effort, gives error budget)
SLO SETTING PROCESS
═══════════════════════════════════════════════════════════════
Step 1: Measure current state
└── "We're currently at 99.5% availability"
Step 2: Understand user needs
└── "Users complain when we're below 99%"
Step 3: Consider business context
└── "We're competing with Company X at 99.9%"
Step 4: Set initial SLO
└── "Target 99.9%, with 99.5% as minimum"
Step 5: Implement and measure
└── Track SLI against SLO
Step 6: Review and adjust
└── "We're consistently at 99.95%, raise target?"
└── "We're always missing, lower target?"
# Service: Payment API
# Version: 1.2
# Last reviewed: 2024-01-15
## SLIs
| SLI | Definition | Measurement |
|-----|------------|-------------|
| Availability | Successful responses / Total requests | HTTP 2xx/3xx vs 5xx |
| Latency | Request duration at p99 | Measured at load balancer |
| Correctness | Valid payment responses | Reconciliation check |
## SLOs
| SLI | SLO Target | Error Budget (monthly) |
|-----|------------|----------------------|
| Availability | ≥99.95% | 21.6 minutes |
| Latency p99 | ≤500ms | 0.05% of requests |
| Correctness | ≥99.99% | 0.01% of responses |
## Error Budget Policy
- Budget >50%: Normal development velocity
- Budget 25-50%: Increased monitoring, cautious releases
- Budget <25%: Feature freeze, reliability focus
- Budget depleted: All hands on reliability
## Review Schedule
- Weekly: Error budget check
- Monthly: SLO review meeting
- Quarterly: SLO target review

Gotcha: The SLO Ceiling Problem

If you consistently exceed your SLO by a large margin, you might be over-investing in reliability. Being at 99.99% when your SLO is 99.9% means you could move faster. Consider either: raising the SLO (if users benefit) or deliberately spending error budget on velocity (if they don’t).


ERROR BUDGET CALCULATION
═══════════════════════════════════════════════════════════════
SLO: 99.9% availability
Error budget: 100% - 99.9% = 0.1%
Monthly error budget:
- Minutes in month: 30 days × 24 hours × 60 min = 43,200 minutes
- Error budget: 43,200 × 0.001 = 43.2 minutes
Weekly error budget:
- Minutes in week: 7 × 24 × 60 = 10,080 minutes
- Error budget: 10,080 × 0.001 = 10.08 minutes
Budget burn rate:
- Normal: ~1 minute per day
- Incident: 10 minutes in 1 hour = 10x burn rate
ERROR BUDGET DASHBOARD
═══════════════════════════════════════════════════════════════
MONTHLY ERROR BUDGET: 43.2 minutes (SLO: 99.9%)
Week 1: [████████████████████░░░░░░░░░░░░░░░░░░] Used: 8 min
Week 2: [█████████████████████████░░░░░░░░░░░░░] Used: 15 min
Week 3: [██████████████████████████████░░░░░░░░] Used: 25 min (incident)
Week 4: [████████████████████████████████░░░░░░] Used: 32 min
─────────────────────────────────────────
Total used: 32 minutes | Remaining: 11.2 minutes
Status: ⚠️ 26% remaining - Cautious releases
Last 30 days trend:
Budget ──────────────────────────────────
100% │ ●
│ ●●
│ ●●●
│ ●●
│ ●●●●●●
50% │─────────────────────────────────────── Warning
│ ●●●●●●●●●●●●●●●●●●
0% │────────────────────────────────────────────────▶
Day 1 Day 30
Budget LevelPolicyActions
>75%Green - Full velocityShip features, experiment
50-75%Yellow - CautionNormal releases, increased monitoring
25-50%Orange - Slow downOnly critical releases, postmortems for all incidents
<25%Red - StopFeature freeze, all hands on reliability
DepletedEmergencyWar room until budget recovers

Try This (3 minutes)

Your service has a 99.9% SLO. This month:

  • Incident 1: 15 minutes of downtime
  • Incident 2: 8 minutes of degraded performance (counts as 50%)
  • Incident 3: 5 minutes of downtime

Calculate:

  1. Total budget (43.2 minutes for 99.9%)
  2. Budget consumed: _____ minutes
  3. Budget remaining: _____ minutes
  4. What policy level are you at?

RELIABILITY IMPROVEMENT CYCLE
═══════════════════════════════════════════════════════════════
┌─────────────────────────────────────────────────┐
│ │
▼ │
┌─────────┐ │
│ MEASURE │ ← SLIs, error budget tracking │
└────┬────┘ │
│ │
▼ │
┌─────────┐ │
│ ANALYZE │ ← Why are we missing SLO? │
└────┬────┘ │
│ │
▼ │
┌─────────┐ │
│PRIORITIZE│ ← What will have most impact? │
└────┬────┘ │
│ │
▼ │
┌─────────┐ │
│ IMPROVE │ ← Implement fixes │
└────┬────┘ │
│ │
└────────────────────────────────────────────────┘

Every significant incident should have a blameless postmortem:

POSTMORTEM TEMPLATE
═══════════════════════════════════════════════════════════════
## Incident: Payment API Outage 2024-01-15
### Summary
- Duration: 23 minutes
- Impact: 12,000 failed transactions
- Error budget consumed: 23 minutes (53% of monthly)
### Timeline
- 14:32 - Deploy of version 2.3.1
- 14:35 - Error rate spikes to 15%
- 14:38 - Alert fires, on-call paged
- 14:45 - Rollback initiated
- 14:55 - Service recovered
### Contributing Factors
1. Database migration had incompatible schema change
2. Canary deployment disabled for "quick fix"
3. Integration tests didn't cover this code path
### Action Items
| Action | Owner | Due |
|--------|-------|-----|
| Re-enable canary deployments | Alice | 2024-01-16 |
| Add integration test for schema | Bob | 2024-01-20 |
| Review migration process | Team | 2024-01-22 |
### Lessons Learned
- "Quick fixes" are rarely quick
- Canary deployments exist for a reason

Regular reliability reviews keep teams focused:

Weekly: Error budget check

  • How much budget consumed?
  • Any incidents to review?
  • Upcoming risky changes?

Monthly: SLO review

  • Are we meeting SLOs?
  • What’s trending?
  • What’s the biggest reliability risk?

Quarterly: Strategy review

  • Are SLOs still appropriate?
  • What systemic improvements are needed?
  • Resource allocation for reliability work
RELIABILITY INVESTMENT FRAMEWORK
═══════════════════════════════════════════════════════════════
When error budget is healthy:
├── Invest in observability improvements
├── Build automation to reduce MTTR
├── Conduct chaos engineering experiments
└── Pay down reliability tech debt
When error budget is depleted:
├── Stop feature work
├── Focus on top incident causes
├── Increase monitoring coverage
└── Implement quick-win reliability fixes
Investment allocation (example):
┌─────────────────────────────────────────────────────┐
│ Engineering Time Allocation │
│ │
│ Features █████████████████████████ 60% │
│ Reliability ████████████ 25% │
│ Tech Debt █████ 15% │
│ │
│ If SLO missed 2 consecutive months: │
│ │
│ Features ████████████ 30% │
│ Reliability █████████████████████████ 50% │
│ Tech Debt ████████ 20% │
└─────────────────────────────────────────────────────┘

War Story: The Team That Stopped Blaming

A platform team had a toxic incident culture. After every outage: “Who deployed last? Whose code was it? Who’s responsible?” Engineers hid information. Post-incident meetings were interrogations. The same problems kept happening.

A new engineering director introduced blameless postmortems. The first one felt awkward—people kept trying to assign blame. She redirected: “We’re not asking who. We’re asking why the system allowed this to happen.”

Six months later: postmortem participation doubled. Engineers volunteered information. Action items actually got completed because people owned them willingly, not defensively. Incident recurrence dropped 60%.

The insight: When people fear blame, they hide information. When they feel safe, they share what went wrong. Reliability improves when learning replaces blame.

WAR STORY: FROM BLAME CULTURE TO LEARNING CULTURE - THE TRANSFORMATION
═══════════════════════════════════════════════════════════════════════════════
THE INCIDENT: March 15th - Production Database Outage
BEFORE (Blame Culture)
────────────────────────────────────────────────────────────────
The "Post-Incident Review" (actually: public interrogation)
Meeting Room A, 15 attendees, tense silence
Manager: "The database was down for 47 minutes. Who did this?"
*Scans the room*
Junior Engineer: *Sweating* "I... I ran the migration..."
Manager: "Did you test it first?"
Junior Engineer: "Yes, in staging, but—"
Senior Engineer: *Interrupting* "The staging database is completely
different. Everyone knows that."
Junior Engineer: *Wants to disappear*
Manager: "Why wasn't there a review of this migration?"
Team Lead: "There was. I approved it. But I didn't see—"
Manager: "You didn't see what could go wrong? That's your job."
Meeting continues for 90 minutes. No one learns anything.
Everyone learns to hide their mistakes.
Result 6 months later:
- Same migration issues occur 3 more times
- Engineers deploy only on Fridays so incidents happen on weekends
- Junior engineers stop asking for help
- Senior engineers stop doing code reviews
- Incident reports are vague: "unknown cause"
- MTTR increases because people are afraid to admit they know what's wrong
AFTER (Learning Culture)
────────────────────────────────────────────────────────────────
New Director's first post-incident review
Same incident, same room, different approach
Director: "Let's understand what happened. Timeline first. Facts only."
14:32 - Migration started
14:34 - Lock escalation began
14:35 - Application timeouts
14:38 - Alert fired
14:42 - On-call paged
14:55 - Decision to rollback
15:19 - Service restored
Director: "47 minutes total. Now, what CONDITIONS allowed this to happen?"
Engineer 1: "The migration worked in staging but staging has 1% of
production data."
Director: *Writing on whiteboard* "CONDITION: Staging doesn't represent
production data volume. What else?"
Engineer 2: "There's no way to test migrations against prod-like data."
Director: "CONDITION: No production-representative test environment."
Engineer 3: "The locks weren't visible. We didn't know they were building up."
Director: "CONDITION: Lock monitoring gap. More?"
Junior Engineer: *Tentatively* "I... I actually asked about this in
the PR, but it got approved anyway."
Director: "Wait, you asked? Show me."
*Pulls up PR*
PR Comment from Junior Engineer: "Will this lock the users table?
That seems risky for a 10M row table."
Reviewer Response: "Should be fine, staging worked."
Director: "This is GOLD. The system failed to escalate a valid concern.
CONDITION: Review process didn't require load testing for
schema changes. The PERSON did the right thing. The SYSTEM
let them down."
Junior Engineer: *Visible relief*
Director: "The question isn't 'who made a mistake.' The question is
'why was making this mistake so easy?' Our action items
should make this mistake IMPOSSIBLE to repeat."
ACTION ITEMS (with owners who volunteered)
────────────────────────────────────────────────────────────────
| # | Action | Owner | Due |
|---|--------|-------|-----|
| 1 | Create prod-shadow DB for migration testing | Sarah | Apr 1 |
| 2 | Add lock monitoring to alerting | James | Mar 25 |
| 3 | Schema change review checklist | Team | Mar 22 |
| 4 | Document "concern escalation" for PRs | Director | Mar 18 |
| 5 | Migration runbook with rollback steps | Junior Eng | Mar 20 |
Note: Junior engineer was GIVEN an action item, not punished.
This built confidence and ownership.
RESULTS 6 MONTHS LATER
────────────────────────────────────────────────────────────────
Metric Before After Change
───────────────────────────────────────────────────────────
Incidents from same root cause 12 2 -83%
Action item completion rate 34% 89% +162%
Postmortem participation 5 avg 12 avg +140%
Engineer satisfaction (survey) 3.2/5 4.4/5 +38%
Mean time to acknowledge 12 min 4 min -67%
Voluntary incident reports 0 23 ∞
THE TRANSFORMATION FORMULA
────────────────────────────────────────────────────────────────
1. Replace "Who?" with "What conditions?"
2. Assume everyone tried their best
3. Ask "What would have prevented this?"
4. Make action items about SYSTEMS, not PEOPLE
5. Celebrate catching problems over hiding them
6. Follow up on action items publicly
7. Share postmortems widely (learning becomes culture)

  • Google publishes SLOs for many of their services. GCP has public SLOs that trigger automatic credits if breached. This transparency builds trust and sets industry standards.

  • The “rule of 10” in SLO setting: It takes roughly 10x effort to add each nine of reliability. Going from 99% to 99.9% is hard. Going from 99.9% to 99.99% is 10x harder.

  • SLOs predate software. The concept comes from manufacturing quality control. Walter Shewhart developed statistical quality control at Bell Labs in the 1920s—the same principles apply today.

  • Amazon’s “two pizza” rule for team size also applies to SLOs: if a service needs more SLOs than can fit on two pizza boxes, it’s probably too complex. Most teams settle on 3-5 SLIs per service—enough to capture user experience, few enough to focus on.


MistakeProblemSolution
Too many SLIsCan’t focus, alert fatigue3-5 SLIs per service max
SLO = current performanceNo room to improve or bufferSet slightly below current
Measuring internally, not at edgeMisses user experienceMeasure where users connect
Ignoring error budgetSLO is just a numberMake budget decisions automatic
No postmortemsSame incidents repeatBlameless postmortem culture
Yearly SLO reviewSLOs become staleQuarterly minimum

  1. What’s the difference between an SLI and an SLO?

    Answer

    SLI (Service Level Indicator) is a measurement of service behavior—a metric, a fact. “Our availability is 99.85%” is an SLI.

    SLO (Service Level Objective) is a target for that measurement—a goal. “Availability should be ≥99.9%” is an SLO.

    SLIs tell you where you are. SLOs tell you where you want to be. The gap between them is what you work to close.

  2. Why should SLOs be stricter than SLAs?

    Answer

    SLAs have consequences—often financial penalties or contract breaches. If your SLO equals your SLA, you have no buffer. The moment you miss your SLO, you’ve also breached your SLA.

    By setting a stricter SLO (e.g., SLO: 99.9%, SLA: 99.5%), you get:

    1. Warning time: When you miss SLO, you know to focus on reliability before SLA breach
    2. Buffer: Normal variation won’t breach SLA
    3. Improvement incentive: Teams aim higher than the minimum

    The gap between SLO and SLA is your safety margin.

  3. How do error budgets help resolve the tension between velocity and reliability?

    Answer

    Without error budgets, “ship features” and “be reliable” are subjective goals that conflict—whoever argues loudest wins.

    With error budgets:

    • Reliability is quantified: “We have X minutes of budget”
    • Velocity is enabled: “Budget remaining? Ship fast”
    • Reliability is protected: “Budget depleted? Focus on reliability”

    The decision is data-driven, not political. Product teams know features will ship when budget is healthy. Engineering knows reliability will be prioritized when budget is depleted. Both sides get what they need.

  4. What makes a good SLI?

    Answer

    A good SLI is:

    1. Measurable: You can actually collect the data reliably
    2. User-centric: Reflects what users experience, not internal metrics
    3. Proportional: Worse SLI = worse user experience (linear relationship)
    4. Actionable: Your team can influence it
    5. Captured at the edge: Measured where users connect, not deep in the stack

    Example: “Request latency p99 measured at the load balancer” is better than “API server response time mean” because it’s user-centric (what they actually experience), proportional (slower = worse), and measured at the edge.

  5. Your service has a 99.9% availability SLO. This month you had 25 minutes of downtime. Calculate: (a) Total error budget, (b) Budget consumed, (c) Budget remaining as percentage, (d) What policy level should you be at?

    Answer

    (a) Total error budget for 99.9% monthly SLO:

    • Minutes in month: 30 × 24 × 60 = 43,200 minutes
    • Error budget: 43,200 × (1 - 0.999) = 43,200 × 0.001 = 43.2 minutes

    (b) Budget consumed:

    • 25 minutes (the downtime)

    (c) Budget remaining as percentage:

    • Remaining: 43.2 - 25 = 18.2 minutes
    • Percentage: 18.2 / 43.2 = 42.1% remaining

    (d) Policy level:

    • 42.1% falls in the Orange (25-50%) range
    • Policy: Slow down releases, postmortem for all incidents, focus on reliability

    This team should pause non-critical deployments and focus on preventing further budget burn.

  6. Why is “CPU utilization” usually a bad SLI but “request latency p99” is usually a good SLI?

    Answer

    CPU utilization is bad because:

    • Not user-centric: Users don’t experience CPU directly
    • Not proportional: 80% CPU might mean great performance or terrible performance
    • Leading indicator at best: High CPU might cause problems, but might not
    • Not actionable in terms of user impact: “Lower CPU” doesn’t tell you what to fix

    Request latency p99 is good because:

    • User-centric: Users directly experience wait time
    • Proportional: Higher latency = worse user experience, always
    • Actionable: “Latency is high” points directly at the problem
    • Measurable at the edge: Captures full user experience including network

    The key insight: SLIs should measure what users care about, not what systems do internally. Users care about “is it fast?” and “does it work?”, not “is the server busy?”

  7. What is a blameless postmortem, and why does it improve reliability more than a blame-focused investigation?

    Answer

    Blameless postmortem: An incident review that focuses on what conditions allowed the failure, not who made a mistake. It assumes everyone acted rationally given what they knew at the time.

    Why it improves reliability:

    1. Information flows freely: When people aren’t afraid of punishment, they share what actually happened. Hidden information stays hidden in blame cultures.

    2. Root causes emerge: Asking “why did the system allow this?” reveals systemic issues. Asking “who did this?” stops at individual error.

    3. Action items work: People volunteer to own fixes when they’re not being punished. Forced action items get minimal effort.

    4. Similar incidents decrease: Systemic fixes prevent recurrence. Punishing individuals doesn’t change the system that enabled the error.

    5. Reporting increases: Engineers report near-misses when they’re safe to share. Blame cultures only learn from disasters.

    The paradox: Removing blame increases accountability because people own problems instead of hiding from them.

  8. A team consistently exceeds their SLO—they have 99.95% availability against a 99.9% target, month after month. Should they celebrate, or is this a problem?

    Answer

    This is potentially a problem called “over-achievement.”

    Consider:

    • Error budget: 43.2 minutes monthly
    • Actual downtime: ~21.6 minutes (50% of budget)
    • Remaining budget: ~50% unused, every month

    Why this might be bad:

    1. Over-investment in reliability: Engineering time going to reliability that could go to features
    2. Too conservative: Team might be afraid to take risks, slowing down innovation
    3. Wrong SLO: The target might be too easy, giving false confidence

    What to do:

    1. Consider raising the SLO if users would benefit from higher reliability
    2. Deliberately spend error budget on faster deployments, more experimentation
    3. Redirect reliability investment to other areas that need it
    4. Review if SLO is appropriate for the service’s actual requirements

    The insight: SLOs are targets, not just floors. Consistently beating them by large margins suggests misallocation of engineering effort.


Task: Define SLIs and SLOs for a service and create an error budget dashboard.

Part A: Define SLIs (10 minutes)

Choose a service you work with (or use the example “User API” service).

Define SLIs using this template:

SLI NameDefinitionMeasurement MethodGood Threshold
Availability% of successful responses(2xx + 3xx) / total at LB≥99.9%
Latencyp99 request durationHistogram at LB≤200ms

Part B: Set SLOs (10 minutes)

For each SLI, set an SLO:

SLISLO TargetError Budget (monthly)Rationale
Availability99.9%43.2 minutesUsers expect high availability
Latency p99200ms0.1% of requestsUX degrades above 200ms

Part C: Calculate Current Status (10 minutes)

Using real data from your service (or the sample data below):

Sample data for this month:

  • Total requests: 5,000,000
  • Failed requests (5xx): 3,500
  • Requests over 200ms: 6,000
  • Downtime: 15 minutes

Calculate:

  1. Current availability: ____%
  2. Current latency compliance: ____%
  3. Error budget consumed (availability): ____ minutes
  4. Error budget remaining: ____ minutes
  5. Current policy level: ____

Part D: Create Improvement Plan (10 minutes)

Based on your calculations:

  1. Which SLI needs the most attention?
  2. What would you investigate first?
  3. What’s one action that would improve it?
PriorityIssueProposed ActionExpected Impact
1
2

Success Criteria:

  • At least 3 SLIs defined
  • SLOs set with rationale
  • Current status calculated correctly
  • Improvement plan with prioritized actions

Sample Answers:

Check your calculations

Using the sample data:

  1. Availability: (5,000,000 - 3,500) / 5,000,000 = 99.93%
  2. Latency compliance: (5,000,000 - 6,000) / 5,000,000 = 99.88%
  3. Error budget consumed: 15 minutes
  4. Error budget remaining: 43.2 - 15 = 28.2 minutes (65% remaining)
  5. Policy level: Yellow (50-75% remaining) - Normal operations but increased monitoring

Analysis:

  • Latency (99.88%) is below SLO (99.9%)—needs attention
  • Availability (99.93%) is meeting SLO (99.9%)—healthy
  • Focus on latency improvements

MEASURING AND IMPROVING RELIABILITY - WHAT TO REMEMBER
═══════════════════════════════════════════════════════════════════════════════
THE CORE FRAMEWORK
────────────────────────────────────────────────────────────────
SLI → SLO → SLA (in order of strictness)
SLI: "We measured 99.85% availability" ← The fact
SLO: "We target 99.9% availability" ← The internal goal
SLA: "We promise 99.5% availability" ← The contract
SLA ≤ current SLI < SLO
(Give yourself buffer between promise and target)
THE ERROR BUDGET FORMULA
────────────────────────────────────────────────────────────────
Error Budget = 100% - SLO
For 99.9% SLO:
Budget = 100% - 99.9% = 0.1%
Monthly = 30 × 24 × 60 × 0.001 = 43.2 minutes
Budget > 75%: Ship fast
Budget 50-75%: Normal pace
Budget 25-50%: Slow down
Budget < 25%: Feature freeze
Budget = 0%: All hands on reliability
THE FOUR GOLDEN SIGNALS
────────────────────────────────────────────────────────────────
1. LATENCY - How long requests take
2. TRAFFIC - How much demand
3. ERRORS - Rate of failures
4. SATURATION - How "full" the system is
These four capture most user-visible problems.
GOOD SLI CHARACTERISTICS
────────────────────────────────────────────────────────────────
[ ] User-centric (what users experience, not system internals)
[ ] Measurable (you can actually collect the data)
[ ] Proportional (worse SLI = worse experience)
[ ] Actionable (your team can influence it)
[ ] Edge-measured (where users connect)
✗ "CPU is at 80%" → Not user-centric
✓ "p99 latency is 200ms" → User-centric
THE RELIABILITY IMPROVEMENT CYCLE
────────────────────────────────────────────────────────────────
MEASURE → ANALYZE → PRIORITIZE → IMPROVE → REPEAT
│ │ │ │
│ │ │ └── Fix the issue
│ │ └── What has most impact?
│ └── Why are we missing SLO?
└── Track SLIs and error budget
BLAMELESS POSTMORTEMS
────────────────────────────────────────────────────────────────
NOT: "Who made this mistake?"
YES: "What conditions allowed this to happen?"
The question isn't WHO failed.
The question is WHY the system made failure easy.
Key behaviors:
1. Facts first, judgment later
2. Assume good intent
3. Fix systems, not people
4. Follow up on action items
5. Share learnings widely
ERROR BUDGET POLICY EXAMPLE
────────────────────────────────────────────────────────────────
| Budget | Color | Policy |
|--------|--------|---------------------------|
| >75% | Green | Ship fast, experiment |
| 50-75% | Yellow | Normal pace, monitor |
| 25-50% | Orange | Only critical releases |
| <25% | Red | Feature freeze |
| 0% | Black | War room until recovery |
THE KEY INSIGHT
────────────────────────────────────────────────────────────────
Error budgets resolve the eternal conflict:
BEFORE: "Move fast!" vs "Be reliable!" → Politics wins
AFTER: Budget healthy? → Ship features
Budget depleted? → Fix reliability
Data-driven decisions. No arguments needed.
COMMON MISTAKES TO AVOID
────────────────────────────────────────────────────────────────
🚩 Too many SLIs (>5 per service)
🚩 SLO = current performance (no improvement room)
🚩 Measuring internally instead of at edge
🚩 Ignoring error budget in decisions
🚩 Blaming individuals in postmortems
🚩 Never reviewing/adjusting SLOs
🚩 Consistently over-achieving (might be over-investing)
THE BOTTOM LINE
────────────────────────────────────────────────────────────────
"Hope is not a strategy."
Reliability without measurement is wishful thinking.
With SLIs, SLOs, and error budgets, it's engineering.

  • “Site Reliability Engineering” - Google. Chapters 4 (SLOs), 5 (Error Budgets), and 15 (Postmortems). The foundational text that introduced these concepts to the industry.

  • “The Art of SLOs” - Workshop materials from Google. Practical guidance on implementing SLOs, with templates and examples.

  • “Implementing Service Level Objectives” - Alex Hidalgo. The comprehensive book on SLO implementation, from theory to practice.

  • “The Site Reliability Workbook” - Google. Chapter on “Implementing SLOs” has practical worksheets for defining SLIs and SLOs.

  • Datadog Blog: “SLOs in Practice” - Real-world examples of SLO implementation at various companies.

  • Honeycomb.io Blog - Excellent posts on observability-driven SLOs and why traditional metrics often fail.


Congratulations! You’ve completed the Reliability Engineering foundation. You now understand:

  • What reliability means and how to measure it
  • How systems fail and how to design for failure
  • Redundancy patterns for fault tolerance
  • SLIs, SLOs, and error budgets for continuous improvement

Where to go from here:

Your InterestNext Track
Understanding what’s happeningObservability Theory
Operating reliable systemsSRE Discipline
Building secure systemsSecurity Principles
Distributed system challengesDistributed Systems

ModuleKey Takeaway
2.1Reliability is measurable; each nine is 10x harder
2.2Predict failure modes with FMEA; design degradation paths
2.3Redundancy enables survival; but test your failover
2.4SLIs measure, SLOs target, error budgets enable decisions

“Hope is not a strategy. Measure reliability, set targets, and engineer toward them.”