Module 2.1: What is Reliability?

Цей контент ще не доступний вашою мовою.

Complexity: [MEDIUM]

Time to Complete: 35-40 minutes

Prerequisites: Systems Thinking Track (recommended)

Track: Foundations

What You’ll Be Able to Do

After completing this module, you will be able to:

Explain the difference between availability, durability, and reliability and why each requires distinct engineering strategies
Evaluate a system’s reliability posture by identifying single points of failure and hidden dependencies
Design reliability requirements for a service by mapping user expectations to concrete engineering constraints
Compare reactive vs. proactive reliability approaches and justify when to invest in each

The Incident That Changed Everything

December 24th, 2012. Netflix headquarters.

The on-call engineer stares at her dashboard in disbelief. Christmas Eve, and AWS’s US-East region has just entered what would become one of its most infamous outages. Netflix’s entire streaming service—the thing that millions of families planned to use during their holiday gatherings—is going dark.

But something unexpected happens.

While other companies scramble, while their engineers panic and their executives field angry customer calls, Netflix’s systems are… recovering. Automatically. Within minutes, traffic is rerouting to other regions. The streaming continues. Most customers never notice.

The secret? Years of reliability engineering.

Netflix hadn’t hoped for the best. They’d engineered for the worst. They’d asked: “What happens when AWS fails?” And they’d built systems that could answer: “We keep streaming.”

THE TWO APPROACHES TO FAILURE
═══════════════════════════════════════════════════════════════════════════════

COMPANY A: "Hope It Doesn't Break"
┌─────────────────────────────────────────────────────────────────────────────┐
│                                                                             │
│    "Our cloud provider    →   AWS outage    →   COMPLETE OUTAGE            │
│     has 99.99% uptime"        happens           "Why didn't anyone          │
│                                                  tell us this could          │
│                                                  happen?!"                   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

COMPANY B: Netflix's Approach - "Engineer for Failure"
┌─────────────────────────────────────────────────────────────────────────────┐
│                                                                             │
│    "What if AWS     →   Built multi-region   →   AWS outage    →   Auto-   │
│     fails?"             fallback systems          happens          recovery │
│                                                                             │
│    "What if a       →   Chaos Monkey tests   →   Failure          Known,   │
│     server dies?"       this constantly          occurs    →      handled  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

This is the difference between hoping and engineering. Between luck and reliability.

Why This Module Matters

Your users don’t care about your architecture. They don’t care about your tech stack. They don’t care that you’re using Kubernetes or that your microservices are beautifully decoupled. They care about exactly one thing:

Does it work when I need it?

That question seems simple. But answering it systematically—measuring it, designing for it, trading off against other goals—that’s the discipline of reliability engineering.

Consider what “99% reliable” actually means:

THE BRUTAL MATH OF UNRELIABILITY
═══════════════════════════════════════════════════════════════════════════════

"99% reliable" sounds great until you do the math:

Per Year:      99% uptime = 3.65 DAYS of downtime
Per Month:                = 7.3 HOURS of downtime
Per Week:                 = 1.7 HOURS of downtime
Per Day:                  = 14.4 MINUTES of downtime

For a payment processor handling $1M/hour:
─────────────────────────────────────────────────────────────────────────────
7.3 hours/month × $1,000,000/hour = $7.3 MILLION in lost revenue PER MONTH

For an emergency services dispatcher:
─────────────────────────────────────────────────────────────────────────────
14 minutes per day × 365 = 85 HOURS per year of "911 not working"

For a trading platform:
─────────────────────────────────────────────────────────────────────────────
Each minute of downtime during market hours can cost millions in trades

This module teaches you to think about reliability systematically—not as “we hope it doesn’t break” but as an engineering discipline with clear metrics, trade-offs, and design principles.

The Bridge Analogy

Civil engineers don’t say “we hope this bridge doesn’t collapse.” They calculate loads, specify materials, add safety factors, and design for specific failure scenarios. They know exactly what wind speed will cause problems, what weight the bridge can bear, and what happens if a cable snaps.

Software reliability engineering applies the same rigor to systems: understand the failure modes, design for them, measure the results. The question isn’t “will it fail?” but “how will it fail, and what happens when it does?”

What You’ll Learn

How to define and measure reliability precisely
The relationship between availability, reliability, and durability
MTBF, MTTR, and other reliability metrics that actually matter
Why “five nines” is exponentially harder than “three nines”
The reliability vs. velocity trade-off and how to navigate it
How to calculate and use error budgets

Part 1: Defining Reliability

1.1 What Does “Reliable” Mean?

Here’s a conversation that happens in every engineering organization:

THE VAGUENESS PROBLEM
═══════════════════════════════════════════════════════════════════════════════

Product Manager: "Is the payment system reliable?"

Engineer 1: "Yes, we haven't had any outages this month."

Engineer 2: "Well, we had 500 failed transactions yesterday."

Engineer 1: "But that's only 0.01% of transactions!"

Engineer 2: "That's 500 customers who couldn't buy things."

Product Manager: "So... is it reliable or not?"

Everyone: "..."

The problem? “Reliable” means different things to different people. We need precision.

Reliability is the probability that a system performs its intended function for a specified period under stated conditions.

Three components must be defined:

THE THREE PILLARS OF A RELIABILITY DEFINITION
═══════════════════════════════════════════════════════════════════════════════

                    ┌─────────────────────────────────────┐
                    │        RELIABILITY STATEMENT        │
                    └─────────────────────────────────────┘
                                     │
           ┌─────────────────────────┼─────────────────────────┐
           │                         │                         │
           ▼                         ▼                         ▼
    ┌─────────────┐          ┌─────────────┐          ┌─────────────┐
    │  INTENDED   │          │  SPECIFIED  │          │   STATED    │
    │  FUNCTION   │          │   PERIOD    │          │ CONDITIONS  │
    └─────────────┘          └─────────────┘          └─────────────┘
    What should it           For how long?            Under what
    do exactly?                                       circumstances?

    Examples:                Examples:                Examples:
    • Process payment        • 24/7 continuous        • Normal load
    • Return result          • During business        • ≤1000 TPS
    • Complete in <2s          hours only             • Valid requests
    • Return accurate        • For 30 days            • Network available
      data                     minimum

Vague vs. Precise Reliability Statements:

Vague	Precise
”The system is reliable"	"The system successfully processes 99.9% of valid requests within 2 seconds"
"High availability"	"Available 99.95% of the time, measured monthly"
"Data is safe"	"99.999999999% of objects stored survive over 10 years"
"Fast enough"	"95th percentile latency under 200ms”

COMPLETE RELIABILITY DEFINITION EXAMPLE
═══════════════════════════════════════════════════════════════════════════════

❌ Vague: "The payment system is reliable"

✅ Precise: "The payment system successfully processes 99.9% of
           transactions within 2 seconds, under normal load (up to
           1000 TPS), 24/7"

Breaking it down:
┌─────────────────────────────────────────────────────────────────────────────┐
│ INTENDED FUNCTION                                                           │
│ ─────────────────                                                           │
│ • Process transactions                                                      │
│ • Successfully (not just accept, but complete)                              │
│ • Within 2 seconds (latency requirement)                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│ SPECIFIED PERIOD                                                            │
│ ─────────────────                                                           │
│ • 24/7 (continuous operation)                                               │
│ • No planned maintenance windows excluded                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│ STATED CONDITIONS                                                           │
│ ─────────────────                                                           │
│ • Normal load: ≤1000 transactions per second                                │
│ • Assumes valid transactions (garbage in = garbage out is ok)               │
│ • Assumes network connectivity to payment processor                          │
└─────────────────────────────────────────────────────────────────────────────┘

1.2 Reliability vs. Availability vs. Durability

These three terms get confused constantly. Even experienced engineers mix them up. Let’s fix that forever.

Concept	Question It Answers	Measures	Example
Reliability	”When I use it, does it work?”	Success rate	”99.9% of requests succeed”
Availability	”Can I use it right now?”	Uptime	”System is up 99.99% of time”
Durability	”Will my data still exist tomorrow?”	Data retention	”99.999999999% of data preserved”

THE CRITICAL DISTINCTION
═══════════════════════════════════════════════════════════════════════════════

                            ┌─────────────────┐
                            │   USER REQUEST  │
                            └────────┬────────┘
                                     │
                                     ▼
                        ┌────────────────────────┐
                        │  Can I reach the       │
                        │  system at all?        │
                        └───────────┬────────────┘
                                    │
                    ┌───────────────┴───────────────┐
                    │                               │
                    ▼                               ▼
              ┌─────────┐                     ┌─────────┐
              │   NO    │                     │   YES   │
              └────┬────┘                     └────┬────┘
                   │                               │
                   ▼                               ▼
           AVAILABILITY                  ┌────────────────────┐
              FAILURE                    │  Does it work      │
           (system down)                 │  correctly?        │
                                         └──────────┬─────────┘
                                                    │
                                    ┌───────────────┴───────────┐
                                    │                           │
                                    ▼                           ▼
                              ┌─────────┐                 ┌─────────┐
                              │   NO    │                 │   YES   │
                              └────┬────┘                 └────┬────┘
                                   │                           │
                                   ▼                           ▼
                           RELIABILITY                     SUCCESS!
                              FAILURE                    (this is what
                           (errors, bugs)                users want)

The Four Combinations:

RELIABILITY vs AVAILABILITY MATRIX
═══════════════════════════════════════════════════════════════════════════════

                              RELIABILITY
                        Low                High
                   ┌────────────────┬────────────────┐
                   │                │                │
             High  │  UNRELIABLE    │    IDEAL       │
                   │  UPTIME        │                │
    AVAILABILITY   │  "It's always  │  "Works great  │
                   │   up but half  │   all the      │
                   │   my requests  │   time"        │
                   │   fail"        │                │
                   ├────────────────┼────────────────┤
                   │                │                │
              Low  │   WORST        │  FLAKY         │
                   │                │  UPTIME        │
                   │  "Can't reach  │  "When it's    │
                   │   it, and when │   up, it's     │
                   │   I can, it    │   perfect,     │
                   │   fails"       │   but..."      │
                   │                │                │
                   └────────────────┴────────────────┘

REAL-WORLD EXAMPLES:
────────────────────────────────────────────────────────────────────────────────
High Availability + Low Reliability:
  API always responds, but 5% of responses are errors

Low Availability + High Reliability:
  Mainframe with weekly maintenance windows—when up, zero errors

High Both: Netflix in 2023—always available, almost always works

Low Both: That internal tool nobody maintains—down half the time,
         buggy the other half

Durability: The Third Dimension

Durability is different—it’s about data persistence over time, not requests.

DURABILITY EXPLAINED
═══════════════════════════════════════════════════════════════════════════════

Scenario: You upload a photo to cloud storage

AVAILABILITY question: "Can I access my photo right now?"
  • If cloud storage is down, you can't access it
  • But the photo still exists!

DURABILITY question: "Will my photo exist in 10 years?"
  • Even if storage is up, corruption could destroy it
  • Durability = probability of data survival

THE MATH OF 11 NINES (Amazon S3's durability guarantee):
─────────────────────────────────────────────────────────────────────────────
99.999999999% durability

If you store 10,000,000 objects:
• Expected losses per year: 10M × 0.000000001 = 0.01 objects
• Expected losses per century: ~1 object

That's ~10,000 years between losing a single object.

How they achieve it:
• Multiple copies across multiple data centers
• Automatic integrity checking
• Self-healing when corruption detected
• Geographic distribution

Did You Know?

Amazon S3’s famous “11 nines” durability (99.999999999%) means if you store 10 million objects, you’d statistically expect to lose one every 10,000 years. That’s not availability—S3 can be temporarily unavailable while still being durable. Your data is safe; you just can’t access it right now.

This distinction matters enormously. When S3 has an “outage,” your files aren’t being deleted—they’re just temporarily inaccessible. Durability and availability are independent properties.

1.3 The User’s Perspective

From your user’s perspective, reliability is simple: Did it work?

Users don’t care about our careful distinctions between availability, reliability, and durability. They care about outcomes.

USER EXPERIENCE OF RELIABILITY: THE CHECKOUT JOURNEY
═══════════════════════════════════════════════════════════════════════════════

User tries to buy something:

SCENARIO A: Availability Failure
┌──────────────────────────────────────────────────────────────────────────────┐
│                                                                              │
│   User clicks "Checkout"                                                     │
│           │                                                                  │
│           ▼                                                                  │
│   ┌───────────────────────┐                                                  │
│   │  "This site can't be  │                                                  │
│   │  reached"             │    User thinks: "Their site is broken"          │
│   └───────────────────────┘                                                  │
│                                                                              │
│   Result: FAILED ❌                                                          │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

SCENARIO B: Reliability Failure
┌──────────────────────────────────────────────────────────────────────────────┐
│                                                                              │
│   User clicks "Checkout" → Site loads → User enters payment info            │
│           │                                                                  │
│           ▼                                                                  │
│   ┌───────────────────────┐                                                  │
│   │  "Error processing    │                                                  │
│   │  your payment"        │    User thinks: "Their site is broken"          │
│   └───────────────────────┘                                                  │
│                                                                              │
│   Result: FAILED ❌                                                          │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

SCENARIO C: Durability Failure
┌──────────────────────────────────────────────────────────────────────────────┐
│                                                                              │
│   User clicks "Checkout" → Site loads → Payment works → Success page!       │
│           │                                                                  │
│           ▼                                                                  │
│   ... 2 days later ...                                                       │
│   "Where's my order? No confirmation email. No order in history."            │
│                                                                              │
│   User thinks: "Their site is broken"                                        │
│                                                                              │
│   Result: FAILED ❌                                                          │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

TO THE USER, ALL THREE FAILURES ARE IDENTICAL: "IT DIDN'T WORK"

This is why we need to measure all three dimensions—because any failure mode leads to the same user outcome.

Try This (2 minutes)

Think about an app you use daily. Recall a time it failed you. Was it:

An availability failure (couldn’t connect)?

A reliability failure (connected but got an error)?

A durability failure (your data was lost)?

Understanding the failure type helps identify the fix.

Bonus: Think about how you, as a user, responded. Did you retry? Give up? Switch to a competitor? That’s the business cost of unreliability.

Part 2: Measuring Reliability

2.1 The Nines

When engineers talk about reliability, they talk about “nines.” You’ll hear phrases like “we need five nines” or “we’re only at three nines.” What does this mean?

The “nines” are a shorthand for the number of 9s in the reliability percentage:

Nines	Percentage	Error Rate	Downtime/Year	Downtime/Month	Downtime/Day
One nine	90%	10%	36.5 days	3 days	2.4 hours
Two nines	99%	1%	3.65 days	7.3 hours	14 minutes
Three nines	99.9%	0.1%	8.76 hours	43.8 minutes	1.4 minutes
Four nines	99.99%	0.01%	52.6 minutes	4.4 minutes	8.6 seconds
Five nines	99.999%	0.001%	5.26 minutes	26.3 seconds	0.86 seconds
Six nines	99.9999%	0.0001%	31.5 seconds	2.6 seconds	86 ms

VISUALIZING THE NINES
═══════════════════════════════════════════════════════════════════════════════

If you handle 1,000,000 requests per day:

90% (1 nine)     ████████████████████████████████████░░░░░░░░
                 100,000 failures per day
                 "Basically broken"

99% (2 nines)    ████████████████████████████████████████░░░░
                 10,000 failures per day
                 "Unreliable, users complaining constantly"

99.9% (3 nines)  █████████████████████████████████████████░░░
                 1,000 failures per day
                 "Okay for internal tools, not great for customers"

99.99% (4 nines) ██████████████████████████████████████████░░
                 100 failures per day
                 "Good for most consumer services"

99.999% (5 nines)███████████████████████████████████████████░
                 10 failures per day
                 "Enterprise-grade, mission-critical"

99.9999%(6 nines)████████████████████████████████████████████
                 1 failure per day
                 "Financial trading, emergency services"

The Exponential Cost of Each Nine:

Each additional nine is approximately 10x harder and more expensive to achieve. Here’s why:

THE EXPONENTIAL COST OF NINES
═══════════════════════════════════════════════════════════════════════════════

NINES    REQUIREMENTS                                  APPROXIMATE COST FACTOR
─────────────────────────────────────────────────────────────────────────────────

99%      • Basic monitoring                                      $
         • Some automation
         • Single data center

99.9%    • Good monitoring                                      $$
         • Automated failover
         • Redundant components
         • On-call rotation

99.99%   • Sophisticated monitoring                            $$$$
         • Multi-zone deployment
         • Automatic scaling
         • Fast incident response
         • Chaos engineering
         • 24/7 operations

99.999%  • Multi-region deployment                          $$$$$$$$
         • Global load balancing
         • Zero-downtime deployments
         • Full automation
         • Dedicated reliability teams
         • Extensive testing infrastructure

WHY IT'S EXPONENTIAL:
─────────────────────────────────────────────────────────────────────────────────
Going from 99% to 99.9%:  Remove 90% of remaining failures
Going from 99.9% to 99.99%: Remove 90% of remaining failures AGAIN
Going from 99.99% to 99.999%: And again...

The easy failures are fixed first. Each level requires fixing
progressively harder, rarer, weirder problems.

At 99.99%, you're fixing bugs that only happen during full moons
(or at least, that's what it feels like).

2.2 Key Reliability Metrics

Beyond “nines,” there are four critical metrics every reliability engineer must understand:

THE FOUR FUNDAMENTAL RELIABILITY METRICS
═══════════════════════════════════════════════════════════════════════════════

        MTBF                          MTTR
   (Mean Time Between              (Mean Time To
      Failures)                      Recovery)
        │                              │
        │  How long between            │  How long to fix
        │  failures?                   │  each failure?
        │                              │
        ▼                              ▼
    ┌───────┐    ┌───────┐    ┌───────┐    ┌───────┐    ┌───────┐
    │ WORKS │    │ DOWN  │    │ WORKS │    │ DOWN  │    │ WORKS │
    └───────┘    └───────┘    └───────┘    └───────┘    └───────┘
    │◄──────────▶│◄───▶│◄─────────────▶│◄───▶│
         MTBF      MTTR      MTBF        MTTR

                      MTTF                        MTTD
                 (Mean Time To                (Mean Time To
                   Failure)                    Detect)
                 For non-repairable           Time from failure
                 components                   occurring to
                 (how long until              detecting it
                 it dies?)

MTBF - Mean Time Between Failures

MTBF measures how frequently failures occur. Higher MTBF = fewer failures = better.

MTBF CALCULATION
═══════════════════════════════════════════════════════════════════════════════

Formula: MTBF = Total Operating Time / Number of Failures

Example Scenario:
─────────────────────────────────────────────────────────────────────────────
Timeline over 1000 hours:

Hour 0        Hour 200      Hour 450      Hour 800      Hour 1000
  │              │             │             │             │
  ▼              ▼             ▼             ▼             ▼
  START        FAIL #1      FAIL #2      FAIL #3        END
              (15 min)     (30 min)     (45 min)

Calculation:
• Total operating time: 1000 hours
• Number of failures: 3
• MTBF = 1000 / 3 = 333.3 hours

Interpretation: On average, expect a failure every 333 hours (~14 days)

MTTR - Mean Time To Recovery

MTTR measures how quickly you recover from failures. Lower MTTR = faster recovery = better.

MTTR BREAKDOWN
═══════════════════════════════════════════════════════════════════════════════

MTTR is actually composed of multiple phases:

   ┌─────────────────────────────────────────────────────────────────────────┐
   │                            TOTAL MTTR                                   │
   └─────────────────────────────────────────────────────────────────────────┘
                                    │
         ┌──────────────────────────┼──────────────────────────┐
         │                          │                          │
         ▼                          ▼                          ▼
   ┌──────────┐             ┌──────────────┐            ┌─────────────┐
   │  MTTD    │             │  MTTI        │            │  MTT-Fix    │
   │  (Time   │             │  (Time to    │            │  (Time to   │
   │  to      │             │  Investigate)│            │  actually   │
   │  Detect) │             │              │            │  fix)       │
   └──────────┘             └──────────────┘            └─────────────┘

Example:
─────────────────────────────────────────────────────────────────────────────
Failure occurs at 2:00 AM
• 2:00 AM - 2:15 AM: No one knows (MTTD = 15 min)
• 2:15 AM - 2:45 AM: Investigating "what's wrong" (MTTI = 30 min)
• 2:45 AM - 3:15 AM: Fixing the issue (MTT-Fix = 30 min)
• Total MTTR = 75 minutes

To reduce MTTR, attack each phase:
• Reduce MTTD: Better monitoring, alerting
• Reduce MTTI: Better dashboards, runbooks, observability
• Reduce MTT-Fix: Automation, rollback capabilities, simpler systems

Calculating Availability from MTBF and MTTR:

THE AVAILABILITY FORMULA
═══════════════════════════════════════════════════════════════════════════════

              MTBF
Availability = ────────────
              MTBF + MTTR


Example:
─────────────────────────────────────────────────────────────────────────────
• MTBF = 250 hours (failure every ~10 days)
• MTTR = 2 hours (takes 2 hours to fix each failure)

Availability = 250 / (250 + 2) = 250/252 = 99.2%

What this means:
• System is working: 99.2% of the time
• System is down: 0.8% of the time
• Per month downtime: 0.8% × 720 hours = 5.76 hours

The MTTR Revelation

Here’s a secret that many teams miss: reducing MTTR is often easier and more effective than reducing MTBF.

Consider two approaches to improving from 99% to 99.9%:

Approach Starting Point Target Improvement Needed
Increase MTBF MTBF=100h, MTTR=1h MTBF=1000h 10x fewer failures
Decrease MTTR MTBF=100h, MTTR=1h MTTR=6min 10x faster recovery

Preventing all failures is hard—you’re fighting complexity, unknown unknowns, and Murphy’s Law. But recovering faster? You can invest in:

Better monitoring (detect failures immediately)

Automated rollbacks (fix some failures in seconds)

Runbooks and training (human recovery is faster)

Simpler architectures (easier to debug)

The best reliability engineers optimize both—but they often get more ROI from MTTR.

Approach	Starting Point	Target	Improvement Needed
Increase MTBF	MTBF=100h, MTTR=1h	MTBF=1000h	10x fewer failures
Decrease MTTR	MTBF=100h, MTTR=1h	MTTR=6min	10x faster recovery

2.3 Error Budgets

Here’s a concept that changed how the industry thinks about reliability: the error budget.

An error budget is the acceptable amount of unreliability—the difference between 100% and your reliability target.

ERROR BUDGET: THE CONCEPT
═══════════════════════════════════════════════════════════════════════════════

If your target is 99.9% reliability...

                    100% - 99.9% = 0.1% error budget

That 0.1% is not a failure—it's PERMISSION TO FAIL within limits.

In a month (30 days = 43,200 minutes):
─────────────────────────────────────────────────────────────────────────────
Error budget = 43,200 × 0.001 = 43.2 minutes

You can "spend" 43.2 minutes on:
• Downtime from incidents
• Failed requests
• Deployments that break things temporarily
• Experiments that don't work out

BUDGET VISUALIZATION (Example: Month in Progress)
─────────────────────────────────────────────────────────────────────────────
Week 1:  [████████░░]  Incident: 5 min     Budget: 38.2 min remaining
Week 2:  [███████░░░]  Incident: 12 min    Budget: 26.2 min remaining
Week 3:  [██████░░░░]  Deployment: 3 min   Budget: 23.2 min remaining
Week 4:  [█████░░░░░]  Healthy!            Budget: 23.2 min remaining

Status: ✅ HEALTHY - Room for one more risky deployment this month

Why Error Budgets Are Revolutionary:

Before error budgets, reliability conversations went like this:

Developers: “Let’s ship fast!”
Ops: “No, we need stability!”
Result: Endless conflict, arbitrary decisions

Error budgets change the conversation:

ERROR BUDGETS CREATE ALIGNMENT
═══════════════════════════════════════════════════════════════════════════════

OLD CONVERSATION (Conflict)
─────────────────────────────────────────────────────────────────────────────
Developer: "Let's deploy the new feature!"
Ops: "Too risky, we just had an incident."
Developer: "When CAN we ship then?"
Ops: "When I feel comfortable."
Developer: "That's not a real answer!"
→ RESULT: Arguments, politics, frustration

NEW CONVERSATION (Data-Driven)
─────────────────────────────────────────────────────────────────────────────
Developer: "Let's deploy the new feature!"
Ops: "What's our error budget status?"
Developer: "We have 28 minutes remaining this month."
Ops: "Last similar deployment caused 5 minutes of issues."
Developer: "So we can afford this. Let's go."
→ RESULT: Shared decision based on data

Error Budget Policy:

Budget Status	What It Means	Team Response
>50% remaining	Plenty of room	Ship features, take calculated risks, experiment
25-50% remaining	Getting tight	Continue carefully, reduce risk per deployment
<25% remaining	Warning zone	Slow down, extra testing, avoid risky changes
0% or negative	Reliability crisis	Freeze features, all hands on reliability

ERROR BUDGET POLICY IN ACTION
═══════════════════════════════════════════════════════════════════════════════

SCENARIO: Team wants to deploy major new feature

Budget Status Check:
┌──────────────────────────────────────────────────────────────────────────────┐
│                                                                              │
│   This Month's Error Budget: 43.2 minutes                                    │
│   Used so far: 35.8 minutes                                                  │
│   Remaining: 7.4 minutes (17%)                                               │
│                                                                              │
│   [███████████████████████████████████░░░░░░░░░░] 83% used                   │
│                                                                              │
│   Status: ⚠️  WARNING ZONE                                                    │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Decision: Hold the risky feature until next month. Instead:
• Deploy smaller, safer changes
• Focus on reliability improvements
• Bank error budget for next month

This isn't "ops blocking devs"—it's the TEAM deciding based on shared data.

Try This (3 minutes)

Your service has 99.5% reliability target. Calculate your monthly error budget:

What percentage is your error budget? (100% - 99.5% = ?)

How many minutes per month? (43,200 × budget = ?)

If you’ve had 3 incidents of 30 minutes each, how much budget remains?

Based on the policy above, what should your team do?

Part 3: The Reliability Trade-offs

3.1 Reliability vs. Velocity

Here’s the uncomfortable truth every engineering organization faces: you can’t have maximum reliability AND maximum velocity. There’s a fundamental tension.

THE RELIABILITY-VELOCITY SPECTRUM
═══════════════════════════════════════════════════════════════════════════════

            MAXIMUM RELIABILITY              MAXIMUM VELOCITY
                    │                              │
                    │         YOUR CHOICE          │
                    │     ┌─────────────────┐      │
                    │ ◀───┤ WHERE ARE YOU? ├───▶  │
                    │     └─────────────────┘      │
                    ▼                              ▼
        ┌───────────────────────┐    ┌───────────────────────┐
        │ • More testing        │    │ • Less testing        │
        │ • Longer review cycles│    │ • Quick reviews       │
        │ • Canary deployments  │    │ • Deploy when ready   │
        │ • Conservative changes│    │ • Big bang changes    │
        │ • Expensive infra     │    │ • Cheap infra         │
        │ • 24/7 on-call        │    │ • "We'll fix in prod" │
        └───────────────────────┘    └───────────────────────┘

EXAMPLES ON THE SPECTRUM:
─────────────────────────────────────────────────────────────────────────────
◀─────────────────────────────────────────────────────────────────────────▶
Medical    Banks    Airlines    E-commerce    Internal    Startup    Hackathon
Device                                        Tools       MVP        Project
│          │        │           │             │           │          │
"Zero      "Must    "Regulated  "Revenue     "Annoying   "Speed     "Ship
tolerance" be       but needs   depends      but not     is         something
           trust-   innovation" on uptime"   critical"   survival"  by 5pm"
           worthy"

Why you can’t fully optimize both:

THE PHYSICS OF THE TRADE-OFF
═══════════════════════════════════════════════════════════════════════════════

1. EVERY DEPLOYMENT IS A RISK
   ─────────────────────────────────────────────────────────────────────────
   More deploys = More chances for bugs to slip through
   Each deploy could be "the one" that breaks production

   Fast velocity: Deploy 10 times/day = 10 opportunities for failure
   High reliability: Deploy 1 time/week = Fewer opportunities for failure

2. EVERY FEATURE ADDS COMPLEXITY
   ─────────────────────────────────────────────────────────────────────────
   More features = More code paths = More failure modes

   Features: 100 → Surface area for bugs: 100 potential failure points
   Features: 1000 → Surface area for bugs: 1000 potential failure points

3. EVERY TEST ADDS TIME
   ─────────────────────────────────────────────────────────────────────────
   More testing = More confidence but slower delivery

   Quick path: 5 min test suite = Fast, but might miss edge cases
   Thorough path: 2 hour test suite = Slow, but catches more bugs

4. REDUNDANCY COSTS MONEY
   ─────────────────────────────────────────────────────────────────────────
   More reliability = More servers, regions, engineers

   Single region: $10k/month, single point of failure
   Multi-region: $50k/month, survives region failure

3.2 Context Determines Trade-offs

The right trade-off depends entirely on what you’re building:

System Type	Reliability Target	Velocity Target	Why This Balance
Pacemaker firmware	99.9999%+	Releases per year	Lives at stake - One bug could kill
Banking core	99.99%	Releases per quarter	Money and trust - Errors lose customers
Flight booking	99.95%	Releases per week	Revenue impact - Downtime = lost sales
E-commerce	99.9%	Daily releases	Balance - Fast features, acceptable risk
Internal tool	99%	Multiple per day	Low stakes - If it breaks, devs complain
Prototype	”Runs sometimes”	Constantly	Learning - Reliability is future problem

3.3 The 100% Reliability Myth

Let’s talk about something engineers need to accept: 100% reliability is impossible.

Not “very expensive.” Not “only for Google.” Physically impossible.

WHY 100% RELIABILITY CANNOT EXIST
═══════════════════════════════════════════════════════════════════════════════

THE PHYSICS PROBLEM
─────────────────────────────────────────────────────────────────────────────
• Hardware fails randomly (transistors wear out, solder cracks)
• Cosmic rays flip bits in memory (really! ~1 error per GB per month)
• Power grids have outages (transformers explode, trees fall)
• Datacenter cooling fails (air conditioners break)

THE DEPENDENCY PROBLEM
─────────────────────────────────────────────────────────────────────────────
• DNS can fail (Dyn DDoS attack 2016 took down half the internet)
• Cloud providers have outages (AWS has had 6+ major outages)
• Certificate authorities get compromised (DigiNotar 2011)
• BGP gets hijacked (routing your traffic to wrong places)

THE HUMAN PROBLEM
─────────────────────────────────────────────────────────────────────────────
• Engineers make mistakes (fat-finger deploy to production)
• On-call gets sick (at 3 AM, during a snowstorm)
• Documentation becomes outdated (runbook says "click button" but UI changed)
• Knowledge leaves (the one person who understood that system quit)

THE ECONOMICS PROBLEM
─────────────────────────────────────────────────────────────────────────────
             COST
               ▲
               │                                              *
               │                                           *
               │                                        *
               │                                    *
               │                              *
               │                       *
               │               *
               │      *
               │ *
               └───────────────────────────────────────────────▶ RELIABILITY
                   90%    99%   99.9%  99.99%  99.999%   100%

Cost grows EXPONENTIALLY. At some point, another nine costs more than
it's worth. That breakeven point is different for every system.

Did You Know?

Google’s Chubby lock service intentionally introduces planned outages. Why? To ensure that dependent services don’t accidentally build assumptions about 100% availability.

If Chubby were “too reliable,” services would build implicit dependencies on it always being there. Then when Chubby eventually had an unplanned outage, those services would fail catastrophically—they never tested for Chubby being down.

By being deliberately unreliable (within SLA), Chubby forces dependent services to handle its failures gracefully. Controlled unreliability builds resilience.

This is brilliant: your dependencies can be TOO reliable if it causes you to not handle their failure.

War Story: The 99.99% Promise

A SaaS company promised enterprise customers “99.99% availability” in contracts. Marketing loved the number. Sales closed deals with it. Nobody did the math.
THE MATH NOBODY DID
═════════════════════════════════════════════════════════════════════════════

Promised: 99.99% availability
Monthly budget: 43,200 minutes × 0.0001 = 4.32 minutes

Reality:
• Average incident detection time: 8 minutes
• Average investigation time: 15 minutes
• Average fix time: 22 minutes
• Total average MTTR: 45 minutes

ONE incident per month = SLA violation by 10x

Q1 Results:
─────────────────────────────────────────────────────────────────────────────
Month 1: 2 incidents, 87 minutes downtime (20x over budget)
Month 2: 1 incident, 52 minutes downtime (12x over budget)
Month 3: 3 incidents, 142 minutes downtime (33x over budget)

SLA credits owed: $2.4 million
Engineering budget for reliability: $800,000

The SLA credits alone cost 3x what it would have taken to fix the problems!
The fix wasn’t better reliability—it was honest expectations. They renegotiated to 99.9% (43 minutes/month), invested in faster incident response, and actually hit their targets.

Customers were happier with a realistic promise kept than an ambitious promise broken.

Lesson: Your SLA should match your operational capability, not your marketing aspirations.

Part 4: Reliability as a Practice

4.1 Reliability is Not a Feature

Here’s a common mistake: treating reliability as something you add later. “First we’ll build the features, then we’ll make it reliable.”

This is like saying “First we’ll build the house, then we’ll add the foundation.”

TWO APPROACHES TO RELIABILITY
═══════════════════════════════════════════════════════════════════════════════

APPROACH A: BOLT-ON RELIABILITY (Fragile)
─────────────────────────────────────────────────────────────────────────────

    Phase 1: "Ship features fast!"
    ┌─────────────────────────────────────────────────────────────────────┐
    │                        APPLICATION                                   │
    │  • No error handling ("happy path only")                            │
    │  • Direct database calls                                             │
    │  • Synchronous everything                                            │
    │  • No timeouts                                                       │
    │  • No monitoring                                                     │
    └─────────────────────────────────────────────────────────────────────┘

    Phase 2: "Oh no, it's breaking in production!"
    ┌─────────────────────────────────────────────────────────────────────┐
    │                        APPLICATION                                   │
    │  (original code, unchanged)                                         │
    └────────────────────────────────────┬────────────────────────────────┘
                                         │
                ┌────────────────────────┴─────────────────────────┐
                │                                                   │
                ▼                                                   ▼
    ┌───────────────────────┐                         ┌───────────────────────┐
    │    BOLT-ON RETRY      │                         │   BOLT-ON MONITORING  │
    │    LOGIC              │                         │                       │
    │    (doesn't fit well) │                         │   (can't see inside)  │
    └───────────────────────┘                         └───────────────────────┘

    Result: Fragile, expensive to maintain, still breaks in new ways


APPROACH B: BUILT-IN RELIABILITY (Robust)
─────────────────────────────────────────────────────────────────────────────

    ┌─────────────────────────────────────────────────────────────────────┐
    │                        APPLICATION                                   │
    │                                                                      │
    │  EVERY external call has:               Graceful degradation:       │
    │  ┌──────────────────┐                   ┌──────────────────┐        │
    │  │ • Timeout        │                   │ If payment fails: │        │
    │  │ • Retry w/backoff│                   │ → Queue for later │        │
    │  │ • Circuit breaker│                   │ → Show "pending"  │        │
    │  │ • Fallback       │                   │ → Notify user     │        │
    │  └──────────────────┘                   └──────────────────┘        │
    │                                                                      │
    │  Bulkheads between components:          Health checks:              │
    │  ┌──────────────────┐                   ┌──────────────────┐        │
    │  │ Search │ Checkout│                   │ • Readiness probe│        │
    │  │ can    │ can     │                   │ • Liveness probe │        │
    │  │ fail   │ fail    │                   │ • Dependency     │        │
    │  │ alone  │ alone   │                   │   checks         │        │
    │  └────────┴─────────┘                   └──────────────────┘        │
    │                                                                      │
    └─────────────────────────────────────────────────────────────────────┘

    Result: Handles failures gracefully, observable, maintainable

4.2 The Reliability Mindset

Reliability engineers think differently. They ask questions that others don’t:

THE FIVE QUESTIONS OF RELIABILITY
═══════════════════════════════════════════════════════════════════════════════

┌─────────────────────────────────────────────────────────────────────────────┐
│                                                                             │
│  1. WHAT CAN FAIL?                                                          │
│     ────────────────                                                        │
│     Answer: EVERYTHING. Make a list:                                        │
│     • Database: connection lost, slow queries, disk full                    │
│     • Network: timeout, partition, DNS failure                              │
│     • Third-party: API down, rate limited, changed response                 │
│     • Infrastructure: server crash, region outage, deployment               │
│     • Human: bad config, wrong button, missing step                         │
│                                                                             │
│  2. HOW WILL WE KNOW IT FAILED?                                             │
│     ────────────────────────────                                            │
│     • What metrics indicate failure?                                        │
│     • What logs capture the evidence?                                       │
│     • What alerts wake someone up?                                          │
│     • How do we distinguish "slow" from "broken"?                           │
│                                                                             │
│  3. HOW WILL WE RECOVER?                                                    │
│     ─────────────────────                                                   │
│     • Automated: failover, restart, rollback?                               │
│     • Manual: runbook, escalation path?                                     │
│     • How long will recovery take?                                          │
│     • What's the backup plan if Plan A fails?                               │
│                                                                             │
│  4. HOW DO WE PREVENT RECURRENCE?                                           │
│     ────────────────────────────────                                        │
│     • Post-incident review (blameless)                                      │
│     • Action items with owners and dates                                    │
│     • Tests that would catch this failure                                   │
│     • Architecture changes to eliminate the failure mode                    │
│                                                                             │
│  5. WHAT'S THE BLAST RADIUS?                                                │
│     ────────────────────────                                                │
│     • If this component fails, what else breaks?                            │
│     • Are there bulkheads to contain the damage?                            │
│     • Can we degrade gracefully instead of failing hard?                    │
│     • Who is affected: all users, some users, internal only?                │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

4.3 Reliability Anti-patterns

These are the patterns that look reasonable but lead to unreliable systems:

Anti-pattern	Why It Seems Reasonable	Why It Fails	Better Approach
”It won’t fail"	"We’ve never seen it fail”	Past stability doesn’t guarantee future stability	Assume everything fails; design for it
”We’ll fix it in prod"	"We can move faster this way”	Each prod fix is an incident; users are your testers	Test failure modes before production
”More redundancy = more reliable"	"Two is better than one”	Redundancy adds complexity; can cause split-brain	Understand failure modes FIRST, then add appropriate redundancy
”Users will retry"	"Retries are simple”	Pushes reliability burden onto users; causes retry storms	Handle retries internally with backoff
”We tested it"	"Tests passed!”	Tests only check known scenarios; prod has unknowns	Test + monitor + chaos engineering
”The cloud handles it"	"AWS never fails”	Cloud providers fail; you’re responsible for your SLA	Design for cloud provider failures

Try This (3 minutes)

Pick one of your services. Ask yourself the five reliability questions:

What are the top 3 things that could fail?

How would you know if each one happened?

What would you do about each one?

If you can’t answer these quickly, that’s a reliability gap—and now you know where to focus.

Did You Know?

The term “reliability engineering” emerged from the U.S. military in the 1950s. Early missiles had a 60% failure rate—more than half of them failed! The military realized they needed systematic approaches to reliability, not just better components. The discipline they created eventually made it to software.
MTBF was originally measured in flight hours for aircraft. A plane with 10,000 hour MTBF meant you’d expect one failure per 10,000 hours of flight—not calendar time. This is why “flight hours” is still a common reliability metric in aviation.
The first software reliability model was created by John Musa at Bell Labs in 1975. He applied hardware reliability mathematics to software, founding the field of software reliability engineering. His insight: software bugs follow statistical patterns just like hardware failures.
Netflix pioneered the “Chaos Monkey” approach in 2010, randomly killing production instances to ensure their systems could handle failures. Why the name? Because it’s like having a monkey loose in your data center, randomly unplugging things. This evolved into chaos engineering—deliberately injecting failures to build confidence. If you haven’t tested a failure, you don’t know if you can survive it.
The Space Shuttle had five redundant computers running different software written by different teams. Why? A single bug could kill astronauts. The fifth computer ran entirely different software to protect against systematic bugs. This is the ultimate “defense in depth.”
Amazon’s internal SLA is stricter than external SLAs. They aim for internal services to be ~10x more reliable than what they promise customers. Why? Because customers experience the product of all internal reliabilities, not the average.

Common Mistakes

Mistake	What It Looks Like	Why It’s Wrong	How to Fix It
Measuring availability, not reliability	”We had 100% uptime!” but error rates are 5%	System is up but returning errors; users are still failing	Track success rate of user-facing operations, not just uptime
Ignoring partial failures	”It works” when search is broken but checkout works	Users experience service as degraded; you don’t know	Define and measure degraded states; create SLIs for each component
Setting unrealistic targets	”We need five nines” without doing the math	Team is demoralized; SLA credits drain budget	Start with user impact, work backward to a realistic target
Not tracking error budget	No visibility into how much reliability “spend” remains	Can’t make informed velocity vs. reliability trade-offs	Implement error budget tracking with clear policy
Optimizing one component	Made the database 10x faster but users still see errors	End-to-end reliability is what matters, not component	Measure reliability from user’s perspective (synthetic monitoring)
Treating MTTR as fixed	”Incidents take 45 minutes to fix, that’s just how it is”	Recovery time is controllable; you’re leaving improvements on the table	Invest in detection, runbooks, automation; measure MTTD separately
Confusing SLOs and SLAs	Internal target = customer promise	No buffer for unexpected issues; breaches become SLA violations	Set internal SLO tighter than external SLA
Not testing failure modes	”We have redundancy” but never tested failover	First test of redundancy is during a real incident	Regular game days; chaos engineering

Quiz

Test your understanding of reliability concepts:

1. A system has 99.9% availability and 95% reliability. What does this mean for users?

Answer

This means:

99.9% availability: The system is reachable 99.9% of the time (down only 8.76 hours/year)
95% reliability: When reachable, only 95% of requests succeed (5% fail)

Combined effect: 99.9% × 95% = 94.9% success rate

So even though the system seems to have “three nines availability,” users experience roughly two nines of actual reliability. About 5 out of every 100 requests fail, even though the system is “up.”

This is why measuring only availability is misleading!

2. Your company promises 99.99% availability in contracts. Calculate your monthly error budget in minutes. Then explain why promising this might be risky.

Answer

Calculation:

Monthly minutes: 30 × 24 × 60 = 43,200 minutes
Error budget: 43,200 × 0.0001 = 4.32 minutes

Why this is risky:

4.32 minutes = less than 5 minutes of total downtime allowed per month
If average MTTR is 30 minutes, ONE incident blows the entire budget
Even detecting an incident often takes longer than 4 minutes
You’d need sub-minute detection AND sub-minute recovery
Multi-region, automated failover, extensive automation required
Engineering cost to achieve this is massive

Most teams promising 99.99% without doing this math end up paying SLA credits.

3. System A has MTBF=500 hours, MTTR=30 minutes. System B has MTBF=100 hours, MTTR=5 minutes. Which is more available?

Answer

System A:

MTBF = 500 hours
MTTR = 0.5 hours (30 minutes)
Availability = 500 / (500 + 0.5) = 500/500.5 = 99.90%

System B:

MTBF = 100 hours
MTTR = 0.083 hours (5 minutes)
Availability = 100 / (100 + 0.083) = 100/100.083 = 99.92%

System B is more available despite failing 5x more often!

This demonstrates the power of reducing MTTR. System B fails frequently but recovers so quickly that users experience better overall availability.

This is why investing in fast recovery (automated rollback, quick detection, good runbooks) often beats trying to prevent all failures.

4. What’s wrong with this reliability target: “The API will be 99.9% reliable”?

Answer

This statement is incomplete. A proper reliability definition needs three components:

Intended function - What does “work” mean?
- Respond at all? Respond correctly? Respond within X milliseconds?
- Does a slow response count as a failure?
- What about partial responses?
Specified period - Over what time window?
- Per request? Per day? Per month?
- Are maintenance windows excluded?
Stated conditions - Under what circumstances?
- What’s “normal” load?
- Are there excluded scenarios (DDoS, natural disasters)?
- What about dependent services being down?

Better version: “The API will successfully respond to 99.9% of valid requests within 200ms, measured monthly, under loads up to 1000 RPS, excluding scheduled maintenance windows.”

5. Your team has used 38 of your 43.2-minute monthly error budget, and it’s week 3. Product wants to deploy a major new feature. What should happen?

Answer

Analysis:

Budget remaining: 43.2 - 38 = 5.2 minutes (12% remaining)
Time remaining in month: ~1 week
This is deep in the warning zone (<25% remaining)

What should happen:

Do not deploy the risky feature this month
Check: How much risk does this deployment carry? Similar past deploys caused how much impact?
If past similar deploys typically cause 10+ minutes of issues, this would blow the budget
Alternative actions:
- Deploy small, safe changes only
- Focus on reliability improvements
- Save the major feature for next month when budget resets
- If absolutely critical, get executive sign-off acknowledging SLA risk

This isn’t ops blocking devs—it’s the team making a data-driven decision based on shared reliability goals.

Hands-On Exercise

Scenario: You’re joining a team as a reliability engineer. Your first task is to assess the current reliability posture and make recommendations.

Part 1: Calculate Reliability Metrics (15 minutes)

Data from last month’s monitoring:

Metric	Value
Total requests	10,000,000
Failed requests (5xx errors)	12,000
Slow requests (>2s latency)	85,000
Number of incidents	4
Total incident duration	3 hours 20 minutes (200 minutes)
Operating hours	720 (full month)

Calculate each metric:

1. SUCCESS RATE (RELIABILITY)
═══════════════════════════════════════════════════════════════════════════════
Formula: Success Rate = (Total - Failed) / Total × 100

Your calculation:
───────────────────────────────────────────────────────────────────────────────




Result: _______%

2. AVAILABILITY
═══════════════════════════════════════════════════════════════════════════════
Formula: Availability = (Operating Time - Downtime) / Operating Time × 100

Your calculation:
───────────────────────────────────────────────────────────────────────────────




Result: _______%

3. MTBF (Mean Time Between Failures)
═══════════════════════════════════════════════════════════════════════════════
Formula: MTBF = Operating Time / Number of Incidents

Your calculation:
───────────────────────────────────────────────────────────────────────────────




Result: _______ hours

4. MTTR (Mean Time To Recovery)
═══════════════════════════════════════════════════════════════════════════════
Formula: MTTR = Total Downtime / Number of Incidents

Your calculation:
───────────────────────────────────────────────────────────────────────────────




Result: _______ minutes

5. ERROR BUDGET STATUS
═══════════════════════════════════════════════════════════════════════════════
Target: 99.9% reliability
Error budget: 0.1% = 43.2 minutes per month

Your calculation:
───────────────────────────────────────────────────────────────────────────────
Budget: 43.2 minutes
Used: _______ minutes
Remaining: _______ minutes
Status: [ ] Healthy  [ ] Warning  [ ] Over budget

Part 2: Write a Reliability Assessment (10 minutes)

Based on your calculations, write a brief assessment:

RELIABILITY ASSESSMENT TEMPLATE
═══════════════════════════════════════════════════════════════════════════════

CURRENT STATE:
──────────────────────────────────────────────────────────────────────────────
• Reliability (success rate): _______% (target: 99.9%)
• Availability: _______% (target: 99.9%)
• Error budget: _______ status

KEY FINDINGS:
──────────────────────────────────────────────────────────────────────────────
1. Meeting target? [ ] Yes  [ ] No
2. Biggest gap: _________________________________
3. MTBF vs MTTR priority: _________________________________

RECOMMENDATION:
──────────────────────────────────────────────────────────────────────────────
Top priority improvement: _________________________________

Why: _________________________________

Part 3: Bonus Challenge

Notice the data includes 85,000 “slow requests” (>2s latency) that aren’t counted as failures. Should they be?

Consider:

If users expect <2s response, is a 5s response a “success”?
How would counting slow requests as failures change your reliability number?
What should the SLI definition be?

Success Criteria:

All 5 metrics calculated correctly
Analysis identifies whether target is met
MTBF vs MTTR trade-off reasoned through
One specific improvement identified with justification
(Bonus) Considered impact of slow requests on reliability definition

Check Your Work - Sample Answers

Calculations:

Success Rate = (10,000,000 - 12,000) / 10,000,000 = 99.88%
Availability = (720 - 3.33) / 720 = 99.54%
MTBF = 720 / 4 = 180 hours (~7.5 days between failures)
MTTR = 200 minutes / 4 = 50 minutes per incident
Error Budget:
- Budget: 43.2 minutes
- Used: 200 minutes
- Over budget by 156.8 minutes (463% over!)

Assessment:

CURRENT STATE:

Reliability: 99.88% (close but not meeting 99.9% target)
Availability: 99.54% (significantly below 99.9% target)
Error budget: Severely over budget

KEY FINDINGS:

Not meeting targets on either dimension
Biggest gap: Availability - 3+ hours of downtime vs 43 minutes allowed
Priority: MTTR first - 50 minutes per incident is too long. Reducing to 15 minutes would make a huge difference.

RECOMMENDATION:

Invest in faster incident response
Implement automated detection (reduce MTTD from unknown to <5 minutes)
Create runbooks for common issues
Add automated rollback capability
Goal: MTTR from 50 minutes → 15 minutes

Bonus - Slow Requests: If slow requests count as failures:

Total failures = 12,000 + 85,000 = 97,000
New success rate = (10M - 97K) / 10M = 99.03%

This changes the picture dramatically! The SLI definition matters.

Better SLI definition: “99.9% of requests complete successfully (2xx) within 2 seconds”

Key Takeaways

Before moving on, make sure you understand these core concepts:

RELIABILITY FUNDAMENTALS CHECKLIST
═══════════════════════════════════════════════════════════════════════════════

□ Reliability = probability of correct operation
  (not the same as availability)

□ Availability = proportion of time system is operational
  (can be up but still failing requests)

□ Durability = probability data survives
  (your data is safe even when you can't access it)

□ Nines are exponentially harder
  (99.99% is ~10x harder than 99.9%)

□ MTBF tells you how often you fail
  (longer is better)

□ MTTR tells you how fast you recover
  (shorter is often easier to improve than MTBF)

□ Error budgets make trade-offs explicit
  (shared language between dev and ops)

□ 100% reliability is impossible
  (and probably undesirable)

Next Module

Module 2.2: Failure Modes and Effects - Now that you understand what reliability means, learn how systems actually fail. Understanding failure modes is the first step to designing for reliability.

Module 2.1: What is Reliability?

What You’ll Be Able to Do

The Incident That Changed Everything

Why This Module Matters

What You’ll Learn

Part 1: Defining Reliability

1.1 What Does “Reliable” Mean?

1.2 Reliability vs. Availability vs. Durability

1.3 The User’s Perspective

Part 2: Measuring Reliability

2.1 The Nines

2.2 Key Reliability Metrics

2.3 Error Budgets

Part 3: The Reliability Trade-offs

3.1 Reliability vs. Velocity

3.2 Context Determines Trade-offs

3.3 The 100% Reliability Myth

Part 4: Reliability as a Practice

4.1 Reliability is Not a Feature

4.2 The Reliability Mindset

4.3 Reliability Anti-patterns

Did You Know?

Common Mistakes

Quiz

Hands-On Exercise

Part 1: Calculate Reliability Metrics (15 minutes)

Part 2: Write a Reliability Assessment (10 minutes)

Part 3: Bonus Challenge

Further Reading

Key Takeaways

Next Module