Перейти до вмісту

Module 2.1: What is Reliability?

Цей контент ще не доступний вашою мовою.

Complexity: [MEDIUM]

Time to Complete: 35-40 minutes

Prerequisites: Systems Thinking Track (recommended)

Track: Foundations

After completing this module, you will be able to:

  1. Explain the difference between availability, durability, and reliability and why each requires distinct engineering strategies
  2. Evaluate a system’s reliability posture by identifying single points of failure and hidden dependencies
  3. Design reliability requirements for a service by mapping user expectations to concrete engineering constraints
  4. Compare reactive vs. proactive reliability approaches and justify when to invest in each

December 24th, 2012. Netflix headquarters.

The on-call engineer stares at her dashboard in disbelief. Christmas Eve, and AWS’s US-East region has just entered what would become one of its most infamous outages. Netflix’s entire streaming service—the thing that millions of families planned to use during their holiday gatherings—is going dark.

But something unexpected happens.

While other companies scramble, while their engineers panic and their executives field angry customer calls, Netflix’s systems are… recovering. Automatically. Within minutes, traffic is rerouting to other regions. The streaming continues. Most customers never notice.

The secret? Years of reliability engineering.

Netflix hadn’t hoped for the best. They’d engineered for the worst. They’d asked: “What happens when AWS fails?” And they’d built systems that could answer: “We keep streaming.”

THE TWO APPROACHES TO FAILURE
═══════════════════════════════════════════════════════════════════════════════
COMPANY A: "Hope It Doesn't Break"
┌─────────────────────────────────────────────────────────────────────────────┐
│ │
│ "Our cloud provider → AWS outage → COMPLETE OUTAGE │
│ has 99.99% uptime" happens "Why didn't anyone │
│ tell us this could │
│ happen?!" │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
COMPANY B: Netflix's Approach - "Engineer for Failure"
┌─────────────────────────────────────────────────────────────────────────────┐
│ │
│ "What if AWS → Built multi-region → AWS outage → Auto- │
│ fails?" fallback systems happens recovery │
│ │
│ "What if a → Chaos Monkey tests → Failure Known, │
│ server dies?" this constantly occurs → handled │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

This is the difference between hoping and engineering. Between luck and reliability.


Your users don’t care about your architecture. They don’t care about your tech stack. They don’t care that you’re using Kubernetes or that your microservices are beautifully decoupled. They care about exactly one thing:

Does it work when I need it?

That question seems simple. But answering it systematically—measuring it, designing for it, trading off against other goals—that’s the discipline of reliability engineering.

Consider what “99% reliable” actually means:

THE BRUTAL MATH OF UNRELIABILITY
═══════════════════════════════════════════════════════════════════════════════
"99% reliable" sounds great until you do the math:
Per Year: 99% uptime = 3.65 DAYS of downtime
Per Month: = 7.3 HOURS of downtime
Per Week: = 1.7 HOURS of downtime
Per Day: = 14.4 MINUTES of downtime
For a payment processor handling $1M/hour:
─────────────────────────────────────────────────────────────────────────────
7.3 hours/month × $1,000,000/hour = $7.3 MILLION in lost revenue PER MONTH
For an emergency services dispatcher:
─────────────────────────────────────────────────────────────────────────────
14 minutes per day × 365 = 85 HOURS per year of "911 not working"
For a trading platform:
─────────────────────────────────────────────────────────────────────────────
Each minute of downtime during market hours can cost millions in trades

This module teaches you to think about reliability systematically—not as “we hope it doesn’t break” but as an engineering discipline with clear metrics, trade-offs, and design principles.

The Bridge Analogy

Civil engineers don’t say “we hope this bridge doesn’t collapse.” They calculate loads, specify materials, add safety factors, and design for specific failure scenarios. They know exactly what wind speed will cause problems, what weight the bridge can bear, and what happens if a cable snaps.

Software reliability engineering applies the same rigor to systems: understand the failure modes, design for them, measure the results. The question isn’t “will it fail?” but “how will it fail, and what happens when it does?”


  • How to define and measure reliability precisely
  • The relationship between availability, reliability, and durability
  • MTBF, MTTR, and other reliability metrics that actually matter
  • Why “five nines” is exponentially harder than “three nines”
  • The reliability vs. velocity trade-off and how to navigate it
  • How to calculate and use error budgets

Here’s a conversation that happens in every engineering organization:

THE VAGUENESS PROBLEM
═══════════════════════════════════════════════════════════════════════════════
Product Manager: "Is the payment system reliable?"
Engineer 1: "Yes, we haven't had any outages this month."
Engineer 2: "Well, we had 500 failed transactions yesterday."
Engineer 1: "But that's only 0.01% of transactions!"
Engineer 2: "That's 500 customers who couldn't buy things."
Product Manager: "So... is it reliable or not?"
Everyone: "..."

The problem? “Reliable” means different things to different people. We need precision.

Reliability is the probability that a system performs its intended function for a specified period under stated conditions.

Three components must be defined:

THE THREE PILLARS OF A RELIABILITY DEFINITION
═══════════════════════════════════════════════════════════════════════════════
┌─────────────────────────────────────┐
│ RELIABILITY STATEMENT │
└─────────────────────────────────────┘
┌─────────────────────────┼─────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ INTENDED │ │ SPECIFIED │ │ STATED │
│ FUNCTION │ │ PERIOD │ │ CONDITIONS │
└─────────────┘ └─────────────┘ └─────────────┘
What should it For how long? Under what
do exactly? circumstances?
Examples: Examples: Examples:
• Process payment • 24/7 continuous • Normal load
• Return result • During business • ≤1000 TPS
• Complete in <2s hours only • Valid requests
• Return accurate • For 30 days • Network available
data minimum

Vague vs. Precise Reliability Statements:

VaguePrecise
”The system is reliable""The system successfully processes 99.9% of valid requests within 2 seconds"
"High availability""Available 99.95% of the time, measured monthly"
"Data is safe""99.999999999% of objects stored survive over 10 years"
"Fast enough""95th percentile latency under 200ms”
COMPLETE RELIABILITY DEFINITION EXAMPLE
═══════════════════════════════════════════════════════════════════════════════
❌ Vague: "The payment system is reliable"
✅ Precise: "The payment system successfully processes 99.9% of
transactions within 2 seconds, under normal load (up to
1000 TPS), 24/7"
Breaking it down:
┌─────────────────────────────────────────────────────────────────────────────┐
│ INTENDED FUNCTION │
│ ───────────────── │
│ • Process transactions │
│ • Successfully (not just accept, but complete) │
│ • Within 2 seconds (latency requirement) │
├─────────────────────────────────────────────────────────────────────────────┤
│ SPECIFIED PERIOD │
│ ───────────────── │
│ • 24/7 (continuous operation) │
│ • No planned maintenance windows excluded │
├─────────────────────────────────────────────────────────────────────────────┤
│ STATED CONDITIONS │
│ ───────────────── │
│ • Normal load: ≤1000 transactions per second │
│ • Assumes valid transactions (garbage in = garbage out is ok) │
│ • Assumes network connectivity to payment processor │
└─────────────────────────────────────────────────────────────────────────────┘

1.2 Reliability vs. Availability vs. Durability

Section titled “1.2 Reliability vs. Availability vs. Durability”

These three terms get confused constantly. Even experienced engineers mix them up. Let’s fix that forever.

ConceptQuestion It AnswersMeasuresExample
Reliability”When I use it, does it work?”Success rate”99.9% of requests succeed”
Availability”Can I use it right now?”Uptime”System is up 99.99% of time”
Durability”Will my data still exist tomorrow?”Data retention”99.999999999% of data preserved”
THE CRITICAL DISTINCTION
═══════════════════════════════════════════════════════════════════════════════
┌─────────────────┐
│ USER REQUEST │
└────────┬────────┘
┌────────────────────────┐
│ Can I reach the │
│ system at all? │
└───────────┬────────────┘
┌───────────────┴───────────────┐
│ │
▼ ▼
┌─────────┐ ┌─────────┐
│ NO │ │ YES │
└────┬────┘ └────┬────┘
│ │
▼ ▼
AVAILABILITY ┌────────────────────┐
FAILURE │ Does it work │
(system down) │ correctly? │
└──────────┬─────────┘
┌───────────────┴───────────┐
│ │
▼ ▼
┌─────────┐ ┌─────────┐
│ NO │ │ YES │
└────┬────┘ └────┬────┘
│ │
▼ ▼
RELIABILITY SUCCESS!
FAILURE (this is what
(errors, bugs) users want)

The Four Combinations:

RELIABILITY vs AVAILABILITY MATRIX
═══════════════════════════════════════════════════════════════════════════════
RELIABILITY
Low High
┌────────────────┬────────────────┐
│ │ │
High │ UNRELIABLE │ IDEAL │
│ UPTIME │ │
AVAILABILITY │ "It's always │ "Works great │
│ up but half │ all the │
│ my requests │ time" │
│ fail" │ │
├────────────────┼────────────────┤
│ │ │
Low │ WORST │ FLAKY │
│ │ UPTIME │
│ "Can't reach │ "When it's │
│ it, and when │ up, it's │
│ I can, it │ perfect, │
│ fails" │ but..." │
│ │ │
└────────────────┴────────────────┘
REAL-WORLD EXAMPLES:
────────────────────────────────────────────────────────────────────────────────
High Availability + Low Reliability:
API always responds, but 5% of responses are errors
Low Availability + High Reliability:
Mainframe with weekly maintenance windows—when up, zero errors
High Both: Netflix in 2023—always available, almost always works
Low Both: That internal tool nobody maintains—down half the time,
buggy the other half

Durability: The Third Dimension

Durability is different—it’s about data persistence over time, not requests.

DURABILITY EXPLAINED
═══════════════════════════════════════════════════════════════════════════════
Scenario: You upload a photo to cloud storage
AVAILABILITY question: "Can I access my photo right now?"
• If cloud storage is down, you can't access it
• But the photo still exists!
DURABILITY question: "Will my photo exist in 10 years?"
• Even if storage is up, corruption could destroy it
• Durability = probability of data survival
THE MATH OF 11 NINES (Amazon S3's durability guarantee):
─────────────────────────────────────────────────────────────────────────────
99.999999999% durability
If you store 10,000,000 objects:
• Expected losses per year: 10M × 0.000000001 = 0.01 objects
• Expected losses per century: ~1 object
That's ~10,000 years between losing a single object.
How they achieve it:
• Multiple copies across multiple data centers
• Automatic integrity checking
• Self-healing when corruption detected
• Geographic distribution

Did You Know?

Amazon S3’s famous “11 nines” durability (99.999999999%) means if you store 10 million objects, you’d statistically expect to lose one every 10,000 years. That’s not availability—S3 can be temporarily unavailable while still being durable. Your data is safe; you just can’t access it right now.

This distinction matters enormously. When S3 has an “outage,” your files aren’t being deleted—they’re just temporarily inaccessible. Durability and availability are independent properties.

From your user’s perspective, reliability is simple: Did it work?

Users don’t care about our careful distinctions between availability, reliability, and durability. They care about outcomes.

USER EXPERIENCE OF RELIABILITY: THE CHECKOUT JOURNEY
═══════════════════════════════════════════════════════════════════════════════
User tries to buy something:
SCENARIO A: Availability Failure
┌──────────────────────────────────────────────────────────────────────────────┐
│ │
│ User clicks "Checkout" │
│ │ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ "This site can't be │ │
│ │ reached" │ User thinks: "Their site is broken" │
│ └───────────────────────┘ │
│ │
│ Result: FAILED ❌ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
SCENARIO B: Reliability Failure
┌──────────────────────────────────────────────────────────────────────────────┐
│ │
│ User clicks "Checkout" → Site loads → User enters payment info │
│ │ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ "Error processing │ │
│ │ your payment" │ User thinks: "Their site is broken" │
│ └───────────────────────┘ │
│ │
│ Result: FAILED ❌ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
SCENARIO C: Durability Failure
┌──────────────────────────────────────────────────────────────────────────────┐
│ │
│ User clicks "Checkout" → Site loads → Payment works → Success page! │
│ │ │
│ ▼ │
│ ... 2 days later ... │
│ "Where's my order? No confirmation email. No order in history." │
│ │
│ User thinks: "Their site is broken" │
│ │
│ Result: FAILED ❌ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
TO THE USER, ALL THREE FAILURES ARE IDENTICAL: "IT DIDN'T WORK"

This is why we need to measure all three dimensions—because any failure mode leads to the same user outcome.

Try This (2 minutes)

Think about an app you use daily. Recall a time it failed you. Was it:

  • An availability failure (couldn’t connect)?
  • A reliability failure (connected but got an error)?
  • A durability failure (your data was lost)?

Understanding the failure type helps identify the fix.

Bonus: Think about how you, as a user, responded. Did you retry? Give up? Switch to a competitor? That’s the business cost of unreliability.


When engineers talk about reliability, they talk about “nines.” You’ll hear phrases like “we need five nines” or “we’re only at three nines.” What does this mean?

The “nines” are a shorthand for the number of 9s in the reliability percentage:

NinesPercentageError RateDowntime/YearDowntime/MonthDowntime/Day
One nine90%10%36.5 days3 days2.4 hours
Two nines99%1%3.65 days7.3 hours14 minutes
Three nines99.9%0.1%8.76 hours43.8 minutes1.4 minutes
Four nines99.99%0.01%52.6 minutes4.4 minutes8.6 seconds
Five nines99.999%0.001%5.26 minutes26.3 seconds0.86 seconds
Six nines99.9999%0.0001%31.5 seconds2.6 seconds86 ms
VISUALIZING THE NINES
═══════════════════════════════════════════════════════════════════════════════
If you handle 1,000,000 requests per day:
90% (1 nine) ████████████████████████████████████░░░░░░░░
100,000 failures per day
"Basically broken"
99% (2 nines) ████████████████████████████████████████░░░░
10,000 failures per day
"Unreliable, users complaining constantly"
99.9% (3 nines) █████████████████████████████████████████░░░
1,000 failures per day
"Okay for internal tools, not great for customers"
99.99% (4 nines) ██████████████████████████████████████████░░
100 failures per day
"Good for most consumer services"
99.999% (5 nines)███████████████████████████████████████████░
10 failures per day
"Enterprise-grade, mission-critical"
99.9999%(6 nines)████████████████████████████████████████████
1 failure per day
"Financial trading, emergency services"

The Exponential Cost of Each Nine:

Each additional nine is approximately 10x harder and more expensive to achieve. Here’s why:

THE EXPONENTIAL COST OF NINES
═══════════════════════════════════════════════════════════════════════════════
NINES REQUIREMENTS APPROXIMATE COST FACTOR
─────────────────────────────────────────────────────────────────────────────────
99% • Basic monitoring $
• Some automation
• Single data center
99.9% • Good monitoring $$
• Automated failover
• Redundant components
• On-call rotation
99.99% • Sophisticated monitoring $$$$
• Multi-zone deployment
• Automatic scaling
• Fast incident response
• Chaos engineering
• 24/7 operations
99.999% • Multi-region deployment $$$$$$$$
• Global load balancing
• Zero-downtime deployments
• Full automation
• Dedicated reliability teams
• Extensive testing infrastructure
WHY IT'S EXPONENTIAL:
─────────────────────────────────────────────────────────────────────────────────
Going from 99% to 99.9%: Remove 90% of remaining failures
Going from 99.9% to 99.99%: Remove 90% of remaining failures AGAIN
Going from 99.99% to 99.999%: And again...
The easy failures are fixed first. Each level requires fixing
progressively harder, rarer, weirder problems.
At 99.99%, you're fixing bugs that only happen during full moons
(or at least, that's what it feels like).

Beyond “nines,” there are four critical metrics every reliability engineer must understand:

THE FOUR FUNDAMENTAL RELIABILITY METRICS
═══════════════════════════════════════════════════════════════════════════════
MTBF MTTR
(Mean Time Between (Mean Time To
Failures) Recovery)
│ │
│ How long between │ How long to fix
│ failures? │ each failure?
│ │
▼ ▼
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
│ WORKS │ │ DOWN │ │ WORKS │ │ DOWN │ │ WORKS │
└───────┘ └───────┘ └───────┘ └───────┘ └───────┘
│◄──────────▶│◄───▶│◄─────────────▶│◄───▶│
MTBF MTTR MTBF MTTR
MTTF MTTD
(Mean Time To (Mean Time To
Failure) Detect)
For non-repairable Time from failure
components occurring to
(how long until detecting it
it dies?)

MTBF - Mean Time Between Failures

MTBF measures how frequently failures occur. Higher MTBF = fewer failures = better.

MTBF CALCULATION
═══════════════════════════════════════════════════════════════════════════════
Formula: MTBF = Total Operating Time / Number of Failures
Example Scenario:
─────────────────────────────────────────────────────────────────────────────
Timeline over 1000 hours:
Hour 0 Hour 200 Hour 450 Hour 800 Hour 1000
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
START FAIL #1 FAIL #2 FAIL #3 END
(15 min) (30 min) (45 min)
Calculation:
• Total operating time: 1000 hours
• Number of failures: 3
• MTBF = 1000 / 3 = 333.3 hours
Interpretation: On average, expect a failure every 333 hours (~14 days)

MTTR - Mean Time To Recovery

MTTR measures how quickly you recover from failures. Lower MTTR = faster recovery = better.

MTTR BREAKDOWN
═══════════════════════════════════════════════════════════════════════════════
MTTR is actually composed of multiple phases:
┌─────────────────────────────────────────────────────────────────────────┐
│ TOTAL MTTR │
└─────────────────────────────────────────────────────────────────────────┘
┌──────────────────────────┼──────────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────────┐ ┌─────────────┐
│ MTTD │ │ MTTI │ │ MTT-Fix │
│ (Time │ │ (Time to │ │ (Time to │
│ to │ │ Investigate)│ │ actually │
│ Detect) │ │ │ │ fix) │
└──────────┘ └──────────────┘ └─────────────┘
Example:
─────────────────────────────────────────────────────────────────────────────
Failure occurs at 2:00 AM
• 2:00 AM - 2:15 AM: No one knows (MTTD = 15 min)
• 2:15 AM - 2:45 AM: Investigating "what's wrong" (MTTI = 30 min)
• 2:45 AM - 3:15 AM: Fixing the issue (MTT-Fix = 30 min)
• Total MTTR = 75 minutes
To reduce MTTR, attack each phase:
• Reduce MTTD: Better monitoring, alerting
• Reduce MTTI: Better dashboards, runbooks, observability
• Reduce MTT-Fix: Automation, rollback capabilities, simpler systems

Calculating Availability from MTBF and MTTR:

THE AVAILABILITY FORMULA
═══════════════════════════════════════════════════════════════════════════════
MTBF
Availability = ────────────
MTBF + MTTR
Example:
─────────────────────────────────────────────────────────────────────────────
• MTBF = 250 hours (failure every ~10 days)
• MTTR = 2 hours (takes 2 hours to fix each failure)
Availability = 250 / (250 + 2) = 250/252 = 99.2%
What this means:
• System is working: 99.2% of the time
• System is down: 0.8% of the time
• Per month downtime: 0.8% × 720 hours = 5.76 hours

The MTTR Revelation

Here’s a secret that many teams miss: reducing MTTR is often easier and more effective than reducing MTBF.

Consider two approaches to improving from 99% to 99.9%:

ApproachStarting PointTargetImprovement Needed
Increase MTBFMTBF=100h, MTTR=1hMTBF=1000h10x fewer failures
Decrease MTTRMTBF=100h, MTTR=1hMTTR=6min10x faster recovery

Preventing all failures is hard—you’re fighting complexity, unknown unknowns, and Murphy’s Law. But recovering faster? You can invest in:

  • Better monitoring (detect failures immediately)
  • Automated rollbacks (fix some failures in seconds)
  • Runbooks and training (human recovery is faster)
  • Simpler architectures (easier to debug)

The best reliability engineers optimize both—but they often get more ROI from MTTR.

Here’s a concept that changed how the industry thinks about reliability: the error budget.

An error budget is the acceptable amount of unreliability—the difference between 100% and your reliability target.

ERROR BUDGET: THE CONCEPT
═══════════════════════════════════════════════════════════════════════════════
If your target is 99.9% reliability...
100% - 99.9% = 0.1% error budget
That 0.1% is not a failure—it's PERMISSION TO FAIL within limits.
In a month (30 days = 43,200 minutes):
─────────────────────────────────────────────────────────────────────────────
Error budget = 43,200 × 0.001 = 43.2 minutes
You can "spend" 43.2 minutes on:
• Downtime from incidents
• Failed requests
• Deployments that break things temporarily
• Experiments that don't work out
BUDGET VISUALIZATION (Example: Month in Progress)
─────────────────────────────────────────────────────────────────────────────
Week 1: [████████░░] Incident: 5 min Budget: 38.2 min remaining
Week 2: [███████░░░] Incident: 12 min Budget: 26.2 min remaining
Week 3: [██████░░░░] Deployment: 3 min Budget: 23.2 min remaining
Week 4: [█████░░░░░] Healthy! Budget: 23.2 min remaining
Status: ✅ HEALTHY - Room for one more risky deployment this month

Why Error Budgets Are Revolutionary:

Before error budgets, reliability conversations went like this:

  • Developers: “Let’s ship fast!”
  • Ops: “No, we need stability!”
  • Result: Endless conflict, arbitrary decisions

Error budgets change the conversation:

ERROR BUDGETS CREATE ALIGNMENT
═══════════════════════════════════════════════════════════════════════════════
OLD CONVERSATION (Conflict)
─────────────────────────────────────────────────────────────────────────────
Developer: "Let's deploy the new feature!"
Ops: "Too risky, we just had an incident."
Developer: "When CAN we ship then?"
Ops: "When I feel comfortable."
Developer: "That's not a real answer!"
→ RESULT: Arguments, politics, frustration
NEW CONVERSATION (Data-Driven)
─────────────────────────────────────────────────────────────────────────────
Developer: "Let's deploy the new feature!"
Ops: "What's our error budget status?"
Developer: "We have 28 minutes remaining this month."
Ops: "Last similar deployment caused 5 minutes of issues."
Developer: "So we can afford this. Let's go."
→ RESULT: Shared decision based on data

Error Budget Policy:

Budget StatusWhat It MeansTeam Response
>50% remainingPlenty of roomShip features, take calculated risks, experiment
25-50% remainingGetting tightContinue carefully, reduce risk per deployment
<25% remainingWarning zoneSlow down, extra testing, avoid risky changes
0% or negativeReliability crisisFreeze features, all hands on reliability
ERROR BUDGET POLICY IN ACTION
═══════════════════════════════════════════════════════════════════════════════
SCENARIO: Team wants to deploy major new feature
Budget Status Check:
┌──────────────────────────────────────────────────────────────────────────────┐
│ │
│ This Month's Error Budget: 43.2 minutes │
│ Used so far: 35.8 minutes │
│ Remaining: 7.4 minutes (17%) │
│ │
│ [███████████████████████████████████░░░░░░░░░░] 83% used │
│ │
│ Status: ⚠️ WARNING ZONE │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
Decision: Hold the risky feature until next month. Instead:
• Deploy smaller, safer changes
• Focus on reliability improvements
• Bank error budget for next month
This isn't "ops blocking devs"—it's the TEAM deciding based on shared data.

Try This (3 minutes)

Your service has 99.5% reliability target. Calculate your monthly error budget:

  1. What percentage is your error budget? (100% - 99.5% = ?)
  2. How many minutes per month? (43,200 × budget = ?)
  3. If you’ve had 3 incidents of 30 minutes each, how much budget remains?
  4. Based on the policy above, what should your team do?

Here’s the uncomfortable truth every engineering organization faces: you can’t have maximum reliability AND maximum velocity. There’s a fundamental tension.

THE RELIABILITY-VELOCITY SPECTRUM
═══════════════════════════════════════════════════════════════════════════════
MAXIMUM RELIABILITY MAXIMUM VELOCITY
│ │
│ YOUR CHOICE │
│ ┌─────────────────┐ │
│ ◀───┤ WHERE ARE YOU? ├───▶ │
│ └─────────────────┘ │
▼ ▼
┌───────────────────────┐ ┌───────────────────────┐
│ • More testing │ │ • Less testing │
│ • Longer review cycles│ │ • Quick reviews │
│ • Canary deployments │ │ • Deploy when ready │
│ • Conservative changes│ │ • Big bang changes │
│ • Expensive infra │ │ • Cheap infra │
│ • 24/7 on-call │ │ • "We'll fix in prod" │
└───────────────────────┘ └───────────────────────┘
EXAMPLES ON THE SPECTRUM:
─────────────────────────────────────────────────────────────────────────────
◀─────────────────────────────────────────────────────────────────────────▶
Medical Banks Airlines E-commerce Internal Startup Hackathon
Device Tools MVP Project
│ │ │ │ │ │ │
"Zero "Must "Regulated "Revenue "Annoying "Speed "Ship
tolerance" be but needs depends but not is something
trust- innovation" on uptime" critical" survival" by 5pm"
worthy"

Why you can’t fully optimize both:

THE PHYSICS OF THE TRADE-OFF
═══════════════════════════════════════════════════════════════════════════════
1. EVERY DEPLOYMENT IS A RISK
─────────────────────────────────────────────────────────────────────────
More deploys = More chances for bugs to slip through
Each deploy could be "the one" that breaks production
Fast velocity: Deploy 10 times/day = 10 opportunities for failure
High reliability: Deploy 1 time/week = Fewer opportunities for failure
2. EVERY FEATURE ADDS COMPLEXITY
─────────────────────────────────────────────────────────────────────────
More features = More code paths = More failure modes
Features: 100 → Surface area for bugs: 100 potential failure points
Features: 1000 → Surface area for bugs: 1000 potential failure points
3. EVERY TEST ADDS TIME
─────────────────────────────────────────────────────────────────────────
More testing = More confidence but slower delivery
Quick path: 5 min test suite = Fast, but might miss edge cases
Thorough path: 2 hour test suite = Slow, but catches more bugs
4. REDUNDANCY COSTS MONEY
─────────────────────────────────────────────────────────────────────────
More reliability = More servers, regions, engineers
Single region: $10k/month, single point of failure
Multi-region: $50k/month, survives region failure

The right trade-off depends entirely on what you’re building:

System TypeReliability TargetVelocity TargetWhy This Balance
Pacemaker firmware99.9999%+Releases per yearLives at stake - One bug could kill
Banking core99.99%Releases per quarterMoney and trust - Errors lose customers
Flight booking99.95%Releases per weekRevenue impact - Downtime = lost sales
E-commerce99.9%Daily releasesBalance - Fast features, acceptable risk
Internal tool99%Multiple per dayLow stakes - If it breaks, devs complain
Prototype”Runs sometimes”ConstantlyLearning - Reliability is future problem

Let’s talk about something engineers need to accept: 100% reliability is impossible.

Not “very expensive.” Not “only for Google.” Physically impossible.

WHY 100% RELIABILITY CANNOT EXIST
═══════════════════════════════════════════════════════════════════════════════
THE PHYSICS PROBLEM
─────────────────────────────────────────────────────────────────────────────
• Hardware fails randomly (transistors wear out, solder cracks)
• Cosmic rays flip bits in memory (really! ~1 error per GB per month)
• Power grids have outages (transformers explode, trees fall)
• Datacenter cooling fails (air conditioners break)
THE DEPENDENCY PROBLEM
─────────────────────────────────────────────────────────────────────────────
• DNS can fail (Dyn DDoS attack 2016 took down half the internet)
• Cloud providers have outages (AWS has had 6+ major outages)
• Certificate authorities get compromised (DigiNotar 2011)
• BGP gets hijacked (routing your traffic to wrong places)
THE HUMAN PROBLEM
─────────────────────────────────────────────────────────────────────────────
• Engineers make mistakes (fat-finger deploy to production)
• On-call gets sick (at 3 AM, during a snowstorm)
• Documentation becomes outdated (runbook says "click button" but UI changed)
• Knowledge leaves (the one person who understood that system quit)
THE ECONOMICS PROBLEM
─────────────────────────────────────────────────────────────────────────────
COST
│ *
│ *
│ *
│ *
│ *
│ *
│ *
│ *
│ *
└───────────────────────────────────────────────▶ RELIABILITY
90% 99% 99.9% 99.99% 99.999% 100%
Cost grows EXPONENTIALLY. At some point, another nine costs more than
it's worth. That breakeven point is different for every system.

Did You Know?

Google’s Chubby lock service intentionally introduces planned outages. Why? To ensure that dependent services don’t accidentally build assumptions about 100% availability.

If Chubby were “too reliable,” services would build implicit dependencies on it always being there. Then when Chubby eventually had an unplanned outage, those services would fail catastrophically—they never tested for Chubby being down.

By being deliberately unreliable (within SLA), Chubby forces dependent services to handle its failures gracefully. Controlled unreliability builds resilience.

This is brilliant: your dependencies can be TOO reliable if it causes you to not handle their failure.

War Story: The 99.99% Promise

A SaaS company promised enterprise customers “99.99% availability” in contracts. Marketing loved the number. Sales closed deals with it. Nobody did the math.

THE MATH NOBODY DID
═════════════════════════════════════════════════════════════════════════════
Promised: 99.99% availability
Monthly budget: 43,200 minutes × 0.0001 = 4.32 minutes
Reality:
• Average incident detection time: 8 minutes
• Average investigation time: 15 minutes
• Average fix time: 22 minutes
• Total average MTTR: 45 minutes
ONE incident per month = SLA violation by 10x
Q1 Results:
─────────────────────────────────────────────────────────────────────────────
Month 1: 2 incidents, 87 minutes downtime (20x over budget)
Month 2: 1 incident, 52 minutes downtime (12x over budget)
Month 3: 3 incidents, 142 minutes downtime (33x over budget)
SLA credits owed: $2.4 million
Engineering budget for reliability: $800,000
The SLA credits alone cost 3x what it would have taken to fix the problems!

The fix wasn’t better reliability—it was honest expectations. They renegotiated to 99.9% (43 minutes/month), invested in faster incident response, and actually hit their targets.

Customers were happier with a realistic promise kept than an ambitious promise broken.

Lesson: Your SLA should match your operational capability, not your marketing aspirations.


Here’s a common mistake: treating reliability as something you add later. “First we’ll build the features, then we’ll make it reliable.”

This is like saying “First we’ll build the house, then we’ll add the foundation.”

TWO APPROACHES TO RELIABILITY
═══════════════════════════════════════════════════════════════════════════════
APPROACH A: BOLT-ON RELIABILITY (Fragile)
─────────────────────────────────────────────────────────────────────────────
Phase 1: "Ship features fast!"
┌─────────────────────────────────────────────────────────────────────┐
│ APPLICATION │
│ • No error handling ("happy path only") │
│ • Direct database calls │
│ • Synchronous everything │
│ • No timeouts │
│ • No monitoring │
└─────────────────────────────────────────────────────────────────────┘
Phase 2: "Oh no, it's breaking in production!"
┌─────────────────────────────────────────────────────────────────────┐
│ APPLICATION │
│ (original code, unchanged) │
└────────────────────────────────────┬────────────────────────────────┘
┌────────────────────────┴─────────────────────────┐
│ │
▼ ▼
┌───────────────────────┐ ┌───────────────────────┐
│ BOLT-ON RETRY │ │ BOLT-ON MONITORING │
│ LOGIC │ │ │
│ (doesn't fit well) │ │ (can't see inside) │
└───────────────────────┘ └───────────────────────┘
Result: Fragile, expensive to maintain, still breaks in new ways
APPROACH B: BUILT-IN RELIABILITY (Robust)
─────────────────────────────────────────────────────────────────────────────
┌─────────────────────────────────────────────────────────────────────┐
│ APPLICATION │
│ │
│ EVERY external call has: Graceful degradation: │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ • Timeout │ │ If payment fails: │ │
│ │ • Retry w/backoff│ │ → Queue for later │ │
│ │ • Circuit breaker│ │ → Show "pending" │ │
│ │ • Fallback │ │ → Notify user │ │
│ └──────────────────┘ └──────────────────┘ │
│ │
│ Bulkheads between components: Health checks: │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Search │ Checkout│ │ • Readiness probe│ │
│ │ can │ can │ │ • Liveness probe │ │
│ │ fail │ fail │ │ • Dependency │ │
│ │ alone │ alone │ │ checks │ │
│ └────────┴─────────┘ └──────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Result: Handles failures gracefully, observable, maintainable

Reliability engineers think differently. They ask questions that others don’t:

THE FIVE QUESTIONS OF RELIABILITY
═══════════════════════════════════════════════════════════════════════════════
┌─────────────────────────────────────────────────────────────────────────────┐
│ │
│ 1. WHAT CAN FAIL? │
│ ──────────────── │
│ Answer: EVERYTHING. Make a list: │
│ • Database: connection lost, slow queries, disk full │
│ • Network: timeout, partition, DNS failure │
│ • Third-party: API down, rate limited, changed response │
│ • Infrastructure: server crash, region outage, deployment │
│ • Human: bad config, wrong button, missing step │
│ │
│ 2. HOW WILL WE KNOW IT FAILED? │
│ ──────────────────────────── │
│ • What metrics indicate failure? │
│ • What logs capture the evidence? │
│ • What alerts wake someone up? │
│ • How do we distinguish "slow" from "broken"? │
│ │
│ 3. HOW WILL WE RECOVER? │
│ ───────────────────── │
│ • Automated: failover, restart, rollback? │
│ • Manual: runbook, escalation path? │
│ • How long will recovery take? │
│ • What's the backup plan if Plan A fails? │
│ │
│ 4. HOW DO WE PREVENT RECURRENCE? │
│ ──────────────────────────────── │
│ • Post-incident review (blameless) │
│ • Action items with owners and dates │
│ • Tests that would catch this failure │
│ • Architecture changes to eliminate the failure mode │
│ │
│ 5. WHAT'S THE BLAST RADIUS? │
│ ──────────────────────── │
│ • If this component fails, what else breaks? │
│ • Are there bulkheads to contain the damage? │
│ • Can we degrade gracefully instead of failing hard? │
│ • Who is affected: all users, some users, internal only? │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

These are the patterns that look reasonable but lead to unreliable systems:

Anti-patternWhy It Seems ReasonableWhy It FailsBetter Approach
”It won’t fail""We’ve never seen it fail”Past stability doesn’t guarantee future stabilityAssume everything fails; design for it
”We’ll fix it in prod""We can move faster this way”Each prod fix is an incident; users are your testersTest failure modes before production
”More redundancy = more reliable""Two is better than one”Redundancy adds complexity; can cause split-brainUnderstand failure modes FIRST, then add appropriate redundancy
”Users will retry""Retries are simple”Pushes reliability burden onto users; causes retry stormsHandle retries internally with backoff
”We tested it""Tests passed!”Tests only check known scenarios; prod has unknownsTest + monitor + chaos engineering
”The cloud handles it""AWS never fails”Cloud providers fail; you’re responsible for your SLADesign for cloud provider failures

Try This (3 minutes)

Pick one of your services. Ask yourself the five reliability questions:

  1. What are the top 3 things that could fail?
  2. How would you know if each one happened?
  3. What would you do about each one?

If you can’t answer these quickly, that’s a reliability gap—and now you know where to focus.


  • The term “reliability engineering” emerged from the U.S. military in the 1950s. Early missiles had a 60% failure rate—more than half of them failed! The military realized they needed systematic approaches to reliability, not just better components. The discipline they created eventually made it to software.

  • MTBF was originally measured in flight hours for aircraft. A plane with 10,000 hour MTBF meant you’d expect one failure per 10,000 hours of flight—not calendar time. This is why “flight hours” is still a common reliability metric in aviation.

  • The first software reliability model was created by John Musa at Bell Labs in 1975. He applied hardware reliability mathematics to software, founding the field of software reliability engineering. His insight: software bugs follow statistical patterns just like hardware failures.

  • Netflix pioneered the “Chaos Monkey” approach in 2010, randomly killing production instances to ensure their systems could handle failures. Why the name? Because it’s like having a monkey loose in your data center, randomly unplugging things. This evolved into chaos engineering—deliberately injecting failures to build confidence. If you haven’t tested a failure, you don’t know if you can survive it.

  • The Space Shuttle had five redundant computers running different software written by different teams. Why? A single bug could kill astronauts. The fifth computer ran entirely different software to protect against systematic bugs. This is the ultimate “defense in depth.”

  • Amazon’s internal SLA is stricter than external SLAs. They aim for internal services to be ~10x more reliable than what they promise customers. Why? Because customers experience the product of all internal reliabilities, not the average.


MistakeWhat It Looks LikeWhy It’s WrongHow to Fix It
Measuring availability, not reliability”We had 100% uptime!” but error rates are 5%System is up but returning errors; users are still failingTrack success rate of user-facing operations, not just uptime
Ignoring partial failures”It works” when search is broken but checkout worksUsers experience service as degraded; you don’t knowDefine and measure degraded states; create SLIs for each component
Setting unrealistic targets”We need five nines” without doing the mathTeam is demoralized; SLA credits drain budgetStart with user impact, work backward to a realistic target
Not tracking error budgetNo visibility into how much reliability “spend” remainsCan’t make informed velocity vs. reliability trade-offsImplement error budget tracking with clear policy
Optimizing one componentMade the database 10x faster but users still see errorsEnd-to-end reliability is what matters, not componentMeasure reliability from user’s perspective (synthetic monitoring)
Treating MTTR as fixed”Incidents take 45 minutes to fix, that’s just how it is”Recovery time is controllable; you’re leaving improvements on the tableInvest in detection, runbooks, automation; measure MTTD separately
Confusing SLOs and SLAsInternal target = customer promiseNo buffer for unexpected issues; breaches become SLA violationsSet internal SLO tighter than external SLA
Not testing failure modes”We have redundancy” but never tested failoverFirst test of redundancy is during a real incidentRegular game days; chaos engineering

Test your understanding of reliability concepts:

1. A system has 99.9% availability and 95% reliability. What does this mean for users?

Answer

This means:

  • 99.9% availability: The system is reachable 99.9% of the time (down only 8.76 hours/year)
  • 95% reliability: When reachable, only 95% of requests succeed (5% fail)

Combined effect: 99.9% × 95% = 94.9% success rate

So even though the system seems to have “three nines availability,” users experience roughly two nines of actual reliability. About 5 out of every 100 requests fail, even though the system is “up.”

This is why measuring only availability is misleading!

2. Your company promises 99.99% availability in contracts. Calculate your monthly error budget in minutes. Then explain why promising this might be risky.

Answer

Calculation:

  • Monthly minutes: 30 × 24 × 60 = 43,200 minutes
  • Error budget: 43,200 × 0.0001 = 4.32 minutes

Why this is risky:

  • 4.32 minutes = less than 5 minutes of total downtime allowed per month
  • If average MTTR is 30 minutes, ONE incident blows the entire budget
  • Even detecting an incident often takes longer than 4 minutes
  • You’d need sub-minute detection AND sub-minute recovery
  • Multi-region, automated failover, extensive automation required
  • Engineering cost to achieve this is massive

Most teams promising 99.99% without doing this math end up paying SLA credits.

3. System A has MTBF=500 hours, MTTR=30 minutes. System B has MTBF=100 hours, MTTR=5 minutes. Which is more available?

Answer

System A:

  • MTBF = 500 hours
  • MTTR = 0.5 hours (30 minutes)
  • Availability = 500 / (500 + 0.5) = 500/500.5 = 99.90%

System B:

  • MTBF = 100 hours
  • MTTR = 0.083 hours (5 minutes)
  • Availability = 100 / (100 + 0.083) = 100/100.083 = 99.92%

System B is more available despite failing 5x more often!

This demonstrates the power of reducing MTTR. System B fails frequently but recovers so quickly that users experience better overall availability.

This is why investing in fast recovery (automated rollback, quick detection, good runbooks) often beats trying to prevent all failures.

4. What’s wrong with this reliability target: “The API will be 99.9% reliable”?

Answer

This statement is incomplete. A proper reliability definition needs three components:

  1. Intended function - What does “work” mean?

    • Respond at all? Respond correctly? Respond within X milliseconds?
    • Does a slow response count as a failure?
    • What about partial responses?
  2. Specified period - Over what time window?

    • Per request? Per day? Per month?
    • Are maintenance windows excluded?
  3. Stated conditions - Under what circumstances?

    • What’s “normal” load?
    • Are there excluded scenarios (DDoS, natural disasters)?
    • What about dependent services being down?

Better version: “The API will successfully respond to 99.9% of valid requests within 200ms, measured monthly, under loads up to 1000 RPS, excluding scheduled maintenance windows.”

5. Your team has used 38 of your 43.2-minute monthly error budget, and it’s week 3. Product wants to deploy a major new feature. What should happen?

Answer

Analysis:

  • Budget remaining: 43.2 - 38 = 5.2 minutes (12% remaining)
  • Time remaining in month: ~1 week
  • This is deep in the warning zone (<25% remaining)

What should happen:

  1. Do not deploy the risky feature this month
  2. Check: How much risk does this deployment carry? Similar past deploys caused how much impact?
  3. If past similar deploys typically cause 10+ minutes of issues, this would blow the budget
  4. Alternative actions:
    • Deploy small, safe changes only
    • Focus on reliability improvements
    • Save the major feature for next month when budget resets
    • If absolutely critical, get executive sign-off acknowledging SLA risk

This isn’t ops blocking devs—it’s the team making a data-driven decision based on shared reliability goals.


Scenario: You’re joining a team as a reliability engineer. Your first task is to assess the current reliability posture and make recommendations.

Part 1: Calculate Reliability Metrics (15 minutes)

Section titled “Part 1: Calculate Reliability Metrics (15 minutes)”

Data from last month’s monitoring:

MetricValue
Total requests10,000,000
Failed requests (5xx errors)12,000
Slow requests (>2s latency)85,000
Number of incidents4
Total incident duration3 hours 20 minutes (200 minutes)
Operating hours720 (full month)

Calculate each metric:

1. SUCCESS RATE (RELIABILITY)
═══════════════════════════════════════════════════════════════════════════════
Formula: Success Rate = (Total - Failed) / Total × 100
Your calculation:
───────────────────────────────────────────────────────────────────────────────
Result: _______%
2. AVAILABILITY
═══════════════════════════════════════════════════════════════════════════════
Formula: Availability = (Operating Time - Downtime) / Operating Time × 100
Your calculation:
───────────────────────────────────────────────────────────────────────────────
Result: _______%
3. MTBF (Mean Time Between Failures)
═══════════════════════════════════════════════════════════════════════════════
Formula: MTBF = Operating Time / Number of Incidents
Your calculation:
───────────────────────────────────────────────────────────────────────────────
Result: _______ hours
4. MTTR (Mean Time To Recovery)
═══════════════════════════════════════════════════════════════════════════════
Formula: MTTR = Total Downtime / Number of Incidents
Your calculation:
───────────────────────────────────────────────────────────────────────────────
Result: _______ minutes
5. ERROR BUDGET STATUS
═══════════════════════════════════════════════════════════════════════════════
Target: 99.9% reliability
Error budget: 0.1% = 43.2 minutes per month
Your calculation:
───────────────────────────────────────────────────────────────────────────────
Budget: 43.2 minutes
Used: _______ minutes
Remaining: _______ minutes
Status: [ ] Healthy [ ] Warning [ ] Over budget

Part 2: Write a Reliability Assessment (10 minutes)

Section titled “Part 2: Write a Reliability Assessment (10 minutes)”

Based on your calculations, write a brief assessment:

RELIABILITY ASSESSMENT TEMPLATE
═══════════════════════════════════════════════════════════════════════════════
CURRENT STATE:
──────────────────────────────────────────────────────────────────────────────
• Reliability (success rate): _______% (target: 99.9%)
• Availability: _______% (target: 99.9%)
• Error budget: _______ status
KEY FINDINGS:
──────────────────────────────────────────────────────────────────────────────
1. Meeting target? [ ] Yes [ ] No
2. Biggest gap: _________________________________
3. MTBF vs MTTR priority: _________________________________
RECOMMENDATION:
──────────────────────────────────────────────────────────────────────────────
Top priority improvement: _________________________________
Why: _________________________________

Notice the data includes 85,000 “slow requests” (>2s latency) that aren’t counted as failures. Should they be?

Consider:

  • If users expect <2s response, is a 5s response a “success”?
  • How would counting slow requests as failures change your reliability number?
  • What should the SLI definition be?

Success Criteria:

  • All 5 metrics calculated correctly
  • Analysis identifies whether target is met
  • MTBF vs MTTR trade-off reasoned through
  • One specific improvement identified with justification
  • (Bonus) Considered impact of slow requests on reliability definition

Check Your Work - Sample Answers

Calculations:

  1. Success Rate = (10,000,000 - 12,000) / 10,000,000 = 99.88%

  2. Availability = (720 - 3.33) / 720 = 99.54%

  3. MTBF = 720 / 4 = 180 hours (~7.5 days between failures)

  4. MTTR = 200 minutes / 4 = 50 minutes per incident

  5. Error Budget:

    • Budget: 43.2 minutes
    • Used: 200 minutes
    • Over budget by 156.8 minutes (463% over!)

Assessment:

CURRENT STATE:

  • Reliability: 99.88% (close but not meeting 99.9% target)
  • Availability: 99.54% (significantly below 99.9% target)
  • Error budget: Severely over budget

KEY FINDINGS:

  1. Not meeting targets on either dimension
  2. Biggest gap: Availability - 3+ hours of downtime vs 43 minutes allowed
  3. Priority: MTTR first - 50 minutes per incident is too long. Reducing to 15 minutes would make a huge difference.

RECOMMENDATION:

  • Invest in faster incident response
  • Implement automated detection (reduce MTTD from unknown to <5 minutes)
  • Create runbooks for common issues
  • Add automated rollback capability
  • Goal: MTTR from 50 minutes → 15 minutes

Bonus - Slow Requests: If slow requests count as failures:

  • Total failures = 12,000 + 85,000 = 97,000
  • New success rate = (10M - 97K) / 10M = 99.03%

This changes the picture dramatically! The SLI definition matters.

Better SLI definition: “99.9% of requests complete successfully (2xx) within 2 seconds”


Books:

  • “Site Reliability Engineering” - Google (free online). Chapters 1-4 cover reliability fundamentals from the team that coined “SRE.” The definitive text on modern reliability engineering.

  • “Release It! Design and Deploy Production-Ready Software” - Michael Nygard (2nd edition). Practical patterns for building reliable systems. Every pattern has war stories from real production failures.

  • “The Checklist Manifesto” - Atul Gawande. How checklists improve reliability in complex domains (aviation, surgery)—surprisingly applicable to incident response.

Papers:

  • “How Complex Systems Fail” - Richard Cook. A 5-page paper that every reliability engineer should read. Describes why failures happen and why hindsight is misleading.

  • “On Designing and Deploying Internet-Scale Services” - James Hamilton. Classic paper on building reliable services at scale. Written in 2007 but still relevant.

Talks:

  • “Mastering Chaos: A Netflix Guide to Microservices” - Josh Evans (YouTube). How Netflix builds reliability into their microservices architecture.

  • “Building Reliability In” - John Allspaw. Thoughtful exploration of what reliability actually means in practice.


Before moving on, make sure you understand these core concepts:

RELIABILITY FUNDAMENTALS CHECKLIST
═══════════════════════════════════════════════════════════════════════════════
□ Reliability = probability of correct operation
(not the same as availability)
□ Availability = proportion of time system is operational
(can be up but still failing requests)
□ Durability = probability data survives
(your data is safe even when you can't access it)
□ Nines are exponentially harder
(99.99% is ~10x harder than 99.9%)
□ MTBF tells you how often you fail
(longer is better)
□ MTTR tells you how fast you recover
(shorter is often easier to improve than MTBF)
□ Error budgets make trade-offs explicit
(shared language between dev and ops)
□ 100% reliability is impossible
(and probably undesirable)

Module 2.2: Failure Modes and Effects - Now that you understand what reliability means, learn how systems actually fail. Understanding failure modes is the first step to designing for reliability.