A major cloud outage during a peak-traffic holiday period illustrates why reliability engineering matters most when demand and user expectations are both high.
But something unexpected happens.
Teams that invest in failover, graceful degradation, and recovery automation are far better positioned to limit user impact when a cloud dependency fails.
The secret? Years of reliability engineering.
Teams that deliberately test cloud-dependency failure are better prepared to keep serving users when a provider has problems.
flowchart TD
subgraph Company_A ["COMPANY A: 'Hope It Doesn't Break'"]
A1["Our cloud provider has 99.99% uptime"] --> A2["AWS outage happens"]
A2 --> A3["COMPLETE OUTAGE<br/>'Why didn't anyone tell us this could happen?!'"]
end
subgraph Company_B ["COMPANY B: Netflix's Approach - 'Engineer for Failure'"]
B1["What if AWS fails?"] --> B2["Built multi-region fallback systems"]
B2 --> B3["AWS outage happens"]
B3 --> B4["Auto-recovery"]
B5["What if a server dies?"] --> B6["Chaos Monkey tests this constantly"]
B6 --> B7["Failure occurs"]
B7 --> B8["Known, handled"]
end
The chaos-engineering reference in the diagram above is covered in detail in [the chaos engineering canonical](../../disciplines/reliability-security/chaos-engineering/module-1.1-chaos-principles/). This is the difference between hoping and engineering. Between luck and reliability.
Your users don’t care about your architecture. They don’t care about your tech stack. They don’t care that you’re using Kubernetes or that your microservices are beautifully decoupled. They care about exactly one thing:
Does it work when I need it?
That question seems simple. But answering it systematically—measuring it, designing for it, trading off against other goals—that’s the discipline of reliability engineering.
Each minute of downtime during market hours can cost millions in trades
This module teaches you to think about reliability systematically—not as “we hope it doesn’t break” but as an engineering discipline with clear metrics, trade-offs, and design principles.
The Bridge Analogy
Civil engineers don’t say “we hope this bridge doesn’t collapse.” They calculate loads, specify materials, add safety factors, and design for specific failure scenarios. They know exactly what wind speed will cause problems, what weight the bridge can bear, and what happens if a cable snaps.
Software reliability engineering applies the same rigor to systems: understand the failure modes, design for them, measure the results. The question isn’t “will it fail?” but “how will it fail, and what happens when it does?”
That's ~10,000 years between losing a single object.
How they achieve it:
• Multiple copies across multiple data centers
• Automatic integrity checking
• Self-healing when corruption detected
• Geographic distribution
Did You Know?
Amazon S3’s published 11-nines durability target implies an extremely low expected rate of object loss. That’s not availability—S3 can be temporarily unavailable while still being durable. Your data is safe; you just can’t access it right now.
This distinction matters enormously. When S3 has an “outage,” your files aren’t being deleted—they’re just temporarily inaccessible. Durability and availability are independent properties.
C1["User clicks 'Checkout' → Site loads → Payment works → Success page!"] --> C2["... 2 days later ...<br/>'Where's my order? No confirmation email. No order in history.'"]
C2 --> C3["Result: FAILED ❌<br/>User thinks: 'Their site is broken'"]
end
This is why we need to measure all three dimensions—because any failure mode leads to the same user outcome.
Try This (2 minutes)
Think about an app you use daily. Recall a time it failed you. Was it:
An availability failure (couldn’t connect)?
A reliability failure (connected but got an error)?
A durability failure (your data was lost)?
Understanding the failure type helps identify the fix.
Bonus: Think about how you, as a user, responded. Did you retry? Give up? Switch to a competitor? That’s the business cost of unreliability.
When engineers talk about reliability, they talk about “nines.” You’ll hear phrases like “we need five nines” or “we’re only at three nines.” What does this mean?
The “nines” are a shorthand for the number of 9s in the reliability percentage:
Here’s the uncomfortable truth every engineering organization faces: you can’t have maximum reliability AND maximum velocity. There’s a fundamental tension.
flowchart LR
R["MAXIMUM RELIABILITY<br/>• More testing<br/>• Longer review cycles<br/>• Canary deployments<br/>• Conservative changes<br/>• Expensive infra<br/>• 24/7 on-call"] <--> V["MAXIMUM VELOCITY<br/>• Less testing<br/>• Quick reviews<br/>• Deploy when ready<br/>• Big bang changes<br/>• Cheap infra<br/>• 'We'll fix in prod'"]
Examples on the Spectrum:
Medical Device: “Zero tolerance” (Maximum Reliability)
Banks: “Must be trustworthy”
Airlines: “Regulated but needs innovation”
E-commerce: “Revenue depends on uptime”
Internal Tools: “Annoying but not critical”
Startup MVP: “Speed is survival”
Hackathon Project: “Ship something by 5pm” (Maximum Velocity)
Cost grows EXPONENTIALLY. At some point, another nine costs more than it’s worth. That breakeven point is different for every system.
Did You Know?
Google’s Chubby lock service intentionally introduces planned outages. Why? To ensure that dependent services don’t accidentally build assumptions about 100% availability.
If Chubby were “too reliable,” services would build implicit dependencies on it always being there. Then when Chubby eventually had an unplanned outage, those services would fail catastrophically—they had not adequately tested for Chubby being down.
By being deliberately unreliable (within SLA), Chubby forces dependent services to handle its failures gracefully. Controlled unreliability builds resilience.
This is brilliant: your dependencies can be TOO reliable if it causes you to not handle their failure.
War Story: The 99.99% Promise
Teams sometimes promise availability numbers that their current incident response capability cannot realistically support.
In hardware-heavy domains, MTBF is often tracked in usage-based units such as operating hours. This reflects usage exposure rather than calendar time.
The first software reliability model was created by John Musa at Bell Labs in 1975. He applied hardware reliability mathematics to software, founding the field of software reliability engineering. His insight: software bugs follow statistical patterns just like hardware failures.
Netflix popularized the “Chaos Monkey” approach — see the chaos engineering canonical module for the full story. The principle that matters here: if you haven’t tested a failure, you don’t know if you can survive it.
The Space Shuttle had five redundant computers running different software written by different teams. Why? A single bug could kill astronauts. The fifth computer ran entirely different software to protect against systematic bugs. This is the ultimate “defense in depth.”
Teams often set internal reliability objectives stricter than customer-facing SLAs. This creates buffer for multi-service dependency chains and unexpected failures.
1. Scenario: You are reviewing the metrics for a newly launched photo-sharing service. The dashboard proudly displays 99.9% availability for the month, but the customer support queue is flooded with complaints about failed uploads. You dig deeper and discover the reliability success rate is only 95%. What exactly are the users experiencing?
Answer
Users are experiencing a service that is almost always reachable, but frequently fails to process their requests. Because the availability is 99.9%, the servers are online and accepting connections almost all the time (down only ~43 minutes a month). However, the 95% reliability means that when users attempt an action, like uploading a photo, 5 out of every 100 attempts result in an error or failure. The combined effect (99.9% × 95%) means the actual user success rate is only 94.9%. This scenario highlights why measuring only availability is dangerous; it creates a false sense of security while users suffer through persistent partial failures and bugs.
2. Scenario: The sales team at your SaaS company just closed a massive enterprise contract by promising a 99.99% availability SLA. The engineering team currently has an average Time to Detect (MTTD) of 5 minutes and an average Time to Fix (MTT-Fix) of 15 minutes. Calculate your monthly error budget in minutes and explain why this new contract puts the company in severe danger.
Answer
The monthly error budget for a 99.99% SLA is approximately 4.32 minutes (43,200 minutes per month × 0.0001). This contract puts the company in extreme danger because a single average incident takes 20 minutes to resolve (5 minutes to detect + 15 minutes to fix). Therefore, just one typical incident would quickly blow through nearly five months’ worth of error budget, likely triggering financial SLA penalties. Achieving 99.99% requires fully automated detection and recovery mechanisms that resolve issues in sub-minute timeframes, which the current team clearly lacks. Promising this level of availability without the operational maturity to support it is a guaranteed recipe for losing money.
3. Scenario: You must choose between two database architectures. Architecture Alpha crashes rarely (MTBF = 500 hours) but requires manual intervention to restore, taking 30 minutes (MTTR). Architecture Beta crashes much more frequently due to aggressive preemptive node cycling (MTBF = 100 hours), but it has an automated failover that restores service in exactly 5 minutes (MTTR). Which architecture provides higher overall availability to the end user?
Answer
Architecture Beta provides higher overall availability (99.92%) compared to Architecture Alpha (99.90%). To find this, we calculate availability using the formula MTBF / (MTBF + MTTR). For Alpha, 500 / (500 + 0.5) equals 99.90%. For Beta, 100 / (100 + 0.083) equals 99.92%. This counterintuitive result demonstrates the immense power of optimizing for fast recovery over failure prevention. Even though Beta fails five times as often, its fully automated, rapid recovery means users ultimately experience less total downtime over the long run.
4. Scenario: A product manager hands you a specification for a new microservice. The document states: “Requirement: The payment processing API must be 99.9% reliable.” As a reliability engineer, you push back and ask them to rewrite it. Why is this requirement unusable, and what is it missing?
Answer
This requirement is unusable because it is too vague to be measured or engineered against. A proper reliability statement must precisely define three components: the intended function, the specified period, and the stated conditions. In this case, it fails to define what constitutes a “successful” API response (e.g., must it return within 500ms? Is a 500 error a failure?). It also fails to specify the time window for measurement (e.g., over a rolling 30-day window) and the conditions under which the guarantee holds (e.g., under normal load up to 1000 TPS, excluding planned maintenance). Without these specifics, engineers and stakeholders will constantly disagree on whether the system is actually “reliable” when an edge-case incident occurs.
5. Scenario: It is the third week of the month, and your team’s service has suffered a few rocky deployments, consuming 38 minutes of your 43.2-minute monthly error budget. The product team is pressuring you to deploy a massive, highly anticipated feature update before the weekend. According to error budget policies, how should you handle this situation?
Answer
You should halt the deployment of the major feature until the next month when the error budget resets. With only 5.2 minutes of budget remaining, you are deep in the warning zone, meaning any slight hiccup during this risky deployment will push the service over its SLA limit. The error budget exists specifically to remove emotion from these decisions; it is an agreed-upon contract that dictates feature freezes when reliability is threatened. Instead of deploying the risky feature, the team should spend the rest of the month deploying small, low-risk patches or focusing entirely on reliability and technical debt improvements. If the business decides the feature must go out regardless, it requires explicit executive sign-off acknowledging the accepted breach of the reliability targets.
“Site Reliability Engineering” - Google (free online). Chapters 1-4 cover reliability fundamentals from the team that coined “SRE.” The definitive text on modern reliability engineering.
“Release It! Design and Deploy Production-Ready Software” - Michael Nygard (2nd edition). Practical patterns for building reliable systems. Every pattern has war stories from real production failures.
“The Checklist Manifesto” - Atul Gawande. How checklists improve reliability in complex domains (aviation, surgery)—surprisingly applicable to incident response.
Papers:
“How Complex Systems Fail” - Richard Cook. A 5-page paper that every reliability engineer should read. Describes why failures happen and why hindsight is misleading.
“On Designing and Deploying Internet-Scale Services” - James Hamilton. Classic paper on building reliable services at scale. Written in 2007 but still relevant.
Talks:
“Mastering Chaos: A Netflix Guide to Microservices” - Josh Evans (YouTube). How Netflix builds reliability into their microservices architecture.
“Building Reliability In” - John Allspaw. Thoughtful exploration of what reliability actually means in practice.
Module 2.2: Failure Modes and Effects - Now that you understand what reliability means, learn how systems actually fail. Understanding failure modes is the first step to designing for reliability.