The incident begins with a single overheating server in a Virginia data center. Temperature sensors trigger automatic failover—exactly as designed. The affected workloads shift to other servers. So far, everything is working perfectly.
Pause and predict: If you were the engineer on call during this AWS incident, what would be the first metric you’d look at, and would it have helped you diagnose the root cause?
But here’s where it gets interesting.
The failover causes a spike in network traffic. The spike triggers rate limiters on internal services—a safety mechanism. But those rate limiters are a bit too aggressive. They start throttling legitimate traffic. Services that depend on those throttled services start timing out. Those timeouts trigger retries. The retries create more traffic. More rate limiting. More timeouts. More retries.
Within minutes, a single overheating server has cascaded into a multi-hour outage affecting AWS S3, EC2, and Lambda in the US-East-1 region. Thousands of companies are down. Reddit. Slack. Twitch. iRobot’s Roomba vacuums won’t start. Dog doors won’t open. Smart toilets won’t flush.
flowchart TD
subgraph Initial["INITIAL TRIGGER (11:42 AM)"]
direction TB
A["Single server overheats"] --> B["Automatic failover (WORKS CORRECTLY)"]
E -->|"Service timeouts"| F["Timeouts trigger retries"]
F -->|"MORE traffic"| G["More rate limiting"]
G -->|"More timeouts"| F
end
subgraph Collapse["TOTAL COLLAPSE (11:50 AM onwards)"]
direction TB
H["S3: DEGRADED"]
I["EC2: DEGRADED"]
J["Lambda: DOWN"]
H --> K["Websites won't load"]
I --> L["Applications can't start"]
J --> M["Serverless functions fail"]
end
C --> D
G -.-> H
G -.-> I
G -.-> J
Every individual component worked exactly as designed. The failover worked. The rate limiters worked. The retry logic worked. But together, they created a catastrophe.
This is why understanding failure modes matters. The question isn’t “will it fail?” but “HOW will it fail—and what happens next?”
Understanding failure modes lets you design systems that fail gracefully instead of catastrophically. The difference between a minor incident and a company-ending outage is often not whether failures happen, but how they propagate.
Consider two architectures:
Architecture
When Database Slows Down
Result
Tightly coupled
All services wait → timeouts cascade → retries multiply → total outage
Same trigger. 400x difference in business impact. The only difference is how failure modes were designed.
The Car Analogy
Modern cars have multiple failure modes designed in. Run out of fuel? The engine stops but the steering and brakes still work. Battery dies? The car stops but the doors still open. Brake line leaks? There’s a second brake circuit.
Engineers didn’t just hope these systems wouldn’t fail—they specifically designed what happens when they do. Your software needs the same intentional design.
Not all failures are equal. A crash is different from corruption. A timeout is different from an error. Understanding the type of failure guides your response.
flowchart TD
subgraph Visibility["BY VISIBILITY: Can You Tell It Failed?"]
direction TB
subgraph Obvious["OBVIOUS"]
direction TB
O1["• Process crash<br>• 500 error response<br>• Connection refused<br>• Timeout<br>• Error in logs"]
Stop and think: How often does your current team formally brainstorm what could go wrong before deploying a new feature? Is it documented, or just discussed casually?
Assemble the team - Include people who know the system deeply
Define the scope - What system or feature are you analyzing?
Create a system map - Draw components and dependencies
Brainstorm failure modes - For each component, ask “how could this fail?”
Trace effects - Follow each failure through the system
Score and prioritize - Use RPN to focus effort
Plan mitigations - Design specific responses
Review regularly - FMEA is not one-time; systems change
Did You Know?
FMEA was developed by the U.S. military in the 1940s and was first used on the Apollo program. NASA required contractors to perform FMEA on all mission-critical systems. The technique helped identify and mitigate thousands of potential failures before they could endanger astronauts.
The Apollo 13 Survival Story: When an oxygen tank exploded on Apollo 13, the crew survived because FMEA had identified and mitigated thousands of failure scenarios. The procedures they used—venting to reduce pressure, routing power through specific pathways, using the lunar module as a “lifeboat”—were all documented because engineers had asked “what if?” for every component.
Stop and think: Think about your favorite streaming app. What happens when your internet connection drops to 1 bar? Does it show an error, or does it lower the video quality to 480p? That is graceful degradation in action.
Pause and predict: If the primary database in your system crashed right now, which services would survive? Would your users still be able to perform read-only tasks?
Feature flags - Disable problematic features quickly
Geographic isolation - Failures in one region don’t affect others
Service isolation - Each service fails independently
Data isolation - Separate databases for critical vs. non-critical data
War Story: The Shared Database That Took Down Everything
A fintech startup ran all their microservices against a shared PostgreSQL database. “It’s simpler,” they said. “We can do transactions across services.”
Then, on a random Tuesday, a developer added a new analytics query to the reporting service. The query was correct, but it ran a full table scan on a 50-million-row table. Without an index. Under normal load.
The fix: separate databases for separate services, with clear ownership. Reporting now has its own read replica. A reporting bug can’t take down checkout. Each service’s database failure is isolated to that service.
Stop and think: Have you ever written a simple while(retries < 3) loop in your code without adding a delay? You might have accidentally built a thundering herd trigger.
The term “Byzantine failure” comes from the “Byzantine Generals Problem,” a thought experiment about unreliable messengers. Byzantine failures are when a system doesn’t just fail, but provides wrong or inconsistent information—the hardest type to handle.
NASA’s Mars Climate Orbiter was lost in 1999 due to a failure mode that wasn’t analyzed: unit mismatch. One team used metric units, another used imperial. The $125 million spacecraft crashed because nobody did FMEA on the interface between teams.
Circuit breakers are named after electrical circuit breakers—devices that “break” (open) when current exceeds safe levels, preventing damage. The software pattern does the same: stops calling a failing service to prevent cascading damage.
The “Swiss Cheese” model from James Reason explains why complex systems fail: each defense layer has holes (like Swiss cheese slices), and failures occur when holes align. This is why defense in depth—multiple imperfect layers—is more effective than one “perfect” layer.
War Story: The Timeout That Was Too Long
A fintech startup set all their service timeouts to 30 seconds—a “safe” default. “Better to wait than fail,” they reasoned.
One day, a database query started taking 25 seconds instead of the usual 200ms. A missing WHERE clause caused a sequential scan. Not a bug—it still returned correct data. Just slowly.
timeline
title The 30-Second Timeout Death Spiral
Normal state : Request → Database (200ms) → Response : Connection pool 10/100
Second 1 : Query becomes slow (25s) : New requests arrive, start waiting : Connections 50/100
Second 10 : All connections waiting on slow queries : Connections 100/100 (exhausted)
Second 15 : New requests can't get connections : Queue starts filling : Memory climbing
Second 20 : Queue full : OOM pressure building : GC thrashing
Second 25 : OOM killer strikes : Pod dies : Kubernetes restarts it
Second 30 : Pod comes back : Hits slow database : Pool fills immediately : Death spiral resumes
The fix was embarrassingly simple: reduce timeouts to 2 seconds. A 25-second query now fails fast at 2 seconds, the circuit breaker opens, and the system degrades gracefully (returning cached data or an error) instead of collapsing.
Lesson: The most dangerous failures aren’t outages—they’re systems that are “almost working.” A long timeout that never triggers is worse than no timeout at all.
You are investigating two different alerts. The first is a database timeout that occurred at 2:14 AM and hasn’t repeated since. The second is an image processing service that throws a “file corrupted” error on roughly 2% of user uploads, but works perfectly if the user immediately uploads the same file again. How would you categorize these two failures, and how should your response differ?
Answer
The first alert is a transient failure, while the second is an intermittent failure.
Transient failures occur once and then resolve on their own, like a momentary network hiccup or a brief routing delay. Because they don’t persist, the system can usually recover simply by retrying the operation with exponential backoff.
Intermittent failures, on the other hand, occur unpredictably—sometimes the system works, sometimes it doesn’t, with no clear pattern. These are much harder to debug because you can’t reliably reproduce them. In this scenario, retrying might occasionally succeed (masking the problem), but the correct response is to add detailed logging to capture the exact state when the error occurs, as it often points to an underlying issue like a race condition or resource exhaustion on a specific node.
During an FMEA session for a new financial reporting microservice, your team evaluates a failure mode where floating-point rounding errors could alter daily revenue totals. The team gives this a Severity of 9, but a Detection score of 2. Why is this specific combination of scores considered a “critical” risk that requires immediate architectural changes?
Answer
A high severity combined with low detection means the failure has a major impact AND you won’t know it’s happening until significant damage has already been done.
In this financial reporting scenario, the failure is “silent.” The system continues to operate and return 200 OK responses, but it is generating corrupted data. Because the detection score is so low, no automated alarms are firing, meaning the business will continue making decisions based on incorrect revenue numbers until someone notices during an audit months later.
By the time you notice, the corruption has spread and fixing it requires massive manual reconciliation. These silent failure modes often yield a very high Risk Priority Number (RPN) and should be prioritized for immediate mitigation—such as adding robust reconciliation checks, strict data validation, or double-entry verification—even if you cannot eliminate the underlying likelihood of the error entirely.
Your e-commerce platform’s “Recommended Products” service suddenly experiences a 30-second delay due to a bad machine learning model deployment. Within minutes, users can’t even log in or view their shopping carts, and the entire site goes down. How could implementing the bulkhead pattern have prevented this total outage?
Answer
Implementing the bulkhead pattern would have isolated the resources used by the “Recommended Products” service from the rest of the application, dramatically reducing the blast radius of the failure.
Without bulkheads, all services likely share the same global connection or thread pool. When the recommendations service slowed down, incoming requests piled up and consumed all available threads waiting for a response, starving critical services like login and cart management.
By using bulkheads, you divide those resources into separate, isolated pools—just like watertight compartments in a ship. The recommendation service would have exhausted its dedicated thread pool and failed, but the login and cart services would still have access to their own dedicated resources. This ensures that a non-critical feature failure doesn’t sink the entire platform, turning a total outage into a graceful degradation scenario.
An internal authentication service usually handles 500 requests per second. During a minor network blip, the service briefly slows down, causing some client requests to time out. Suddenly, traffic spikes to 2,500 requests per second, CPU usage hits 100%, and the service crashes completely. What failure pattern just occurred, and what specific client-side changes are needed to prevent it?
Answer
The service just experienced a retry storm, which occurs when a system slowdown triggers a destructive positive feedback loop.
When the service initially slowed down and requests timed out, the clients aggressively retried their failed requests. This injected 2-3x more traffic into a service that was already struggling, which slowed it down even further, leading to more timeouts, more retries, and an overwhelming load that eventually killed the service. What started as a minor, recoverable slowdown was amplified into a total outage by the clients’ “helpful” retry behavior.
To prevent this, clients must implement exponential backoff (waiting progressively longer between retries) and jitter (adding randomness to the retry delay so clients don’t all retry at the exact same millisecond). Additionally, implementing a circuit breaker or strict retry budgets will ensure clients stop hammering a service that is clearly unresponsive.
“Release It! Second Edition” - Michael Nygard. Essential reading on stability patterns including circuit breakers, bulkheads, and timeouts. Every chapter is a war story.
“Failure Mode and Effects Analysis” - D.H. Stamatis. Comprehensive guide to FMEA technique from the manufacturing world.
Papers:
“How Complex Systems Fail” - Richard Cook. A 5-page paper on why FMEA alone isn’t enough—complex systems fail in unexpected ways. Required reading.
“Metastable Failures in Distributed Systems” - Nathan Bronson et al. (Facebook/Meta). Deep dive into failure patterns that can sustain themselves even after the trigger is removed.
Talks:
“Breaking Things on Purpose” - Kolton Andrus (Gremlin). How to build confidence through deliberate failure injection.