Skip to content

Module 1.6: Postmortems and Learning

Discipline Module | Complexity: [MEDIUM] | Time: 30-35 min

Before starting this module:


After completing this module, you will be able to:

  • Lead blameless postmortem meetings that surface systemic causes rather than individual fault
  • Design action items that address root causes and prevent incident recurrence
  • Build a postmortem culture where learning from failure becomes a competitive advantage
  • Analyze postmortem trends across incidents to identify organizational reliability patterns

An incident happened. You fixed it. Everyone’s exhausted.

What happens next determines whether you’re truly learning or just fighting the same fires forever.

Without postmortems: Incidents repeat. Knowledge stays in people’s heads. Teams blame each other. Nothing improves.

With postmortems: Every incident makes the system stronger. Lessons are documented. Teams learn together. The same incident is much less likely to happen twice.

This module teaches you how to conduct blameless postmortems that turn failures into fuel for improvement.


A postmortem is a structured analysis of an incident after it’s resolved.

Purpose:

  • Understand what happened
  • Identify contributing factors
  • Prevent recurrence
  • Share learnings broadly

Not purpose:

  • Assign blame
  • Punish individuals
  • Document for legal defense
  • Check a compliance box
TriggerPostmortem?
SEV-1 incidentAlmost always
SEV-2 incidentUsually
SEV-3 incidentIf caused by interesting failure mode
Near missOften (we got lucky)
Recurring issueYes (why does this keep happening?)
Customer escalationUsually
Data lossAlmost always

The threshold should be low enough to capture learning, high enough to not overwhelm the team.


The foundation of effective postmortems is blamelessness.

With blame:

  • People hide information to protect themselves
  • Root causes stay hidden
  • Changes don’t prevent recurrence
  • Fear dominates, innovation dies

Without blame:

  • People share freely, knowing they’re safe
  • True root causes emerge
  • Changes actually prevent recurrence
  • Psychological safety enables improvement

Blameless doesn’t mean:

  • No accountability
  • No consequences for negligence
  • Ignoring patterns of behavior
  • Pretending humans don’t make mistakes

Blameless means:

  • Focus on systems, not individuals
  • Assume good intentions
  • Ask “why did the system allow this?” not “who screwed up?”
  • Build better systems, not blame better individuals

Every incident has two stories:

First story (blame narrative):

"Alice pushed a bad config that took down production.
She should have been more careful."

Second story (systems narrative):

"A config change caused a production outage.
Why did our system allow a bad config to reach production?
Why didn't we have validation?
Why didn't we have automatic rollback?
Why didn't the review process catch it?
What about the system made the error likely?"

Stop and think: If an engineer deletes the wrong database table because the production and staging database terminal prompts look identical, what is the first story vs the second story? How does the second story lead to a structural fix?

The second story finds root causes. The first story finds scapegoats.


graph TD
D0["Day 0: Incident occurs, resolved"] --> D1["Day 1-2: Initial draft written by incident responders"]
D1 --> D3["Day 3-4: Draft reviewed by participants"]
D3 --> D5["Day 5: Postmortem meeting held"]
D5 --> D7["Day 5-7: Final version published"]
D7 --> Ongoing["Ongoing: Action items tracked to completion"]

What to collect:

  • Timeline of events
  • Alert history
  • Metrics and dashboards
  • Chat logs from incident channel
  • Customer reports
  • Deployment history

Tip: Do this while memories are fresh. Waiting a week loses detail.

The Incident Commander or Tech Lead writes the initial draft.

Key sections:

  • Summary
  • Timeline
  • Impact
  • Contributing factors
  • Action items

(Full template below)

Who attends:

  • Incident responders
  • On-call engineers
  • Service owners
  • Stakeholders (optional for major incidents)

Pause and predict: Why is it recommended to wait 3-4 days to hold the postmortem meeting instead of holding it immediately after the incident?

Meeting structure (60-90 minutes):

1. Read the timeline together (10 min)
- Walk through events
- Fill in gaps
- Correct mistakes
2. Identify contributing factors (20 min)
- What conditions led to this?
- Apply "5 Whys"
- Avoid single root cause
3. Discuss action items (20 min)
- What would prevent recurrence?
- Who owns each action?
- When will they be done?
4. Share learnings (10 min)
- What surprised us?
- What did we learn?
- Who else should know?
  • Post final postmortem to shared location
  • Announce in relevant channels
  • Present at team/company meetings for SEV-1
  • Cross-link to related incidents

Action items are useless if not completed.

action_items:
- item: "Add config validation in CI pipeline"
owner: " @alice"
due: "2024-02-15"
status: "In Progress"
priority: "P1"
- item: "Add automatic rollback for config changes"
owner: " @bob"
due: "2024-02-28"
status: "Not Started"
priority: "P1"
- item: "Document config change procedure"
owner: " @carol"
due: "2024-02-10"
status: "Complete"
priority: "P2"

Review action items weekly until all complete.


Practice finding root causes with the 5 Whys technique:

Incident: Production database went down
Why #1: Why did the database go down?
→ It ran out of disk space
Why #2: Why did it run out of disk space?
→ Log files grew unexpectedly large
Why #3: Why did log files grow unexpectedly?
→ A new feature logged at debug level in production
Why #4: Why was debug logging enabled in production?
→ The developer forgot to change the log level before deploying
Why #5: Why did the deployment process allow incorrect log levels?
→ There's no validation of configuration in the deployment pipeline
Root cause: Missing configuration validation
Action: Add log level validation to CI/CD pipeline

Stop and think: What happens if you stop asking “Why?” after step 3 (log files grew unexpectedly)? What kind of action item would you end up with compared to the actual systemic root cause?

Notice: We didn’t stop at “the developer made a mistake.” We asked why the system allowed that mistake.


# Postmortem: [TITLE]
**Date**: YYYY-MM-DD
**Authors**: [Names]
**Status**: Draft | Reviewed | Final
**Severity**: SEV-X
## Summary
[2-3 sentences describing what happened and the impact]
## Impact
- **Duration**: [Start time] to [End time] ([X] minutes/hours)
- **Users affected**: [Number or percentage]
- **Revenue impact**: [If applicable]
- **Error budget consumed**: [X]%
- **Data loss**: [Yes/No, details if yes]
## Timeline (All times in UTC)
| Time | Event |
|------|-------|
| 14:00 | Deploy of version X.Y.Z begins |
| 14:05 | Error rate increases, alerts fire |
| 14:10 | On-call acknowledges, begins investigation |
| 14:20 | Root cause identified |
| 14:25 | Rollback initiated |
| 14:30 | Service recovered |
| 14:35 | Incident declared resolved |
## Detection
How was the incident detected?
- [ ] Monitoring alert
- [ ] Customer report
- [ ] Internal user report
- [ ] Other: ____________
Time to detection (TTD): [X] minutes
Could detection be faster? [Yes/No, how]
## Contributing Factors
### Factor 1: [Name]
[Description of contributing factor]
### Factor 2: [Name]
[Description of contributing factor]
### Factor 3: [Name]
[Description of contributing factor]
## What Went Well
- [Thing that went well]
- [Another thing that went well]
## What Went Poorly
- [Thing that went poorly]
- [Another thing that went poorly]
## Where We Got Lucky
- [Way we got lucky that masked the severity]
## Action Items
| Action | Owner | Due Date | Priority | Status |
|--------|-------|----------|----------|--------|
| [Action 1] | [@owner1] | YYYY-MM-DD | P1 | Not Started |
| [Action 2] | [@owner2] | YYYY-MM-DD | P1 | Not Started |
| [Action 3] | [@owner3] | YYYY-MM-DD | P2 | Not Started |
## Lessons Learned
[Key insights from this incident]
## Related Incidents
- [Link to related postmortem]
- [Link to related postmortem]
## Supporting Information
- [Link to incident channel]
- [Link to dashboards]
- [Link to relevant docs]

  1. NASA’s approach to postmortems inspired Google’s SRE culture. NASA’s “lessons learned” database has entries going back to the 1960s, including failures that led to tragedies.

  2. The aviation industry’s blameless culture has made flying remarkably safe. Pilots can report mistakes anonymously, leading to systemic improvements. SRE borrowed this concept.

  3. John Allspaw (former Etsy CTO) popularized “blameless postmortems” in tech. His 2012 talk “Blameless PostMortems” is considered a foundational SRE resource.

  4. Some companies publish their postmortems publicly (GitLab, Cloudflare, Google). GitLab’s handbook even includes their postmortem templates. This radical transparency builds customer trust and contributes to industry-wide learning.


War Story: The Postmortem That Changed Culture

Section titled “War Story: The Postmortem That Changed Culture”

A company I worked with had a blame culture. After incidents:

  • Managers asked “who did this?”
  • Engineers got written up
  • People hid mistakes
  • Same incidents kept recurring

Then a major outage happened:

The Incident:

  • 4-hour outage during peak traffic
  • Cause: An engineer deployed a change that wasn’t tested
  • Impact: ~$500K revenue loss

The Old Way (What Leadership Wanted):

  • Find who deployed the change
  • Punish them
  • “Make an example”

What Actually Happened:

The new VP of Engineering stopped the blame train:

"Before we talk about who, let's talk about what.
What systems allowed untested code to reach production?
What process failed to catch this?
What tooling gaps existed?
The engineer who deployed this is not the problem.
The engineer who deploys the NEXT untested change is also not the problem.
The problem is that our system makes it possible."

The Postmortem Found:

  • No automated testing in CI
  • No staging environment
  • No deployment rollback process
  • No config validation
  • Review process was optional

The Actions:

  • Built proper CI/CD pipeline
  • Created staging environment
  • Implemented automatic rollback
  • Added config validation
  • Made review mandatory

Six Months Later:

  • Zero similar incidents
  • Deployment frequency increased 4x
  • Engineers reported near-misses proactively
  • Team satisfaction improved dramatically

The Lesson: Blame finds scapegoats. Blamelessness finds solutions.


Bad: "The outage was caused by Alice's careless mistake."
Good: "A configuration error reached production because
our validation process didn't catch it."
Bad: "Root cause: Bad deployment"
Good: "Contributing factors:
1. Missing automated tests
2. No staging environment
3. Time pressure to ship
4. Unclear rollback procedure"
Bad: "Be more careful"
Good: "Add automated config validation to CI pipeline"
Bad: "Improve monitoring"
Good: "Add alert for config drift > 5%"
Bad: "Document better"
Good: "Write runbook for config deployment with rollback steps"
Bad: Action items created, never tracked, never completed
Good: Weekly review of open action items until all complete
Bad: Postmortem written to defend team, hide problems
Good: Postmortem written honestly, focuses on improvement

People need to feel safe sharing:

  • Leadership models blamelessness
  • Mistakes are treated as learning opportunities
  • Postmortems are not used for performance reviews
  • Contributors are thanked, not criticized

Postmortems take time but save more:

  • 2-4 hours to write well
  • 1-2 hours for meeting
  • Time saved: Avoiding repeat incidents

Action items must be completed:

  • Assign owners and due dates
  • Review weekly
  • Escalate if stuck
  • Celebrate completions

Learnings should spread:

  • Post to shared wiki
  • Present at team meetings
  • Cross-team sharing sessions
  • Digest for leadership

MistakeProblemSolution
Blame individualsPeople hide, problems repeatFocus on systems
One root causeOversimplifies, misses factorsMultiple contributing factors
Skip when busyLearn less, same incidents repeatMake postmortems mandatory
Vague actionsNothing actually changesSpecific, measurable actions
No follow-upActions never completeWeekly tracking
Only share internallyOrganization doesn’t learnBroad publication

Scenario: During a major outage, a senior engineer accidentally runs a destructive command on the production database instead of the staging database. In the postmortem meeting, the VP of Engineering says, “We need to ensure everyone is being more careful and holding themselves accountable. We cannot have careless mistakes like this.” Why does this statement violate the principles of a blameless postmortem, and what is the likely outcome of this approach?

Show Answer

This statement violates blamelessness because it focuses entirely on the individual’s “carelessness” rather than the system conditions that allowed the mistake to happen. In a blameless culture, the focus should be on why the system permitted a destructive command to execute against production without friction (e.g., lack of environment isolation, identical terminal prompts, missing confirmation steps). The likely outcome of the VP’s approach is a chilling effect on psychological safety; engineers will begin to hide their mistakes, withhold critical information during incidents, and avoid taking necessary risks. True accountability in SRE means taking responsibility for improving the system to prevent recurrence, not punishing human error.

Scenario: After an incident where a malformed configuration file caused a service crash, the team creates the following action item: “Review configuration files more thoroughly before merging them to the main branch.” The item is assigned to the entire team. Why is this action item destined to fail, and how should it be rewritten?

Show Answer

This action item will fail because it relies entirely on human vigilance, which is notoriously unreliable, and it lacks both a specific owner and a measurable completion state. “Review more thoroughly” is a vague directive that does not change the underlying system or prevent the failure mode from recurring when a reviewer is tired, distracted, or under time pressure. A proper action item must structurally improve the system and have clear accountability. It should be rewritten as something concrete and automated, such as: “Implement a CI pipeline step using a linter to validate the syntax of all configuration files before allowing a merge,” assigned to a specific engineer with a defined due date.

Scenario: A payment gateway fails during Black Friday. The postmortem concludes: “Root Cause: The third-party payment API timed out.” The team closes the investigation and blames the vendor. Why is stopping at this “single root cause” an anti-pattern, and what systemic vulnerabilities does this conclusion ignore?

Show Answer

Stopping at a single root cause is an anti-pattern because complex system failures are almost never the result of a single isolated event; they are the culmination of multiple contributing factors. By concluding that the third-party API timeout was the sole cause, the team abdicates responsibility and misses critical opportunities to improve their own system’s resilience. This conclusion ignores essential systemic questions: Why didn’t our application gracefully degrade when the API timed out? Why did the timeout cause the entire gateway to fail instead of queuing the requests? Why didn’t our alerting system notify us of elevated latency before a full failure occurred? An effective postmortem must explore the full chain of events and system behaviors that turned an external timeout into a customer-facing outage.

Scenario: A junior engineer notices that a background job responsible for sending non-critical weekly newsletter emails failed silently for two days. They quickly fix the bug, restart the job, and the emails are sent. No customers complained, and no revenue was lost. The engineer asks if they should write a postmortem. Should they, and what is the reasoning?

Show Answer

Yes, they should likely write a brief postmortem or incident report, even though the immediate impact was low. While a full 90-minute meeting might not be necessary, documenting the incident is crucial because the silent failure mode is a significant systemic risk. The fact that the job failed silently for two days indicates a critical gap in monitoring and alerting; if a more important background job fails in the same way, the consequences could be disastrous. Writing a postmortem ensures the team addresses the lack of observability and prevents similar silent failures across other systems. The threshold for capturing learning should be low enough to catch these near-misses and systemic gaps before they manifest as critical outages.


Practice writing a postmortem for a mock incident.

Scenario: Payment service outage
Timeline:
- 09:00: Developer deploys new version
- 09:05: Error rate jumps to 50%
- 09:15: On-call receives page
- 09:20: On-call begins investigation
- 09:35: On-call identifies bad database query in new code
- 09:40: On-call rolls back deployment
- 09:45: Error rate returns to normal
- 09:50: Incident declared resolved
Context:
- The developer was new to the team
- The code passed code review
- There were no automated tests
- The staging environment was broken
- The deployment happened on a Friday afternoon

Write a complete postmortem using the template.

Focus on:

  1. Summary: 2-3 sentences
  2. Impact: Duration, users affected, error budget impact
  3. Contributing Factors: List at least 4 (not just “bad code”)
  4. Action Items: At least 5 specific, actionable items

Think about:

  • Why wasn’t this caught in testing?
  • Why was staging broken?
  • Why was review not sufficient?
  • Why was the developer unprepared?
  • Why deploy on Friday afternoon?
  • Summary is clear and concise
  • Impact is quantified
  • At least 4 contributing factors identified
  • No blame on individuals
  • At least 5 specific action items
  • All action items have owners and due dates
  • “What went well” section included

  1. Blamelessness enables learning — blame hides truth, safety reveals it
  2. Multiple factors, not single root cause — incidents are complex
  3. Action items must be specific and tracked — vague items accomplish nothing
  4. Share broadly — learnings benefit the whole organization
  5. Process requires discipline — postmortems must happen, not optional

Books:

  • “Site Reliability Engineering” — Chapter 15: Postmortem Culture
  • “The Field Guide to Human Error” — Sidney Dekker

Articles:

  • “Blameless PostMortems” — John Allspaw
  • “How to Write a Postmortem” — PagerDuty

Talks:

  • “Blameless Post-Mortems” — John Allspaw (Velocity 2012)
  • “Learning from Failure” — J. Paul Reed

Postmortems are how organizations learn from failure.

Effective postmortems:

  • Create psychological safety through blamelessness
  • Focus on systems, not individuals
  • Identify multiple contributing factors
  • Produce specific, actionable improvements
  • Track actions to completion
  • Share learnings broadly

Every incident should make your system stronger. Postmortems are how that happens.


Continue to Module 1.7: Capacity Planning to learn how to ensure your systems can handle future demand.


“We are not the sum of our accidents. We are the sum of what we learn from them.” — Unknown