Перейти до вмісту

Module 1.1: Incident Command & Crisis Management

Цей контент ще не доступний вашою мовою.

Complexity[COMPLEX]
Time to Complete2.5 hours
PrerequisitesNone

After completing this module, you will be able to:

  1. Implement the Incident Commander role by establishing communication channels, assigning workstreams, and controlling the pace of response
  2. Design an incident response framework with clear escalation paths, severity definitions, and role assignments before a crisis occurs
  3. Evaluate incident response effectiveness by identifying coordination failures, communication gaps, and decision bottlenecks in post-incident reviews
  4. Apply crisis management techniques (stabilize-before-fix, controlled communication, parallel workstreams) to reduce mean time to recovery

It’s 2:47 AM on a Friday night. Sarah, a senior SRE at a mid-size fintech company, wakes up to her phone buzzing with a cascade of PagerDuty alerts. The primary database cluster serving 12 million users across three continents has stopped accepting writes. The read replicas are 47 seconds behind and climbing. Customer-facing APIs are returning 500 errors at a rate of 14,000 per minute. The on-call Slack channel is already filling up with panicked messages from engineers who were also paged. Three people are SSH’d into production servers running diagnostic queries. Nobody is coordinating. Nobody knows what anyone else is doing. A junior engineer, trying to help, restarts the primary database node — which wipes the in-memory query cache and doubles the read latency on the replicas. The outage, which could have been contained in 20 minutes, drags on for 3 hours and 41 minutes. The company loses $2.3 million in failed transactions and spends the next two weeks apologizing to enterprise clients.

This is not a rare story. This is what happens when talented engineers face a crisis without a system for managing it.

The technology that failed Sarah’s company that night was not the database. It was the absence of incident command. No one declared themselves in charge. No one controlled communications. No one stopped well-meaning engineers from making the situation worse. The technical skills in that room were more than sufficient to solve the problem. What was missing was structure.

This module teaches you that structure. By the end, you will know how to walk into chaos, establish order, and lead a team through the worst night of their operational lives. You will know when to escalate, how to communicate, and — critically — when to stop trying to fix and start trying to stabilize. These are not theoretical skills. They are the difference between a 20-minute incident and a 4-hour catastrophe.


The Incident Command System: From Wildfires to Web Services

Section titled “The Incident Command System: From Wildfires to Web Services”

The Incident Command System (ICS) was not invented by software engineers. It was created in the 1970s by California firefighters after a series of devastating wildfires where multiple agencies showed up, nobody knew who was in charge, and the fires burned out of control. Sound familiar?

ICS solved a fundamental coordination problem: how do you get dozens of people from different teams, with different expertise, working together under extreme time pressure?

The answer was deceptively simple: clear roles, clear communication channels, and a single point of authority.

The software industry adopted ICS because our incidents have the same characteristics as wildfire response:

  • Time pressure — every minute of downtime costs money and trust
  • Multiple responders — database engineers, application developers, network specialists, all converging
  • Incomplete information — you rarely know the root cause when the incident starts
  • High stakes — wrong actions can make things worse
  • Communication overhead — the more people involved, the harder it is to coordinate

ICS Principles Adapted for Software Engineering

Section titled “ICS Principles Adapted for Software Engineering”
ICS PrincipleFirefighting ContextSoftware Incident Context
Unity of CommandEvery firefighter reports to one supervisorEvery responder reports to the Incident Commander
Modular OrganizationExpand/contract teams as neededPull in SMEs only when needed
Management by Objectives”Contain the fire to this ridge""Restore write capability to the primary DB”
Span of Control3-7 direct reports per leaderIC manages 3-5 leads, not 20 engineers
Common TerminologyStandard terms across agenciesSeverity levels, standard status updates
Integrated CommunicationsSingle radio frequency for commandSingle Slack channel, single bridge call

The key insight from ICS is this: in a crisis, process is not bureaucracy. Process is survival. Without it, you get a room full of smart people all pulling in different directions.


There are four critical roles in a software incident. In a small team, one person might wear multiple hats. In a large organization, each role is a separate person. But every role must be filled, even if implicitly.

┌─────────────────────┐
│ INCIDENT COMMANDER │
│ (Owns the incident)│
└─────────┬───────────┘
┌──────────────────┼──────────────────┐
│ │ │
┌────────▼─────────┐ ┌─────▼──────────┐ ┌────▼───────────────┐
│ COMMUNICATIONS │ │ OPERATIONS │ │ SUBJECT MATTER │
│ LEAD │ │ LEAD │ │ EXPERTS (SMEs) │
│ │ │ │ │ │
│ - Status updates │ │ - Hands on │ │ - Database │
│ - Stakeholders │ │ keyboard │ │ - Networking │
│ - Status page │ │ - Executing │ │ - Application │
│ - Customer comms │ │ fixes │ │ - Security │
└──────────────────┘ │ - Monitoring │ │ - Infrastructure │
└────────────────┘ └────────────────────┘

The Incident Commander is the single point of authority during an incident. They do not fix the problem. They manage the people fixing the problem.

Responsibilities:

  • Declare the incident and its severity level
  • Assign roles to responders
  • Make decisions when the team disagrees or is uncertain
  • Maintain the incident timeline
  • Decide when to escalate and when to de-escalate
  • Call the incident resolved and trigger the postmortem process

What the IC says:

  • “I’m declaring myself Incident Commander for this Sev-1. Acknowledge.”
  • “Alex, you’re Operations Lead. Focus on the database cluster. Report back in 10 minutes.”
  • “We’re not going to debate the root cause right now. Our objective is restoring writes. Root cause comes later.”
  • “It’s been 30 minutes. I’m escalating to VP Engineering for business impact assessment.”
  • “The fix is deployed and we’ve had 15 minutes of clean metrics. I’m declaring this incident resolved.”

What the IC does NOT do:

  • SSH into production servers
  • Write code or run queries
  • Get drawn into technical rabbit holes
  • Let three people argue about the right fix without making a decision
  • Disappear for 20 minutes to investigate something

The hardest part of being IC is not touching the keyboard. If the IC starts debugging, nobody is coordinating. This is the number one failure mode in incident management.

Rule of thumb: The moment you start typing commands into a terminal, you’ve stopped being the Incident Commander.

The Communications Lead is the voice of the incident to the outside world. They handle everything that is not directly related to fixing the problem.

Responsibilities:

  • Post and update the status page
  • Send internal updates to leadership and stakeholders
  • Draft customer-facing communications
  • Manage the incident Slack channel (keep it focused, move side conversations out)
  • Track and relay questions from stakeholders to the IC
  • Maintain the written timeline of events

Why this role matters: Without a Comms Lead, one of two things happens. Either the IC gets pulled away from coordination to answer “what’s happening?” from the CEO every five minutes, or nobody communicates at all and leadership starts making their own assumptions (which are always worse than reality).

The Operations Lead is the person (or people) with hands on the keyboard, actually diagnosing and fixing the problem.

Responsibilities:

  • Run diagnostic commands and investigate the root cause
  • Propose and execute mitigation and resolution actions
  • Report status to the IC at regular intervals
  • Flag when they need additional expertise (SMEs)
  • Document what they’ve tried and what the results were

Critical rule: The Operations Lead proposes actions. The IC approves them. This prevents well-meaning engineers from making unilateral changes to production during a crisis.

Ops Lead: "I want to failover to the secondary database cluster."
IC: "What's the risk?"
Ops Lead: "We'll lose about 30 seconds of writes during the switchover.
Some in-flight transactions will need to be retried by clients."
IC: "That's acceptable. Proceed. Comms Lead, update the status page:
'We are performing a database failover. Brief interruption expected.'"

SMEs are the specialists who get pulled in when the incident touches their domain. They do not join the incident by default. They are called in by the IC when needed.

Examples:

  • The DBA when it’s a database issue
  • The network engineer when it’s a connectivity issue
  • The security engineer when there’s a potential breach
  • The application developer who wrote the service that’s failing

Key principle: SMEs advise, they don’t command. They provide expertise to the Operations Lead and IC, who make the final decisions about what actions to take.


Communication is where incidents are won or lost. Too little communication and nobody knows what’s happening. Too much communication and the signal drowns in noise. The goal is structured, predictable, and focused communication.

A war room is a dedicated space (physical or virtual) where incident responders coordinate. In the age of remote work, this is almost always a video bridge call.

War room rules:

  1. One conversation at a time. If the IC is talking, everyone else is listening.
  2. State your name before speaking. “This is Alex. The replica lag is now at 2 minutes.”
  3. Keep the line clear. Side conversations happen in a separate thread or DM.
  4. Mute when not speaking. Background noise kills focus.
  5. No spectators. If you’re not in an assigned role, you’re not on the bridge.

For any Sev-1 or Sev-2 incident, create a dedicated Slack channel immediately:

#incident-2026-03-24-db-outage

Channel naming convention: #incident-YYYY-MM-DD-short-description

Pin the following message at the top of the channel:

INCIDENT: Primary database cluster write failure
SEVERITY: Sev-1
IC: @sarah
COMMS LEAD: @jordan
OPS LEAD: @alex
STATUS: Investigating
BRIDGE CALL: https://meet.company.com/incident-room
STATUS PAGE: https://status.company.com
Last update: 02:55 UTC — Replica lag increasing. Investigating primary node disk I/O.

Channel discipline:

  • The incident channel is for coordinated updates only
  • Debugging discussions go in a thread
  • “Is this related to my service?” questions go in a thread
  • Spectator commentary goes nowhere — keep it out entirely
  • The Comms Lead is the channel moderator and will redirect off-topic messages

During an active incident, the Comms Lead posts a structured update every 15 minutes, whether or not there is news. This is non-negotiable. Silence during an incident is terrifying for stakeholders and leadership.

UPDATE — 03:15 UTC (30 min into incident)
Status: Mitigating
What we know: Primary DB disk I/O saturated at 100%. Cause under investigation.
What we're doing: Failing over to secondary cluster. ETA 5 minutes.
Customer impact: Write operations failing. Read operations degraded (~3s latency).
Next update: 03:30 UTC

Even if the update is “no change, still investigating,” post it. The update itself communicates that people are actively working the problem.


Your status page is the single source of truth for customers during an incident. It should be updated within 5 minutes of an incident being declared.

Status page levels:

StatusMeaningWhen to Use
OperationalEverything working normallyDefault state
Degraded PerformanceSlower than normal but functionalLatency increases, partial errors
Partial OutageSome features unavailableOne service or region down
Major OutageCore functionality unavailablePrimary service down

Do not write customer communications from scratch during a crisis. Use templates and fill in the specifics.

Initial acknowledgment (post within 5 minutes):

We are aware of an issue affecting [service/feature]. Our team is actively
investigating. We will provide updates every [15/30] minutes.
Current impact: [brief description of what customers are experiencing]

Update during investigation:

We continue to investigate the issue affecting [service/feature].
We have identified [what you know] and are working on [what you're doing].
Current impact: [updated description]
Next update: [time]

Mitigation in progress:

We have identified the cause of the issue affecting [service/feature]
and are implementing a fix. Some customers may see improvement shortly.
Current impact: [updated description]
Expected resolution: [time estimate, if known, or "we will update when available"]

Resolved:

The issue affecting [service/feature] has been resolved. All systems are
operating normally. The incident lasted approximately [duration].
We will publish a detailed postmortem within [timeframe, e.g., 48 hours].
We apologize for the disruption and appreciate your patience.

Golden rules of external communication:

  1. Never lie. If you don’t know the cause, say “we’re investigating.” Never fabricate an explanation.
  2. Never blame a vendor publicly. Even if AWS caused the outage, your customers chose you, not AWS.
  3. Never give an ETA you can’t meet. “We’re working on it” is better than “fixed in 30 minutes” when you have no idea.
  4. Acknowledge the impact. Customers do not care about your internal details. They care that their transactions failed.
  5. Show empathy. “We understand this impacts your business” goes a long way.

Severity levels are not subjective opinions. They are predefined criteria that determine how the organization responds. Define them before you need them, not during a crisis.

┌───────────────────────────────────────────────────────────────────────────┐
│ SEVERITY LEVEL MATRIX │
├───────┬──────────────────┬──────────────────┬──────────────────────────┤
│ Level │ Customer Impact │ Revenue Impact │ Response │
├───────┼──────────────────┼──────────────────┼──────────────────────────┤
│ SEV-1 │ All/most users │ Revenue stopped │ All-hands, 24/7 │
│ │ cannot use core │ or major loss │ IC + full team │
│ │ functionality │ per minute │ Exec notification │
│ │ │ │ 15-min update cadence │
├───────┼──────────────────┼──────────────────┼──────────────────────────┤
│ SEV-2 │ Large subset of │ Significant │ Dedicated responders │
│ │ users affected │ revenue impact │ IC assigned │
│ │ or core feature │ │ 30-min update cadence │
│ │ degraded │ │ Business hours + on-call │
├───────┼──────────────────┼──────────────────┼──────────────────────────┤
│ SEV-3 │ Small subset of │ Minor or no │ On-call investigates │
│ │ users affected │ direct revenue │ during business hours │
│ │ Non-critical │ impact │ 2-hour update cadence │
│ │ feature broken │ │ │
├───────┼──────────────────┼──────────────────┼──────────────────────────┤
│ SEV-4 │ Internal only or │ None │ Ticket created │
│ │ cosmetic issue │ │ Fixed in normal sprint │
│ │ │ │ No active incident mgmt │
└───────┴──────────────────┴──────────────────┴──────────────────────────┘
ScenarioSeverityWhy
Payment processing is down for all customersSev-1Core revenue function, all users affected
Login is failing for users in EU regionSev-2Core feature, subset of users
Search results are returning slowly (5s vs 500ms)Sev-2Degraded core feature, all users affected
A reporting dashboard shows stale data (2 hours old)Sev-3Non-core feature, data is available but delayed
The internal admin panel has a broken CSS layoutSev-4Internal only, cosmetic
Email notifications are delayed by 10 minutesSev-3Non-critical feature, mild user impact
The mobile app crashes on launch for all iOS usersSev-1Core functionality, major user segment
A staging environment is downSev-4No customer impact

Severity is not set in stone. Re-evaluate it as you learn more:

  • Upgrade when: impact is broader than initially thought, the fix is taking longer than expected, or a workaround stops working
  • Downgrade when: a workaround reduces customer impact, the affected population is smaller than estimated, or the issue is intermittent rather than total

The IC makes severity changes and the Comms Lead immediately communicates them.


When you get paged at 3 AM, you need to determine severity in the first 5 minutes. Here is a triage framework:

┌─────────────────────────────────────────────────────────────────┐
│ 5-MINUTE TRIAGE FLOW │
│ │
│ 1. Are customers affected RIGHT NOW? │
│ ├─ YES ──► Check error rates, status codes, latency │
│ └─ NO ───► Likely Sev-3 or Sev-4. Investigate in morning. │
│ │
│ 2. What percentage of customers are affected? │
│ ├─ >50% ──► Likely Sev-1 │
│ ├─ 10-50% ─► Likely Sev-2 │
│ └─ <10% ──► Likely Sev-3 │
│ │
│ 3. Is a core revenue function impacted? │
│ ├─ YES ──► Bump severity by one level │
│ └─ NO ───► Keep current assessment │
│ │
│ 4. Is the issue getting worse? │
│ ├─ YES ──► Bump severity by one level │
│ └─ NO ───► Keep current assessment │
│ │
│ 5. Declare severity and begin incident process. │
│ Do NOT spend more than 5 minutes triaging. │
└─────────────────────────────────────────────────────────────────┘

Critical mindset: It is always better to over-classify and downgrade than to under-classify and scramble to catch up. Declaring a Sev-1 that turns out to be a Sev-3 wastes a few hours of people’s time. Treating a Sev-1 as a Sev-3 can cost millions.


Mitigating vs. Resolving: Stop the Bleeding First

Section titled “Mitigating vs. Resolving: Stop the Bleeding First”

This is the single most important concept in incident management, and the one that engineers get wrong most often.

Mitigation means reducing or eliminating the customer impact, even if the underlying problem is not fixed.

Resolution means fixing the root cause so the problem does not recur.

┌────────────────────────────────────────────────────────────────────┐
│ │
│ TIMELINE OF AN INCIDENT │
│ │
│ Detection ──► Triage ──► MITIGATION ──► Stabilization ──► Root │
│ │ Cause Analysis │
│ │ │
│ Customer impact ENDS here │
│ (not at root cause fix) │
│ │
└────────────────────────────────────────────────────────────────────┘
ProblemMitigation (do this first)Resolution (do this later)
Database overloadedFailover to replica, add read cacheOptimize slow queries, upgrade hardware
Bad deploy causing 500sRollback to previous versionFix the bug, improve test coverage
Memory leak in serviceRestart the serviceFind and fix the leak
DNS pointing to wrong IPUpdate DNS recordFix the deployment pipeline that set wrong IP
Third-party API downEnable fallback/cached responsesAdd circuit breaker, diversify providers
Disk full on primary serverDelete old logs, expand volumeAdd disk usage alerts, log rotation

Engineers are problem solvers. When they see a bug, their instinct is to find and fix it. This is a virtue in normal operations and a liability during an incident. Here is what happens:

BAD (Resolution-first thinking):
03:00 - Alert fires. Database writes failing.
03:05 - Engineer starts reading database logs.
03:20 - Engineer finds a suspicious query pattern.
03:45 - Engineer is deep in query execution plans.
04:15 - Engineer finds the root cause: a missing index after a migration.
04:30 - Engineer adds the index. Writes recover.
Customer impact: 90 minutes.
GOOD (Mitigation-first thinking):
03:00 - Alert fires. Database writes failing.
03:05 - IC declares Sev-1. Triage: primary DB disk I/O saturated.
03:10 - Ops Lead: "I can failover to secondary in 2 minutes."
03:12 - IC: "Do it."
03:14 - Failover complete. Writes recovering.
03:20 - Customer impact resolved. Team begins root cause analysis.
04:00 - Root cause found: missing index. Fix scheduled for morning.
Customer impact: 14 minutes.

The same engineers. The same problem. The same root cause. But one team had a 14-minute outage and the other had a 90-minute outage. The difference is the question they asked first:

  • Bad question: “Why is this happening?”
  • Good question: “How do I make it stop affecting customers?”

Timeline: Well-Managed vs. Poorly Managed Incident

Section titled “Timeline: Well-Managed vs. Poorly Managed Incident”
TIME EVENT WHO
──────── ────────────────────────────────────────────── ──────────
02:47 Alerts fire: DB write errors > 1000/min Monitoring
02:48 On-call engineer acks the page Sarah
02:50 Sarah checks dashboards: primary DB disk I/O Sarah
at 100%, replica lag climbing
02:52 Sarah declares Sev-1 in #incidents Sarah (IC)
Creates #incident-2026-03-24-db-outage
Pages Ops Lead (Alex) and Comms Lead (Jordan)
02:55 Jordan posts initial status page update Jordan
02:56 Alex joins bridge call, begins diagnostics Alex
03:00 Jordan sends first Slack update to leadership Jordan
03:05 Alex reports: "Primary disk controller Alex
degraded. I recommend failover to secondary."
03:06 Sarah approves failover. "Do it." Sarah
03:08 Failover initiated Alex
03:10 Jordan updates status page: failover in Jordan
progress
03:12 Writes recovering on secondary cluster Alex
03:15 Error rates back to baseline. Sarah asks Sarah
team to monitor for 15 minutes.
03:17 Jordan updates status page: "Mitigated. Jordan
Monitoring."
03:30 15 minutes clean. Sarah declares incident Sarah
resolved.
03:32 Jordan posts resolution notice on status page Jordan
03:35 Sarah schedules postmortem for Monday Sarah
Total customer impact: ~20 minutes
TIME EVENT WHO
──────── ────────────────────────────────────────────── ──────────
02:47 Alerts fire: DB write errors > 1000/min Monitoring
02:48 On-call engineer acks the page Mike
02:50 Mike SSHs into primary DB. Starts reading logs. Mike
03:00 Two more engineers get paged by escalation Dev1, Dev2
policy. They also SSH into production.
03:05 Dev1 restarts a background job that looks Dev1
suspicious. No improvement.
03:10 Nobody has updated the status page.
Customer support tickets piling up.
03:15 VP of Engineering sees tweets about the VP Eng
outage. Starts asking in #general "what's
happening?"
03:20 Mike, Dev1, Dev2 are each investigating Mike,
different theories. Nobody is coordinating. Dev1, Dev2
03:30 VP Eng joins the bridge call and starts VP Eng
asking questions every 2 minutes. Nobody
can focus.
03:45 Dev2, trying to help, restarts the primary Dev2
DB node. This wipes the query cache.
03:47 Read latency doubles. More alerts fire. Monitoring
04:00 A senior DBA is finally paged. She assesses DBA
the situation in 5 minutes: disk controller
failing. Recommends failover.
04:10 Nobody is sure who should approve the (nobody)
failover. 10 minutes of discussion.
04:20 Failover initiated. DBA
04:25 Writes recovering. DBA
04:40 Situation stabilizing.
05:00 Error rates back to baseline.
Total customer impact: ~2 hours 13 minutes

The difference between these two incidents is not technical skill. It is incident management.


A SaaS company serving enterprise clients experienced a login failure on a Monday morning at 9:02 AM Eastern — the highest-traffic hour of their week. The authentication service was rejecting all login attempts with a cryptic “internal server error.”

The on-call engineer diagnosed the problem within 8 minutes: an expired TLS certificate on the internal service mesh. A certificate that was supposed to auto-renew had silently failed three days prior, and the monitoring alert for certificate expiry had been snoozed during an unrelated maintenance window and never re-enabled.

Here is where incident management failed. The engineer who found the problem knew how to fix it: manually renew the certificate and restart the affected services. But instead of requesting a new certificate from the internal CA (a 2-minute operation), he decided to investigate why auto-renewal failed. His reasoning was sound — he wanted to make sure the new certificate would not also fail. But while he investigated, 180,000 users could not log in.

No Incident Commander had been declared. No one was tracking that the investigation had shifted from mitigation to root cause analysis. No status page had been updated. Customer success managers were fielding calls with no information to share.

The certificate was finally renewed at 9:47 AM — 45 minutes after the initial failure. During that 45 minutes, the company’s enterprise clients, who had contractual SLAs guaranteeing 99.95% uptime, experienced a complete authentication outage during peak business hours.

The financial impact:

  • SLA credits owed to enterprise clients: $1.2 million
  • Lost revenue from churned trial customers: $340,000 (estimated)
  • Emergency customer success outreach: $85,000 in overtime and professional services
  • Deal pipeline impact: Two enterprise prospects paused their evaluations, representing $3.1 million in ARR

Total estimated impact: $4.7 million from a 45-minute outage caused by an expired certificate that could have been renewed in 2 minutes.

The postmortem identified three failures: the missing certificate expiry alert, the lack of auto-renewal monitoring, and — most critically — the absence of incident command that would have separated mitigation from investigation.


  1. Google’s Incident Management system is built on ICS. Google’s internal incident response training is directly adapted from the FEMA Incident Command System. Every Site Reliability Engineer at Google is trained in ICS principles before they are allowed to manage production incidents. This is documented in the Google SRE book, Chapter 14.

  2. The average cost of a data center outage increased to $740,357 per incident in 2024. According to the Uptime Institute, this figure includes direct costs (lost revenue, SLA credits) and indirect costs (reputation damage, customer churn). For financial services companies, the figure is nearly double. The first 15 minutes of mitigation are worth more than the next 15 hours of investigation.

  3. PagerDuty’s internal data shows that incidents with a dedicated Incident Commander are resolved 40% faster than incidents without one. The primary factor is not the IC’s technical skill — it is the reduction of coordination overhead. When everyone knows who’s in charge, they spend less time discussing what to do and more time doing it.

  4. Blameless culture directly impacts incident response speed. Research by DevOps Research and Assessment (DORA) shows that organizations where engineers feel safe admitting mistakes have 2.5x faster mean time to recovery. When people are afraid of punishment, they hide problems, delay escalation, and spend time covering tracks instead of fixing systems. Incident command only works in a culture where declaring “I caused this” is met with “thank you for telling us” rather than blame.


MistakeWhat HappensWhat To Do Instead
No Incident Commander declaredMultiple people issue conflicting instructions. Nobody tracks the overall picture. Actions are duplicated or contradictory.First senior responder declares themselves IC within 2 minutes of incident detection. Transfer IC role later if needed.
IC starts debuggingCoordination stops. Nobody is managing communication, tracking actions, or making decisions. The incident is now unmanaged.IC stays off the keyboard. If you’re the only person available, do triage first, then get someone else to take IC before you start debugging.
Skipping mitigation for root causeCustomers suffer while engineers investigate. A 10-minute outage becomes a 90-minute outage because the team is chasing “why” instead of “how to stop it.”Always ask: “Can we reduce customer impact right now?” Rollback, failover, feature flag, circuit breaker — anything to stop the bleeding. Root cause analysis happens after.
Too many people in the war roomNoise overwhelms signal. Every engineer who joins asks “what’s happening?” and the team has to re-explain. Spectators offer untested theories.Strict role-based access. Only IC, Comms Lead, Ops Lead, and requested SMEs on the bridge. Everyone else follows the Slack channel updates.
No regular status updatesLeadership panics and starts making phone calls. Customer support has nothing to tell customers. Social media fills the void with speculation. Executives show up in the war room asking questions.Comms Lead posts structured updates every 15 minutes for Sev-1, every 30 minutes for Sev-2, whether or not there is news.
Freelance production changesAn engineer, trying to help, makes an unauthorized change that makes the situation worse. Classic example: restarting a database node and wiping the cache during a performance incident.All production changes during an incident go through the IC. “I want to try X” is a request, not an action. The IC approves or rejects.
Not documenting actions during the incidentThe postmortem timeline is reconstructed from memory, Slack messages, and partial logs. Key decisions and their rationale are lost.Comms Lead or a designated scribe logs every significant action and decision in the incident channel with timestamps.
Declaring “resolved” too earlyThe fix is deployed but not verified. The incident recurs 20 minutes later and the team has to reconvene. Customer trust takes a double hit.Require a minimum stabilization period (15-30 minutes of clean metrics) before declaring resolved. Monitor closely for the next 2-4 hours.

Test your understanding. Try answering before revealing the answer.

Your monitoring alerts fire at 3 AM showing a 40% error rate on your API gateway. You are the only engineer paged. What is the correct first action?

Show Answer

Triage the severity within 5 minutes. Check dashboards to determine: Are customers affected? What percentage? Is a core function impacted? Based on this, declare the severity level. If it’s Sev-1 or Sev-2, page additional responders and declare yourself IC. Do not start debugging until you have assessed the situation and established the incident structure. A 40% error rate on the API gateway is almost certainly Sev-1 — it means nearly half your traffic is failing.

You are the Incident Commander during a Sev-1. Your best database engineer says: “I know exactly what the problem is. Give me 20 minutes and I’ll have the root cause fixed.” Meanwhile, a simpler mitigation (failover to a replica) could restore service in 3 minutes but would lose about 10 seconds of recent writes. What do you do?

Show Answer

Order the failover. Mitigation always comes before resolution. A 3-minute path to restoring service with a minor data loss (10 seconds of writes) is almost always preferable to a 20-minute path that might not work as expected. The “20 minutes” estimate is optimistic — root cause fixes during incidents frequently take 2-3x longer than estimated. After the failover restores service, the database engineer can then investigate and fix the root cause without time pressure.

The 10-second write loss is a business decision — confirm with the IC (you) and, if appropriate, a business stakeholder. But in most cases, 10 seconds of lost writes is dramatically better than 20+ minutes of total write failure.

During a Sev-2 incident, a VP joins the bridge call and starts asking detailed technical questions every few minutes. This is disrupting the Operations Lead’s focus. What should happen?

Show Answer

The IC should redirect the VP to the Comms Lead. The IC says something like: “VP, I understand you need updates. Jordan is our Communications Lead and will provide you with updates every 30 minutes. I need to keep this bridge focused on the technical response.” If the VP insists on staying, the IC should ask them to mute and listen without interrupting. The IC owns the incident and has the authority to manage the war room, regardless of organizational hierarchy. This is a core ICS principle: during an incident, the IC outranks everyone on operational decisions.

An engineer on your team notices elevated error rates on a service they own but believes it’s a transient issue that will resolve itself. They decide not to declare an incident. Two hours later, the errors have increased and customers are complaining. What went wrong?

Show Answer

The engineer failed to follow the over-classify principle. It is always better to declare an incident and downgrade than to wait and escalate later. By the time the engineer realized the issue was real, two hours of customer impact had already occurred. The correct action was to declare a Sev-3 at minimum when the errors were first observed, set a watch on the metrics, and upgrade severity if the errors persisted or increased. Declaring an incident is cheap. Missing an incident is expensive.

Your team has resolved a Sev-1 incident. Error rates are back to zero and the fix has been deployed. It’s 4:30 AM. Should you declare the incident resolved and go back to sleep?

Show Answer

Not yet. You need a stabilization period. Monitor clean metrics for at least 15-30 minutes before declaring resolved. Many incidents have a “bounce” where the fix appears to work but fails under load or when certain conditions recur. Additionally, before standing down: update the status page to “Monitoring,” send a near-final update to stakeholders, and ensure someone is watching dashboards for the next 2-4 hours (either you or a handoff to another timezone). Only after the stabilization period should you declare resolved, post the final status page update, and schedule the postmortem.

You are the IC for a Sev-1. Thirty minutes in, the Operations Lead tells you they are stuck and don’t know what’s causing the issue. They’ve tried three things and none have worked. What do you do?

Show Answer

Escalate and expand. Specifically: (1) Ask the Ops Lead who the subject matter experts are for the affected systems and page them immediately. Fresh eyes often see what exhausted eyes miss. (2) Ask: “Have we tried mitigating even if we don’t know the cause?” — failover, rollback, feature flags, and traffic shifting don’t require understanding root cause. (3) Consider whether the scope is broader than initially thought and whether you should upgrade severity. (4) If 30 minutes in you are still stuck on a Sev-1, escalate to senior leadership so they are aware and can authorize exceptional measures (like waking up someone on PTO who has critical knowledge). Do not let “we’ll figure it out” drag on while customers suffer.


Hands-On Exercise: Tabletop Exercise — Managing a Sev-1 Database Outage

Section titled “Hands-On Exercise: Tabletop Exercise — Managing a Sev-1 Database Outage”

A tabletop exercise is a simulation where you walk through an incident scenario, making decisions at each stage. You do not need a real system for this. You need a quiet room, this scenario, and ideally 2-4 other people to play the roles.

If you are practicing solo, play the IC role and write down what you would say and do at each decision point.

You are the Incident Commander. It is Tuesday at 10:15 AM. Your largest customer, representing 18% of annual revenue, is in the middle of their quarterly financial close — the most critical period of their fiscal cycle.

10:15 AM — PagerDuty alerts fire:

  • CRITICAL: PostgreSQL primary — replication lag > 60s
  • CRITICAL: API error rate > 5% — /api/v2/transactions endpoint
  • WARNING: Connection pool utilization > 90% on app-server-{1,2,3}

Your team available:

  • You (IC)
  • Morgan — Senior backend engineer, knows the application layer well
  • Priya — DBA, PostgreSQL expert, but currently in a meeting
  • Taylor — SRE, strong on infrastructure and monitoring
  • Casey — Customer success manager for the affected enterprise client

Resources available:

  • Grafana dashboards
  • PostgreSQL read replicas (2 replicas, normally healthy)
  • Feature flag system (LaunchDarkly)
  • Recent deploy log (last deploy was 9:45 AM — 30 minutes ago)

Decision Point 1 (10:15 AM): Initial Response

Section titled “Decision Point 1 (10:15 AM): Initial Response”

Based on the alerts, answer these questions:

  1. What severity do you declare and why?
  2. Who do you page and what roles do you assign?
  3. What is the first message you post in the incident channel?
Recommended Actions

Severity: Sev-1. The transaction endpoint is a core revenue function. Error rates above 5% on that endpoint during a customer’s financial close is a direct revenue and relationship threat. The combination of replication lag, API errors, and connection pool exhaustion suggests a systemic issue, not a transient blip.

Pages and roles:

  • Pull Priya out of her meeting immediately. She is the PostgreSQL expert and this is a database incident. Assign her as Operations Lead.
  • Morgan as secondary Ops Lead / SME for the application layer.
  • Taylor as your technical eyes — have them pull up Grafana and start reading the situation.
  • Casey should be notified but NOT in the war room. Send Casey to standby: “We have a Sev-1 impacting the transactions API. I will update you every 15 minutes. Prepare to communicate with [enterprise client] if needed.”

Incident channel message:

INCIDENT DECLARED — Sev-1
Issue: PostgreSQL replication lag + API errors on /api/v2/transactions
IC: [You]
Ops Lead: @priya
Comms Lead: @taylor (until dedicated Comms resource available)
Bridge: [link]
Impact: Transaction processing errors. Largest customer in financial close.
Status: Investigating

Decision Point 2 (10:22 AM): First Findings

Section titled “Decision Point 2 (10:22 AM): First Findings”

Taylor reports from Grafana:

  • The 9:45 AM deploy introduced a new database migration that added an index on the transactions table
  • The CREATE INDEX command is still running — it’s been holding a lock for 37 minutes
  • The lock is blocking writes to the transactions table
  • Read replicas are healthy but falling behind because WAL replay is blocked

Priya confirms: “The migration is running CREATE INDEX without CONCURRENTLY. It’s holding an ACCESS EXCLUSIVE lock on the transactions table. No writes can proceed until it finishes or is cancelled.”

  1. What is your mitigation strategy?
  2. What questions do you ask before approving it?
  3. What do you tell Casey to communicate to the enterprise client?
Recommended Actions

Mitigation strategy: Cancel the index creation immediately.

The index is blocking all writes. Cancelling it will release the lock and restore write capability. The index can be recreated later using CREATE INDEX CONCURRENTLY, which does not hold an exclusive lock.

Questions before approving:

  • “Priya, if we cancel the CREATE INDEX, will it cleanly release the lock? Any risk of table corruption?” (Answer: No, cancelling a CREATE INDEX is safe. The partial index is simply dropped.)
  • “Will the application function correctly without this new index?” (Answer: Yes, it ran fine without it before the migration.)
  • “How do we cancel it? pg_cancel_backend() on the migration’s PID?”

Once Priya confirms it is safe, approve immediately: “Cancel it. Now.”

Communication to enterprise client (via Casey): “We are experiencing a brief disruption to transaction processing. Our team has identified the cause and is implementing a fix. We expect resolution within the next 10 minutes. Your data is safe and no transactions have been lost — pending transactions will process once the system recovers.”

Do NOT say “a bad deploy caused it” to the customer. Keep it factual and focused on impact and recovery.


Decision Point 3 (10:26 AM): Mitigation Executed

Section titled “Decision Point 3 (10:26 AM): Mitigation Executed”

Priya cancels the index creation. Within 90 seconds:

  • Write lock released
  • Transaction writes resuming
  • Error rates dropping
  • Replication lag decreasing

But now Morgan reports: “There’s a backlog of about 4,000 queued transactions that were blocked. They’re all hitting the database simultaneously. Connection pool is at 98%.”

  1. Do you have a new problem? What’s the risk?
  2. What action do you take?
Recommended Actions

Yes, you have a thundering herd problem. 4,000 queued transactions all retrying at once could overwhelm the database again, causing a secondary outage.

Actions:

  1. Ask Priya: “Can the database handle 4,000 concurrent writes or do we need to throttle?” If the connection pool is at 98%, the answer is likely “it’s going to be tight.”
  2. If available, enable request rate limiting or queue-based processing on the API layer. Morgan may be able to do this through a feature flag or application configuration.
  3. If neither option is available, consider briefly scaling up the connection pool or adding app server instances (if autoscaling is available).
  4. Monitor closely for the next 10 minutes. If connection pools hit 100% and connections start being rejected, you may need to temporarily reject new requests to let the backlog drain.

The key is to not panic. The thundering herd is a known pattern after an outage. It is temporary. But you need to actively manage it rather than assuming it will sort itself out.


The backlog has drained. Error rates are at 0%. Replication lag is back to normal. Connection pool utilization is at 45%.

  1. Is the incident over? What’s your checklist before declaring resolved?
  2. What follow-up actions do you identify?
Recommended Actions

Not yet resolved. Begin stabilization monitoring.

Checklist before declaring resolved:

  • Error rates at 0% for at least 15 minutes
  • Replication lag stable at normal levels for 15 minutes
  • Connection pool utilization stable and normal
  • No queued or failed transactions remaining
  • Enterprise client confirmed their operations are functioning

At 10:55 AM, if all metrics are clean, declare resolved.

Follow-up actions to log for the postmortem:

  1. The migration framework should enforce CREATE INDEX CONCURRENTLY for all production migrations. This is a process/tooling fix.
  2. Add a pre-deploy check that flags migrations with exclusive locks.
  3. Review the deploy process: who approved a migration that takes an exclusive lock on the busiest table in the system during business hours?
  4. Add monitoring for long-running database locks (alert if any lock is held for > 5 minutes).
  5. Update the runbook for connection pool exhaustion / thundering herd scenarios.

Final communications:

  • Status page: “Resolved. Transaction processing has been fully restored.”
  • Casey to enterprise client: “The issue has been fully resolved. All transactions have been processed. We will share a detailed postmortem within 48 hours.”
  • Internal: Schedule postmortem for Wednesday morning while the details are fresh.

After completing the tabletop, reflect on these questions:

  1. Did you prioritize mitigation over investigation? The correct first instinct when you heard “CREATE INDEX is holding a lock” should have been “cancel it” — not “why wasn’t it run with CONCURRENTLY?”

  2. Did you manage the thundering herd? Anticipating second-order effects (like backlog surges after restoring service) is a mark of experienced incident management.

  3. Did you manage communication throughout? The enterprise client in financial close needed to hear from you proactively, not discover the outage through failed transactions.

  4. Did you resist declaring resolved too early? The 15-minute stabilization period catches many “bouncing” incidents.

If you did this exercise with a group, discuss: What would you do differently? Where did you disagree? Disagreements are the most valuable part — they reveal where your team’s incident playbook has gaps.


  1. Incident command is not optional. Without a single point of authority, you get a room full of smart people making things worse.
  2. Mitigation before resolution. Stop the customer impact first. Investigate after.
  3. Communicate relentlessly. Silence is worse than bad news. Update stakeholders every 15 minutes even if the update is “no change.”
  4. Over-classify, don’t under-classify. Declaring a Sev-1 that turns out to be a Sev-3 is a minor inconvenience. Treating a Sev-1 as a Sev-3 is a catastrophe.
  5. The IC does not touch the keyboard. The moment you start debugging, you’ve stopped managing the incident.
  6. Practice before you need it. Tabletop exercises build the muscle memory that kicks in at 3 AM when your brain is running on adrenaline and bad coffee.

Module 1.2: Blameless Postmortems — Learn how to turn incidents into organizational learning without blame, fear, or finger-pointing. The postmortem is where incidents stop being crises and start being investments in reliability.