Module 1.1: Incident Command & Crisis Management

Цей контент ще не доступний вашою мовою.


Complexity	`[COMPLEX]`
Time to Complete	2.5 hours
Prerequisites	None

What You’ll Be Able to Do

After completing this module, you will be able to:

Implement the Incident Commander role by establishing communication channels, assigning workstreams, and controlling the pace of response
Design an incident response framework with clear escalation paths, severity definitions, and role assignments before a crisis occurs
Evaluate incident response effectiveness by identifying coordination failures, communication gaps, and decision bottlenecks in post-incident reviews
Apply crisis management techniques (stabilize-before-fix, controlled communication, parallel workstreams) to reduce mean time to recovery

Why This Module Matters

It’s 2:47 AM on a Friday night. Sarah, a senior SRE at a mid-size fintech company, wakes up to her phone buzzing with a cascade of PagerDuty alerts. The primary database cluster serving 12 million users across three continents has stopped accepting writes. The read replicas are 47 seconds behind and climbing. Customer-facing APIs are returning 500 errors at a rate of 14,000 per minute. The on-call Slack channel is already filling up with panicked messages from engineers who were also paged. Three people are SSH’d into production servers running diagnostic queries. Nobody is coordinating. Nobody knows what anyone else is doing. A junior engineer, trying to help, restarts the primary database node — which wipes the in-memory query cache and doubles the read latency on the replicas. The outage, which could have been contained in 20 minutes, drags on for 3 hours and 41 minutes. The company loses $2.3 million in failed transactions and spends the next two weeks apologizing to enterprise clients.

This is not a rare story. This is what happens when talented engineers face a crisis without a system for managing it.

The technology that failed Sarah’s company that night was not the database. It was the absence of incident command. No one declared themselves in charge. No one controlled communications. No one stopped well-meaning engineers from making the situation worse. The technical skills in that room were more than sufficient to solve the problem. What was missing was structure.

This module teaches you that structure. By the end, you will know how to walk into chaos, establish order, and lead a team through the worst night of their operational lives. You will know when to escalate, how to communicate, and — critically — when to stop trying to fix and start trying to stabilize. These are not theoretical skills. They are the difference between a 20-minute incident and a 4-hour catastrophe.

The Incident Command System: From Wildfires to Web Services

The Incident Command System (ICS) was not invented by software engineers. It was created in the 1970s by California firefighters after a series of devastating wildfires where multiple agencies showed up, nobody knew who was in charge, and the fires burned out of control. Sound familiar?

ICS solved a fundamental coordination problem: how do you get dozens of people from different teams, with different expertise, working together under extreme time pressure?

The answer was deceptively simple: clear roles, clear communication channels, and a single point of authority.

The software industry adopted ICS because our incidents have the same characteristics as wildfire response:

Time pressure — every minute of downtime costs money and trust
Multiple responders — database engineers, application developers, network specialists, all converging
Incomplete information — you rarely know the root cause when the incident starts
High stakes — wrong actions can make things worse
Communication overhead — the more people involved, the harder it is to coordinate

ICS Principles Adapted for Software Engineering

ICS Principle	Firefighting Context	Software Incident Context
Unity of Command	Every firefighter reports to one supervisor	Every responder reports to the Incident Commander
Modular Organization	Expand/contract teams as needed	Pull in SMEs only when needed
Management by Objectives	”Contain the fire to this ridge"	"Restore write capability to the primary DB”
Span of Control	3-7 direct reports per leader	IC manages 3-5 leads, not 20 engineers
Common Terminology	Standard terms across agencies	Severity levels, standard status updates
Integrated Communications	Single radio frequency for command	Single Slack channel, single bridge call

The key insight from ICS is this: in a crisis, process is not bureaucracy. Process is survival. Without it, you get a room full of smart people all pulling in different directions.

Incident Roles: Who Does What

There are four critical roles in a software incident. In a small team, one person might wear multiple hats. In a large organization, each role is a separate person. But every role must be filled, even if implicitly.

                    ┌─────────────────────┐
                    │  INCIDENT COMMANDER │
                    │  (Owns the incident)│
                    └─────────┬───────────┘
                              │
           ┌──────────────────┼──────────────────┐
           │                  │                  │
  ┌────────▼─────────┐ ┌─────▼──────────┐ ┌────▼───────────────┐
  │  COMMUNICATIONS  │ │  OPERATIONS    │ │  SUBJECT MATTER    │
  │  LEAD            │ │  LEAD          │ │  EXPERTS (SMEs)    │
  │                  │ │                │ │                    │
  │ - Status updates │ │ - Hands on     │ │ - Database         │
  │ - Stakeholders   │ │   keyboard     │ │ - Networking       │
  │ - Status page    │ │ - Executing    │ │ - Application      │
  │ - Customer comms │ │   fixes        │ │ - Security         │
  └──────────────────┘ │ - Monitoring   │ │ - Infrastructure   │
                       └────────────────┘ └────────────────────┘

Role 1: Incident Commander (IC)

The Incident Commander is the single point of authority during an incident. They do not fix the problem. They manage the people fixing the problem.

Responsibilities:

Declare the incident and its severity level
Assign roles to responders
Make decisions when the team disagrees or is uncertain
Maintain the incident timeline
Decide when to escalate and when to de-escalate
Call the incident resolved and trigger the postmortem process

What the IC says:

“I’m declaring myself Incident Commander for this Sev-1. Acknowledge.”
“Alex, you’re Operations Lead. Focus on the database cluster. Report back in 10 minutes.”
“We’re not going to debate the root cause right now. Our objective is restoring writes. Root cause comes later.”
“It’s been 30 minutes. I’m escalating to VP Engineering for business impact assessment.”
“The fix is deployed and we’ve had 15 minutes of clean metrics. I’m declaring this incident resolved.”

What the IC does NOT do:

SSH into production servers
Write code or run queries
Get drawn into technical rabbit holes
Let three people argue about the right fix without making a decision
Disappear for 20 minutes to investigate something

The hardest part of being IC is not touching the keyboard. If the IC starts debugging, nobody is coordinating. This is the number one failure mode in incident management.

Rule of thumb: The moment you start typing commands into a terminal, you’ve stopped being the Incident Commander.

Role 2: Communications Lead

The Communications Lead is the voice of the incident to the outside world. They handle everything that is not directly related to fixing the problem.

Responsibilities:

Post and update the status page
Send internal updates to leadership and stakeholders
Draft customer-facing communications
Manage the incident Slack channel (keep it focused, move side conversations out)
Track and relay questions from stakeholders to the IC
Maintain the written timeline of events

Why this role matters: Without a Comms Lead, one of two things happens. Either the IC gets pulled away from coordination to answer “what’s happening?” from the CEO every five minutes, or nobody communicates at all and leadership starts making their own assumptions (which are always worse than reality).

Role 3: Operations Lead

The Operations Lead is the person (or people) with hands on the keyboard, actually diagnosing and fixing the problem.

Responsibilities:

Run diagnostic commands and investigate the root cause
Propose and execute mitigation and resolution actions
Report status to the IC at regular intervals
Flag when they need additional expertise (SMEs)
Document what they’ve tried and what the results were

Critical rule: The Operations Lead proposes actions. The IC approves them. This prevents well-meaning engineers from making unilateral changes to production during a crisis.

  Ops Lead: "I want to failover to the secondary database cluster."
  IC:       "What's the risk?"
  Ops Lead: "We'll lose about 30 seconds of writes during the switchover.
             Some in-flight transactions will need to be retried by clients."
  IC:       "That's acceptable. Proceed. Comms Lead, update the status page:
             'We are performing a database failover. Brief interruption expected.'"

Role 4: Subject Matter Experts (SMEs)

SMEs are the specialists who get pulled in when the incident touches their domain. They do not join the incident by default. They are called in by the IC when needed.

Examples:

The DBA when it’s a database issue
The network engineer when it’s a connectivity issue
The security engineer when there’s a potential breach
The application developer who wrote the service that’s failing

Key principle: SMEs advise, they don’t command. They provide expertise to the Operations Lead and IC, who make the final decisions about what actions to take.

Establishing Communication Channels

Communication is where incidents are won or lost. Too little communication and nobody knows what’s happening. Too much communication and the signal drowns in noise. The goal is structured, predictable, and focused communication.

The War Room

A war room is a dedicated space (physical or virtual) where incident responders coordinate. In the age of remote work, this is almost always a video bridge call.

War room rules:

One conversation at a time. If the IC is talking, everyone else is listening.
State your name before speaking. “This is Alex. The replica lag is now at 2 minutes.”
Keep the line clear. Side conversations happen in a separate thread or DM.
Mute when not speaking. Background noise kills focus.
No spectators. If you’re not in an assigned role, you’re not on the bridge.

Slack Channel Structure

For any Sev-1 or Sev-2 incident, create a dedicated Slack channel immediately:

#incident-2026-03-24-db-outage

Channel naming convention: #incident-YYYY-MM-DD-short-description

Pin the following message at the top of the channel:

INCIDENT: Primary database cluster write failure
SEVERITY: Sev-1
IC: @sarah
COMMS LEAD: @jordan
OPS LEAD: @alex
STATUS: Investigating
BRIDGE CALL: https://meet.company.com/incident-room
STATUS PAGE: https://status.company.com

Last update: 02:55 UTC — Replica lag increasing. Investigating primary node disk I/O.

Channel discipline:

The incident channel is for coordinated updates only
Debugging discussions go in a thread
“Is this related to my service?” questions go in a thread
Spectator commentary goes nowhere — keep it out entirely
The Comms Lead is the channel moderator and will redirect off-topic messages

The 15-Minute Update Cadence

During an active incident, the Comms Lead posts a structured update every 15 minutes, whether or not there is news. This is non-negotiable. Silence during an incident is terrifying for stakeholders and leadership.

UPDATE — 03:15 UTC (30 min into incident)
Status: Mitigating
What we know: Primary DB disk I/O saturated at 100%. Cause under investigation.
What we're doing: Failing over to secondary cluster. ETA 5 minutes.
Customer impact: Write operations failing. Read operations degraded (~3s latency).
Next update: 03:30 UTC

Even if the update is “no change, still investigating,” post it. The update itself communicates that people are actively working the problem.

Status Pages and External Communication

Status Page Updates

Your status page is the single source of truth for customers during an incident. It should be updated within 5 minutes of an incident being declared.

Status page levels:

Status	Meaning	When to Use
Operational	Everything working normally	Default state
Degraded Performance	Slower than normal but functional	Latency increases, partial errors
Partial Outage	Some features unavailable	One service or region down
Major Outage	Core functionality unavailable	Primary service down

External Communication Templates

Do not write customer communications from scratch during a crisis. Use templates and fill in the specifics.

Initial acknowledgment (post within 5 minutes):

We are aware of an issue affecting [service/feature]. Our team is actively
investigating. We will provide updates every [15/30] minutes.

Current impact: [brief description of what customers are experiencing]

Update during investigation:

We continue to investigate the issue affecting [service/feature].
We have identified [what you know] and are working on [what you're doing].

Current impact: [updated description]
Next update: [time]

Mitigation in progress:

We have identified the cause of the issue affecting [service/feature]
and are implementing a fix. Some customers may see improvement shortly.

Current impact: [updated description]
Expected resolution: [time estimate, if known, or "we will update when available"]

Resolved:

The issue affecting [service/feature] has been resolved. All systems are
operating normally. The incident lasted approximately [duration].

We will publish a detailed postmortem within [timeframe, e.g., 48 hours].
We apologize for the disruption and appreciate your patience.

Golden rules of external communication:

Never lie. If you don’t know the cause, say “we’re investigating.” Never fabricate an explanation.
Never blame a vendor publicly. Even if AWS caused the outage, your customers chose you, not AWS.
Never give an ETA you can’t meet. “We’re working on it” is better than “fixed in 30 minutes” when you have no idea.
Acknowledge the impact. Customers do not care about your internal details. They care that their transactions failed.
Show empathy. “We understand this impacts your business” goes a long way.

Severity Levels: Sev-1 Through Sev-4

Severity levels are not subjective opinions. They are predefined criteria that determine how the organization responds. Define them before you need them, not during a crisis.

Severity Definitions

┌───────────────────────────────────────────────────────────────────────────┐
│                        SEVERITY LEVEL MATRIX                            │
├───────┬──────────────────┬──────────────────┬──────────────────────────┤
│ Level │ Customer Impact   │ Revenue Impact    │ Response                 │
├───────┼──────────────────┼──────────────────┼──────────────────────────┤
│ SEV-1 │ All/most users    │ Revenue stopped   │ All-hands, 24/7          │
│       │ cannot use core   │ or major loss     │ IC + full team           │
│       │ functionality     │ per minute        │ Exec notification        │
│       │                  │                  │ 15-min update cadence    │
├───────┼──────────────────┼──────────────────┼──────────────────────────┤
│ SEV-2 │ Large subset of   │ Significant       │ Dedicated responders     │
│       │ users affected    │ revenue impact    │ IC assigned              │
│       │ or core feature   │                  │ 30-min update cadence    │
│       │ degraded          │                  │ Business hours + on-call │
├───────┼──────────────────┼──────────────────┼──────────────────────────┤
│ SEV-3 │ Small subset of   │ Minor or no       │ On-call investigates     │
│       │ users affected    │ direct revenue    │ during business hours    │
│       │ Non-critical      │ impact            │ 2-hour update cadence    │
│       │ feature broken    │                  │                          │
├───────┼──────────────────┼──────────────────┼──────────────────────────┤
│ SEV-4 │ Internal only or  │ None              │ Ticket created           │
│       │ cosmetic issue    │                  │ Fixed in normal sprint   │
│       │                  │                  │ No active incident mgmt  │
└───────┴──────────────────┴──────────────────┴──────────────────────────┘

Real-World Severity Examples

Scenario	Severity	Why
Payment processing is down for all customers	Sev-1	Core revenue function, all users affected
Login is failing for users in EU region	Sev-2	Core feature, subset of users
Search results are returning slowly (5s vs 500ms)	Sev-2	Degraded core feature, all users affected
A reporting dashboard shows stale data (2 hours old)	Sev-3	Non-core feature, data is available but delayed
The internal admin panel has a broken CSS layout	Sev-4	Internal only, cosmetic
Email notifications are delayed by 10 minutes	Sev-3	Non-critical feature, mild user impact
The mobile app crashes on launch for all iOS users	Sev-1	Core functionality, major user segment
A staging environment is down	Sev-4	No customer impact

When to Upgrade or Downgrade Severity

Severity is not set in stone. Re-evaluate it as you learn more:

Upgrade when: impact is broader than initially thought, the fix is taking longer than expected, or a workaround stops working
Downgrade when: a workaround reduces customer impact, the affected population is smaller than estimated, or the issue is intermittent rather than total

The IC makes severity changes and the Comms Lead immediately communicates them.

Triaging: Assessing Severity Quickly

When you get paged at 3 AM, you need to determine severity in the first 5 minutes. Here is a triage framework:

The 5-Minute Triage Checklist

┌─────────────────────────────────────────────────────────────────┐
│                    5-MINUTE TRIAGE FLOW                         │
│                                                                 │
│  1. Are customers affected RIGHT NOW?                           │
│     ├─ YES ──► Check error rates, status codes, latency         │
│     └─ NO ───► Likely Sev-3 or Sev-4. Investigate in morning.  │
│                                                                 │
│  2. What percentage of customers are affected?                  │
│     ├─ >50% ──► Likely Sev-1                                   │
│     ├─ 10-50% ─► Likely Sev-2                                  │
│     └─ <10% ──► Likely Sev-3                                   │
│                                                                 │
│  3. Is a core revenue function impacted?                        │
│     ├─ YES ──► Bump severity by one level                      │
│     └─ NO ───► Keep current assessment                         │
│                                                                 │
│  4. Is the issue getting worse?                                 │
│     ├─ YES ──► Bump severity by one level                      │
│     └─ NO ───► Keep current assessment                         │
│                                                                 │
│  5. Declare severity and begin incident process.                │
│     Do NOT spend more than 5 minutes triaging.                 │
└─────────────────────────────────────────────────────────────────┘

Critical mindset: It is always better to over-classify and downgrade than to under-classify and scramble to catch up. Declaring a Sev-1 that turns out to be a Sev-3 wastes a few hours of people’s time. Treating a Sev-1 as a Sev-3 can cost millions.

Mitigating vs. Resolving: Stop the Bleeding First

This is the single most important concept in incident management, and the one that engineers get wrong most often.

Mitigation means reducing or eliminating the customer impact, even if the underlying problem is not fixed.

Resolution means fixing the root cause so the problem does not recur.

┌────────────────────────────────────────────────────────────────────┐
│                                                                    │
│  TIMELINE OF AN INCIDENT                                           │
│                                                                    │
│  Detection ──► Triage ──► MITIGATION ──► Stabilization ──► Root    │
│                               │            Cause Analysis          │
│                               │                                    │
│                    Customer impact ENDS here                       │
│                    (not at root cause fix)                          │
│                                                                    │
└────────────────────────────────────────────────────────────────────┘

Mitigation Examples

Problem	Mitigation (do this first)	Resolution (do this later)
Database overloaded	Failover to replica, add read cache	Optimize slow queries, upgrade hardware
Bad deploy causing 500s	Rollback to previous version	Fix the bug, improve test coverage
Memory leak in service	Restart the service	Find and fix the leak
DNS pointing to wrong IP	Update DNS record	Fix the deployment pipeline that set wrong IP
Third-party API down	Enable fallback/cached responses	Add circuit breaker, diversify providers
Disk full on primary server	Delete old logs, expand volume	Add disk usage alerts, log rotation

Why Engineers Resist Mitigation

Engineers are problem solvers. When they see a bug, their instinct is to find and fix it. This is a virtue in normal operations and a liability during an incident. Here is what happens:

BAD (Resolution-first thinking):
  03:00 - Alert fires. Database writes failing.
  03:05 - Engineer starts reading database logs.
  03:20 - Engineer finds a suspicious query pattern.
  03:45 - Engineer is deep in query execution plans.
  04:15 - Engineer finds the root cause: a missing index after a migration.
  04:30 - Engineer adds the index. Writes recover.

  Customer impact: 90 minutes.

GOOD (Mitigation-first thinking):
  03:00 - Alert fires. Database writes failing.
  03:05 - IC declares Sev-1. Triage: primary DB disk I/O saturated.
  03:10 - Ops Lead: "I can failover to secondary in 2 minutes."
  03:12 - IC: "Do it."
  03:14 - Failover complete. Writes recovering.
  03:20 - Customer impact resolved. Team begins root cause analysis.
  04:00 - Root cause found: missing index. Fix scheduled for morning.

  Customer impact: 14 minutes.

The same engineers. The same problem. The same root cause. But one team had a 14-minute outage and the other had a 90-minute outage. The difference is the question they asked first:

Bad question: “Why is this happening?”
Good question: “How do I make it stop affecting customers?”

Timeline: Well-Managed vs. Poorly Managed Incident

A Well-Managed Sev-1

TIME     EVENT                                          WHO
──────── ────────────────────────────────────────────── ──────────
02:47    Alerts fire: DB write errors > 1000/min        Monitoring
02:48    On-call engineer acks the page                 Sarah
02:50    Sarah checks dashboards: primary DB disk I/O   Sarah
         at 100%, replica lag climbing
02:52    Sarah declares Sev-1 in #incidents              Sarah (IC)
         Creates #incident-2026-03-24-db-outage
         Pages Ops Lead (Alex) and Comms Lead (Jordan)
02:55    Jordan posts initial status page update         Jordan
02:56    Alex joins bridge call, begins diagnostics      Alex
03:00    Jordan sends first Slack update to leadership   Jordan
03:05    Alex reports: "Primary disk controller          Alex
         degraded. I recommend failover to secondary."
03:06    Sarah approves failover. "Do it."               Sarah
03:08    Failover initiated                              Alex
03:10    Jordan updates status page: failover in         Jordan
         progress
03:12    Writes recovering on secondary cluster          Alex
03:15    Error rates back to baseline. Sarah asks        Sarah
         team to monitor for 15 minutes.
03:17    Jordan updates status page: "Mitigated.         Jordan
         Monitoring."
03:30    15 minutes clean. Sarah declares incident       Sarah
         resolved.
03:32    Jordan posts resolution notice on status page   Jordan
03:35    Sarah schedules postmortem for Monday           Sarah
         Total customer impact: ~20 minutes

A Poorly Managed Sev-1

TIME     EVENT                                          WHO
──────── ────────────────────────────────────────────── ──────────
02:47    Alerts fire: DB write errors > 1000/min        Monitoring
02:48    On-call engineer acks the page                 Mike
02:50    Mike SSHs into primary DB. Starts reading logs. Mike
03:00    Two more engineers get paged by escalation      Dev1, Dev2
         policy. They also SSH into production.
03:05    Dev1 restarts a background job that looks       Dev1
         suspicious. No improvement.
03:10    Nobody has updated the status page.
         Customer support tickets piling up.
03:15    VP of Engineering sees tweets about the         VP Eng
         outage. Starts asking in #general "what's
         happening?"
03:20    Mike, Dev1, Dev2 are each investigating         Mike,
         different theories. Nobody is coordinating.     Dev1, Dev2
03:30    VP Eng joins the bridge call and starts         VP Eng
         asking questions every 2 minutes. Nobody
         can focus.
03:45    Dev2, trying to help, restarts the primary     Dev2
         DB node. This wipes the query cache.
03:47    Read latency doubles. More alerts fire.         Monitoring
04:00    A senior DBA is finally paged. She assesses     DBA
         the situation in 5 minutes: disk controller
         failing. Recommends failover.
04:10    Nobody is sure who should approve the           (nobody)
         failover. 10 minutes of discussion.
04:20    Failover initiated.                             DBA
04:25    Writes recovering.                              DBA
04:40    Situation stabilizing.
05:00    Error rates back to baseline.
         Total customer impact: ~2 hours 13 minutes

The difference between these two incidents is not technical skill. It is incident management.

A SaaS company serving enterprise clients experienced a login failure on a Monday morning at 9:02 AM Eastern — the highest-traffic hour of their week. The authentication service was rejecting all login attempts with a cryptic “internal server error.”

The on-call engineer diagnosed the problem within 8 minutes: an expired TLS certificate on the internal service mesh. A certificate that was supposed to auto-renew had silently failed three days prior, and the monitoring alert for certificate expiry had been snoozed during an unrelated maintenance window and never re-enabled.

Here is where incident management failed. The engineer who found the problem knew how to fix it: manually renew the certificate and restart the affected services. But instead of requesting a new certificate from the internal CA (a 2-minute operation), he decided to investigate why auto-renewal failed. His reasoning was sound — he wanted to make sure the new certificate would not also fail. But while he investigated, 180,000 users could not log in.

No Incident Commander had been declared. No one was tracking that the investigation had shifted from mitigation to root cause analysis. No status page had been updated. Customer success managers were fielding calls with no information to share.

The certificate was finally renewed at 9:47 AM — 45 minutes after the initial failure. During that 45 minutes, the company’s enterprise clients, who had contractual SLAs guaranteeing 99.95% uptime, experienced a complete authentication outage during peak business hours.

The financial impact:

SLA credits owed to enterprise clients: $1.2 million
Lost revenue from churned trial customers: $340,000 (estimated)
Emergency customer success outreach: $85,000 in overtime and professional services
Deal pipeline impact: Two enterprise prospects paused their evaluations, representing $3.1 million in ARR

Total estimated impact: $4.7 million from a 45-minute outage caused by an expired certificate that could have been renewed in 2 minutes.

The postmortem identified three failures: the missing certificate expiry alert, the lack of auto-renewal monitoring, and — most critically — the absence of incident command that would have separated mitigation from investigation.

Did You Know?

Google’s Incident Management system is built on ICS. Google’s internal incident response training is directly adapted from the FEMA Incident Command System. Every Site Reliability Engineer at Google is trained in ICS principles before they are allowed to manage production incidents. This is documented in the Google SRE book, Chapter 14.
The average cost of a data center outage increased to $740,357 per incident in 2024. According to the Uptime Institute, this figure includes direct costs (lost revenue, SLA credits) and indirect costs (reputation damage, customer churn). For financial services companies, the figure is nearly double. The first 15 minutes of mitigation are worth more than the next 15 hours of investigation.
PagerDuty’s internal data shows that incidents with a dedicated Incident Commander are resolved 40% faster than incidents without one. The primary factor is not the IC’s technical skill — it is the reduction of coordination overhead. When everyone knows who’s in charge, they spend less time discussing what to do and more time doing it.
Blameless culture directly impacts incident response speed. Research by DevOps Research and Assessment (DORA) shows that organizations where engineers feel safe admitting mistakes have 2.5x faster mean time to recovery. When people are afraid of punishment, they hide problems, delay escalation, and spend time covering tracks instead of fixing systems. Incident command only works in a culture where declaring “I caused this” is met with “thank you for telling us” rather than blame.

Common Mistakes

Mistake	What Happens	What To Do Instead
No Incident Commander declared	Multiple people issue conflicting instructions. Nobody tracks the overall picture. Actions are duplicated or contradictory.	First senior responder declares themselves IC within 2 minutes of incident detection. Transfer IC role later if needed.
IC starts debugging	Coordination stops. Nobody is managing communication, tracking actions, or making decisions. The incident is now unmanaged.	IC stays off the keyboard. If you’re the only person available, do triage first, then get someone else to take IC before you start debugging.
Skipping mitigation for root cause	Customers suffer while engineers investigate. A 10-minute outage becomes a 90-minute outage because the team is chasing “why” instead of “how to stop it.”	Always ask: “Can we reduce customer impact right now?” Rollback, failover, feature flag, circuit breaker — anything to stop the bleeding. Root cause analysis happens after.
Too many people in the war room	Noise overwhelms signal. Every engineer who joins asks “what’s happening?” and the team has to re-explain. Spectators offer untested theories.	Strict role-based access. Only IC, Comms Lead, Ops Lead, and requested SMEs on the bridge. Everyone else follows the Slack channel updates.
No regular status updates	Leadership panics and starts making phone calls. Customer support has nothing to tell customers. Social media fills the void with speculation. Executives show up in the war room asking questions.	Comms Lead posts structured updates every 15 minutes for Sev-1, every 30 minutes for Sev-2, whether or not there is news.
Freelance production changes	An engineer, trying to help, makes an unauthorized change that makes the situation worse. Classic example: restarting a database node and wiping the cache during a performance incident.	All production changes during an incident go through the IC. “I want to try X” is a request, not an action. The IC approves or rejects.
Not documenting actions during the incident	The postmortem timeline is reconstructed from memory, Slack messages, and partial logs. Key decisions and their rationale are lost.	Comms Lead or a designated scribe logs every significant action and decision in the incident channel with timestamps.
Declaring “resolved” too early	The fix is deployed but not verified. The incident recurs 20 minutes later and the team has to reconvene. Customer trust takes a double hit.	Require a minimum stabilization period (15-30 minutes of clean metrics) before declaring resolved. Monitor closely for the next 2-4 hours.

Quiz

Test your understanding. Try answering before revealing the answer.

Question 1

Your monitoring alerts fire at 3 AM showing a 40% error rate on your API gateway. You are the only engineer paged. What is the correct first action?

Show Answer

Triage the severity within 5 minutes. Check dashboards to determine: Are customers affected? What percentage? Is a core function impacted? Based on this, declare the severity level. If it’s Sev-1 or Sev-2, page additional responders and declare yourself IC. Do not start debugging until you have assessed the situation and established the incident structure. A 40% error rate on the API gateway is almost certainly Sev-1 — it means nearly half your traffic is failing.

Question 2

You are the Incident Commander during a Sev-1. Your best database engineer says: “I know exactly what the problem is. Give me 20 minutes and I’ll have the root cause fixed.” Meanwhile, a simpler mitigation (failover to a replica) could restore service in 3 minutes but would lose about 10 seconds of recent writes. What do you do?

Show Answer

Order the failover. Mitigation always comes before resolution. A 3-minute path to restoring service with a minor data loss (10 seconds of writes) is almost always preferable to a 20-minute path that might not work as expected. The “20 minutes” estimate is optimistic — root cause fixes during incidents frequently take 2-3x longer than estimated. After the failover restores service, the database engineer can then investigate and fix the root cause without time pressure.

The 10-second write loss is a business decision — confirm with the IC (you) and, if appropriate, a business stakeholder. But in most cases, 10 seconds of lost writes is dramatically better than 20+ minutes of total write failure.

Question 3

During a Sev-2 incident, a VP joins the bridge call and starts asking detailed technical questions every few minutes. This is disrupting the Operations Lead’s focus. What should happen?

Show Answer

The IC should redirect the VP to the Comms Lead. The IC says something like: “VP, I understand you need updates. Jordan is our Communications Lead and will provide you with updates every 30 minutes. I need to keep this bridge focused on the technical response.” If the VP insists on staying, the IC should ask them to mute and listen without interrupting. The IC owns the incident and has the authority to manage the war room, regardless of organizational hierarchy. This is a core ICS principle: during an incident, the IC outranks everyone on operational decisions.

Question 4

An engineer on your team notices elevated error rates on a service they own but believes it’s a transient issue that will resolve itself. They decide not to declare an incident. Two hours later, the errors have increased and customers are complaining. What went wrong?

Show Answer

The engineer failed to follow the over-classify principle. It is always better to declare an incident and downgrade than to wait and escalate later. By the time the engineer realized the issue was real, two hours of customer impact had already occurred. The correct action was to declare a Sev-3 at minimum when the errors were first observed, set a watch on the metrics, and upgrade severity if the errors persisted or increased. Declaring an incident is cheap. Missing an incident is expensive.

Question 5

Your team has resolved a Sev-1 incident. Error rates are back to zero and the fix has been deployed. It’s 4:30 AM. Should you declare the incident resolved and go back to sleep?

Show Answer

Not yet. You need a stabilization period. Monitor clean metrics for at least 15-30 minutes before declaring resolved. Many incidents have a “bounce” where the fix appears to work but fails under load or when certain conditions recur. Additionally, before standing down: update the status page to “Monitoring,” send a near-final update to stakeholders, and ensure someone is watching dashboards for the next 2-4 hours (either you or a handoff to another timezone). Only after the stabilization period should you declare resolved, post the final status page update, and schedule the postmortem.

Question 6

You are the IC for a Sev-1. Thirty minutes in, the Operations Lead tells you they are stuck and don’t know what’s causing the issue. They’ve tried three things and none have worked. What do you do?

Show Answer

Escalate and expand. Specifically: (1) Ask the Ops Lead who the subject matter experts are for the affected systems and page them immediately. Fresh eyes often see what exhausted eyes miss. (2) Ask: “Have we tried mitigating even if we don’t know the cause?” — failover, rollback, feature flags, and traffic shifting don’t require understanding root cause. (3) Consider whether the scope is broader than initially thought and whether you should upgrade severity. (4) If 30 minutes in you are still stuck on a Sev-1, escalate to senior leadership so they are aware and can authorize exceptional measures (like waking up someone on PTO who has critical knowledge). Do not let “we’ll figure it out” drag on while customers suffer.

Hands-On Exercise: Tabletop Exercise — Managing a Sev-1 Database Outage

A tabletop exercise is a simulation where you walk through an incident scenario, making decisions at each stage. You do not need a real system for this. You need a quiet room, this scenario, and ideally 2-4 other people to play the roles.

If you are practicing solo, play the IC role and write down what you would say and do at each decision point.

The Scenario

You are the Incident Commander. It is Tuesday at 10:15 AM. Your largest customer, representing 18% of annual revenue, is in the middle of their quarterly financial close — the most critical period of their fiscal cycle.

10:15 AM — PagerDuty alerts fire:

CRITICAL: PostgreSQL primary — replication lag > 60s
CRITICAL: API error rate > 5% — /api/v2/transactions endpoint
WARNING: Connection pool utilization > 90% on app-server-{1,2,3}

Your team available:

You (IC)
Morgan — Senior backend engineer, knows the application layer well
Priya — DBA, PostgreSQL expert, but currently in a meeting
Taylor — SRE, strong on infrastructure and monitoring
Casey — Customer success manager for the affected enterprise client

Resources available:

Grafana dashboards
PostgreSQL read replicas (2 replicas, normally healthy)
Feature flag system (LaunchDarkly)
Recent deploy log (last deploy was 9:45 AM — 30 minutes ago)

Decision Point 1 (10:15 AM): Initial Response

Based on the alerts, answer these questions:

What severity do you declare and why?
Who do you page and what roles do you assign?
What is the first message you post in the incident channel?

Recommended Actions

Severity: Sev-1. The transaction endpoint is a core revenue function. Error rates above 5% on that endpoint during a customer’s financial close is a direct revenue and relationship threat. The combination of replication lag, API errors, and connection pool exhaustion suggests a systemic issue, not a transient blip.

Pages and roles:

Pull Priya out of her meeting immediately. She is the PostgreSQL expert and this is a database incident. Assign her as Operations Lead.
Morgan as secondary Ops Lead / SME for the application layer.
Taylor as your technical eyes — have them pull up Grafana and start reading the situation.
Casey should be notified but NOT in the war room. Send Casey to standby: “We have a Sev-1 impacting the transactions API. I will update you every 15 minutes. Prepare to communicate with [enterprise client] if needed.”

Incident channel message:

INCIDENT DECLARED — Sev-1
Issue: PostgreSQL replication lag + API errors on /api/v2/transactions
IC: [You]
Ops Lead: @priya
Comms Lead: @taylor (until dedicated Comms resource available)
Bridge: [link]
Impact: Transaction processing errors. Largest customer in financial close.
Status: Investigating

Decision Point 2 (10:22 AM): First Findings

Taylor reports from Grafana:

The 9:45 AM deploy introduced a new database migration that added an index on the transactions table
The CREATE INDEX command is still running — it’s been holding a lock for 37 minutes
The lock is blocking writes to the transactions table
Read replicas are healthy but falling behind because WAL replay is blocked

Priya confirms: “The migration is running CREATE INDEX without CONCURRENTLY. It’s holding an ACCESS EXCLUSIVE lock on the transactions table. No writes can proceed until it finishes or is cancelled.”

What is your mitigation strategy?
What questions do you ask before approving it?
What do you tell Casey to communicate to the enterprise client?

Recommended Actions

Mitigation strategy: Cancel the index creation immediately.

The index is blocking all writes. Cancelling it will release the lock and restore write capability. The index can be recreated later using CREATE INDEX CONCURRENTLY, which does not hold an exclusive lock.

Questions before approving:

“Priya, if we cancel the CREATE INDEX, will it cleanly release the lock? Any risk of table corruption?” (Answer: No, cancelling a CREATE INDEX is safe. The partial index is simply dropped.)
“Will the application function correctly without this new index?” (Answer: Yes, it ran fine without it before the migration.)
“How do we cancel it? pg_cancel_backend() on the migration’s PID?”

Once Priya confirms it is safe, approve immediately: “Cancel it. Now.”

Communication to enterprise client (via Casey): “We are experiencing a brief disruption to transaction processing. Our team has identified the cause and is implementing a fix. We expect resolution within the next 10 minutes. Your data is safe and no transactions have been lost — pending transactions will process once the system recovers.”

Do NOT say “a bad deploy caused it” to the customer. Keep it factual and focused on impact and recovery.

Decision Point 3 (10:26 AM): Mitigation Executed

Priya cancels the index creation. Within 90 seconds:

Write lock released
Transaction writes resuming
Error rates dropping
Replication lag decreasing

But now Morgan reports: “There’s a backlog of about 4,000 queued transactions that were blocked. They’re all hitting the database simultaneously. Connection pool is at 98%.”

Do you have a new problem? What’s the risk?
What action do you take?

Recommended Actions

Yes, you have a thundering herd problem. 4,000 queued transactions all retrying at once could overwhelm the database again, causing a secondary outage.

Actions:

Ask Priya: “Can the database handle 4,000 concurrent writes or do we need to throttle?” If the connection pool is at 98%, the answer is likely “it’s going to be tight.”
If available, enable request rate limiting or queue-based processing on the API layer. Morgan may be able to do this through a feature flag or application configuration.
If neither option is available, consider briefly scaling up the connection pool or adding app server instances (if autoscaling is available).
Monitor closely for the next 10 minutes. If connection pools hit 100% and connections start being rejected, you may need to temporarily reject new requests to let the backlog drain.

The key is to not panic. The thundering herd is a known pattern after an outage. It is temporary. But you need to actively manage it rather than assuming it will sort itself out.

Decision Point 4 (10:40 AM): Stabilizing

The backlog has drained. Error rates are at 0%. Replication lag is back to normal. Connection pool utilization is at 45%.

Is the incident over? What’s your checklist before declaring resolved?
What follow-up actions do you identify?

Recommended Actions

Not yet resolved. Begin stabilization monitoring.

Checklist before declaring resolved:

Error rates at 0% for at least 15 minutes
Replication lag stable at normal levels for 15 minutes
Connection pool utilization stable and normal
No queued or failed transactions remaining
Enterprise client confirmed their operations are functioning

At 10:55 AM, if all metrics are clean, declare resolved.

Follow-up actions to log for the postmortem:

The migration framework should enforce CREATE INDEX CONCURRENTLY for all production migrations. This is a process/tooling fix.
Add a pre-deploy check that flags migrations with exclusive locks.
Review the deploy process: who approved a migration that takes an exclusive lock on the busiest table in the system during business hours?
Add monitoring for long-running database locks (alert if any lock is held for > 5 minutes).
Update the runbook for connection pool exhaustion / thundering herd scenarios.

Final communications:

Status page: “Resolved. Transaction processing has been fully restored.”
Casey to enterprise client: “The issue has been fully resolved. All transactions have been processed. We will share a detailed postmortem within 48 hours.”
Internal: Schedule postmortem for Wednesday morning while the details are fresh.

Exercise Debrief

After completing the tabletop, reflect on these questions:

Did you prioritize mitigation over investigation? The correct first instinct when you heard “CREATE INDEX is holding a lock” should have been “cancel it” — not “why wasn’t it run with CONCURRENTLY?”
Did you manage the thundering herd? Anticipating second-order effects (like backlog surges after restoring service) is a mark of experienced incident management.
Did you manage communication throughout? The enterprise client in financial close needed to hear from you proactively, not discover the outage through failed transactions.
Did you resist declaring resolved too early? The 15-minute stabilization period catches many “bouncing” incidents.

If you did this exercise with a group, discuss: What would you do differently? Where did you disagree? Disagreements are the most valuable part — they reveal where your team’s incident playbook has gaps.

Key Takeaways

Incident command is not optional. Without a single point of authority, you get a room full of smart people making things worse.
Mitigation before resolution. Stop the customer impact first. Investigate after.
Communicate relentlessly. Silence is worse than bad news. Update stakeholders every 15 minutes even if the update is “no change.”
Over-classify, don’t under-classify. Declaring a Sev-1 that turns out to be a Sev-3 is a minor inconvenience. Treating a Sev-1 as a Sev-3 is a catastrophe.
The IC does not touch the keyboard. The moment you start debugging, you’ve stopped managing the incident.
Practice before you need it. Tabletop exercises build the muscle memory that kicks in at 3 AM when your brain is running on adrenaline and bad coffee.

Next Module

Module 1.2: Blameless Postmortems — Learn how to turn incidents into organizational learning without blame, fear, or finger-pointing. The postmortem is where incidents stop being crises and start being investments in reliability.

Module 1.1: Incident Command & Crisis Management

What You’ll Be Able to Do

Why This Module Matters

The Incident Command System: From Wildfires to Web Services

ICS Principles Adapted for Software Engineering

Incident Roles: Who Does What

Role 1: Incident Commander (IC)

Role 2: Communications Lead

Role 3: Operations Lead

Role 4: Subject Matter Experts (SMEs)

Establishing Communication Channels

The War Room

Slack Channel Structure

The 15-Minute Update Cadence

Status Pages and External Communication

Status Page Updates

External Communication Templates

Severity Levels: Sev-1 Through Sev-4

Severity Definitions

Real-World Severity Examples

When to Upgrade or Downgrade Severity

Triaging: Assessing Severity Quickly

The 5-Minute Triage Checklist

Mitigating vs. Resolving: Stop the Bleeding First

Mitigation Examples

Why Engineers Resist Mitigation

Timeline: Well-Managed vs. Poorly Managed Incident

A Well-Managed Sev-1

A Poorly Managed Sev-1

War Story: The $4.7 Million Login Outage

Did You Know?

Common Mistakes

Quiz

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Hands-On Exercise: Tabletop Exercise — Managing a Sev-1 Database Outage

The Scenario

Decision Point 1 (10:15 AM): Initial Response

Decision Point 2 (10:22 AM): First Findings

Decision Point 3 (10:26 AM): Mitigation Executed

Decision Point 4 (10:40 AM): Stabilizing

Exercise Debrief

Key Takeaways

Next Module