Module 1.5: Incident Management
Цей контент ще не доступний вашою мовою.
Discipline Module | Complexity:
[MEDIUM]| Time: 35-40 min
Prerequisites
Section titled “Prerequisites”Before starting this module:
- Required: Module 1.1: What is SRE? — Understanding SRE fundamentals
- Required: Module 1.2: SLOs — Understanding service levels
- Recommended: Observability Theory Track — Monitoring and debugging
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Design an incident response framework with clear roles, severities, and escalation paths
- Lead an incident as Incident Commander — coordinating communication, diagnosis, and resolution
- Implement incident communication templates that keep stakeholders informed without slowing response
- Build runbooks that reduce mean time to resolution for recurring incident categories
Why This Module Matters
Section titled “Why This Module Matters”It’s 3 AM. Your phone buzzes. The dashboard is red. Customers are complaining on Twitter.
What do you do?
Without incident management: Panic. Random debugging. Blame. Chaos. Hours of downtime.
With incident management: Clear roles. Systematic response. Coordinated communication. Resolution.
Incidents will happen. The question is whether you’re prepared or not.
This module teaches you how to respond to incidents effectively — minimizing impact, enabling fast recovery, and learning from every failure.
What Is an Incident?
Section titled “What Is an Incident?”An incident is an unplanned event that:
- Disrupts or degrades service
- Requires immediate response
- Affects users or business operations
Not every alert is an incident. Not every bug is an incident.
Incident vs. Problem vs. Alert
Section titled “Incident vs. Problem vs. Alert”| Term | Definition | Example |
|---|---|---|
| Alert | Notification that something might be wrong | ”CPU usage above 80%“ |
| Incident | Active, user-impacting issue requiring response | ”Payment processing failing” |
| Problem | Root cause of one or more incidents | ”Memory leak in payment service” |
An alert might trigger investigation. An incident triggers response. A problem is what you fix to prevent future incidents.
Severity Levels
Section titled “Severity Levels”Not all incidents are equal. Severity levels help you respond appropriately.
Standard Severity Framework
Section titled “Standard Severity Framework”SEV-1 (Critical) ├── User impact: Total outage, major feature broken ├── Scope: All users affected ├── Response: All hands, leadership notified └── Example: Complete site down
SEV-2 (High) ├── User impact: Significant degradation ├── Scope: Many users affected ├── Response: On-call team plus escalations └── Example: Checkout failing for 50% of users
SEV-3 (Medium) ├── User impact: Minor degradation ├── Scope: Some users affected ├── Response: On-call team handles └── Example: Search results slow
SEV-4 (Low) ├── User impact: Minimal ├── Scope: Few users affected ├── Response: Handle during business hours └── Example: Admin dashboard unavailableSeverity by SLO Impact
Section titled “Severity by SLO Impact”| Severity | Error Budget Impact | Response |
|---|---|---|
| SEV-1 | Consuming >100x normal rate | Everyone engaged immediately |
| SEV-2 | Consuming 10-100x normal rate | On-call + escalations |
| SEV-3 | Consuming 2-10x normal rate | On-call investigates |
| SEV-4 | Normal to 2x normal rate | Track, fix when possible |
Don’t Over-Classify
Section titled “Don’t Over-Classify”A common mistake is making everything SEV-1. This causes:
- Alert fatigue
- Desensitization to real emergencies
- Resource exhaustion
Be honest about severity. Most incidents are SEV-3 or SEV-4.
Incident Response Roles
Section titled “Incident Response Roles”Clear roles prevent chaos. Everyone knows their job.
The Core Roles
Section titled “The Core Roles”┌─────────────────────────────────────────────────────────┐│ INCIDENT RESPONSE TEAM │├─────────────────────────────────────────────────────────┤│ ││ ┌─────────────────────────────────────────────┐ ││ │ INCIDENT COMMANDER (IC) │ ││ │ • Owns overall incident response │ ││ │ • Makes decisions, coordinates work │ ││ │ • Declares incident resolved │ ││ └────────────────────┬────────────────────────┘ ││ │ ││ ┌───────────────┼───────────────┐ ││ ▼ ▼ ▼ ││ ┌─────────┐ ┌─────────┐ ┌─────────┐ ││ │ COMMS │ │ TECH │ │ SUBJECT │ ││ │ LEAD │ │ LEAD │ │ MATTER │ ││ │ │ │ │ │ EXPERTS │ ││ └─────────┘ └─────────┘ └─────────┘ ││ │└─────────────────────────────────────────────────────────┘Role Definitions
Section titled “Role Definitions”Incident Commander (IC)
- Overall responsibility for the incident
- Coordinates all response activities
- Makes decisions when needed
- Communicates status to stakeholders
- Declares incident open/closed
Communications Lead (Comms)
- External communication (status page, social media)
- Internal communication (leadership, other teams)
- Keeps customers informed
- Handles media if necessary
Tech Lead
- Drives technical investigation
- Coordinates debugging efforts
- Recommends and implements fixes
- Documents technical timeline
Subject Matter Experts (SMEs)
- Called in based on incident type
- Provide deep expertise in specific areas
- Execute technical tasks under Tech Lead direction
Role Rotation
Section titled “Role Rotation”The IC doesn’t have to be the most senior person. In fact, rotating IC duties:
- Builds incident response skills across team
- Prevents burnout
- Reduces single points of knowledge
Try This: Role-Play an Incident
Section titled “Try This: Role-Play an Incident”With your team, practice roles with a mock incident:
Scenario: "Payment processing is failing for 30% of transactions"
Assign roles: - IC: ______________ - Comms: ______________ - Tech Lead: ______________ - SME (Payments): ______________ - SME (Database): ______________
Practice: 1. IC opens incident, assigns roles 2. Tech Lead begins investigation 3. Comms drafts status page update 4. After 5 min, IC requests status update 5. Practice handoff if IC needs to leaveThe Incident Lifecycle
Section titled “The Incident Lifecycle”Every incident follows a lifecycle:
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ DETECT │ ──▶ │ TRIAGE │ ──▶ │ RESPOND │ ──▶ │ RESOLVE │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │ │ │ ▼ ▼ ▼ ▼ Monitoring Severity Fix/mitigate Verify alerts assignment Communication fixed User reports Role Coordination Close assignment incident
│ ▼ ┌─────────────┐ │ LEARN │ │ (postmortem)│ └─────────────┘Phase 1: Detection
Section titled “Phase 1: Detection”How you learn something is wrong:
- Automated monitoring alerts
- Customer reports
- Internal user reports
- Social media
- Partner notifications
Goal: Minimize time to detection (TTD).
Phase 2: Triage
Section titled “Phase 2: Triage”Quick assessment:
- What’s the impact?
- How many users affected?
- What’s the severity?
- Who needs to be involved?
Goal: Correct severity classification within 5 minutes.
Phase 3: Response
Section titled “Phase 3: Response”Active work to resolve:
- Investigation and debugging
- Implementing fixes
- Coordinating across teams
- Communicating status
Goal: Effective coordination, not chaos.
Phase 4: Resolution
Section titled “Phase 4: Resolution”Confirming the incident is over:
- Fix deployed and verified
- Monitoring confirms recovery
- Users no longer impacted
- Incident declared resolved
Goal: Confident closure, not premature declaration.
Phase 5: Learning
Section titled “Phase 5: Learning”Post-incident improvement:
- Postmortem conducted (next module)
- Action items identified
- Process improvements made
Goal: Never have the same incident twice.
On-Call Best Practices
Section titled “On-Call Best Practices”Being on-call is central to incident management.
On-Call Structure
Section titled “On-Call Structure”Primary On-Call └── First responder └── Available 24/7 during rotation └── Handles or escalates all alerts
Secondary On-Call └── Backup to primary └── Available if primary overwhelmed └── Steps in during handoffs
Escalation Path └── Manager → Director → VP └── For severity or business decisions └── Not for technical debuggingSustainable On-Call
Section titled “Sustainable On-Call”Do:
- Maximum 1 week on-call stretches
- Minimum 2 people in rotation
- Compensate for on-call (time off, pay)
- Clear escalation paths
- Runbooks for common issues
- Post-on-call feedback sessions
Don’t:
- 24/7 on-call for one person
- Alerts for things that aren’t actionable
- Expect on-call to also do project work
- Punish people for escalating
Alert Quality
Section titled “Alert Quality”Good alerts are:
- Actionable: Responder can do something
- Relevant: Actually indicates a problem
- Urgent: Requires immediate attention
- Clear: Includes context to start debugging
Bad alerts are:
- “CPU is high” (what should I do?)
- Fires constantly (alert fatigue)
- Requires reading 5 dashboards to understand
On-Call Metrics
Section titled “On-Call Metrics”| Metric | Good Target | Why It Matters |
|---|---|---|
| Pages per on-call week | < 5 | More = burnout |
| False positive rate | < 20% | Higher = fatigue |
| Time to acknowledge | < 5 min | Faster = faster response |
| Incidents requiring escalation | < 20% | Higher = skill gaps |
| On-call satisfaction | > 3/5 | Lower = retention risk |
Did You Know?
Section titled “Did You Know?”-
Google’s incident management system was inspired by fire department protocols. The Incident Commander role comes directly from emergency services’ Incident Command System (ICS).
-
The best incident responders often do less, not more. They focus on coordination and decision-making rather than trying to personally fix everything.
-
“All hands” incidents often have worse outcomes than properly staffed responses. Too many people creates confusion and duplicated effort.
-
PagerDuty’s annual “State of On-Call” report consistently shows that engineers at high-performing organizations get paged less frequently but handle more complex issues—because they’ve automated away the simple stuff and have better tooling for the hard stuff.
War Story: The Incident That Went Right
Section titled “War Story: The Incident That Went Right”A company I worked with had a major outage:
The Incident:
- 2:30 AM: Total site down
- Cause: Database corruption from failed migration
What Made It Go Well:
Clear roles activated immediately:
IC: Senior SRE (woken by page)Comms: Product manager (called in)Tech Lead: Database engineer (paged)SME: Migration author (paged)Structured communication:
2:35 AM: IC opens incident channel2:40 AM: Comms posts to status page: "Investigating issues"2:50 AM: Tech Lead: "Identified - database corruption"3:00 AM: Comms updates status page with ETA3:30 AM: Database restored from backup3:45 AM: Site back online4:00 AM: IC declares incident resolvedWhat the IC did well:
- Didn’t try to debug personally
- Focused on coordination
- Asked for status every 15 minutes
- Made decision to restore from backup (vs. repair)
- Handled escalation to leadership
- Kept Comms informed for status updates
The result:
- 75 minutes total downtime
- Clear communication throughout
- No blame, just focus
- Excellent postmortem material
Compare to previous incidents:
- Similar issue 6 months earlier: 4 hours of chaos
- No roles, everyone debugging same thing
- Customers in dark, leadership angry
Lesson: Process turns chaos into coordination.
Incident Communication
Section titled “Incident Communication”Internal Communication
Section titled “Internal Communication”Incident Channel:
#incident-2024-01-15-payment-outage
[IC] INCIDENT OPEN - SEV-1 - Payment processing failing[IC] Roles: IC=@alice, Tech=@bob, Comms=@carol[Tech] Investigating. Initial data shows DB connection failures.[Comms] Status page updated. ETA 30 min.[Tech] Root cause identified: Connection pool exhausted[Tech] Implementing fix: Increasing pool size[IC] 15-min check: Fix deploying, ETA 10 more minutes[Tech] Fix deployed. Monitoring.[IC] Metrics recovering. Watching for 10 minutes.[IC] INCIDENT RESOLVED - Total duration: 45 minutesKey Practices:
- One channel per incident
- All decisions documented
- Regular IC status updates
- Clear open/close announcements
External Communication
Section titled “External Communication”Status Page Updates:
[INVESTIGATING] 2:45 PMWe are investigating reports of payment processing issues.Some customers may experience failures when completing checkout.Next update in 15 minutes.
[IDENTIFIED] 3:00 PMWe have identified the cause and are implementing a fix.Affected: Payment processingETA: 15 minutesNext update in 15 minutes.
[MONITORING] 3:20 PMA fix has been deployed. We are monitoring for recovery.Some transactions may have failed during this period.Affected transactions will be automatically retried.
[RESOLVED] 3:35 PMPayment processing has fully recovered.The issue was caused by a configuration error.No customer data was affected.We apologize for the inconvenience.Principles:
- Update regularly (every 15-30 min)
- Be honest about what you know
- Give ETAs when you can
- Acknowledge customer impact
- Avoid blame or technical jargon
Communication Protocols for Major Incidents
Section titled “Communication Protocols for Major Incidents”During a major incident, unstructured communication causes almost as much damage as the outage itself. These protocols keep everyone informed without overwhelming responders.
Communication Cadence by Severity:
| Severity | Internal Update Cadence | Status Page Cadence | Leadership Notification |
|---|---|---|---|
| SEV-1 | Every 15 minutes | Every 15 minutes | Immediately, then every 30 min |
| SEV-2 | Every 30 minutes | Every 30 minutes | Within 30 min, then hourly |
| SEV-3 | Every 60 minutes | As status changes | Daily summary if prolonged |
| SEV-4 | As status changes | Not required | Not required |
Even if nothing has changed, post an update. Silence breeds anxiety and speculation.
Stakeholder Notification Tiers:
Tier 1: Engineering (immediate) └── On-call team, incident responders, relevant SMEs └── Notified via: PagerDuty, incident Slack channel
Tier 2: Engineering Management (within 15 min for SEV-1) └── Engineering managers, directors of affected services └── Notified via: Slack, email
Tier 3: Executives (within 30 min for SEV-1) └── VP Engineering, CTO, CEO (for customer-facing SEV-1) └── Notified via: SMS, phone call, email └── They need: impact scope, ETA, whether customers are affected
Tier 4: Customers (within 30-60 min for SEV-1) └── Via status page, in-app banner, email for affected accounts └── They need: what's broken, workarounds, when it will be fixedCommunication Templates:
Use these as starting points. The Comms Lead fills in the brackets and posts.
INITIAL NOTIFICATION (internal):"[SEVERITY] incident declared. [SERVICE] is [IMPACT DESCRIPTION].Approximately [NUMBER/PERCENTAGE] of users affected.IC: @[NAME] | Tech Lead: @[NAME] | Comms: @[NAME]Incident channel: #incident-[DATE]-[SHORT-NAME]Next update in [15/30] minutes."
STATUS UPDATE (internal, use at each cadence interval):"Update [NUMBER] — [TIME]Current status: [investigating / identified / fix in progress / monitoring]What we know: [1-2 sentences]What we're doing: [current action]ETA to resolution: [estimate or 'unknown']Next update in [15/30] minutes."
STATUS PAGE UPDATE (external, customer-facing):"We are aware of an issue affecting [SERVICE/FEATURE].Impact: [PLAIN-LANGUAGE DESCRIPTION of what users experience].Our team is actively working on a resolution.ETA: [estimate or 'We will provide an update by TIME'].We apologize for the inconvenience."
RESOLUTION NOTIFICATION (external):"The issue affecting [SERVICE] has been resolved as of [TIME].Duration: [START] to [END].[Brief root cause in plain language, no internal jargon].[Any actions customers need to take, e.g., retry failed transactions].We apologize for the disruption and are taking steps to prevent recurrence."War Room / Bridge Call Protocols:
For SEV-1 incidents, a bridge call (video or voice) keeps responders aligned.
| Rule | Why |
|---|---|
| IC opens and runs the bridge | Single point of coordination |
| Mute when not speaking | Reduce noise so updates are heard |
| Tech Lead gives status every 15 min | Keeps IC informed without IC asking repeatedly |
| Non-responders stay off the bridge | Too many people creates chaos (use the Slack channel instead) |
| All decisions spoken aloud and typed in channel | Creates written record, avoids “I thought we said…” |
| Handoff protocol when IC rotates | Outgoing IC summarizes state, incoming IC confirms |
Post-Incident Customer Communication:
After resolution, customers deserve a follow-up — especially for SEV-1 and SEV-2 incidents.
- Within 24 hours: Post a brief summary on the status page explaining what happened and what you are doing to prevent recurrence.
- Within 3-5 business days: For major incidents, send a detailed post-incident report to affected customers. Include root cause, timeline, impact scope, and concrete remediation steps.
- Tone: Honest, accountable, specific. Avoid vague language like “an issue occurred.” Instead: “A misconfigured database failover caused payment processing to fail for 45 minutes.”
- Don’t over-promise: Say “we are implementing X and Y to reduce the likelihood of recurrence” rather than “this will never happen again.”
Customers remember how you communicated during an outage more than the outage itself. Transparent, timely communication builds trust even when things break.
Runbooks and Playbooks
Section titled “Runbooks and Playbooks”Runbooks reduce time to resolution by documenting common procedures.
Runbook Example
Section titled “Runbook Example”# Runbook: Payment Service High Error Rate
## TriggerAlert: payment-service-error-rate-highThreshold: Error rate > 5% for 5 minutes
## ImpactUsers unable to complete purchases.Business impact: ~$X per minute of outage.
## Quick Diagnosis1. Check payment-service dashboard: [link]2. Check dependent services: - Database: [link] - Payment gateway: [link] - Auth service: [link]
## Common Causes & Fixes
### Database Connection Exhaustion**Symptoms**: Connection timeout errors in logs**Fix**:```bashkubectl rollout restart deployment/payment-service -n productionIf doesn’t resolve: Check database load, may need failover
Payment Gateway Outage
Section titled “Payment Gateway Outage”Symptoms: Gateway timeout errors in logs Verify: Check gateway status page: [link] Fix: Enable fallback payment processor:
kubectl set env deployment/payment-service GATEWAY=fallbackAuth Service Degradation
Section titled “Auth Service Degradation”Symptoms: Auth timeout errors in logs Verify: Check auth service dashboard: [link] Fix: Auth service team owns resolution. Escalate to #auth-team
Escalation
Section titled “Escalation”- If not resolved in 15 min: Page secondary on-call
- If not resolved in 30 min: Page engineering manager
- For business decisions: Contact VP on-call
Post-Resolution
Section titled “Post-Resolution”- Verify metrics returned to normal
- Create incident ticket in Jira
- Schedule postmortem if SEV-2 or higher
### Playbook vs. Runbook
| Runbook | Playbook ||---------|----------|| Specific procedure | General strategy || Step-by-step | Principles and patterns || For specific alert/issue | For types of incidents || "How to fix X" | "How to approach category" |
Both are valuable. Runbooks for common issues, playbooks for novel situations.
---
## Common Mistakes
| Mistake | Problem | Solution ||---------|---------|----------|| Everyone debugging | Duplicated effort, chaos | Assign roles, coordinate || No communication | Customers angry, leadership blind | Comms role, regular updates || Premature resolution | Problem returns, erodes trust | Verify with metrics || Over-escalating | Leadership fatigue, cry wolf | Calibrate severity accurately || Under-escalating | Major incident unaddressed | Clear escalation criteria || No postmortem | Same incident recurs | Always do postmortems for SEV-1/2 |
---
## Quiz: Check Your Understanding
### Question 1You receive an alert that API error rate is at 3% (normally 0.1%). How do you triage this?
<details><summary>Show Answer</summary>
**Triage Steps:**
1. **Check impact scope**: How many users affected?2. **Check trend**: Is it getting worse?3. **Check error types**: 4xx vs 5xx matters4. **Check dependencies**: Are upstream/downstream services healthy?
**Severity assessment**:- If 3% and stable, many users affected → SEV-2 or SEV-3- If 3% and climbing rapidly → SEV-2, may escalate to SEV-1- If mostly 4xx (client errors) → Maybe SEV-3 or SEV-4- If mostly 5xx (server errors) → More serious
**Decision**: Classify severity, assign roles if SEV-2+, start investigation.
</details>
### Question 2What's the IC's main job during an incident?
<details><summary>Show Answer</summary>
The IC's main jobs are:
1. **Coordinate**: Ensure roles are assigned and working together2. **Decide**: Make calls when team is stuck or needs direction3. **Communicate**: Keep stakeholders informed4. **Manage scope**: Keep team focused on resolution
What the IC should NOT do:- Personal debugging (that's Tech Lead)- External communication (that's Comms)- Get lost in technical details
The IC is the conductor, not the musician.
</details>
### Question 3An incident is taking longer than expected. When should you escalate?
<details><summary>Show Answer</summary>
Escalate when:
1. **Time thresholds exceeded**: As defined in runbook (e.g., 30 min for SEV-1)2. **Expertise needed**: Current team lacks skills3. **Business decisions required**: Need authority to approve risky fix4. **Scope expansion**: Incident affecting more than originally thought5. **Customer impact increasing**: Despite mitigation efforts
How to escalate well:- Summarize situation clearly- State what you need from escalation- Don't wait too long (bias toward early escalation)
</details>
### Question 4How often should you update the status page during an incident?
<details><summary>Show Answer</summary>
**Recommended cadence:**
- **During active incident**: Every 15-30 minutes- **When status changes**: Immediately (investigating → identified → fix deployed)- **At resolution**: Clear resolution message
**Key principles**:- Regular updates even if no change ("still investigating")- Honest about what you know and don't know- Give ETAs when you can (but don't overpromise)- Acknowledge customer impact
Silence is worse than "we're still working on it."
</details>
---
## Hands-On Exercise: Build an Incident Response Plan
Create an incident response plan for your team.
### Part 1: Severity Definition
Define your severity levels:
```yamlseverity_levels: sev1: name: "Critical" criteria: - # What makes an incident SEV-1? response: - # Who is paged? - # Response time target? examples: - # Real examples from your systems
sev2: name: "High" criteria: - response: - examples: -
sev3: name: "Medium" criteria: - response: - examples: -Part 2: Role Assignments
Section titled “Part 2: Role Assignments”Define who fills each role:
roles: incident_commander: primary: # Who is first IC? rotation: # Who rotates? responsibilities: - -
communications_lead: primary: backup: responsibilities: -
tech_lead: primary: responsibilities: -Part 3: Communication Templates
Section titled “Part 3: Communication Templates”Create templates for status updates:
## Status Page Template: Investigating
We are investigating reports of [ISSUE].
**Impact**: [WHO IS AFFECTED]**Current status**: Investigating**Next update**: [TIME]
---
## Status Page Template: Identified
We have identified the cause of [ISSUE] and are working on a fix.
**Impact**: [WHO IS AFFECTED]**Root cause**: [HIGH-LEVEL DESCRIPTION]**ETA**: [ESTIMATED TIME]**Next update**: [TIME]
---
## Status Page Template: Resolved
[ISSUE] has been resolved.
**Duration**: [START TIME] to [END TIME]**Impact**: [SUMMARY]**Root cause**: [BRIEF EXPLANATION]
We apologize for any inconvenience.Part 4: Runbook Outline
Section titled “Part 4: Runbook Outline”Create a runbook template:
# Runbook: [Alert Name]
## Overview- Alert threshold:- Impact when triggered:
## Quick Diagnosis1.2.3.
## Common Causes### Cause 1: [Name]- Symptoms:- Fix:
### Cause 2: [Name]- Symptoms:- Fix:
## Escalation- If not resolved in X min:- For business decisions:Success Criteria
Section titled “Success Criteria”- Defined all severity levels with examples
- Assigned roles with backups
- Created communication templates
- Started at least one runbook
Key Takeaways
Section titled “Key Takeaways”- Incidents need structure: Roles, severity, process prevent chaos
- The IC coordinates, doesn’t debug: Focus on the whole response
- Communication is critical: Internal and external, regular updates
- Runbooks save time: Document common issues and fixes
- Sustainable on-call: Reasonable load, compensation, clear escalation
Further Reading
Section titled “Further Reading”Books:
- “Site Reliability Engineering” — Chapter 14: Managing Incidents
- “Incident Management for Operations” — Rob Schnepp
Articles:
- “Incident Response at Heroku” — Heroku engineering blog
- “On Being On-Call” — Alice Goldfuss
Tools:
- PagerDuty: Incident management platform
- OpsGenie: Alerting and on-call management
- Incident.io: Incident response in Slack
Summary
Section titled “Summary”Incident management is what happens when things go wrong — and things will go wrong.
Effective incident management:
- Clear roles (IC, Comms, Tech Lead, SMEs)
- Defined severity levels
- Structured response process
- Regular communication
- Documented runbooks
Without it, incidents are chaos. With it, incidents are manageable.
Next Module
Section titled “Next Module”Continue to Module 1.6: Postmortems and Learning to learn how to learn from incidents and prevent recurrence.
“Every great outage tells a story. Make sure you’re listening.” — Unknown