Module 3.4: From Data to Insight

Complexity: [MEDIUM]

Time to Complete: 30-35 minutes

Prerequisites: Module 3.3: Instrumentation Principles

Track: Foundations

What You’ll Be Able to Do

After completing this module, you will be able to:

Build alert rules that detect meaningful user-impact signals rather than generating noise from infrastructure metrics
Analyze observability data to move from symptom detection to root-cause identification within a structured debugging workflow
Design dashboards that answer operational questions at a glance and guide engineers toward the right next investigation step
Evaluate whether an alerting strategy minimizes both false positives and time-to-detection for real incidents

The Team That Had Perfect Data But No Answers

December 2018. A Major E-commerce Platform. Black Friday Weekend.

The observability team had done everything right. World-class Prometheus deployment with 2.3 million time series. Elasticsearch cluster ingesting 4TB of logs daily. Jaeger storing 500 million spans. Beautiful Grafana dashboards covering every service. They’d spent 18 months building this observability stack.

2:14 AM, Saturday. The worst time for e-commerce—Black Friday shoppers still active, Cyber Monday promotions launching in hours.

Alert: “Checkout error rate exceeded 2%.”

The on-call engineer opens Grafana. Error rate: 2.3%. She can see that something is wrong. But the dashboard has 47 panels. Which one matters right now? She clicks through each one, looking for anomalies. Five minutes pass.

She opens the logs. Searches for “checkout” and “error.” 2.4 million results in the last hour. She adds “level:error”—still 847,000 results. She can’t narrow down further because the logs don’t have consistent field names. Different services use different conventions.

She switches to tracing. Finds a failing trace. The error is “connection timeout” to the inventory service. Great. But why? The inventory service dashboard shows… nothing wrong. Latency normal. Error rate normal.

47 minutes. That’s how long it took to find the root cause. The inventory service was fine. The problem was network congestion between the checkout service and inventory service—a firewall rule change made two weeks earlier had introduced a 50ms timeout where there should have been 500ms. Under Black Friday load, retries stacked up and cascaded.

The fix: One line of configuration. 30 seconds to apply.

The cost: 47 minutes × 2.3% error rate × Black Friday traffic = $2.8 million in abandoned carts. Plus customer support costs. Plus reputation damage.

THE DATA-TO-INSIGHT GAP
═══════════════════════════════════════════════════════════════════════════════

WHAT THEY HAD                          WHAT THEY LACKED
─────────────────────────────────────  ─────────────────────────────────────
☑ 2.3 million metrics                  ✗ Dashboard hierarchy (what matters now?)
☑ 4TB daily logs                       ✗ Consistent field names (can't query)
☑ 500 million spans                    ✗ SLO-based alerting (why 2%? is that bad?)
☑ Beautiful dashboards                 ✗ Runbooks (what to do when this happens)
☑ All three pillars                    ✗ Investigation workflow (where to start)
☑ Perfect instrumentation              ✗ Mental model of failure modes

DATA ≠ INSIGHT

They had all the data to find the problem in 5 minutes.
They lacked the insight to navigate that data effectively.
Cost of the gap: $2.8 million.

After the incident, they rebuilt their approach:

Created dashboard hierarchy (summary → signals → details)
Defined consistent field naming across all services
Implemented SLO-based alerts with error budgets
Wrote runbooks for common failure modes
Trained the team on systematic investigation

The next similar incident took 8 minutes to resolve.

Why This Module Matters

You’ve instrumented your services. Logs are flowing, metrics are being scraped, traces are being collected. Now what?

Having data isn’t the same as having understanding. The real value of observability comes from turning data into insight—answering questions, detecting problems, and building mental models of system behavior.

This module teaches you how to use observability data effectively: asking the right questions, building useful alerts, debugging systematically, and developing intuition about your systems.

The Medical Analogy

A doctor doesn’t just collect test results—they interpret them. Blood pressure, heart rate, and lab results are data. A diagnosis is insight. The same data can mean different things in different contexts. Observability is similar: metrics, logs, and traces are raw data. Understanding why your system behaves the way it does is insight.

What You’ll Learn

How to ask good questions of your observability data
Building effective alerts (signal vs. noise)
Debugging systematically with observability
Dashboards that tell stories
Building mental models of system behavior

Part 1: Asking Questions

1.1 The Question Hierarchy

Not all questions are equally useful. Start broad, narrow down:

THE QUESTION HIERARCHY
═══════════════════════════════════════════════════════════════

Level 1: DETECTION
─────────────────────────────────────────────────────────────
"Is something wrong?"
    → Alerts, SLO dashboards, error rates
    → Answers: Yes/No

Level 2: SCOPE
─────────────────────────────────────────────────────────────
"How bad is it? Who's affected?"
    → Break down by dimension: region, user type, endpoint
    → Answers: X% of users, Y region, Z feature

Level 3: LOCALIZATION
─────────────────────────────────────────────────────────────
"Where is the problem?"
    → Service dependency analysis, trace flamegraphs
    → Answers: Service X, database Y, function Z

Level 4: ROOT CAUSE
─────────────────────────────────────────────────────────────
"Why is this happening?"
    → Logs, code, recent changes
    → Answers: Bug in version 2.3.1, config change, external dependency

1.2 Good Questions vs. Bad Questions

QUESTION QUALITY
═══════════════════════════════════════════════════════════════

BAD QUESTIONS (vague, hard to answer)
─────────────────────────────────────────────────────────────
"Why is the site slow?"
    → What is "slow"? For whom? Since when?

"Is everything okay?"
    → Define "okay"

"What happened yesterday?"
    → What specifically?

GOOD QUESTIONS (specific, answerable)
─────────────────────────────────────────────────────────────
"What's the p99 latency for /api/checkout in the last hour?"
    → Specific metric, specific endpoint, specific timeframe

"Which users experienced errors during the deploy at 3pm?"
    → Specific event, specific population

"What changed between 2pm (working) and 3pm (broken)?"
    → Specific comparison, specific times

1.3 The Exploratory Investigation Pattern

EXPLORATORY INVESTIGATION
═══════════════════════════════════════════════════════════════

START: "Something is wrong" (symptom)

1. QUANTIFY
   "How wrong? What's the error rate/latency/failure count?"
   → Establishes baseline for comparison

2. SEGMENT
   "Is it all requests or some? Which ones?"
   → Group by: endpoint, user_type, region, version, time
   → Find: Where is the problem concentrated?

3. CORRELATE
   "What else happened at the same time?"
   → Deploys, config changes, dependency issues, traffic spike
   → Find: Potential causes

4. EXEMPLIFY
   "Show me a specific failing request"
   → Trace → Logs → Detailed error
   → Find: Concrete evidence

5. HYPOTHESIZE
   "I think the problem is X because of Y"
   → Test: If X is the problem, what else should be true?

6. VERIFY
   "Let me confirm by checking Z"
   → Additional queries to confirm/refute hypothesis

7. RESOLVE
   "The problem is X, fixed by Y"
   → Document for future reference

Try This (3 minutes)

Think of a recent incident. Walk through the investigation pattern:

How did you quantify the problem?

How did you segment to narrow down?

What correlations did you look for?

Did you find a specific example to examine?

Part 2: Effective Alerting

2.1 Alert Philosophy

ALERTING PRINCIPLES
═══════════════════════════════════════════════════════════════

ALERT ONLY WHEN ACTION IS NEEDED
─────────────────────────────────────────────────────────────
If the alert fires and you do nothing → Remove the alert
If the alert fires and you always do the same thing → Automate it
If the alert fires and you investigate → Good alert

ALERT ON SYMPTOMS, NOT CAUSES
─────────────────────────────────────────────────────────────
Bad:  "CPU > 80%"           (cause - might be fine)
Good: "Error rate > 1%"     (symptom - users affected)
Good: "p99 latency > 500ms" (symptom - users affected)

ALERT ON USER IMPACT
─────────────────────────────────────────────────────────────
Would a user notice? Alert.
Would only you notice? Maybe don't alert.

Error budget burn rate > threshold → Users will be affected → Alert

2.2 SLO-Based Alerting

SLO-BASED ALERTS
═══════════════════════════════════════════════════════════════

Instead of arbitrary thresholds, alert based on error budget:

SLO: 99.9% availability (error budget: 43 minutes/month)

Alert conditions:
─────────────────────────────────────────────────────────────
SEVERE (page immediately):
    Error budget burn rate that would exhaust budget in 1 hour
    = 720x normal burn rate
    "At this rate, we're out of budget in 1 hour"

HIGH (page during business hours):
    Error budget burn rate that would exhaust budget in 6 hours
    = 120x normal burn rate
    "At this rate, we're out of budget in 6 hours"

MEDIUM (ticket):
    Error budget burn rate that would exhaust budget in 3 days
    = 10x normal burn rate
    "At this rate, we're out of budget this week"

LOW (dashboard):
    Any error budget consumption
    "We're using budget, but within acceptable range"

2.3 Reducing Alert Fatigue

Problem	Cause	Solution
Too many alerts	Low thresholds	Raise thresholds, use SLO-based
Alerts for non-issues	Alerting on causes	Alert on symptoms instead
Flapping alerts	Sensitive thresholds	Add duration requirement (5 min)
Duplicate alerts	Multiple rules for same issue	Deduplicate, use alert hierarchy
Unactionable alerts	No clear response	Add runbook, or delete alert

ALERT HYGIENE CHECKLIST
═══════════════════════════════════════════════════════════════

For each alert, answer:
[ ] Does this require human action?
[ ] Is the action clear? (link to runbook)
[ ] Does this fire rarely enough to be noticed?
[ ] Is the severity appropriate?
[ ] Has this alert been useful in the last 30 days?

If you can't answer "yes" to all → Fix or remove the alert.

Did You Know?

Google’s SRE teams aim for a 50% “signal ratio”—meaning 50% of pages should be real issues requiring intervention. Below 50%, alerts are too noisy and get ignored. Above 80%, you’re probably missing issues because you’re not alerting enough.

Part 3: Debugging with Observability

3.1 The Debugging Workflow

DEBUGGING WORKFLOW
═══════════════════════════════════════════════════════════════

            ┌─────────────────────────────────────────────┐
METRICS ───▶│ Alert fires: Error rate spike              │
            │ Dashboard: 5% errors on /api/checkout      │
            └──────────────────┬──────────────────────────┘
                               │
            ┌──────────────────▼──────────────────────────┐
METRICS ───▶│ Segment: Which errors? Where?              │
            │ 90% of errors from US-West region          │
            │ Started 14:32, correlates with deploy      │
            └──────────────────┬──────────────────────────┘
                               │
            ┌──────────────────▼──────────────────────────┐
TRACES ────▶│ Exemplar: Get example failing trace        │
            │ Trace abc-123: Timeout on payment-service  │
            │ 3 retry attempts, all timed out            │
            └──────────────────┬──────────────────────────┘
                               │
            ┌──────────────────▼──────────────────────────┐
LOGS ──────▶│ Detail: What's in the logs?                │
            │ "Connection refused: payment-db-west-02"   │
            │ Database server is unreachable             │
            └──────────────────┬──────────────────────────┘
                               │
            ┌──────────────────▼──────────────────────────┐
ROOT CAUSE ▶│ payment-db-west-02 in maintenance         │
            │ Failover didn't trigger correctly          │
            └─────────────────────────────────────────────┘

3.2 Common Debugging Patterns

“It’s slow”

DEBUGGING LATENCY
═══════════════════════════════════════════════════════════════

1. Confirm: What's the actual latency? p50? p99?
2. Scope: All requests or some? Which endpoints?
3. Trace: Get example slow traces
4. Flamegraph: Where is time spent?
   └── Network call to service X: 800ms (80% of time)
5. Drill down: Why is service X slow?
6. Repeat until root cause found

“It’s broken”

DEBUGGING ERRORS
═══════════════════════════════════════════════════════════════

1. Confirm: What's the error rate? Error types?
2. Scope: Which users? Which endpoints? Which versions?
3. Compare: What's different about failing requests?
4. Exemplify: Get specific error logs
   └── Stack trace, error message, context
5. Correlate: What changed? Deploy? Config? Dependency?
6. Hypothesize and verify

“It’s weird”

DEBUGGING STRANGE BEHAVIOR
═══════════════════════════════════════════════════════════════

1. Define: What exactly is "weird"? Quantify it.
2. Baseline: What's "normal"? Compare to historical data.
3. Isolate: Find the smallest reproducing case.
4. Trace: Follow request through system.
5. Diff: What's different from normal requests?
6. Often: Race condition, caching issue, state dependency

3.3 Comparison: Then vs. Now

COMPARISON QUERIES
═══════════════════════════════════════════════════════════════

"It was working yesterday, now it's broken"

Query pattern:
    Compare: metric_X at time_good vs time_bad

    Example:
    GOOD: 2024-01-15 10:00-12:00, error_rate = 0.1%
    BAD:  2024-01-16 10:00-12:00, error_rate = 5.0%

    Segment by deploy_version:
    GOOD: version 2.3.0, error_rate = 0.1%
    BAD:  version 2.3.1, error_rate = 4.9%

    Conclusion: Version 2.3.1 introduced the problem.

Similar pattern for:
    - Before/after deploy
    - This week vs. last week
    - Failing user vs. working user

War Story: The $4.2 Million Lesson in Systematic Investigation

2020. A Major Subscription Service. Sunday Evening, Peak Usage.

9:17 PM. Alert fires: “Video streaming error rate exceeded 1%.” The on-call engineer opens the dashboard. Error rate: 1.3%. He’s seen this before. “Probably just a CDN hiccup,” he thinks. He restarts the streaming service pods. Error rate drops to 0.8%. Problem solved?

No. Error rate climbs back to 1.5% within 5 minutes. Then 2.1%. Then 3.4%.

The Old Approach (what he did):

9:17 PM: Alert fires. Restarts pods.

9:23 PM: Error rate climbing. Restarts more pods.

9:31 PM: No improvement. Escalates to senior engineer.

9:45 PM: Senior tries rolling back recent deploy. No effect.

10:12 PM: Team realizes problem isn’t in their code.

10:34 PM: Discover third-party authentication provider is rate-limiting them.

10:41 PM: Implement token caching workaround.

Time to resolution: 84 minutes. Most users couldn’t watch anything during peak Sunday evening.

The New Approach (what they trained to do after):

Step Action Time
Quantify ”Error rate 1.3%, normally 0.1%. Specific error: ‘auth_token_failed‘“ 2 min
Segment ”100% of errors are auth failures. Affects all regions equally.” 3 min
Correlate ”Started 9:15 PM. No deploys. No config changes. Check external deps.” 4 min
Exemplify ”Trace shows auth service calling identity provider, getting 429 Too Many Requests” 3 min
Hypothesize ”Identity provider is rate-limiting us” 1 min
Verify ”Check identity provider dashboard—confirmed, hitting rate limit” 2 min
Resolve ”Enable token caching (was disabled in recent cleanup)“ 5 min

Time to resolution with systematic approach: 20 minutes.

The Financial Impact:

84 minutes at peak × 3.4% average error rate × 12 million active users = ~408,000 users affected

Estimated lost subscription renewals: $890,000

Compensation credits issued: $340,000

Customer support costs: $125,000

Engineering escalation (5 engineers × 2 hours): $2,500

Total: $1.36 million for this single incident

With systematic investigation, the impact would have been:

20 minutes × 2.1% average error rate × 12 million users = ~50,000 users affected

Estimated cost: ~$160,000

Savings: $1.2 million from faster investigation

The team ran the numbers. They’d had 47 similar incidents in the past year. If systematic investigation saved even 20 minutes on average, that was $4.2 million annually.

They built runbooks. They trained on the exploratory pattern. They created investigation checklists. The pattern works. Use it.

Step	Action	Time
Quantify	”Error rate 1.3%, normally 0.1%. Specific error: ‘auth_token_failed‘“	2 min
Segment	”100% of errors are auth failures. Affects all regions equally.”	3 min
Correlate	”Started 9:15 PM. No deploys. No config changes. Check external deps.”	4 min
Exemplify	”Trace shows auth service calling identity provider, getting 429 Too Many Requests”	3 min
Hypothesize	”Identity provider is rate-limiting us”	1 min
Verify	”Check identity provider dashboard—confirmed, hitting rate limit”	2 min
Resolve	”Enable token caching (was disabled in recent cleanup)“	5 min

Part 4: Dashboards That Tell Stories

4.1 Dashboard Anti-Patterns

DASHBOARD ANTI-PATTERNS
═══════════════════════════════════════════════════════════════

THE WALL OF METRICS
─────────────────────────────────────────────────────────────
┌────┐┌────┐┌────┐┌────┐┌────┐┌────┐┌────┐┌────┐
│    ││    ││    ││    ││    ││    ││    ││    │
└────┘└────┘└────┘└────┘└────┘└────┘└────┘└────┘
┌────┐┌────┐┌────┐┌────┐┌────┐┌────┐┌────┐┌────┐
│    ││    ││    ││    ││    ││    ││    ││    │
└────┘└────┘└────┘└────┘└────┘└────┘└────┘└────┘

Problem: No hierarchy, everything equal importance
Result: Eyes glaze over, important signals missed

THE VANITY DASHBOARD
─────────────────────────────────────────────────────────────
┌────────────────────────────────────────────────────────────┐
│  REQUESTS TODAY: 1,234,567,890                             │
│  UPTIME: 99.999%                                           │
│  SERVERS: 1,234                                            │
└────────────────────────────────────────────────────────────┘

Problem: Big numbers that never change actionably
Result: Looks impressive, helps no one

4.2 Good Dashboard Structure

DASHBOARD STRUCTURE
═══════════════════════════════════════════════════════════════

LEVEL 1: EXECUTIVE SUMMARY (top)
┌─────────────────────────────────────────────────────────────┐
│  SLO Status: 99.95% (target: 99.9%) ✅                     │
│  Error Budget: 65% remaining                                │
│  Active Incidents: 0                                        │
└─────────────────────────────────────────────────────────────┘
"At a glance: Are we okay?"

LEVEL 2: THE GOLDEN SIGNALS (middle)
┌─────────────────────────────────────────────────────────────┐
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│  │   Latency    │  │    Errors    │  │   Traffic    │     │
│  │   p50: 45ms  │  │    0.1%      │  │   1,234/s    │     │
│  │   p99: 200ms │  │              │  │              │     │
│  └──────────────┘  └──────────────┘  └──────────────┘     │
└─────────────────────────────────────────────────────────────┘
"What's the current behavior?"

LEVEL 3: DRILL-DOWN (expandable)
┌─────────────────────────────────────────────────────────────┐
│  By Endpoint:        By Region:        By Version:          │
│  /api/checkout 50ms  US-West 45ms     v2.3.1 48ms          │
│  /api/search   30ms  EU-Central 52ms  v2.3.0 45ms          │
│  /api/user     25ms  AP-South 60ms                          │
└─────────────────────────────────────────────────────────────┘
"Where should I look if something's wrong?"

4.3 Dashboard Design Principles

Principle	Why	Example
Hierarchy	Important things first	SLO status at top, details below
Context	Show what normal looks like	Current value + historical range
Action	Link to next step	Click metric → See traces
Consistency	Same layout across services	Everyone knows where to look
Simplicity	Only what’s needed	Remove metrics nobody looks at

Try This (3 minutes)

Look at one of your dashboards:

Can you tell system health in 5 seconds?

Is there a clear hierarchy?

Do you know what to click if something’s wrong?

When did you last remove a panel?

Part 5: Building Mental Models

5.1 What is a Mental Model?

A mental model is your understanding of how the system actually works—not the documentation, but your intuition built from experience.

MENTAL MODEL COMPONENTS
═══════════════════════════════════════════════════════════════

ARCHITECTURE
    "Request comes in at load balancer, goes to API server,
     which calls auth service, then the appropriate backend."

DEPENDENCIES
    "If Redis is down, sessions break. If Postgres is down,
     nothing works. If payment API is slow, checkout is slow."

BEHAVIOR UNDER STRESS
    "Under high load, the API starts queueing requests.
     Past 1000 QPS, we see cascading timeouts."

FAILURE MODES
    "When disk fills, logs stop but service keeps running.
     When memory fills, OOM killer takes random processes."

RECENT CHANGES
    "We deployed v2.3.1 yesterday, added new caching layer
     last week, migrated database replica last month."

5.2 Building Mental Models

HOW TO BUILD MENTAL MODELS
═══════════════════════════════════════════════════════════════

1. WATCH NORMAL BEHAVIOR
   └── What do metrics look like during a normal day?
   └── What's the typical request flow?
   └── What are normal log patterns?

2. WATCH ABNORMAL BEHAVIOR
   └── What happens during incidents?
   └── What breaks first under load?
   └── What error messages appear?

3. TRACE REQUESTS END-TO-END
   └── Pick a request type, follow it through
   └── Understand every service it touches
   └── Know the critical path

4. LEARN FROM POSTMORTEMS
   └── What failed? Why didn't we catch it?
   └── What was the actual impact?
   └── What did we learn about the system?

5. EXPERIMENT (carefully)
   └── What happens if we scale down?
   └── What happens if we add latency?
   └── Chaos engineering builds models

Mental models in one person’s head don’t scale. Share them:

Method	When	Example
Runbooks	Common scenarios	”If X, do Y because Z”
Architecture diagrams	System overview	Request flow, dependencies
Postmortems	After incidents	What happened and why
Pairing	Onboarding	Debug together, share intuition
Documentation	Reference	”The system works like this…”

Did You Know?

NASA’s Mission Control uses a layered dashboard system called “Flight Director Console.” The top level shows overall mission health. Lower levels show subsystem details. This hierarchy-based design is over 60 years old but still the gold standard.
The term “runbook” comes from IBM mainframe operations in the 1960s. Operators had physical books of procedures to “run” for common scenarios. Digital runbooks serve the same purpose.
Expert intuition is real. Studies show experienced operators can often sense something is wrong before any alert fires—they’ve internalized subtle patterns. This is the goal of building mental models.
PagerDuty analyzed millions of incidents and found that the median time to acknowledgement is 3 minutes, but the median time to resolution is 30 minutes. The investigation phase—understanding what’s wrong—takes 10x longer than noticing something is wrong. Good observability reduces investigation time, not detection time.

Common Mistakes

Mistake	Problem	Solution
Alerting on everything	Alert fatigue, ignored pages	Alert on symptoms, not causes
Dashboard sprawl	Can’t find the right dashboard	Hierarchy, templates, linking
No runbooks	Every incident is new	Document common scenarios
Siloed investigation	Can’t correlate across signals	Shared trace IDs, linked tools
Over-reliance on dashboards	Miss issues dashboards don’t show	Combine with exploration
Not sharing knowledge	One expert, team bottleneck	Postmortems, pairing, docs

Quiz

Why should you alert on symptoms rather than causes?
Answer

Alerting on symptoms (error rate, latency) tells you when users are affected. You know action is needed.

Alerting on causes (CPU, memory, connections) often fires when nothing is wrong. High CPU might be fine if latency is good. Low disk might never cause issues.

Examples:
- Bad: “CPU > 80%” (might be fine, might not be)
- Good: “Error rate > 1%” (users definitely affected)
- Bad: “DB connections > 90%” (might handle it fine)
- Good: “p99 latency > 500ms” (users definitely affected)
When you alert on symptoms, every alert means user impact. When you alert on causes, you’re often paged for non-issues.
What’s the benefit of SLO-based alerting over threshold-based alerting?
Answer

Threshold-based: “Alert if error rate > X” - arbitrary number, doesn’t consider time or budget.

SLO-based: “Alert if error budget burn rate threatens monthly target” - tied to actual user impact over time.

Benefits of SLO-based:
1. Context-aware: A 1% error rate might be fine for 5 minutes but not for 5 hours
2. Budget-preserving: Alerts fire when you’ll actually miss your SLO
3. Less noise: Brief spikes don’t page if budget is healthy
4. Proportional severity: Burn rate determines urgency (1 hour vs 6 hours to exhaustion)
SLO-based alerts say “this matters” rather than “this is different from arbitrary threshold.”
What is the exploratory investigation pattern and why is it effective?
Answer

The exploratory investigation pattern is a systematic approach:
1. Quantify - How bad is it?
2. Segment - Which subset is affected?
3. Correlate - What else happened?
4. Exemplify - Show me a specific case
5. Hypothesize - I think X is the cause
6. Verify - Test the hypothesis
7. Resolve - Fix and document
It’s effective because:
- Systematic: Doesn’t skip steps or jump to conclusions
- Evidence-based: Each step produces data to inform the next
- Narrowing: Each step reduces the scope of the problem
- Verifiable: Hypotheses are tested, not assumed
Without a pattern, debugging becomes random guessing. With a pattern, you efficiently converge on the answer.
Why are mental models important for operations?
Answer

Mental models are your internalized understanding of how the system works. They’re important because:
1. Speed: You know what to check without reading documentation
2. Intuition: You can sense when something is “off” even without alerts
3. Prediction: You can anticipate what will happen (e.g., “if we lose Redis, sessions break”)
4. Context: You understand why things are designed the way they are
5. Efficiency: You skip irrelevant paths and focus on likely causes
Without mental models, every incident is like operating a new system—slow, methodical, starting from scratch. With mental models, you leverage experience to solve problems faster.

The challenge: Mental models are in people’s heads. Sharing them through runbooks, postmortems, and pairing is essential for team resilience.
An SRE team receives 150 alerts per week. 90% require no action after investigation. What is the “signal ratio” and what should they do?
Answer

Signal ratio: 15 actionable alerts ÷ 150 total alerts = 10%

This is critically low. Google recommends targeting 50% signal ratio.

Problems with 10% signal ratio:
- Alert fatigue: Engineers start ignoring alerts
- Slow response: Real issues get lost in noise
- Burnout: Constant interruption with no value
- Missed incidents: When everything alerts, nothing alerts
What they should do:
1. Audit each alert: For alerts that required no action, why did they fire?
2. Delete or fix: Remove alerts that never need action; fix thresholds on noisy alerts
3. Move to symptoms: Replace cause-based alerts (CPU > 80%) with symptom-based (error rate > 1%)
4. Add duration: “Error rate > 1% for 5 minutes” reduces flapping
5. Use SLO-based alerting: Alert on error budget burn, not arbitrary thresholds
Target: After cleanup, aim for 50-80% signal ratio. Every alert should feel important.

Your team has a dashboard with 47 panels. During an incident, the on-call engineer spends 10 minutes finding the relevant panel. What’s wrong and how would you fix it?

Answer

Problem: No hierarchy. All 47 panels treated equally.

The fix—implement dashboard hierarchy:

TOP (5 seconds to answer "are we okay?")
┌────────────────────────────────────────────┐
│ SLO Status | Error Budget | Active Alerts  │
└────────────────────────────────────────────┘

MIDDLE (30 seconds to answer "what's the behavior?")
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Latency  │ │ Errors   │ │ Traffic  │
└──────────┘ └──────────┘ └──────────┘

BOTTOM (drill-down when investigating)
┌────────────────────────────────────────────┐
│ By Endpoint | By Region | By Version | ... │
└────────────────────────────────────────────┘

Additional improvements:

Use collapsible sections for detail panels
Color-code based on status (green/yellow/red)
Add “jump to” links based on alert type
Create separate dashboards for different use cases (overview vs. deep-dive)

Result: 10 minutes → 30 seconds to find relevant information

An engineer’s hypothesis during debugging is “the database is slow.” They restart the database and the problem goes away. Did they find the root cause? What should they do differently?
Answer

No, they did not find the root cause. They found a mitigation that happens to work.

Problems with “restart and hope”:
1. Will recur: The actual cause is still there
2. No learning: Team doesn’t understand the system better
3. Masks issues: Might be masking multiple problems
4. Wastes time: Next incident will require same investigation
What they should do differently:
1. Before restarting, capture evidence:
  - Database metrics at time of issue
  - Slow query logs
  - Connection pool state
  - Recent changes (deploys, config, traffic patterns)
2. After restarting, investigate why:
  - Why was the database slow?
  - What queries were problematic?
  - What changed to cause this?
  - Will it happen again?
3. Document in postmortem:
  - Actual root cause (even if discovered later)
  - Why restart helped (and why it’s not a real fix)
  - Actions to prevent recurrence
The rule: Mitigation first (get users happy), then investigation (understand why). Mitigation without investigation is incomplete.
Calculate the cost of a 30-minute investigation delay for a service with: 100,000 requests/minute, 5% error rate during incident, $0.50 average revenue per successful request. How does this justify investment in observability tooling?
Answer

Calculation:
- Requests during 30 minutes: 100,000/min × 30 min = 3,000,000 requests
- Failed requests: 3,000,000 × 5% = 150,000 failures
- Lost revenue: 150,000 × $0.50 = $75,000 lost in 30 minutes
Justification math:

If better tooling reduces investigation time from 30 minutes to 10 minutes:
- Savings per incident: $75,000 - $25,000 = $50,000
- If you have 2 incidents/month: $100,000/month saved
- Annual savings: $1.2 million
Typical observability investment:
- Enterprise observability platform: $100,000-500,000/year
- Engineering time to implement: $200,000 (one-time)
- Total year-one cost: ~$500,000
ROI: $1.2M savings - $500K investment = $700,000 net benefit year one

Beyond revenue:
- Customer trust (hard to quantify, real impact)
- Engineering morale (fewer 3 AM scrambles)
- Regulatory compliance (faster incident response)
Observability tooling almost always pays for itself in reduced incident impact.

Key Takeaways

DATA-TO-INSIGHT ESSENTIALS CHECKLIST
═══════════════════════════════════════════════════════════════════════════════

ASKING GOOD QUESTIONS
☑ Start broad (is something wrong?) then narrow
☑ Be specific: endpoint, timeframe, metric
☑ Avoid vague questions ("why is it slow?")
☑ Follow the question hierarchy: Detection → Scope → Localization → Root Cause

THE EXPLORATORY INVESTIGATION PATTERN
☑ 1. Quantify - how bad is it?
☑ 2. Segment - who/what is affected?
☑ 3. Correlate - what else happened?
☑ 4. Exemplify - show me a specific case
☑ 5. Hypothesize - I think X is the cause
☑ 6. Verify - test the hypothesis
☑ 7. Resolve - fix and document

EFFECTIVE ALERTING
☑ Alert on symptoms (user impact), not causes (CPU/memory)
☑ Use SLO-based alerting with error budgets
☑ Target 50%+ signal ratio (actionable alerts)
☑ Every alert must have: clear action + runbook link
☑ Delete alerts that never require action

DASHBOARD DESIGN
☑ Hierarchy: Summary → Golden Signals → Drill-down
☑ Answer "are we okay?" in 5 seconds (top row)
☑ Consistency across services (same layout)
☑ Link to next step (metric → traces → logs)

BUILDING MENTAL MODELS
☑ Watch normal behavior to recognize abnormal
☑ Learn from postmortems (failure modes)
☑ Trace requests end-to-end (understand flow)
☑ Share knowledge: runbooks, docs, pairing
☑ Experiment carefully (chaos engineering)

Hands-On Exercise

Task: Design an observability workflow for a common scenario.

Scenario: You receive an alert: “p99 latency for /api/search exceeded 500ms”

Part 1: Investigation Plan (10 minutes)

Write out your investigation steps using the exploratory pattern:

Step	Question	How You’d Answer (tool/query)
Quantify	What’s the actual p99? How long has it been elevated?
Segment	All search requests or some? Which dimensions?
Correlate	What changed? Deploys? Traffic spike?
Exemplify	Get a specific slow trace
Hypothesize	What do you think is the cause?
Verify	How would you confirm?

Part 2: Alert Design (10 minutes)

Design an SLO-based alert for this scenario:

Element	Your Design
SLO target	p99 latency for /api/search ≤ ___ms
Measurement window	___ minutes
Severe alert	Burn rate = ___ (exhausts budget in ___ hours)
High alert	Burn rate = ___ (exhausts budget in ___ hours)
Runbook link	Steps to investigate (from Part 1)

Part 3: Dashboard Design (10 minutes)

Sketch a simple dashboard for search health:

Top row (Executive Summary):
┌────────────────────────────────────────────────────────────┐
│                                                            │
│  [What should be here?]                                    │
│                                                            │
└────────────────────────────────────────────────────────────┘

Middle row (Golden Signals):
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│                  │ │                  │ │                  │
│  [Panel 1]       │ │  [Panel 2]       │ │  [Panel 3]       │
│                  │ │                  │ │                  │
└──────────────────┘ └──────────────────┘ └──────────────────┘

Bottom row (Drill-down):
┌────────────────────────────────────────────────────────────┐
│                                                            │
│  [What breakdowns would be useful?]                        │
│                                                            │
└────────────────────────────────────────────────────────────┘

Success Criteria:

Investigation plan covers all steps of exploratory pattern
Alert design includes SLO target and multiple severity levels
Dashboard has clear hierarchy (summary → signals → details)
Each dashboard panel has a clear purpose

Observability Theory: What’s Next?

Congratulations! You’ve completed the Observability Theory foundation. You now understand:

What observability is and how it differs from monitoring
The three pillars: logs, metrics, traces
How to instrument systems effectively
How to turn data into insight

Where to go from here:

Your Interest	Next Track
Implementing observability	Observability Toolkit
Operating with observability	SRE Discipline
Security observability	Security Principles
Foundational concepts	Distributed Systems

Track Summary

Module	Key Takeaway
3.1	Observability lets you ask questions you didn’t predict; monitoring answers predefined questions
3.2	Logs (detail), metrics (aggregation), traces (flow)—connected by correlation IDs
3.3	Instrument boundaries and business operations; keep cardinality bounded; sample traces
3.4	Alert on symptoms not causes; debug systematically; build and share mental models

“The goal isn’t to have all the data. The goal is to understand the system.”