Skip to content

Module 3.4: From Data to Insight

Complexity: [MEDIUM]

Time to Complete: 30-35 minutes

Prerequisites: Module 3.3: Instrumentation Principles

Track: Foundations

After completing this module, you will be able to:

  1. Build alert rules that detect meaningful user-impact signals rather than generating noise from infrastructure metrics
  2. Analyze observability data to move from symptom detection to root-cause identification within a structured debugging workflow
  3. Design dashboards that answer operational questions at a glance and guide engineers toward the right next investigation step
  4. Evaluate whether an alerting strategy minimizes both false positives and time-to-detection for real incidents

The Team That Had Perfect Data But No Answers

Section titled “The Team That Had Perfect Data But No Answers”

December 2018. A Major E-commerce Platform. Black Friday Weekend.

The observability team had done everything right. World-class Prometheus deployment with 2.3 million time series. Elasticsearch cluster ingesting 4TB of logs daily. Jaeger storing 500 million spans. Beautiful Grafana dashboards covering every service. They’d spent 18 months building this observability stack.

2:14 AM, Saturday. The worst time for e-commerce—Black Friday shoppers still active, Cyber Monday promotions launching in hours.

Alert: “Checkout error rate exceeded 2%.”

The on-call engineer opens Grafana. Error rate: 2.3%. She can see that something is wrong. But the dashboard has 47 panels. Which one matters right now? She clicks through each one, looking for anomalies. Five minutes pass.

She opens the logs. Searches for “checkout” and “error.” 2.4 million results in the last hour. She adds “level:error”—still 847,000 results. She can’t narrow down further because the logs don’t have consistent field names. Different services use different conventions.

She switches to tracing. Finds a failing trace. The error is “connection timeout” to the inventory service. Great. But why? The inventory service dashboard shows… nothing wrong. Latency normal. Error rate normal.

47 minutes. That’s how long it took to find the root cause. The inventory service was fine. The problem was network congestion between the checkout service and inventory service—a firewall rule change made two weeks earlier had introduced a 50ms timeout where there should have been 500ms. Under Black Friday load, retries stacked up and cascaded.

The fix: One line of configuration. 30 seconds to apply.

The cost: 47 minutes × 2.3% error rate × Black Friday traffic = $2.8 million in abandoned carts. Plus customer support costs. Plus reputation damage.

THE DATA-TO-INSIGHT GAP
═══════════════════════════════════════════════════════════════════════════════
WHAT THEY HAD WHAT THEY LACKED
───────────────────────────────────── ─────────────────────────────────────
☑ 2.3 million metrics ✗ Dashboard hierarchy (what matters now?)
☑ 4TB daily logs ✗ Consistent field names (can't query)
☑ 500 million spans ✗ SLO-based alerting (why 2%? is that bad?)
☑ Beautiful dashboards ✗ Runbooks (what to do when this happens)
☑ All three pillars ✗ Investigation workflow (where to start)
☑ Perfect instrumentation ✗ Mental model of failure modes
DATA ≠ INSIGHT
They had all the data to find the problem in 5 minutes.
They lacked the insight to navigate that data effectively.
Cost of the gap: $2.8 million.

After the incident, they rebuilt their approach:

  • Created dashboard hierarchy (summary → signals → details)
  • Defined consistent field naming across all services
  • Implemented SLO-based alerts with error budgets
  • Wrote runbooks for common failure modes
  • Trained the team on systematic investigation

The next similar incident took 8 minutes to resolve.


You’ve instrumented your services. Logs are flowing, metrics are being scraped, traces are being collected. Now what?

Having data isn’t the same as having understanding. The real value of observability comes from turning data into insight—answering questions, detecting problems, and building mental models of system behavior.

This module teaches you how to use observability data effectively: asking the right questions, building useful alerts, debugging systematically, and developing intuition about your systems.

The Medical Analogy

A doctor doesn’t just collect test results—they interpret them. Blood pressure, heart rate, and lab results are data. A diagnosis is insight. The same data can mean different things in different contexts. Observability is similar: metrics, logs, and traces are raw data. Understanding why your system behaves the way it does is insight.


  • How to ask good questions of your observability data
  • Building effective alerts (signal vs. noise)
  • Debugging systematically with observability
  • Dashboards that tell stories
  • Building mental models of system behavior

Not all questions are equally useful. Start broad, narrow down:

THE QUESTION HIERARCHY
═══════════════════════════════════════════════════════════════
Level 1: DETECTION
─────────────────────────────────────────────────────────────
"Is something wrong?"
→ Alerts, SLO dashboards, error rates
→ Answers: Yes/No
Level 2: SCOPE
─────────────────────────────────────────────────────────────
"How bad is it? Who's affected?"
→ Break down by dimension: region, user type, endpoint
→ Answers: X% of users, Y region, Z feature
Level 3: LOCALIZATION
─────────────────────────────────────────────────────────────
"Where is the problem?"
→ Service dependency analysis, trace flamegraphs
→ Answers: Service X, database Y, function Z
Level 4: ROOT CAUSE
─────────────────────────────────────────────────────────────
"Why is this happening?"
→ Logs, code, recent changes
→ Answers: Bug in version 2.3.1, config change, external dependency
QUESTION QUALITY
═══════════════════════════════════════════════════════════════
BAD QUESTIONS (vague, hard to answer)
─────────────────────────────────────────────────────────────
"Why is the site slow?"
→ What is "slow"? For whom? Since when?
"Is everything okay?"
→ Define "okay"
"What happened yesterday?"
→ What specifically?
GOOD QUESTIONS (specific, answerable)
─────────────────────────────────────────────────────────────
"What's the p99 latency for /api/checkout in the last hour?"
→ Specific metric, specific endpoint, specific timeframe
"Which users experienced errors during the deploy at 3pm?"
→ Specific event, specific population
"What changed between 2pm (working) and 3pm (broken)?"
→ Specific comparison, specific times
EXPLORATORY INVESTIGATION
═══════════════════════════════════════════════════════════════
START: "Something is wrong" (symptom)
1. QUANTIFY
"How wrong? What's the error rate/latency/failure count?"
→ Establishes baseline for comparison
2. SEGMENT
"Is it all requests or some? Which ones?"
→ Group by: endpoint, user_type, region, version, time
→ Find: Where is the problem concentrated?
3. CORRELATE
"What else happened at the same time?"
→ Deploys, config changes, dependency issues, traffic spike
→ Find: Potential causes
4. EXEMPLIFY
"Show me a specific failing request"
→ Trace → Logs → Detailed error
→ Find: Concrete evidence
5. HYPOTHESIZE
"I think the problem is X because of Y"
→ Test: If X is the problem, what else should be true?
6. VERIFY
"Let me confirm by checking Z"
→ Additional queries to confirm/refute hypothesis
7. RESOLVE
"The problem is X, fixed by Y"
→ Document for future reference

Try This (3 minutes)

Think of a recent incident. Walk through the investigation pattern:

  • How did you quantify the problem?
  • How did you segment to narrow down?
  • What correlations did you look for?
  • Did you find a specific example to examine?

ALERTING PRINCIPLES
═══════════════════════════════════════════════════════════════
ALERT ONLY WHEN ACTION IS NEEDED
─────────────────────────────────────────────────────────────
If the alert fires and you do nothing → Remove the alert
If the alert fires and you always do the same thing → Automate it
If the alert fires and you investigate → Good alert
ALERT ON SYMPTOMS, NOT CAUSES
─────────────────────────────────────────────────────────────
Bad: "CPU > 80%" (cause - might be fine)
Good: "Error rate > 1%" (symptom - users affected)
Good: "p99 latency > 500ms" (symptom - users affected)
ALERT ON USER IMPACT
─────────────────────────────────────────────────────────────
Would a user notice? Alert.
Would only you notice? Maybe don't alert.
Error budget burn rate > threshold → Users will be affected → Alert
SLO-BASED ALERTS
═══════════════════════════════════════════════════════════════
Instead of arbitrary thresholds, alert based on error budget:
SLO: 99.9% availability (error budget: 43 minutes/month)
Alert conditions:
─────────────────────────────────────────────────────────────
SEVERE (page immediately):
Error budget burn rate that would exhaust budget in 1 hour
= 720x normal burn rate
"At this rate, we're out of budget in 1 hour"
HIGH (page during business hours):
Error budget burn rate that would exhaust budget in 6 hours
= 120x normal burn rate
"At this rate, we're out of budget in 6 hours"
MEDIUM (ticket):
Error budget burn rate that would exhaust budget in 3 days
= 10x normal burn rate
"At this rate, we're out of budget this week"
LOW (dashboard):
Any error budget consumption
"We're using budget, but within acceptable range"
ProblemCauseSolution
Too many alertsLow thresholdsRaise thresholds, use SLO-based
Alerts for non-issuesAlerting on causesAlert on symptoms instead
Flapping alertsSensitive thresholdsAdd duration requirement (5 min)
Duplicate alertsMultiple rules for same issueDeduplicate, use alert hierarchy
Unactionable alertsNo clear responseAdd runbook, or delete alert
ALERT HYGIENE CHECKLIST
═══════════════════════════════════════════════════════════════
For each alert, answer:
[ ] Does this require human action?
[ ] Is the action clear? (link to runbook)
[ ] Does this fire rarely enough to be noticed?
[ ] Is the severity appropriate?
[ ] Has this alert been useful in the last 30 days?
If you can't answer "yes" to all → Fix or remove the alert.

Did You Know?

Google’s SRE teams aim for a 50% “signal ratio”—meaning 50% of pages should be real issues requiring intervention. Below 50%, alerts are too noisy and get ignored. Above 80%, you’re probably missing issues because you’re not alerting enough.


DEBUGGING WORKFLOW
═══════════════════════════════════════════════════════════════
┌─────────────────────────────────────────────┐
METRICS ───▶│ Alert fires: Error rate spike │
│ Dashboard: 5% errors on /api/checkout │
└──────────────────┬──────────────────────────┘
┌──────────────────▼──────────────────────────┐
METRICS ───▶│ Segment: Which errors? Where? │
│ 90% of errors from US-West region │
│ Started 14:32, correlates with deploy │
└──────────────────┬──────────────────────────┘
┌──────────────────▼──────────────────────────┐
TRACES ────▶│ Exemplar: Get example failing trace │
│ Trace abc-123: Timeout on payment-service │
│ 3 retry attempts, all timed out │
└──────────────────┬──────────────────────────┘
┌──────────────────▼──────────────────────────┐
LOGS ──────▶│ Detail: What's in the logs? │
│ "Connection refused: payment-db-west-02" │
│ Database server is unreachable │
└──────────────────┬──────────────────────────┘
┌──────────────────▼──────────────────────────┐
ROOT CAUSE ▶│ payment-db-west-02 in maintenance │
│ Failover didn't trigger correctly │
└─────────────────────────────────────────────┘

“It’s slow”

DEBUGGING LATENCY
═══════════════════════════════════════════════════════════════
1. Confirm: What's the actual latency? p50? p99?
2. Scope: All requests or some? Which endpoints?
3. Trace: Get example slow traces
4. Flamegraph: Where is time spent?
└── Network call to service X: 800ms (80% of time)
5. Drill down: Why is service X slow?
6. Repeat until root cause found

“It’s broken”

DEBUGGING ERRORS
═══════════════════════════════════════════════════════════════
1. Confirm: What's the error rate? Error types?
2. Scope: Which users? Which endpoints? Which versions?
3. Compare: What's different about failing requests?
4. Exemplify: Get specific error logs
└── Stack trace, error message, context
5. Correlate: What changed? Deploy? Config? Dependency?
6. Hypothesize and verify

“It’s weird”

DEBUGGING STRANGE BEHAVIOR
═══════════════════════════════════════════════════════════════
1. Define: What exactly is "weird"? Quantify it.
2. Baseline: What's "normal"? Compare to historical data.
3. Isolate: Find the smallest reproducing case.
4. Trace: Follow request through system.
5. Diff: What's different from normal requests?
6. Often: Race condition, caching issue, state dependency
COMPARISON QUERIES
═══════════════════════════════════════════════════════════════
"It was working yesterday, now it's broken"
Query pattern:
Compare: metric_X at time_good vs time_bad
Example:
GOOD: 2024-01-15 10:00-12:00, error_rate = 0.1%
BAD: 2024-01-16 10:00-12:00, error_rate = 5.0%
Segment by deploy_version:
GOOD: version 2.3.0, error_rate = 0.1%
BAD: version 2.3.1, error_rate = 4.9%
Conclusion: Version 2.3.1 introduced the problem.
Similar pattern for:
- Before/after deploy
- This week vs. last week
- Failing user vs. working user

War Story: The $4.2 Million Lesson in Systematic Investigation

2020. A Major Subscription Service. Sunday Evening, Peak Usage.

9:17 PM. Alert fires: “Video streaming error rate exceeded 1%.” The on-call engineer opens the dashboard. Error rate: 1.3%. He’s seen this before. “Probably just a CDN hiccup,” he thinks. He restarts the streaming service pods. Error rate drops to 0.8%. Problem solved?

No. Error rate climbs back to 1.5% within 5 minutes. Then 2.1%. Then 3.4%.

The Old Approach (what he did):

  • 9:17 PM: Alert fires. Restarts pods.
  • 9:23 PM: Error rate climbing. Restarts more pods.
  • 9:31 PM: No improvement. Escalates to senior engineer.
  • 9:45 PM: Senior tries rolling back recent deploy. No effect.
  • 10:12 PM: Team realizes problem isn’t in their code.
  • 10:34 PM: Discover third-party authentication provider is rate-limiting them.
  • 10:41 PM: Implement token caching workaround.

Time to resolution: 84 minutes. Most users couldn’t watch anything during peak Sunday evening.

The New Approach (what they trained to do after):

StepActionTime
Quantify”Error rate 1.3%, normally 0.1%. Specific error: ‘auth_token_failed‘“2 min
Segment”100% of errors are auth failures. Affects all regions equally.”3 min
Correlate”Started 9:15 PM. No deploys. No config changes. Check external deps.”4 min
Exemplify”Trace shows auth service calling identity provider, getting 429 Too Many Requests”3 min
Hypothesize”Identity provider is rate-limiting us”1 min
Verify”Check identity provider dashboard—confirmed, hitting rate limit”2 min
Resolve”Enable token caching (was disabled in recent cleanup)“5 min

Time to resolution with systematic approach: 20 minutes.

The Financial Impact:

  • 84 minutes at peak × 3.4% average error rate × 12 million active users = ~408,000 users affected
  • Estimated lost subscription renewals: $890,000
  • Compensation credits issued: $340,000
  • Customer support costs: $125,000
  • Engineering escalation (5 engineers × 2 hours): $2,500
  • Total: $1.36 million for this single incident

With systematic investigation, the impact would have been:

  • 20 minutes × 2.1% average error rate × 12 million users = ~50,000 users affected
  • Estimated cost: ~$160,000
  • Savings: $1.2 million from faster investigation

The team ran the numbers. They’d had 47 similar incidents in the past year. If systematic investigation saved even 20 minutes on average, that was $4.2 million annually.

They built runbooks. They trained on the exploratory pattern. They created investigation checklists. The pattern works. Use it.


DASHBOARD ANTI-PATTERNS
═══════════════════════════════════════════════════════════════
THE WALL OF METRICS
─────────────────────────────────────────────────────────────
┌────┐┌────┐┌────┐┌────┐┌────┐┌────┐┌────┐┌────┐
│ ││ ││ ││ ││ ││ ││ ││ │
└────┘└────┘└────┘└────┘└────┘└────┘└────┘└────┘
┌────┐┌────┐┌────┐┌────┐┌────┐┌────┐┌────┐┌────┐
│ ││ ││ ││ ││ ││ ││ ││ │
└────┘└────┘└────┘└────┘└────┘└────┘└────┘└────┘
Problem: No hierarchy, everything equal importance
Result: Eyes glaze over, important signals missed
THE VANITY DASHBOARD
─────────────────────────────────────────────────────────────
┌────────────────────────────────────────────────────────────┐
│ REQUESTS TODAY: 1,234,567,890 │
│ UPTIME: 99.999% │
│ SERVERS: 1,234 │
└────────────────────────────────────────────────────────────┘
Problem: Big numbers that never change actionably
Result: Looks impressive, helps no one
DASHBOARD STRUCTURE
═══════════════════════════════════════════════════════════════
LEVEL 1: EXECUTIVE SUMMARY (top)
┌─────────────────────────────────────────────────────────────┐
│ SLO Status: 99.95% (target: 99.9%) ✅ │
│ Error Budget: 65% remaining │
│ Active Incidents: 0 │
└─────────────────────────────────────────────────────────────┘
"At a glance: Are we okay?"
LEVEL 2: THE GOLDEN SIGNALS (middle)
┌─────────────────────────────────────────────────────────────┐
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Latency │ │ Errors │ │ Traffic │ │
│ │ p50: 45ms │ │ 0.1% │ │ 1,234/s │ │
│ │ p99: 200ms │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
"What's the current behavior?"
LEVEL 3: DRILL-DOWN (expandable)
┌─────────────────────────────────────────────────────────────┐
│ By Endpoint: By Region: By Version: │
│ /api/checkout 50ms US-West 45ms v2.3.1 48ms │
│ /api/search 30ms EU-Central 52ms v2.3.0 45ms │
│ /api/user 25ms AP-South 60ms │
└─────────────────────────────────────────────────────────────┘
"Where should I look if something's wrong?"
PrincipleWhyExample
HierarchyImportant things firstSLO status at top, details below
ContextShow what normal looks likeCurrent value + historical range
ActionLink to next stepClick metric → See traces
ConsistencySame layout across servicesEveryone knows where to look
SimplicityOnly what’s neededRemove metrics nobody looks at

Try This (3 minutes)

Look at one of your dashboards:

  • Can you tell system health in 5 seconds?
  • Is there a clear hierarchy?
  • Do you know what to click if something’s wrong?
  • When did you last remove a panel?

A mental model is your understanding of how the system actually works—not the documentation, but your intuition built from experience.

MENTAL MODEL COMPONENTS
═══════════════════════════════════════════════════════════════
ARCHITECTURE
"Request comes in at load balancer, goes to API server,
which calls auth service, then the appropriate backend."
DEPENDENCIES
"If Redis is down, sessions break. If Postgres is down,
nothing works. If payment API is slow, checkout is slow."
BEHAVIOR UNDER STRESS
"Under high load, the API starts queueing requests.
Past 1000 QPS, we see cascading timeouts."
FAILURE MODES
"When disk fills, logs stop but service keeps running.
When memory fills, OOM killer takes random processes."
RECENT CHANGES
"We deployed v2.3.1 yesterday, added new caching layer
last week, migrated database replica last month."
HOW TO BUILD MENTAL MODELS
═══════════════════════════════════════════════════════════════
1. WATCH NORMAL BEHAVIOR
└── What do metrics look like during a normal day?
└── What's the typical request flow?
└── What are normal log patterns?
2. WATCH ABNORMAL BEHAVIOR
└── What happens during incidents?
└── What breaks first under load?
└── What error messages appear?
3. TRACE REQUESTS END-TO-END
└── Pick a request type, follow it through
└── Understand every service it touches
└── Know the critical path
4. LEARN FROM POSTMORTEMS
└── What failed? Why didn't we catch it?
└── What was the actual impact?
└── What did we learn about the system?
5. EXPERIMENT (carefully)
└── What happens if we scale down?
└── What happens if we add latency?
└── Chaos engineering builds models

Mental models in one person’s head don’t scale. Share them:

MethodWhenExample
RunbooksCommon scenarios”If X, do Y because Z”
Architecture diagramsSystem overviewRequest flow, dependencies
PostmortemsAfter incidentsWhat happened and why
PairingOnboardingDebug together, share intuition
DocumentationReference”The system works like this…”

  • NASA’s Mission Control uses a layered dashboard system called “Flight Director Console.” The top level shows overall mission health. Lower levels show subsystem details. This hierarchy-based design is over 60 years old but still the gold standard.

  • The term “runbook” comes from IBM mainframe operations in the 1960s. Operators had physical books of procedures to “run” for common scenarios. Digital runbooks serve the same purpose.

  • Expert intuition is real. Studies show experienced operators can often sense something is wrong before any alert fires—they’ve internalized subtle patterns. This is the goal of building mental models.

  • PagerDuty analyzed millions of incidents and found that the median time to acknowledgement is 3 minutes, but the median time to resolution is 30 minutes. The investigation phase—understanding what’s wrong—takes 10x longer than noticing something is wrong. Good observability reduces investigation time, not detection time.


MistakeProblemSolution
Alerting on everythingAlert fatigue, ignored pagesAlert on symptoms, not causes
Dashboard sprawlCan’t find the right dashboardHierarchy, templates, linking
No runbooksEvery incident is newDocument common scenarios
Siloed investigationCan’t correlate across signalsShared trace IDs, linked tools
Over-reliance on dashboardsMiss issues dashboards don’t showCombine with exploration
Not sharing knowledgeOne expert, team bottleneckPostmortems, pairing, docs

  1. Why should you alert on symptoms rather than causes?

    Answer

    Alerting on symptoms (error rate, latency) tells you when users are affected. You know action is needed.

    Alerting on causes (CPU, memory, connections) often fires when nothing is wrong. High CPU might be fine if latency is good. Low disk might never cause issues.

    Examples:

    • Bad: “CPU > 80%” (might be fine, might not be)
    • Good: “Error rate > 1%” (users definitely affected)
    • Bad: “DB connections > 90%” (might handle it fine)
    • Good: “p99 latency > 500ms” (users definitely affected)

    When you alert on symptoms, every alert means user impact. When you alert on causes, you’re often paged for non-issues.

  2. What’s the benefit of SLO-based alerting over threshold-based alerting?

    Answer

    Threshold-based: “Alert if error rate > X” - arbitrary number, doesn’t consider time or budget.

    SLO-based: “Alert if error budget burn rate threatens monthly target” - tied to actual user impact over time.

    Benefits of SLO-based:

    1. Context-aware: A 1% error rate might be fine for 5 minutes but not for 5 hours
    2. Budget-preserving: Alerts fire when you’ll actually miss your SLO
    3. Less noise: Brief spikes don’t page if budget is healthy
    4. Proportional severity: Burn rate determines urgency (1 hour vs 6 hours to exhaustion)

    SLO-based alerts say “this matters” rather than “this is different from arbitrary threshold.”

  3. What is the exploratory investigation pattern and why is it effective?

    Answer

    The exploratory investigation pattern is a systematic approach:

    1. Quantify - How bad is it?
    2. Segment - Which subset is affected?
    3. Correlate - What else happened?
    4. Exemplify - Show me a specific case
    5. Hypothesize - I think X is the cause
    6. Verify - Test the hypothesis
    7. Resolve - Fix and document

    It’s effective because:

    • Systematic: Doesn’t skip steps or jump to conclusions
    • Evidence-based: Each step produces data to inform the next
    • Narrowing: Each step reduces the scope of the problem
    • Verifiable: Hypotheses are tested, not assumed

    Without a pattern, debugging becomes random guessing. With a pattern, you efficiently converge on the answer.

  4. Why are mental models important for operations?

    Answer

    Mental models are your internalized understanding of how the system works. They’re important because:

    1. Speed: You know what to check without reading documentation
    2. Intuition: You can sense when something is “off” even without alerts
    3. Prediction: You can anticipate what will happen (e.g., “if we lose Redis, sessions break”)
    4. Context: You understand why things are designed the way they are
    5. Efficiency: You skip irrelevant paths and focus on likely causes

    Without mental models, every incident is like operating a new system—slow, methodical, starting from scratch. With mental models, you leverage experience to solve problems faster.

    The challenge: Mental models are in people’s heads. Sharing them through runbooks, postmortems, and pairing is essential for team resilience.

  5. An SRE team receives 150 alerts per week. 90% require no action after investigation. What is the “signal ratio” and what should they do?

    Answer

    Signal ratio: 15 actionable alerts ÷ 150 total alerts = 10%

    This is critically low. Google recommends targeting 50% signal ratio.

    Problems with 10% signal ratio:

    • Alert fatigue: Engineers start ignoring alerts
    • Slow response: Real issues get lost in noise
    • Burnout: Constant interruption with no value
    • Missed incidents: When everything alerts, nothing alerts

    What they should do:

    1. Audit each alert: For alerts that required no action, why did they fire?
    2. Delete or fix: Remove alerts that never need action; fix thresholds on noisy alerts
    3. Move to symptoms: Replace cause-based alerts (CPU > 80%) with symptom-based (error rate > 1%)
    4. Add duration: “Error rate > 1% for 5 minutes” reduces flapping
    5. Use SLO-based alerting: Alert on error budget burn, not arbitrary thresholds

    Target: After cleanup, aim for 50-80% signal ratio. Every alert should feel important.

  6. Your team has a dashboard with 47 panels. During an incident, the on-call engineer spends 10 minutes finding the relevant panel. What’s wrong and how would you fix it?

    Answer

    Problem: No hierarchy. All 47 panels treated equally.

    The fix—implement dashboard hierarchy:

    TOP (5 seconds to answer "are we okay?")
    ┌────────────────────────────────────────────┐
    │ SLO Status | Error Budget | Active Alerts │
    └────────────────────────────────────────────┘
    MIDDLE (30 seconds to answer "what's the behavior?")
    ┌──────────┐ ┌──────────┐ ┌──────────┐
    │ Latency │ │ Errors │ │ Traffic │
    └──────────┘ └──────────┘ └──────────┘
    BOTTOM (drill-down when investigating)
    ┌────────────────────────────────────────────┐
    │ By Endpoint | By Region | By Version | ... │
    └────────────────────────────────────────────┘

    Additional improvements:

    • Use collapsible sections for detail panels
    • Color-code based on status (green/yellow/red)
    • Add “jump to” links based on alert type
    • Create separate dashboards for different use cases (overview vs. deep-dive)

    Result: 10 minutes → 30 seconds to find relevant information

  7. An engineer’s hypothesis during debugging is “the database is slow.” They restart the database and the problem goes away. Did they find the root cause? What should they do differently?

    Answer

    No, they did not find the root cause. They found a mitigation that happens to work.

    Problems with “restart and hope”:

    1. Will recur: The actual cause is still there
    2. No learning: Team doesn’t understand the system better
    3. Masks issues: Might be masking multiple problems
    4. Wastes time: Next incident will require same investigation

    What they should do differently:

    1. Before restarting, capture evidence:

      • Database metrics at time of issue
      • Slow query logs
      • Connection pool state
      • Recent changes (deploys, config, traffic patterns)
    2. After restarting, investigate why:

      • Why was the database slow?
      • What queries were problematic?
      • What changed to cause this?
      • Will it happen again?
    3. Document in postmortem:

      • Actual root cause (even if discovered later)
      • Why restart helped (and why it’s not a real fix)
      • Actions to prevent recurrence

    The rule: Mitigation first (get users happy), then investigation (understand why). Mitigation without investigation is incomplete.

  8. Calculate the cost of a 30-minute investigation delay for a service with: 100,000 requests/minute, 5% error rate during incident, $0.50 average revenue per successful request. How does this justify investment in observability tooling?

    Answer

    Calculation:

    • Requests during 30 minutes: 100,000/min × 30 min = 3,000,000 requests
    • Failed requests: 3,000,000 × 5% = 150,000 failures
    • Lost revenue: 150,000 × $0.50 = $75,000 lost in 30 minutes

    Justification math:

    If better tooling reduces investigation time from 30 minutes to 10 minutes:

    • Savings per incident: $75,000 - $25,000 = $50,000
    • If you have 2 incidents/month: $100,000/month saved
    • Annual savings: $1.2 million

    Typical observability investment:

    • Enterprise observability platform: $100,000-500,000/year
    • Engineering time to implement: $200,000 (one-time)
    • Total year-one cost: ~$500,000

    ROI: $1.2M savings - $500K investment = $700,000 net benefit year one

    Beyond revenue:

    • Customer trust (hard to quantify, real impact)
    • Engineering morale (fewer 3 AM scrambles)
    • Regulatory compliance (faster incident response)

    Observability tooling almost always pays for itself in reduced incident impact.


DATA-TO-INSIGHT ESSENTIALS CHECKLIST
═══════════════════════════════════════════════════════════════════════════════
ASKING GOOD QUESTIONS
☑ Start broad (is something wrong?) then narrow
☑ Be specific: endpoint, timeframe, metric
☑ Avoid vague questions ("why is it slow?")
☑ Follow the question hierarchy: Detection → Scope → Localization → Root Cause
THE EXPLORATORY INVESTIGATION PATTERN
☑ 1. Quantify - how bad is it?
☑ 2. Segment - who/what is affected?
☑ 3. Correlate - what else happened?
☑ 4. Exemplify - show me a specific case
☑ 5. Hypothesize - I think X is the cause
☑ 6. Verify - test the hypothesis
☑ 7. Resolve - fix and document
EFFECTIVE ALERTING
☑ Alert on symptoms (user impact), not causes (CPU/memory)
☑ Use SLO-based alerting with error budgets
☑ Target 50%+ signal ratio (actionable alerts)
☑ Every alert must have: clear action + runbook link
☑ Delete alerts that never require action
DASHBOARD DESIGN
☑ Hierarchy: Summary → Golden Signals → Drill-down
☑ Answer "are we okay?" in 5 seconds (top row)
☑ Consistency across services (same layout)
☑ Link to next step (metric → traces → logs)
BUILDING MENTAL MODELS
☑ Watch normal behavior to recognize abnormal
☑ Learn from postmortems (failure modes)
☑ Trace requests end-to-end (understand flow)
☑ Share knowledge: runbooks, docs, pairing
☑ Experiment carefully (chaos engineering)

Task: Design an observability workflow for a common scenario.

Scenario: You receive an alert: “p99 latency for /api/search exceeded 500ms”

Part 1: Investigation Plan (10 minutes)

Write out your investigation steps using the exploratory pattern:

StepQuestionHow You’d Answer (tool/query)
QuantifyWhat’s the actual p99? How long has it been elevated?
SegmentAll search requests or some? Which dimensions?
CorrelateWhat changed? Deploys? Traffic spike?
ExemplifyGet a specific slow trace
HypothesizeWhat do you think is the cause?
VerifyHow would you confirm?

Part 2: Alert Design (10 minutes)

Design an SLO-based alert for this scenario:

ElementYour Design
SLO targetp99 latency for /api/search ≤ ___ms
Measurement window___ minutes
Severe alertBurn rate = ___ (exhausts budget in ___ hours)
High alertBurn rate = ___ (exhausts budget in ___ hours)
Runbook linkSteps to investigate (from Part 1)

Part 3: Dashboard Design (10 minutes)

Sketch a simple dashboard for search health:

Top row (Executive Summary):
┌────────────────────────────────────────────────────────────┐
│ │
│ [What should be here?] │
│ │
└────────────────────────────────────────────────────────────┘
Middle row (Golden Signals):
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ │ │ │ │ │
│ [Panel 1] │ │ [Panel 2] │ │ [Panel 3] │
│ │ │ │ │ │
└──────────────────┘ └──────────────────┘ └──────────────────┘
Bottom row (Drill-down):
┌────────────────────────────────────────────────────────────┐
│ │
│ [What breakdowns would be useful?] │
│ │
└────────────────────────────────────────────────────────────┘

Success Criteria:

  • Investigation plan covers all steps of exploratory pattern
  • Alert design includes SLO target and multiple severity levels
  • Dashboard has clear hierarchy (summary → signals → details)
  • Each dashboard panel has a clear purpose

  • “Practical Monitoring” - Mike Julian. Excellent coverage of alerting philosophy and dashboard design.

  • “The Art of Monitoring” - James Turnbull. Comprehensive guide to building monitoring systems.

  • “Thinking in Systems” - Donella Meadows. Foundational book on building mental models of complex systems.


Congratulations! You’ve completed the Observability Theory foundation. You now understand:

  • What observability is and how it differs from monitoring
  • The three pillars: logs, metrics, traces
  • How to instrument systems effectively
  • How to turn data into insight

Where to go from here:

Your InterestNext Track
Implementing observabilityObservability Toolkit
Operating with observabilitySRE Discipline
Security observabilitySecurity Principles
Foundational conceptsDistributed Systems

ModuleKey Takeaway
3.1Observability lets you ask questions you didn’t predict; monitoring answers predefined questions
3.2Logs (detail), metrics (aggregation), traces (flow)—connected by correlation IDs
3.3Instrument boundaries and business operations; keep cardinality bounded; sample traces
3.4Alert on symptoms not causes; debug systematically; build and share mental models

“The goal isn’t to have all the data. The goal is to understand the system.”