Module 3.1: What is Observability?
Complexity:
[MEDIUM]Time to Complete: 30-35 minutes
Prerequisites: Reliability Engineering Track (recommended)
Track: Foundations
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Explain why observability differs fundamentally from monitoring and when each approach is appropriate
- Evaluate a system’s observability maturity by assessing whether engineers can answer novel questions about system behavior without deploying new code
- Design an observability strategy that enables debugging unknown-unknowns rather than only alerting on predefined failure conditions
- Compare observability tooling approaches (metrics-first, logs-first, traces-first) and justify the right starting point for a given architecture
The Dashboard That Showed Green While the Company Lost Millions
Section titled “The Dashboard That Showed Green While the Company Lost Millions”March 2017. Amazon Web Services. 9:37 AM Pacific Time.
The senior engineer’s dashboard shows nothing wrong. CPU utilization: normal. Memory: normal. Error rate: 0.02%. Network: stable. All the lines are green. Every metric within threshold.
But the phone won’t stop ringing.
“The S3 console won’t load.” “Our static assets are 404ing.” “Entire us-east-1 seems broken.”
The engineer stares at the dashboard. It’s lying to him. Everything says “fine” while half the internet is on fire.
Here’s what happened: An engineer ran an automation script to remove a small number of S3 servers. A typo caused far more servers to be removed than intended. The billing subsystem—dependent on those servers—started failing. S3’s index subsystem couldn’t query billing. S3 couldn’t serve any objects.
Thousands of websites went dark. Major platforms like Slack, Quora, and Trello became unavailable. The outage lasted 4 hours.
Cost: Estimated $150-160 million in losses across affected businesses.
The dashboard problem: All the metrics were designed to answer “Is this specific thing okay?” None could answer “Why are customers screaming while our graphs show green?”
This is the difference between monitoring and observability.
Monitoring asks: “Is X within threshold?” Observability asks: “Why is the system behaving this way?”
The S3 team had world-class monitoring. Every server reported health. Every metric was collected. But they couldn’t see that the relationship between systems was broken. They could see the trees were green; they couldn’t see the forest was on fire.
This incident changed how AWS thinks about observability. They invested heavily in distributed tracing, request correlation, and the ability to ask questions they hadn’t anticipated needing to ask.
Why This Module Matters
Section titled “Why This Module Matters”It’s 3 AM. The on-call engineer’s phone buzzes: “High latency detected.” The dashboard opens. Everything looks… fine. CPU is normal. Memory is normal. Error rate is low. But users are complaining. Something is wrong, and nobody can see what.
This is the gap between monitoring and observability. Monitoring tells you when predefined things go wrong. Observability lets you understand why your system is behaving the way it is—even when you didn’t predict the failure mode in advance.
THE MONITORING TRAP═══════════════════════════════════════════════════════════════════════════════
3:00 AM - Dashboard check
┌─────────────────────────────────────────────────────────────────────────────┐│ PRODUCTION DASHBOARD ││ ││ CPU: ████████░░ 78% ✓ Memory: ██████░░░░ 60% ✓ ││ Errors: 0.12% ✓ Latency: 145ms ✓ ││ Requests/s: 12,456 ✓ Database: Connected ✓ ││ ││ ✅ ALL SYSTEMS OPERATIONAL ││ │└─────────────────────────────────────────────────────────────────────────────┘
3:05 AM - Slack channel
Support: "Users reporting checkout failures" Support: "12 tickets in the last 5 minutes" Support: "All from premium users?"
Engineer: "Dashboard shows everything green..." Engineer: "Let me check logs..." Engineer: "3.2 million log lines in the last hour" Engineer: "Can't search by user ID" Engineer: "Can't correlate across services" Engineer: "I have no idea what's happening"
The dashboard answered every question it was designed to answer.It couldn't answer the question that mattered.In complex distributed systems, you can’t anticipate every failure. You need systems that let you ask new questions without deploying new code.
The Car Dashboard Analogy
A car dashboard is monitoring: it shows predefined metrics (speed, fuel, temperature). But when something weird happens—a strange noise, intermittent vibration—the dashboard doesn’t help. A mechanic with diagnostic tools has observability: they can probe the system, trace connections, and discover what’s wrong without knowing in advance what to look for.
What You’ll Learn
Section titled “What You’ll Learn”- The difference between monitoring and observability
- Where observability came from (control theory)
- The observability equation: can you understand internal state from external outputs?
- Why traditional monitoring fails for distributed systems
- The mindset shift required for observability
Part 1: Monitoring vs. Observability
Section titled “Part 1: Monitoring vs. Observability”1.1 What is Monitoring?
Section titled “1.1 What is Monitoring?”Monitoring is collecting predefined metrics and alerting when they cross thresholds.
TRADITIONAL MONITORING═══════════════════════════════════════════════════════════════
You define in advance:- What to measure (CPU, memory, error rate)- What's normal (CPU < 80%)- When to alert (CPU > 80% for 5 minutes)
┌─────────────────────────────────────────────────────────────┐│ Dashboard ││ ││ CPU: ████████░░ 78% Memory: ██████░░░░ 60% ││ ││ Errors: 0.1% Requests: 1,234/s ││ ││ All systems normal ✅ ││ │└─────────────────────────────────────────────────────────────┘
Monitoring answers: "Are the things I decided to watch okay?"Monitoring works when:
- You know what can go wrong
- Failures match known patterns
- Systems are relatively simple
1.2 What is Observability?
Section titled “1.2 What is Observability?”Observability is the ability to understand a system’s internal state by examining its outputs—without knowing in advance what you’re looking for.
OBSERVABILITY═══════════════════════════════════════════════════════════════
You emit rich telemetry (logs, metrics, traces) that lets youask arbitrary questions after the fact.
Question: "Why are 5% of requests slow?" │ ▼Drill down by: user, endpoint, region, time... │ ▼Find: Requests from EU to US-East database are slow │ ▼Further: Only for users with >1000 items │ ▼Root cause: N+1 query pattern, worse with more items
Observability answers: "Why is the system behaving this way?"1.3 The Key Difference
Section titled “1.3 The Key Difference”| Aspect | Monitoring | Observability |
|---|---|---|
| Questions | Predefined | Ad-hoc, exploratory |
| Approach | ”Is X okay?" | "Why is this happening?” |
| Failure modes | Known in advance | Discovered during investigation |
| Data | Aggregated metrics | High-cardinality, detailed |
| Investigation | Dashboard → Runbook | Explore → Hypothesize → Verify |
MONITORING vs OBSERVABILITY WORKFLOW═══════════════════════════════════════════════════════════════
MONITORING WORKFLOW (known unknowns)────────────────────────────────────────────────────────────Alert: "Error rate > 1%" ↓Check runbook: "If error rate high, check database" ↓Dashboard: Database looks fine ↓??? (stuck)
OBSERVABILITY WORKFLOW (unknown unknowns)────────────────────────────────────────────────────────────Notice: "Latency increased for some requests" ↓Explore: Which requests? Filter by endpoint, user, region ↓Find: /api/v2/search requests from mobile clients ↓Drill: What's different about these? Trace shows slow cache ↓Correlate: Cache server in that region had memory pressure ↓Discovered failure mode you never anticipatedDid You Know?
The term “observability” comes from control theory, coined by Rudolf Kálmán in 1960. In control theory, a system is “observable” if you can determine its complete internal state from its outputs. Software adopted this concept because modern distributed systems are too complex to monitor every internal state directly.
Part 2: Why Monitoring Isn’t Enough
Section titled “Part 2: Why Monitoring Isn’t Enough”2.1 The Cardinality Problem
Section titled “2.1 The Cardinality Problem”Traditional monitoring aggregates data to reduce storage. But aggregation hides details.
THE CARDINALITY PROBLEM═══════════════════════════════════════════════════════════════
You have 1 million requests. Monitoring shows:- Average latency: 100ms ✅- p99 latency: 500ms ✅- Error rate: 0.5% ✅
Everything looks fine! But 5,000 users had terrible experience.
What monitoring CAN'T tell you:- Which users?- Which endpoints?- What did those requests have in common?- Why were they different?
High-cardinality dimensions you need:- user_id (millions of values)- request_id (billions of values)- trace_id (billions of values)- endpoint + parameters- geographic region- device type- feature flags enabled
Traditional metrics can't handle this. You need observability.2.2 The Unknown Unknowns
Section titled “2.2 The Unknown Unknowns”You can only monitor what you anticipate. But complex systems fail in unexpected ways.
KNOWN vs UNKNOWN FAILURES═══════════════════════════════════════════════════════════════
KNOWN KNOWNS (Monitoring handles well) ├── CPU exhaustion ├── Memory exhaustion ├── Disk full └── Process crash
KNOWN UNKNOWNS (Monitoring struggles) ├── "Slow sometimes" ├── "Works for most users" └── "Fails under specific conditions"
UNKNOWN UNKNOWNS (Monitoring fails) ├── Novel failure combinations ├── Emergent behavior ├── Race conditions └── "That's never happened before"
Observability lets you investigate unknown unknowns becauseyou can ask questions you didn't think to ask in advance.2.3 Distributed System Complexity
Section titled “2.3 Distributed System Complexity”Monitoring was designed for monoliths. Distributed systems need more:
MONOLITH vs DISTRIBUTED═══════════════════════════════════════════════════════════════
MONOLITH┌─────────────────────────────────────────────────────────────┐│ ││ Request ──▶ [Application] ──▶ [Database] ──▶ Response ││ ││ - One process to monitor ││ - Logs in one place ││ - Stack trace shows full path ││ │└─────────────────────────────────────────────────────────────┘
DISTRIBUTED SYSTEM┌─────────────────────────────────────────────────────────────┐│ ││ Request ──▶ [Gateway] ──▶ [Service A] ──▶ [Service B] ││ │ │ │ ││ ▼ ▼ ▼ ││ [Cache] [Queue] [Database] ││ │ ││ ▼ ││ [Service C] ││ ││ - Multiple processes, multiple machines ││ - Logs scattered everywhere ││ - No single stack trace ││ - Need to correlate across services ││ │└─────────────────────────────────────────────────────────────┘
Distributed systems need distributed observability.Try This (2 minutes)
Think about a recent debugging session. Did you:
- Follow a runbook? (Monitoring mindset)
- Explore and ask questions? (Observability mindset)
- Get stuck because you couldn’t see enough? (Gap between the two)
Part 3: The Observability Equation
Section titled “Part 3: The Observability Equation”3.1 Control Theory Origins
Section titled “3.1 Control Theory Origins”In control theory, observability is a mathematical property: can you determine the internal state of a system from its external outputs?
OBSERVABILITY IN CONTROL THEORY═══════════════════════════════════════════════════════════════
┌─────────────────────┐Input ───▶│ SYSTEM │───▶ Output │ │ │ Internal State ? │ └─────────────────────┘
Question: Given the outputs, can we know the internal state?
OBSERVABLE: Yes, outputs tell us everything we needNOT OBSERVABLE: Internal state is hidden from outputs
Example - Observable: A car's speedometer output tells you internal velocity state
Example - Not Observable: A black box that outputs same value regardless of input3.2 Software Observability
Section titled “3.2 Software Observability”Applied to software, observability means: can you understand why your system is behaving the way it is, just by examining telemetry?
SOFTWARE OBSERVABILITY═══════════════════════════════════════════════════════════════
Highly Observable System:┌─────────────────────────────────────────────────────────────┐│ Structured logs with context (user, request, trace) ││ Metrics with high-cardinality dimensions ││ Distributed traces showing request flow ││ Events capturing state changes │└─────────────────────────────────────────────────────────────┘ │ ▼Can answer: "Why did user X's request fail at 3:42 PM?"Can answer: "Why are requests from region Y slow?"Can answer: "What changed that caused this behavior?"
Poorly Observable System:┌─────────────────────────────────────────────────────────────┐│ Unstructured logs: "Error occurred" ││ Aggregate metrics: "Average latency: 100ms" ││ No tracing ││ No correlation between data sources │└─────────────────────────────────────────────────────────────┘ │ ▼Can answer: "Is the average latency okay?" (Yes)Cannot answer: "Why are some requests slow?" (???)3.3 Properties of Observable Systems
Section titled “3.3 Properties of Observable Systems”| Property | What It Means | Example |
|---|---|---|
| High cardinality | Many unique dimension values | user_id, not just “users” |
| High dimensionality | Many dimensions to slice by | user, endpoint, region, version, feature_flag |
| Correlation | Can connect data across sources | Trace ID links logs, metrics, traces |
| Context preservation | Details not aggregated away | Full request details, not just averages |
| Queryability | Can ask arbitrary questions | ”Show me requests where X AND Y AND Z” |
Try This (2 minutes)
Pick a recent production issue. Could you have answered these questions with your current tools?
Question Could You Answer It? ”Show me all requests from user X in the last hour” Yes / No / Partially ”What do the slowest 1% of requests have in common?” Yes / No / Partially ”Which version of the code handled this request?” Yes / No / Partially ”What other requests were happening at the same time?” Yes / No / Partially Each “No” reveals an observability gap.
Part 4: The Observability Mindset
Section titled “Part 4: The Observability Mindset”4.1 From “Know What’s Wrong” to “Understand Behavior”
Section titled “4.1 From “Know What’s Wrong” to “Understand Behavior””MINDSET SHIFT═══════════════════════════════════════════════════════════════
MONITORING MINDSET─────────────────────────────────────────────────────────────"I will define what 'wrong' means in advance" ↓Create alerts for known bad states ↓When alert fires, follow runbook ↓Problem: What if failure doesn't match any alert?
OBSERVABILITY MINDSET─────────────────────────────────────────────────────────────"I will emit rich telemetry about system behavior" ↓When something seems off, explore the data ↓Ask questions, form hypotheses, verify ↓Discover failure modes I didn't anticipate4.2 Exploration Over Dashboards
Section titled “4.2 Exploration Over Dashboards”DASHBOARD vs EXPLORATION═══════════════════════════════════════════════════════════════
DASHBOARD (monitoring)┌─────────────────────────────────────────────────────────────┐│ Fixed views of predefined metrics ││ Good for: known important signals ││ Bad for: investigating new problems ││ ││ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ││ │ CPU │ │ Memory │ │ Errors │ │ QPS │ ││ │ 78% │ │ 60% │ │ 0.1% │ │ 1234 │ ││ └────────┘ └────────┘ └────────┘ └────────┘ ││ ││ If these don't show the problem, you're stuck. │└─────────────────────────────────────────────────────────────┘
EXPLORATION (observability)┌─────────────────────────────────────────────────────────────┐│ Query interface for ad-hoc investigation ││ Good for: discovering unknown issues ││ ││ > show requests where latency > 500ms ││ → 5,234 requests (2.1%) ││ ││ > group by endpoint ││ → /api/search: 4,891 (94%) ││ ││ > filter endpoint=/api/search, group by user_tier ││ → premium: 12, free: 4,879 ││ ││ > Hypothesis: Free tier hitting rate limits? ││ │└─────────────────────────────────────────────────────────────┘4.3 Questions Observability Enables
Section titled “4.3 Questions Observability Enables”With good observability, you can ask:
- “Why is this specific request slow?” - Not averages, this one request
- “What do failing requests have in common?” - Pattern discovery
- “What changed?” - Correlation with deploys, config, external factors
- “Is this new?” - Historical comparison
- “Who is affected?” - Impact scoping
- “What else is affected?” - Blast radius discovery
War Story: The 5% Mystery That Cost Millions
2019. A Major E-commerce Platform. Black Friday Weekend.
The site reliability team was confident. Dashboards showed average latency at 180ms—well within SLO. Error rate sat at 0.3%—excellent. But customer support tickets kept flooding in: “Checkout won’t complete.” “Payment page hangs forever.” “Your site is unusable.”
The team dismissed it as user perception. The numbers looked great. Leadership started questioning if support was exaggerating.
Then a product manager showed up with data: abandoned cart rate had spiked 340%. Customers were leaving without buying. Revenue was hemorrhaging.
Day 1: Engineers added high-cardinality observability. Within 2 hours, they discovered 5.2% of checkout requests took over 8 seconds—but only for users matching a specific pattern.
Day 2: They drilled down. Affected users had: (1) accounts older than 2 years, (2) Safari browser, (3) connecting from US East Coast.
Day 3: Root cause found. A feature flag enabled for “loyal customers” (accounts >2 years) triggered a new recommendation engine. That engine made a synchronous call to a third-party API. Safari’s stricter timeouts exposed latency that Chrome masked. The API server was in US West, adding 40ms RTT for East Coast users.
Financial Impact:
- Lost revenue during Black Friday: $2.3M
- Customer churn from frustrated loyal customers: estimated $8M annually
- Fix took 20 minutes once they found it (disable feature flag)
The Lesson: Their monitoring was technically excellent. Average latency? Perfect. p99? Good. Error rate? Great. But averages hid the pain of their most valuable customers. High-cardinality observability revealed what aggregate metrics couldn’t see.
Part 5: Building Toward Observability
Section titled “Part 5: Building Toward Observability”5.1 The Three Pillars (Preview)
Section titled “5.1 The Three Pillars (Preview)”Observability is built on three complementary data types:
THE THREE PILLARS OF OBSERVABILITY═══════════════════════════════════════════════════════════════
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐│ LOGS │ │ METRICS │ │ TRACES ││ │ │ │ │ ││ What happened │ │ How much/many │ │ Request flow ││ (events) │ │ (aggregates) │ │ (journey) ││ │ │ │ │ ││ High detail │ │ Low cost │ │ Shows path ││ High cost │ │ Low detail │ │ Shows timing │└─────────────────┘ └─────────────────┘ └─────────────────┘ │ │ │ └───────────────────┼────────────────────┘ │ Correlation via trace_id / request_idWe’ll explore each pillar in detail in Module 3.2.
5.2 Starting the Journey
Section titled “5.2 Starting the Journey”You don’t need to build everything at once. Start with:
- Structured logging - Add context to logs (user_id, request_id)
- Request IDs - Generate unique IDs that flow through all services
- Meaningful metrics - Beyond CPU/memory, measure what matters to users
- Correlation - Link logs, metrics, and traces with shared IDs
Try This (3 minutes)
Evaluate your current system:
Question Yes/No Can you trace a single request through all services? Can you filter logs by user_id? Can you see which users are affected by an issue? Can you ask questions you didn’t pre-define? Each “No” is an observability gap.
Did You Know?
Section titled “Did You Know?”-
Honeycomb (observability company) was founded on the principle that high-cardinality data is essential. Traditional monitoring tools couldn’t handle millions of unique values, so they built new systems optimized for it.
-
Google’s Dapper paper (2010) introduced distributed tracing to the industry. It showed how Google traces requests across thousands of services to understand behavior. This paper inspired Zipkin, Jaeger, and eventually OpenTelemetry.
-
The term “pillars” (logs, metrics, traces) has been criticized by observability practitioners. Charity Majors argues they’re not pillars but rather different views of the same events. The “pillar” framing can lead teams to treat them as separate silos instead of integrated data.
-
Twitter famously had a “Fail Whale” page during outages in its early days. The engineering team couldn’t debug distributed issues because they lacked observability—they had monitoring but couldn’t answer “why.” This drove major investments in distributed tracing that later influenced the industry.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| ”We have dashboards, we’re observable” | Dashboards are monitoring, not observability | Add queryable, high-cardinality data |
| Logging without structure | Can’t query, can’t correlate | Structured JSON logs with context |
| No request/trace IDs | Can’t follow requests across services | Generate IDs at edge, propagate everywhere |
| Aggregating too early | Lose detail needed for debugging | Store raw events, aggregate at query time |
| Treating pillars as silos | Can’t correlate logs, metrics, traces | Use common identifiers (trace_id) |
| Only instrumenting your code | Miss database, cache, external calls | Instrument at boundaries too |
-
What’s the key difference between monitoring and observability?
Answer
Monitoring answers predefined questions: “Is CPU above threshold? Is error rate above threshold?” You decide in advance what to watch and alert on.
Observability lets you ask arbitrary questions after the fact: “Why are these specific requests slow? What do failing requests have in common?” You emit rich telemetry and explore it when issues arise.
Monitoring tells you that something is wrong (when it matches your definitions). Observability helps you understand why something is happening (even when you didn’t predict it).
-
Why can’t traditional metrics handle high-cardinality data well?
Answer
Traditional metrics systems (like Prometheus) store time-series data where each unique combination of labels creates a new series. With high-cardinality dimensions like user_id (millions of values) or request_id (billions), the number of time series explodes:
- 10 endpoints × 3 regions = 30 series (fine)
- 10 endpoints × 3 regions × 1M users = 30 million series (expensive)
Storage, memory, and query costs grow linearly with cardinality. Traditional systems aggregate to keep cardinality low, but this loses the detail needed for debugging specific issues.
Observability tools handle this by storing events (not pre-aggregated series) and computing aggregations at query time.
-
What does observability mean in control theory, and how does that apply to software?
Answer
In control theory, a system is observable if you can determine its complete internal state just by examining its outputs. You don’t need to open the black box—the outputs tell you everything.
Applied to software: An observable system emits enough telemetry (logs, metrics, traces) that you can understand why it’s behaving the way it is, without adding new instrumentation or deploying new code.
If you need to add a printf, redeploy, and reproduce an issue to debug it, your system isn’t fully observable. If you can diagnose novel issues just from existing telemetry, it is.
-
Why do distributed systems need observability more than monoliths?
Answer
Distributed systems have:
- No single stack trace: A request touches multiple services. No single log shows the full picture.
- Scattered data: Logs are on different machines. Correlating them is hard without shared IDs.
- Emergent failures: Problems arise from interactions between services, not single components.
- More unknowns: More moving parts = more ways to fail unexpectedly.
Monoliths can often be debugged with a stack trace and local logs. Distributed systems need distributed tracing, correlated logs, and the ability to query across services—the core of observability.
-
A company has 1 million daily requests. Average latency is 150ms, p99 is 400ms. They consider this “fine.” But 0.5% of requests take >5 seconds. How many users experience extreme latency daily? Why might monitoring miss this?
Answer
Calculation: 1,000,000 × 0.5% = 5,000 users daily experience >5 second latency.
Why monitoring misses it:
- Average (150ms) is dominated by the 99.5% of fast requests
- p99 (400ms) only captures the 99th percentile—the 0.5% beyond that is invisible
- Even p99.9 might not show the full picture
- Traditional monitoring aggregates away the tail; you need high-cardinality data to find what those 5,000 requests have in common
This is exactly the scenario from the war story—aggregate metrics look fine while thousands of users suffer.
-
Your system has 50 endpoints, 10 regions, and 2 million users. If you stored traditional metrics with user_id as a label, how many time series would you create? Why is this problematic?
Answer
Calculation: 50 endpoints × 10 regions × 2,000,000 users = 1 billion time series
Why this is problematic:
- Each time series consumes memory and storage
- Prometheus recommends staying under 10 million series for manageability
- 1 billion series would require ~100TB+ of memory for efficient querying
- Query performance degrades dramatically at this scale
- Cost would be astronomical
Solution: Observability tools store events (not pre-aggregated series) and compute aggregations at query time, handling high cardinality efficiently.
-
An engineer says “We have Prometheus, Grafana, and ELK—we’re fully observable.” What’s wrong with this statement?
Answer
Having tools doesn’t equal observability. The statement confuses capabilities with properties.
Questions to ask:
- Can you trace a single request through all services?
- Can you query by user_id, request_id, or other high-cardinality dimensions?
- Are logs, metrics, and traces correlated (same trace_id)?
- Can you ask arbitrary questions you didn’t pre-define?
Prometheus + Grafana + ELK can be part of an observable system, but:
- Prometheus struggles with high cardinality
- ELK logs without structure/correlation are just searchable text
- Grafana dashboards are monitoring, not exploration
Observability is about what you can discover, not what tools you have.
-
The AWS S3 2017 outage lasted 4 hours and affected thousands of websites. If AWS had better observability, what specific questions would they have needed to answer faster?
Answer
Key questions observability should have answered:
- “What changed?” - Which automation ran? What did it modify?
- “What are the dependencies?” - What systems depend on the removed servers?
- “What’s the blast radius?” - Which customers/services are affected?
- “What’s the cascade?” - Billing → S3 index → S3 objects: what’s the dependency chain?
- “Why do our health checks pass?” - Individual servers report healthy, but system-level behavior is broken
The core observability gap: They could answer “Is server X healthy?” but not “Why are customers experiencing failures when all servers report healthy?” They lacked correlation between individual health and emergent system behavior.
Key Takeaways
Section titled “Key Takeaways”OBSERVABILITY ESSENTIALS CHECKLIST═══════════════════════════════════════════════════════════════════════════════
UNDERSTANDING THE DIFFERENCE☑ Monitoring answers predefined questions ("Is CPU > 80%?")☑ Observability enables unknown questions ("Why are THESE requests slow?")☑ Dashboards showing green doesn't mean users are happy
THE CARDINALITY IMPERATIVE☑ Traditional metrics aggregate away the details you need☑ High cardinality (user_id, request_id) is essential for debugging☑ 5% of users having problems is 50,000 users at 1M requests/day
DISTRIBUTED SYSTEM REALITY☑ No single stack trace shows the full picture☑ Logs scattered across machines need correlation (trace_id)☑ Failures emerge from interactions, not individual components
THE OBSERVABILITY MINDSET☑ Emit rich telemetry, explore when problems arise☑ Form hypotheses, verify with data☑ Discover failure modes you didn't anticipate
STARTING THE JOURNEY☑ Structured logging with context (user_id, request_id)☑ Propagate trace IDs through all services☑ Enable ad-hoc queries, not just predefined dashboardsHands-On Exercise
Section titled “Hands-On Exercise”Task: Evaluate the observability of a system you work with.
Part 1: Current State Assessment (10 minutes)
Fill out this observability scorecard:
| Capability | Score (0-3) | Notes |
|---|---|---|
| Structured logging | 0=none, 1=some, 2=most, 3=all | |
| Request IDs propagated | 0=none, 1=some services, 2=most, 3=all | |
| Distributed tracing | 0=none, 1=basic, 2=detailed, 3=comprehensive | |
| High-cardinality queries | 0=can’t, 1=limited, 2=some, 3=any dimension | |
| Cross-service correlation | 0=manual, 1=partial, 2=mostly automated, 3=seamless | |
| Ad-hoc investigation | 0=impossible, 1=painful, 2=possible, 3=easy |
Total: ___/18
Interpretation:
- 0-6: Monitoring only, major observability gaps
- 7-12: Partial observability, can debug some issues
- 13-18: Good observability, can investigate most issues
Part 2: Gap Analysis (10 minutes)
For your lowest-scoring capabilities:
| Capability | Current State | What’s Missing | First Step to Improve |
|---|---|---|---|
Part 3: Investigation Scenario (10 minutes)
Imagine this scenario:
“Users report that the checkout page is slow, but only sometimes. Your dashboard shows normal latency.”
Write down the steps you’d take to investigate:
- What would you look at first?
- What questions would you try to answer?
- What data would you need that you might not have?
- Where would you get stuck with your current tools?
Success Criteria:
- Scorecard completed with honest assessment
- At least 2 gaps identified with improvement steps
- Investigation scenario shows understanding of observability vs monitoring approach
- Identified at least one data gap in current system
Further Reading
Section titled “Further Reading”-
“Observability Engineering” - Charity Majors, Liz Fong-Jones, George Miranda. The definitive book on observability concepts and practices.
-
“Distributed Systems Observability” - Cindy Sridharan. Free ebook covering observability in distributed systems.
-
“Dapper, a Large-Scale Distributed Systems Tracing Infrastructure” - Google paper that introduced distributed tracing.
Next Module
Section titled “Next Module”Module 3.2: The Three Pillars - Deep dive into logs, metrics, and traces—what each provides and how they work together.