Module 3.1: What is Observability?

Complexity: [MEDIUM]

Time to Complete: 30-35 minutes

Prerequisites: Reliability Engineering Track (recommended)

Track: Foundations

What You’ll Be Able to Do

After completing this module, you will be able to:

Explain why observability differs fundamentally from monitoring and when each approach is appropriate
Evaluate a system’s observability maturity by assessing whether engineers can answer novel questions about system behavior without deploying new code
Design an observability strategy that enables debugging unknown-unknowns rather than only alerting on predefined failure conditions
Compare observability tooling approaches (metrics-first, logs-first, traces-first) and justify the right starting point for a given architecture

The Dashboard That Showed Green While the Company Lost Millions

March 2017. Amazon Web Services. 9:37 AM Pacific Time.

The senior engineer’s dashboard shows nothing wrong. CPU utilization: normal. Memory: normal. Error rate: 0.02%. Network: stable. All the lines are green. Every metric within threshold.

But the phone won’t stop ringing.

“The S3 console won’t load.” “Our static assets are 404ing.” “Entire us-east-1 seems broken.”

The engineer stares at the dashboard. It’s lying to him. Everything says “fine” while half the internet is on fire.

Here’s what happened: An engineer ran an automation script to remove a small number of S3 servers. A typo caused far more servers to be removed than intended. The billing subsystem—dependent on those servers—started failing. S3’s index subsystem couldn’t query billing. S3 couldn’t serve any objects.

Thousands of websites went dark. Major platforms like Slack, Quora, and Trello became unavailable. The outage lasted 4 hours.

Cost: Estimated $150-160 million in losses across affected businesses.

The dashboard problem: All the metrics were designed to answer “Is this specific thing okay?” None could answer “Why are customers screaming while our graphs show green?”

This is the difference between monitoring and observability.

Monitoring asks: “Is X within threshold?” Observability asks: “Why is the system behaving this way?”

The S3 team had world-class monitoring. Every server reported health. Every metric was collected. But they couldn’t see that the relationship between systems was broken. They could see the trees were green; they couldn’t see the forest was on fire.

This incident changed how AWS thinks about observability. They invested heavily in distributed tracing, request correlation, and the ability to ask questions they hadn’t anticipated needing to ask.

Why This Module Matters

It’s 3 AM. The on-call engineer’s phone buzzes: “High latency detected.” The dashboard opens. Everything looks… fine. CPU is normal. Memory is normal. Error rate is low. But users are complaining. Something is wrong, and nobody can see what.

This is the gap between monitoring and observability. Monitoring tells you when predefined things go wrong. Observability lets you understand why your system is behaving the way it is—even when you didn’t predict the failure mode in advance.

THE MONITORING TRAP
═══════════════════════════════════════════════════════════════════════════════

3:00 AM - Dashboard check

┌─────────────────────────────────────────────────────────────────────────────┐
│                     PRODUCTION DASHBOARD                                    │
│                                                                             │
│   CPU: ████████░░ 78%  ✓     Memory: ██████░░░░ 60%  ✓                     │
│   Errors: 0.12%  ✓            Latency: 145ms  ✓                            │
│   Requests/s: 12,456  ✓       Database: Connected  ✓                       │
│                                                                             │
│   ✅ ALL SYSTEMS OPERATIONAL                                                │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

3:05 AM - Slack channel

    Support: "Users reporting checkout failures"
    Support: "12 tickets in the last 5 minutes"
    Support: "All from premium users?"

    Engineer: "Dashboard shows everything green..."
    Engineer: "Let me check logs..."
    Engineer: "3.2 million log lines in the last hour"
    Engineer: "Can't search by user ID"
    Engineer: "Can't correlate across services"
    Engineer: "I have no idea what's happening"

The dashboard answered every question it was designed to answer.
It couldn't answer the question that mattered.

In complex distributed systems, you can’t anticipate every failure. You need systems that let you ask new questions without deploying new code.

The Car Dashboard Analogy

A car dashboard is monitoring: it shows predefined metrics (speed, fuel, temperature). But when something weird happens—a strange noise, intermittent vibration—the dashboard doesn’t help. A mechanic with diagnostic tools has observability: they can probe the system, trace connections, and discover what’s wrong without knowing in advance what to look for.

What You’ll Learn

The difference between monitoring and observability
Where observability came from (control theory)
The observability equation: can you understand internal state from external outputs?
Why traditional monitoring fails for distributed systems
The mindset shift required for observability

Part 1: Monitoring vs. Observability

1.1 What is Monitoring?

Monitoring is collecting predefined metrics and alerting when they cross thresholds.

TRADITIONAL MONITORING
═══════════════════════════════════════════════════════════════

You define in advance:
- What to measure (CPU, memory, error rate)
- What's normal (CPU < 80%)
- When to alert (CPU > 80% for 5 minutes)

┌─────────────────────────────────────────────────────────────┐
│                     Dashboard                               │
│                                                             │
│  CPU: ████████░░  78%     Memory: ██████░░░░  60%          │
│                                                             │
│  Errors: 0.1%             Requests: 1,234/s                │
│                                                             │
│  All systems normal ✅                                      │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Monitoring answers: "Are the things I decided to watch okay?"

Monitoring works when:

You know what can go wrong
Failures match known patterns
Systems are relatively simple

1.2 What is Observability?

Observability is the ability to understand a system’s internal state by examining its outputs—without knowing in advance what you’re looking for.

OBSERVABILITY
═══════════════════════════════════════════════════════════════

You emit rich telemetry (logs, metrics, traces) that lets you
ask arbitrary questions after the fact.

Question: "Why are 5% of requests slow?"
    │
    ▼
Drill down by: user, endpoint, region, time...
    │
    ▼
Find: Requests from EU to US-East database are slow
    │
    ▼
Further: Only for users with >1000 items
    │
    ▼
Root cause: N+1 query pattern, worse with more items

Observability answers: "Why is the system behaving this way?"

1.3 The Key Difference

Aspect	Monitoring	Observability
Questions	Predefined	Ad-hoc, exploratory
Approach	”Is X okay?"	"Why is this happening?”
Failure modes	Known in advance	Discovered during investigation
Data	Aggregated metrics	High-cardinality, detailed
Investigation	Dashboard → Runbook	Explore → Hypothesize → Verify

MONITORING vs OBSERVABILITY WORKFLOW
═══════════════════════════════════════════════════════════════

MONITORING WORKFLOW (known unknowns)
────────────────────────────────────────────────────────────
Alert: "Error rate > 1%"
    ↓
Check runbook: "If error rate high, check database"
    ↓
Dashboard: Database looks fine
    ↓
??? (stuck)

OBSERVABILITY WORKFLOW (unknown unknowns)
────────────────────────────────────────────────────────────
Notice: "Latency increased for some requests"
    ↓
Explore: Which requests? Filter by endpoint, user, region
    ↓
Find: /api/v2/search requests from mobile clients
    ↓
Drill: What's different about these? Trace shows slow cache
    ↓
Correlate: Cache server in that region had memory pressure
    ↓
Discovered failure mode you never anticipated

Did You Know?

The term “observability” comes from control theory, coined by Rudolf Kálmán in 1960. In control theory, a system is “observable” if you can determine its complete internal state from its outputs. Software adopted this concept because modern distributed systems are too complex to monitor every internal state directly.

Part 2: Why Monitoring Isn’t Enough

2.1 The Cardinality Problem

Traditional monitoring aggregates data to reduce storage. But aggregation hides details.

THE CARDINALITY PROBLEM
═══════════════════════════════════════════════════════════════

You have 1 million requests. Monitoring shows:
- Average latency: 100ms ✅
- p99 latency: 500ms ✅
- Error rate: 0.5% ✅

Everything looks fine! But 5,000 users had terrible experience.

What monitoring CAN'T tell you:
- Which users?
- Which endpoints?
- What did those requests have in common?
- Why were they different?

High-cardinality dimensions you need:
- user_id (millions of values)
- request_id (billions of values)
- trace_id (billions of values)
- endpoint + parameters
- geographic region
- device type
- feature flags enabled

Traditional metrics can't handle this. You need observability.

2.2 The Unknown Unknowns

You can only monitor what you anticipate. But complex systems fail in unexpected ways.

KNOWN vs UNKNOWN FAILURES
═══════════════════════════════════════════════════════════════

KNOWN KNOWNS (Monitoring handles well)
    ├── CPU exhaustion
    ├── Memory exhaustion
    ├── Disk full
    └── Process crash

KNOWN UNKNOWNS (Monitoring struggles)
    ├── "Slow sometimes"
    ├── "Works for most users"
    └── "Fails under specific conditions"

UNKNOWN UNKNOWNS (Monitoring fails)
    ├── Novel failure combinations
    ├── Emergent behavior
    ├── Race conditions
    └── "That's never happened before"

Observability lets you investigate unknown unknowns because
you can ask questions you didn't think to ask in advance.

2.3 Distributed System Complexity

Monitoring was designed for monoliths. Distributed systems need more:

MONOLITH vs DISTRIBUTED
═══════════════════════════════════════════════════════════════

MONOLITH
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│    Request ──▶ [Application] ──▶ [Database] ──▶ Response   │
│                                                             │
│    - One process to monitor                                │
│    - Logs in one place                                      │
│    - Stack trace shows full path                           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

DISTRIBUTED SYSTEM
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│    Request ──▶ [Gateway] ──▶ [Service A] ──▶ [Service B]   │
│                    │              │              │          │
│                    ▼              ▼              ▼          │
│               [Cache]        [Queue]      [Database]       │
│                                   │                         │
│                                   ▼                         │
│                            [Service C]                      │
│                                                             │
│    - Multiple processes, multiple machines                 │
│    - Logs scattered everywhere                              │
│    - No single stack trace                                  │
│    - Need to correlate across services                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Distributed systems need distributed observability.

Try This (2 minutes)

Think about a recent debugging session. Did you:

Follow a runbook? (Monitoring mindset)

Explore and ask questions? (Observability mindset)

Get stuck because you couldn’t see enough? (Gap between the two)

Part 3: The Observability Equation

3.1 Control Theory Origins

In control theory, observability is a mathematical property: can you determine the internal state of a system from its external outputs?

OBSERVABILITY IN CONTROL THEORY
═══════════════════════════════════════════════════════════════

          ┌─────────────────────┐
Input ───▶│      SYSTEM        │───▶ Output
          │                     │
          │  Internal State ?   │
          └─────────────────────┘

Question: Given the outputs, can we know the internal state?

OBSERVABLE: Yes, outputs tell us everything we need
NOT OBSERVABLE: Internal state is hidden from outputs

Example - Observable:
    A car's speedometer output tells you internal velocity state

Example - Not Observable:
    A black box that outputs same value regardless of input

3.2 Software Observability

Applied to software, observability means: can you understand why your system is behaving the way it is, just by examining telemetry?

SOFTWARE OBSERVABILITY
═══════════════════════════════════════════════════════════════

Highly Observable System:
┌─────────────────────────────────────────────────────────────┐
│  Structured logs with context (user, request, trace)       │
│  Metrics with high-cardinality dimensions                  │
│  Distributed traces showing request flow                    │
│  Events capturing state changes                            │
└─────────────────────────────────────────────────────────────┘
    │
    ▼
Can answer: "Why did user X's request fail at 3:42 PM?"
Can answer: "Why are requests from region Y slow?"
Can answer: "What changed that caused this behavior?"

Poorly Observable System:
┌─────────────────────────────────────────────────────────────┐
│  Unstructured logs: "Error occurred"                       │
│  Aggregate metrics: "Average latency: 100ms"               │
│  No tracing                                                │
│  No correlation between data sources                       │
└─────────────────────────────────────────────────────────────┘
    │
    ▼
Can answer: "Is the average latency okay?" (Yes)
Cannot answer: "Why are some requests slow?" (???)

3.3 Properties of Observable Systems

Property	What It Means	Example
High cardinality	Many unique dimension values	user_id, not just “users”
High dimensionality	Many dimensions to slice by	user, endpoint, region, version, feature_flag
Correlation	Can connect data across sources	Trace ID links logs, metrics, traces
Context preservation	Details not aggregated away	Full request details, not just averages
Queryability	Can ask arbitrary questions	”Show me requests where X AND Y AND Z”

Try This (2 minutes)

Pick a recent production issue. Could you have answered these questions with your current tools?

Question Could You Answer It?
”Show me all requests from user X in the last hour” Yes / No / Partially
”What do the slowest 1% of requests have in common?” Yes / No / Partially
”Which version of the code handled this request?” Yes / No / Partially
”What other requests were happening at the same time?” Yes / No / Partially

Each “No” reveals an observability gap.

Question	Could You Answer It?
”Show me all requests from user X in the last hour”	Yes / No / Partially
”What do the slowest 1% of requests have in common?”	Yes / No / Partially
”Which version of the code handled this request?”	Yes / No / Partially
”What other requests were happening at the same time?”	Yes / No / Partially

Part 4: The Observability Mindset

4.1 From “Know What’s Wrong” to “Understand Behavior”

MINDSET SHIFT
═══════════════════════════════════════════════════════════════

MONITORING MINDSET
─────────────────────────────────────────────────────────────
"I will define what 'wrong' means in advance"
    ↓
Create alerts for known bad states
    ↓
When alert fires, follow runbook
    ↓
Problem: What if failure doesn't match any alert?

OBSERVABILITY MINDSET
─────────────────────────────────────────────────────────────
"I will emit rich telemetry about system behavior"
    ↓
When something seems off, explore the data
    ↓
Ask questions, form hypotheses, verify
    ↓
Discover failure modes I didn't anticipate

4.2 Exploration Over Dashboards

DASHBOARD vs EXPLORATION
═══════════════════════════════════════════════════════════════

DASHBOARD (monitoring)
┌─────────────────────────────────────────────────────────────┐
│  Fixed views of predefined metrics                          │
│  Good for: known important signals                         │
│  Bad for: investigating new problems                       │
│                                                             │
│  ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐           │
│  │  CPU   │  │ Memory │  │ Errors │  │  QPS   │           │
│  │  78%   │  │  60%   │  │  0.1%  │  │  1234  │           │
│  └────────┘  └────────┘  └────────┘  └────────┘           │
│                                                             │
│  If these don't show the problem, you're stuck.            │
└─────────────────────────────────────────────────────────────┘

EXPLORATION (observability)
┌─────────────────────────────────────────────────────────────┐
│  Query interface for ad-hoc investigation                  │
│  Good for: discovering unknown issues                      │
│                                                             │
│  > show requests where latency > 500ms                     │
│    → 5,234 requests (2.1%)                                 │
│                                                             │
│  > group by endpoint                                        │
│    → /api/search: 4,891 (94%)                              │
│                                                             │
│  > filter endpoint=/api/search, group by user_tier         │
│    → premium: 12, free: 4,879                              │
│                                                             │
│  > Hypothesis: Free tier hitting rate limits?              │
│                                                             │
└─────────────────────────────────────────────────────────────┘

4.3 Questions Observability Enables

With good observability, you can ask:

“Why is this specific request slow?” - Not averages, this one request
“What do failing requests have in common?” - Pattern discovery
“What changed?” - Correlation with deploys, config, external factors
“Is this new?” - Historical comparison
“Who is affected?” - Impact scoping
“What else is affected?” - Blast radius discovery

War Story: The 5% Mystery That Cost Millions

2019. A Major E-commerce Platform. Black Friday Weekend.

The site reliability team was confident. Dashboards showed average latency at 180ms—well within SLO. Error rate sat at 0.3%—excellent. But customer support tickets kept flooding in: “Checkout won’t complete.” “Payment page hangs forever.” “Your site is unusable.”

The team dismissed it as user perception. The numbers looked great. Leadership started questioning if support was exaggerating.

Then a product manager showed up with data: abandoned cart rate had spiked 340%. Customers were leaving without buying. Revenue was hemorrhaging.

Day 1: Engineers added high-cardinality observability. Within 2 hours, they discovered 5.2% of checkout requests took over 8 seconds—but only for users matching a specific pattern.

Day 2: They drilled down. Affected users had: (1) accounts older than 2 years, (2) Safari browser, (3) connecting from US East Coast.

Day 3: Root cause found. A feature flag enabled for “loyal customers” (accounts >2 years) triggered a new recommendation engine. That engine made a synchronous call to a third-party API. Safari’s stricter timeouts exposed latency that Chrome masked. The API server was in US West, adding 40ms RTT for East Coast users.

Financial Impact:

Lost revenue during Black Friday: $2.3M

Customer churn from frustrated loyal customers: estimated $8M annually

Fix took 20 minutes once they found it (disable feature flag)

The Lesson: Their monitoring was technically excellent. Average latency? Perfect. p99? Good. Error rate? Great. But averages hid the pain of their most valuable customers. High-cardinality observability revealed what aggregate metrics couldn’t see.

Part 5: Building Toward Observability

5.1 The Three Pillars (Preview)

Observability is built on three complementary data types:

THE THREE PILLARS OF OBSERVABILITY
═══════════════════════════════════════════════════════════════

┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│     LOGS        │  │    METRICS      │  │    TRACES       │
│                 │  │                 │  │                 │
│  What happened  │  │  How much/many  │  │  Request flow   │
│  (events)       │  │  (aggregates)   │  │  (journey)      │
│                 │  │                 │  │                 │
│  High detail    │  │  Low cost       │  │  Shows path     │
│  High cost      │  │  Low detail     │  │  Shows timing   │
└─────────────────┘  └─────────────────┘  └─────────────────┘
         │                   │                    │
         └───────────────────┼────────────────────┘
                             │
                    Correlation via
                     trace_id / request_id

We’ll explore each pillar in detail in Module 3.2.

5.2 Starting the Journey

You don’t need to build everything at once. Start with:

Structured logging - Add context to logs (user_id, request_id)
Request IDs - Generate unique IDs that flow through all services
Meaningful metrics - Beyond CPU/memory, measure what matters to users
Correlation - Link logs, metrics, and traces with shared IDs

Try This (3 minutes)

Evaluate your current system:

Question Yes/No
Can you trace a single request through all services?
Can you filter logs by user_id?
Can you see which users are affected by an issue?
Can you ask questions you didn’t pre-define?

Each “No” is an observability gap.

Question	Yes/No
Can you trace a single request through all services?
Can you filter logs by user_id?
Can you see which users are affected by an issue?
Can you ask questions you didn’t pre-define?

Did You Know?

Honeycomb (observability company) was founded on the principle that high-cardinality data is essential. Traditional monitoring tools couldn’t handle millions of unique values, so they built new systems optimized for it.
Google’s Dapper paper (2010) introduced distributed tracing to the industry. It showed how Google traces requests across thousands of services to understand behavior. This paper inspired Zipkin, Jaeger, and eventually OpenTelemetry.
The term “pillars” (logs, metrics, traces) has been criticized by observability practitioners. Charity Majors argues they’re not pillars but rather different views of the same events. The “pillar” framing can lead teams to treat them as separate silos instead of integrated data.
Twitter famously had a “Fail Whale” page during outages in its early days. The engineering team couldn’t debug distributed issues because they lacked observability—they had monitoring but couldn’t answer “why.” This drove major investments in distributed tracing that later influenced the industry.

Common Mistakes

Mistake	Problem	Solution
”We have dashboards, we’re observable”	Dashboards are monitoring, not observability	Add queryable, high-cardinality data
Logging without structure	Can’t query, can’t correlate	Structured JSON logs with context
No request/trace IDs	Can’t follow requests across services	Generate IDs at edge, propagate everywhere
Aggregating too early	Lose detail needed for debugging	Store raw events, aggregate at query time
Treating pillars as silos	Can’t correlate logs, metrics, traces	Use common identifiers (trace_id)
Only instrumenting your code	Miss database, cache, external calls	Instrument at boundaries too

Quiz

What’s the key difference between monitoring and observability?

Answer

Monitoring answers predefined questions: “Is CPU above threshold? Is error rate above threshold?” You decide in advance what to watch and alert on.

Observability lets you ask arbitrary questions after the fact: “Why are these specific requests slow? What do failing requests have in common?” You emit rich telemetry and explore it when issues arise.

Monitoring tells you that something is wrong (when it matches your definitions). Observability helps you understand why something is happening (even when you didn’t predict it).
Why can’t traditional metrics handle high-cardinality data well?
Answer

Traditional metrics systems (like Prometheus) store time-series data where each unique combination of labels creates a new series. With high-cardinality dimensions like user_id (millions of values) or request_id (billions), the number of time series explodes:
- 10 endpoints × 3 regions = 30 series (fine)
- 10 endpoints × 3 regions × 1M users = 30 million series (expensive)
Storage, memory, and query costs grow linearly with cardinality. Traditional systems aggregate to keep cardinality low, but this loses the detail needed for debugging specific issues.

Observability tools handle this by storing events (not pre-aggregated series) and computing aggregations at query time.
What does observability mean in control theory, and how does that apply to software?

Answer

In control theory, a system is observable if you can determine its complete internal state just by examining its outputs. You don’t need to open the black box—the outputs tell you everything.

Applied to software: An observable system emits enough telemetry (logs, metrics, traces) that you can understand why it’s behaving the way it is, without adding new instrumentation or deploying new code.

If you need to add a printf, redeploy, and reproduce an issue to debug it, your system isn’t fully observable. If you can diagnose novel issues just from existing telemetry, it is.
Why do distributed systems need observability more than monoliths?
Answer

Distributed systems have:
1. No single stack trace: A request touches multiple services. No single log shows the full picture.
2. Scattered data: Logs are on different machines. Correlating them is hard without shared IDs.
3. Emergent failures: Problems arise from interactions between services, not single components.
4. More unknowns: More moving parts = more ways to fail unexpectedly.
Monoliths can often be debugged with a stack trace and local logs. Distributed systems need distributed tracing, correlated logs, and the ability to query across services—the core of observability.
A company has 1 million daily requests. Average latency is 150ms, p99 is 400ms. They consider this “fine.” But 0.5% of requests take >5 seconds. How many users experience extreme latency daily? Why might monitoring miss this?
Answer

Calculation: 1,000,000 × 0.5% = 5,000 users daily experience >5 second latency.

Why monitoring misses it:
- Average (150ms) is dominated by the 99.5% of fast requests
- p99 (400ms) only captures the 99th percentile—the 0.5% beyond that is invisible
- Even p99.9 might not show the full picture
- Traditional monitoring aggregates away the tail; you need high-cardinality data to find what those 5,000 requests have in common
This is exactly the scenario from the war story—aggregate metrics look fine while thousands of users suffer.
Your system has 50 endpoints, 10 regions, and 2 million users. If you stored traditional metrics with user_id as a label, how many time series would you create? Why is this problematic?
Answer

Calculation: 50 endpoints × 10 regions × 2,000,000 users = 1 billion time series

Why this is problematic:
- Each time series consumes memory and storage
- Prometheus recommends staying under 10 million series for manageability
- 1 billion series would require ~100TB+ of memory for efficient querying
- Query performance degrades dramatically at this scale
- Cost would be astronomical
Solution: Observability tools store events (not pre-aggregated series) and compute aggregations at query time, handling high cardinality efficiently.
An engineer says “We have Prometheus, Grafana, and ELK—we’re fully observable.” What’s wrong with this statement?
Answer

Having tools doesn’t equal observability. The statement confuses capabilities with properties.

Questions to ask:
- Can you trace a single request through all services?
- Can you query by user_id, request_id, or other high-cardinality dimensions?
- Are logs, metrics, and traces correlated (same trace_id)?
- Can you ask arbitrary questions you didn’t pre-define?
Prometheus + Grafana + ELK can be part of an observable system, but:
- Prometheus struggles with high cardinality
- ELK logs without structure/correlation are just searchable text
- Grafana dashboards are monitoring, not exploration
Observability is about what you can discover, not what tools you have.
The AWS S3 2017 outage lasted 4 hours and affected thousands of websites. If AWS had better observability, what specific questions would they have needed to answer faster?
Answer

Key questions observability should have answered:
1. “What changed?” - Which automation ran? What did it modify?
2. “What are the dependencies?” - What systems depend on the removed servers?
3. “What’s the blast radius?” - Which customers/services are affected?
4. “What’s the cascade?” - Billing → S3 index → S3 objects: what’s the dependency chain?
5. “Why do our health checks pass?” - Individual servers report healthy, but system-level behavior is broken
The core observability gap: They could answer “Is server X healthy?” but not “Why are customers experiencing failures when all servers report healthy?” They lacked correlation between individual health and emergent system behavior.

Key Takeaways

OBSERVABILITY ESSENTIALS CHECKLIST
═══════════════════════════════════════════════════════════════════════════════

UNDERSTANDING THE DIFFERENCE
☑ Monitoring answers predefined questions ("Is CPU > 80%?")
☑ Observability enables unknown questions ("Why are THESE requests slow?")
☑ Dashboards showing green doesn't mean users are happy

THE CARDINALITY IMPERATIVE
☑ Traditional metrics aggregate away the details you need
☑ High cardinality (user_id, request_id) is essential for debugging
☑ 5% of users having problems is 50,000 users at 1M requests/day

DISTRIBUTED SYSTEM REALITY
☑ No single stack trace shows the full picture
☑ Logs scattered across machines need correlation (trace_id)
☑ Failures emerge from interactions, not individual components

THE OBSERVABILITY MINDSET
☑ Emit rich telemetry, explore when problems arise
☑ Form hypotheses, verify with data
☑ Discover failure modes you didn't anticipate

STARTING THE JOURNEY
☑ Structured logging with context (user_id, request_id)
☑ Propagate trace IDs through all services
☑ Enable ad-hoc queries, not just predefined dashboards

Hands-On Exercise

Task: Evaluate the observability of a system you work with.

Part 1: Current State Assessment (10 minutes)

Fill out this observability scorecard:

Capability	Score (0-3)	Notes
Structured logging		0=none, 1=some, 2=most, 3=all
Request IDs propagated		0=none, 1=some services, 2=most, 3=all
Distributed tracing		0=none, 1=basic, 2=detailed, 3=comprehensive
High-cardinality queries		0=can’t, 1=limited, 2=some, 3=any dimension
Cross-service correlation		0=manual, 1=partial, 2=mostly automated, 3=seamless
Ad-hoc investigation		0=impossible, 1=painful, 2=possible, 3=easy

Total: ___/18

Interpretation:

0-6: Monitoring only, major observability gaps
7-12: Partial observability, can debug some issues
13-18: Good observability, can investigate most issues

Part 2: Gap Analysis (10 minutes)

For your lowest-scoring capabilities:

Capability	Current State	What’s Missing	First Step to Improve

Part 3: Investigation Scenario (10 minutes)

Imagine this scenario:

“Users report that the checkout page is slow, but only sometimes. Your dashboard shows normal latency.”

Write down the steps you’d take to investigate:

What would you look at first?
What questions would you try to answer?
What data would you need that you might not have?
Where would you get stuck with your current tools?

Success Criteria:

Scorecard completed with honest assessment
At least 2 gaps identified with improvement steps
Investigation scenario shows understanding of observability vs monitoring approach
Identified at least one data gap in current system

Next Module

Module 3.2: The Three Pillars - Deep dive into logs, metrics, and traces—what each provides and how they work together.