Skip to content

Module 3.2: The Three Pillars of Observability

Complexity: [MEDIUM]

Time to Complete: 35-40 minutes

Prerequisites: Module 3.1: What is Observability?

Track: Foundations

After completing this module, you will be able to:

  1. Analyze when to use metrics, logs, or traces for a given debugging scenario and explain the strengths and blind spots of each
  2. Design a correlated observability pipeline where trace IDs connect metrics anomalies to log entries to distributed traces
  3. Evaluate whether existing observability coverage has gaps that leave specific failure modes invisible
  4. Implement a strategy for combining the three pillars so they reinforce each other rather than creating three isolated data silos

The Engineer Who Had Everything and Nothing

Section titled “The Engineer Who Had Everything and Nothing”

October 2015. A Major Rideshare Company. 7:43 PM on New Year’s Eve.

The platform is about to handle 440% of normal traffic. The engineering team has prepared for months. They have comprehensive Prometheus metrics on every service. They have Elasticsearch with 2TB of logs. They have distributed tracing through Jaeger. All three pillars, best-in-class tooling.

At 11:52 PM, surge pricing stops working.

The on-call engineer opens Grafana. Latency spike on the pricing service—p99 jumped from 150ms to 8 seconds. Great, the metrics show that something is wrong.

She switches to Jaeger to find slow traces. But the trace UI only lets her search by trace ID or service name. She doesn’t have a trace ID. She searches for “pricing-service” and gets 2 million results in the last hour. No way to filter by duration. No way to find the slow ones.

She pivots to Elasticsearch. Searches for “pricing” and “error.” 847,000 results. None have trace IDs—the team never added them to logs. She can see errors happened but can’t connect them to specific requests or traces.

90 minutes. That’s how long it took to find the root cause: a database connection pool exhaustion that only manifested under New Year’s Eve load. The fix was a single config change. But finding it required manually correlating timestamps across three disconnected tools.

Cost: $12.4 million in lost surge pricing revenue. 340,000 customer complaints. A PR crisis that took weeks to contain.

The lesson: They had all three pillars. What they didn’t have was correlation. Without trace IDs in logs, without the ability to drill from metrics to traces, without exemplars connecting aggregates to specifics—the pillars were just three separate silos. Three blind investigators who couldn’t share notes.

THE THREE PILLARS PARADOX
═══════════════════════════════════════════════════════════════════════════════
WHAT THE TEAM HAD WHAT THEY COULD DO
───────────────────────────────────── ─────────────────────────────────────
☑ Prometheus metrics (2M series) ✗ Find which specific requests failed
☑ Elasticsearch logs (2TB/day) ✗ Connect a log to its trace
☑ Jaeger traces (100M spans/day) ✗ Find traces matching a metric spike
✗ Query logs by trace_id
✗ Drill down from aggregate to specific
Having the pillars ≠ Having observability
WHAT THEY ADDED AFTER THE INCIDENT
─────────────────────────────────────────────────────────────────────────────
✓ trace_id in every log line
✓ Exemplars linking p99 metrics to sample traces
✓ Duration-based trace search
✓ Unified query UI that links all three
Same tools. 10-minute resolution time for similar incidents.

You’ve learned what observability is. Now, how do you actually achieve it?

The industry has converged on three complementary data types—logs, metrics, and traces—often called the “three pillars.” Each has strengths and weaknesses. Used together with proper correlation, they give you the visibility to understand any system behavior.

This module teaches you what each pillar provides, when to use which, and critically—how to connect them so you can move seamlessly between different views of the same problem.

The Crime Scene Analogy

Investigating an incident is like investigating a crime. Logs are witness statements—detailed accounts of what happened. Metrics are statistics—how many, how often, trends over time. Traces are the timeline—reconstructing the sequence of events. A good investigator uses all three: statistics to find patterns, witnesses for details, timelines to understand causation.


  • What each pillar provides and its limitations
  • When to use logs vs. metrics vs. traces
  • How the three pillars work together
  • The importance of correlation (trace IDs, request IDs)
  • Modern alternatives and the “single pane of glass”

Logs are timestamped records of discrete events. They capture what happened, when, and context about the event.

LOGS: DISCRETE EVENTS
═══════════════════════════════════════════════════════════════
2024-01-15T10:32:15.123Z level=info msg="Request received"
method=POST path=/api/checkout user_id=12345 request_id=abc-123
2024-01-15T10:32:15.456Z level=info msg="Payment processed"
amount=99.99 currency=USD request_id=abc-123 duration_ms=333
2024-01-15T10:32:15.789Z level=error msg="Inventory check failed"
item_id=SKU-789 error="connection timeout" request_id=abc-123
Each log entry is a snapshot of a moment in time.
UNSTRUCTURED LOG (hard to query)
═══════════════════════════════════════════════════════════════
[2024-01-15 10:32:15] ERROR: Payment failed for user 12345,
order #789, amount $99.99 - connection timeout
- Human readable
- Hard to parse programmatically
- Can't easily filter by user_id or order
- Regex required for extraction
STRUCTURED LOG (easy to query)
═══════════════════════════════════════════════════════════════
{
"timestamp": "2024-01-15T10:32:15.789Z",
"level": "error",
"message": "Payment failed",
"user_id": 12345,
"order_id": 789,
"amount": 99.99,
"currency": "USD",
"error": "connection timeout",
"request_id": "abc-123",
"service": "payment-api",
"version": "2.3.1"
}
- Machine parseable
- Easy to filter: WHERE user_id = 12345
- Easy to aggregate: COUNT BY error
- Context preserved as queryable fields
StrengthsWeaknesses
Rich detail and contextHigh storage cost
Flexible schemaCan be noisy
Good for debugging specificsHard to see patterns
Audit trailPerformance overhead if excessive
Natural for developersQuery can be slow at scale
  • Debugging specific issues: “What happened to request abc-123?”
  • Audit trails: “Who did what when?”
  • Error details: Stack traces, error messages, context
  • State changes: “User X upgraded to premium”
  • Unusual events: Things that don’t happen often enough for metrics

Try This (2 minutes)

Look at a recent log line from your system. Does it have:

  • Timestamp with timezone
  • Log level
  • Request/trace ID
  • User identifier
  • Structured format (JSON)

Each missing item reduces your ability to query and correlate.


Metrics are numeric measurements collected over time. They’re optimized for aggregation and trending.

METRICS: NUMERIC TIME SERIES
═══════════════════════════════════════════════════════════════
http_requests_total{method="POST", path="/api/checkout", status="200"} 45623
http_requests_total{method="POST", path="/api/checkout", status="500"} 127
http_request_duration_seconds{quantile="0.99"} 0.456
http_request_duration_seconds{quantile="0.50"} 0.089
db_connections_active 47
db_connections_max 100
Each metric is a name + labels + numeric value over time.
TypeWhat It MeasuresExample
CounterCumulative total (only goes up)Total requests, total errors
GaugeCurrent value (goes up and down)Active connections, queue depth
HistogramDistribution of valuesRequest latency distribution
SummarySimilar to histogram, pre-calculated quantilesp50, p99 latencies
METRIC TYPES VISUALIZED
═══════════════════════════════════════════════════════════════
COUNTER (monotonically increasing)
────────●────●────●────●──────────────────────────────────────▶
Resets only on restart
GAUGE (fluctuates)
● ●
● ● ● ●
● ● ● ●
● ●
────────────────────────────────────────────────────────────▶
Current value at each point
HISTOGRAM (distribution)
Count
│ ████
│ ████████
│ ████████████
│ ████████████████
└──────────────────────▶ Latency (ms)
0 100 200 300 400
StrengthsWeaknesses
Low storage costLoses individual event detail
Fast queriesLimited cardinality
Good for trends and alertingCan’t debug specific requests
Efficient aggregationPre-aggregated, can’t re-aggregate differently
Compact representationChoosing what to measure is hard
  • Alerting: “Error rate above threshold”
  • Dashboards: Real-time system health
  • Capacity planning: Trends over time
  • SLI measurement: Availability, latency percentiles
  • Business KPIs: Requests per second, revenue per minute

Gotcha: The Cardinality Trap

Metrics with high-cardinality labels (user_id, request_id) explode storage costs. A metric with labels for 1 million users creates 1 million time series. Use logs for high-cardinality data, metrics for bounded dimensions (endpoint, region, status_code).


Traces capture the journey of a request through a distributed system. A trace is a tree of spans, each representing work done by a service.

DISTRIBUTED TRACE
═══════════════════════════════════════════════════════════════
Trace ID: abc-123-def-456
┌─────────────────────────────────────────────────────────┐
│ API Gateway (50ms) │
│ └── Auth Service (10ms) │
│ └── Order Service (35ms) │
│ └── Inventory Service (15ms) │
│ └── Payment Service (18ms) │
│ └── Database Query (12ms) │
│ └── External Payment API (5ms) │
└─────────────────────────────────────────────────────────┘
Timeline:
0ms 10ms 20ms 30ms 40ms 50ms
|------|------|------|------|------|
[====== API Gateway ====================]
[Auth]
[======= Order Service =========]
[Inventory]
[=== Payment ====]
[DB]
[API]
ComponentWhat It IsExample
TraceFull request journeytrace_id: abc-123
SpanSingle unit of work”Database query”, “HTTP call”
Parent SpanThe span that triggered this oneOrder Service is parent of Payment Service
Tags/AttributesMetadata on spanshttp.status=200, db.query=“SELECT…”
Events/LogsTimestamped annotations within span”Cache miss”, “Retry attempt 2”

For traces to work across services, trace context must be propagated:

TRACE CONTEXT PROPAGATION
═══════════════════════════════════════════════════════════════
Request from client:
Headers: (none)
API Gateway generates trace:
trace_id: abc-123
span_id: span-001
API Gateway → Order Service:
Headers:
traceparent: 00-abc123-span001-01
Order Service creates child span:
trace_id: abc-123 (same)
span_id: span-002
parent_id: span-001
Order Service → Payment Service:
Headers:
traceparent: 00-abc123-span002-01
All spans share trace_id, linked by parent relationships.
StrengthsWeaknesses
Shows request flow across servicesStorage cost (many spans per request)
Identifies slow componentsSampling often required
Reveals dependenciesInstrumentation overhead
Debug specific requestsRequires propagation (can break)
Shows parallelism and waitingComplex to implement well
  • “Where did the time go?”: Identifying slow spans
  • Dependency mapping: What calls what?
  • Debugging specific requests: “What happened to request X?”
  • Finding bottlenecks: Which service is the problem?
  • Understanding system architecture: Visualizing flow

Did You Know?

Google processes over 10 billion requests per day. Even with aggressive sampling (0.01%), that’s 1 million traces daily. Dapper, their tracing system, was designed around sampling from the start. Most tracing systems sample to control costs—you don’t need every trace, just representative ones.


Each pillar alone has blind spots:

PILLAR BLIND SPOTS
═══════════════════════════════════════════════════════════════
LOGS ALONE:
✅ "Error occurred in payment service"
❌ "Was this the slow request? What called payment service?"
METRICS ALONE:
✅ "Error rate increased at 3pm"
❌ "Which specific requests failed? What was the error?"
TRACES ALONE:
✅ "Request took 500ms, 400ms in database"
❌ "Is this normal? How many requests are affected?"
CONNECTED:
✅ Metric alert fires (error rate up)
✅ Drill into traces (which requests are errors)
✅ Look at logs (what's the error message)
✅ Full picture: "Database connection pool exhausted,
affecting 5% of checkout requests"

The key to connecting pillars: shared identifiers.

CORRELATION WITH TRACE ID
═══════════════════════════════════════════════════════════════
trace_id: abc-123
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌──────────┐ ┌────────┐
│ LOGS │ │ TRACES │ │ METRICS│
│ │ │ │ │ │
│trace_id│ │ trace_id │ │trace_id│
│=abc-123│ │ =abc-123 │ │exemplar│
│ │ │ │ │=abc-123│
└────────┘ └──────────┘ └────────┘
Query: "Show me everything for trace abc-123"
→ Logs from all services for this request
→ Trace showing timing and flow
→ Metrics at the time of this request

4.3 Exemplars: Connecting Metrics to Traces

Section titled “4.3 Exemplars: Connecting Metrics to Traces”

Exemplars link aggregated metrics back to specific traces:

EXEMPLARS
═══════════════════════════════════════════════════════════════
Metric: http_request_duration_seconds (p99 = 450ms)
Without exemplar:
"p99 latency is high, but which requests?"
With exemplar:
"p99 latency is high, here's a trace showing one: abc-123"
Prometheus Exemplar format:
http_request_duration_seconds{path="/checkout"} 0.45 # {trace_id="abc-123"} 0.48
↓ Click trace_id
See full trace of a slow request that contributed to the p99.
INVESTIGATION WORKFLOW
═══════════════════════════════════════════════════════════════
1. ALERT (Metrics)
"Error rate > 1% for /api/checkout"
2. SCOPE (Metrics)
Break down by: region, user_tier, version
"Errors concentrated in US-West, v2.3.1"
3. SAMPLE (Traces via Exemplar)
Find example error traces
"Trace abc-123 shows timeout to payment service"
4. DETAIL (Logs)
Filter logs by trace_id
"Connection refused: payment-db-03.us-west"
5. ROOT CAUSE
"payment-db-03 was down for maintenance"
"Traffic should have failed over but didn't"

Try This (3 minutes)

Map your current investigation workflow:

StepYour Tool/MethodWhat’s Missing?
Alert
Scope
Sample
Detail

Where do you get stuck? That’s your correlation gap.

War Story: The $6.7 Million Investigation Gap

2018. A Major Trading Platform. Monday Morning, Market Open.

At 9:31 AM Eastern, order execution latency jumped from 3ms to 400ms. For a high-frequency trading platform, this was catastrophic. Every millisecond of delay meant lost arbitrage opportunities. Customers were losing money—and switching to competitors in real-time.

The platform had excellent tooling: Prometheus metrics with 50,000 time series, Elasticsearch ingesting 500GB of logs daily, and Zipkin handling 10 million spans per hour. On paper, world-class observability.

9:31 AM: Alert fires. P99 latency above threshold. 9:34 AM: Engineer opens Grafana. Confirms latency spike. Can see which services are slow but not why. 9:41 AM: Switches to Zipkin. Searches for “order-execution” service. Gets 847,000 traces. No way to filter to just the slow ones. No latency-based search. 9:52 AM: Opens Elasticsearch. Searches for “order” and “slow.” 2.3 million results. Logs have timestamps but no trace IDs. Can’t correlate to traces. 10:14 AM: Resorts to manually comparing timestamps across tools. Tedious, error-prone. 10:47 AM: Finally identifies pattern—slow requests all touched a specific Redis cluster. 10:52 AM: Root cause found—Redis master failover during maintenance window wasn’t announced. Connections were timing out during reconnection. 10:54 AM: Fix applied—bump connection pool retry settings.

Time to resolution: 83 minutes. Fix took 2 minutes. Finding the problem took 81.

Financial Impact:

  • Lost trading volume during outage: $847,000
  • Customer churn (3 major clients left): $5.8 million annual revenue
  • Regulatory fine for execution delay: $120,000
  • Total: $6.77 million

The Postmortem:

What They HadWhat They Couldn’t Do
50,000 Prometheus metricsFind slow traces from metric spikes
500GB/day logs in ElasticsearchSearch logs by trace_id
10M spans/hour in ZipkinFilter traces by duration
Three world-class toolsNavigate between them

What They Built After:

  • Added trace_id to every log line (2 hours of work)
  • Added exemplars to latency histograms (4 hours)
  • Built unified search UI linking all three (2 weeks)
  • Next similar incident: 9 minutes to resolution

The math: $6.77M cost ÷ 2 weeks engineering time = worth it.


The “three pillars” framing has critics:

PILLAR PROBLEMS
═══════════════════════════════════════════════════════════════
SILOED THINKING
"We have a logging system, a metrics system, a tracing system"
→ Three separate UIs, three separate queries, manual correlation
THE REAL NEED
"We have events that describe system behavior"
→ Events can be viewed as logs, aggregated into metrics,
connected into traces
→ Same underlying data, different lenses

Modern observability thinking centers on events:

EVENTS-FIRST MODEL
═══════════════════════════════════════════════════════════════
Every interesting thing is an EVENT:
{
"timestamp": "2024-01-15T10:32:15.789Z",
"trace_id": "abc-123",
"span_id": "span-001",
"service": "payment-api",
"operation": "process_payment",
"duration_ms": 333,
"user_id": "12345",
"amount": 99.99,
"status": "success",
"db_queries": 3,
"cache_hit": false
}
From this event, you can:
→ View as LOG: Full details of this operation
→ Compute METRICS: avg(duration_ms), count by status
→ Build TRACE: Connect via trace_id and span_id
One data model, multiple views.

OpenTelemetry (OTel) is becoming the standard for observability instrumentation:

OPENTELEMETRY
═══════════════════════════════════════════════════════════════
Application Code
┌──────────────────────┐
│ OpenTelemetry SDK │
│ │
│ - Traces │
│ - Metrics │
│ - Logs │
└──────────┬───────────┘
┌──────────▼───────────┐
│ OTel Collector │
│ │
│ Process, batch, │
│ export to backends │
└──────────┬───────────┘
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌─────────┐
│ Jaeger │ │Prometheus│ │ Loki │
│(Traces) │ │(Metrics) │ │ (Logs) │
└─────────┘ └──────────┘ └─────────┘
Benefits:
- Vendor-neutral instrumentation
- Consistent correlation (trace_id everywhere)
- One library, multiple signals
- Easy to switch backends

  • Netflix streams over 1 billion hours of video per week and traces every request. They sample aggressively but keep 100% of traces for errors—you always want the full picture when something goes wrong.

  • The W3C Trace Context standard defines how trace IDs propagate in HTTP headers. Before this standard, every tracing system had its own headers (Zipkin used B3, AWS used X-Amzn-Trace-Id). Now traceparent and tracestate are standard.

  • Logs are the oldest pillar—Unix systems have had syslog since the 1980s. Metrics became common in the 2000s (RRDtool, Graphite). Traces emerged in the 2010s with Dapper and microservices. The history reflects increasing system complexity.

  • Uber built their own tracing system called Jaeger (German for “hunter”) in 2015, then open-sourced it. It was designed to handle their scale: billions of spans per day across thousands of microservices. Jaeger became a CNCF project and is now one of the most popular tracing backends.


MistakeProblemSolution
No trace propagationTraces break at service boundariesUse OTel SDK, verify headers
High-cardinality metricsStorage explosion, slow queriesUse logs for high-cardinality, metrics for bounded
Unstructured logsCan’t query or correlateJSON with consistent fields
No request ID in logsCan’t find logs for a traceAdd trace_id to every log line
Separate tools, no correlationManual jumping between UIsUse exemplars, linked IDs
Logging everythingNoise, cost, performanceLog meaningful events with context

  1. When would you use logs instead of metrics?

    Answer

    Use logs when you need:

    • Rich context: Error messages, stack traces, request details
    • High cardinality: Data with many unique values (user_id, request_id)
    • Specific debugging: “What happened to this exact request?”
    • Audit trails: Who did what when
    • Infrequent events: Things that don’t happen often enough to aggregate

    Use metrics when you need:

    • Trends and aggregations: “How many requests per second?”
    • Alerting on thresholds: “Error rate > 1%”
    • Low-cardinality breakdowns: By endpoint, by region
    • Efficient storage: Numeric data compresses well
    • Fast dashboards: Pre-aggregated for quick display
  2. What is trace context propagation and why is it essential?

    Answer

    Trace context propagation is passing trace identifiers (trace_id, span_id, flags) from one service to the next as a request flows through a distributed system.

    It’s essential because:

    1. Without it, traces break: Each service would start a new trace
    2. Enables correlation: All services’ spans share the same trace_id
    3. Shows causation: Parent-child relationships reveal what called what
    4. Allows debugging: You can find all work done for a single request

    Propagation happens via HTTP headers (traceparent) or message metadata. If any service doesn’t propagate, the trace is broken at that point.

  3. What are exemplars and why do they matter?

    Answer

    Exemplars are links from aggregated metrics back to specific traces that contributed to those metrics.

    Why they matter:

    1. Bridge metrics → traces: “p99 latency is high” becomes “here’s a specific slow request”
    2. Enable drill-down: Click a metric spike to see example traces
    3. Reduce MTTR: Go from aggregate problem to specific example quickly
    4. Connect pillars: Metrics alone don’t show details; exemplars provide the path to details

    Without exemplars, you see “latency is high” but must manually hunt for examples. With exemplars, you click through immediately.

  4. Why do critics argue the “three pillars” framing is problematic?

    Answer

    Critics (like Charity Majors) argue:

    1. Creates silos: Teams build separate logging, metrics, and tracing systems that don’t talk to each other
    2. Misses the point: Observability is about understanding behavior, not about having three data types
    3. Events are fundamental: Logs, metrics, and traces are really different views of underlying events
    4. Correlation is key: The pillars only help if they’re connected; treating them as separate defeats the purpose

    The better framing: “We collect rich events about system behavior. We can view them as logs (individual events), metrics (aggregates over time), or traces (connected journeys). They’re the same data, different lenses.”

  5. A company generates 10,000 requests/second. They want to trace all requests but storage costs are prohibitive. If they sample at 1%, how many traces per day will they store? What requests should NOT be sampled?

    Answer

    Calculation:

    • 10,000 requests/second × 60 seconds × 60 minutes × 24 hours = 864,000,000 requests/day
    • At 1% sampling: 864,000,000 × 0.01 = 8,640,000 traces/day

    Requests that should NOT be sampled (keep 100%):

    1. Errors: Always trace failed requests—you need full details to debug
    2. Slow requests: Any request above p99 latency threshold
    3. Key business transactions: Payments, order completions, sign-ups
    4. Known problematic endpoints: Routes that historically have issues
    5. Debug-flagged requests: When a user reports an issue, they can add a header to force tracing

    Head-based vs tail-based sampling:

    • Head-based: Decide at request start (easy, but might miss interesting requests)
    • Tail-based: Decide at request end (can keep all errors/slow requests, harder to implement)
  6. Your team has Prometheus metrics showing p99 latency increased. You want to find an example slow request to investigate. Without exemplars, list the steps you’d need to take. How do exemplars simplify this?

    Answer

    Without exemplars (manual correlation):

    1. Note the timestamp of the p99 spike from Prometheus
    2. Switch to your tracing tool (Jaeger, Zipkin)
    3. Search for traces from that service around that timestamp
    4. Manually filter for traces with duration > p99 value
    5. If tracing tool doesn’t support duration filtering, export traces and filter externally
    6. Hope you find a representative example

    Time: 10-30 minutes. Error-prone. Might not find the right trace.

    With exemplars:

    1. View p99 latency metric in Grafana
    2. Click the exemplar marker on the graph
    3. Jump directly to a trace that contributed to that p99 value
    4. Investigate

    Time: 30 seconds. Guaranteed to find a relevant trace.

    Exemplars are the bridge between “something is slow” and “here’s a specific slow thing to investigate.”

  7. An engineer says “We use structured JSON logging, so we have observability.” What’s missing from this statement? What else would they need?

    Answer

    Structured logging is necessary but not sufficient for observability.

    What structured logging provides:

    • Queryable fields (can filter by user_id, error_code)
    • Consistent format (easy to parse)
    • Context preservation

    What’s still missing:

    1. Trace IDs: Can they connect logs to traces? Without trace_id in logs, they’re still isolated.
    2. Request correlation: Can they find all logs for a single request across services?
    3. Metrics: Logs alone don’t show trends, aggregates, or enable alerting
    4. Traces: Logs don’t show request flow, timing, or service dependencies
    5. High-cardinality queries: Can they ask “show me all logs where user_id=X AND error_code=Y”?
    6. Cross-signal navigation: Can they click from a log to its trace? From a metric spike to sample logs?

    Structured logs are the foundation. Full observability requires all three pillars connected via shared IDs.

  8. Your checkout service makes 5 downstream calls (inventory, pricing, payment, shipping, notification). A trace shows total latency of 850ms. The spans show: inventory (45ms), pricing (180ms), payment (320ms), shipping (90ms), notification (40ms). The sum is 675ms, but total is 850ms. What explains the 175ms gap?

    Answer

    The 175ms gap (850ms - 675ms = 175ms) represents time spent in the parent checkout service itself, not in downstream calls.

    This time could be:

    1. Service overhead: Request parsing, response assembly
    2. Sequential processing: Time between finishing one call and starting another
    3. Business logic: Validation, transformation, calculations
    4. Network latency: Not captured in span duration (time waiting for response to arrive)
    5. Queue time: Time waiting in thread pool before processing

    To investigate:

    • Add more granular spans within the checkout service
    • Look for “gaps” in the waterfall view between child spans
    • Add spans for “validation,” “marshal_request,” “await_response”
    • Check if calls are sequential when they could be parallel (inventory + pricing could run in parallel)

    Key insight: Trace spans only show what you instrument. Missing time = missing instrumentation.


THREE PILLARS ESSENTIALS CHECKLIST
═══════════════════════════════════════════════════════════════════════════════
UNDERSTANDING EACH PILLAR
☑ Logs = detailed events (what happened, high cardinality OK)
☑ Metrics = numeric aggregates (how many, low cardinality required)
☑ Traces = request journey (where time went, shows dependencies)
WHEN TO USE WHAT
☑ Debugging specific requests → Logs + Traces
☑ Alerting on thresholds → Metrics
☑ Finding patterns across users → Logs with structured fields
☑ Identifying slow components → Traces
☑ Capacity planning → Metrics (trends over time)
THE CORRELATION IMPERATIVE
☑ trace_id in EVERY log line (non-negotiable)
☑ Exemplars connecting metrics to sample traces
☑ Same timestamp format across all pillars
☑ Unified UI or linked navigation between tools
CARDINALITY RULES
☑ Metrics: bounded dimensions only (endpoint, region, status_code)
☑ Logs: high cardinality welcome (user_id, request_id, session_id)
☑ Traces: sampling for volume control, 100% for errors
THE EVENTS PERSPECTIVE
☑ Pillars are views, not silos
☑ Same underlying data, different lenses
☑ OpenTelemetry unifies instrumentation
☑ Goal: answer questions you didn't anticipate

Task: Design an observability strategy using all three pillars.

Scenario: You’re building a checkout service that:

  • Receives orders from users
  • Validates inventory
  • Processes payments
  • Sends confirmation emails

Part 1: Design Structured Logs (10 minutes)

For each key event, define the log structure:

EventLog LevelKey Fields
Order receivedINFOtimestamp, trace_id, user_id, order_id, items[], total
Inventory checkedINFOtimestamp, trace_id, order_id, items_available, items_unavailable
Payment attemptINFOtimestamp, trace_id, order_id, amount, payment_method
Payment failedERRORtimestamp, trace_id, order_id, error_code, error_message
Email sentINFOtimestamp, trace_id, order_id, email_type, recipient

Add 2-3 more events relevant to your scenario:

EventLog LevelKey Fields

Part 2: Design Metrics (10 minutes)

Define metrics for monitoring and alerting:

Metric NameTypeLabelsPurpose
checkout_requests_totalCounterstatus, payment_methodTrack volume and success rate
checkout_duration_secondsHistogramstep (validate, pay, email)Track latency by phase
inventory_availabilityGaugeitem_categoryMonitor stock levels
payment_failures_totalCountererror_code, providerTrack payment issues

Add 2-3 more metrics:

Metric NameTypeLabelsPurpose

Part 3: Design Traces (10 minutes)

Define the span structure for a checkout request:

Trace: checkout-{order_id}
├── Span: receive_order
│ └── Tags: user_id, item_count, total_amount
├── Span: validate_inventory
│ └── Tags: items_checked, items_available
│ └── Child: db_query (inventory lookup)
├── Span: process_payment
│ └── Tags: amount, method, provider
│ └── Child: external_api_call (payment gateway)
└── Span: send_confirmation
└── Tags: email_type, recipient
└── Child: smtp_send

Add timing expectations:

  • Total checkout: ____ms expected
  • Which span is likely the bottleneck? ____

Part 4: Correlation Plan (5 minutes)

How will you connect the three pillars?

Correlation NeedSolution
Find logs for a traceInclude trace_id in every log
Find traces for a metric spikeUse exemplars with trace_id
Find metrics for a time windowQuery by timestamp range

Success Criteria:

  • At least 5 meaningful log events defined with fields
  • At least 4 metrics with appropriate types and labels
  • Trace structure with at least 4 spans
  • Correlation strategy defined (trace_id in logs, exemplars)

  • “Distributed Systems Observability” - Cindy Sridharan. Excellent coverage of all three pillars.

  • OpenTelemetry Documentation - https://opentelemetry.io/docs/. The emerging standard for instrumentation.

  • “Three Pillars with Zero Answers” - Charity Majors (blog post). The critique of pillar-centric thinking.


Module 3.3: Instrumentation Principles - How to add observability to your code: what to instrument, where, and how.