Module 1.5: Distributed Tracing

Toolkit Track | Complexity: [COMPLEX] | Time: 45-50 min

The VP of Engineering stared at the Slack channel in disbelief. “Customer checkout is failing. Intermittently. For the past 3 hours.” The e-commerce platform processed $47 million daily during the holiday season, with each minute of checkout failures costing roughly $32,000 in abandoned carts. Metrics showed elevated error rates, but which of the 67 microservices was failing? Logs flooded in from everywhere—millions of lines—but without correlation, they were useless. The war room had 15 engineers from 8 teams, each defending their service. “It’s not us,” became the mantra. Three hours became six. Lost revenue: $5.8 million. The root cause, discovered only after implementing distributed tracing: a single misconfigured timeout in a third-party payment validation service, buried five hops deep in the request flow. One service. Five layers down. Invisible without tracing.

Prerequisites

Before starting this module:

Module 1.2: OpenTelemetry — Instrumentation fundamentals
Module 1.1: Prometheus — Understanding metrics
Module 1.4: Loki — Log correlation (recommended)
Familiarity with microservices architecture

What You’ll Be Able to Do

After completing this module, you will be able to:

Deploy distributed tracing backends (Tempo, Jaeger) and configure trace collection from applications
Implement trace context propagation across service boundaries for end-to-end request visibility
Configure sampling strategies to balance trace coverage with storage costs in production
Integrate traces with metrics and logs for correlated troubleshooting using exemplars and trace-to-log links

Why This Module Matters

In a monolith, debugging is straightforward: stack traces tell you what happened. In microservices, a single user request might touch 20 services across 5 teams. When something fails, you need to see the entire journey.

Distributed tracing solves this. It connects the dots across services, showing exactly where latency hides and where errors originate. Without tracing, debugging distributed systems is guesswork.

Did You Know?

Google’s Dapper paper (2010) started it all—it described how Google traces every request across their massive infrastructure, inspiring Jaeger, Zipkin, and eventually OpenTelemetry
A single trace can have thousands of spans—complex e-commerce transactions might generate 500+ spans across dozens of services
Traces are sampled, not exhaustive—storing every trace would be prohibitively expensive; most systems sample 1-10% of traffic
The W3C Trace Context standard ensures interoperability—headers like traceparent work across languages, frameworks, and vendors

Tracing Concepts

The Anatomy of a Trace

┌─────────────────────────────────────────────────────────────────┐
│                          TRACE                                   │
│  trace_id: abc123                                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Time ──────────────────────────────────────────────────────▶   │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │ SPAN: api-gateway                                           ││
│  │ span_id: s1  parent: none  duration: 500ms                  ││
│  │ ┌───────────────────────────────────────────────────────┐   ││
│  │ │                                                       │   ││
│  │ │  ┌──────────────────────────────────────────────────┐ │   ││
│  │ │  │ SPAN: user-service                               │ │   ││
│  │ │  │ span_id: s2  parent: s1  duration: 150ms        │ │   ││
│  │ │  │ ┌──────────────────────────┐                    │ │   ││
│  │ │  │ │ SPAN: postgres          │                    │ │   ││
│  │ │  │ │ span_id: s3  parent: s2 │                    │ │   ││
│  │ │  │ │ duration: 50ms          │                    │ │   ││
│  │ │  │ └──────────────────────────┘                    │ │   ││
│  │ │  └──────────────────────────────────────────────────┘ │   ││
│  │ │                                                       │   ││
│  │ │  ┌──────────────────────────────────────────────────┐ │   ││
│  │ │  │ SPAN: order-service                              │ │   ││
│  │ │  │ span_id: s4  parent: s1  duration: 300ms        │ │   ││
│  │ │  │ ┌────────────────┐ ┌─────────────────────┐      │ │   ││
│  │ │  │ │ SPAN: redis   │ │ SPAN: payment-api   │      │ │   ││
│  │ │  │ │ s5 (20ms)     │ │ s6 (200ms)          │      │ │   ││
│  │ │  │ └────────────────┘ └─────────────────────┘      │ │   ││
│  │ │  └──────────────────────────────────────────────────┘ │   ││
│  │ │                                                       │   ││
│  │ └───────────────────────────────────────────────────────┘   ││
│  └─────────────────────────────────────────────────────────────┘│
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Key Terminology

Term	Definition
Trace	The complete journey of a request through the system
Span	A single unit of work (e.g., HTTP call, DB query)
Trace ID	Unique identifier for the entire trace
Span ID	Unique identifier for a specific span
Parent Span ID	Links child span to parent
Baggage	Key-value pairs propagated across all spans
Context Propagation	Passing trace context between services

W3C Trace Context

# Standard headers for trace propagation
traceparent: 00-abc123def456-789xyz-01
              │  │              │     │
              │  │              │     └── Flags (sampled)
              │  │              └── Parent span ID
              │  └── Trace ID
              └── Version

tracestate: vendor1=value1,vendor2=value2
            └── Vendor-specific data

Tracing Backends

Comparing Solutions

┌─────────────────────────────────────────────────────────────────┐
│                    TRACING BACKENDS                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  JAEGER                         TEMPO                            │
│  ┌─────────────────────┐       ┌─────────────────────┐          │
│  │ • CNCF graduated    │       │ • Grafana project   │          │
│  │ • Battle-tested     │       │ • Object storage    │          │
│  │ • Full-text search  │       │ • Cost-effective    │          │
│  │ • Cassandra/ES      │       │ • Log correlation   │          │
│  │ • Self-contained UI │       │ • Grafana native    │          │
│  └─────────────────────┘       └─────────────────────┘          │
│                                                                  │
│  ZIPKIN                         AWS X-RAY                        │
│  ┌─────────────────────┐       ┌─────────────────────┐          │
│  │ • Original OSS      │       │ • AWS native        │          │
│  │ • Simple setup      │       │ • Service maps      │          │
│  │ • Limited scale     │       │ • AWS integration   │          │
│  │ • MySQL/Cassandra   │       │ • Vendor lock-in    │          │
│  └─────────────────────┘       └─────────────────────┘          │
│                                                                  │
│  RECOMMENDATION:                                                 │
│  • Grafana stack → Tempo (seamless integration)                 │
│  • Need search → Jaeger (tag-based queries)                     │
│  • AWS-native → X-Ray (no extra infrastructure)                 │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Jaeger

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    JAEGER ARCHITECTURE                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Applications                                                    │
│  ┌──────┐ ┌──────┐ ┌──────┐                                     │
│  │ App1 │ │ App2 │ │ App3 │                                     │
│  └──┬───┘ └──┬───┘ └──┬───┘                                     │
│     │        │        │                                          │
│     └────────┼────────┘                                          │
│              │ OTLP / Jaeger protocol                            │
│              ▼                                                   │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                    JAEGER COLLECTOR                       │   │
│  │  • Validates spans                                        │   │
│  │  • Processes and indexes                                  │   │
│  │  • Writes to storage                                      │   │
│  └────────────────────────┬─────────────────────────────────┘   │
│                           │                                      │
│              ┌────────────┼────────────┐                        │
│              ▼            ▼            ▼                        │
│  ┌────────────────┐ ┌──────────┐ ┌──────────┐                  │
│  │  Elasticsearch │ │ Cassandra│ │ Badger   │                  │
│  │  (production)  │ │ (scale)  │ │ (dev)    │                  │
│  └────────────────┘ └──────────┘ └──────────┘                  │
│                           │                                      │
│                           ▼                                      │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                    JAEGER QUERY                           │   │
│  │  • Serves UI                                              │   │
│  │  • REST/gRPC API                                          │   │
│  │  • Search by tags, service, operation                     │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Deploying Jaeger

# jaeger-allinone.yaml (development)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
  namespace: tracing
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
        - name: jaeger
          image: jaegertracing/all-in-one:1.50
          ports:
            - containerPort: 16686  # UI
            - containerPort: 4317   # OTLP gRPC
            - containerPort: 4318   # OTLP HTTP
            - containerPort: 14268  # Jaeger HTTP
            - containerPort: 6831   # Jaeger UDP (compact)
          env:
            - name: COLLECTOR_OTLP_ENABLED
              value: "true"
          resources:
            limits:
              memory: 1Gi
---
apiVersion: v1
kind: Service
metadata:
  name: jaeger
  namespace: tracing
spec:
  ports:
    - name: ui
      port: 16686
    - name: otlp-grpc
      port: 4317
    - name: otlp-http
      port: 4318
  selector:
    app: jaeger

Jaeger with Elasticsearch (Production)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger-collector
  namespace: tracing
spec:
  replicas: 3
  selector:
    matchLabels:
      app: jaeger-collector
  template:
    metadata:
      labels:
        app: jaeger-collector
    spec:
      containers:
        - name: collector
          image: jaegertracing/jaeger-collector:1.50
          ports:
            - containerPort: 4317
            - containerPort: 14268
          env:
            - name: SPAN_STORAGE_TYPE
              value: elasticsearch
            - name: ES_SERVER_URLS
              value: http://elasticsearch:9200
            - name: ES_INDEX_PREFIX
              value: jaeger
            - name: COLLECTOR_OTLP_ENABLED
              value: "true"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger-query
  namespace: tracing
spec:
  replicas: 2
  selector:
    matchLabels:
      app: jaeger-query
  template:
    metadata:
      labels:
        app: jaeger-query
    spec:
      containers:
        - name: query
          image: jaegertracing/jaeger-query:1.50
          ports:
            - containerPort: 16686
          env:
            - name: SPAN_STORAGE_TYPE
              value: elasticsearch
            - name: ES_SERVER_URLS
              value: http://elasticsearch:9200

Jaeger Operator

# Operator-based deployment
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: production
  namespace: tracing
spec:
  strategy: production  # vs allInOne

  collector:
    replicas: 3
    resources:
      limits:
        cpu: 1
        memory: 1Gi

  query:
    replicas: 2

  storage:
    type: elasticsearch
    elasticsearch:
      nodeCount: 3
      resources:
        limits:
          memory: 4Gi
      redundancyPolicy: SingleRedundancy

  ingress:
    enabled: true

Grafana Tempo

Why Tempo?

TEMPO'S KEY INSIGHT: Traces are append-only, search by ID is enough

Traditional (Jaeger):                 Tempo:
─────────────────────────────────────────────────────────────────

Store + Index everything:            Store only, index nothing:
• Elasticsearch cluster              • Object storage (S3/GCS)
• Index spans by tags                • Traces by ID only
• Search: service, operation         • Search: trace ID
• Cost: $$$                          • Cost: $

Finding traces:                      Finding traces:
1. Search Jaeger UI                  1. Metrics show problem
2. Find trace by tags                2. Exemplars link to trace ID
                                     3. Logs contain trace ID
                                     4. Look up trace directly

Best for:                            Best for:
• Need tag-based search              • Grafana stack users
• Unknown trace IDs                  • Cost-conscious
• Debugging without metrics          • Metrics-to-traces workflow

Tempo Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    TEMPO ARCHITECTURE                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Spans via OTLP                                                  │
│       │                                                          │
│       ▼                                                          │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                    DISTRIBUTOR                            │   │
│  │  • Receives spans                                         │   │
│  │  • Hashes trace ID                                        │   │
│  │  • Routes to ingester                                     │   │
│  └────────────────────────┬─────────────────────────────────┘   │
│                           │                                      │
│                           ▼                                      │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                    INGESTER                               │   │
│  │  • Batches spans by trace                                 │   │
│  │  • Holds in memory (WAL)                                  │   │
│  │  • Flushes to object storage                              │   │
│  └────────────────────────┬─────────────────────────────────┘   │
│                           │                                      │
│                           ▼                                      │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                 OBJECT STORAGE                            │   │
│  │  ┌─────────────────┐ ┌─────────────────┐                 │   │
│  │  │  Blocks         │ │  Bloom Filters  │                 │   │
│  │  │  (compressed    │ │  (trace ID      │                 │   │
│  │  │   trace data)   │ │   lookup)       │                 │   │
│  │  └─────────────────┘ └─────────────────┘                 │   │
│  └──────────────────────────────────────────────────────────┘   │
│                           ▲                                      │
│                           │                                      │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                    QUERIER                                │   │
│  │  • Receives trace ID queries                              │   │
│  │  • Checks bloom filters                                   │   │
│  │  • Fetches matching blocks                                │   │
│  └──────────────────────────────────────────────────────────┘   │
│                           ▲                                      │
│                           │                                      │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                    QUERY FRONTEND                         │   │
│  │  • Caching                                                │   │
│  │  • Query splitting                                        │   │
│  │  • TraceQL processing                                     │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Deploying Tempo

apiVersion: v1
kind: ConfigMap
metadata:
  name: tempo-config
  namespace: tracing
data:
  tempo.yaml: |
    server:
      http_listen_port: 3200

    distributor:
      receivers:
        otlp:
          protocols:
            grpc:
              endpoint: 0.0.0.0:4317
            http:
              endpoint: 0.0.0.0:4318
        jaeger:
          protocols:
            thrift_http:
              endpoint: 0.0.0.0:14268

    ingester:
      trace_idle_period: 10s
      max_block_bytes: 1_000_000
      max_block_duration: 5m

    compactor:
      compaction:
        block_retention: 48h

    storage:
      trace:
        backend: s3
        s3:
          bucket: tempo-traces
          endpoint: s3.amazonaws.com
          region: us-east-1
        wal:
          path: /var/tempo/wal
        local:
          path: /var/tempo/blocks

    querier:
      frontend_worker:
        frontend_address: tempo-query-frontend:9095

    query_frontend:
      search:
        duration_slo: 5s
        throughput_bytes_slo: 1073741824
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tempo
  namespace: tracing
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tempo
  template:
    metadata:
      labels:
        app: tempo
    spec:
      containers:
        - name: tempo
          image: grafana/tempo:2.3.0
          args:
            - -config.file=/etc/tempo/tempo.yaml
          ports:
            - containerPort: 3200  # HTTP
            - containerPort: 4317  # OTLP gRPC
            - containerPort: 4318  # OTLP HTTP
          volumeMounts:
            - name: config
              mountPath: /etc/tempo
            - name: storage
              mountPath: /var/tempo
      volumes:
        - name: config
          configMap:
            name: tempo-config
        - name: storage
          emptyDir: {}

TraceQL (Tempo’s Query Language)

# Find traces by service name
{ resource.service.name = "api-gateway" }

# Find slow spans
{ span.http.status_code >= 500 && duration > 1s }

# Find traces with errors
{ status = error }

# Find specific operation
{ name = "HTTP GET /users" }

# Combine conditions
{
  resource.service.name = "payment-service" &&
  span.http.method = "POST" &&
  duration > 500ms
}

# Aggregate: Find slowest operations
{ resource.service.name = "api" } | avg(duration) by (name)

# Pipeline: Filter then aggregate
{ duration > 100ms } | count() by (resource.service.name)

Sampling Strategies

Why Sample?

THE SAMPLING MATH:

1000 requests/second × 50 spans/request × 1KB/span = 50 MB/second
                                                    = 4.3 TB/day
                                                    = 130 TB/month

At 10% sampling:
100 requests/second × 50 spans/request × 1KB/span = 5 MB/second
                                                   = 432 GB/day
                                                   = 13 TB/month

RULE OF THUMB: Sample enough to catch errors, not so much you go broke

Sampling Types

┌─────────────────────────────────────────────────────────────────┐
│                    SAMPLING STRATEGIES                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  HEAD-BASED SAMPLING                                             │
│  Decision at trace start                                         │
│  ┌─────┐                                                         │
│  │ App │──▶ Random: 10% sampled ──▶ Propagate decision          │
│  └─────┘                           (all or nothing)              │
│                                                                  │
│  Pros: Simple, consistent                                        │
│  Cons: Might miss errors (if not sampled)                       │
│                                                                  │
│  ─────────────────────────────────────────────────────────────  │
│                                                                  │
│  TAIL-BASED SAMPLING                                             │
│  Decision after trace complete                                   │
│  ┌─────┐     ┌─────────────┐     ┌──────────────────┐           │
│  │ App │──▶  │  Collector  │──▶  │ Keep if:         │           │
│  └─────┘     │  (buffer)   │     │ • Error occurred │           │
│              └─────────────┘     │ • Latency > 1s   │           │
│                                  │ • Important user │           │
│                                  └──────────────────┘           │
│                                                                  │
│  Pros: Never miss errors, smart decisions                       │
│  Cons: Complex, memory-intensive                                │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

OpenTelemetry Collector Tail Sampling

processors:
  tail_sampling:
    decision_wait: 10s  # Wait for all spans
    num_traces: 100000  # Traces in memory
    policies:
      # Always keep errors
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]

      # Always keep slow traces
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 1000

      # Always keep specific endpoints
      - name: important-endpoints
        type: string_attribute
        string_attribute:
          key: http.url
          values: ["/checkout", "/payment"]

      # Sample everything else at 10%
      - name: probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [jaeger]

Correlating Signals

The Three Pillars Connected

┌─────────────────────────────────────────────────────────────────┐
│                    SIGNAL CORRELATION                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  WORKFLOW: Something is broken, find out why                    │
│                                                                  │
│  1. METRICS show problem                                         │
│     ┌─────────────────────────────────────────────────────────┐ │
│     │ http_request_duration_seconds{...} = 5.2s ◀── SLOW!     │ │
│     │                                                          │ │
│     │ histogram has "exemplar" → trace_id: abc123             │ │
│     └─────────────────────────────────────────────────────────┘ │
│                              │                                   │
│                              ▼                                   │
│  2. TRACE shows journey                                          │
│     ┌─────────────────────────────────────────────────────────┐ │
│     │ trace_id: abc123                                         │ │
│     │ ├─ api-gateway (50ms)                                    │ │
│     │ │  └─ user-service (100ms)                               │ │
│     │ │     └─ postgres (3000ms) ◀── HERE'S THE PROBLEM!      │ │
│     │ └─ order-service (200ms)                                 │ │
│     └─────────────────────────────────────────────────────────┘ │
│                              │                                   │
│                              ▼                                   │
│  3. LOGS show details                                            │
│     ┌─────────────────────────────────────────────────────────┐ │
│     │ {trace_id="abc123"} | json                               │ │
│     │                                                          │ │
│     │ 10:30:01 user-service: Query started                     │ │
│     │ 10:30:04 postgres: Lock wait timeout exceeded           │ │
│     │ 10:30:04 user-service: Query failed, retrying           │ │
│     └─────────────────────────────────────────────────────────┘ │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Exemplars in Prometheus

# Enable exemplars in your app
from prometheus_client import Histogram
from opentelemetry import trace

histogram = Histogram('http_request_duration_seconds', 'Request duration')

def handle_request():
    span = trace.get_current_span()
    trace_id = span.get_span_context().trace_id

    with histogram.time() as metric:
        # Handle request
        process_request()

    # Attach trace_id as exemplar
    metric.observe(duration, {'trace_id': format(trace_id, '032x')})

# Prometheus config to scrape exemplars
scrape_configs:
  - job_name: 'my-app'
    scrape_interval: 15s
    # Enable exemplar scraping (requires Prometheus 2.27+)
    enable_exemplars: true

Grafana Correlation

# Grafana data sources configured for correlation
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    jsonData:
      exemplarTraceIdDestinations:
        - name: trace_id
          datasourceUid: tempo

  - name: Tempo
    type: tempo
    uid: tempo
    url: http://tempo:3200
    jsonData:
      tracesToLogs:
        datasourceUid: loki
        tags: ['service.name']
        mappedTags: [{ key: 'service.name', value: 'app' }]
        mapTagNamesEnabled: true

  - name: Loki
    type: loki
    uid: loki
    url: http://loki:3100
    jsonData:
      derivedFields:
        - name: TraceID
          matcherRegex: '"trace_id":"(\w+)"'
          url: '${__value.raw}'
          datasourceUid: tempo

Common Mistakes

Mistake	Why It’s Bad	Better Approach
Storing 100% of traces	Costs explode, storage overload	Sample at 1-10%, keep all errors
Missing context propagation	Traces disconnect at service boundaries	Use OTel auto-instrumentation, verify headers
Too many spans	Cardinality issues, hard to read	Span meaningful operations, not every function
Not correlating signals	Miss the full picture	Add trace_id to logs, use exemplars
Ignoring sampling bias	Missing rare errors	Use tail-based sampling for errors
No service name in spans	Can’t filter by service	Always set `service.name` resource attribute

War Story: The $4.2 Million Black Friday Ghost

┌─────────────────────────────────────────────────────────────────┐
│  THE $4.2 MILLION BLACK FRIDAY GHOST                            │
│  ───────────────────────────────────────────────────────────────│
│  Company: Major online retailer                                 │
│  Architecture: 127 microservices across 3 cloud regions         │
│  Black Friday target: $89 million in 24 hours                   │
│  The nightmare: Checkout failures, no visibility, finger-pointing│
└─────────────────────────────────────────────────────────────────┘

10:00 AM - Black Friday

The traffic surge began as expected. Everything looked green on dashboards. Then customers started complaining: “Payment accepted, but no order confirmation.” Not errors—just silence. The orders vanished into the void.

11:30 AM - The War Room Assembles

Fifteen engineers from payment, inventory, fulfillment, and notification teams. Each team’s metrics looked healthy. Each team’s logs showed successful operations. “It’s not us,” echoed around the room.

METRICS ANALYSIS (All services "green"):
─────────────────────────────────────────────────────────────────
payment-service:      error_rate: 0.02%  ✓  (within SLA)
inventory-service:    error_rate: 0.01%  ✓  (within SLA)
fulfillment-service:  error_rate: 0.03%  ✓  (within SLA)
notification-service: error_rate: 0.00%  ✓  (within SLA)

But checkout completion rate: DOWN 23%

2:00 PM - Desperation Sets In

Revenue loss was mounting. Engineers manually correlated logs by timestamp—needle in a haystack across 127 services. Someone suggested “Let’s just restart everything.” They did. Problem persisted.

4:30 PM - The Breakthrough

A junior engineer had been implementing distributed tracing as a “20% project.” It wasn’t fully deployed, but it covered the payment flow. She enabled sampling to 100% and captured a failing request.

TRACE: f8d2e4a1-7b3c-4e5f-9a1b-2c3d4e5f6a7b
─────────────────────────────────────────────────────────────────
api-gateway (12ms)
 └─ checkout-orchestrator (2,847ms) ← SUSPICIOUSLY LONG
     ├─ payment-service (156ms) ✓
     ├─ inventory-reserve (89ms) ✓
     └─ fulfillment-queue (2,589ms) ← THE BOTTLENECK
         └─ kafka-producer (TIMEOUT) ✗

Root cause: Kafka broker rebalancing during traffic spike
- Producer timeout: 3000ms
- Actual wait: 2589ms (retrying internally, not reporting errors)
- Result: fire-and-forget message lost, no error logged

The Root Cause

The fulfillment service used Kafka with acks=1 and fire-and-forget publishing. During the traffic spike, Kafka brokers started rebalancing. Messages were accepted by the producer but never delivered. No errors because the producer configured timeouts to fail silently.

# The problematic Kafka config (production)
producer:
  acks: 1                    # ← Only leader ack, not replicas
  retries: 0                 # ← No retry on failure
  linger.ms: 0              # ← Send immediately, no batching
  request.timeout.ms: 3000  # ← 3s timeout, then silent drop
  # No error callback configured

The Fix (Applied at 5:15 PM)

# Fixed Kafka config
producer:
  acks: all                  # ← Wait for all replicas
  retries: 3                 # ← Retry on transient failures
  enable.idempotence: true   # ← Prevent duplicates
  delivery.timeout.ms: 120000
  # Error callback: log and alert on failed delivery

The Financial Impact

BLACK FRIDAY DAMAGE ASSESSMENT
─────────────────────────────────────────────────────────────────
Duration of incident:      7.25 hours (10:00 AM → 5:15 PM)
Peak revenue rate:         $62,000/minute
Lost orders:               ~6,800 checkouts
Average order value:       $617
Direct lost revenue:       $4,195,600

Additional costs:
- Emergency escalation:    $47,000 (contractor callouts)
- Customer service surge:  $23,000 (extended hours)
- Reputation damage:       Immeasurable (social media storm)

Total quantifiable impact: ~$4.3 million in one day

Why Tracing Saved Them

WITHOUT TRACING:                    WITH TRACING:
─────────────────────────────────────────────────────────────────
• 15 engineers, 7 hours            • 1 engineer, 45 minutes
• Each service looked healthy      • Saw exact failure point
• Finger-pointing war              • Objective evidence
• "Restart everything"             • Targeted fix
• Would have continued failing     • Identified silent failure mode

The Monday After

The team mandated distributed tracing across all 127 services. Within 6 weeks, full coverage. The junior engineer’s “20% project” became the company’s standard. Her promotion followed.

Key Lessons

Silent failures are the deadliest: Services that fail without logging are invisible to everything except traces
Green dashboards can lie: Individual service metrics don’t show cross-service failures
Timeouts must be instrumented: Any timeout should create a span with explicit failure status
Fire-and-forget is gambling: Async operations need delivery confirmation and tracing
Traces show what logs and metrics cannot: The request’s journey through time and services

Quiz

Question 1

Why is tail-based sampling more expensive than head-based sampling?

Show Answer

Tail-based sampling requires:

Memory: Must buffer all spans until the trace is complete (could be seconds or minutes)
Compute: Analyzes every span to decide if the trace is interesting (errors, high latency)
Network: Must receive ALL spans before deciding, then discard most

Head-based sampling decides at trace start:

Uses minimal memory (just a random number)
No analysis needed
Discarded spans never sent

A trace with 100 spans: tail-based must process all 100 before discarding 90%. Head-based discards 90% immediately, never sending the spans at all.

Question 2

How would you find all traces where a payment-service call took longer than 500ms?

Show Answer

In Jaeger:

Service: payment-service
Tags: http.status_code=* (any)
Min Duration: 500ms
Search

In Tempo with TraceQL:

{
  resource.service.name = "payment-service" &&
  duration > 500ms
}

Via metrics → exemplars:

Query Prometheus: histogram_quantile(0.99, http_request_duration_seconds_bucket{service="payment"})
Click data point with high latency
View exemplar → links to trace ID
Open in Tempo/Jaeger

Question 3

A trace shows gaps—services A→B→C are traced, but C→D appears as a separate trace. What’s likely wrong?

Show Answer

Context propagation is broken between C and D. Common causes:

Missing instrumentation: Service C might be making HTTP calls without the OTel HTTP client instrumentation. The outgoing call doesn’t include traceparent header.
Header stripping: A proxy, API gateway, or service mesh between C and D might be removing the trace headers.
Async communication: If C→D uses a message queue, you need specific instrumentation to propagate context through messages. Default HTTP instrumentation won’t help.
Different tracing systems: C might use Jaeger client, D might use Zipkin—they need compatible propagation format (W3C Trace Context works across both).

Debug by checking: Do the HTTP requests from C include traceparent header? Does D receive and extract it?

Question 4

You see a trace where user-service took 2 seconds, but all its child spans (db queries, cache calls) total only 200ms. Where did 1.8 seconds go?

Show Answer

The 1.8 seconds is “dark time”—work happening outside of instrumented operations. Common causes:

CPU-bound code: JSON parsing, business logic, serialization—typically not instrumented as spans
Uninstrumented I/O: File system operations, non-HTTP network calls, DNS lookups
Garbage collection: Long GC pauses appear as gaps in traces
Thread pool waiting: Time waiting for a thread to become available
Missing child spans: Some operations might not be instrumented (internal service calls, cache client)

To find it:

Add spans around suspicious code blocks
Profile the application (CPU, memory)
Check for GC pauses in JVM/runtime logs
Verify all I/O operations are instrumented

Question 5

Your system processes 5,000 requests/second with an average of 40 spans per trace at 800 bytes per span. You’re using 5% head sampling. Calculate daily storage requirements and explain why tail sampling might still be needed.

Show Answer

Storage calculation:

5,000 req/s × 40 spans × 800 bytes = 160 MB/second raw
With 5% sampling: 8 MB/second = 691 GB/day

Monthly storage: ~21 TB
At $0.023/GB (S3): ~$480/month

Why tail sampling is still needed:

Head sampling randomly keeps 5% of traces. But consider:

Error rate: 0.1% of requests fail
At 5% sampling, you capture: 5,000 × 0.001 × 0.05 = 0.25 errors/second
Some error traces will be discarded!

Tail sampling ensures:

All errors captured: Sample 100% of traces with status=error
All slow requests captured: Keep traces where duration > SLA
Important users: Keep traces for premium customers
Then sample the rest: 5% of normal successful traces

Configuration pattern:

policies:
  - name: keep-errors
    type: status_code
    status_codes: [ERROR]
  - name: keep-slow
    type: latency
    threshold_ms: 2000
  - name: sample-rest
    type: probabilistic
    sampling_percentage: 5

Question 6

Your traces are breaking at a Kafka message boundary. Service A publishes, Service B consumes, but they appear as separate traces. How do you fix this?

Show Answer

Kafka (and other message queues) don’t automatically propagate trace context like HTTP does. You must:

1. Inject context when producing:

from opentelemetry import trace
from opentelemetry.propagate import inject

def produce_message(topic, message):
    headers = {}
    # Inject current trace context into headers
    inject(headers)

    producer.send(topic,
                  value=message,
                  headers=[(k, v.encode()) for k, v in headers.items()])

2. Extract context when consuming:

from opentelemetry.propagate import extract

def consume_message(message):
    # Convert Kafka headers to dict
    headers = {k: v.decode() for k, v in message.headers}

    # Extract and use as parent context
    ctx = extract(headers)

    with tracer.start_as_current_span("process_message", context=ctx):
        # Now this span is linked to producer's trace
        process(message)

3. Use OTel instrumentation libraries:

from opentelemetry.instrumentation.kafka import KafkaInstrumentor

KafkaInstrumentor().instrument()  # Auto-instruments produce/consume

The key insight: HTTP auto-propagates via headers. Message queues need explicit instrumentation for each message system (Kafka, RabbitMQ, SQS, etc.).

Question 7

You’re comparing Jaeger and Tempo for your organization. Given this scenario, which would you choose and why?

Scenario: 80 microservices, Grafana already deployed, storing 500GB traces/day, need to search by custom business attributes (customer_id, order_id), budget-conscious.

Show Answer

Recommendation: Jaeger for this scenario, despite the budget focus.

Analysis:

Factor	Jaeger	Tempo
Grafana integration	Good (via datasource)	Native (built-in)
Custom tag search	✓ Full support	✗ Requires exemplars/logs
Storage cost	Higher (requires indexing)	Lower (object storage)
500GB/day	~$3,000-5,000/month (ES)	~$350/month (S3)

Why Jaeger wins here:

The requirement “search by customer_id, order_id” is critical. Tempo’s architecture is trace-ID-only lookup. To find traces by customer_id:

With Tempo:

Customer calls support: “Order 12345 failed”
You search Loki for order_id=12345
Find log line with trace_id
Look up trace_id in Tempo

With Jaeger:

Customer calls support: “Order 12345 failed”
Search Jaeger: order_id=12345
Get traces directly

If budget is paramount:

Consider Tempo + richer logging:

Log every transaction with trace_id
Accept the two-hop lookup workflow
Save $2,500-4,000/month

Hybrid approach:

Tempo for storage (cheap)
Sample 100% of errors/slow requests into Jaeger (searchable)
This gives searchability for interesting traces, cheap storage for the rest

Question 8

Given this TraceQL query, explain what it finds and write an equivalent Jaeger search:

{ resource.service.name = "checkout" && span.http.status_code >= 500 } | avg(duration) by (span.http.route) | > 1s

Show Answer

What this query finds:

resource.service.name = "checkout" - Traces from the checkout service
span.http.status_code >= 500 - Only spans with server errors (5xx)
avg(duration) by (span.http.route) - Calculate average duration grouped by HTTP endpoint
> 1s - Filter to routes where average error duration exceeds 1 second

In plain English: “Find which API endpoints in checkout service have slow 5xx errors (averaging over 1 second), so we can prioritize fixing the slowest failure modes.”

Equivalent Jaeger search:

Jaeger doesn’t support aggregations in queries. You would:

Search in Jaeger UI:
- Service: checkout
- Tags: http.status_code >= 500
- Min Duration: 1s

Export and analyze externally:

# Fetch traces via API
curl "http://jaeger:16686/api/traces?service=checkout&tags=http.status_code:500" \
  | jq '.data[].spans[] | select(.duration > 1000000) | {route: .tags["http.route"], duration: .duration}'

Use Jaeger metrics (if enabled):

histogram_quantile(0.5,
  rate(jaeger_trace_duration_bucket{service="checkout", status_code=~"5.."}[5m])
) > 1

Key insight: TraceQL is more powerful for ad-hoc analysis. Jaeger excels at finding individual traces but requires external tools for aggregation.

Hands-On Exercise

Scenario: End-to-End Tracing Investigation

Set up a traced application and practice navigating from metrics to traces to logs.

Setup

# Create kind cluster
kind create cluster --name tracing-lab

# Install the full observability stack
kubectl create namespace tracing

# Deploy Tempo
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: tempo-config
  namespace: tracing
data:
  tempo.yaml: |
    server:
      http_listen_port: 3200
    distributor:
      receivers:
        otlp:
          protocols:
            grpc:
              endpoint: 0.0.0.0:4317
    ingester:
      trace_idle_period: 10s
      max_block_duration: 5m
    storage:
      trace:
        backend: local
        local:
          path: /var/tempo/traces
        wal:
          path: /var/tempo/wal
    query_frontend:
      search:
        duration_slo: 5s
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tempo
  namespace: tracing
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tempo
  template:
    metadata:
      labels:
        app: tempo
    spec:
      containers:
        - name: tempo
          image: grafana/tempo:2.3.0
          args: ["-config.file=/etc/tempo/tempo.yaml"]
          ports:
            - containerPort: 3200
            - containerPort: 4317
          volumeMounts:
            - name: config
              mountPath: /etc/tempo
            - name: storage
              mountPath: /var/tempo
      volumes:
        - name: config
          configMap:
            name: tempo-config
        - name: storage
          emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: tempo
  namespace: tracing
spec:
  ports:
    - name: http
      port: 3200
    - name: otlp-grpc
      port: 4317
  selector:
    app: tempo
EOF

# Wait for Tempo
kubectl -n tracing wait --for=condition=ready pod -l app=tempo --timeout=120s

Deploy Traced Application

apiVersion: apps/v1
kind: Deployment
metadata:
  name: traced-demo
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: traced-demo
  template:
    metadata:
      labels:
        app: traced-demo
    spec:
      containers:
        - name: app
          image: jaegertracing/example-hotrod:1.50
          ports:
            - containerPort: 8080
          env:
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: "http://tempo.tracing.svc.cluster.local:4317"
            - name: OTEL_SERVICE_NAME
              value: "hotrod"

kubectl apply -f traced-app.yaml

# Port forward to access the demo app
kubectl port-forward svc/traced-demo 8080:8080 &

# Port forward Tempo
kubectl -n tracing port-forward svc/tempo 3200:3200 &

Investigation Tasks

Generate traces
- Open http://localhost:8080
- Click different buttons to request rides
- Each click generates a multi-service trace

Query Tempo

# List trace IDs
curl "http://localhost:3200/api/search?limit=10"

# Get a specific trace
curl "http://localhost:3200/api/traces/<trace-id>"

Use TraceQL

# Find slow driver lookups
{ name = "SQL SELECT" && duration > 100ms }

# Find errors
{ status = error }

Practice correlation
- Note a trace ID from a slow request
- Search logs for that trace ID
- Understand the full request journey

Success Criteria

Can generate traces from the demo app
Can query traces by service name
Can find slow operations within a trace
Understand parent-child span relationships
Can explain where latency accumulates

Cleanup

kind delete cluster --name tracing-lab

Key Takeaways

Before moving on, ensure you can:

Next Steps

Congratulations! You’ve completed the Observability Toolkit. You now understand:

Prometheus for metrics
OpenTelemetry for instrumentation
Grafana for visualization
Loki for logs
Jaeger/Tempo for traces

Consider exploring:

GitOps & Deployments Toolkit — Deploy your observable applications
SRE Discipline — Apply observability for reliability

“A trace is a story. Each span is a chapter. The trace ID is how you find the book. Learn to read the story, and you’ll solve mysteries others can’t even see.”