Skip to content

Observability Toolkit

Toolkit Track | 8 Modules | ~6.5 hours total

The Observability Toolkit covers the essential tools for monitoring, logging, and tracing cloud-native applications. These are the instruments you’ll use daily to understand what’s happening in your systems.

This toolkit builds on the theoretical foundations from Observability Theory and shows you how to implement those concepts with production-grade tools.

Before starting this toolkit:

#ModuleComplexityTime
1.1Prometheus[COMPLEX]45-50 min
1.2OpenTelemetry[COMPLEX]45-50 min
1.3Grafana[COMPLEX]40-45 min
1.4Loki[COMPLEX]40-45 min
1.5Distributed Tracing[COMPLEX]45-50 min
1.6Pixie[MEDIUM]90 min
1.7Hubble[MEDIUM]90 min
1.8Coroot[MEDIUM]90 min
1.9Continuous Profiling[MEDIUM]40 min
1.10SLO Tooling[MEDIUM]40 min

After completing this toolkit, you will be able to:

  1. Deploy Prometheus — Scraping, PromQL, alerting, service discovery
  2. Instrument with OpenTelemetry — SDK, Collector, auto-instrumentation
  3. Build Grafana dashboards — Variables, Four Golden Signals, Explore
  4. Aggregate logs with Loki — LogQL, Promtail, multi-tenancy
  5. Trace distributed requests — Jaeger, Tempo, TraceQL, sampling
  6. Use Pixie for zero-instrumentation observability — eBPF, PxL queries, instant debugging
  7. Deploy Hubble for network observability — Cilium integration, network policy debugging
  8. Deploy Coroot for auto-instrumented observability — Zero-code tracing, SLOs, profiling
┌─────────────────────────────────────────────────────────────────┐
│ OBSERVABILITY STACK │
├─────────────────────────────────────────────────────────────────┤
│ │
│ YOUR APPLICATION │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Instrumented Code │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Metrics │ │ Logs │ │ Traces │ │ │
│ │ │ /metrics │ │ stdout │ │ spans │ │ │
│ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │
│ └───────┼─────────────┼─────────────┼──────────────────────┘ │
│ │ │ │ │
│ │ │ │ OTLP │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ OPENTELEMETRY COLLECTOR │ │
│ │ (receives, processes, exports all signals) │ │
│ └───────┬─────────────┬─────────────┬──────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ PROMETHEUS │ │ LOKI │ │ TEMPO │ │
│ │ (metrics) │ │ (logs) │ │ (traces) │ │
│ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ │
│ │ │ │ │
│ └──────────────┼──────────────┘ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ GRAFANA │ │
│ │ (unified visualization, dashboards, exploration) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
NeedToolWhy
Metrics storagePrometheusIndustry standard, PromQL, native K8s
Metrics at scaleThanos/MimirMulti-cluster, long-term, HA
InstrumentationOpenTelemetryVendor-neutral, complete SDK
Log aggregationLokiCost-effective, Grafana native
Log search (heavy)ElasticsearchFull-text, complex queries
Tracing (search)JaegerTag-based search, standalone UI
Tracing (cheap)TempoObject storage, trace ID only
DashboardsGrafanaUnifies all signals, plugins
Module 1.1: Prometheus
│ Foundation: metrics collection
Module 1.2: OpenTelemetry
│ Modern instrumentation standard
Module 1.3: Grafana
│ Visualization for all signals
Module 1.4: Loki
│ Log aggregation "Prometheus-style"
Module 1.5: Distributed Tracing
│ Jaeger, Tempo, correlation
[Toolkit Complete] → GitOps & Deployments Toolkit
PillarQuestion AnsweredTool
MetricsWhat is happening? (quantitative)Prometheus
LogsWhy did it happen? (context)Loki
TracesWhere did it happen? (request flow)Jaeger/Tempo
INVESTIGATING AN INCIDENT:
1. METRICS → "Is something wrong?"
• Dashboard shows latency spike
• Error rate increasing
• CPU/memory abnormal
2. TRACES → "Where is the problem?"
• Find slow trace via exemplar
• See which service is slow
• Identify bottleneck span
3. LOGS → "What exactly happened?"
• Filter by trace_id
• See error messages
• Get full context

“You can’t improve what you can’t measure. You can’t debug what you can’t see. Observability gives you both.”