Observability Toolkit
Toolkit Track | 8 Modules | ~6.5 hours total
Overview
Section titled “Overview”The Observability Toolkit covers the essential tools for monitoring, logging, and tracing cloud-native applications. These are the instruments you’ll use daily to understand what’s happening in your systems.
This toolkit builds on the theoretical foundations from Observability Theory and shows you how to implement those concepts with production-grade tools.
Prerequisites
Section titled “Prerequisites”Before starting this toolkit:
- Observability Theory Track — Conceptual foundation
- SRE Discipline — Where observability fits (recommended)
- Basic Kubernetes knowledge
- Command-line familiarity
Modules
Section titled “Modules”| # | Module | Complexity | Time |
|---|---|---|---|
| 1.1 | Prometheus | [COMPLEX] | 45-50 min |
| 1.2 | OpenTelemetry | [COMPLEX] | 45-50 min |
| 1.3 | Grafana | [COMPLEX] | 40-45 min |
| 1.4 | Loki | [COMPLEX] | 40-45 min |
| 1.5 | Distributed Tracing | [COMPLEX] | 45-50 min |
| 1.6 | Pixie | [MEDIUM] | 90 min |
| 1.7 | Hubble | [MEDIUM] | 90 min |
| 1.8 | Coroot | [MEDIUM] | 90 min |
| 1.9 | Continuous Profiling | [MEDIUM] | 40 min |
| 1.10 | SLO Tooling | [MEDIUM] | 40 min |
Learning Outcomes
Section titled “Learning Outcomes”After completing this toolkit, you will be able to:
- Deploy Prometheus — Scraping, PromQL, alerting, service discovery
- Instrument with OpenTelemetry — SDK, Collector, auto-instrumentation
- Build Grafana dashboards — Variables, Four Golden Signals, Explore
- Aggregate logs with Loki — LogQL, Promtail, multi-tenancy
- Trace distributed requests — Jaeger, Tempo, TraceQL, sampling
- Use Pixie for zero-instrumentation observability — eBPF, PxL queries, instant debugging
- Deploy Hubble for network observability — Cilium integration, network policy debugging
- Deploy Coroot for auto-instrumented observability — Zero-code tracing, SLOs, profiling
The Observability Stack
Section titled “The Observability Stack”┌─────────────────────────────────────────────────────────────────┐│ OBSERVABILITY STACK │├─────────────────────────────────────────────────────────────────┤│ ││ YOUR APPLICATION ││ ┌──────────────────────────────────────────────────────────┐ ││ │ Instrumented Code │ ││ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ ││ │ │ Metrics │ │ Logs │ │ Traces │ │ ││ │ │ /metrics │ │ stdout │ │ spans │ │ ││ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ ││ └───────┼─────────────┼─────────────┼──────────────────────┘ ││ │ │ │ ││ │ │ │ OTLP ││ ▼ ▼ ▼ ││ ┌──────────────────────────────────────────────────────────┐ ││ │ OPENTELEMETRY COLLECTOR │ ││ │ (receives, processes, exports all signals) │ ││ └───────┬─────────────┬─────────────┬──────────────────────┘ ││ │ │ │ ││ ▼ ▼ ▼ ││ ┌────────────┐ ┌────────────┐ ┌────────────┐ ││ │ PROMETHEUS │ │ LOKI │ │ TEMPO │ ││ │ (metrics) │ │ (logs) │ │ (traces) │ ││ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ ││ │ │ │ ││ └──────────────┼──────────────┘ ││ ▼ ││ ┌──────────────────────────────────────────────────────────┐ ││ │ GRAFANA │ ││ │ (unified visualization, dashboards, exploration) │ ││ └──────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────┘Tool Selection Guide
Section titled “Tool Selection Guide”| Need | Tool | Why |
|---|---|---|
| Metrics storage | Prometheus | Industry standard, PromQL, native K8s |
| Metrics at scale | Thanos/Mimir | Multi-cluster, long-term, HA |
| Instrumentation | OpenTelemetry | Vendor-neutral, complete SDK |
| Log aggregation | Loki | Cost-effective, Grafana native |
| Log search (heavy) | Elasticsearch | Full-text, complex queries |
| Tracing (search) | Jaeger | Tag-based search, standalone UI |
| Tracing (cheap) | Tempo | Object storage, trace ID only |
| Dashboards | Grafana | Unifies all signals, plugins |
Study Path
Section titled “Study Path”Module 1.1: Prometheus │ │ Foundation: metrics collection ▼Module 1.2: OpenTelemetry │ │ Modern instrumentation standard ▼Module 1.3: Grafana │ │ Visualization for all signals ▼Module 1.4: Loki │ │ Log aggregation "Prometheus-style" ▼Module 1.5: Distributed Tracing │ │ Jaeger, Tempo, correlation ▼[Toolkit Complete] → GitOps & Deployments ToolkitKey Concepts
Section titled “Key Concepts”The Three Pillars
Section titled “The Three Pillars”| Pillar | Question Answered | Tool |
|---|---|---|
| Metrics | What is happening? (quantitative) | Prometheus |
| Logs | Why did it happen? (context) | Loki |
| Traces | Where did it happen? (request flow) | Jaeger/Tempo |
When to Use What
Section titled “When to Use What”INVESTIGATING AN INCIDENT:
1. METRICS → "Is something wrong?" • Dashboard shows latency spike • Error rate increasing • CPU/memory abnormal
2. TRACES → "Where is the problem?" • Find slow trace via exemplar • See which service is slow • Identify bottleneck span
3. LOGS → "What exactly happened?" • Filter by trace_id • See error messages • Get full contextRelated Tracks
Section titled “Related Tracks”- Before: Observability Theory — Why these tools exist
- Related: SRE Discipline — How to apply observability
- Related: IaC Tools Toolkit — Deploy observability stack with Terraform
- After: GitOps & Deployments Toolkit — Deploy observable apps
“You can’t improve what you can’t measure. You can’t debug what you can’t see. Observability gives you both.”