Module 1.3: Grafana
Toolkit Track | Complexity:
[MEDIUM]| Time: 40-45 min
Prerequisites
Section titled “Prerequisites”Before starting this module:
- Module 1.1: Prometheus
- Module 1.2: OpenTelemetry
- Basic PromQL knowledge
- Understanding of metrics and time series
At 11:47 PM on Black Friday, the director of engineering at a major retail company pulled up the main Grafana dashboard on the war room TV. Three hundred engineers stared at the screen. Revenue was down 40% from projections and nobody knew why.
The dashboard showed 200 panels. Graphs for every conceivable metric—CPU, memory, disk, network, HTTP status codes, queue depths, cache hit rates. Everything looked… fine. Green across the board. Yet customers were abandoning carts at record rates.
“Which of these panels matters?” asked the CEO, who had flown in from vacation.
Silence. The dashboards had been built over three years by different teams. Some showed production data, others showed staging. Some used 5-minute rates, others used instantaneous values. Nobody knew what “green” actually meant because thresholds had never been set.
The team spent 47 minutes just figuring out which dashboard showed the real checkout service. By the time they found the problem—a third-party payment provider timing out—they had lost $3.2 million in sales.
The next quarter, they rebuilt their dashboards from scratch: one overview, four golden signals per service, consistent thresholds, linked drill-downs. The following Black Friday, they identified and routed around a similar payment issue in 4 minutes.
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Deploy Grafana with provisioned dashboards, data sources, and alerting rules as code
- Implement Grafana dashboards with variables, annotations, and cross-datasource correlations
- Configure Grafana alerting with notification policies, silences, and escalation routing
- Integrate Grafana with Prometheus, Loki, and Tempo for unified metrics, logs, and traces visualization
Why This Module Matters
Section titled “Why This Module Matters”Data without visualization is just numbers. Grafana transforms raw metrics into actionable insights. It’s the window into your systems—the first place you look during an incident, the source of truth for SLOs, the tool that makes observability accessible to everyone.
Grafana has evolved from a dashboarding tool into a complete observability platform. Understanding it deeply—beyond basic charts—unlocks powerful capabilities for correlation, alerting, and investigation.
Did You Know?
Section titled “Did You Know?”- Grafana is used by millions of users at companies like Bloomberg, PayPal, and CERN—from startups to enterprises
- The name “Grafana” comes from combining “graphite” and “graphana” (an earlier fork). It was created in 2014 by Torkel Ödegaard
- Grafana Labs’ LGTM stack (Loki, Grafana, Tempo, Mimir) provides a complete open-source alternative to commercial observability platforms
- Dashboard variables can reduce a 100-dashboard sprawl to 10 reusable dashboards—most teams don’t use them enough
Grafana Architecture
Section titled “Grafana Architecture”┌─────────────────────────────────────────────────────────────────┐│ GRAFANA ECOSYSTEM │├─────────────────────────────────────────────────────────────────┤│ ││ GRAFANA (Visualization) ││ ┌─────────────────────────────────────────────────────────┐ ││ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐│ ││ │ │Dashboards│ │ Explore │ │ Alerting │ │ Users ││ ││ │ │ │ │ │ │ │ │ & Auth ││ ││ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘│ ││ │ │ ││ │ ┌──────────────────────────────────────────────────┐ │ ││ │ │ DATA SOURCE PLUGINS │ │ ││ │ │ Prometheus │ Loki │ Tempo │ Elasticsearch │ ... │ │ ││ │ └──────────────────────────────────────────────────┘ │ ││ └─────────────────────────────────────────────────────────┘ ││ │ ││ ┌──────────────────┼──────────────────┐ ││ ▼ ▼ ▼ ││ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││ │ Prometheus │ │ Loki │ │ Tempo │ ││ │ (Mimir) │ │ (logs) │ │ (traces) │ ││ │ metrics │ │ │ │ │ ││ └──────────────┘ └──────────────┘ └──────────────┘ ││ │└─────────────────────────────────────────────────────────────────┘Grafana Stack (LGTM)
Section titled “Grafana Stack (LGTM)”| Component | Purpose |
|---|---|
| Loki | Log aggregation (like Prometheus for logs) |
| Grafana | Visualization and exploration |
| Tempo | Distributed tracing |
| Mimir | Scalable Prometheus metrics storage |
War Story: The Dashboard That Cried Wolf
Section titled “War Story: The Dashboard That Cried Wolf”A fintech startup prided itself on comprehensive monitoring. They had 847 alert rules across 312 dashboards. “We monitor everything,” the VP of Engineering boasted to investors.
Monday 3:00 AM: PagerDuty fires. “High CPU on auth-service.” The on-call engineer checks—87% CPU but everything working. Acknowledges. Goes back to sleep.
Monday 3:47 AM: Another alert. “Memory pressure on auth-service.” Same story. Acknowledges.
Monday 4:12 AM: “Disk I/O high on auth-service.” The engineer starts wondering why auth-service is so needy tonight. Acknowledges.
Monday 4:38 AM: “Latency spike on auth-service.” By now, the engineer is exhausted and frustrated. Just acknowledges without checking.
Monday 5:15 AM: Customer complaints start rolling in. Logins are failing. The actual problem? A credential rotation job had stalled, causing auth retries to spike. The first three alerts were symptoms. The fourth one—latency—was the real signal.
But by then, alert fatigue had set in. The engineer ignored the one alert that mattered.
Dashboard/Alert Audit Results─────────────────────────────────────────────────────────────────Total dashboards: 312Dashboards viewed monthly: 47Total alert rules: 847Alerts/week average: 156True positives: 12 (7.7%)MTTA (Mean Time to Ack): 23 minutes (should be <5)Incidents missed due to fatigue: 3 in 6 months─────────────────────────────────────────────────────────────────Cost of alert fatigue: $1.2M in incident impactThe Fix:
- Deleted 734 alert rules (kept only SLO-based alerts)
- Consolidated to 28 dashboards (overview → service → component)
- Added thresholds with meaningful colors (green/yellow/red = actual SLO risk)
- Created runbooks linked from each alert
- Implemented alert routing: Page for customer-facing, ticket for internal
After 3 months:
- Alerts per week: 156 → 12
- True positive rate: 7.7% → 89%
- MTTA: 23 min → 3 min
- Incidents missed: 0
The Lesson: More dashboards ≠ better visibility. The goal isn’t monitoring everything—it’s surfacing what matters.
Dashboard Design Principles
Section titled “Dashboard Design Principles”The Four Golden Signals
Section titled “The Four Golden Signals”FOUR GOLDEN SIGNALS─────────────────────────────────────────────────────────────────
Every dashboard should answer these questions:
1. LATENCY - How long does it take? └── Histogram quantiles (p50, p95, p99)
2. TRAFFIC - How much load are we handling? └── Requests per second
3. ERRORS - What's failing? └── Error rate as percentage
4. SATURATION - How "full" is the system? └── CPU, memory, queue depth
DASHBOARD LAYOUT─────────────────────────────────────────────────────────────────┌────────────────────────────────────────────────────────────────┐│ SERVICE: $service ENVIRONMENT: $env TIME: $__interval │├────────────────────────────────────────────────────────────────┤│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ││ │ Latency │ │ Traffic │ │ Errors │ ││ │ (p99) │ │ (RPS) │ │ (%) │ ││ │ STAT PANEL │ │ STAT PANEL │ │ STAT PANEL │ ││ └─────────────────┘ └─────────────────┘ └─────────────────┘ │├────────────────────────────────────────────────────────────────┤│ ┌───────────────────────────────────────────────────────────┐││ │ REQUEST RATE │││ │ ════════════════════════════════════════════════ │││ │ │││ └───────────────────────────────────────────────────────────┘││ ┌───────────────────────────────────────────────────────────┐││ │ LATENCY │││ │ ────────── p99 ────────── p95 ────────── p50 │││ │ │││ └───────────────────────────────────────────────────────────┘││ ┌───────────────────────────────────────────────────────────┐││ │ ERROR RATE │││ │ ════════════════════════════════════════════════ │││ │ │││ └───────────────────────────────────────────────────────────┘│└────────────────────────────────────────────────────────────────┘Dashboard Hierarchy
Section titled “Dashboard Hierarchy”DASHBOARD ORGANIZATION─────────────────────────────────────────────────────────────────
LEVEL 1: Overview Dashboard├── All services at a glance├── Traffic heatmap├── Error hotspots└── Click to drill down
LEVEL 2: Service Dashboard├── Golden signals for one service├── Dependencies (upstream/downstream)├── Resource utilization└── Click to drill down
LEVEL 3: Component Dashboard├── Individual pods/instances├── Detailed metrics├── Debug information└── Linked to logs/tracesDashboard Variables
Section titled “Dashboard Variables”Variable Types
Section titled “Variable Types”GRAFANA VARIABLES─────────────────────────────────────────────────────────────────
QUERY VARIABLE (dynamic from data source) Name: service Query: label_values(http_requests_total, service) Result: Dropdown with all services
CUSTOM VARIABLE (static list) Name: percentile Values: 50,90,95,99 Result: Dropdown with percentile options
INTERVAL VARIABLE (time-based) Name: interval Values: 1m,5m,15m,1h Auto: Based on time range
DATASOURCE VARIABLE Name: datasource Type: prometheus Result: Switch between Prometheus instances
TEXT VARIABLE (user input) Name: filter Type: text Result: Free-text filter inputUsing Variables
Section titled “Using Variables”# In queries, use $variable or ${variable}rate(http_requests_total{service="$service"}[$interval])
# Multi-value with regexrate(http_requests_total{service=~"$service"}[$interval])
# With custom variable for percentilehistogram_quantile(0.$percentile, sum by (le)(rate(http_request_duration_bucket{service="$service"}[$interval])))Variable Configuration
Section titled “Variable Configuration”{ "templating": { "list": [ { "name": "service", "type": "query", "datasource": "Prometheus", "query": "label_values(http_requests_total, service)", "multi": true, "includeAll": true, "allValue": ".*", "refresh": 2 }, { "name": "interval", "type": "interval", "auto": true, "auto_min": "10s", "options": [ {"value": "1m", "text": "1 minute"}, {"value": "5m", "text": "5 minutes"}, {"value": "15m", "text": "15 minutes"} ] } ] }}Panel Types
Section titled “Panel Types”Choosing the Right Panel
Section titled “Choosing the Right Panel”| Panel Type | Use Case | Example |
|---|---|---|
| Time series | Metrics over time | Request rate, latency trends |
| Stat | Single current value | Current error rate, RPS |
| Gauge | Value vs. thresholds | CPU usage, SLO budget |
| Bar gauge | Compare values | Top 5 endpoints by latency |
| Table | Detailed data | Pod status, error details |
| Heatmap | Distribution over time | Latency distribution |
| Logs | Log entries | Loki integration |
| Traces | Distributed traces | Tempo/Jaeger integration |
Time Series Best Practices
Section titled “Time Series Best Practices”TIME SERIES CONFIGURATION─────────────────────────────────────────────────────────────────
LEGEND├── Format: {{service}} - {{method}}├── Placement: Bottom or Right└── Hide if too many series (use tooltip)
AXES├── Left Y: Main metric (e.g., RPS)├── Right Y: Secondary (e.g., error %)└── Label units explicitly
THRESHOLDS├── Warning: Yellow line├── Critical: Red line└── Fill below threshold for visibility
SERIES OVERRIDES├── Error series: Red├── Success series: Green└── Specific series stylingStat Panel with Thresholds
Section titled “Stat Panel with Thresholds”{ "type": "stat", "title": "Error Rate", "targets": [ { "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100", "legendFormat": "" } ], "fieldConfig": { "defaults": { "unit": "percent", "thresholds": { "mode": "absolute", "steps": [ {"color": "green", "value": null}, {"color": "yellow", "value": 1}, {"color": "red", "value": 5} ] }, "mappings": [], "color": {"mode": "thresholds"} } }, "options": { "colorMode": "background", "graphMode": "none", "textMode": "value" }}Grafana Alerting
Section titled “Grafana Alerting”Alert Rules
Section titled “Alert Rules”# Grafana alerting (unified alerting)apiVersion: 1groups: - name: service-alerts folder: Production interval: 1m rules: - uid: high-error-rate title: High Error Rate condition: C data: # Query A: Error requests - refId: A datasourceUid: prometheus model: expr: sum(rate(http_requests_total{status=~"5.."}[5m]))
# Query B: Total requests - refId: B datasourceUid: prometheus model: expr: sum(rate(http_requests_total[5m]))
# Expression C: Error rate - refId: C datasourceUid: __expr__ model: type: math expression: $A / $B * 100 conditions: - evaluator: type: gt params: [5] # > 5%
for: 5m labels: severity: critical annotations: summary: "Error rate is {{ $values.C }}%"Contact Points
Section titled “Contact Points”# Contact points configurationapiVersion: 1contactPoints: - name: slack-alerts receivers: - uid: slack-1 type: slack settings: url: https://hooks.slack.com/services/... recipient: "#alerts" title: "{{ .CommonLabels.alertname }}" text: "{{ .CommonAnnotations.summary }}"
- name: pagerduty receivers: - uid: pd-1 type: pagerduty settings: integrationKey: "<key>" severity: "{{ .CommonLabels.severity }}"Explore Mode
Section titled “Explore Mode”Ad-hoc Investigation
Section titled “Ad-hoc Investigation”EXPLORE MODE─────────────────────────────────────────────────────────────────
┌────────────────────────────────────────────────────────────────┐│ [Prometheus ▼] [Loki ▼] [Tempo ▼] │├────────────────────────────────────────────────────────────────┤│ QUERY: rate(http_requests_total{service="api"}[5m]) ││ ┌──────────────────────────────────────────────────────────┐ ││ │ [Run Query] [Add Query] [Split] │ ││ └──────────────────────────────────────────────────────────┘ │├────────────────────────────────────────────────────────────────┤│ ││ RESULT: ││ ════════════════════════════════════════════════ ││ ^ ││ / ││ ──────────────────────────────────────/────────── ││ ││ [Split] Split panes for comparison ││ [Logs] Link to Loki for this time range ││ [Traces] Link to Tempo for this trace ID ││ │└────────────────────────────────────────────────────────────────┘
Use Cases:• Incident investigation• Metric exploration• Query building before dashboard• Correlating metrics, logs, tracesCorrelating Signals
Section titled “Correlating Signals”SIGNAL CORRELATION─────────────────────────────────────────────────────────────────
1. Start with metric anomaly rate(http_requests_total{status="500"}[5m]) > 0
2. Click "Explore" on spike timestamp
3. Split view → Add Loki {service="api"} |= "error" | json
4. Find error with trace_id
5. Split view → Add Tempo Paste trace_id → See full trace
6. Identify root cause in upstream serviceDashboard as Code
Section titled “Dashboard as Code”Provisioning
Section titled “Provisioning”apiVersion: 1providers: - name: 'default' orgId: 1 folder: 'Production' type: file disableDeletion: false updateIntervalSeconds: 30 options: path: /var/lib/grafana/dashboardsGrafonnet (Jsonnet)
Section titled “Grafonnet (Jsonnet)”local grafana = import 'grafonnet/grafana.libsonnet';local dashboard = grafana.dashboard;local row = grafana.row;local prometheus = grafana.prometheus;local graphPanel = grafana.graphPanel;
dashboard.new( 'Service Dashboard', tags=['production', 'service'], time_from='now-1h',).addTemplate( grafana.template.datasource( 'datasource', 'prometheus', 'Prometheus', )).addTemplate( grafana.template.query( 'service', 'label_values(http_requests_total, service)', datasource='$datasource', )).addRow( row.new( title='Golden Signals', ) .addPanel( graphPanel.new( 'Request Rate', datasource='$datasource', ) .addTarget( prometheus.target( 'sum(rate(http_requests_total{service="$service"}[5m]))', legendFormat='RPS', ) ) ) .addPanel( graphPanel.new( 'Error Rate', datasource='$datasource', ) .addTarget( prometheus.target( 'sum(rate(http_requests_total{service="$service",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="$service"}[5m]))', legendFormat='Error %', ) ) ))Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| Too many panels | Slow, overwhelming | Focus on golden signals |
| No variables | Duplicate dashboards | Use template variables |
| Hardcoded time ranges | Stale data | Use relative ranges |
| Missing units | Unclear data | Always set units |
| No thresholds | Can’t spot issues | Add visual thresholds |
| Direct queries everywhere | Slow dashboards | Use recording rules |
Test your understanding:
1. Why are dashboard variables important?
Answer: Variables provide:
- Reusability: One dashboard for many services
- Exploration: Easy switching between environments
- Maintenance: Fewer dashboards to maintain
- Consistency: Same layout, different data
Without variables, teams create duplicate dashboards for each service, leading to maintenance nightmares.
2. What are the Four Golden Signals and why use them?
Answer: The Four Golden Signals are:
- Latency: How long requests take
- Traffic: Request volume
- Errors: Failure rate
- Saturation: Resource utilization
They provide a complete picture of service health. If all four are healthy, the service is likely healthy. They’re recommended by Google’s SRE book as the minimum monitoring.
3. When should you use Explore vs. Dashboards?
Answer:
- Dashboards: Known questions, ongoing monitoring, team visibility
- Explore: Ad-hoc investigation, incident response, query building
Explore is for investigation, dashboards are for monitoring. Build dashboards from queries developed in Explore.
4. How does Grafana help with signal correlation?
Answer: Grafana enables correlation through:
- Linked data sources: Jump from metric to logs to traces
- Common labels: TraceID, service name across signals
- Split view: Compare signals side-by-side
- Time synchronization: Same time range across panels
This enables rapid root cause analysis: metric spike → related logs → distributed trace.
5. A dashboard has 50 panels and takes 30 seconds to load. What's likely wrong and how would you fix it?
Answer: Common causes and fixes:
1. Too many queries executing:
- Each panel fires separate query
- Fix: Use recording rules to pre-compute, reduce panels
2. Long time ranges with high resolution:
- Querying months of data at 1-second granularity
- Fix: Use
$__intervalvariable, implement step adjustment
3. High cardinality queries:
topk(100, ...)or unbounded label matches- Fix: Add filters, reduce cardinality, use recording rules
4. No caching:
- Same queries repeatedly hit Prometheus
- Fix: Enable query caching, set appropriate cache TTL
5. Slow data source:
- Prometheus itself is overwhelmed
- Fix: Add capacity, shard Prometheus, use Thanos/Mimir
Diagnostic approach:
- Open browser DevTools → Network tab
- Identify slowest queries
- Run those queries directly in Prometheus to isolate
6. You need one dashboard to work for 100 microservices. How would you design the variable structure?
Answer: Multi-level variable cascade:
Variable: namespace Query: label_values(up, namespace)
Variable: service Query: label_values(up{namespace="$namespace"}, service) Depends on: namespace
Variable: instance Query: label_values(up{namespace="$namespace", service="$service"}, instance) Depends on: namespace, service
Variable: interval Type: interval Auto: true Auto_min: 10sKey techniques:
- Cascading dependencies: Each variable filters the next
- Include All option:
.*regex for multi-select - Refresh on load: Keep data current
- Default values: Pre-select production namespace
In panels: Use {namespace="$namespace", service=~"$service"} with regex match for multi-select support.
7. What's the difference between Stat, Gauge, and Bar Gauge panels? When would you use each?
Answer:
| Panel | Best For | Example |
|---|---|---|
| Stat | Current value, large display | Error rate: 0.5%, uptime: 99.95% |
| Gauge | Value vs. thresholds, radial | CPU: 75% of 100%, memory: 8/16 GB |
| Bar Gauge | Comparing multiple values | Top 5 endpoints by latency |
Decision tree:
- Single value, big number display? → Stat
- Single value, progress toward limit? → Gauge
- Multiple values, comparing? → Bar Gauge
- Time series needed? → Time series (not these)
Threshold configuration: All three support color thresholds (green/yellow/red) which should map to SLO states, not arbitrary percentages.
8. How would you set up Grafana to correlate a metric spike to logs and then traces?
Answer: Configure data source links:
1. Prometheus → Loki (metrics to logs):
datasources: - name: Prometheus jsonData: derivedFields: - datasourceUid: loki matcherRegex: service="([^"]+)" name: ServiceLogs url: '/explore?left={"queries":[{"expr":"{service=\"${__value.raw}\"}"}]}'2. Prometheus → Tempo (via exemplars):
jsonData: exemplarTraceIdDestinations: - name: trace_id datasourceUid: tempo3. Loki → Tempo (logs to traces):
datasources: - name: Loki jsonData: derivedFields: - name: TraceID matcherRegex: 'trace_id=(\w+)' datasourceUid: tempo url: '${__value.raw}'Workflow:
- See latency spike in dashboard (Prometheus)
- Click exemplar → opens trace in Tempo
- See slow span, click service → opens Loki with filtered logs
- Find error message with stack trace
Hands-On Exercise: Build a Service Dashboard
Section titled “Hands-On Exercise: Build a Service Dashboard”Create a complete service dashboard:
# Deploy Grafana using Helmhelm repo add grafana https://grafana.github.io/helm-chartshelm repo update
helm install grafana grafana/grafana \ --namespace monitoring \ --set adminPassword=admin \ --set persistence.enabled=true \ --set persistence.size=10GiStep 1: Access Grafana
Section titled “Step 1: Access Grafana”# Get admin passwordkubectl get secret -n monitoring grafana -o jsonpath="{.data.admin-password}" | base64 -d
# Port forwardkubectl port-forward -n monitoring svc/grafana 3000:80
# Login at http://localhost:3000 (admin / <password>)Step 2: Add Prometheus Data Source
Section titled “Step 2: Add Prometheus Data Source”- Go to Configuration → Data Sources
- Add Prometheus
- URL:
http://prometheus-server:80 - Save & Test
Step 3: Create Dashboard with Variables
Section titled “Step 3: Create Dashboard with Variables”{ "title": "Service Dashboard", "templating": { "list": [ { "name": "service", "type": "query", "datasource": "Prometheus", "query": "label_values(up, job)", "multi": false, "includeAll": false }, { "name": "interval", "type": "interval", "auto": true, "auto_min": "10s", "options": [ {"value": "1m", "text": "1m"}, {"value": "5m", "text": "5m"}, {"value": "15m", "text": "15m"} ] } ] }}Step 4: Add Golden Signal Panels
Section titled “Step 4: Add Golden Signal Panels”Request Rate (Time Series):
sum(rate(http_requests_total{job="$service"}[$interval]))Error Rate (Stat Panel):
sum(rate(http_requests_total{job="$service",status=~"5.."}[$interval]))/sum(rate(http_requests_total{job="$service"}[$interval]))* 100Latency P99 (Time Series):
histogram_quantile(0.99, sum by (le)(rate(http_request_duration_seconds_bucket{job="$service"}[$interval])))CPU Usage (Gauge):
avg(rate(process_cpu_seconds_total{job="$service"}[$interval])) * 100Step 5: Configure Thresholds
Section titled “Step 5: Configure Thresholds”For Error Rate Stat:
- Green: < 1%
- Yellow: 1-5%
- Red: > 5%
For CPU Gauge:
- Green: < 70%
- Yellow: 70-85%
- Red: > 85%
Success Criteria
Section titled “Success Criteria”You’ve completed this exercise when you can:
- Create dashboard with variables
- Switch services using dropdown
- See all four golden signals
- Thresholds change panel colors
- Export dashboard as JSON
Key Takeaways
Section titled “Key Takeaways”Before moving on, ensure you understand:
- Four Golden Signals: Every service dashboard should show Latency, Traffic, Errors, and Saturation
- Dashboard hierarchy: Overview (all services) → Service (one service) → Component (instances/pods)
- Variables are essential: Query, interval, and custom variables enable reusable dashboards
- Panel selection: Stat (single value), Gauge (vs. threshold), Bar Gauge (comparison), Time Series (trends)
- Thresholds must be meaningful: Map to SLO states (green = healthy, yellow = warning, red = breaching)
- Explore for investigation: Ad-hoc queries, split view, incident response; dashboards for monitoring
- Signal correlation: Configure exemplars (Prometheus → Tempo), derived fields (Loki → Tempo)
- Dashboard as code: Use Grafonnet/JSON provisioning for version control and consistency
- Alert fatigue is deadly: Fewer, high-signal alerts beat many noisy ones
- Units and legends: Always label axes, set units, format legends for clarity
Further Reading
Section titled “Further Reading”- Grafana Documentation — Official docs
- Grafonnet — Dashboard as code
- Dashboard Best Practices — Official guide
- LGTM Stack — Complete observability
Summary
Section titled “Summary”Grafana transforms metrics into insights. The Four Golden Signals provide a framework for service health. Variables create reusable, maintainable dashboards. Explore mode enables rapid investigation during incidents. Dashboard as code brings version control to visualization. Together with Prometheus, Loki, and Tempo, Grafana forms a complete observability platform.
Next Module
Section titled “Next Module”Continue to Module 1.4: Logging with Loki to learn about log aggregation and analysis.