Observability at Scale
Observability at Scale
Section titled “Observability at Scale”Learning Outcomes
Section titled “Learning Outcomes”- Configure Prometheus sharding and federation to handle scraping workloads that exceed single-node vertical scaling limits.
- Implement highly available, long-term metric storage using Grafana Mimir and S3-compatible object storage on bare metal.
- Design OpenTelemetry Collector pipelines for unified routing, batching, and sampling of metrics, logs, and traces.
- Diagnose and resolve high-cardinality explosions using relabeling rules, recording rules, and ingestion limits.
- Deploy and operate the Grafana LGTM (Loki, Grafana, Tempo, Mimir) stack with continuous profiling (Pyroscope) and exemplars.
- Mitigate alerting fatigue using advanced Alertmanager routing, grouping, and inhibition rules.
The Limits of Single-Node Prometheus
Section titled “The Limits of Single-Node Prometheus”Prometheus is designed primarily for reliability, favoring a shared-nothing architecture where a single binary scrapes targets, stores data locally in its Time Series Database (TSDB), and evaluates alerting rules. On bare-metal clusters exceeding 500 nodes or emitting millions of active series, a single Prometheus instance hits vertical scaling limits.
The primary bottlenecks are:
- Memory: Prometheus keeps the head block (the last 2-3 hours of data) in memory. High active series count dictates memory requirements. OOM kills are common during high churn events (e.g., node rotation, massive deployment rollouts).
- CPU: Evaluating complex PromQL queries over long time ranges or many series consumes significant CPU.
- Storage I/O: Compacting the TSDB and appending to the Write-Ahead Log (WAL) requires fast NVMe drives. Spinning disks will cause WAL backpressure, leading to dropped scrapes.
Hierarchical Federation vs. Sharding
Section titled “Hierarchical Federation vs. Sharding”When a single Prometheus cannot scrape all targets, the workload must be distributed.
Hierarchical Federation involves deploying multiple “leaf” Prometheus servers (e.g., one per namespace or failure domain) that scrape targets. A “global” Prometheus server then scrapes the leaf servers using the /federate endpoint, pulling only aggregated data (via match[] parameters) rather than raw metrics.
Hash-based Sharding distributes the scrape load across multiple identical Prometheus replicas. Each replica discovers all targets but only scrapes a subset based on a hash of the target’s address.
scrape_configs: - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: # Keep only targets whose hash matches the shard index - source_labels: [__address__] modulus: 4 # Total number of shards target_label: __tmp_hash action: hashmod - source_labels: [__tmp_hash] regex: 0 # This specific shard's index (0, 1, 2, or 3) action: keepPrometheus Agent Mode
Section titled “Prometheus Agent Mode”Introduced in v2.32, Agent mode (--enable-feature=agent) runs Prometheus without a local TSDB, alerting, or querying capabilities. It only scrapes targets, appends to the WAL, and remote-writes to a centralized storage backend (like Mimir or Thanos). This drastically reduces the memory footprint on edge nodes.
Long-Term Storage: Thanos vs. Mimir
Section titled “Long-Term Storage: Thanos vs. Mimir”To achieve global querying, high availability, and long-term retention beyond the capacity of local NVMe drives, metrics must be shipped to a distributed system backed by object storage (e.g., MinIO or Ceph RGW on bare metal).
Architecture Comparison
Section titled “Architecture Comparison”Both Thanos and Mimir aim to solve the same problem but take different architectural approaches.
graph TD subgraph Thanos Architecture P1[Prometheus] --> S1[Thanos Sidecar] S1 -->|Uploads Blocks| O1[(Object Storage)] Q1[Thanos Querier] --> S1 Q1 --> SG1[Store Gateway] SG1 --> O1 end
subgraph Mimir Architecture P2[Prometheus Agent] -->|Remote Write| D1[Mimir Distributor] D1 --> I1[Mimir Ingester] I1 -->|Uploads Blocks| O2[(Object Storage)] Q2[Mimir Querier] --> I1 Q2 --> SG2[Mimir Store Gateway] SG2 --> O2 end| Feature | Thanos | Grafana Mimir |
|---|---|---|
| Data Ingestion | Pull (Querier reads from Sidecar) + Block upload | Push (Remote Write to Distributor) |
| Prometheus Role | Full Prometheus (with local TSDB) required | Agent mode sufficient (stateless scrape) |
| Multitenancy | Requires external proxies (e.g., Thanos Receive) | Native (tenant ID headers) |
| Operational Complexity | Lower (Sidecars piggyback on existing Prometheus) | Higher (Microservices architecture, requires Memcached/Redis) |
| Query Performance | Good, but query time depends on Sidecar availability | Excellent (extensive caching, query sharding, shuffle-sharding) |
Bare Metal Recommendation: If you already have a stable Prometheus setup and just need long-term retention, Thanos is less invasive. If you are building a centralized metrics platform for multiple internal teams (tenants) and need to isolate their workloads, Mimir is the superior choice.
OpenTelemetry Collector Pipelines
Section titled “OpenTelemetry Collector Pipelines”The OpenTelemetry (OTel) Collector standardizes telemetry ingestion, processing, and exporting. Instead of running Fluentbit for logs, Promtail for Loki, and Jaeger Agent for traces, the OTel Collector handles all three signals.
Deployment Patterns
Section titled “Deployment Patterns”- Agent (DaemonSet): Runs on every node. Collects node metrics, container logs, and receives traces from local pods via localhost. Adds infrastructure metadata (node name, pod IP).
- Gateway (Deployment): Runs as a standalone cluster. Receives data from Agents. Handles heavy processing (tail-based sampling, cardinality reduction, API key enrichment) before exporting to backends.
Pipeline Configuration
Section titled “Pipeline Configuration”The Collector configuration consists of Receivers, Processors, Exporters, and the Service pipeline that connects them.
receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318
processors: # Crucial for preventing OOMs in the Collector memory_limiter: check_interval: 1s limit_mib: 4000 spike_limit_mib: 800
# Adds Kubernetes metadata to telemetry k8sattributes: auth_type: "serviceAccount" passthrough: false extract: metadata: - k8s.pod.name - k8s.namespace.name
batch: send_batch_size: 10000 timeout: 10s
exporters: prometheusremotewrite: endpoint: "http://mimir-nginx.mimir.svc.cluster.local/api/v1/push" otlphttp/loki: endpoint: "http://loki-gateway.loki.svc.cluster.local/otlp" otlp/tempo: endpoint: "tempo-distributor.tempo.svc.cluster.local:4317" tls: insecure: true
service: pipelines: metrics: receivers: [otlp] processors: [memory_limiter, k8sattributes, batch] exporters: [prometheusremotewrite] traces: receivers: [otlp] processors: [memory_limiter, k8sattributes, batch] exporters: [otlp/tempo]The Grafana LGTM Stack
Section titled “The Grafana LGTM Stack”The LGTM stack (Loki, Grafana, Tempo, Mimir) shares a similar underlying microservices architecture: Distributors, Ingesters, Queriers, and Compactor components backed by object storage.
Loki (Logs)
Section titled “Loki (Logs)”Loki does not index the content of logs. It only indexes labels. This makes it extremely cost-effective to operate but requires strict label discipline.
- Gotcha: Do not extract high-cardinality data (like
user_idortrace_id) into Loki labels. Keep them in the log line and parse them at query time using LogQL (e.g.,| json | user_id="123"). Extracting them into labels will create a new log stream for every user, overwhelming the Ingesters.
Tempo (Traces)
Section titled “Tempo (Traces)”Tempo stores trace data in Parquet format in object storage. It relies heavily on TraceQL for querying.
- Exemplars: Exemplars bridge the gap between metrics and traces. They attach a specific
trace_idto a metric histogram bucket. In Grafana, this manifests as a diamond on a latency graph. Clicking the diamond jumps directly to the Tempo trace that experienced that exact latency, eliminating the need to manually search logs for the timeframe.
Pyroscope (Continuous Profiling)
Section titled “Pyroscope (Continuous Profiling)”Pyroscope collects continuous CPU and memory profiles from applications. It allows you to diff profiles across time (e.g., “Why is CPU usage higher today than yesterday?”).
- Deployment requires language-specific agents (e.g., eBPF for C/C++/Rust/Go, or native SDKs for Java/Python).
Cardinality Management
Section titled “Cardinality Management”Cardinality is the total number of unique time series in your TSDB.
Active Series = Metric Name * Unique Label Combinations
A single poorly designed metric can bring down an observability stack. For example:
http_requests_total{path="/api/users/123", status="200"}
If the user ID is in the path, every user creates a new series. This is a cardinality explosion.
Mitigation Strategies
Section titled “Mitigation Strategies”-
Relabeling (Drop at Scrape Time): If a developer exposes a bad metric, drop it before ingestion.
metric_relabel_configs:- source_labels: [__name__]regex: 'http_requests_total_bad_metric'action: drop -
Limits (Fencing): Configure Mimir or Prometheus to reject scrapes that have too many series or labels.
# Prometheus scrape configsample_limit: 10000label_limit: 50label_name_length_limit: 200label_value_length_limit: 200 -
Recording Rules (Aggregation): Pre-compute expensive queries and store the result as a new, low-cardinality metric. Drop the raw high-cardinality metric after a short retention period.
Alerting Fatigue Mitigation
Section titled “Alerting Fatigue Mitigation”Alert fatigue destroys on-call mental health. Alertmanager provides three mechanisms to control alert volume.
- Routing: Send critical alerts to PagerDuty, warnings to Slack.
- Grouping: Group alerts by
clusterandnamespace. If a node dies, you get one notification (“15 pods down on node-X”) instead of 15 separate emails. - Inhibition: Suppress downstream alerts if an upstream alert is firing.
inhibit_rules: # If the whole node is down, don't alert me about individual pods failing on that node. - source_matchers: - alertname = "NodeDown" target_matchers: - alertname = "TargetDown" equal: ['node']Hands-on Lab
Section titled “Hands-on Lab”In this lab, we will deploy a scalable, multi-tenant capable metric backend using Grafana Mimir, backed by a local MinIO instance, and configure a Prometheus Agent to remote-write to it.
Prerequisites
Section titled “Prerequisites”- A running Kubernetes cluster (v1.32+).
helminstalled.kubectlconfigured.
Step 1: Deploy MinIO (Object Storage Backend)
Section titled “Step 1: Deploy MinIO (Object Storage Backend)”Mimir requires object storage for blocks, rules, and alerts. We will deploy a simple MinIO instance.
helm repo add bitnami https://charts.bitnami.com/bitnamihelm repo update
# Create a values file for MinIOcat <<EOF > minio-values.yamlauth: rootUser: admin rootPassword: supersecretpassworddefaultBuckets: "mimir-blocks,mimir-ruler,mimir-alertmanager"persistence: size: 10GiEOF
helm install minio bitnami/minio -f minio-values.yaml -n observability --create-namespaceVerification:
kubectl get pods -n observability -l app.kubernetes.io/name=minio# Expected: Pod is READY 1/1Step 2: Deploy Grafana Mimir
Section titled “Step 2: Deploy Grafana Mimir”We will deploy Mimir in “monolithic” mode for the lab, but configure it to use the S3 backend. In production, you would deploy microservices.
helm repo add grafana https://grafana.github.io/helm-chartshelm repo update
cat <<EOF > mimir-values.yamlmimir: structuredConfig: multitenancy_enabled: true blocks_storage: backend: s3 s3: endpoint: minio.observability.svc.cluster.local:9000 access_key_id: admin secret_access_key: supersecretpassword insecure: true bucket_name: mimir-blocks alertmanager_storage: backend: s3 s3: endpoint: minio.observability.svc.cluster.local:9000 access_key_id: admin secret_access_key: supersecretpassword insecure: true bucket_name: mimir-alertmanager ruler_storage: backend: s3 s3: endpoint: minio.observability.svc.cluster.local:9000 access_key_id: admin secret_access_key: supersecretpassword insecure: true bucket_name: mimir-rulerEOF
helm install mimir grafana/mimir-distributed -f mimir-values.yaml -n observabilityVerification: Wait for the components to start.
kubectl get pods -n observability | grep mimir# Expected: distributor, ingester, querier, store-gateway pods should be Running.Step 3: Deploy Prometheus in Agent Mode
Section titled “Step 3: Deploy Prometheus in Agent Mode”Configure Prometheus to scrape its own metrics and remote write to Mimir, specifying a tenant ID header.
cat <<EOF > prom-agent.yamlapiVersion: v1kind: ConfigMapmetadata: name: prometheus-agent-config namespace: observabilitydata: prometheus.yml: |- global: scrape_interval: 15s scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] remote_write: - url: http://mimir-nginx.observability.svc.cluster.local/api/v1/push headers: X-Scope-OrgID: "tenant-a"---apiVersion: apps/v1kind: Deploymentmetadata: name: prometheus-agent namespace: observabilityspec: replicas: 1 selector: matchLabels: app: prometheus-agent template: metadata: labels: app: prometheus-agent spec: containers: - name: prometheus image: prom/prometheus:v2.51.0 args: - "--config.file=/etc/prometheus/prometheus.yml" - "--enable-feature=agent" ports: - containerPort: 9090 volumeMounts: - name: config mountPath: /etc/prometheus volumes: - name: config configMap: name: prometheus-agent-configEOF
kubectl apply -f prom-agent.yamlVerification: Check the Prometheus agent logs to ensure it is successfully remote-writing.
kubectl logs -n observability deploy/prometheus-agent# Expected: No "remote_write" error logs. Should see "WAL started" and scraping active.Troubleshooting:
If the remote write fails, ensure the mimir-nginx service name matches the deployed Mimir gateway service, and that MinIO credentials are correct in the Mimir configuration. If Mimir ingesters crash, check if the buckets were successfully created in MinIO.
Practitioner Gotchas
Section titled “Practitioner Gotchas”- The WAL Replay OOMKill: When a Prometheus Pod restarts, it must replay the Write-Ahead Log (WAL) into memory before it can begin scraping or serving queries. If the WAL is massive (due to high cardinality or slow disk I/O preventing compaction), the replay will spike memory usage beyond the container’s limits, resulting in continuous
OOMKilledcrash loops. Fix: Delete thewaldirectory (losing recent uncompacted data) to restore service, then implement metric limits. - Loki “Maximum active streams” Errors: Loki groups logs into “streams” based on their exact label combinations. If you map a highly dynamic Kubernetes label (like a
pod-template-hashor a unique request ID) to a Loki label, you will exceed the tenant stream limits and Loki will reject ingestion with a 429 error. Fix: Ensure Promtail/OTel only promotes static infrastructure metadata (namespace, app name, cluster) to labels. Keep dynamic data in the JSON payload. - Thanos Compactor Halting: The Thanos Compactor is a singleton. It downloads blocks from object storage, deduplicates them, and uploads larger blocks. If multiple compactors run simultaneously, or if a block becomes corrupted, the Compactor will halt to prevent data loss, throwing “overlapping blocks” errors. Fix: Ensure only one Compactor runs per object storage bucket. Manually delete the
compactor-meta.jsonor the specific overlapping block in object storage if corruption occurs. - Agent Mode Dropped Scrapes: Prometheus Agent mode has no local TSDB to buffer data for long periods. If the network to the remote-write endpoint (e.g., Mimir) goes down, the Agent will buffer data in its WAL up to the configured capacity. Once full, it will silently drop new scrapes. Fix: Monitor
prometheus_remote_storage_dropped_samples_total. Size the WAL disk appropriately for expected network partition durations.
1. A cluster experiences a network partition lasting 4 hours. The central Thanos Querier cannot reach the Thanos Sidecars running in that cluster. What happens to the metrics generated during this time once the network is restored? A. The metrics are permanently lost because Thanos Sidecars only serve real-time data. B. The central Querier will fetch the metrics from the Sidecars, provided the local Prometheus instances had enough disk space to retain the data. C. The Thanos Compactor will automatically synthesize the missing data using linear regression. D. The metrics must be manually imported using the Prometheus remote-write API. Correct Answer: B (Thanos Sidecars serve data directly from the local Prometheus TSDB. As long as the data hasn’t aged out of the local retention window, it is recoverable).
2. You notice your Loki Ingesters are constantly crashing with OutOfMemory errors, and ingestion performance is degrading. The application teams recently started logging a transaction_id with every request. Which of the following is the most likely cause?
A. The transaction_id is being parsed from the JSON body during queries, exhausting query memory.
B. The Loki retention period is set too high, filling up the local disk.
C. The transaction_id was mapped to a Loki label, causing a massive cardinality explosion of log streams.
D. The OpenTelemetry Collector is batching too many logs at once.
Correct Answer: C (Loki creates a separate stream in memory for every unique combination of labels. High cardinality labels like IDs will exhaust memory).
3. You are designing an OpenTelemetry Collector pipeline for a bare-metal Kubernetes cluster. You want to drop all logs that do NOT contain the word “error” before they are sent over the network to a central logging cluster. Where should this filtering logic reside? A. In a tail-based sampling Processor on the central Gateway Collectors. B. In a filter Processor on the Agent (DaemonSet) Collectors running on each node. C. In the Receiver configuration of the central Gateway Collectors. D. In the Promtail Sidecar attached to each application pod. Correct Answer: B (Filtering at the edge/Agent level prevents unnecessary network transmission and reduces load on the central Gateway).
4. A single Prometheus server is consuming 60GB of RAM and frequently OOMKills. You analyze the TSDB and find 15 million active series. You want to scale the scraping workload without losing any data or setting up a complex microservices backend like Mimir yet. What is the most appropriate architectural change? A. Switch Prometheus to Agent Mode. B. Implement Hierarchical Federation to pull all metrics to a larger machine. C. Implement Hash-based Sharding across multiple Prometheus replicas. D. Decrease the scrape interval from 15s to 60s. Correct Answer: C (Hash-based sharding divides the 15 million series across multiple smaller Prometheus instances, reducing the memory footprint of each without requiring a new backend).
5. You are receiving 50 individual emails every time a physical rack loses power because 50 database pods go down simultaneously. What is the correct Alertmanager configuration to fix this fatigue?
A. Create an inhibit_rule that suppresses Pod alerts if a Rack alert is firing.
B. Set the repeat_interval to 24 hours for all alerts.
C. Create a route that groups alerts by rack_id.
D. Both A and C.
Correct Answer: D (Grouping consolidates the 50 pod alerts into one notification. Inhibition suppresses the pod alerts entirely if the rack itself is the known root cause).