Перейти до вмісту

Module 7.4: Observability Without Cloud Services

Цей контент ще не доступний вашою мовою.

Complexity: [COMPLEX] | Time: 60 minutes

Prerequisites: Module 7.3: Node Failure & Auto-Remediation, Module 4.1: Storage Architecture


In November 2023, an e-commerce company migrated from AWS EKS to on-premises Kubernetes. Their cloud setup had been straightforward: CloudWatch for logs, CloudWatch Metrics for monitoring, X-Ray for tracing, and PagerDuty for alerting. One AWS bill covered everything.

When they moved to bare metal, they initially considered using Datadog. However, at roughly $23/host/month for the infrastructure plan, the cost for their 400-node fleet would exceed $110,000/year. Furthermore, SaaS monitoring required opening internet egress for telemetry, violating their strict data sovereignty and air-gapped compliance requirements.

Forced to build a self-hosted stack, the infrastructure team assumed they could replicate their cloud observability in a weekend. They deployed a single Prometheus instance, pointed Grafana at it, and called it done.

Three months later, Prometheus crashed. It had been ingesting 800,000 samples per second across 400 nodes, and its local storage had grown to 1.2TB. The 15-day retention consumed all available disk space on the monitoring node. When Prometheus restarted, it took 45 minutes to replay the WAL (Write-Ahead Log), during which there was zero monitoring visibility. The team later discovered that Prometheus had been silently dropping samples for a week due to memory pressure, so their dashboards had gaps nobody noticed. Meanwhile, container logs were being written to local disk and rotated away after 24 hours — they had no centralized logging at all.

The rebuild took six weeks: Prometheus with Thanos for long-term storage and high availability, Loki for centralized logging, a proper Alertmanager cluster with on-call rotation, and IPMI exporters for hardware-level metrics. Total cost: $15,000 in additional hardware and 240 engineering hours. The lesson: observability on bare metal is not a weekend project. It is a production system that requires the same care as the workloads it monitors.


After completing this module, you will be able to:

  1. Deploy a production-grade observability stack (Prometheus/Thanos, Loki, Alertmanager) sized for bare-metal cluster scale
  2. Configure IPMI and hardware-level exporters to monitor physical infrastructure alongside Kubernetes metrics
  3. Design a high-availability monitoring architecture that survives the failures it is meant to detect
  4. Implement on-call alerting pipelines with proper severity classification, escalation policies, and runbook integration

  • Self-hosted Prometheus architecture with Thanos for long-term storage
  • Grafana deployment at scale (dashboards, provisioning, multi-tenancy)
  • Loki for centralized logging (replacing CloudWatch/Stackdriver)
  • Alertmanager configuration with on-call rotation
  • IPMI exporter for hardware-level monitoring
  • Capacity planning for the monitoring stack itself

A single Prometheus instance cannot scale to large bare metal clusters. Thanos extends Prometheus with global querying, long-term storage, and high availability.

+---------------------------------------------------------------+
| PROMETHEUS + THANOS ARCHITECTURE |
| |
| Per-cluster Prometheus instances (scraping): |
| |
| ┌──────────┐ ┌──────────┐ ┌──────────┐ |
| │Prometheus│ │Prometheus│ │Prometheus│ |
| │ (HA-a) │ │ (HA-b) │ │ (infra) │ |
| │ workers │ │ workers │ │ctrl+stor │ |
| │ 1-50 │ │ 1-50 │ │ │ |
| └────┬─────┘ └────┬─────┘ └────┬─────┘ |
| │ sidecar │ sidecar │ sidecar |
| ┌────▼─────┐ ┌────▼─────┐ ┌────▼─────┐ |
| │ Thanos │ │ Thanos │ │ Thanos │ |
| │ Sidecar │ │ Sidecar │ │ Sidecar │ |
| └────┬─────┘ └────┬─────┘ └────┬─────┘ |
| │ │ │ |
| ▼ ▼ ▼ |
| ┌─────────────────────────────────────────┐ |
| │ Thanos Query (global) │ |
| │ Deduplicates HA pairs, fans out │ |
| └─────────────┬───────────────────────────┘ |
| │ |
| ┌────────┴────────┐ |
| ▼ ▼ |
| ┌─────────┐ ┌──────────────┐ |
| │ Grafana │ │ Thanos Store │──> Object Storage (MinIO) |
| │ │ │ Gateway │ (long-term, cheap) |
| └─────────┘ └──────────────┘ |
| |
| ┌──────────────┐ |
| │Thanos Compact│ Downsamples old data: |
| │ │ raw -> 5m -> 1h |
| └──────────────┘ Saves 90%+ storage |
+---------------------------------------------------------------+

Pause and predict: A single Prometheus instance with 15-day retention crashed under 800K samples/second. What architectural change would prevent this from happening again, and how would you handle the need for 1-year retention?

This configuration is optimized for bare-metal clusters: a 30-second scrape interval (15s is overkill for most infrastructure metrics), HA labeling for Thanos deduplication, and scrape targets that include IPMI and SMART exporters unique to physical infrastructure.

# prometheus.yaml — optimized for bare metal
global:
scrape_interval: 30s # 15s is overkill for most bare metal
evaluation_interval: 30s
external_labels:
cluster: production
replica: ha-a # for Thanos deduplication
# Scrape configs for bare metal targets
scrape_configs:
# Kubernetes service discovery
- job_name: kubelet
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
# Node exporter (system metrics)
- job_name: node-exporter
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_endpoints_name]
regex: node-exporter
action: keep
# IPMI exporter (hardware metrics)
# BMCs speak IPMI, not HTTP — use the multi-target exporter pattern
# where Prometheus scrapes the ipmi-exporter and passes the BMC address
# as a URL parameter (similar to blackbox-exporter)
- job_name: ipmi
static_configs:
- targets:
- bmc-worker-01
- bmc-worker-02
# ... all BMC addresses (no port — these are IPMI targets)
metrics_path: /ipmi
params:
module: [default]
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: ipmi-exporter:9290 # the actual exporter service
# SMART disk metrics (via standalone smartctl_exporter)
# Note: if using Node Exporter's textfile collector for SMART data,
# remove this job — the metrics are already scraped by the node-exporter job above
- job_name: smartmon
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
regex: (.+):(.+)
target_label: __address__
replacement: $1:9633
metrics_path: /metrics

Deploy each Prometheus as a StatefulSet with a Thanos sidecar container. Key Prometheus flags for Thanos compatibility:

  • --storage.tsdb.retention.time=48h (short local retention, Thanos handles long-term)
  • --storage.tsdb.min-block-duration=2h and --storage.tsdb.max-block-duration=2h (required for Thanos block upload)
  • --web.enable-lifecycle (allows Thanos sidecar to trigger reloads)

The sidecar uploads completed TSDB blocks to object storage (MinIO) and serves real-time data via the Thanos StoreAPI. Configure the object store connection in a Secret with S3-compatible endpoint, bucket name, and credentials.


To operate Grafana reliably for multiple teams on bare metal, avoid manual dashboard creation. Instead, manage Grafana as code.

Use Grafana’s provisioning feature to load dashboards and datasources automatically from ConfigMaps. For multi-tenancy, configure Grafana Organizations or Teams, mapping corporate OIDC/LDAP groups to specific Grafana roles. To run Grafana as a highly available pair (as recommended in the sizing guidelines), configure it to use a shared external database like PostgreSQL rather than the default local SQLite, ensuring user sessions and dashboard states survive pod restarts.


Loki replaces CloudWatch Logs and Stackdriver Logging. Unlike Elasticsearch, Loki indexes only metadata (labels), not the full log text, making it dramatically cheaper to operate.

+---------------------------------------------------------------+
| LOKI LOGGING ARCHITECTURE |
| |
| ┌──────────┐ ┌──────────┐ ┌──────────┐ |
| │ Promtail │ │ Promtail │ │ Promtail │ (DaemonSet) |
| │ worker-01│ │ worker-02│ │ worker-03│ Tails container |
| │ │ │ │ │ │ logs from |
| └────┬─────┘ └────┬─────┘ └────┬─────┘ /var/log/pods/ |
| │ │ │ |
| └──────────────┼──────────────┘ |
| ▼ |
| ┌──────────────┐ |
| │ Loki │ Stores log streams |
| │ (3 pods, │ indexed by labels only |
| │ HA mode) │ (namespace, pod, container) |
| └──────┬───────┘ |
| │ |
| ▼ |
| ┌──────────────┐ |
| │ Object Store │ Chunks stored in MinIO |
| │ (MinIO) │ or local filesystem |
| └──────────────┘ |
| |
| ┌──────────────┐ |
| │ Grafana │ Query logs alongside metrics |
| │ │ using LogQL |
| └──────────────┘ |
+---------------------------------------------------------------+

Stop and think: Loki indexes only labels (namespace, pod, container), not the full log text. This makes it 10-100x cheaper than Elasticsearch. What is the trade-off? When would this design choice make troubleshooting harder?

This configuration uses S3-compatible storage (MinIO) for log chunks and the TSDB index format for efficient queries. The retention_period controls how long logs are kept, and per_stream_rate_limit prevents a single noisy application from overwhelming the ingestion pipeline.

loki-config.yaml
auth_enabled: false
server:
http_listen_port: 3100
common:
path_prefix: /loki
storage:
s3:
endpoint: minio.storage.svc.cluster.local:9000
bucketnames: loki-chunks
access_key_id: ${MINIO_ACCESS_KEY}
secret_access_key: ${MINIO_SECRET_KEY}
insecure: true
s3forcepathstyle: true
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: s3
schema: v13
index:
prefix: index_
period: 24h
limits_config:
retention_period: 30d # keep logs for 30 days
max_query_lookback: 30d
ingestion_rate_mb: 10 # per-tenant ingestion rate
per_stream_rate_limit: 3MB
storage_config:
tsdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/cache

Promtail runs as a DaemonSet, mounting /var/log and /var/log/pods as read-only host volumes. It tails container logs and ships them to Loki with labels extracted from the Kubernetes metadata (namespace, pod, container name). To improve query performance, you can configure Promtail pipeline_stages to extract additional high-value labels like level or component.

If queries for older logs become extremely slow (e.g., 30+ seconds), check these common bottlenecks:

  1. Missing Chunk Cache: Loki reads chunks from MinIO for every historical query. Deploying a Memcached cluster (e.g., 3 pods) for chunk caching is a quick win that typically reduces query times by 5-10x.
  2. Too Few Label Indexes: If logs only have namespace and pod labels, Loki must scan massive chunks. Add more labels in Promtail.
  3. Object Storage Latency: If MinIO disks are shared with other workloads, I/O contention will stall Loki. Ensure MinIO has dedicated disks.
  4. Large Chunk Size: The default chunk_target_size (1.5MB) may be too large for your ingestion rate; reducing it can speed up queries.
  5. Legacy Indexing: Ensure you are using the tsdb index format, not the legacy BoltDB shipper.

On-premises environments often cannot use cloud-based alerting services due to network isolation or compliance requirements. Alertmanager supports direct integrations.

Alertmanager routes alerts based on labels. Configure multiple receivers with escalation:

  • Hardware critical alerts (severity=critical, category=hardware): webhook to on-call system + email, repeat every 15 minutes
  • Application critical alerts: webhook + email, repeat every 30 minutes
  • Warnings: email only, repeat every 24 hours

Group alerts by alertname, cluster, and namespace to reduce noise. Use group_wait: 30s to batch alerts that fire simultaneously (e.g., multiple nodes in the same rack losing power). Ensure every alert rule includes a runbook_url annotation linking directly to the mitigation steps, so on-call engineers have immediate access to remediation procedures.

When the Kubernetes cluster resides in a datacenter VLAN isolated from the corporate network, alerts sent to an internal SMTP server (e.g., on port 587) may be silently dropped by firewalls. To diagnose this, check the Alertmanager logs (kubectl logs -n monitoring alertmanager-0) for connection timeouts, or use a busybox pod with nc -zv to test SMTP reachability. If you cannot open the firewall, the most robust fix is to deploy a local SMTP relay (like Postfix) in the monitoring namespace that is explicitly allowed to forward mail to the corporate server, or to switch entirely to webhook-based notifications.

+---------------------------------------------------------------+
| SELF-HOSTED ON-CALL STACK |
| |
| Alertmanager |
| │ |
| ▼ webhook |
| Grafana OnCall (open-source, self-hosted) |
| │ |
| ├──> Slack notification |
| ├──> Email notification |
| ├──> Phone call (via Twilio integration) |
| └──> SMS (via Twilio integration) |
| |
| On-call schedules: |
| - Primary: rotates weekly |
| - Secondary: backup, rotates opposite week |
| - Escalation: if no ack in 10 min, page secondary |
| - If no ack in 20 min, page engineering manager |
| |
+---------------------------------------------------------------+

The IPMI exporter exposes BMC sensor data as Prometheus metrics, giving you visibility into temperatures, fan speeds, voltages, and PSU status.

Deploy the prometheuscommunity/ipmi-exporter as a Deployment in the monitoring namespace. Configure it with BMC credentials stored in a Kubernetes Secret, using the LAN_2_0 driver with bmc, ipmi, and dcmi collectors. Prometheus scrapes each BMC address through the exporter’s /ipmi endpoint.

+---------------------------------------------------------------+
| CRITICAL IPMI METRICS |
| |
| Metric Alert Threshold |
| ───────────────────────────────────────────────── |
| ipmi_temperature_celsius > 85 (CPU) |
| {name="CPU Temp"} > 45 (ambient) |
| |
| ipmi_fan_speed_rpm < 1000 (fan failure) |
| {name="Fan 1"} |
| |
| ipmi_voltage_volts +/- 10% of nominal |
| {name="12V"} (11.4V warning) |
| |
| ipmi_power_watts > 90% of PSU rated |
| {name="System Power"} capacity |
| |
| ipmi_sensor_state != 0 (any critical state) |
| {name="PSU Status"} |
| |
+---------------------------------------------------------------+

The monitoring stack itself needs resources. Undersizing it leads to the monitoring system failing when you need it most.

Prometheus TSDB is highly optimized, compressing raw samples down to an average of 2 bytes per sample. You can calculate your storage needs using this formula: samples_per_second * 2 bytes * 86,400 seconds. For example, a cluster ingesting 500,000 samples per second generates ~1 MB/s, or ~84 GB per day.

  • Local Storage (48 hours): Requires ~168 GB of fast NVMe storage per Prometheus replica.
  • Long-Term Storage (1 year): Keeping 1 year of data locally would require ~30 TB of disk, which is expensive and slows down queries. Instead, Thanos offloads this to MinIO object storage. With Thanos Compactor downsampling historical data (raw -> 5m -> 1h), 1 year of metrics for this cluster will consume approximately 3 TB of object storage.
+---------------------------------------------------------------+
| MONITORING STACK SIZING (per 100 nodes) |
| |
| Component CPU Memory Disk Notes |
| ──────────────────────────────────────────────── |
| Prometheus (x2) 4 CPU 16 GB 200 GB HA pair, 48h ret |
| Thanos Query 2 CPU 4 GB - Stateless |
| Thanos Store GW 2 CPU 8 GB 50 GB Cache for S3 |
| Thanos Compact 2 CPU 4 GB 100 GB Downsampling |
| Loki (x3) 2 CPU 8 GB 50 GB HA mode |
| Grafana (x2) 1 CPU 2 GB - HA pair |
| Alertmanager(x3) 0.5CPU 1 GB - HA cluster |
| MinIO (x4) 2 CPU 8 GB 1000 GB Object store |
| ──────────────────────────────────────────────── |
| Total ~30 CPU ~90 GB ~1.6 TB |
| |
| Rule of thumb: dedicate 3-5% of cluster resources |
| to observability |
+---------------------------------------------------------------+

Pause and predict: Your monitoring stack itself needs 30 CPU and 90 GB RAM for a 100-node cluster. If monitoring runs on the same nodes as workloads and a major incident causes resource contention, what happens to your ability to diagnose the incident?

High cardinality (too many unique time series) is the primary cause of Prometheus OOM crashes. Each unique combination of metric name and label values creates a separate time series. On bare metal, common offenders are metrics with per-disk, per-NIC, or per-container labels.

Terminal window
# Find high-cardinality metrics (top 20)
curl -s http://prometheus:9090/api/v1/status/tsdb | jq '
.data.seriesCountByMetricName |
sort_by(-.value) |
.[0:20] |
.[] | "\(.name): \(.value) series"'
# Common offenders on bare metal:
# container_* metrics with high pod churn
# node_* metrics with many disk/interface labels
# Custom metrics with unbounded label values (IP addresses, user IDs)

  • Prometheus was created at SoundCloud in 2012 and donated to the CNCF in 2016. It was the second project to graduate after Kubernetes itself. Its pull-based scraping model was inspired by Google’s Borgmon, which monitored Borg (the predecessor to Kubernetes) internally at Google.

  • Thanos was named after the Marvel villain because it brings balance to the Prometheus universe — specifically, it balances the trade-off between local retention (fast queries) and long-term storage (cheap, durable). The project started at Improbable (a gaming technology company) in 2017.

  • Loki processes logs 10-100x cheaper than Elasticsearch for the same volume because it indexes only labels (namespace, pod, container), not the full log text. The trade-off is that full-text search requires scanning chunks — which is slower for ad-hoc queries but perfectly fast for targeted queries like “show me logs from pod X in namespace Y.”

  • IPMI (Intelligent Platform Management Interface) was first released in 1998 by Intel, HP, NEC, and Dell. Despite being nearly 30 years old, it remains the standard for out-of-band server management. The protocol runs on a dedicated BMC (Baseboard Management Controller) chip with its own network interface, CPU, and memory — essentially a computer within your computer that runs even when the main system is powered off.


MistakeProblemSolution
Single Prometheus instanceSPOF: crash = no monitoringDeploy HA pair with Thanos deduplication
15-day local retentionFills disk, crashes PrometheusUse 48h local + Thanos for long-term
No log aggregationLogs lost on container restart or node failureDeploy Loki + Promtail DaemonSet
Alertmanager singletonMissed alerts if Alertmanager crashesDeploy 3-node Alertmanager cluster
Monitoring on same nodes as workloadsResource contention during incidentsDedicated monitoring nodes or guaranteed resources
No IPMI monitoringHardware failures are invisible until node goes downDeploy IPMI exporter for temperature, PSU, fan metrics
Unbounded label cardinalityPrometheus OOM from millions of seriesDrop high-cardinality labels via relabeling
No monitoring-of-monitoringMonitoring stack fails silentlyExternal black-box probe (ping from outside cluster)

Your Prometheus instance is ingesting 500,000 samples per second with 30-second scrape intervals across 200 bare metal nodes. You need 1 year of metrics retention. How do you architect this?

Answer

Architecture: Prometheus HA pair + Thanos with MinIO object storage.

Sizing calculation:

  • 500,000 samples/sec * 2 bytes/sample (TSDB compressed) = ~1 MB/s
  • Per day: 1 MB/s * 86,400 = ~84 GB/day (The 2 bytes/sample figure already accounts for Prometheus TSDB compression; raw uncompressed samples are 16 bytes each)
  • 48-hour local retention: ~168 GB per Prometheus instance
  • 1 year in Thanos with downsampling:
    • Raw (first 30 days): 84 GB * 30 = ~2.5 TB
    • 5-minute downsampled (30-365 days): ~400 GB
    • 1-hour downsampled (optional, for very old data): ~60 GB
    • Total object storage: ~3 TB

Architecture:

  1. 2x Prometheus (HA pair, same config, external_labels differ only by replica)
  2. 2x Thanos Sidecar (one per Prometheus, uploads blocks to MinIO)
  3. 1x Thanos Query (deduplicates HA pair, queries both local and store)
  4. 1x Thanos Store Gateway (serves queries from MinIO)
  5. 1x Thanos Compactor (downsamples: raw -> 5m -> 1h)
  6. MinIO cluster (4 nodes, erasure coding, 4 TB usable)

Why not just increase Prometheus retention?

  • 1 year at 84 GB/day = ~30 TB local disk — expensive NVMe
  • Prometheus queries slow down with large TSDB
  • No HA: disk failure = data loss
  • MinIO with erasure coding is cheaper and fault-tolerant

Your Loki cluster is receiving logs from 200 nodes, but queries for logs older than 2 days are extremely slow (30+ seconds). What is likely wrong and how do you fix it?

Answer

Likely causes and fixes:

  1. No chunk caching: Loki reads chunks from object storage (MinIO) for every query. Without a cache, this means network I/O for every request.

    # Add chunk cache in loki-config.yaml
    chunk_store_config:
    chunk_cache_config:
    memcached:
    host: memcached.monitoring.svc
    service: memcached
  2. Too few label indexes: If most logs have the same label set (e.g., only namespace and pod), Loki must scan large chunks to filter.

    # Add more labels in Promtail
    pipeline_stages:
    - labels:
    level: # extract log level (info, error, warn)
    component: # extract component name from log line
  3. Large chunk size: Default chunk target size might be too large for your ingestion rate.

    ingester:
    chunk_target_size: 1572864 # 1.5 MB (default)
    # Consider reducing for faster queries at the cost of more chunks
  4. Object storage latency: MinIO might be slow due to disk I/O contention.

    • Check MinIO disk I/O: iostat -x 1
    • Ensure MinIO has dedicated disks (not shared with Ceph or other workloads)
  5. Missing TSDB index: Ensure you are using the TSDB index (not BoltDB) for better query performance.

    schema_config:
    configs:
    - store: tsdb # not boltdb-shipper

Quick win: Deploy a Memcached cluster (3 pods, 4 GB each) for chunk caching. This alone typically reduces query times by 5-10x for historical data.

Your Alertmanager sends alerts via email, but the SMTP server is on the corporate network and your Kubernetes cluster is in a separate datacenter VLAN. Alerts are being silently dropped. How do you diagnose and fix this?

Answer

Diagnosis:

  1. Check Alertmanager logs:

    Terminal window
    kubectl logs -n monitoring alertmanager-0 | grep -i "error\|fail\|smtp"
    # Look for: "connection refused", "timeout", "TLS handshake"
  2. Test SMTP connectivity from within the cluster:

    Terminal window
    kubectl run smtp-test --image=busybox --restart=Never -- sh -c \
    "nc -zv smtp.internal 587; echo exit code: $?"
  3. Check network policies or firewall rules blocking port 587 from the monitoring namespace.

Fix options:

  1. Open firewall: Allow port 587 from monitoring VLAN to corporate VLAN.

  2. Deploy a local SMTP relay inside the cluster:

    # Deploy Postfix as an SMTP relay in the monitoring namespace
    # Configure it to relay through the corporate SMTP server
    # Alertmanager sends to local relay (in-cluster, no firewall issue)
    # Relay forwards to corporate SMTP (firewall rule needed only for relay pod)
  3. Use webhook instead of email: Deploy a webhook receiver that posts to Slack, Teams, or a custom notification service.

  4. Alertmanager -> Grafana OnCall -> Twilio: If email is unreliable, use a notification chain that does not depend on corporate SMTP.

Prevention: Always test alerting during initial setup. Send a test alert and verify it arrives:

Terminal window
# Manually fire a test alert
curl -X POST http://alertmanager:9093/api/v2/alerts \
-H "Content-Type: application/json" \
-d '[{"labels":{"alertname":"TestAlert","severity":"warning"},"annotations":{"summary":"Test alert from setup verification"}}]'

You are building the monitoring stack for a new 150-node bare metal cluster. The infrastructure team asks: “Can we just use the existing Datadog agents?” What are the trade-offs of Datadog vs self-hosted on bare metal?

Answer

Trade-offs:

FactorDatadogSelf-hosted (Prometheus/Thanos/Loki)
Cost$23/host/month * 150 = $41,400/year (infrastructure plan)$15,000 hardware + ~$5,000/year ops
Setup timeDaysWeeks
MaintenanceZero (SaaS)Ongoing (upgrades, capacity, failures)
Data residencyData leaves your networkData stays on-premises
Network dependencyRequires internet egressWorks air-gapped
Retention15 months (paid tier)Unlimited (limited by storage)
CustomizationLimited to Datadog featuresFull control
Hardware metricsRequires custom integrationIPMI exporter built for this
ComplianceMay not meet data sovereignty requirementsFull control over data location

Recommendation depends on context:

  • Use Datadog if: Budget allows, no data sovereignty requirements, small team without monitoring expertise, internet egress is available.

  • Use self-hosted if: Data must stay on-premises (healthcare, finance, government), air-gapped environment, cost-sensitive at scale (150+ nodes), team has Prometheus expertise, need deep hardware-level monitoring (IPMI, SMART).

  • Hybrid option: Datadog for application APM/tracing, self-hosted Prometheus for infrastructure and hardware metrics. This gives you the best APM with full hardware visibility.

Most on-premises Kubernetes deployments choose self-hosted because the primary reason for running on-premises (data control, compliance, cost) also applies to monitoring data.


Hands-On Exercise: Deploy a Monitoring Stack

Section titled “Hands-On Exercise: Deploy a Monitoring Stack”

Task: Deploy Prometheus, Grafana, and Alertmanager on a kind cluster.

Terminal window
# Create a kind cluster
kind create cluster --name monitoring-lab
# Install kube-prometheus-stack via Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--wait \
--timeout 10m \
--set grafana.adminPassword=admin \
--set prometheus.prometheusSpec.retention=24h
  1. Verify all components are running:

    Terminal window
    kubectl wait --for=condition=Ready pods --all -n monitoring --timeout=300s
    kubectl get pods -n monitoring
  2. Access Grafana: Open a second terminal window to run the port-forward without blocking your prompt:

    Terminal window
    kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80
    # Open http://localhost:3000 (admin/admin)
  3. Explore the pre-built dashboards for node metrics, pod metrics, and Kubernetes components.

  4. Create a test alert:

    Terminal window
    kubectl apply -f - <<'EOF'
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
    name: test-alert
    namespace: monitoring
    labels:
    release: monitoring # Required for Prometheus Operator to discover this rule
    spec:
    groups:
    - name: test
    rules:
    - alert: HighCPU
    expr: node_cpu_seconds_total > 0
    for: 1m
    labels:
    severity: warning
    annotations:
    summary: "Test alert: CPU is being used"
    runbook_url: "https://internal-wiki.example.com/runbooks/high-cpu"
    EOF
  5. Verify the alert fires in Alertmanager: Open a third terminal window for this port-forward to avoid interrupting Grafana:

    Terminal window
    kubectl port-forward -n monitoring svc/monitoring-kube-prometheus-alertmanager 9093:9093
    # Open http://localhost:9093
  • Prometheus is scraping all cluster targets
  • Grafana shows node and pod metrics on dashboards
  • Alertmanager is receiving and displaying alerts
  • Understand the difference between Prometheus local storage and Thanos long-term storage
  • Can explain why IPMI exporter is essential for bare metal but irrelevant in the cloud
Terminal window
kind delete cluster --name monitoring-lab

Continue to Module 7.5: Capacity Expansion & Hardware Refresh to learn how to add new racks, handle mixed CPU generations, and plan hardware refresh cycles.