Module 1.2: Instrumentation & Alerting

PCA Track | Complexity: [COMPLEX] | Time: 45-55 min

Prerequisites

Before starting this module:

Prometheus Module — architecture, metric types, basic alerting
PromQL Deep Dive — query fundamentals
Observability 3.3: Instrumentation Principles
Basic Go, Python, or Java knowledge (for client library examples)

What You’ll Be Able to Do

After completing this module, you will be able to:

Instrument an application with Prometheus client libraries, choosing the correct metric type (counter, gauge, histogram, summary) for each signal
Apply naming conventions, label design, and cardinality controls that keep metrics usable and storage costs predictable
Configure alerting rules with proper for durations, severity labels, and runbook annotations that minimize false positives and pager fatigue
Design a metrics-based SLO alerting pipeline using multi-window burn-rate alerts for error budgets

The database team at a large ride-sharing company added a custom Prometheus metric to track query latency. They were proud of it — db_query_duration_milliseconds. It worked perfectly in development.

Three weeks later, the infrastructure team tried to create an SLO dashboard combining API latency (measured in seconds using http_request_duration_seconds) with database latency. The query looked simple:

histogram_quantile(0.99,
  sum by (le)(rate(http_request_duration_seconds_bucket[5m]))
)
+
histogram_quantile(0.99,
  sum by (le)(rate(db_query_duration_milliseconds_bucket[5m]))
)

The P99 total latency showed 3,000.2 seconds. It took 45 minutes of confusion before someone realized: one metric was in seconds, the other in milliseconds. The query was adding 0.2 seconds of API latency to 3,000 milliseconds of DB latency. The result was mathematically correct but semantically nonsensical.

The fix required a migration: rename the metric, update all dashboards, modify alerting rules, and coordinate a rolling deployment across 400 pods. Two engineer-weeks of work, all because of a naming convention violation.

The Prometheus naming convention exists for exactly this reason: always use base units — seconds, bytes, meters. Never milliseconds, kilobytes, or centimeters. This war story gets told to every new hire now.

Why This Module Matters

Instrumentation and alerting are 34% of the PCA exam combined (16% Instrumentation + 18% Alerting). But beyond the exam, these are the skills that determine whether your monitoring actually works. Bad instrumentation creates metrics nobody can use. Bad alerting creates noise that trains your team to ignore pages.

This module covers the full lifecycle: choosing the right metric type, naming it correctly, exposing it from your application, collecting it with exporters, and routing alerts when things go wrong.

Did You Know?

Prometheus client libraries exist for 15+ languages — Go, Python, Java, Ruby, Rust, .NET, Erlang, Haskell, and more. The Go library is the reference implementation; others follow the same patterns.
node_exporter exposes over 1,000 metrics on a typical Linux system — CPU, memory, disk, network, filesystem, and more. It’s the single most deployed Prometheus exporter.
Alertmanager was designed around the concept of “routing trees” inspired by email routing. The tree structure lets you route different alert severities to different teams using a single configuration file.
The _total suffix on counters was originally optional but became mandatory in the OpenMetrics standard. Modern Prometheus (2.x) adds it automatically if missing, but you should always include it in your instrumentation.

The Four Metric Types

Counter

A counter is a cumulative metric that only goes up (and resets to zero on restart).

COUNTER: Monotonically increasing value
──────────────────────────────────────────────────────────────

Value over time:
  0 → 1 → 5 → 12 → 30 → 0 → 3 → 15 → 28
                          ↑
                     restart/reset

USE WHEN:
  ✓ Counting events (requests, errors, bytes sent)
  ✓ Counting completions (jobs finished, items processed)
  ✓ Anything that only goes up during normal operation

DON'T USE WHEN:
  ✗ Value can decrease (temperature, queue size)
  ✗ Value represents current state (active connections)

ALWAYS QUERY WITH rate() or increase():
  rate(http_requests_total[5m])      ← per-second rate
  increase(http_requests_total[1h])  ← total in last hour

Gauge

A gauge is a metric that can go up and down.

GAUGE: Current value that can increase or decrease
──────────────────────────────────────────────────────────────

Value over time:
  42 → 38 → 55 → 71 → 63 → 48 → 52

USE WHEN:
  ✓ Current state (temperature, queue depth, active connections)
  ✓ Snapshots (memory usage, disk space, goroutine count)
  ✓ Values that go up AND down

DON'T USE WHEN:
  ✗ Counting events (use Counter)
  ✗ Measuring distributions (use Histogram)

QUERY DIRECTLY (no rate needed):
  node_memory_MemAvailable_bytes     ← current available memory
  kube_deployment_spec_replicas      ← desired replica count

Histogram

A histogram samples observations and counts them in configurable buckets.

HISTOGRAM: Distribution of values in buckets
──────────────────────────────────────────────────────────────

Generates 3 types of series:
  metric_bucket{le="0.1"}   = 24054    ← cumulative count ≤ 0.1s
  metric_bucket{le="0.5"}   = 129389   ← cumulative count ≤ 0.5s
  metric_bucket{le="+Inf"}  = 144927   ← total count (all observations)
  metric_sum                 = 53423.4  ← sum of all observed values
  metric_count               = 144927   ← total number of observations

USE WHEN:
  ✓ Request latency (the primary use case)
  ✓ Response sizes
  ✓ Any distribution where you need percentiles
  ✓ SLO calculations (bucket at your SLO target)

ADVANTAGES:
  ✓ Aggregatable across instances (can sum buckets)
  ✓ Can calculate any percentile after the fact
  ✓ Can compute average (sum / count)

TRADE-OFFS:
  ✗ Fixed bucket boundaries chosen at instrumentation time
  ✗ Each bucket is a separate time series (cardinality cost)
  ✗ Percentile accuracy depends on bucket granularity

Summary

A summary calculates quantiles on the client side.

SUMMARY: Client-computed quantiles
──────────────────────────────────────────────────────────────

Generates series like:
  metric{quantile="0.5"}   = 0.042    ← median
  metric{quantile="0.9"}   = 0.087    ← P90
  metric{quantile="0.99"}  = 0.235    ← P99
  metric_sum                = 53423.4  ← sum of all observed values
  metric_count              = 144927   ← total number of observations

USE WHEN:
  ✓ You need exact quantiles from a single instance
  ✓ You can't choose histogram bucket boundaries upfront
  ✓ Streaming quantile algorithms are acceptable

DON'T USE WHEN (most of the time):
  ✗ You need to aggregate across instances
     (cannot add quantiles meaningfully!)
  ✗ You need flexible percentile calculation at query time
  ✗ You need SLO calculations

PREFER HISTOGRAMS. Summaries exist for legacy reasons.

Decision Framework: Which Type?

CHOOSING A METRIC TYPE
──────────────────────────────────────────────────────────────

Does the value only go up?
├── YES → Is it counting events/completions?
│         ├── YES → COUNTER (with _total suffix)
│         └── NO  → Probably still a COUNTER
└── NO  → Can the value go up AND down?
          ├── YES → Is it a current state/snapshot?
          │         ├── YES → GAUGE
          │         └── NO  → GAUGE (probably)
          └── Do you need distribution/percentiles?
                    ├── YES → HISTOGRAM (almost always)
                    │         └── Summary only if you truly
                    │             can't define buckets upfront
                    └── NO  → GAUGE

Client Library Instrumentation

Go (Reference Implementation)

package main

import (
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    // Counter: total HTTP requests
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "myapp_http_requests_total",
            Help: "Total number of HTTP requests.",
        },
        []string{"method", "status", "path"},
    )

    // Histogram: request latency
    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "myapp_http_request_duration_seconds",
            Help:    "HTTP request latency in seconds.",
            Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5},
        },
        []string{"method", "path"},
    )

    // Gauge: active connections
    activeConnections = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "myapp_active_connections",
            Help: "Number of currently active connections.",
        },
    )
)

func handler(w http.ResponseWriter, r *http.Request) {
    start := time.Now()
    activeConnections.Inc()
    defer activeConnections.Dec()

    // ... handle request ...
    w.WriteHeader(http.StatusOK)

    duration := time.Since(start).Seconds()
    httpRequestsTotal.WithLabelValues(r.Method, "200", r.URL.Path).Inc()
    httpRequestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
}

func main() {
    http.HandleFunc("/", handler)
    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":8080", nil)
}

Python

from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Counter: total HTTP requests
REQUEST_COUNT = Counter(
    'myapp_http_requests_total',
    'Total number of HTTP requests.',
    ['method', 'status', 'path']
)

# Histogram: request latency
REQUEST_LATENCY = Histogram(
    'myapp_http_request_duration_seconds',
    'HTTP request latency in seconds.',
    ['method', 'path'],
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5]
)

# Gauge: active connections
ACTIVE_CONNECTIONS = Gauge(
    'myapp_active_connections',
    'Number of currently active connections.'
)

def handle_request(method, path):
    ACTIVE_CONNECTIONS.inc()
    start = time.time()

    # ... handle request ...
    status = "200"

    duration = time.time() - start
    REQUEST_COUNT.labels(method=method, status=status, path=path).inc()
    REQUEST_LATENCY.labels(method=method, path=path).observe(duration)
    ACTIVE_CONNECTIONS.dec()

# Start metrics server on port 8000
start_http_server(8000)

# For Flask/FastAPI, use middleware instead:
# from prometheus_flask_instrumentator import Instrumentator
# Instrumentator().instrument(app).expose(app)

Java (Micrometer / simpleclient)

import io.prometheus.client.Counter;
import io.prometheus.client.Histogram;
import io.prometheus.client.Gauge;
import io.prometheus.client.exporter.HTTPServer;

public class MyApp {
    // Counter: total HTTP requests
    static final Counter requestsTotal = Counter.build()
        .name("myapp_http_requests_total")
        .help("Total number of HTTP requests.")
        .labelNames("method", "status", "path")
        .register();

    // Histogram: request latency
    static final Histogram requestDuration = Histogram.build()
        .name("myapp_http_request_duration_seconds")
        .help("HTTP request latency in seconds.")
        .labelNames("method", "path")
        .buckets(.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5)
        .register();

    // Gauge: active connections
    static final Gauge activeConnections = Gauge.build()
        .name("myapp_active_connections")
        .help("Number of currently active connections.")
        .register();

    public void handleRequest(String method, String path) {
        activeConnections.inc();
        Histogram.Timer timer = requestDuration
            .labels(method, path)
            .startTimer();

        try {
            // ... handle request ...
            requestsTotal.labels(method, "200", path).inc();
        } finally {
            timer.observeDuration();
            activeConnections.dec();
        }
    }

    public static void main(String[] args) throws Exception {
        // Expose metrics on port 8000
        HTTPServer server = new HTTPServer(8000);
    }
}

Metric Naming Conventions

The Rules

PROMETHEUS NAMING CONVENTION
──────────────────────────────────────────────────────────────

Format: <namespace>_<name>_<unit>_<suffix>

namespace  = application or library name (myapp, http, node)
name       = what is being measured (requests, duration, size)
unit       = base unit (seconds, bytes, meters — NEVER milli/kilo)
suffix     = metric type indicator (_total for counters, _info for info)

GOOD:
  myapp_http_requests_total              ← counter, counts requests
  myapp_http_request_duration_seconds    ← histogram, duration in seconds
  myapp_http_response_size_bytes         ← histogram, size in bytes
  node_memory_MemAvailable_bytes         ← gauge, memory in bytes
  process_cpu_seconds_total              ← counter, CPU time in seconds

BAD:
  myapp_requests                         ← missing unit, missing _total
  http_request_duration_milliseconds     ← use seconds, not milliseconds
  db_query_time_ms                       ← abbreviation, non-base unit
  MyApp_HTTP_Requests                    ← camelCase/PascalCase, use snake_case
  request_latency                        ← vague, missing namespace and unit

Unit Rules

Measurement	Base Unit	Suffix	Example
Time/Duration	seconds	`_seconds`	`http_request_duration_seconds`
Data size	bytes	`_bytes`	`http_response_size_bytes`
Temperature	celsius	`_celsius`	`room_temperature_celsius`
Voltage	volts	`_volts`	`power_supply_volts`
Energy	joules	`_joules`	`cpu_energy_joules`
Weight	grams	`_grams`	`package_weight_grams`
Ratios	ratio	`_ratio`	`cache_hit_ratio`
Percentages	ratio (0-1)	`_ratio`	Use 0-1, not 0-100

Suffix Rules

Type	Suffix	Example
Counter	`_total`	`http_requests_total`
Counter (created timestamp)	`_created`	`http_requests_created`
Histogram	`_bucket`, `_sum`, `_count`	`http_request_duration_seconds_bucket`
Summary	`_sum`, `_count`	`rpc_duration_seconds_sum`
Info metric	`_info`	`build_info{version="1.2.3"}`
Gauge	(no suffix)	`node_memory_MemAvailable_bytes`

Label Best Practices

LABEL DO'S AND DON'TS
──────────────────────────────────────────────────────────────

DO:
  ✓ Use labels for dimensions you'll filter/aggregate by
  ✓ Keep cardinality bounded (status codes: ~5 values)
  ✓ Use consistent names: "method" not "http_method" in one
    place and "request_method" in another

DON'T:
  ✗ user_id (millions of values = millions of series)
  ✗ request_id (unbounded, every request creates a series)
  ✗ email (PII + unbounded cardinality)
  ✗ url with path parameters (/users/12345 = unique per user)
  ✗ error_message (free-form text = unbounded)
  ✗ timestamp as label (infinite cardinality)

RULE OF THUMB:
  If a label can have more than ~100 unique values,
  it probably shouldn't be a label.
  Each unique label combination = one time series in memory.

Exporters

node_exporter (Hardware & OS Metrics)

The most widely deployed exporter — runs on every node to expose hardware and OS metrics.

# Install via binary
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz
tar xvfz node_exporter-*.tar.gz
./node_exporter

# Or via Kubernetes DaemonSet (kube-prometheus-stack includes it)
helm install monitoring prometheus-community/kube-prometheus-stack

Key metrics from node_exporter:

# CPU utilization
1 - avg by (instance)(rate(node_cpu_seconds_total{mode="idle"}[5m]))

# Memory utilization
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# Disk space usage
1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})

# Network throughput
rate(node_network_receive_bytes_total{device="eth0"}[5m])
rate(node_network_transmit_bytes_total{device="eth0"}[5m])

# Disk I/O
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

blackbox_exporter (Probing)

Probes endpoints over HTTP, HTTPS, DNS, TCP, and ICMP. Useful for monitoring things you don’t control.

# blackbox-exporter config
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: [200]
      follow_redirects: true

  http_post_2xx:
    prober: http
    http:
      method: POST

  tcp_connect:
    prober: tcp
    timeout: 5s

  dns_lookup:
    prober: dns
    dns:
      query_name: "kubernetes.default.svc.cluster.local"
      query_type: "A"

  icmp_ping:
    prober: icmp
    timeout: 5s

Prometheus scrape config for blackbox_exporter:

scrape_configs:
  - job_name: 'blackbox-http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://example.com
        - https://api.myservice.com/health
    relabel_configs:
      # Pass the target URL as a parameter
      - source_labels: [__address__]
        target_label: __param_target
      # Store original target as a label
      - source_labels: [__param_target]
        target_label: instance
      # Point to the blackbox exporter
      - target_label: __address__
        replacement: blackbox-exporter:9115

Key blackbox metrics:

# Is the endpoint up?
probe_success{job="blackbox-http"}

# SSL certificate expiry (days)
(probe_ssl_earliest_cert_expiry - time()) / 86400

# HTTP response time
probe_http_duration_seconds

# DNS lookup time
probe_dns_lookup_time_seconds

Other Common Exporters

Exporter	Purpose	Key Metrics
mysqld_exporter	MySQL databases	Queries/sec, connections, replication lag
postgres_exporter	PostgreSQL databases	Active connections, transaction rate, table sizes
redis_exporter	Redis	Commands/sec, memory usage, connected clients
kafka_exporter	Apache Kafka	Consumer lag, topic offsets, partition count
nginx_exporter	Nginx	Active connections, requests/sec, response codes
kube-state-metrics	Kubernetes objects	Pod status, deployment replicas, node conditions
cadvisor	Containers	CPU, memory, network per container

Alertmanager Deep Dive

Alert Lifecycle

ALERT STATES
──────────────────────────────────────────────────────────────

  INACTIVE  ──→  PENDING  ──→  FIRING  ──→  RESOLVED
     ↑              │             │              │
     │              │             │              │
     │  expr false  │  for: 5m   │  expr false  │
     └──────────────┘  elapsed   │  for > 0s    │
                                 │              │
                                 └──────────────┘

INACTIVE: Alert expression evaluates to false. No action.

PENDING:  Alert expression evaluates to true.
          Waiting for "for" duration to elapse.
          Won't fire yet — prevents noise from brief spikes.

FIRING:   Alert has been true for at least "for" duration.
          Sent to Alertmanager for routing and notification.

RESOLVED: Alert was firing, now expression is false.
          Alertmanager sends "resolved" notification.

Alerting Rules

groups:
  - name: application-alerts
    rules:
      # HIGH SEVERITY: Service completely down
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "{{ $labels.job }} is down"
          description: "{{ $labels.instance }} has been unreachable for >1 minute."
          runbook_url: "https://wiki.example.com/runbooks/service-down"

      # HIGH SEVERITY: Error rate spike
      - alert: HighErrorRate
        expr: |
          sum by (service)(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (service)(rate(http_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }}."

      # MEDIUM SEVERITY: Slow responses
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            sum by (le, service)(rate(http_request_duration_seconds_bucket[5m]))
          ) > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High P99 latency on {{ $labels.service }}"
          description: "P99 latency is {{ $value | humanizeDuration }}."

      # LOW SEVERITY: Certificate expiring
      - alert: SSLCertExpiringSoon
        expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 30
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL cert for {{ $labels.instance }} expires in {{ $value | humanize }} days"

      # CAPACITY: Disk filling up
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})
          < 0.15
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Disk space below 15% on {{ $labels.instance }}"

      # SLO-BASED: Error budget burn rate
      - alert: ErrorBudgetBurnRate
        expr: |
          job:http_error_ratio:rate5m > (14.4 * 0.001)
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error budget burning too fast for {{ $labels.job }}"
          description: "At current rate, error budget will be exhausted in <1 hour."

Alertmanager Configuration

# alertmanager.yml — complete production example
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: '<secret>'
  slack_api_url: 'https://hooks.slack.com/services/T00/B00/xxxx'
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

# TEMPLATES: customize notification format
templates:
  - '/etc/alertmanager/templates/*.tmpl'

# ROUTING TREE: determines where alerts go
route:
  # Default receiver for unmatched alerts
  receiver: 'slack-default'

  # Group alerts by these labels (reduces noise)
  group_by: ['alertname', 'service']

  # Wait before sending first notification for a group
  group_wait: 30s

  # Wait before sending updates to an existing group
  group_interval: 5m

  # Wait before re-sending a firing alert
  repeat_interval: 4h

  # Child routes (evaluated top-to-bottom, first match wins)
  routes:
    # Critical alerts → PagerDuty (wake someone up)
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 1h
      routes:
        # Database team owns DB alerts
        - match:
            team: database
          receiver: 'pagerduty-database'

    # Warning alerts → Slack channel
    - match:
        severity: warning
      receiver: 'slack-warnings'
      repeat_interval: 4h

    # Info alerts → email digest
    - match:
        severity: info
      receiver: 'email-digest'
      group_wait: 10m
      repeat_interval: 24h

    # Regex matching: any alert from staging
    - match_re:
        environment: staging|dev
      receiver: 'slack-staging'
      repeat_interval: 12h

# RECEIVERS: notification targets
receivers:
  - name: 'slack-default'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true
        title: '{{ .Status | toUpper }}: {{ .CommonLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *{{ .Labels.alertname }}* ({{ .Labels.severity }})
          {{ .Annotations.description }}
          {{ end }}

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - routing_key: '<pagerduty-integration-key>'
        severity: critical
        description: '{{ .CommonLabels.alertname }}: {{ .CommonAnnotations.summary }}'

  - name: 'pagerduty-database'
    pagerduty_configs:
      - routing_key: '<database-team-key>'
        severity: critical

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#alerts-warnings'
        send_resolved: true

  - name: 'slack-staging'
    slack_configs:
      - channel: '#alerts-staging'
        send_resolved: false

  - name: 'email-digest'
    email_configs:
      - to: 'team@example.com'
        send_resolved: false

# INHIBITION RULES: suppress dependent alerts
inhibit_rules:
  # If a critical alert fires, suppress warnings for the same service
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['alertname', 'service']

  # If a node is down, suppress all pod alerts on that node
  - source_match:
      alertname: NodeDown
    target_match_re:
      alertname: Pod.*
    equal: ['node']

  # If cluster is unreachable, suppress everything
  - source_match:
      alertname: ClusterUnreachable
    target_match_re:
      alertname: .+
    equal: ['cluster']

Routing Tree Visual

ALERTMANAGER ROUTING TREE
──────────────────────────────────────────────────────────────

Incoming Alert: {alertname="HighErrorRate", severity="critical", team="api"}

route (root):                          receiver: slack-default
├── match: severity=critical           receiver: pagerduty-critical  ← MATCH!
│   └── match: team=database           receiver: pagerduty-database  (no match)
├── match: severity=warning            receiver: slack-warnings
├── match: severity=info               receiver: email-digest
└── match_re: env=staging|dev          receiver: slack-staging

Result: Alert goes to pagerduty-critical (first matching child route)

NOTE: By default, routing stops at first match.
      Add "continue: true" on a route to keep matching subsequent routes.

Inhibition Rules Explained

INHIBITION: Suppressing dependent alerts
──────────────────────────────────────────────────────────────

Scenario: Node goes down → all pods on that node fail

WITHOUT inhibition:
  Alert: NodeDown (node-1)              ← root cause
  Alert: PodCrashLooping (pod-a)        ← symptom
  Alert: PodCrashLooping (pod-b)        ← symptom
  Alert: PodCrashLooping (pod-c)        ← symptom
  Alert: HighErrorRate (service-x)      ← symptom
  = 5 pages for one problem!

WITH inhibition:
  inhibit_rules:
    - source_match: {alertname: NodeDown}
      target_match_re: {alertname: "Pod.*|HighErrorRate"}
      equal: [node]

  Alert: NodeDown (node-1)              ← only this fires
  (all dependent alerts suppressed)
  = 1 page for one problem!

Silences

Silences temporarily mute alerts during planned maintenance:

# Create a silence via amtool CLI
amtool silence add \
  --alertmanager.url=http://localhost:9093 \
  --author="jane@example.com" \
  --comment="Planned database maintenance window" \
  --duration=2h \
  service="database" severity="warning"

# List active silences
amtool silence query --alertmanager.url=http://localhost:9093

# Expire (remove) a silence
amtool silence expire --alertmanager.url=http://localhost:9093 <silence-id>

Silence via Alertmanager UI: Navigate to http://alertmanager:9093/#/silences and click “New Silence”. Fill in label matchers, duration, author, and comment.

Best practice: Always add a comment explaining why and a reasonable duration. Silences that last forever hide real problems.

Recording Rules for Alerting

Recording rules pre-compute expensive expressions so alerting rules evaluate faster.

groups:
  - name: recording_rules
    interval: 30s
    rules:
      # Pre-compute error ratio per service
      - record: service:http_error_ratio:rate5m
        expr: |
          sum by (service)(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (service)(rate(http_requests_total[5m]))

      # Pre-compute P99 latency per service
      - record: service:http_latency_p99:rate5m
        expr: |
          histogram_quantile(0.99,
            sum by (le, service)(rate(http_request_duration_seconds_bucket[5m]))
          )

      # Pre-compute CPU utilization per node
      - record: node:cpu_utilization:ratio_rate5m
        expr: |
          1 - avg by (node)(rate(node_cpu_seconds_total{mode="idle"}[5m]))

  - name: alerting_rules
    rules:
      # NOW alerting rules can use the pre-computed values
      - alert: HighErrorRate
        expr: service:http_error_ratio:rate5m > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate {{ $value | humanizePercentage }} on {{ $labels.service }}"

      - alert: HighLatency
        expr: service:http_latency_p99:rate5m > 2
        for: 10m
        labels:
          severity: warning

      - alert: HighCPU
        expr: node:cpu_utilization:ratio_rate5m > 0.9
        for: 15m
        labels:
          severity: warning

Common Mistakes

Mistake	Problem	Solution
Using milliseconds for duration	Unit mismatch with other metrics	Always use base units: `_seconds`, not `_milliseconds`
Counter without `_total` suffix	Violates OpenMetrics standard	Always append `_total` to counter names
High-cardinality labels (user_id)	Memory explosion, slow queries	Remove unbounded labels; aggregate at application level
Missing `Help` text on metrics	Hard to understand; fails lint checks	Always add descriptive Help strings
Alerting without `for` duration	Flapping alerts from transient spikes	Use `for: 5m` minimum for most alerts
No inhibition rules	Alert storms during major incidents	Suppress symptoms when root-cause alert fires
Silencing without comments	Nobody knows why alerts were muted	Always add author, comment, and expiry
Summary instead of Histogram	Cannot aggregate across instances	Use Histogram unless you have a specific reason not to
Missing runbook_url in annotations	On-call engineer has no guidance	Always link to a runbook explaining diagnosis/fix
Too many receivers	Alert fatigue, nobody reads channels	Consolidate: critical → page, warning → Slack, info → email

Quiz

1. What are the four Prometheus metric types? Give one real-world example for each.

Answer:

Counter: Monotonically increasing value. Resets on restart.
- Example: http_requests_total — total HTTP requests served
Gauge: Value that can go up and down.
- Example: node_memory_MemAvailable_bytes — currently available memory
Histogram: Observations bucketed by value, with cumulative counts.
- Example: http_request_duration_seconds — request latency distribution
Summary: Client-computed streaming quantiles.
- Example: go_gc_duration_seconds — garbage collection pause duration with pre-computed percentiles

Key distinction: Histograms are aggregatable across instances (sum buckets), Summaries are not (cannot add quantiles). Prefer Histograms in almost all cases.

2. Why does Prometheus use base units (seconds, bytes) instead of human-friendly units (milliseconds, megabytes)?

Answer:

Base units prevent unit mismatch errors when combining metrics. If one team uses _milliseconds and another uses _seconds, joining or adding these metrics produces nonsensical results.

Specific reasons:

Consistency: All duration metrics are in seconds, so rate(a_seconds[5m]) + rate(b_seconds[5m]) always works
PromQL functions: histogram_quantile() returns values in the metric’s unit — if metrics are in seconds, the result is in seconds
Grafana handles display: Grafana can convert seconds to “2.5ms” or “1.3h” for human display. Store in base units, format at display time.
OpenMetrics standard: Requires base units for interoperability across tools

The rule: store in base units, display in human units.

3. Explain the Alertmanager routing tree. How does Alertmanager decide which receiver gets an alert?

Answer:

The routing tree is a hierarchy of routes, each with label matchers and a receiver:

Every alert enters at the root route (the top-level route:)
Child routes are evaluated top-to-bottom — first matching child wins
Matching uses match (exact) or match_re (regex) on alert labels
If no child matches, the alert goes to the root route’s receiver
continue: true on a route means “keep checking subsequent siblings even after matching”
Child routes can have their own children — nesting creates a tree

route: receiver=default
├── match: severity=critical → receiver=pagerduty
│   └── match: team=db → receiver=pagerduty-db
├── match: severity=warning → receiver=slack
└── (unmatched) → receiver=default

group_by, group_wait, group_interval, and repeat_interval control batching:

group_by: Labels to group alerts by (reduces notification count)
group_wait: How long to buffer before sending the first notification
group_interval: Minimum time between updates to a group
repeat_interval: How often to re-send a firing alert

4. What is the difference between inhibition and silencing in Alertmanager?

Answer:

Inhibition (automatic, rule-based):

Suppresses target alerts when a source alert is firing
Configured in inhibit_rules in alertmanager.yml
Happens automatically — no human action needed
Example: NodeDown inhibits all PodCrashLooping alerts on that node
Purpose: Prevent alert storms from cascading failures

Silencing (manual, time-based):

Temporarily mutes alerts matching specific label matchers
Created via UI or amtool CLI
Requires human action — someone decides to silence
Has a defined expiry time
Example: Silence all alerts for service="database" during planned maintenance
Purpose: Suppress known noise during maintenance windows

Key difference: Inhibition is automatic and ongoing (fires whenever the source alert fires). Silencing is manual and temporary (created for a specific time window).

5. You're instrumenting a new microservice. It has an HTTP API and a background job queue. What metrics would you add, with what types and names?

Answer:

HTTP API metrics:

myservice_http_requests_total{method, status, path}        — Counter
myservice_http_request_duration_seconds{method, path}      — Histogram
myservice_http_request_size_bytes{method, path}            — Histogram
myservice_http_response_size_bytes{method, path}           — Histogram
myservice_http_active_requests{method}                     — Gauge

Background job metrics:

myservice_jobs_processed_total{queue, status}              — Counter
myservice_job_duration_seconds{queue}                      — Histogram
myservice_jobs_queued{queue}                               — Gauge (current queue depth)
myservice_job_last_success_timestamp_seconds{queue}        — Gauge

Runtime metrics (auto-exposed by most client libraries):

process_cpu_seconds_total                                  — Counter
process_resident_memory_bytes                              — Gauge
go_goroutines (if Go)                                      — Gauge

Design decisions:

path label should use route patterns (/users/{id}), not actual paths (/users/12345) to avoid cardinality explosion
Histogram buckets for HTTP: [.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5]
Histogram buckets for jobs: [.1, .5, 1, 5, 10, 30, 60, 300] (jobs are typically slower)

6. What is the purpose of the `for` field in an alerting rule? What happens if you omit it?

Answer:

The for field specifies how long the alert expression must be continuously true before the alert transitions from pending to firing.

- alert: HighErrorRate
  expr: error_rate > 0.05
  for: 5m        # Must be true for 5 minutes before firing

Without for (or for: 0s):

Alert fires immediately when expression is true
Next evaluation cycle where expression is false → alert resolves
Causes alert flapping: brief spikes trigger and resolve alerts rapidly
On-call engineers get paged for transient conditions that self-resolve

With for: 5m:

Brief spikes (< 5 min) are ignored
Only sustained problems trigger notifications
Reduces false positives significantly

Guidelines:

for: 1m — Critical infrastructure alerts (ServiceDown)
for: 5m — Error rate and latency alerts
for: 15m — Capacity and resource alerts
for: 1h — Slow-burn problems (certificate expiry, disk growth trends)

Hands-On Exercise: Instrument, Export, Alert

Build a complete monitoring pipeline: instrument an app, deploy an exporter, configure alerts.

Setup

# Ensure you have a cluster with Prometheus
# (Use the setup from Module 1's hands-on, or:)
kind create cluster --name pca-lab
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace

Step 1: Deploy a Python App with Custom Metrics

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-code
  namespace: monitoring
data:
  app.py: |
    from prometheus_client import Counter, Histogram, Gauge, start_http_server
    import random, time, threading

    REQUESTS = Counter('myapp_http_requests_total', 'Total HTTP requests', ['method', 'status'])
    LATENCY = Histogram('myapp_http_request_duration_seconds', 'Request latency',
                        buckets=[.01, .025, .05, .1, .25, .5, 1, 2.5, 5])
    QUEUE_SIZE = Gauge('myapp_queue_size', 'Current items in queue')
    JOBS = Counter('myapp_jobs_processed_total', 'Jobs processed', ['status'])

    def simulate_traffic():
        while True:
            method = random.choice(['GET', 'GET', 'GET', 'POST', 'PUT'])
            latency = random.expovariate(10)  # ~100ms average
            status = '200' if random.random() > 0.03 else '500'
            REQUESTS.labels(method=method, status=status).inc()
            LATENCY.observe(latency)
            time.sleep(0.1)

    def simulate_queue():
        while True:
            QUEUE_SIZE.set(random.randint(0, 50))
            if random.random() > 0.1:
                JOBS.labels(status='success').inc()
            else:
                JOBS.labels(status='failure').inc()
            time.sleep(1)

    if __name__ == '__main__':
        start_http_server(8000)
        threading.Thread(target=simulate_traffic, daemon=True).start()
        threading.Thread(target=simulate_queue, daemon=True).start()
        print("Metrics server running on :8000")
        while True:
            time.sleep(1)
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: instrumented-app
  namespace: monitoring
spec:
  replicas: 2
  selector:
    matchLabels:
      app: instrumented-app
  template:
    metadata:
      labels:
        app: instrumented-app
    spec:
      containers:
        - name: app
          image: python:3.11-slim
          command: ["sh", "-c", "pip install prometheus_client && python /app/app.py"]
          ports:
            - containerPort: 8000
              name: metrics
          volumeMounts:
            - name: code
              mountPath: /app
      volumes:
        - name: code
          configMap:
            name: app-code
---
apiVersion: v1
kind: Service
metadata:
  name: instrumented-app
  namespace: monitoring
  labels:
    app: instrumented-app
spec:
  selector:
    app: instrumented-app
  ports:
    - port: 8000
      targetPort: 8000
      name: metrics

kubectl apply -f instrumented-app.yaml

Step 2: Create a ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: instrumented-app
  namespace: monitoring
  labels:
    release: monitoring  # Must match Prometheus selector
spec:
  selector:
    matchLabels:
      app: instrumented-app
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics

kubectl apply -f servicemonitor.yaml

Step 3: Verify Scraping

# Port-forward to Prometheus
kubectl port-forward -n monitoring svc/monitoring-kube-prometheus-prometheus 9090:9090

Open http://localhost:9090/targets and verify instrumented-app appears as a target. Then query:

# Verify metrics are flowing
myapp_http_requests_total

# Request rate
rate(myapp_http_requests_total[5m])

# Error rate
sum(rate(myapp_http_requests_total{status="500"}[5m]))
/ sum(rate(myapp_http_requests_total[5m]))

# P99 latency
histogram_quantile(0.99, sum by (le)(rate(myapp_http_request_duration_seconds_bucket[5m])))

# Queue depth
myapp_queue_size

Step 4: Create Alerting Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: instrumented-app-alerts
  namespace: monitoring
  labels:
    release: monitoring
spec:
  groups:
    - name: instrumented-app
      rules:
        - alert: MyAppHighErrorRate
          expr: |
            sum(rate(myapp_http_requests_total{status=~"5.."}[5m]))
            / sum(rate(myapp_http_requests_total[5m]))
            > 0.05
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: "High error rate on instrumented-app"
            description: "Error rate is {{ $value | humanizePercentage }}"

        - alert: MyAppHighLatency
          expr: |
            histogram_quantile(0.99,
              sum by (le)(rate(myapp_http_request_duration_seconds_bucket[5m]))
            ) > 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High P99 latency on instrumented-app"

        - record: myapp:http_error_ratio:rate5m
          expr: |
            sum(rate(myapp_http_requests_total{status=~"5.."}[5m]))
            / sum(rate(myapp_http_requests_total[5m]))

kubectl apply -f alerting-rules.yaml

Step 5: Verify Alerts

Open http://localhost:9090/alerts and confirm the alert rules are loaded. If the simulated error rate exceeds 5%, you should see MyAppHighErrorRate transition from Inactive to Pending to Firing.

Success Criteria

You’ve completed this exercise when you can:

See custom metrics (myapp_*) in Prometheus
Query request rate, error rate, and latency percentiles
ServiceMonitor is correctly scraping the application
Alert rules are loaded and evaluating in Prometheus
Recording rule myapp:http_error_ratio:rate5m produces results
Understand the Counter/Gauge/Histogram types from the instrumented code

Key Takeaways

Before moving on, ensure you understand:

Summary

Instrumentation is where monitoring begins — without good metrics from your applications and infrastructure, no amount of query power helps. This module covered the complete instrumentation-to-alerting pipeline: choosing the right metric type, naming it according to conventions, exposing it with client libraries, collecting it with exporters, and routing alerts through Alertmanager when thresholds are breached. Combined with the PromQL Deep Dive, you now have comprehensive coverage of the PCA exam’s four gap domains.

PCA README — Full exam domain mapping and study strategy
PromQL Deep Dive — Complete PromQL for Domain 3
Prometheus Fundamentals — Architecture and basics
Grafana — Dashboarding for Domain 5

Module 1.2: Instrumentation & Alerting

Prerequisites

What You’ll Be Able to Do

Why This Module Matters

Did You Know?

The Four Metric Types

Counter

Gauge

Histogram

Summary

Decision Framework: Which Type?

Client Library Instrumentation

Go (Reference Implementation)

Python

Java (Micrometer / simpleclient)

Metric Naming Conventions

The Rules

Unit Rules

Suffix Rules

Label Best Practices

Exporters

node_exporter (Hardware & OS Metrics)

blackbox_exporter (Probing)

Other Common Exporters

Alertmanager Deep Dive

Alert Lifecycle

Alerting Rules

Alertmanager Configuration

Routing Tree Visual

Inhibition Rules Explained

Silences

Recording Rules for Alerting

Common Mistakes

Quiz

Hands-On Exercise: Instrument, Export, Alert

Setup

Step 1: Deploy a Python App with Custom Metrics

Step 2: Create a ServiceMonitor

Step 3: Verify Scraping

Step 4: Create Alerting Rules

Step 5: Verify Alerts

Success Criteria

Key Takeaways

Further Reading

Summary

Related Modules