Перейти до вмісту

Module 1.2: Instrumentation & Alerting

Цей контент ще не доступний вашою мовою.

PCA Track | Complexity: [COMPLEX] | Time: 45-55 min

Before starting this module:


After completing this module, you will be able to:

  1. Instrument an application with Prometheus client libraries, choosing the correct metric type (counter, gauge, histogram, summary) for each signal
  2. Apply naming conventions, label design, and cardinality controls that keep metrics usable and storage costs predictable
  3. Configure alerting rules with proper for durations, severity labels, and runbook annotations that minimize false positives and pager fatigue
  4. Design a metrics-based SLO alerting pipeline using multi-window burn-rate alerts for error budgets

The database team at a large ride-sharing company added a custom Prometheus metric to track query latency. They were proud of it — db_query_duration_milliseconds. It worked perfectly in development.

Three weeks later, the infrastructure team tried to create an SLO dashboard combining API latency (measured in seconds using http_request_duration_seconds) with database latency. The query looked simple:

histogram_quantile(0.99,
sum by (le)(rate(http_request_duration_seconds_bucket[5m]))
)
+
histogram_quantile(0.99,
sum by (le)(rate(db_query_duration_milliseconds_bucket[5m]))
)

The P99 total latency showed 3,000.2 seconds. It took 45 minutes of confusion before someone realized: one metric was in seconds, the other in milliseconds. The query was adding 0.2 seconds of API latency to 3,000 milliseconds of DB latency. The result was mathematically correct but semantically nonsensical.

The fix required a migration: rename the metric, update all dashboards, modify alerting rules, and coordinate a rolling deployment across 400 pods. Two engineer-weeks of work, all because of a naming convention violation.

The Prometheus naming convention exists for exactly this reason: always use base units — seconds, bytes, meters. Never milliseconds, kilobytes, or centimeters. This war story gets told to every new hire now.


Instrumentation and alerting are 34% of the PCA exam combined (16% Instrumentation + 18% Alerting). But beyond the exam, these are the skills that determine whether your monitoring actually works. Bad instrumentation creates metrics nobody can use. Bad alerting creates noise that trains your team to ignore pages.

This module covers the full lifecycle: choosing the right metric type, naming it correctly, exposing it from your application, collecting it with exporters, and routing alerts when things go wrong.

  • Prometheus client libraries exist for 15+ languages — Go, Python, Java, Ruby, Rust, .NET, Erlang, Haskell, and more. The Go library is the reference implementation; others follow the same patterns.
  • node_exporter exposes over 1,000 metrics on a typical Linux system — CPU, memory, disk, network, filesystem, and more. It’s the single most deployed Prometheus exporter.
  • Alertmanager was designed around the concept of “routing trees” inspired by email routing. The tree structure lets you route different alert severities to different teams using a single configuration file.
  • The _total suffix on counters was originally optional but became mandatory in the OpenMetrics standard. Modern Prometheus (2.x) adds it automatically if missing, but you should always include it in your instrumentation.

A counter is a cumulative metric that only goes up (and resets to zero on restart).

COUNTER: Monotonically increasing value
──────────────────────────────────────────────────────────────
Value over time:
0 → 1 → 5 → 12 → 30 → 0 → 3 → 15 → 28
restart/reset
USE WHEN:
✓ Counting events (requests, errors, bytes sent)
✓ Counting completions (jobs finished, items processed)
✓ Anything that only goes up during normal operation
DON'T USE WHEN:
✗ Value can decrease (temperature, queue size)
✗ Value represents current state (active connections)
ALWAYS QUERY WITH rate() or increase():
rate(http_requests_total[5m]) ← per-second rate
increase(http_requests_total[1h]) ← total in last hour

A gauge is a metric that can go up and down.

GAUGE: Current value that can increase or decrease
──────────────────────────────────────────────────────────────
Value over time:
42 → 38 → 55 → 71 → 63 → 48 → 52
USE WHEN:
✓ Current state (temperature, queue depth, active connections)
✓ Snapshots (memory usage, disk space, goroutine count)
✓ Values that go up AND down
DON'T USE WHEN:
✗ Counting events (use Counter)
✗ Measuring distributions (use Histogram)
QUERY DIRECTLY (no rate needed):
node_memory_MemAvailable_bytes ← current available memory
kube_deployment_spec_replicas ← desired replica count

A histogram samples observations and counts them in configurable buckets.

HISTOGRAM: Distribution of values in buckets
──────────────────────────────────────────────────────────────
Generates 3 types of series:
metric_bucket{le="0.1"} = 24054 ← cumulative count ≤ 0.1s
metric_bucket{le="0.5"} = 129389 ← cumulative count ≤ 0.5s
metric_bucket{le="+Inf"} = 144927 ← total count (all observations)
metric_sum = 53423.4 ← sum of all observed values
metric_count = 144927 ← total number of observations
USE WHEN:
✓ Request latency (the primary use case)
✓ Response sizes
✓ Any distribution where you need percentiles
✓ SLO calculations (bucket at your SLO target)
ADVANTAGES:
✓ Aggregatable across instances (can sum buckets)
✓ Can calculate any percentile after the fact
✓ Can compute average (sum / count)
TRADE-OFFS:
✗ Fixed bucket boundaries chosen at instrumentation time
✗ Each bucket is a separate time series (cardinality cost)
✗ Percentile accuracy depends on bucket granularity

A summary calculates quantiles on the client side.

SUMMARY: Client-computed quantiles
──────────────────────────────────────────────────────────────
Generates series like:
metric{quantile="0.5"} = 0.042 ← median
metric{quantile="0.9"} = 0.087 ← P90
metric{quantile="0.99"} = 0.235 ← P99
metric_sum = 53423.4 ← sum of all observed values
metric_count = 144927 ← total number of observations
USE WHEN:
✓ You need exact quantiles from a single instance
✓ You can't choose histogram bucket boundaries upfront
✓ Streaming quantile algorithms are acceptable
DON'T USE WHEN (most of the time):
✗ You need to aggregate across instances
(cannot add quantiles meaningfully!)
✗ You need flexible percentile calculation at query time
✗ You need SLO calculations
PREFER HISTOGRAMS. Summaries exist for legacy reasons.
CHOOSING A METRIC TYPE
──────────────────────────────────────────────────────────────
Does the value only go up?
├── YES → Is it counting events/completions?
│ ├── YES → COUNTER (with _total suffix)
│ └── NO → Probably still a COUNTER
└── NO → Can the value go up AND down?
├── YES → Is it a current state/snapshot?
│ ├── YES → GAUGE
│ └── NO → GAUGE (probably)
└── Do you need distribution/percentiles?
├── YES → HISTOGRAM (almost always)
│ └── Summary only if you truly
│ can't define buckets upfront
└── NO → GAUGE

package main
import (
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
// Counter: total HTTP requests
httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "myapp_http_requests_total",
Help: "Total number of HTTP requests.",
},
[]string{"method", "status", "path"},
)
// Histogram: request latency
httpRequestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "myapp_http_request_duration_seconds",
Help: "HTTP request latency in seconds.",
Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5},
},
[]string{"method", "path"},
)
// Gauge: active connections
activeConnections = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "myapp_active_connections",
Help: "Number of currently active connections.",
},
)
)
func handler(w http.ResponseWriter, r *http.Request) {
start := time.Now()
activeConnections.Inc()
defer activeConnections.Dec()
// ... handle request ...
w.WriteHeader(http.StatusOK)
duration := time.Since(start).Seconds()
httpRequestsTotal.WithLabelValues(r.Method, "200", r.URL.Path).Inc()
httpRequestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
}
func main() {
http.HandleFunc("/", handler)
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8080", nil)
}
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
# Counter: total HTTP requests
REQUEST_COUNT = Counter(
'myapp_http_requests_total',
'Total number of HTTP requests.',
['method', 'status', 'path']
)
# Histogram: request latency
REQUEST_LATENCY = Histogram(
'myapp_http_request_duration_seconds',
'HTTP request latency in seconds.',
['method', 'path'],
buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5]
)
# Gauge: active connections
ACTIVE_CONNECTIONS = Gauge(
'myapp_active_connections',
'Number of currently active connections.'
)
def handle_request(method, path):
ACTIVE_CONNECTIONS.inc()
start = time.time()
# ... handle request ...
status = "200"
duration = time.time() - start
REQUEST_COUNT.labels(method=method, status=status, path=path).inc()
REQUEST_LATENCY.labels(method=method, path=path).observe(duration)
ACTIVE_CONNECTIONS.dec()
# Start metrics server on port 8000
start_http_server(8000)
# For Flask/FastAPI, use middleware instead:
# from prometheus_flask_instrumentator import Instrumentator
# Instrumentator().instrument(app).expose(app)
import io.prometheus.client.Counter;
import io.prometheus.client.Histogram;
import io.prometheus.client.Gauge;
import io.prometheus.client.exporter.HTTPServer;
public class MyApp {
// Counter: total HTTP requests
static final Counter requestsTotal = Counter.build()
.name("myapp_http_requests_total")
.help("Total number of HTTP requests.")
.labelNames("method", "status", "path")
.register();
// Histogram: request latency
static final Histogram requestDuration = Histogram.build()
.name("myapp_http_request_duration_seconds")
.help("HTTP request latency in seconds.")
.labelNames("method", "path")
.buckets(.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5)
.register();
// Gauge: active connections
static final Gauge activeConnections = Gauge.build()
.name("myapp_active_connections")
.help("Number of currently active connections.")
.register();
public void handleRequest(String method, String path) {
activeConnections.inc();
Histogram.Timer timer = requestDuration
.labels(method, path)
.startTimer();
try {
// ... handle request ...
requestsTotal.labels(method, "200", path).inc();
} finally {
timer.observeDuration();
activeConnections.dec();
}
}
public static void main(String[] args) throws Exception {
// Expose metrics on port 8000
HTTPServer server = new HTTPServer(8000);
}
}

PROMETHEUS NAMING CONVENTION
──────────────────────────────────────────────────────────────
Format: <namespace>_<name>_<unit>_<suffix>
namespace = application or library name (myapp, http, node)
name = what is being measured (requests, duration, size)
unit = base unit (seconds, bytes, meters — NEVER milli/kilo)
suffix = metric type indicator (_total for counters, _info for info)
GOOD:
myapp_http_requests_total ← counter, counts requests
myapp_http_request_duration_seconds ← histogram, duration in seconds
myapp_http_response_size_bytes ← histogram, size in bytes
node_memory_MemAvailable_bytes ← gauge, memory in bytes
process_cpu_seconds_total ← counter, CPU time in seconds
BAD:
myapp_requests ← missing unit, missing _total
http_request_duration_milliseconds ← use seconds, not milliseconds
db_query_time_ms ← abbreviation, non-base unit
MyApp_HTTP_Requests ← camelCase/PascalCase, use snake_case
request_latency ← vague, missing namespace and unit
MeasurementBase UnitSuffixExample
Time/Durationseconds_secondshttp_request_duration_seconds
Data sizebytes_byteshttp_response_size_bytes
Temperaturecelsius_celsiusroom_temperature_celsius
Voltagevolts_voltspower_supply_volts
Energyjoules_joulescpu_energy_joules
Weightgrams_gramspackage_weight_grams
Ratiosratio_ratiocache_hit_ratio
Percentagesratio (0-1)_ratioUse 0-1, not 0-100
TypeSuffixExample
Counter_totalhttp_requests_total
Counter (created timestamp)_createdhttp_requests_created
Histogram_bucket, _sum, _counthttp_request_duration_seconds_bucket
Summary_sum, _countrpc_duration_seconds_sum
Info metric_infobuild_info{version="1.2.3"}
Gauge(no suffix)node_memory_MemAvailable_bytes
LABEL DO'S AND DON'TS
──────────────────────────────────────────────────────────────
DO:
✓ Use labels for dimensions you'll filter/aggregate by
✓ Keep cardinality bounded (status codes: ~5 values)
✓ Use consistent names: "method" not "http_method" in one
place and "request_method" in another
DON'T:
✗ user_id (millions of values = millions of series)
✗ request_id (unbounded, every request creates a series)
✗ email (PII + unbounded cardinality)
✗ url with path parameters (/users/12345 = unique per user)
✗ error_message (free-form text = unbounded)
✗ timestamp as label (infinite cardinality)
RULE OF THUMB:
If a label can have more than ~100 unique values,
it probably shouldn't be a label.
Each unique label combination = one time series in memory.

The most widely deployed exporter — runs on every node to expose hardware and OS metrics.

Terminal window
# Install via binary
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz
tar xvfz node_exporter-*.tar.gz
./node_exporter
# Or via Kubernetes DaemonSet (kube-prometheus-stack includes it)
helm install monitoring prometheus-community/kube-prometheus-stack

Key metrics from node_exporter:

# CPU utilization
1 - avg by (instance)(rate(node_cpu_seconds_total{mode="idle"}[5m]))
# Memory utilization
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# Disk space usage
1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})
# Network throughput
rate(node_network_receive_bytes_total{device="eth0"}[5m])
rate(node_network_transmit_bytes_total{device="eth0"}[5m])
# Disk I/O
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

Probes endpoints over HTTP, HTTPS, DNS, TCP, and ICMP. Useful for monitoring things you don’t control.

# blackbox-exporter config
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [200]
follow_redirects: true
http_post_2xx:
prober: http
http:
method: POST
tcp_connect:
prober: tcp
timeout: 5s
dns_lookup:
prober: dns
dns:
query_name: "kubernetes.default.svc.cluster.local"
query_type: "A"
icmp_ping:
prober: icmp
timeout: 5s

Prometheus scrape config for blackbox_exporter:

scrape_configs:
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
- https://api.myservice.com/health
relabel_configs:
# Pass the target URL as a parameter
- source_labels: [__address__]
target_label: __param_target
# Store original target as a label
- source_labels: [__param_target]
target_label: instance
# Point to the blackbox exporter
- target_label: __address__
replacement: blackbox-exporter:9115

Key blackbox metrics:

# Is the endpoint up?
probe_success{job="blackbox-http"}
# SSL certificate expiry (days)
(probe_ssl_earliest_cert_expiry - time()) / 86400
# HTTP response time
probe_http_duration_seconds
# DNS lookup time
probe_dns_lookup_time_seconds
ExporterPurposeKey Metrics
mysqld_exporterMySQL databasesQueries/sec, connections, replication lag
postgres_exporterPostgreSQL databasesActive connections, transaction rate, table sizes
redis_exporterRedisCommands/sec, memory usage, connected clients
kafka_exporterApache KafkaConsumer lag, topic offsets, partition count
nginx_exporterNginxActive connections, requests/sec, response codes
kube-state-metricsKubernetes objectsPod status, deployment replicas, node conditions
cadvisorContainersCPU, memory, network per container

ALERT STATES
──────────────────────────────────────────────────────────────
INACTIVE ──→ PENDING ──→ FIRING ──→ RESOLVED
↑ │ │ │
│ │ │ │
│ expr false │ for: 5m │ expr false │
└──────────────┘ elapsed │ for > 0s │
│ │
└──────────────┘
INACTIVE: Alert expression evaluates to false. No action.
PENDING: Alert expression evaluates to true.
Waiting for "for" duration to elapse.
Won't fire yet — prevents noise from brief spikes.
FIRING: Alert has been true for at least "for" duration.
Sent to Alertmanager for routing and notification.
RESOLVED: Alert was firing, now expression is false.
Alertmanager sends "resolved" notification.
groups:
- name: application-alerts
rules:
# HIGH SEVERITY: Service completely down
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
team: platform
annotations:
summary: "{{ $labels.job }} is down"
description: "{{ $labels.instance }} has been unreachable for >1 minute."
runbook_url: "https://wiki.example.com/runbooks/service-down"
# HIGH SEVERITY: Error rate spike
- alert: HighErrorRate
expr: |
sum by (service)(rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (service)(rate(http_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }}."
# MEDIUM SEVERITY: Slow responses
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum by (le, service)(rate(http_request_duration_seconds_bucket[5m]))
) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "High P99 latency on {{ $labels.service }}"
description: "P99 latency is {{ $value | humanizeDuration }}."
# LOW SEVERITY: Certificate expiring
- alert: SSLCertExpiringSoon
expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 30
for: 1h
labels:
severity: warning
annotations:
summary: "SSL cert for {{ $labels.instance }} expires in {{ $value | humanize }} days"
# CAPACITY: Disk filling up
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})
< 0.15
for: 15m
labels:
severity: warning
annotations:
summary: "Disk space below 15% on {{ $labels.instance }}"
# SLO-BASED: Error budget burn rate
- alert: ErrorBudgetBurnRate
expr: |
job:http_error_ratio:rate5m > (14.4 * 0.001)
for: 5m
labels:
severity: critical
annotations:
summary: "Error budget burning too fast for {{ $labels.job }}"
description: "At current rate, error budget will be exhausted in <1 hour."
# alertmanager.yml — complete production example
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager'
smtp_auth_password: '<secret>'
slack_api_url: 'https://hooks.slack.com/services/T00/B00/xxxx'
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
# TEMPLATES: customize notification format
templates:
- '/etc/alertmanager/templates/*.tmpl'
# ROUTING TREE: determines where alerts go
route:
# Default receiver for unmatched alerts
receiver: 'slack-default'
# Group alerts by these labels (reduces noise)
group_by: ['alertname', 'service']
# Wait before sending first notification for a group
group_wait: 30s
# Wait before sending updates to an existing group
group_interval: 5m
# Wait before re-sending a firing alert
repeat_interval: 4h
# Child routes (evaluated top-to-bottom, first match wins)
routes:
# Critical alerts → PagerDuty (wake someone up)
- match:
severity: critical
receiver: 'pagerduty-critical'
repeat_interval: 1h
routes:
# Database team owns DB alerts
- match:
team: database
receiver: 'pagerduty-database'
# Warning alerts → Slack channel
- match:
severity: warning
receiver: 'slack-warnings'
repeat_interval: 4h
# Info alerts → email digest
- match:
severity: info
receiver: 'email-digest'
group_wait: 10m
repeat_interval: 24h
# Regex matching: any alert from staging
- match_re:
environment: staging|dev
receiver: 'slack-staging'
repeat_interval: 12h
# RECEIVERS: notification targets
receivers:
- name: 'slack-default'
slack_configs:
- channel: '#alerts'
send_resolved: true
title: '{{ .Status | toUpper }}: {{ .CommonLabels.alertname }}'
text: >-
{{ range .Alerts }}
*{{ .Labels.alertname }}* ({{ .Labels.severity }})
{{ .Annotations.description }}
{{ end }}
- name: 'pagerduty-critical'
pagerduty_configs:
- routing_key: '<pagerduty-integration-key>'
severity: critical
description: '{{ .CommonLabels.alertname }}: {{ .CommonAnnotations.summary }}'
- name: 'pagerduty-database'
pagerduty_configs:
- routing_key: '<database-team-key>'
severity: critical
- name: 'slack-warnings'
slack_configs:
- channel: '#alerts-warnings'
send_resolved: true
- name: 'slack-staging'
slack_configs:
- channel: '#alerts-staging'
send_resolved: false
- name: 'email-digest'
email_configs:
- to: 'team@example.com'
send_resolved: false
# INHIBITION RULES: suppress dependent alerts
inhibit_rules:
# If a critical alert fires, suppress warnings for the same service
- source_match:
severity: critical
target_match:
severity: warning
equal: ['alertname', 'service']
# If a node is down, suppress all pod alerts on that node
- source_match:
alertname: NodeDown
target_match_re:
alertname: Pod.*
equal: ['node']
# If cluster is unreachable, suppress everything
- source_match:
alertname: ClusterUnreachable
target_match_re:
alertname: .+
equal: ['cluster']
ALERTMANAGER ROUTING TREE
──────────────────────────────────────────────────────────────
Incoming Alert: {alertname="HighErrorRate", severity="critical", team="api"}
route (root): receiver: slack-default
├── match: severity=critical receiver: pagerduty-critical ← MATCH!
│ └── match: team=database receiver: pagerduty-database (no match)
├── match: severity=warning receiver: slack-warnings
├── match: severity=info receiver: email-digest
└── match_re: env=staging|dev receiver: slack-staging
Result: Alert goes to pagerduty-critical (first matching child route)
NOTE: By default, routing stops at first match.
Add "continue: true" on a route to keep matching subsequent routes.
INHIBITION: Suppressing dependent alerts
──────────────────────────────────────────────────────────────
Scenario: Node goes down → all pods on that node fail
WITHOUT inhibition:
Alert: NodeDown (node-1) ← root cause
Alert: PodCrashLooping (pod-a) ← symptom
Alert: PodCrashLooping (pod-b) ← symptom
Alert: PodCrashLooping (pod-c) ← symptom
Alert: HighErrorRate (service-x) ← symptom
= 5 pages for one problem!
WITH inhibition:
inhibit_rules:
- source_match: {alertname: NodeDown}
target_match_re: {alertname: "Pod.*|HighErrorRate"}
equal: [node]
Alert: NodeDown (node-1) ← only this fires
(all dependent alerts suppressed)
= 1 page for one problem!

Silences temporarily mute alerts during planned maintenance:

Terminal window
# Create a silence via amtool CLI
amtool silence add \
--alertmanager.url=http://localhost:9093 \
--author="jane@example.com" \
--comment="Planned database maintenance window" \
--duration=2h \
service="database" severity="warning"
# List active silences
amtool silence query --alertmanager.url=http://localhost:9093
# Expire (remove) a silence
amtool silence expire --alertmanager.url=http://localhost:9093 <silence-id>

Silence via Alertmanager UI: Navigate to http://alertmanager:9093/#/silences and click “New Silence”. Fill in label matchers, duration, author, and comment.

Best practice: Always add a comment explaining why and a reasonable duration. Silences that last forever hide real problems.


Recording rules pre-compute expensive expressions so alerting rules evaluate faster.

groups:
- name: recording_rules
interval: 30s
rules:
# Pre-compute error ratio per service
- record: service:http_error_ratio:rate5m
expr: |
sum by (service)(rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (service)(rate(http_requests_total[5m]))
# Pre-compute P99 latency per service
- record: service:http_latency_p99:rate5m
expr: |
histogram_quantile(0.99,
sum by (le, service)(rate(http_request_duration_seconds_bucket[5m]))
)
# Pre-compute CPU utilization per node
- record: node:cpu_utilization:ratio_rate5m
expr: |
1 - avg by (node)(rate(node_cpu_seconds_total{mode="idle"}[5m]))
- name: alerting_rules
rules:
# NOW alerting rules can use the pre-computed values
- alert: HighErrorRate
expr: service:http_error_ratio:rate5m > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate {{ $value | humanizePercentage }} on {{ $labels.service }}"
- alert: HighLatency
expr: service:http_latency_p99:rate5m > 2
for: 10m
labels:
severity: warning
- alert: HighCPU
expr: node:cpu_utilization:ratio_rate5m > 0.9
for: 15m
labels:
severity: warning

MistakeProblemSolution
Using milliseconds for durationUnit mismatch with other metricsAlways use base units: _seconds, not _milliseconds
Counter without _total suffixViolates OpenMetrics standardAlways append _total to counter names
High-cardinality labels (user_id)Memory explosion, slow queriesRemove unbounded labels; aggregate at application level
Missing Help text on metricsHard to understand; fails lint checksAlways add descriptive Help strings
Alerting without for durationFlapping alerts from transient spikesUse for: 5m minimum for most alerts
No inhibition rulesAlert storms during major incidentsSuppress symptoms when root-cause alert fires
Silencing without commentsNobody knows why alerts were mutedAlways add author, comment, and expiry
Summary instead of HistogramCannot aggregate across instancesUse Histogram unless you have a specific reason not to
Missing runbook_url in annotationsOn-call engineer has no guidanceAlways link to a runbook explaining diagnosis/fix
Too many receiversAlert fatigue, nobody reads channelsConsolidate: critical → page, warning → Slack, info → email

1. What are the four Prometheus metric types? Give one real-world example for each.

Answer:

  1. Counter: Monotonically increasing value. Resets on restart.

    • Example: http_requests_total — total HTTP requests served
  2. Gauge: Value that can go up and down.

    • Example: node_memory_MemAvailable_bytes — currently available memory
  3. Histogram: Observations bucketed by value, with cumulative counts.

    • Example: http_request_duration_seconds — request latency distribution
  4. Summary: Client-computed streaming quantiles.

    • Example: go_gc_duration_seconds — garbage collection pause duration with pre-computed percentiles

Key distinction: Histograms are aggregatable across instances (sum buckets), Summaries are not (cannot add quantiles). Prefer Histograms in almost all cases.

2. Why does Prometheus use base units (seconds, bytes) instead of human-friendly units (milliseconds, megabytes)?

Answer:

Base units prevent unit mismatch errors when combining metrics. If one team uses _milliseconds and another uses _seconds, joining or adding these metrics produces nonsensical results.

Specific reasons:

  • Consistency: All duration metrics are in seconds, so rate(a_seconds[5m]) + rate(b_seconds[5m]) always works
  • PromQL functions: histogram_quantile() returns values in the metric’s unit — if metrics are in seconds, the result is in seconds
  • Grafana handles display: Grafana can convert seconds to “2.5ms” or “1.3h” for human display. Store in base units, format at display time.
  • OpenMetrics standard: Requires base units for interoperability across tools

The rule: store in base units, display in human units.

3. Explain the Alertmanager routing tree. How does Alertmanager decide which receiver gets an alert?

Answer:

The routing tree is a hierarchy of routes, each with label matchers and a receiver:

  1. Every alert enters at the root route (the top-level route:)
  2. Child routes are evaluated top-to-bottom — first matching child wins
  3. Matching uses match (exact) or match_re (regex) on alert labels
  4. If no child matches, the alert goes to the root route’s receiver
  5. continue: true on a route means “keep checking subsequent siblings even after matching”
  6. Child routes can have their own children — nesting creates a tree
route: receiver=default
├── match: severity=critical → receiver=pagerduty
│ └── match: team=db → receiver=pagerduty-db
├── match: severity=warning → receiver=slack
└── (unmatched) → receiver=default

group_by, group_wait, group_interval, and repeat_interval control batching:

  • group_by: Labels to group alerts by (reduces notification count)
  • group_wait: How long to buffer before sending the first notification
  • group_interval: Minimum time between updates to a group
  • repeat_interval: How often to re-send a firing alert
4. What is the difference between inhibition and silencing in Alertmanager?

Answer:

Inhibition (automatic, rule-based):

  • Suppresses target alerts when a source alert is firing
  • Configured in inhibit_rules in alertmanager.yml
  • Happens automatically — no human action needed
  • Example: NodeDown inhibits all PodCrashLooping alerts on that node
  • Purpose: Prevent alert storms from cascading failures

Silencing (manual, time-based):

  • Temporarily mutes alerts matching specific label matchers
  • Created via UI or amtool CLI
  • Requires human action — someone decides to silence
  • Has a defined expiry time
  • Example: Silence all alerts for service="database" during planned maintenance
  • Purpose: Suppress known noise during maintenance windows

Key difference: Inhibition is automatic and ongoing (fires whenever the source alert fires). Silencing is manual and temporary (created for a specific time window).

5. You're instrumenting a new microservice. It has an HTTP API and a background job queue. What metrics would you add, with what types and names?

Answer:

HTTP API metrics:

myservice_http_requests_total{method, status, path} — Counter
myservice_http_request_duration_seconds{method, path} — Histogram
myservice_http_request_size_bytes{method, path} — Histogram
myservice_http_response_size_bytes{method, path} — Histogram
myservice_http_active_requests{method} — Gauge

Background job metrics:

myservice_jobs_processed_total{queue, status} — Counter
myservice_job_duration_seconds{queue} — Histogram
myservice_jobs_queued{queue} — Gauge (current queue depth)
myservice_job_last_success_timestamp_seconds{queue} — Gauge

Runtime metrics (auto-exposed by most client libraries):

process_cpu_seconds_total — Counter
process_resident_memory_bytes — Gauge
go_goroutines (if Go) — Gauge

Design decisions:

  • path label should use route patterns (/users/{id}), not actual paths (/users/12345) to avoid cardinality explosion
  • Histogram buckets for HTTP: [.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5]
  • Histogram buckets for jobs: [.1, .5, 1, 5, 10, 30, 60, 300] (jobs are typically slower)
6. What is the purpose of the `for` field in an alerting rule? What happens if you omit it?

Answer:

The for field specifies how long the alert expression must be continuously true before the alert transitions from pending to firing.

- alert: HighErrorRate
expr: error_rate > 0.05
for: 5m # Must be true for 5 minutes before firing

Without for (or for: 0s):

  • Alert fires immediately when expression is true
  • Next evaluation cycle where expression is false → alert resolves
  • Causes alert flapping: brief spikes trigger and resolve alerts rapidly
  • On-call engineers get paged for transient conditions that self-resolve

With for: 5m:

  • Brief spikes (< 5 min) are ignored
  • Only sustained problems trigger notifications
  • Reduces false positives significantly

Guidelines:

  • for: 1m — Critical infrastructure alerts (ServiceDown)
  • for: 5m — Error rate and latency alerts
  • for: 15m — Capacity and resource alerts
  • for: 1h — Slow-burn problems (certificate expiry, disk growth trends)

Hands-On Exercise: Instrument, Export, Alert

Section titled “Hands-On Exercise: Instrument, Export, Alert”

Build a complete monitoring pipeline: instrument an app, deploy an exporter, configure alerts.

Terminal window
# Ensure you have a cluster with Prometheus
# (Use the setup from Module 1's hands-on, or:)
kind create cluster --name pca-lab
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace

Step 1: Deploy a Python App with Custom Metrics

Section titled “Step 1: Deploy a Python App with Custom Metrics”
instrumented-app.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: app-code
namespace: monitoring
data:
app.py: |
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import random, time, threading
REQUESTS = Counter('myapp_http_requests_total', 'Total HTTP requests', ['method', 'status'])
LATENCY = Histogram('myapp_http_request_duration_seconds', 'Request latency',
buckets=[.01, .025, .05, .1, .25, .5, 1, 2.5, 5])
QUEUE_SIZE = Gauge('myapp_queue_size', 'Current items in queue')
JOBS = Counter('myapp_jobs_processed_total', 'Jobs processed', ['status'])
def simulate_traffic():
while True:
method = random.choice(['GET', 'GET', 'GET', 'POST', 'PUT'])
latency = random.expovariate(10) # ~100ms average
status = '200' if random.random() > 0.03 else '500'
REQUESTS.labels(method=method, status=status).inc()
LATENCY.observe(latency)
time.sleep(0.1)
def simulate_queue():
while True:
QUEUE_SIZE.set(random.randint(0, 50))
if random.random() > 0.1:
JOBS.labels(status='success').inc()
else:
JOBS.labels(status='failure').inc()
time.sleep(1)
if __name__ == '__main__':
start_http_server(8000)
threading.Thread(target=simulate_traffic, daemon=True).start()
threading.Thread(target=simulate_queue, daemon=True).start()
print("Metrics server running on :8000")
while True:
time.sleep(1)
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: instrumented-app
namespace: monitoring
spec:
replicas: 2
selector:
matchLabels:
app: instrumented-app
template:
metadata:
labels:
app: instrumented-app
spec:
containers:
- name: app
image: python:3.11-slim
command: ["sh", "-c", "pip install prometheus_client && python /app/app.py"]
ports:
- containerPort: 8000
name: metrics
volumeMounts:
- name: code
mountPath: /app
volumes:
- name: code
configMap:
name: app-code
---
apiVersion: v1
kind: Service
metadata:
name: instrumented-app
namespace: monitoring
labels:
app: instrumented-app
spec:
selector:
app: instrumented-app
ports:
- port: 8000
targetPort: 8000
name: metrics
Terminal window
kubectl apply -f instrumented-app.yaml
servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: instrumented-app
namespace: monitoring
labels:
release: monitoring # Must match Prometheus selector
spec:
selector:
matchLabels:
app: instrumented-app
endpoints:
- port: metrics
interval: 15s
path: /metrics
Terminal window
kubectl apply -f servicemonitor.yaml
Terminal window
# Port-forward to Prometheus
kubectl port-forward -n monitoring svc/monitoring-kube-prometheus-prometheus 9090:9090

Open http://localhost:9090/targets and verify instrumented-app appears as a target. Then query:

# Verify metrics are flowing
myapp_http_requests_total
# Request rate
rate(myapp_http_requests_total[5m])
# Error rate
sum(rate(myapp_http_requests_total{status="500"}[5m]))
/ sum(rate(myapp_http_requests_total[5m]))
# P99 latency
histogram_quantile(0.99, sum by (le)(rate(myapp_http_request_duration_seconds_bucket[5m])))
# Queue depth
myapp_queue_size
alerting-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: instrumented-app-alerts
namespace: monitoring
labels:
release: monitoring
spec:
groups:
- name: instrumented-app
rules:
- alert: MyAppHighErrorRate
expr: |
sum(rate(myapp_http_requests_total{status=~"5.."}[5m]))
/ sum(rate(myapp_http_requests_total[5m]))
> 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "High error rate on instrumented-app"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: MyAppHighLatency
expr: |
histogram_quantile(0.99,
sum by (le)(rate(myapp_http_request_duration_seconds_bucket[5m]))
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High P99 latency on instrumented-app"
- record: myapp:http_error_ratio:rate5m
expr: |
sum(rate(myapp_http_requests_total{status=~"5.."}[5m]))
/ sum(rate(myapp_http_requests_total[5m]))
Terminal window
kubectl apply -f alerting-rules.yaml

Open http://localhost:9090/alerts and confirm the alert rules are loaded. If the simulated error rate exceeds 5%, you should see MyAppHighErrorRate transition from Inactive to Pending to Firing.

You’ve completed this exercise when you can:

  • See custom metrics (myapp_*) in Prometheus
  • Query request rate, error rate, and latency percentiles
  • ServiceMonitor is correctly scraping the application
  • Alert rules are loaded and evaluating in Prometheus
  • Recording rule myapp:http_error_ratio:rate5m produces results
  • Understand the Counter/Gauge/Histogram types from the instrumented code

Before moving on, ensure you understand:

  • Four metric types: Counter (monotonic, use rate()), Gauge (up/down, use directly), Histogram (buckets, use histogram_quantile()), Summary (client-side quantiles, rarely preferred)
  • Naming conventions: <namespace>_<name>_<unit>_total for counters, base units only (seconds, bytes), snake_case
  • Label best practices: Bounded cardinality only, no user_id/request_id, consistent naming across services
  • Client libraries: Go/Python/Java all follow the same patterns — Counter.Inc(), Gauge.Set(), Histogram.Observe()
  • Exporters: node_exporter (OS metrics), blackbox_exporter (probing), kube-state-metrics (K8s objects)
  • Alert lifecycle: inactive -> pending -> firing -> resolved. The for field prevents flapping.
  • Alertmanager routing: Tree structure, first-match wins, group_by reduces noise, repeat_interval prevents spam
  • Inhibition rules: Automatically suppress dependent alerts when root-cause alert fires
  • Silences: Manual, temporary muting during maintenance. Always add comments and expiry.
  • Recording rules: Pre-compute expensive queries for alerting and dashboards. Naming convention: level:metric:operations


Instrumentation is where monitoring begins — without good metrics from your applications and infrastructure, no amount of query power helps. This module covered the complete instrumentation-to-alerting pipeline: choosing the right metric type, naming it according to conventions, exposing it with client libraries, collecting it with exporters, and routing alerts through Alertmanager when thresholds are breached. Combined with the PromQL Deep Dive, you now have comprehensive coverage of the PCA exam’s four gap domains.