Module 1.2: Instrumentation & Alerting
PCA Track | Complexity:
[COMPLEX]| Time: 45-55 min
Prerequisites
Section titled “Prerequisites”Before starting this module:
- Prometheus Module — architecture, metric types, basic alerting
- PromQL Deep Dive — query fundamentals
- Observability 3.3: Instrumentation Principles
- Basic Go, Python, or Java knowledge (for client library examples)
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Instrument an application with Prometheus client libraries, choosing the correct metric type (counter, gauge, histogram, summary) for each signal
- Apply naming conventions, label design, and cardinality controls that keep metrics usable and storage costs predictable
- Configure alerting rules with proper
fordurations, severity labels, and runbook annotations that minimize false positives and pager fatigue - Design a metrics-based SLO alerting pipeline using multi-window burn-rate alerts for error budgets
The database team at a large ride-sharing company added a custom Prometheus metric to track query latency. They were proud of it — db_query_duration_milliseconds. It worked perfectly in development.
Three weeks later, the infrastructure team tried to create an SLO dashboard combining API latency (measured in seconds using http_request_duration_seconds) with database latency. The query looked simple:
histogram_quantile(0.99, sum by (le)(rate(http_request_duration_seconds_bucket[5m])))+histogram_quantile(0.99, sum by (le)(rate(db_query_duration_milliseconds_bucket[5m])))The P99 total latency showed 3,000.2 seconds. It took 45 minutes of confusion before someone realized: one metric was in seconds, the other in milliseconds. The query was adding 0.2 seconds of API latency to 3,000 milliseconds of DB latency. The result was mathematically correct but semantically nonsensical.
The fix required a migration: rename the metric, update all dashboards, modify alerting rules, and coordinate a rolling deployment across 400 pods. Two engineer-weeks of work, all because of a naming convention violation.
The Prometheus naming convention exists for exactly this reason: always use base units — seconds, bytes, meters. Never milliseconds, kilobytes, or centimeters. This war story gets told to every new hire now.
Why This Module Matters
Section titled “Why This Module Matters”Instrumentation and alerting are 34% of the PCA exam combined (16% Instrumentation + 18% Alerting). But beyond the exam, these are the skills that determine whether your monitoring actually works. Bad instrumentation creates metrics nobody can use. Bad alerting creates noise that trains your team to ignore pages.
This module covers the full lifecycle: choosing the right metric type, naming it correctly, exposing it from your application, collecting it with exporters, and routing alerts when things go wrong.
Did You Know?
Section titled “Did You Know?”- Prometheus client libraries exist for 15+ languages — Go, Python, Java, Ruby, Rust, .NET, Erlang, Haskell, and more. The Go library is the reference implementation; others follow the same patterns.
- node_exporter exposes over 1,000 metrics on a typical Linux system — CPU, memory, disk, network, filesystem, and more. It’s the single most deployed Prometheus exporter.
- Alertmanager was designed around the concept of “routing trees” inspired by email routing. The tree structure lets you route different alert severities to different teams using a single configuration file.
- The
_totalsuffix on counters was originally optional but became mandatory in the OpenMetrics standard. Modern Prometheus (2.x) adds it automatically if missing, but you should always include it in your instrumentation.
The Four Metric Types
Section titled “The Four Metric Types”Counter
Section titled “Counter”A counter is a cumulative metric that only goes up (and resets to zero on restart).
COUNTER: Monotonically increasing value──────────────────────────────────────────────────────────────
Value over time: 0 → 1 → 5 → 12 → 30 → 0 → 3 → 15 → 28 ↑ restart/reset
USE WHEN: ✓ Counting events (requests, errors, bytes sent) ✓ Counting completions (jobs finished, items processed) ✓ Anything that only goes up during normal operation
DON'T USE WHEN: ✗ Value can decrease (temperature, queue size) ✗ Value represents current state (active connections)
ALWAYS QUERY WITH rate() or increase(): rate(http_requests_total[5m]) ← per-second rate increase(http_requests_total[1h]) ← total in last hourA gauge is a metric that can go up and down.
GAUGE: Current value that can increase or decrease──────────────────────────────────────────────────────────────
Value over time: 42 → 38 → 55 → 71 → 63 → 48 → 52
USE WHEN: ✓ Current state (temperature, queue depth, active connections) ✓ Snapshots (memory usage, disk space, goroutine count) ✓ Values that go up AND down
DON'T USE WHEN: ✗ Counting events (use Counter) ✗ Measuring distributions (use Histogram)
QUERY DIRECTLY (no rate needed): node_memory_MemAvailable_bytes ← current available memory kube_deployment_spec_replicas ← desired replica countHistogram
Section titled “Histogram”A histogram samples observations and counts them in configurable buckets.
HISTOGRAM: Distribution of values in buckets──────────────────────────────────────────────────────────────
Generates 3 types of series: metric_bucket{le="0.1"} = 24054 ← cumulative count ≤ 0.1s metric_bucket{le="0.5"} = 129389 ← cumulative count ≤ 0.5s metric_bucket{le="+Inf"} = 144927 ← total count (all observations) metric_sum = 53423.4 ← sum of all observed values metric_count = 144927 ← total number of observations
USE WHEN: ✓ Request latency (the primary use case) ✓ Response sizes ✓ Any distribution where you need percentiles ✓ SLO calculations (bucket at your SLO target)
ADVANTAGES: ✓ Aggregatable across instances (can sum buckets) ✓ Can calculate any percentile after the fact ✓ Can compute average (sum / count)
TRADE-OFFS: ✗ Fixed bucket boundaries chosen at instrumentation time ✗ Each bucket is a separate time series (cardinality cost) ✗ Percentile accuracy depends on bucket granularitySummary
Section titled “Summary”A summary calculates quantiles on the client side.
SUMMARY: Client-computed quantiles──────────────────────────────────────────────────────────────
Generates series like: metric{quantile="0.5"} = 0.042 ← median metric{quantile="0.9"} = 0.087 ← P90 metric{quantile="0.99"} = 0.235 ← P99 metric_sum = 53423.4 ← sum of all observed values metric_count = 144927 ← total number of observations
USE WHEN: ✓ You need exact quantiles from a single instance ✓ You can't choose histogram bucket boundaries upfront ✓ Streaming quantile algorithms are acceptable
DON'T USE WHEN (most of the time): ✗ You need to aggregate across instances (cannot add quantiles meaningfully!) ✗ You need flexible percentile calculation at query time ✗ You need SLO calculations
PREFER HISTOGRAMS. Summaries exist for legacy reasons.Decision Framework: Which Type?
Section titled “Decision Framework: Which Type?”CHOOSING A METRIC TYPE──────────────────────────────────────────────────────────────
Does the value only go up?├── YES → Is it counting events/completions?│ ├── YES → COUNTER (with _total suffix)│ └── NO → Probably still a COUNTER└── NO → Can the value go up AND down? ├── YES → Is it a current state/snapshot? │ ├── YES → GAUGE │ └── NO → GAUGE (probably) └── Do you need distribution/percentiles? ├── YES → HISTOGRAM (almost always) │ └── Summary only if you truly │ can't define buckets upfront └── NO → GAUGEClient Library Instrumentation
Section titled “Client Library Instrumentation”Go (Reference Implementation)
Section titled “Go (Reference Implementation)”package main
import ( "net/http" "time"
"github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promauto" "github.com/prometheus/client_golang/prometheus/promhttp")
var ( // Counter: total HTTP requests httpRequestsTotal = promauto.NewCounterVec( prometheus.CounterOpts{ Name: "myapp_http_requests_total", Help: "Total number of HTTP requests.", }, []string{"method", "status", "path"}, )
// Histogram: request latency httpRequestDuration = promauto.NewHistogramVec( prometheus.HistogramOpts{ Name: "myapp_http_request_duration_seconds", Help: "HTTP request latency in seconds.", Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5}, }, []string{"method", "path"}, )
// Gauge: active connections activeConnections = promauto.NewGauge( prometheus.GaugeOpts{ Name: "myapp_active_connections", Help: "Number of currently active connections.", }, ))
func handler(w http.ResponseWriter, r *http.Request) { start := time.Now() activeConnections.Inc() defer activeConnections.Dec()
// ... handle request ... w.WriteHeader(http.StatusOK)
duration := time.Since(start).Seconds() httpRequestsTotal.WithLabelValues(r.Method, "200", r.URL.Path).Inc() httpRequestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)}
func main() { http.HandleFunc("/", handler) http.Handle("/metrics", promhttp.Handler()) http.ListenAndServe(":8080", nil)}Python
Section titled “Python”from prometheus_client import Counter, Histogram, Gauge, start_http_serverimport time
# Counter: total HTTP requestsREQUEST_COUNT = Counter( 'myapp_http_requests_total', 'Total number of HTTP requests.', ['method', 'status', 'path'])
# Histogram: request latencyREQUEST_LATENCY = Histogram( 'myapp_http_request_duration_seconds', 'HTTP request latency in seconds.', ['method', 'path'], buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5])
# Gauge: active connectionsACTIVE_CONNECTIONS = Gauge( 'myapp_active_connections', 'Number of currently active connections.')
def handle_request(method, path): ACTIVE_CONNECTIONS.inc() start = time.time()
# ... handle request ... status = "200"
duration = time.time() - start REQUEST_COUNT.labels(method=method, status=status, path=path).inc() REQUEST_LATENCY.labels(method=method, path=path).observe(duration) ACTIVE_CONNECTIONS.dec()
# Start metrics server on port 8000start_http_server(8000)
# For Flask/FastAPI, use middleware instead:# from prometheus_flask_instrumentator import Instrumentator# Instrumentator().instrument(app).expose(app)Java (Micrometer / simpleclient)
Section titled “Java (Micrometer / simpleclient)”import io.prometheus.client.Counter;import io.prometheus.client.Histogram;import io.prometheus.client.Gauge;import io.prometheus.client.exporter.HTTPServer;
public class MyApp { // Counter: total HTTP requests static final Counter requestsTotal = Counter.build() .name("myapp_http_requests_total") .help("Total number of HTTP requests.") .labelNames("method", "status", "path") .register();
// Histogram: request latency static final Histogram requestDuration = Histogram.build() .name("myapp_http_request_duration_seconds") .help("HTTP request latency in seconds.") .labelNames("method", "path") .buckets(.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5) .register();
// Gauge: active connections static final Gauge activeConnections = Gauge.build() .name("myapp_active_connections") .help("Number of currently active connections.") .register();
public void handleRequest(String method, String path) { activeConnections.inc(); Histogram.Timer timer = requestDuration .labels(method, path) .startTimer();
try { // ... handle request ... requestsTotal.labels(method, "200", path).inc(); } finally { timer.observeDuration(); activeConnections.dec(); } }
public static void main(String[] args) throws Exception { // Expose metrics on port 8000 HTTPServer server = new HTTPServer(8000); }}Metric Naming Conventions
Section titled “Metric Naming Conventions”The Rules
Section titled “The Rules”PROMETHEUS NAMING CONVENTION──────────────────────────────────────────────────────────────
Format: <namespace>_<name>_<unit>_<suffix>
namespace = application or library name (myapp, http, node)name = what is being measured (requests, duration, size)unit = base unit (seconds, bytes, meters — NEVER milli/kilo)suffix = metric type indicator (_total for counters, _info for info)
GOOD: myapp_http_requests_total ← counter, counts requests myapp_http_request_duration_seconds ← histogram, duration in seconds myapp_http_response_size_bytes ← histogram, size in bytes node_memory_MemAvailable_bytes ← gauge, memory in bytes process_cpu_seconds_total ← counter, CPU time in seconds
BAD: myapp_requests ← missing unit, missing _total http_request_duration_milliseconds ← use seconds, not milliseconds db_query_time_ms ← abbreviation, non-base unit MyApp_HTTP_Requests ← camelCase/PascalCase, use snake_case request_latency ← vague, missing namespace and unitUnit Rules
Section titled “Unit Rules”| Measurement | Base Unit | Suffix | Example |
|---|---|---|---|
| Time/Duration | seconds | _seconds | http_request_duration_seconds |
| Data size | bytes | _bytes | http_response_size_bytes |
| Temperature | celsius | _celsius | room_temperature_celsius |
| Voltage | volts | _volts | power_supply_volts |
| Energy | joules | _joules | cpu_energy_joules |
| Weight | grams | _grams | package_weight_grams |
| Ratios | ratio | _ratio | cache_hit_ratio |
| Percentages | ratio (0-1) | _ratio | Use 0-1, not 0-100 |
Suffix Rules
Section titled “Suffix Rules”| Type | Suffix | Example |
|---|---|---|
| Counter | _total | http_requests_total |
| Counter (created timestamp) | _created | http_requests_created |
| Histogram | _bucket, _sum, _count | http_request_duration_seconds_bucket |
| Summary | _sum, _count | rpc_duration_seconds_sum |
| Info metric | _info | build_info{version="1.2.3"} |
| Gauge | (no suffix) | node_memory_MemAvailable_bytes |
Label Best Practices
Section titled “Label Best Practices”LABEL DO'S AND DON'TS──────────────────────────────────────────────────────────────
DO: ✓ Use labels for dimensions you'll filter/aggregate by ✓ Keep cardinality bounded (status codes: ~5 values) ✓ Use consistent names: "method" not "http_method" in one place and "request_method" in another
DON'T: ✗ user_id (millions of values = millions of series) ✗ request_id (unbounded, every request creates a series) ✗ email (PII + unbounded cardinality) ✗ url with path parameters (/users/12345 = unique per user) ✗ error_message (free-form text = unbounded) ✗ timestamp as label (infinite cardinality)
RULE OF THUMB: If a label can have more than ~100 unique values, it probably shouldn't be a label. Each unique label combination = one time series in memory.Exporters
Section titled “Exporters”node_exporter (Hardware & OS Metrics)
Section titled “node_exporter (Hardware & OS Metrics)”The most widely deployed exporter — runs on every node to expose hardware and OS metrics.
# Install via binarywget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gztar xvfz node_exporter-*.tar.gz./node_exporter
# Or via Kubernetes DaemonSet (kube-prometheus-stack includes it)helm install monitoring prometheus-community/kube-prometheus-stackKey metrics from node_exporter:
# CPU utilization1 - avg by (instance)(rate(node_cpu_seconds_total{mode="idle"}[5m]))
# Memory utilization1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# Disk space usage1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})
# Network throughputrate(node_network_receive_bytes_total{device="eth0"}[5m])rate(node_network_transmit_bytes_total{device="eth0"}[5m])
# Disk I/Orate(node_disk_read_bytes_total[5m])rate(node_disk_written_bytes_total[5m])blackbox_exporter (Probing)
Section titled “blackbox_exporter (Probing)”Probes endpoints over HTTP, HTTPS, DNS, TCP, and ICMP. Useful for monitoring things you don’t control.
# blackbox-exporter configmodules: http_2xx: prober: http timeout: 5s http: valid_http_versions: ["HTTP/1.1", "HTTP/2.0"] valid_status_codes: [200] follow_redirects: true
http_post_2xx: prober: http http: method: POST
tcp_connect: prober: tcp timeout: 5s
dns_lookup: prober: dns dns: query_name: "kubernetes.default.svc.cluster.local" query_type: "A"
icmp_ping: prober: icmp timeout: 5sPrometheus scrape config for blackbox_exporter:
scrape_configs: - job_name: 'blackbox-http' metrics_path: /probe params: module: [http_2xx] static_configs: - targets: - https://example.com - https://api.myservice.com/health relabel_configs: # Pass the target URL as a parameter - source_labels: [__address__] target_label: __param_target # Store original target as a label - source_labels: [__param_target] target_label: instance # Point to the blackbox exporter - target_label: __address__ replacement: blackbox-exporter:9115Key blackbox metrics:
# Is the endpoint up?probe_success{job="blackbox-http"}
# SSL certificate expiry (days)(probe_ssl_earliest_cert_expiry - time()) / 86400
# HTTP response timeprobe_http_duration_seconds
# DNS lookup timeprobe_dns_lookup_time_secondsOther Common Exporters
Section titled “Other Common Exporters”| Exporter | Purpose | Key Metrics |
|---|---|---|
| mysqld_exporter | MySQL databases | Queries/sec, connections, replication lag |
| postgres_exporter | PostgreSQL databases | Active connections, transaction rate, table sizes |
| redis_exporter | Redis | Commands/sec, memory usage, connected clients |
| kafka_exporter | Apache Kafka | Consumer lag, topic offsets, partition count |
| nginx_exporter | Nginx | Active connections, requests/sec, response codes |
| kube-state-metrics | Kubernetes objects | Pod status, deployment replicas, node conditions |
| cadvisor | Containers | CPU, memory, network per container |
Alertmanager Deep Dive
Section titled “Alertmanager Deep Dive”Alert Lifecycle
Section titled “Alert Lifecycle”ALERT STATES──────────────────────────────────────────────────────────────
INACTIVE ──→ PENDING ──→ FIRING ──→ RESOLVED ↑ │ │ │ │ │ │ │ │ expr false │ for: 5m │ expr false │ └──────────────┘ elapsed │ for > 0s │ │ │ └──────────────┘
INACTIVE: Alert expression evaluates to false. No action.
PENDING: Alert expression evaluates to true. Waiting for "for" duration to elapse. Won't fire yet — prevents noise from brief spikes.
FIRING: Alert has been true for at least "for" duration. Sent to Alertmanager for routing and notification.
RESOLVED: Alert was firing, now expression is false. Alertmanager sends "resolved" notification.Alerting Rules
Section titled “Alerting Rules”groups: - name: application-alerts rules: # HIGH SEVERITY: Service completely down - alert: ServiceDown expr: up == 0 for: 1m labels: severity: critical team: platform annotations: summary: "{{ $labels.job }} is down" description: "{{ $labels.instance }} has been unreachable for >1 minute." runbook_url: "https://wiki.example.com/runbooks/service-down"
# HIGH SEVERITY: Error rate spike - alert: HighErrorRate expr: | sum by (service)(rate(http_requests_total{status=~"5.."}[5m])) / sum by (service)(rate(http_requests_total[5m])) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate on {{ $labels.service }}" description: "Error rate is {{ $value | humanizePercentage }}."
# MEDIUM SEVERITY: Slow responses - alert: HighLatency expr: | histogram_quantile(0.99, sum by (le, service)(rate(http_request_duration_seconds_bucket[5m])) ) > 2 for: 10m labels: severity: warning annotations: summary: "High P99 latency on {{ $labels.service }}" description: "P99 latency is {{ $value | humanizeDuration }}."
# LOW SEVERITY: Certificate expiring - alert: SSLCertExpiringSoon expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 30 for: 1h labels: severity: warning annotations: summary: "SSL cert for {{ $labels.instance }} expires in {{ $value | humanize }} days"
# CAPACITY: Disk filling up - alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.15 for: 15m labels: severity: warning annotations: summary: "Disk space below 15% on {{ $labels.instance }}"
# SLO-BASED: Error budget burn rate - alert: ErrorBudgetBurnRate expr: | job:http_error_ratio:rate5m > (14.4 * 0.001) for: 5m labels: severity: critical annotations: summary: "Error budget burning too fast for {{ $labels.job }}" description: "At current rate, error budget will be exhausted in <1 hour."Alertmanager Configuration
Section titled “Alertmanager Configuration”# alertmanager.yml — complete production exampleglobal: resolve_timeout: 5m smtp_smarthost: 'smtp.example.com:587' smtp_from: 'alertmanager@example.com' smtp_auth_username: 'alertmanager' smtp_auth_password: '<secret>' slack_api_url: 'https://hooks.slack.com/services/T00/B00/xxxx' pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
# TEMPLATES: customize notification formattemplates: - '/etc/alertmanager/templates/*.tmpl'
# ROUTING TREE: determines where alerts goroute: # Default receiver for unmatched alerts receiver: 'slack-default'
# Group alerts by these labels (reduces noise) group_by: ['alertname', 'service']
# Wait before sending first notification for a group group_wait: 30s
# Wait before sending updates to an existing group group_interval: 5m
# Wait before re-sending a firing alert repeat_interval: 4h
# Child routes (evaluated top-to-bottom, first match wins) routes: # Critical alerts → PagerDuty (wake someone up) - match: severity: critical receiver: 'pagerduty-critical' repeat_interval: 1h routes: # Database team owns DB alerts - match: team: database receiver: 'pagerduty-database'
# Warning alerts → Slack channel - match: severity: warning receiver: 'slack-warnings' repeat_interval: 4h
# Info alerts → email digest - match: severity: info receiver: 'email-digest' group_wait: 10m repeat_interval: 24h
# Regex matching: any alert from staging - match_re: environment: staging|dev receiver: 'slack-staging' repeat_interval: 12h
# RECEIVERS: notification targetsreceivers: - name: 'slack-default' slack_configs: - channel: '#alerts' send_resolved: true title: '{{ .Status | toUpper }}: {{ .CommonLabels.alertname }}' text: >- {{ range .Alerts }} *{{ .Labels.alertname }}* ({{ .Labels.severity }}) {{ .Annotations.description }} {{ end }}
- name: 'pagerduty-critical' pagerduty_configs: - routing_key: '<pagerduty-integration-key>' severity: critical description: '{{ .CommonLabels.alertname }}: {{ .CommonAnnotations.summary }}'
- name: 'pagerduty-database' pagerduty_configs: - routing_key: '<database-team-key>' severity: critical
- name: 'slack-warnings' slack_configs: - channel: '#alerts-warnings' send_resolved: true
- name: 'slack-staging' slack_configs: - channel: '#alerts-staging' send_resolved: false
- name: 'email-digest' email_configs: - to: 'team@example.com' send_resolved: false
# INHIBITION RULES: suppress dependent alertsinhibit_rules: # If a critical alert fires, suppress warnings for the same service - source_match: severity: critical target_match: severity: warning equal: ['alertname', 'service']
# If a node is down, suppress all pod alerts on that node - source_match: alertname: NodeDown target_match_re: alertname: Pod.* equal: ['node']
# If cluster is unreachable, suppress everything - source_match: alertname: ClusterUnreachable target_match_re: alertname: .+ equal: ['cluster']Routing Tree Visual
Section titled “Routing Tree Visual”ALERTMANAGER ROUTING TREE──────────────────────────────────────────────────────────────
Incoming Alert: {alertname="HighErrorRate", severity="critical", team="api"}
route (root): receiver: slack-default├── match: severity=critical receiver: pagerduty-critical ← MATCH!│ └── match: team=database receiver: pagerduty-database (no match)├── match: severity=warning receiver: slack-warnings├── match: severity=info receiver: email-digest└── match_re: env=staging|dev receiver: slack-staging
Result: Alert goes to pagerduty-critical (first matching child route)
NOTE: By default, routing stops at first match. Add "continue: true" on a route to keep matching subsequent routes.Inhibition Rules Explained
Section titled “Inhibition Rules Explained”INHIBITION: Suppressing dependent alerts──────────────────────────────────────────────────────────────
Scenario: Node goes down → all pods on that node fail
WITHOUT inhibition: Alert: NodeDown (node-1) ← root cause Alert: PodCrashLooping (pod-a) ← symptom Alert: PodCrashLooping (pod-b) ← symptom Alert: PodCrashLooping (pod-c) ← symptom Alert: HighErrorRate (service-x) ← symptom = 5 pages for one problem!
WITH inhibition: inhibit_rules: - source_match: {alertname: NodeDown} target_match_re: {alertname: "Pod.*|HighErrorRate"} equal: [node]
Alert: NodeDown (node-1) ← only this fires (all dependent alerts suppressed) = 1 page for one problem!Silences
Section titled “Silences”Silences temporarily mute alerts during planned maintenance:
# Create a silence via amtool CLIamtool silence add \ --alertmanager.url=http://localhost:9093 \ --author="jane@example.com" \ --comment="Planned database maintenance window" \ --duration=2h \ service="database" severity="warning"
# List active silencesamtool silence query --alertmanager.url=http://localhost:9093
# Expire (remove) a silenceamtool silence expire --alertmanager.url=http://localhost:9093 <silence-id>Silence via Alertmanager UI: Navigate to http://alertmanager:9093/#/silences and click “New Silence”. Fill in label matchers, duration, author, and comment.
Best practice: Always add a comment explaining why and a reasonable duration. Silences that last forever hide real problems.
Recording Rules for Alerting
Section titled “Recording Rules for Alerting”Recording rules pre-compute expensive expressions so alerting rules evaluate faster.
groups: - name: recording_rules interval: 30s rules: # Pre-compute error ratio per service - record: service:http_error_ratio:rate5m expr: | sum by (service)(rate(http_requests_total{status=~"5.."}[5m])) / sum by (service)(rate(http_requests_total[5m]))
# Pre-compute P99 latency per service - record: service:http_latency_p99:rate5m expr: | histogram_quantile(0.99, sum by (le, service)(rate(http_request_duration_seconds_bucket[5m])) )
# Pre-compute CPU utilization per node - record: node:cpu_utilization:ratio_rate5m expr: | 1 - avg by (node)(rate(node_cpu_seconds_total{mode="idle"}[5m]))
- name: alerting_rules rules: # NOW alerting rules can use the pre-computed values - alert: HighErrorRate expr: service:http_error_ratio:rate5m > 0.05 for: 5m labels: severity: critical annotations: summary: "Error rate {{ $value | humanizePercentage }} on {{ $labels.service }}"
- alert: HighLatency expr: service:http_latency_p99:rate5m > 2 for: 10m labels: severity: warning
- alert: HighCPU expr: node:cpu_utilization:ratio_rate5m > 0.9 for: 15m labels: severity: warningCommon Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| Using milliseconds for duration | Unit mismatch with other metrics | Always use base units: _seconds, not _milliseconds |
Counter without _total suffix | Violates OpenMetrics standard | Always append _total to counter names |
| High-cardinality labels (user_id) | Memory explosion, slow queries | Remove unbounded labels; aggregate at application level |
Missing Help text on metrics | Hard to understand; fails lint checks | Always add descriptive Help strings |
Alerting without for duration | Flapping alerts from transient spikes | Use for: 5m minimum for most alerts |
| No inhibition rules | Alert storms during major incidents | Suppress symptoms when root-cause alert fires |
| Silencing without comments | Nobody knows why alerts were muted | Always add author, comment, and expiry |
| Summary instead of Histogram | Cannot aggregate across instances | Use Histogram unless you have a specific reason not to |
| Missing runbook_url in annotations | On-call engineer has no guidance | Always link to a runbook explaining diagnosis/fix |
| Too many receivers | Alert fatigue, nobody reads channels | Consolidate: critical → page, warning → Slack, info → email |
1. What are the four Prometheus metric types? Give one real-world example for each.
Answer:
-
Counter: Monotonically increasing value. Resets on restart.
- Example:
http_requests_total— total HTTP requests served
- Example:
-
Gauge: Value that can go up and down.
- Example:
node_memory_MemAvailable_bytes— currently available memory
- Example:
-
Histogram: Observations bucketed by value, with cumulative counts.
- Example:
http_request_duration_seconds— request latency distribution
- Example:
-
Summary: Client-computed streaming quantiles.
- Example:
go_gc_duration_seconds— garbage collection pause duration with pre-computed percentiles
- Example:
Key distinction: Histograms are aggregatable across instances (sum buckets), Summaries are not (cannot add quantiles). Prefer Histograms in almost all cases.
2. Why does Prometheus use base units (seconds, bytes) instead of human-friendly units (milliseconds, megabytes)?
Answer:
Base units prevent unit mismatch errors when combining metrics. If one team uses _milliseconds and another uses _seconds, joining or adding these metrics produces nonsensical results.
Specific reasons:
- Consistency: All duration metrics are in seconds, so
rate(a_seconds[5m]) + rate(b_seconds[5m])always works - PromQL functions:
histogram_quantile()returns values in the metric’s unit — if metrics are in seconds, the result is in seconds - Grafana handles display: Grafana can convert seconds to “2.5ms” or “1.3h” for human display. Store in base units, format at display time.
- OpenMetrics standard: Requires base units for interoperability across tools
The rule: store in base units, display in human units.
3. Explain the Alertmanager routing tree. How does Alertmanager decide which receiver gets an alert?
Answer:
The routing tree is a hierarchy of routes, each with label matchers and a receiver:
- Every alert enters at the root route (the top-level
route:) - Child routes are evaluated top-to-bottom — first matching child wins
- Matching uses
match(exact) ormatch_re(regex) on alert labels - If no child matches, the alert goes to the root route’s receiver
continue: trueon a route means “keep checking subsequent siblings even after matching”- Child routes can have their own children — nesting creates a tree
route: receiver=default├── match: severity=critical → receiver=pagerduty│ └── match: team=db → receiver=pagerduty-db├── match: severity=warning → receiver=slack└── (unmatched) → receiver=defaultgroup_by, group_wait, group_interval, and repeat_interval control batching:
group_by: Labels to group alerts by (reduces notification count)group_wait: How long to buffer before sending the first notificationgroup_interval: Minimum time between updates to a grouprepeat_interval: How often to re-send a firing alert
4. What is the difference between inhibition and silencing in Alertmanager?
Answer:
Inhibition (automatic, rule-based):
- Suppresses target alerts when a source alert is firing
- Configured in
inhibit_rulesin alertmanager.yml - Happens automatically — no human action needed
- Example: NodeDown inhibits all PodCrashLooping alerts on that node
- Purpose: Prevent alert storms from cascading failures
Silencing (manual, time-based):
- Temporarily mutes alerts matching specific label matchers
- Created via UI or
amtoolCLI - Requires human action — someone decides to silence
- Has a defined expiry time
- Example: Silence all alerts for
service="database"during planned maintenance - Purpose: Suppress known noise during maintenance windows
Key difference: Inhibition is automatic and ongoing (fires whenever the source alert fires). Silencing is manual and temporary (created for a specific time window).
5. You're instrumenting a new microservice. It has an HTTP API and a background job queue. What metrics would you add, with what types and names?
Answer:
HTTP API metrics:
myservice_http_requests_total{method, status, path} — Countermyservice_http_request_duration_seconds{method, path} — Histogrammyservice_http_request_size_bytes{method, path} — Histogrammyservice_http_response_size_bytes{method, path} — Histogrammyservice_http_active_requests{method} — GaugeBackground job metrics:
myservice_jobs_processed_total{queue, status} — Countermyservice_job_duration_seconds{queue} — Histogrammyservice_jobs_queued{queue} — Gauge (current queue depth)myservice_job_last_success_timestamp_seconds{queue} — GaugeRuntime metrics (auto-exposed by most client libraries):
process_cpu_seconds_total — Counterprocess_resident_memory_bytes — Gaugego_goroutines (if Go) — GaugeDesign decisions:
pathlabel should use route patterns (/users/{id}), not actual paths (/users/12345) to avoid cardinality explosion- Histogram buckets for HTTP:
[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5] - Histogram buckets for jobs:
[.1, .5, 1, 5, 10, 30, 60, 300](jobs are typically slower)
6. What is the purpose of the `for` field in an alerting rule? What happens if you omit it?
Answer:
The for field specifies how long the alert expression must be continuously true before the alert transitions from pending to firing.
- alert: HighErrorRate expr: error_rate > 0.05 for: 5m # Must be true for 5 minutes before firingWithout for (or for: 0s):
- Alert fires immediately when expression is true
- Next evaluation cycle where expression is false → alert resolves
- Causes alert flapping: brief spikes trigger and resolve alerts rapidly
- On-call engineers get paged for transient conditions that self-resolve
With for: 5m:
- Brief spikes (< 5 min) are ignored
- Only sustained problems trigger notifications
- Reduces false positives significantly
Guidelines:
for: 1m— Critical infrastructure alerts (ServiceDown)for: 5m— Error rate and latency alertsfor: 15m— Capacity and resource alertsfor: 1h— Slow-burn problems (certificate expiry, disk growth trends)
Hands-On Exercise: Instrument, Export, Alert
Section titled “Hands-On Exercise: Instrument, Export, Alert”Build a complete monitoring pipeline: instrument an app, deploy an exporter, configure alerts.
# Ensure you have a cluster with Prometheus# (Use the setup from Module 1's hands-on, or:)kind create cluster --name pca-labhelm repo add prometheus-community https://prometheus-community.github.io/helm-chartshelm repo updatehelm install monitoring prometheus-community/kube-prometheus-stack \ --namespace monitoring --create-namespaceStep 1: Deploy a Python App with Custom Metrics
Section titled “Step 1: Deploy a Python App with Custom Metrics”apiVersion: v1kind: ConfigMapmetadata: name: app-code namespace: monitoringdata: app.py: | from prometheus_client import Counter, Histogram, Gauge, start_http_server import random, time, threading
REQUESTS = Counter('myapp_http_requests_total', 'Total HTTP requests', ['method', 'status']) LATENCY = Histogram('myapp_http_request_duration_seconds', 'Request latency', buckets=[.01, .025, .05, .1, .25, .5, 1, 2.5, 5]) QUEUE_SIZE = Gauge('myapp_queue_size', 'Current items in queue') JOBS = Counter('myapp_jobs_processed_total', 'Jobs processed', ['status'])
def simulate_traffic(): while True: method = random.choice(['GET', 'GET', 'GET', 'POST', 'PUT']) latency = random.expovariate(10) # ~100ms average status = '200' if random.random() > 0.03 else '500' REQUESTS.labels(method=method, status=status).inc() LATENCY.observe(latency) time.sleep(0.1)
def simulate_queue(): while True: QUEUE_SIZE.set(random.randint(0, 50)) if random.random() > 0.1: JOBS.labels(status='success').inc() else: JOBS.labels(status='failure').inc() time.sleep(1)
if __name__ == '__main__': start_http_server(8000) threading.Thread(target=simulate_traffic, daemon=True).start() threading.Thread(target=simulate_queue, daemon=True).start() print("Metrics server running on :8000") while True: time.sleep(1)---apiVersion: apps/v1kind: Deploymentmetadata: name: instrumented-app namespace: monitoringspec: replicas: 2 selector: matchLabels: app: instrumented-app template: metadata: labels: app: instrumented-app spec: containers: - name: app image: python:3.11-slim command: ["sh", "-c", "pip install prometheus_client && python /app/app.py"] ports: - containerPort: 8000 name: metrics volumeMounts: - name: code mountPath: /app volumes: - name: code configMap: name: app-code---apiVersion: v1kind: Servicemetadata: name: instrumented-app namespace: monitoring labels: app: instrumented-appspec: selector: app: instrumented-app ports: - port: 8000 targetPort: 8000 name: metricskubectl apply -f instrumented-app.yamlStep 2: Create a ServiceMonitor
Section titled “Step 2: Create a ServiceMonitor”apiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: instrumented-app namespace: monitoring labels: release: monitoring # Must match Prometheus selectorspec: selector: matchLabels: app: instrumented-app endpoints: - port: metrics interval: 15s path: /metricskubectl apply -f servicemonitor.yamlStep 3: Verify Scraping
Section titled “Step 3: Verify Scraping”# Port-forward to Prometheuskubectl port-forward -n monitoring svc/monitoring-kube-prometheus-prometheus 9090:9090Open http://localhost:9090/targets and verify instrumented-app appears as a target. Then query:
# Verify metrics are flowingmyapp_http_requests_total
# Request raterate(myapp_http_requests_total[5m])
# Error ratesum(rate(myapp_http_requests_total{status="500"}[5m]))/ sum(rate(myapp_http_requests_total[5m]))
# P99 latencyhistogram_quantile(0.99, sum by (le)(rate(myapp_http_request_duration_seconds_bucket[5m])))
# Queue depthmyapp_queue_sizeStep 4: Create Alerting Rules
Section titled “Step 4: Create Alerting Rules”apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: instrumented-app-alerts namespace: monitoring labels: release: monitoringspec: groups: - name: instrumented-app rules: - alert: MyAppHighErrorRate expr: | sum(rate(myapp_http_requests_total{status=~"5.."}[5m])) / sum(rate(myapp_http_requests_total[5m])) > 0.05 for: 2m labels: severity: warning annotations: summary: "High error rate on instrumented-app" description: "Error rate is {{ $value | humanizePercentage }}"
- alert: MyAppHighLatency expr: | histogram_quantile(0.99, sum by (le)(rate(myapp_http_request_duration_seconds_bucket[5m])) ) > 1 for: 5m labels: severity: warning annotations: summary: "High P99 latency on instrumented-app"
- record: myapp:http_error_ratio:rate5m expr: | sum(rate(myapp_http_requests_total{status=~"5.."}[5m])) / sum(rate(myapp_http_requests_total[5m]))kubectl apply -f alerting-rules.yamlStep 5: Verify Alerts
Section titled “Step 5: Verify Alerts”Open http://localhost:9090/alerts and confirm the alert rules are loaded. If the simulated error rate exceeds 5%, you should see MyAppHighErrorRate transition from Inactive to Pending to Firing.
Success Criteria
Section titled “Success Criteria”You’ve completed this exercise when you can:
- See custom metrics (
myapp_*) in Prometheus - Query request rate, error rate, and latency percentiles
- ServiceMonitor is correctly scraping the application
- Alert rules are loaded and evaluating in Prometheus
- Recording rule
myapp:http_error_ratio:rate5mproduces results - Understand the Counter/Gauge/Histogram types from the instrumented code
Key Takeaways
Section titled “Key Takeaways”Before moving on, ensure you understand:
- Four metric types: Counter (monotonic, use rate()), Gauge (up/down, use directly), Histogram (buckets, use histogram_quantile()), Summary (client-side quantiles, rarely preferred)
- Naming conventions:
<namespace>_<name>_<unit>_totalfor counters, base units only (seconds, bytes), snake_case - Label best practices: Bounded cardinality only, no user_id/request_id, consistent naming across services
- Client libraries: Go/Python/Java all follow the same patterns — Counter.Inc(), Gauge.Set(), Histogram.Observe()
- Exporters: node_exporter (OS metrics), blackbox_exporter (probing), kube-state-metrics (K8s objects)
- Alert lifecycle: inactive -> pending -> firing -> resolved. The
forfield prevents flapping. - Alertmanager routing: Tree structure, first-match wins, group_by reduces noise, repeat_interval prevents spam
- Inhibition rules: Automatically suppress dependent alerts when root-cause alert fires
- Silences: Manual, temporary muting during maintenance. Always add comments and expiry.
- Recording rules: Pre-compute expensive queries for alerting and dashboards. Naming convention:
level:metric:operations
Further Reading
Section titled “Further Reading”- Prometheus Client Libraries — All official client libraries
- Writing Exporters — Guide to building custom exporters
- Alertmanager Configuration — Full Alertmanager config reference
- Instrumentation Best Practices — Official instrumentation guide
- Metric and Label Naming — Naming conventions reference
Summary
Section titled “Summary”Instrumentation is where monitoring begins — without good metrics from your applications and infrastructure, no amount of query power helps. This module covered the complete instrumentation-to-alerting pipeline: choosing the right metric type, naming it according to conventions, exposing it with client libraries, collecting it with exporters, and routing alerts through Alertmanager when thresholds are breached. Combined with the PromQL Deep Dive, you now have comprehensive coverage of the PCA exam’s four gap domains.
Related Modules
Section titled “Related Modules”- PCA README — Full exam domain mapping and study strategy
- PromQL Deep Dive — Complete PromQL for Domain 3
- Prometheus Fundamentals — Architecture and basics
- Grafana — Dashboarding for Domain 5