Module 1.5: Release Engineering Metrics & Observability
Цей контент ще не доступний вашою мовою.
Discipline Module | Complexity:
[MEDIUM]| Time: 2 hours
Prerequisites
Section titled “Prerequisites”Before starting this module:
- Required: Prometheus and Grafana basics — PromQL queries, dashboard creation, alert configuration
- Required: Module 1.1: Release Strategies — Understanding of deployment strategies and progressive delivery
- Recommended: SRE Module: SLOs and Error Budgets — SLI/SLO/SLA concepts
- Recommended: Familiarity with CI/CD pipelines and deployment tooling
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Implement DORA metrics collection to measure deployment frequency, lead time, MTTR, and change failure rate
- Design release quality dashboards that give engineering leadership actionable visibility into delivery health
- Build automated release scorecards that track reliability and velocity trends over time
- Analyze release metric patterns to identify bottlenecks in your software delivery pipeline
Why This Module Matters
Section titled “Why This Module Matters”A VP of Engineering walks into a meeting and asks three questions:
- “How often do we deploy to production?”
- “When a deployment causes a problem, how quickly do we recover?”
- “What percentage of our deployments cause failures?”
The room goes silent. Nobody knows. The CI/CD system tracks builds. The monitoring system tracks uptime. The incident tracker tracks outages. But nobody has connected these systems to answer the fundamental question: are we getting better or worse at releasing software?
This is the blind spot that kills engineering organizations. Without release metrics, you cannot tell if your investment in canary deployments, feature flags, and ring rollouts is actually paying off. You cannot tell if your deployment pipeline is a competitive advantage or a liability. You cannot tell if the team that deploys 50 times a day is bold or reckless.
The DORA research program — the largest scientific study of software delivery performance ever conducted — identified four metrics that separate elite engineering organizations from the rest. These are not vanity metrics. They are predictive indicators of organizational performance, and they all center on one thing: how you release software.
This module teaches you to measure what matters, build dashboards that tell the truth, and use release metrics to drive continuous improvement — not continuous blame.
The DORA Four Key Metrics
Section titled “The DORA Four Key Metrics”What DORA Discovered
Section titled “What DORA Discovered”The DevOps Research and Assessment (DORA) program, founded by Dr. Nicole Forsgren, Jez Humble, and Gene Kim, studied thousands of engineering organizations over a decade. Their finding: four metrics predict both software delivery performance AND organizational performance (profitability, market share, customer satisfaction).
These metrics are:
┌──────────────────────────────────────────────────────────────────┐│ DORA Four Key Metrics ││ ││ THROUGHPUT (Speed) STABILITY (Quality) ││ ───────────────── ────────────────── ││ 1. Deployment Frequency 3. Change Failure Rate ││ How often do we deploy? How often do deploys fail? ││ ││ 2. Lead Time for Changes 4. Mean Time to Recovery ││ How fast from commit How fast do we recover ││ to production? from failures? ││ ││ KEY INSIGHT: Elite teams are BOTH faster AND more stable. ││ Speed and stability are NOT trade-offs — they reinforce ││ each other. │└──────────────────────────────────────────────────────────────────┘Metric 1: Deployment Frequency
Section titled “Metric 1: Deployment Frequency”Definition: How often does your organization deploy code to production?
| Performance Level | Deployment Frequency |
|---|---|
| Elite | On-demand (multiple deploys per day) |
| High | Between once per day and once per week |
| Medium | Between once per week and once per month |
| Low | Between once per month and once every 6 months |
Why it matters: Frequent deployments mean smaller changesets. Smaller changesets are easier to debug, faster to roll back, and less risky. A team deploying 10 times per day ships 10 small changes. A team deploying monthly ships one massive change with hundreds of commits.
How to measure it:
# Count deployments per day (using observed generation changes)sum( changes(kube_deployment_status_observed_generation{namespace="production"}[24h])) by (deployment)Or, if you emit a custom metric on each deployment:
# Custom deployment countersum(increase(deployment_completed_total{environment="production"}[24h]))What to watch for:
- Deployment frequency decreasing over time (sign of growing fear or complexity)
- Uneven distribution across teams (some teams deploy daily, others monthly)
- Deploy frequency dropping after an incident (overcorrection)
Metric 2: Lead Time for Changes
Section titled “Metric 2: Lead Time for Changes”Definition: The time from a code commit to that code running successfully in production.
| Performance Level | Lead Time |
|---|---|
| Elite | Less than 1 hour |
| High | Between 1 day and 1 week |
| Medium | Between 1 week and 1 month |
| Low | Between 1 month and 6 months |
Components of lead time:
Commit → Code Review → CI Build → Deploy to Staging → Deploy to Prod │ │ │ │ │ └──────────┴────────────┴──────────────┴───────────────────┘ Total Lead Time
Coding time Review wait Build time Stage/test Deploy (developer) (bottleneck) (automated) (gates) (pipeline)How to measure it:
# If you emit timestamps for each stage:histogram_quantile(0.50, sum(rate( deployment_lead_time_seconds_bucket{environment="production"}[7d] )) by (le))In practice, most teams compute this by comparing the deployment timestamp with the oldest commit SHA in the deployment that was not in the previous deployment.
Common bottlenecks:
- Code review wait time (the #1 bottleneck in most organizations)
- Manual approval gates
- Slow CI pipelines
- Infrequent deploy windows
- Long-running integration test suites
Metric 3: Change Failure Rate (CFR)
Section titled “Metric 3: Change Failure Rate (CFR)”Definition: The percentage of deployments that cause a failure in production (requiring rollback, hotfix, or incident).
| Performance Level | Change Failure Rate |
|---|---|
| Elite | 0-5% |
| High | 5-10% |
| Medium | 10-15% |
| Low | 16-30% |
How to measure it:
# Failed deployments / Total deployments (use increase() for counters over a window)sum(increase(deployment_completed_total{result="failure", environment="production"}[30d])) /sum(increase(deployment_completed_total{environment="production"}[30d]))What counts as a “failure”?:
- Deployment rolled back (automatically or manually)
- Hotfix deployed within 24 hours of the original deployment
- Incident created and attributed to the deployment
- Canary analysis failed and aborted the rollout
What does NOT count:
- Bugs found in staging (caught before production)
- Planned rollbacks (testing the rollback mechanism)
- Configuration changes that did not affect users
Key insight: CFR and deployment frequency are NOT inversely correlated for elite teams. Elite teams deploy more often AND have lower failure rates. This is because frequent deployments mean smaller changes, which are inherently less risky.
Metric 4: Mean Time to Recovery (MTTR)
Section titled “Metric 4: Mean Time to Recovery (MTTR)”Definition: How long it takes to recover from a production failure (service impairment or outage).
| Performance Level | MTTR |
|---|---|
| Elite | Less than 1 hour |
| High | Less than 1 day |
| Medium | Less than 1 week |
| Low | More than 1 week |
MTTR components:
Failure occurs → Detection → Diagnosis → Fix → Recovery │ │ │ │ │ └──────────────┴───────────┴────────┴────────┘ Total MTTR
Time to Time to Time to Time to Detect Diagnose Fix Deploy Fix (TTD) (TTDiag) (TTF) (TTDeploy)How to reduce each component:
| Component | Reduction Strategy |
|---|---|
| Time to Detect | Better monitoring, tighter alert thresholds, canary analysis |
| Time to Diagnose | Deployment markers in dashboards, correlation of deploys with metrics |
| Time to Fix | Instant rollback, feature flag kill switches |
| Time to Deploy Fix | Fast CI/CD pipeline, automated deployment |
How to measure it:
# Average time from incident creation to resolutionavg( incident_resolution_timestamp - incident_creation_timestamp) by (severity)Release Health: Beyond DORA
Section titled “Release Health: Beyond DORA”Deployment-Specific Metrics
Section titled “Deployment-Specific Metrics”DORA metrics tell you about your overall delivery performance. You also need metrics about individual deployments:
1. Deployment Duration
How long does a deployment take from start to all pods healthy?
# Time from deployment start to last pod readyhistogram_quantile(0.95, sum(rate( deployment_duration_seconds_bucket{environment="production"}[7d] )) by (le, deployment))Watch for deployments getting slower over time — this often indicates growing image sizes, slow health checks, or resource contention.
2. Rollback Frequency
How often do deployments get rolled back?
sum(rate(deployment_rollback_total{environment="production"}[30d]))/sum(rate(deployment_completed_total{environment="production"}[30d]))A healthy rollback rate is 1-5%. Below 1% might mean you are not rolling back when you should be. Above 10% means your pre-production testing is inadequate.
3. Canary Analysis Pass Rate
How often do canary deployments pass their analysis gates?
sum(rate(argo_rollouts_analysis_run_total{phase="Successful"}[30d]))/sum(rate(argo_rollouts_analysis_run_total[30d]))4. Time to Full Rollout
How long from first canary pod to 100% traffic?
histogram_quantile(0.50, sum(rate( rollout_completion_time_seconds_bucket[30d] )) by (le))SLIs and SLOs During Releases
Section titled “SLIs and SLOs During Releases”Releases are the most dangerous time for SLO compliance. Here is how to protect your SLOs:
Pre-release SLO check: Before deploying, check remaining error budget:
# Remaining error budget (30-day window)1 - ( sum(rate(http_requests_total{status=~"5.."}[30d])) / sum(rate(http_requests_total[30d]))) / (1 - 0.999) # SLO target: 99.9%If error budget is below 20%, consider delaying the deployment or deploying with extra caution (smaller canary percentage, longer bake time).
During-release SLO monitoring:
# Real-time error budget burn ratesum(rate(http_requests_total{status=~"5.."}[5m]))/sum(rate(http_requests_total[5m]))If the 5-minute error rate exceeds 10x the SLO threshold during a canary, trigger automatic rollback.
Post-release SLO impact:
# Error budget consumed by last deployment# (error rate during deploy window - baseline error rate) * deploy durationTrack this per deployment to identify which releases consume the most error budget.
The Error Budget Gate
Section titled “The Error Budget Gate”Some organizations gate deployments on error budget:
Error Budget Remaining Deployment Policy───────────────────── ─────────────────────> 50% Deploy freely (normal process)20-50% Deploy with caution (smaller canary, longer bake)5-20% Deploy only critical fixes< 5% Deployment freeze (reliability work only)This is the error budget concept from SRE applied to release engineering: reliability failures consume the budget that would otherwise allow new deployments.
Deployment Markers and Correlation
Section titled “Deployment Markers and Correlation”The Most Important Dashboard Feature
Section titled “The Most Important Dashboard Feature”The single most powerful thing you can add to any monitoring dashboard is a deployment marker — a vertical line showing when a deployment happened:
Error Rate 5% ┤ │ 2% ┤ ▲ │ ╱ ╲ 1% ┤──────────────────╱ ╲────────────────── │ ╱ ╲ 0% ┤────────────────╱ ╲──────────────── └──┬───────────┬─┬────────┬─────────────── 00:00 │ 06:00 12:00 │ Deploy v2.1.0 ← DEPLOYMENT MARKERWithout the deployment marker, you see a spike and wonder: “What happened at 4 AM?” With the marker, the correlation is instant: “The v2.1.0 deployment caused the spike.”
Implementing Deployment Markers
Section titled “Implementing Deployment Markers”In Grafana (Annotations):
# Push a deployment annotation to Grafanacurl -X POST http://grafana:3000/api/annotations \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $GRAFANA_TOKEN" \ -d '{ "dashboardUID": "release-health", "time": '$(date +%s000)', "tags": ["deployment", "webapp", "v2.1.0"], "text": "Deployed webapp v2.1.0 by deploy-bot" }'In Prometheus (recording rule):
# Record deployment events as a metricgroups: - name: deployments rules: - record: deployment_event expr: | kube_deployment_status_observed_generation != ignoring(generation) kube_deployment_metadata_generationIn CI/CD pipeline:
# GitLab CI exampledeploy: script: - kubectl apply -f manifests/ - | curl -X POST "$GRAFANA_URL/api/annotations" \ -H "Authorization: Bearer $GRAFANA_TOKEN" \ -H "Content-Type: application/json" \ -d "{ \"time\": $(date +%s000), \"tags\": [\"deployment\", \"$CI_PROJECT_NAME\", \"$CI_COMMIT_SHORT_SHA\"], \"text\": \"Deployed $CI_PROJECT_NAME ($CI_COMMIT_SHORT_SHA) by $GITLAB_USER_LOGIN\" }"Correlation Dashboard Design
Section titled “Correlation Dashboard Design”A release health dashboard should answer these questions at a glance:
┌──────────────────────────────────────────────────────────────┐│ RELEASE HEALTH DASHBOARD ││ ││ ┌────────────────────────────────────────────────────────┐ ││ │ Error Rate (with deployment markers) │ ││ │ [graph with vertical lines at each deployment] │ ││ └────────────────────────────────────────────────────────┘ ││ ││ ┌────────────────────────────────────────────────────────┐ ││ │ P99 Latency (with deployment markers) │ ││ │ [graph with vertical lines at each deployment] │ ││ └────────────────────────────────────────────────────────┘ ││ ││ ┌─────────────────┐ ┌─────────────────┐ ││ │ Deploy Freq │ │ Change Failure │ ││ │ 12/day (7d avg) │ │ Rate: 3.2% │ ││ └─────────────────┘ └─────────────────┘ ││ ││ ┌─────────────────┐ ┌─────────────────┐ ││ │ Lead Time │ │ MTTR │ ││ │ 2.3 hours (p50) │ │ 18 min (p50) │ ││ └─────────────────┘ └─────────────────┘ ││ ││ ┌────────────────────────────────────────────────────────┐ ││ │ Recent Deployments │ ││ │ ✓ 14:22 webapp v2.1.0 (canary: passed) │ ││ │ ✗ 11:05 api-gateway v3.0 (canary: FAILED, rolled back│ ││ │ ✓ 09:30 auth-svc v1.8.2 (direct: success) │ ││ │ ✓ 08:15 webapp v2.0.9 (canary: passed) │ ││ └────────────────────────────────────────────────────────┘ ││ ││ ┌────────────────────────────────────────────────────────┐ ││ │ Error Budget Remaining: 72% (30-day window) │ ││ │ ████████████████████████████░░░░░░░░░░ │ ││ └────────────────────────────────────────────────────────┘ │└──────────────────────────────────────────────────────────────┘Alert Fatigue During Releases
Section titled “Alert Fatigue During Releases”The Problem
Section titled “The Problem”Deployments are inherently noisy. During a rolling update, old pods terminate and new pods start. Connection counts spike, latency briefly increases, error rates may blip. If your alerting is not deployment-aware, every deployment triggers false alerts:
14:00 - Deploy starts14:01 - ALERT: Connection pool usage > 80% ← False alarm14:02 - ALERT: P99 latency > 200ms ← False alarm (cold caches)14:03 - ALERT: Error rate > 0.5% ← False alarm (old pods draining)14:05 - Deploy complete, metrics normalize14:06 - Three people investigated three false alertsOver time, the team starts ignoring deployment alerts. Then a real deployment problem occurs, and nobody notices because they have been trained that deployment alerts are noise.
Solutions
Section titled “Solutions”1. Deployment-Aware Alert Suppression
Temporarily raise alert thresholds during deployments:
# Prometheus alerting rule with deployment awarenessgroups: - name: deployment-aware-alerts rules: - alert: HighErrorRate expr: | ( # Normal threshold: 0.5% sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.005 ) # Only alert if no deployment is in progress unless on() ( kube_deployment_status_updated_replicas != kube_deployment_spec_replicas ) for: 5m labels: severity: warning2. Separate “Deployment Alert” Channel
Route deployment-time alerts to a dedicated channel that the deployer monitors:
# Alertmanager routingroute: routes: - match: context: deployment receiver: deployment-channel group_wait: 30s - match: severity: critical receiver: pagerduty3. Post-Deployment Burn Rate Alerts
Instead of alerting on absolute thresholds during deployment, alert on the rate of change:
# Alert if error rate increases by more than 5x during deployment- alert: DeploymentCausedErrorSpike expr: | ( sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) / ( sum(rate(http_requests_total{status=~"5.."}[1h] offset 1h)) / sum(rate(http_requests_total[1h] offset 1h)) ) > 5 for: 3m labels: severity: critical context: deploymentThis alerts when the error rate is 5x the baseline from an hour ago — catching real problems while ignoring normal deployment noise.
Deployment Freezes
Section titled “Deployment Freezes”When to Freeze Deployments
Section titled “When to Freeze Deployments”A deployment freeze is a period during which non-critical deployments are prohibited. Common triggers:
| Trigger | Rationale |
|---|---|
| Error budget exhausted | No remaining budget for risky changes |
| Major business event | Black Friday, tax season, election day |
| Active incident | Do not compound problems with more changes |
| Holiday periods | Reduced staffing means slower incident response |
| End of quarter | Revenue-critical period for B2B SaaS |
Implementing Deployment Freezes
Section titled “Implementing Deployment Freezes”1. CI/CD Pipeline Gate:
# Check freeze status before deployingdeploy: rules: - if: $DEPLOYMENT_FREEZE == "true" when: manual allow_failure: false script: - | FREEZE=$(curl -s http://deploy-config/api/freeze-status) if [ "$FREEZE" = "active" ]; then echo "DEPLOYMENT FREEZE ACTIVE. Reason: $(curl -s http://deploy-config/api/freeze-reason)" echo "Contact release-engineering team for exceptions." exit 1 fi - kubectl apply -f manifests/2. Kubernetes Admission Webhook:
# Webhook that rejects deployments during freeze periodsdef validate_deployment(request): if is_freeze_active(): # Allow critical security patches if has_annotation(request, "freeze-exception: security"): return allow(request) return deny(request, "Deployment freeze active. Contact release-eng for exceptions.") return allow(request)3. DORA-Driven Freeze Policy:
Error Budget Status Deployment Policy────────────────── ──────────────────Budget > 50% Normal — deploy freelyBudget 20-50% Cautious — canary required, longer bakeBudget 5-20% Restricted — critical fixes onlyBudget < 5% Frozen — reliability work onlyBudget = 0% Hard freeze — emergency patches onlyThe Freeze Paradox
Section titled “The Freeze Paradox”Counterintuitively, deployment freezes often make things worse. Here is why:
During freeze: Changes accumulate ──► Large batch of changes after freeze Large batch ──► Higher risk per deployment Higher risk ──► Higher failure rate Higher failure rate ──► More incidents More incidents ──► More freezes → VICIOUS CYCLEThe DORA research shows that high-performing teams deploy more often, not less. Freezes should be rare, short, and combined with a plan to reduce the conditions that caused them.
Better alternative: Instead of freezing deployments, increase deployment safety requirements during high-risk periods:
- Require canary deployments (no direct rollouts)
- Increase bake time (24h instead of 2h)
- Require manual promotion at each stage
- Add extra metric checks to analysis templates
Building a DORA Metrics Program
Section titled “Building a DORA Metrics Program”Step 1: Start Measuring
Section titled “Step 1: Start Measuring”You cannot improve what you do not measure. Begin by instrumenting your deployment pipeline:
# deployment-events ConfigMap (emitted by CI/CD)apiVersion: v1kind: ConfigMapmetadata: name: deployment-event labels: deployment-event: "true"data: service: "webapp" version: "v2.1.0" environment: "production" deployer: "ci-bot" commit_sha: "abc123" commit_timestamp: "2026-03-24T10:00:00Z" deploy_timestamp: "2026-03-24T10:15:00Z" result: "success" # or "failure", "rollback"Step 2: Calculate Baselines
Section titled “Step 2: Calculate Baselines”Measure your current state for at least 30 days before setting targets:
Current Baselines (30-day): Deployment Frequency: 2.3 per day Lead Time (p50): 18 hours Change Failure Rate: 8.5% MTTR (p50): 45 minutesStep 3: Set Improvement Targets
Section titled “Step 3: Set Improvement Targets”Use the DORA performance levels as guides, but set realistic incremental targets:
Current 3-Month Target 6-Month TargetDeploy Frequency: 2.3/day 5/day 10/dayLead Time: 18 hours 8 hours 2 hoursCFR: 8.5% 5% 3%MTTR: 45 min 30 min 15 minStep 4: Review Regularly
Section titled “Step 4: Review Regularly”Monthly DORA reviews should:
- Track trends (improving or degrading?)
- Identify the biggest bottleneck (usually lead time)
- Celebrate improvements (deploy frequency up 20%!)
- Investigate regressions (why did CFR increase this month?)
Step 5: Avoid Goodhart’s Law
Section titled “Step 5: Avoid Goodhart’s Law”“When a measure becomes a target, it ceases to be a good measure.”
If you incentivize deployment frequency, teams will deploy empty changes. If you penalize change failure rate, teams will stop deploying. Use DORA metrics as diagnostic tools, not performance targets for individuals.
Did You Know?
Section titled “Did You Know?”-
The DORA research proved that speed and stability are NOT trade-offs. Before DORA, the conventional wisdom was that deploying faster meant more failures. The data showed the opposite: elite teams deploy 973x more frequently than low performers AND have 3x lower change failure rates. Speed enables stability because smaller changes are inherently less risky.
-
Etsy was one of the first companies to put deployment markers on all dashboards, and they found that over 80% of production incidents correlated with a recent deployment. This single visualization — a vertical line on a graph — reduced their mean time to diagnose by 60% because engineers stopped asking “what changed?” and started asking “what did this specific deployment change?”
-
Deployment freezes are inversely correlated with reliability in the DORA data. Organizations with more frequent deployment freezes have WORSE reliability metrics, not better. The freeze accumulates changes, making the post-freeze deployment larger and riskier. Elite performers deploy through holidays with automated canary analysis instead of freezing.
-
Google’s “golden signals” — latency, traffic, errors, and saturation — were designed specifically for deployment monitoring. The original Google SRE book presents these four signals as the minimum viable monitoring for any service, and they are particularly important during deployments when all four can shift simultaneously. Every release health dashboard should track all four golden signals.
War Story: The Dashboard That Stopped Incidents
Section titled “War Story: The Dashboard That Stopped Incidents”A fintech company averaged 3-4 deployment-related incidents per month. Their MTTR was 2 hours. They had good monitoring but a critical gap: deployments were invisible on their dashboards.
The post-incident analysis for every deployment-related incident followed the same pattern:
1. Alert fires at 14:22 — "Error rate > 1%"2. On-call opens Grafana — sees spike starting at 14:153. On-call checks recent changes — "Did anyone deploy anything?"4. On-call asks in Slack — no response for 10 minutes5. Someone responds — "I deployed api-gateway 20 minutes ago"6. On-call rolls back — error rate normalizes7. Total MTTR: 45 minutes (30 minutes spent finding the cause)The fix was embarrassingly simple: add deployment annotations to every Grafana dashboard.
After the change:
1. Alert fires at 14:22 — "Error rate > 1%"2. On-call opens Grafana — sees spike starting at 14:153. Deployment marker visible: "api-gateway v3.2.0 deployed at 14:14"4. On-call rolls back api-gateway5. Total MTTR: 8 minutes (zero time spent finding the cause)Over the next quarter:
- MTTR dropped from 2 hours to 15 minutes average
- Deployment-related incidents dropped from 3-4/month to 1/month (faster rollbacks prevented cascading failures)
- Engineer satisfaction with on-call improved significantly
The total implementation effort: one day of work to add annotations to the CI/CD pipeline.
Lesson: The most impactful observability improvement is often the simplest: connect your deployment system to your monitoring system.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| Measuring DORA metrics but not acting on them | Metrics without action are theater | Monthly review meetings with specific improvement actions |
| Using DORA metrics to evaluate individuals | Creates gaming and fear; people stop deploying | Use metrics as team/system diagnostics, never for individual performance |
| No deployment markers on dashboards | Every incident starts with “did anyone deploy something?” | Annotate all dashboards with deployment events from CI/CD |
| Alert thresholds not deployment-aware | Every deployment triggers false alerts, causing alert fatigue | Use deployment-aware alerting with suppression or elevated thresholds |
| Deployment freezes as primary risk strategy | Accumulates changes, making post-freeze deploys riskier | Increase deployment safety requirements instead of freezing |
| Measuring only deployment frequency | Frequent deploys mean nothing if they keep failing | Track all four DORA metrics together — they are interconnected |
| Not tracking error budget consumption per deploy | Cannot identify which releases are burning budget | Measure error budget delta for each deployment window |
| Ignoring lead time bottlenecks | Slow pipelines discourage frequent deployment | Profile your pipeline: where does time actually go? |
Quiz: Check Your Understanding
Section titled “Quiz: Check Your Understanding”Question 1
Section titled “Question 1”What are the four DORA metrics and what do they measure?
Show Answer
The four DORA metrics are:
-
Deployment Frequency — How often the organization deploys to production. Measures throughput.
-
Lead Time for Changes — Time from code commit to code running in production. Measures throughput.
-
Change Failure Rate (CFR) — Percentage of deployments that cause a production failure (rollback, hotfix, or incident). Measures stability.
-
Mean Time to Recovery (MTTR) — Time from a production failure to recovery. Measures stability.
The critical insight: elite teams are high on BOTH throughput AND stability. Speed and quality are not trade-offs — they reinforce each other because frequent small changes are inherently less risky than infrequent large changes.
Question 2
Section titled “Question 2”Why are deployment markers on dashboards considered one of the highest-impact observability improvements?
Show Answer
Deployment markers instantly answer the question “what changed?” — which is the first question in every incident investigation. Without markers:
- Engineers must manually check CI/CD systems, ask in Slack, and cross-reference timestamps
- This investigation takes 10-30 minutes per incident
- During that time, the issue is unresolved and users are affected
With markers:
- The correlation between a deployment and a metric change is visually obvious
- Time to diagnose drops from minutes to seconds
- Rollback decisions happen faster
Studies (including Etsy’s public data) show that 80%+ of production incidents correlate with recent deployments. Making deployments visible on monitoring dashboards is the single highest-ROI observability investment most teams can make.
Question 3
Section titled “Question 3”Why do deployment freezes often make reliability worse?
Show Answer
Deployment freezes create a vicious cycle:
- Changes accumulate during the freeze
- Post-freeze deployment is a large batch of changes
- Large batches are harder to test, debug, and roll back
- Higher risk leads to higher failure rates
- Higher failure rates lead to more freezes
The DORA research confirms this: organizations with more frequent freezes have worse reliability metrics. The freeze does not reduce risk — it concentrates and defers it.
Better alternatives:
- Increase deployment safety requirements (canary required, longer bake time)
- Require manual promotion at each stage
- Add extra metric checks during high-risk periods
- Gate deployments on error budget (automatic, not calendar-based)
Question 4
Section titled “Question 4”How should error budgets interact with deployment decisions?
Show Answer
Error budgets should gate deployment risk, not deployment itself:
- Budget > 50%: Deploy normally with standard process
- Budget 20-50%: Deploy with increased caution (smaller canary, longer bake time, extra metrics)
- Budget 5-20%: Only critical fixes and security patches
- Budget < 5%: Deployment freeze — reliability improvement work only
The key principle: error budget consumed by a bad deployment is budget that could have been spent on faster feature delivery. Teams that blow their error budget with bad releases lose their deployment velocity.
Additionally, track error budget consumption per deployment to identify which releases are the most costly and invest in improving those areas.
Question 5
Section titled “Question 5”What is Goodhart’s Law and how does it apply to DORA metrics?
Show Answer
Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”
Applied to DORA metrics:
- If deployment frequency becomes a target, teams deploy empty or trivial changes to inflate the number
- If change failure rate becomes a target, teams stop deploying risky (but necessary) changes
- If lead time becomes a target, teams skip important reviews and testing
- If MTTR becomes a target, incidents get “resolved” prematurely
DORA metrics should be used as diagnostic tools — they tell you where to investigate and improve. They should never be tied to individual performance reviews, bonuses, or promotions. Use them to identify systemic issues (slow pipeline, inadequate testing) rather than to judge people.
Question 6
Section titled “Question 6”How would you design deployment-aware alerting to avoid false positives during rollouts?
Show Answer
Three complementary approaches:
-
Suppression during deployment: Use Prometheus
unlessclauses to suppress alerts when a deployment is in progress (detected by replica count mismatch). Alerts only fire after the deployment completes and metrics stabilize. -
Rate-of-change alerts instead of absolute thresholds: Instead of “error rate > 0.5%”, alert on “error rate is 5x the baseline from 1 hour ago.” This ignores expected deployment noise while catching genuine spikes.
-
Separate deployment alert channel: Route deployment-time alerts to a dedicated channel monitored by the deployer, keeping the main on-call channel free from deployment noise. After deployment completes, normal alerting resumes.
The goal is to prevent alert fatigue — if every deployment generates false alerts, the team will ignore real alerts during deployments, which is the most dangerous time.
Hands-On Exercise: Build a Release Health Dashboard in Grafana
Section titled “Hands-On Exercise: Build a Release Health Dashboard in Grafana”Objective
Section titled “Objective”Create a Grafana dashboard that correlates deployments with error rates, demonstrating how deployment markers make incident diagnosis instant.
# Create clusterkind create cluster --name metrics-lab
# Install Prometheus and Grafanakubectl create namespace monitoring
# Deploy Prometheuscat <<'PROMEOF' | kubectl apply -f -apiVersion: apps/v1kind: Deploymentmetadata: name: prometheus namespace: monitoringspec: replicas: 1 selector: matchLabels: app: prometheus template: metadata: labels: app: prometheus spec: containers: - name: prometheus image: prom/prometheus:v2.53.0 ports: - containerPort: 9090 args: - "--config.file=/etc/prometheus/prometheus.yml" volumeMounts: - name: config mountPath: /etc/prometheus volumes: - name: config configMap: name: prometheus-config---apiVersion: v1kind: Servicemetadata: name: prometheus namespace: monitoringspec: selector: app: prometheus ports: - port: 9090---apiVersion: v1kind: ConfigMapmetadata: name: prometheus-config namespace: monitoringdata: prometheus.yml: | global: scrape_interval: 10s scrape_configs: - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port] action: replace target_label: __address__ regex: (.+) replacement: ${1}:9090PROMEOF
# Deploy Grafanacat <<'GRAFEOF' | kubectl apply -f -apiVersion: apps/v1kind: Deploymentmetadata: name: grafana namespace: monitoringspec: replicas: 1 selector: matchLabels: app: grafana template: metadata: labels: app: grafana spec: containers: - name: grafana image: grafana/grafana:11.1.0 ports: - containerPort: 3000 env: - name: GF_SECURITY_ADMIN_PASSWORD value: admin - name: GF_AUTH_ANONYMOUS_ENABLED value: "true"---apiVersion: v1kind: Servicemetadata: name: grafana namespace: monitoringspec: selector: app: grafana ports: - port: 3000GRAFEOF
kubectl -n monitoring rollout status deployment prometheus --timeout=90skubectl -n monitoring rollout status deployment grafana --timeout=90sStep 1: Port-Forward Grafana and Prometheus
Section titled “Step 1: Port-Forward Grafana and Prometheus”kubectl -n monitoring port-forward svc/grafana 3000:3000 &kubectl -n monitoring port-forward svc/prometheus 9090:9090 &sleep 3Step 2: Configure Prometheus Data Source in Grafana
Section titled “Step 2: Configure Prometheus Data Source in Grafana”# Add Prometheus as data sourcecurl -s -X POST http://admin:admin@localhost:3000/api/datasources \ -H "Content-Type: application/json" \ -d '{ "name": "Prometheus", "type": "prometheus", "url": "http://prometheus.monitoring:9090", "access": "proxy", "isDefault": true }' | jq .Step 3: Create the Release Health Dashboard
Section titled “Step 3: Create the Release Health Dashboard”# Create a dashboard with deployment annotationscurl -s -X POST http://admin:admin@localhost:3000/api/dashboards/db \ -H "Content-Type: application/json" \ -d '{ "dashboard": { "title": "Release Health Dashboard", "tags": ["release-engineering"], "timezone": "browser", "annotations": { "list": [ { "name": "Deployments", "datasource": "-- Grafana --", "enable": true, "iconColor": "rgba(255, 96, 96, 1)", "tags": ["deployment"] } ] }, "panels": [ { "title": "Pod Restarts (Simulated Error Rate)", "type": "timeseries", "gridPos": {"h": 8, "w": 24, "x": 0, "y": 0}, "datasource": "Prometheus", "targets": [ { "expr": "sum(rate(kube_pod_container_status_restarts_total[5m])) by (namespace)", "legendFormat": "{{namespace}}" } ] }, { "title": "Running Pods by Namespace", "type": "timeseries", "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}, "datasource": "Prometheus", "targets": [ { "expr": "count(kube_pod_status_phase{phase=\"Running\"}) by (namespace)", "legendFormat": "{{namespace}}" } ] }, { "title": "DORA Metrics (Placeholder)", "type": "stat", "gridPos": {"h": 4, "w": 6, "x": 0, "y": 16}, "options": { "textMode": "value_and_name" }, "targets": [ { "expr": "vector(8)", "legendFormat": "Deploy Freq (per day)" } ] }, { "title": "Change Failure Rate", "type": "stat", "gridPos": {"h": 4, "w": 6, "x": 6, "y": 16}, "targets": [ { "expr": "vector(0.032)", "legendFormat": "CFR" } ], "fieldConfig": { "defaults": { "unit": "percentunit" } } } ] }, "overwrite": true }' | jq '.uid'Step 4: Simulate a Deployment and Add an Annotation
Section titled “Step 4: Simulate a Deployment and Add an Annotation”# Deploy a sample appkubectl create deployment webapp --image=hashicorp/http-echo:0.2.3 -- -text="v1" -listen=:8080kubectl rollout status deployment webapp
# Add deployment annotation to Grafanacurl -s -X POST http://admin:admin@localhost:3000/api/annotations \ -H "Content-Type: application/json" \ -d '{ "time": '$(date +%s000)', "tags": ["deployment"], "text": "Deployed webapp v1.0.0" }' | jq .
echo "First deployment marker added."sleep 10
# Simulate a new deploymentkubectl set image deployment/webapp http-echo=hashicorp/http-echo:0.2.3
# Add another annotationcurl -s -X POST http://admin:admin@localhost:3000/api/annotations \ -H "Content-Type: application/json" \ -d '{ "time": '$(date +%s000)', "tags": ["deployment"], "text": "Deployed webapp v2.0.0 - NEW VERSION" }' | jq .
echo "Second deployment marker added."Step 5: View the Dashboard
Section titled “Step 5: View the Dashboard”Open your browser and navigate to:
http://localhost:3000/dashboardsLog in with admin / admin. Find the “Release Health Dashboard.”
You should see:
- Time series panels showing Kubernetes metrics
- Red vertical lines (deployment annotations) marking when each deployment happened
- Stat panels showing placeholder DORA metrics
Step 6: Add More Annotations to Simulate a Deployment History
Section titled “Step 6: Add More Annotations to Simulate a Deployment History”# Simulate a history of deploymentsfor i in $(seq 1 5); do OFFSET=$((i * 600)) # Every 10 minutes back TIMESTAMP=$(( $(date +%s) - OFFSET )) curl -s -X POST http://admin:admin@localhost:3000/api/annotations \ -H "Content-Type: application/json" \ -d "{ \"time\": ${TIMESTAMP}000, \"tags\": [\"deployment\"], \"text\": \"Deployed webapp v1.${i}.0\" }" | jq .statusdone
echo "Deployment history annotations added. Refresh your dashboard."Clean Up
Section titled “Clean Up”kill %1 %2 2>/dev/nullkind delete cluster --name metrics-labSuccess Criteria
Section titled “Success Criteria”You have completed this exercise when you can confirm:
- Grafana is running with Prometheus as a data source
- The Release Health Dashboard shows time-series panels with Kubernetes metrics
- Deployment annotations (red vertical lines) are visible on the dashboard
- You can see the correlation between deployment events and metric changes
- You understand how CI/CD pipelines would push these annotations automatically
- You can explain why deployment markers reduce Mean Time to Diagnose
Key Takeaways
Section titled “Key Takeaways”- DORA metrics predict organizational performance — deployment frequency, lead time, change failure rate, and MTTR are the four metrics that matter most
- Speed and stability reinforce each other — elite teams deploy more often AND fail less, because small changes are inherently less risky
- Deployment markers are the highest-ROI observability investment — connecting your deployment system to your monitoring system eliminates the “what changed?” investigation
- Error budgets should gate deployment risk, not deployment itself — low budget means more caution, not a freeze
- Deployment freezes often backfire — they accumulate changes, making the post-freeze deployment riskier
- Deployment-aware alerting prevents alert fatigue — suppress or adjust alerts during deployments to avoid crying wolf
- Use DORA metrics as diagnostics, not performance targets — Goodhart’s Law will destroy any metric used to judge individuals
Further Reading
Section titled “Further Reading”Books:
- “Accelerate” — Nicole Forsgren, Jez Humble, Gene Kim (the definitive DORA research)
- “The DevOps Handbook” — Gene Kim, Jez Humble, Patrick Debois (deployment pipeline optimization)
- “Observability Engineering” — Charity Majors, Liz Fong-Jones, George Miranda
Reports:
- DORA State of DevOps Report — dora.dev (annual, free)
- Puppet State of DevOps Report — puppet.com/resources
Articles:
- “Are You an Elite DevOps Performer?” — DORA (dora.dev/research)
- “The Four Keys” — Google Cloud Blog (implementing DORA metrics)
- “Deployment Markers: The Most Important Dashboard Feature” — Charity Majors
Tools:
- Four Keys — github.com/dora-team/fourkeys (open-source DORA metrics dashboard)
- Sleuth — sleuth.io (DORA metrics tracking platform)
- LinearB — linearb.io (engineering metrics)
Summary
Section titled “Summary”Release engineering metrics transform deployment from an act of faith into a measurable, improvable process. The DORA four key metrics — deployment frequency, lead time, change failure rate, and MTTR — provide a scientific framework for measuring release performance. Combined with deployment markers on dashboards, error budget gating, and deployment-aware alerting, these practices give you visibility into whether your release engineering investments are paying off. The ultimate goal is not just measuring — it is creating a culture where releasing software is boring, predictable, and safe.
Next Module
Section titled “Next Module”You have completed the Release Engineering discipline fundamentals. Continue to the Release Engineering Overview for a summary of all modules and suggested next steps, including how to integrate release engineering with SRE, GitOps, and Platform Engineering practices.
“What gets measured gets managed — but only if you measure the right things for the right reasons.” — Adapted from Peter Drucker