Skip to content

Module 1.5: Release Engineering Metrics & Observability

Discipline Module | Complexity: [MEDIUM] | Time: 2 hours

Before starting this module:

  • Required: Prometheus and Grafana basics — PromQL queries, dashboard creation, alert configuration
  • Required: Module 1.1: Release Strategies — Understanding of deployment strategies and progressive delivery
  • Recommended: SRE Module: SLOs and Error Budgets — SLI/SLO/SLA concepts
  • Recommended: Familiarity with CI/CD pipelines and deployment tooling

After completing this module, you will be able to:

  • Implement DORA metrics collection to measure deployment frequency, lead time, MTTR, and change failure rate
  • Design release quality dashboards that give engineering leadership actionable visibility into delivery health
  • Build automated release scorecards that track reliability and velocity trends over time
  • Analyze release metric patterns to identify bottlenecks in your software delivery pipeline

A VP of Engineering walks into a meeting and asks three questions:

  1. “How often do we deploy to production?”
  2. “When a deployment causes a problem, how quickly do we recover?”
  3. “What percentage of our deployments cause failures?”

The room goes silent. Nobody knows. The CI/CD system tracks builds. The monitoring system tracks uptime. The incident tracker tracks outages. But nobody has connected these systems to answer the fundamental question: are we getting better or worse at releasing software?

This is the blind spot that kills engineering organizations. Without release metrics, you cannot tell if your investment in canary deployments, feature flags, and ring rollouts is actually paying off. You cannot tell if your deployment pipeline is a competitive advantage or a liability. You cannot tell if the team that deploys 50 times a day is bold or reckless.

The DORA research program — the largest scientific study of software delivery performance ever conducted — identified four metrics that separate elite engineering organizations from the rest. These are not vanity metrics. They are predictive indicators of organizational performance, and they all center on one thing: how you release software.

This module teaches you to measure what matters, build dashboards that tell the truth, and use release metrics to drive continuous improvement — not continuous blame.


The DevOps Research and Assessment (DORA) program, founded by Dr. Nicole Forsgren, Jez Humble, and Gene Kim, studied thousands of engineering organizations over a decade. Their finding: four metrics predict both software delivery performance AND organizational performance (profitability, market share, customer satisfaction).

These metrics are:

┌──────────────────────────────────────────────────────────────────┐
│ DORA Four Key Metrics │
│ │
│ THROUGHPUT (Speed) STABILITY (Quality) │
│ ───────────────── ────────────────── │
│ 1. Deployment Frequency 3. Change Failure Rate │
│ How often do we deploy? How often do deploys fail? │
│ │
│ 2. Lead Time for Changes 4. Mean Time to Recovery │
│ How fast from commit How fast do we recover │
│ to production? from failures? │
│ │
│ KEY INSIGHT: Elite teams are BOTH faster AND more stable. │
│ Speed and stability are NOT trade-offs — they reinforce │
│ each other. │
└──────────────────────────────────────────────────────────────────┘

Definition: How often does your organization deploy code to production?

Performance LevelDeployment Frequency
EliteOn-demand (multiple deploys per day)
HighBetween once per day and once per week
MediumBetween once per week and once per month
LowBetween once per month and once every 6 months

Why it matters: Frequent deployments mean smaller changesets. Smaller changesets are easier to debug, faster to roll back, and less risky. A team deploying 10 times per day ships 10 small changes. A team deploying monthly ships one massive change with hundreds of commits.

How to measure it:

# Count deployments per day (using observed generation changes)
sum(
changes(kube_deployment_status_observed_generation{namespace="production"}[24h])
) by (deployment)

Or, if you emit a custom metric on each deployment:

# Custom deployment counter
sum(increase(deployment_completed_total{environment="production"}[24h]))

What to watch for:

  • Deployment frequency decreasing over time (sign of growing fear or complexity)
  • Uneven distribution across teams (some teams deploy daily, others monthly)
  • Deploy frequency dropping after an incident (overcorrection)

Definition: The time from a code commit to that code running successfully in production.

Performance LevelLead Time
EliteLess than 1 hour
HighBetween 1 day and 1 week
MediumBetween 1 week and 1 month
LowBetween 1 month and 6 months

Components of lead time:

Commit → Code Review → CI Build → Deploy to Staging → Deploy to Prod
│ │ │ │ │
└──────────┴────────────┴──────────────┴───────────────────┘
Total Lead Time
Coding time Review wait Build time Stage/test Deploy
(developer) (bottleneck) (automated) (gates) (pipeline)

How to measure it:

# If you emit timestamps for each stage:
histogram_quantile(0.50,
sum(rate(
deployment_lead_time_seconds_bucket{environment="production"}[7d]
)) by (le)
)

In practice, most teams compute this by comparing the deployment timestamp with the oldest commit SHA in the deployment that was not in the previous deployment.

Common bottlenecks:

  • Code review wait time (the #1 bottleneck in most organizations)
  • Manual approval gates
  • Slow CI pipelines
  • Infrequent deploy windows
  • Long-running integration test suites

Definition: The percentage of deployments that cause a failure in production (requiring rollback, hotfix, or incident).

Performance LevelChange Failure Rate
Elite0-5%
High5-10%
Medium10-15%
Low16-30%

How to measure it:

# Failed deployments / Total deployments (use increase() for counters over a window)
sum(increase(deployment_completed_total{result="failure", environment="production"}[30d])) /
sum(increase(deployment_completed_total{environment="production"}[30d]))

What counts as a “failure”?:

  • Deployment rolled back (automatically or manually)
  • Hotfix deployed within 24 hours of the original deployment
  • Incident created and attributed to the deployment
  • Canary analysis failed and aborted the rollout

What does NOT count:

  • Bugs found in staging (caught before production)
  • Planned rollbacks (testing the rollback mechanism)
  • Configuration changes that did not affect users

Key insight: CFR and deployment frequency are NOT inversely correlated for elite teams. Elite teams deploy more often AND have lower failure rates. This is because frequent deployments mean smaller changes, which are inherently less risky.

Definition: How long it takes to recover from a production failure (service impairment or outage).

Performance LevelMTTR
EliteLess than 1 hour
HighLess than 1 day
MediumLess than 1 week
LowMore than 1 week

MTTR components:

Failure occurs → Detection → Diagnosis → Fix → Recovery
│ │ │ │ │
└──────────────┴───────────┴────────┴────────┘
Total MTTR
Time to Time to Time to Time to
Detect Diagnose Fix Deploy Fix
(TTD) (TTDiag) (TTF) (TTDeploy)

How to reduce each component:

ComponentReduction Strategy
Time to DetectBetter monitoring, tighter alert thresholds, canary analysis
Time to DiagnoseDeployment markers in dashboards, correlation of deploys with metrics
Time to FixInstant rollback, feature flag kill switches
Time to Deploy FixFast CI/CD pipeline, automated deployment

How to measure it:

# Average time from incident creation to resolution
avg(
incident_resolution_timestamp - incident_creation_timestamp
) by (severity)

DORA metrics tell you about your overall delivery performance. You also need metrics about individual deployments:

1. Deployment Duration

How long does a deployment take from start to all pods healthy?

# Time from deployment start to last pod ready
histogram_quantile(0.95,
sum(rate(
deployment_duration_seconds_bucket{environment="production"}[7d]
)) by (le, deployment)
)

Watch for deployments getting slower over time — this often indicates growing image sizes, slow health checks, or resource contention.

2. Rollback Frequency

How often do deployments get rolled back?

sum(rate(deployment_rollback_total{environment="production"}[30d]))
/
sum(rate(deployment_completed_total{environment="production"}[30d]))

A healthy rollback rate is 1-5%. Below 1% might mean you are not rolling back when you should be. Above 10% means your pre-production testing is inadequate.

3. Canary Analysis Pass Rate

How often do canary deployments pass their analysis gates?

sum(rate(argo_rollouts_analysis_run_total{phase="Successful"}[30d]))
/
sum(rate(argo_rollouts_analysis_run_total[30d]))

4. Time to Full Rollout

How long from first canary pod to 100% traffic?

histogram_quantile(0.50,
sum(rate(
rollout_completion_time_seconds_bucket[30d]
)) by (le)
)

Releases are the most dangerous time for SLO compliance. Here is how to protect your SLOs:

Pre-release SLO check: Before deploying, check remaining error budget:

# Remaining error budget (30-day window)
1 - (
sum(rate(http_requests_total{status=~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
) / (1 - 0.999) # SLO target: 99.9%

If error budget is below 20%, consider delaying the deployment or deploying with extra caution (smaller canary percentage, longer bake time).

During-release SLO monitoring:

# Real-time error budget burn rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

If the 5-minute error rate exceeds 10x the SLO threshold during a canary, trigger automatic rollback.

Post-release SLO impact:

# Error budget consumed by last deployment
# (error rate during deploy window - baseline error rate) * deploy duration

Track this per deployment to identify which releases consume the most error budget.

Some organizations gate deployments on error budget:

Error Budget Remaining Deployment Policy
───────────────────── ─────────────────────
> 50% Deploy freely (normal process)
20-50% Deploy with caution (smaller canary, longer bake)
5-20% Deploy only critical fixes
< 5% Deployment freeze (reliability work only)

This is the error budget concept from SRE applied to release engineering: reliability failures consume the budget that would otherwise allow new deployments.


The single most powerful thing you can add to any monitoring dashboard is a deployment marker — a vertical line showing when a deployment happened:

Error Rate
5% ┤
2% ┤ ▲
│ ╱ ╲
1% ┤──────────────────╱ ╲──────────────────
│ ╱ ╲
0% ┤────────────────╱ ╲────────────────
└──┬───────────┬─┬────────┬───────────────
00:00 │ 06:00 12:00
Deploy v2.1.0 ← DEPLOYMENT MARKER

Without the deployment marker, you see a spike and wonder: “What happened at 4 AM?” With the marker, the correlation is instant: “The v2.1.0 deployment caused the spike.”

In Grafana (Annotations):

Terminal window
# Push a deployment annotation to Grafana
curl -X POST http://grafana:3000/api/annotations \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $GRAFANA_TOKEN" \
-d '{
"dashboardUID": "release-health",
"time": '$(date +%s000)',
"tags": ["deployment", "webapp", "v2.1.0"],
"text": "Deployed webapp v2.1.0 by deploy-bot"
}'

In Prometheus (recording rule):

# Record deployment events as a metric
groups:
- name: deployments
rules:
- record: deployment_event
expr: |
kube_deployment_status_observed_generation
!= ignoring(generation)
kube_deployment_metadata_generation

In CI/CD pipeline:

# GitLab CI example
deploy:
script:
- kubectl apply -f manifests/
- |
curl -X POST "$GRAFANA_URL/api/annotations" \
-H "Authorization: Bearer $GRAFANA_TOKEN" \
-H "Content-Type: application/json" \
-d "{
\"time\": $(date +%s000),
\"tags\": [\"deployment\", \"$CI_PROJECT_NAME\", \"$CI_COMMIT_SHORT_SHA\"],
\"text\": \"Deployed $CI_PROJECT_NAME ($CI_COMMIT_SHORT_SHA) by $GITLAB_USER_LOGIN\"
}"

A release health dashboard should answer these questions at a glance:

┌──────────────────────────────────────────────────────────────┐
│ RELEASE HEALTH DASHBOARD │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Error Rate (with deployment markers) │ │
│ │ [graph with vertical lines at each deployment] │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ P99 Latency (with deployment markers) │ │
│ │ [graph with vertical lines at each deployment] │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Deploy Freq │ │ Change Failure │ │
│ │ 12/day (7d avg) │ │ Rate: 3.2% │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Lead Time │ │ MTTR │ │
│ │ 2.3 hours (p50) │ │ 18 min (p50) │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Recent Deployments │ │
│ │ ✓ 14:22 webapp v2.1.0 (canary: passed) │ │
│ │ ✗ 11:05 api-gateway v3.0 (canary: FAILED, rolled back│ │
│ │ ✓ 09:30 auth-svc v1.8.2 (direct: success) │ │
│ │ ✓ 08:15 webapp v2.0.9 (canary: passed) │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Error Budget Remaining: 72% (30-day window) │ │
│ │ ████████████████████████████░░░░░░░░░░ │ │
│ └────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘

Deployments are inherently noisy. During a rolling update, old pods terminate and new pods start. Connection counts spike, latency briefly increases, error rates may blip. If your alerting is not deployment-aware, every deployment triggers false alerts:

14:00 - Deploy starts
14:01 - ALERT: Connection pool usage > 80% ← False alarm
14:02 - ALERT: P99 latency > 200ms ← False alarm (cold caches)
14:03 - ALERT: Error rate > 0.5% ← False alarm (old pods draining)
14:05 - Deploy complete, metrics normalize
14:06 - Three people investigated three false alerts

Over time, the team starts ignoring deployment alerts. Then a real deployment problem occurs, and nobody notices because they have been trained that deployment alerts are noise.

1. Deployment-Aware Alert Suppression

Temporarily raise alert thresholds during deployments:

# Prometheus alerting rule with deployment awareness
groups:
- name: deployment-aware-alerts
rules:
- alert: HighErrorRate
expr: |
(
# Normal threshold: 0.5%
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
> 0.005
)
# Only alert if no deployment is in progress
unless on()
(
kube_deployment_status_updated_replicas
!= kube_deployment_spec_replicas
)
for: 5m
labels:
severity: warning

2. Separate “Deployment Alert” Channel

Route deployment-time alerts to a dedicated channel that the deployer monitors:

# Alertmanager routing
route:
routes:
- match:
context: deployment
receiver: deployment-channel
group_wait: 30s
- match:
severity: critical
receiver: pagerduty

3. Post-Deployment Burn Rate Alerts

Instead of alerting on absolute thresholds during deployment, alert on the rate of change:

# Alert if error rate increases by more than 5x during deployment
- alert: DeploymentCausedErrorSpike
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
)
/
(
sum(rate(http_requests_total{status=~"5.."}[1h] offset 1h))
/ sum(rate(http_requests_total[1h] offset 1h))
)
> 5
for: 3m
labels:
severity: critical
context: deployment

This alerts when the error rate is 5x the baseline from an hour ago — catching real problems while ignoring normal deployment noise.


A deployment freeze is a period during which non-critical deployments are prohibited. Common triggers:

TriggerRationale
Error budget exhaustedNo remaining budget for risky changes
Major business eventBlack Friday, tax season, election day
Active incidentDo not compound problems with more changes
Holiday periodsReduced staffing means slower incident response
End of quarterRevenue-critical period for B2B SaaS

1. CI/CD Pipeline Gate:

# Check freeze status before deploying
deploy:
rules:
- if: $DEPLOYMENT_FREEZE == "true"
when: manual
allow_failure: false
script:
- |
FREEZE=$(curl -s http://deploy-config/api/freeze-status)
if [ "$FREEZE" = "active" ]; then
echo "DEPLOYMENT FREEZE ACTIVE. Reason: $(curl -s http://deploy-config/api/freeze-reason)"
echo "Contact release-engineering team for exceptions."
exit 1
fi
- kubectl apply -f manifests/

2. Kubernetes Admission Webhook:

# Webhook that rejects deployments during freeze periods
def validate_deployment(request):
if is_freeze_active():
# Allow critical security patches
if has_annotation(request, "freeze-exception: security"):
return allow(request)
return deny(request, "Deployment freeze active. Contact release-eng for exceptions.")
return allow(request)

3. DORA-Driven Freeze Policy:

Error Budget Status Deployment Policy
────────────────── ──────────────────
Budget > 50% Normal — deploy freely
Budget 20-50% Cautious — canary required, longer bake
Budget 5-20% Restricted — critical fixes only
Budget < 5% Frozen — reliability work only
Budget = 0% Hard freeze — emergency patches only

Counterintuitively, deployment freezes often make things worse. Here is why:

During freeze:
Changes accumulate ──► Large batch of changes after freeze
Large batch ──► Higher risk per deployment
Higher risk ──► Higher failure rate
Higher failure rate ──► More incidents
More incidents ──► More freezes
→ VICIOUS CYCLE

The DORA research shows that high-performing teams deploy more often, not less. Freezes should be rare, short, and combined with a plan to reduce the conditions that caused them.

Better alternative: Instead of freezing deployments, increase deployment safety requirements during high-risk periods:

  • Require canary deployments (no direct rollouts)
  • Increase bake time (24h instead of 2h)
  • Require manual promotion at each stage
  • Add extra metric checks to analysis templates

You cannot improve what you do not measure. Begin by instrumenting your deployment pipeline:

# deployment-events ConfigMap (emitted by CI/CD)
apiVersion: v1
kind: ConfigMap
metadata:
name: deployment-event
labels:
deployment-event: "true"
data:
service: "webapp"
version: "v2.1.0"
environment: "production"
deployer: "ci-bot"
commit_sha: "abc123"
commit_timestamp: "2026-03-24T10:00:00Z"
deploy_timestamp: "2026-03-24T10:15:00Z"
result: "success" # or "failure", "rollback"

Measure your current state for at least 30 days before setting targets:

Current Baselines (30-day):
Deployment Frequency: 2.3 per day
Lead Time (p50): 18 hours
Change Failure Rate: 8.5%
MTTR (p50): 45 minutes

Use the DORA performance levels as guides, but set realistic incremental targets:

Current 3-Month Target 6-Month Target
Deploy Frequency: 2.3/day 5/day 10/day
Lead Time: 18 hours 8 hours 2 hours
CFR: 8.5% 5% 3%
MTTR: 45 min 30 min 15 min

Monthly DORA reviews should:

  • Track trends (improving or degrading?)
  • Identify the biggest bottleneck (usually lead time)
  • Celebrate improvements (deploy frequency up 20%!)
  • Investigate regressions (why did CFR increase this month?)

“When a measure becomes a target, it ceases to be a good measure.”

If you incentivize deployment frequency, teams will deploy empty changes. If you penalize change failure rate, teams will stop deploying. Use DORA metrics as diagnostic tools, not performance targets for individuals.


  1. The DORA research proved that speed and stability are NOT trade-offs. Before DORA, the conventional wisdom was that deploying faster meant more failures. The data showed the opposite: elite teams deploy 973x more frequently than low performers AND have 3x lower change failure rates. Speed enables stability because smaller changes are inherently less risky.

  2. Etsy was one of the first companies to put deployment markers on all dashboards, and they found that over 80% of production incidents correlated with a recent deployment. This single visualization — a vertical line on a graph — reduced their mean time to diagnose by 60% because engineers stopped asking “what changed?” and started asking “what did this specific deployment change?”

  3. Deployment freezes are inversely correlated with reliability in the DORA data. Organizations with more frequent deployment freezes have WORSE reliability metrics, not better. The freeze accumulates changes, making the post-freeze deployment larger and riskier. Elite performers deploy through holidays with automated canary analysis instead of freezing.

  4. Google’s “golden signals” — latency, traffic, errors, and saturation — were designed specifically for deployment monitoring. The original Google SRE book presents these four signals as the minimum viable monitoring for any service, and they are particularly important during deployments when all four can shift simultaneously. Every release health dashboard should track all four golden signals.


War Story: The Dashboard That Stopped Incidents

Section titled “War Story: The Dashboard That Stopped Incidents”

A fintech company averaged 3-4 deployment-related incidents per month. Their MTTR was 2 hours. They had good monitoring but a critical gap: deployments were invisible on their dashboards.

The post-incident analysis for every deployment-related incident followed the same pattern:

1. Alert fires at 14:22 — "Error rate > 1%"
2. On-call opens Grafana — sees spike starting at 14:15
3. On-call checks recent changes — "Did anyone deploy anything?"
4. On-call asks in Slack — no response for 10 minutes
5. Someone responds — "I deployed api-gateway 20 minutes ago"
6. On-call rolls back — error rate normalizes
7. Total MTTR: 45 minutes (30 minutes spent finding the cause)

The fix was embarrassingly simple: add deployment annotations to every Grafana dashboard.

After the change:

1. Alert fires at 14:22 — "Error rate > 1%"
2. On-call opens Grafana — sees spike starting at 14:15
3. Deployment marker visible: "api-gateway v3.2.0 deployed at 14:14"
4. On-call rolls back api-gateway
5. Total MTTR: 8 minutes (zero time spent finding the cause)

Over the next quarter:

  • MTTR dropped from 2 hours to 15 minutes average
  • Deployment-related incidents dropped from 3-4/month to 1/month (faster rollbacks prevented cascading failures)
  • Engineer satisfaction with on-call improved significantly

The total implementation effort: one day of work to add annotations to the CI/CD pipeline.

Lesson: The most impactful observability improvement is often the simplest: connect your deployment system to your monitoring system.


MistakeProblemSolution
Measuring DORA metrics but not acting on themMetrics without action are theaterMonthly review meetings with specific improvement actions
Using DORA metrics to evaluate individualsCreates gaming and fear; people stop deployingUse metrics as team/system diagnostics, never for individual performance
No deployment markers on dashboardsEvery incident starts with “did anyone deploy something?”Annotate all dashboards with deployment events from CI/CD
Alert thresholds not deployment-awareEvery deployment triggers false alerts, causing alert fatigueUse deployment-aware alerting with suppression or elevated thresholds
Deployment freezes as primary risk strategyAccumulates changes, making post-freeze deploys riskierIncrease deployment safety requirements instead of freezing
Measuring only deployment frequencyFrequent deploys mean nothing if they keep failingTrack all four DORA metrics together — they are interconnected
Not tracking error budget consumption per deployCannot identify which releases are burning budgetMeasure error budget delta for each deployment window
Ignoring lead time bottlenecksSlow pipelines discourage frequent deploymentProfile your pipeline: where does time actually go?

What are the four DORA metrics and what do they measure?

Show Answer

The four DORA metrics are:

  1. Deployment Frequency — How often the organization deploys to production. Measures throughput.

  2. Lead Time for Changes — Time from code commit to code running in production. Measures throughput.

  3. Change Failure Rate (CFR) — Percentage of deployments that cause a production failure (rollback, hotfix, or incident). Measures stability.

  4. Mean Time to Recovery (MTTR) — Time from a production failure to recovery. Measures stability.

The critical insight: elite teams are high on BOTH throughput AND stability. Speed and quality are not trade-offs — they reinforce each other because frequent small changes are inherently less risky than infrequent large changes.

Why are deployment markers on dashboards considered one of the highest-impact observability improvements?

Show Answer

Deployment markers instantly answer the question “what changed?” — which is the first question in every incident investigation. Without markers:

  • Engineers must manually check CI/CD systems, ask in Slack, and cross-reference timestamps
  • This investigation takes 10-30 minutes per incident
  • During that time, the issue is unresolved and users are affected

With markers:

  • The correlation between a deployment and a metric change is visually obvious
  • Time to diagnose drops from minutes to seconds
  • Rollback decisions happen faster

Studies (including Etsy’s public data) show that 80%+ of production incidents correlate with recent deployments. Making deployments visible on monitoring dashboards is the single highest-ROI observability investment most teams can make.

Why do deployment freezes often make reliability worse?

Show Answer

Deployment freezes create a vicious cycle:

  1. Changes accumulate during the freeze
  2. Post-freeze deployment is a large batch of changes
  3. Large batches are harder to test, debug, and roll back
  4. Higher risk leads to higher failure rates
  5. Higher failure rates lead to more freezes

The DORA research confirms this: organizations with more frequent freezes have worse reliability metrics. The freeze does not reduce risk — it concentrates and defers it.

Better alternatives:

  • Increase deployment safety requirements (canary required, longer bake time)
  • Require manual promotion at each stage
  • Add extra metric checks during high-risk periods
  • Gate deployments on error budget (automatic, not calendar-based)

How should error budgets interact with deployment decisions?

Show Answer

Error budgets should gate deployment risk, not deployment itself:

  • Budget > 50%: Deploy normally with standard process
  • Budget 20-50%: Deploy with increased caution (smaller canary, longer bake time, extra metrics)
  • Budget 5-20%: Only critical fixes and security patches
  • Budget < 5%: Deployment freeze — reliability improvement work only

The key principle: error budget consumed by a bad deployment is budget that could have been spent on faster feature delivery. Teams that blow their error budget with bad releases lose their deployment velocity.

Additionally, track error budget consumption per deployment to identify which releases are the most costly and invest in improving those areas.

What is Goodhart’s Law and how does it apply to DORA metrics?

Show Answer

Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”

Applied to DORA metrics:

  • If deployment frequency becomes a target, teams deploy empty or trivial changes to inflate the number
  • If change failure rate becomes a target, teams stop deploying risky (but necessary) changes
  • If lead time becomes a target, teams skip important reviews and testing
  • If MTTR becomes a target, incidents get “resolved” prematurely

DORA metrics should be used as diagnostic tools — they tell you where to investigate and improve. They should never be tied to individual performance reviews, bonuses, or promotions. Use them to identify systemic issues (slow pipeline, inadequate testing) rather than to judge people.

How would you design deployment-aware alerting to avoid false positives during rollouts?

Show Answer

Three complementary approaches:

  1. Suppression during deployment: Use Prometheus unless clauses to suppress alerts when a deployment is in progress (detected by replica count mismatch). Alerts only fire after the deployment completes and metrics stabilize.

  2. Rate-of-change alerts instead of absolute thresholds: Instead of “error rate > 0.5%”, alert on “error rate is 5x the baseline from 1 hour ago.” This ignores expected deployment noise while catching genuine spikes.

  3. Separate deployment alert channel: Route deployment-time alerts to a dedicated channel monitored by the deployer, keeping the main on-call channel free from deployment noise. After deployment completes, normal alerting resumes.

The goal is to prevent alert fatigue — if every deployment generates false alerts, the team will ignore real alerts during deployments, which is the most dangerous time.


Hands-On Exercise: Build a Release Health Dashboard in Grafana

Section titled “Hands-On Exercise: Build a Release Health Dashboard in Grafana”

Create a Grafana dashboard that correlates deployments with error rates, demonstrating how deployment markers make incident diagnosis instant.

Terminal window
# Create cluster
kind create cluster --name metrics-lab
# Install Prometheus and Grafana
kubectl create namespace monitoring
# Deploy Prometheus
cat <<'PROMEOF' | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:v2.53.0
ports:
- containerPort: 9090
args:
- "--config.file=/etc/prometheus/prometheus.yml"
volumeMounts:
- name: config
mountPath: /etc/prometheus
volumes:
- name: config
configMap:
name: prometheus-config
---
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitoring
spec:
selector:
app: prometheus
ports:
- port: 9090
---
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 10s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: ${1}:9090
PROMEOF
# Deploy Grafana
cat <<'GRAFEOF' | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:11.1.0
ports:
- containerPort: 3000
env:
- name: GF_SECURITY_ADMIN_PASSWORD
value: admin
- name: GF_AUTH_ANONYMOUS_ENABLED
value: "true"
---
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: monitoring
spec:
selector:
app: grafana
ports:
- port: 3000
GRAFEOF
kubectl -n monitoring rollout status deployment prometheus --timeout=90s
kubectl -n monitoring rollout status deployment grafana --timeout=90s

Step 1: Port-Forward Grafana and Prometheus

Section titled “Step 1: Port-Forward Grafana and Prometheus”
Terminal window
kubectl -n monitoring port-forward svc/grafana 3000:3000 &
kubectl -n monitoring port-forward svc/prometheus 9090:9090 &
sleep 3

Step 2: Configure Prometheus Data Source in Grafana

Section titled “Step 2: Configure Prometheus Data Source in Grafana”
Terminal window
# Add Prometheus as data source
curl -s -X POST http://admin:admin@localhost:3000/api/datasources \
-H "Content-Type: application/json" \
-d '{
"name": "Prometheus",
"type": "prometheus",
"url": "http://prometheus.monitoring:9090",
"access": "proxy",
"isDefault": true
}' | jq .

Step 3: Create the Release Health Dashboard

Section titled “Step 3: Create the Release Health Dashboard”
Terminal window
# Create a dashboard with deployment annotations
curl -s -X POST http://admin:admin@localhost:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-d '{
"dashboard": {
"title": "Release Health Dashboard",
"tags": ["release-engineering"],
"timezone": "browser",
"annotations": {
"list": [
{
"name": "Deployments",
"datasource": "-- Grafana --",
"enable": true,
"iconColor": "rgba(255, 96, 96, 1)",
"tags": ["deployment"]
}
]
},
"panels": [
{
"title": "Pod Restarts (Simulated Error Rate)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 0},
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(rate(kube_pod_container_status_restarts_total[5m])) by (namespace)",
"legendFormat": "{{namespace}}"
}
]
},
{
"title": "Running Pods by Namespace",
"type": "timeseries",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 8},
"datasource": "Prometheus",
"targets": [
{
"expr": "count(kube_pod_status_phase{phase=\"Running\"}) by (namespace)",
"legendFormat": "{{namespace}}"
}
]
},
{
"title": "DORA Metrics (Placeholder)",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 16},
"options": {
"textMode": "value_and_name"
},
"targets": [
{
"expr": "vector(8)",
"legendFormat": "Deploy Freq (per day)"
}
]
},
{
"title": "Change Failure Rate",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 6, "y": 16},
"targets": [
{
"expr": "vector(0.032)",
"legendFormat": "CFR"
}
],
"fieldConfig": {
"defaults": {
"unit": "percentunit"
}
}
}
]
},
"overwrite": true
}' | jq '.uid'

Step 4: Simulate a Deployment and Add an Annotation

Section titled “Step 4: Simulate a Deployment and Add an Annotation”
Terminal window
# Deploy a sample app
kubectl create deployment webapp --image=hashicorp/http-echo:0.2.3 -- -text="v1" -listen=:8080
kubectl rollout status deployment webapp
# Add deployment annotation to Grafana
curl -s -X POST http://admin:admin@localhost:3000/api/annotations \
-H "Content-Type: application/json" \
-d '{
"time": '$(date +%s000)',
"tags": ["deployment"],
"text": "Deployed webapp v1.0.0"
}' | jq .
echo "First deployment marker added."
sleep 10
# Simulate a new deployment
kubectl set image deployment/webapp http-echo=hashicorp/http-echo:0.2.3
# Add another annotation
curl -s -X POST http://admin:admin@localhost:3000/api/annotations \
-H "Content-Type: application/json" \
-d '{
"time": '$(date +%s000)',
"tags": ["deployment"],
"text": "Deployed webapp v2.0.0 - NEW VERSION"
}' | jq .
echo "Second deployment marker added."

Open your browser and navigate to:

http://localhost:3000/dashboards

Log in with admin / admin. Find the “Release Health Dashboard.”

You should see:

  • Time series panels showing Kubernetes metrics
  • Red vertical lines (deployment annotations) marking when each deployment happened
  • Stat panels showing placeholder DORA metrics

Step 6: Add More Annotations to Simulate a Deployment History

Section titled “Step 6: Add More Annotations to Simulate a Deployment History”
Terminal window
# Simulate a history of deployments
for i in $(seq 1 5); do
OFFSET=$((i * 600)) # Every 10 minutes back
TIMESTAMP=$(( $(date +%s) - OFFSET ))
curl -s -X POST http://admin:admin@localhost:3000/api/annotations \
-H "Content-Type: application/json" \
-d "{
\"time\": ${TIMESTAMP}000,
\"tags\": [\"deployment\"],
\"text\": \"Deployed webapp v1.${i}.0\"
}" | jq .status
done
echo "Deployment history annotations added. Refresh your dashboard."
Terminal window
kill %1 %2 2>/dev/null
kind delete cluster --name metrics-lab

You have completed this exercise when you can confirm:

  • Grafana is running with Prometheus as a data source
  • The Release Health Dashboard shows time-series panels with Kubernetes metrics
  • Deployment annotations (red vertical lines) are visible on the dashboard
  • You can see the correlation between deployment events and metric changes
  • You understand how CI/CD pipelines would push these annotations automatically
  • You can explain why deployment markers reduce Mean Time to Diagnose

  1. DORA metrics predict organizational performance — deployment frequency, lead time, change failure rate, and MTTR are the four metrics that matter most
  2. Speed and stability reinforce each other — elite teams deploy more often AND fail less, because small changes are inherently less risky
  3. Deployment markers are the highest-ROI observability investment — connecting your deployment system to your monitoring system eliminates the “what changed?” investigation
  4. Error budgets should gate deployment risk, not deployment itself — low budget means more caution, not a freeze
  5. Deployment freezes often backfire — they accumulate changes, making the post-freeze deployment riskier
  6. Deployment-aware alerting prevents alert fatigue — suppress or adjust alerts during deployments to avoid crying wolf
  7. Use DORA metrics as diagnostics, not performance targets — Goodhart’s Law will destroy any metric used to judge individuals

Books:

  • “Accelerate” — Nicole Forsgren, Jez Humble, Gene Kim (the definitive DORA research)
  • “The DevOps Handbook” — Gene Kim, Jez Humble, Patrick Debois (deployment pipeline optimization)
  • “Observability Engineering” — Charity Majors, Liz Fong-Jones, George Miranda

Reports:

  • DORA State of DevOps Report — dora.dev (annual, free)
  • Puppet State of DevOps Report — puppet.com/resources

Articles:

  • “Are You an Elite DevOps Performer?” — DORA (dora.dev/research)
  • “The Four Keys” — Google Cloud Blog (implementing DORA metrics)
  • “Deployment Markers: The Most Important Dashboard Feature” — Charity Majors

Tools:

  • Four Keys — github.com/dora-team/fourkeys (open-source DORA metrics dashboard)
  • Sleuth — sleuth.io (DORA metrics tracking platform)
  • LinearB — linearb.io (engineering metrics)

Release engineering metrics transform deployment from an act of faith into a measurable, improvable process. The DORA four key metrics — deployment frequency, lead time, change failure rate, and MTTR — provide a scientific framework for measuring release performance. Combined with deployment markers on dashboards, error budget gating, and deployment-aware alerting, these practices give you visibility into whether your release engineering investments are paying off. The ultimate goal is not just measuring — it is creating a culture where releasing software is boring, predictable, and safe.


You have completed the Release Engineering discipline fundamentals. Continue to the Release Engineering Overview for a summary of all modules and suggested next steps, including how to integrate release engineering with SRE, GitOps, and Platform Engineering practices.


“What gets measured gets managed — but only if you measure the right things for the right reasons.” — Adapted from Peter Drucker