Module 2.2: Argo Rollouts
Цей контент ще не доступний вашою мовою.
Toolkit Track | Complexity:
[COMPLEX]| Time: 45-50 min
The release manager hit “deploy” and watched the metrics dashboard. Three hundred microservices. Forty million users. No room for error. Within 90 seconds, the p99 latency spiked from 200ms to 3.2 seconds. Customer complaints flooded the support queue. But something remarkable happened: no humans intervened. The Argo Rollouts analysis detected the latency anomaly at 10% traffic, automatically aborted the canary, and rolled back to stable. Total user impact: 4 million requests slightly degraded, zero failed transactions. The bad deploy that would have cost the streaming platform $12 million in subscriber churn was stopped by a YAML file and a Prometheus query. The release manager exhaled, then smiled: “Progressive delivery just paid for itself.”
Prerequisites
Section titled “Prerequisites”Before starting this module:
- Module 2.1: ArgoCD — GitOps fundamentals
- GitOps Discipline — Deployment concepts
- Understanding of Kubernetes Deployments and Services
- Basic networking concepts (traffic splitting)
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Configure canary deployments with automated analysis using Prometheus metrics queries
- Implement blue-green deployments with traffic management and automated promotion criteria
- Integrate Argo Rollouts with service meshes and ingress controllers for traffic splitting
- Evaluate progressive delivery strategies and select appropriate rollout patterns for different risk profiles
Why This Module Matters
Section titled “Why This Module Matters”Kubernetes Deployments use rolling updates by default—gradually replacing old pods with new ones. But rolling updates can’t answer: “Is this new version actually better?” They blindly proceed until all pods are replaced.
Argo Rollouts enables progressive delivery: canary deployments, blue-green switches, and automated rollbacks based on metrics. You can deploy to 10% of traffic, verify metrics look good, then automatically promote to 100%—or roll back if they don’t.
Did You Know?
Section titled “Did You Know?”- Argo Rollouts was born from Intuit’s frustration with Kubernetes Deployments—they needed a way to safely deploy thousands of times per day
- The canary deployment pattern is named after canaries in coal mines—miners brought canaries underground; if the canary died, the air was toxic
- Netflix pioneered automated canary analysis—their Kayenta system inspired Argo Rollouts’ analysis features
- Blue-green deployments can double your resource usage—you need capacity for both versions simultaneously
Rollout Strategies
Section titled “Rollout Strategies”Rolling Update vs. Progressive Delivery
Section titled “Rolling Update vs. Progressive Delivery”ROLLING UPDATE (Native Kubernetes)─────────────────────────────────────────────────────────────────
Time ──────────────────────────────────────────────────────────▶
Pods: [v1][v1][v1][v1][v1] [v1][v1][v1][v1][v2] → 1 pod replaced [v1][v1][v1][v2][v2] → 2 pods replaced [v1][v1][v2][v2][v2] → 3 pods replaced [v1][v2][v2][v2][v2] → 4 pods replaced [v2][v2][v2][v2][v2] → Done!
Traffic: No control - pods receive traffic as soon as readyRollback: Must wait for new rolling updateAnalysis: None - hope for the best
─────────────────────────────────────────────────────────────────
CANARY (Argo Rollouts)─────────────────────────────────────────────────────────────────
Time ──────────────────────────────────────────────────────────▶
Pods: [v1][v1][v1][v1][v1] [v1][v1][v1][v1][v1] + [v2] → Canary pod added
Traffic: v1 (90%) ─────────────────────────────────────────────▶ v2 (10%) ─────────────────────────────────────────────▶
Analysis: Is error rate OK? Is latency OK? ├── Yes: Increase to 50%, then 100% └── No: Rollback immediately, alert on-call
Result: Bad versions never reach more than 10% of usersBlue-Green Strategy
Section titled “Blue-Green Strategy”BLUE-GREEN DEPLOYMENT─────────────────────────────────────────────────────────────────
BEFORE:┌──────────────────────────────────────────────────────────────┐│ ││ BLUE (Active) GREEN (Inactive) ││ ┌─────────────────┐ ┌─────────────────┐ ││ │ v1 pods │ │ (empty) │ ││ │ [v1][v1][v1] │ │ │ ││ └────────┬────────┘ └─────────────────┘ ││ │ ││ ▼ ││ ┌─────────┐ ││ │ Service │ ──────▶ 100% traffic ││ └─────────┘ ││ │└──────────────────────────────────────────────────────────────┘
AFTER DEPLOYMENT:┌──────────────────────────────────────────────────────────────┐│ ││ BLUE (Inactive) GREEN (Active) ││ ┌─────────────────┐ ┌─────────────────┐ ││ │ v1 pods │ │ v2 pods │ ││ │ [v1][v1][v1] │ │ [v2][v2][v2] │ ││ └─────────────────┘ └────────┬────────┘ ││ │ ││ ▼ ││ ┌─────────┐ ││ │ Service │ ──────▶ 100% ││ └─────────┘ ││ ││ (kept for instant rollback) ││ │└──────────────────────────────────────────────────────────────┘Installing Argo Rollouts
Section titled “Installing Argo Rollouts”# Install controllerkubectl create namespace argo-rolloutskubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
# Install kubectl pluginbrew install argoproj/tap/kubectl-argo-rollouts # macOS# orcurl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64chmod +x kubectl-argo-rollouts-linux-amd64sudo mv kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts
# Verifykubectl argo rollouts versionCanary Rollouts
Section titled “Canary Rollouts”Basic Canary
Section titled “Basic Canary”apiVersion: argoproj.io/v1alpha1kind: Rolloutmetadata: name: my-appspec: replicas: 5 selector: matchLabels: app: my-app template: metadata: labels: app: my-app spec: containers: - name: app image: myapp:v2 ports: - containerPort: 8080
strategy: canary: # Traffic split steps steps: - setWeight: 10 - pause: {duration: 5m} # Wait 5 minutes - setWeight: 30 - pause: {duration: 5m} - setWeight: 60 - pause: {duration: 5m} # 100% happens automatically after last step
# Traffic routing (for service mesh / ingress) canaryService: my-app-canary stableService: my-app-stable
# Analysis at each step analysis: templates: - templateName: success-rate startingStep: 1 args: - name: service-name value: my-app-canaryCanary with Traffic Management
Section titled “Canary with Traffic Management”apiVersion: argoproj.io/v1alpha1kind: Rolloutmetadata: name: my-appspec: replicas: 5 selector: matchLabels: app: my-app template: metadata: labels: app: my-app spec: containers: - name: app image: myapp:v2
strategy: canary: canaryService: my-app-canary stableService: my-app-stable
trafficRouting: # NGINX Ingress nginx: stableIngress: my-app-ingress annotationPrefix: nginx.ingress.kubernetes.io
# OR Istio # istio: # virtualService: # name: my-app-vs
# OR AWS ALB # alb: # ingress: my-app-ingress # servicePort: 80
steps: - setWeight: 10 - pause: {duration: 2m} - setWeight: 30 - pause: {duration: 2m} - setWeight: 60 - pause: {duration: 2m}---# Services for traffic splittingapiVersion: v1kind: Servicemetadata: name: my-app-stablespec: selector: app: my-app ports: - port: 80 targetPort: 8080---apiVersion: v1kind: Servicemetadata: name: my-app-canaryspec: selector: app: my-app ports: - port: 80 targetPort: 8080Canary with Manual Approval
Section titled “Canary with Manual Approval”strategy: canary: steps: - setWeight: 10 - pause: {} # Infinite pause - requires manual promotion - setWeight: 50 - pause: {duration: 5m} - setWeight: 100# Check rollout statuskubectl argo rollouts get rollout my-app
# Promote past the pausekubectl argo rollouts promote my-app
# Or abort and rollbackkubectl argo rollouts abort my-appBlue-Green Rollouts
Section titled “Blue-Green Rollouts”Basic Blue-Green
Section titled “Basic Blue-Green”apiVersion: argoproj.io/v1alpha1kind: Rolloutmetadata: name: my-appspec: replicas: 3 selector: matchLabels: app: my-app template: metadata: labels: app: my-app spec: containers: - name: app image: myapp:v2
strategy: blueGreen: activeService: my-app-active previewService: my-app-preview
# Wait for analysis before promotion prePromotionAnalysis: templates: - templateName: smoke-tests args: - name: service-name value: my-app-preview
# Require manual approval autoPromotionEnabled: false
# Keep old ReplicaSet for quick rollback scaleDownDelaySeconds: 30---apiVersion: v1kind: Servicemetadata: name: my-app-activespec: selector: app: my-app ports: - port: 80---apiVersion: v1kind: Servicemetadata: name: my-app-previewspec: selector: app: my-app ports: - port: 80Blue-Green with Automatic Promotion
Section titled “Blue-Green with Automatic Promotion”strategy: blueGreen: activeService: my-app-active previewService: my-app-preview
# Auto-promote after preview is ready autoPromotionEnabled: true autoPromotionSeconds: 60 # Wait 60s after ready
# Analysis before switching prePromotionAnalysis: templates: - templateName: smoke-tests
# Analysis after switching postPromotionAnalysis: templates: - templateName: success-rate args: - name: duration value: "5m"Analysis Templates
Section titled “Analysis Templates”Prometheus Metrics Analysis
Section titled “Prometheus Metrics Analysis”apiVersion: argoproj.io/v1alpha1kind: AnalysisTemplatemetadata: name: success-ratespec: args: - name: service-name - name: threshold value: "0.95" # 95% success rate
metrics: - name: success-rate interval: 1m count: 5 # Run 5 times successCondition: result[0] >= {{args.threshold}} failureLimit: 3
provider: prometheus: address: http://prometheus.monitoring:9090 query: | sum(rate( http_requests_total{ service="{{args.service-name}}", status=~"2.." }[1m] )) / sum(rate( http_requests_total{ service="{{args.service-name}}" }[1m] ))Latency Analysis
Section titled “Latency Analysis”apiVersion: argoproj.io/v1alpha1kind: AnalysisTemplatemetadata: name: latency-checkspec: args: - name: service-name - name: percentile value: "0.99" - name: threshold-ms value: "500"
metrics: - name: p99-latency interval: 1m count: 5 successCondition: result[0] < {{args.threshold-ms}} failureLimit: 2
provider: prometheus: address: http://prometheus.monitoring:9090 query: | histogram_quantile( {{args.percentile}}, sum(rate( http_request_duration_seconds_bucket{ service="{{args.service-name}}" }[2m] )) by (le) ) * 1000Web Hook Analysis (Custom Checks)
Section titled “Web Hook Analysis (Custom Checks)”apiVersion: argoproj.io/v1alpha1kind: AnalysisTemplatemetadata: name: custom-checkspec: args: - name: canary-hash
metrics: - name: integration-tests successCondition: result.passed == true failureLimit: 1
provider: web: url: https://ci.example.com/api/test method: POST headers: - key: Content-Type value: application/json body: | { "pod_hash": "{{args.canary-hash}}", "test_suite": "smoke" } jsonPath: "{$.result}"Kayenta Analysis (Automated Canary Analysis)
Section titled “Kayenta Analysis (Automated Canary Analysis)”apiVersion: argoproj.io/v1alpha1kind: AnalysisTemplatemetadata: name: kayenta-analysisspec: args: - name: start-time - name: end-time
metrics: - name: kayenta provider: kayenta: address: http://kayenta.monitoring:8090 application: my-app canaryConfigName: my-canary-config metricsAccountName: prometheus-account storageAccountName: gcs-account threshold: pass: 95 marginal: 75 scopes: - name: default controlScope: scope: production start: "{{args.start-time}}" end: "{{args.end-time}}" experimentScope: scope: canary start: "{{args.start-time}}" end: "{{args.end-time}}"Analysis Runs
Section titled “Analysis Runs”Inline Analysis
Section titled “Inline Analysis”strategy: canary: steps: - setWeight: 20 - pause: {duration: 2m}
# Run analysis at this step - analysis: templates: - templateName: success-rate args: - name: service-name value: my-app-canary
- setWeight: 50 - pause: {duration: 2m} - setWeight: 100Background Analysis
Section titled “Background Analysis”strategy: canary: analysis: # Start analysis after first step startingStep: 1
templates: - templateName: success-rate - templateName: latency-check
args: - name: service-name valueFrom: fieldRef: fieldPath: metadata.nameAnalysis with Dry-Run
Section titled “Analysis with Dry-Run”strategy: canary: analysis: templates: - templateName: success-rate
# Don't fail rollout, just report dryRun: - metricName: success-rateExperiments
Section titled “Experiments”A/B Testing
Section titled “A/B Testing”apiVersion: argoproj.io/v1alpha1kind: Experimentmetadata: name: homepage-experimentspec: duration: 1h
templates: - name: control replicas: 2 selector: matchLabels: app: homepage variant: control template: metadata: labels: app: homepage variant: control spec: containers: - name: app image: homepage:v1
- name: experiment replicas: 2 selector: matchLabels: app: homepage variant: experiment template: metadata: labels: app: homepage variant: experiment spec: containers: - name: app image: homepage:v2-new-design
analyses: - name: conversion-rate templateName: conversion-analysis args: - name: control-service value: homepage-control - name: experiment-service value: homepage-experimentExperiment as Part of Rollout
Section titled “Experiment as Part of Rollout”strategy: canary: steps: - setWeight: 0
# Run experiment before any traffic - experiment: duration: 30m templates: - name: experiment specRef: canary replicas: 2 analyses: - name: smoke-test templateName: smoke-tests
- setWeight: 20 - pause: {duration: 5m} - setWeight: 100Observing Rollouts
Section titled “Observing Rollouts”CLI Commands
Section titled “CLI Commands”# Watch rollout in real-timekubectl argo rollouts get rollout my-app --watch
# See rollout historykubectl argo rollouts history rollout my-app
# Get detailed statuskubectl argo rollouts status my-app
# List all rolloutskubectl argo rollouts list rollouts
# Dashboard (web UI)kubectl argo rollouts dashboardRollout Status
Section titled “Rollout Status”NAME KIND STATUS AGEmy-app Rollout ✔ Healthy 5d├──# revision:3│ └──⧫ my-app-7f8b9c6d4-xxxxx Pod ✔ Running 1h│ └──⧫ my-app-7f8b9c6d4-yyyyy Pod ✔ Running 1h│ └──⧫ my-app-7f8b9c6d4-zzzzz Pod ✔ Running 1h├──# revision:2│ └──⧫ my-app-5f7b8c5d3-aaaaa Pod ◌ ScaledDown 2h└──# revision:1 └──⧫ my-app-4f6b7c4d2-bbbbb Pod ◌ ScaledDown 3dNotifications
Section titled “Notifications”apiVersion: argoproj.io/v1alpha1kind: Rolloutmetadata: name: my-app annotations: notifications.argoproj.io/subscribe.on-rollout-completed.slack: rollouts-channel notifications.argoproj.io/subscribe.on-rollout-step-completed.slack: rollouts-channel notifications.argoproj.io/subscribe.on-analysis-run-failed.slack: rollouts-alertsspec: # ...Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It’s Bad | Better Approach |
|---|---|---|
| No analysis templates | Blind deployment, no safety | Always add success-rate and latency analysis |
| Too aggressive steps | Problems hit many users | Start at 5-10%, pause longer at each step |
| Ignoring canary metrics | Analysis passes but users suffer | Include business metrics, not just infrastructure |
| No scaleDownDelay | Instant rollback impossible | Keep old version for 30-60 seconds minimum |
| Same replica count | Canary gets equal load despite traffic | Scale canary based on traffic weight |
| Manual promotion in prod | Human bottleneck, slow deployments | Use automated analysis for well-understood services |
War Story: The $8.3 Million Deployment That Took 90 Seconds to Stop
Section titled “War Story: The $8.3 Million Deployment That Took 90 Seconds to Stop”┌─────────────────────────────────────────────────────────────────┐│ THE $8.3 MILLION DEPLOYMENT THAT TOOK 90 SECONDS TO STOP ││ ───────────────────────────────────────────────────────────────││ Company: Global food delivery platform ││ Scale: 15M daily orders, 850 restaurants per minute ││ The crisis: Memory leak shipped to production Friday evening │└─────────────────────────────────────────────────────────────────┘Friday, 6:47 PM - The Deploy
The order-service team merged a “small refactor” that passed all unit tests and staging validation. The code had a memory leak—objects allocated in a hot path but never garbage collected. In staging with 1% production traffic, it took 6 hours to manifest. In production, the leak would compound to OOM kills within 15 minutes.
Before Argo Rollouts (The Old World)
The team’s previous incident, 8 months earlier, had played out like this:
PREVIOUS INCIDENT - WITHOUT PROGRESSIVE DELIVERY─────────────────────────────────────────────────────────────────18:47 Deploy started (Kubernetes rolling update)18:49 100% traffic on new version (maxSurge, maxUnavailable)19:02 First OOM kill (dismissed as transient)19:15 15 pods OOM killed, orders failing19:23 On-call paged, starts investigation19:35 Root cause identified: memory leak19:40 Rollback initiated19:47 Rollback complete, but...19:47 → Database connection pool exhausted (thundering herd)20:15 Full recovery
Total impact: 28 minutes @ $17,000/minute = $476,000Plus: SLA violations, restaurant refunds, customer creditsTotal incident cost: $1.2 millionWith Argo Rollouts (The New World)
After that incident, the platform team implemented Argo Rollouts with automated analysis:
# The rollout configuration that saved themstrategy: canary: steps: - setWeight: 5 # Start with 5% traffic - pause: {duration: 5m} # Watch for 5 minutes - analysis: templates: - templateName: memory-stability - setWeight: 25 - pause: {duration: 5m} - setWeight: 50 - pause: {duration: 10m} - setWeight: 100
---apiVersion: argoproj.io/v1alpha1kind: AnalysisTemplatemetadata: name: memory-stabilityspec: metrics: - name: memory-growth-rate interval: 1m count: 5 # Fail if memory grows more than 10% in 5 minutes successCondition: result[0] < 0.1 provider: prometheus: query: | ( avg(container_memory_working_set_bytes{pod=~"order-service-canary.*"}) - avg(container_memory_working_set_bytes{pod=~"order-service-canary.*"} offset 5m) ) / avg(container_memory_working_set_bytes{pod=~"order-service-canary.*"} offset 5m)The Timeline That Saved Millions
FRIDAY 6:47 PM - WITH ARGO ROLLOUTS─────────────────────────────────────────────────────────────────18:47:00 Rollout started18:47:05 Canary pods created (5% traffic)18:48:00 Memory baseline: 256MB per pod18:50:00 Memory trending: 312MB (normal startup)18:52:00 Memory trending: 489MB (⚠️ growing fast)18:52:30 Analysis check 1: Growth rate 91% - FAIL18:52:31 Analysis status: Failed18:52:32 Rollout aborted automatically18:52:35 Canary pods terminating18:52:40 100% traffic back to stable
Total time exposed: 5 minutes 40 secondsTraffic affected: 5% = ~2,500 ordersFailed orders: 0 (caught before OOM)The Financial Math
COST COMPARISON─────────────────────────────────────────────────────────────────WITHOUT ARGO ROLLOUTS:─────────────────────────────────────────────────────────────────Downtime: 28 minutesRevenue loss: 28 × $17,000 = $476,000SLA violations: $320,000Restaurant compensation: $180,000Customer credits: $120,000Engineering overtime: $45,000Reputation damage: Immeasurable─────────────────────────────────────────────────────────────────Total: $1,141,000+
WITH ARGO ROLLOUTS:─────────────────────────────────────────────────────────────────Canary exposure: 5.7 minutes at 5% trafficRevenue loss: ~$4,800 (delayed orders, not lost)Customer impact: 2,500 slightly delayed ordersSLA violations: $0 (within tolerance)Engineering response: $0 (automatic)─────────────────────────────────────────────────────────────────Total: <$5,000
SAVINGS PER INCIDENT: $1,136,000+Why Memory Analysis Caught It
The key insight: memory leaks are progressive. They don’t fail immediately—they compound. Traditional health checks (readiness probes) don’t catch memory leaks because pods stay “healthy” until they suddenly aren’t.
MEMORY TRAJECTORY COMPARISON───────────────────────────────────────────────────────────────── Stable Version Leaky Version─────────────────────────────────────────────────────────────────t=0 (startup) 256 MB 256 MBt=2 min 260 MB 340 MB ← Divergingt=5 min 262 MB 520 MB ← Analysis fails heret=10 min 265 MB 890 MBt=15 min 268 MB OOM KILL ← Would have failed hereKey Lessons
- Analysis timing matters: Memory leak detection needs at least 5 minutes of data
- Rate of change, not absolute values: Looking at growth rate catches leaks before OOM
- 5% is your friend: Start small, fail small
- Automated response is faster: Machines detect and act in seconds, humans take minutes
- The analysis pays for itself: One prevented incident justifies the implementation effort
Question 1
Section titled “Question 1”What’s the key difference between canary and blue-green deployments?
Show Answer
Canary: Gradually shifts traffic from old to new (e.g., 10% → 30% → 60% → 100%). Both versions run simultaneously with controlled traffic split. Good for: detecting problems with minimal user impact.
Blue-Green: Maintains two complete environments. Traffic switches 100% at once (0% → 100%). Good for: instant rollback, testing full environment before switch.
Trade-offs:
- Canary uses fewer resources (one set of pods scaled up/down)
- Blue-Green requires double capacity but offers instant rollback
- Canary detects issues gradually; blue-green is all-or-nothing
Question 2
Section titled “Question 2”Your analysis template uses count: 3 and interval: 1m. How long will the analysis run before passing?
Show Answer
At least 3 minutes (3 runs × 1 minute apart).
The analysis runs every minute for 3 iterations:
- t=0: First measurement
- t=1m: Second measurement
- t=2m: Third measurement
- t=2m+: Analysis completes if all passed
If any measurement fails and failureLimit is reached, analysis fails immediately. If not, it retries until count successes are achieved or failureLimit is exceeded.
Question 3
Section titled “Question 3”Write a PromQL query for analysis that checks if error rate is below 1% for a canary service.
Show Answer
query: | sum(rate( http_requests_total{ service="{{args.service-name}}", status=~"5.." }[2m] )) / sum(rate( http_requests_total{ service="{{args.service-name}}" }[2m] )) < 0.01
# Or as successCondition:successCondition: result[0] < 0.01query: | sum(rate(http_requests_total{service="{{args.service-name}}",status=~"5.."}[2m])) / sum(rate(http_requests_total{service="{{args.service-name}}"}[2m]))Key points:
- Use
rate()for per-second rates - Time range (2m) should be 2x the analysis interval
- Compare 5xx status codes to total requests
- Threshold of 0.01 = 1%
Question 4
Section titled “Question 4”Your rollout is stuck in “Paused” state. What commands would you use to investigate and resolve?
Show Answer
# See detailed status and reason for pausekubectl argo rollouts get rollout my-app
# Check analysis runskubectl argo rollouts get rollout my-app --watch
# If analysis failed, check whykubectl get analysisruns -l rollout=my-app
# View analysis run detailskubectl describe analysisrun <name>
# Options to resolve:# 1. If pause is intentional (manual gate):kubectl argo rollouts promote my-app
# 2. If analysis failed, fix and retry:kubectl argo rollouts retry rollout my-app
# 3. If you want to abort and rollback:kubectl argo rollouts abort my-app
# 4. Force to stable version:kubectl argo rollouts undo my-appQuestion 5
Section titled “Question 5”You’re using a canary strategy with NGINX Ingress for traffic splitting. Your canary is at 30% but monitoring shows it’s receiving 50% of traffic. What’s wrong?
Show Answer
Common causes for traffic split mismatch:
-
Pod ratio vs traffic weight: Without a traffic router, Argo Rollouts scales pods proportionally. With 5 replicas at 30% canary:
- Stable: 3-4 pods
- Canary: 1-2 pods
- Kubernetes round-robin = ~30-40% traffic
But if HPA or manual scaling changed pod counts, the ratio shifts.
-
Missing ingress annotation: NGINX traffic splitting requires the correct annotation:
trafficRouting:nginx:stableIngress: my-app-ingressannotationPrefix: nginx.ingress.kubernetes.ioWithout it, traffic routes to both services equally.
-
Session affinity: If sticky sessions are enabled, returning users always hit the same version, skewing observed percentages.
-
Health check traffic: Kubernetes probes hit all pods equally, inflating canary traffic in metrics.
Debug steps:
# Check ingress annotationskubectl get ingress my-app-ingress -o yaml | grep -A5 annotations
# Verify canary service selectorkubectl get svc my-app-canary -o yaml
# Check rollout's view of traffickubectl argo rollouts get rollout my-appQuestion 6
Section titled “Question 6”Design an analysis template that checks THREE conditions: error rate < 1%, p99 latency < 500ms, AND successful health checks. All must pass.
Show Answer
apiVersion: argoproj.io/v1alpha1kind: AnalysisTemplatemetadata: name: comprehensive-checkspec: args: - name: service-name - name: namespace
metrics: # Check 1: Error rate < 1% - name: error-rate interval: 1m count: 5 successCondition: result[0] < 0.01 failureLimit: 2 provider: prometheus: address: http://prometheus:9090 query: | sum(rate(http_requests_total{ service="{{args.service-name}}", namespace="{{args.namespace}}", status=~"5.." }[2m])) / sum(rate(http_requests_total{ service="{{args.service-name}}", namespace="{{args.namespace}}" }[2m]))
# Check 2: P99 latency < 500ms - name: p99-latency interval: 1m count: 5 successCondition: result[0] < 500 failureLimit: 2 provider: prometheus: address: http://prometheus:9090 query: | histogram_quantile(0.99, sum(rate(http_request_duration_ms_bucket{ service="{{args.service-name}}", namespace="{{args.namespace}}" }[2m])) by (le) )
# Check 3: Health endpoint returns 200 - name: health-check interval: 30s count: 10 successCondition: result.status == "200" failureLimit: 1 provider: web: url: "http://{{args.service-name}}.{{args.namespace}}.svc.cluster.local/health" method: GET jsonPath: "{$.status}"All three metrics run in parallel. The analysis passes only if ALL metrics succeed within their failure limits.
Question 7
Section titled “Question 7”Your company requires that production deployments be approved by a team lead before reaching 50% traffic. How do you configure this in Argo Rollouts?
Show Answer
Use infinite pause at the approval checkpoint:
apiVersion: argoproj.io/v1alpha1kind: Rolloutmetadata: name: my-appspec: strategy: canary: steps: # Automated: Start canary - setWeight: 10 - pause: {duration: 5m} - analysis: templates: - templateName: success-rate
# Automated: If analysis passes, go to 30% - setWeight: 30 - pause: {duration: 10m}
# MANUAL APPROVAL REQUIRED - pause: {} # ← Infinite pause
# After approval: Continue to 50%+ - setWeight: 50 - pause: {duration: 5m} - setWeight: 75 - pause: {duration: 5m} - setWeight: 100Approval workflow:
# 1. Rollout reaches 30% and pauseskubectl argo rollouts get rollout my-app# Status: Paused - CanaryPauseStep
# 2. Team lead reviews metrics, approveskubectl argo rollouts promote my-app
# 3. Rollout continues to 50%+Alternative: Notifications + manual gate:
metadata: annotations: notifications.argoproj.io/subscribe.on-rollout-step-completed.slack: approvals-channelThis posts to Slack when the pause is reached, alerting approvers.
Question 8
Section titled “Question 8”Calculate the blast radius for a canary deployment with these parameters: 10,000 requests/second, 10% canary weight, 5-minute analysis interval, and analysis fails on 3rd check. How many requests hit the bad version?
Show Answer
Calculation:
BLAST RADIUS CALCULATION─────────────────────────────────────────────────────────────────Total traffic: 10,000 req/sCanary weight: 10%Canary traffic: 1,000 req/s
Analysis configuration:- interval: 1m (assumed)- count: 5 (5 checks to pass)- failureLimit: 3 (assumed)
Timeline to failure:- Check 1 (t=1m): Pass- Check 2 (t=2m): Pass- Check 3 (t=3m): FAIL- Check 4 (t=4m): FAIL- Check 5 (t=5m): FAIL ← Analysis fails, rollback triggered
Time at canary weight: ~5 minutesRequests to canary: 1,000 req/s × 300 seconds = 300,000 requests
BLAST RADIUS: 300,000 requests (3% of 5-minute total)Compare to rolling update:
ROLLING UPDATE (NO CANARY)─────────────────────────────────────────────────────────────────Time to 100%: ~2 minutes (typical rolling update)Time to detect: +5 minutes (alert fires)Time to rollback: +3 minutes (human response + rollback)
Total exposure: 10 minutes at 100% trafficRequests affected: 10,000 × 600 = 6,000,000 requests
Rolling update blast radius: 6,000,000 requestsCanary blast radius: 300,000 requests
RISK REDUCTION: 95%Key insight: Canary at 10% with 5-minute analysis exposes 20× fewer users than a rolling update with the same detection time.
Hands-On Exercise
Section titled “Hands-On Exercise”Scenario: Progressive Delivery Pipeline
Section titled “Scenario: Progressive Delivery Pipeline”Implement a canary deployment with automated analysis.
# Create kind clusterkind create cluster --name rollouts-lab
# Install Argo Rolloutskubectl create namespace argo-rolloutskubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
# Install Prometheus for analysishelm repo add prometheus-community https://prometheus-community.github.io/helm-chartshelm install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace \ --set grafana.enabled=false
# Wait for componentskubectl -n argo-rollouts wait --for=condition=ready pod -l app.kubernetes.io/name=argo-rollouts --timeout=120sDeploy Demo Application
Section titled “Deploy Demo Application”apiVersion: argoproj.io/v1alpha1kind: Rolloutmetadata: name: demo-rolloutspec: replicas: 5 selector: matchLabels: app: demo template: metadata: labels: app: demo spec: containers: - name: demo image: argoproj/rollouts-demo:blue ports: - containerPort: 8080
strategy: canary: canaryService: demo-canary stableService: demo-stable
steps: - setWeight: 20 - pause: {duration: 30s} - setWeight: 50 - pause: {duration: 30s} - setWeight: 80 - pause: {duration: 30s}---apiVersion: v1kind: Servicemetadata: name: demo-stablespec: selector: app: demo ports: - port: 80 targetPort: 8080---apiVersion: v1kind: Servicemetadata: name: demo-canaryspec: selector: app: demo ports: - port: 80 targetPort: 8080kubectl apply -f rollout.yaml
# Watch the rolloutkubectl argo rollouts get rollout demo-rollout --watchTrigger a New Release
Section titled “Trigger a New Release”# Update to new image (yellow version)kubectl argo rollouts set image demo-rollout demo=argoproj/rollouts-demo:yellow
# Watch the canary progresskubectl argo rollouts get rollout demo-rollout --watchAdd Analysis
Section titled “Add Analysis”apiVersion: argoproj.io/v1alpha1kind: AnalysisTemplatemetadata: name: always-passspec: metrics: - name: always-pass count: 3 interval: 10s successCondition: result == "true" provider: job: spec: template: spec: containers: - name: check image: busybox command: [sh, -c, 'echo "true"'] restartPolicy: Never backoffLimit: 0kubectl apply -f analysis.yaml
# Update rollout to use analysiskubectl patch rollout demo-rollout --type merge -p 'spec: strategy: canary: analysis: templates: - templateName: always-pass'
# Trigger new rolloutkubectl argo rollouts set image demo-rollout demo=argoproj/rollouts-demo:green
# Watch analysiskubectl argo rollouts get rollout demo-rollout --watchTest Rollback
Section titled “Test Rollback”# Abort during rolloutkubectl argo rollouts set image demo-rollout demo=argoproj/rollouts-demo:red
# While in progress, abortkubectl argo rollouts abort demo-rollout
# Check that pods rolled backkubectl argo rollouts get rollout demo-rolloutSuccess Criteria
Section titled “Success Criteria”- Argo Rollouts controller is running
- Can perform canary deployment with weight steps
- Can observe rollout progress with CLI
- Analysis runs and affects promotion
- Can abort and rollback a rollout
Cleanup
Section titled “Cleanup”kind delete cluster --name rollouts-labKey Takeaways
Section titled “Key Takeaways”Before moving on, ensure you can:
- Explain why progressive delivery reduces blast radius (traffic percentage × detection time)
- Choose between canary and blue-green strategies based on traffic routing capabilities
- Write a Rollout spec with setWeight, pause, and analysis steps
- Create AnalysisTemplates with Prometheus queries and success conditions
- Calculate blast radius: (traffic % × requests/sec × time-to-detect)
- Configure traffic routing with NGINX, Istio, or pod-based splitting
- Use the Argo Rollouts CLI: get, promote, abort, retry, undo
- Design multi-metric analysis checking error rate, latency, and health
- Implement manual approval gates with infinite pause steps
- Troubleshoot common issues: traffic mismatch, stuck pauses, analysis failures
Next Module
Section titled “Next Module”Continue to Module 2.3: Flux where we’ll explore the alternative GitOps toolkit approach.
“Ship fast, but ship safe. Progressive delivery lets you have both.”