Module 1.1: Release Strategies & Progressive Delivery Fundamentals
Цей контент ще не доступний вашою мовою.
Discipline Module | Complexity:
[MEDIUM]| Time: 2 hours
Prerequisites
Section titled “Prerequisites”Before starting this module:
- Required: CI/CD Fundamentals — Understanding build pipelines, artifact promotion, and deployment automation
- Required: Kubernetes Deployments — Working knowledge of Deployments, Services, and label selectors
- Recommended: Basic understanding of load balancers and HTTP routing
- Recommended: Familiarity with monitoring/observability concepts
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Evaluate deployment strategies — rolling, blue-green, canary, A/B — against your risk tolerance and infrastructure
- Design a release strategy matrix that matches deployment patterns to service criticality levels
- Implement progressive delivery workflows that gradually shift traffic with automated rollback triggers
- Analyze release failure modes to build deployment pipelines that detect problems before full rollout
Why This Module Matters
Section titled “Why This Module Matters”It is 2:47 AM. Your phone is screaming. A deployment that went out six hours ago has been silently corrupting customer invoices. Every single user is affected. The rollback takes 40 minutes because nobody tested it. By the time you are back in bed, 12,000 invoices need manual correction, and the CEO wants a meeting at 8 AM.
This scenario plays out every week at companies around the world. Not because the code was terrible, but because the release strategy was terrible — or, more accurately, because there was no strategy at all.
Release engineering is the discipline of getting code from “it works on my machine” to “it works for every user” safely, repeatedly, and reversibly. The difference between teams that deploy with confidence and teams that deploy with dread comes down to one thing: how they manage the blast radius of change.
In this module, you will learn the fundamental release strategies — Blue/Green, Canary, Shadow, and more — and understand when each one is the right tool. You will also learn the art of progressive delivery: the idea that a release is not a binary event but a graduated exposure of change to increasingly larger audiences.
By the end, you will never again think of deployment as flipping a switch. You will think of it as turning a dial.
Stop and think: What is the most stressful deployment you have ever been a part of? What made it so stressful? Was it the size of the change, the lack of testing, or the inability to easily undo it?
The Problem with Big-Bang Releases
Section titled “The Problem with Big-Bang Releases”Why “Deploy Everything at Once” Fails
Section titled “Why “Deploy Everything at Once” Fails”Traditional releases work like this:
Developer commits → Build passes → Deploy to all servers → Hope for the bestThis is the big-bang release. Everything goes live at once. If it works, great. If it does not, every user is affected simultaneously.
The math is brutal:
| Release Pattern | Blast Radius | Rollback Time | Risk |
|---|---|---|---|
| Big-bang deploy | 100% of users | Minutes to hours | Extreme |
| Blue/Green | 100% → 0% instantly | Seconds | Low |
| Canary | 1-5% → gradually | Seconds | Very Low |
| Shadow/Dark | 0% (mirrored only) | N/A | Near Zero |
The goal of release engineering is to move from the top of that table to the bottom.
Blast Radius: The Core Concept
Section titled “Blast Radius: The Core Concept”Blast radius is the percentage of users, requests, or systems affected if a release goes wrong.
Think of it like testing fireworks. You would not set off an untested firework in a crowded stadium. You would test it in an empty field first, then a small gathering, then a larger event, and only then at the stadium.
Progressive delivery applies the same logic to software:
graph LR subgraph Time [Blast Radius Over Time] direction LR S[Shadow<br/>0%] --> I[Internal<br/>5%] I --> B[Beta users<br/>25%] B --> R1[Region 1<br/>50%] R1 --> R2[Region 2<br/>75%] R2 --> GA[Full GA<br/>100%] endEvery step validates the release before expanding the blast radius.
Release Strategy Deep Dive
Section titled “Release Strategy Deep Dive”Pause and predict: If you have two complete, isolated copies of your production environment running side-by-side, how quickly do you think you can undo a bad release?
Blue/Green Deployments
Section titled “Blue/Green Deployments”Blue/Green is the simplest progressive strategy. You maintain two identical production environments:
- Blue: The current live environment
- Green: The new version, fully deployed but receiving no traffic
When you are confident Green is healthy, you switch all traffic from Blue to Green. If something goes wrong, you switch back.
graph TD LB[Load Balancer] -->|Traffic| Blue LB -.->|No Traffic| Green
subgraph Blue [Blue v1 - LIVE ✓] B_P1[Pod] B_P2[Pod] B_P3[Pod] end
subgraph Green [Green v2 - STANDBY] G_P1[Pod] G_P2[Pod] G_P3[Pod] endAfter cutover:
graph TD LB[Load Balancer] -.->|No Traffic| Blue LB -->|Traffic| Green
subgraph Blue [Blue v1 - STANDBY] B_P1[Pod] B_P2[Pod] B_P3[Pod] end
subgraph Green [Green v2 - LIVE ✓] G_P1[Pod] G_P2[Pod] G_P3[Pod] endIn Kubernetes, Blue/Green is achieved with label selectors on Services:
# Service pointing to Blue (v1)apiVersion: v1kind: Servicemetadata: name: my-appspec: selector: app: my-app version: blue # ← Change this to "green" for cutover ports: - port: 80 targetPort: 8080Advantages:
- Instant rollback (just switch the selector back)
- Full environment validation before cutover
- Zero downtime if done correctly
Disadvantages:
- Requires double the infrastructure (temporarily)
- All-or-nothing traffic switch (no gradual rollout)
- Database migrations are tricky (both versions must work with the schema)
When to use:
- Critical services where instant rollback is essential
- When you need full integration testing before going live
- Services with simple state management
Canary Deployments
Section titled “Canary Deployments”Named after the canaries coal miners used to detect toxic gases, a canary deployment sends a small percentage of traffic to the new version first.
graph TD LB[Load Balancer] -->|95% Traffic| Stable LB -->|5% Traffic| Canary
subgraph Stable [Stable v1] S_P1[Pod] S_P2[Pod] S_P3[Pod] S_P4[Pod] S_P5[Pod] S_P6[Pod] end
subgraph Canary [Canary v2] C_P1[Pod] endIf the canary is healthy (low error rates, acceptable latency), traffic gradually increases:
5% → 10% → 25% → 50% → 100%If the canary shows problems, it gets killed — and only 5% of users were ever affected.
Advantages:
- Minimal blast radius
- Real production traffic validation
- Automated promotion/rollback possible
- Gradual confidence building
Disadvantages:
- More complex to set up
- Requires traffic splitting capability
- Metrics analysis needed to determine health
- Session affinity can complicate things
When to use:
- High-traffic services where even brief outages are costly
- When you want metrics-driven deployment decisions
- Services where user behavior validation matters
Shadow (Dark Launch) Deployments
Section titled “Shadow (Dark Launch) Deployments”Shadow deployments mirror production traffic to the new version without serving responses to users. The new version processes real requests, but its responses are discarded.
graph TD Req[Request] --> LB[Load Balancer] LB -->|Serves| Prod LB -.->|Mirrors| Shadow Prod --> Resp[Response from v1 only] Shadow -.->|Processes but discards output| Discard((Discard))
subgraph Prod [Production v1] P_P1[Serves users] end
subgraph Shadow [Shadow v2] S_P1[Processes requests] endAdvantages:
- Zero user impact — users never see v2 responses
- Tests with real production load patterns
- Validates performance under actual traffic
- Great for data pipeline or ML model changes
Disadvantages:
- Cannot test user-facing changes (UI differences)
- Write operations are dangerous (double-writes)
- Requires infrastructure to mirror and discard
- Does not validate client-side behavior
When to use:
- Backend services with no side effects on reads
- Performance validation before launch
- ML model comparisons (A/B testing the model, not the UX)
- Database query optimization validation
Rolling Updates
Section titled “Rolling Updates”Kubernetes’ default strategy. Pods are replaced one at a time (or in batches):
apiVersion: apps/v1kind: Deploymentmetadata: name: my-appspec: replicas: 6 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 # Allow 1 extra pod during update maxUnavailable: 0 # Always keep all pods availableUpdate sequence:
Step 1: [v1] [v1] [v1] [v1] [v1] [v1] [v2] ← 1 new pod addedStep 2: [v1] [v1] [v1] [v1] [v1] [v2] [v2] ← 1 old pod removed, 1 new addedStep 3: [v1] [v1] [v1] [v1] [v2] [v2] [v2]Step 4: [v1] [v1] [v1] [v2] [v2] [v2] [v2]Step 5: [v1] [v1] [v2] [v2] [v2] [v2] [v2]Step 6: [v1] [v2] [v2] [v2] [v2] [v2] [v2]Step 7: [v2] [v2] [v2] [v2] [v2] [v2] ← CompleteAdvantages:
- Built into Kubernetes (zero extra tooling)
- Simple to configure
- Gradual replacement
Disadvantages:
- No fine-grained traffic control (traffic split depends on pod count)
- Mixed versions serve traffic simultaneously
- Rollback is another rolling update (not instant)
Pause and predict: If a deployment is the physical act of putting code onto a server, what is a release? Are they the same thing?
Feature Flags vs Release Toggles
Section titled “Feature Flags vs Release Toggles”Decoupling Deployment from Release
Section titled “Decoupling Deployment from Release”Here is a critical insight: deployment and release are not the same thing.
- Deployment: Putting code on servers
- Release: Making a feature available to users
Feature flags let you deploy code without releasing it:
# Feature is deployed but not releasedif feature_flags.is_enabled("new-checkout-flow", user=current_user): return new_checkout_flow(cart)else: return old_checkout_flow(cart)This separation is powerful because it means:
- You can deploy any time (even Friday afternoon)
- You can release to specific users first
- You can kill a feature without redeploying
- You can A/B test without infrastructure changes
Types of Feature Flags
Section titled “Types of Feature Flags”| Type | Lifespan | Purpose | Example |
|---|---|---|---|
| Release toggle | Days/weeks | Gate incomplete features | New checkout flow |
| Experiment toggle | Weeks/months | A/B testing | Button color test |
| Ops toggle | Permanent | Circuit breakers | Disable recommendations |
| Permission toggle | Permanent | Entitlements | Premium features |
The Kill Switch Pattern
Section titled “The Kill Switch Pattern”Every critical feature should have a kill switch — an ops toggle that immediately disables it:
# Feature flag configurationnew_payment_processor: enabled: true kill_switch: true # Can be disabled instantly rollout_percentage: 25 # Only 25% of users see it excluded_regions: - ap-southeast-1 # Not yet tested in APACIf the new payment processor starts failing, one API call disables it:
curl -X PUT https://flags.internal/api/flags/new_payment_processor \ -d '{"enabled": false}'No redeployment. No rollback. Instant recovery.
Stop and think: If you have v1 and v2 of your application serving traffic simultaneously during a rolling update, what happens if v2 renames a critical database column?
Database Migrations During Zero-Downtime Releases
Section titled “Database Migrations During Zero-Downtime Releases”The Hardest Problem in Release Engineering
Section titled “The Hardest Problem in Release Engineering”Code is stateless and replaceable. Databases are not. This makes database schema changes the most dangerous part of any release.
The fundamental problem: during a rolling deployment, both old and new versions of your application run simultaneously. Both must work with the same database schema.
The Expand-Contract Pattern
Section titled “The Expand-Contract Pattern”Never make breaking schema changes in a single release. Use expand-contract (also called parallel change):
Phase 1 — Expand (Release N): Add new columns/tables, keep old ones working.
-- Add new column, allow NULL (backward compatible)ALTER TABLE users ADD COLUMN email_verified BOOLEAN DEFAULT NULL;Old code ignores the new column. New code starts writing to it.
Phase 2 — Migrate (Release N+1): Backfill data and start reading from new column.
-- Backfill existing rowsUPDATE users SET email_verified = false WHERE email_verified IS NULL;Phase 3 — Contract (Release N+2): Remove old columns/code once fully migrated.
-- Now safe to add NOT NULL constraintALTER TABLE users ALTER COLUMN email_verified SET NOT NULL;
-- Later: remove old column if it was replaced-- ALTER TABLE users DROP COLUMN old_email_status;Migration Anti-Patterns
Section titled “Migration Anti-Patterns”| Anti-Pattern | Why It Breaks | Safe Alternative |
|---|---|---|
DROP COLUMN in same release | Old pods still reading it | Expand-contract over 2+ releases |
RENAME COLUMN | Old code uses old name | Add new column, backfill, drop old |
ALTER TYPE (e.g., int→bigint) | Can lock table for hours | Create new column, dual-write, swap |
| Non-reversible migration | Cannot roll back release | Always write reversible migrations |
| Big table migration in one tx | Locks table, kills production | Batch in small chunks |
Stop and think: Is it better to have a system that rarely breaks but takes hours to fix, or a system that breaks occasionally but can be fixed in seconds?
MTTR vs MTBF: Two Philosophies of Reliability
Section titled “MTTR vs MTBF: Two Philosophies of Reliability”The Old Way: Maximize MTBF
Section titled “The Old Way: Maximize MTBF”MTBF (Mean Time Between Failures) asks: “How do we prevent failures?”
Organizations optimizing for MTBF:
- Have heavy change approval processes
- Deploy infrequently (monthly or quarterly)
- Test exhaustively before release
- Avoid changes to “stable” systems
The problem: failures still happen. And when they do, recovery is slow because the team has no practice at it.
The Modern Way: Minimize MTTR
Section titled “The Modern Way: Minimize MTTR”MTTR (Mean Time to Recovery) asks: “How fast can we recover from failure?”
Organizations optimizing for MTTR:
- Deploy frequently (many times per day)
- Invest in rollback mechanisms
- Practice incident response
- Accept that failures will happen
MTBF-Focused:──────────────────────────────X──────(long recovery)────────────────── Long time between failures Slow, unpracticed recovery
MTTR-Focused:────X──(quick fix)──────X──(quick fix)──────X──(quick fix)────── More frequent failures But fast, practiced recoveriesThe key insight from the DORA research (which we will explore in Module 1.5): high-performing teams optimize for MTTR, not MTBF. They deploy more often, fail more gracefully, and recover faster.
How Release Strategies Enable Low MTTR
Section titled “How Release Strategies Enable Low MTTR”| Strategy | MTTR Contribution |
|---|---|
| Blue/Green | Instant rollback by switching traffic |
| Canary | Auto-rollback before most users are affected |
| Feature flags | Disable features without redeployment |
| Shadow | Catch problems before any user is affected |
Choosing the Right Strategy
Section titled “Choosing the Right Strategy”Decision Matrix
Section titled “Decision Matrix”| Factor | Blue/Green | Canary | Shadow | Rolling |
|---|---|---|---|---|
| Complexity | Low | Medium | High | Very Low |
| Blast radius | 0→100% | Gradual | 0% | Gradual |
| Rollback speed | Instant | Fast | N/A | Slow |
| Infra cost | 2x (temporary) | +10-20% | 2x | +0-10% |
| Traffic control | Binary | Fine-grained | Mirrored | None |
| DB-safe | Needs expand-contract | Needs expand-contract | Read-only safe | Needs expand-contract |
| Best for | Critical services | High-traffic APIs | Backend/ML | Simple services |
Combining Strategies
Section titled “Combining Strategies”Real-world release engineering combines multiple strategies:
1. Shadow deploy → Validate performance with real traffic2. Canary to 1% → Watch metrics for 30 minutes3. Canary to 10% → Watch metrics for 1 hour4. Blue/Green the rest → Full cutover with instant rollback5. Feature flag → Gradually enable new UI for usersThis is progressive delivery in practice: layering strategies to minimize risk at every stage.
Did You Know?
Section titled “Did You Know?”-
Facebook deploys code to production thousands of times per day, but features reach users over days or weeks through a sophisticated feature flag system called Gatekeeper. The average feature is tested on Facebook employees first, then 1% of users, then gradually ramped up. A deployment is never a release.
-
The term “canary deployment” comes from the practice of taking canaries into coal mines. The birds were more sensitive to carbon monoxide than humans, so if the canary stopped singing, miners knew to evacuate. In software, your canary pods are the first to “stop singing” (show errors), warning you before the toxic change reaches everyone.
-
LinkedIn pioneered the concept of “Dark Launches” in 2009 when they needed to test a completely rewritten backend. They mirrored 100% of production read traffic to the new system for weeks, comparing responses without users ever knowing. By the time they switched over, they had months of production validation — and the cutover was uneventful.
-
Blue/Green deployments were first described by Daniel North and Jez Humble in 2005, and the technique predates Kubernetes by a decade. The original implementation used DNS switching between two physical server clusters. In Kubernetes, the same concept is achieved with a single label selector change — what used to take hours now takes milliseconds.
War Story: The Migration That Ate Production
Section titled “War Story: The Migration That Ate Production”A team was migrating from a monolith to microservices. They had been developing the new services for months and were ready to switch.
Their plan: big-bang cutover on a Saturday night. Move all traffic from the monolith to the new services at once.
What happened:
- 7:00 PM — Cut traffic to new services. Response times look good.
- 7:15 PM — Connection pool exhaustion. The new services open 10x more DB connections than the monolith.
- 7:30 PM — Roll back. But the rollback script has a bug. Monolith cannot start because a database migration already ran.
- 8:00 PM — Emergency manual fix of the migration. Monolith is back but running on the new schema with patched queries.
- 11:00 PM — All data integrity issues identified and corrected.
- 3:00 AM — Post-incident meeting. Everyone is exhausted and demoralized.
What they should have done:
- Shadow deploy the new services for two weeks to find the connection pool issue
- Canary at 1% to validate real user traffic patterns
- Expand-contract the database to support both services simultaneously
- Feature flags to gradually shift functionality, not traffic
They eventually did exactly this over the next three months, and the second attempt was anticlimactic — which is the goal.
Lesson: The measure of a great release is how boring it is.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| No rollback plan | ”We’ll figure it out if something goes wrong” | Test rollback before every release; automate it |
| Canary without metrics | Deploying to 5% but not measuring anything | Define success metrics before deployment starts |
| Blue/Green with DB migrations | Both environments need the same schema | Use expand-contract pattern over multiple releases |
| Feature flags as permanent code | Codebase fills with dead branches | Set expiry dates; track flag lifecycle; clean up quarterly |
| Testing only in staging | Staging never matches production | Use shadow deployments for production-traffic validation |
| Deploying on Fridays | Nobody monitors over the weekend | Deploy early in the week, or invest in proper on-call and automation |
| Skipping smoke tests after cutover | Assuming traffic switch means success | Automated post-deployment health checks are non-negotiable |
| Ignoring session state | Users mid-transaction get errors during switch | Implement graceful connection draining and sticky sessions |
Quiz: Check Your Understanding
Section titled “Quiz: Check Your Understanding”Question 1
Section titled “Question 1”Your team merged the new checkout flow on Friday morning. The code is running on all production servers, but the marketing campaign for the new checkout doesn’t launch until Monday. When users visit the site over the weekend, they still see the old checkout. Which concept describes what has happened to the new code, and why is this separation important?
Show Answer
The new checkout code has been **deployed** but not yet **released**.A deployment is the physical act of putting code onto servers and making it ready to run, while a release is the business decision to make that feature available to users. By decoupling these two events (often using feature flags), your team can safely push code to production during normal working hours without prematurely exposing unfinished or unannounced features. This separation is the foundation of progressive delivery, allowing you to control the blast radius and turn releasing into a simple configuration change rather than a stressful infrastructure event.
Question 2
Section titled “Question 2”You are tasked with rolling out a major rewrite of the database query optimization engine. The new engine changes how data is fetched but should return exactly the same results to the end user. Your engineering manager suggests routing 5% of user traffic to the new engine to test it. Why might a different deployment strategy be safer and more effective for this specific scenario?
Show Answer
A **Shadow deployment** would be significantly safer and more effective than a Canary deployment for this rewrite.Because the new engine is a backend-only change with no intended user-facing differences, a Canary deployment unnecessarily exposes that 5% of users to potential errors or latency spikes if the new queries perform poorly. Instead, a Shadow deployment allows you to mirror 100% of real production traffic to the new engine while discarding its responses, ensuring zero user impact. This lets you validate the new engine’s performance, resource utilization, and correctness under actual production load without ever putting user experience at risk.
Question 3
Section titled “Question 3”Your application is running across 50 pods. You initiate a rolling update to deploy a new version that renames the account_status database column to status. Midway through the deployment, you notice a massive spike in errors and the application goes down. What caused this failure, and how could you have prevented it?
Show Answer
The failure occurred because during a rolling update, both the old and new versions of your application run simultaneously and must share the exact same database schema.When the new version renamed the column, the old pods (which were still running and serving traffic) immediately crashed because they were trying to query a column name that no longer existed. To prevent this, you must use the expand-contract pattern across multiple releases. First, you expand by adding the new column while keeping the old one intact. Then, you migrate the data and update the code to write to both. Finally, in a subsequent release once all code depends on the new column, you contract by removing the old column.
Question 4
Section titled “Question 4”Company A deploys once a quarter, requiring three weeks of code freezes and approvals to prevent bugs. Company B deploys twenty times a day with automated rollbacks. When a severe bug inevitably hits production, Company B resolves it in four minutes, while Company A takes two days to diagnose and issue a hotfix. Which reliability philosophy is Company B following, and why does it lead to better outcomes?
Show Answer
Company B is optimizing for **Mean Time To Recovery (MTTR)**, while Company A is stuck trying to maximize Mean Time Between Failures (MTBF).Optimizing for MTBF creates a false sense of security; failures will always happen eventually, but heavy processes make the team unpracticed and slow at fixing them. By focusing on MTTR, Company B accepts that failures are normal and instead invests in fast, automated recovery mechanisms and smaller, more frequent changes. Because the batch size of each deployment is tiny, diagnosing the root cause is trivial, and practiced rollback mechanisms ensure that the blast radius of any defect is minimal and quickly contained.
Question 5
Section titled “Question 5”Your team just launched a new integration with a third-party payment gateway. During peak holiday traffic, the third-party gateway starts timing out, causing your entire checkout flow to hang. A developer frantically starts creating a revert pull request, estimating it will take 20 minutes to build and deploy. How could the architecture have been designed differently to resolve this outage instantly?
Show Answer
The architecture should have implemented a **kill switch** (an operational feature flag) for the new payment gateway integration.A kill switch allows operations or development teams to instantly disable a problematic feature or integration with a single API call or configuration flip, without requiring a redeployment or code rollback. In this scenario, flipping the kill switch would have immediately bypassed the failing third-party gateway and reverted the system to the legacy processor. You should use this pattern for any feature that touches critical user flows, relies on external dependencies, or carries a high cost of failure if it degrades in production.
Question 6
Section titled “Question 6”Your infrastructure team proudly announces they have implemented Blue/Green deployments in Kubernetes. However, during the first failed release, the rollback takes 30 minutes. When you investigate, you find they are updating the Helm chart to point back to the previous image tag and running the CI/CD pipeline from scratch to redeploy. What fundamental principle of Blue/Green deployments have they misunderstood, and how should it work?
Show Answer
The team has misunderstood that a true Blue/Green rollback is a **traffic routing switch**, not a redeployment event.By tearing down the old version and relying on the CI/CD pipeline to rebuild and redeploy the previous image, they have entirely lost the speed and safety benefits of the strategy. In a proper Blue/Green architecture, the “Blue” (previous) environment must remain running and healthy on standby even after the “Green” (new) environment is actively taking traffic. A rollback should simply consist of updating the Kubernetes Service selector or Load Balancer rule to instantly route traffic back to the running Blue pods, reducing recovery time from minutes to milliseconds.
Hands-On Exercise: Manual Blue/Green Deployment with Kubernetes
Section titled “Hands-On Exercise: Manual Blue/Green Deployment with Kubernetes”Objective
Section titled “Objective”Deploy a web application using the Blue/Green strategy in Kubernetes, perform a zero-downtime cutover, and practice instant rollback.
Create a local Kubernetes cluster:
kind create cluster --name release-labStep 1: Deploy the Blue Version
Section titled “Step 1: Deploy the Blue Version”apiVersion: apps/v1kind: Deploymentmetadata: name: webapp-blue labels: app: webapp version: bluespec: replicas: 3 selector: matchLabels: app: webapp version: blue template: metadata: labels: app: webapp version: blue spec: containers: - name: webapp image: hashicorp/http-echo:0.2.3 args: - "-text=Hello from BLUE (v1)" - "-listen=:8080" ports: - containerPort: 8080 readinessProbe: httpGet: path: / port: 8080 initialDelaySeconds: 2 periodSeconds: 3apiVersion: v1kind: Servicemetadata: name: webappspec: type: NodePort selector: app: webapp version: blue # ← Currently pointing to Blue ports: - port: 80 targetPort: 8080 nodePort: 30080Apply both:
k apply -f blue-deployment.yamlk apply -f webapp-service.yamlVerify Blue is live:
k get pods -l version=blue# Should show 3 pods Running
# Test the service using an in-cluster curlkubectl run curl-test --rm -it --restart=Never --image=curlimages/curl -- \ curl -s webapp.default.svc:80# Output: Hello from BLUE (v1)Step 2: Deploy the Green Version (No Traffic Yet)
Section titled “Step 2: Deploy the Green Version (No Traffic Yet)”apiVersion: apps/v1kind: Deploymentmetadata: name: webapp-green labels: app: webapp version: greenspec: replicas: 3 selector: matchLabels: app: webapp version: green template: metadata: labels: app: webapp version: green spec: containers: - name: webapp image: hashicorp/http-echo:0.2.3 args: - "-text=Hello from GREEN (v2)" - "-listen=:8080" ports: - containerPort: 8080 readinessProbe: httpGet: path: / port: 8080 initialDelaySeconds: 2 periodSeconds: 3k apply -f green-deployment.yamlk get pods -l version=green# Should show 3 pods Running — but no traffic is going to them yetVerify Green works by directly port-forwarding to one of its pods:
# Find a green pod nameGREEN_POD=$(k get pods -l version=green -o jsonpath='{.items[0].metadata.name}')k port-forward $GREEN_POD 8081:8080 &curl http://localhost:8081# Output: Hello from GREEN (v2)Step 3: Perform the Cutover
Section titled “Step 3: Perform the Cutover”Switch the Service selector from Blue to Green:
k patch service webapp -p '{"spec":{"selector":{"version":"green"}}}'Verify the switch:
kubectl run curl-test2 --rm -it --restart=Never --image=curlimages/curl -- \ curl -s webapp.default.svc:80# Output: Hello from GREEN (v2)That is it. One command. Instant cutover.
Step 4: Practice Instant Rollback
Section titled “Step 4: Practice Instant Rollback”Simulate a problem with Green and roll back:
# Rollback to Bluek patch service webapp -p '{"spec":{"selector":{"version":"blue"}}}'
kubectl run curl-test3 --rm -it --restart=Never --image=curlimages/curl -- \ curl -s webapp.default.svc:80# Output: Hello from BLUE (v1)Step 5: Verify with a Traffic Loop
Section titled “Step 5: Verify with a Traffic Loop”Run a continuous traffic loop to prove zero downtime during switching. Use an in-cluster pod to test through the Service (port-forward does not dynamically follow Service selector changes):
# In one terminal — continuous requests via in-cluster podkubectl run traffic-loop --rm -it --restart=Never --image=curlimages/curl -- \ sh -c 'while true; do RESPONSE=$(curl -s webapp.default.svc:80) echo "$(date +%H:%M:%S) - $RESPONSE" sleep 0.5 done'In another terminal, perform the cutover:
k patch service webapp -p '{"spec":{"selector":{"version":"green"}}}'Watch the first terminal. You should see the response change from Blue to Green with zero failed requests.
Step 6: Clean Up
Section titled “Step 6: Clean Up”kind delete cluster --name release-labSuccess Criteria
Section titled “Success Criteria”You have completed this exercise when you can confirm:
- Both Blue and Green deployments ran simultaneously with 3 pods each
- The Service selector controlled which version received traffic
- Cutover from Blue to Green happened with a single
kubectl patchcommand - Rollback from Green to Blue was equally instant
- The traffic loop showed zero failed requests during the switch
- You understand that the Blue pods remained running as a rollback target
Key Takeaways
Section titled “Key Takeaways”- Deployment is not release — Feature flags and progressive delivery decouple putting code on servers from exposing it to users
- Blast radius is everything — Every release strategy is ultimately about controlling how many users are affected if something goes wrong
- Blue/Green gives instant rollback — Keep the old version running and switch traffic with a selector change
- Canary gives gradual confidence — Start with 1% and increase only when metrics confirm health
- Shadow validates without risk — Mirror production traffic to test performance before any user is exposed
- Database migrations need expand-contract — Never make breaking schema changes in a single release
- Optimize for MTTR, not MTBF — Fast recovery beats rare failure every time
Further Reading
Section titled “Further Reading”Books:
- “Continuous Delivery” — Jez Humble and David Farley (the foundational text on release engineering)
- “Accelerate” — Nicole Forsgren, Jez Humble, Gene Kim (DORA research on deployment performance)
- “Release It!” — Michael Nygard (patterns for production-ready software)
Articles:
- “Progressive Delivery” — James Governor, RedMonk (redmonk.com)
- “BlueGreenDeployment” — Martin Fowler (martinfowler.com)
- “Feature Toggles” — Pete Hodgson, Martin Fowler (martinfowler.com/articles/feature-toggles.html)
Talks:
- “Progressive Delivery” — James Governor (YouTube)
- “Testing in Production” — Charity Majors (YouTube)
Summary
Section titled “Summary”Release engineering is the discipline of getting changes to users safely and reversibly. The fundamental strategies — Blue/Green, Canary, Shadow, and Rolling — each control blast radius differently. Progressive delivery layers these strategies together, treating a release as a gradual dial-turn rather than a binary switch. Combined with feature flags and proper database migration patterns, these techniques let you deploy with confidence instead of dread.
The measure of great release engineering is how boring your deployments become.
Next Module
Section titled “Next Module”Continue to Module 1.2: Advanced Canary Deployments with Argo Rollouts to learn how to automate canary deployments with metrics-driven promotion and rollback.