Skip to content

Module 1.1: Release Strategies & Progressive Delivery Fundamentals

Discipline Module | Complexity: [MEDIUM] | Time: 2 hours

Before starting this module:

  • Required: CI/CD Fundamentals — Understanding build pipelines, artifact promotion, and deployment automation
  • Required: Kubernetes Deployments — Working knowledge of Deployments, Services, and label selectors
  • Recommended: Basic understanding of load balancers and HTTP routing
  • Recommended: Familiarity with monitoring/observability concepts

After completing this module, you will be able to:

  • Evaluate deployment strategies — rolling, blue-green, canary, A/B — against your risk tolerance and infrastructure
  • Design a release strategy matrix that matches deployment patterns to service criticality levels
  • Implement progressive delivery workflows that gradually shift traffic with automated rollback triggers
  • Analyze release failure modes to build deployment pipelines that detect problems before full rollout

It is 2:47 AM. Your phone is screaming. A deployment that went out six hours ago has been silently corrupting customer invoices. Every single user is affected. The rollback takes 40 minutes because nobody tested it. By the time you are back in bed, 12,000 invoices need manual correction, and the CEO wants a meeting at 8 AM.

This scenario plays out every week at companies around the world. Not because the code was terrible, but because the release strategy was terrible — or, more accurately, because there was no strategy at all.

Release engineering is the discipline of getting code from “it works on my machine” to “it works for every user” safely, repeatedly, and reversibly. The difference between teams that deploy with confidence and teams that deploy with dread comes down to one thing: how they manage the blast radius of change.

In this module, you will learn the fundamental release strategies — Blue/Green, Canary, Shadow, and more — and understand when each one is the right tool. You will also learn the art of progressive delivery: the idea that a release is not a binary event but a graduated exposure of change to increasingly larger audiences.

By the end, you will never again think of deployment as flipping a switch. You will think of it as turning a dial.


Stop and think: What is the most stressful deployment you have ever been a part of? What made it so stressful? Was it the size of the change, the lack of testing, or the inability to easily undo it?

Traditional releases work like this:

Developer commits → Build passes → Deploy to all servers → Hope for the best

This is the big-bang release. Everything goes live at once. If it works, great. If it does not, every user is affected simultaneously.

The math is brutal:

Release PatternBlast RadiusRollback TimeRisk
Big-bang deploy100% of usersMinutes to hoursExtreme
Blue/Green100% → 0% instantlySecondsLow
Canary1-5% → graduallySecondsVery Low
Shadow/Dark0% (mirrored only)N/ANear Zero

The goal of release engineering is to move from the top of that table to the bottom.

Blast radius is the percentage of users, requests, or systems affected if a release goes wrong.

Think of it like testing fireworks. You would not set off an untested firework in a crowded stadium. You would test it in an empty field first, then a small gathering, then a larger event, and only then at the stadium.

Progressive delivery applies the same logic to software:

graph LR
subgraph Time [Blast Radius Over Time]
direction LR
S[Shadow<br/>0%] --> I[Internal<br/>5%]
I --> B[Beta users<br/>25%]
B --> R1[Region 1<br/>50%]
R1 --> R2[Region 2<br/>75%]
R2 --> GA[Full GA<br/>100%]
end

Every step validates the release before expanding the blast radius.


Pause and predict: If you have two complete, isolated copies of your production environment running side-by-side, how quickly do you think you can undo a bad release?

Blue/Green is the simplest progressive strategy. You maintain two identical production environments:

  • Blue: The current live environment
  • Green: The new version, fully deployed but receiving no traffic

When you are confident Green is healthy, you switch all traffic from Blue to Green. If something goes wrong, you switch back.

graph TD
LB[Load Balancer] -->|Traffic| Blue
LB -.->|No Traffic| Green
subgraph Blue [Blue v1 - LIVE ✓]
B_P1[Pod]
B_P2[Pod]
B_P3[Pod]
end
subgraph Green [Green v2 - STANDBY]
G_P1[Pod]
G_P2[Pod]
G_P3[Pod]
end

After cutover:

graph TD
LB[Load Balancer] -.->|No Traffic| Blue
LB -->|Traffic| Green
subgraph Blue [Blue v1 - STANDBY]
B_P1[Pod]
B_P2[Pod]
B_P3[Pod]
end
subgraph Green [Green v2 - LIVE ✓]
G_P1[Pod]
G_P2[Pod]
G_P3[Pod]
end

In Kubernetes, Blue/Green is achieved with label selectors on Services:

# Service pointing to Blue (v1)
apiVersion: v1
kind: Service
metadata:
name: my-app
spec:
selector:
app: my-app
version: blue # ← Change this to "green" for cutover
ports:
- port: 80
targetPort: 8080

Advantages:

  • Instant rollback (just switch the selector back)
  • Full environment validation before cutover
  • Zero downtime if done correctly

Disadvantages:

  • Requires double the infrastructure (temporarily)
  • All-or-nothing traffic switch (no gradual rollout)
  • Database migrations are tricky (both versions must work with the schema)

When to use:

  • Critical services where instant rollback is essential
  • When you need full integration testing before going live
  • Services with simple state management

Named after the canaries coal miners used to detect toxic gases, a canary deployment sends a small percentage of traffic to the new version first.

graph TD
LB[Load Balancer] -->|95% Traffic| Stable
LB -->|5% Traffic| Canary
subgraph Stable [Stable v1]
S_P1[Pod]
S_P2[Pod]
S_P3[Pod]
S_P4[Pod]
S_P5[Pod]
S_P6[Pod]
end
subgraph Canary [Canary v2]
C_P1[Pod]
end

If the canary is healthy (low error rates, acceptable latency), traffic gradually increases:

5% → 10% → 25% → 50% → 100%

If the canary shows problems, it gets killed — and only 5% of users were ever affected.

Advantages:

  • Minimal blast radius
  • Real production traffic validation
  • Automated promotion/rollback possible
  • Gradual confidence building

Disadvantages:

  • More complex to set up
  • Requires traffic splitting capability
  • Metrics analysis needed to determine health
  • Session affinity can complicate things

When to use:

  • High-traffic services where even brief outages are costly
  • When you want metrics-driven deployment decisions
  • Services where user behavior validation matters

Shadow deployments mirror production traffic to the new version without serving responses to users. The new version processes real requests, but its responses are discarded.

graph TD
Req[Request] --> LB[Load Balancer]
LB -->|Serves| Prod
LB -.->|Mirrors| Shadow
Prod --> Resp[Response from v1 only]
Shadow -.->|Processes but discards output| Discard((Discard))
subgraph Prod [Production v1]
P_P1[Serves users]
end
subgraph Shadow [Shadow v2]
S_P1[Processes requests]
end

Advantages:

  • Zero user impact — users never see v2 responses
  • Tests with real production load patterns
  • Validates performance under actual traffic
  • Great for data pipeline or ML model changes

Disadvantages:

  • Cannot test user-facing changes (UI differences)
  • Write operations are dangerous (double-writes)
  • Requires infrastructure to mirror and discard
  • Does not validate client-side behavior

When to use:

  • Backend services with no side effects on reads
  • Performance validation before launch
  • ML model comparisons (A/B testing the model, not the UX)
  • Database query optimization validation

Kubernetes’ default strategy. Pods are replaced one at a time (or in batches):

apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 6
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow 1 extra pod during update
maxUnavailable: 0 # Always keep all pods available

Update sequence:

Step 1: [v1] [v1] [v1] [v1] [v1] [v1] [v2] ← 1 new pod added
Step 2: [v1] [v1] [v1] [v1] [v1] [v2] [v2] ← 1 old pod removed, 1 new added
Step 3: [v1] [v1] [v1] [v1] [v2] [v2] [v2]
Step 4: [v1] [v1] [v1] [v2] [v2] [v2] [v2]
Step 5: [v1] [v1] [v2] [v2] [v2] [v2] [v2]
Step 6: [v1] [v2] [v2] [v2] [v2] [v2] [v2]
Step 7: [v2] [v2] [v2] [v2] [v2] [v2] ← Complete

Advantages:

  • Built into Kubernetes (zero extra tooling)
  • Simple to configure
  • Gradual replacement

Disadvantages:

  • No fine-grained traffic control (traffic split depends on pod count)
  • Mixed versions serve traffic simultaneously
  • Rollback is another rolling update (not instant)

Pause and predict: If a deployment is the physical act of putting code onto a server, what is a release? Are they the same thing?

Here is a critical insight: deployment and release are not the same thing.

  • Deployment: Putting code on servers
  • Release: Making a feature available to users

Feature flags let you deploy code without releasing it:

# Feature is deployed but not released
if feature_flags.is_enabled("new-checkout-flow", user=current_user):
return new_checkout_flow(cart)
else:
return old_checkout_flow(cart)

This separation is powerful because it means:

  • You can deploy any time (even Friday afternoon)
  • You can release to specific users first
  • You can kill a feature without redeploying
  • You can A/B test without infrastructure changes
TypeLifespanPurposeExample
Release toggleDays/weeksGate incomplete featuresNew checkout flow
Experiment toggleWeeks/monthsA/B testingButton color test
Ops togglePermanentCircuit breakersDisable recommendations
Permission togglePermanentEntitlementsPremium features

Every critical feature should have a kill switch — an ops toggle that immediately disables it:

# Feature flag configuration
new_payment_processor:
enabled: true
kill_switch: true # Can be disabled instantly
rollout_percentage: 25 # Only 25% of users see it
excluded_regions:
- ap-southeast-1 # Not yet tested in APAC

If the new payment processor starts failing, one API call disables it:

Terminal window
curl -X PUT https://flags.internal/api/flags/new_payment_processor \
-d '{"enabled": false}'

No redeployment. No rollback. Instant recovery.


Stop and think: If you have v1 and v2 of your application serving traffic simultaneously during a rolling update, what happens if v2 renames a critical database column?

Database Migrations During Zero-Downtime Releases

Section titled “Database Migrations During Zero-Downtime Releases”

The Hardest Problem in Release Engineering

Section titled “The Hardest Problem in Release Engineering”

Code is stateless and replaceable. Databases are not. This makes database schema changes the most dangerous part of any release.

The fundamental problem: during a rolling deployment, both old and new versions of your application run simultaneously. Both must work with the same database schema.

Never make breaking schema changes in a single release. Use expand-contract (also called parallel change):

Phase 1 — Expand (Release N): Add new columns/tables, keep old ones working.

-- Add new column, allow NULL (backward compatible)
ALTER TABLE users ADD COLUMN email_verified BOOLEAN DEFAULT NULL;

Old code ignores the new column. New code starts writing to it.

Phase 2 — Migrate (Release N+1): Backfill data and start reading from new column.

-- Backfill existing rows
UPDATE users SET email_verified = false WHERE email_verified IS NULL;

Phase 3 — Contract (Release N+2): Remove old columns/code once fully migrated.

-- Now safe to add NOT NULL constraint
ALTER TABLE users ALTER COLUMN email_verified SET NOT NULL;
-- Later: remove old column if it was replaced
-- ALTER TABLE users DROP COLUMN old_email_status;
Anti-PatternWhy It BreaksSafe Alternative
DROP COLUMN in same releaseOld pods still reading itExpand-contract over 2+ releases
RENAME COLUMNOld code uses old nameAdd new column, backfill, drop old
ALTER TYPE (e.g., int→bigint)Can lock table for hoursCreate new column, dual-write, swap
Non-reversible migrationCannot roll back releaseAlways write reversible migrations
Big table migration in one txLocks table, kills productionBatch in small chunks

Stop and think: Is it better to have a system that rarely breaks but takes hours to fix, or a system that breaks occasionally but can be fixed in seconds?

MTTR vs MTBF: Two Philosophies of Reliability

Section titled “MTTR vs MTBF: Two Philosophies of Reliability”

MTBF (Mean Time Between Failures) asks: “How do we prevent failures?”

Organizations optimizing for MTBF:

  • Have heavy change approval processes
  • Deploy infrequently (monthly or quarterly)
  • Test exhaustively before release
  • Avoid changes to “stable” systems

The problem: failures still happen. And when they do, recovery is slow because the team has no practice at it.

MTTR (Mean Time to Recovery) asks: “How fast can we recover from failure?”

Organizations optimizing for MTTR:

  • Deploy frequently (many times per day)
  • Invest in rollback mechanisms
  • Practice incident response
  • Accept that failures will happen
MTBF-Focused:
──────────────────────────────X──────(long recovery)──────────────────
Long time between failures Slow, unpracticed recovery
MTTR-Focused:
────X──(quick fix)──────X──(quick fix)──────X──(quick fix)──────
More frequent failures But fast, practiced recoveries

The key insight from the DORA research (which we will explore in Module 1.5): high-performing teams optimize for MTTR, not MTBF. They deploy more often, fail more gracefully, and recover faster.

StrategyMTTR Contribution
Blue/GreenInstant rollback by switching traffic
CanaryAuto-rollback before most users are affected
Feature flagsDisable features without redeployment
ShadowCatch problems before any user is affected

FactorBlue/GreenCanaryShadowRolling
ComplexityLowMediumHighVery Low
Blast radius0→100%Gradual0%Gradual
Rollback speedInstantFastN/ASlow
Infra cost2x (temporary)+10-20%2x+0-10%
Traffic controlBinaryFine-grainedMirroredNone
DB-safeNeeds expand-contractNeeds expand-contractRead-only safeNeeds expand-contract
Best forCritical servicesHigh-traffic APIsBackend/MLSimple services

Real-world release engineering combines multiple strategies:

1. Shadow deploy → Validate performance with real traffic
2. Canary to 1% → Watch metrics for 30 minutes
3. Canary to 10% → Watch metrics for 1 hour
4. Blue/Green the rest → Full cutover with instant rollback
5. Feature flag → Gradually enable new UI for users

This is progressive delivery in practice: layering strategies to minimize risk at every stage.


  1. Facebook deploys code to production thousands of times per day, but features reach users over days or weeks through a sophisticated feature flag system called Gatekeeper. The average feature is tested on Facebook employees first, then 1% of users, then gradually ramped up. A deployment is never a release.

  2. The term “canary deployment” comes from the practice of taking canaries into coal mines. The birds were more sensitive to carbon monoxide than humans, so if the canary stopped singing, miners knew to evacuate. In software, your canary pods are the first to “stop singing” (show errors), warning you before the toxic change reaches everyone.

  3. LinkedIn pioneered the concept of “Dark Launches” in 2009 when they needed to test a completely rewritten backend. They mirrored 100% of production read traffic to the new system for weeks, comparing responses without users ever knowing. By the time they switched over, they had months of production validation — and the cutover was uneventful.

  4. Blue/Green deployments were first described by Daniel North and Jez Humble in 2005, and the technique predates Kubernetes by a decade. The original implementation used DNS switching between two physical server clusters. In Kubernetes, the same concept is achieved with a single label selector change — what used to take hours now takes milliseconds.


War Story: The Migration That Ate Production

Section titled “War Story: The Migration That Ate Production”

A team was migrating from a monolith to microservices. They had been developing the new services for months and were ready to switch.

Their plan: big-bang cutover on a Saturday night. Move all traffic from the monolith to the new services at once.

What happened:

  1. 7:00 PM — Cut traffic to new services. Response times look good.
  2. 7:15 PM — Connection pool exhaustion. The new services open 10x more DB connections than the monolith.
  3. 7:30 PM — Roll back. But the rollback script has a bug. Monolith cannot start because a database migration already ran.
  4. 8:00 PM — Emergency manual fix of the migration. Monolith is back but running on the new schema with patched queries.
  5. 11:00 PM — All data integrity issues identified and corrected.
  6. 3:00 AM — Post-incident meeting. Everyone is exhausted and demoralized.

What they should have done:

  1. Shadow deploy the new services for two weeks to find the connection pool issue
  2. Canary at 1% to validate real user traffic patterns
  3. Expand-contract the database to support both services simultaneously
  4. Feature flags to gradually shift functionality, not traffic

They eventually did exactly this over the next three months, and the second attempt was anticlimactic — which is the goal.

Lesson: The measure of a great release is how boring it is.


MistakeProblemSolution
No rollback plan”We’ll figure it out if something goes wrong”Test rollback before every release; automate it
Canary without metricsDeploying to 5% but not measuring anythingDefine success metrics before deployment starts
Blue/Green with DB migrationsBoth environments need the same schemaUse expand-contract pattern over multiple releases
Feature flags as permanent codeCodebase fills with dead branchesSet expiry dates; track flag lifecycle; clean up quarterly
Testing only in stagingStaging never matches productionUse shadow deployments for production-traffic validation
Deploying on FridaysNobody monitors over the weekendDeploy early in the week, or invest in proper on-call and automation
Skipping smoke tests after cutoverAssuming traffic switch means successAutomated post-deployment health checks are non-negotiable
Ignoring session stateUsers mid-transaction get errors during switchImplement graceful connection draining and sticky sessions

Your team merged the new checkout flow on Friday morning. The code is running on all production servers, but the marketing campaign for the new checkout doesn’t launch until Monday. When users visit the site over the weekend, they still see the old checkout. Which concept describes what has happened to the new code, and why is this separation important?

Show Answer The new checkout code has been **deployed** but not yet **released**.

A deployment is the physical act of putting code onto servers and making it ready to run, while a release is the business decision to make that feature available to users. By decoupling these two events (often using feature flags), your team can safely push code to production during normal working hours without prematurely exposing unfinished or unannounced features. This separation is the foundation of progressive delivery, allowing you to control the blast radius and turn releasing into a simple configuration change rather than a stressful infrastructure event.

You are tasked with rolling out a major rewrite of the database query optimization engine. The new engine changes how data is fetched but should return exactly the same results to the end user. Your engineering manager suggests routing 5% of user traffic to the new engine to test it. Why might a different deployment strategy be safer and more effective for this specific scenario?

Show Answer A **Shadow deployment** would be significantly safer and more effective than a Canary deployment for this rewrite.

Because the new engine is a backend-only change with no intended user-facing differences, a Canary deployment unnecessarily exposes that 5% of users to potential errors or latency spikes if the new queries perform poorly. Instead, a Shadow deployment allows you to mirror 100% of real production traffic to the new engine while discarding its responses, ensuring zero user impact. This lets you validate the new engine’s performance, resource utilization, and correctness under actual production load without ever putting user experience at risk.

Your application is running across 50 pods. You initiate a rolling update to deploy a new version that renames the account_status database column to status. Midway through the deployment, you notice a massive spike in errors and the application goes down. What caused this failure, and how could you have prevented it?

Show Answer The failure occurred because during a rolling update, both the old and new versions of your application run simultaneously and must share the exact same database schema.

When the new version renamed the column, the old pods (which were still running and serving traffic) immediately crashed because they were trying to query a column name that no longer existed. To prevent this, you must use the expand-contract pattern across multiple releases. First, you expand by adding the new column while keeping the old one intact. Then, you migrate the data and update the code to write to both. Finally, in a subsequent release once all code depends on the new column, you contract by removing the old column.

Company A deploys once a quarter, requiring three weeks of code freezes and approvals to prevent bugs. Company B deploys twenty times a day with automated rollbacks. When a severe bug inevitably hits production, Company B resolves it in four minutes, while Company A takes two days to diagnose and issue a hotfix. Which reliability philosophy is Company B following, and why does it lead to better outcomes?

Show Answer Company B is optimizing for **Mean Time To Recovery (MTTR)**, while Company A is stuck trying to maximize Mean Time Between Failures (MTBF).

Optimizing for MTBF creates a false sense of security; failures will always happen eventually, but heavy processes make the team unpracticed and slow at fixing them. By focusing on MTTR, Company B accepts that failures are normal and instead invests in fast, automated recovery mechanisms and smaller, more frequent changes. Because the batch size of each deployment is tiny, diagnosing the root cause is trivial, and practiced rollback mechanisms ensure that the blast radius of any defect is minimal and quickly contained.

Your team just launched a new integration with a third-party payment gateway. During peak holiday traffic, the third-party gateway starts timing out, causing your entire checkout flow to hang. A developer frantically starts creating a revert pull request, estimating it will take 20 minutes to build and deploy. How could the architecture have been designed differently to resolve this outage instantly?

Show Answer The architecture should have implemented a **kill switch** (an operational feature flag) for the new payment gateway integration.

A kill switch allows operations or development teams to instantly disable a problematic feature or integration with a single API call or configuration flip, without requiring a redeployment or code rollback. In this scenario, flipping the kill switch would have immediately bypassed the failing third-party gateway and reverted the system to the legacy processor. You should use this pattern for any feature that touches critical user flows, relies on external dependencies, or carries a high cost of failure if it degrades in production.

Your infrastructure team proudly announces they have implemented Blue/Green deployments in Kubernetes. However, during the first failed release, the rollback takes 30 minutes. When you investigate, you find they are updating the Helm chart to point back to the previous image tag and running the CI/CD pipeline from scratch to redeploy. What fundamental principle of Blue/Green deployments have they misunderstood, and how should it work?

Show Answer The team has misunderstood that a true Blue/Green rollback is a **traffic routing switch**, not a redeployment event.

By tearing down the old version and relying on the CI/CD pipeline to rebuild and redeploy the previous image, they have entirely lost the speed and safety benefits of the strategy. In a proper Blue/Green architecture, the “Blue” (previous) environment must remain running and healthy on standby even after the “Green” (new) environment is actively taking traffic. A rollback should simply consist of updating the Kubernetes Service selector or Load Balancer rule to instantly route traffic back to the running Blue pods, reducing recovery time from minutes to milliseconds.


Hands-On Exercise: Manual Blue/Green Deployment with Kubernetes

Section titled “Hands-On Exercise: Manual Blue/Green Deployment with Kubernetes”

Deploy a web application using the Blue/Green strategy in Kubernetes, perform a zero-downtime cutover, and practice instant rollback.

Create a local Kubernetes cluster:

Terminal window
kind create cluster --name release-lab
blue-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp-blue
labels:
app: webapp
version: blue
spec:
replicas: 3
selector:
matchLabels:
app: webapp
version: blue
template:
metadata:
labels:
app: webapp
version: blue
spec:
containers:
- name: webapp
image: hashicorp/http-echo:0.2.3
args:
- "-text=Hello from BLUE (v1)"
- "-listen=:8080"
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /
port: 8080
initialDelaySeconds: 2
periodSeconds: 3
webapp-service.yaml
apiVersion: v1
kind: Service
metadata:
name: webapp
spec:
type: NodePort
selector:
app: webapp
version: blue # ← Currently pointing to Blue
ports:
- port: 80
targetPort: 8080
nodePort: 30080

Apply both:

Terminal window
k apply -f blue-deployment.yaml
k apply -f webapp-service.yaml

Verify Blue is live:

Terminal window
k get pods -l version=blue
# Should show 3 pods Running
# Test the service using an in-cluster curl
kubectl run curl-test --rm -it --restart=Never --image=curlimages/curl -- \
curl -s webapp.default.svc:80
# Output: Hello from BLUE (v1)

Step 2: Deploy the Green Version (No Traffic Yet)

Section titled “Step 2: Deploy the Green Version (No Traffic Yet)”
green-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp-green
labels:
app: webapp
version: green
spec:
replicas: 3
selector:
matchLabels:
app: webapp
version: green
template:
metadata:
labels:
app: webapp
version: green
spec:
containers:
- name: webapp
image: hashicorp/http-echo:0.2.3
args:
- "-text=Hello from GREEN (v2)"
- "-listen=:8080"
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /
port: 8080
initialDelaySeconds: 2
periodSeconds: 3
Terminal window
k apply -f green-deployment.yaml
k get pods -l version=green
# Should show 3 pods Running — but no traffic is going to them yet

Verify Green works by directly port-forwarding to one of its pods:

Terminal window
# Find a green pod name
GREEN_POD=$(k get pods -l version=green -o jsonpath='{.items[0].metadata.name}')
k port-forward $GREEN_POD 8081:8080 &
curl http://localhost:8081
# Output: Hello from GREEN (v2)

Switch the Service selector from Blue to Green:

Terminal window
k patch service webapp -p '{"spec":{"selector":{"version":"green"}}}'

Verify the switch:

Terminal window
kubectl run curl-test2 --rm -it --restart=Never --image=curlimages/curl -- \
curl -s webapp.default.svc:80
# Output: Hello from GREEN (v2)

That is it. One command. Instant cutover.

Simulate a problem with Green and roll back:

Terminal window
# Rollback to Blue
k patch service webapp -p '{"spec":{"selector":{"version":"blue"}}}'
kubectl run curl-test3 --rm -it --restart=Never --image=curlimages/curl -- \
curl -s webapp.default.svc:80
# Output: Hello from BLUE (v1)

Run a continuous traffic loop to prove zero downtime during switching. Use an in-cluster pod to test through the Service (port-forward does not dynamically follow Service selector changes):

Terminal window
# In one terminal — continuous requests via in-cluster pod
kubectl run traffic-loop --rm -it --restart=Never --image=curlimages/curl -- \
sh -c 'while true; do
RESPONSE=$(curl -s webapp.default.svc:80)
echo "$(date +%H:%M:%S) - $RESPONSE"
sleep 0.5
done'

In another terminal, perform the cutover:

Terminal window
k patch service webapp -p '{"spec":{"selector":{"version":"green"}}}'

Watch the first terminal. You should see the response change from Blue to Green with zero failed requests.

Terminal window
kind delete cluster --name release-lab

You have completed this exercise when you can confirm:

  • Both Blue and Green deployments ran simultaneously with 3 pods each
  • The Service selector controlled which version received traffic
  • Cutover from Blue to Green happened with a single kubectl patch command
  • Rollback from Green to Blue was equally instant
  • The traffic loop showed zero failed requests during the switch
  • You understand that the Blue pods remained running as a rollback target

  1. Deployment is not release — Feature flags and progressive delivery decouple putting code on servers from exposing it to users
  2. Blast radius is everything — Every release strategy is ultimately about controlling how many users are affected if something goes wrong
  3. Blue/Green gives instant rollback — Keep the old version running and switch traffic with a selector change
  4. Canary gives gradual confidence — Start with 1% and increase only when metrics confirm health
  5. Shadow validates without risk — Mirror production traffic to test performance before any user is exposed
  6. Database migrations need expand-contract — Never make breaking schema changes in a single release
  7. Optimize for MTTR, not MTBF — Fast recovery beats rare failure every time

Books:

  • “Continuous Delivery” — Jez Humble and David Farley (the foundational text on release engineering)
  • “Accelerate” — Nicole Forsgren, Jez Humble, Gene Kim (DORA research on deployment performance)
  • “Release It!” — Michael Nygard (patterns for production-ready software)

Articles:

  • “Progressive Delivery” — James Governor, RedMonk (redmonk.com)
  • “BlueGreenDeployment” — Martin Fowler (martinfowler.com)
  • “Feature Toggles” — Pete Hodgson, Martin Fowler (martinfowler.com/articles/feature-toggles.html)

Talks:

  • “Progressive Delivery” — James Governor (YouTube)
  • “Testing in Production” — Charity Majors (YouTube)

Release engineering is the discipline of getting changes to users safely and reversibly. The fundamental strategies — Blue/Green, Canary, Shadow, and Rolling — each control blast radius differently. Progressive delivery layers these strategies together, treating a release as a gradual dial-turn rather than a binary switch. Combined with feature flags and proper database migration patterns, these techniques let you deploy with confidence instead of dread.

The measure of great release engineering is how boring your deployments become.


Continue to Module 1.2: Advanced Canary Deployments with Argo Rollouts to learn how to automate canary deployments with metrics-driven promotion and rollback.