
Introduction
Deploying updates to production systems can be risky. A single bad release can cause downtime, break user functionality, or lose revenue. That’s why modern teams rely on progressive deployment strategies like Blue/Green and Canary. Both approaches aim to release new versions safely, but they differ in how traffic is switched and validated. In this comprehensive guide, we’ll explain how each strategy works, provide real implementation examples with Kubernetes and Argo Rollouts, and help you choose the right approach for your applications.
The Problem with Traditional Deployments
Traditional “all-at-once” deployments replace the running version completely. This approach causes:
- Downtime during the update process
- No rollback path without redeploying
- All-or-nothing risk where every user is affected by bugs
- Pressure on QA to catch everything before production
Blue/Green and Canary deployments solve these problems through controlled traffic shifting and gradual rollouts.
What Is a Blue/Green Deployment?
In a Blue/Green deployment, you maintain two identical environments. Blue runs the current production version, Green runs the new release. Once testing completes, you switch all traffic from Blue to Green instantly.
How It Works
- Blue environment serves 100% of production traffic
- Deploy new version to Green environment
- Run smoke tests and validation against Green
- Switch the load balancer to route traffic to Green
- Keep Blue ready for instant rollback
- Decommission Blue after confidence period
Blue/Green with Kubernetes
# blue-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-blue
labels:
app: myapp
version: blue
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: blue
template:
metadata:
labels:
app: myapp
version: blue
spec:
containers:
- name: myapp
image: myapp:1.0.0
ports:
- containerPort: 8080
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
# green-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-green
labels:
app: myapp
version: green
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: green
template:
metadata:
labels:
app: myapp
version: green
spec:
containers:
- name: myapp
image: myapp:2.0.0
ports:
- containerPort: 8080
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
# service.yaml - switch by changing selector
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp
version: blue # Change to 'green' to switch
ports:
- port: 80
targetPort: 8080
type: ClusterIP
Automated Blue/Green Switch Script
#!/bin/bash
# blue-green-switch.sh
set -e
CURRENT_VERSION=$(kubectl get svc myapp -o jsonpath='{.spec.selector.version}')
NEW_VERSION=$1
echo "Current version: $CURRENT_VERSION"
echo "Switching to: $NEW_VERSION"
# Verify new deployment is ready
kubectl rollout status deployment/myapp-$NEW_VERSION --timeout=300s
# Run smoke tests against new version
NEW_POD=$(kubectl get pods -l app=myapp,version=$NEW_VERSION -o jsonpath='{.items[0].metadata.name}')
kubectl exec $NEW_POD -- curl -sf http://localhost:8080/health
# Switch traffic
kubectl patch svc myapp -p "{\"spec\":{\"selector\":{\"version\":\"$NEW_VERSION\"}}}"
echo "Traffic switched to $NEW_VERSION"
# Verify switch
kubectl get svc myapp -o jsonpath='{.spec.selector.version}'
echo ""
echo "Rollback command: kubectl patch svc myapp -p '{\"spec\":{\"selector\":{\"version\":\"$CURRENT_VERSION\"}}}'"
What Is a Canary Deployment?
A Canary deployment releases a new version to a small percentage of users first. If no issues appear, traffic gradually shifts until the new version handles 100%. The name comes from canaries in coal mines—early warning systems for danger.
How It Works
- Deploy new version alongside current version
- Route 5% of traffic to the canary
- Monitor error rates, latency, and business metrics
- If healthy, increase to 25%, then 50%, then 100%
- If problems detected, roll back immediately
Canary with Argo Rollouts
# canary-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: myapp
spec:
replicas: 10
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: myapp:2.0.0
ports:
- containerPort: 8080
resources:
requests:
memory: "256Mi"
cpu: "250m"
strategy:
canary:
# Traffic management
canaryService: myapp-canary
stableService: myapp-stable
trafficRouting:
nginx:
stableIngress: myapp-ingress
# Gradual rollout steps
steps:
- setWeight: 5
- pause: { duration: 5m }
- setWeight: 25
- pause: { duration: 10m }
- setWeight: 50
- pause: { duration: 15m }
- setWeight: 75
- pause: { duration: 10m }
# Final 100% happens automatically
# Analysis for automated decisions
analysis:
templates:
- templateName: success-rate
startingStep: 1
args:
- name: service-name
value: myapp-canary
# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 1m
count: 5
successCondition: result[0] >= 0.95
failureLimit: 3
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(
http_requests_total{service="{{args.service-name}}",status=~"2.."}[5m]
)) /
sum(rate(
http_requests_total{service="{{args.service-name}}"}[5m]
))
- name: latency-p99
interval: 1m
count: 5
successCondition: result[0] <= 500
failureLimit: 3
provider:
prometheus:
address: http://prometheus:9090
query: |
histogram_quantile(0.99, sum(rate(
http_request_duration_seconds_bucket{service="{{args.service-name}}"}[5m]
)) by (le)) * 1000
Canary with Istio Service Mesh
# virtual-service.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: myapp
spec:
hosts:
- myapp
http:
- match:
- headers:
x-canary:
exact: "true"
route:
- destination:
host: myapp
subset: canary
- route:
- destination:
host: myapp
subset: stable
weight: 95
- destination:
host: myapp
subset: canary
weight: 5
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: myapp
spec:
host: myapp
subsets:
- name: stable
labels:
version: v1
- name: canary
labels:
version: v2
Monitoring Canary Health
# Prometheus queries for canary analysis
# Error rate comparison
sum(rate(http_requests_total{version="canary",status=~"5.."}[5m])) /
sum(rate(http_requests_total{version="canary"}[5m]))
>
sum(rate(http_requests_total{version="stable",status=~"5.."}[5m])) /
sum(rate(http_requests_total{version="stable"}[5m])) * 1.5
# Latency comparison (p95)
histogram_quantile(0.95, sum(rate(
http_request_duration_seconds_bucket{version="canary"}[5m]
)) by (le))
>
histogram_quantile(0.95, sum(rate(
http_request_duration_seconds_bucket{version="stable"}[5m]
)) by (le)) * 1.2
Blue/Green vs Canary: Detailed Comparison
| Feature | Blue/Green | Canary |
|---|---|---|
| Traffic Switch | 100% instant | Gradual (5% → 25% → 100%) |
| Rollback Speed | Instant | Instant (reduce weight to 0) |
| Infrastructure Cost | 2x during deployment | Minimal extra (~10%) |
| User Impact on Failure | All users (briefly) | Only canary users |
| Real User Testing | None before switch | Yes, progressive |
| Complexity | Low | Medium-High |
| Best For | Critical systems | Frequent releases |
Combining Both Strategies
Many organizations combine both approaches for maximum safety:
# hybrid-rollout.yaml - Canary within Blue/Green
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: myapp
spec:
replicas: 10
strategy:
blueGreen:
activeService: myapp-active
previewService: myapp-preview
autoPromotionEnabled: false
# Run canary analysis before promotion
prePromotionAnalysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: myapp-preview
# Post-promotion verification
postPromotionAnalysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: myapp-active
Common Mistakes to Avoid
Watch out for these common pitfalls:
1. Not Testing Database Migrations
# Wrong - Blue/Green with incompatible schema
# v1: SELECT * FROM users WHERE active = 1
# v2: SELECT * FROM users WHERE status = 'active'
# Correct - Use backwards-compatible migrations
# Step 1: Add new column, keep old one
# Step 2: Write to both columns
# Step 3: Migrate reads to new column
# Step 4: Remove old column after full rollout
2. Ignoring Session Affinity
# Wrong - User gets different versions mid-session
apiVersion: v1
kind: Service
spec:
sessionAffinity: None
# Correct - Maintain session consistency
apiVersion: v1
kind: Service
spec:
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 3600
3. Insufficient Monitoring During Canary
# Wrong - Only checking if pods are running
readiness:
httpGet:
path: /health
# Correct - Comprehensive analysis
analysis:
metrics:
- name: error-rate
- name: latency-p99
- name: cpu-usage
- name: memory-usage
- name: business-conversion-rate
4. No Rollback Plan
# Always have automated rollback triggers
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
strategy:
canary:
analysis:
templates:
- templateName: success-rate
failureLimit: 1 # Rollback after 1 failed analysis
Decision Guide: Which Strategy to Choose
Choose Blue/Green when:
- You need instant rollback capability
- Application has stateful components that are hard to run in parallel
- Compliance requires full environment validation before production
- Infrastructure cost isn't a primary concern
Choose Canary when:
- You release frequently (multiple times per day)
- You want real user validation before full rollout
- You have strong observability (metrics, logs, traces)
- Infrastructure cost optimization is important
Choose Both when:
- You have critical systems requiring maximum safety
- You want canary validation before Blue/Green switch
- You have the tooling and expertise for complex pipelines
Final Thoughts
Blue/Green and Canary deployments are essential strategies for modern software delivery. Blue/Green provides instant traffic switching and easy rollback, ideal for critical systems. Canary enables gradual rollouts with real user validation, perfect for continuous delivery. Many mature organizations combine both: canary testing in a preview environment before Blue/Green promotion. Start with whichever matches your team's capabilities, then evolve as your observability and automation improve.
To implement these strategies in your CI/CD pipeline, read Continuous Deployment with GitLab CI/CD and GitOps Workflows with Argo CD. For advanced rollout strategies, see the Argo Rollouts Documentation and the Istio Traffic Management Guide.
1 Comment