DevOps

Blue/Green vs Canary Deployments: When and How to Use Each

BlueGreen Vs Canary Deployments When And How To Use Each

Introduction

Deploying updates to production systems can be risky. A single bad release can cause downtime, break user functionality, or lose revenue. That’s why modern teams rely on progressive deployment strategies like Blue/Green and Canary. Both approaches aim to release new versions safely, but they differ in how traffic is switched and validated. In this comprehensive guide, we’ll explain how each strategy works, provide real implementation examples with Kubernetes and Argo Rollouts, and help you choose the right approach for your applications.

The Problem with Traditional Deployments

Traditional “all-at-once” deployments replace the running version completely. This approach causes:

  • Downtime during the update process
  • No rollback path without redeploying
  • All-or-nothing risk where every user is affected by bugs
  • Pressure on QA to catch everything before production

Blue/Green and Canary deployments solve these problems through controlled traffic shifting and gradual rollouts.

What Is a Blue/Green Deployment?

In a Blue/Green deployment, you maintain two identical environments. Blue runs the current production version, Green runs the new release. Once testing completes, you switch all traffic from Blue to Green instantly.

How It Works

  1. Blue environment serves 100% of production traffic
  2. Deploy new version to Green environment
  3. Run smoke tests and validation against Green
  4. Switch the load balancer to route traffic to Green
  5. Keep Blue ready for instant rollback
  6. Decommission Blue after confidence period

Blue/Green with Kubernetes

# blue-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-blue
  labels:
    app: myapp
    version: blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: blue
  template:
    metadata:
      labels:
        app: myapp
        version: blue
    spec:
      containers:
      - name: myapp
        image: myapp:1.0.0
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
# green-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-green
  labels:
    app: myapp
    version: green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: green
  template:
    metadata:
      labels:
        app: myapp
        version: green
    spec:
      containers:
      - name: myapp
        image: myapp:2.0.0
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
# service.yaml - switch by changing selector
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp
    version: blue  # Change to 'green' to switch
  ports:
  - port: 80
    targetPort: 8080
  type: ClusterIP

Automated Blue/Green Switch Script

#!/bin/bash
# blue-green-switch.sh

set -e

CURRENT_VERSION=$(kubectl get svc myapp -o jsonpath='{.spec.selector.version}')
NEW_VERSION=$1

echo "Current version: $CURRENT_VERSION"
echo "Switching to: $NEW_VERSION"

# Verify new deployment is ready
kubectl rollout status deployment/myapp-$NEW_VERSION --timeout=300s

# Run smoke tests against new version
NEW_POD=$(kubectl get pods -l app=myapp,version=$NEW_VERSION -o jsonpath='{.items[0].metadata.name}')
kubectl exec $NEW_POD -- curl -sf http://localhost:8080/health

# Switch traffic
kubectl patch svc myapp -p "{\"spec\":{\"selector\":{\"version\":\"$NEW_VERSION\"}}}"

echo "Traffic switched to $NEW_VERSION"

# Verify switch
kubectl get svc myapp -o jsonpath='{.spec.selector.version}'

echo ""
echo "Rollback command: kubectl patch svc myapp -p '{\"spec\":{\"selector\":{\"version\":\"$CURRENT_VERSION\"}}}'"

What Is a Canary Deployment?

A Canary deployment releases a new version to a small percentage of users first. If no issues appear, traffic gradually shifts until the new version handles 100%. The name comes from canaries in coal mines—early warning systems for danger.

How It Works

  1. Deploy new version alongside current version
  2. Route 5% of traffic to the canary
  3. Monitor error rates, latency, and business metrics
  4. If healthy, increase to 25%, then 50%, then 100%
  5. If problems detected, roll back immediately

Canary with Argo Rollouts

# canary-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: myapp
spec:
  replicas: 10
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: myapp
        image: myapp:2.0.0
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
  strategy:
    canary:
      # Traffic management
      canaryService: myapp-canary
      stableService: myapp-stable
      trafficRouting:
        nginx:
          stableIngress: myapp-ingress
      
      # Gradual rollout steps
      steps:
      - setWeight: 5
      - pause: { duration: 5m }
      - setWeight: 25
      - pause: { duration: 10m }
      - setWeight: 50
      - pause: { duration: 15m }
      - setWeight: 75
      - pause: { duration: 10m }
      # Final 100% happens automatically
      
      # Analysis for automated decisions
      analysis:
        templates:
        - templateName: success-rate
        startingStep: 1
        args:
        - name: service-name
          value: myapp-canary
# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
  - name: service-name
  metrics:
  - name: success-rate
    interval: 1m
    count: 5
    successCondition: result[0] >= 0.95
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          sum(rate(
            http_requests_total{service="{{args.service-name}}",status=~"2.."}[5m]
          )) / 
          sum(rate(
            http_requests_total{service="{{args.service-name}}"}[5m]
          ))
  
  - name: latency-p99
    interval: 1m
    count: 5
    successCondition: result[0] <= 500
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          histogram_quantile(0.99, sum(rate(
            http_request_duration_seconds_bucket{service="{{args.service-name}}"}[5m]
          )) by (le)) * 1000

Canary with Istio Service Mesh

# virtual-service.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: myapp
spec:
  hosts:
  - myapp
  http:
  - match:
    - headers:
        x-canary:
          exact: "true"
    route:
    - destination:
        host: myapp
        subset: canary
  - route:
    - destination:
        host: myapp
        subset: stable
      weight: 95
    - destination:
        host: myapp
        subset: canary
      weight: 5

---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: myapp
spec:
  host: myapp
  subsets:
  - name: stable
    labels:
      version: v1
  - name: canary
    labels:
      version: v2

Monitoring Canary Health

# Prometheus queries for canary analysis

# Error rate comparison
sum(rate(http_requests_total{version="canary",status=~"5.."}[5m])) /
sum(rate(http_requests_total{version="canary"}[5m]))
>
sum(rate(http_requests_total{version="stable",status=~"5.."}[5m])) /
sum(rate(http_requests_total{version="stable"}[5m])) * 1.5

# Latency comparison (p95)
histogram_quantile(0.95, sum(rate(
  http_request_duration_seconds_bucket{version="canary"}[5m]
)) by (le))
>
histogram_quantile(0.95, sum(rate(
  http_request_duration_seconds_bucket{version="stable"}[5m]
)) by (le)) * 1.2

Blue/Green vs Canary: Detailed Comparison

Feature Blue/Green Canary
Traffic Switch 100% instant Gradual (5% → 25% → 100%)
Rollback Speed Instant Instant (reduce weight to 0)
Infrastructure Cost 2x during deployment Minimal extra (~10%)
User Impact on Failure All users (briefly) Only canary users
Real User Testing None before switch Yes, progressive
Complexity Low Medium-High
Best For Critical systems Frequent releases

Combining Both Strategies

Many organizations combine both approaches for maximum safety:

# hybrid-rollout.yaml - Canary within Blue/Green
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: myapp
spec:
  replicas: 10
  strategy:
    blueGreen:
      activeService: myapp-active
      previewService: myapp-preview
      autoPromotionEnabled: false
      
      # Run canary analysis before promotion
      prePromotionAnalysis:
        templates:
        - templateName: success-rate
        args:
        - name: service-name
          value: myapp-preview
      
      # Post-promotion verification
      postPromotionAnalysis:
        templates:
        - templateName: success-rate
        args:
        - name: service-name
          value: myapp-active

Common Mistakes to Avoid

Watch out for these common pitfalls:

1. Not Testing Database Migrations

# Wrong - Blue/Green with incompatible schema
# v1: SELECT * FROM users WHERE active = 1
# v2: SELECT * FROM users WHERE status = 'active'

# Correct - Use backwards-compatible migrations
# Step 1: Add new column, keep old one
# Step 2: Write to both columns
# Step 3: Migrate reads to new column
# Step 4: Remove old column after full rollout

2. Ignoring Session Affinity

# Wrong - User gets different versions mid-session
apiVersion: v1
kind: Service
spec:
  sessionAffinity: None

# Correct - Maintain session consistency
apiVersion: v1
kind: Service
spec:
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 3600

3. Insufficient Monitoring During Canary

# Wrong - Only checking if pods are running
readiness:
  httpGet:
    path: /health

# Correct - Comprehensive analysis
analysis:
  metrics:
  - name: error-rate
  - name: latency-p99
  - name: cpu-usage
  - name: memory-usage
  - name: business-conversion-rate

4. No Rollback Plan

# Always have automated rollback triggers
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
  strategy:
    canary:
      analysis:
        templates:
        - templateName: success-rate
        failureLimit: 1  # Rollback after 1 failed analysis

Decision Guide: Which Strategy to Choose

Choose Blue/Green when:

  • You need instant rollback capability
  • Application has stateful components that are hard to run in parallel
  • Compliance requires full environment validation before production
  • Infrastructure cost isn't a primary concern

Choose Canary when:

  • You release frequently (multiple times per day)
  • You want real user validation before full rollout
  • You have strong observability (metrics, logs, traces)
  • Infrastructure cost optimization is important

Choose Both when:

  • You have critical systems requiring maximum safety
  • You want canary validation before Blue/Green switch
  • You have the tooling and expertise for complex pipelines

Final Thoughts

Blue/Green and Canary deployments are essential strategies for modern software delivery. Blue/Green provides instant traffic switching and easy rollback, ideal for critical systems. Canary enables gradual rollouts with real user validation, perfect for continuous delivery. Many mature organizations combine both: canary testing in a preview environment before Blue/Green promotion. Start with whichever matches your team's capabilities, then evolve as your observability and automation improve.

To implement these strategies in your CI/CD pipeline, read Continuous Deployment with GitLab CI/CD and GitOps Workflows with Argo CD. For advanced rollout strategies, see the Argo Rollouts Documentation and the Istio Traffic Management Guide.

1 Comment

Leave a Comment