Every deployment to production should be boring. If your team still has a "deploy day" ritual complete with a war room, a shared Zoom call, and someone hovering over the rollback button, you have a process problem, not a code problem. The anxiety around releases is almost always a symptom of missing automation, insufficient testing gates, and manual approval bottlenecks that have calcified into habit.
We've implemented CI/CD pipelines for over forty engineering teams at this point, spanning early-stage startups pushing code ten times a day to regulated fintech platforms that need audit trails on every artifact. The architecture varies in the details, but the underlying structure is remarkably consistent. Here's the exact playbook we use.
The Pipeline Architecture
Before diving into individual strategies, it helps to see the full picture. Every pipeline we build follows this flow, with each stage acting as an automated gate that must pass before code moves forward:
Developer opens PR
|
v
+------------------+ +------------------+ +------------------+
| Lint & Type | --> | Unit Tests & | --> | Build & Push |
| Checks | | Integration | | Container Image |
+------------------+ +------------------+ +------------------+
|
v
+----------------------+
| Deploy to Staging |
+----------------------+
|
v
+----------------------+
| Smoke Tests & |
| E2E Validation |
+----------------------+
|
v
+----------------------+
| Production Deploy |
| (Canary/Blue-Green) |
+----------------------+
|
v
+----------------------+
| Monitor Metrics & |
| Auto-Rollback Gate |
+----------------------+
The key principle: every stage is automated, and every gate is binary. Code either passes or it doesn't. No "let's just push it and watch the logs" decision-making. The pipeline is the authority, not the on-call engineer's intuition at 4 PM on a Friday.
A note on merge queues: If your team has more than five engineers merging to main, use GitHub's merge queue or a similar tool. Without it, you'll hit the "it worked on my branch" problem where two individually passing PRs create a broken build when merged together. The merge queue serializes integration and catches this automatically.
Rolling vs Blue-Green vs Canary
There are three mainstream deployment strategies for zero-downtime releases. Each has real trade-offs, and picking the wrong one for your situation creates unnecessary complexity.
Rolling deployments replace instances one at a time. The old version and new version run simultaneously during the rollout. This is the simplest strategy and the default in Kubernetes. The downside is that your application needs to handle running two versions at once, and rollbacks are slow since you have to roll forward through every instance again.
Blue-green deployments maintain two identical environments. "Blue" runs your current production version, "green" runs the new release. You switch traffic all at once by updating the load balancer target. Rollback is instant since you just flip back to blue. The catch is that you need double the infrastructure during deployment, and database migrations get tricky since both environments share the same data layer.
Canary deployments route a small percentage of traffic to the new version and gradually increase it. This gives you real production validation with minimal blast radius. If metrics degrade, you pull back the canary. The trade-off is added complexity in your routing layer and the requirement for good observability to actually detect problems at low traffic percentages.
Our recommendation: Start with canary for most teams. It gives you the best balance of safety and operational simplicity. Blue-green is worth it when you need guaranteed instant rollback and can afford the infrastructure cost. Rolling is fine for non-critical internal services.
Here's how we configure canary deployments using Argo Rollouts in Kubernetes. This replaces the standard Deployment resource and gives you declarative, GitOps-friendly canary management:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api-server
spec:
replicas: 6
revisionHistoryLimit: 3
selector:
matchLabels:
app: api-server
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 5m}
- setWeight: 30
- pause: {duration: 5m}
- setWeight: 60
- pause: {duration: 10m}
canaryService: api-server-canary
stableService: api-server-stable
trafficRouting:
nginx:
stableIngress: api-server-ingress
analysis:
templates:
- templateName: success-rate
startingStep: 1
args:
- name: service-name
value: api-server-canary
template:
metadata:
labels:
app: api-server
spec:
containers:
- name: api-server
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/api-server:latest
ports:
- containerPort: 3000
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
The steps block is where the magic happens. Traffic shifts from 10% to 30% to 60%, pausing at each stage to let the analysis template evaluate metrics. If the analysis fails at any step, the rollout automatically aborts and rolls back. No human intervention required.
Automated Rollback Triggers
A canary deployment without automated rollback is theater. You need to define, in advance, the exact conditions that trigger a rollback. Relying on someone watching a dashboard is not a strategy.
We monitor three primary signals during every canary promotion:
- Error rate: HTTP 5xx responses as a percentage of total requests. Threshold is typically 1% above baseline, measured over a 2-minute sliding window.
- Latency P99: The 99th percentile response time. If it increases more than 25% compared to the stable version, something is wrong. This catches performance regressions that averages miss.
- Pod restart count: Any unexpected container restart during canary is an immediate rollback signal. This catches OOM kills, crash loops, and liveness probe failures.
Here's a Prometheus alerting rule that captures the error-rate signal. This feeds into the Argo Rollouts AnalysisTemplate to drive automatic rollback decisions:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: canary-rollback-triggers
namespace: monitoring
spec:
groups:
- name: canary.rollback
interval: 30s
rules:
- alert: CanaryHighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5..",canary="true"}[2m]))
/
sum(rate(http_requests_total{canary="true"}[2m]))
) > 0.01
for: 1m
labels:
severity: critical
team: platform
annotations:
summary: "Canary error rate exceeds 1%"
runbook: "Auto-rollback will trigger via Argo analysis"
- alert: CanaryHighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{canary="true"}[2m])) by (le)
)
>
1.25 * histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{canary="false"}[2m])) by (le)
)
for: 2m
labels:
severity: critical
annotations:
summary: "Canary P99 latency 25% above stable"
- alert: CanaryPodRestarts
expr: |
increase(kube_pod_container_status_restarts_total{
pod=~".*canary.*"
}[5m]) > 0
labels:
severity: critical
annotations:
summary: "Canary pod restarted unexpectedly"
The critical detail here is the for duration on each rule. You want it long enough to avoid flapping on transient spikes but short enough to catch real regressions before they affect too many users. One minute at 10% traffic means roughly one in a thousand users are impacted before the system reacts. That's an acceptable blast radius for most applications.
The GitHub Actions Workflow
Theory is nice, but teams need a working pipeline they can adapt. Here's the GitHub Actions workflow we use as a starting point. It covers the full lifecycle: test, build, push to ECR, deploy to staging, run smoke tests, and promote to production.
name: Deploy Pipeline
on:
push:
branches: [main]
env:
AWS_REGION: us-east-1
ECR_REGISTRY: 123456789.dkr.ecr.us-east-1.amazonaws.com
IMAGE_NAME: api-server
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: npm
- run: npm ci
- run: npm run lint
- run: npm test -- --coverage
- uses: actions/upload-artifact@v4
with:
name: coverage
path: coverage/
build-and-push:
needs: test
runs-on: ubuntu-latest
outputs:
image-tag: ${{ steps.meta.outputs.tags }}
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789:role/github-actions
aws-region: ${{ env.AWS_REGION }}
- uses: aws-actions/amazon-ecr-login@v2
- id: meta
run: |
TAG="${{ env.ECR_REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}"
echo "tags=$TAG" >> "$GITHUB_OUTPUT"
- uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
cache-from: type=gha
cache-to: type=gha,mode=max
deploy-staging:
needs: build-and-push
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v4
- run: |
helm upgrade --install api-server ./charts/api-server \
--namespace staging \
--set image.tag=${{ github.sha }} \
--wait --timeout 5m
smoke-tests:
needs: deploy-staging
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: |
npm ci
STAGING_URL=https://staging.example.com \
npm run test:smoke
- run: |
curl -sf https://staging.example.com/health \
|| (echo "Health check failed" && exit 1)
deploy-production:
needs: smoke-tests
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- run: |
helm upgrade --install api-server ./charts/api-server \
--namespace production \
--set image.tag=${{ github.sha }} \
--set strategy=canary \
--wait --timeout 15m
- run: |
echo "Deployed ${{ github.sha }} to production"
echo "Monitor: https://grafana.example.com/d/canary"
A few things worth highlighting in this workflow:
- The
environmentkey on the staging and production jobs enables GitHub's environment protection rules. You can require manual approval for production deploys if needed, without changing the workflow file. - OIDC authentication via
role-to-assumemeans no long-lived AWS credentials in your repository secrets. This is a security baseline, not a nice-to-have. - Docker layer caching via GitHub Actions cache (
type=gha) typically cuts build times by 40-60%. Without it, every build re-downloads and re-installs every dependency. - The smoke test job is the final gate before production. Keep these tests focused on critical user paths: can users log in, can they perform the primary action, does the API respond? Save exhaustive testing for the unit and integration stages.
Database Migrations Without Downtime
This is the piece that catches teams off guard. You can have perfect container orchestration and still cause downtime with a careless database migration. The problem is simple: during a rolling or canary deployment, the old version and new version of your application are running simultaneously against the same database. If the new version's migration removes a column that the old version still reads, you have an outage.
The solution is the expand-and-contract pattern, and it requires discipline:
- Deploy 1 (expand): Add the new column or table. Update the code to write to both the old and new locations. The old code still reads from the old location and keeps working.
- Deploy 2 (migrate): Backfill existing data to the new location. Update reads to use the new location. The old column is now unused but still present.
- Deploy 3 (contract): Remove the old column and any dual-write logic. This is safe because no running code references it.
Yes, this means a single schema change takes three deploys. That's the cost of zero downtime, and it's worth it. Each deploy is individually safe to roll back because neither adding an unused column nor removing an unreferenced one can break running code.
Feature flags make this easier. Wrap the read path in a feature flag so you can switch between old and new columns without a deploy. This lets you decouple the migration from the release and test the new data path with a subset of traffic before committing to it. Tools like LaunchDarkly, Unleash, or even a simple config map work for this.
One more rule that saves teams from outage post-mortems: never deploy a destructive migration on a Friday. In fact, never deploy a destructive migration in the same week you stopped writing to the old schema. Give yourself a full cycle of observation before you drop anything.
Zero-downtime deployment isn't aspirational. For any team deploying more than once a week, it's table stakes. The upfront investment in getting this pipeline right pays for itself the first time you push a release at 2 PM on a Tuesday and nobody notices because nothing broke, nothing slowed down, and nobody was paged.
The infrastructure described here takes a few weeks to implement properly, less if you have existing Kubernetes and GitHub Actions experience. If you'd rather have a team that's done this forty times before handle it, that's what we do.