A practical guide to rolling, blue‑green, canary, and feature‑flag deployments, with focus on consistency, scalability, and the trade‑offs each pattern introduces.
Zero‑Downtime Deployment Strategies
Originally published on AI Study Room. For the full version with runnable examples, visit the original post.
The problem: updates that interrupt users
When a service moves from a hobby project to a production‑grade platform, the window in which a new version is being rolled out can no longer be treated as “acceptable downtime”. Users expect a continuous experience, and any interruption can translate into lost revenue, broken sessions, or a spike in support tickets. Achieving zero‑downtime therefore becomes a non‑functional requirement that must be baked into the deployment pipeline.
Core constraints that shape any solution
| Constraint | Why it matters |
|---|---|
| Scalability | The strategy must work whether you run 3 pods or 3 000 instances. |
| Consistency model | During the rollout both old and new code may be serving traffic; data schemas must stay compatible. |
| Infrastructure cost | Doubling the environment (as in blue‑green) may be prohibitive for small teams. |
| Operational risk | The ability to roll back instantly can be the difference between a minor glitch and a full outage. |
The following sections walk through the four most common patterns, explain how they satisfy (or violate) the constraints above, and outline the engineering trade‑offs you will face.
1. Rolling deployments
How it works
A rolling deployment updates instances one at a time:
- The orchestrator (Kubernetes, Nomad, ECS, etc.) creates a new replica with the target image.
- Health checks run; once the pod reports ready, traffic is shifted to it.
- An old replica is terminated.
- Steps 1‑3 repeat until every replica runs the new version.
Because the total replica count stays constant, no extra capacity is required. This makes the pattern attractive for cost‑sensitive environments.
Scalability implications
- Horizontal scaling works out‑of‑the‑box – you can add more replicas before the rollout starts and the orchestrator will update them in parallel, limited only by the
maxSurgesetting. - The rollout time grows linearly with the number of instances if you keep the surge at 0. Increasing
maxSurgereduces the window but temporarily spikes CPU/memory usage.
Consistency considerations
During the rollout both versions coexist. Any API change that is not backward compatible will break requests routed to the older pods. The safe approach is the expand‑contract migration:
- Add new columns/tables while keeping the old ones.
- Deploy code that can read both schemas.
- After all pods run the new code, clean up the old schema.
Trade‑offs
- Pros – No extra hardware, simple to configure, works for any stateless service.
- Cons – Mixed‑version traffic, requires strict backward compatibility, can be slow for large fleets.
2. Blue‑Green deployments
How it works
Two complete environments exist side‑by‑side:
- Blue – the current production stack.
- Green – a fresh copy where the new version is deployed. Once the green stack passes smoke tests, the load balancer swaps all traffic from blue to green in a single atomic operation. If something goes wrong, the switch is reversed instantly.
Scalability implications
- You must provision double the resources for the duration of the cut‑over. In cloud environments this translates to a 100 % cost increase for the deployment window.
- The approach scales well because the switch is independent of the number of instances – the load balancer simply points to a different target pool.
Consistency considerations
Since only one environment serves traffic at any moment, API incompatibilities are invisible to users. The only requirement is that the green environment can handle the full production load before the switch.
Trade‑offs
- Pros – No mixed‑version traffic, instant rollback, clear separation of concerns (test in production‑like environment).
- Cons – Double infrastructure cost, need for a routing layer that can perform atomic switches (e.g., AWS ALB, NGINX, Envoy), and the challenge of keeping stateful resources (databases, caches) synchronized.
3. Canary deployments
How it works
A small fraction of traffic (often 1‑5 %) is routed to the new version. Metrics are observed; if they stay within thresholds, the traffic share is gradually increased until the canary becomes the full production version.
Enabling fine‑grained traffic routing
Service meshes such as Istio or Linkerd expose APIs to split traffic by HTTP header, cookie, or random percentage. This removes the need for custom load‑balancer logic.
Scalability implications
- The mesh operates at the request level, so the number of pods does not affect routing precision.
- Monitoring overhead grows with the number of canary stages, but modern observability stacks (Prometheus + Grafana, Datadog, etc.) can handle high cardinality metrics.
Consistency considerations
Because only a subset sees the new version, schema incompatibility is less risky – the old version continues to serve the majority of requests. However, you still need the expand‑contract migration pattern until the canary reaches 100 %.
Trade‑offs
- Pros – Minimal user impact, early detection of regressions, works well for high‑traffic services.
- Cons – Requires sophisticated routing and observability, longer overall rollout time, can be complex to automate.
4. Feature‑flag driven releases
How it works
Code for a new capability is merged into the main branch and deployed behind a flag that defaults to off. The flag can be toggled per user, region, or percentage, effectively turning the deployment into a canary at the feature level.
Tooling
Managed platforms such as LaunchDarkly and Flagsmith provide SDKs for most languages, a UI for flag management, and analytics on flag usage.
Scalability implications
Feature flags are just key‑value lookups; they add negligible latency when stored in a fast cache (Redis, in‑process memory). The real scaling concern is the operational overhead of managing many flags across services.
Consistency considerations
Flags decouple deployment from release. The code path that reads the flag must be tolerant of both the old and new behavior, which again pushes the need for backward‑compatible logic.
Trade‑offs
- Pros – Instant rollback (flip the flag), granular rollouts, can test new code in production without touching the routing layer.
- Cons – Flag‑related technical debt, potential for “flag explosion”, and the need for rigorous testing of flag combinations.
5. Database migrations – the hidden blocker
Zero‑downtime deployments often fail at the persistence layer. The guiding rule is dual‑read/write compatibility:
- Expand – Add new columns/tables, keep old ones untouched.
- Migrate – Deploy code that writes to both old and new structures.
- Contract – After all instances run the new code, drop the legacy schema.
Kubernetes readiness and liveness probes should be aware of migration state. A pod should report not ready until its migration step finishes, preventing the orchestrator from routing traffic to a partially migrated instance.
6. Session & connection handling
- Stateless sessions – Store JWTs or signed cookies on the client; no server‑side state to lose.
- Shared session store – Use Redis or a relational database so that a pod can disappear without invalidating a user’s session.
- WebSockets – Clients must implement reconnection logic because a pod termination will break the TCP connection. A load balancer that supports sticky sessions (e.g., NGINX with
proxy_next_upstream) can mitigate brief disconnects, but the application should be prepared for a full reconnect.
Choosing the right strategy
| Situation | Recommended pattern |
|---|---|
| Small, stateless service with low risk | Rolling deployment with health‑check gating |
| Mission‑critical service that cannot tolerate mixed versions | Blue‑green with automated smoke tests |
| High‑traffic API where regressions are costly | Canary + service mesh + feature flags |
| Frequent feature toggles, A/B testing | Feature‑flag driven releases |
In practice many teams blend these approaches: routine bug fixes use rolling updates, while major releases start as a canary and finish with a blue‑green cut‑over.
Final thoughts
Zero‑downtime deployment is not a single technology but a collection of patterns that must be aligned with your consistency model, scaling requirements, and risk appetite. The engineering effort spent on making migrations backward compatible, wiring health probes, and automating rollbacks pays off in reduced incident volume and faster delivery cycles.
For a hands‑on walkthrough, see the Kubernetes rollout guide and the Istio traffic‑splitting tutorial. Both include YAML snippets that you can drop into a cluster and adapt to your own service.
If you found this guide useful, explore more deep‑dive articles on deployment patterns, observability, and distributed data migrations at the AI Study Room.

Comments
Please log in or register to join the discussion