Netflix's Confidence Chimp: How One Team Cut Fleet-Wide Migrations from 86 Days to 26
#DevOps

Netflix's Confidence Chimp: How One Team Cut Fleet-Wide Migrations from 86 Days to 26

DevOps Reporter
10 min read

Netflix's Change Automation team built an event-driven platform that runs code changes across 4,000+ services with composable, Lego-like steps and a custom confidence metric. Senior engineer Casey Bleifer breaks down the architecture, the painful first exercise where 44% of targets needed a human, and the unglamorous fixes that drove that number down.

Every platform engineer knows the long tail. You ship a new library version, watch adoption climb to 75%, then 85%, then stall. A year later the old version is still running somewhere, you're maintaining a dozen active releases, and a deprecation you scheduled for last quarter is still open. Casey Bleifer, a senior engineer on Netflix's Change Automation team, opened her QCon San Francisco talk with exactly this story, inspired by an internal library she found running 73 distinct versions.

The Log4j scramble is the version of this problem nobody forgets. When a critical CVE drops, you cannot afford a migration that takes months. You need every affected service patched in days. Netflix's answer was to build a fleet-wide automation platform and pair it with a deliberately small, repeatable exercise they nicknamed Confidence Chimp. The results so far: a campaign that once took 86 days now finishes in 26, and the share of services needing a human to step in fell from 44% to 21%, even as the platform expanded from 120 Java services to over 2,000 targets spanning Java and Python.

Featured image

What's new

The team set two goals that sound almost unreasonable: automate any code change across the fleet in a week or less, and ship fixes for critical vulnerabilities in two days or less. Three requirements shaped the design.

Minimal effort for everyone. Platform teams driving a migration should only have to configure it, not hunt down which services are affected or chase owners on Slack. Service owners, on the other side, should be able to do nothing unless their input is genuinely required.

Respect fleet diversity. A real fleet is not uniform. Repos span many languages, some well tested and some not. Security requirements differ by the data a service touches. Some services live in monorepos, some have quiet periods, some belong to business units with their own rules. Automation has to honor those constraints while still reaching everyone.

Safety first. The automation should fix things, not break them. That means validating changes before rollout, phasing rollouts by criticality so a problem in low-risk apps can pause the campaign before it reaches high-blast-radius services, baking in compliance checks, and giving anyone a big red stop button at any point.

Why it matters

The burden of unfinished migrations is not abstract. Old software keeps running, teams maintain versions they want to retire, and the engineers who should be building features instead spend their weeks applying the same dependency bump across services they barely remember owning. Bleifer framed it as a productivity tax paid twice: once by the platform team trying to drive the change, and once by every service owner interrupted to apply it.

The interesting move here is not just the orchestration engine. Plenty of teams have built migration tooling. What Netflix added is an explicit, measured notion of confidence as a gate. The platform does not assume it can auto-merge into your repo. It asks, in effect, whether the change is trustworthy and whether your service is ready to receive it, and only then proceeds without a human.

How it works

A few terms anchor the architecture. A platform team that wants to run a migration creates a campaign. The pieces of software undergoing the migration are targets. Each target moves along a path, which is just a set of automated steps. Bleifer also drew a sharp line between a rollout (the orchestration that progresses a target through its steps) and a deployment (the actual delivery of the change to infrastructure). Netflix does not run deployments itself here; it integrates with Spinnaker and consumes its events to monitor them and to respect quiet periods.

Composable steps

The heart of a path is a set of composable steps, each with its own state. Working across Netflix teams, the group found that migrations share common stages but demand real customization. Some need manual input, some need steps nobody had anticipated. So the steps snap together like Lego bricks. A few have prerequisites and tend to pair up, but the model lets any team assemble the path it needs.

Event-driven orchestration

Underneath, the platform is a loop between a state machine and an event consumer, deliberately decoupled because events can originate anywhere, inside Netflix's systems or in any of the tools it integrates with. When an event arrives, the listener processes it and hands it to the state machine. The state machine reads the event together with the step type, say create a pull request, and launches the right step handler, which is a child workflow running that one piece of automation.

The state machine carries a lot of weight. It decides what step comes next so targets keep moving along the path, tracks and updates step state, launches workflows, and handles edge cases: pausing targets during a quiet period or an incident, resuming them, and distinguishing terminal failures from retryable ones to pick the best path forward.

Confidently Automating Changes Across a Diverse Fleet - InfoQ

A common path, step by step

For one-time deprecations the customization matters most, but recurring work like dependency updates tends to follow the same shape, so the platform ships provided paths. The most common is a code change path with validation. Here is how it runs.

  1. Code transform. The platform launches a container that performs the transform. Teams can supply a custom script, point at one of the platform's pre-configured codemods for common jobs like a dependency bump or a delivery-config update, or even hand over a GenAI-prompted container.
  2. Draft pull request. The change opens as a draft PR rather than something activated and merged immediately. All PR checks must pass first.
  3. Validation. While still in draft, the PR moves to validation. Netflix partnered with its resilience team so this step can launch canaries for the change; the canary has to pass or the rollout stops and the platform team is invited to investigate. Validation is extensible, so a team can plug in custom tests here too. Bleifer named this as an area they want to expand.
  4. Compliance checks before merge. Before auto-merging, the platform verifies a set of conditions. It respects repository permissions (if they exist, they probably exist for a reason). It confirms the PR is in a mergeable state and passes its checks. And it reads the confidence rating.
  5. Activate and merge, followed by a final verification that the build looks good on main.
  6. Monitor deployment. As the change rolls out to its clusters, the platform watches Spinnaker events and honors quiet periods. When that completes, the rollout is successful.

The confidence metric

Building the platform did not make teams come running. New automation that touches your production code is not something you trust on day one, especially across software with wildly different maturity. Bleifer's reframing is that confidence runs in two directions.

One direction is the platform's confidence in itself: will it do the change correctly, at scale, and safely? The other is recipient confidence: can a given service actually receive automation? Does it have tests that would catch a regression? Does it deploy often enough for the change to land inside the target window? And, beyond the technology, do the owners actually want this happening at scale?

Netflix folded these into a single confidence metric that the platform reads at the merge gate. High confidence means the PR can merge automatically. Low confidence means the team and platform have agreed a human still belongs in the loop. The metric is currently more qualitative, with the team rating services on a strongly-disagree-to-strongly-agree scale, but Bleifer said they are moving toward a calculation driven by software metadata so it updates as a service changes and can detect when something drops out of the high-confidence band.

Confidence Chimp and the first painful run

Rather than boil the ocean across 4,100-plus services, the team started with 3% of the fleet: the JVM ecosystem, services only. They designed a deliberately trivial change, a log line edit, and ran it through the full code change path. The logic is clean. If a no-op change breaks, the cause is either a gap in the platform or a service that simply is not automatable, and either way there is a theme worth fixing. The new member of Netflix's Simian Army got a name: Confidence Chimp.

The first run took 120 Java services and let the automation loose. It took 86 days to finish, and 44% of targets needed a manual intervention. The breakdown pointed mostly outward, at things beyond the platform's own code: partner-team integrations whose contracts and uptime needed hardening, and service owners themselves. Three themes stood out.

Contact data. Notifications were firing into the void because services had no contact metadata. When the team needed to reach an owner manually, they often found the wrong Slack channel and ended up on a goose chase. This, alongside some security incidents, drove a company-wide push to populate accurate on-call, Slack, and email contacts.

Stuck pull requests. PRs that could not auto-merge sat open for weeks, blocked by compliance checks, flaky builds, or merge rules. The team fine-tuned compliance checks to merge as much as safely possible, and added detection for PRs moving from an unhappy to a happy state so workflows could resume automatically instead of waiting for someone to notice. They also started detecting manual merges, rerunning merge checks, and commenting on stale PRs.

Slow deployments. Plenty of services deployed far less often than the seven-day target, frequently because of manual gates. Working with the delivery team, they identified gates that were well-tested or statistically were not catching real production issues, and removed them, cutting manual interventions from deployment approvals by 77%. Gates that genuinely guard safety stayed.

Where it stands now

The gains came from running the exercise, fixing the obstacles, and running it again, an all-hands effort across the platform team, partner teams, and willing service owners. The numbers as of the talk:

  • Targets per campaign: from 120 to over 2,000.
  • Time to complete: from 86 days to 26.
  • Manual intervention rate: from 44% to 21%.
  • Fleet coverage: from 3% (Java only) to 50% in the most recent exercise (Java plus high-confidence, well-tested Python). Platform capability already reaches 66% of services; the exercise lags because the external work of earning team confidence takes longer than the engineering.

Alongside the exercises, the platform has run real migrations: JDK upgrades, delivery-config modernization, software deprecations and introductions, a port migration, and a GenAI-prompted migration that moved services off an old library onto a new one in one pass.

What's next is service-coverage completion to unlock JavaScript and Python ecosystems, expansion beyond services into libraries, jobs, and repositories (including validating libraries before release so teams maintain fewer versions), more provided paths such as Python dependency updates configurable in a few lines, and generalizing that GenAI-prompted transform so any migration can supply a prompt and declare the resources its container needs.

What practitioners can take from this

A few points from the Q&A are worth carrying back to your own platform work. Netflix deliberately did not route campaigns through GitHub Actions, because the team wanted to keep its own safety checks rather than inherit a CI runner's model. It stayed framework-agnostic on code transforms, letting teams bring OpenRewrite or whatever they already use rather than mandating one tool. Auto-merge required real negotiation with central security, and for SOX- and PCI-relevant changes a human review is non-negotiable by design. When the platform launches a canary or merges on a team's behalf, it assumes its own permissions, granted through that security review.

Bleifer's closing lessons are the kind that sound obvious until you have ignored them. Confidence is a two-way street: the technology is necessary but not sufficient, because owners have to trust the platform and the platform has to earn it. Don't boil the ocean; break the problem down and let each solved piece make the next one easier. And the orchestration runs on partnerships, not just code. Across security, resilience, delivery, and hundreds of service owners, the human integrations turned out to be as load-bearing as the state machine.

Comments

Loading comments...