At QCon AI, ServiceTitan Principal AI Engineer David Stein detailed an agent-driven pattern that compressed a reporting-metrics migration from quarters into weeks. The architecture hinges less on smart models and more on rigid, programmatic validation loops that let ordinary coding agents grind through hundreds of standardized tasks without going off the rails.
Legacy migrations are the projects every engineering org defers. You know the new architecture is better, you know the old code is slowing you down, and you know that getting from one to the other means hundreds of Jira tickets and several quarters of work that might still strand you halfway up the hill. In a QCon AI presentation, ServiceTitan Principal AI Engineer David Stein described how his team rebuilt that calculus, moving a few hundred reporting metrics off legacy code and onto a modern metric store in roughly two weeks. The interesting part is not that AI wrote the code. It is the surrounding system that made unreliable agents produce reliable output.

The service update: an agent assembly line, not a magic prompt
The naive approach is the one everyone tries first. Point Cursor or Claude Code at a few hundred thousand lines of legacy code and ask it to migrate the whole thing onto a new architecture. Stein has tried this. It does not work, and it fails for the same reason a strong human engineer can't do it in one pass: the task is too large to hold in context. The agent makes early progress, then hallucinates. It invents new metrics instead of porting existing ones, completes five of them, writes "here's how I'd do the rest," and stops. The five it finished are wrong anyway.
ServiceTitan's answer is a pattern Stein calls the assembly line, built on three moves: decompose, standardize, automate.
Decompose the migration into the smallest unit that a single agent can finish and that you can independently verify. For ServiceTitan, that unit was one metric. Move one pebble, not the mountain. Go too granular and you create overhead; the right grain is whatever an agent can reliably complete end to end.
Standardize the context every task needs. Because the tasks rhyme, the same ingredients recur: access to real data in a staging Snowflake cluster, the location of the legacy code, the target pattern in the new platform, and a crisp definition of done. ServiceTitan encoded this in two plain text files, migration_goals.txt and migration_tasks.txt. The goals file defines what "complete" means for a single task and for the overall migration, names the CLI tools the agent should use, and explains how the agent checks its own work. The tasks file is a phased checklist the agent marks off as it goes, with phase boundaries that give human engineers natural inspection points.
Automate the step that used to require a person: understanding the legacy code well enough to reimplement it. This is what 2025-era coding models unlock. The earlier two steps were always possible; the comprehension of old, sparsely documented code is the new capability.
Notably, the team skipped MCP entirely. They reasoned from how a human actually acquires context, which is a CLI. Engineers run SnowSQL to see what the staging data looks like, so they handed the agents the same SnowSQL access with staging-scoped credentials. You cannot describe a metric's formula in the abstract and expect correct code, just as a human can't write it without seeing the real table shapes and data distributions.
How the self-healing loop works
The engine that makes this safe is the validator, and it is where most of the engineering effort actually went.
ServiceTitan built a simulator that regenerates the same reports the legacy backend produced, but sourced from the new metrics platform, in their case dbt MetricFlow and the Semantic Layer. For any migrated metric, the simulator produces a directly comparable output: matching formatting, the same data, scannable across the full distribution, with a clean pass or fail at the end.

The loop runs like this. The agent acquires context through the provided CLI tools. It writes the code. It runs the validator. If the output does not match the legacy reference, the agent inspects the mismatch, tries again, and repeats until it passes. Stein's analogy is the old game Lemmings: the individual agents are not clever, but if you place the right roadblocks, they all reach the destination.
This reframes the question of model intelligence. The agent does not need to match a senior engineer. In a well-built self-healing loop, the worst case is not a confidently wrong result shipped to production. The worst case is that the agent fails to make progress on a hard task, which is a visible, recoverable state. The danger is the inverse: an agent that believes it succeeded when it failed. That only happens when validation is weak, and a weak validator will happily burn a week of compute producing plausible slop. ServiceTitan revised its validator multiple times across the few hundred metrics.
Where it earned its keep
The reporting platform was a strong fit because the metrics existed at scale, each carried hidden dependencies on production database schemas, and several had been written years ago by people who had since left. A prior team attempt to move this code into a data lakehouse had stalled on exactly that accumulated complexity.
The assembly line cleared roughly 85 percent of the metrics through automation. About 15 percent were genuinely complex enough to need a human to step in. When you have hundreds of items, offloading 85 percent of the toil reshapes the project's economics entirely.
The second-order benefit is architectural agility. The old migration timeline front-loaded investigation and proofs of concept, committed to a platform, then back-loaded all value realization to the very end, which is why these projects are so hard to schedule and so risky to abandon midway. The assembly line inverts the curve. You invest heavily up front in decomposition and validation, then realize most of the value in a compressed window. Stein's team hit a case where, partway through, they realized a different target architecture would have been better. Rerunning the entire migration cost them a re-prompt rather than a demoralizing team-wide restart. The agents, as he put it, don't mind doing it again.
Trade-offs a cloud architect should weigh
The pattern is not free, and the costs land in specific places.
Validation is the bottleneck, and it is non-trivial. Many legacy systems were never built to be observable, the same way much legacy code was never built to be unit tested. You need enough visibility into the reference system to compare outputs at the right points, including handling timing and data-density issues. What used to be a nice-to-have, clean observability into the old system, becomes the load-bearing requirement.
Context gaps surface as stuck agents. The common failure was an agent unable to find test data that exercised a metric's edge cases. The fix was decidedly human: curating golden lists of record IDs and time ranges where good example data is known to exist, plus example queries covering each metric's nuances. This is tribal knowledge, the stuff you would lean over the cubicle wall to ask a teammate for, written down into the context files. Capturing it well was what pushed success rates high.
Security sits unresolved and demands deliberate scoping. When an agent runs shell commands and CLI tools, it can attempt things you never intended, including calls to cloud APIs or destructive operations. ServiceTitan's mitigations are pragmatic rather than airtight: scope agents to staging data, withhold any credential that could touch production, and watch the agents work. Stein was candid that the isolation guarantees in current agent tooling are still maturing, and that highly sensitive applications would need a much higher bar for which tools agents can hold and how runs are sandboxed. Anyone applying this pattern in a regulated or production-adjacent context should treat credential scoping and sandboxing as primary design constraints, not afterthoughts.
The pattern also does not collapse into a single button yet. Asked about a minimal path, Stein expects that tools like Cursor and Claude Code will eventually infer the validator and simulation harness from the legacy code directly. For now, you still have to break the work into agent-sized tasks, build a genuinely good validator, and kick off the loop. Tools such as Promptfoo for evaluating and red-teaming LLM behavior fit naturally into this discipline, but the validation logic specific to your domain remains yours to write.
The broader pattern
Strip away the AI framing and the assembly line is recognizable: decompose a large problem, standardize the inputs, and demand programmatic proof that each unit is correct. These are the same instincts a staff engineer applies when handing work to a team of juniors. What changed is that the comprehension step, reading and faithfully reimplementing unfamiliar legacy code, can now be parallelized across agents instead of serialized across engineer-weeks.
Stein spent a decade at LinkedIn watching active-active geo-replication efforts and monolith decompositions consume dozens of engineers across multiple quarters. The question he leaves architects with is worth sitting with: what migration have you deferred indefinitely because it costs too many tickets and too many quarters, and would the math look different if the comprehension work became a validated, repeatable agent loop? The discipline that makes it work, rigid validation and well-curated context, is exactly the discipline good teams claim to value anyway. The agents just make the absence of it impossible to ignore.

Comments
Please log in or register to join the discussion