Anthropic researcher Nicholas Carlini orchestrated sixteen Claude Opus 4.6 AI agents to build a Rust-based C compiler from scratch, demonstrating autonomous software development at scale.
In an effort to probe the limits of autonomous software development, Anthropic researcher Nicholas Carlini used sixteen Claude Opus 4.6 AI agents to build a Rust-based C compiler from scratch. Working in parallel on a shared repository, the agents coordinated their changes and ultimately produced a compiler capable of building the Linux 6.9 kernel across x86, ARM, and RISC-V, as well as many other open-source projects. The agents ran roughly 2,000 sessions without human intervention, incurring about $20,000 in API costs. According to Carlini, this "dramatically expands the scope of what's achievable with LLM agents".

While Carlini describes the compiler as an "interesting artifact" in its own right, he stresses that the deeper lessons are about "designing harnesses for long-running autonomous agent teams", ensuring agents remain on track without human oversight and can make progress in parallel:
If you ask for a solution to a long and complex problem, the model may solve part of it, but eventually it will stop and wait for continued input—a question, a status update, or a request for clarification.
Carlini's approach consisted in "sticking Claude in a simple loop", so that the agent keeps working on a given task until it's perfect, then it immediately moves to the next. He paired this setup with multiple Claude instances running in parallel, each inside its own Docker container but accessing a shared Git repo. This increased efficiency, allowing Claude to tackle multiple tasks at once, and encouraging agent specialization, with some agents handling documentation, others generated code quality, and so on.
To synchronize agents, Carlini relied on a simple lock-based scheme: Claude takes a "lock" on a task by writing a text file to current_tasks/ [...]. If two agents try to claim the same task, git's synchronization forces the second agent to pick a different one. Once a task is complete, the agent merges other agents' changes locally, then pushes its branch, and removes the lock. Carlini says "Claude is smart enough to figure out" merge conflicts on its own.
Most notably, in this setup, Carlini does not use an orchestration agent, preferring to "leave it up to each Claude agent to decide how to act":
In most cases, Claude picks up the "next most obvious" problem. When stuck on a bug, Claude will often maintain a running doc of failed approaches and remaining tasks.
Carlini enforced a number of key practices to ensure success, including maintaining high-quality tests and continuous integration while preventing Claude from spending too much time on testing; assigning distinct agents to separate projects when they were likely to hit the same bug; and specializing agents as mentioned above.
The possibility that multiple agents encountered the same bug simultaneously, generating distinct fixes that would overwrite each other's work, was a major problem, particularly evident with the Linux kernel. To address this, Carlini employed GCC as a compiler oracle: each agent used GCC to compile a random subset of the kernel tree while Claude's compiler handled the remainder, refining its output only on that subset.
After two weeks and approximately $20k in API costs, this effort produced a 100k-line compiler that passes 99% of GCC's torture test, can compile FFmpeg, Redis, PostgreSQL, QEMU, and runs Doom.
Carlini's effort ignited a wide online debate, with reactions ranging from positive to skeptical and sparked further discussion on its practical and philosophical impact. X user @chatgpt21 noted that while this was no small feat, it still required a human engineer to "constantly redesign tests, build CI pipelines when agents broke each other's work, and create workarounds when all 16 agents got stuck on the same bug". On the other hand, @hryz3 emphasized that those agents were trained "on the same code they were asked to reproduce". More sarcastically, @TomFrankly wrote:
They spent $20k in tokens to spit out code that's in the training data they scraped?
Microsoft's Steve Sinofsky further qualified the claim that Claude did in two weeks the work that took human engineers 37 years by pointing out that:
It didn't take GCC 37 years to be built. In 1987 it fully worked for the language as it existed at the time. Over 37 years it evolved with the language, platforms, libraries, optimization and debugging technology, etc.
@WebReflection uncovered another interesting dimension of the debate asking:
How much OSS contribution was done in the making? 'cause [there will be] no experts' code to look at as reference in the future if [not] giving anything back to the sources that made any of this possible.
@RituWithAI summed up the implications this might have on software development roles:
We are entering an era where the primary skill for a 10x developer isn't their ability to solve a complex bug, but their ability to design the automated testing rigs and feedback loops that allow sixteen parallel instances of a model to solve it for them.
As a final note, Carlini himself hinted at the risks that being able to generate code so easily may pose and at the need for "new strategies to navigate safely" this world.
This experiment represents a significant milestone in autonomous software development, demonstrating that AI agents can collaborate effectively on complex engineering tasks without constant human intervention. The key innovation wasn't just using multiple agents, but creating a framework where they could work independently while maintaining coordination through simple mechanisms like file-based locks and Git operations.
The implications extend beyond compiler development. This approach could be applied to other complex software engineering tasks where parallel work and coordination are essential. However, the experiment also revealed limitations - the need for human intervention in designing testing frameworks, the challenges of agent coordination when facing common bugs, and the substantial computational costs involved.
What makes this particularly noteworthy is that it achieved something previously thought to require extensive human expertise and coordination. Building a C compiler that can handle the Linux kernel is a non-trivial engineering challenge that has historically required teams of experienced developers working for years. That sixteen AI agents could accomplish this in two weeks, even with the limitations noted, suggests we're entering a new era of software development where human engineers focus more on designing systems for AI collaboration rather than writing every line of code themselves.

Comments
Please log in or register to join the discussion