Justin Lebar recounts how large language models turned a modest fuzzing project into a $10,000, multi‑agent bug‑finding operation that uncovered dozens of LLVM miscompiles, including a critical atomic‑store regression. The post explores the technical workflow, the economics of token‑driven AI, and the broader implications for compiler security and software engineering.
Finding Miscompiles for Fun, Not Profit – A Deep Dive
By Justin Lebar, May 28 2026
Paid post – SemiAnalysis
When I first set out to fuzz LLVM a few years ago, the goal was simple: generate random programs, compile them, and compare the results. After a handful of bugs in the instcombine pass, the returns dwindled and I set the project aside. Fast forward to May 2026, armed with an LLM‑augmented workflow and a willingness to spend a few thousand dollars on token usage, and the same idea exploded into a torrent of miscompiles across NVIDIA’s ptxas and AMD’s GPU back‑ends.

The Core Argument
The central claim of Lebar’s narrative is that AI‑driven agents can now automate large‑scale compiler bug discovery at a speed and breadth that far exceeds traditional fuzzing, provided the organization can afford the token costs. The evidence comes in three parts:
- Fuzzer acceleration via LLMs – By prompting ChatGPT 5.5 (and later Claude) to iteratively rewrite the fuzzer after each discovered bug, Lebar eliminated the typical “stuck” problem where a fuzzer repeatedly hits the same defect. The model also performed test‑case minimisation, often spending an hour per case to shrink the program to its essential miscompile‑triggering core.
- Massive bug yield – Within three days, the
ptxasfuzzer produced 40 miscompiles, climbing to roughly 80 a week later. Similar rates were observed for the AMDGPU backend. By contrast, a fleet of 50 Claude sub‑agents uncovered bugs at a rate of one every four minutes, and for the x86 backend the rate approached two per minute. - Economic accounting – The fuzzer’s token consumption was on the order of a few hundred dollars, while the sub‑agent approach cost over $10 000 in a few hours. Despite the higher expense, the agents identified a class of bugs (e.g., an atomic‑store regression) that fuzzing would struggle to expose.
Together, these points illustrate a shift: the bottleneck is no longer algorithmic cleverness but the budget for AI compute.
How the AI‑Enhanced Fuzzer Works
- Prompt‑driven code generation – Lebar supplied a high‑level description of a fuzzing harness to ChatGPT. The model emitted a complete Python script that:
- Generates random PTX instruction sequences.
- Invokes
ptxason the generated code. - Executes the resulting binary and compares output against a reference interpreter.
- Iterative bug avoidance – After each crash, a follow‑up prompt asked the model to mutate the harness so that the same trigger would be filtered out. This mirrors the “corpus‑pruning” step in AFL++ but is performed automatically by the LLM.
- Minimisation as a service – The model was instructed to repeatedly delete random instructions while preserving the miscompile, effectively performing a delta‑debug reduction without human supervision.
- Parallel sub‑agents – For the inspection‑based approach, Lebar gave Claude a high‑level goal: “Find any code path in LLVM that could produce an incorrect transformation.” Claude spun up 50 independent agents, each scanning different source files, generating hypotheses, and reporting back with a concise diff and a reproduction script.
The workflow is captured in the open‑source repository FuzzX, which contains the generated fuzzers, the list of miscompiles, and a few example agent reports.
Implications for Compiler Engineering
1. Shift from Manual to Token‑Driven QA
The economics suggest a new tier of testing: organizations that can allocate tens of thousands of dollars to token usage will be able to outsource large portions of regression testing to LLMs. Smaller teams may still rely on classic fuzzers, but they will increasingly view AI agents as a premium service for “deep‑danger” bugs such as atomic‑ordering violations.
2. Open‑Source Advantage
Lebar notes that fixing bugs in LLVM’s AMDGPU backend was trivial because the code is public. Closed‑source compilers like ptxas still benefit from the fuzzer, but the inspection‑based approach is limited to binary analysis. This reinforces the strategic value of open‑source toolchains for rapid remediation.
3. Bug Severity Spectrum
Fuzz‑found bugs are demonstrable miscompiles; they can be reproduced with a concrete input and a test harness. Agent‑found bugs often rely on the model’s judgment about whether a transformation is semantically incorrect. While the average severity is lower, the outlier atomic‑store regression demonstrates that a single high‑impact bug can justify the $10 000 expense.
4. Future of Compiler Verification
If LLMs can read millions of lines of source code and propose concrete counter‑examples, the traditional “prove‑by‑testing” mindset may give way to a hybrid model: formal verification for core invariants + AI‑driven exploration for edge cases.
Counter‑Perspectives
- Token cost volatility – Prices for Claude Max or ChatGPT Pro can fluctuate, and token quotas may be throttled during peak demand. Relying on a budget‑driven approach could introduce unpredictability into release cycles.
- False positives – Agent‑generated reports sometimes flag benign transformations as bugs. Without rigorous triage, teams could waste time chasing phantom issues.
- Security concerns – Feeding proprietary compiler internals to a third‑party LLM raises confidentiality questions. Companies may need to host private model instances or adopt on‑prem LLMs to mitigate data leakage.
- Human expertise erosion – As Lebar observes, his AI delivered more value than his own effort for this project. Over‑reliance on automated agents could diminish the skill set of compiler engineers, making maintenance harder when the AI is unavailable.
Concluding Reflections
The experiment Lebar describes is less a novelty and more a harbinger of a new testing economics: the barrier to exhaustive compiler validation is shifting from algorithmic ingenuity to financial willingness to purchase AI tokens. For organizations that can afford the spend, the payoff is clear—a flood of miscompiles, including rare but catastrophic bugs, can be uncovered in hours rather than months.
For the broader community, the lesson is twofold. First, open‑source compilers will continue to reap the benefits of rapid, community‑driven fixes. Second, the industry must grapple with the ethical and practical ramifications of a future where AI agents can out‑perform human engineers at a price tag. As token costs decline and models become more efficient, the gap between “expensive” and “affordable” will shrink, potentially democratizing this capability—but until then, the divide will shape who can guarantee the correctness of the software that underpins modern computing.
*The full list of discovered bugs, along with reproducible test cases, is available in the FuzzX GitHub repository.*

Comments
Please log in or register to join the discussion