Large language models promise revolutionary assistance in scientific research yet remain fundamentally untrustworthy. As physicist Tobias Osborne notes: "They hallucinate references to papers that don’t exist. If you push back on a correct argument, they’ll ‘discover’ an error that isn’t there." Benchmark tests reveal alarming failure rates—GPT-5 produces incorrect proofs for false statements 29% of the time in the BrokenMath evaluation.
alt="Article illustration 2"
loading="lazy">
This unreliability becomes critical in mathematical research where, as Osborne experienced, LLMs can generate "a baroque facade stapled onto a pile of steaming trash"—seemingly valid proofs that collapse under expert scrutiny. Yet the success of agentic coding tools suggests a solution: rigorous testing frameworks.
> **"Why can’t we do this for mathematics or physics? Compilers are way more rigorous than the average journal referee."**
> — Tobias Osborne
The core problem lies in mathematics' "programming language": natural language. While proof assistants like Lean offer formal verification, their steep learning curve deters researchers. Osborne's breakthrough came when Claude suggested using **Lamport's structured proofs**—a methodically organized proof format that plays to transformers' strength in breadth-first reasoning.
### Alethfeld: Adversarial Verification Architecture
Alethfeld implements a multi-agent system entirely through prompts:
- **Prover**: Constructs arguments using structured proofs
- **Verifier**: Identifies logical gaps
- **Reference Checker**: Validates citations
- **Formaliser**: Translates to Lean for computational verification
The adversarial dynamic forces rigor: when the Prover submitted an incorrect solution to an HMMT Olympiad problem, Lean's type system caught an unsigned variable assumption the semantic verifier missed—demonstrating the necessity of dual-layer validation.
### Beyond Olympiads: Research-Grade Results
Alethfeld's significant achievement isn't solving curated competition problems but verifying novel research lemmas. Osborne confirmed its effectiveness on quantum Boolean functions, producing a previously unpublished result verified in Lean. The system currently handles:
- Undergraduate to early-PhD level proofs
- Domains with strong Mathlib coverage (analysis, linear algebra)
- Pre-submission verification
### The Open-Science Imperative
Alethfeld deliberately contrasts with proprietary systems like DeepMind's AlphaProof:
1. MIT-licensed on GitHub (github.com/tobiasosborne/alethfeld)
2. No platform dependencies
3. Uses existing LLM APIs (Claude Opus recommended)
4. 100% prompt-engineered architecture
"Scientists shouldn’t need permission, a waitlist, or a subscription to verify their own proofs," Osborne argues. While context limits constrain very long proofs and niche domains, Alethfeld proves sophisticated verification requires no specialized infrastructure—just disciplined prompt design and open methodology. Researchers are invited to test and improve the system, advancing trustworthy LLM-assisted mathematics.