Benchmarking the Next Generation: GPT‑5.1, Gemini 3 Pro, and Claude Opus 4.5 in Full‑Stack MVP Development
Share this article
The Experiment
In a recent deep dive, the author set out to test whether the high SWE‑bench scores of the latest large language models translate into real‑world, production‑grade code. The challenge: build Speakit, a minimal viable product that extracts text from URLs or PDFs, plays it back with TTS controls, and presents a clean reading interface—all on a local machine using the models’ APIs.
The three contenders were:
- GPT‑5.1 Codex Max – the newest OpenAI model, marketed as an engineering‑grade assistant.
- Gemini 3 Pro – Google’s flagship, boasting a 74.2 % SWE‑bench score.
- Claude Opus 4.5 – Anthropic’s latest offering, tied at 74.4 % on the same benchmark.
The prompt was delivered via Cursor, ensuring the models’ architectural decisions were exposed rather than wrapped in a hosted UI.
Speed and Iteration
The first metric that mattered was how quickly each model could generate a runnable app and iterate on missing features. Below are the raw times recorded:
| Metric | Gemini 3 Pro | GPT‑5.1 Codex Max | Claude Opus 4.5 |
|---|---|---|---|
| Initial generation | 11 m 30 s | 8 m 20 s | 16 m 50 s |
| Verify generation | 4 m | 1 m | 5 m 10 s |
| Total iteration | 15 m 30 s | 9 m 20 s | 22 m 00 s |
“The iteration speed of GPT‑5.1 was the fastest, but the build process required manual fixes that slowed the overall cycle.” – Author
Gemini’s total time was competitive, but GPT‑5.1’s leaner initial generation gave it an edge in early prototyping.
Code Quality & Architecture
Using Lizard to compute cyclomatic complexity and maintainability, the models produced markedly different codebases.
| Metric | Gemini 3 Pro | GPT‑5.1 Codex Max | Claude Opus 4.5 |
|---|---|---|---|
| Tech stack | Next.js, React, Tailwind | Vite, React, Express | Next.js, React, Tailwind |
| Lines of code | 1,038 | 961 | 1,714 |
| Avg CCN | 2.1 | 3.2 | 3.1 |
| Accessibility | 95 % | 69 % | 98 % |
| SEO | 100 % | 82 % | 100 % |
Gemini’s lean stack and low complexity score indicate a disciplined, production‑ready approach. GPT‑5.1’s unconventional Vite + Express stack introduced accessibility gaps, while Opus’s bloated codebase suffered from over‑engineering.
Feature Completeness
The core Speakit requirements were split into three categories: Core (input extraction), Playback (TTS), and UI/UX. The results were stark.
| Feature | Gemini 3 Pro | GPT‑5.1 Codex Max | Claude Opus 4.5 |
|---|---|---|---|
| Core (URL/PDF) | ✅ Mostly (PDF failed) | ✅ Complete | ⚠️ Mixed |
| Playback (TTS) | ✅ Complete | ✅ Complete | 🔴 Major failures |
| UI/UX | ⚠️ Functional but raw | ⚠️ Limited | ✅ Polished |
| Auth & Data | 🔴 Failed | 🟡 Partial | ⚠️ Partial |
“Gemini delivered the cleanest code and the highest feature completeness, but still struggled with authentication.” – Author
Opus produced a polished landing page but missed critical TTS controls, underscoring its design‑oriented bias. GPT‑5.1 succeeded in PDF extraction but suffered from state‑management bugs that prevented playback from stopping on refresh.
The Verdict
Benchmark scores on SWE‑bench do not guarantee a shipping‑ready product. The practical takeaways are:
- Gemini 3 Pro – Best for engineers who need a fast, maintainable MVP. Its low complexity and solid feature set make it the most reliable starting point.
- Claude Opus 4.5 – Ideal for teams that prioritize UI polish and can afford to refactor logic. Its design talent shines, but developers must trim boilerplate.
- GPT‑5.1 Codex Max – Offers flexibility to deviate from the Next.js norm, but requires extra hand‑holding for accessibility and build stability.
“Success hinges on understanding each model’s personality—Gemini for engineering, Opus for design, GPT for experimentation—and being ready to act as the senior engineer who reviews their pull requests.” – Author
Implications for the Field
As LLMs evolve, the line between code generation and code engineering will blur. However, this study confirms that human oversight remains essential. Teams should treat AI assistants as powerful collaborators rather than autonomous coders.
The repositories generated during the test are available for review:
For a deeper dive into the methodology and raw data, refer to the original blog post: https://www.hansreinl.de/blog/ai-coding-benchmark-gpt-5-1-gemini-3-opus-4-5