Article illustration 1

The Experiment

In a recent deep dive, the author set out to test whether the high SWE‑bench scores of the latest large language models translate into real‑world, production‑grade code. The challenge: build Speakit, a minimal viable product that extracts text from URLs or PDFs, plays it back with TTS controls, and presents a clean reading interface—all on a local machine using the models’ APIs.

The three contenders were:

  • GPT‑5.1 Codex Max – the newest OpenAI model, marketed as an engineering‑grade assistant.
  • Gemini 3 Pro – Google’s flagship, boasting a 74.2 % SWE‑bench score.
  • Claude Opus 4.5 – Anthropic’s latest offering, tied at 74.4 % on the same benchmark.

The prompt was delivered via Cursor, ensuring the models’ architectural decisions were exposed rather than wrapped in a hosted UI.

Speed and Iteration

The first metric that mattered was how quickly each model could generate a runnable app and iterate on missing features. Below are the raw times recorded:

Metric Gemini 3 Pro GPT‑5.1 Codex Max Claude Opus 4.5
Initial generation 11 m 30 s 8 m 20 s 16 m 50 s
Verify generation 4 m 1 m 5 m 10 s
Total iteration 15 m 30 s 9 m 20 s 22 m 00 s

“The iteration speed of GPT‑5.1 was the fastest, but the build process required manual fixes that slowed the overall cycle.” – Author

Gemini’s total time was competitive, but GPT‑5.1’s leaner initial generation gave it an edge in early prototyping.

Code Quality & Architecture

Using Lizard to compute cyclomatic complexity and maintainability, the models produced markedly different codebases.

Metric Gemini 3 Pro GPT‑5.1 Codex Max Claude Opus 4.5
Tech stack Next.js, React, Tailwind Vite, React, Express Next.js, React, Tailwind
Lines of code 1,038 961 1,714
Avg CCN 2.1 3.2 3.1
Accessibility 95 % 69 % 98 %
SEO 100 % 82 % 100 %

Gemini’s lean stack and low complexity score indicate a disciplined, production‑ready approach. GPT‑5.1’s unconventional Vite + Express stack introduced accessibility gaps, while Opus’s bloated codebase suffered from over‑engineering.

Feature Completeness

The core Speakit requirements were split into three categories: Core (input extraction), Playback (TTS), and UI/UX. The results were stark.

Feature Gemini 3 Pro GPT‑5.1 Codex Max Claude Opus 4.5
Core (URL/PDF) ✅ Mostly (PDF failed) ✅ Complete ⚠️ Mixed
Playback (TTS) ✅ Complete ✅ Complete 🔴 Major failures
UI/UX ⚠️ Functional but raw ⚠️ Limited ✅ Polished
Auth & Data 🔴 Failed 🟡 Partial ⚠️ Partial

“Gemini delivered the cleanest code and the highest feature completeness, but still struggled with authentication.” – Author

Opus produced a polished landing page but missed critical TTS controls, underscoring its design‑oriented bias. GPT‑5.1 succeeded in PDF extraction but suffered from state‑management bugs that prevented playback from stopping on refresh.

The Verdict

Benchmark scores on SWE‑bench do not guarantee a shipping‑ready product. The practical takeaways are:

  • Gemini 3 Pro – Best for engineers who need a fast, maintainable MVP. Its low complexity and solid feature set make it the most reliable starting point.
  • Claude Opus 4.5 – Ideal for teams that prioritize UI polish and can afford to refactor logic. Its design talent shines, but developers must trim boilerplate.
  • GPT‑5.1 Codex Max – Offers flexibility to deviate from the Next.js norm, but requires extra hand‑holding for accessibility and build stability.

“Success hinges on understanding each model’s personality—Gemini for engineering, Opus for design, GPT for experimentation—and being ready to act as the senior engineer who reviews their pull requests.” – Author

Implications for the Field

As LLMs evolve, the line between code generation and code engineering will blur. However, this study confirms that human oversight remains essential. Teams should treat AI assistants as powerful collaborators rather than autonomous coders.

The repositories generated during the test are available for review:

Article illustration 2

For a deeper dive into the methodology and raw data, refer to the original blog post: https://www.hansreinl.de/blog/ai-coding-benchmark-gpt-5-1-gemini-3-opus-4-5