A practical benchmark compares six AI coding assistants on how well they turn reference images of the Pantheon into parametric OpenSCAD code, revealing trade‑offs between speed, geometric judgment, and the value of visual feedback loops.

OpenSCAD LLM Benchmark: Building the Pantheon

Why the Pantheon matters as a test case

The benchmark is not a sanity‑check for basic OpenSCAD syntax – every modern coding LLM can emit a correct cube() or cylinder(). The Pantheon sits in a sweet spot where OpenSCAD shines: it requires Boolean operations, radial symmetry, repeated columns, and clean extrusions, yet it is complex enough that a naïve “stack a few primitives” approach fails to capture the recognizable silhouette. A weak result still looks vaguely domed, but a strong one must respect the relationship between the drum, the portico, the dome rings, and the front façade.

How the test was run

Prompt – “see two reference images and build a .scad file implementing the Pantheon. Use the OpenSCAD CLI to render previews, iterate, and stop when you are happy.”
Reference images – a front elevation and an aerial view combined into a single PNG (generated with ffmpeg).
Evaluation criteria –
- Quality (geometric fidelity, detail density, correct proportions) – scored 0‑5.
- Speed (total wall‑clock time from start to final STL) – scored 0‑5.
Toolchain – All agents ran on the same macOS machine with OpenSCAD on the PATH, rendering PNG previews for each iteration.

The six runs are listed in the original table; the text below expands on the most illustrative observations.

Evidence from the six runs

Client / Model	Time Score	Quality Score	Notable Strengths
Cursor Composer 2.5	5/5 (fastest)	1/5	Produced a rotunda, dome and columns in under a minute, but missed material nuance and proportion.
Codex Desktop 5.5	4/5	4/5	Dense geometry, includes an inscription on the entablature; final STL diverged from preview, penalising the quality score.
Claude Code 2.1 / Opus	3/5	3/5	Clearer portico and stepped base than Cursor, but uniform coloring made details compete for attention.
Claude Code 2.1 / Sonnet	2/5	3.4/5	Clean silhouette, well‑balanced proportions, the most coherent autonomous model before Antigravity.
Google Antigravity 2.0 / Gemini 3.5 Flash	1/5	4.5/5	Used real Pantheon dimensions, added the coffered interior ceiling, and exported a cut‑away view; the slowest but highest‑quality autonomous run.
ModelRift / Gemini Flash 3.0 (human‑in‑the‑loop)	1/5	3.8/5	Visual annotation loop let a human correct column spacing and roof detail, yielding a more coherent model than the pure autonomous attempts.

What the numbers tell us

Speed does not predict quality. Cursor finished in seconds but produced a placeholder; Sonnet took several minutes and delivered the cleanest autonomous geometry.
Geometric judgment is the bottleneck. All agents could launch OpenSCAD and render PNGs without friction. The variance appears when the model must decide where a column belongs or how thick a drum wall should be.
Preview ≠ final mesh. Codex’s PNG previews looked impressive, yet the exported STL introduced surface artifacts that lowered its final score. A separate mesh‑validation step is essential for any production pipeline.
Human visual feedback matters. The ModelRift run shows that pointing at a problem on a render (instead of describing it in text) speeds up correction and improves the final geometry.

Counter‑perspectives

“OpenSCAD is too restrictive for real‑world architecture.”

Critics argue that constructive solid geometry cannot capture the organic curves of many historic buildings. The Pantheon deliberately avoids those pitfalls, but the benchmark still exposes OpenSCAD’s limits: the interior coffers are approximated with square cut‑outs, and the marble texture is reduced to flat colors. For projects that need free‑form sculpting, a mesh‑based tool like Blender remains preferable.

“Autonomous generation is already good enough.”

The Antigravity run demonstrates that a top‑tier LLM can locate real‑world measurements, plan a parametric model, and render a cut‑away view without human input. Yet the run took twelve minutes and still required a post‑process check for export fidelity. In a production environment where iteration cycles are measured in seconds, the extra latency and the risk of hidden geometry errors keep fully autonomous pipelines from replacing the annotation loop.

“Cost outweighs benefit for high‑end models.”

Gemini 3.5 Flash costs roughly three times more per token than the Gemini 1.5 Flash baseline. The quality gain (4.5 → 3.8) is modest compared with the price jump, especially when a cheaper model combined with visual feedback can reach a comparable score. Teams must weigh token cost against the marginal improvement in detail.

Takeaways for the developer community

OpenSCAD remains a viable target language for LLM‑generated CAD. Its text‑first nature aligns with how language models reason about structure, and the deterministic CLI makes automated preview loops trivial.
Tool access is no longer the limiting factor; geometric reasoning is. Future model improvements should focus on spatial judgment—understanding proportions, symmetry, and architectural hierarchy—rather than merely invoking the correct primitives.
Human‑in‑the‑loop annotation is a pragmatic bridge. Visual notes on a rendered PNG let engineers correct column spacing or dome height in seconds, a workflow that outperforms pure text prompts for fine‑grained adjustments.
Benchmarking must separate preview quality from export quality. A model that produces beautiful PNGs but broken STL meshes is not ready for 3D‑printing pipelines; downstream mesh validation should be part of any benchmark.
Cost‑aware model selection matters. When a cheaper model can achieve a 3.8/5 score with visual feedback, the extra expense of a premium Flash model may not be justified for most internal tooling.

What’s next?

ModelRift plans to extend the benchmark suite with two additional structures: the Parthenon (emphasising colonnades) and a modernist pavilion (testing planar extrusion and parametric façade panels). By diversifying the geometry types, we hope to surface more nuanced failure modes and to measure whether newer LLM releases close the gap between autonomous speed and human‑augmented quality.

For a deeper dive into ModelRift’s annotation workflow, see the post “Building a better OpenSCAD customizer”.

#LLMs #OpenSCAD #3D Modeling #Benchmark #Human in the Loop

OpenSCAD LLM Benchmark: Building the Pantheon

OpenSCAD LLM Benchmark: Building the Pantheon

Why the Pantheon matters as a test case

How the test was run

Evidence from the six runs

What the numbers tell us

Counter‑perspectives

“OpenSCAD is too restrictive for real‑world architecture.”

“Autonomous generation is already good enough.”

“Cost outweighs benefit for high‑end models.”

Takeaways for the developer community

What’s next?

Comments