Dynamic Languages Faster and Cheaper in 13-Language Claude Code Benchmark

A comprehensive benchmark testing Claude Code's performance across 13 programming languages reveals that dynamic languages like Ruby, Python, and JavaScript are significantly faster and cheaper to generate, while statically typed languages show 1.4-2.6x higher costs and slower execution times.

A new benchmark testing Claude Code's performance across 13 programming languages reveals surprising efficiency differences that could reshape how teams approach AI-assisted development workflows.

The Benchmark Setup

Ruby committer Yusuke Endoh designed an experiment to measure how efficiently Claude Code (Opus 4.6) could generate working implementations across different programming languages. The task was deliberately simple yet comprehensive: implement a simplified version of Git with core functionality including init, add, commit, log, status, diff, checkout, and reset operations.

The experiment was split into two phases. Phase v1 implemented the basic commands from an empty directory, while phase v2 extended the project with additional functionality. Each language was run 20 times, creating 600 total runs across the 13 languages tested.

To ensure fair comparison, Endoh used a custom hash algorithm rather than SHA-256, eliminating differences in library dependencies across languages. This isolation allowed the benchmark to focus purely on language-level differences rather than ecosystem advantages.

The Results: Dynamic Languages Dominate

Dynamic languages emerged as clear winners in both speed and cost metrics. Ruby averaged $0.36 per run at 73.1 seconds, Python came in at $0.38 per run and 74.6 seconds, and JavaScript at $0.39 per run and 81.1 seconds. All three had low variance and passed all tests across all 40 runs.

From fourth place onward, costs rose sharply and variance increased dramatically. Go averaged $0.50 at 101.6 seconds but with a standard deviation of 37 seconds. Rust averaged $0.54 but had the widest spread at 54.8 seconds and was one of only two languages with test failures.

C was the most expensive mainstream language at $0.74, weighed down by generating 517 lines of code compared to Ruby's 219 lines.

Type Systems Come at a Cost

The type system findings may be the most practically useful result for teams evaluating AI coding workflows. Adding mypy strict checking to Python made it 1.6 to 1.7 times slower. Adding Steep type checking to Ruby imposed an even larger penalty, making it 2.0 to 3.2 times slower than plain Ruby.

TypeScript was notably more expensive than JavaScript, averaging $0.62 versus $0.39, despite producing similar line counts. The author notes that the overhead is not just from generating type annotations but likely from higher thinking-token usage as the model reasons about type constraints.

Limitations and Context

Endoh is transparent about the limitations. He is a Ruby committer and flags that bias. The generated programs are roughly 200 lines of code, firmly at prototyping scale, and he acknowledges that static typing may prove advantageous in larger codebases.

The experiment was supported by Anthropic's Claude for Open Source Program, which provided six months of free Claude Max access. The benchmark only measures generation cost and speed, not code quality, maintainability, or runtime performance.

Community Response and Debate

Discussion on Lobsters challenged whether prototyping-scale conclusions can be drawn from 200-line outputs, with one commenter noting that very few useful prototypes are that small. Others pointed out that the benchmark does not account for ecosystem advantages, where languages with strong package ecosystems would require less generated code for real-world tasks.

A commenter on the DEV Community post raised a qualitative concern: that a 2x speedup is potentially offset if the generated code is harder to modify later, and that Rust and Haskell test failures should not simply be categorized as bugs, since stricter type systems are designed to catch errors early rather than letting them reach production.

Endoh addresses several of these points directly. On scale, he agrees that a larger benchmark would be valuable but notes the difficulty of designing one that is fair across 15 languages. On the 2x speed difference, he argues that in iterative AI-assisted development, the gap between waiting 30 seconds and 60 seconds matters for developer flow, though he concedes the difference becomes irrelevant if future models reduce generation times to sub-second levels.

On ecosystem effects, he deliberately excluded library dependencies to isolate language-level differences, using a custom hash function for exactly this reason.

The Technical Details

Out of 600 total runs, only 3 produced failures: two in Rust and one in Haskell. In one Rust failure log, the agent claimed the tests were wrong, which the author identified as a hallucination since all other Rust trials succeeded.

The full dataset, including per-run results, execution logs, and all generated source code, is available in the benchmark repository for anyone who wants to examine the raw data or reproduce the results.

What This Means for Development Teams

The benchmark suggests that for AI-assisted prototyping and development, dynamic languages offer significant efficiency advantages. The 2x difference in generation time between Ruby/Python/JavaScript and statically typed languages like Rust or Go could meaningfully impact developer productivity in iterative workflows.

However, teams must weigh these efficiency gains against the potential benefits of static typing for larger codebases, where type safety becomes increasingly valuable. The benchmark's focus on small-scale prototyping means it doesn't capture the full picture of language trade-offs in production environments.

The results also highlight how AI coding assistants may have different strengths and weaknesses with different language paradigms, suggesting that tool selection and language choice should be considered together rather than in isolation.

The experiment raises interesting questions about the future of AI-assisted development. As models become faster and more capable, will the current efficiency differences matter less? Or will the fundamental differences in how AI models reason about different language paradigms persist even as raw generation speeds improve?

For now, teams using Claude Code for rapid prototyping might find significant advantages in choosing dynamic languages, while those prioritizing long-term maintainability and type safety may still prefer the benefits that static typing provides, even at the cost of slower generation times.

#AI #Benchmark #Programming Languages #dynamic languages #static typing