When Anthropic released its compact Claude Haiku 4.5 model, the tech community rushed to benchmark its capabilities. But one researcher took an unconventional approach: testing how effectively these AI models navigate complex text adventures. The results expose fascinating cost-performance tradeoffs in today's leading language models.

The Text Adventure Arena

Unlike standard benchmarks, text adventures require models to demonstrate:
- Multi-step reasoning to solve puzzles
- Context tracking across evolving narratives
- Creative problem-solving within constrained environments

Researcher benchmarks measured achievement progress across seven games like Lost Pig and 9:05, with models ranked against Gemini 2.5 Flash's performance:

| Model              | Perf. vs Flash | Cost ($/M tokens) |
|--------------------|----------------|-------------------|
| Claude Sonnet 4.5  | +12%           | 15.0              |
| GPT-5              | +10%           | 10.0              |
| Claude Haiku 4.5   | -1%            | 5.0               |
| Gemini 2.5 Pro     | -6%            | 10.0              |
| GLM 4.6            | -10%           | 1.8               |

Key Findings

  1. Haiku's Price-Performance Paradox: While matching Gemini 2.5 Flash's achievement rate, Haiku costs twice as much per million tokens. Researcher notes: "I would not use it for this" given the economics.

  2. Surprise Underperformers: More expensive models like Gemini 2.5 Pro and Grok 4 delivered worse results than their cheaper sibling (Gemini Flash), possibly due to "overly systematic explorations" hindering progress.

  3. The Sonnet Premium: Claude Sonnet 4.5 leads in performance but at 6x Gemini Flash's cost - raising questions about value for token-intensive applications.

Methodology Breakthrough

The stark cost differences sparked a radical proposal: Replace 'turn budgets' with 'cash budgets'. Rather than giving each model equal turns, researchers suggest allocating computational budgets based on token cost:

"What we really should do is give models a cash budget... cutting them off when they reach a predefined limit that depends on their cost. This would give Sonnet 4.5 a sixth of the time Gemini 2.5 Flash has to earn achievements."

This approach could revolutionize LLM benchmarking by simulating real-world cost constraints.

The Games That Test Best

Analysis revealed significant variation in game difficulty:
- 9:05 provided the most consistent challenge
- So Far introduced excessive noise due to achievement volatility
- Plundered Hearts and For a Change served as effective 'skill filters' where most models failed completely

The Bottom Line for Developers

While text adventures remain a niche benchmark, they reveal critical insights about LLM economics. As models converge on similar capabilities, cost-per-performance becomes the decisive factor for real-world applications. The proposed 'cash budget' methodology might soon reshape how we evaluate these systems.

Source: Haiku 4.5 Playing Text Adventures