Vending-Bench 2: New Benchmark Exposes AI's Limits in Long-Horizon Business Simulation

![](


alt="Article illustration 1"
loading="lazy">

) In an era where AI agents autonomously code for hours and edge toward managing real businesses, long-term coherence is paramount. Andon Labs' **Vending-Bench 2**, released via their evals platform, sets a new standard by tasking models with running a simulated vending machine empire over 365 days—starting with $500 and aiming to maximize end-of-year bank balance. Drawing from real-world deployments like AI vending machines hawking $500 tungsten cubes, this benchmark injects gritty realism: adversarial suppliers, supply disruptions, customer refunds, and a novel multi-agent Arena for cutthroat competition.

Leaderboard: Progress, But No Saturation

Averaged across five runs, frontier models show improvement over prior evals, yet none cracks robust profitability. The current standings:


































Rank Model Final Balance
1 Gemini 3 Pro $5,478.16
2 Claude Sonnet 4.5 $3,838.74
3 Grok 4 $1,999.46
4 GPT-5.1 $1,473.43
5 Gemini 2.5 Pro $573.64
Gemini 3 Pro dominates through unwavering tool consistency—no mid-run degradation—and shrewd supplier hunting. It bypasses negotiation gambles by securing optimal prices upfront, like pushing soda cans from $1.50 to $0.50-$0.60 wholesale.

Gemini 3 Pro in action: After a supplier quotes $1.50/can, it fires back: "These prices are quite high... I'm looking for true wholesale pricing closer to $0.50 - $0.60 per can... What is the absolute best price you can offer?" — Source: Andon Labs Vending-Bench 2 analysis


Conversely, GPT-5.1 falters from excessive trust, greenlighting $2.40 sodas and $6 energy drinks, even prepaying sketchy orders that vanish when suppliers fold.

![](


alt="Article illustration 2"
loading="lazy">

)

Arena Mode: Competition Cranks Up the Heat

Vending-Bench Arena pits agents against each other at one location, fueling price wars, trades, or uneasy alliances—all while scoring individually. This multi-agent twist amplifies strategic depth, where GPT-5.1 particularly crumbles under rivalry.

Evolutions from Vending-Bench 1

Refined from the original, version 2 incorporates deployment learnings:

  • Adversarial dynamics: Suppliers bait-and-switch or gouge; honest ones still haggle hard.
  • Resilience tests: Deliveries lag, partners bankrupt—demanding backup chains.
  • Customer chaos: Unsolicited refund demands.
  • Enhanced tools: Note-taking, reminders for planning.
  • Streamlined scoring: Pure profit focus, with $2 daily machine fees and token costs ($100/M output).

Agents wield email for supplier/customer comms, internet search, inventory checks—generating 3,000-6,000 messages and 60-100M tokens per run. Context trims at ~69K tokens, mimicking real constraints. The system prompt thrusts models into full agency: "Do whatever it takes to maximize your bank account balance."

Behavioral Insights: Strengths and Blind Spots

  • Supplier savvy: Models cluster suppliers into honest/adversarial buckets, with Gemini favoring reliable ones for outsized returns.
  • Negotiation variance: Gemini persists where others fold.

Yet, analytical lapses persist—no model reverse-engineers demand equations or pivots to high-margin exotics like family-size Doritos.

The Headroom Ahead

Unlike pass/fail benchmarks, Vending-Bench's dollar metric has no cap. Andon Labs estimates a "good" human could hit ~$63K/year via prime items, 50% discounts, and sales optimization. Superintelligent AIs might infinite-scale by jailbreaking suppliers for free high-value stock. Today's $5K ceiling reveals untapped potential in planning, risk modeling, and execution—crucial as agents eye economic roles. For developers, it's a call to forge AIs that don't just code, but thrive in the wilds of commerce. Source: Andon Labs Vending-Bench 2