Vending-Bench 2: New Benchmark Exposes AI's Limits in Long-Horizon Business Simulation

![](

) In an era where AI agents autonomously code for hours and edge toward managing real businesses, long-term coherence is paramount. Andon Labs' **Vending-Bench 2**, released via their evals platform, sets a new standard by tasking models with running a simulated vending machine empire over 365 days—starting with $500 and aiming to maximize end-of-year bank balance. Drawing from real-world deployments like AI vending machines hawking $500 tungsten cubes, this benchmark injects gritty realism: adversarial suppliers, supply disruptions, customer refunds, and a novel multi-agent Arena for cutthroat competition.

Leaderboard: Progress, But No Saturation

Averaged across five runs, frontier models show improvement over prior evals, yet none cracks robust profitability. The current standings:

Rank	Model	Final Balance
1	Gemini 3 Pro	$5,478.16
2	Claude Sonnet 4.5	$3,838.74
3	Grok 4	$1,999.46
4	GPT-5.1	$1,473.43
5	Gemini 2.5 Pro	$573.64

Gemini 3 Pro dominates through unwavering tool consistency—no mid-run degradation—and shrewd supplier hunting. It bypasses negotiation gambles by securing optimal prices upfront, like pushing soda cans from $1.50 to $0.50-$0.60 wholesale.

Gemini 3 Pro in action: After a supplier quotes $1.50/can, it fires back: "These prices are quite high... I'm looking for true wholesale pricing closer to $0.50 - $0.60 per can... What is the absolute best price you can offer?" — Source: Andon Labs Vending-Bench 2 analysis

Conversely, GPT-5.1 falters from excessive trust, greenlighting $2.40 sodas and $6 energy drinks, even prepaying sketchy orders that vanish when suppliers fold.

![](

)

Arena Mode: Competition Cranks Up the Heat

Vending-Bench Arena pits agents against each other at one location, fueling price wars, trades, or uneasy alliances—all while scoring individually. This multi-agent twist amplifies strategic depth, where GPT-5.1 particularly crumbles under rivalry.

Evolutions from Vending-Bench 1

Refined from the original, version 2 incorporates deployment learnings:

Adversarial dynamics: Suppliers bait-and-switch or gouge; honest ones still haggle hard.
Resilience tests: Deliveries lag, partners bankrupt—demanding backup chains.
Customer chaos: Unsolicited refund demands.
Enhanced tools: Note-taking, reminders for planning.
Streamlined scoring: Pure profit focus, with $2 daily machine fees and token costs ($100/M output).

Agents wield email for supplier/customer comms, internet search, inventory checks—generating 3,000-6,000 messages and 60-100M tokens per run. Context trims at ~69K tokens, mimicking real constraints. The system prompt thrusts models into full agency: "Do whatever it takes to maximize your bank account balance."

Behavioral Insights: Strengths and Blind Spots

Supplier savvy: Models cluster suppliers into honest/adversarial buckets, with Gemini favoring reliable ones for outsized returns.
Negotiation variance: Gemini persists where others fold.

Yet, analytical lapses persist—no model reverse-engineers demand equations or pivots to high-margin exotics like family-size Doritos.

The Headroom Ahead

Unlike pass/fail benchmarks, Vending-Bench's dollar metric has no cap. Andon Labs estimates a "good" human could hit ~$63K/year via prime items, 50% discounts, and sales optimization. Superintelligent AIs might infinite-scale by jailbreaking suppliers for free high-value stock. Today's $5K ceiling reveals untapped potential in planning, risk modeling, and execution—crucial as agents eye economic roles. For developers, it's a call to forge AIs that don't just code, but thrive in the wilds of commerce. Source: Andon Labs Vending-Bench 2

#VendingBench2 #AIAgents #LongHorizonPlanning

Vending-Bench 2: New Benchmark Exposes AI's Limits in Long-Horizon Business Simulation

Share this article

Vending-Bench 2: New Benchmark Exposes AI's Limits in Long-Horizon Business Simulation

Leaderboard: Progress, But No Saturation

Arena Mode: Competition Cranks Up the Heat

Evolutions from Vending-Bench 1

Behavioral Insights: Strengths and Blind Spots

The Headroom Ahead

Share this article