Andon Labs launches Vending-Bench 2, a year-long simulation where AI models manage a vending machine business amid adversarial suppliers, delays, and competition. Gemini 3 Pro leads with $5,478 in profits, but all frontier models lag far behind human baselines, highlighting gaps in coherence, negotiation, and strategy. The multi-agent Arena variant intensifies these challenges, signaling critical needs for advancing autonomous economic agents.
Vending-Bench 2: New Benchmark Exposes AI's Limits in Long-Horizon Business Simulation

In an era where AI agents autonomously code for hours and edge toward managing real businesses, long-term coherence is paramount. Andon Labs' Vending-Bench 2, released via their evals platform, sets a new standard by tasking models with running a simulated vending machine empire over 365 days—starting with $500 and aiming to maximize end-of-year bank balance. Drawing from real-world deployments like AI vending machines hawking $500 tungsten cubes, this benchmark injects gritty realism: adversarial suppliers, supply disruptions, customer refunds, and a novel multi-agent Arena for cutthroat competition.
Leaderboard: Progress, But No Saturation
Averaged across five runs, frontier models show improvement over prior evals, yet none cracks robust profitability. The current standings:
| Rank | Model | Final Balance |
|---|---|---|
| 1 | Gemini 3 Pro | $5,478.16 |
| 2 | Claude Sonnet 4.5 | $3,838.74 |
| 3 | Grok 4 | $1,999.46 |
| 4 | GPT-5.1 | $1,473.43 |
| 5 | Gemini 2.5 Pro | $573.64 |
Gemini 3 Pro dominates through unwavering tool consistency—no mid-run degradation—and shrewd supplier hunting. It bypasses negotiation gambles by securing optimal prices upfront, like pushing soda cans from $1.50 to $0.50-$0.60 wholesale.
Gemini 3 Pro in action: After a supplier quotes $1.50/can, it fires back: "These prices are quite high... I'm looking for true wholesale pricing closer to $0.50 - $0.60 per can... What is the absolute best price you can offer?" — Source: Andon Labs Vending-Bench 2 analysis
Conversely, GPT-5.1 falters from excessive trust, greenlighting $2.40 sodas and $6 energy drinks, even prepaying sketchy orders that vanish when suppliers fold.

Arena Mode: Competition Cranks Up the Heat
Vending-Bench Arena pits agents against each other at one location, fueling price wars, trades, or uneasy alliances—all while scoring individually. This multi-agent twist amplifies strategic depth, where GPT-5.1 particularly crumbles under rivalry.
Evolutions from Vending-Bench 1
Refined from the original, version 2 incorporates deployment learnings:
- Adversarial dynamics: Suppliers bait-and-switch or gouge; honest ones still haggle hard.
- Resilience tests: Deliveries lag, partners bankrupt—demanding backup chains.
- Customer chaos: Unsolicited refund demands.
- Enhanced tools: Note-taking, reminders for planning.
- Streamlined scoring: Pure profit focus, with $2 daily machine fees and token costs ($100/M output).
Agents wield email for supplier/customer comms, internet search, inventory checks—generating 3,000-6,000 messages and 60-100M tokens per run. Context trims at ~69K tokens, mimicking real constraints. The system prompt thrusts models into full agency: "Do whatever it takes to maximize your bank account balance."
Behavioral Insights: Strengths and Blind Spots
- Supplier savvy: Models cluster suppliers into honest/adversarial buckets, with Gemini favoring reliable ones for outsized returns.
- Negotiation variance: Gemini persists where others fold.
Yet, analytical lapses persist—no model reverse-engineers demand equations or pivots to high-margin exotics like family-size Doritos.
The Headroom Ahead
Unlike pass/fail benchmarks, Vending-Bench's dollar metric has no cap. Andon Labs estimates a "good" human could hit ~$63K/year via prime items, 50% discounts, and sales optimization. Superintelligent AIs might infinite-scale by jailbreaking suppliers for free high-value stock. Today's $5K ceiling reveals untapped potential in planning, risk modeling, and execution—crucial as agents eye economic roles. For developers, it's a call to forge AIs that don't just code, but thrive in the wilds of commerce. Source: Andon Labs Vending-Bench 2

Comments
Please log in or register to join the discussion