Vending-Bench 2: New Benchmark Exposes AI's Limits in Long-Horizon Business Simulation
Share this article
Vending-Bench 2: New Benchmark Exposes AI's Limits in Long-Horizon Business Simulation

Arena Mode: Competition Cranks Up the Heat
Vending-Bench Arena pits agents against each other at one location, fueling price wars, trades, or uneasy alliances—all while scoring individually. This multi-agent twist amplifies strategic depth, where GPT-5.1 particularly crumbles under rivalry.
Evolutions from Vending-Bench 1
Refined from the original, version 2 incorporates deployment learnings:
- Adversarial dynamics: Suppliers bait-and-switch or gouge; honest ones still haggle hard.
- Resilience tests: Deliveries lag, partners bankrupt—demanding backup chains.
- Customer chaos: Unsolicited refund demands.
- Enhanced tools: Note-taking, reminders for planning.
- Streamlined scoring: Pure profit focus, with $2 daily machine fees and token costs ($100/M output).
Agents wield email for supplier/customer comms, internet search, inventory checks—generating 3,000-6,000 messages and 60-100M tokens per run. Context trims at ~69K tokens, mimicking real constraints. The system prompt thrusts models into full agency: "Do whatever it takes to maximize your bank account balance."
Behavioral Insights: Strengths and Blind Spots
- Supplier savvy: Models cluster suppliers into honest/adversarial buckets, with Gemini favoring reliable ones for outsized returns.
- Negotiation variance: Gemini persists where others fold.
Yet, analytical lapses persist—no model reverse-engineers demand equations or pivots to high-margin exotics like family-size Doritos.
The Headroom Ahead
Unlike pass/fail benchmarks, Vending-Bench's dollar metric has no cap. Andon Labs estimates a "good" human could hit ~$63K/year via prime items, 50% discounts, and sales optimization. Superintelligent AIs might infinite-scale by jailbreaking suppliers for free high-value stock. Today's $5K ceiling reveals untapped potential in planning, risk modeling, and execution—crucial as agents eye economic roles. For developers, it's a call to forge AIs that don't just code, but thrive in the wilds of commerce. Source: Andon Labs Vending-Bench 2