Google DeepMind Expands AI Benchmarking with Poker, Werewolf, and Chess Competitions
#AI

Google DeepMind Expands AI Benchmarking with Poker, Werewolf, and Chess Competitions

Startups Reporter
4 min read

Google DeepMind has launched new AI benchmarks on Kaggle's Game Arena, adding poker and Werewolf to test models' abilities in risk management and social deduction, while updating chess rankings with Gemini 3 models leading the way.

Google DeepMind is pushing the boundaries of AI benchmarking with the expansion of Kaggle's Game Arena, introducing poker and Werewolf alongside chess to test how artificial intelligence handles imperfect information, social dynamics, and calculated risk.

Featured image

Chess: The Foundation of Strategic Reasoning

Last year, DeepMind partnered with Kaggle to launch Game Arena with chess as the initial benchmark, measuring models on strategic reasoning, dynamic adaptation, and long-term planning. The platform has now been updated with the latest generation of models, revealing significant performance gains.

While traditional chess engines like Stockfish rely on brute-force calculation, evaluating millions of positions per second, large language models take a different approach. They leverage pattern recognition and "intuition" to drastically reduce the search space - an approach that mirrors human play.

Gemini 3 Pro and Gemini 3 Flash currently dominate the chess leaderboard with the highest Elo ratings. Analysis of their internal "thoughts" reveals strategic reasoning grounded in familiar chess concepts like piece mobility, pawn structure, and king safety. This performance leap over the Gemini 2.5 generation demonstrates the rapid pace of model progress and validates Game Arena's value in tracking these improvements over time.

Werewolf: Testing Social Intelligence

The addition of Werewolf marks Game Arena's first team-based game played entirely through natural language. This social deduction game requires models to navigate imperfect information in dialogue, where a team of "villagers" must work together to identify hidden "werewolves" among them.

This benchmark assesses the "soft skills" essential for next-generation AI assistants - communication, negotiation, and the ability to navigate ambiguity. These capabilities mirror what agents need to collaborate effectively with humans and other agents in enterprise environments.

Beyond performance measurement, Werewolf serves as a controlled environment for agentic safety research. The game's dual nature - requiring players to both seek truth (as villagers) and deceive (as werewolves) - allows researchers to test a model's ability to detect manipulation while simultaneously evaluating its own deceptive capabilities. This research is fundamental to building AI agents that can act as reliable safeguards against bad actors.

Gemini 3 Pro and Gemini 3 Flash currently hold the top two positions on the Werewolf leaderboard, demonstrating the ability to reason about other players' statements and actions across multiple game rounds. They can identify inconsistencies between a player's public claims and voting patterns, using these insights to build consensus with teammates.

Poker: Mastering Calculated Risk

Poker introduces a new dimension to Game Arena: risk management. Like Werewolf, poker is a game of imperfect information, but here the challenge isn't about building alliances - it's about quantifying uncertainty.

Models must overcome the luck of the deal by inferring opponents' hands and adapting to their playing styles to determine optimal moves. To test these skills, DeepMind is launching a new poker benchmark featuring Heads-Up No-Limit Texas Hold'em, with an AI poker tournament where top models will compete head-to-head.

The final poker leaderboard will be revealed at kaggle.com/game-arena on Wednesday, February 4, following the conclusion of the tournament finals.

Expert Commentary and Live Events

To mark the launch of these new and updated benchmarks, DeepMind has partnered with notable figures in gaming and poker. Chess Grandmaster Hikaru Nakamura and poker legends Nick Schulman, Doug Polk, and Liv Boeree will provide expert commentary across three livestreamed events:

  • Monday, February 2: The top eight models on the poker leaderboard face off in the AI poker battle
  • Tuesday, February 3: Poker tournament semi-finals alongside highlight matches from Werewolf and chess leaderboards
  • Wednesday, February 4: The final two models compete for the poker crown, with the full leaderboard release and a chess match between Gemini 3 Pro and Gemini 3 Flash

All livestreams begin at 9:30 AM PT at kaggle.com/game-arena.

Why These Benchmarks Matter

Games have always been central to DeepMind's history, offering objective proving grounds where difficulty scales with competition level. As AI systems become more general, mastering diverse games demonstrates their consistency across distinct cognitive skills.

Beyond measuring performance, games serve as controlled sandbox environments to evaluate agentic safety, providing insight into model behavior in complex environments they'll encounter when deployed in the real world. The progression from chess (perfect information) to Werewolf (social deduction) to poker (risk management) mirrors the real-world challenges AI systems face when moving from controlled environments to the messy, uncertain world of human interaction.

Whether it's finding a creative checkmate, negotiating a truce in Werewolf, or going all-in at the poker table, Kaggle Game Arena is where we discover what these models can really do. Explore the arena at kaggle.com/game-arena.

orankelly

Oran Kelly, Product Manager, Google DeepMind

Comments

Loading comments...