New research demonstrates that applying Monte Carlo Tree Search to language models with PPO distillation significantly improves reasoning performance on combinatorial tasks, outperforming standard RL methods like GRPO.
In the world of artificial intelligence, game-playing neural networks like AlphaZero have achieved superhuman performance by combining raw policy with test-time search and distilling stronger strategies back into the network. Yet similar techniques haven't seen widespread adoption in language modeling—until now.
A new research implementation explores whether tree search distillation can actually improve language model reasoning, and how it compares to standard reinforcement learning methods. The results, while preliminary, show promising improvements in a combinatorial reasoning task.
The Countdown Challenge
The researcher chose the game of Countdown as their testing environment, a combinatorial arithmetic problem where players must use standard operations (+, -, /, *) on a set of integers to reach a target number. This choice was deliberate—the hypothesis being that combinatorial problems benefit more from the parallel adaptive reasoning that tree search enables, compared to sequential reasoning tasks.
The experiments used Qwen-2.5-1.5B-Instruct as the base model, trained on 20,000 Countdown samples and evaluated on 820 test samples. Each sample consists of four input integers between 1 and 13.
MCTS for Language Models
Traditional MCTS algorithms face challenges when applied to language models. Unlike board games where each move has a substantial impact on the outcome, language models contain many tokens that act as fillers or syntactic sugar. Branching from top-k logits doesn't always result in meaningful search diversity.
The implementation adopted a Tree-of-Thoughts approach, where each node-state represents a sequence of contiguous tokens:
- Root nodes correspond to input prompts
- Intermediate nodes represent reasoning steps: ...
- Terminal nodes contain answers: ...
Key innovations include:
- Parallel MCTS with 16 agents sharing the same search tree
- Virtual losses to encourage search diversity
- Using sequence-level summed logprobs with softmax for action priors
- A value head implemented as an MLP with tanh activation
The Distillation Process
Unlike board games where training comes from minimizing KL divergence between search and raw policies, the language model implementation required a different approach due to the granularity mismatch between reasoning steps and tokens.
After MCTS iterations complete for a sample, a greedy selection process chooses trajectories based on maximum visit count. These trajectories are submitted to a shared buffer for PPO training. The training objective minimizes a total loss combining PPO, value, and KL divergence components.
Results and Implications
The distilled model achieved an asymptotic mean@16 evaluation score of 11.3% on the Countdown task, significantly outperforming both CISPO (8.4%) and best-of-N sampling (7.7%). Compared to the pre-RL instruct model (3.1%), this represents an 8.2 percentage point improvement.
"The part that excites me here is the additional reasoning knobs we can tune, like the number of parallel workers per tree, or the number of MCTS iterations," the researcher noted. "Initial experiments showed increasing both these values led to significant performance gains."
The implementation also revealed that "best-of-N" distillation underperformed both CISPO and MCTS, despite higher training rewards. The researcher theorizes this is because without incentives for robust reasoning every time, models don't develop strategies to improve single-shot performance.
Technical Implementation
The experiments were conducted on an 8xH100 node from Andromeda, with six GPUs designated as generators and two as trainers. A Rust worker sampled questions from the dataset and submitted inference requests via gRPC, with selected trajectories written to a Redis stream. Weights were synced between generators and trainers every 8 gradient steps using Redis pub/sub.
The complete implementation is open source and available here.
Future Directions
While promising, the researcher acknowledges these are small-scale experiments on a 1.5B model and larger models may behave differently. "It's possible this is a 'small model phenomenon', and the method doesn't scale as well as GRPO for larger models," they wrote.
The approach offers additional levers for scaling compute and raising reward ceilings that aren't available in standard RL methods. "Whereas it's not obvious to me that throwing 100x more compute at GRPO would have turned the plateau into a hockey stick," the researcher noted.
This research represents an important step toward bringing the benefits of search-based reasoning to language models, potentially opening new avenues for improving reasoning capabilities in AI systems. As compute budgets grow and models scale, techniques like tree search distillation could play an increasingly important role in developing more capable reasoning systems.

Comments
Please log in or register to join the discussion