Internal RL: How Autoregressive Models Are Unlocking Hierarchical Reinforcement Learning

Researchers demonstrate that autoregressive models can develop emergent temporal abstractions—internal controllers that compress sequences of actions into learned behaviors. This 'internal RL' approach enables efficient exploration in hierarchical tasks with sparse rewards, overcoming limitations of token-by-token sampling. The breakthrough could accelerate progress in foundation models for complex decision-making.

Autoregressive models, pretrained on next-token prediction and fine-tuned with reinforcement learning (RL), power today's most advanced AI systems. Yet their token-by-token action generation creates a critical bottleneck: exploring complex environments becomes inefficient, especially when rewards are sparse. New research reveals how these models can transcend this limitation by discovering temporal abstractions within their own internal representations—a paradigm shift enabling hierarchical reasoning.

The Token-by-Token Trap

Traditional RL fine-tuning forces autoregressive models to sample actions sequentially, akin to typing one character at a time. This granularity impedes exploration in tasks requiring multi-step planning, as random token variations rarely stumble upon coherent, reward-yielding sequences. The inefficiency worsens with sparse rewards, where feedback arrives only after lengthy chains of correct actions.

Main article image (Alt: arXiv logo representing research publication)

Latent Actions and Internal Controllers

The breakthrough, detailed in a recent arXiv paper, introduces a higher-order sequence model that manipulates the residual stream activations of a base autoregressive model. This controller learns to:

Compress action sequences into reusable "chunks" of behavior
Predict termination conditions for these sequences
Reinforce successful patterns directly within activation space

"Composing controllers over time enables agents to explore novel tasks efficiently," note authors Kobayashi et al. "Each controller executes temporally extended actions that unfold over scales impractical for token-wise exploration."

Hierarchical Mastery in Practice

Experiments on grid worlds and MuJoCo environments revealed controllers spontaneously developing human-interpretable skills. For example:

Navigation sequences compressing pathfinding maneuvers
Object manipulation routines terminating upon task completion
Adaptive behaviors responding to environmental feedback loops

Crucially, this "internal RL" allowed agents to solve sparse-reward problems where standard RL fine-tuning failed. The controllers formed a hierarchy: low-level token generation handled by the base model, while high-level strategy emerged through activation manipulation.

Why This Matters

Beyond solving specific RL challenges, this work reimagines how foundation models reason:

Efficiency: Reduces exploration steps by orders of magnitude
Generalization: Learned controllers transfer across related tasks
Interpretability: Latent action chunks map to intuitive behaviors

As foundation models tackle increasingly complex real-world problems—from robotics to scientific discovery—internal RL provides a path toward scalable hierarchical planning. Instead of treating autoregressive architectures as rigid token generators, we might soon see them as architects of their own reusable cognitive modules.

The era of AI systems building internal abstractions isn't just coming; it's already emerging from within the models themselves.