Researchers propose Recursive Language Models (RLMs) that enable language models to recursively interact with unbounded context through REPL environments, achieving breakthrough performance on long-context benchmarks while avoiding degradation issues.

Recursive Language Models: A New Paradigm for Unbounded Context Processing

The Context Rot Problem

Language models face a well-known but poorly characterized phenomenon called "context rot" - as the number of tokens in the context window increases, the model's ability to accurately recall information from that context decreases. This isn't just about hitting context window limits; it's about degradation in performance even when the context fits within the model's window.

Researchers have observed this in everyday use cases: your Claude Code history gets bloated, or you chat with ChatGPT for a long time, and the model seems to get "dumber" as the conversation progresses. The natural intuition is that splitting the context into multiple model calls and combining them in a third call might avoid this degradation issue.

Introducing Recursive Language Models

Recursive Language Models (RLMs) are a general inference strategy where language models can decompose and recursively interact with their input context as a variable. The key insight is treating the prompt as a Python variable that can be processed programmatically in arbitrary REPL flows, allowing the LLM to figure out what to peek at from the long context at test time.

How RLMs Work

An RLM acts as a thin wrapper around a language model that can spawn recursive LM calls for intermediate computation. From the user's perspective, it's the same as a model call: rlm.completion(messages) is a direct replacement for gpt5.completion(messages).

Under the hood, the RLM provides only the query to the root language model (depth=0), which interacts with a Python REPL environment that stores the potentially huge context. The root model can call recursive LMs inside this REPL environment as if they were functions in code, allowing it to naturally peek at, partition, grep through, and launch recursive sub-queries over the context.

Figure 1. An example of a recursive language model (RLM) call, which acts as a mapping from text → text, but is more flexible than a standard language model call and can scale to near-infinite context lengths.

When the root model is confident it has an answer, it can either directly output the answer as FINAL(answer), or build up an answer using variables in its REPL environment and return the string inside that answer as FINAL_VAR(final_ans_var).

Breakthrough Results on Long-Context Benchmarks

OOLONG Benchmark: Double the Performance

The OOLONG benchmark evaluates long-context reasoning tasks over fine-grained information. The trec_coarse split consists of distributional queries about giant lists of question entries, where models must semantically map and associate thousands of pieces of information in a single query.

Researchers compared several approaches:

GPT-5 given the whole context and query
GPT-5-mini given the whole context and query
RLM(GPT-5-mini) where GPT-5-mini can recursively call itself inside REPL
RLM(GPT-5) without sub-calls (ablation)
ReAct with GPT-5 + BM25 retrieval

Results: RLM(GPT-5-mini) outperformed GPT-5 by over 34 points (~114% increase) while maintaining roughly the same total model API cost per query! When ablating recursion, performance degraded by ~10%, indicating the importance of recursive sub-queries for semantic mapping tasks.

Even with doubled context size (~263k tokens), RLM(GPT-5-mini) maintained a 49% performance advantage over GPT-5 while being cheaper per query on average.

BrowseComp-Plus: Perfect Performance at 1000 Documents

For evaluating ridiculously large contexts, researchers used BrowseComp-Plus, which provides ~100K documents where answers are scattered across the corpus. They tested performance with 10, 50, 100, and 1000 documents in context.

Results: RLM(GPT-5) was the only approach achieving and maintaining perfect performance at the 1000 document scale, with the non-recursive ablation achieving 90%. Base GPT-5 approaches showed clear performance dropoff as document count increased, while RLM maintained reasonable cost scaling.

What Makes RLMs Powerful

Key Design Choices

Context as Variable: Treating the prompt as a Python variable that can be processed programmatically in arbitrary REPL flows
Recursive Calls: Allowing the REPL environment to make calls back to the LLM (or smaller LLM), facilitated by the decomposition and versatility from choice 1

Strategies That Emerge

RLMs naturally develop several strategies for context interaction:

Peeking: The root model grabs the first few thousand characters to observe structure
Grepping: Using keyword or regex patterns to narrow down lines of interest
Partition + Map: Chunking context and running recursive LM calls for semantic mapping
Summarization: Summarizing subsets of context for outer model decisions
Long-input, Long-output: Programmatically processing long sequences (e.g., git diff histories)

Relationship to Test-Time Scaling

RLMs offer another axis of scaling test-time compute. The trajectory in which a language model chooses to interact with and recurse over its context is entirely learnable and can be reinforcement learning-ified in the same way that reasoning is currently trained for frontier models.

Importantly, RLMs don't require training models that can handle huge context lengths because no single language model call should require handling a huge context.

Limitations and Future Work

Current limitations include:

No optimization for speed - recursive calls are blocking and don't use prefix caching
No strong guarantees about controlling total API cost or runtime
Only tested with recursive depth of 1 (root model can only call LMs, not other RLMs)

Future work includes enabling larger recursive depth, optimizing inference engines for RLMs, and training models specifically to work in this recursive framework.

Why RLMs Matter

RLMs represent a fundamentally different bet than modern agents. Agents are designed based on human/expert intuition on how to break down problems for LMs. RLMs are designed based on the principle that LMs should decide how to break down problems to be digestible for themselves.

The performance of RLMs correlates directly with improvements to base model capabilities - if tomorrow's best frontier LM can handle 10M tokens, an RLM can reasonably handle 100M tokens (maybe at half the cost too).

As the researchers conclude: "I personally have no idea what will work in the end, but I'm excited to see where this idea goes!"

For more information, see the full paper and official codebase. A minimal implementation is also available for building upon: RLM Minimal.

Citation: @article{zhang2025rlm, title = "Recursive Language Models", author = "Zhang, Alex and Khattab, Omar", year = "2025", month = "October", url = "https://alexzhang13.github.io/blog/2025/rlm/" }

#LLM #Recursive Models #context processing #Benchmarks #research