Stop Round-Tripping Your Codebase: How to Cut LLM Token Usage by 80% Using Recursive Document Analysis
#LLMs

Stop Round-Tripping Your Codebase: How to Cut LLM Token Usage by 80% Using Recursive Document Analysis

Tech Essays Reporter
3 min read

A new tool called Matryoshka demonstrates how treating documents as external state rather than context can dramatically reduce token consumption while enabling interactive analysis of large codebases, achieving 82% savings in real-world testing.

When analyzing codebases with large language models, a fundamental inefficiency emerges: the same document contents get repeatedly sent to the API across multiple turns of conversation. For a medium-sized project with 7,770 lines of code, traditional approaches that load the entire context for each query can consume nearly 100,000 tokens per pass. In a typical 10-turn conversation, this balloons to nearly a million tokens—most of them redundant transmissions of the same code.

This redundancy isn't just costly; it actively degrades model performance. Research on recursive language models has identified "context rot," where model capability declines as input length increases, particularly for information-dense tasks requiring synthesis across dispersed fragments. The degradation occurs long before hitting theoretical context limits, suggesting that models struggle to maintain connections between disparate pieces of information when everything is dumped into the prompt.

Matryoshka, an open-source tool available on GitHub, addresses this by fundamentally reimagining how LLMs interact with documents. Instead of treating documents as context to be parsed, it treats them as external environments that the model can query and navigate. This approach draws from two key research insights: the Recursive Language Models paper's proposal for symbolic operations against document state, and Barliman's demonstration of program synthesis from examples rather than specifications.

The core innovation is pointer-based state management. When an LLM executes a query like (grep "def ") against a 2,244-line file, Matryoshka doesn't return all 150 matches to the context. Instead, it stores them server-side and binds them to a variable called RESULTS, returning only the metadata: "Found 150 results." Subsequent operations like (count RESULTS) or (filter RESULTS ...) operate on this server-side state, with only the distilled results entering the conversation context. The model never sees the raw 150 function definitions; it only receives aggregated answers.

This is enabled by a declarative query language called Nucleus, which uses S-expressions to specify intent rather than procedural steps. The language includes commands for search (grep, fuzzy_search), aggregation (count, sum), and transformation (map, filter). When custom parsing is needed, Matryoshka can synthesize functions from examples using a miniKanren-inspired relational programming system, eliminating the need for manual regex construction.

In practice, Matryoshka integrates with LLM agents through the Model Context Protocol (MCP). When Claude Code launches, it discovers Matryoshka as an MCP server and receives a tool manifest describing available operations. The agent learns incrementally through the lattice_help tool, which provides on-demand command references rather than requiring upfront training.

A real-world analysis of the anki-connect codebase demonstrates the impact. The hybrid workflow—using direct reads for small files (<300 lines) and Matryoshka for large ones—achieves 82% token savings while maintaining 100% coverage. The tool processed 6,904 lines using only 6,500 tokens, compared to 95,000 tokens for the traditional approach.

The architecture reveals why this works. Small files like util.py (107 lines) contain critical configuration defaults and API decorator implementations that fit comfortably in context. Large files like plugin/__init__.py (2,244 lines) and README.md (4,660 lines) benefit from Matryoshka's ability to query specific patterns without loading everything. The system maintains a persistent analytical state across sessions, with bindings like RESULTS acting as pointers to server-side data rather than holding data directly in the model's context.

This approach fundamentally changes the LLM's role from passive reader to active interrogator. Rather than attempting to hold an entire codebase in working memory, the model navigates through structured sections and retrieves only what's needed for each specific query. The server executes substantive computational tasks, returning distilled results that preserve the model's capacity for novel reasoning.

For developers working with large codebases, this represents a practical path to making LLM-assisted analysis economically viable. The 82% reduction in token usage translates directly to cost savings and faster response times, while the interactive exploration capability enables more thorough analysis than traditional round-tripping approaches.

Comments

Loading comments...