A deep dive into the surprising cost dynamics of LLM agent conversations, revealing how cache reads create quadratic growth that can dominate expenses at surprisingly low token counts.
When building coding agents that interact with large language models, most developers focus on the obvious costs: input tokens, output tokens, and perhaps cache writes. But there's a hidden cost monster lurking in the shadows that few anticipate—cache reads grow quadratically with conversation length, and they can dominate your expenses far sooner than you might expect.
The fundamental problem stems from how LLM agents operate. In a typical loop, the agent sends the entire conversation history to the model, receives tool calls, executes them, and repeats. Each iteration must read the entire conversation from cache, creating a compounding cost structure that escalates rapidly.
The Mathematics of Misery
Let's break down the cost components. LLM providers charge for four things: input tokens, cache writes, output tokens, and cache reads. The cache read cost is particularly insidious because it scales with both the number of tokens and the number of calls. If you have n tokens and m calls, your cache read cost is proportional to n × m.
Using Anthropic's pricing as an example ($5 per million tokens for input, $0.50 for cache reads), the math becomes alarming quickly. With default settings of 150 input tokens, 100 output tokens per call, and a final context length of 100,000 tokens across 401 calls, the total cost reaches $17.93—and cache reads alone account for $15.98, or 89.2% of the total.
When Half Your Budget Vanishes
The critical question becomes: at what point do cache reads consume half your budget? The answer, based on empirical data from hundreds of coding conversations, is surprisingly early. In one representative conversation, cache reads reached 50% of total costs at just 27,500 tokens. By the end of that conversation at 100,000 tokens, cache reads consumed 87% of the total $12.93 cost.
This isn't an isolated incident. Analysis of 250 randomly sampled conversations shows consistent patterns. The distribution of input tokens has a median around 285, while output tokens median around 100, but the key variable is the number of LLM calls, which varies dramatically between conversations.
The Dead Reckoning Problem
There's a fundamental tension here that mirrors navigation challenges. If you let an agent work for long stretches without feedback (fewer tool calls and back-and-forth), you reduce cache read costs but risk the agent going off course. The feedback loop that enables agents to find correct solutions is also what drives up costs exponentially.
This creates what might be called the "dead reckoning" problem in agent design. Fewer LLM calls mean cheaper conversations, but also mean the agent's internal compass might be leading it astray. More calls mean better accuracy but rapidly escalating costs.
Practical Implications for Agent Design
Several design decisions become critical when you understand this cost structure:
Tool Output Thresholds: Some coding agents refuse to return large tool outputs after a certain threshold, forcing multiple smaller reads instead of one large one. This is counterproductive—if the agent is going to read the whole file anyway, it should do it in one call rather than five.
Subagent Architecture: Using subagents and tools that themselves call out to LLMs can move iteration outside the main context window, reducing the quadratic growth in the primary conversation.
Conversation Restarting: Starting new conversations feels wasteful because you lose context, but the tokens spent re-establishing context are often cheaper than continuing the conversation. This mirrors how developers naturally start fresh when beginning new tasks rather than continuing old ones.
The Bigger Picture
This cost structure raises fundamental questions about the relationship between cost management, context management, and agent orchestration. Are these really separate problems, or different facets of the same challenge?
Research directions like Recursive Language Models might offer solutions by changing how context is managed and reused. But for now, developers building coding agents need to be acutely aware that the "cheap" cache reads they're relying on to make long conversations feasible are actually a ticking time bomb.
The lesson is clear: in LLM agent design, what seems like an optimization for efficiency can become a budget-destroying liability. The quadratic curve of cache reads means that conversations that seem manageable at 10,000 or 20,000 tokens can become prohibitively expensive at 50,000 or 100,000 tokens. Understanding this dynamic isn't just about saving money—it's about building sustainable, scalable agent systems that won't bankrupt their users as conversations grow longer.
As we continue developing tools like Shelley and the exe.dev platform, these cost dynamics are central to our thinking. The challenge isn't just technical—it's economic. And in the world of LLM agents, the economics are governed by some surprisingly unfriendly mathematics.
Comments
Please log in or register to join the discussion