Users report Pro Max 5x plans exhausting quota in 1.5 hours with moderate usage, traced to cache_read tokens counting at full rate against rate limits
A Pro Max 5x subscriber has reported exhausting their daily quota in just 1.5 hours of moderate usage, despite the plan's promise of handling heavy development workloads. The issue appears to stem from how cache_read tokens are counted against rate limits, effectively negating the cost benefits of prompt caching.
The Problem
The user, operating on a Pro Max 5x plan with the claude-opus-4-6 model, observed that after a quota reset, their allowance was depleted within 1.5 hours of mostly Q&A and light development work. This was particularly puzzling because the previous 5-hour window of heavy development (multi-file implementation, graphify pipeline, multi-agent spawns) had consumed the prior quota window as expected.
Investigation Methodology
Using data extracted from ~/.claude/projects/*//*.jsonl session files, the user tracked token consumption across two distinct windows:
Window 1 (15:00-20:00, 5 hours, heavy development):
- 2,715 API calls
- 1,044M cache read tokens
- 16.8M cache creation tokens
- 8.9k input tokens
- 1.15M output tokens
- Peak context: 966,078 tokens
Window 2 (20:00-21:30, 1.5 hours, moderate usage):
- 222 API calls (main session)
- 23.2M cache read tokens
- 1.4M cache creation tokens
- 304 input tokens
- 91k output tokens
- Peak context: 182,302 tokens
The Root Cause
The analysis reveals that if cache_read tokens count at their expected reduced rate (1/10), Window 2 should have consumed only 8.7M effective tokens per hour - well within reasonable limits for moderate usage. However, if cache_read tokens count at full rate (the suspected actual behavior), the consumption jumps to 70.5M tokens per hour, explaining the rapid quota depletion.
Compounding Factors
Several issues compound the problem:
Background Sessions: Multiple Claude Code sessions running in different terminals continue making API calls (compacts, retros, hook processing) even when not actively used, consuming from the same quota pool.
Auto-Compact Spikes: Each auto-compact event triggers an expensive API call with the full pre-compact context (~966k tokens) as cache_creation, creating automatic quota spikes without user action.
Context Window Amplification: The 1M context window, while marketed as a feature, becomes counterproductive when cache_read tokens count at full rate, as each API call near the compact threshold sends ~960k tokens.
Community Response
The issue has sparked significant community discussion, with one commenter noting it's "not a duplicate" and that the hypothesis about cache_read tokens counting at full rate for quota is distinct from general cache miss issues.
Another community member developed a tool called "cozempic" that claims to keep sessions under control with auto-pruning at 4 thresholds, allowing sessions to run 3-4x longer and cost 2-3x less.
Anthropic's Investigation
Anthropic engineer cnighswonger has been actively investigating the issue, collecting per-call telemetry using the claude-code-cache-fix interceptor. Their analysis of ~1,500 logged calls across 6 distinct Q5h reset windows tested three counting hypotheses:
- cache_read = 0.0x (does NOT count for quota): CV 34.4%
- cache_read = 0.1x (counts at published billing rate): CV 101.6%
- cache_read = 1.0x (counts at full input rate): CV 123.7%
The cache_read = 0.0x model produced the most consistent results across windows, suggesting that within the resolution of their data, cache_read does not meaningfully count toward the 5-hour quota - contradicting the original hypothesis but supporting the idea that caching does save quota.
Suggested Improvements
Based on the findings, several improvements have been suggested:
Clarify cache_read quota accounting: Document whether cache_read tokens count at full or reduced rate against rate limits
Rate limit by effective tokens: Count cache_read at 1/10 rate for rate limiting, matching the cost reduction
Session idle detection: Don't count idle session overhead against quota, or warn users about open sessions
Quota visibility: Show real-time token consumption breakdown in Claude Code (cache_read vs cache_create vs input vs output)
Context-aware quota estimates: Before operations, estimate quota cost based on current context size
The issue remains open as the community continues to investigate and validate these findings across different accounts and usage patterns.

Comments
Please log in or register to join the discussion