New research reveals fundamental limitations in how large language models process long contexts, demonstrating that popular inference-time scaling techniques fail beyond certain thresholds. Researchers propose test-time training with targeted gradient updates as a more effective approach, showing double-digit performance gains.