Google's TurboQuant slashes AI memory demands by 6x while boosting GPU performance
#Machine Learning

Google's TurboQuant slashes AI memory demands by 6x while boosting GPU performance

Chips Reporter
3 min read

Google's TurboQuant compresses LLM KV caches to 3 bits without accuracy loss, delivering up to 8x performance gains on Nvidia H100 GPUs.

Google Research has unveiled TurboQuant, a breakthrough compression algorithm that dramatically reduces the memory footprint of large language model (LLM) inference while simultaneously boosting performance on Nvidia's H100 GPUs. The technology addresses one of AI's most pressing bottlenecks: the growing memory demands of key-value (KV) caches as context windows expand.

Featured image

The memory bottleneck in modern AI

As LLMs process longer conversations and documents, they must maintain increasingly large KV caches that store previously computed attention data. These caches prevent redundant calculations at each token generation step, but their size grows linearly with context length. Traditional vector quantization methods can compress these caches, but they introduce quantization constants that add overhead—seemingly minor until you consider that context windows now routinely exceed 100,000 tokens.

How TurboQuant works

TurboQuant employs a two-stage compression process that eliminates the quantization overhead entirely. The first stage uses PolarQuant, which transforms data vectors from Cartesian to polar coordinates. This conversion separates each vector into a radius (magnitude) and angles (direction). Since angular distributions are predictable and concentrated, PolarQuant avoids the expensive per-block normalization step that conventional quantizers require, achieving high-quality compression without storing quantization constants.

The second stage applies Quantized Johnson-Lindenstrauss (QJL) error correction. This algorithm projects residual quantization errors into a lower-dimensional space and reduces each value to a single sign bit. This clever approach eliminates systematic bias in attention score calculations at negligible computational cost.

Performance that defies expectations

Benchmarks on Nvidia H100 GPUs reveal TurboQuant's transformative potential. The 4-bit implementation delivers up to eight times performance improvement in computing attention logits compared to unquantized 32-bit keys, while reducing KV cache memory by at least six times. Perhaps most impressively, these gains come with no accuracy loss—TurboQuant achieves perfect downstream scores on needle-in-a-haystack retrieval tasks while compressing memory by 6x.

Google TurboQuant

Real-world validation

The algorithm was tested across multiple long-context benchmarks including LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval using open-source models like Gemma and Mistral. TurboQuant matched or outperformed the KIVI baseline across all tasks, demonstrating its versatility in question answering, code generation, and summarization scenarios.

Vector search performance proved equally compelling. Against Product Quantization and RabbiQ on the GloVe dataset, TurboQuant achieved the highest 1@k recall ratios despite those baselines using larger codebooks and dataset-specific tuning.

Production-ready deployment

TurboQuant's design philosophy prioritizes practical deployment. The algorithm requires no training or fine-tuning, making it immediately applicable to existing models. Its negligible runtime overhead means it can be integrated into production inference systems without performance penalties. This combination of zero accuracy loss, massive memory reduction, and performance gains positions TurboQuant as a compelling solution for data centers facing AI's growing computational demands.

Microsoft data center in Mount Pleasant, Wisconsin

The research, co-authored by Google Research scientist Amir Zandieh and VP Vahab Mirrokni, will be presented at ICLR 2026 next month. As AI models continue to grow in size and complexity, technologies like TurboQuant that can dramatically reduce resource requirements while improving performance may prove essential for scaling AI infrastructure sustainably.

Comments

Loading comments...