TheTom has implemented TQ4_1S (4-bit, 5.0 BPW) and TQ3_1S (3-bit, 4.0 BPW) weight quantization using WHT rotation + Lloyd-Max centroids in llama.cpp-turboquant. This Metal-only implementation achieves significant model size reductions (27-42%) with minimal perplexity increases (+0.4-5.8%) and competitive performance. The CUDA port is in progress, with community testing showing promising results across multiple GPU architectures and model families.

TQ4_1S Weight Compression: Breakthrough in Model Quantization for llama.cpp

The pull request #45 in TheTom/llama-cpp-turboquant introduces a significant advancement in model quantization techniques with TQ4_1S weight compression. This implementation uses WHT (Walsh-Hadamard Transform) rotation combined with Lloyd-Max centroids to achieve substantial model size reductions while maintaining performance and quality.

Technical Implementation

At its core, TQ4_1S represents a sophisticated approach to post-training quantization that requires no retraining, calibration data, or model modification. The technique uses:

WHT rotation: A mathematical transformation that helps preserve information during quantization
Lloyd-Max centroids: An optimal quantization method that minimizes mean squared error
Fused Metal kernel: Zero threadgroup memory implementation with cooperative SIMD rotation via simd_shuffle_xor

The V2.1 implementation introduces several optimizations:

NR0=8 configuration for better data reuse
Zero threadgroup memory design
Efficient SIMD operations

Performance Benchmarks

The quantization technique shows impressive results across multiple models:

Model Size Reductions

Qwen2.5-1.5B: 27% reduction
Qwen3.5-27B: 28% reduction
Qwen3.5-35B MoE: 37% reduction
Qwen2.5-72B: 38% reduction
Phi-4 14B: 36% reduction
Llama 3.1 70B: 29-42% reduction (depending on configuration)

Perplexity Impact

The technique maintains impressive quality metrics with minimal perplexity increases:

Qwen2.5-1.5B: +1.9%
Qwen3.5-27B: +1.3%
Qwen2.5-72B: +3.9%
Phi-4 14B: +1.0%
Llama 3.1 70B: +5.8% (Premium) or +16% (Hybrid)

Performance Characteristics

Metal implementation achieves 85-99% of q8_0 baseline performance
CUDA port (in progress) shows 39% of q8_0 baseline with potential for improvement
No performance regression on existing quantization methods

Hardware and Platform Support

Current Implementation

Metal-only runtime (Apple Silicon)
Quantization works on any platform
Runtime dequant kernels are Metal-specific
Compressed GGUFs will not run correctly on CUDA/HIP until backends are ported

CUDA Port Progress

Signalnine has implemented a CUDA port with the following features:

CUDA dequant for TQ4_1S/TQ3_1S
Fused mul_mat_vec kernel with pre-rotated activations
mmvq exclusion for fused dispatch path
llama-quantize registration for TQ4_1S/TQ3_1S types

Initial CUDA benchmarks show:

20 t/s with cuBLAS path
69 t/s with fused kernel (vs 177 t/s for q8_0)
39% of q8_0 performance
Perfect PPL matching between cuBLAS and fused implementations

Community Testing and Validation

The implementation has undergone extensive community testing across multiple platforms:

Apple Silicon Results

M5 Max shows no regressions in existing functionality
Mac Mini M2 Pro confirms performance improvements
Qwen3.5-27B achieves 92% decode speed at 28% smaller size

@shivam2014

NVIDIA GPU Results

RTX 4090 shows significant KV cache savings with TurboQuant
RTX PRO 6000 demonstrates improved performance with weight compression
Multi-GPU configurations tested across various architectures

@christopheraleman1015

Model-Specific Findings

Qwen Models: Show consistent 27-38% size reductions with minimal quality impact
Llama Models: Benefit from hybrid configurations (TQ4 attention + Q4_K/Q5_K FFN)
MoE Models: Special handling required for expert layers
Long Context: TurboQuant KV extends max context from 48K to 100K for 8B models

Technical Insights and Optimizations

Performance Optimization Opportunities

TheTom identified several optimization opportunities for the CUDA implementation:

NR0 multi-row CTA with shared activation reuse: Potential to improve from 69 to 110-140 t/s
Hot loop load deduplication: Reduce redundant memory accesses
restrict qualifiers + vectorized loads: Enable better compiler optimization
Small-batch kernels: Optimize for common serving scenarios
CTA shape sweep per architecture: Tailor for different GPU architectures

Llama-Specific Considerations

Llama-family models show 6-8x higher per-layer error amplification with WHT-rotated FFN tensors. This has led to two recommended configurations:

Hybrid (TQ4 attn + Q4_K FFN)
Premium (TQ4 attn + Q5_K/Q6_K FFN)

Both configurations beat Q4_K_M in quality and speed at similar model sizes.

@turquoisebaydev

Practical Implications

VRAM Efficiency

The combination of weight compression and KV cache compression provides significant VRAM savings:

TQ4_1S + turbo3 KV reduces total memory footprint by ~30%
Enables 70B+ models on consumer hardware
Extends maximum context length substantially

Use Cases

Long-context applications: The KV cache compression enables handling of much longer contexts
Resource-constrained environments: Smaller model sizes enable deployment on more hardware
High-throughput serving: The efficiency improvements benefit production deployments

Future Development

The implementation is actively being refined with:

Ongoing CUDA optimization
HIP/ROCm compatibility in development
Additional model family testing
Integration with broader llama.cpp ecosystem

The community validation has been extensive, with 12+ GPUs, 11+ models, and 10+ independent testers reporting zero regressions. This thorough validation has led to the merge of the feature into the main branch to reduce testing friction.

@JonCooperWorks

Conclusion

TQ4_1S weight compression represents a significant advancement in model quantization techniques, offering substantial size reductions with minimal quality impact. The combination of WHT rotation and Lloyd-Max centroids provides a mathematically sound approach to compression that has been thoroughly validated across multiple hardware platforms and model families.

As the CUDA port continues to be optimized and additional platforms are supported, this technique has the potential to become a standard approach for deploying large language models on resource-constrained hardware. The synergy between weight compression and KV cache compression opens new possibilities for long-context applications and high-throughput serving scenarios.

The open-source nature of the implementation and the extensive community testing ensure that this advancement will benefit the broader AI/ML ecosystem, enabling more efficient deployment of increasingly large language models.

For more technical details, see the original research paper and getting started guide.

#quantization #LLM #Llama.cpp #GPU #Metal #CUDA #WHT #Lloyd-Max #TurboQuant

TQ4_1S Weight Compression: Breakthrough in Model Quantization for llama.cpp

TQ4_1S Weight Compression: Breakthrough in Model Quantization for llama.cpp

Technical Implementation

Performance Benchmarks

Model Size Reductions

Perplexity Impact

Performance Characteristics

Hardware and Platform Support

Current Implementation

CUDA Port Progress

Community Testing and Validation

Apple Silicon Results

NVIDIA GPU Results

Model-Specific Findings

Technical Insights and Optimizations

Performance Optimization Opportunities

Llama-Specific Considerations

Practical Implications

VRAM Efficiency

Use Cases

Future Development

Conclusion

Comments