TQ4_1S Weight Compression: Breakthrough in Model Quantization for llama.cpp
#Machine Learning

TQ4_1S Weight Compression: Breakthrough in Model Quantization for llama.cpp

Startups Reporter
5 min read

TheTom has implemented TQ4_1S (4-bit, 5.0 BPW) and TQ3_1S (3-bit, 4.0 BPW) weight quantization using WHT rotation + Lloyd-Max centroids in llama.cpp-turboquant. This Metal-only implementation achieves significant model size reductions (27-42%) with minimal perplexity increases (+0.4-5.8%) and competitive performance. The CUDA port is in progress, with community testing showing promising results across multiple GPU architectures and model families.

TQ4_1S Weight Compression: Breakthrough in Model Quantization for llama.cpp

The pull request #45 in TheTom/llama-cpp-turboquant introduces a significant advancement in model quantization techniques with TQ4_1S weight compression. This implementation uses WHT (Walsh-Hadamard Transform) rotation combined with Lloyd-Max centroids to achieve substantial model size reductions while maintaining performance and quality.

Technical Implementation

At its core, TQ4_1S represents a sophisticated approach to post-training quantization that requires no retraining, calibration data, or model modification. The technique uses:

  • WHT rotation: A mathematical transformation that helps preserve information during quantization
  • Lloyd-Max centroids: An optimal quantization method that minimizes mean squared error
  • Fused Metal kernel: Zero threadgroup memory implementation with cooperative SIMD rotation via simd_shuffle_xor

The V2.1 implementation introduces several optimizations:

  • NR0=8 configuration for better data reuse
  • Zero threadgroup memory design
  • Efficient SIMD operations

Performance Benchmarks

The quantization technique shows impressive results across multiple models:

Model Size Reductions

  • Qwen2.5-1.5B: 27% reduction
  • Qwen3.5-27B: 28% reduction
  • Qwen3.5-35B MoE: 37% reduction
  • Qwen2.5-72B: 38% reduction
  • Phi-4 14B: 36% reduction
  • Llama 3.1 70B: 29-42% reduction (depending on configuration)

Perplexity Impact

The technique maintains impressive quality metrics with minimal perplexity increases:

  • Qwen2.5-1.5B: +1.9%
  • Qwen3.5-27B: +1.3%
  • Qwen2.5-72B: +3.9%
  • Phi-4 14B: +1.0%
  • Llama 3.1 70B: +5.8% (Premium) or +16% (Hybrid)

Performance Characteristics

  • Metal implementation achieves 85-99% of q8_0 baseline performance
  • CUDA port (in progress) shows 39% of q8_0 baseline with potential for improvement
  • No performance regression on existing quantization methods

Featured image

Hardware and Platform Support

Current Implementation

  • Metal-only runtime (Apple Silicon)
  • Quantization works on any platform
  • Runtime dequant kernels are Metal-specific
  • Compressed GGUFs will not run correctly on CUDA/HIP until backends are ported

CUDA Port Progress

Signalnine has implemented a CUDA port with the following features:

  • CUDA dequant for TQ4_1S/TQ3_1S
  • Fused mul_mat_vec kernel with pre-rotated activations
  • mmvq exclusion for fused dispatch path
  • llama-quantize registration for TQ4_1S/TQ3_1S types

Initial CUDA benchmarks show:

  • 20 t/s with cuBLAS path
  • 69 t/s with fused kernel (vs 177 t/s for q8_0)
  • 39% of q8_0 performance
  • Perfect PPL matching between cuBLAS and fused implementations

Community Testing and Validation

The implementation has undergone extensive community testing across multiple platforms:

Apple Silicon Results

  • M5 Max shows no regressions in existing functionality
  • Mac Mini M2 Pro confirms performance improvements
  • Qwen3.5-27B achieves 92% decode speed at 28% smaller size

@shivam2014 @shivam2014

NVIDIA GPU Results

  • RTX 4090 shows significant KV cache savings with TurboQuant
  • RTX PRO 6000 demonstrates improved performance with weight compression
  • Multi-GPU configurations tested across various architectures

@christopheraleman1015 @christopheraleman1015

Model-Specific Findings

  1. Qwen Models: Show consistent 27-38% size reductions with minimal quality impact
  2. Llama Models: Benefit from hybrid configurations (TQ4 attention + Q4_K/Q5_K FFN)
  3. MoE Models: Special handling required for expert layers
  4. Long Context: TurboQuant KV extends max context from 48K to 100K for 8B models

Technical Insights and Optimizations

Performance Optimization Opportunities

TheTom identified several optimization opportunities for the CUDA implementation:

  1. NR0 multi-row CTA with shared activation reuse: Potential to improve from 69 to 110-140 t/s
  2. Hot loop load deduplication: Reduce redundant memory accesses
  3. restrict qualifiers + vectorized loads: Enable better compiler optimization
  4. Small-batch kernels: Optimize for common serving scenarios
  5. CTA shape sweep per architecture: Tailor for different GPU architectures

Llama-Specific Considerations

Llama-family models show 6-8x higher per-layer error amplification with WHT-rotated FFN tensors. This has led to two recommended configurations:

  • Hybrid (TQ4 attn + Q4_K FFN)
  • Premium (TQ4 attn + Q5_K/Q6_K FFN)

Both configurations beat Q4_K_M in quality and speed at similar model sizes.

@turquoisebaydev @turquoisebaydev

Practical Implications

VRAM Efficiency

The combination of weight compression and KV cache compression provides significant VRAM savings:

  • TQ4_1S + turbo3 KV reduces total memory footprint by ~30%
  • Enables 70B+ models on consumer hardware
  • Extends maximum context length substantially

Use Cases

  1. Long-context applications: The KV cache compression enables handling of much longer contexts
  2. Resource-constrained environments: Smaller model sizes enable deployment on more hardware
  3. High-throughput serving: The efficiency improvements benefit production deployments

Future Development

The implementation is actively being refined with:

  • Ongoing CUDA optimization
  • HIP/ROCm compatibility in development
  • Additional model family testing
  • Integration with broader llama.cpp ecosystem

The community validation has been extensive, with 12+ GPUs, 11+ models, and 10+ independent testers reporting zero regressions. This thorough validation has led to the merge of the feature into the main branch to reduce testing friction.

@JonCooperWorks @JonCooperWorks

Conclusion

TQ4_1S weight compression represents a significant advancement in model quantization techniques, offering substantial size reductions with minimal quality impact. The combination of WHT rotation and Lloyd-Max centroids provides a mathematically sound approach to compression that has been thoroughly validated across multiple hardware platforms and model families.

As the CUDA port continues to be optimized and additional platforms are supported, this technique has the potential to become a standard approach for deploying large language models on resource-constrained hardware. The synergy between weight compression and KV cache compression opens new possibilities for long-context applications and high-throughput serving scenarios.

The open-source nature of the implementation and the extensive community testing ensure that this advancement will benefit the broader AI/ML ecosystem, enabling more efficient deployment of increasingly large language models.

For more technical details, see the original research paper and getting started guide.

Comments

Loading comments...