The CDNA 3 Breakthrough

A landmark pull request in llama.cpp has fundamentally transformed AMD GPU performance for large language models. By enabling Matrix Core operations (MFMA instructions) within MMQ (Matrix Multiplication for Quantized models) kernels, AMD's CDNA 3 architecture (powering MI300X GPUs) now achieves unprecedented throughput—reaching 9.5K tokens/second on DeepSeekV3 models. This represents a 2-3x speedup over previous implementations for key quantization types like Q4_0 and Q8_0.

Technical Triumphs

The 21-commit contribution delivers three critical advancements:

  1. MFMA Integration: Enables CDNA 3's matrix cores via V_MFMA_I32_16X16X16I8 instructions, replacing inefficient emulation paths
  2. Stream-K Parallelism: Implements AMD's workgroup scheduling for irregular workloads, optimizing hardware utilization
  3. Hardware Agnosticism: Eliminates NVIDIA-specific WARP_SIZE constants, using __AMDGCN_WAVEFRONT_SIZE instead

"We redesigned all quants to use the same tile size, MFMA instructions, and warp counts—this unified approach ensures consistent performance gains across quantization types," noted AMD engineer @deepsek in the PR discussion.

Performance Unleashed

Benchmarks reveal staggering improvements on 512-token batches:

Quantization Previous TFLOPS New TFLOPS Improvement
q4_0 82.51 223.52 171%
q8_0 91.70 222.75 143%
iq2_xxs 76.06 195.34 157%

Critically, stream-K scheduling demonstrates 2x better hardware utilization during partial matrix operations, while register pressure optimizations prevent spilling. The changes passed 6,534 backend operation tests across all major quantization types.

The Road Ahead

This collaboration between AMD's engineering team and llama.cpp maintainers (@JohannesGaessler, @ggerganov, @IMbackK) sets the stage for deeper hardware integration:
- HIPBLASLT optimizations for BF16 operations
- FlashAttention improvements
- Expanded CDNA 2/3 support

As consumer RDNA 3+ GPUs adopt similar matrix math capabilities, these optimizations will democratize high-performance LLM inference beyond datacenter hardware. The removal of NVIDIA-specific constants also paves the way for future multi-architecture support.

Source: PR #14624 (merged July 26, 2025)