AMD Matrix Cores Supercharge Llama.cpp: MFMA and Stream-K Unlock 9.5K Tokens/sec on MI300X

A major pull request in llama.cpp has enabled AMD's Matrix Cores (MFMA) and stream-K scheduling for CDNA 3 GPUs, dramatically accelerating quantized inference. The update removes NVIDIA-specific hardware assumptions and delivers up to 9.5K tokens/sec on MI300X hardware—rivalling high-end NVIDIA performance. This collaboration between AMD engineers and the llama.cpp community marks a leap forward for open-source AI on alternative hardware.

The CDNA 3 Breakthrough

A landmark pull request in llama.cpp has fundamentally transformed AMD GPU performance for large language models. By enabling Matrix Core operations (MFMA instructions) within MMQ (Matrix Multiplication for Quantized models) kernels, AMD's CDNA 3 architecture (powering MI300X GPUs) now achieves unprecedented throughput—reaching 9.5K tokens/second on DeepSeekV3 models. This represents a 2-3x speedup over previous implementations for key quantization types like Q4_0 and Q8_0.

Technical Triumphs

The 21-commit contribution delivers three critical advancements:

MFMA Integration: Enables CDNA 3's matrix cores via V_MFMA_I32_16X16X16I8 instructions, replacing inefficient emulation paths
Stream-K Parallelism: Implements AMD's workgroup scheduling for irregular workloads, optimizing hardware utilization
Hardware Agnosticism: Eliminates NVIDIA-specific WARP_SIZE constants, using __AMDGCN_WAVEFRONT_SIZE instead

"We redesigned all quants to use the same tile size, MFMA instructions, and warp counts—this unified approach ensures consistent performance gains across quantization types," noted AMD engineer @deepsek in the PR discussion.

Performance Unleashed

Benchmarks reveal staggering improvements on 512-token batches:

Quantization	Previous TFLOPS	New TFLOPS	Improvement
`q4_0`	82.51	223.52	171%
`q8_0`	91.70	222.75	143%
`iq2_xxs`	76.06	195.34	157%

Critically, stream-K scheduling demonstrates 2x better hardware utilization during partial matrix operations, while register pressure optimizations prevent spilling. The changes passed 6,534 backend operation tests across all major quantization types.

The Road Ahead

This collaboration between AMD's engineering team and llama.cpp maintainers (@JohannesGaessler, @ggerganov, @IMbackK) sets the stage for deeper hardware integration:

HIPBLASLT optimizations for BF16 operations
FlashAttention improvements
Expanded CDNA 2/3 support

As consumer RDNA 3+ GPUs adopt similar matrix math capabilities, these optimizations will democratize high-performance LLM inference beyond datacenter hardware. The removal of NVIDIA-specific constants also paves the way for future multi-architecture support.

Source: PR #14624 (merged July 26, 2025)