AMD Matrix Cores Supercharge Llama.cpp: MFMA and Stream-K Unlock 9.5K Tokens/sec on MI300X
A major pull request in llama.cpp has enabled AMD's Matrix Cores (MFMA) and stream-K scheduling for CDNA 3 GPUs, dramatically accelerating quantized inference. The update removes NVIDIA-specific hardware assumptions and delivers up to 9.5K tokens/sec on MI300X hardware—rivalling high-end NVIDIA performance. This collaboration between AMD engineers and the llama.cpp community marks a leap forward for open-source AI on alternative hardware.