A major pull request in llama.cpp has enabled AMD's Matrix Cores (MFMA) and stream-K scheduling for CDNA 3 GPUs, dramatically accelerating quantized inference. The update removes NVIDIA-specific hardware assumptions and delivers up to 9.5K tokens/sec on MI300X hardware—rivalling high-end NVIDIA performance. This collaboration between AMD engineers and the llama.cpp community marks a leap forward for open-source AI on alternative hardware.