AMD Achieves Milestone in Large-Scale MoE Pretraining with ZAYA1 on MI300X and Pollara

The quest for hardware diversity in AI training has a new champion: AMD's MI300X GPUs paired with the Pollara interconnect. Researchers, led by Quentin Anthony, have completed the first large-scale mixture-of-experts (MoE) pretraining run on this pure AMD platform, yielding ZAYA1—a 760M active parameter (8.3B total) MoE model that matches leading base models like Qwen3-4B and Gemma3-12B, while surpassing Llama-3-8B and OLMoE on reasoning, mathematics, and coding benchmarks.

Detailed in arXiv paper 2511.17127 submitted November 21, 2025, this work transcends a mere success story. It equips systems architects and ML engineers with comprehensive benchmarks and design principles tailored to AMD's ecosystem, potentially accelerating its adoption amid NVIDIA's market dominance.

Article illustration 1

Unpacking the AMD Training Stack

At the systems level, the study provides first-of-its-kind microbenchmarks for Pollara's core collectives—all-reduce, reduce-scatter, all-gather, and broadcast—spanning diverse message sizes and GPU scales. These insights are critical for optimizing distributed training, where communication bottlenecks often dictate throughput.

MI300X-specific benchmarks on kernel sizing and memory bandwidth further guide model design. The authors propose "MI300X-aware transformer sizing rules" for attention and MLP blocks, alongside MoE width strategies that optimize both pretraining speed and inference latency. Their training recipe covers fault-tolerance, checkpoint reshaping, and other utilities often overlooked in research papers.

Key MI300X Design Takeaways:
- Balance attention/MLP block sizes to maximize memory bandwidth utilization.
- MoE widths tuned for Pollara's interconnect topology.
- Fault-tolerant checkpoints enable stable multi-day runs.

"We distill practical guidance for both systems and model design... Together, these results demonstrate that the AMD hardware, network, and software stack are mature and optimized enough for competitive large-scale pretraining."

—Abstract from arXiv:2511.17127

ZAYA1: Benchmark-Beating Performance on AMD Alone

ZAYA1-base stands as a proof-of-concept for AMD's viability. Despite its modest active parameter count, it delivers competitive results across standard evaluations, signaling potential for scaled-up successors. The team hints at upcoming enhancements, positioning ZAYA1 as the starting point for a family of AMD-native foundation models.

Article illustration 2

Reshaping AI Infrastructure Choices

This paper arrives at a pivotal moment. With escalating demand for AI compute, NVIDIA shortages have spotlighted alternatives. AMD's Pollara now has empirical data to challenge InfiniBand and NVLink, while MI300X optimizations lower the entry barrier for cloud providers and enterprises building custom clusters.

For developers, the real value lies in reproducibility: detailed recipes and benchmarks enable rapid iteration. As MoE architectures gain traction for their efficiency, AMD's validated stack could diversify the training landscape, spurring innovation in cost-effective, high-performance AI systems that don't hinge on a single vendor.