Azure completed MLPerf Training v6.0's Llama 3.1 405B benchmark in just over seven minutes using 8,192 NVIDIA GB200 GPUs, demonstrating the largest reported cluster of its kind and 99.8% weak scaling efficiency across 128 racks.
Azure posted the fastest MLPerf Training v6.0 result to date for Llama 3.1 405B, training the model in just over seven minutes according to MLCommons.
The benchmark measured how communication overhead and system stability dominate training performance at extreme scale. Azure's full stack approach showed its advantage here, delivering a result that depends on coordinated hardware, networking, and software working together.
The Cluster
Azure assembled the largest reported GB200 NVL72 cluster in MLPerf Training history: 2,048 compute tray nodes spread across 128 racks, totaling 8,192 NVIDIA GB200 GPUs.
The company's Fairwater AI supercomputing infrastructure powered this result. Fairwater provides high-performance GPU scale-up domains with resilient, scale-out networking, built for frontier-scale distributed AI training where communication overhead and synchronization latency become major bottlenecks.
What Made It Work
Three architectural elements converged to make this result possible.
First, fifth-generation NVIDIA NVLink delivered 1,800 GB/s per GPU for intra-rack communication, keeping latency-sensitive operations on the fast scale-up domain.
Second, Azure's 100 GB/s MRC fabric, accelerated by NVIDIA DOCA and connected with ConnectX-8 SuperNICs and Spectrum-X Ethernet switches, handled inter-rack communication across the cluster.
Third, topology-aware workload mapping aligned parallelism strategies with the underlying network structure. Llama 405B distributes work across four parallelism dimensions: tensor, context, pipeline, and data parallelism. Each generates different communication patterns, and some sit directly on the critical training path.
Tensor-parallel and context-parallel communication must complete before computation can continue. Pipeline-parallel communication is also on the critical path, but across stages rather than within a layer. All three run over NVLink.
Data-parallel communication, by contrast, overlaps with backward compute because gradient reduction runs alongside the backward pass. This traffic goes over the MRC scale-out network.
The result: cross-rack MRC communication adds roughly 20 milliseconds of exposed time on the critical path of a 1.27-second training step. That amounts to about 1.6% of step time.
Stable Step Time at Scale
The team tested weak scaling by maintaining constant workload per GPU while expanding cluster capacity. Step time stayed nearly identical as the cluster grew.
At 112 GB200 NVL72 racks (7,168 GPUs), step time measured 1.2734 seconds. At 128 racks (8,192 GPUs), it measured 1.2712 seconds. The difference: 2 milliseconds.
That corresponds to 99.8% weak scaling efficiency when adding 1,024 GPUs. Step-time variance stayed between plus or minus 0.04% and 0.05%, confirming predictable execution at extreme scale.
This efficiency directly translates to maximized hardware utilization and faster overall time-to-train. The scale-out network becomes nearly invisible to training performance.
Why This Matters
As model sizes continue growing into hundreds of billions of parameters, consistent performance requires more than raw compute capacity. It demands a highly efficient communication infrastructure.
The MLPerf Training benchmark tests real-world training workloads under controlled conditions. Results like this show that large-scale LLM training can achieve near-linear scaling when the communication stack is designed correctly from the start.
Azure's result demonstrates that frontier AI training at 8,000+ GPU scale is not just possible but reproducible with stable, predictable performance.
This work was made possible through collaboration across multiple Azure and NVIDIA teams.

Comments
Please log in or register to join the discussion