Nvidia's FP64 Emulation Gamble: Trading Precision for Performance
#Chips

Nvidia's FP64 Emulation Gamble: Trading Precision for Performance

Hardware Reporter
2 min read

Nvidia leverages tensor cores and Ozaki scheme emulation to boost FP64 throughput 4.4× over hardware, but AMD warns it fails IEEE standards and real-world scientific workloads.

Featured image

Double precision floating-point (FP64) remains indispensable for scientific accuracy in aerospace, climate modeling, and nuclear simulations. Unlike AI's tolerance for lower precision, FP64 provides 18 quintillion unique values—essential where rounding errors cause catastrophic failures.

Nvidia's new Rubin GPUs deliver only 33 teraFLOPS of native FP64 performance, slightly less than 2022's H100. However, activating CUDA-based emulation skyrockets matrix operations to 200 teraFLOPS. This approach decomposes FP64 calculations into clusters of INT8 operations processed by the GPU's tensor cores—specialized units originally designed for AI workloads.

The technique, formalized as the Ozaki scheme, exploits tensor cores' efficiency: Rubin handles 35 petaFLOPS of FP4 but struggles at FP64. By repurposing these resources, Nvidia achieves 4.4× higher FP64 throughput than its outgoing Blackwell architecture.

Accuracy vs. Speed Debate

AMD Fellow Nicholas Malaya voices skepticism: "It's quite good in benchmarks, not obvious for physical simulations." Key concerns include:

  • IEEE non-compliance: Ignores signed zeros, NaN (not-a-number), and infinity handling
  • Error propagation: Minor calculation deviations amplify in complex systems
  • Memory bloat: Requires 2× more capacity for matrix storage
  • Workload limitations: Only effective for dense matrix math (DGEMM), not vector-heavy tasks

Malaya notes 60-70% of HPC workloads (like computational fluid dynamics) rely on vector operations, where emulation offers no benefit. Rubin defaults to slower CUDA cores for such cases.

Historical Context, Modern Execution

FP64 emulation dates to 1950s pre-hardware FPUs. Nvidia reinvigorated it after 2024 research showed tensor cores could exceed native FP64 speeds. Senior Director Dan Ernst defends the approach: "We have the hardware—why not use it? Accuracy matches dedicated cores."

Nvidia addresses gaps with error-correction algorithms and argues most HPC applications don't require strict IEEE adherence. Memory overhead, Ernst concedes, exists but remains manageable for multi-gigabyte matrices.

The Road Ahead

With Rubin-powered supercomputers launching soon, real-world testing looms. AMD explores similar emulation on MI355X GPUs but demands IEEE compliance for validation. Malaya advocates industry-wide benchmarking: "Build a basket of apps to see where this works."

Ultimately, FP64 emulation excels at specific matrix workloads but faces fundamental limits. As Ernst admits: "More FLOPS doesn't always mean useful FLOPS." For now, AMD bets on hardware, enhancing FP64 in next-gen MI430X chiplets while Nvidia pushes software innovation.

Comments

Loading comments...