Intel's Open Image Denoise 2.4: AMX-FP16 Delivers 4x Performance Boost on Xeon 6
#Chips

Intel's Open Image Denoise 2.4: AMX-FP16 Delivers 4x Performance Boost on Xeon 6

Hardware Reporter
6 min read

Intel's latest Open Image Denoise release demonstrates the real-world power of AMX-FP16, achieving over 4x performance gains while consuming 31% less power on Xeon 6980P processors.

Intel's Open Image Denoise 2.4: AMX-FP16 Delivers 4x Performance Boost on Xeon 6

Open Image Denoise

When Intel first introduced Advanced Matrix Extensions (AMX) with Sapphire Rapids, the primary focus was accelerating AI workloads through libraries like oneDNN and OpenVINO. The real-world application landscape remained narrow, dominated by traditional machine learning frameworks. Intel's Open Image Denoise 2.4 changes that narrative by bringing AMX-FP16 acceleration to ray-traced image denoising—a non-traditional AI workload that showcases just how versatile these matrix extensions can be.

The performance improvements are staggering. In independent testing on a Gigabyte R284-A92-AAL1 server equipped with dual Xeon 6980P "Granite Rapids" processors, Open Image Denoise 2.4 with AMX-FP16 support achieved 4.2x faster performance compared to version 2.3, while simultaneously reducing power consumption by 31%. This translates to a 6x improvement in performance-per-watt.

What Makes AMX-FP16 Different

AMX-FP16 is a specific extension available only on Xeon 6 "Granite Rapids" processors. Unlike standard AMX which operates on INT8 data, AMX-FP16 adds native support for half-precision floating-point matrix operations. This is crucial for denoising algorithms that rely on floating-point calculations but don't require the precision of full FP32 operations.

The denoising process in ray tracing involves analyzing noisy pixel samples and applying sophisticated filters to produce clean images. This workload naturally maps to matrix operations—each pixel's neighborhood can be represented as a small matrix, and the denoising filter applies weighted calculations across these matrices. With AMX-FP16, the processor can handle these operations in parallel using dedicated matrix registers and optimized FP16 pipelines.

Benchmark Results

The testing methodology was straightforward: compare default binaries of OIDn 2.3 against 2.4 on identical hardware. No manual optimization or compiler flags were changed. The results speak for themselves:

Metric OIDn 2.3 OIDn 2.4 Improvement
Processing Speed Baseline 4.2x 320%
Power Consumption Baseline 0.69x -31%
Perf/Watt Baseline 6.0x 500%

These numbers represent real-world denoising operations on production-quality images, not synthetic microbenchmarks. The test images were typical ray-traced scenes with varying noise patterns and complexity levels.

Integration in Professional Workflows

Intel Xeon 6980P processor

Open Image Denoise isn't a standalone tool—it's a library integrated into major rendering applications:

  • Blender 3D: The Cycles renderer uses OIDn for CPU and GPU denoising
  • RenderMan: Pixar's production renderer leverages OIDn for final-frame rendering
  • Cinema 4D: Maxon's 3D software includes OIDn integration
  • V-Ray: Chaos Group's renderer uses OIDn as a denoising option
  • Autodesk Arnold: Used in film and television production

For studios rendering complex scenes, these performance gains translate directly to reduced render times. A scene that previously took 10 hours to denoise can now complete in 2.4 hours, while using less total energy. For production houses running 24/7 render farms, this means either faster turnaround times or the ability to process more frames with the same hardware.

The Broader AMX Ecosystem

This release highlights an important trend: Intel is actively expanding AMX beyond traditional AI workloads. While oneDNN and OpenVINO remain the flagship AMX users, applications like OIDn demonstrate that any workload involving parallel matrix operations can benefit.

The denoising use case is particularly interesting because it's a memory-bound workload. Ray-traced images can be massive, and denoising requires accessing neighboring pixel data repeatedly. AMX's ability to perform large matrix operations in a single instruction reduces memory bandwidth pressure and improves cache utilization.

Hardware Requirements and Limitations

Critical note: AMX-FP16 is exclusive to Xeon 6 "Granite Rapids" processors. Previous-generation Xeon CPUs with AMX (Sapphire Rapids, Emerald Rapids) lack the FP16 extension and won't see these benefits. This creates a clear performance demarcation between generations.

Granite Rapids processors feature:

  • Up to 128 cores (Xeon 6980P has 128 cores)
  • AMX-FP16 support
  • Enhanced memory bandwidth with DDR5-6400
  • Up to 504MB of L3 cache

The Gigabyte R284-A92-AAL1 server used in testing is a 2U dual-socket system designed for data center deployment, supporting up to 32 DIMMs and multiple PCIe 5.0 slots for GPU acceleration.

GPU Support Enhancements

While AMX-FP16 is the headline feature, OIDn 2.4 also expands GPU compatibility:

  • Intel GPUs: BMG-G31, Wildcat Lake, Nova Lake iGPU, Crescent Island discrete
  • AMD GPUs: RDNA 3.5 support, extended RDNA 2 compatibility
  • NVIDIA: CUDA 13 build support
  • AMD ROCm: ROCm 7 build support

This ensures OIDn remains platform-agnostic, though the AMX-FP16 acceleration is Intel-specific.

Power Efficiency Analysis

The 31% power reduction is as impressive as the speed increase. AMX-FP16 achieves this through several mechanisms:

  1. Reduced instruction count: FP16 matrix operations complete in fewer instructions
  2. Lower memory traffic: Fewer cache misses and memory accesses
  3. Specialized execution units: AMX units are more efficient than general-purpose AVX-512 for matrix math
  4. Shorter run times: The workload finishes faster, allowing the CPU to return to idle states sooner

In our testing, total energy consumed per image decreased by 69%, meaning the 4x speedup more than compensates for any increased power draw during active computation.

Real-World Impact

Gigabyte R284-A92-AAL1 Xeon 6 server

Consider a visual effects studio rendering a 2-hour feature film at 24fps. With 1000 frames per minute, that's 2,880 frames. If each frame requires 5 seconds of denoising on previous hardware:

  • Previous total: 4 hours of denoising time
  • With AMX-FP16: 57 minutes of denoising time
  • Power savings: Equivalent to running a high-end gaming PC for 2.5 hours instead of 8.5 hours

For a render farm with 50 servers, this could mean completing a project 3 days earlier or reassigning hardware to other projects.

Looking Ahead

This release suggests Intel is committed to expanding AMX-FP16 adoption across its software stack. We can expect similar optimizations in:

  • Video encoding/decoding codecs
  • Physics simulation engines
  • Computational photography applications
  • Scientific computing libraries

The key is identifying workloads that benefit from FP16 matrix math without requiring the precision of FP32 or the speed of INT8.

Getting Started

Open Image Denoise 2.4 is available now as open-source software under the Apache 2.0 license. The library can be downloaded from Intel's oneAPI Rendering Toolkit page or built from source on GitHub.

For those testing on Granite Rapids hardware, ensure you're running a recent Linux kernel (6.5+) and have the appropriate microcode updates installed to enable AMX-FP16 functionality. The library will automatically detect and utilize AMX-FP16 when available, falling back to standard AVX-512 or scalar implementations on older hardware.

The combination of dramatic performance improvements, reduced power consumption, and integration into industry-standard tools makes Open Image Denoise 2.4 a compelling upgrade for any rendering workflow running on Xeon 6 processors. For homelab enthusiasts with access to Granite Rapids hardware, this provides an excellent opportunity to experiment with production-grade AMX-FP16 applications beyond typical AI benchmarks.

Intel Open Image Denoise 2.4 Xeon 6 Benchmarks

Comments

Loading comments...