Linux Kernel Patch Boosts NVMe Performance by 15% with CPU Cluster Optimization

Intel engineers have developed a Linux kernel patch that improves NVMe storage performance by up to 15% through CPU cluster-aware IRQ affinity handling, particularly benefiting high-core-count systems.

CPU Clustering Meets NVMe Storage: Solving the IRQ Bottleneck

Intel Xeon CPUs and NVMe SSDs Intel Xeon processors with NVMe SSDs – the test platform for the new kernel patch

Linux systems running on modern high-core-count processors often face storage bottlenecks that defy conventional tuning. Intel engineers have addressed one such limitation with a kernel patch that optimizes interrupt request (IRQ) distribution for NVMe storage devices. The core innovation lies in making the kernel's CPU grouping logic cluster-aware within each NUMA domain.

Here's the technical breakdown: When NVMe storage generates interrupts, the Linux kernel assigns them to CPU cores via IRQ affinity settings. On systems where the number of NVMe IRQs is less than the total CPU count (common in dense storage configurations), multiple cores must share IRQs. Performance suffers significantly when the IRQ-affined cores reside in different CPU clusters than the cores processing the storage I/O requests – a topology mismatch that increases memory latency.

Benchmark Results: Quantifying the Gain

On an Intel Xeon E-series server platform, the patch demonstrated a consistent 15% improvement in random read performance using the industry-standard FIO benchmark with libaio engine. This was measured using 4K random reads across multiple NVMe drives – a common homelab and datacenter workload. While Intel hasn't published full test matrices, the improvement specifically targets:

Systems with CPU cluster architectures (like Intel's Core/Atom hybrid designs)
NUMA-enabled multi-socket configurations
Storage-heavy workloads where IRQ overhead dominates latency

Implementation and Availability

Hardware optimization remains critical for storage performance

The 271-line patch modifies lib/group_cpus.c to prioritize CPU cluster locality when assigning IRQ affinity masks. As Intel engineer Wangyang Guo explains: "Grouping CPUs by cluster within each NUMA domain ensures better locality between CPUs and their assigned NVMe IRQs, eliminating unnecessary cross-cluster communication penalties."

Currently residing in Andrew Morton's mm-everything Git branch, the patch targets inclusion in Linux 6.20~7.0 kernels. Homelab builders should note these implementation details:

Kernel Requirement: Linux 6.20+ once merged
Tuning Benefit: Maximized on systems with >16 cores and multiple NVMe controllers
Power Implications: Reduced cross-cluster communication may lower overall energy consumption

Build Recommendations for Storage-Optimized Systems

CPU Selection: When building NVMe-heavy systems, prioritize processors with unified L3 cache clusters (e.g., Intel's monolithic die designs) over chiplet architectures where cache coherency penalties are higher
IRQ Monitoring: Use irqbalance --debug to visualize IRQ distribution across cores before/after patch application
NUMA Alignment: Physically install NVMe drives in slots connected to the same CPU socket handling primary storage processes
Kernel Testing: Once available, validate with fio tests comparing numa_mem_preferred and numa_mem_local policies

This optimization exemplifies how low-level kernel improvements continue to unlock hardware potential. As core counts increase, such topology-aware scheduling becomes increasingly critical for storage performance – especially relevant for homelabs running database servers, VM clusters, or high-throughput storage arrays.

Twitter image Social preview image showing the performance context