WQ_AFFN_CACHE_SHARD Merged For Linux 7.1: Significant Win For CPUs With Many Cores Per LLC

Linux 7.1 introduces WQ_AFFN_CACHE_SHARD, a workqueue affinity optimization that dramatically improves performance on high-core-count CPUs by reducing L3 cache contention.

The Linux 7.1 kernel has merged a significant optimization that will benefit modern high-end processors with many CPU cores sharing the same last level cache (LLC). The new WQ_AFFN_CACHE_SHARD affinity scope addresses a critical bottleneck in the default workqueue behavior that has been hurting performance on systems with numerous cores per L3 cache.

LINUX KERNEL

The Problem With Default Workqueue Behavior

The issue stems from how Linux handles workqueues on systems with many cores sharing a single L3 cache. The default unbound workqueue with WQ_AFFN_CACHE creates just one pool for the entire system, which can lead to severe contention and degraded I/O performance. This problem becomes particularly pronounced on today's high-core-count processors from Intel, AMD, and Arm.

Even on relatively modest hardware, the impact is measurable. Oracle engineer Check Lever discovered that on a 12-core system with a single shared L3 cache, using NFS-over-RDMA with 12 FIO jobs resulted in approximately 39% of CPU cycles being spent in a spin lock slow-path due to the default workqueue behavior.

The Solution: WQ_AFFN_CACHE_SHARD

Linux engineer Breno Leitao from Meta developed the set of patches introducing WQ_AFFN_CACHE_SHARD as an intermediate affinity level. This new scope subdivides each LLC into groups of at most wq_cache_shard_size CPUs, with the default set to eight but configurable at boot time.

This approach effectively reduces contention by creating multiple workqueue pools within a single L3 cache, rather than having all cores compete for the same pool. The result is a more efficient distribution of work across cores sharing the same cache.

Performance Gains Across Multiple Platforms

Benchmark results demonstrate the substantial improvements this change brings:

NVIDIA Grace CPU (72 cores, single LLC): Cache_shard delivers approximately 5x the throughput and 6.5x lower p50 latency compared to the previous cache scope
Xeon D server (16 cores): Up to 5.9% improvement in FIO random reads from NVMe storage
Intel Xeon processors: Notable throughput gains observed

These improvements are particularly significant for data center workloads and high-performance computing scenarios where I/O performance is critical.

Implementation Details

With Linux 7.1, WQ_AFFN_CACHE_SHARD becomes the default affinity scope for workqueues. The configuration parameter wq_cache_shard_size allows system administrators to tune the number of CPUs per shard based on their specific hardware characteristics and workload patterns.

This optimization represents a thoughtful evolution in how Linux handles work distribution on modern multi-core processors, acknowledging that the traditional approach of a single workqueue per cache is no longer optimal for today's hardware.

Lots of CPUs with many cores and LLCs

The workqueue changes submitted for Linux 7.1 mark one of the most significant performance optimizations in this kernel release, particularly for systems with high core counts and shared L3 caches. As processors continue to pack more cores into each cache domain, such refinements become increasingly important for maintaining optimal system performance.

For system administrators and developers working with high-core-count servers, this change should provide immediate performance benefits without requiring any application-level modifications. The kernel now handles work distribution more intelligently by default, reducing contention and improving overall system responsiveness.