Linux's sched_ext Will Prioritize Idle SMT Siblings For Better Performance

Linux's sched_ext scheduler will now prioritize idle SMT siblings before other CPUs, potentially boosting CPU-bound workload performance by 2-3%.

A change to the Linux kernel's extensible scheduler class "sched_ext" for allowing nifty scheduler implementations via BPF programs will begin to prioritize SMT siblings to help with better performance. A sched_ext change queued in its development tree ahead of the upcoming Linux 7.1 kernel cycle will prioritize idle SMT siblings for providing slightly better performance over the current behavior of just picking a CPU within the same last level cache.

If there is an idle SMT sibling, sched_ext will now prefer it before checking for CPUs within the same LLC followed by the same NUMA code or any other idle CPU on the system. Andrea Righi of NVIDIA clocked the benefit of prioritizing idle SMT siblings at 2~3% for CPU-bound workloads. He explained in the queued patch making the change: "In the default built-in idle CPU selection policy, when @prev_cpu is busy and no fully idle core is available, try to place the task on its SMT sibling if that sibling is idle, before searching any other idle CPU in the same LLC. Migration to the sibling is cheap and keeps the task on the same core, preserving L1 cache and reducing wakeup latency. On large SMT systems this appears to consistently boost throughput by roughly 2-3% on CPU-bound workloads (running a number of tasks equal to the number of SMT cores)."

With the patch in sched_ext.git's "for-next" Git branch the change should land for next month's Linux 7.1 merge window.

LINUX KERNEL

This optimization represents a significant improvement for systems with Simultaneous Multi-Threading (SMT) enabled, which is now standard on most modern processors from Intel and AMD. The scheduler change takes advantage of the fact that SMT siblings share the same physical core, meaning they have access to the same L1 cache and execution units.

By prioritizing idle SMT siblings, the scheduler can maintain better cache locality and reduce context switching overhead. When a task wakes up and finds its previous CPU busy, the scheduler will now first check if the sibling thread on the same core is idle before looking elsewhere. This approach minimizes the cost of migration while still finding available execution resources.

The 2-3% performance gain might seem modest, but for CPU-bound workloads running at scale, this can translate to meaningful improvements in throughput and efficiency. This is particularly relevant for data centers and cloud environments where CPU utilization is often pushed to its limits.

This change also highlights the ongoing evolution of Linux's scheduler infrastructure. The sched_ext framework allows for custom scheduler implementations via BPF programs, enabling more specialized scheduling policies without modifying the core kernel scheduler. This flexibility is crucial as workloads become more diverse and specialized scheduling needs emerge.

The timing of this change is notable as it comes just before the Linux 7.1 merge window, suggesting it has been thoroughly tested and is ready for broader adoption. The fact that it's being included in the mainline kernel rather than remaining an experimental feature indicates confidence in its stability and benefits.

For system administrators and developers running CPU-intensive workloads on Linux systems with SMT enabled, this change should provide a free performance boost once Linux 7.1 is released. The improvement is automatic and requires no configuration changes, making it a low-risk optimization that benefits all users of the sched_ext scheduler.

Linux's sched_ext Will Prioritize Idle SMT Siblings For Better Performance

Comments