The Hidden Performance Killer: Split Locks and Modern CPU Architecture

Split locks on x86-64 processors create severe performance bottlenecks by forcing expensive bus lock operations when atomic operations cross cache line boundaries. This investigation reveals how different CPU architectures handle these operations and the surprising impact on real-world applications.

Split locks represent one of those subtle architectural corner cases that can devastate performance in ways most developers never anticipate. When an atomic operation spans two cache lines, modern x86-64 processors fall back to a "bus lock" mechanism that can bring system performance to its knees. But what exactly happens under the hood, and how bad is this problem really?

The Nature of Split Locks

Atomic operations are fundamental to multithreaded programming, allowing threads to perform sequences of operations without interference. Whether it's a test-and-set for acquiring locks or an atomic increment for shared counters, these operations rely on cache coherency protocols to maintain consistency across cores.

However, Intel and AMD architectures lack the ability to lock two cache lines simultaneously. When an atomic operation targets memory spanning cache line boundaries, the processor must resort to a bus lock - a much more expensive operation that historically meant preventing all other CPUs from accessing memory.

Testing Methodology and Hardware

The investigation employed a core-to-core latency test using _InterlockedCompareExchange64, which compiles to lock cmpxgx on x86-64. By carefully positioning the target memory to straddle cache line boundaries, the test could measure the performance impact of split locks across different CPU architectures.

Hardware tested included:

Intel Core Ultra 9 285K (Arrow Lake)
AMD Ryzen 9 9900X (Zen 5)
Intel Core i7-1265U (Alder Lake)
AMD Ryzen 9 3950X (Zen 2)
Intel Core i5-6600K (Skylake)
AMD FX-8350 (Piledriver)
Intel Celeron J4125 (Goldmont Plus)

Architecture-Specific Performance Patterns

Intel Arrow Lake (Core Ultra 9 285K)

Arrow Lake exhibits particularly poor split lock performance, with latencies reaching 7 microseconds - an eternity in CPU terms. Interestingly, split locks only affect L2 misses, leaving L2 and L1 hits unaffected. This suggests the bus lock mechanism operates at the L2 cache level, where multiple cores share access.

Performance degradation is severe for workloads generating cache miss traffic. Geekbench 6's photo filter workload, which creates substantial cache miss patterns, suffers heavily under split lock contention.

AMD Zen 5 (Ryzen 9 9900X)

Zen 5 shows better split lock latency than Arrow Lake at around 500 nanoseconds, but this still represents a significant performance penalty. The architecture suffers a devastating 10x regression in L2 and L3 performance when split locks are active.

Both Geekbench 6 workloads tested show heavy performance regressions, with even L1D misses becoming extremely costly under split lock contention.

Intel Alder Lake (Core i7-1265U)

Surprisingly, Alder Lake's split lock performance is even worse than Arrow Lake, with P-Cores suffering particularly poor latency. However, Alder Lake demonstrates excellent isolation - other applications barely notice the performance impact.

This architectural choice prioritizes consistency over raw performance, insulating unrelated workloads from split lock contention effects.

AMD Piledriver (FX-8350)

Remarkably, AMD's older Piledriver architecture delivers the best split lock performance among all tested hardware. With latencies only 2-3x higher than intra-cacheline locks, Piledriver avoids the microsecond-scale penalties seen on newer architectures.

Even more impressively, split locks don't affect cache hits at all - including the shared L3 cache. This suggests Piledriver may handle split locks entirely within its cache coherency protocol without falling back to a traditional bus lock mechanism.

The Bus Lock Mystery

The term "bus lock" is increasingly anachronistic on modern processors. Contemporary CPUs use sophisticated non-blocking, distributed interconnect topologies rather than shared buses. Yet the terminology persists in documentation.

Intel's implementation likely operates at the IDI (in-die interconnect) level, affecting communication between cores and the uncore. AMD's behavior is more puzzling - Zen 2 and Zen 5 show L2 cache impacts, suggesting split locks may propagate to the Infinity Fabric layer.

Piledriver's immunity to split lock effects hints at a fundamentally different approach, possibly leveraging its cache coherency protocol to allow unrelated accesses to proceed when they hit cache.

Linux Mitigation Strategies

Linux implements split lock mitigation by trapping split locks and introducing millisecond-level delays. This approach aims to make split locks "annoying" while providing better quality of service to other applications.

The default mitigation strategy makes sense for multi-user or server environments where consistency is paramount. However, for consumer systems, this represents an overreaction to a problem that rarely manifests in real-world usage.

Games have reportedly used split locks for years without creating issues, even on affected architectures. The performance penalty from Linux's mitigation can be more severe than the original problem.

Real-World Implications

Split locks don't block other cores from executing code - they're not equivalent to Python's global interpreter lock. On modern CPUs, they only introduce penalties when memory accesses miss certain cache levels, and the severity varies dramatically by architecture.

For programmers, the lesson is clear: avoid split locks when possible. They perform poorly and have heavier effects on other applications than intra-cacheline locks.

For hardware designers, there's clearly room for optimization. Piledriver demonstrates that better split lock handling is possible, even on older architectures.

The Bigger Picture

This investigation reveals how architectural decisions made years ago continue to impact modern computing. The persistence of "bus lock" terminology obscures the reality that modern processors handle these operations in fundamentally different ways.

As CPUs become increasingly complex with hybrid architectures, cache hierarchies, and sophisticated interconnects, corner cases like split locks become both more important and harder to understand. The variation in behavior across architectures - from Piledriver's elegant handling to Arrow Lake's severe penalties - demonstrates that there's no one-size-fits-all solution.

Moving forward, both hardware and software developers need to take a measured, data-driven approach to these issues. Introducing new performance problems while solving old ones represents a step backward, especially in consumer computing where ease of use is paramount.

The split lock saga serves as a reminder that in computer architecture, the devil truly is in the details - and sometimes those details can have performance impacts measured in orders of magnitude.

#Hardware #CHIPS #Trends