Restartable Sequences: The Hidden Powerhouse Revolutionizing High-Performance Computing

As multi-core processors become commonplace, restartable sequences (rseq) emerge as a Linux kernel feature that could fundamentally change how we approach thread-safe programming, offering performance improvements of up to 43x in certain scenarios.

In the ever-evolving landscape of system programming, a quietly powerful technology has been flying under the radar since its introduction in Linux 4.18 back in 2018. Restartable sequences, or rseq, represent a paradigm shift in how we think about thread safety, potentially rendering traditional locks and atomics obsolete for high-performance scenarios.

The Emergence of rseq

Restartable sequences allow developers to create thread-safe data structures without traditional synchronization mechanisms, scaling efficiently to microprocessors with dozens or even hundreds of cores. Currently, this functionality remains largely confined to Linux systems and requires handwritten assembly code, limiting its adoption to specialized applications like tcmalloc, jemalloc, glibc, and the Cosmopolitan C library.

The core concept elegantly addresses the fundamental challenge of thread synchronization: what happens when the kernel preempts a thread during a critical section? With rseq, developers can mark sequences of assembly instructions that shouldn't be interrupted. When preemption occurs, the kernel can intelligently handle the interruption by either allowing the operation to complete or restarting it from the beginning.

Performance Benchmarks That Turn Heads

The most compelling aspect of rseq is the dramatic performance improvements it delivers in multi-core environments. Justine Tunney, a prominent system programmer, has demonstrated impressive results:

On a Raspberry Pi 5 (4 cores): rseq made a malloc() implementation 3x faster compared to thread-local dlmalloc mspace
On a System76 Thelio Astra with Ampere's 128-core CPU: 34x faster than mutex-based sharding
On an AMD Threadripper Pro 7995WX with 96 cores: 43x faster than traditional approaches

These numbers aren't just incremental improvements—they represent orders of magnitude that could transform how we approach high-performance computing. As Tunney notes, "Using restartable sequences turned my 3GHz CPU into a 33GHz CPU. Using mutexes turned my 3GHz CPU into a 219MHz CPU."

Technical Deep Dive: How rseq Works

At its core, rseq operates through a clever mechanism involving Thread Local Storage (TLS) and kernel cooperation. When a thread is created, the kernel allocates 32 bytes of TLS memory that gets updated with the CPU number whenever the thread is rescheduled.

The magic happens through a two-way communication system:

The kernel updates TLS with the current CPU ID
The thread can specify a sequence of instructions via the rseq_cs field in TLS

When the kernel needs to preempt a thread, it checks if the program counter is within the specified instruction sequence. If so, it forces the thread to jump to an abort handler, which can restart the operation or handle it appropriately.

This approach effectively creates "tiny database transactions" at the assembly level, allowing developers to build lock-free data structures that maintain consistency without traditional synchronization overhead.

Practical Applications and Examples

The article provides concrete implementations, including a high-performance hit counter and a lock-free linked list. These examples demonstrate how rseq can be applied to real-world problems:

Fast Hit Counter

Comparing five different approaches to incrementing a counter across multiple threads reveals rseq's dominance:

Mutex-based (glibc): 30,739k ops/sec
Mutex-based (Cosmopolitan): 82,009k ops/sec
Atomic operations: 440k ops/sec
Sharding: 1,652,324k ops/sec
rseq: 174,545,455k ops/sec
CPU affinity: 161,000,000k ops/sec

Lock-Free Linked List

The article provides a complete implementation of push/pop operations for a sharded linked list, showing how rseq enables thread-safe operations without locks or atomics. The implementation carefully handles CPU affinity changes and ensures memory alignment to prevent false sharing.

Adoption Challenges and Counter-Perspectives

Despite its impressive potential, rseq faces several adoption barriers:

Platform Limitations: Currently restricted to Linux systems, with no equivalent support in Windows, macOS, or other operating systems.
Implementation Complexity: Requires handwritten assembly code, significantly raising the barrier to entry for most developers.
Portability Concerns: As Tunney notes, "if you're building a library or something that's open source, you're going to need to support other strategies too."
Specialized Hardware Requirements: The most dramatic performance improvements are visible on high-core-count systems that aren't yet commonplace in typical development environments.

Some developers argue that traditional approaches remain more practical for most applications. "For most developers, that's a take it or leave it kind of improvement," Tunney acknowledges, suggesting that the benefits may not justify the complexity for many use cases.

The Future of rseq

Looking ahead, several trends suggest rseq could become more mainstream:

Hardware Evolution: With 128-core and even 192-core processors becoming more affordable, the performance benefits of rseq will become increasingly relevant.
Language Integration: Tunney predicts that "all system programming languages will be redesigned to be able to express restartable sequences, similar to how C11 introduced compiler APIs for atomics."
Library Adoption: As more high-performance libraries adopt rseq, pressure will grow for broader OS support.
Tooling Improvements: Better development tools and abstractions could lower the barrier to entry for implementing rseq.

Community Sentiment and Expert Opinions

The system programming community has shown cautious interest in rseq. While the performance numbers are undeniable, many developers remain skeptical about the practicality of widespread adoption. Some view rseq as a specialized solution for extreme performance scenarios rather than a general-purpose replacement for traditional synchronization.

Others argue that the complexity of rseq implementation outweighs its benefits for most applications. "I'm afraid LLMs are not yet smart enough to help you build restartable sequences," Tunney notes, highlighting the significant expertise required to implement this technology effectively.

Conclusion

Restartable sequences represent a fascinating development in system programming, offering a glimpse into how we might approach thread safety in an era of massively parallel hardware. While current adoption remains limited to specialized applications and high-performance computing, the dramatic performance improvements suggest that rseq will play an increasingly important role as multi-core processors become ubiquitous.

The technology's future will likely depend on broader operating system support, better language integration, and the development of higher-level abstractions that make it accessible to more developers. Until then, rseq will remain a powerful tool in the arsenal of system programmers pushing the boundaries of performance on modern hardware.

For those interested in exploring rseq further, the Linux kernel documentation provides technical specifications, while projects like Cosmopolitan Libc offer practical implementations. As the hardware landscape continues to evolve, technologies like rseq may well become essential components of the system programmer's toolkit.