Tail-Call Optimization Delivers Major CPython Speed Boosts on Windows and macOS

New benchmarks reveal that tail-call interpreter threading in CPython yields up to 15% performance gains on Windows with MSVC and 5% on macOS ARM. The optimization overcomes compiler limitations in large interpreter loops by enabling critical function inlining.

In a significant reversal of previous performance claims, CPython developers have validated substantial speed improvements through tail-call interpreter threading—particularly on Windows systems. After retracting earlier benchmark results due to a compiler bug, new data shows tail-calling outperforms traditional dispatch methods by up to 15% on Windows/MSVC and 5% on macOS ARM systems.

The Interpreter Dispatch Evolution

CPython historically used two dispatch techniques:

Switch-case: Jumps between instructions via switch statements
Computed gotos: Uses GCC/Clang's labels-as-values for direct jumps

Tail-call threading—where each bytecode handler tail-calls the next—was long considered impractical in C due to unreliable tail-call optimization (TCO). Modern compiler advancements changed this:

// Tail-call pattern
void OP_NAME(PyThreadState* tstate) {
    // Handler logic
    __attribute__((musttail)) return next_op(tstate);
}

Clang's __attribute__((musttail)) (popularized by Josh Haberman's protobuf work) and experimental MSVC support now guarantee TCO, eliminating stack overflow risks.

Performance Breakthrough

Updated pyperformance benchmarks reveal:

5% geomean speedup on macOS ARM (Xcode Clang)
15% geomean speedup on Windows x86-64 (MSVC 2026)
Specific Windows benchmarks showed up to 78% improvements

"The tail-calling interpreter resets compiler heuristics to sane levels," explains the researcher. Traditional 12K-line interpreter loops disable optimizations like inlining due to size constraints. Tail-calling splits this monolith, enabling critical inlining:

; Switch-case (no inlining)
call PyStackRef_CLOSE_SPECIALIZED

; Tail-call (inlined instructions)
add rcx, rax
jo  overflow_handler

Platform Implementation

macOS: Already shipping in uv's Python 3.14 builds
Windows: Requires Visual Studio 2026 (MSVC 18)

Build command:

PCbuild\build.bat -p x64 -c Release --pgo

Implications

This optimization demonstrates how compiler advancements unlock hidden performance in mature systems. The gains are especially significant for:

Large pure-Python libraries (14% speedup observed)
Long-running scripts (up to 40% faster)

The collaboration between CPython contributors and the MSVC team highlights how targeted compiler improvements can yield disproportionate returns in interpreter design.

Source: No Longer Sorry: Tail Calling Improves CPython Performance on Windows