Tail-Call Optimization Delivers Major CPython Speed Boosts on Windows and macOS
Share this article
In a significant reversal of previous performance claims, CPython developers have validated substantial speed improvements through tail-call interpreter threading—particularly on Windows systems. After retracting earlier benchmark results due to a compiler bug, new data shows tail-calling outperforms traditional dispatch methods by up to 15% on Windows/MSVC and 5% on macOS ARM systems.
The Interpreter Dispatch Evolution
CPython historically used two dispatch techniques:
1. Switch-case: Jumps between instructions via switch statements
2. Computed gotos: Uses GCC/Clang's labels-as-values for direct jumps
Tail-call threading—where each bytecode handler tail-calls the next—was long considered impractical in C due to unreliable tail-call optimization (TCO). Modern compiler advancements changed this:
// Tail-call pattern
void OP_NAME(PyThreadState* tstate) {
// Handler logic
__attribute__((musttail)) return next_op(tstate);
}
Clang's __attribute__((musttail)) (popularized by Josh Haberman's protobuf work) and experimental MSVC support now guarantee TCO, eliminating stack overflow risks.
Performance Breakthrough
Updated pyperformance benchmarks reveal:
- 5% geomean speedup on macOS ARM (Xcode Clang)
- 15% geomean speedup on Windows x86-64 (MSVC 2026)
- Specific Windows benchmarks showed up to 78% improvements
"The tail-calling interpreter resets compiler heuristics to sane levels," explains the researcher. Traditional 12K-line interpreter loops disable optimizations like inlining due to size constraints. Tail-calling splits this monolith, enabling critical inlining:
; Switch-case (no inlining)
call PyStackRef_CLOSE_SPECIALIZED
; Tail-call (inlined instructions)
add rcx, rax
jo overflow_handler
Platform Implementation
- macOS: Already shipping in uv's Python 3.14 builds
- Windows: Requires Visual Studio 2026 (MSVC 18)
Build command:
PCbuild\build.bat -p x64 -c Release --pgo
Implications
This optimization demonstrates how compiler advancements unlock hidden performance in mature systems. The gains are especially significant for:
- Large pure-Python libraries (14% speedup observed)
- Long-running scripts (up to 40% faster)
The collaboration between CPython contributors and the MSVC team highlights how targeted compiler improvements can yield disproportionate returns in interpreter design.
Source: No Longer Sorry: Tail Calling Improves CPython Performance on Windows