CPython JIT Accelerator: Inside the Plan for 5-10% Speed Boosts in Python 3.15 and 3.16
Share this article
At the recent Python Core Dev Sprint hosted by ARM in Cambridge, core developers Savannah Ostrowski, Mark Shannon, Ken Jin, Diego Russo, and Brandt Bucher laid out an aggressive optimization roadmap for CPython's Just-In-Time (JIT) compiler. Their goal: achieve a 5% geometric mean speedup in Python 3.15 and 10% in 3.16 as measured by pyperformance benchmarks. These figures represent significant engineering challenges given Python's dynamic nature and the JIT's relative infancy.
Rewiring the JIT Frontend
The current JIT uses "trace projection," which predicts execution paths using historical data from the interpreter's inline caches. This approach is being replaced by trace recording—similar to PyPy and TorchDynamo—where live runtime data informs compilation. Brandt Bucher inspired a frontend rewrite that already shows promising results:
# Preliminary results from new trace-recording frontend:
- 1.5% geometric mean speedup on pyperformance
- 100% faster on Richards benchmark
- 15% slower on worst-case benchmark
- Supports generators, custom dunders, and object initialization
The new system uses "dual dispatch," maintaining separate interpreter and tracing dispatch tables while leveraging computed gotos for transitions. This foundational shift enables more accurate optimizations but requires meticulous tuning to avoid regressions.
Assembly-Level Micro-Optimizations
Copy-and-Patch compilation—CPython's lightweight JIT technique—is getting low-level enhancements:
Branch Inversion: Brandt Bucher's #139757 PR reverses conditional jumps to prioritize fall-through in hot paths, yielding ~1% speedup. Example:
; Before jne _JIT_JUMP_TARGET ; Cold path jmp _JIT_CONTINUE ; Hot path ; After je _JIT_CONTINUE ; Hot path (fall-through) jmp _JIT_JUMP_TARGET ; Cold pathAArch64 Tuning: Mark Shannon and Diego Russo are optimizing ARM architecture codegen (#140683).
- Hot-Cold Splitting: Separating frequently and rarely executed code paths to improve cache utilization (currently in planning).
Register Allocation Breakthrough
Mark Shannon is implementing Anton Ertl's 1995 register allocation technique for stack machines, which caches stack values in registers via state transitions. Early results show 0.5% geometric mean speedup and 16% on nbody benchmark. The real prize? Unlocking deeper optimizations by mitigating Python's reference counting overhead:
"CPython tracks object liveness via reference counting and garbage collection. Any operation that decrements a reference count could invoke arbitrary Python code (via
__del__), forcing register spills." — Ken Jin
Matt Page's free-threading work provides a solution: lifetime analysis via "borrowed references" (_BORROW opcodes) allows skipping redundant reference counting. This optimization alone shows 6% speedup in nbody microbenchmarks and enables more aggressive register allocation.
Free-Threading and Future Challenges
With Python's Global Interpreter Lock (GIL) removal progressing, the JIT must adapt:
- Thread Safety: Initial plans borrow from Ruby's ZJIT, using a "watcher" to invalidate JIT code when new threads spawn. Single-threaded code runs optimized, while multi-threaded workloads gracefully degrade to interpreter mode.
- Constant Propagation: Plans to embed trace-level constants (à la PyPy) for deeper optimizations.
The Road to 10%
These incremental optimizations—each contributing 0.5%-1.5%—compound toward the 10% target. The team emphasizes conservative estimates, as geometric means hide both significant wins and unavoidable regressions. As the trace-recording frontend stabilizes and register allocation matures, core developers anticipate a wave of contributor opportunities in 2025.
Source: Faster JIT Plan for 3.15 and 3.16 by CPython core developer Ken Jin