CPython JIT Accelerator: Inside the Plan for 5-10% Speed Boosts in Python 3.15 and 3.16

Python core developers have unveiled ambitious plans to accelerate CPython's JIT compiler, targeting 5% faster performance in 3.15 and 10% in 3.16. The roadmap includes trace recording, register allocation, and assembly-level optimizations—while tackling challenges like debugger integration and free-threading support.

At the recent Python Core Dev Sprint hosted by ARM in Cambridge, core developers Savannah Ostrowski, Mark Shannon, Ken Jin, Diego Russo, and Brandt Bucher laid out an aggressive optimization roadmap for CPython's Just-In-Time (JIT) compiler. Their goal: achieve a 5% geometric mean speedup in Python 3.15 and 10% in 3.16 as measured by pyperformance benchmarks. These figures represent significant engineering challenges given Python's dynamic nature and the JIT's relative infancy.

Rewiring the JIT Frontend

The current JIT uses "trace projection," which predicts execution paths using historical data from the interpreter's inline caches. This approach is being replaced by trace recording—similar to PyPy and TorchDynamo—where live runtime data informs compilation. Brandt Bucher inspired a frontend rewrite that already shows promising results:

# Preliminary results from new trace-recording frontend:
- 1.5% geometric mean speedup on pyperformance
- 100% faster on Richards benchmark
- 15% slower on worst-case benchmark
- Supports generators, custom dunders, and object initialization

The new system uses "dual dispatch," maintaining separate interpreter and tracing dispatch tables while leveraging computed gotos for transitions. This foundational shift enables more accurate optimizations but requires meticulous tuning to avoid regressions.

Assembly-Level Micro-Optimizations

Copy-and-Patch compilation—CPython's lightweight JIT technique—is getting low-level enhancements:

Branch Inversion: Brandt Bucher's #139757 PR reverses conditional jumps to prioritize fall-through in hot paths, yielding ~1% speedup. Example:

; Before
jne _JIT_JUMP_TARGET  ; Cold path
jmp _JIT_CONTINUE     ; Hot path

; After
je _JIT_CONTINUE      ; Hot path (fall-through)
jmp _JIT_JUMP_TARGET  ; Cold path

AArch64 Tuning: Mark Shannon and Diego Russo are optimizing ARM architecture codegen (#140683).
Hot-Cold Splitting: Separating frequently and rarely executed code paths to improve cache utilization (currently in planning).

Register Allocation Breakthrough

Mark Shannon is implementing Anton Ertl's 1995 register allocation technique for stack machines, which caches stack values in registers via state transitions. Early results show 0.5% geometric mean speedup and 16% on nbody benchmark. The real prize? Unlocking deeper optimizations by mitigating Python's reference counting overhead:

"CPython tracks object liveness via reference counting and garbage collection. Any operation that decrements a reference count could invoke arbitrary Python code (via __del__), forcing register spills." — Ken Jin

Matt Page's free-threading work provides a solution: lifetime analysis via "borrowed references" (_BORROW opcodes) allows skipping redundant reference counting. This optimization alone shows 6% speedup in nbody microbenchmarks and enables more aggressive register allocation.

Free-Threading and Future Challenges

With Python's Global Interpreter Lock (GIL) removal progressing, the JIT must adapt:

Thread Safety: Initial plans borrow from Ruby's ZJIT, using a "watcher" to invalidate JIT code when new threads spawn. Single-threaded code runs optimized, while multi-threaded workloads gracefully degrade to interpreter mode.
Constant Propagation: Plans to embed trace-level constants (à la PyPy) for deeper optimizations.

The Road to 10%

These incremental optimizations—each contributing 0.5%-1.5%—compound toward the 10% target. The team emphasizes conservative estimates, as geometric means hide both significant wins and unavoidable regressions. As the trace-recording frontend stabilizes and register allocation matures, core developers anticipate a wave of contributor opportunities in 2025.

Source: Faster JIT Plan for 3.15 and 3.16 by CPython core developer Ken Jin

#CPython #JITCompiler #PythonPerformance