Tracing JITs in the Wild: Lessons from PyPy for CPython's New Compiler

At the recent CPython Core Developer Sprint in Cambridge, veteran PyPy contributor Antonio Cuni delivered a sobering masterclass on the real-world challenges facing tracing just-in-time (JIT) compilers. Drawing from seven years of optimizing financial systems at a high-frequency trading firm, Cuni revealed why PyPy's performance characteristics often defy expectations—and why CPython's new JIT may face similar hurdles.

When Trace Blockers Sabotage Speed

Tracing JITs work by recording and optimizing hot code paths, but they hit invisible walls when encountering operations they can't analyze—like C extensions. Cuni demonstrated this with a PI-calculation benchmark:

def get_pi():
    # ... calculation logic ...
    while abs(term) > tol:
        hic_sunt_leones()  # JIT cannot trace through this!
        # ... update term ...

Results were striking: Without the blocker, PyPy ran 42x faster than CPython. With one untraceable call? Just 1.8x faster—nearly all gains evaporated. "This will be worse for CPython," Cuni warned, "because virtually all real code uses C extensions, and PyPy's JIT sees more internals by design."

The Exponential Curse of Data-Driven Code

When control flow depends on variable data patterns, tracing JITs suffer combinatorial explosions. Cuni showed a function processing parameters with dynamic None checks:

def fn(v=None, a=None, b=None, ...):
    if v is None: v = 0
    if a is None: a = 1.25
    # ... 7 more conditionals ...

PyPy without JIT: 2.3x slower than CPython. With JIT? 13x slower due to 527 compiled bridge traces for different None combinations. "Branchless code helps but butchers readability," Cuni noted. "This is a fundamental tracing JIT limitation—merging traces might help, but risks recreating the trace-blocker problem."

Generators: The Silent Performance Killers

Async-heavy modern Python faces hidden costs. In a Pythagorean triple counter, generator-based iteration ran 29% slower on CPython versus loops. PyPy optimized the class-based iterator to near-loop speed but couldn't fully optimize generators:

def range_product(a, b):
    for i in range(*a):      # JIT struggles to inline
        for j in range(*b):
            yield i, j

"Generators force frame creation—the JIT can't see through them," Cuni explained. "In complex async systems, we've seen far worse slowdowns."

The Silver Lining: Allocation Removal

Not all lessons were dire. Tracing JITs excel at removing temporary objects. When calculating triangle centroids from binary data, PyPy optimized a clean OOP abstraction:

t = Triangle(buf, 0)
tot_x += t.a.x + t.b.x + t.c.x  # JIT eliminates all intermediates

Result: The abstracted version ran 2.5x faster than the low-level byte-unpacking approach. "Virtual allocation removal is PyPy's crown jewel," said Cuni. "CPython's JIT should prioritize this."

The Roadmap Ahead

Cuni closed with urgent recommendations: Better tooling to diagnose JIT behavior, warmup optimizations to avoid short-program penalties, and trace merging to combat combinatorial explosions. As CPython's JIT matures, he stressed, these lessons from PyPy's trenches will prove invaluable: "Performance in a JIT world isn't intuitive—your clean abstraction might become faster than your clever hack."

Source: Tracing JITs in the real world by Antonio Cuni

#PythonJIT #TracingJIT #CPythonPerformance