A developer's deep dive reveals how moving from Python to C++ static autograd with tracing delivers 11x speedups on Tenstorrent's Wormhole hardware, while exploring the surprising limitations of parallelism at GPT-2 scale. The benchmarks expose a fundamental divide between dispatch-bound and compute-bound workloads.