When Compiler Optimizations Backfire: How Jump Tables Slowed UTF-8 Parsing by 4x
Share this article
Compiler optimizations promise faster code, but sometimes they deliver the opposite. Developer Nemanja Trifunovic uncovered this paradox while benchmarking UTF-8 sequence length calculations—a fundamental operation in text processing. His discovery exposes how a common compiler optimization can unexpectedly cripple performance, forcing us to rethink assumptions about modern CPUs.
The Optimization That Slowed Things Down
Trifunovic compared two approaches for determining UTF-8 sequence lengths. The "modern" method used std::countl_one to count leading ones in the lead byte, which Clang 18.1.3 (for ARM64) optimized using a jump table:
#include <bit>
int utf8_sequence_length(unsigned char lead_byte) {
switch (std::countl_one(lead_byte)) {
case 0: return 1;
case 2: return 2;
case 3: return 3;
case 4: return 4;
default: return 0; // invalid
}
}
The compiler generated a lookup table (.Lswitch.table) to avoid branching. Yet benchmarks showed disappointing throughput: just 438-462 MB/s processing ASCII text. Meanwhile, a "naïve" branch-based approach from Trifunovic's earlier work processed over 2000 MB/s—over four times faster.
Why Jump Tables Stumbled
CPU profiling revealed high backend stall cycles, suggesting memory bottlenecks. The lookup table—though small—forced frequent L1 cache accesses. For predominantly ASCII input (where lead bytes start with 0), branch predictors in modern CPUs excel at speculating correctly, making branches cheaper than memory lookups.
Trifunovic tested this by disabling jump tables in Clang (-fno-jump-tables). The resulting branch-heavy assembly:
cmp w0, #2
b.gt .LBB0_4
cbz w0, .LBB0_6
cmp w0, #2
b.eq .LBB0_5
Performance immediately matched the faster naïve approach. Notably, GCC didn't use jump tables and showed no such penalty.
A Wider Pattern
This isn't an isolated case. Julian Squires' 2017 research Are Jump Tables Always Fastest? found similar behavior on x86-64. Jump tables excel with dense, sparse cases but falter when:
1. Inputs are predictable (e.g., ASCII-dominated text)
2. Tables force cache misses
3. Branch predictors can learn patterns
Implications for Performance-Critical Code
- Benchmark Rigor Matters: "Optimized" compiler output isn't universally faster. Measure with real-world data distributions.
- CPU Evolution Changes Rules: Branch predictors have advanced significantly—what was slow decades ago may now excel.
- Compiler Flags Are Levers: Tools like
-fno-jump-tablescan rescue performance when automatic optimizations misfire.
As Trifunovic’s benchmarks reveal, sometimes the "clever" approach is outsmarted by simplicity. In an era of sophisticated silicon, performance tuning remains a dialogue between developer intentions, compiler choices, and the unseen logic of microarchitectures.
Source: Nemanja Trifunovic's When Compiler Optimizations Hurt Performance