Unveiling Compiler Wizardry: How GCC Replaces Division by 10 with Magic Multiplies and Shifts

Converting binary integers to their decimal ASCII strings is a routine task in C programming, yet one that can expose the raw efficiency of modern compilers. In financial trading systems from a decade ago, where FIX protocol demanded rapid ASCII encoding of prices and quantities, such conversions were performance bottlenecks. Matt Godbolt, renowned for Compiler Explorer, uses this as a teaching hook in his undergraduate presentations and now dissects it in day 7 of his Advent of Compiler Optimisations 2025 series, showcasing how GCC transforms a seemingly simple loop into assembly that dodges expensive divisions entirely.

The Fundamental Binary-to-Decimal Routine

The standard algorithm extracts digits from the least significant: repeatedly take the number modulo 10 to get the rightmost digit, add 48 for its ASCII '0' offset, then divide by 10 to shift to the next digit. Digits emerge in reverse, requiring a post-loop reversal, and a do-while loop ensures zero produces a single '0'. Godbolt's example captures this essence:

char *to_ascii(unsigned n, char *buf) {
    char *p = buf;
    do {
        *p++ = (n % 10) + '0';
        n /= 10;
    } while (n);
    *p = '\0';
    return buf;
}

Here, the % and /= operations scream inefficiency—division and modulo are among the slowest instructions on x86-64, often latency-bound at 20-100 cycles. But with optimizations like -O2, GCC rewrites this into a lean machine.

The Magic Constant: Division via Fixed-Point Arithmetic

At the heart of the optimization is a technique to approximate division by 10 without a div instruction. GCC multiplies the input by the 'magic constant' 0xCCCCCCCD, then performs a signed right shift by 35 bits. Why this odd pairing?

The constant, when interpreted in fixed-point terms, is 0xCCCCCCCD / 2^35 ≈ 0.1 (precisely 0.10000000000582077). Multiplying an unsigned integer by this value and shifting right effectively computes floor division by 10, with the approximation tuned to yield exact integer quotients for all 64-bit values. Godbolt annotates the key assembly:

; Core division approximation
imul   rax, rdi, 0xCCCCCCCD  ; Magic multiply (signed, but low bits match unsigned)
sar    rax, 35              ; Shift right 35: equivalent to / 2^35, yielding n / 10

; Remainder via reconstruction
lea    rcx, [rax + rax*4]   ; 5 * (n/10)
lea    rax, [rcx*2]         ; 10 * (n/10)
sub    rdi, rax             ; Original n - 10*(n/10) = n % 10

The imul choice, despite unsigned inputs, leverages identical low-64-bit results and better instruction flexibility. For the modulo, GCC reconstructs 10 * quotient using lea for cheap multiples (building on prior optimizations like 3x or 5x via address calculations), then subtracts from the original—voilà, remainder without extra cost.

This isn't universal; for signed division by 3, GCC adds rounding adjustments, and by 15 bloats the code further. Yet for unsigned /10, it's flawless and faster than hardware division.

Layered Optimizations for Loop Efficiency

GCC doesn't stop there. The generated code includes:

  • Proactive Increments: Buffer pointer advances (p++) are hoisted or fused to minimize loop overhead.
  • Lookahead Termination: An early check skips the loop if n <= 9, peeking ahead to avoid redundant iterations for single-digit remainders.

These tweaks, combined with division elimination, turn a plodding routine into a cycle-efficient powerhouse, all while preserving correctness for edge cases like n=0.

Why This Matters in Modern Development

Godbolt's analysis reminds developers that compilers like GCC are not mere translators but sophisticated optimizers, often outpacing manual tweaks. In performance domains—from embedded IoT devices to high-throughput servers—such transformations can shave critical latencies. While Godbolt hints at advanced alternatives (bit hacks for log10, template-dispatched multi-digit lookups via 100-entry tables), the baseline shows how trusting -O2 unlocks hidden speed.

This peek into assembly demystifies why C remains king for systems code: its semantics align perfectly with compiler smarts. As Godbolt's series continues through 25 days of optimizations, it equips coders to write better source that amplifies these gains. For those inspired, Compiler Explorer lets you tinker with this code yourself—witness the magic unfold, and consider the video companion for a guided tour.

In an age of high-level abstractions, these low-level revelations ground us in the hardware reality that still powers our abstractions.