NVIDIA Engineer's Configure-Caching Patch Cuts GCC Bootstrap Wall Time by 15%

Kyrylo Tkachov's proposal caches Autoconf configure results across GCC's three bootstrap stages, slashing configure time by 43% and overall wall time by roughly 15% on a large AArch64 box almost certainly powered by NVIDIA's Vera CPU.

If you have ever sat and watched a full GCC bootstrap crawl across your terminal, you know the build is not always pinning your cores. NVIDIA compiler engineer Kyrylo Tkachov measured exactly that pain and posted a patch to the gcc-patches mailing list that attacks one of the least glamorous parts of building a compiler: the Autoconf configure step.

The headline numbers are the kind of thing this homelab cares about. Configure time drops by about 43%, and the overall bootstrap wall time falls by roughly 15%, with the generated configuration unchanged.

NVIDIA Vera CPU

What a GCC bootstrap actually spends time on

A native GCC bootstrap is not a single compile. It is a three-stage process. Stage 1 builds GCC with your existing system compiler. Stage 2 rebuilds GCC using the stage 1 compiler. Stage 3 rebuilds again using the stage 2 compiler, and then the build system compares the stage 2 and stage 3 outputs to confirm they are byte-for-byte identical. That comparison is your proof the compiler is self-consistent.

The catch is that each of those stages runs Autoconf's configure from scratch. GCC has an enormous codebase with a sprawling set of configure checks, and those scripts execute serially. They probe for header files, type sizes, library behavior, linker capabilities, and hundreds of other host and target properties. On a native build, none of those answers change between stages. You are running the same expensive feature detection three times and getting three identical results.

Tkachov's profiling on what he described only as a "large multi-core AArch64 machine," almost certainly NVIDIA's Vera CPU, found that around 30% of the total bootstrap wall time went to running configure scripts. Worse for anyone who bought a high-core-count chip, the machine sat at under 15% utilization for nearly half the build. That is the classic Amdahl's law trap. You throw 72 or 144 cores at a build, and a serial configure phase happily ignores all of them.

The fix: cache configure results across stages

The patch caches the configure results so they carry across all three stages instead of being recomputed each time. For a native bootstrap, where you are not cross-compiling, the configure answers are guaranteed stable, so reusing them is safe. Autoconf already has a site-cache mechanism for exactly this kind of reuse, and the patch leans on that idea to stop the redundant probing.

Tkachov was careful about correctness, which matters when you are touching the part of the toolchain everything else is built on:

"This roughly halves the time spent in configure (about a 43% reduction) and cuts the overall bootstrap wall time by about 15%, with no change in the generated configuration: the produced config headers are identical to a non-cached build and the stage 2 / stage 3 comparison still succeeds."

The work was bootstrapped and tested on aarch64-none-linux-gnu and x86_64-linux, and cross, Canadian, and --disable-bootstrap builds should be unaffected. That last point is the important guardrail. Cross-compiling is precisely the case where configure answers differ from the host, so the caching deliberately stays out of the way there.

Twitter image

Why this matters if you build toolchains

For a homelab that compiles its own toolchains, this is real wall-clock savings on every rebuild. If a bootstrap takes 30 minutes, a 15% cut hands you back four to five minutes per run. Multiply that across CI pipelines, distribution build farms, and anyone bisecting a compiler regression where you bootstrap repeatedly, and the aggregate time saved is substantial.

The pattern here is broader than GCC. Serial configuration phases are a recurring bottleneck across large C and C++ projects that still lean on Autoconf. The compute is embarrassingly idle while a shell script walks through feature tests one at a time. As core counts keep climbing on server parts, the relative cost of these serial sections only grows. A 64-core or 128-core machine makes the parallel compile phase fly, which makes the unparallelized configure phase a larger fraction of the total.

The pushback: hack versus cleanup

Not everyone on the list is sold on the approach. One responder called the configure-result caching more of a "hack" and argued the better long-term path is to clean up the configure scripts themselves. The suggestions included dropping configure checks now considered useless, potentially removing GNU Gold linker support, and pruning other remnants that have accumulated over decades.

The cleanup argument has merit on one axis the caching patch does not touch: cross-compiling. Caching only helps native bootstraps because cross builds genuinely need different answers. Trimming dead configure checks would speed up every build configuration, native and cross alike, because you simply stop running probes nobody needs anymore.

These two approaches are not mutually exclusive. Caching delivers a measurable win now with verified-identical output, while the deeper script cleanup is a larger, slower effort that pays off everywhere. The caching patch ships a result you can benchmark today; the cleanup is the kind of work that takes many patches across many releases.

The patch is out for testing now, so the numbers above are early figures rather than merged-and-final. For anyone who measures their builds, it is worth pulling the series and timing a make bootstrap on your own hardware. Configure overhead scales with your specific check set and storage latency, so your reduction may land above or below the 15% headline depending on how I/O-bound your box is during those serial probes.