Surprisingly Slow: Unveiling Hidden Performance Bottlenecks in Modern Software

In the age of multi-gigabyte NVMe drives and 10 Gbps networks, one might expect software to hum along at peak efficiency. Yet, developers frequently encounter inexplicable slowness that defies hardware capabilities. Drawing from years of optimization experience across projects like Firefox, Mercurial, and rustup, this exploration highlights common yet surprising performance traps. These issues, often rooted in legacy practices or environmental quirks, carry profound implications for build times, application responsiveness, and system scalability. By addressing them, engineers can unlock substantial gains without overhauling architectures.

The Configuration Phase: Setup That Outpaces the Build

Build systems rely on an initial configuration step to detect environments—probing compilers, versions, and capabilities via tools like Autoconf's configure scripts or CMake. This adaptation to diverse machines is crucial but serially executed, frequently dominating total build time.

On small projects, configuration can exceed 10 seconds, while compilation finishes in fractions thereof, especially on high-core setups like a 16-core Ryzen 5950X. For behemoths like LLVM/Clang on a 96-vCPU AWS instance, configuration once surpassed linking and compiling combined. The serial nature blocks parallel builds, turning a necessary prelude into a major hurdle.

Parallelizing configuration or embedding it in the build DAG enables incremental execution. Reproducible environments, as in Bazel, eliminate probing altogether, likely contributing significantly to its speed advantages. These approaches reveal how much latent efficiency hides in traditional tooling, urging a shift toward integrated, deterministic builds.

OS-Specific Process Creation Overhead

Process spawning efficiency varies starkly between platforms, influencing architecture choices. Windows incurs 10-30ms per new process, versus Linux's sub-millisecond fork() + exec(). Threads start quickly everywhere (~dozens of microseconds), making them preferable for concurrency.

UNIX-derived tools exacerbate this: configure scripts, often shell-based, spawn thousands of subprocesses (grep, sed, pipes), adding seconds on Windows. A 1,000-process script at 10ms each totals 10 seconds—pure overhead.

Profiling Mercurial's Windows checkout exposed this as the slowdown culprit, not filesystems. Developers targeting Windows should pivot to threading or persistent processes, reducing spawn frequency. This OS disparity underscores the need for platform-aware design in cross-compatible software.

Antivirus Scans Throttling File Closes

Windows file I/O seems performant—opens and writes in microseconds to milliseconds—until CloseHandle() drags at 1-10ms. Windows Defender's filesystem filter drivers trigger synchronous scans on closes, even for async I/O, bloating latency.

This persists across workloads, flouting documentation. Mercurial and rustup mitigate via thread pools: background CloseHandle() calls yield 3x speedups for file-intensive tasks. Rustup's tar extraction now outpaces Linux equivalents.

Installers, VCS, and archivers handling hundreds of files must adopt this. It illustrates how security layers, essential for protection, demand creative bypasses to preserve performance, particularly in I/O-bound applications.

Terminal I/O: When Output Becomes a Bottleneck

Verbose builds or CLIs falter when terminals choke on output. Stdout/stderr writes block if emulators lag, especially with styling (colors, cursors). Firefox builds once lost minutes to Windows Command Prompt and macOS Terminal.app limitations; quiet modes reclaimed them.

Npm's progress spinner highlighted frequent updates overwhelming terminals. Modern emulators fare better with plain text but struggle with flair.

Mitigations include buffering, async threads, throttling (10 Hz max for spinners), and null-device benchmarking. Measuring I/O latency quantifies impact, ensuring UX enhancements don't erode core speed in devops pipelines.

Dynamic CPU Behaviors and Power Throttling

Processors fluidly adjust via ACPI C/P-States, influenced by load, thermals, and power. Laptops throttle aggressively—battery mode or a jammed fan can halve speeds, as seen in a MacBook Pro's unexplained Firefox build slowdowns.

Desktops vary too: Firefox linking ballooned 2-4x on conservative C-States, pinning cores at 1-1.5 GHz until load spiked. Aggressive profiles ensure readiness but spike power draw, amplifying data centers' airline-scale carbon footprint.

Cloud instances allow tuning without cost hikes, but ethical trade-offs loom. For CPU-heavy tasks, desktops or monitored servers outperform variable laptops. Benchmarks must report power states; otherwise, results mislead, complicating reproducible dev environments.

Interpreter Launches: Milliseconds That Multiply

Builds invoking Python or Node.js thousands of times pay 1-30ms startup per process—initialization and imports before code runs. Mercurial tests wasted 10-18% CPU on interpreter bootstrap, 30-38% to dispatch.

Windows amplifies via slow spawning. Java's JIT avoids this with runtime optimization, but static languages shine. Favor fewer processes or low-overhead alternatives; this is critical for scalable CI and test harnesses.

Storage Speed Untapped: Fsync's Global Flush

NVMe hits >3 GB/s reads, 500k IOPS at ~10μs—DRAM-like throughput. Yet, OS APIs and practices squander it.

Fsync() on Linux/ext4 flushes all dirty pages, not just the file: a 1-byte call post-1GB write stalls seconds. GitHub Enterprise's MySQL timed out waiting on Git packfiles. Isolate databases? Yes, but ext4's upcoming fast commits enable granularity.

For ephemeral data (Kubernetes, CI), disable fsync()—configs, eatmydata, or kernel hacks. Firefox slashed 50 GB I/O by skipping tests. When durability isn't paramount, relax guarantees for I/O velocity.

Compression: Outdated Trade-offs in a Fast World

Compression historically saved I/O at CPU cost; now, with NVMe and gigabit nets, CPUs bottleneck first. Zlib/DEFLATE, from 1990s constraints, compresses too slowly for 1 Gbps—often netting slowdowns.

Zstandard crushes it: faster, better ratios. Git, Docker, tar.gz workflows could accelerate; petabyte data lakes reclaim terabytes. Measure line speeds: if uncompressed I/O isn't saturated, compression hurts.

Zstandard's tunability—fast initial, recompress later—fits real-time to archival. Suboptimal compression likely drags Big Data tools like Hadoop, where awareness lags.

ISA Conservatism: 20 Years of Instructions Ignored

Linux binaries target 2003 x86_64, skipping SSE4/AVX/AVX2. Compilers default similarly, forgoing vectorization.

Diff tools exemplify: line splitting/hashing in git diff uses scalar ops, making prefix skips slower than Myers algorithm. Assembly-backed memchr()/memcmp() vectorize for superlinear gains.

GCC/Clang's x86_64-v3 (Haswell+) unlocks them; RHEL9 adopts v2. Server fleets recompiling could efficiency-boost, but underclocking risks apply. Cloud ISAs (AVX-512) offer more, justifying custom distros.

Toward Measured, Adaptive Optimization

From configs to ISAs, these slowdowns stem from unexamined defaults and shifting hardware dynamics. They ripple through dev workflows, inflating CI costs and frustrating users.

Profiling across platforms reveals them; fixes demand balancing complexity against gains. As infrastructure evolves, proactive auditing—questioning fsync, compression, power—ensures software scales with silicon's promise, fostering resilient, efficient ecosystems.

This article is adapted from Gregory Szorc's 2021 blog post "Surprisingly Slow," informed by his work on Firefox, Mercurial, and related tools. Original source: https://gregoryszorc.com/blog/2021/04/06/surprisingly-slow/