An analysis of how Michael Abrash's hand-crafted assembly optimizations in Quake achieved a 50% performance improvement, examining the most critical functions and techniques used.
Fabien Sanglard's exploration of the Quake source code reveals how Michael Abrash's assembly optimizations achieved a remarkable 50% performance improvement, validating John Carmack's claim that the game would lose nearly half its speed without these hand-crafted routines.
The investigation began with establishing a baseline: running the original winquake.exe on a Pentium MMX 233MHz achieved 42.3fps in timedemo demo1. When rebuilt with assembly optimizations enabled, the framerate remained essentially unchanged at 42.2fps. However, when the assembly code was disabled by setting id386 to 0, the framerate plummeted to 22.7fps - exactly half the original speed.
This dramatic performance difference stemmed from 63 assembly functions spread across 21 files. After filtering out functions for DOS-specific operations, experimental features, and duplicates, 32 core optimization functions remained, categorized into math, sound, render, and draw operations.
To quantify the impact of each function, Sanglard modified the engine to enable one assembly function at a time. The results showed that D_DrawSpans8 contributed 12.6fps, R_DrawSurfaceBlock8_mip* functions added 4.2fps, and D_Polyset* functions contributed 2.2fps. The remaining functions combined for only 0.8fps.
Several key optimization techniques emerged from examining the most critical functions:
TransformVector demonstrated sophisticated Pentium FPU pipeline optimization. While the VC6 compiler generated code that performed operations sequentially (causing stalls while waiting for fmul results), Abrash's version overlapped three dot products in parallel. By using the fxch instruction to freely swap FPU stack registers, he calculated three partial sums simultaneously, hiding fmul latency completely. The stores were also moved to the end to avoid pipeline stalls.
R_DrawSurfaceBlock8_mipX functions used self-modifying code to embed colormap base addresses directly into the instruction stream, eliminating register usage and additional ADD operations. The inner loop was fully unrolled to save loop counters and avoid branch mispredictions on the final iteration.
D_DrawSpans8 represented the most sophisticated optimization, achieving what Abrash called "overlap" between FPU and integer pipelines. The function issued an FDIV for the next 8-pixel span at the beginning of the current span, allowing the integer pipelines to draw pixels while the FPU performed the 30+ cycle division. A jump table eliminated branch mispredictions for spans with fewer than 8 pixels. The clamp operation was optimized using unsigned comparisons to test both upper and lower bounds simultaneously.
These optimizations reflected a deep understanding of Pentium architecture, particularly the Pentium FPU pipeline's ability to hide latency through instruction overlap. The TransformVector function alone demonstrated how careful instruction ordering could eliminate pipeline stalls entirely, while D_DrawSpans8 showed how to keep both integer and floating-point pipelines busy simultaneously.
The investigation confirms that Abrash's assembly optimizations were indeed responsible for approximately half of Quake's performance, with the most significant gains coming from low-level drawing routines that were executed millions of times per second. This level of optimization required intimate knowledge of processor architecture and careful consideration of pipeline behavior - techniques that remain relevant for performance-critical code today.
Comments
Please log in or register to join the discussion