Intermediate Floating-Point Precision

How your compiler silently decides the precision of floating-point calculations, and why the same code can produce different results depending on architecture, compiler version, and runtime flags.

The assumption that a float variable undergoes single-precision arithmetic is, in many contexts, wrong. A function declared to multiply two floats may in fact perform the operation at double precision, or even extended precision, depending on a constellation of factors that the developer rarely controls directly. This is not a bug in the compiler. It is a design decision with deep roots in the history of x86 computing, and it has consequences for both correctness and performance that are easy to underestimate.

The x87 floating-point unit, which served as the default floating-point hardware on 32-bit Windows for years, operates with eight registers that are each 80 bits wide. The common explanation is that this means all calculations happen at 80-bit precision, but the reality is more nuanced. The x87 has a configurable precision setting that can be set to 24-bit (single), 53-bit (double), or 64-bit (extended). The Visual C++ runtime initializes the FPU to double precision during thread startup, meaning that every floating-point operation is rounded to 53 bits of mantissa before being stored back in the register. The exponent remains at full width, so the range is not constrained, but the precision of intermediate results is locked to double unless the developer explicitly changes it.

This default can be overridden. A call to _controlfp_s can raise the precision to full 64-bit or lower it to 24-bit. And here is where things get interesting: Direct3D 9, for reasons of its own, sets the x87 FPU to 24-bit precision unless the application requests D3DCREATE_FPU_PRESERVE. This means that the same floating-point code, running on the same machine, can produce different results on different threads within the same process. The precision is a per-thread runtime state. It is invisible in the source code, invisible in the generated assembly, and invisible to anyone who is not actively looking for it.

The SSE instruction set family changed this landscape, but not entirely in the direction one might expect. SSE instructions encode their precision directly in the instruction mnemonic: mulss for single-precision multiply, mulsd for double-precision. There is no global precision flag. This should make the behavior deterministic and predictable. But the Visual C++ compiler, when generating SSE code for 32-bit builds, chose to widen float calculations to double precision anyway. A simple float multiply compiles to a sequence that converts both inputs from float to double, performs the multiply at double precision, and then converts the result back to float. The reason was historical consistency: 32-bit x87 code had been running at double precision for years, and the team wanted SSE results to match.

The performance cost of this widening is not trivial. The conversion instructions are not free, and they lengthen the dependency chain significantly. In benchmarked cases, the float multiply sequence took 35% to 78% longer than the equivalent double multiply, an irony that is hard to miss. The extra instructions are also provably unnecessary for single-operation calculations. IEEE 754 guarantees that basic operations produce correctly rounded results, so widening a single multiply to double precision before rounding back to float yields an identical result. The compiler should optimize these away under the as-if rule, but prior to Visual Studio 2012, it often did not.

The 64-bit ABI brought a cleaner story. The x64 calling convention passes parameters in registers rather than on the stack, eliminating several setup instructions. More importantly, the VC++ team adopted source-precision intermediates for x64 builds. A float multiply now compiles to exactly three instructions: load, multiply, store. No conversions, no widening. The instruction count drops from eight or twelve to three, and the code runs at the precision the source types imply. This is the behavior that most developers would expect, and it is the behavior that modern compilers have converged on.

Visual Studio 2012 extended this source-precision policy to 32-bit SSE builds as well, marking a significant shift. The old behavior of widening float intermediates to double was abandoned in favor of respecting the declared types. For developers who need higher precision, the escape hatch is explicit: cast to double at key points, or store temporaries in double variables. This gives the developer control without imposing a hidden cost on code that does not need it.

The C99 standard formalized some of this with FLT_EVAL_METHOD, a macro that indicates what precision the compiler uses for intermediate floating-point evaluations. A value of 0 means the types dictate precision. A value of 1 means double (unless the type is already higher). A value of 2 means long double. GCC uses 2 for 32-bit code because the x87 registers are naturally 80-bit, and 0 for 64-bit code because SSE supports source precision directly. VC++ effectively used 1 for 32-bit builds and moved toward 0 with VS 2012. The standard does not mandate a particular value, leaving it to implementation, which means that floating-point portability across compilers requires awareness of this macro and what it implies.

The practical consequences extend beyond performance. Consider a simple equality check: if a float division is performed at float precision, the result stored back to float is identical to the original calculation, and an equality test succeeds. If the same division is performed at double precision, the intermediate result is more accurate, but when it is rounded back to float for storage, it may differ from the float-precision version. The equality test fails. Both results are correct under the IEEE standard. The standard explicitly declines to specify intermediate precision for expressions, deferring to language rules and compiler implementations. This means that code that is correct under one compiler or one build configuration may fail under another, not because of a bug, but because of a legitimate difference in intermediate precision.

This is not merely an academic concern. Game developers, scientific computing researchers, and anyone working with numerically sensitive algorithms has encountered these discrepancies. The classic reference is David Goldberg's 1991 paper, "What Every Computer Scientist Should Know About Floating-Point Arithmetic," which warns that computing every expression in the highest available precision is not always the right strategy. Higher precision can preserve information that would otherwise be lost to rounding, but it can also mask errors that would otherwise be caught, and it can produce results that are inconsistent with the declared types. Microsoft's own guidance, through Eric Fleegal, leans toward using the highest practical precision, but this is a recommendation, not a requirement, and it comes with caveats.

The broader lesson is that floating-point arithmetic is not a fixed specification but a negotiation between hardware capabilities, compiler policies, language standards, and developer intent. The same source code can be evaluated at different precisions on different platforms, and all of those evaluations can be IEEE-compliant. For developers who need deterministic results, the path forward involves understanding the tools: compile with /fp:strict or equivalent, use source-precision intermediates by default, cast explicitly when higher precision is needed, and test across the target platforms. The days of invisible double-precision widening are largely over on modern toolchains, but the legacy of that behavior persists in codebases that were tuned for it, and in the expectations of developers who never knew it was happening.

The x87 FPU and its 80-bit registers are a historical artifact, but the questions it raised about intermediate precision remain relevant. Every floating-point architecture makes choices about what precision to use for temporary values, and those choices have consequences that ripple through numerical code in ways that are subtle and sometimes surprising. The best defense is awareness, explicit control, and a healthy skepticism toward the assumption that a float operation is always a float operation.

Diagram of 80-bit x87 registers

The diagram above shows the layout of the 80-bit x87 registers, with the sign bit in blue, the exponent in pink, and the mantissa in green. When the precision setting is 24-bit, only the light green portion of the mantissa is used. At 53-bit precision, the light and medium green bits are active. At full 64-bit precision, the entire mantissa participates. The exponent is always fully used, which means that even at double precision, the range of values that can be represented in an x87 register exceeds what a true double can hold. This partial widening is the source of much confusion and many subtle numerical differences.

Wide Load sign

For developers navigating this terrain, the practical advice is straightforward. Use /arch:sse2 or higher to avoid the x87 FPU entirely on 32-bit builds. Prefer /fp:strict over /fp:fast unless you have verified that the relaxed semantics of fast math do not affect your results. Add the suffix f to floating-point constants that should remain single-precision, to avoid implicit widening. Enable compiler warning C4244 to surface unintended double-to-float conversions. And when you encounter a floating-point result that differs from what you expected, check the intermediate precision first. The answer is almost always there.

#floating-point #compiler optimization #x86 #IEEE 754 #precision

Intermediate Floating-Point Precision

Comments