Compiler Optimization: Why Floating-Point Resists Vectorization

Compilers can vectorize integer summation loops without hesitation, turning a scalar loop into AVX2 code that processes eight elements at a time. The same transformation applied to floating-point summation is almost never performed by default. The reason is not a missing optimization —it is a deliberate correctness constraint rooted in how floating-point arithmetic works.

Integer Sum: Vectorization Succeeds

Given a function that sums a span of integers:

int sum(std::span<const int> data);

A compiler targeting a processor with AVX2 support will typically generate code using vpaddd, which adds eight 32-bit integers in parallel. After the main loop, the eight partial sums stored in a vector register are collapsed into a scalar with a horizontal reduction sequence.

This works because integer addition is associative. The mathematical identity (a + b) + c = a + (b + c) holds exactly for integers of any size, with no exceptions. The compiler can freely reorder and regroup additions without changing the result.

Float Sum: Vectorization Fails

The same structure with float:

float sum(std::span<const float> data);

Produces output like this:

.L2:
    vaddss xmm0, xmm0, DWORD PTR [rdi]
    vaddss xmm0, xmm0, DWORD PTR [rdi+4]
    vaddss xmm0, xmm0, DWORD PTR [rdi+8]
    vaddss xmm0, xmm0, DWORD PTR [rdi+12]
    vaddss xmm0, xmm0, DWORD PTR [rdi+16]
    vaddss xmm0, xmm0, DWORD PTR [rdi+20]
    vaddss xmm0, xmm0, DWORD PTR [rdi+24]
    vaddss xmm0, xmm0, DWORD PTR [rdi+28]
    add rdi, 32
    cmp rax, rdi
    jne .L2

vaddss is a scalar single-precision add. This loop unrolls eight iterations but does not use vaddps (packed, vectorized add). Each addition depends on the result of the previous one. The computation is fundamentally sequential.

Why the Compiler Cannot Vectorize

Floating-point numbers are stored in a fixed number of bits. Every operation rounds the result to the nearest representable value. This rounding means floating-point addition is not associative:

(a + b) + c != a + (b + c)

The difference is usually small but is not zero, and for certain inputs —values of very different magnitudes —it can be significant. The classic example is adding a large number to many small numbers: if the large number absorbs all significant digits, the small numbers contribute nothing when added one at a time, but they would contribute when summed among themselves first.

Vectorizing a summation loop changes the order of additions. Instead of summing elements 0 through N-1 left-to-right, a vectorized loop maintains several independent partial sums in parallel and then combines them. This is mathematically equivalent for integers. For floating-point, it produces a different result.

The C and C++ standards require that floating-point operations comply with IEEE 754 by default. Compilers are not permitted to reorder floating-point operations unless the programmer explicitly grants that permission. Vectorization would require reordering, so vectorization is blocked.

Enabling Vectorization

Fast-math flags (global)

-O3 -ffast-math

This flag (or equivalently -Ofast) enables a collection of relaxed floating-point assumptions: operations may be reordered, signed zeros may be ignored, infinities and NaNs may be treated as unreachable. With -ffast-math, the compiler generates AVX2 vectorized code for the float sum.

The problem is scope. -ffast-math applies to the entire translation unit. Any code in that file that relies on strict IEEE 754 behavior —checking for NaN, handling infinities correctly, relying on exact rounding for numerical algorithms —may silently produce wrong results.

Per-function attribute (GCC)

__attribute__((optimize("fast-math")))
float sum(std::span<const float> data) {
    float total = 0;
    for (float x : data) total += x;
    return total;
}

This applies relaxed floating-point semantics only to this function. The rest of the program is unaffected. It produces vectorized output while limiting the blast radius of the precision trade-off. The downside is that it is a GCC-specific extension with no standard C++ equivalent.

std::reduce (standards-based, limited in practice)

#include <numeric>
float sum(std::span<const float> data) {
    return std::reduce(data.begin(), data.end(), 0.0f);
}

std::reduce is defined to allow non-deterministic execution order —unlike std::accumulate, which is sequentially ordered. The intent of the standard is that std::reduce permits vectorization. In practice, current versions of GCC and Clang do not generate vectorized code for floating-point std::reduce without additional flags. This may improve in future compiler versions.

Choosing the Right Trade-Off

Whether to enable fast-math depends on what your code does with the results.

For rendering, game physics, machine learning inference, and similar applications where approximate results are acceptable and performance matters, -ffast-math or the per-function attribute is a reasonable choice. The numerical differences from reordered additions are typically negligible.

For numerical solvers, financial calculations, or any code that relies on specific rounding behavior to ensure convergence or correctness, changing the floating-point model can produce subtly or catastrophically wrong results. The default conservative behavior exists for good reason.

The most targeted approach is to apply the relaxed semantics only to the specific functions that need vectorization, verify with benchmarks that the performance gain is real, and verify with test cases that the numerical differences are within acceptable bounds.