Compiler Optimization: SIMD and Auto-Vectorization
Modern CPUs contain wide SIMD execution units capable of processing 8, 16, or even 32 integers in a single instruction. Auto-vectorization is the compiler’s ability to recognize loops that operate on independent data elements and rewrite them to use those units —without any changes to source code. When it works well, the performance difference is dramatic.
A Concrete Example
Consider a function that takes two arrays and replaces each element of x with whichever of x[i] or y[i] is larger:
void max_in_place(std::span<int> x, std::span<const int> y) {
for (size_t i = 0; i < x.size(); ++i) {
if (y[i] > x[i]) x[i] = y[i];
}
}
At -O2, GCC and Clang generate a straightforward scalar loop:
.L3:
mov edx, DWORD PTR [rsi + rax*4] ; load y[i]
cmp edx, DWORD PTR [rdi + rax*4] ; compare with x[i]
jle .L2 ; skip if y[i] <= x[i]
mov DWORD PTR [rdi + rax*4], edx ; store y[i] into x[i]
.L2:
add rax, 1
cmp rax, 65536
jne .L3
One integer per iteration. There is a branch inside the loop, a load, a conditional store. Perfectly correct, but it leaves the SIMD units entirely idle.
Enabling Auto-Vectorization
Two things are required:
- A higher optimization level:
-O3, or at minimum-O2 -ftree-loop-vectorize - A target that has SIMD instructions:
-march=skylake,-march=native, or similar
With both flags, the compiler targets AVX2 and can operate on 256-bit ymm registers, processing 8 int values per iteration:
.L4:
vmovdqu ymm1, YMMWORD PTR [rsi + rax]
vpcmpgtd ymm0, ymm1, YMMWORD PTR [rdi + rax]
vptest ymm0, ymm0
je .L3
vpmaskmovd YMMWORD PTR [rdi + rax], ymm0, ymm1
.L3:
add rax, 32
cmp rax, 262144
jne .L4
Eight integers per iteration instead of one. The compiler loads 8 elements from y, compares all of them against the corresponding 8 elements of x in a single vpcmpgtd, and then conditionally writes back using a masked store. This is roughly an 8x throughput improvement in the best case.
Getting Cleaner Code: Eliminate the Branch
The masked-store variant above is still a bit awkward. The comparison result is a predicate mask, the vptest checks whether any update is needed, and vpmaskmovd does a masked write. The branch is there because the original source contains an if.
If you restructure the source to express the operation unconditionally —either with a ternary or with std::max —the compiler recognizes it as a vector max operation directly:
x[i] = std::max(x[i], y[i]);
The resulting assembly is cleaner:
.L3:
vmovdqu ymm0, YMMWORD PTR [rdi + rax]
vpmaxsd ymm0, ymm0, YMMWORD PTR [rsi + rax]
vmovdqu YMMWORD PTR [rdi + rax], ymm0
add rax, 32
cmp rax, 262144
jne .L3
Load 8 elements, compute their element-wise maximum with a single vpmaxsd, store back. No comparison, no branch, no mask. This is the cleanest possible vectorized output for this problem.
The Aliasing Problem
You may notice that compilers often generate two copies of a vectorized loop: the fast SIMD version and a scalar fallback. The scalar fallback exists because the compiler cannot always prove that two pointers do not overlap in memory.
If x and y happen to point to overlapping regions, processing 8 elements at a time with x[i] = max(x[i], y[i]) would produce wrong results, because reading y[i+3] after writing x[i] could see a modified value. The compiler must be conservative, so it inserts a runtime check:
lea rax, [rdi - 4]
sub rax, rsi
cmp rax, 24
jbe .L_slow_path
If the check determines the arrays might overlap, execution falls through to the scalar path. If they are clearly separate, it takes the fast SIMD path.
For cases where you know the pointers cannot alias, the GCC/Clang extension __restrict communicates that guarantee:
void max_in_place(int* __restrict x, const int* __restrict y, size_t n);
With __restrict, the compiler omits the alias check and generates only the SIMD path. The trade-off is that you must ensure this contract is actually met at every call site; violating it is undefined behavior.
What Prevents Auto-Vectorization
Several things can block the compiler from vectorizing a loop:
- Data dependencies between iterations: if iteration
i+1reads a value written by iterationi, the loop cannot be parallelized. - Function calls inside the loop: unless the function can be inlined, its side effects are unknown.
- Pointer aliasing: discussed above —the compiler must assume pointers might overlap unless told otherwise.
- Non-unit stride access: reading every 4th element is much harder to vectorize than reading contiguous elements.
If you suspect a loop is not being vectorized, most compilers support diagnostic flags (-fopt-info-vec, -Rpass=loop-vectorize) that explain exactly why vectorization was blocked.
Writing Code That Vectorizes
The main lessons from looking at how auto-vectorization works in practice:
Prefer unconditional expressions. std::max(a, b) vectorizes better than if (b > a) a = b. The compiler can select vector max instructions; it cannot always eliminate masked stores from conditional writes.
Keep data contiguous. Array of Structures layouts often prevent vectorization because the fields the loop cares about are scattered in memory. Structure of Arrays layouts, where all values of one field are contiguous, vectorize naturally.
Do not manually decompose arithmetic. If you want x * 3, write x * 3, not (x << 1) + x. The compiler recognizes multiplication patterns and can choose between SIMD multiply, shift-and-add, or whatever is fastest for the target.
Consider data structure layout early. It is far easier to design for vectorization from the start than to restructure data after the fact.
Summary
Auto-vectorization is one of the highest-leverage optimizations available, because it can multiply throughput by the SIMD width without any change to source code. But it is also one of the most fragile: aliasing concerns, loop-carried dependencies, and source code patterns that look like conditional writes can all prevent the compiler from applying it. Understanding what the compiler needs —unconditional operations, independent iterations, non-aliased pointers —lets you write code that enables vectorization naturally.