yuqi-zheng

Compiler Optimization: Loop Unrolling


Every loop pays a tax. That tax consists of the counter increment, the bounds check, the branch back to the top, and the pipeline bubbles that branch instructions introduce. For a loop that executes millions of times, this tax is negligible per iteration. For a tight loop that processes a small, fixed number of elements, it can represent a substantial fraction of total runtime.

Loop unrolling addresses this by duplicating the loop body multiple times and reducing the number of iterations proportionally. The result is fewer control instructions, more independent work visible to the CPU at any moment, and opportunities to use more efficient load instructions.

This article is part of the Advent of Compiler Optimisations 2025 series by Matt Godbolt.

The Key Variable: Does the Compiler Know the Iteration Count?

The aggressiveness of loop unrolling depends almost entirely on whether the compiler can determine the number of iterations at compile time.

Dynamic Size: No Unrolling

When the loop count is not known until runtime, the compiler generates a standard loop. Consider a sum over a dynamically-sized span:

int sum(const std::span<int>& data);

The compiler has no choice but to emit a loop that checks the remaining count on every iteration:

.LBB0_2:
    ldr r3, [r2], #4   ; load next element, advance pointer
    subs r1, r1, #4    ; decrement remaining byte count
    add r0, r3, r0     ; accumulate
    bne .LBB0_2        ; loop if not done

Three instructions per element plus a branch. The loop is correct and reasonable, but there is no opportunity to eliminate the control overhead without more information.

Fixed Size: Full Unrolling

When the compiler knows the iteration count at compile time, it can do something far more aggressive:

int sum(const std::span<int, 8>& data);

The template parameter 8 makes the count a compile-time constant. The compiler fully unrolls the loop, emitting eight independent load-and-add sequences with no loop counter, no bounds check, and no branch:

ldmib r1, {r2, r3}     ; load two elements at once
add r0, r2, r3
ldr r2, [r1, #8]
add r0, r0, r2
; ... four more load+add pairs ...

Notice the ldmib instruction —this loads multiple registers in a single instruction, something that is only possible when the compiler knows exactly how many elements to process and can plan the sequence in advance. The result is a flat sequence of loads and additions with no loop structure at all.

Unrolling Thresholds

The compiler does not always unroll, and it does not always unroll fully. The heuristics vary by compiler and target architecture, but a rough picture for GCC and Clang on ARM looks like this:

Iteration countCompiler behavior
Up to 16Full unroll; code size increase is acceptable
Around 32Partial unroll with some register spilling; trying to maximize parallelism but running out of registers
50 or moreFalls back to a standard loop; code size growth outweighs benefit

For very large iteration counts, the compiler could theoretically use a blocked unrolling strategy —process 16 elements per “super-iteration”, then loop over those super-iterations. Current versions of GCC and Clang generally do not apply this pattern manually, relying instead on auto-vectorization to handle large arrays efficiently.

How to Give the Compiler What It Needs

The most direct way to enable aggressive unrolling is to encode the iteration count in the type system.

Use std::array<T, N> instead of std::vector<T>. The size is a template parameter and is always visible to the compiler.

Use std::span<T, N> instead of std::span<T>. The fixed-extent version carries the count at compile time; the dynamic-extent version does not.

Avoid unnecessary heap allocation. If you know a collection will always have exactly N elements, a stack-allocated array of size N gives the compiler full information and eliminates any indirection.

Consider Profile-Guided Optimization (PGO). When the iteration count is genuinely dynamic, PGO can provide the compiler with observed runtime distributions, allowing it to make informed decisions about partial unrolling for the common case.

What Not to Do

Do not manually unroll loops. Writing:

sum += data[0];
sum += data[1];
sum += data[2];
// ...

is harder to read, harder to maintain, and does not necessarily produce better code than what the compiler generates from a proper loop with a known bound. The compiler understands the target architecture —its register count, instruction throughput, and latency characteristics —far better than a hand-written unrolling can.

If you want unrolling, communicate the loop bound through the type system and let the compiler do the work.

Takeaways

Loop unrolling is not a manual optimization. It is a consequence of giving the compiler accurate information about what your code does. When the iteration count is fixed and visible, the compiler eliminates loop overhead entirely and may use specialized instructions that are not available in a general loop. When the count is dynamic, the compiler generates correct, reasonable code but cannot go further without additional hints.

The most actionable practice is to use fixed-size types (std::array, fixed-extent std::span) whenever the size is genuinely known at compile time. The performance improvement on small, hot loops can be significant.