Compiler Optimization: Loop Unswitching

Loop unswitching is an optimization that targets a specific and common pattern: a conditional branch inside a loop whose condition does not change across iterations. Instead of re-evaluating an invariant condition on every pass through the loop, the compiler moves the condition outside the loop and generates two specialized loop copies —one for each branch outcome.

This article is part of the Advent of Compiler Optimisations 2025 series by Matt Godbolt.

The Problem: Invariant Conditions Inside Loops

Consider a function that sums the elements of a vector, with an option to sum their squares instead:

int sum(const std::vector<int>& data, bool squared) {
    int total = 0;
    for (int x : data)
        total += squared ? x * x : x;
    return total;
}

The value of squared is fixed for the entire duration of the loop. Checking it on every iteration is wasteful —but it is the natural way to write the function, and we should not be expected to restructure our code around this concern.

What the Compiler Does at `-O2`

At -O2, the compiler opts for a compact representation. It keeps a single loop and uses a conditional select or a multiply-by-one trick to avoid branching:

mla r0, r2, r2, r0   ; total += x * (squared ? x : 1)

The multiplication always executes. When squared is false, x is multiplied by 1, which is a no-op arithmetically but not computationally. The loop is small and branch-free, but it performs unnecessary work on every iteration when squaring is not requested.

The -O2 tradeoff is intentional: code size is kept small, and the loop is still reasonably fast.

What the Compiler Does at `-O3`

At -O3, the compiler enables loop unswitching. It transforms the single loop into:

A test of squared before the loop.
Two entirely separate loop copies —one that computes total += x, one that computes total += x * x.

The resulting assembly looks roughly like this:

cbz w1, .L_non_squared   ; jump to non-squaring loop if squared == false

.L_squaring_loop:
    ; tight loop: total += x * x
    ; no conditional, no branch inside

.L_non_squared:
    ; tight loop: total += x
    ; no conditional, no branch inside

Each loop is now completely clean. There are no conditionals, no multiplications when not needed, and no redundant work. The optimizer can also apply further transformations —vectorization in particular —more aggressively on these simplified loops, because they contain only a single, uniform operation.

The Code Size Tradeoff

The reason this optimization does not fire at -O2 is straightforward: it doubles the size of the loop. For a large loop body, that can be a significant increase in the binary. Instruction cache pressure matters, and bloating every loop that happens to contain an invariant conditional would hurt programs that have many such loops and limited cache.

The -O2 / -O3 boundary reflects a deliberate policy choice:

Level	Priority	Loop unswitching
`-O2`	Balance speed and size	No
`-O3`	Maximize speed	Yes

For performance-critical code paths, -O3 is often the right choice. For general builds where binary size and cache behavior matter, -O2 is more conservative and usually sufficient.

Benefits Beyond Branch Elimination

Once a loop has been unswitched, secondary optimizations become more effective:

Vectorization. Auto-vectorization works best on loops with uniform, predictable operations. A loop that always adds x * x is far easier to vectorize than one that conditionally squares.

Instruction-level parallelism. A loop with no conditional branches gives the CPU’s out-of-order execution engine a clearer view of the data flow.

Register allocation. Fewer live values and a simpler control flow graph make the register allocator’s job easier.

Practical Advice

Write code that expresses intent clearly. If a loop has a flag that controls behavior uniformly across all iterations, pass it as a parameter and let the compiler decide whether to unswitch. Do not manually split the loop into two separate call sites to force the optimization —that produces harder-to-maintain code and removes flexibility for the optimizer.

Use Compiler Explorer to verify whether unswitching has fired for your specific code and compiler version. The behavior can depend on loop size, inlining decisions, and other contextual factors that are not always obvious from reading the source.

If you are working on a hot path where the performance difference between -O2 and -O3 is measurable, profile first, then selectively apply __attribute__((optimize("O3"))) or use pragma-based per-function optimization levels rather than rebuilding the entire project at -O3.