yuqi-zheng

Compiler Optimization: Induction Variables and Loop-Carried Dependencies


One of the most counterintuitive results in compiler optimization is that replacing an “expensive” operation with a cheaper one can make a loop slower. This happens when the replacement introduces a dependency between iterations that the original code did not have. The compiler understands this. Knowing when it applies is useful for anyone reasoning about loop performance.

What Is an Induction Variable?

An induction variable is a loop variable that changes by a constant amount on each iteration. The loop counter itself is the canonical example, but the term applies to any derived quantity with this property.

Consider a loop that fills an array with multiples of a constant:

for (int i = 0; i < N; ++i)
    out[i] = i * table;

Each iteration computes i * table from scratch. A compiler pass called induction variable substitution can transform this into:

int acc = 0;
for (int i = 0; i < N; ++i) {
    out[i] = acc;
    acc += table;
}

The multiplication is gone, replaced by an addition. This is strength reduction: replacing a strong operation (multiply) with a weaker one (add). For a fixed-latency multiplier, this is usually beneficial.

The Quadratic Case

Generating the sequence of squares —out[i] = i * i —invites the same treatment. The mathematical identity (n+1)^2 = n^2 + 2n + 1 means consecutive squares differ by consecutive odd numbers. You can compute the whole sequence with only additions:

int sq = 0, step = 1;
for (int i = 0; i < N; ++i) {
    out[i] = sq;
    sq += step;
    step += 2;
}

No multiplications at all. This seems strictly better. But compilers typically do not generate this code. Instead, they generate something equivalent to:

for (int i = 0; i < N; ++i)
    out[i] = i * i;

using an actual imul instruction each iteration.

Why the Compiler Prefers imul

The issue is dependency chains.

In the induction variable version, each iteration depends on the results of the previous iteration. sq on iteration k depends on sq and step from iteration k-1. This is a loop-carried dependency: the next iteration cannot begin its arithmetic until the previous one finishes.

In the multiplication version, each i * i depends only on i, which is known at the start of the iteration independently of what happened in the previous one. Modern out-of-order processors can execute multiple iterations simultaneously when they find no data dependencies between them.

Here is the throughput comparison on a Haswell-class processor, analyzed with llvm-mca:

VersionCycles per iterationBottleneck
Induction variable (additions)~1.5Dependency chain on sq and step
Direct multiply (imul)~1.5Multiply unit throughput

The numbers are similar, but the underlying reason matters. The multiply version hits its limit because the multiply unit can only issue so many operations per cycle. There is no dependency between iterations: a superscalar processor can have several multiplications in flight simultaneously across different iterations. The induction variable version hits its limit because of the chain: step depends on the previous step, and sq depends on the previous sq and the current step.

In practice, the multiply version tends to be more robust. It has fewer instructions, no second state variable to maintain, and the same limit on throughput. In more complex loops where other operations compete for execution resources, the induction variable version can fall further behind because the dependency chain serializes the work.

The General Principle

Every optimization that replaces one thing with another needs to account for the full cost model, which includes:

  • Instruction latency (how long a single operation takes)
  • Instruction throughput (how many per cycle can be issued)
  • Dependency structure (which operations must wait for others)

Strength reduction is beneficial when the replaced operation is expensive and the replacement has no new dependencies. When the replacement creates a dependency chain that did not exist before, the benefit may be negative.

Modern compilers model this explicitly. The induction variable optimization pass knows whether a given substitution introduces a loop-carried dependency and whether the target CPU’s out-of-order execution can hide that dependency. For the quadratic case on x86, the analysis says: the multiplication version is at least as fast and has less structural complexity. The substitution is not made.

Implications for Manual Optimization

This is a case where intuition about “cheaper operations” fails. The imul instruction has higher latency than add —three cycles versus one —but the loop runs just as fast when using it, because multiple instances of it can overlap in the execution pipeline.

Before manually transforming a loop to use additions instead of multiplications, it is worth asking whether the transformation creates a dependency that was not there before. If it does, the correct tool is to benchmark with realistic data on the target hardware, not to assume the reduction in operation cost translates to a reduction in runtime.

The compiler’s choice to keep the multiplication in the quadratic example is not conservatism or a missing optimization. It is the correct decision given an accurate model of how the hardware executes the loop.