yuqi-zheng

Compiler Optimization: Multiplying by Constants


Integer multiplication is one of the first places people try to “help” the compiler: replace x * 8 with x << 3, or rewrite x * 3 as (x << 1) + x. This kind of manual bit manipulation is a holdover from an era when multiply instructions were genuinely expensive. On modern hardware, it is usually unnecessary and sometimes counterproductive. Looking at what compilers actually emit for constant multiplication shows why.

Small Constants: The lea Instruction

For multipliers like 2, 3, 4, and 5, a modern x86 compiler almost never emits a mul instruction. Instead it uses lea —Load Effective Address —which was designed for pointer arithmetic but can compute base + index * scale + displacement in a single instruction with scales of 1, 2, 4, or 8.

; x * 2
lea eax, [rdi + rdi]

; x * 3
lea eax, [rdi + rdi*2]

; x * 4
lea eax, [rdi*4]

; x * 5
lea eax, [rdi + rdi*4]

lea writes to an arbitrary destination register, does not modify flags, and has single-cycle latency on most modern cores. It is the ideal tool for small constant multiplication.

Powers of Two: Shift with a Caveat

For powers of two, the compiler uses left shift (shl). However, x86’s shift instruction requires the source and destination to be the same register, so when the output register differs from the input, a mov is needed first:

; x * 16
mov eax, edi
shl eax, 4

This is slightly less elegant than lea eax, [rdi*16] would be, but lea only supports scale factors of 1, 2, 4, and 8. For larger powers of two, shl is the only option.

Composite Constants: Chained lea

For a constant like 25, the compiler decomposes the multiplication into a sequence of smaller ones:

; x * 25 = (x * 5) * 5
lea eax, [rdi + rdi*4]   ; eax = x * 5
lea eax, [rax + rax*4]   ; eax = (x*5) * 5 = x * 25

Two lea instructions, no mul, no explicit addition. The compiler is solving a small optimization problem: find the shortest sequence of lea and shl operations that produces the target multiplier.

Large Constants: Just Use imul

When the constant is large enough that no short sequence of shifts and additions suffices, the compiler switches to imul with a three-operand form:

; x * 522
imul eax, edi, 522

Modern CPUs have heavily pipelined integer multiply units. imul typically takes 3 to 4 cycles of latency but has a throughput of one per cycle. Decomposing 522 into shifts and additions —(x << 9) + (x << 3) + (x << 1), since 512 + 8 + 2 = 522 —would require three shifts and two additions, which is more instructions with more data dependencies. A single imul is genuinely faster.

What Happens When You “Help” the Compiler

This is where it gets interesting. If you write:

return (x << 9) + (x << 3) + (x << 1);

to express x * 522 manually, the compiler recognizes the pattern and converts it back to imul eax, edi, 522. The compiler looks through your bit manipulation, identifies the underlying multiplication, and then makes its own decision about the best implementation —which is the imul it would have generated anyway.

Conversely, if you target an old architecture like Pentium with -march=pentium, the compiler uses shifts and additions for this multiplication because the integer multiplier on that hardware was slow enough that decomposing the operation was the right call.

The compiler’s choice is not arbitrary. It knows the target CPU’s instruction latencies and throughputs, and it makes decisions accordingly. Manually rewriting arithmetic expressions in terms of bit operations does not provide any information the compiler lacks —it just makes the source less readable and occasionally interferes with recognition of higher-level patterns.

The General Principle

Constant multiplication is a microcosm of a broader principle in performance-oriented programming. Compilers carry detailed performance models of target microarchitectures. They know when lea is faster than imul, when imul is faster than a sequence of shifts, and when a shift-and-add sequence is preferable to either. This knowledge is embedded in the optimizer and updated with each compiler release as CPUs evolve.

Writing x * 522 is unambiguous about intent. The compiler knows what you want and picks the best way to produce it. Writing (x << 9) + (x << 3) + (x << 1) is harder to read, obscures the intent, and at best produces identical output. At worst, it prevents the compiler from seeing opportunities you would not have anticipated.

Write multiplication as multiplication. Let the compiler decide how to implement it.