yuqi-zheng

Compiler Optimization: The Cost of Division


Of the basic arithmetic operations —addition, subtraction, multiplication, division —integer division stands apart in its cost. On modern x86 hardware, a single integer division can take 18 to nearly 100 clock cycles depending on operand sizes and the specific microarchitecture. Addition takes one cycle. Multiplication takes about three. The gap is not small.

Understanding why division is slow, and how compilers work around it, leads to better intuitions about writing efficient numeric code.

This article is part of the Advent of Compiler Optimisations 2025 series by Matt Godbolt.

Why Division Is Expensive

Modern CPUs are deeply pipelined, meaning they execute multiple instructions simultaneously. Most operations —loads, additions, multiplications —have been heavily optimized and can issue multiple times per cycle, using several parallel execution units.

Division is different for two structural reasons:

Algorithm complexity. Division cannot be easily pipelined the way multiplication can. The hardware implementation of integer division is inherently sequential: each step depends on the result of the previous one. The divisor and dividend are both needed in full before useful work can begin, and the quotient bits are computed one at a time.

Single execution unit. Processors typically have only one integer division unit per core, compared to multiple adders and multipliers. This means that while the CPU is busy with one division, any subsequent division must wait in a queue. Multiple independent divisions cannot execute in parallel.

On Intel Alder Lake, a 32-bit unsigned division takes roughly 18 cycles in the best case. For 64-bit division, the range extends further. During those cycles, the CPU is executing other work where it can, but the division itself is a bottleneck.

Division by a Power of Two: Almost Free, But Careful

The standard developer optimization for division is to replace division by a power of two with a right shift. Dividing by 512 (which is 2^9) is the same as shifting right by 9 bits:

x / 512? x >> 9

This works correctly for unsigned integers. A right shift on an unsigned value is a logical shift, and it computes integer division rounded toward negative infinity —which, for non-negative values, is the same as truncation toward zero.

For signed integers, however, the behavior diverges.

The Rounding Trap

The C and C++ standards specify that integer division truncates toward zero. For positive dividends, truncating toward zero and toward negative infinity are the same. For negative dividends, they differ:

-1 / 512   ==  0      // C: truncation toward zero
-1 >> 9    == -1      // arithmetic right shift: floor division

If you write x >> 9 where x is a signed integer and intend it to mean x / 512, you will get the wrong answer whenever x is negative. The value -1 shifted right by 9 is still -1, but -1 / 512 should be 0 under C’s truncation semantics.

How the Compiler Handles Signed Division

When the compiler sees x / 512 where x is a signed integer, it generates code that produces the correct C result:

test edi, edi          ; is x negative?
lea eax, [rdi + 511]   ; compute x + 511 (i.e., x + divisor - 1)
cmovns eax, edi        ; if x >= 0, use x directly; if negative, use x + 511
sar eax, 9             ; arithmetic right shift by 9

The lea computes x + 511 as a bias value. If x is negative, adding 511 before shifting ensures that the shift rounds toward zero rather than toward negative infinity. The cmovns selects the biased or unbiased value based on the sign of x. The final arithmetic right shift then produces the correct result.

This sequence is three instructions plus a conditional move, all of which execute in a few cycles total. It is substantially faster than the div instruction, but it is more expensive than a single sar.

Getting the Single Shift: Use Unsigned

If you know that a value is non-negative, tell the compiler by using an unsigned type:

unsigned divide_by_512(unsigned x) {
    return x / 512;  // compiles to: shr eax, 9
}

With an unsigned type, the compiler knows there are no negative values to handle. It emits a single shift instruction. The correction logic disappears because it is not needed.

This is not merely a micro-optimization. Using unsigned also communicates a semantic constraint to readers of the code: this value represents a quantity that cannot be negative. The type system is carrying meaningful information.

The Broader Pattern

This situation generalizes. Whenever you use a signed integer for a value that you know is always non-negative, you are asking the compiler to generate correctness logic that will never actually be needed. The compiler has no choice —it cannot assume the value is non-negative unless you tell it, either through the type or through a contract annotation.

Options for informing the compiler:

  • Use unsigned or size_t for quantities that are intrinsically non-negative.
  • Use __builtin_assume(x >= 0) (GCC/Clang) to assert non-negativity without changing the type. The compiler can then elide the correction logic.
  • Use std::make_unsigned_t<T> in templates where the signedness depends on a template parameter.

Do Not Manually Replace Division with Shifts

The conclusion to draw from all of this is not “always write shifts instead of divisions.” It is the opposite: write divisions when you mean division, and let the compiler generate the most efficient implementation given what it knows about the types.

Manually writing x >> 9 when you mean x / 512:

  • Is wrong for signed negative inputs.
  • Is harder for a reader to understand (what is 9? why a shift?).
  • Prevents the compiler from applying further transformations that depend on understanding the arithmetic meaning.

Write x / 512. Use unsigned types where appropriate. Trust that the compiler will emit a shift when it can —and add the correction logic when it cannot.