Compiler Optimization: Recognizing the popcount Pattern
Population count —counting the number of set bits in an integer —is a fundamental operation used in data compression, cryptography, chess engines, sparse matrix representations, and error detection. It is simple enough to implement by hand, yet modern compilers can recognize the hand-written implementation and replace it entirely with a single, specialized CPU instruction.
This article is part of the Advent of Compiler Optimisations 2025 series by Matt Godbolt.
The Operation
The population count of a value is the number of bits that are 1. For example:
0b11010100 = 212 decimal 鈫?4 set bits
This is also called the Hamming weight of the value.
A Compact C Implementation
One efficient way to count set bits without a lookup table uses the fact that value & (value - 1) clears the lowest set bit:
int popcount(uint64_t value) {
int count = 0;
while (value) {
value &= value - 1; // clear the lowest set bit
count++;
}
return count;
}
The loop runs exactly as many times as there are set bits. For sparse values (few bits set), it is faster than a naive bit-by-bit scan. For dense values it is slower, but for general use it is a clean and correct implementation.
What the Compiler Does
On a Baseline Target
When compiled for an older or generic target (such as -march=nocona), the compiler generates a loop that directly implements the algorithm:
.L3:
lea rax, [rdi - 1]
add edx, 1
and rdi, rax
jne .L3
This is what you would expect: the loop body computes value & (value - 1), increments the counter, and repeats until value is zero.
On a Modern Target
When compiled for an architecture that includes the POPCNT instruction (available on Intel CPUs since Nehalem/Westmere, circa 2010), the entire function collapses to a single instruction:
popcnt rax, rdi
The compiler recognizes the loop pattern, proves that it is computing a population count, and replaces the entire function with the hardware primitive. One cycle instead of O(n) iterations.
This is pattern matching at the compiler level: the optimizer does not just inline and simplify —it recognizes an algorithmic pattern and substitutes a hardware operation that is semantically equivalent but far faster.
Enabling the Optimization
The critical requirement is specifying a target architecture that includes POPCNT. In GCC and Clang, use:
-march=westmere # or any sufficiently modern x86 target
-mpopcnt # explicitly enable just the POPCNT instruction
-march=native # use all instructions supported by the build machine
Without one of these flags, the compiler conservatively generates the loop, since it cannot assume the hardware has popcnt.
The Standard Library Alternative
C++20 provides a portable, intent-revealing alternative:
#include <bit>
int count = std::popcount(x);
This expresses the operation directly and guarantees the best available implementation on every target. On x86 with POPCNT support, it compiles to the same single instruction. On ARM it uses the cnt instruction (which operates on SIMD registers). On RISC-V it uses the cpop instruction when available.
Using std::popcount is preferable to a hand-rolled loop for all new code: it is clearer, portable, and the compiler cannot fail to optimize it.
A Microarchitecture Quirk: False Dependencies
On certain Intel CPUs (notably Sandy Bridge and Ivy Bridge), the popcnt instruction has a hardware bug: the output register appears to depend on its own previous value, even though it is written unconditionally. This is called a false dependency, and it can cause unnecessary pipeline stalls.
Compilers that target these architectures insert an xor instruction before popcnt to break the dependency:
xor eax, eax ; break false dependency on eax
popcnt eax, edi
This is a transparent, automatic fix. The developer does not need to be aware of this microarchitecture quirk —the compiler handles it when targeting the affected CPUs.
Takeaways
Compiler pattern recognition is not limited to algebraic simplifications and loop transformations. It extends to recognizing well-known algorithms and replacing them with hardware primitives. The popcount example illustrates this clearly: a correct, idiomatic implementation of a known operation gets replaced with a single instruction when the target hardware supports it.
For new code, prefer std::popcount. For existing code that uses a bit-clearing loop, specify the appropriate -march flag and verify in Compiler Explorer that the optimization fires. For maximum portability without sacrificing performance, let the standard library handle it.