Compiler Optimization: How Compilers Compare Fixed-Length Strings
When you write sv == "ABCDEFG" in C++, you might expect the compiler to call memcmp. For short strings compared against a compile-time constant, it does not. Instead, modern compilers inline and specialize the entire comparison into a short sequence of integer loads and bitwise operations —no function call, no branches, and sometimes overlapping reads that cover non-power-of-two lengths with power-of-two-sized loads.
The Setup
Consider a family of functions, each comparing a std::string_view against a string literal of a specific length:
bool t7(std::string_view sv) {
return sv == "ABCDEFG"sv;
}
The compiler knows the target string at compile time: its length, its content, and its byte representation in memory. This knowledge is enough to generate code that is specialized for exactly this comparison.
Step One: Length Check
Every generated function begins by testing the input length:
cmp rdi, 7
jne .L_return_false
If the length does not match, the function returns false immediately without touching the string data. This single branch eliminates the need for any memory access in the failure case.
Power-of-Two Lengths: Direct Integer Comparison
For strings whose length is exactly 1, 2, 4, or 8 bytes, the approach is straightforward: load the entire string as a single integer of the corresponding width and compare it against the expected value encoded as an immediate.
- Length 1:
cmp byte ptr [rsi], 'A' - Length 4:
cmp dword ptr [rsi], 0x44434241 - Length 8:
cmp qword ptr [rsi], <8-byte immediate>
The immediate 0x44434241 is the four bytes of "ABCD" interpreted as a little-endian 32-bit integer: 'A' is 0x41, 'B' is 0x42, and so on. The comparison is a single instruction. If the CPU’s load and compare can fuse, it may execute as a single micro-operation.
Non-Power-of-Two Lengths: Overlapping Reads
Length 7 cannot be handled by a single 4-byte or 8-byte load without reading past the end of the string (which may be invalid) or leaving bytes unchecked. The solution the compiler uses is overlapping reads: one 4-byte load from the start and one 4-byte load ending at the last byte, with the middle 1 byte covered by both.
For "ABCDEFG" (7 bytes), the compiler generates:
mov eax, 1145258561 ; the bytes of "ABCD" as a 32-bit integer
xor eax, dword ptr [rsi] ; XOR with first 4 bytes of input
mov ecx, 1195787588 ; the bytes of "DEFG" as a 32-bit integer
xor ecx, dword ptr [rsi+3] ; XOR with bytes 3..6 of input
or ecx, eax ; combine: zero only if both XORs are zero
sete al ; set result to 1 if zero flag is set
The load at [rsi+3] reads bytes 3, 4, 5, and 6 —so byte 3 ('D') is read twice, once in each load. That is intentional and harmless: a byte either matches its expected value or it does not, and reading it twice does not change that.
Why XOR Instead of CMP?
The XOR approach is elegant for combining multiple comparisons without branches. a XOR b is zero if and only if a == b. By XORing the loaded 4 bytes against the expected 4 bytes, you get a value that is zero when all four bytes match. Combine two such XOR results with OR, and the final result is zero only when both 4-byte groups matched. A single sete instruction reads the zero flag to produce the boolean return value.
The entire 7-byte comparison executes as two loads, two XORs, one OR, and one sete —with no conditional branches after the initial length check.
Lengths 3, 5, 6, 9
The same overlapping read strategy applies to other non-power-of-two lengths:
- Length 3: one 2-byte load and one 2-byte load offset by 1
- Length 5: one 4-byte load and one 4-byte load, with 3 bytes of overlap
- Length 6: two 4-byte loads with 2 bytes of overlap
- Length 9: one 8-byte load and one 8-byte load with 7 bytes of overlap, or an 8-byte load plus a 1-byte check
The common thread is that unaligned 32-bit and 64-bit loads on x86 are cheap —the hardware handles misalignment in hardware with no penalty in most cases, so using a larger-than-necessary load to cover non-aligned data is an entirely valid strategy.
Why This Matters
This optimization is invisible to the programmer. You write sv == "ABCDEFG" and the compiler produces the bit-manipulation sequence above. The abstraction is not leaking —the compiler is simply choosing a better implementation than a naive call to memcmp would be.
The performance difference is meaningful for hot paths. A function call to memcmp involves a call instruction, a return, and inside memcmp itself there is setup overhead for any length shorter than a cache line. The inlined version uses no call, no setup, and often fewer total memory accesses.
From the programmer’s perspective, the lesson is to write the clear, readable comparison and let the compiler handle the rest. Do not manually unroll string comparisons into byte-by-byte loops, and do not call memcmp by hand when the compiler would generate something better. Use std::string_view::operator== or std::string::operator== and trust the optimizer.
Compiler Explorer makes it easy to observe this directly: compile a function returning sv == "ABCDEFG"sv with Clang at -O2 for x86-64 and examine the output. Then try changing the string length and watch the code pattern shift between the aligned and overlapping cases.