yuqi-zheng

Split Locks: The Hidden Cost of Cache-Line-Crossing Atomic Operations


x86-64 supports unaligned memory access, including atomic operations on data that spans two cache lines. These are called split locks. The x86 architecture guarantees correctness by locking the entire memory bus for the duration —an operation that costs around 1000 CPU cycles and stalls every other core on the machine.


What Makes a Split Lock

A split lock occurs when an atomic operation targets memory that crosses a 64-byte cache line boundary. The CPU cannot lock just the two affected cache lines independently (the cache coherence protocol handles single-line atomics without a bus lock), so it falls back to asserting the LOCK# bus signal, which prevents any other memory transaction system-wide until the operation completes.

From Intel’s documentation:

A locked memory access to a split-line address (an address that crosses a cache line boundary) causes the processor to lock the bus for the duration of the locked access. […] Bus locking has serious performance implications for all agents on the bus.

The penalty is approximately 1000 CPU cycles. Unlike a normal atomic, which only serializes access to one cache line, a split lock stops the world.


Measuring the Difference

Aligned atomic increment (normal case):

alignas(64) std::atomic<uint32_t> v{0};

auto start = std::chrono::steady_clock::now();
for (size_t i = 0; i < niter; ++i)
    v.fetch_add(1, std::memory_order_relaxed);
auto stop = std::chrono::steady_clock::now();

Typical cost: ~6 ns per operation.

Split-lock atomic increment (crosses cache line boundary):

char buf[128] = {};
// Place the atomic at offset 61 within a 64-byte-aligned buffer.
// The 4-byte value spans bytes [61, 64) and [64, 65) —two cache lines.
auto* v = reinterpret_cast<std::atomic<uint32_t>*>(
    reinterpret_cast<uintptr_t>(buf + 64) - 3);

auto start = std::chrono::steady_clock::now();
for (size_t i = 0; i < niter; ++i)
    v->fetch_add(1, std::memory_order_relaxed);
auto stop = std::chrono::steady_clock::now();

Typical cost: ~1194 ns per operation —200脳 slower.

That is not a performance regression. It is a correctness hazard dressed as a performance issue: a single misaligned atomic in a hot path can saturate the memory bus and create visible latency spikes in unrelated threads on other cores.


Detecting Split Locks with perf

Intel CPUs expose a performance counter for split locks:

$ perf stat -e sq_misc.split_lock ./program

 Performance counter stats for './program':

         1,000,000      sq_misc.split_lock:u

       1.203341403 seconds time elapsed

One million split locks —one per loop iteration, exactly as expected.

Note: newer Intel microarchitectures (Ice Lake and later) can be configured to raise a #AC exception on split locks rather than silently serializing. Linux 5.8+ has kernel support for this (split_lock_detect=fatal), which turns silent performance bugs into immediate crashes during development.


Avoiding Split Locks

Use natural alignment. std::atomic<T> has alignof(T) alignment by default on most platforms, which is sufficient. A 4-byte atomic at a 4-byte-aligned address will never cross a 64-byte boundary as long as it does not span offset 60—4 within a cache line.

Add alignas when in doubt:

struct alignas(64) Counter {
    std::atomic<uint32_t> value{0};
};

Watch struct layouts. Placing an atomic member at the end of a struct that happens to push it to a cache-line boundary is the most common source of accidental split locks. Use offsetof or a static assertion to verify placement.

Audit with perf before shipping any code that puts atomics in manually managed memory layouts (arenas, shared memory segments, DMA buffers).