yuqi-zheng

Implementing a Correct C++ Spinlock: TAS, TTAS, and PAUSE


The most common spinlock implementation on the internet is wrong in a subtle but measurable way. It spins on a read-modify-write operation, which forces the cache line into exclusive state on every iteration. Under contention, this generates cache coherency traffic proportional to the number of waiting threads —the opposite of what you want.


The Wrong Way: Test-and-Set (TAS)

struct tas_lock {
    std::atomic<bool> lock_{false};

    void lock() {
        while (lock_.exchange(true, std::memory_order_acquire))
            ;
    }

    void unlock() {
        lock_.store(false, std::memory_order_release);
    }
};

exchange is a read-modify-write operation. On x86, it compiles to XCHG, which requires the cache line in exclusive (modified) state. When N threads spin with exchange, each iteration causes N-1 cache line transfers —a quadratic amount of coherency traffic relative to the number of waiters.


The Right Way: Test-and-Test-and-Set (TTAS)

The fix is to spin on a plain load (which only requires shared state) and attempt the RMW only when the load suggests the lock may be available:

struct ttas_lock {
    std::atomic<bool> lock_{false};

    void lock() {
        for (;;) {
            // optimistic fast path: try to grab the lock immediately
            if (!lock_.exchange(true, std::memory_order_acquire))
                return;
            // slow path: spin on a read-only load until the lock looks free
            while (lock_.load(std::memory_order_relaxed))
                ;
        }
    }

    void unlock() {
        lock_.store(false, std::memory_order_release);
    }
};

Multiple threads reading the same cache line simultaneously is cheap —they all hold it in shared state. Only when the owning thread calls unlock() do the waiters race to acquire exclusive state, and exactly one wins. This reduces coherency traffic from O(N) per cycle to O(N) per lock acquisition.


Reducing Load-Store Unit Pressure: PAUSE

SMT (Hyper-Threading) shares execution resources between logical cores on the same physical core. A tight load loop starves the sibling logical core.

The PAUSE instruction (x86) / YIELD (ARM) signals to the CPU “I am spinning” —it introduces a short pipeline delay that reduces load on shared execution units and, on some microarchitectures, improves detection of memory-order violations in the speculative pipeline:

void lock() {
    for (;;) {
        if (!lock_.exchange(true, std::memory_order_acquire))
            return;
        while (lock_.load(std::memory_order_relaxed)) {
            __builtin_ia32_pause();  // x86
            // asm volatile("yield" ::: "memory");  // ARM
        }
    }
}

Benchmark: 4 cores / 8 threads, 100M lock-unlock pairs

ThreadsTTAS + PAUSE (ns/op)TAS (ns/op)
177
22843
4112205
6241509
8442854

At 8 threads, TAS is nearly 2脳 slower. The perf counters confirm why:

MetricTTAS + PAUSETAS
L1 dcache load misses27.2%47.7%
Atomic lock instructions161M552M

The TAS implementation issues 3.4脳 more atomic instructions and misses the cache 1.7脳 more often.


Complete Implementation

struct spinlock {
    std::atomic<bool> lock_{false};

    void lock() noexcept {
        for (;;) {
            if (!lock_.exchange(true, std::memory_order_acquire))
                return;
            while (lock_.load(std::memory_order_relaxed))
                __builtin_ia32_pause();
        }
    }

    bool try_lock() noexcept {
        // relaxed load first: avoids an unnecessary RMW when the lock is held
        return !lock_.load(std::memory_order_relaxed)
            && !lock_.exchange(true, std::memory_order_acquire);
    }

    void unlock() noexcept {
        lock_.store(false, std::memory_order_release);
    }
};

The try_lock implementation matters for while (!m.try_lock()) spin loops —without the initial relaxed check, each try_lock call issues an XCHG, turning it back into a TAS loop.


When Not to Use a Spinlock

Spinlocks are appropriate for critical sections that complete in tens of nanoseconds and where blocking is unacceptable (kernel code, interrupt handlers, real-time systems). For anything held longer, std::mutex is the right choice: the OS can deschedule waiting threads instead of burning CPU cycles.

Never use volatile as a spinlock primitive. It provides neither atomicity nor memory ordering.