yuqi-zheng

Memory Models, Caches, and Pipelines: The Hardware Behind Concurrency


Most concurrency bugs aren’t logic errors. They happen because the hardware underneath is trying to be fast — and fast means reordering, buffering, and caching, all of which silently break the sequential model that programmers expect. This article walks through the hardware mechanisms that cause memory reordering, the memory models that define what reordering is allowed, and the cache and pipeline architecture that makes all of it necessary.


Why Memory Reordering Happens

Programs are written as if each instruction executes in order. The CPU, however, is free to reorder memory operations as long as the result of a single-threaded execution is unchanged. For concurrent programs, this creates a gap: what one thread “sees” may not match the order in which another thread “wrote.”

Three hardware mechanisms cause this gap:

  1. Store buffers — hide write latency by buffering stores before they reach cache
  2. Invalidate queues — hide cache coherence latency by deferring invalidation processing
  3. Out-of-order execution — maximizes pipeline utilization by executing independent instructions early

Each one is an optimization that makes single-threaded code faster but breaks multi-threaded assumptions. Let’s examine them one by one.


Store Buffers: Hiding Write Latency

When a CPU executes a store, the target cache line must be in Exclusive or Modified state. If it isn’t — because another core holds it — the CPU must wait for a cache coherence round-trip to acquire ownership. On a multi-socket system, that round-trip can take hundreds of nanoseconds.

To avoid stalling, the CPU writes the store data into a store buffer and continues executing. The store is committed to the cache asynchronously. From the writing CPU’s perspective, the store is complete. From every other CPU’s perspective, it hasn’t happened yet.

This creates StoreStore reordering (two stores appear in a different order to other cores) and StoreLoad reordering (a later load completes before an earlier store is visible):

// Thread A (CPU 0)          // Thread B (CPU 1)
a = 1;                       while (b == 0)
b = 1;                           /* spin */;
                             assert(a == 1);  // may fail!

Thread A writes a, then b. Both stores enter the store buffer. If b’s cache line is locally owned (a hit) but a’s is not (a miss), b reaches cache first. Thread B sees b == 1 but a is still 0 — the assert fires.


Invalidate Queues: Hiding Coherence Latency

Under the MESI protocol, when CPU 0 writes to a cache line that CPU 1 holds in Shared state, CPU 0 must send an Invalidate message. CPU 1 must acknowledge it — but the acknowledgment takes time. If CPU 0 waits for every ACK before proceeding, it stalls.

The solution: CPU 1 places incoming invalidate messages into an invalidate queue and sends the ACK immediately. The invalidation is applied later, asynchronously.

This creates LoadLoad reordering (a later load sees a value that should have been invalidated) and LoadStore reordering:

// Thread A (CPU 0)          // Thread B (CPU 1)
a = 1;                       while (a == 0)
                                 /* spin */;
                             b = 1;

// Thread C (CPU 2)
while (b == 0)
    /* spin */;
assert(a == 1);  // may fail!

Thread B loads a, sees 1, then writes b = 1. But the invalidate for a is still sitting in CPU 2’s invalidate queue. Thread C loads b, sees 1, then loads a — and gets the stale value 0.


Out-of-Order and Speculative Execution

Modern CPUs are superscalar and out-of-order. Independent instructions can execute as soon as their operands are ready, regardless of program order. A load whose address is already in a register can execute before an earlier store whose address is still being computed.

Speculative execution extends this further: the CPU predicts the outcome of a branch and executes the predicted path speculatively. If the prediction is correct, the results are committed. If not, they are discarded.

From a memory-model perspective, speculative execution does not introduce new reordering categories beyond what the store buffer and invalidate queue already cause — the CPU must behave as if it executed instructions in program order. But speculatively executed loads can cause transient cache state changes that are visible through side channels (the basis of Spectre-style attacks).


Hardware Memory Models

The combination of store buffers, invalidate queues, and out-of-order execution creates a range of possible memory reorderings. A hardware memory model specifies which reorderings are allowed on a given architecture.

The four possible reorderings of loads and stores are:

ReorderingMeaning
StoreStoreTwo stores appear out of order to other cores
StoreLoadA load completes before an earlier store is visible
LoadLoadTwo loads return values out of order
LoadStoreA store becomes visible before an earlier load completes

x86/x86-64: Total Store Order (TSO)

x86 enforces a strong model. Only StoreLoad reordering is allowed:

ReorderingAllowed on x86?
StoreStoreNo
StoreLoadYes
LoadLoadNo
LoadStoreNo

The sole source of reordering on x86 is the store buffer. This is why most “obviously correct” concurrent code works on x86 but fails on ARM.

Exception: Non-temporal stores (e.g., _mm_stream_si128) and unaligned accesses can break TSO guarantees and require explicit fencing.

ARM/PowerPC: Weak Ordering

ARM and PowerPC allow all four reorderings:

ReorderingAllowed on ARM?
StoreStoreYes
StoreLoadYes
LoadLoadYes
LoadStoreYes

The hardware has maximum freedom to optimize memory accesses. Software must use explicit memory barriers or atomic operations with appropriate memory ordering to constrain reordering.


The C++ Memory Model: Mapping Software to Hardware

Different hardware models make portable concurrent code difficult. C11/C++11 solves this with a language memory model that provides a uniform abstraction. Programmers specify the required ordering strength, and the compiler emits the appropriate barriers for the target architecture.

Memory OrderCore GuaranteeTypical Use
relaxedAtomicity only, no ordering constraintsIndependent counters
consumeData-dependency ordering only (rarely used; often promoted to acquire)Pointer dereference after load
acquireNo subsequent memory access may be reordered before this loadLock acquisition
releaseNo prior memory access may be reordered after this storeLock release
acq_relBoth acquire and release semantics (for read-modify-write)Atomic fetch-and-add
seq_cstGlobal total order of all seq_cst operationsDistributed state synchronization

Mapping to Hardware Barriers

Memory Orderx86 BarrierARM BarrierCost
relaxedNoneNoneZero overhead
consumeNonedmb ishldLow (data dependency only)
acquireNone (TSO already prevents it)dmb ishMedium
releaseNone (TSO already prevents it)dmb ishMedium
acq_relmfencedmb ishHigh
seq_cstmfence + lock prefixdmb syHighest

On x86, acquire and release are free — the hardware already provides the necessary guarantees. On ARM, every ordering stronger than relaxed requires a dmb barrier.

Choosing the Right Memory Order

  1. Is the variable independent? If yes (e.g., a standalone counter), use relaxed.
  2. Do you need to synchronize non-atomic data? If yes, use acquire/release pairs — the releasing thread’s prior writes become visible to the acquiring thread.
  3. Do you need global consensus? Only then use seq_cst. This is rare — most synchronization patterns only need acquire/release.

Avoid defaulting to seq_cst. C++ atomics default to seq_cst for historical reasons, but 90% of use cases don’t need it, and on ARM the difference between seq_cst and acq_rel is significant.


Memory Barriers vs Compiler Barriers

Hardware Memory Barriers

MFENCE — full barrier. Serializes all memory operations (loads and stores):

mov [data], eax    ; store
mfence             ; no subsequent memory operation crosses this point
mov ebx, [flag]    ; load — guaranteed to see all stores before mfence

SFENCE — store barrier. Serializes stores only. Used with non-temporal stores and write-combining memory:

movntps [buffer], xmm0  ; non-temporal store
sfence                  ; all prior stores are visible before subsequent stores
mov [flag], 1           ; flag store — only ordered after the non-temporal store

LFENCE — load barrier. Serializes loads. Used primarily to constrain speculative execution (e.g., Spectre mitigations):

mov eax, [secret]  ; load
lfence             ; no subsequent instruction executes until this load completes
mov [result], eax  ; dependent store

Compiler Barriers

A compiler barrier prevents the compiler from reordering memory operations across the barrier, but does not prevent the CPU from doing so.

// C11/C++11
atomic_signal_fence(memory_order_seq_cst);

// GCC/Clang
__asm__ __volatile__("" ::: "memory");

// MSVC
_ReadWriteBarrier();

The Critical Distinction

A compiler barrier only prevents compile-time reordering. It emits no hardware instructions. A hardware memory barrier prevents both compile-time and runtime reordering. Using a compiler barrier when a hardware barrier is needed is a subtle but devastating bug — it works on weakly-ordered compilers but fails on hardware with aggressive out-of-order execution.


Cache Coherence and False Sharing

MESI in Brief

The MESI protocol maintains cache coherence across cores. Each cache line is in one of four states:

StateMeaning
ModifiedThis core owns the line; it has been written; memory is stale
ExclusiveThis core owns the line; it has not been written; memory is current
SharedMultiple cores hold the line read-only; memory is current
InvalidThis core does not have a valid copy

A write requires Exclusive or Modified ownership. If the line is Shared, the core must broadcast an Invalidate message and wait for ACKs before proceeding.

False Sharing

Two unrelated variables that happen to share a cache line will fight for ownership even though no logical data dependency exists:

struct Counters {
    std::atomic<int> a;  // thread 1 increments
    std::atomic<int> b;  // thread 2 increments
};

Both atomics are 4 bytes. They likely share the same 64-byte cache line. Every increment by thread 1 invalidates thread 2’s cache line and vice versa — a ping-pong of coherence traffic.

Fix: align each variable to a cache line boundary:

struct alignas(64) PaddedCounter {
    std::atomic<int> value;
};

struct Counters {
    PaddedCounter a;  // thread 1
    PaddedCounter b;  // thread 2
};

Or use std::hardware_destructive_interference_size (C++17) instead of hard-coding 64.

True Sharing

When multiple threads genuinely read and write the same variable, the coherence traffic is inherent — not false sharing. Solutions:

  • std::atomic for correctness (when performance is not critical)
  • thread_local copies with periodic aggregation for throughput

Cache Partitioning with Intel RDT/CAT

On a multi-core system, the L3 cache is shared. Background threads — logging, telemetry, compaction — can evict the hot cache lines of your latency-critical path, causing unpredictable misses.

Intel’s Cache Allocation Technology (CAT), part of Resource Director Technology (RDT), lets you partition the L3 cache into Classes of Service (COS). Each COS is defined by a bitmask specifying which cache ways it can use. You can bind critical cores to a dedicated COS, guaranteeing that background threads cannot pollute their cache space.

Setup with pqos

# Install the tools
sudo apt install -y intel-cmt-cat

# Verify CAT support
pqos -d

# View current configuration
pqos -s

# COS 1: allocate the upper 8 ways of a 12-way L3, plus 80% memory bandwidth
sudo pqos -e 'llc:1=0xff0;mba:1=80'

# COS 2: allocate the lower 4 ways, 50% memory bandwidth
sudo pqos -e 'llc:2=0x0f;mba:2=50'

# Bind cores 2 and 3 to COS 1 (low-latency trading thread)
sudo pqos -a 'llc:1=2,3;mba:1=2,3'

# Monitor LLC occupancy and memory bandwidth for cores 2,3
sudo pqos -i 1 -m 'llc,mbl,mbr:2,3'

CAT is not a general-purpose tool — it is specifically for isolating latency-critical workloads on shared hardware. For broader system-level isolation (CPU pinning, IRQ routing, NUMA binding), see the low-latency tuning guide.


CPU Pipeline Stalls and ILP

Modern CPUs divide instruction execution into pipeline stages — typically Fetch (IF), Decode (ID), Execute (EX), Memory (MEM), Writeback (WB). Under ideal conditions, one instruction completes per clock cycle (the single-instruction-per-cycle model).

Stalls break this ideal. Two types of instruction-level dependencies cause them:

  1. Data dependency — a later instruction needs the result of an earlier one:
a = b + c;   // writes a
d = a + e;   // reads a — must wait for the previous instruction to write back

The d = a + e instruction cannot enter the Execute stage until a = b + c completes Writeback. This is a pipeline bubble.

  1. Control dependency — a branch instruction’s target is unknown until the branch executes:
if (x > 0)        // branch — target unknown during Fetch
    y = a + b;     // may or may not execute

Until the branch resolves, the pipeline cannot fetch the correct next instruction. Branch prediction mitigates this, but a misprediction flushes the pipeline — a penalty of 15–20 cycles on modern CPUs.

Three Principles for Pipeline-Friendly Code

  1. Break long dependency chains. Instead of a serial chain of dependent operations, restructure to allow independent computations to overlap. For example, computing price * quantity for N orders in parallel rather than accumulating a running total.

  2. Keep hot data in registers. Every memory access is a potential cache miss. Values that are used repeatedly should live in registers — the compiler does this when it can, but pointer aliasing, function calls, and volatile accesses force it to spill.

  3. Mix instruction types. Modern CPUs have multiple execution units — integer ALUs, FPUs, load/store units, vector units. A sequence of same-type instructions saturates one unit while others sit idle. Interleaving integer, floating-point, and memory operations lets multiple units work in parallel, maximizing instruction-level parallelism (ILP).

For complex dependency graphs across a system, topological sorting can identify the critical path and expose opportunities for parallelism.


Conclusion

Memory ordering, cache coherence, and pipeline efficiency are not independent topics — they are three views of the same hardware reality. Store buffers exist because writing to cache is slow. Invalidate queues exist because cache coherence is slow. Out-of-order execution exists because pipelines must stay full. Memory models define the rules of the game; cache coherence implements them; pipelines depend on all of it.

The practical takeaway: start with acquire/release for synchronization, relaxed for independent counters, and reach for seq_cst only when you can prove you need it. Measure with perf. And if you’re on x86, count yourself lucky — the hardware does most of the heavy lifting for you.