yuqi-zheng

Optimizing a Ring Buffer for Throughput


A single-producer single-consumer (SPSC) ring buffer is the canonical wait-free queue. Despite being conceptually simple, the naive implementation suffers from a cache coherency bottleneck that limits throughput to roughly 5.5 million operations per second on modern hardware. This article shows how one structural change raises that to 112 million ops/s —a 20x improvement that beats both boost::lockfree::spsc_queue and folly::ProducerConsumerQueue.

The complete optimized implementation is available as SPSCQueue.h.


The data structure

The queue stores elements in a contiguous array. Two atomic indices mark the read and write positions. The critical layout requirement is that the indices must be on separate cache lines to prevent false sharing:

struct ringbuffer {
  std::vector<int> data_;
  alignas(64) std::atomic<size_t> readIdx_{0};
  alignas(64) std::atomic<size_t> writeIdx_{0};

  ringbuffer(size_t capacity) : data_(capacity, 0) {}
};

alignas(64) matches the cache line width on x86-64 and ARM. The more portable alternative is std::hardware_destructive_interference_size, available since C++17.

One slot is wasted to distinguish full from empty: the queue is full when (writeIdx + 1) % capacity == readIdx.

This layout is essentially identical to the io_uring ring used by the Linux kernel:

struct io_uring {
  u32 head ____cacheline_aligned_in_smp;
  u32 tail ____cacheline_aligned_in_smp;
};

Push and pop

bool push(int val) {
  auto const writeIdx = writeIdx_.load(std::memory_order_relaxed);
  auto nextWriteIdx = writeIdx + 1;
  if (nextWriteIdx == data_.size()) nextWriteIdx = 0;
  if (nextWriteIdx == readIdx_.load(std::memory_order_acquire))
    return false;
  data_[writeIdx] = val;
  writeIdx_.store(nextWriteIdx, std::memory_order_release);
  return true;
}

bool pop(int &val) {
  auto const readIdx = readIdx_.load(std::memory_order_relaxed);
  if (readIdx == writeIdx_.load(std::memory_order_acquire))
    return false;
  val = data_[readIdx];
  auto nextReadIdx = readIdx + 1;
  if (nextReadIdx == data_.size()) nextReadIdx = 0;
  readIdx_.store(nextReadIdx, std::memory_order_release);
  return true;
}

The element type is int (4 bytes) to isolate the queue’s own overhead from copy cost. For large elements, optimizing the queue itself yields diminishing returns.


Baseline performance

Benchmark: transfer 100 million integers between two threads through a queue of capacity 100,000. Compiled with -O3 -march=native -std=c++20 on an AMD Ryzen 9 3900X, with each thread pinned to a different CCX.

5,513,850 ops/s

Performance counter stats for './a.out 1 11':
  349,857,603 cache-misses:u   # 91.228% of all cache refs
  383,497,078 cache-references:u
    6,673,242 l2_request_g1.change_to_x:u

18.137 seconds time elapsed

100 million transfers, 350 million cache misses —roughly 3.5 misses per round trip. This matches the throughput of boost::lockfree::spsc_queue and folly::ProducerConsumerQueue, both of which have the same structural problem.


Why so many cache misses?

The MESI protocol allows multiple cores to hold a cache line in Shared state simultaneously, but requires Exclusive ownership before a write can proceed. Every transition from Shared to Exclusive is a cache miss.

Consider what happens on each push:

  1. The writer loads writeIdx (its own variable) —likely a hit.
  2. The writer loads readIdx to check for full —this line is Exclusive on the reader side, so the writer must request ownership, causing a miss.
  3. The writer stores the new writeIdx —the reader immediately needs to see this, so the reader requests Exclusive ownership next time, causing another miss.

Three cache misses per operation, each one requiring a round-trip across the interconnect (hundreds of nanoseconds on a multi-CCX Ryzen).


The fix: cache the remote index locally

Each thread reads the other’s index only to determine whether the queue is empty or full. In the common case —the queue is neither empty nor full —that remote read is unnecessary. Cache the last known value of the remote index and only refresh it when the cached value indicates the queue is at a boundary:

struct ringbuffer2 {
  std::vector<int> data_{};
  alignas(64) std::atomic<size_t> readIdx_{0};
  alignas(64) size_t writeIdxCached_{0};   // reader's cached copy of writeIdx
  alignas(64) std::atomic<size_t> writeIdx_{0};
  alignas(64) size_t readIdxCached_{0};    // writer's cached copy of readIdx

  ringbuffer2(size_t capacity) : data_(capacity, 0) {}
};

Each cached variable lives on its own cache line, owned exclusively by the thread that writes it.

bool push(int val) {
  auto writeIdx = writeIdx_.load(std::memory_order_relaxed);
  auto nextWriteIdx = (writeIdx + 1) % data_.size();

  if (nextWriteIdx == readIdxCached_) {
    readIdxCached_ = readIdx_.load(std::memory_order_acquire);
    if (nextWriteIdx == readIdxCached_) return false;
  }

  data_[writeIdx] = val;
  writeIdx_.store(nextWriteIdx, std::memory_order_release);
  return true;
}

bool pop(int &val) {
  auto readIdx = readIdx_.load(std::memory_order_relaxed);

  if (readIdx == writeIdxCached_) {
    writeIdxCached_ = writeIdx_.load(std::memory_order_acquire);
    if (readIdx == writeIdxCached_) return false;
  }

  val = data_[readIdx];
  readIdx_.store((readIdx + 1) % data_.size(), std::memory_order_release);
  return true;
}

The cross-core acquire/release pair now happens only when the producer catches up to the consumer or vice versa. In steady state —where the producer stays ahead of the consumer by a few slots —the hot loop touches only cache lines that are locally owned.


Optimized performance

112,287,037 ops/s

Performance counter stats for './a.out 1 11':
  15,010,392 cache-misses:u   # 32.55% of all cache refs
  46,110,663 cache-references:u
   6,781,273 l2_request_g1.change_to_x:u

0.892 seconds time elapsed
MetricBeforeAfterImprovement
Throughput5.5M ops/s112M ops/s20x
Cache misses350M15M95% reduction
Elapsed time18.1 s0.9 s20x

Further optimizations

Batching. Reading or writing multiple elements per synchronization point amortizes the cost of the acquire/release pair further. This is the approach taken by Linux’s io_uring submission and completion rings.

Huge pages. For large queues, TLB pressure on the data array becomes significant. Backing data_ with 2 MiB huge pages via a custom allocator or madvise(MADV_HUGEPAGE) reduces TLB misses substantially.

Power-of-two capacity. Forcing capacity to a power of two replaces the modulo with a bitwise AND, eliminating a division on every index advance.