yuqi-zheng

Virtual Memory and Latency: A Practical Guide for Low-Latency Systems


Virtual memory is invisible until it isn’t. For most applications, page faults and TLB misses are a background tax paid during warm-up. For latency-sensitive systems —high-frequency trading, real-time audio, control systems —they appear as unpredictable spikes that violate tail-latency requirements. This guide covers each source and how to eliminate it.


The Short List

  • Minimize page faults by pre-faulting, locking memory, and disabling swap.
  • Reduce TLB misses by tightening working-set size and using huge pages.
  • Avoid TLB shootdowns by not modifying page tables after startup.
  • Avoid writable file-backed mappings that trigger page cache writeback stalls.
  • Disable Transparent Huge Pages (THP) —the background compaction daemon introduces latency spikes.
  • Disable KSM —kernel same-page merging locks page tables.
  • Disable automatic NUMA balancing —page migration triggers TLB shootdowns.

Page Faults

The kernel defers physical memory allocation until a page is accessed (demand paging). Two categories matter for latency:

Major faults: the required page is not in RAM and must be read from disk. Cost is comparable to a read syscall —hundreds of microseconds to milliseconds.

Minor faults: the page is in RAM (or the page cache) but the page table entry has not been established yet. The kernel updates the PTE and returns. This still requires entering the kernel and may contend on page table locks in multithreaded programs.

Anonymous memory (from malloc, mmap(MAP_ANONYMOUS)) incurs a minor fault on first write: the kernel allocates a zero-filled page and wires it up.

Eliminating Page Faults

Pre-fault all memory before the hot path:

// After mmap or malloc, touch every page to establish mappings
char* mem = (char*)mmap(nullptr, size, PROT_READ | PROT_WRITE,
                        MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
// MAP_POPULATE triggers eager mapping during mmap itself

Or use mlock() to both lock pages in RAM and trigger pre-faulting:

mlock(ptr, size);  // prevents swap-out and pre-faults

Disable swap system-wide to prevent any anonymous page from being evicted:

swapoff -a

Disable automatic NUMA balancing, which migrates pages and causes extra faults:

echo 0 > /proc/sys/kernel/numa_balancing

Monitoring

perf stat -e faults,minor-faults,major-faults ./program

TLB Misses

The TLB caches virtual-to-physical address translations. It has limited capacity —typically 64 L1 entries (4 KiB pages) and 1536 L2 entries (mixed sizes). A miss requires a hardware page table walk, costing tens of nanoseconds.

Reducing TLB Pressure

Shrink the working set. Every hot structure that fits in fewer pages leaves more TLB capacity for other data. Prefer dense, compact layouts over pointer-chasing.

Use huge pages. A single 2 MiB TLB entry covers the same address space as 512 standard 4 KiB entries. For large arrays and buffers, this dramatically reduces miss rate.

void* p = mmap(nullptr, size, PROT_READ | PROT_WRITE,
               MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);

Monitoring

perf stat -e dTLB-loads,dTLB-load-misses ./program

If dTLB-load-misses / dTLB-loads exceeds 5%, huge pages will help.


TLB Shootdowns

CPUs do not keep TLBs coherent with each other. When the kernel modifies a page table (permission change, unmapping), it must send an inter-processor interrupt (IPI) to every other core to invalidate their stale TLB entries. This is a TLB shootdown.

The receiving core must interrupt whatever it was doing, enter the kernel, flush the relevant TLB entries, and return. For real-time workloads this is an unbounded interrupt.

What Triggers Shootdowns

  • munmap, mprotect (explicit)
  • free() in glibc, which may call madvise(MADV_FREE) or munmap internally
  • Transparent Huge Pages (background compaction by khugepaged)
  • Kernel Same-page Merging (KSM)
  • Automatic NUMA balancing (page migration)
  • Memory compaction (kcompactd)

Eliminating Shootdowns

Map all needed memory at startup and never unmap it. This requires either a custom allocator that does not return memory to the OS, or a long-lived memory pool.

For allocators, mimalloc can be configured to avoid returning memory:

MIMALLOC_RESERVE_HUGE_OS_PAGES=4
MIMALLOC_PAGE_RESET=0

Monitoring

egrep 'TLB|CPU' /proc/interrupts

Page Cache Writeback Stalls

Writable file-backed mappings (mmap of a regular file with PROT_WRITE) introduce a subtle hazard: when the kernel flushes dirty pages to disk, it temporarily marks those pages read-only. If the application writes to such a page during this window, the kernel must handle a page fault and wait for disk I/O before the write can proceed.

On a Ryzen 3900X with an NVMe SSD, this produces latency spikes up to 777 microseconds —invisible in average benchmarks, catastrophic for tail latency.

The fix: do not use writable file-backed mappings in hot paths. Use MAP_ANONYMOUS or map files on tmpfs / hugetlbfs (which have no disk backing and no writeback).

To audit existing mappings:

cat /proc/self/maps | grep '.w.s'

Any entry with w (writable) and s (shared, i.e., file-backed) is a potential writeback hazard.


Transparent Huge Pages (THP)

THP automatically promotes ranges of standard pages to 2 MiB huge pages. While this reduces TLB misses, the background daemon khugepaged scans memory and performs promotions, which requires page table modifications and TLB shootdowns. Memory compaction (kcompactd) runs when contiguous physical memory is unavailable.

These background operations appear as random latency spikes with no obvious user-code trigger.

Disable THP:

echo never > /sys/kernel/mm/transparent_hugepage/enabled

Then use explicit MAP_HUGETLB allocations where huge pages are genuinely beneficial.


NUMA

On multi-socket systems, memory latency depends on which NUMA node owns the physical page. Linux automatic NUMA balancing migrates pages toward the accessing core —but migration involves page table modification and TLB shootdowns.

Explicitly pin memory to the correct NUMA node at allocation time:

numactl --membind=0 --cpunodebind=0 ./program

Or use set_mempolicy / mbind in code. Disable automatic balancing:

echo 0 > /proc/sys/kernel/numa_balancing

References

  • Ulrich Drepper, “What Every Programmer Should Know About Memory” (2007)
  • Linux kernel documentation: Documentation/vm/
  • Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, Chapter 4