Using Huge Pages on Linux

Applications that perform random memory accesses over a large working set are often bottlenecked not by cache misses alone, but by TLB misses. The TLB (Translation Lookaside Buffer) is the CPU’s cache for virtual-to-physical address mappings. With the default 4 KiB page size, a TLB with 2048 entries covers only 8 MiB of working set. Switching to 2 MiB pages extends that coverage to 4 GiB —a 512x improvement in TLB reach.

Determining whether you need huge pages

The following benchmark uses a large hash table with random lookups, a workload that exercises TLB pressure heavily.

#include <absl/container/flat_hash_map.h>
#include <nmmintrin.h>

int main() {
  struct hash {
    size_t operator()(size_t h) const noexcept {
      return _mm_crc32_u64(0, h);
    }
  };

  size_t iters = 10000000;
  absl::flat_hash_map<size_t, size_t, hash> ht;
  ht.reserve(iters);
  for (size_t i = 0; i < iters; ++i) ht.try_emplace(i, i);
}

Compiled with -O3 -mavx -DNDEBUG and profiled on an AMD Ryzen 3900X:

Performance counter stats for './a.out':

          70,080      faults:u
      20,802,877      dTLB-loads:u
      19,436,707      dTLB-load-misses:u   # 93.43% of all dTLB cache hits
      32,872,323      cache-misses:u       # 52.279% of all cache refs
      62,878,289      cache-references:u

     0.708913859 seconds time elapsed

93% of data TLB accesses miss. The working set exceeds the 8 MiB covered by the L2 TLB at 4 KiB pages. Almost every LLC miss is accompanied by a TLB miss, which requires a page table walk through cache or main memory.

Rule of thumb: if dTLB-load-misses / dTLB-loads is above roughly 5-10%, and your working set is larger than a few megabytes, huge pages will help.

Approach 1: Transparent Huge Pages with madvise

Linux’s Transparent Huge Pages (THP) feature lets the kernel promote 4 KiB pages to 2 MiB pages automatically. In madvise mode (the default), promotion only happens for regions explicitly tagged:

#include <stdlib.h>
#include <sys/mman.h>

template <typename T>
struct thp_allocator {
  static constexpr std::size_t huge_page_size = 1 << 21; // 2 MiB
  using value_type = T;

  thp_allocator() = default;
  template <class U>
  constexpr thp_allocator(const thp_allocator<U>&) noexcept {}

  T* allocate(std::size_t n) {
    if (n > std::numeric_limits<std::size_t>::max() / sizeof(T))
      throw std::bad_alloc();
    void* p = nullptr;
    if (posix_memalign(&p, huge_page_size, n * sizeof(T)) != 0)
      throw std::bad_alloc();
    madvise(p, n * sizeof(T), MADV_HUGEPAGE);
    return static_cast<T*>(p);
  }

  void deallocate(T* p, std::size_t) { std::free(p); }
};

The posix_memalign call aligns the allocation to the 2 MiB huge page boundary, which improves the kernel’s ability to back it with a huge page. The subsequent madvise tags the region as a candidate for promotion.

Limitations:

THP promotion is not guaranteed. The kernel requires a contiguous 2 MiB aligned physical range, which may not be available.
This trick is only useful for allocations of at least one huge page (2 MiB). Smaller allocations will not benefit.
THP can introduce latency spikes from background compaction. For latency-sensitive applications, use explicit huge pages instead (see below).

Approach 2: Explicit huge pages with MAP_HUGETLB

mmap with MAP_HUGETLB allocates directly from the kernel’s huge page pool, guaranteeing huge page backing:

#include <sys/mman.h>

template <typename T>
struct huge_page_allocator {
  static constexpr std::size_t huge_page_size = 1 << 21; // 2 MiB
  using value_type = T;

  huge_page_allocator() = default;
  template <class U>
  constexpr huge_page_allocator(const huge_page_allocator<U>&) noexcept {}

  static size_t round_up(size_t n) {
    return (((n - 1) / huge_page_size) + 1) * huge_page_size;
  }

  T* allocate(std::size_t n) {
    if (n > std::numeric_limits<std::size_t>::max() / sizeof(T))
      throw std::bad_alloc();
    auto p = static_cast<T*>(mmap(
        nullptr, round_up(n * sizeof(T)),
        PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
        -1, 0));
    if (p == MAP_FAILED) throw std::bad_alloc();
    return p;
  }

  void deallocate(T* p, std::size_t n) {
    munmap(p, round_up(n * sizeof(T)));
  }
};

Pre-allocating the huge page pool

Huge pages for MAP_HUGETLB come from a reserved pool. Reserve them before your application starts:

# at boot via kernel command line
hugepages=64

# at runtime
echo 64 > /proc/sys/vm/nr_hugepages

Use the allocator with standard containers:

std::vector<int, huge_page_allocator<int>> v;
absl::flat_hash_map<K, V, Hash, Eq, huge_page_allocator<std::pair<const K,V>>> ht;

This approach is best when you have a small number of large, long-lived allocations (hash tables, ring buffers, packet data pools).

Approach 3: mimalloc

mimalloc (Microsoft’s high-performance allocator) adds huge page support to all allocations transparently via LD_PRELOAD. No code changes are required.

2 MiB huge pages

MIMALLOC_EAGER_COMMIT_DELAY=0 MIMALLOC_LARGE_OS_PAGES=1 \
  LD_PRELOAD=./libmimalloc.so \
  perf stat -e 'faults,dTLB-loads,dTLB-load-misses,cache-misses,cache-references' \
  ./a.out

        658      faults:u
  8,717,125      dTLB-loads:u
      6,320      dTLB-load-misses:u   # 0.07%
 23,104,208      cache-misses:u
 36,081,035      cache-references:u

0.543847504 seconds time elapsed

The TLB miss rate drops from 93% to 0.07%. With 2 MiB pages, the 2048-entry L2 TLB covers 4 GiB —more than enough for this workload.

1 GiB huge pages

Requires kernel boot parameter hugepagesz=1G hugepages=4.

MIMALLOC_EAGER_COMMIT_DELAY=0 MIMALLOC_RESERVE_HUGE_OS_PAGES=4 \
  LD_PRELOAD=libmimalloc.so ./a.out

     532      faults
     639,907      dTLB-loads
       7,869      dTLB-load-misses   # 1.23%
  25,401,262      cache-misses
  70,739,506      cache-references

 0.598358478 seconds time elapsed

Counterintuitively, 1 GiB pages perform slightly worse here. The working set fits within 2 MiB huge page coverage, so the larger page size offers no additional TLB benefit and introduces minor overhead from internal fragmentation.

Summary

Approach	Guarantee	Code change	Latency risk
THP (`MADV_HUGEPAGE`)	None	Allocator	Background compaction spikes
`MAP_HUGETLB`	Yes	Allocator + pool setup	None
mimalloc	Yes (with `LARGE_OS_PAGES`)	None (LD_PRELOAD)	None

When TLB misses are the bottleneck (verify with perf stat), 2 MiB pages are almost always the right choice. 1 GiB pages help only when the working set is too large for 2 MiB page coverage —typically several gigabytes.