yuqi-zheng

Using Huge Pages on Linux


Applications that perform random memory accesses over a large working set are often bottlenecked not by cache misses alone, but by TLB misses. The TLB (Translation Lookaside Buffer) is the CPU’s cache for virtual-to-physical address mappings. With the default 4 KiB page size, a TLB with 2048 entries covers only 8 MiB of working set. Switching to 2 MiB pages extends that coverage to 4 GiB —a 512x improvement in TLB reach.


Determining whether you need huge pages

The following benchmark uses a large hash table with random lookups, a workload that exercises TLB pressure heavily.

#include <absl/container/flat_hash_map.h>
#include <nmmintrin.h>

int main() {
  struct hash {
    size_t operator()(size_t h) const noexcept {
      return _mm_crc32_u64(0, h);
    }
  };

  size_t iters = 10000000;
  absl::flat_hash_map<size_t, size_t, hash> ht;
  ht.reserve(iters);
  for (size_t i = 0; i < iters; ++i) ht.try_emplace(i, i);
}

Compiled with -O3 -mavx -DNDEBUG and profiled on an AMD Ryzen 3900X:

Performance counter stats for './a.out':

          70,080      faults:u
      20,802,877      dTLB-loads:u
      19,436,707      dTLB-load-misses:u   # 93.43% of all dTLB cache hits
      32,872,323      cache-misses:u       # 52.279% of all cache refs
      62,878,289      cache-references:u

     0.708913859 seconds time elapsed

93% of data TLB accesses miss. The working set exceeds the 8 MiB covered by the L2 TLB at 4 KiB pages. Almost every LLC miss is accompanied by a TLB miss, which requires a page table walk through cache or main memory.

Rule of thumb: if dTLB-load-misses / dTLB-loads is above roughly 5-10%, and your working set is larger than a few megabytes, huge pages will help.


Approach 1: Transparent Huge Pages with madvise

Linux’s Transparent Huge Pages (THP) feature lets the kernel promote 4 KiB pages to 2 MiB pages automatically. In madvise mode (the default), promotion only happens for regions explicitly tagged:

#include <stdlib.h>
#include <sys/mman.h>

template <typename T>
struct thp_allocator {
  static constexpr std::size_t huge_page_size = 1 << 21; // 2 MiB
  using value_type = T;

  thp_allocator() = default;
  template <class U>
  constexpr thp_allocator(const thp_allocator<U>&) noexcept {}

  T* allocate(std::size_t n) {
    if (n > std::numeric_limits<std::size_t>::max() / sizeof(T))
      throw std::bad_alloc();
    void* p = nullptr;
    if (posix_memalign(&p, huge_page_size, n * sizeof(T)) != 0)
      throw std::bad_alloc();
    madvise(p, n * sizeof(T), MADV_HUGEPAGE);
    return static_cast<T*>(p);
  }

  void deallocate(T* p, std::size_t) { std::free(p); }
};

The posix_memalign call aligns the allocation to the 2 MiB huge page boundary, which improves the kernel’s ability to back it with a huge page. The subsequent madvise tags the region as a candidate for promotion.

Limitations:

  • THP promotion is not guaranteed. The kernel requires a contiguous 2 MiB aligned physical range, which may not be available.
  • This trick is only useful for allocations of at least one huge page (2 MiB). Smaller allocations will not benefit.
  • THP can introduce latency spikes from background compaction. For latency-sensitive applications, use explicit huge pages instead (see below).

Approach 2: Explicit huge pages with MAP_HUGETLB

mmap with MAP_HUGETLB allocates directly from the kernel’s huge page pool, guaranteeing huge page backing:

#include <sys/mman.h>

template <typename T>
struct huge_page_allocator {
  static constexpr std::size_t huge_page_size = 1 << 21; // 2 MiB
  using value_type = T;

  huge_page_allocator() = default;
  template <class U>
  constexpr huge_page_allocator(const huge_page_allocator<U>&) noexcept {}

  static size_t round_up(size_t n) {
    return (((n - 1) / huge_page_size) + 1) * huge_page_size;
  }

  T* allocate(std::size_t n) {
    if (n > std::numeric_limits<std::size_t>::max() / sizeof(T))
      throw std::bad_alloc();
    auto p = static_cast<T*>(mmap(
        nullptr, round_up(n * sizeof(T)),
        PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
        -1, 0));
    if (p == MAP_FAILED) throw std::bad_alloc();
    return p;
  }

  void deallocate(T* p, std::size_t n) {
    munmap(p, round_up(n * sizeof(T)));
  }
};

Pre-allocating the huge page pool

Huge pages for MAP_HUGETLB come from a reserved pool. Reserve them before your application starts:

# at boot via kernel command line
hugepages=64

# at runtime
echo 64 > /proc/sys/vm/nr_hugepages

Use the allocator with standard containers:

std::vector<int, huge_page_allocator<int>> v;
absl::flat_hash_map<K, V, Hash, Eq, huge_page_allocator<std::pair<const K,V>>> ht;

This approach is best when you have a small number of large, long-lived allocations (hash tables, ring buffers, packet data pools).


Approach 3: mimalloc

mimalloc (Microsoft’s high-performance allocator) adds huge page support to all allocations transparently via LD_PRELOAD. No code changes are required.

2 MiB huge pages

MIMALLOC_EAGER_COMMIT_DELAY=0 MIMALLOC_LARGE_OS_PAGES=1 \
  LD_PRELOAD=./libmimalloc.so \
  perf stat -e 'faults,dTLB-loads,dTLB-load-misses,cache-misses,cache-references' \
  ./a.out
        658      faults:u
  8,717,125      dTLB-loads:u
      6,320      dTLB-load-misses:u   # 0.07%
 23,104,208      cache-misses:u
 36,081,035      cache-references:u

0.543847504 seconds time elapsed

The TLB miss rate drops from 93% to 0.07%. With 2 MiB pages, the 2048-entry L2 TLB covers 4 GiB —more than enough for this workload.

1 GiB huge pages

Requires kernel boot parameter hugepagesz=1G hugepages=4.

MIMALLOC_EAGER_COMMIT_DELAY=0 MIMALLOC_RESERVE_HUGE_OS_PAGES=4 \
  LD_PRELOAD=libmimalloc.so ./a.out
     532      faults
     639,907      dTLB-loads
       7,869      dTLB-load-misses   # 1.23%
  25,401,262      cache-misses
  70,739,506      cache-references

 0.598358478 seconds time elapsed

Counterintuitively, 1 GiB pages perform slightly worse here. The working set fits within 2 MiB huge page coverage, so the larger page size offers no additional TLB benefit and introduces minor overhead from internal fragmentation.


Summary

ApproachGuaranteeCode changeLatency risk
THP (MADV_HUGEPAGE)NoneAllocatorBackground compaction spikes
MAP_HUGETLBYesAllocator + pool setupNone
mimallocYes (with LARGE_OS_PAGES)None (LD_PRELOAD)None

When TLB misses are the bottleneck (verify with perf stat), 2 MiB pages are almost always the right choice. 1 GiB pages help only when the working set is too large for 2 MiB page coverage —typically several gigabytes.