Using Huge Pages on Linux
Applications that perform random memory accesses over a large working set are often bottlenecked not by cache misses alone, but by TLB misses. The TLB (Translation Lookaside Buffer) is the CPU’s cache for virtual-to-physical address mappings. With the default 4 KiB page size, a TLB with 2048 entries covers only 8 MiB of working set. Switching to 2 MiB pages extends that coverage to 4 GiB —a 512x improvement in TLB reach.
Determining whether you need huge pages
The following benchmark uses a large hash table with random lookups, a workload that exercises TLB pressure heavily.
#include <absl/container/flat_hash_map.h>
#include <nmmintrin.h>
int main() {
struct hash {
size_t operator()(size_t h) const noexcept {
return _mm_crc32_u64(0, h);
}
};
size_t iters = 10000000;
absl::flat_hash_map<size_t, size_t, hash> ht;
ht.reserve(iters);
for (size_t i = 0; i < iters; ++i) ht.try_emplace(i, i);
}
Compiled with -O3 -mavx -DNDEBUG and profiled on an AMD Ryzen 3900X:
Performance counter stats for './a.out':
70,080 faults:u
20,802,877 dTLB-loads:u
19,436,707 dTLB-load-misses:u # 93.43% of all dTLB cache hits
32,872,323 cache-misses:u # 52.279% of all cache refs
62,878,289 cache-references:u
0.708913859 seconds time elapsed
93% of data TLB accesses miss. The working set exceeds the 8 MiB covered by the L2 TLB at 4 KiB pages. Almost every LLC miss is accompanied by a TLB miss, which requires a page table walk through cache or main memory.
Rule of thumb: if dTLB-load-misses / dTLB-loads is above roughly 5-10%, and your working set is larger than a few megabytes, huge pages will help.
Approach 1: Transparent Huge Pages with madvise
Linux’s Transparent Huge Pages (THP) feature lets the kernel promote 4 KiB pages to 2 MiB pages automatically. In madvise mode (the default), promotion only happens for regions explicitly tagged:
#include <stdlib.h>
#include <sys/mman.h>
template <typename T>
struct thp_allocator {
static constexpr std::size_t huge_page_size = 1 << 21; // 2 MiB
using value_type = T;
thp_allocator() = default;
template <class U>
constexpr thp_allocator(const thp_allocator<U>&) noexcept {}
T* allocate(std::size_t n) {
if (n > std::numeric_limits<std::size_t>::max() / sizeof(T))
throw std::bad_alloc();
void* p = nullptr;
if (posix_memalign(&p, huge_page_size, n * sizeof(T)) != 0)
throw std::bad_alloc();
madvise(p, n * sizeof(T), MADV_HUGEPAGE);
return static_cast<T*>(p);
}
void deallocate(T* p, std::size_t) { std::free(p); }
};
The posix_memalign call aligns the allocation to the 2 MiB huge page boundary, which improves the kernel’s ability to back it with a huge page. The subsequent madvise tags the region as a candidate for promotion.
Limitations:
- THP promotion is not guaranteed. The kernel requires a contiguous 2 MiB aligned physical range, which may not be available.
- This trick is only useful for allocations of at least one huge page (2 MiB). Smaller allocations will not benefit.
- THP can introduce latency spikes from background compaction. For latency-sensitive applications, use explicit huge pages instead (see below).
Approach 2: Explicit huge pages with MAP_HUGETLB
mmap with MAP_HUGETLB allocates directly from the kernel’s huge page pool, guaranteeing huge page backing:
#include <sys/mman.h>
template <typename T>
struct huge_page_allocator {
static constexpr std::size_t huge_page_size = 1 << 21; // 2 MiB
using value_type = T;
huge_page_allocator() = default;
template <class U>
constexpr huge_page_allocator(const huge_page_allocator<U>&) noexcept {}
static size_t round_up(size_t n) {
return (((n - 1) / huge_page_size) + 1) * huge_page_size;
}
T* allocate(std::size_t n) {
if (n > std::numeric_limits<std::size_t>::max() / sizeof(T))
throw std::bad_alloc();
auto p = static_cast<T*>(mmap(
nullptr, round_up(n * sizeof(T)),
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
-1, 0));
if (p == MAP_FAILED) throw std::bad_alloc();
return p;
}
void deallocate(T* p, std::size_t n) {
munmap(p, round_up(n * sizeof(T)));
}
};
Pre-allocating the huge page pool
Huge pages for MAP_HUGETLB come from a reserved pool. Reserve them before your application starts:
# at boot via kernel command line
hugepages=64
# at runtime
echo 64 > /proc/sys/vm/nr_hugepages
Use the allocator with standard containers:
std::vector<int, huge_page_allocator<int>> v;
absl::flat_hash_map<K, V, Hash, Eq, huge_page_allocator<std::pair<const K,V>>> ht;
This approach is best when you have a small number of large, long-lived allocations (hash tables, ring buffers, packet data pools).
Approach 3: mimalloc
mimalloc (Microsoft’s high-performance allocator) adds huge page support to all allocations transparently via LD_PRELOAD. No code changes are required.
2 MiB huge pages
MIMALLOC_EAGER_COMMIT_DELAY=0 MIMALLOC_LARGE_OS_PAGES=1 \
LD_PRELOAD=./libmimalloc.so \
perf stat -e 'faults,dTLB-loads,dTLB-load-misses,cache-misses,cache-references' \
./a.out
658 faults:u
8,717,125 dTLB-loads:u
6,320 dTLB-load-misses:u # 0.07%
23,104,208 cache-misses:u
36,081,035 cache-references:u
0.543847504 seconds time elapsed
The TLB miss rate drops from 93% to 0.07%. With 2 MiB pages, the 2048-entry L2 TLB covers 4 GiB —more than enough for this workload.
1 GiB huge pages
Requires kernel boot parameter hugepagesz=1G hugepages=4.
MIMALLOC_EAGER_COMMIT_DELAY=0 MIMALLOC_RESERVE_HUGE_OS_PAGES=4 \
LD_PRELOAD=libmimalloc.so ./a.out
532 faults
639,907 dTLB-loads
7,869 dTLB-load-misses # 1.23%
25,401,262 cache-misses
70,739,506 cache-references
0.598358478 seconds time elapsed
Counterintuitively, 1 GiB pages perform slightly worse here. The working set fits within 2 MiB huge page coverage, so the larger page size offers no additional TLB benefit and introduces minor overhead from internal fragmentation.
Summary
| Approach | Guarantee | Code change | Latency risk |
|---|---|---|---|
THP (MADV_HUGEPAGE) | None | Allocator | Background compaction spikes |
MAP_HUGETLB | Yes | Allocator + pool setup | None |
| mimalloc | Yes (with LARGE_OS_PAGES) | None (LD_PRELOAD) | None |
When TLB misses are the bottleneck (verify with perf stat), 2 MiB pages are almost always the right choice. 1 GiB pages help only when the working set is too large for 2 MiB page coverage —typically several gigabytes.