Cache-Friendly Maps: flat_map and unordered_flat_map vs the STL
Modern CPUs are hundreds of times faster than main memory. A single cache miss stalls the core for hundreds of cycles while it waits for data. This asymmetry makes cache-friendliness —not asymptotic complexity— the dominant factor in data structure performance for small working sets.
std::map and std::unordered_map are node-based containers: every element lives in a separately allocated node. boost::container::flat_map and boost::unordered_flat_map replace those nodes with contiguous storage, trading insertion speed for cache locality. This article benchmarks all four containers under cold-cache conditions and explains why the flat variants win in read-heavy, small-dataset scenarios common in trading systems.
The four containers
std::map — red-black tree
Each element is a heap-allocated node containing a pair<const Key, Value>, plus three child/parent pointers and a color bit. Finding an element requires O(log N) pointer traversals, each potentially causing a cache miss because nodes are scattered across the heap.
[50] ← Node: {key, value, left, right, parent, color}
/ \
[30] [70] ← Each node: separate heap allocation
/ \ / \
[10][40][60][90] ← Traversal = pointer chasing = cache misses
Node size: sizeof(pair) + 3 × sizeof(ptr) + 1 → typically 48–80 bytes per node on 64-bit, of which only the key and value are useful payload.
boost::container::flat_map — sorted vector
Elements are stored as contiguous pair<Key, Value> in a sorted array. Lookup is binary search (O(log N)), but each probe touches a cache-line-aligned element in a contiguous array —the CPU prefetcher can predict the access pattern.
[10][30][40][50][60][70][90] ← Contiguous array of pairs
↑ binary search probes ← Sequential in memory → prefetchable
No per-element overhead. Insertion requires shifting elements (O(N) in the worst case), which is a memmove for POD types.
std::unordered_map — chained hash table
Each bucket is a linked list of heap-allocated nodes. On a hash collision, a new node is allocated and linked. Lookup is O(1) on average, but following the chain means pointer chasing through scattered nodes.
bucket[0]: → [K1,V1] → [K5,V5] → nullptr
bucket[1]: → nullptr
bucket[2]: → [K3,V3] → nullptr
bucket[3]: → [K7,V7] → [K2,V2] → [K9,V9] → nullptr
Each node is a separate allocation. The bucket array itself is contiguous, but the elements are not.
boost::unordered_flat_map — open-addressing hash table
Elements are stored directly in the bucket array. On collision, quadratic probing searches for the next empty slot. No chaining, no per-element pointers, no separate allocations.
bucket[0]: [K1,V1]
bucket[1]: [empty]
bucket[2]: [K3,V3]
bucket[3]: [K7,V7] ← probe sequence: 3,4,7,12,21...
bucket[4]: [K2,V2] ← (quadratic probing offsets)
bucket[5]: [K9,V9]
bucket[6]: [K5,V5]
bucket[7]: [empty]
The entire hash table is one contiguous block of memory. Traversal is a linear scan with a conditional skip for empty slots.
Memory layout comparison
The key insight is pointer density. Every pointer in a data structure is a potential cache miss, because following it means jumping to an arbitrary address.
| Container | Per-element overhead | Pointer chasing | Contiguity |
|---|---|---|---|
std::map | 3 ptrs + color bit | Yes (tree traversal) | None |
std::unordered_map | 1 ptr + bucket indirection | Yes (chain traversal) | Bucket array only |
flat_map | None | No (index arithmetic) | Full |
unordered_flat_map | None | No (probe arithmetic) | Full |
For flat_map with N=100 elements of pair<uint64_t, PodValue<64>> (72 bytes each), the entire data structure fits in 100 × 72 = 7,200 bytes — well within L1 cache (32 KiB). A binary search touches at most log2(100) ≈ 7 cache lines.
For std::map with the same data, each node is ~128 bytes (72 payload + 48 overhead + alignment), scattered across the heap. A tree traversal of depth 7 can touch 7 different cache lines in 7 different 4 KiB pages, with no spatial locality.
Benchmark methodology
Reproducing realistic performance requires simulating cache-cold conditions, not the warm-cache numbers that most naive benchmarks produce. The benchmark project at G:\flat_map implements five controls:
1. Exclusive CPU core
On Linux, use cset to isolate a core:
cset set -c 2-63 -s system
cset set -c 0-1 -s benchmark
cset proc -m -k -f root -t system
echo 0 > /sys/devices/system/cpu/cpu1/online
cset proc -s benchmark -e ./flat_map_bench
This eliminates scheduler-induced jitter from other processes.
2. rdtscp cycle-accurate timing
The rdtscp instruction reads the CPU’s time-stamp counter with implicit serialization. The benchmark uses it to measure individual operations, not averaged batches:
inline uint64_t rdtsc_start() {
unsigned int aux;
return __rdtscp(&aux);
}
inline uint64_t rdtsc_end() {
unsigned int aux;
_mm_lfence(); // serialize before reading
return __rdtscp(&aux);
}
The lfence before the end measurement ensures all prior instructions have retired.
3. Cache eviction before each measurement
A CacheEvictor allocates a buffer of 2× L3 cache size and randomly walks through it, touching every cache line. The random (non-sequential) access pattern prevents the CPU’s cache replacement policy from keeping old data:
class CacheEvictor {
// 2× L3 size buffer
void evict() {
const size_t total_lines = buf_size_ / 64;
for (size_t i = 0; i < 10 * total_lines; ++i) {
size_t idx = dist(rng_) * 64;
sink += *reinterpret_cast<uint64_t*>(buf_ + idx);
}
}
};
This simulates the “cold cache” scenario where the container has not been recently accessed —common in server applications that manage thousands of small map instances.
4. Fixed-seed random data
Data is generated with std::mt19937_64 using a fixed seed (42). This ensures:
- Identical key distributions across runs
- Fair comparison:
std::map’s tree shape andstd::unordered_map’s collision chain lengths are deterministic
5. Manual timing with Google Benchmark
for (auto _ : state) {
evictor.evict();
uint64_t t0 = rdtsc_start();
// ... operation under test ...
uint64_t t1 = rdtsc_end();
state.SetIterationTime(static_cast<double>(t1 - t0));
}
UseManualTime() tells Google Benchmark to use the rdtscp-measured cycles instead of wall-clock time.
Benchmark parameters
The benchmark varies four parameters:
| Parameter | Values | Rationale |
|---|---|---|
| Key type | uint64_t, std::string (4–16 chars) | Integer and short-string keys are the most common |
| Value size | 8, 32, 64, 128 bytes | From small structs to medium-sized objects |
| Element count | 10, 50, 100, 500, 1000 | Small dataset focus |
| Value type | POD, non-POD | POD allows memcpy/memmove; non-POD requires element-wise copy |
All results are in CPU cycles per operation, measured under cold-cache conditions.
Results
Insert performance
Integer key, PodValue<64>:
| N | map | unordered_map | flat_map | unordered_flat_map |
|---|---|---|---|---|
| 10 | ~1,400 | ~1,100 | ~800 | ~600 |
| 50 | ~9,000 | ~7,000 | ~4,500 | ~2,800 |
| 100 | ~20,000 | ~15,000 | ~11,000 | ~5,500 |
| 500 | ~130,000 | ~90,000 | ~280,000 | ~28,000 |
| 1000 | ~300,000 | ~190,000 | ~900,000 | ~55,000 |
For small datasets (N ≤ 100, value ≤ 64 bytes), flat_map insert beats both map and unordered_map. The reason: flat_map’s sorted-array insertion is memmove of a small contiguous block, which is extremely fast for POD types. The node-based containers must heap-allocate a node for each insertion.
Above the crossover point (N ≈ 100–500), flat_map’s O(N) shift dominates and performance degrades rapidly.
String key, PodValue<64>:
String keys shift the crossover point lower. String comparison is slower than integer comparison, and pair<string, ValType> is no longer POD — flat_map must move elements one-by-one instead of memmove. The advantage over map and unordered_map shrinks or disappears.
Key takeaway: flat_map insertion is competitive only with integer keys, POD values, and small datasets. unordered_flat_map insertion dominates everywhere.
Find performance
Integer key:
| N | map | unordered_map | flat_map | unordered_flat_map |
|---|---|---|---|---|
| 10 | ~400 | ~250 | ~150 | ~120 |
| 50 | ~700 | ~280 | ~250 | ~130 |
| 100 | ~900 | ~300 | ~350 | ~140 |
| 500 | ~1,300 | ~350 | ~800 | ~160 |
| 1000 | ~1,600 | ~400 | ~1,000 | ~170 |
flat_map find always beats map by 40–100%. The binary search touches contiguous cache lines that the prefetcher can predict, while map’s tree traversal is a pointer-chasing chain of cache misses.
For small datasets (N ≤ 100, value ≤ 64 bytes), flat_map even beats unordered_map. The hash computation plus chain traversal for unordered_map costs more than 7 binary-search probes in a contiguous array that fits in L1.
unordered_flat_map find is the fastest everywhere: a single hash computation plus a short probe sequence in a contiguous array.
String key:
String keys narrow flat_map’s advantage over unordered_map because string comparison is slower than integer comparison. The binary search performs up to log2(N) string comparisons, each potentially comparing multiple characters. But flat_map still beats map consistently.
Traverse performance
| Container | Cycles per element (N=100, int key, PodValue<64>) |
|---|---|
std::map | ~120 |
std::unordered_map | ~80 |
flat_map | ~4 |
unordered_flat_map | ~5 |
This is the most dramatic result. flat_map and unordered_flat_map are 20–30× faster at traversal than the node-based containers. The reason is pure sequential access: the prefetcher can stay far ahead of the traversal loop, keeping the pipeline fed with data. Node-based containers require a pointer dereference per element, which defeats prefetching.
This result holds regardless of dataset size, value size, or key type. Sequential access to contiguous memory is always faster than pointer chasing.
POD vs non-POD
With fixed N=100, Size=128, integer key:
| Container | POD insert | non-POD insert | Slowdown |
|---|---|---|---|
std::map | baseline | ~10% slower | Node alloc dominates |
std::unordered_map | baseline | ~15% slower | Node alloc dominates |
flat_map | baseline | ~60–80% slower | memmove → element-wise move |
unordered_flat_map | baseline | ~40–50% slower | Rehash copy → element-wise move |
POD vs non-POD primarily affects insertion performance of the flat containers. When the value type is POD, flat_map’s insertion shift and unordered_flat_map’s rehash can use memcpy/memmove —a single SIMD-optimized operation for the entire block. When the value type is non-POD (custom copy/move constructors), each element must be moved individually.
Find and traverse performance are unaffected by POD vs non-POD: those operations don’t copy or move elements.
Why cold cache matters
A common mistake in micro-benchmarks is measuring warm-cache performance: run the operation thousands of times in a tight loop, then average. This gives misleadingly low numbers because after the first few iterations, the entire data structure is in L1 cache.
In real applications —especially trading systems— the situation is different:
-
Many instances. A server managing connections for thousands of users might have thousands of small map instances. When a request arrives for user #8472, that user’s map hasn’t been accessed in milliseconds —it’s been evicted from cache by the other 8,471 maps.
-
Context switches. Even on a dedicated core, the OS may schedule kernel work (timer interrupts, softirq, RCU callbacks) that evicts the user’s cache lines.
-
Interleaved data paths. A trading system might look up an instrument in one map, then immediately look up the counterparty in another. The second lookup evicts the first map from cache.
The cold-cache scenario is the realistic one for small, frequently-created maps. The warm-cache numbers are optimistic upper bounds.
The memcpy advantage
flat_map’s secret weapon for small datasets is memmove/memcpy. When inserting into a sorted array of POD types, the shift of trailing elements compiles to a single rep movsb or SIMD copy:
// flat_map insertion (simplified)
auto pos = std::lower_bound(data.begin(), data.end(), key);
data.insert(pos, {key, value}); // calls memmove for POD types
For N=100 elements of PodValue<64> (72 bytes each), shifting 50 elements is memmove of 3,600 bytes —a single operation that takes ~50 ns on modern x86. Compare this with std::map’s insertion: allocate a node (potentially calling malloc), link it into the tree, rebalance (2–3 pointer writes and a rotation). The malloc alone can take hundreds of nanoseconds if it hits a lock.
This advantage disappears when:
- The value type is non-POD (element-wise move instead of
memmove) - The dataset is large enough that the shift touches many cache lines beyond L1/L2
- String keys make the
lower_boundcomparison expensive
Practical selection guide
Is your dataset small (N < 100)?
├── Yes
│ ├── Is it read-heavy?
│ │ ├── Yes → Do you need ordering?
│ │ │ ├── Yes → flat_map (2–3× faster find than map)
│ │ │ └── No → unordered_flat_map (fastest everything)
│ │ └── No (write-heavy)
│ │ └── unordered_flat_map (O(1) amortized insert)
│ └── Is the value type POD?
│ ├── Yes → flat_map is still competitive for insert
│ └── No → flat_map insert degrades; prefer unordered_flat_map
└── No (N ≥ 100)
├── Need ordering?
│ ├── Yes → std::map (flat_map insert is O(N))
│ └── No → unordered_flat_map
└── Need traversal?
└── flat_map or unordered_flat_map (20× faster traversal)
Constraints
Both flat variants require elements to be copyable and moveable —they store elements directly, not through pointers. This means you cannot store std::unique_ptr or other move-only types directly (wrap them in a std::shared_ptr or use a pointer as the value type).
unordered_flat_map does not allow setting max_load_factor —it uses an internal policy to decide when to rehash. This is usually fine, but it means you cannot control the memory/performance trade-off as precisely as with std::unordered_map.
C++23: std::flat_map
C++23 introduces std::flat_map and std::flat_multimap into the standard library, based on the Boost design. The interface is nearly identical to std::map, making migration straightforward. As of early 2026, major standard library implementations (libc++, libstdc++, MSVC STL) have partial or complete support.
References
- flat_map 性能调研 (Chinese, with detailed benchmarks): https://zhuanlan.zhihu.com/p/661418250
- Boost.Container flat_map: https://www.boost.org/doc/libs/1_83_0/doc/html/container/non_standard_containers.html
- Boost.Unordered open-addressing containers: https://www.boost.org/doc/libs/1_83_0/libs/unordered/doc/html/unordered.html
- C++23
std::flat_mapproposal (P0429R9): https://wg21.link/P0429 - Intel 64 and IA-32 Optimization Reference Manual: https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
- Tessil hash map benchmarks: https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html