Low-Latency Tuning Guide for AMD64/Linux
This guide covers how to tune AMD64/x86-64 hardware and Linux for real-time or low-latency workloads. Target applications include line-rate packet capture, deep packet inspection, kernel-bypass networking, and precise benchmarking of CPU-bound code.
“Latency” here means the time between an event occurring and its processing completing —for example, from a NIC receiving a packet to the application finishing its handling of that packet.
The two levers are:
- Maximize single-core performance —raise CPU frequency, disable power-saving features.
- Minimize jitter —reduce interruptions from timers, interrupts, and competing threads.
You can measure jitter reduction with the hiccups tool. After isolating core 3, the output looks like:
$ hiccups | column -t -R 1,2,3,4,5,6
cpu threshold_ns hiccups pct99_ns pct999_ns max_ns
0 168 17110 83697 6590444 17010845
1 168 9929 169555 5787333 9517076
2 168 20728 73359 6008866 16008460
3 168 28336 1354 4870 17869
Core 3’s max jitter drops to 18 microseconds while the others remain in the tens of milliseconds range.
Hardware
Power profile
Set the UEFI/BIOS power profile to Maximum Performance. This disables aggressive C-states and allows the CPU to respond instantly to load.
Disable SMT / Hyper-Threading
SMT improves throughput for IPC-limited workloads by sharing execution units between logical cores. For latency-sensitive code, the sharing introduces resource contention and jitter.
Disabling SMT also doubles the effective L1/L2 cache per physical core (in a 2-way SMT system).
Preferred order:
- Disable in UEFI/BIOS.
- Kernel parameter at boot:
nosmt - Runtime:
echo off > /sys/devices/system/cpu/smt/control - Hot-unplug sibling threads after consulting topology:
lscpu --extended
cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list
Verify:
cat /sys/devices/system/cpu/smt/active # expect 0
Turbo Boost
Intel Turbo Boost and AMD Turbo Core allow sustained operation above the base frequency when thermal and power headroom exist. With good cooling and some cores disabled, the remaining cores can run at maximum boost frequency continuously.
cat /sys/devices/system/cpu/intel_pstate/no_turbo # 0 = boost enabled
cpupower frequency-info
Do not run turbostat in production —it introduces scheduling jitter.
Kernel
CPU frequency governor
Force all cores to maximum frequency:
find /sys/devices/system/cpu -name scaling_governor \
-exec sh -c 'echo performance > {}' ';'
Or via tuned:
tuned-adm profile latency-performance
Core isolation
The kernel scheduler load-balances across all cores by default. Add isolcpus to the kernel command line to exclude cores from the general scheduler pool:
isolcpus=1-7
Your application threads must explicitly pin to these cores via taskset or sched_setaffinity.
Kernel threads still spawn on isolated cores. Migrate them to core 0:
pgrep -P 2 | xargs -i taskset -p -c 0 {}
Or use tuna:
tuna --cpus=1-7 --isolate
tuna -P # verify thread affinities
Migrate kernel workqueues:
find /sys/devices/virtual/workqueue -name cpumask \
-exec sh -c 'echo 1 > {}' ';'
Verify with:
perf stat -e 'sched:sched_switch' -a -A --timeout 10000
Isolated cores should show near-zero context switches.
Timer tick suppression
The scheduler timer fires periodically on every core to preempt threads. On an isolated core running a single thread, suppress it:
# kernel command line
nohz_full=1-7 rcu_nocbs=1-7
The tick is only suppressed when exactly one runnable thread is on the core. Check /proc/sched_debug to verify.
Extend the VM stats update interval to reduce related wakeups:
sysctl vm.stat_interval=120
Verify:
perf stat -e 'irq_vectors:local_timer_entry' -a -A --timeout 30000
Isolated cores should show roughly 1-2 timer interrupts per second. Full elimination is not currently possible.
IRQ affinity
Move interrupt handling off isolated cores. Let irqbalance do it automatically (it respects isolcpus):
irqbalance --foreground --oneshot
Or ban specific cores manually (core 3 = bitmask 0x8):
IRQBALANCE_BANNED_CPUS=8 irqbalance --foreground --oneshot
Check current affinities:
find /proc/irq/ -name smp_affinity_list -print -exec cat '{}' ';'
watch cat /proc/interrupts
Network stack
For low-latency networking, avoid the Linux kernel network stack entirely. Kernel-bypass options:
- DPDK
- OpenOnload (Solarflare)
- Mellanox VMA
- Exablaze
If you must use the kernel stack, consult the Red Hat performance tuning guide and Cloudflare’s article on low-latency 10 Gbps networking.
Memory
Disable swap
Accessing anonymous memory that has been paged to disk causes a major page fault, which stalls for the duration of a disk read. Disable swap entirely:
swapoff -a
Then lock all current and future allocations:
mlockall(MCL_CURRENT | MCL_FUTURE);
Disable Transparent Huge Pages
THP automatically promotes 4 KiB pages to 2 MiB pages, reducing TLB pressure. However, the promotion process and background compaction by khugepaged and kcompactd introduce latency spikes by modifying page tables and triggering TLB shootdowns.
Disable at boot:
transparent_hugepage=never
Or at runtime:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
Use explicit huge pages (MAP_HUGETLB or a THP-aware allocator) instead.
Disable automatic NUMA balancing
Linux’s automatic NUMA balancing migrates pages toward the NUMA node that accesses them most, using page faults as signals. This causes TLB flushes and unpredictable page fault latency.
echo 0 > /proc/sys/kernel/numa_balancing
Also ensure numad is not running.
Disable KSM
Kernel Samepage Merging deduplicates identical pages, but requires locking page tables and flushing TLBs during the merge.
echo 0 > /sys/kernel/mm/ksm/run
Spectre/Meltdown mitigations
Software mitigations for CPU vulnerabilities (Spectre, Meltdown, MDS) carry a performance cost that varies by workload. On an isolated, trusted system, disable them:
mitigations=off # kernel command line
Cache partitioning
If the CPU supports Intel Cache Allocation Technology (CAT), partition the LLC to give the latency-critical application the majority of cache ways. Use intel-cmt-cat to configure.
Application
Prevent page faults
Call mlockall(MCL_CURRENT | MCL_FUTURE) early in startup. This pre-faults all mapped pages and prevents future allocations from faulting.
Use huge pages
The default 4 KiB page size means a TLB with 2048 entries covers only 8 MiB of working set. For applications with larger working sets, 2 MiB or 1 GiB huge pages reduce TLB misses significantly.
Monitor TLB behavior:
perf stat -e 'dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses' \
-a --timeout 10000
Avoid TLB shootdowns
TLB shootdowns are triggered by page table modifications —the kernel sends an IPI to every core to invalidate stale TLB entries. Sources include:
munmap,mprotect- glibc malloc calling
madvise(MADV_FREE)ormunmapon freed memory - THP promotion and compaction
- KSM merges
- NUMA page migration
- Page cache writeback
To avoid them: map all memory at startup, do not free it afterward, disable THP, KSM, and automatic NUMA balancing, and avoid writable file-backed mappings.
Monitor:
egrep 'TLB|CPU' /proc/interrupts
perf stat -e 'tlb:tlb_flush' -a -A --timeout 10000
Scheduling policy
Prefer SCHED_OTHER on an isolated core running a single thread with busy-wait polling, over SCHED_FIFO or SCHED_RR. Real-time priority on a thread that never yields can starve kernel tasks (such as vmstat), potentially locking the system.
Linux throttles real-time tasks to 95% of CPU time by default (/proc/sys/kernel/sched_rt_runtime_us). If real-time scheduling is required, adjust this limit accordingly.