Low-Latency Tuning Guide for AMD64/Linux

This guide covers how to tune AMD64/x86-64 hardware and Linux for real-time or low-latency workloads. Target applications include line-rate packet capture, deep packet inspection, kernel-bypass networking, and precise benchmarking of CPU-bound code.

“Latency” here means the time between an event occurring and its processing completing —for example, from a NIC receiving a packet to the application finishing its handling of that packet.

The two levers are:

Maximize single-core performance —raise CPU frequency, disable power-saving features.
Minimize jitter —reduce interruptions from timers, interrupts, and competing threads.

You can measure jitter reduction with the hiccups tool. After isolating core 3, the output looks like:

$ hiccups | column -t -R 1,2,3,4,5,6
cpu  threshold_ns  hiccups  pct99_ns  pct999_ns  max_ns
0    168           17110    83697     6590444    17010845
1    168           9929     169555    5787333    9517076
2    168           20728    73359     6008866    16008460
3    168           28336    1354      4870       17869

Core 3’s max jitter drops to 18 microseconds while the others remain in the tens of milliseconds range.

Hardware

Power profile

Set the UEFI/BIOS power profile to Maximum Performance. This disables aggressive C-states and allows the CPU to respond instantly to load.

Disable SMT / Hyper-Threading

SMT improves throughput for IPC-limited workloads by sharing execution units between logical cores. For latency-sensitive code, the sharing introduces resource contention and jitter.

Disabling SMT also doubles the effective L1/L2 cache per physical core (in a 2-way SMT system).

Preferred order:

Disable in UEFI/BIOS.
Kernel parameter at boot: nosmt
Runtime: echo off > /sys/devices/system/cpu/smt/control
Hot-unplug sibling threads after consulting topology:

lscpu --extended
cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list

Verify:

cat /sys/devices/system/cpu/smt/active   # expect 0

Turbo Boost

Intel Turbo Boost and AMD Turbo Core allow sustained operation above the base frequency when thermal and power headroom exist. With good cooling and some cores disabled, the remaining cores can run at maximum boost frequency continuously.

cat /sys/devices/system/cpu/intel_pstate/no_turbo   # 0 = boost enabled
cpupower frequency-info

Do not run turbostat in production —it introduces scheduling jitter.

Kernel

CPU frequency governor

Force all cores to maximum frequency:

find /sys/devices/system/cpu -name scaling_governor \
  -exec sh -c 'echo performance > {}' ';'

Or via tuned:

tuned-adm profile latency-performance

Core isolation

The kernel scheduler load-balances across all cores by default. Add isolcpus to the kernel command line to exclude cores from the general scheduler pool:

isolcpus=1-7

Your application threads must explicitly pin to these cores via taskset or sched_setaffinity.

Kernel threads still spawn on isolated cores. Migrate them to core 0:

pgrep -P 2 | xargs -i taskset -p -c 0 {}

Or use tuna:

tuna --cpus=1-7 --isolate
tuna -P   # verify thread affinities

Migrate kernel workqueues:

find /sys/devices/virtual/workqueue -name cpumask \
  -exec sh -c 'echo 1 > {}' ';'

Verify with:

perf stat -e 'sched:sched_switch' -a -A --timeout 10000

Isolated cores should show near-zero context switches.

Timer tick suppression

The scheduler timer fires periodically on every core to preempt threads. On an isolated core running a single thread, suppress it:

# kernel command line
nohz_full=1-7 rcu_nocbs=1-7

The tick is only suppressed when exactly one runnable thread is on the core. Check /proc/sched_debug to verify.

Extend the VM stats update interval to reduce related wakeups:

sysctl vm.stat_interval=120

Verify:

perf stat -e 'irq_vectors:local_timer_entry' -a -A --timeout 30000

Isolated cores should show roughly 1-2 timer interrupts per second. Full elimination is not currently possible.

IRQ affinity

Move interrupt handling off isolated cores. Let irqbalance do it automatically (it respects isolcpus):

irqbalance --foreground --oneshot

Or ban specific cores manually (core 3 = bitmask 0x8):

IRQBALANCE_BANNED_CPUS=8 irqbalance --foreground --oneshot

Check current affinities:

find /proc/irq/ -name smp_affinity_list -print -exec cat '{}' ';'
watch cat /proc/interrupts

Network stack

For low-latency networking, avoid the Linux kernel network stack entirely. Kernel-bypass options:

DPDK
OpenOnload (Solarflare)
Mellanox VMA
Exablaze

If you must use the kernel stack, consult the Red Hat performance tuning guide and Cloudflare’s article on low-latency 10 Gbps networking.

Memory

Disable swap

Accessing anonymous memory that has been paged to disk causes a major page fault, which stalls for the duration of a disk read. Disable swap entirely:

swapoff -a

Then lock all current and future allocations:

mlockall(MCL_CURRENT | MCL_FUTURE);

Disable Transparent Huge Pages

THP automatically promotes 4 KiB pages to 2 MiB pages, reducing TLB pressure. However, the promotion process and background compaction by khugepaged and kcompactd introduce latency spikes by modifying page tables and triggering TLB shootdowns.

Disable at boot:

transparent_hugepage=never

Or at runtime:

echo never > /sys/kernel/mm/transparent_hugepage/enabled

Use explicit huge pages (MAP_HUGETLB or a THP-aware allocator) instead.

Disable automatic NUMA balancing

Linux’s automatic NUMA balancing migrates pages toward the NUMA node that accesses them most, using page faults as signals. This causes TLB flushes and unpredictable page fault latency.

echo 0 > /proc/sys/kernel/numa_balancing

Also ensure numad is not running.

Disable KSM

Kernel Samepage Merging deduplicates identical pages, but requires locking page tables and flushing TLBs during the merge.

echo 0 > /sys/kernel/mm/ksm/run

Spectre/Meltdown mitigations

Software mitigations for CPU vulnerabilities (Spectre, Meltdown, MDS) carry a performance cost that varies by workload. On an isolated, trusted system, disable them:

mitigations=off   # kernel command line

Cache partitioning

If the CPU supports Intel Cache Allocation Technology (CAT), partition the LLC to give the latency-critical application the majority of cache ways. Use intel-cmt-cat to configure.

Application

Prevent page faults

Call mlockall(MCL_CURRENT | MCL_FUTURE) early in startup. This pre-faults all mapped pages and prevents future allocations from faulting.

Use huge pages

The default 4 KiB page size means a TLB with 2048 entries covers only 8 MiB of working set. For applications with larger working sets, 2 MiB or 1 GiB huge pages reduce TLB misses significantly.

Monitor TLB behavior:

perf stat -e 'dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses' \
  -a --timeout 10000

Avoid TLB shootdowns

TLB shootdowns are triggered by page table modifications —the kernel sends an IPI to every core to invalidate stale TLB entries. Sources include:

munmap, mprotect
glibc malloc calling madvise(MADV_FREE) or munmap on freed memory
THP promotion and compaction
KSM merges
NUMA page migration
Page cache writeback

To avoid them: map all memory at startup, do not free it afterward, disable THP, KSM, and automatic NUMA balancing, and avoid writable file-backed mappings.

Monitor:

egrep 'TLB|CPU' /proc/interrupts
perf stat -e 'tlb:tlb_flush' -a -A --timeout 10000

Scheduling policy

Prefer SCHED_OTHER on an isolated core running a single thread with busy-wait polling, over SCHED_FIFO or SCHED_RR. Real-time priority on a thread that never yields can starve kernel tasks (such as vmstat), potentially locking the system.

Linux throttles real-time tasks to 95% of CPU time by default (/proc/sys/kernel/sched_rt_runtime_us). If real-time scheduling is required, adjust this limit accordingly.