Low-Latency Trading Machine: A Practical Linux Kernel Tuning Guide
Deploying a low-latency trading machine is not just about writing fast code. The operating system introduces jitter through power management, interrupt handling, scheduler load balancing, and memory management. This article distills years of production deployment experience into a systematic tuning guide — every parameter explained, every service justified, every setting verified.
The Big Picture
The goal of low-latency tuning is twofold:
- Maximize per-core performance — keep CPU frequency locked at maximum, eliminate power-saving transitions
- Minimize jitter — prevent unexpected latency spikes from interrupts, timers, page faults, and background kernel activity
These two goals shape every tuning decision we make. Let’s walk through them systematically.
You can measure jitter reduction with the hiccups tool. After isolating core 3, the output shows a dramatic difference:
$ hiccups | column -t -R 1,2,3,4,5,6
cpu threshold_ns hiccups pct99_ns pct999_ns max_ns
0 168 17110 83697 6590444 17010845
1 168 9929 169555 5787333 9517076
2 168 20728 73359 6008866 16008460
3 168 28336 1354 4870 17869
Core 3’s max jitter drops to 18 microseconds while the others remain in the tens of milliseconds range.
CPU Power Management
Disable C-States
Modern CPUs implement multiple C-states (idle power states). Deeper C-states save power but introduce latency when the core wakes up — the transition from C6 back to C0 can take tens of microseconds. In a trading system where every microsecond counts, this is unacceptable.
GRUB parameters:
intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll
intel_idle.max_cstate=0— disables the Intel idle driver, preventing the CPU from entering any C-state beyond C0processor.max_cstate=0— disables the generic ACPI idle driver as a fallbackidle=poll— replaces the idle loop with a busy-wait polling loop; the CPU never truly sleeps
Trade-off: Power consumption increases significantly. A core that would draw ~5W at C6 now draws ~80W+ at C0. This is acceptable for trading machines where latency is revenue.
From the field: On Solarflare machines, you can also use onload_tool disable_cstates to enforce this at runtime without rebooting.
Disable PCIe ASPM
PCIe Active State Power Management (ASPM) puts PCIe links into lower power states (L0s, L1) when idle. Waking a link from L1 adds microseconds of latency — exactly the kind of jitter that kills trading performance.
pcie_aspm=off
Note: some older docs use pcie_aspm.policy=performance, which keeps links in L0 but allows some power negotiation. pcie_aspm=off is more aggressive and preferred for trading.
Disable HPET
The High Precision Event Timer (HPET) is a legacy timer with higher access latency than the TSC (Time Stamp Counter). On modern Intel CPUs, the TSC runs at base frequency and is invariant, making it both faster to read and more precise.
hpet=disabled
The kernel falls back to TSC-based timers, which are significantly faster to access (20-30ns vs 1000+ns for HPET reads).
Enable Turbo Boost (BIOS)
At the BIOS/UEFI level:
- Set energy profile to Maximum Performance
- Ensure Turbo Boost is enabled — it allows cores to clock above base frequency
- Disable Hyper-Threading (SMT) — more on this below
Verify after boot:
cat /sys/devices/system/cpu/intel_pstate/no_turbo # 0 = turbo enabled
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor # should show "performance"
Do not run turbostat in production — it introduces scheduling jitter. Use it only during setup verification and disable it afterward.
CPU Isolation
This is the most impactful category of tuning for trading systems. The idea is simple: reserve specific cores exclusively for your trading application, and prevent the kernel from scheduling anything else on them.
CPU Frequency Governor
Force all cores to maximum frequency directly via sysfs:
find /sys/devices/system/cpu -name scaling_governor \
-exec sh -c 'echo performance > {}' ';'
Or via tuned (covered in detail later):
tuned-adm profile latency-performance
Isolate Cores from the Scheduler
isolcpus=1-17
isolcpus removes the specified CPUs from the kernel’s load balancer. No user-space process will be scheduled on these cores unless explicitly pinned via taskset or sched_setaffinity(). This is the foundation of CPU isolation.
Suppress Timer Interrupts
nohz_full=1-17 nohz=on
nohz_full suppresses periodic scheduler tick interrupts on the specified cores. Without this, each isolated core receives a timer interrupt every 1-4ms (depending on CONFIG_HZ), causing unnecessary wakeups and cache pollution.
Prerequisite: nohz_full requires isolcpus to be set on the same cores — the kernel won’t suppress ticks on cores that might need scheduler activity.
nohz=on enables the tickless kernel mode globally.
Note: The tick is only suppressed when exactly one runnable thread is on the core. Check /proc/sched_debug to verify tick suppression on isolated cores.
Offload RCU Callbacks
rcu_nocbs=1-17
RCU (Read-Copy-Update) is a synchronization mechanism used extensively in the kernel. When a grace period ends, RCU callbacks run on the CPU that initiated them. On an isolated core, these callbacks introduce jitter.
rcu_nocbs offloads RCU callback processing to a housekeeping CPU (typically CPU 0), keeping isolated cores clean.
Disable Hyper-Threading
noht
Hyper-Threading shares L1/L2 cache and execution units between two logical cores. For a trading thread that needs deterministic cache access, this is a liability — the sibling thread can evict your cache lines at any time.
Disabling HT doubles the effective L1/L2 cache per physical core and eliminates resource contention. The cost is 50% fewer logical cores, but for latency-critical threads, this is always the right trade-off.
You can also disable HT at runtime:
echo off > /sys/devices/system/cpu/smt/control
Pin Kernel Threads and Workqueues
Even with isolcpus, some kernel threads and workqueues may still land on isolated cores. Move them away:
# Move kernel threads to CPU 0
pgrep -P 2 | xargs -I{} taskset -pc 0 {}
# Move workqueues to CPU 0
find /sys/devices/virtual/workqueue -name cpumask -exec sh -c 'echo 1 > {}' ';'
Or use tuna for a cleaner interface:
tuna --cpus=1-17 --isolate
Memory Management
Transparent Huge Pages: Disable
transparent_hugepage=never
THP causes multi-microsecond latency spikes through page compaction, TLB shootdowns, and the khugepaged kernel thread. For a detailed explanation of why THP, KSM, NUMA balancing, and TLB shootdowns are harmful to low-latency systems, see Virtual Memory and Latency.
You can also disable THP at runtime without rebooting:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
Explicit Hugepages
In the 2024 deployment, we configure:
default_hugepagesz=2M hugepagesz=1G hugepages=8 hugepagesz=2M hugepages=2048
This allocates:
- 8 × 1GB huge pages = 8GB for large structures (LOB snapshots, feature accumulators)
- 2048 × 2MB huge pages = 4GB for ring buffers, smaller data structures
At runtime, you can also adjust:
sysctl -w vm.nr_hugepages=20
Why explicit hugepages? A 4KB page requires a TLB entry per page. A 2MB page covers 512× the memory with one TLB entry. For a trading application that walks through a ring buffer or order book, this means dramatically fewer TLB misses — and TLB misses on modern x86 cost 100-200 cycles each (~40-80ns at 2.5GHz).
Disable NUMA Balancing
echo 0 > /proc/sys/kernel/numa_balancing
Automatic NUMA balancing migrates pages to the NUMA node of the accessing CPU, causing TLB shootdowns and page faults. Pin memory explicitly with numa_alloc_onnode() or mbind() instead.
Disable Kernel Samepage Merging (KSM)
echo 0 > /sys/kernel/mm/ksm/run
KSM introduces page table locking and TLB shootdowns — it has no place on a trading machine.
Lock Application Memory
From the application side, call mlockall() to prevent page faults:
#include <sys/mman.h>
mlockall(MCL_CURRENT | MCL_FUTURE);
This locks all current and future pages in RAM, preventing the kernel from swapping them out. Combined with hugepages, this eliminates two major sources of jitter: page faults and swap I/O.
Also disable swap entirely:
swapoff -a
And reduce VM statistics update frequency:
sysctl vm.stat_interval=120
Avoid TLB Shootdowns
Map all memory at startup and never free it. Disable THP, KSM, and automatic NUMA balancing. Avoid writable file-backed mappings in hot paths. See Virtual Memory and Latency for the full list of shootdown triggers and elimination strategies.
Monitor:
egrep 'TLB|CPU' /proc/interrupts
perf stat -e 'tlb:tlb_flush' -a -A --timeout 10000
Disable CPU Vulnerability Mitigations (Optional)
mitigations=off
Spectre, Meltdown, and related vulnerabilities are mitigated through techniques like retpoline, KPTI (kernel page table isolation), and IBRS. These mitigations add overhead to every syscall and indirect branch — typically 5-15% performance loss.
For trading machines on isolated networks behind firewalls, this is an acceptable risk. Evaluate your threat model carefully.
Intel Cache Allocation Technology (CAT)
If the CPU supports Intel CAT, partition the Last-Level Cache (LLC) to give the latency-critical application the majority of cache ways. This prevents other processes from polluting the cache your trading thread depends on.
Use pqos (from the intel-cmt-cat package) to configure:
pqos -e "llc:1=0xff;llc:2=0xf" # Class 1 gets 8 ways, Class 2 gets 4 ways
pqos -a "pid:1=$(pgrep trading_app);pid:2=$(pgrep monitoring)"
Interrupt Handling
Disable IRQ Balancer and Pin Interrupts
systemctl stop irqbalance
systemctl disable irqbalance
irqbalance distributes hardware interrupts across all CPUs for “fairness.” On a trading machine, this means your isolated cores get interrupted by NIC, disk, and timer IRQs.
Option 1 — Disable irqbalance entirely and pin manually (our production approach):
systemctl stop irqbalance
systemctl disable irqbalance
Then pin all interrupts to the housekeeping CPU:
for ((i=1; i<=500; i++)); do
echo "1" > /proc/irq/$i/smp_affinity
done
This writes 1 (CPU 0 only) to every IRQ’s affinity mask, keeping all hardware interrupts off isolated cores.
Option 2 — Use irqbalance in one-shot mode (it respects isolcpus):
irqbalance --foreground --oneshot
Or ban specific cores manually (core 3 = bitmask 0x8):
IRQBALANCE_BANNED_CPUS=8 irqbalance --foreground --oneshot
For finer control, pin specific NIC interrupts to dedicated cores:
# Pin NIC queue 0 to CPU 0
echo 1 > /proc/irq/<irq_number>/smp_affinity
Check /proc/interrupts to find which IRQ numbers correspond to which devices.
The irqaffinity GRUB Parameter
irqaffinity=0
This sets the default CPU affinity for all interrupts to CPU 0 at boot time, before any userspace configuration. It’s a safety net that ensures no interrupt lands on isolated cores even during early boot.
Network Stack Tuning
Socket Buffer Sizes
For stock (equity) machines receiving market data:
sysctl -w net.core.rmem_max=1073741824 # 1GB max receive buffer
sysctl -w net.core.rmem_default=4194304 # 4MB default
For futures machines with higher message rates:
sysctl -w net.core.rmem_default=16777216 # 16MB
sysctl -w net.core.rmem_max=16777216 # 16MB
sysctl -w net.core.wmem_default=16777216 # 16MB
sysctl -w net.core.wmem_max=16777216 # 16MB
Why so large? If the application experiences a brief pause (e.g., a GC-like event or a burst of processing), the kernel buffer needs to absorb incoming market data without dropping packets. A 16MB buffer at a typical CTP feed rate (~200MB/day sustained, ~50MB/hour peak) provides several minutes of headroom.
Make these persistent in /etc/rc.local:
# /etc/rc.local
sysctl -w net.core.rmem_max=1073741824
sysctl -w net.core.rmem_default=4194304
chmod +x /etc/rc.d/rc.local
Ethtool Interrupt Coalescing
ethtool -C eth<N> rx-usecs 0 adaptive-rx off
Interrupt coalescing delays RX interrupts to batch notifications. rx-usecs sets the delay in microseconds — setting it to 0 delivers every packet immediately.
adaptive-rx off prevents the kernel from dynamically adjusting the coalescing parameters, which would otherwise introduce unpredictable latency.
Verify:
ethtool -c eth<N>
Note: On Solarflare NICs with Onload (kernel bypass), this is less relevant since packets never traverse the kernel stack. But for kernel-path NICs (like Intel X710), this setting is critical.
Kernel Bypass Alternatives
For the lowest possible network latency, consider bypassing the Linux kernel network stack entirely:
- DPDK — open-source, vendor-neutral kernel bypass framework
- OpenOnload / EFVI (Solarflare) — transparent acceleration for socket-based applications
- Mellanox VMA — MPI-accelerated messaging for Mellanox NICs
- Exablaze / SolarCapture — ultra-low-latency capture cards
If you must use the kernel stack, the ethtool and buffer tuning above becomes critical.
Service Disabling
Every background service is a potential source of jitter. Here’s what we disable and why:
| Service | Purpose | Why Disable |
|---|---|---|
abrt-ccpp | Core dump handler for bug reporting | Core dumps cause disk I/O and CPU spikes |
abrt-oops | Kernel oops reporter | Unnecessary disk activity |
abrt-vmcore | Kernel crash dump reporter | Same as above |
abrtd | ABRT daemon | Orchestrate the above; disable the whole subsystem |
firewalld | Firewall rules | Trading machines sit behind dedicated firewalls; local iptables is sufficient |
ipmi / ipmievd | IPMI hardware monitoring | Can trigger SMIs (System Management Interrupts) — non-maskable, unpredictable latency |
kdump | Crash dump tool | Reserves significant memory; crash handling is for debugging, not production |
postfix | Email MTA | Completely unnecessary; background disk I/O |
rhel-domainname | NIS domain name | Legacy, unnecessary |
cpuspeed | CPU frequency scaling | We want maximum frequency at all times |
systemctl disable abrt-ccpp.service abrt-oops.service abrt-vmcore.service \
abrtd.service firewalld.service ipmi.service ipmievd.service \
kdump.service postfix.service rhel-domainname.service
Additional Hardening
# Disable SELinux
vi /etc/selinux/config
SELINUX=disabled
# Disable SSH DNS lookups (prevents login delays)
vi /etc/ssh/sshd_config
UseDNS no
# Disable NMI watchdog
nmi_watchdog=0 # in GRUB
# Disable audit
audit=0 # in GRUB
# Disable soft lockup detector
nosoftlockup # in GRUB
# Disable halt accounting
nohalt # in GRUB
NMI watchdog (nmi_watchdog=0): The watchdog uses performance counters to detect hung CPUs. It generates NMIs at regular intervals, which can preempt your trading thread. Disable it.
Audit (audit=0): The Linux audit framework logs security events. Every syscall that hits an audit rule adds overhead. On a trading machine with no compliance audit requirement, disable it entirely.
nosoftlockup: The soft lockup detector monitors CPUs that haven’t scheduled for 20+ seconds. On isolated cores running busy-wait loops, this fires constantly. Disable it.
nohalt: Prevents the kernel from using the halt instruction in the idle loop. Combined with idle=poll, this ensures the CPU never enters any low-power state.
tuned-adm: Quick Profiling
tuned-adm profile network-latency
The network-latency profile applies a batch of sensible defaults:
- Sets CPU governor to
performance - Disables transparent hugepages
- Increases net.core.somaxconn and net.ipv4.tcp_max_syn_backlog
- Disables NUMA balancing
- Sets kernel.sched_min_granularity_ns for lower scheduling latency
What it doesn’t do: It does NOT set isolcpus, nohz_full, or rcu_nocbs. These require a reboot with new GRUB parameters, so you still need the manual kernel tuning above.
You can verify the active profile:
tuned-adm active
Solarflare Onload (Kernel Bypass)
For the lowest possible network latency, Solarflare NICs with OpenOnload provide kernel bypass — packets are delivered directly from the NIC to user-space via the EFVI interface, skipping the entire Linux network stack.
Installation (Online)
# Install dependencies
yum install -y gcc gcc-c++ python-devel libpcap-devel automake \
libtool rpm-build kernel-devel
# Method 1: RPM build
rpmbuild -ta openonload-*.tgz
rpm -ivh *.rpm
# Method 2: Direct install
tar xzf openonload-<version>.tgz
cd openonload-<version>/scripts
./onload_install
Important: kernel-devel version must match your running kernel exactly. Check with uname -r and rpm -qa | grep kernel-devel.
Installation (Offline)
On air-gapped trading machines, pre-download the tarball and use Method 2. You’ll also need to set up a local yum repository from a CentOS ISO for the build dependencies.
Post-Install Configuration
# Load the Onload module
onload_tool reload
# Verify
lsmod | grep sfc
lsmod | grep onload
# Set firmware to ultra-low-latency mode
sfboot firmware-variant=ultra-low-latency
# Flash firmware
sfupdate --write
# Reboot for firmware changes
reboot
sfboot firmware-variant=ultra-low-latency configures the NIC firmware to prioritize latency over throughput. This disables some offload features (like TX checksum offload and LRO) in favor of minimal packet processing delay.
Disable C-States via Onload
onload_tool disable_cstates
This is a runtime alternative to the GRUB C-state parameters. Useful when you can’t reboot but need to disable C-states immediately.
Intel X710 (i40e) Driver
For non-Solarflare NICs, you may need to compile the driver from source:
tar -xf x710-i40e-2.12.6.tar.gz
cd i40e-2.12.6/src
make
make install
This ensures you have the latest driver with latency fixes, rather than the in-tree driver that ships with the kernel.
Scheduling Policy
On an isolated core running a single thread with busy-wait polling, prefer SCHED_OTHER over SCHED_FIFO or SCHED_RR. Real-time priority on a thread that never yields can starve kernel tasks (such as vmstat), potentially locking the system.
Linux throttles real-time tasks to 95% of CPU time by default (/proc/sys/kernel/sched_rt_runtime_us). If real-time scheduling is required, adjust this limit accordingly — but for most trading workloads on isolated cores, SCHED_OTHER with busy-wait polling is sufficient and safer.
The Complete GRUB Configuration
Combining everything above, here’s the production GRUB configuration for an 18-core machine with cores 1-17 isolated:
GRUB_CMDLINE_LINUX="crashkernel=auto rhgb quiet \
isolcpus=1-17 \
nohz_full=1-17 \
nohz=on \
rcu_nocbs=1-17 \
idle=poll \
intel_idle.max_cstate=0 \
processor.max_cstate=0 \
mce=ignore_ce \
nmi_watchdog=0 \
transparent_hugepage=never \
pcie_aspm=off \
hpet=disabled \
nosoftlockup \
audit=0 \
irqaffinity=0 \
noht \
nohalt \
skew_tick=1 \
default_hugepagesz=2M \
hugepagesz=1G hugepages=8 \
hugepagesz=2M hugepages=2048"
After editing /etc/default/grub:
grub2-mkconfig -o /boot/grub2/grub.cfg
# For UEFI systems:
# grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg
reboot
Note: skew_tick=1 offsets the timer tick on each CPU to avoid all CPUs waking simultaneously, which reduces lock contention on kernel data structures.
Verification
After applying all tuning, verify each setting:
CPU and Power
# Check CPU governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Expected: performance
# Check Turbo Boost
cat /sys/devices/system/cpu/intel_pstate/no_turbo
# Expected: 0
# Check HT/SMT status
cat /sys/devices/system/cpu/smt/active
# Expected: 0
# Check C-states (should show only C0)
cat /sys/module/intel_idle/parameters/max_cstate
# Expected: 0
Isolation
# Check isolated CPUs
cat /sys/devices/system/cpu/isolated
# Expected: 1-17
# Check nohz_full
cat /sys/devices/system/cpu/nohz_full
# Expected: 1-17
# Check RCU nocbs
cat /sys/devices/system/cpu/rcu_nocbs
# Expected: 1-17
Memory
# Check hugepages
cat /proc/meminfo | grep Huge
# Expected: HugePages_Total matches your configuration
# Check THP
cat /sys/kernel/mm/transparent_hugepage/enabled
# Expected: always never [never]
# Check NUMA balancing
cat /proc/sys/kernel/numa_balancing
# Expected: 0
Jitter Measurement
For the ultimate verification, measure actual system jitter:
# Measure context switches per CPU (10 seconds)
perf stat -e 'sched:sched_switch' -a -A --timeout 10000
# Measure timer interrupts per CPU (30 seconds)
perf stat -e 'irq_vectors:local_timer_entry' -a -A --timeout 30000
# Monitor TLB misses
perf stat -e 'dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses' -a --timeout 10000
# Monitor TLB shootdowns
perf stat -e 'tlb:tlb_flush' -a -A --timeout 10000
For continuous jitter monitoring, we use the sysjitter tool:
cd sysjitter-ace/
./longtest.sh
This runs a busy-wait loop on each core and measures the maximum deviation from expected timing — any spike indicates kernel interference. On a properly tuned machine, maximum jitter should be under 5μs.
Quick Checklist
Here’s a summary of every step in order:
- BIOS/UEFI: Maximum performance, Turbo Boost on, HT off
- GRUB: Apply all kernel parameters (see Complete Configuration above)
- tuned-adm:
profile network-latency - Services: Disable abrt, firewalld, ipmi, kdump, postfix, irqbalance
- SELinux: Disabled
- SSH:
UseDNS no - Swap:
swapoff -a - IRQ: Pin all to CPU 0, disable irqbalance
- Network buffers: Set
rmem_max/wmem_maxper workload - Ethtool:
rx-usecs 0 adaptive-rx offon kernel-path NICs - Solarflare: Install Onload,
sfboot firmware-variant=ultra-low-latency - Hugepages: Allocate via GRUB or
sysctl - Verify: Run all verification commands and sysjitter