yuqi-zheng

Low-Latency Trading Machine: A Practical Linux Kernel Tuning Guide


Deploying a low-latency trading machine is not just about writing fast code. The operating system introduces jitter through power management, interrupt handling, scheduler load balancing, and memory management. This article distills years of production deployment experience into a systematic tuning guide — every parameter explained, every service justified, every setting verified.


The Big Picture

The goal of low-latency tuning is twofold:

  1. Maximize per-core performance — keep CPU frequency locked at maximum, eliminate power-saving transitions
  2. Minimize jitter — prevent unexpected latency spikes from interrupts, timers, page faults, and background kernel activity

These two goals shape every tuning decision we make. Let’s walk through them systematically.

You can measure jitter reduction with the hiccups tool. After isolating core 3, the output shows a dramatic difference:

$ hiccups | column -t -R 1,2,3,4,5,6
cpu  threshold_ns  hiccups  pct99_ns  pct999_ns  max_ns
0    168           17110    83697     6590444    17010845
1    168           9929     169555    5787333    9517076
2    168           20728    73359     6008866    16008460
3    168           28336    1354      4870       17869

Core 3’s max jitter drops to 18 microseconds while the others remain in the tens of milliseconds range.


CPU Power Management

Disable C-States

Modern CPUs implement multiple C-states (idle power states). Deeper C-states save power but introduce latency when the core wakes up — the transition from C6 back to C0 can take tens of microseconds. In a trading system where every microsecond counts, this is unacceptable.

GRUB parameters:

intel_idle.max_cstate=0  processor.max_cstate=0  idle=poll
  • intel_idle.max_cstate=0 — disables the Intel idle driver, preventing the CPU from entering any C-state beyond C0
  • processor.max_cstate=0 — disables the generic ACPI idle driver as a fallback
  • idle=poll — replaces the idle loop with a busy-wait polling loop; the CPU never truly sleeps

Trade-off: Power consumption increases significantly. A core that would draw ~5W at C6 now draws ~80W+ at C0. This is acceptable for trading machines where latency is revenue.

From the field: On Solarflare machines, you can also use onload_tool disable_cstates to enforce this at runtime without rebooting.

Disable PCIe ASPM

PCIe Active State Power Management (ASPM) puts PCIe links into lower power states (L0s, L1) when idle. Waking a link from L1 adds microseconds of latency — exactly the kind of jitter that kills trading performance.

pcie_aspm=off

Note: some older docs use pcie_aspm.policy=performance, which keeps links in L0 but allows some power negotiation. pcie_aspm=off is more aggressive and preferred for trading.

Disable HPET

The High Precision Event Timer (HPET) is a legacy timer with higher access latency than the TSC (Time Stamp Counter). On modern Intel CPUs, the TSC runs at base frequency and is invariant, making it both faster to read and more precise.

hpet=disabled

The kernel falls back to TSC-based timers, which are significantly faster to access (20-30ns vs 1000+ns for HPET reads).

Enable Turbo Boost (BIOS)

At the BIOS/UEFI level:

  • Set energy profile to Maximum Performance
  • Ensure Turbo Boost is enabled — it allows cores to clock above base frequency
  • Disable Hyper-Threading (SMT) — more on this below

Verify after boot:

cat /sys/devices/system/cpu/intel_pstate/no_turbo  # 0 = turbo enabled
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor  # should show "performance"

Do not run turbostat in production — it introduces scheduling jitter. Use it only during setup verification and disable it afterward.


CPU Isolation

This is the most impactful category of tuning for trading systems. The idea is simple: reserve specific cores exclusively for your trading application, and prevent the kernel from scheduling anything else on them.

CPU Frequency Governor

Force all cores to maximum frequency directly via sysfs:

find /sys/devices/system/cpu -name scaling_governor \
  -exec sh -c 'echo performance > {}' ';'

Or via tuned (covered in detail later):

tuned-adm profile latency-performance

Isolate Cores from the Scheduler

isolcpus=1-17

isolcpus removes the specified CPUs from the kernel’s load balancer. No user-space process will be scheduled on these cores unless explicitly pinned via taskset or sched_setaffinity(). This is the foundation of CPU isolation.

Suppress Timer Interrupts

nohz_full=1-17  nohz=on

nohz_full suppresses periodic scheduler tick interrupts on the specified cores. Without this, each isolated core receives a timer interrupt every 1-4ms (depending on CONFIG_HZ), causing unnecessary wakeups and cache pollution.

Prerequisite: nohz_full requires isolcpus to be set on the same cores — the kernel won’t suppress ticks on cores that might need scheduler activity.

nohz=on enables the tickless kernel mode globally.

Note: The tick is only suppressed when exactly one runnable thread is on the core. Check /proc/sched_debug to verify tick suppression on isolated cores.

Offload RCU Callbacks

rcu_nocbs=1-17

RCU (Read-Copy-Update) is a synchronization mechanism used extensively in the kernel. When a grace period ends, RCU callbacks run on the CPU that initiated them. On an isolated core, these callbacks introduce jitter.

rcu_nocbs offloads RCU callback processing to a housekeeping CPU (typically CPU 0), keeping isolated cores clean.

Disable Hyper-Threading

noht

Hyper-Threading shares L1/L2 cache and execution units between two logical cores. For a trading thread that needs deterministic cache access, this is a liability — the sibling thread can evict your cache lines at any time.

Disabling HT doubles the effective L1/L2 cache per physical core and eliminates resource contention. The cost is 50% fewer logical cores, but for latency-critical threads, this is always the right trade-off.

You can also disable HT at runtime:

echo off > /sys/devices/system/cpu/smt/control

Pin Kernel Threads and Workqueues

Even with isolcpus, some kernel threads and workqueues may still land on isolated cores. Move them away:

# Move kernel threads to CPU 0
pgrep -P 2 | xargs -I{} taskset -pc 0 {}

# Move workqueues to CPU 0
find /sys/devices/virtual/workqueue -name cpumask -exec sh -c 'echo 1 > {}' ';'

Or use tuna for a cleaner interface:

tuna --cpus=1-17 --isolate

Memory Management

Transparent Huge Pages: Disable

transparent_hugepage=never

THP causes multi-microsecond latency spikes through page compaction, TLB shootdowns, and the khugepaged kernel thread. For a detailed explanation of why THP, KSM, NUMA balancing, and TLB shootdowns are harmful to low-latency systems, see Virtual Memory and Latency.

You can also disable THP at runtime without rebooting:

echo never > /sys/kernel/mm/transparent_hugepage/enabled

Explicit Hugepages

In the 2024 deployment, we configure:

default_hugepagesz=2M  hugepagesz=1G  hugepages=8  hugepagesz=2M  hugepages=2048

This allocates:

  • 8 × 1GB huge pages = 8GB for large structures (LOB snapshots, feature accumulators)
  • 2048 × 2MB huge pages = 4GB for ring buffers, smaller data structures

At runtime, you can also adjust:

sysctl -w vm.nr_hugepages=20

Why explicit hugepages? A 4KB page requires a TLB entry per page. A 2MB page covers 512× the memory with one TLB entry. For a trading application that walks through a ring buffer or order book, this means dramatically fewer TLB misses — and TLB misses on modern x86 cost 100-200 cycles each (~40-80ns at 2.5GHz).

Disable NUMA Balancing

echo 0 > /proc/sys/kernel/numa_balancing

Automatic NUMA balancing migrates pages to the NUMA node of the accessing CPU, causing TLB shootdowns and page faults. Pin memory explicitly with numa_alloc_onnode() or mbind() instead.

Disable Kernel Samepage Merging (KSM)

echo 0 > /sys/kernel/mm/ksm/run

KSM introduces page table locking and TLB shootdowns — it has no place on a trading machine.

Lock Application Memory

From the application side, call mlockall() to prevent page faults:

#include <sys/mman.h>
mlockall(MCL_CURRENT | MCL_FUTURE);

This locks all current and future pages in RAM, preventing the kernel from swapping them out. Combined with hugepages, this eliminates two major sources of jitter: page faults and swap I/O.

Also disable swap entirely:

swapoff -a

And reduce VM statistics update frequency:

sysctl vm.stat_interval=120

Avoid TLB Shootdowns

Map all memory at startup and never free it. Disable THP, KSM, and automatic NUMA balancing. Avoid writable file-backed mappings in hot paths. See Virtual Memory and Latency for the full list of shootdown triggers and elimination strategies.

Monitor:

egrep 'TLB|CPU' /proc/interrupts
perf stat -e 'tlb:tlb_flush' -a -A --timeout 10000

Disable CPU Vulnerability Mitigations (Optional)

mitigations=off

Spectre, Meltdown, and related vulnerabilities are mitigated through techniques like retpoline, KPTI (kernel page table isolation), and IBRS. These mitigations add overhead to every syscall and indirect branch — typically 5-15% performance loss.

For trading machines on isolated networks behind firewalls, this is an acceptable risk. Evaluate your threat model carefully.

Intel Cache Allocation Technology (CAT)

If the CPU supports Intel CAT, partition the Last-Level Cache (LLC) to give the latency-critical application the majority of cache ways. This prevents other processes from polluting the cache your trading thread depends on.

Use pqos (from the intel-cmt-cat package) to configure:

pqos -e "llc:1=0xff;llc:2=0xf"   # Class 1 gets 8 ways, Class 2 gets 4 ways
pqos -a "pid:1=$(pgrep trading_app);pid:2=$(pgrep monitoring)"

Interrupt Handling

Disable IRQ Balancer and Pin Interrupts

systemctl stop irqbalance
systemctl disable irqbalance

irqbalance distributes hardware interrupts across all CPUs for “fairness.” On a trading machine, this means your isolated cores get interrupted by NIC, disk, and timer IRQs.

Option 1 — Disable irqbalance entirely and pin manually (our production approach):

systemctl stop irqbalance
systemctl disable irqbalance

Then pin all interrupts to the housekeeping CPU:

for ((i=1; i<=500; i++)); do
    echo "1" > /proc/irq/$i/smp_affinity
done

This writes 1 (CPU 0 only) to every IRQ’s affinity mask, keeping all hardware interrupts off isolated cores.

Option 2 — Use irqbalance in one-shot mode (it respects isolcpus):

irqbalance --foreground --oneshot

Or ban specific cores manually (core 3 = bitmask 0x8):

IRQBALANCE_BANNED_CPUS=8 irqbalance --foreground --oneshot

For finer control, pin specific NIC interrupts to dedicated cores:

# Pin NIC queue 0 to CPU 0
echo 1 > /proc/irq/<irq_number>/smp_affinity

Check /proc/interrupts to find which IRQ numbers correspond to which devices.

The irqaffinity GRUB Parameter

irqaffinity=0

This sets the default CPU affinity for all interrupts to CPU 0 at boot time, before any userspace configuration. It’s a safety net that ensures no interrupt lands on isolated cores even during early boot.


Network Stack Tuning

Socket Buffer Sizes

For stock (equity) machines receiving market data:

sysctl -w net.core.rmem_max=1073741824       # 1GB max receive buffer
sysctl -w net.core.rmem_default=4194304       # 4MB default

For futures machines with higher message rates:

sysctl -w net.core.rmem_default=16777216      # 16MB
sysctl -w net.core.rmem_max=16777216           # 16MB
sysctl -w net.core.wmem_default=16777216       # 16MB
sysctl -w net.core.wmem_max=16777216           # 16MB

Why so large? If the application experiences a brief pause (e.g., a GC-like event or a burst of processing), the kernel buffer needs to absorb incoming market data without dropping packets. A 16MB buffer at a typical CTP feed rate (~200MB/day sustained, ~50MB/hour peak) provides several minutes of headroom.

Make these persistent in /etc/rc.local:

# /etc/rc.local
sysctl -w net.core.rmem_max=1073741824
sysctl -w net.core.rmem_default=4194304
chmod +x /etc/rc.d/rc.local

Ethtool Interrupt Coalescing

ethtool -C eth<N> rx-usecs 0 adaptive-rx off

Interrupt coalescing delays RX interrupts to batch notifications. rx-usecs sets the delay in microseconds — setting it to 0 delivers every packet immediately.

adaptive-rx off prevents the kernel from dynamically adjusting the coalescing parameters, which would otherwise introduce unpredictable latency.

Verify:

ethtool -c eth<N>

Note: On Solarflare NICs with Onload (kernel bypass), this is less relevant since packets never traverse the kernel stack. But for kernel-path NICs (like Intel X710), this setting is critical.

Kernel Bypass Alternatives

For the lowest possible network latency, consider bypassing the Linux kernel network stack entirely:

  • DPDK — open-source, vendor-neutral kernel bypass framework
  • OpenOnload / EFVI (Solarflare) — transparent acceleration for socket-based applications
  • Mellanox VMA — MPI-accelerated messaging for Mellanox NICs
  • Exablaze / SolarCapture — ultra-low-latency capture cards

If you must use the kernel stack, the ethtool and buffer tuning above becomes critical.


Service Disabling

Every background service is a potential source of jitter. Here’s what we disable and why:

ServicePurposeWhy Disable
abrt-ccppCore dump handler for bug reportingCore dumps cause disk I/O and CPU spikes
abrt-oopsKernel oops reporterUnnecessary disk activity
abrt-vmcoreKernel crash dump reporterSame as above
abrtdABRT daemonOrchestrate the above; disable the whole subsystem
firewalldFirewall rulesTrading machines sit behind dedicated firewalls; local iptables is sufficient
ipmi / ipmievdIPMI hardware monitoringCan trigger SMIs (System Management Interrupts) — non-maskable, unpredictable latency
kdumpCrash dump toolReserves significant memory; crash handling is for debugging, not production
postfixEmail MTACompletely unnecessary; background disk I/O
rhel-domainnameNIS domain nameLegacy, unnecessary
cpuspeedCPU frequency scalingWe want maximum frequency at all times
systemctl disable abrt-ccpp.service abrt-oops.service abrt-vmcore.service \
  abrtd.service firewalld.service ipmi.service ipmievd.service \
  kdump.service postfix.service rhel-domainname.service

Additional Hardening

# Disable SELinux
vi /etc/selinux/config
SELINUX=disabled

# Disable SSH DNS lookups (prevents login delays)
vi /etc/ssh/sshd_config
UseDNS no

# Disable NMI watchdog
nmi_watchdog=0   # in GRUB

# Disable audit
audit=0          # in GRUB

# Disable soft lockup detector
nosoftlockup     # in GRUB

# Disable halt accounting
nohalt           # in GRUB

NMI watchdog (nmi_watchdog=0): The watchdog uses performance counters to detect hung CPUs. It generates NMIs at regular intervals, which can preempt your trading thread. Disable it.

Audit (audit=0): The Linux audit framework logs security events. Every syscall that hits an audit rule adds overhead. On a trading machine with no compliance audit requirement, disable it entirely.

nosoftlockup: The soft lockup detector monitors CPUs that haven’t scheduled for 20+ seconds. On isolated cores running busy-wait loops, this fires constantly. Disable it.

nohalt: Prevents the kernel from using the halt instruction in the idle loop. Combined with idle=poll, this ensures the CPU never enters any low-power state.


tuned-adm: Quick Profiling

tuned-adm profile network-latency

The network-latency profile applies a batch of sensible defaults:

  • Sets CPU governor to performance
  • Disables transparent hugepages
  • Increases net.core.somaxconn and net.ipv4.tcp_max_syn_backlog
  • Disables NUMA balancing
  • Sets kernel.sched_min_granularity_ns for lower scheduling latency

What it doesn’t do: It does NOT set isolcpus, nohz_full, or rcu_nocbs. These require a reboot with new GRUB parameters, so you still need the manual kernel tuning above.

You can verify the active profile:

tuned-adm active

Solarflare Onload (Kernel Bypass)

For the lowest possible network latency, Solarflare NICs with OpenOnload provide kernel bypass — packets are delivered directly from the NIC to user-space via the EFVI interface, skipping the entire Linux network stack.

Installation (Online)

# Install dependencies
yum install -y gcc gcc-c++ python-devel libpcap-devel automake \
  libtool rpm-build kernel-devel

# Method 1: RPM build
rpmbuild -ta openonload-*.tgz
rpm -ivh *.rpm

# Method 2: Direct install
tar xzf openonload-<version>.tgz
cd openonload-<version>/scripts
./onload_install

Important: kernel-devel version must match your running kernel exactly. Check with uname -r and rpm -qa | grep kernel-devel.

Installation (Offline)

On air-gapped trading machines, pre-download the tarball and use Method 2. You’ll also need to set up a local yum repository from a CentOS ISO for the build dependencies.

Post-Install Configuration

# Load the Onload module
onload_tool reload

# Verify
lsmod | grep sfc
lsmod | grep onload

# Set firmware to ultra-low-latency mode
sfboot firmware-variant=ultra-low-latency

# Flash firmware
sfupdate --write

# Reboot for firmware changes
reboot

sfboot firmware-variant=ultra-low-latency configures the NIC firmware to prioritize latency over throughput. This disables some offload features (like TX checksum offload and LRO) in favor of minimal packet processing delay.

Disable C-States via Onload

onload_tool disable_cstates

This is a runtime alternative to the GRUB C-state parameters. Useful when you can’t reboot but need to disable C-states immediately.

Intel X710 (i40e) Driver

For non-Solarflare NICs, you may need to compile the driver from source:

tar -xf x710-i40e-2.12.6.tar.gz
cd i40e-2.12.6/src
make
make install

This ensures you have the latest driver with latency fixes, rather than the in-tree driver that ships with the kernel.


Scheduling Policy

On an isolated core running a single thread with busy-wait polling, prefer SCHED_OTHER over SCHED_FIFO or SCHED_RR. Real-time priority on a thread that never yields can starve kernel tasks (such as vmstat), potentially locking the system.

Linux throttles real-time tasks to 95% of CPU time by default (/proc/sys/kernel/sched_rt_runtime_us). If real-time scheduling is required, adjust this limit accordingly — but for most trading workloads on isolated cores, SCHED_OTHER with busy-wait polling is sufficient and safer.


The Complete GRUB Configuration

Combining everything above, here’s the production GRUB configuration for an 18-core machine with cores 1-17 isolated:

GRUB_CMDLINE_LINUX="crashkernel=auto rhgb quiet \
  isolcpus=1-17 \
  nohz_full=1-17 \
  nohz=on \
  rcu_nocbs=1-17 \
  idle=poll \
  intel_idle.max_cstate=0 \
  processor.max_cstate=0 \
  mce=ignore_ce \
  nmi_watchdog=0 \
  transparent_hugepage=never \
  pcie_aspm=off \
  hpet=disabled \
  nosoftlockup \
  audit=0 \
  irqaffinity=0 \
  noht \
  nohalt \
  skew_tick=1 \
  default_hugepagesz=2M \
  hugepagesz=1G hugepages=8 \
  hugepagesz=2M hugepages=2048"

After editing /etc/default/grub:

grub2-mkconfig -o /boot/grub2/grub.cfg
# For UEFI systems:
# grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg
reboot

Note: skew_tick=1 offsets the timer tick on each CPU to avoid all CPUs waking simultaneously, which reduces lock contention on kernel data structures.


Verification

After applying all tuning, verify each setting:

CPU and Power

# Check CPU governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Expected: performance

# Check Turbo Boost
cat /sys/devices/system/cpu/intel_pstate/no_turbo
# Expected: 0

# Check HT/SMT status
cat /sys/devices/system/cpu/smt/active
# Expected: 0

# Check C-states (should show only C0)
cat /sys/module/intel_idle/parameters/max_cstate
# Expected: 0

Isolation

# Check isolated CPUs
cat /sys/devices/system/cpu/isolated
# Expected: 1-17

# Check nohz_full
cat /sys/devices/system/cpu/nohz_full
# Expected: 1-17

# Check RCU nocbs
cat /sys/devices/system/cpu/rcu_nocbs
# Expected: 1-17

Memory

# Check hugepages
cat /proc/meminfo | grep Huge
# Expected: HugePages_Total matches your configuration

# Check THP
cat /sys/kernel/mm/transparent_hugepage/enabled
# Expected: always never [never]

# Check NUMA balancing
cat /proc/sys/kernel/numa_balancing
# Expected: 0

Jitter Measurement

For the ultimate verification, measure actual system jitter:

# Measure context switches per CPU (10 seconds)
perf stat -e 'sched:sched_switch' -a -A --timeout 10000

# Measure timer interrupts per CPU (30 seconds)
perf stat -e 'irq_vectors:local_timer_entry' -a -A --timeout 30000

# Monitor TLB misses
perf stat -e 'dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses' -a --timeout 10000

# Monitor TLB shootdowns
perf stat -e 'tlb:tlb_flush' -a -A --timeout 10000

For continuous jitter monitoring, we use the sysjitter tool:

cd sysjitter-ace/
./longtest.sh

This runs a busy-wait loop on each core and measures the maximum deviation from expected timing — any spike indicates kernel interference. On a properly tuned machine, maximum jitter should be under 5μs.


Quick Checklist

Here’s a summary of every step in order:

  1. BIOS/UEFI: Maximum performance, Turbo Boost on, HT off
  2. GRUB: Apply all kernel parameters (see Complete Configuration above)
  3. tuned-adm: profile network-latency
  4. Services: Disable abrt, firewalld, ipmi, kdump, postfix, irqbalance
  5. SELinux: Disabled
  6. SSH: UseDNS no
  7. Swap: swapoff -a
  8. IRQ: Pin all to CPU 0, disable irqbalance
  9. Network buffers: Set rmem_max / wmem_max per workload
  10. Ethtool: rx-usecs 0 adaptive-rx off on kernel-path NICs
  11. Solarflare: Install Onload, sfboot firmware-variant=ultra-low-latency
  12. Hugepages: Allocate via GRUB or sysctl
  13. Verify: Run all verification commands and sysjitter

References