DPDK Poll Mode Driver: A Deep Dive from Architecture to Implementation
The Data Plane Development Kit (DPDK) enables user-space applications to process packets at line rate — 10 Gbps, 40 Gbps, even 100 Gbps — by bypassing the Linux kernel entirely. At the heart of this architecture sits the Poll Mode Driver (PMD): a driver that replaces interrupt-driven I/O with a tight polling loop, achieving nanosecond-scale latency and millions of packets per second per core.
This article is based on a thorough reading of the DPDK 26.03 source tree, with specific file paths and line numbers cited throughout. We’ll trace the complete path from rte_eth_rx_burst() down to the hardware descriptor ring, using the VirtIO PMD as our reference implementation — the simplest PMD that still demonstrates every major optimization technique.
Why Polling?
Traditional network I/O works like this:
NIC receives packet → hardware interrupt → interrupt handler → softirq
→ kernel scheduler → application wakes up → process packet
Each step adds latency. Context switches cost microseconds. Interrupt coalescing adds more. And the kernel’s per-packet overhead (skb allocation, netfilter hooks, socket buffering) makes high PPS rates impossible.
PMD inverts the model:
Application polls in a tight loop → reads hardware descriptors directly
→ zero interrupts, zero kernel involvement → process packet
| Property | Interrupt Mode | Poll Mode (PMD) |
|---|---|---|
| Latency | Microseconds (interrupt + scheduling) | Nanoseconds (direct register read) |
| CPU Utilization | Event-driven, idle when no traffic | 100% dedicated core |
| Context Switches | Frequent (kernel ↔ user) | None (pure user-space) |
| Batch Processing | Unfriendly (per-packet interrupts) | Natural (burst of 32/64/128) |
| Data Copies | Multiple (kernel → user) | Zero-copy (DMA to user-space) |
The trade-off is clear: you sacrifice a CPU core to gain deterministic, ultra-low-latency packet processing. In trading systems where every microsecond counts, this is always worth it.
Burst-Oriented API
PMD processes packets in bursts, not one at a time. The core API:
#define BURST_SIZE 32
struct rte_mbuf *pkts[BURST_SIZE];
uint16_t nb_rx;
while (1) {
// Try to receive up to BURST_SIZE packets
nb_rx = rte_eth_rx_burst(port, queue, pkts, BURST_SIZE);
for (i = 0; i < nb_rx; i++) {
process_packet(pkts[i]);
}
// Try to send up to nb_pkts packets
nb_tx = rte_eth_tx_burst(port, queue, pkts_to_send, nb_pkts);
}
Batch processing amortizes function call overhead, improves cache locality, and enables SIMD vectorization. The typical burst size is 32 or 64 — large enough to amortize overhead, small enough to fit in L1 cache.
Architecture: Three Layers
┌─────────────────────────────────────────────────┐
│ Application Layer │
│ rte_eth_rx_burst() / rte_eth_tx_burst() │
└──────────────────┬──────────────────────────────┘
│
┌──────────────────▼──────────────────────────────┐
│ ethdev Library (lib/ethdev) │
│ Unified API abstraction + function pointer dispatch│
└──────────────────┬──────────────────────────────┘
│
┌──────────────────▼──────────────────────────────┐
│ PMD Layer (drivers/net/*) │
│ Hardware-specific driver implementations │
│ — virtio, ixgbe, i40e, ice, bnxt ... │
└──────────────────┬──────────────────────────────┘
│
┌──────────────────▼──────────────────────────────┐
│ Hardware (NIC + PCIe) │
└─────────────────────────────────────────────────┘
The key insight: rte_eth_rx_burst() doesn’t contain any RX logic. It simply calls through a function pointer stored in rte_eth_dev.rx_pkt_burst. The actual implementation is selected at device startup based on hardware capabilities — standard, inorder, packed ring, or vectorized (SIMD) path.
This is the fast path — no branches, no locks, just a direct function pointer call. The slow path (configuration, link status, statistics) goes through eth_dev_ops, a separate function table that’s never touched on the hot path.
Core Data Structures
eth_rx_burst_t / eth_tx_burst_t
File: lib/ethdev/rte_ethdev_core.h (lines 28–37)
typedef uint16_t (*eth_rx_burst_t)(void *rxq,
struct rte_mbuf **rx_pkts,
uint16_t nb_pkts);
typedef uint16_t (*eth_tx_burst_t)(void *txq,
struct rte_mbuf **tx_pkts,
uint16_t nb_pkts);
Simple and fast: a queue pointer, an array of mbuf pointers, and a count. Returns the actual number of packets processed.
rte_eth_dev — The Device Structure
File: lib/ethdev/ethdev_driver.h (lines 72–117)
struct __rte_cache_aligned rte_eth_dev {
// Fast-path function pointers — placed at the start for cache friendliness
eth_rx_burst_t rx_pkt_burst;
eth_tx_burst_t tx_pkt_burst;
eth_tx_prep_t tx_pkt_prepare;
eth_rx_queue_count_t rx_queue_count;
eth_rx_descriptor_status_t rx_descriptor_status;
// ... more fast-path pointers
// Device data (shared across processes)
struct rte_eth_dev_data *data;
// PMD private data
void *process_private;
const struct eth_dev_ops *dev_ops; // Slow-path function table
// Device handles
struct rte_device *device;
struct rte_intr_handle *intr_handle;
// ... callbacks, state
};
Design decisions worth noting:
- Fast-path pointers at offset 0 — the first cache line contains
rx_pkt_burstandtx_pkt_burst, which are accessed on every packet. No pointer chasing needed. __rte_cache_aligned— the entire structure starts on a cache line boundary, preventing false sharing with adjacent data.- Separation of
dev_ops— the slow-path table (configure, start, stop, stats) lives in a different cache line. Hot-path code never touches it.
rte_eth_dev_data — Shared Device State
File: lib/ethdev/ethdev_driver.h (lines 128–214)
struct __rte_cache_aligned rte_eth_dev_data {
char name[RTE_ETH_NAME_MAX_LEN];
void **rx_queues; // Array of RX queue pointers
void **tx_queues; // Array of TX queue pointers
uint16_t nb_rx_queues;
uint16_t nb_tx_queues;
void *dev_private; // PMD-specific private data
struct rte_eth_link dev_link;
struct rte_eth_conf dev_conf;
uint16_t mtu;
uint16_t port_id;
int numa_node; // NUMA affinity
// Bitfield status flags (saves space)
uint8_t promiscuous : 1,
scattered_rx : 1,
all_multicast : 1,
dev_started : 1,
// ...
uint8_t rx_queue_state[RTE_MAX_QUEUES_PER_PORT];
uint8_t tx_queue_state[RTE_MAX_QUEUES_PER_PORT];
};
This structure is designed to be placed in shared memory for multi-process DPDK. The queue pointer arrays, configuration, and state flags all live here so that a secondary process can access them.
rte_mbuf — The Packet Buffer
File: lib/mbuf/rte_mbuf.h
struct rte_mbuf {
MARKER cacheline0;
void *buf_addr; // Virtual address of buffer
rte_iova_t buf_iova; // Physical/IOVA address (for DMA)
RTE_ATOMIC(uint16_t) refcnt; // Reference counter
struct rte_mbuf *next; // Next segment (scatter/gather)
uint16_t buf_len; // Buffer capacity
uint64_t ol_flags; // Offload flags
MARKER cacheline1 __rte_cache_aligned;
uint64_t tx_offload;
uint16_t data_len; // Data in this segment
uint16_t data_off; // Offset to data start
struct rte_mempool *pool; // Originating mempool
uint16_t pkt_len; // Total packet length (all segments)
// ... more fields
MARKER cacheline2 __rte_cache_aligned;
uint16_t port; // Ingress port
uint16_t packet_type; // L2/L3/L4 classification
uint64_t dynfield1[10]; // Application-specific dynamic fields
};
The rte_mbuf is deliberately split across cache lines:
- cacheline0 —
buf_addr,buf_iova,refcnt: accessed on every RX/TX operation - cacheline1 —
data_len,data_off,pkt_len: accessed when processing packet content - cacheline2 —
port,packet_type,dynfield1: accessed by application logic
This layout ensures that the most frequently accessed fields don’t evict each other from L1 cache.
VirtIO PMD: Runtime Path Selection
The most interesting aspect of PMD is how it selects the optimal implementation at runtime. Here’s the VirtIO registration code:
File: drivers/net/virtio/virtio_ethdev.c (lines 1350–1420)
static void
virtio_set_rxtx_funcs(struct rte_eth_dev *eth_dev)
{
struct virtio_hw *hw = eth_dev->data->dev_private;
eth_dev->tx_pkt_prepare = virtio_xmit_pkts_prepare;
// Select TX burst function based on hardware features
if (virtio_with_packed_queue(hw)) {
if (hw->use_vec_tx)
eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec; // SIMD
else
eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
} else {
if (hw->use_inorder_tx)
eth_dev->tx_pkt_burst = virtio_xmit_pkts_inorder;
else
eth_dev->tx_pkt_burst = virtio_xmit_pkts; // Standard
}
// Select RX burst function
if (virtio_with_packed_queue(hw)) {
if (hw->use_vec_rx)
eth_dev->rx_pkt_burst = &virtio_recv_pkts_packed_vec; // SIMD
else if (virtio_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF))
eth_dev->rx_pkt_burst = &virtio_recv_mergeable_pkts_packed;
else
eth_dev->rx_pkt_burst = &virtio_recv_pkts_packed;
} else {
if (hw->use_vec_rx)
eth_dev->rx_pkt_burst = virtio_recv_pkts_vec; // SIMD
else if (hw->use_inorder_rx)
eth_dev->rx_pkt_burst = &virtio_recv_pkts_inorder;
else if (virtio_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF))
eth_dev->rx_pkt_burst = &virtio_recv_mergeable_pkts;
else
eth_dev->rx_pkt_burst = virtio_recv_pkts; // Standard
}
}
The function pointer is set once at device startup. After that, every call to rte_eth_rx_burst() dispatches directly to the selected implementation with zero overhead — no feature-checking branches on the hot path.
VirtIO supports 8 different RX/TX path combinations depending on:
- Packed ring vs. split ring (VirtIO 1.1 feature)
- Vectorized (SIMD) vs. scalar
- In-order vs. out-of-order descriptor completion
- Mergeable buffers for large packets
RX Burst: Line by Line
File: drivers/net/virtio/virtio_rxtx.c (lines 992–1092)
uint16_t
virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
uint16_t nb_pkts)
{
struct virtnet_rx *rxvq = rx_queue;
struct virtqueue *vq = virtnet_rxq_to_vq(rxvq);
struct virtio_hw *hw = vq->hw;
uint16_t nb_used, num, nb_rx;
nb_rx = 0;
// 1. Fast exit if device not started
if (unlikely(hw->started == 0))
return nb_rx;
// 2. Check how many descriptors the device has used
nb_used = virtqueue_nused(vq);
// 3. Determine batch size (min of available, requested, burst limit)
num = likely(nb_used <= nb_pkts) ? nb_used : nb_pkts;
if (unlikely(num > VIRTIO_MBUF_BURST_SZ))
num = VIRTIO_MBUF_BURST_SZ;
// Cache-line alignment optimization
if (likely(num > DESC_PER_CACHELINE))
num = num - ((vq->vq_used_cons_idx + num) % DESC_PER_CACHELINE);
// 4. Batch dequeue from virtqueue
num = virtqueue_dequeue_burst_rx(vq, rcv_pkts, len, num);
// 5. Process each mbuf
for (i = 0; i < num; i++) {
rxm = rcv_pkts[i];
// Validate packet length
if (unlikely(len[i] < hdr_size + RTE_ETHER_HDR_LEN)) {
virtio_discard_rxbuf(vq, rxm);
rxvq->stats.errors++;
continue;
}
// Set mbuf fields
rxm->port = hw->port_id;
rxm->data_off = RTE_PKTMBUF_HEADROOM;
rxm->ol_flags = 0;
rxm->pkt_len = (uint32_t)(len[i] - hdr_size);
rxm->data_len = (uint16_t)(len[i] - hdr_size);
// Process RX offload (checksum validation, etc.)
if (hw->has_rx_offload && virtio_rx_offload(rxm, hdr) < 0) {
virtio_discard_rxbuf(vq, rxm);
rxvq->stats.errors++;
continue;
}
rx_pkts[nb_rx++] = rxm;
}
// 6. Refill consumed descriptors with fresh mbufs
if (likely(!virtqueue_full(vq))) {
if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts, free_cnt) == 0)) {
virtqueue_enqueue_recv_refill(vq, new_pkts, free_cnt);
}
}
// 7. Notify device only if needed
if (likely(nb_enqueued)) {
vq_update_avail_idx(vq);
if (unlikely(virtqueue_kick_prepare(vq)))
virtqueue_notify(vq);
}
return nb_rx;
}
Key optimization techniques visible in this code:
- Batch dequeue —
virtqueue_dequeue_burst_rxpulls multiple descriptors at once - Cache-line alignment — adjusts batch size so descriptor reads align to cache boundaries
likely/unlikely— branch prediction hints for the compiler- Bulk mbuf allocation —
rte_pktmbuf_alloc_bulkallocates multiple mbufs from the pool in one call - Conditional notification — only writes to the device’s doorbell register when necessary (
virtqueue_kick_prepare), avoiding expensive MMIO writes
TX Burst: Line by Line
File: drivers/net/virtio/virtio_rxtx.c (lines 1859–1939)
uint16_t
virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
uint16_t nb_pkts)
{
struct virtnet_tx *txvq = tx_queue;
struct virtqueue *vq = virtnet_txq_to_vq(txvq);
struct virtio_hw *hw = vq->hw;
uint16_t nb_tx = 0;
if (unlikely(hw->started == 0))
return 0;
// 1. Free completed TX descriptors if ring is getting full
nb_used = virtqueue_nused(vq);
if (likely(nb_used > vq->vq_nentries - vq->vq_free_thresh))
virtio_xmit_cleanup(vq, nb_used);
// 2. Enqueue each packet
for (nb_tx = 0; nb_tx < nb_pkts; nb_tx++) {
struct rte_mbuf *txm = tx_pkts[nb_tx];
int can_push = 0, use_indirect = 0;
// Optimization: header push — avoid a separate descriptor
if ((virtio_with_feature(hw, VIRTIO_F_ANY_LAYOUT) ||
virtio_with_feature(hw, VIRTIO_F_VERSION_1)) &&
rte_mbuf_refcnt_read(txm) == 1 &&
RTE_MBUF_DIRECT(txm) &&
txm->nb_segs == 1 &&
rte_pktmbuf_headroom(txm) >= hdr_size)
can_push = 1;
// Optimization: indirect descriptors for multi-segment packets
else if (virtio_with_feature(hw, VIRTIO_RING_F_INDIRECT_DESC) &&
txm->nb_segs < VIRTIO_MAX_TX_INDIRECT)
use_indirect = 1;
// If short on descriptors, clean up completed ones
if (unlikely(need > 0)) {
virtio_xmit_cleanup(vq, need);
if (need > 0) break; // Still no room
}
// Enqueue to virtqueue
virtqueue_enqueue_xmit(txvq, txm, slots, use_indirect, can_push, 0);
}
// 3. Single notification after batch
if (likely(nb_tx)) {
vq_update_avail_idx(vq);
if (unlikely(virtqueue_kick_prepare(vq)))
virtqueue_notify(vq);
}
return nb_tx;
}
Two VirtIO-specific optimizations stand out:
Header Push: Instead of using a separate descriptor for the VirtIO header, the driver checks if the mbuf’s headroom has enough space. If so, it writes the header directly into the headroom — saving one descriptor and one DMA operation.
Indirect Descriptors: For multi-segment packets (scatter/gather), instead of consuming N descriptors for N segments, the driver uses a single “indirect” descriptor that points to a separate table in memory containing all segment descriptors. This reduces ring space consumption and improves cache behavior.
The Virtqueue: Descriptor Ring
VirtIO uses a shared-memory ring structure between the driver and the device:
┌──────────────────────────────────────┐
│ Available Ring │
│ - Descriptor indices ready to process│
│ - Written by driver, read by device │
├──────────────────────────────────────┤
│ Used Ring │
│ - Descriptor indices completed │
│ - Written by device, read by driver │
├──────────────────────────────────────┤
│ Descriptor Table │
│ - Address, length, flags, next │
│ - Describes DMA buffers │
└──────────────────────────────────────┘
For RX: the driver pre-fills descriptors with empty mbuf addresses. When the device (hypervisor/physical NIC) receives a packet, it DMA’s the data into the mbuf and marks the descriptor in the Used Ring. The driver then dequeues it in the next rx_burst() call.
For TX: the driver fills descriptors with packet data addresses. The device DMA’s the data out, sends it on the wire, and marks the descriptor as used. The driver cleans it up in the next tx_burst().
Performance Optimization Techniques
1. SIMD Vectorization
DPDK uses SSE/AVX/AVX-512/NEON to process multiple descriptors in a single instruction:
// Runtime detection in virtio_ethdev.c (lines 2363-2400)
#if defined(RTE_ARCH_X86_64) && defined(CC_AVX512_SUPPORT)
if (hw->use_vec_rx &&
(!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) ||
!virtio_with_feature(hw, VIRTIO_F_IN_ORDER) ||
rte_vect_get_max_simd_bitwidth() < RTE_VECT_SIMD_512)) {
hw->use_vec_rx = 0; // Fall back to scalar
}
#elif defined(RTE_ARCH_ARM)
if (hw->use_vec_rx &&
(!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON) ||
!virtio_with_feature(hw, VIRTIO_F_IN_ORDER))) {
hw->use_vec_rx = 0;
}
#endif
The vectorized path (virtio_recv_pkts_vec) processes 4–8 descriptors per iteration using SIMD loads and stores, dramatically reducing instruction count.
2. Huge Pages and TLB Efficiency
// DPDK allocates all memory from Huge Pages (2MB or 1GB)
rte_eal_init(argc, argv); // Reserves Huge Pages at startup
// mbuf pool creation on the correct NUMA node
struct rte_mempool *mp =
rte_pktmbuf_pool_create("mbuf_pool",
NB_MBUF, MBUF_CACHE_SIZE, 0,
RTE_MBUF_DEFAULT_BUF_SIZE, numa_node);
2 MB Huge Pages reduce TLB entries by 512× compared to 4 KB pages. For a 2 GB mbuf pool, that’s 1024 TLB entries instead of 524,288 — a massive reduction in TLB miss rate.
3. NUMA-Aware Memory Placement
// Query NIC's NUMA node
rte_eth_dev_info_get(port, &dev_info);
int numa_node = dev_info.device->numa_node;
// Allocate mbuf pool on the same node
struct rte_mempool *mp =
rte_pktmbuf_pool_create("mbuf_pool", NB_MBUF,
MBUF_CACHE_SIZE, 0,
RTE_MBUF_DEFAULT_BUF_SIZE,
rte_socket_id_by_device(numa_node));
Cross-NUMA memory access costs 40–80 ns extra latency on typical x86 servers. Always allocate mbuf pools on the same NUMA node as the NIC.
4. Per-Core Mempool Cache
struct rte_mempool *mp =
rte_pktmbuf_pool_create("mbuf_pool",
NB_MBUF, // Total mbufs
MBUF_CACHE_SIZE, // Per-core cache (e.g., 256)
0,
RTE_MBUF_DEFAULT_BUF_SIZE,
socket_id);
Each core has a local cache of 256 mbufs. Allocation/deallocation hits this cache first — no global lock contention. Only when the cache is exhausted does the core access the shared pool ring.
5. Lock-Free Multi-Queue Processing
┌─────────────────────────────────────────┐
│ Physical NIC │
│ 4 RX queues + 4 TX queues │
└──────────────────┬──────────────────────┘
│ PCIe
┌──────────┬───┴───┬──────────┐
│ │ │ │
Core 0 Core 1 Core 2 Core 3
Queue 0 Queue 1 Queue 2 Queue 3
│ │ │ │
└──────────┴───────┴──────────┘
No lock contention
Each core processes exactly one RX queue and one TX queue. No mutex, no atomic operations, no false sharing. This is why RSS (Receive Side Scaling) is critical — it distributes packets across queues based on a hash of the 5-tuple, ensuring flow affinity to a specific core.
Initialization Flow
1. EAL Init
└─> rte_eal_init()
├─> Enumerate PCI devices
├─> Reserve Huge Pages
└─> Set CPU affinity
2. PMD Probe
└─> rte_eal_pci_probe()
└─> Call driver .probe()
├─> rte_eth_dev_allocate()
├─> Fill eth_dev_ops
└─> Set rx_pkt_burst / tx_pkt_burst
3. Device Configuration
└─> rte_eth_dev_configure()
└─> dev_ops->dev_configure()
4. Queue Setup
├─> rte_eth_rx_queue_setup()
│ └─> dev_ops->rx_queue_setup()
└─> rte_eth_tx_queue_setup()
└─> dev_ops->tx_queue_setup()
5. Device Start
└─> rte_eth_dev_start()
└─> dev_ops->dev_start()
└─> Finalize rx_pkt_burst / tx_pkt_burst selection
Steps 2–5 happen once at startup. After that, the application enters the tight polling loop and never touches the slow path again.
Practical Example: L2 Forwarding
A minimal DPDK application:
#include <rte_eal.h>
#include <rte_ethdev.h>
#include <rte_mbuf.h>
#define BURST_SIZE 32
#define RX_RING_SIZE 1024
#define TX_RING_SIZE 1024
#define NUM_MBUFS 8191
#define MBUF_CACHE_SIZE 250
static volatile bool force_quit;
static void
lcore_main(void *arg)
{
unsigned port_id = (unsigned)(uintptr_t)arg;
struct rte_mbuf *pkts[BURST_SIZE];
uint16_t nb_rx, nb_tx;
while (!force_quit) {
nb_rx = rte_eth_rx_burst(port_id, 0, pkts, BURST_SIZE);
if (unlikely(nb_rx == 0))
continue;
nb_tx = rte_eth_tx_burst(port_id, 0, pkts, nb_rx);
// Free unsent packets
if (unlikely(nb_tx < nb_rx)) {
for (uint16_t buf = nb_tx; buf < nb_rx; buf++)
rte_pktmbuf_free(pkts[buf]);
}
}
}
int main(int argc, char *argv[])
{
// 1. Initialize EAL
rte_eal_init(argc, argv);
// 2. Create mbuf pool
struct rte_mempool *mbuf_pool =
rte_pktmbuf_pool_create("MBUF_POOL",
NUM_MBUFS, MBUF_CACHE_SIZE, 0,
RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());
// 3. Configure and start the port
struct rte_eth_conf port_conf = {0};
uint16_t port_id = 0;
rte_eth_dev_configure(port_id, 1, 1, &port_conf);
rte_eth_rx_queue_setup(port_id, 0, RX_RING_SIZE,
rte_eth_dev_socket_id(port_id), NULL, mbuf_pool);
rte_eth_tx_queue_setup(port_id, 0, TX_RING_SIZE,
rte_eth_dev_socket_id(port_id), NULL);
rte_eth_dev_start(port_id);
rte_eth_promiscuous_enable(port_id);
// 4. Launch worker on each lcore
RTE_LCORE_FOREACH_WORKER(lcore_id) {
rte_eal_remote_launch(lcore_main, (void *)0, lcore_id);
}
// 5. Wait for signal, then cleanup
rte_eal_mp_wait_lcore();
rte_eth_dev_stop(port_id);
rte_eth_dev_close(port_id);
rte_eal_cleanup();
return 0;
}
Source Code Reading Guide
| Priority | File | What You’ll Learn |
|---|---|---|
| 1 | lib/ethdev/rte_ethdev.h | Public API documentation |
| 2 | lib/ethdev/ethdev_driver.h | PMD interface, rte_eth_dev structure |
| 3 | lib/mbuf/rte_mbuf.h | mbuf layout, cache-line split design |
| 4 | drivers/net/virtio/virtio_rxtx.c | RX/TX burst implementation |
| 5 | drivers/net/virtio/virtio_ethdev.c | Runtime path selection |
| 6 | examples/l2fwd/main.c | Minimal real application |
Start with VirtIO — it’s the simplest PMD. Once you understand the descriptor ring and burst API, move to hardware PMDs like ixgbe (Intel 82599) or ice (Intel E810) to see how physical NIC register access works.
Key Takeaways
- Polling trades CPU for latency — dedicate a core, get nanosecond response
- Burst API amortizes overhead — always process packets in batches of 32+
- Function pointers at offset 0 — fast path has zero indirection beyond the initial call
- Runtime path selection — one binary supports scalar, SIMD, packed, and inorder paths
- Huge Pages + NUMA + per-core caches — the memory hierarchy is the performance
- Lock-free by design — one queue per core, no shared state on the hot path
Understanding PMD internals is essential for anyone building low-latency trading systems on DPDK. The same principles — cache alignment, batch processing, zero-copy DMA, runtime dispatch — apply to every PMD implementation, from VirtIO to Solarflare EFVI.