DPDK Poll Mode Driver: A Deep Dive from Architecture to Implementation

The Data Plane Development Kit (DPDK) enables user-space applications to process packets at line rate — 10 Gbps, 40 Gbps, even 100 Gbps — by bypassing the Linux kernel entirely. At the heart of this architecture sits the Poll Mode Driver (PMD): a driver that replaces interrupt-driven I/O with a tight polling loop, achieving nanosecond-scale latency and millions of packets per second per core.

This article is based on a thorough reading of the DPDK 26.03 source tree, with specific file paths and line numbers cited throughout. We’ll trace the complete path from rte_eth_rx_burst() down to the hardware descriptor ring, using the VirtIO PMD as our reference implementation — the simplest PMD that still demonstrates every major optimization technique.

Why Polling?

Traditional network I/O works like this:

NIC receives packet → hardware interrupt → interrupt handler → softirq
  → kernel scheduler → application wakes up → process packet

Each step adds latency. Context switches cost microseconds. Interrupt coalescing adds more. And the kernel’s per-packet overhead (skb allocation, netfilter hooks, socket buffering) makes high PPS rates impossible.

PMD inverts the model:

Application polls in a tight loop → reads hardware descriptors directly
  → zero interrupts, zero kernel involvement → process packet

Property	Interrupt Mode	Poll Mode (PMD)
Latency	Microseconds (interrupt + scheduling)	Nanoseconds (direct register read)
CPU Utilization	Event-driven, idle when no traffic	100% dedicated core
Context Switches	Frequent (kernel ↔ user)	None (pure user-space)
Batch Processing	Unfriendly (per-packet interrupts)	Natural (burst of 32/64/128)
Data Copies	Multiple (kernel → user)	Zero-copy (DMA to user-space)

The trade-off is clear: you sacrifice a CPU core to gain deterministic, ultra-low-latency packet processing. In trading systems where every microsecond counts, this is always worth it.

Burst-Oriented API

PMD processes packets in bursts, not one at a time. The core API:

#define BURST_SIZE 32

struct rte_mbuf *pkts[BURST_SIZE];
uint16_t nb_rx;

while (1) {
    // Try to receive up to BURST_SIZE packets
    nb_rx = rte_eth_rx_burst(port, queue, pkts, BURST_SIZE);

    for (i = 0; i < nb_rx; i++) {
        process_packet(pkts[i]);
    }

    // Try to send up to nb_pkts packets
    nb_tx = rte_eth_tx_burst(port, queue, pkts_to_send, nb_pkts);
}

Batch processing amortizes function call overhead, improves cache locality, and enables SIMD vectorization. The typical burst size is 32 or 64 — large enough to amortize overhead, small enough to fit in L1 cache.

Architecture: Three Layers

┌─────────────────────────────────────────────────┐
│         Application Layer                        │
│  rte_eth_rx_burst() / rte_eth_tx_burst()        │
└──────────────────┬──────────────────────────────┘
                   │
┌──────────────────▼──────────────────────────────┐
│      ethdev Library (lib/ethdev)                 │
│  Unified API abstraction + function pointer dispatch│
└──────────────────┬──────────────────────────────┘
                   │
┌──────────────────▼──────────────────────────────┐
│      PMD Layer (drivers/net/*)                   │
│  Hardware-specific driver implementations        │
│  — virtio, ixgbe, i40e, ice, bnxt ...            │
└──────────────────┬──────────────────────────────┘
                   │
┌──────────────────▼──────────────────────────────┐
│         Hardware (NIC + PCIe)                    │
└─────────────────────────────────────────────────┘

The key insight: rte_eth_rx_burst() doesn’t contain any RX logic. It simply calls through a function pointer stored in rte_eth_dev.rx_pkt_burst. The actual implementation is selected at device startup based on hardware capabilities — standard, inorder, packed ring, or vectorized (SIMD) path.

This is the fast path — no branches, no locks, just a direct function pointer call. The slow path (configuration, link status, statistics) goes through eth_dev_ops, a separate function table that’s never touched on the hot path.

Core Data Structures

`eth_rx_burst_t` / `eth_tx_burst_t`

File: lib/ethdev/rte_ethdev_core.h (lines 28–37)

typedef uint16_t (*eth_rx_burst_t)(void *rxq,
                                   struct rte_mbuf **rx_pkts,
                                   uint16_t nb_pkts);

typedef uint16_t (*eth_tx_burst_t)(void *txq,
                                   struct rte_mbuf **tx_pkts,
                                   uint16_t nb_pkts);

Simple and fast: a queue pointer, an array of mbuf pointers, and a count. Returns the actual number of packets processed.

`rte_eth_dev` — The Device Structure

File: lib/ethdev/ethdev_driver.h (lines 72–117)

struct __rte_cache_aligned rte_eth_dev {
    // Fast-path function pointers — placed at the start for cache friendliness
    eth_rx_burst_t rx_pkt_burst;
    eth_tx_burst_t tx_pkt_burst;
    eth_tx_prep_t tx_pkt_prepare;
    eth_rx_queue_count_t rx_queue_count;
    eth_rx_descriptor_status_t rx_descriptor_status;
    // ... more fast-path pointers

    // Device data (shared across processes)
    struct rte_eth_dev_data *data;

    // PMD private data
    void *process_private;
    const struct eth_dev_ops *dev_ops;  // Slow-path function table

    // Device handles
    struct rte_device *device;
    struct rte_intr_handle *intr_handle;
    // ... callbacks, state
};

Design decisions worth noting:

Fast-path pointers at offset 0 — the first cache line contains rx_pkt_burst and tx_pkt_burst, which are accessed on every packet. No pointer chasing needed.
__rte_cache_aligned — the entire structure starts on a cache line boundary, preventing false sharing with adjacent data.
Separation of dev_ops — the slow-path table (configure, start, stop, stats) lives in a different cache line. Hot-path code never touches it.

`rte_eth_dev_data` — Shared Device State

File: lib/ethdev/ethdev_driver.h (lines 128–214)

struct __rte_cache_aligned rte_eth_dev_data {
    char name[RTE_ETH_NAME_MAX_LEN];

    void **rx_queues;           // Array of RX queue pointers
    void **tx_queues;           // Array of TX queue pointers
    uint16_t nb_rx_queues;
    uint16_t nb_tx_queues;

    void *dev_private;          // PMD-specific private data
    struct rte_eth_link dev_link;
    struct rte_eth_conf dev_conf;
    uint16_t mtu;

    uint16_t port_id;
    int numa_node;              // NUMA affinity

    // Bitfield status flags (saves space)
    uint8_t promiscuous    : 1,
            scattered_rx   : 1,
            all_multicast  : 1,
            dev_started    : 1,
            // ...

    uint8_t rx_queue_state[RTE_MAX_QUEUES_PER_PORT];
    uint8_t tx_queue_state[RTE_MAX_QUEUES_PER_PORT];
};

This structure is designed to be placed in shared memory for multi-process DPDK. The queue pointer arrays, configuration, and state flags all live here so that a secondary process can access them.

`rte_mbuf` — The Packet Buffer

File: lib/mbuf/rte_mbuf.h

struct rte_mbuf {
    MARKER cacheline0;

    void *buf_addr;             // Virtual address of buffer
    rte_iova_t buf_iova;       // Physical/IOVA address (for DMA)

    RTE_ATOMIC(uint16_t) refcnt; // Reference counter
    struct rte_mbuf *next;      // Next segment (scatter/gather)
    uint16_t buf_len;           // Buffer capacity
    uint64_t ol_flags;          // Offload flags

    MARKER cacheline1 __rte_cache_aligned;

    uint64_t tx_offload;
    uint16_t data_len;          // Data in this segment
    uint16_t data_off;          // Offset to data start
    struct rte_mempool *pool;   // Originating mempool

    uint16_t pkt_len;           // Total packet length (all segments)
    // ... more fields

    MARKER cacheline2 __rte_cache_aligned;

    uint16_t port;              // Ingress port
    uint16_t packet_type;       // L2/L3/L4 classification
    uint64_t dynfield1[10];     // Application-specific dynamic fields
};

The rte_mbuf is deliberately split across cache lines:

cacheline0 — buf_addr, buf_iova, refcnt: accessed on every RX/TX operation
cacheline1 — data_len, data_off, pkt_len: accessed when processing packet content
cacheline2 — port, packet_type, dynfield1: accessed by application logic

This layout ensures that the most frequently accessed fields don’t evict each other from L1 cache.

VirtIO PMD: Runtime Path Selection

The most interesting aspect of PMD is how it selects the optimal implementation at runtime. Here’s the VirtIO registration code:

File: drivers/net/virtio/virtio_ethdev.c (lines 1350–1420)

static void
virtio_set_rxtx_funcs(struct rte_eth_dev *eth_dev)
{
    struct virtio_hw *hw = eth_dev->data->dev_private;
    eth_dev->tx_pkt_prepare = virtio_xmit_pkts_prepare;

    // Select TX burst function based on hardware features
    if (virtio_with_packed_queue(hw)) {
        if (hw->use_vec_tx)
            eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed_vec; // SIMD
        else
            eth_dev->tx_pkt_burst = virtio_xmit_pkts_packed;
    } else {
        if (hw->use_inorder_tx)
            eth_dev->tx_pkt_burst = virtio_xmit_pkts_inorder;
        else
            eth_dev->tx_pkt_burst = virtio_xmit_pkts;           // Standard
    }

    // Select RX burst function
    if (virtio_with_packed_queue(hw)) {
        if (hw->use_vec_rx)
            eth_dev->rx_pkt_burst = &virtio_recv_pkts_packed_vec; // SIMD
        else if (virtio_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF))
            eth_dev->rx_pkt_burst = &virtio_recv_mergeable_pkts_packed;
        else
            eth_dev->rx_pkt_burst = &virtio_recv_pkts_packed;
    } else {
        if (hw->use_vec_rx)
            eth_dev->rx_pkt_burst = virtio_recv_pkts_vec;         // SIMD
        else if (hw->use_inorder_rx)
            eth_dev->rx_pkt_burst = &virtio_recv_pkts_inorder;
        else if (virtio_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF))
            eth_dev->rx_pkt_burst = &virtio_recv_mergeable_pkts;
        else
            eth_dev->rx_pkt_burst = virtio_recv_pkts;            // Standard
    }
}

The function pointer is set once at device startup. After that, every call to rte_eth_rx_burst() dispatches directly to the selected implementation with zero overhead — no feature-checking branches on the hot path.

VirtIO supports 8 different RX/TX path combinations depending on:

Packed ring vs. split ring (VirtIO 1.1 feature)
Vectorized (SIMD) vs. scalar
In-order vs. out-of-order descriptor completion
Mergeable buffers for large packets

RX Burst: Line by Line

File: drivers/net/virtio/virtio_rxtx.c (lines 992–1092)

uint16_t
virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
                 uint16_t nb_pkts)
{
    struct virtnet_rx *rxvq = rx_queue;
    struct virtqueue *vq = virtnet_rxq_to_vq(rxvq);
    struct virtio_hw *hw = vq->hw;
    uint16_t nb_used, num, nb_rx;

    nb_rx = 0;

    // 1. Fast exit if device not started
    if (unlikely(hw->started == 0))
        return nb_rx;

    // 2. Check how many descriptors the device has used
    nb_used = virtqueue_nused(vq);

    // 3. Determine batch size (min of available, requested, burst limit)
    num = likely(nb_used <= nb_pkts) ? nb_used : nb_pkts;
    if (unlikely(num > VIRTIO_MBUF_BURST_SZ))
        num = VIRTIO_MBUF_BURST_SZ;

    // Cache-line alignment optimization
    if (likely(num > DESC_PER_CACHELINE))
        num = num - ((vq->vq_used_cons_idx + num) % DESC_PER_CACHELINE);

    // 4. Batch dequeue from virtqueue
    num = virtqueue_dequeue_burst_rx(vq, rcv_pkts, len, num);

    // 5. Process each mbuf
    for (i = 0; i < num; i++) {
        rxm = rcv_pkts[i];

        // Validate packet length
        if (unlikely(len[i] < hdr_size + RTE_ETHER_HDR_LEN)) {
            virtio_discard_rxbuf(vq, rxm);
            rxvq->stats.errors++;
            continue;
        }

        // Set mbuf fields
        rxm->port = hw->port_id;
        rxm->data_off = RTE_PKTMBUF_HEADROOM;
        rxm->ol_flags = 0;
        rxm->pkt_len = (uint32_t)(len[i] - hdr_size);
        rxm->data_len = (uint16_t)(len[i] - hdr_size);

        // Process RX offload (checksum validation, etc.)
        if (hw->has_rx_offload && virtio_rx_offload(rxm, hdr) < 0) {
            virtio_discard_rxbuf(vq, rxm);
            rxvq->stats.errors++;
            continue;
        }

        rx_pkts[nb_rx++] = rxm;
    }

    // 6. Refill consumed descriptors with fresh mbufs
    if (likely(!virtqueue_full(vq))) {
        if (likely(rte_pktmbuf_alloc_bulk(rxvq->mpool, new_pkts, free_cnt) == 0)) {
            virtqueue_enqueue_recv_refill(vq, new_pkts, free_cnt);
        }
    }

    // 7. Notify device only if needed
    if (likely(nb_enqueued)) {
        vq_update_avail_idx(vq);
        if (unlikely(virtqueue_kick_prepare(vq)))
            virtqueue_notify(vq);
    }

    return nb_rx;
}

Key optimization techniques visible in this code:

Batch dequeue — virtqueue_dequeue_burst_rx pulls multiple descriptors at once
Cache-line alignment — adjusts batch size so descriptor reads align to cache boundaries
likely/unlikely — branch prediction hints for the compiler
Bulk mbuf allocation — rte_pktmbuf_alloc_bulk allocates multiple mbufs from the pool in one call
Conditional notification — only writes to the device’s doorbell register when necessary (virtqueue_kick_prepare), avoiding expensive MMIO writes

TX Burst: Line by Line

File: drivers/net/virtio/virtio_rxtx.c (lines 1859–1939)

uint16_t
virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
                 uint16_t nb_pkts)
{
    struct virtnet_tx *txvq = tx_queue;
    struct virtqueue *vq = virtnet_txq_to_vq(txvq);
    struct virtio_hw *hw = vq->hw;
    uint16_t nb_tx = 0;

    if (unlikely(hw->started == 0))
        return 0;

    // 1. Free completed TX descriptors if ring is getting full
    nb_used = virtqueue_nused(vq);
    if (likely(nb_used > vq->vq_nentries - vq->vq_free_thresh))
        virtio_xmit_cleanup(vq, nb_used);

    // 2. Enqueue each packet
    for (nb_tx = 0; nb_tx < nb_pkts; nb_tx++) {
        struct rte_mbuf *txm = tx_pkts[nb_tx];
        int can_push = 0, use_indirect = 0;

        // Optimization: header push — avoid a separate descriptor
        if ((virtio_with_feature(hw, VIRTIO_F_ANY_LAYOUT) ||
              virtio_with_feature(hw, VIRTIO_F_VERSION_1)) &&
            rte_mbuf_refcnt_read(txm) == 1 &&
            RTE_MBUF_DIRECT(txm) &&
            txm->nb_segs == 1 &&
            rte_pktmbuf_headroom(txm) >= hdr_size)
            can_push = 1;

        // Optimization: indirect descriptors for multi-segment packets
        else if (virtio_with_feature(hw, VIRTIO_RING_F_INDIRECT_DESC) &&
             txm->nb_segs < VIRTIO_MAX_TX_INDIRECT)
            use_indirect = 1;

        // If short on descriptors, clean up completed ones
        if (unlikely(need > 0)) {
            virtio_xmit_cleanup(vq, need);
            if (need > 0) break; // Still no room
        }

        // Enqueue to virtqueue
        virtqueue_enqueue_xmit(txvq, txm, slots, use_indirect, can_push, 0);
    }

    // 3. Single notification after batch
    if (likely(nb_tx)) {
        vq_update_avail_idx(vq);
        if (unlikely(virtqueue_kick_prepare(vq)))
            virtqueue_notify(vq);
    }

    return nb_tx;
}

Two VirtIO-specific optimizations stand out:

Header Push: Instead of using a separate descriptor for the VirtIO header, the driver checks if the mbuf’s headroom has enough space. If so, it writes the header directly into the headroom — saving one descriptor and one DMA operation.

Indirect Descriptors: For multi-segment packets (scatter/gather), instead of consuming N descriptors for N segments, the driver uses a single “indirect” descriptor that points to a separate table in memory containing all segment descriptors. This reduces ring space consumption and improves cache behavior.

The Virtqueue: Descriptor Ring

VirtIO uses a shared-memory ring structure between the driver and the device:

┌──────────────────────────────────────┐
│  Available Ring                      │
│  - Descriptor indices ready to process│
│  - Written by driver, read by device │
├──────────────────────────────────────┤
│  Used Ring                           │
│  - Descriptor indices completed      │
│  - Written by device, read by driver │
├──────────────────────────────────────┤
│  Descriptor Table                    │
│  - Address, length, flags, next      │
│  - Describes DMA buffers            │
└──────────────────────────────────────┘

For RX: the driver pre-fills descriptors with empty mbuf addresses. When the device (hypervisor/physical NIC) receives a packet, it DMA’s the data into the mbuf and marks the descriptor in the Used Ring. The driver then dequeues it in the next rx_burst() call.

For TX: the driver fills descriptors with packet data addresses. The device DMA’s the data out, sends it on the wire, and marks the descriptor as used. The driver cleans it up in the next tx_burst().

Performance Optimization Techniques

1. SIMD Vectorization

DPDK uses SSE/AVX/AVX-512/NEON to process multiple descriptors in a single instruction:

// Runtime detection in virtio_ethdev.c (lines 2363-2400)
#if defined(RTE_ARCH_X86_64) && defined(CC_AVX512_SUPPORT)
    if (hw->use_vec_rx &&
        (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) ||
         !virtio_with_feature(hw, VIRTIO_F_IN_ORDER) ||
         rte_vect_get_max_simd_bitwidth() < RTE_VECT_SIMD_512)) {
        hw->use_vec_rx = 0; // Fall back to scalar
    }
#elif defined(RTE_ARCH_ARM)
    if (hw->use_vec_rx &&
        (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON) ||
         !virtio_with_feature(hw, VIRTIO_F_IN_ORDER))) {
        hw->use_vec_rx = 0;
    }
#endif

The vectorized path (virtio_recv_pkts_vec) processes 4–8 descriptors per iteration using SIMD loads and stores, dramatically reducing instruction count.

2. Huge Pages and TLB Efficiency

// DPDK allocates all memory from Huge Pages (2MB or 1GB)
rte_eal_init(argc, argv);  // Reserves Huge Pages at startup

// mbuf pool creation on the correct NUMA node
struct rte_mempool *mp =
    rte_pktmbuf_pool_create("mbuf_pool",
        NB_MBUF, MBUF_CACHE_SIZE, 0,
        RTE_MBUF_DEFAULT_BUF_SIZE, numa_node);

2 MB Huge Pages reduce TLB entries by 512× compared to 4 KB pages. For a 2 GB mbuf pool, that’s 1024 TLB entries instead of 524,288 — a massive reduction in TLB miss rate.

3. NUMA-Aware Memory Placement

// Query NIC's NUMA node
rte_eth_dev_info_get(port, &dev_info);
int numa_node = dev_info.device->numa_node;

// Allocate mbuf pool on the same node
struct rte_mempool *mp =
    rte_pktmbuf_pool_create("mbuf_pool", NB_MBUF,
        MBUF_CACHE_SIZE, 0,
        RTE_MBUF_DEFAULT_BUF_SIZE,
        rte_socket_id_by_device(numa_node));

Cross-NUMA memory access costs 40–80 ns extra latency on typical x86 servers. Always allocate mbuf pools on the same NUMA node as the NIC.

4. Per-Core Mempool Cache

struct rte_mempool *mp =
    rte_pktmbuf_pool_create("mbuf_pool",
        NB_MBUF,         // Total mbufs
        MBUF_CACHE_SIZE, // Per-core cache (e.g., 256)
        0,
        RTE_MBUF_DEFAULT_BUF_SIZE,
        socket_id);

Each core has a local cache of 256 mbufs. Allocation/deallocation hits this cache first — no global lock contention. Only when the cache is exhausted does the core access the shared pool ring.

5. Lock-Free Multi-Queue Processing

┌─────────────────────────────────────────┐
│           Physical NIC                   │
│     4 RX queues + 4 TX queues           │
└──────────────────┬──────────────────────┘
                   │ PCIe
    ┌──────────┬───┴───┬──────────┐
    │          │       │          │
 Core 0     Core 1  Core 2    Core 3
 Queue 0   Queue 1  Queue 2   Queue 3
    │          │       │          │
    └──────────┴───────┴──────────┘
              No lock contention

Each core processes exactly one RX queue and one TX queue. No mutex, no atomic operations, no false sharing. This is why RSS (Receive Side Scaling) is critical — it distributes packets across queues based on a hash of the 5-tuple, ensuring flow affinity to a specific core.

Initialization Flow

1. EAL Init
   └─> rte_eal_init()
       ├─> Enumerate PCI devices
       ├─> Reserve Huge Pages
       └─> Set CPU affinity

2. PMD Probe
   └─> rte_eal_pci_probe()
       └─> Call driver .probe()
           ├─> rte_eth_dev_allocate()
           ├─> Fill eth_dev_ops
           └─> Set rx_pkt_burst / tx_pkt_burst

3. Device Configuration
   └─> rte_eth_dev_configure()
       └─> dev_ops->dev_configure()

4. Queue Setup
   ├─> rte_eth_rx_queue_setup()
   │   └─> dev_ops->rx_queue_setup()
   └─> rte_eth_tx_queue_setup()
       └─> dev_ops->tx_queue_setup()

5. Device Start
   └─> rte_eth_dev_start()
       └─> dev_ops->dev_start()
           └─> Finalize rx_pkt_burst / tx_pkt_burst selection

Steps 2–5 happen once at startup. After that, the application enters the tight polling loop and never touches the slow path again.

Practical Example: L2 Forwarding

A minimal DPDK application:

#include <rte_eal.h>
#include <rte_ethdev.h>
#include <rte_mbuf.h>

#define BURST_SIZE 32
#define RX_RING_SIZE 1024
#define TX_RING_SIZE 1024
#define NUM_MBUFS 8191
#define MBUF_CACHE_SIZE 250

static volatile bool force_quit;

static void
lcore_main(void *arg)
{
    unsigned port_id = (unsigned)(uintptr_t)arg;
    struct rte_mbuf *pkts[BURST_SIZE];
    uint16_t nb_rx, nb_tx;

    while (!force_quit) {
        nb_rx = rte_eth_rx_burst(port_id, 0, pkts, BURST_SIZE);
        if (unlikely(nb_rx == 0))
            continue;

        nb_tx = rte_eth_tx_burst(port_id, 0, pkts, nb_rx);

        // Free unsent packets
        if (unlikely(nb_tx < nb_rx)) {
            for (uint16_t buf = nb_tx; buf < nb_rx; buf++)
                rte_pktmbuf_free(pkts[buf]);
        }
    }
}

int main(int argc, char *argv[])
{
    // 1. Initialize EAL
    rte_eal_init(argc, argv);

    // 2. Create mbuf pool
    struct rte_mempool *mbuf_pool =
        rte_pktmbuf_pool_create("MBUF_POOL",
            NUM_MBUFS, MBUF_CACHE_SIZE, 0,
            RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());

    // 3. Configure and start the port
    struct rte_eth_conf port_conf = {0};
    uint16_t port_id = 0;

    rte_eth_dev_configure(port_id, 1, 1, &port_conf);
    rte_eth_rx_queue_setup(port_id, 0, RX_RING_SIZE,
        rte_eth_dev_socket_id(port_id), NULL, mbuf_pool);
    rte_eth_tx_queue_setup(port_id, 0, TX_RING_SIZE,
        rte_eth_dev_socket_id(port_id), NULL);
    rte_eth_dev_start(port_id);
    rte_eth_promiscuous_enable(port_id);

    // 4. Launch worker on each lcore
    RTE_LCORE_FOREACH_WORKER(lcore_id) {
        rte_eal_remote_launch(lcore_main, (void *)0, lcore_id);
    }

    // 5. Wait for signal, then cleanup
    rte_eal_mp_wait_lcore();
    rte_eth_dev_stop(port_id);
    rte_eth_dev_close(port_id);
    rte_eal_cleanup();

    return 0;
}

Source Code Reading Guide

Priority	File	What You’ll Learn
1	`lib/ethdev/rte_ethdev.h`	Public API documentation
2	`lib/ethdev/ethdev_driver.h`	PMD interface, `rte_eth_dev` structure
3	`lib/mbuf/rte_mbuf.h`	mbuf layout, cache-line split design
4	`drivers/net/virtio/virtio_rxtx.c`	RX/TX burst implementation
5	`drivers/net/virtio/virtio_ethdev.c`	Runtime path selection
6	`examples/l2fwd/main.c`	Minimal real application

Start with VirtIO — it’s the simplest PMD. Once you understand the descriptor ring and burst API, move to hardware PMDs like ixgbe (Intel 82599) or ice (Intel E810) to see how physical NIC register access works.

Key Takeaways

Polling trades CPU for latency — dedicate a core, get nanosecond response
Burst API amortizes overhead — always process packets in batches of 32+
Function pointers at offset 0 — fast path has zero indirection beyond the initial call
Runtime path selection — one binary supports scalar, SIMD, packed, and inorder paths
Huge Pages + NUMA + per-core caches — the memory hierarchy is the performance
Lock-free by design — one queue per core, no shared state on the hot path

Understanding PMD internals is essential for anyone building low-latency trading systems on DPDK. The same principles — cache alignment, batch processing, zero-copy DMA, runtime dispatch — apply to every PMD implementation, from VirtIO to Solarflare EFVI.