yuqi-zheng

POSIX Sockets API: Tips and Pitfalls


The POSIX sockets API (also called Berkeley or BSD sockets) is the lowest-level networking interface available on Linux. It is also riddled with subtle failure modes. Unless you have a compelling reason to work directly at this layer, a cross-platform library such as libuv, libevent, or ASIO handles most of these issues for you.

This article documents the pitfalls you will encounter if you do write against the raw API, and the correct way to handle each one.


Suppressing SIGPIPE

Writing to a TCP socket whose remote end has been closed causes the kernel to deliver SIGPIPE to the process. The default disposition for SIGPIPE is termination. In a multithreaded program, signal handlers are process-global, making race conditions likely.

Linux and most BSDs (OpenBSD, FreeBSD, NetBSD)

Pass MSG_NOSIGNAL on every send call:

ssize_t n = send(fd, msg, len, MSG_NOSIGNAL);

The call returns -1 with errno = EPIPE instead of raising the signal.

macOS

MSG_NOSIGNAL is not available on macOS. Use the SO_NOSIGPIPE socket option instead:

int opt = 1;
if (setsockopt(fd, SOL_SOCKET, SO_NOSIGPIPE, &opt, sizeof(opt)) == -1) {
    perror("setsockopt");
    return -1;
}

Portable fallback

Ignore SIGPIPE process-wide:

signal(SIGPIPE, SIG_IGN);

This works everywhere but is blunt —it suppresses SIGPIPE for all file descriptors in the process, including pipes to child processes.


Handling EINTR

Any blocking system call can be interrupted by a signal and return -1 with errno = EINTR. Whether the call restarts automatically depends on whether SA_RESTART was set when the signal handler was installed, and on the specific system call. epoll_wait on Linux always returns EINTR regardless of SA_RESTART.

Always check for and retry on EINTR:

retry:
    ssize_t n = recv(fd, buf, len, 0);
    if (n == -1) {
        if (errno == EINTR) goto retry;
        perror("recv");
        return -1;
    }

The same pattern applies to send, connect, accept, epoll_wait, and any other blocking call.


Asynchronous hostname resolution

getaddrinfo is the POSIX hostname resolution function. It is synchronous and can block for seconds waiting for DNS responses. It cannot be used directly in an event loop.

If your code only needs to handle IP address literals, pass AI_NUMERICHOST to make the call non-blocking:

struct addrinfo hints = { .ai_flags = AI_NUMERICHOST };
getaddrinfo("192.0.2.1", "80", &hints, &res);

For actual DNS resolution in an async context, use a purpose-built library:

  • c-ares: widely used, integrates with any event loop
  • getdns: DNSSEC-aware, higher-level API

Avoid getaddrinfo_a (glibc’s async variant) —it uses threads and signals internally, which makes it difficult to integrate with epoll.


Non-blocking connect

Step 1: create a non-blocking socket

int fd = socket(res->ai_family,
                res->ai_socktype | SOCK_NONBLOCK | SOCK_CLOEXEC,
                res->ai_protocol);
if (fd == -1) { perror("socket"); return -1; }

SOCK_NONBLOCK sets the socket non-blocking at creation, avoiding a subsequent fcntl call. SOCK_CLOEXEC prevents the fd from leaking into child processes after fork/exec.

Step 2: call connect

retry:
    int rc = connect(fd, res->ai_addr, res->ai_addrlen);
    if (rc == -1 && errno != EINPROGRESS) {
        if (errno == EINTR) goto retry;
        perror("connect");
        return -1;
    }
    // rc == 0: connected immediately (loopback)
    // errno == EINPROGRESS: connection in progress

Step 3: detect completion

When the event loop reports the fd as writable (EPOLLOUT) or as having an error (EPOLLERR, EPOLLHUP), check the outcome:

int err;
socklen_t errlen = sizeof(err);
if (getsockopt(fd, SOL_SOCKET, SO_ERROR, &err, &errlen) == -1) {
    perror("getsockopt"); return -1;
}
if (err != 0) {
    errno = err;
    perror("connect"); return -1;
}
// connection established

Non-blocking accept

Setup

int lfd = socket(AF_INET6, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC, 0);
bind(lfd, (struct sockaddr*)&addr, sizeof(addr));
listen(lfd, 128);

Accept loop

When epoll reports the listening socket as readable, drain it completely:

for (;;) {
    int fd = accept4(lfd, NULL, NULL, SOCK_NONBLOCK | SOCK_CLOEXEC);
    if (fd == -1) {
        if (errno == EAGAIN || errno == EWOULDBLOCK) break;
        if (errno == EINTR || errno == ECONNABORTED) continue;
        perror("accept4"); return -1;
    }
    // handle fd
}

accept4 is a Linux extension that sets flags on the accepted socket atomically, avoiding a race between accept and fcntl. ECONNABORTED occurs when the client resets the connection before accept returns; it is safe to ignore.


Nagle algorithm and delayed ACK interaction

TCP’s Nagle algorithm buffers small writes and sends them together to reduce packet count. TCP’s delayed ACK mechanism waits up to 40 ms before acknowledging received data in case the application has a response to piggyback. The combination causes up to 40 ms latency on interactive request-response patterns where the server sends the last byte of a request in a separate write call.

Disable Nagle (TCP_NODELAY)

For RPC-style or low-latency protocols:

int opt = 1;
setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &opt, sizeof(opt));

Caution: if your application calls write in a tight loop with single-byte payloads, disabling Nagle will flood the network with tiny packets. Batch writes with writev or sendmmsg instead.

Disable delayed ACK (TCP_QUICKACK)

For clients that send requests and receive large responses:

int opt = 1;
setsockopt(fd, IPPROTO_TCP, TCP_QUICKACK, &opt, sizeof(opt));

TCP_QUICKACK is a Linux extension and is not sticky —the kernel may revert it. Set it after each recv call if consistent behavior is required.


Achieving low latency

Kernel bypass

For the lowest possible latency (single-digit microseconds), bypass the kernel network stack entirely:

  • DPDK: general-purpose kernel bypass
  • OpenOnload (Solarflare): transparent acceleration with a LD_PRELOAD shim
  • Mellanox VMA: similar to OpenOnload for Mellanox/NVIDIA NICs
  • Exablaze: FPGA-based, sub-microsecond latency

Busy polling (kernel path)

Linux 3.11 added SO_BUSY_POLL, which polls the NIC driver from within the socket call instead of waiting for an interrupt:

int usecs = 10000;
setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL, &usecs, sizeof(usecs));

User-space busy polling

For the lowest kernel-path latency without kernel bypass, spin in user space:

retry:
    ssize_t n = recv(fd, buf, len, MSG_DONTWAIT);
    if (n == -1) {
        if (errno == EINTR || errno == EAGAIN || errno == EWOULDBLOCK)
            goto retry;
        perror("recv"); return -1;
    }

This burns a CPU core but eliminates interrupt and scheduler latency.

Disable interrupt coalescing

The NIC firmware batches interrupts by default to improve throughput. For latency-sensitive applications, disable coalescing:

ethtool -C eth0 adaptive-rx off rx-usecs 0 rx-frames 0

Packet timestamping

Linux supports hardware and software timestamps on received and transmitted packets via SO_TIMESTAMPING. Applications can use these to measure one-way latency (with a synchronized reference clock), detect scheduling jitter, or audit network paths. See the Linux kernel documentation on network timestamping for the full API.