yuqi-zheng

Python Multiprocessing: Zero-Copy Data Transfer with shared_memory and Protocol 5


This is Part 2 of a two-part series. Part 1 covers the internals of PEP 574 and the PickleBuffer API.


Python multiprocessing sends data between processes by pickling it. For a small list of integers, that’s fine. For an 800 MB DataFrame, standard serialization takes around 12 seconds — before any computation starts.

There are two tools for eliminating this overhead: Pickle Protocol 5 and multiprocessing.shared_memory. They solve different problems.


Why Pickle Is Slow for Large Data

multiprocessing serializes arguments with pickle and sends the byte stream through a pipe or socket. For a NumPy array, the old path looks like:

  1. __reduce_ex__ allocates a bytes copy of the array data
  2. pickle.dumps copies that into the pickle stream
  3. The stream is written to the pipe
  4. The worker reads the pipe into a buffer
  5. pickle.loads copies the buffer into a new array

Three to four copies for one array. CPU time and peak memory both scale with the array size.


Tool 1: Pickle Protocol 5 (PEP 574)

Protocol 5 separates large binary payloads from the metadata stream. The data is passed as a PickleBuffer — a zero-copy view — rather than embedded in the pickle bytes.

import pickle
import numpy as np

a = np.zeros((10_000, 10_000))  # 800 MB

buffers = []
data = pickle.dumps(a, protocol=5, buffer_callback=buffers.append)
# `buffers` holds a memoryview of a's memory — no copy yet

b = pickle.loads(data, buffers=buffers)
# b shares a's memory — still no copy

In a multiprocessing context, the worker allocates a destination buffer (in shared memory or otherwise) and passes it as buffers during deserialization. The array data lands directly in the target, bypassing intermediate copies.

NumPy and Apache Arrow already implement __reduce_ex__ with PickleBuffer support. You get this for free when using protocol=5.

When to use: one-shot transfer of a large object to a worker that will not write back to the caller.


Tool 2: multiprocessing.shared_memory

shared_memory maps the same physical memory pages into multiple processes. There is no pickle, no pipe, no copy — the worker reads and writes the array directly.

import numpy as np
from multiprocessing import shared_memory, Process

# main process: allocate and populate
arr = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
shm = shared_memory.SharedMemory(create=True, size=arr.nbytes)
shared_arr = np.ndarray(arr.shape, dtype=arr.dtype, buffer=shm.buf)
shared_arr[:] = arr[:]

def worker(name, shape, dtype):
    shm = shared_memory.SharedMemory(name=name)
    work_arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
    work_arr[0] += 1          # writes visible to main process immediately
    shm.close()

p = Process(target=worker, args=(shm.name, arr.shape, arr.dtype))
p.start()
p.join()

print(shared_arr[0])  # 2.0 — worker's write is visible here

shm.close()
shm.unlink()  # must be called exactly once to free the OS resource

The only data crossing the process boundary is shm.name — a short string. The array itself never moves.

When to use: a large dataset that multiple workers read or update repeatedly — a shared feature matrix, a rolling price buffer, an inference model loaded once and queried many times.


Comparison

shared_memoryPickle Protocol 5
Copy overheadNone — direct memory mapNear-zero — buffer view, minimal metadata
Programming modelManual: allocate, attach, sync, unlinkTransparent: library support in NumPy / Arrow
Write-backWorkers can write; main process sees changesOne-way transfer; no write-back
ConcurrencyRequires explicit Lock / SemaphoreNot applicable — separate copies per worker
RiskResource leak if unlink() not called; data races if unsynchronizedVersion dependency (needs Protocol 5 environment)
Best fitPersistent shared state, repeated accessLarge one-shot argument to a worker

Avoiding Resource Leaks with shared_memory

shm.unlink() deletes the OS-level shared memory segment. If the creating process crashes before calling it, the segment persists until reboot (on Linux, it appears in /dev/shm).

Use a try/finally to guarantee cleanup:

shm = shared_memory.SharedMemory(create=True, size=arr.nbytes)
try:
    # ... use shm ...
finally:
    shm.close()
    shm.unlink()

Workers that attach to an existing segment should call shm.close() but not shm.unlink() — only the creator unlinks.


Synchronization in Shared Memory

Multiple processes writing the same shared array concurrently will corrupt data without explicit synchronization:

from multiprocessing import Lock

lock = Lock()

def worker(name, shape, dtype, lock):
    shm = shared_memory.SharedMemory(name=name)
    arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
    with lock:
        arr[0] += 1
    shm.close()

If workers partition the array into non-overlapping regions, no locking is needed — each worker owns its slice.


Practical Decision Guide

ScenarioRecommendation
800 MB DataFrame → worker, result returnedProtocol 5 (one-way, no write-back needed)
Model weights loaded once, many workers queryshared_memory (read-only, no lock needed)
Rolling price buffer updated by producer, read by consumersshared_memory + Lock on write
Small objects (< a few MB)Standard pickle — overhead is negligible
Cross-machineNeither — use Arrow IPC, gRPC, or Ray’s object store

Third-Party Libraries

LibraryWhat it adds
joblibTransparent shared_memory for NumPy arrays in Parallel
rayGlobal object store with plasma shared memory, zero-copy across processes and machines
daskDistributed scheduler with lazy computation, spills to disk
npshmexThin wrapper around ProcessPoolExecutor + shared_memory for NumPy

Ray is worth calling out specifically: it uses Plasma (Apache Arrow’s shared memory store) so that objects transferred between local workers never copy at all, and remote transfers use Protocol 5-style out-of-band buffers.


Summary

  • Standard pickle copies large arrays 3–4 times. Both tools here eliminate most or all of that.
  • Protocol 5 is the low-friction option: upgrade NumPy + set protocol=5, get near-zero-copy transfers for free. Best for one-shot worker arguments.
  • shared_memory is zero-copy but requires manual lifecycle management and explicit synchronization for concurrent writes. Best for persistent shared state across many worker invocations.
  • For most scientific computing workloads, joblib or Ray handles the complexity and you interact with a clean API.

References