Python Multiprocessing: Zero-Copy Data Transfer with shared_memory and Protocol 5
This is Part 2 of a two-part series. Part 1 covers the internals of PEP 574 and the PickleBuffer API.
Python multiprocessing sends data between processes by pickling it. For a small list of integers, that’s fine. For an 800 MB DataFrame, standard serialization takes around 12 seconds — before any computation starts.
There are two tools for eliminating this overhead: Pickle Protocol 5 and multiprocessing.shared_memory. They solve different problems.
Why Pickle Is Slow for Large Data
multiprocessing serializes arguments with pickle and sends the byte stream through a pipe or socket. For a NumPy array, the old path looks like:
__reduce_ex__allocates abytescopy of the array datapickle.dumpscopies that into the pickle stream- The stream is written to the pipe
- The worker reads the pipe into a buffer
pickle.loadscopies the buffer into a new array
Three to four copies for one array. CPU time and peak memory both scale with the array size.
Tool 1: Pickle Protocol 5 (PEP 574)
Protocol 5 separates large binary payloads from the metadata stream. The data is passed as a PickleBuffer — a zero-copy view — rather than embedded in the pickle bytes.
import pickle
import numpy as np
a = np.zeros((10_000, 10_000)) # 800 MB
buffers = []
data = pickle.dumps(a, protocol=5, buffer_callback=buffers.append)
# `buffers` holds a memoryview of a's memory — no copy yet
b = pickle.loads(data, buffers=buffers)
# b shares a's memory — still no copy
In a multiprocessing context, the worker allocates a destination buffer (in shared memory or otherwise) and passes it as buffers during deserialization. The array data lands directly in the target, bypassing intermediate copies.
NumPy and Apache Arrow already implement __reduce_ex__ with PickleBuffer support. You get this for free when using protocol=5.
When to use: one-shot transfer of a large object to a worker that will not write back to the caller.
Tool 2: multiprocessing.shared_memory
shared_memory maps the same physical memory pages into multiple processes. There is no pickle, no pipe, no copy — the worker reads and writes the array directly.
import numpy as np
from multiprocessing import shared_memory, Process
# main process: allocate and populate
arr = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
shm = shared_memory.SharedMemory(create=True, size=arr.nbytes)
shared_arr = np.ndarray(arr.shape, dtype=arr.dtype, buffer=shm.buf)
shared_arr[:] = arr[:]
def worker(name, shape, dtype):
shm = shared_memory.SharedMemory(name=name)
work_arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
work_arr[0] += 1 # writes visible to main process immediately
shm.close()
p = Process(target=worker, args=(shm.name, arr.shape, arr.dtype))
p.start()
p.join()
print(shared_arr[0]) # 2.0 — worker's write is visible here
shm.close()
shm.unlink() # must be called exactly once to free the OS resource
The only data crossing the process boundary is shm.name — a short string. The array itself never moves.
When to use: a large dataset that multiple workers read or update repeatedly — a shared feature matrix, a rolling price buffer, an inference model loaded once and queried many times.
Comparison
shared_memory | Pickle Protocol 5 | |
|---|---|---|
| Copy overhead | None — direct memory map | Near-zero — buffer view, minimal metadata |
| Programming model | Manual: allocate, attach, sync, unlink | Transparent: library support in NumPy / Arrow |
| Write-back | Workers can write; main process sees changes | One-way transfer; no write-back |
| Concurrency | Requires explicit Lock / Semaphore | Not applicable — separate copies per worker |
| Risk | Resource leak if unlink() not called; data races if unsynchronized | Version dependency (needs Protocol 5 environment) |
| Best fit | Persistent shared state, repeated access | Large one-shot argument to a worker |
Avoiding Resource Leaks with shared_memory
shm.unlink() deletes the OS-level shared memory segment. If the creating process crashes before calling it, the segment persists until reboot (on Linux, it appears in /dev/shm).
Use a try/finally to guarantee cleanup:
shm = shared_memory.SharedMemory(create=True, size=arr.nbytes)
try:
# ... use shm ...
finally:
shm.close()
shm.unlink()
Workers that attach to an existing segment should call shm.close() but not shm.unlink() — only the creator unlinks.
Synchronization in Shared Memory
Multiple processes writing the same shared array concurrently will corrupt data without explicit synchronization:
from multiprocessing import Lock
lock = Lock()
def worker(name, shape, dtype, lock):
shm = shared_memory.SharedMemory(name=name)
arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
with lock:
arr[0] += 1
shm.close()
If workers partition the array into non-overlapping regions, no locking is needed — each worker owns its slice.
Practical Decision Guide
| Scenario | Recommendation |
|---|---|
| 800 MB DataFrame → worker, result returned | Protocol 5 (one-way, no write-back needed) |
| Model weights loaded once, many workers query | shared_memory (read-only, no lock needed) |
| Rolling price buffer updated by producer, read by consumers | shared_memory + Lock on write |
| Small objects (< a few MB) | Standard pickle — overhead is negligible |
| Cross-machine | Neither — use Arrow IPC, gRPC, or Ray’s object store |
Third-Party Libraries
| Library | What it adds |
|---|---|
joblib | Transparent shared_memory for NumPy arrays in Parallel |
ray | Global object store with plasma shared memory, zero-copy across processes and machines |
dask | Distributed scheduler with lazy computation, spills to disk |
npshmex | Thin wrapper around ProcessPoolExecutor + shared_memory for NumPy |
Ray is worth calling out specifically: it uses Plasma (Apache Arrow’s shared memory store) so that objects transferred between local workers never copy at all, and remote transfers use Protocol 5-style out-of-band buffers.
Summary
- Standard pickle copies large arrays 3–4 times. Both tools here eliminate most or all of that.
- Protocol 5 is the low-friction option: upgrade NumPy + set
protocol=5, get near-zero-copy transfers for free. Best for one-shot worker arguments. shared_memoryis zero-copy but requires manual lifecycle management and explicit synchronization for concurrent writes. Best for persistent shared state across many worker invocations.- For most scientific computing workloads,
joblibor Ray handles the complexity and you interact with a clean API.
References
- PEP 574 – Pickle protocol 5 with out-of-band data
- Python docs:
multiprocessing.shared_memory - Python docs: pickle protocol versions