Python Multiprocessing: Zero-Copy Data Transfer with shared

This is Part 2 of a two-part series. Part 1 covers the internals of PEP 574 and the PickleBuffer API.

Python multiprocessing sends data between processes by pickling it. For a small list of integers, that’s fine. For an 800 MB DataFrame, standard serialization takes around 12 seconds — before any computation starts.

There are two tools for eliminating this overhead: Pickle Protocol 5 and multiprocessing.shared_memory. They solve different problems.

Why Pickle Is Slow for Large Data

multiprocessing serializes arguments with pickle and sends the byte stream through a pipe or socket. For a NumPy array, the old path looks like:

__reduce_ex__ allocates a bytes copy of the array data
pickle.dumps copies that into the pickle stream
The stream is written to the pipe
The worker reads the pipe into a buffer
pickle.loads copies the buffer into a new array

Three to four copies for one array. CPU time and peak memory both scale with the array size.

Tool 1: Pickle Protocol 5 (PEP 574)

Protocol 5 separates large binary payloads from the metadata stream. The data is passed as a PickleBuffer — a zero-copy view — rather than embedded in the pickle bytes.

import pickle
import numpy as np

a = np.zeros((10_000, 10_000))  # 800 MB

buffers = []
data = pickle.dumps(a, protocol=5, buffer_callback=buffers.append)
# `buffers` holds a memoryview of a's memory — no copy yet

b = pickle.loads(data, buffers=buffers)
# b shares a's memory — still no copy

In a multiprocessing context, the worker allocates a destination buffer (in shared memory or otherwise) and passes it as buffers during deserialization. The array data lands directly in the target, bypassing intermediate copies.

NumPy and Apache Arrow already implement __reduce_ex__ with PickleBuffer support. You get this for free when using protocol=5.

When to use: one-shot transfer of a large object to a worker that will not write back to the caller.

Tool 2: `multiprocessing.shared_memory`

shared_memory maps the same physical memory pages into multiple processes. There is no pickle, no pipe, no copy — the worker reads and writes the array directly.

import numpy as np
from multiprocessing import shared_memory, Process

# main process: allocate and populate
arr = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
shm = shared_memory.SharedMemory(create=True, size=arr.nbytes)
shared_arr = np.ndarray(arr.shape, dtype=arr.dtype, buffer=shm.buf)
shared_arr[:] = arr[:]

def worker(name, shape, dtype):
    shm = shared_memory.SharedMemory(name=name)
    work_arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
    work_arr[0] += 1          # writes visible to main process immediately
    shm.close()

p = Process(target=worker, args=(shm.name, arr.shape, arr.dtype))
p.start()
p.join()

print(shared_arr[0])  # 2.0 — worker's write is visible here

shm.close()
shm.unlink()  # must be called exactly once to free the OS resource

The only data crossing the process boundary is shm.name — a short string. The array itself never moves.

When to use: a large dataset that multiple workers read or update repeatedly — a shared feature matrix, a rolling price buffer, an inference model loaded once and queried many times.

Comparison

	`shared_memory`	Pickle Protocol 5
Copy overhead	None — direct memory map	Near-zero — buffer view, minimal metadata
Programming model	Manual: allocate, attach, sync, unlink	Transparent: library support in NumPy / Arrow
Write-back	Workers can write; main process sees changes	One-way transfer; no write-back
Concurrency	Requires explicit `Lock` / `Semaphore`	Not applicable — separate copies per worker
Risk	Resource leak if `unlink()` not called; data races if unsynchronized	Version dependency (needs Protocol 5 environment)
Best fit	Persistent shared state, repeated access	Large one-shot argument to a worker

Avoiding Resource Leaks with `shared_memory`

shm.unlink() deletes the OS-level shared memory segment. If the creating process crashes before calling it, the segment persists until reboot (on Linux, it appears in /dev/shm).

Use a try/finally to guarantee cleanup:

shm = shared_memory.SharedMemory(create=True, size=arr.nbytes)
try:
    # ... use shm ...
finally:
    shm.close()
    shm.unlink()

Workers that attach to an existing segment should call shm.close() but not shm.unlink() — only the creator unlinks.

Synchronization in Shared Memory

Multiple processes writing the same shared array concurrently will corrupt data without explicit synchronization:

from multiprocessing import Lock

lock = Lock()

def worker(name, shape, dtype, lock):
    shm = shared_memory.SharedMemory(name=name)
    arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
    with lock:
        arr[0] += 1
    shm.close()

If workers partition the array into non-overlapping regions, no locking is needed — each worker owns its slice.

Practical Decision Guide

Scenario	Recommendation
800 MB DataFrame → worker, result returned	Protocol 5 (one-way, no write-back needed)
Model weights loaded once, many workers query	`shared_memory` (read-only, no lock needed)
Rolling price buffer updated by producer, read by consumers	`shared_memory` + `Lock` on write
Small objects (< a few MB)	Standard pickle — overhead is negligible
Cross-machine	Neither — use Arrow IPC, gRPC, or Ray’s object store

Third-Party Libraries

Library	What it adds
`joblib`	Transparent `shared_memory` for NumPy arrays in `Parallel`
`ray`	Global object store with plasma shared memory, zero-copy across processes and machines
`dask`	Distributed scheduler with lazy computation, spills to disk
`npshmex`	Thin wrapper around `ProcessPoolExecutor` + `shared_memory` for NumPy

Ray is worth calling out specifically: it uses Plasma (Apache Arrow’s shared memory store) so that objects transferred between local workers never copy at all, and remote transfers use Protocol 5-style out-of-band buffers.

Summary

Standard pickle copies large arrays 3–4 times. Both tools here eliminate most or all of that.
Protocol 5 is the low-friction option: upgrade NumPy + set protocol=5, get near-zero-copy transfers for free. Best for one-shot worker arguments.
shared_memory is zero-copy but requires manual lifecycle management and explicit synchronization for concurrent writes. Best for persistent shared state across many worker invocations.
For most scientific computing workloads, joblib or Ray handles the complexity and you interact with a clean API.

References

PEP 574 – Pickle protocol 5 with out-of-band data
Python docs: multiprocessing.shared_memory
Python docs: pickle protocol versions

Python Multiprocessing: Zero-Copy Data Transfer with shared_memory and Protocol 5