Python Multiprocessing vs Threading

Intermediate 10 min read

What you'll learn

✓Why the GIL exists and what it actually blocks
✓When threads beat processes despite the GIL
✓How multiprocessing shares data and the cost involved
✓Choosing between thread pools, process pools, and asyncio
✓Patterns for mixing all three safely

Prerequisites

•Basic Python

What and why

Python has two ways to run code concurrently in the same program: threads and processes. They look similar from the API surface (ThreadPoolExecutor, ProcessPoolExecutor), but they behave very differently because of the Global Interpreter Lock (GIL).

The GIL is a mutex inside CPython that allows only one thread to execute Python bytecode at a time. C extensions can release it during I/O or heavy native work, which is why threading is still useful. For pure-Python CPU loops, threading gives you no speedup.

Mental model

Threads share memory and a single Python interpreter; processes each have their own interpreter and their own memory. The trade-off: threads are cheap and share data for free but compete for the GIL; processes scale across cores but pay a serialization cost to share anything.

Threading (one process, multiple threads)
+-----------------------------------------+
| Python interpreter + GIL                |
|   T1  T2  T3   <- only one holds GIL    |
|   |   |   |       at a time             |
|   shared memory, shared modules         |
+-----------------------------------------+

Multiprocessing (N processes)
+---------------+ +---------------+ +---------------+
| Interp + GIL  | | Interp + GIL  | | Interp + GIL  |
| own memory    | | own memory    | | own memory    |
+---------------+ +---------------+ +---------------+
     \              |              /
      \             v             /
       +--- pickle over pipe ----+
          (cost of sharing data)

Threads vs processes inside CPython

Hands-on example

A CPU-bound function: count primes under N.

import math, time
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

def count_primes(n):
    return sum(1 for x in range(2, n) if all(x % d for d in range(2, int(math.isqrt(x))+1)))

def bench(executor_cls, n=200_000, workers=4):
    start = time.perf_counter()
    with executor_cls(max_workers=workers) as ex:
        list(ex.map(count_primes, [n] * workers))
    return time.perf_counter() - start

print("threads:  ", bench(ThreadPoolExecutor))
print("processes:", bench(ProcessPoolExecutor))

On a four-core machine, threads complete in roughly the same time as a sequential run because the GIL serializes the work. Processes complete in roughly one quarter the time because each interpreter runs on its own core.

For I/O-bound work, threads win. They are cheap and the GIL is released around socket reads.

import requests
from concurrent.futures import ThreadPoolExecutor

urls = ["https://example.com"] * 50

with ThreadPoolExecutor(max_workers=20) as ex:
    results = list(ex.map(requests.get, urls))

Spinning up twenty processes for this would waste seconds on process startup and pickling.

Sharing state between processes requires serialization. multiprocessing.Queue pickles items across a pipe. For large NumPy arrays, use multiprocessing.shared_memory so workers see the same buffer without copying.

from multiprocessing import shared_memory
import numpy as np

a = np.arange(10_000_000, dtype=np.int64)
shm = shared_memory.SharedMemory(create=True, size=a.nbytes)
buf = np.ndarray(a.shape, dtype=a.dtype, buffer=shm.buf)
buf[:] = a[:]

Children attach by name and operate on the same buffer.

Common pitfalls

Spawning processes on Windows and macOS uses the spawn start method, which re-imports your module. Put process-creation code under if __name__ == "__main__": or you will fork-bomb yourself.

Pickling errors are the most common failure. Anything you send to a process pool must be picklable: no lambdas, no local functions, no open file handles. Use top-level functions or define classes at module scope.

The GIL is not magic. C extensions that hold the GIL across long operations (some image processing libraries, some database drivers) starve other threads. If threading.active_count() is high but throughput is flat, suspect a C extension hogging the GIL.

Threads and signal do not mix well. Only the main thread can receive signals. Use threading.Event to coordinate shutdown.

Mixing asyncio with threads requires loop.run_in_executor or asyncio.to_thread. Calling sync blocking code directly from a coroutine stalls the loop.

Production tips

Default to asyncio for I/O at scale. A thread per connection caps at thousands; asyncio handles tens of thousands on the same hardware.

Use processes for CPU work that does not fit a single core. Bound pool size to the number of physical cores; oversubscription costs more in context switches than it gives in parallelism.

Pin pool size from configuration, not from os.cpu_count() in containers. Inside Kubernetes you see the node’s cores, not the pod’s allocation, and you will starve neighbors.

For long-running pools, set max_tasks_per_child so workers recycle. This bounds memory leaks in third-party libraries you do not control.

ProcessPoolExecutor(max_workers=4, max_tasks_per_child=100)

Profile before optimizing. cProfile plus py-spy will tell you whether you are I/O-bound or CPU-bound. The wrong concurrency model is worse than no concurrency model.

Wrap-up

Threads for I/O, processes for CPU, asyncio for high-fanout I/O. The GIL is why; honor it and your code scales, ignore it and you will benchmark identical numbers across configurations. Guard process startup with __main__, keep payloads picklable, and pin pool sizes deliberately. Once you internalize this, the choice for any given function takes about ten seconds.