Threading vs Multiprocessing in Python
Understand the Python GIL and pick the right concurrency tool: when threads help with I/O, when processes help with CPU, and how to use concurrent.futures.
What you'll learn
- ✓What the GIL is and how it shapes Python concurrency
- ✓When threads actually speed things up (I/O-bound work)
- ✓When you need processes instead (CPU-bound work)
- ✓How to use concurrent.futures for both threads and processes
- ✓Common pitfalls: shared state, pickling, and overhead
Prerequisites
- •Comfortable with Python functions and basic error handling
Concurrency in Python looks confusing from the outside. There are threads, there are processes, there is asyncio, and there is the GIL that everybody complains about. The good news is that picking the right tool is mostly a single question: is your code waiting on I/O, or is it crunching numbers? Answer that, and the choice is almost made for you.
The GIL in one paragraph
The Global Interpreter Lock is a mutex inside CPython that ensures only one thread runs Python bytecode at a time. So even on a 16-core machine, two Python threads doing pure-Python work do not actually run in parallel. They take turns. The lock is released around I/O and around some C-extension calls (NumPy, for example), which is why threads still help in some workloads.
There has been work on a “no-GIL” Python (PEP 703) shipping as an experimental build, but the default CPython you run today still has the GIL. Plan accordingly.
What threads are actually good for
Threads in Python are great when your program spends most of its time waiting. Network requests, database queries, file reads, subprocess calls. While one thread is blocked on a socket, the GIL is released and another thread can run.
Here is a sequential example that fetches a few URLs:
import time
import urllib.request
URLS = [
"https://example.com",
"https://www.python.org",
"https://httpbin.org/get",
]
def fetch(url: str) -> int:
with urllib.request.urlopen(url, timeout=5) as r:
return len(r.read())
start = time.perf_counter()
total = sum(fetch(u) for u in URLS)
print(total, f"in {time.perf_counter() - start:.2f}s")
Now with threads using concurrent.futures:
from concurrent.futures import ThreadPoolExecutor
start = time.perf_counter()
with ThreadPoolExecutor(max_workers=8) as pool:
sizes = list(pool.map(fetch, URLS))
print(sum(sizes), f"in {time.perf_counter() - start:.2f}s")
On a real network, this takes roughly the time of the slowest request rather than the sum. The GIL is no obstacle because each thread spends most of its life waiting on the socket.
When threads do not help
Threads do nothing for CPU-bound work in pure Python. A loop that hashes data, parses strings, or runs a tight numerical kernel will run at the same speed whether you use one thread or eight. Try it:
import time
from concurrent.futures import ThreadPoolExecutor
def crunch(n: int) -> int:
total = 0
for i in range(n):
total += i * i
return total
N = 10_000_000
start = time.perf_counter()
[crunch(N) for _ in range(4)]
print("sequential", time.perf_counter() - start)
start = time.perf_counter()
with ThreadPoolExecutor(max_workers=4) as pool:
list(pool.map(crunch, [N] * 4))
print("threads", time.perf_counter() - start)
The threaded version is the same speed or slightly slower because of thread overhead. The GIL is the reason.
Processes for CPU-bound work
multiprocessing and ProcessPoolExecutor start separate Python interpreters. Each has its own GIL, so they run truly in parallel. The trade-off is overhead: starting a process is expensive, and data passed between processes has to be pickled.
from concurrent.futures import ProcessPoolExecutor
if __name__ == "__main__":
start = time.perf_counter()
with ProcessPoolExecutor(max_workers=4) as pool:
list(pool.map(crunch, [N] * 4))
print("processes", time.perf_counter() - start)
On a 4-core machine that finishes in roughly a quarter of the sequential time. The if __name__ == "__main__": guard is not optional on macOS or Windows. Without it, child processes re-execute your script and you can end up forking forever.
Things to keep in mind with processes:
- Arguments and return values are pickled and copied. Big NumPy arrays or large dicts hurt.
- You cannot share regular Python objects across processes. Use
multiprocessing.Queue,Manager, orshared_memoryfor that. - Process startup adds tens of milliseconds. Do not spawn a pool to crunch ten quick numbers.
concurrent.futures is the friendly API
concurrent.futures is the standard, modern way to use both pools. The API is the same for threads and processes, which makes it easy to switch.
from concurrent.futures import ThreadPoolExecutor, as_completed
def fetch(url):
...
urls = [...]
with ThreadPoolExecutor(max_workers=16) as pool:
futures = {pool.submit(fetch, u): u for u in urls}
for fut in as_completed(futures):
url = futures[fut]
try:
result = fut.result()
except Exception as e:
print(f"{url} failed: {e}")
else:
print(f"{url}: {result}")
submit schedules a callable and returns a Future. as_completed yields futures in the order they finish, which is what you usually want when some tasks are faster than others. Pair this with patterns from error handling so a single failure does not bring the whole batch down.
For simpler “do this list of jobs in parallel” use cases, pool.map is fine and reads cleaner.
A small decision table
A short guide that covers most cases:
- Waiting on lots of network or disk I/O, sync libraries:
ThreadPoolExecutor. - Waiting on lots of network or disk I/O, async libraries available:
asyncio. Often less overhead than threads. - CPU-heavy pure Python work:
ProcessPoolExecutor. - CPU-heavy work in NumPy/SciPy/PyTorch: usually already releases the GIL. Profile before reaching for processes; threads or vectorization often win.
- One slow blocking call inside an async program:
loop.run_in_executorto push it to a thread.
If you have not used asyncio, the article on generators and iterators is a good warmup since coroutines are built on similar machinery.
Shared state is where bugs live
Threads share memory. Two threads writing to the same list, dict, or counter without protection will eventually corrupt it. Use threading.Lock around critical sections, or design so that each thread owns its data and you merge results at the end.
from threading import Lock
counter = 0
lock = Lock()
def bump():
global counter
with lock:
counter += 1
Processes do not share memory, which sidesteps these bugs, but it also means you cannot mutate a global from a worker and expect the parent to see it. Pass results back through the pool’s return values.
Common pitfalls
- Reaching for threads to speed up CPU work in pure Python. The GIL says no.
- Spawning a
ProcessPoolExecutorinside a script withoutif __name__ == "__main__":. Hello, fork bomb. - Creating a new pool for every batch. Pools have setup cost. Create once, reuse.
- Catching every exception inside a worker and silently dropping it. The parent never sees the error. Let exceptions propagate via
future.result()so they surface. - Ignoring backpressure. A queue with no upper bound and a slow consumer becomes a memory bomb. Bound your queues and your pools.
When to pick what, in one sentence each
- Threads: many slow waits, library is sync, you want a small change to existing code.
- Processes: heavy CPU work in pure Python, results worth the pickling cost.
- asyncio: many slow waits, library has an async API, you want low overhead at high concurrency.
Wrap up
The GIL is the rule that decides this whole topic. Threads in Python overlap waiting, not computing. Use ThreadPoolExecutor for I/O-bound work, ProcessPoolExecutor for CPU-bound work, and asyncio when your libraries support it. concurrent.futures gives you a single clean API for both pools, and as_completed plus proper exception handling makes batches resilient. Pick by workload, not by habit, and your Python programs will use your hardware much better.