Pandas Performance Optimization Tips: Make DataFrames 10x Faster
Practical pandas speedups: vectorization, dtype tuning, categorical columns, eval/query, and chunked I/O patterns that turn slow scripts into responsive pipelines.
What you'll learn
- ✓Why iterrows is slow and what to use instead
- ✓Choosing efficient dtypes and categoricals
- ✓Speeding up filters with query and eval
- ✓Profiling memory and CPU hotspots
- ✓Streaming large files with chunksize
Prerequisites
- •Familiar with Python
What and Why
Pandas is fast when you use it the way it expects to be used, and painfully slow when you do not. A script that takes ten minutes on a one-million-row DataFrame can often be rewritten to finish in under ten seconds without changing the result.
The biggest gains come from understanding three things: how pandas stores data in memory, where the Python interpreter overhead lives, and which operations are pushed down to NumPy or C. Once you internalize these, optimization stops being guesswork.
Mental Model
Every Series in pandas is backed by a typed NumPy array or an Arrow-backed extension array. Operations on the whole column run in compiled C code. Operations that loop in Python pay roughly 100 nanoseconds of interpreter overhead per row, which adds up fast.
Slow path (per-row Python) Fast path (vectorized C)
+-----------+ +---------------------+
| for row | 100 ns/row x N | df['a'] + df['b'] |
| in df: | -> seconds | -> microseconds |
| compute | +---------------------+
+-----------+
N rows N rows
1,000,000 ~100 ms+ overhead 1,000,000 ~5 ms total The rule of thumb: if you write a Python for loop over rows, you are leaving 10x to 100x on the table.
Hands-on Example
Let us measure a realistic transformation. We have a million-row DataFrame and we want a derived column.
import numpy as np
import pandas as pd
import time
n = 1_000_000
df = pd.DataFrame({
'price': np.random.rand(n) * 100,
'qty': np.random.randint(1, 10, n),
'tag': np.random.choice(['a','b','c','d'], n),
})
def timeit(label, fn):
t = time.perf_counter()
out = fn()
print(f'{label}: {time.perf_counter()-t:.3f}s')
return out
# 1. iterrows - slowest
def slow():
out = []
for _, r in df.iterrows():
out.append(r['price'] * r['qty'])
return out
# 2. apply - faster but still Python
def apply_row():
return df.apply(lambda r: r['price'] * r['qty'], axis=1)
# 3. vectorized - fastest
def vec():
return df['price'] * df['qty']
timeit('iterrows', slow)
timeit('apply', apply_row)
timeit('vector', vec)
Typical results: iterrows around 25 s, apply around 7 s, vectorized about 5 ms. Same answer, 5000x speedup.
Now compare filtering approaches:
mask = (df['price'] > 50) & (df['qty'] >= 5) & (df['tag'] == 'a')
df[mask] # standard boolean mask
df.query('price > 50 and qty >= 5 and tag == "a"') # numexpr
For large frames with several conditions, query (and eval) use numexpr under the hood and can be two to four times faster while allocating less memory.
Categoricals shrink memory and accelerate group-bys:
df['tag'] = df['tag'].astype('category')
df.groupby('tag', observed=True)['price'].mean()
A string column with millions of rows but only four unique values drops from roughly 60 MB to under 1 MB.
Common Pitfalls
The classic mistake is reaching for apply because it feels Pythonic. apply(axis=1) is essentially a hidden loop. Always look for a vectorized expression first; only fall back to apply when the logic truly cannot be expressed in column algebra.
Another trap is chained assignment, such as df[df.x > 0]['y'] = 5. This both warns and silently fails. Use df.loc[df.x > 0, 'y'] = 5 instead. It is faster and correct.
Dtype drift bites everyone. Reading CSV often produces object columns where you expected numbers, which kills performance. Pass dtype= to read_csv, or call convert_dtypes() after loading.
Finally, do not call df.append in a loop. It was deprecated for a reason: it copies the entire frame each time. Build a list of frames and pd.concat once at the end.
Practical Tips
Profile before optimizing. %timeit in Jupyter and df.info(memory_usage='deep') tell you where the time and memory actually go. Guessing leads to micro-optimizations that do not matter.
Downcast numeric columns with pd.to_numeric(s, downcast='integer') or 'float'. A column of small integers stored as int64 wastes seven bytes per row.
For repeated heavy filters, sort the frame and use searchsorted, or set an index and rely on loc. Both are logarithmic instead of linear.
When data exceeds memory, stream it. pd.read_csv('big.csv', chunksize=100_000) returns an iterator of DataFrames you can aggregate incrementally. For columnar workloads, switch to Parquet via pyarrow; it is faster to read, smaller on disk, and preserves dtypes.
For truly large jobs, consider the Arrow-backed dtypes introduced in pandas 2.0 (dtype_backend='pyarrow'), or graduate to Polars or DuckDB for sub-second analytics on tens of millions of rows.
Wrap-up
Pandas performance is not magic. Stay vectorized, keep dtypes tight, use categoricals for low-cardinality strings, prefer query for complex filters, and stream what does not fit in memory. Profile to confirm the win.
Pick the slowest cell in your current notebook and apply one of these techniques today. Most of the time, a thirty-second change cuts runtime by an order of magnitude, and your future self will thank you when the dataset grows.
Related articles
- Pandas Pandas MultiIndex Tutorial
A practical guide to Pandas MultiIndex: when to use it, how it really works, and the slicing, stacking, and groupby patterns that make hierarchical data manageable.
- Pandas Pandas Pivot Tables Tutorial: Reshape Data Like a Spreadsheet Pro
Learn how to use pandas pivot_table to summarize, reshape, and aggregate data with multiple indexes, columns, and custom aggregation functions in real workflows.
- Pandas Pandas groupby, merge, and concat
A practical guide to combining and summarising DataFrames — groupby with named aggregations, multi-column aggregates, the four merge styles, and stacking with concat.
- Pandas Pandas DataFrames: Reading, Selecting, and Filtering
A practical guide to the daily DataFrame moves — read_csv and read_json, head and info, column selection, loc vs iloc, boolean filtering, sorting, and value_counts.