Pandas GroupBy and Aggregation Tutorial
Master pandas groupby with single and multi-column aggregations, named outputs, transform, and filter for clean analytical pipelines.
What you'll learn
- ✓The split-apply-combine model behind groupby
- ✓Single and multi-column aggregations
- ✓Named aggregation for readable output columns
- ✓Transform vs aggregate vs filter
- ✓Performance and memory tips for large groups
Prerequisites
- •Familiar with how APIs work
What and Why
groupby is the workhorse of pandas analysis. Almost every interesting question about tabular data is some variation of “group by X, then compute Y.” Sales per region, average response time per service, retention by signup cohort. Knowing groupby well is the difference between writing five lines and writing fifty.
The mental model behind groupby is split-apply-combine: split the dataframe into groups, apply a function to each group, and combine the results back into a single dataframe or series.
Mental Model
dataframe
|
v split by group key
[group A] [group B] [group C]
| | |
v apply v apply v apply
result A result B result C
| | |
v combine into one frame
final result apply can be aggregation (one row per group), transformation (same shape as input), or filter (keep or drop entire groups). Picking the right operation prevents many bugs.
Hands-on Example
A small dataset to anchor the examples.
import pandas as pd
df = pd.DataFrame({
"region": ["NA","NA","EU","EU","APAC","APAC","NA","EU"],
"product": ["A","B","A","B","A","B","A","A"],
"revenue": [120, 95, 80, 130, 70, 60, 200, 150],
"units": [4, 3, 2, 5, 3, 2, 6, 4],
})
Single-column groupby and a single aggregation:
df.groupby("region")["revenue"].sum()
# NA 415
# EU 360
# APAC 130
Multiple aggregations on multiple columns:
df.groupby("region").agg({
"revenue": ["sum", "mean"],
"units": "sum",
})
The result has a MultiIndex on columns. Named aggregation gives you flat, readable column names:
df.groupby("region").agg(
total_revenue=("revenue", "sum"),
avg_revenue=("revenue", "mean"),
total_units=("units", "sum"),
).reset_index()
This pattern is the cleanest for production pipelines. Output columns mean exactly what they say.
Group by multiple columns to slice further:
df.groupby(["region", "product"], as_index=False).agg(
total_revenue=("revenue", "sum"),
)
as_index=False keeps the group keys as columns rather than as a MultiIndex, which is usually what you want when feeding the result into the next step.
Transform and Filter
transform returns a result with the same shape as the input, broadcasting the group-level value back to each row. Useful for computing a row’s share of its group.
df["region_total"] = df.groupby("region")["revenue"].transform("sum")
df["share_of_region"] = df["revenue"] / df["region_total"]
filter keeps or drops entire groups based on a predicate.
# Only keep regions with total revenue above 200
big_regions = df.groupby("region").filter(lambda g: g["revenue"].sum() > 200)
A common mistake: using apply for everything. transform and filter are faster and clearer when the operation fits their shape.
Trade-offs
Pandas groupby has performance characteristics worth knowing.
- Vectorized aggregations are fast.
sum,mean,count,min,max,std,nuniqueare implemented in C. - Lambda functions are slow.
agg(lambda x: ...)falls back to a Python loop per group. Use named functions or built-in strings instead. - High cardinality groups are expensive. Grouping by a column with millions of unique values can blow memory.
- Sorting on group keys is the default. Pass
sort=Falseto skip sorting when order does not matter; this can be significantly faster on large data. observed=Truematters for categorical group keys. Otherwise pandas creates entries for every level cross even if there is no data.
For very large data, consider polars or duckdb. They are dramatically faster on group-heavy workloads while preserving most of the mental model.
Practical Tips
- Use named aggregation for readable columns. It is a small typing cost and a large clarity gain.
- Prefer
groupby(...).agg(...)overgroupby(...).apply(...).applyis the slowest option and the easiest to misuse. - Pass
as_index=Falsewhen you will keep working with the result. It avoids a tediousreset_index()step. - Reach for
transformwhen you need group-level features per row. It is the right tool for “share of total,” “rank within group,” and “deviation from group mean.” - Use
agg("size")to count rows including nulls.agg("count")excludes nulls per column, which can confuse you when a key column has missing values. - Avoid groupby for simple boolean masks. A vectorized comparison is faster than splitting and filtering.
- Profile before optimizing. A
df.groupby(...).agg(...)that takes 4 seconds on 10M rows is usually fine. Spend energy on the slow steps.
A pattern that scales well: compute group aggregates once, then merge back to row-level data. This is faster than calling transform many times.
Wrap-up
GroupBy turns “I have a table” into “I have insight.” Use named aggregation for readable output, pick between agg, transform, and filter based on the shape you want, and skip lambdas in favor of vectorized aggregations. Once split-apply-combine becomes second nature, most analytical questions reduce to a few well-chosen lines.
Related articles
- Pandas Pandas: apply vs Vectorization
When to reach for .apply and when vectorized operations win. A practical comparison with benchmarks, mental models, and the patterns that keep Pandas code both readable and fast.
- Pandas Pandas Categorical Data Tutorial
Use Pandas Categorical dtype to cut memory, speed up groupby, and encode ordered categories cleanly with practical conversion and pitfall notes.
- Pandas Pandas Data Cleaning Techniques: A Practical Field Guide
Hands-on pandas patterns for cleaning messy real-world data, covering missing values, types, duplicates, strings, and a reliable cleaning pipeline.
- Pandas Pandas Merge, Join, and Concat Explained
Pick the right pandas combining function with clear examples of inner, outer, left, right joins, concat, and the gotchas that cause silent bugs.