Pandas GroupBy and Aggregation Tutorial

Beginner 10 min read

What you'll learn

✓The split-apply-combine model behind groupby
✓Single and multi-column aggregations
✓Named aggregation for readable output columns
✓Transform vs aggregate vs filter
✓Performance and memory tips for large groups

Prerequisites

•Familiar with how APIs work

What and Why

groupby is the workhorse of pandas analysis. Almost every interesting question about tabular data is some variation of “group by X, then compute Y.” Sales per region, average response time per service, retention by signup cohort. Knowing groupby well is the difference between writing five lines and writing fifty.

The mental model behind groupby is split-apply-combine: split the dataframe into groups, apply a function to each group, and combine the results back into a single dataframe or series.

Mental Model

dataframe
 |
 v split by group key
[group A] [group B] [group C]
 |          |         |
 v apply   v apply   v apply
result A  result B  result C
 |          |         |
 v combine into one frame
      final result

Split-apply-combine

apply can be aggregation (one row per group), transformation (same shape as input), or filter (keep or drop entire groups). Picking the right operation prevents many bugs.

Hands-on Example

A small dataset to anchor the examples.

import pandas as pd

df = pd.DataFrame({
    "region": ["NA","NA","EU","EU","APAC","APAC","NA","EU"],
    "product": ["A","B","A","B","A","B","A","A"],
    "revenue": [120, 95, 80, 130, 70, 60, 200, 150],
    "units":   [4, 3, 2, 5, 3, 2, 6, 4],
})

Single-column groupby and a single aggregation:

df.groupby("region")["revenue"].sum()
# NA      415
# EU      360
# APAC    130

Multiple aggregations on multiple columns:

df.groupby("region").agg({
    "revenue": ["sum", "mean"],
    "units": "sum",
})

The result has a MultiIndex on columns. Named aggregation gives you flat, readable column names:

df.groupby("region").agg(
    total_revenue=("revenue", "sum"),
    avg_revenue=("revenue", "mean"),
    total_units=("units", "sum"),
).reset_index()

This pattern is the cleanest for production pipelines. Output columns mean exactly what they say.

Group by multiple columns to slice further:

df.groupby(["region", "product"], as_index=False).agg(
    total_revenue=("revenue", "sum"),
)

as_index=False keeps the group keys as columns rather than as a MultiIndex, which is usually what you want when feeding the result into the next step.

Transform and Filter

transform returns a result with the same shape as the input, broadcasting the group-level value back to each row. Useful for computing a row’s share of its group.

df["region_total"] = df.groupby("region")["revenue"].transform("sum")
df["share_of_region"] = df["revenue"] / df["region_total"]

filter keeps or drops entire groups based on a predicate.

# Only keep regions with total revenue above 200
big_regions = df.groupby("region").filter(lambda g: g["revenue"].sum() > 200)

A common mistake: using apply for everything. transform and filter are faster and clearer when the operation fits their shape.

Trade-offs

Pandas groupby has performance characteristics worth knowing.

Vectorized aggregations are fast. sum, mean, count, min, max, std, nunique are implemented in C.
Lambda functions are slow. agg(lambda x: ...) falls back to a Python loop per group. Use named functions or built-in strings instead.
High cardinality groups are expensive. Grouping by a column with millions of unique values can blow memory.
Sorting on group keys is the default. Pass sort=False to skip sorting when order does not matter; this can be significantly faster on large data.
observed=True matters for categorical group keys. Otherwise pandas creates entries for every level cross even if there is no data.

For very large data, consider polars or duckdb. They are dramatically faster on group-heavy workloads while preserving most of the mental model.

Practical Tips

Use named aggregation for readable columns. It is a small typing cost and a large clarity gain.
Prefer groupby(...).agg(...) over groupby(...).apply(...). apply is the slowest option and the easiest to misuse.
Pass as_index=False when you will keep working with the result. It avoids a tedious reset_index() step.
Reach for transform when you need group-level features per row. It is the right tool for “share of total,” “rank within group,” and “deviation from group mean.”
Use agg("size") to count rows including nulls. agg("count") excludes nulls per column, which can confuse you when a key column has missing values.
Avoid groupby for simple boolean masks. A vectorized comparison is faster than splitting and filtering.
Profile before optimizing. A df.groupby(...).agg(...) that takes 4 seconds on 10M rows is usually fine. Spend energy on the slow steps.

A pattern that scales well: compute group aggregates once, then merge back to row-level data. This is faster than calling transform many times.

Wrap-up

GroupBy turns “I have a table” into “I have insight.” Use named aggregation for readable output, pick between agg, transform, and filter based on the shape you want, and skip lambdas in favor of vectorized aggregations. Once split-apply-combine becomes second nature, most analytical questions reduce to a few well-chosen lines.