Pandas groupby, merge, and concat
A practical guide to combining and summarising DataFrames — groupby with named aggregations, multi-column aggregates, the four merge styles, and stacking with concat.
What you'll learn
- ✓How to group rows and aggregate them with one or many functions
- ✓How to use named aggregations to get clean output columns
- ✓The four merge styles — inner, left, right, outer — and when to reach for each
- ✓When to use on vs left_on/right_on
- ✓How to stack DataFrames vertically and horizontally with concat
Prerequisites
- •Pandas DataFrames Basics — selecting, filtering, indexing
- •Comfort with Python dictionaries and lists
Loading a DataFrame is the easy part. Real analyst work is combining tables and rolling them up into summaries. This post is about the three tools that do most of that work: groupby, merge, and concat. Get fluent here and you can answer most “how does X compare across Y?” questions in a handful of lines.
If you haven’t done the basics yet, start with DataFrames basics.
The example data
We’ll use two small DataFrames throughout.
import pandas as pd
sales = pd.DataFrame({
"date": pd.to_datetime(["2026-01-04", "2026-01-04", "2026-01-05",
"2026-01-05", "2026-01-06", "2026-01-06"]),
"sku": ["A1", "B2", "A1", "C3", "B2", "A1"],
"region": ["north", "north", "south", "south", "north", "south"],
"units": [12, 7, 15, 3, 11, 9],
"revenue": [18.0, 8.4, 22.5, 11.25, 13.2, 13.5],
})
products = pd.DataFrame({
"sku": ["A1", "B2", "C3", "D4"],
"name": ["Apple", "Banana", "Cherry", "Date"],
"category": ["fruit", "fruit", "fruit", "fruit"],
})
groupby: the basics
groupby splits the DataFrame by the unique values of one or more columns, applies a function to each piece, and stitches the result back together. The mental model is split, apply, combine.
# Total units per region
print(sales.groupby("region")["units"].sum())
# output:
# region
# north 30
# south 27
# Name: units, dtype: int64
Three things happened: pandas split the rows into one group per region, summed units within each group, and returned a Series indexed by region.
Grouping by multiple columns gives a MultiIndex:
print(sales.groupby(["region", "sku"])["units"].sum())
# output:
# region sku
# north A1 12
# B2 18
# south A1 24
# C3 3
Want a flat DataFrame instead of a Series with a MultiIndex? Add .reset_index():
out = sales.groupby(["region", "sku"])["units"].sum().reset_index()
print(out)
# output:
# region sku units
# 0 north A1 12
# 1 north B2 18
# 2 south A1 24
# 3 south C3 3
This is the shape most downstream code (charts, joins, exports) expects.
Aggregating multiple columns
Apply the same function to several columns:
print(sales.groupby("region")[["units", "revenue"]].sum())
# output:
# units revenue
# region
# north 30 39.6
# south 27 47.25
Apply different functions to different columns with .agg:
print(sales.groupby("region").agg({
"units": "sum",
"revenue": "mean",
}))
# output:
# units revenue
# region
# north 30 13.200000
# south 27 15.750000
Apply multiple functions to one column:
print(sales.groupby("region")["units"].agg(["sum", "mean", "max"]))
# output:
# sum mean max
# region
# north 30 10.00000 12
# south 27 9.00000 15
Named aggregations
The previous outputs have awkward column names. Named aggregations are the cleaner pattern — you choose the output column name and tell pandas which input column and function to use.
summary = sales.groupby("region").agg(
total_units=("units", "sum"),
avg_revenue=("revenue", "mean"),
order_count=("sku", "count"),
)
print(summary)
# output:
# total_units avg_revenue order_count
# region
# north 30 13.20 3
# south 27 15.75 3
Read this as: output column name = (input column, function). The function can be a string ("sum", "mean", "max", "count", "nunique") or any callable.
This is the pattern to internalise. It produces clean, ready-to-export tables.
Custom aggregations
Any function that takes a Series and returns a scalar works:
def revenue_range(s):
return s.max() - s.min()
print(sales.groupby("region").agg(
rev_spread=("revenue", revenue_range),
))
# output:
# rev_spread
# region
# north 9.6
# south 11.25
Lambdas work too, though named functions are easier to read.
Try it yourself. Using the sales DataFrame, build a single named-aggregation call that, for each sku, returns total_units, total_revenue, and regions_sold (the number of distinct regions). Hint: "nunique" does the last one. The shape of the answer is more important than the exact numbers — once you can write this pattern from memory, most summary tables are a one-liner.
merge: joining two DataFrames
Once you have multiple tables, you need to combine them. pd.merge is pandas’ SQL join.
joined = sales.merge(products, on="sku")
print(joined)
# output:
# date sku region units revenue name category
# 0 2026-01-04 A1 north 12 18.00 Apple fruit
# 1 2026-01-05 A1 south 15 22.50 Apple fruit
# ...
The defaults: how="inner", and pandas joins on whatever columns share names. Being explicit is almost always better.
The four merge styles
The how argument controls which rows survive.
- inner — only rows with a key in both tables.
- left — all rows from the left table, matched where possible.
- right — all rows from the right table, matched where possible.
- outer — all rows from either, filling with NaN where there’s no match.
# Inner: only SKUs that appear in both
sales.merge(products, on="sku", how="inner")
# Left: every sales row, even if the SKU is missing from products
sales.merge(products, on="sku", how="left")
# Outer: every SKU from either side
sales.merge(products, on="sku", how="outer")
how="left" is the workhorse. Most “I want to enrich my main table with a lookup” tasks are left joins. The left table is your fact table; the right table is a dimension/reference table.
A useful trick — pass indicator=True to see where each row came from:
print(sales.merge(products, on="sku", how="outer", indicator=True))
# Adds a "_merge" column with values left_only / right_only / both
This is gold for debugging missing rows after a join.
When keys have different names
Use left_on and right_on when the join columns have different names in each table:
orders = pd.DataFrame({"product_id": ["A1", "B2"], "qty": [3, 5]})
sales.merge(
orders,
left_on="sku",
right_on="product_id",
how="left",
)
You’ll often get both columns in the output. Drop the redundant one:
result = sales.merge(orders, left_on="sku", right_on="product_id", how="left")
result = result.drop(columns="product_id")
For joins on the index, use left_index=True or right_index=True. Or skip merge entirely and use df.join, which defaults to joining on the index.
Duplicate key gotchas
Merging multiplies rows whenever the join key isn’t unique on the right side. This is the single most common source of “why did my DataFrame suddenly have 10x more rows?” bugs.
# Right side has two rows for sku "A1"
dupes = pd.DataFrame({"sku": ["A1", "A1"], "promo": [0.1, 0.2]})
print(sales.merge(dupes, on="sku", how="left"))
# Each sales row for A1 is now duplicated — one per matching promo row
Before merging, check uniqueness:
assert products["sku"].is_unique
A one-line assertion at the top of an analysis is worth its weight in saved debugging hours.
concat: stacking instead of joining
pd.concat glues DataFrames together along an axis. No join logic — it just stacks.
Stacking rows (the common case) — combine two DataFrames with the same columns:
january = pd.read_csv("sales_2026_01.csv")
february = pd.read_csv("sales_2026_02.csv")
all_sales = pd.concat([january, february], ignore_index=True)
ignore_index=True renumbers the combined DataFrame from 0. Without it, you keep the original indices, which often leads to duplicate index values.
Stacking columns — paste DataFrames side-by-side:
left = pd.DataFrame({"a": [1, 2, 3]})
right = pd.DataFrame({"b": [10, 20, 30]})
print(pd.concat([left, right], axis=1))
# output:
# a b
# 0 1 10
# 1 2 20
# 2 3 30
For column-wise concat, alignment is by index, not by row order. If the indices don’t match, you get NaNs. When in doubt, reset the index on both sides first.
merge vs concat: which to use
A quick rule:
- Same columns, different rows →
concat - Different columns, shared key →
merge
If you’re tempted to concat along columns to combine two related tables, stop and use merge instead. It’s safer because it checks the keys.
A realistic example
Put the three tools together. Goal: revenue by product category and region, including categories that had zero sales.
# 1. Enrich sales with product info (left join — keep all sales rows)
enriched = sales.merge(products, on="sku", how="left")
# 2. Group and aggregate with named outputs
summary = enriched.groupby(["category", "region"], as_index=False).agg(
total_units=("units", "sum"),
total_revenue=("revenue", "sum"),
skus_sold=("sku", "nunique"),
)
print(summary)
# 3. If we had multiple monthly summary files, stack them
# combined = pd.concat([jan_summary, feb_summary], ignore_index=True)
Three lines of pandas, one analyst-ready output. That shape — merge to enrich, groupby to summarise, optionally concat to stitch periods together — covers a huge fraction of routine reporting.
Try it yourself. Add a third DataFrame regions = pd.DataFrame({"region": ["north", "south", "east"], "country": ["US", "US", "US"]}). Left-join it onto the summary DataFrame so every row gets a country column. Then check: did the row count change? If yes, your join key probably has duplicates somewhere. This kind of sanity check (row count before and after a merge) is the cheapest bug catcher in pandas.
Common pitfalls
- Forgetting
as_index=False— leaves you with a grouped DataFrame that has the group keys as the index. Fine for chained ops; awkward for export. - Default inner joins.
mergedefaults to inner. If you expected to keep all left rows and silently lost some, this is almost always why. - Row duplication after merge. Caused by non-unique join keys on the right. Assert uniqueness early.
concatalong columns without matching indices. Causes NaN holes. Reset indices first.- Mixed dtypes in merge keys — joining a string
"1"against an int1will silently produce no matches. Cast both sides to the same type.
Recap
You now know:
groupbyfollows split, apply, combine- Named aggregations are the clean way to produce summary tables
mergehas four styles — inner, left, right, outer — andleftis the workhorse- Use
left_on/right_onwhen join columns have different names - Non-unique join keys multiply rows — assert uniqueness
concatstacks; it doesn’t join
Next steps
Once you can merge and summarise, you can feed clean tables into modelling. The natural next step is Train/Test Split and Classification Metrics.
Related: Pandas DataFrames Basics, What Is Pandas?, What Is Machine Learning?.
Questions or feedback? Email codeloomdevv@gmail.com.