Pandas groupby, merge, and concat

Intermediate 13 min read

What you'll learn

✓How to group rows and aggregate them with one or many functions
✓How to use named aggregations to get clean output columns
✓The four merge styles — inner, left, right, outer — and when to reach for each
✓When to use on vs left_on/right_on
✓How to stack DataFrames vertically and horizontally with concat

Prerequisites

•Pandas DataFrames Basics — selecting, filtering, indexing
•Comfort with Python dictionaries and lists

Loading a DataFrame is the easy part. Real analyst work is combining tables and rolling them up into summaries. This post is about the three tools that do most of that work: groupby, merge, and concat. Get fluent here and you can answer most “how does X compare across Y?” questions in a handful of lines.

If you haven’t done the basics yet, start with DataFrames basics.

The example data

We’ll use two small DataFrames throughout.

import pandas as pd

sales = pd.DataFrame({
    "date":    pd.to_datetime(["2026-01-04", "2026-01-04", "2026-01-05",
                                "2026-01-05", "2026-01-06", "2026-01-06"]),
    "sku":     ["A1", "B2", "A1", "C3", "B2", "A1"],
    "region":  ["north", "north", "south", "south", "north", "south"],
    "units":   [12, 7, 15, 3, 11, 9],
    "revenue": [18.0, 8.4, 22.5, 11.25, 13.2, 13.5],
})

products = pd.DataFrame({
    "sku":      ["A1", "B2", "C3", "D4"],
    "name":     ["Apple", "Banana", "Cherry", "Date"],
    "category": ["fruit", "fruit", "fruit", "fruit"],
})

groupby: the basics

groupby splits the DataFrame by the unique values of one or more columns, applies a function to each piece, and stitches the result back together. The mental model is split, apply, combine.

# Total units per region
print(sales.groupby("region")["units"].sum())
# output:
# region
# north    30
# south    27
# Name: units, dtype: int64

Three things happened: pandas split the rows into one group per region, summed units within each group, and returned a Series indexed by region.

Grouping by multiple columns gives a MultiIndex:

print(sales.groupby(["region", "sku"])["units"].sum())
# output:
# region  sku
# north   A1     12
#         B2     18
# south   A1     24
#         C3      3

Want a flat DataFrame instead of a Series with a MultiIndex? Add .reset_index():

out = sales.groupby(["region", "sku"])["units"].sum().reset_index()
print(out)
# output:
#   region sku  units
# 0  north  A1     12
# 1  north  B2     18
# 2  south  A1     24
# 3  south  C3      3

This is the shape most downstream code (charts, joins, exports) expects.

Aggregating multiple columns

Apply the same function to several columns:

print(sales.groupby("region")[["units", "revenue"]].sum())
# output:
#         units  revenue
# region
# north      30     39.6
# south      27     47.25

Apply different functions to different columns with .agg:

print(sales.groupby("region").agg({
    "units":   "sum",
    "revenue": "mean",
}))
# output:
#         units    revenue
# region
# north      30  13.200000
# south      27  15.750000

Apply multiple functions to one column:

print(sales.groupby("region")["units"].agg(["sum", "mean", "max"]))
# output:
#         sum      mean  max
# region
# north    30  10.00000   12
# south    27   9.00000   15

Named aggregations

The previous outputs have awkward column names. Named aggregations are the cleaner pattern — you choose the output column name and tell pandas which input column and function to use.

summary = sales.groupby("region").agg(
    total_units=("units", "sum"),
    avg_revenue=("revenue", "mean"),
    order_count=("sku", "count"),
)
print(summary)
# output:
#         total_units  avg_revenue  order_count
# region
# north            30        13.20            3
# south            27        15.75            3

Read this as: output column name = (input column, function). The function can be a string ("sum", "mean", "max", "count", "nunique") or any callable.

This is the pattern to internalise. It produces clean, ready-to-export tables.

Custom aggregations

Any function that takes a Series and returns a scalar works:

def revenue_range(s):
    return s.max() - s.min()

print(sales.groupby("region").agg(
    rev_spread=("revenue", revenue_range),
))
# output:
#         rev_spread
# region
# north         9.6
# south        11.25

Lambdas work too, though named functions are easier to read.

Try it yourself. Using the sales DataFrame, build a single named-aggregation call that, for each sku, returns total_units, total_revenue, and regions_sold (the number of distinct regions). Hint: "nunique" does the last one. The shape of the answer is more important than the exact numbers — once you can write this pattern from memory, most summary tables are a one-liner.

merge: joining two DataFrames

Once you have multiple tables, you need to combine them. pd.merge is pandas’ SQL join.

joined = sales.merge(products, on="sku")
print(joined)
# output:
#         date sku region  units  revenue    name category
# 0 2026-01-04  A1  north     12    18.00   Apple    fruit
# 1 2026-01-05  A1  south     15    22.50   Apple    fruit
# ...

The defaults: how="inner", and pandas joins on whatever columns share names. Being explicit is almost always better.

The four merge styles

The how argument controls which rows survive.

inner — only rows with a key in both tables.
left — all rows from the left table, matched where possible.
right — all rows from the right table, matched where possible.
outer — all rows from either, filling with NaN where there’s no match.

# Inner: only SKUs that appear in both
sales.merge(products, on="sku", how="inner")

# Left: every sales row, even if the SKU is missing from products
sales.merge(products, on="sku", how="left")

# Outer: every SKU from either side
sales.merge(products, on="sku", how="outer")

how="left" is the workhorse. Most “I want to enrich my main table with a lookup” tasks are left joins. The left table is your fact table; the right table is a dimension/reference table.

A useful trick — pass indicator=True to see where each row came from:

print(sales.merge(products, on="sku", how="outer", indicator=True))
# Adds a "_merge" column with values left_only / right_only / both

This is gold for debugging missing rows after a join.

When keys have different names

Use left_on and right_on when the join columns have different names in each table:

orders = pd.DataFrame({"product_id": ["A1", "B2"], "qty": [3, 5]})

sales.merge(
    orders,
    left_on="sku",
    right_on="product_id",
    how="left",
)

You’ll often get both columns in the output. Drop the redundant one:

result = sales.merge(orders, left_on="sku", right_on="product_id", how="left")
result = result.drop(columns="product_id")

For joins on the index, use left_index=True or right_index=True. Or skip merge entirely and use df.join, which defaults to joining on the index.

Duplicate key gotchas

Merging multiplies rows whenever the join key isn’t unique on the right side. This is the single most common source of “why did my DataFrame suddenly have 10x more rows?” bugs.

# Right side has two rows for sku "A1"
dupes = pd.DataFrame({"sku": ["A1", "A1"], "promo": [0.1, 0.2]})

print(sales.merge(dupes, on="sku", how="left"))
# Each sales row for A1 is now duplicated — one per matching promo row

Before merging, check uniqueness:

assert products["sku"].is_unique

A one-line assertion at the top of an analysis is worth its weight in saved debugging hours.

concat: stacking instead of joining

pd.concat glues DataFrames together along an axis. No join logic — it just stacks.

Stacking rows (the common case) — combine two DataFrames with the same columns:

january = pd.read_csv("sales_2026_01.csv")
february = pd.read_csv("sales_2026_02.csv")

all_sales = pd.concat([january, february], ignore_index=True)

ignore_index=True renumbers the combined DataFrame from 0. Without it, you keep the original indices, which often leads to duplicate index values.

Stacking columns — paste DataFrames side-by-side:

left  = pd.DataFrame({"a": [1, 2, 3]})
right = pd.DataFrame({"b": [10, 20, 30]})

print(pd.concat([left, right], axis=1))
# output:
#    a   b
# 0  1  10
# 1  2  20
# 2  3  30

For column-wise concat, alignment is by index, not by row order. If the indices don’t match, you get NaNs. When in doubt, reset the index on both sides first.

merge vs concat: which to use

A quick rule:

Same columns, different rows → concat
Different columns, shared key → merge

If you’re tempted to concat along columns to combine two related tables, stop and use merge instead. It’s safer because it checks the keys.

A realistic example

Put the three tools together. Goal: revenue by product category and region, including categories that had zero sales.

# 1. Enrich sales with product info (left join — keep all sales rows)
enriched = sales.merge(products, on="sku", how="left")

# 2. Group and aggregate with named outputs
summary = enriched.groupby(["category", "region"], as_index=False).agg(
    total_units=("units", "sum"),
    total_revenue=("revenue", "sum"),
    skus_sold=("sku", "nunique"),
)
print(summary)

# 3. If we had multiple monthly summary files, stack them
# combined = pd.concat([jan_summary, feb_summary], ignore_index=True)

Three lines of pandas, one analyst-ready output. That shape — merge to enrich, groupby to summarise, optionally concat to stitch periods together — covers a huge fraction of routine reporting.

Try it yourself. Add a third DataFrame regions = pd.DataFrame({"region": ["north", "south", "east"], "country": ["US", "US", "US"]}). Left-join it onto the summary DataFrame so every row gets a country column. Then check: did the row count change? If yes, your join key probably has duplicates somewhere. This kind of sanity check (row count before and after a merge) is the cheapest bug catcher in pandas.

Common pitfalls

Forgetting as_index=False — leaves you with a grouped DataFrame that has the group keys as the index. Fine for chained ops; awkward for export.
Default inner joins. merge defaults to inner. If you expected to keep all left rows and silently lost some, this is almost always why.
Row duplication after merge. Caused by non-unique join keys on the right. Assert uniqueness early.
concat along columns without matching indices. Causes NaN holes. Reset indices first.
Mixed dtypes in merge keys — joining a string "1" against an int 1 will silently produce no matches. Cast both sides to the same type.

Recap

You now know:

groupby follows split, apply, combine
Named aggregations are the clean way to produce summary tables
merge has four styles — inner, left, right, outer — and left is the workhorse
Use left_on/right_on when join columns have different names
Non-unique join keys multiply rows — assert uniqueness
concat stacks; it doesn’t join

Next steps

Once you can merge and summarise, you can feed clean tables into modelling. The natural next step is Train/Test Split and Classification Metrics.

Questions or feedback? Email codeloomdevv@gmail.com.