Pandas Categorical Data Tutorial
Use Pandas Categorical dtype to cut memory, speed up groupby, and encode ordered categories cleanly with practical conversion and pitfall notes.
What you'll learn
- ✓What the Categorical dtype actually stores
- ✓How it cuts memory and speeds up groupby
- ✓Ordered vs unordered categories
- ✓How to add, remove, and reorder categories safely
- ✓Common pitfalls when joining categoricals
Prerequisites
- •Basic Pandas DataFrame usage
What and Why
Real datasets are full of repeated strings: country codes, plan tiers, status fields, product SKUs. Storing each value as a Python object string wastes memory and slows grouping. A column with one million rows and twelve unique countries should not hold one million separate strings.
The Categorical dtype solves this by storing each row as a small integer code that points into a tiny lookup table of unique values. Memory drops, group operations get faster, and you gain explicit control over allowed values.
Mental Model
A Categorical Series has two parts: an array of integer codes, one per row, and a categories index listing the unique values in order. Reads transparently look up codes in the categories array, so you still see strings everywhere in your code.
Categoricals can be ordered or unordered. Ordered ones unlock comparisons like df["tier"] >= "gold", which is useful for plan tiers, severity levels, and grades. Unordered is the default and right for things like country codes.
Hands-on Example
Convert a column and inspect the memory drop.
import pandas as pd
df = pd.DataFrame({
"country": ["US", "IN", "US", "DE", "IN", "US"] * 100_000,
"tier": ["free", "pro", "gold", "free", "pro", "gold"] * 100_000,
})
print(df["country"].memory_usage(deep=True)) # ~37 MB
df["country"] = df["country"].astype("category")
print(df["country"].memory_usage(deep=True)) # ~600 KB
For ordered categories, declare the order explicitly.
tier_order = pd.CategoricalDtype(
categories=["free", "pro", "gold", "platinum"], ordered=True
)
df["tier"] = df["tier"].astype(tier_order)
df[df["tier"] >= "gold"]
Now the comparison knows that gold outranks pro and free, even though the underlying values are strings.
object dtype (string per row):
row 0: "US" row 1: "IN" row 2: "US" row 3: "DE" ...
^ each cell holds a full Python string
category dtype:
codes: [0, 1, 0, 2, 1, 0, ...] <- int8 per row
categories: ["US", "IN", "DE"] <- stored once
ordered: False
memory: ~ N * 1 byte + tiny table
vs N * ~50 bytes for object strings groupby on a categorical is also faster, because Pandas can group directly on integer codes instead of hashing strings row by row.
Trade-offs
The big wins are memory and group speed, both substantial when the cardinality is low relative to the row count. For a column with millions of rows and a few dozen unique values, the gain is order of magnitude.
The downside is rigidity. Assigning a value not in the categories raises by default. That is useful as a guard rail and annoying when ingesting messy data. Use add_categories first, or convert the column back to object during a noisy load and re-categorise after cleaning.
Joins are the most common foot gun. Merging two categoricals with different category lists silently downgrades both to object, throwing away your memory savings. Align the categories before merging, or accept the downgrade explicitly.
Sorting an unordered categorical sorts by category order, not by string order, which can surprise readers of your code. Make the ordering explicit when it matters.
Practical Tips
Categorise after cleaning, not before. Apply dtype changes once the values are stable; otherwise you fight category errors during every transform.
Use ordered categoricals for anything with a natural rank: education level, severity, tier, day of week. Comparison operators and min/max then do the right thing automatically.
Match categories across DataFrames before merging. A small helper s.cat.set_categories(target_categories) keeps the dtype after the join.
Remember remove_unused_categories after filtering. A categorical keeps every original category even if no row uses it, which can swell groupby output with empty groups.
For string operations, use .str accessors as usual; Pandas applies them efficiently against the small categories table rather than the full row count.
Watch IO. Parquet preserves categoricals natively; CSV does not. Round tripping through CSV will quietly drop you back to object dtype.
Wrap-up
The Categorical dtype is the cheapest performance win in Pandas for any dataset with low cardinality string columns. It trades a little rigidity for big memory savings and faster groupby. Apply it after cleaning, use ordered categories where rank matters, and align categories before merges, and your downstream code stays fast without sacrificing readability.
Related articles
- Pandas Pandas: apply vs Vectorization
When to reach for .apply and when vectorized operations win. A practical comparison with benchmarks, mental models, and the patterns that keep Pandas code both readable and fast.
- Pandas Pandas Data Cleaning Techniques: A Practical Field Guide
Hands-on pandas patterns for cleaning messy real-world data, covering missing values, types, duplicates, strings, and a reliable cleaning pipeline.
- Pandas Pandas GroupBy and Aggregation Tutorial
Master pandas groupby with single and multi-column aggregations, named outputs, transform, and filter for clean analytical pipelines.
- Pandas Pandas MultiIndex Tutorial
A practical guide to Pandas MultiIndex: when to use it, how it really works, and the slicing, stacking, and groupby patterns that make hierarchical data manageable.