Pandas Categorical Data Tutorial

Beginner 9 min read

What you'll learn

✓What the Categorical dtype actually stores
✓How it cuts memory and speeds up groupby
✓Ordered vs unordered categories
✓How to add, remove, and reorder categories safely
✓Common pitfalls when joining categoricals

Prerequisites

•Basic Pandas DataFrame usage

What and Why

Real datasets are full of repeated strings: country codes, plan tiers, status fields, product SKUs. Storing each value as a Python object string wastes memory and slows grouping. A column with one million rows and twelve unique countries should not hold one million separate strings.

The Categorical dtype solves this by storing each row as a small integer code that points into a tiny lookup table of unique values. Memory drops, group operations get faster, and you gain explicit control over allowed values.

Mental Model

A Categorical Series has two parts: an array of integer codes, one per row, and a categories index listing the unique values in order. Reads transparently look up codes in the categories array, so you still see strings everywhere in your code.

Categoricals can be ordered or unordered. Ordered ones unlock comparisons like df["tier"] >= "gold", which is useful for plan tiers, severity levels, and grades. Unordered is the default and right for things like country codes.

Hands-on Example

Convert a column and inspect the memory drop.

import pandas as pd

df = pd.DataFrame({
    "country": ["US", "IN", "US", "DE", "IN", "US"] * 100_000,
    "tier":    ["free", "pro", "gold", "free", "pro", "gold"] * 100_000,
})

print(df["country"].memory_usage(deep=True))   # ~37 MB

df["country"] = df["country"].astype("category")
print(df["country"].memory_usage(deep=True))   # ~600 KB

For ordered categories, declare the order explicitly.

tier_order = pd.CategoricalDtype(
    categories=["free", "pro", "gold", "platinum"], ordered=True
)
df["tier"] = df["tier"].astype(tier_order)

df[df["tier"] >= "gold"]

Now the comparison knows that gold outranks pro and free, even though the underlying values are strings.

object dtype (string per row):
row 0: "US"   row 1: "IN"   row 2: "US"   row 3: "DE"  ...
      ^               each cell holds a full Python string

category dtype:
codes:      [0, 1, 0, 2, 1, 0, ...]   <- int8 per row
categories: ["US", "IN", "DE"]        <- stored once
ordered:    False

memory: ~ N * 1 byte + tiny table
      vs N * ~50 bytes for object strings

A Categorical column is integer codes plus a small lookup table

groupby on a categorical is also faster, because Pandas can group directly on integer codes instead of hashing strings row by row.

Trade-offs

The big wins are memory and group speed, both substantial when the cardinality is low relative to the row count. For a column with millions of rows and a few dozen unique values, the gain is order of magnitude.

The downside is rigidity. Assigning a value not in the categories raises by default. That is useful as a guard rail and annoying when ingesting messy data. Use add_categories first, or convert the column back to object during a noisy load and re-categorise after cleaning.

Joins are the most common foot gun. Merging two categoricals with different category lists silently downgrades both to object, throwing away your memory savings. Align the categories before merging, or accept the downgrade explicitly.

Sorting an unordered categorical sorts by category order, not by string order, which can surprise readers of your code. Make the ordering explicit when it matters.

Practical Tips

Categorise after cleaning, not before. Apply dtype changes once the values are stable; otherwise you fight category errors during every transform.

Use ordered categoricals for anything with a natural rank: education level, severity, tier, day of week. Comparison operators and min/max then do the right thing automatically.

Match categories across DataFrames before merging. A small helper s.cat.set_categories(target_categories) keeps the dtype after the join.

Remember remove_unused_categories after filtering. A categorical keeps every original category even if no row uses it, which can swell groupby output with empty groups.

For string operations, use .str accessors as usual; Pandas applies them efficiently against the small categories table rather than the full row count.

Watch IO. Parquet preserves categoricals natively; CSV does not. Round tripping through CSV will quietly drop you back to object dtype.

Wrap-up

The Categorical dtype is the cheapest performance win in Pandas for any dataset with low cardinality string columns. It trades a little rigidity for big memory savings and faster groupby. Apply it after cleaning, use ordered categories where rank matters, and align categories before merges, and your downstream code stays fast without sacrificing readability.