Pandas DataFrames: Reading, Selecting, and Filtering

Beginner 12 min read

What you'll learn

✓How to read CSV and JSON files into a DataFrame
✓How to use head, info, and describe to inspect a new dataset
✓How to select one or more columns
✓The difference between loc (labels) and iloc (positions)
✓How to filter rows with boolean masks
✓How to sort and how to use value_counts

Prerequisites

•What Is Pandas? introduces the DataFrame
•Basic Python — see What Is Python?

The DataFrame is where almost all pandas work happens. This post covers the moves you’ll use every day: reading data, looking at it, selecting columns, picking rows by label or position, filtering, and counting. By the end you should be able to handle 80% of routine data tasks without looking anything up.

If you’re new to pandas, start with What Is Pandas? first.

Reading data in

Pandas reads almost every tabular format you’ll encounter. The two most common:

import pandas as pd

df = pd.read_csv("sales.csv")
df = pd.read_json("sales.json")

read_csv is the workhorse. Useful arguments to know:

df = pd.read_csv(
    "sales.csv",
    sep=",",                # delimiter
    parse_dates=["date"],   # turn date strings into real datetimes
    dtype={"sku": str},     # force a column type
    nrows=1000,             # only first N rows — useful for huge files
)

For Excel, Parquet, and SQL there are matching helpers: read_excel, read_parquet, read_sql. The argument patterns are similar.

For the rest of this post, assume we have:

import pandas as pd

df = pd.DataFrame({
    "date":     pd.to_datetime(["2026-01-04", "2026-01-04", "2026-01-05",
                                 "2026-01-05", "2026-01-06", "2026-01-06"]),
    "product":  ["apple", "banana", "apple", "cherry", "banana", "apple"],
    "region":   ["north", "north", "south", "south", "north", "south"],
    "units":    [12, 7, 15, 3, 11, 9],
    "revenue":  [18.0, 8.4, 22.5, 11.25, 13.2, 13.5],
})

Inspecting: head, info, describe

The first three moves on any new dataset:

print(df.head(3))
# output:
#         date product region  units  revenue
# 0 2026-01-04   apple  north     12    18.00
# 1 2026-01-04  banana  north      7     8.40
# 2 2026-01-05   apple  south     15    22.50

df.tail(3) shows the last three. df.shape returns a (rows, columns) tuple. df.columns lists the column names.

print(df.info())
# output:
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 6 entries, 0 to 5
# Data columns (total 5 columns):
#  #   Column   Non-Null Count  Dtype
# ---  ------   --------------  -----
#  0   date     6 non-null      datetime64[ns]
#  1   product  6 non-null      object
#  2   region   6 non-null      object
#  3   units    6 non-null      int64
#  4   revenue  6 non-null      float64

info is how you spot the two most common data problems: a column that should be numeric loaded as object (string), or unexpected nulls.

print(df.describe())
# output:
#            units    revenue
# count   6.000000   6.000000
# mean    9.500000  14.475000
# std     4.230839   5.067876
# ...

describe() defaults to numeric columns. For strings, try df.describe(include="object") — you’ll see counts and the most common value per column.

Selecting columns

A single column with bracket notation:

print(df["product"])
# output:
# 0     apple
# 1    banana
# 2     apple
# 3    cherry
# 4    banana
# 5     apple
# Name: product, dtype: object

This returns a Series.

Multiple columns with a list inside the brackets:

print(df[["product", "revenue"]])
# output:
#   product  revenue
# 0   apple    18.00
# 1  banana     8.40
# 2   apple    22.50
# 3  cherry    11.25
# 4  banana    13.20
# 5   apple    13.50

This returns a DataFrame. The double brackets confuse beginners — the inner brackets are a Python list of column names; the outer brackets are the indexing operator. Two different things.

Dot access (df.product) works for simple column names too, but breaks for names with spaces or names that collide with method names. Use brackets in real code.

loc vs iloc

The single concept that trips up almost every new pandas user.

df.loc[...] selects by label — the row’s index value and the column’s name.
df.iloc[...] selects by integer position — like a NumPy array.

With our DataFrame indexed 0..5 these often coincide. They diverge as soon as the index changes.

# By label
print(df.loc[0])           # row with index label 0
print(df.loc[0, "product"])  # cell at row 0, column "product"
# output: apple

# By position
print(df.iloc[0])           # first row
print(df.iloc[0, 1])        # first row, second column
# output: apple

Slicing:

print(df.loc[1:3, ["product", "units"]])     # inclusive of 3
print(df.iloc[1:3, [1, 3]])                  # exclusive of 3, like Python

A subtle gotcha: df.loc slicing is inclusive on both ends. df.iloc follows normal Python half-open ranges. Internalise this once and you’ll save yourself hours.

Setting values uses the same syntax:

df.loc[0, "units"] = 13

A rule that keeps you out of trouble: always use loc or iloc when setting values. Avoid chained indexing like df[df["product"] == "apple"]["units"] = 100. It may or may not actually set the values, and pandas will warn you.

Boolean filtering

The pattern that does most of the work. Compare a column against a value to get a boolean Series, then index the DataFrame with it.

mask = df["units"] > 10
print(mask)
# output:
# 0     True
# 1    False
# 2     True
# 3    False
# 4     True
# 5    False
# Name: units, dtype: bool

print(df[mask])
# output:
#         date product region  units  revenue
# 0 2026-01-04   apple  north     12    18.00
# 2 2026-01-05   apple  south     15    22.50
# 4 2026-01-06  banana  north     11    13.20

You can combine masks with & (and), | (or), and ~ (not). Parentheses around each clause are required because of operator precedence:

big_apple = df[(df["product"] == "apple") & (df["units"] > 10)]
print(big_apple)
# output:
#         date product region  units  revenue
# 0 2026-01-04   apple  north     12    18.00
# 2 2026-01-05   apple  south     15    22.50

Forgetting parentheses gives a cryptic TypeError. When it happens, look for the missing brackets first.

Membership checks use .isin:

print(df[df["region"].isin(["north", "east"])])

String columns get a .str accessor:

print(df[df["product"].str.startswith("a")])
# output: apples only

For null handling, df["col"].isna() and df["col"].notna() produce masks the same way.

Try it yourself. Using the DataFrame above, write a single expression that returns all rows where the region is “south” and the revenue is greater than 11. Then write another that returns all rows where the product is not apple. Pandas filtering is a muscle — building it once with each operator (&, |, ~) is what makes it stick.

Sorting

sort_values sorts rows by one or more columns:

print(df.sort_values("units"))
# output: rows ordered from smallest units to largest

print(df.sort_values("units", ascending=False))
# output: largest first

print(df.sort_values(["product", "units"], ascending=[True, False]))
# output: grouped by product alphabetically, within each group by units descending

sort_values returns a new DataFrame. To sort in place, pass inplace=True — though new pandas code usually prefers reassignment for clarity:

df = df.sort_values("date")

sort_index sorts by the row index, which becomes useful after operations that rearrange it.

value_counts

The single most useful one-liner for exploring categorical data.

print(df["product"].value_counts())
# output:
# product
# apple     3
# banana    2
# cherry    1
# Name: count, dtype: int64

Want proportions instead?

print(df["product"].value_counts(normalize=True))
# output:
# product
# apple     0.500000
# banana    0.333333
# cherry    0.166667
# Name: proportion, dtype: float64

Combined with filtering, value_counts answers most “what is the breakdown of X?” questions in one line.

print(df[df["region"] == "north"]["product"].value_counts())
# output:
# product
# apple     1
# banana    2
# Name: count, dtype: int64

A small worked example

Pull all the moves into one tiny analysis.

import pandas as pd

df = pd.read_csv("sales.csv", parse_dates=["date"])

# Quick look
print(df.head())
print(df.info())

# Just the rows we care about: north region, units > 5
mask = (df["region"] == "north") & (df["units"] > 5)
north_big = df.loc[mask, ["date", "product", "units", "revenue"]]

# Sorted by revenue, top to bottom
north_big = north_big.sort_values("revenue", ascending=False)
print(north_big)

# Counts by product for context
print(df["product"].value_counts())

# Write a result file
north_big.to_csv("north_big.csv", index=False)

That’s a complete cycle — load, inspect, filter, project (select columns), sort, count, export. Most analyst code is variations on this skeleton.

Try it yourself. Take any CSV you have. In one script: (1) load it with read_csv, (2) print info(), (3) pick two columns, (4) keep rows where one numeric column is above its mean (hint: df["col"] > df["col"].mean()), (5) sort by another column, (6) print value_counts of a categorical column. If you can do that without looking anything up, you’re already useful in pandas.

Common pitfalls

SettingWithCopyWarning. Triggered by chained indexing on assignment. Use df.loc[mask, "col"] = value instead of df[mask]["col"] = value.
== with floats. Floating-point equality is unreliable. Use np.isclose(df["x"], 1.5) for tolerance.
loc is inclusive, iloc is exclusive. A surprising number of bugs trace back here.
Boolean ops without parentheses. Always wrap each comparison.
Mutating while iterating. Don’t loop over a DataFrame and modify it. Build a mask and assign once.

Recap

You now know:

read_csv and read_json load tabular data; head, info, describe inspect it
df["col"] returns a Series; df[["a", "b"]] returns a DataFrame
loc selects by label, iloc by integer position
Boolean masks drive filtering: df[df["x"] > 10], combined with &, |, ~ and parentheses
sort_values orders rows by one or more columns
value_counts summarises categorical columns in a line

Next steps

DataFrames feed naturally into machine-learning workflows — pandas in, model out.

→ Next: What Is Machine Learning?

Questions or feedback? Email codeloomdevv@gmail.com.