Pandas DataFrames: Reading, Selecting, and Filtering
A practical guide to the daily DataFrame moves — read_csv and read_json, head and info, column selection, loc vs iloc, boolean filtering, sorting, and value_counts.
What you'll learn
- ✓How to read CSV and JSON files into a DataFrame
- ✓How to use head, info, and describe to inspect a new dataset
- ✓How to select one or more columns
- ✓The difference between loc (labels) and iloc (positions)
- ✓How to filter rows with boolean masks
- ✓How to sort and how to use value_counts
Prerequisites
- •What Is Pandas? introduces the DataFrame
- •Basic Python — see What Is Python?
The DataFrame is where almost all pandas work happens. This post covers the moves you’ll use every day: reading data, looking at it, selecting columns, picking rows by label or position, filtering, and counting. By the end you should be able to handle 80% of routine data tasks without looking anything up.
If you’re new to pandas, start with What Is Pandas? first.
Reading data in
Pandas reads almost every tabular format you’ll encounter. The two most common:
import pandas as pd
df = pd.read_csv("sales.csv")
df = pd.read_json("sales.json")
read_csv is the workhorse. Useful arguments to know:
df = pd.read_csv(
"sales.csv",
sep=",", # delimiter
parse_dates=["date"], # turn date strings into real datetimes
dtype={"sku": str}, # force a column type
nrows=1000, # only first N rows — useful for huge files
)
For Excel, Parquet, and SQL there are matching helpers: read_excel, read_parquet, read_sql. The argument patterns are similar.
For the rest of this post, assume we have:
import pandas as pd
df = pd.DataFrame({
"date": pd.to_datetime(["2026-01-04", "2026-01-04", "2026-01-05",
"2026-01-05", "2026-01-06", "2026-01-06"]),
"product": ["apple", "banana", "apple", "cherry", "banana", "apple"],
"region": ["north", "north", "south", "south", "north", "south"],
"units": [12, 7, 15, 3, 11, 9],
"revenue": [18.0, 8.4, 22.5, 11.25, 13.2, 13.5],
})
Inspecting: head, info, describe
The first three moves on any new dataset:
print(df.head(3))
# output:
# date product region units revenue
# 0 2026-01-04 apple north 12 18.00
# 1 2026-01-04 banana north 7 8.40
# 2 2026-01-05 apple south 15 22.50
df.tail(3) shows the last three. df.shape returns a (rows, columns) tuple. df.columns lists the column names.
print(df.info())
# output:
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 6 entries, 0 to 5
# Data columns (total 5 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 date 6 non-null datetime64[ns]
# 1 product 6 non-null object
# 2 region 6 non-null object
# 3 units 6 non-null int64
# 4 revenue 6 non-null float64
info is how you spot the two most common data problems: a column that should be numeric loaded as object (string), or unexpected nulls.
print(df.describe())
# output:
# units revenue
# count 6.000000 6.000000
# mean 9.500000 14.475000
# std 4.230839 5.067876
# ...
describe() defaults to numeric columns. For strings, try df.describe(include="object") — you’ll see counts and the most common value per column.
Selecting columns
A single column with bracket notation:
print(df["product"])
# output:
# 0 apple
# 1 banana
# 2 apple
# 3 cherry
# 4 banana
# 5 apple
# Name: product, dtype: object
This returns a Series.
Multiple columns with a list inside the brackets:
print(df[["product", "revenue"]])
# output:
# product revenue
# 0 apple 18.00
# 1 banana 8.40
# 2 apple 22.50
# 3 cherry 11.25
# 4 banana 13.20
# 5 apple 13.50
This returns a DataFrame. The double brackets confuse beginners — the inner brackets are a Python list of column names; the outer brackets are the indexing operator. Two different things.
Dot access (df.product) works for simple column names too, but breaks for names with spaces or names that collide with method names. Use brackets in real code.
loc vs iloc
The single concept that trips up almost every new pandas user.
df.loc[...]selects by label — the row’s index value and the column’s name.df.iloc[...]selects by integer position — like a NumPy array.
With our DataFrame indexed 0..5 these often coincide. They diverge as soon as the index changes.
# By label
print(df.loc[0]) # row with index label 0
print(df.loc[0, "product"]) # cell at row 0, column "product"
# output: apple
# By position
print(df.iloc[0]) # first row
print(df.iloc[0, 1]) # first row, second column
# output: apple
Slicing:
print(df.loc[1:3, ["product", "units"]]) # inclusive of 3
print(df.iloc[1:3, [1, 3]]) # exclusive of 3, like Python
A subtle gotcha: df.loc slicing is inclusive on both ends. df.iloc follows normal Python half-open ranges. Internalise this once and you’ll save yourself hours.
Setting values uses the same syntax:
df.loc[0, "units"] = 13
A rule that keeps you out of trouble: always use loc or iloc when setting values. Avoid chained indexing like df[df["product"] == "apple"]["units"] = 100. It may or may not actually set the values, and pandas will warn you.
Boolean filtering
The pattern that does most of the work. Compare a column against a value to get a boolean Series, then index the DataFrame with it.
mask = df["units"] > 10
print(mask)
# output:
# 0 True
# 1 False
# 2 True
# 3 False
# 4 True
# 5 False
# Name: units, dtype: bool
print(df[mask])
# output:
# date product region units revenue
# 0 2026-01-04 apple north 12 18.00
# 2 2026-01-05 apple south 15 22.50
# 4 2026-01-06 banana north 11 13.20
You can combine masks with & (and), | (or), and ~ (not). Parentheses around each clause are required because of operator precedence:
big_apple = df[(df["product"] == "apple") & (df["units"] > 10)]
print(big_apple)
# output:
# date product region units revenue
# 0 2026-01-04 apple north 12 18.00
# 2 2026-01-05 apple south 15 22.50
Forgetting parentheses gives a cryptic TypeError. When it happens, look for the missing brackets first.
Membership checks use .isin:
print(df[df["region"].isin(["north", "east"])])
String columns get a .str accessor:
print(df[df["product"].str.startswith("a")])
# output: apples only
For null handling, df["col"].isna() and df["col"].notna() produce masks the same way.
Try it yourself. Using the DataFrame above, write a single expression that returns all rows where the region is “south” and the revenue is greater than 11. Then write another that returns all rows where the product is not apple. Pandas filtering is a muscle — building it once with each operator (&, |, ~) is what makes it stick.
Sorting
sort_values sorts rows by one or more columns:
print(df.sort_values("units"))
# output: rows ordered from smallest units to largest
print(df.sort_values("units", ascending=False))
# output: largest first
print(df.sort_values(["product", "units"], ascending=[True, False]))
# output: grouped by product alphabetically, within each group by units descending
sort_values returns a new DataFrame. To sort in place, pass inplace=True — though new pandas code usually prefers reassignment for clarity:
df = df.sort_values("date")
sort_index sorts by the row index, which becomes useful after operations that rearrange it.
value_counts
The single most useful one-liner for exploring categorical data.
print(df["product"].value_counts())
# output:
# product
# apple 3
# banana 2
# cherry 1
# Name: count, dtype: int64
Want proportions instead?
print(df["product"].value_counts(normalize=True))
# output:
# product
# apple 0.500000
# banana 0.333333
# cherry 0.166667
# Name: proportion, dtype: float64
Combined with filtering, value_counts answers most “what is the breakdown of X?” questions in one line.
print(df[df["region"] == "north"]["product"].value_counts())
# output:
# product
# apple 1
# banana 2
# Name: count, dtype: int64
A small worked example
Pull all the moves into one tiny analysis.
import pandas as pd
df = pd.read_csv("sales.csv", parse_dates=["date"])
# Quick look
print(df.head())
print(df.info())
# Just the rows we care about: north region, units > 5
mask = (df["region"] == "north") & (df["units"] > 5)
north_big = df.loc[mask, ["date", "product", "units", "revenue"]]
# Sorted by revenue, top to bottom
north_big = north_big.sort_values("revenue", ascending=False)
print(north_big)
# Counts by product for context
print(df["product"].value_counts())
# Write a result file
north_big.to_csv("north_big.csv", index=False)
That’s a complete cycle — load, inspect, filter, project (select columns), sort, count, export. Most analyst code is variations on this skeleton.
Try it yourself. Take any CSV you have. In one script: (1) load it with read_csv, (2) print info(), (3) pick two columns, (4) keep rows where one numeric column is above its mean (hint: df["col"] > df["col"].mean()), (5) sort by another column, (6) print value_counts of a categorical column. If you can do that without looking anything up, you’re already useful in pandas.
Common pitfalls
- SettingWithCopyWarning. Triggered by chained indexing on assignment. Use
df.loc[mask, "col"] = valueinstead ofdf[mask]["col"] = value. ==with floats. Floating-point equality is unreliable. Usenp.isclose(df["x"], 1.5)for tolerance.locis inclusive,ilocis exclusive. A surprising number of bugs trace back here.- Boolean ops without parentheses. Always wrap each comparison.
- Mutating while iterating. Don’t loop over a DataFrame and modify it. Build a mask and assign once.
Recap
You now know:
read_csvandread_jsonload tabular data;head,info,describeinspect itdf["col"]returns a Series;df[["a", "b"]]returns a DataFramelocselects by label,ilocby integer position- Boolean masks drive filtering:
df[df["x"] > 10], combined with&,|,~and parentheses sort_valuesorders rows by one or more columnsvalue_countssummarises categorical columns in a line
Next steps
DataFrames feed naturally into machine-learning workflows — pandas in, model out.
→ Next: What Is Machine Learning?
Related: What Is Pandas?, Python Lists, Python Dictionaries.
Questions or feedback? Email codeloomdevv@gmail.com.