Feature Engineering Basics for Tabular Data

Intermediate 12 min read

What you'll learn

✓How to encode categorical features for linear and tree models
✓When to scale numeric features and which scaler to use
✓Strategies for handling missing values without losing signal
✓How to construct interaction features and polynomial terms
✓The subtle ways data leakage sneaks into your pipeline

Prerequisites

•A foundation in [what machine learning is](/blog/what-is-machine-learning)
•Comfort with [pandas dataframes](/blog/pandas-dataframes-basics)
•Familiarity with [train/test split and metrics](/blog/ml-train-test-split-and-metrics)

Most of the gap between a mediocre tabular model and a great one comes from feature engineering rather than from picking a better learner. Better features mean better gradients, better splits, and better generalisation. This article walks through the techniques that pay off most often, and ends with the trap that quietly invalidates more pipelines than any other: data leakage.

Encoding categorical features

Almost every real dataset has string columns. Linear models and many libraries require numeric inputs, so you must encode.

The simplest scheme is one-hot encoding, which creates a binary indicator column for each level. It works well when cardinality is low.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({"city": ["Mumbai", "Delhi", "Mumbai", "Bengaluru"]})
enc = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
encoded = enc.fit_transform(df[["city"]])
print(enc.get_feature_names_out(), encoded, sep="\n")

handle_unknown="ignore" is important. Unknown levels appear in production all the time; without it your pipeline crashes the moment a new city shows up.

For high-cardinality fields like zip codes or user IDs, one-hot encoding produces too many columns. Common alternatives are ordinal encoding for tree models, which simply assigns each level an integer, and target encoding, which replaces each level with the mean target value computed on the training set only.

Tree-based models tolerate ordinal encoding directly because splits on integers can recover any partition of the levels. Linear models cannot, because they assume the encoded value carries meaningful order.

Scaling numeric features

Linear models, k-nearest neighbours, and neural networks are all sensitive to feature scale. A column ranging from 0 to 1 million will dominate gradients next to a column ranging from 0 to 1.

StandardScaler subtracts the mean and divides by the standard deviation, producing roughly zero-mean, unit-variance features. MinMaxScaler rescales each feature into a chosen range, typically [0, 1]. RobustScaler uses the median and IQR instead, which keeps outliers from blowing up the scale.

from sklearn.preprocessing import StandardScaler, RobustScaler
import numpy as np

x = np.array([[10], [20], [30], [40], [10_000]])
print("standard:\n", StandardScaler().fit_transform(x).round(2))
print("robust:\n", RobustScaler().fit_transform(x).round(2))

Tree-based models do not require scaling because they only compare values within each feature. Skipping the scaler for a random forest is a perfectly valid choice and removes one source of bugs.

Handling missing values

Three strategies cover most real situations.

The first is deletion. Drop rows with missing values when missingness is rare and random, and drop columns where most rows are empty.

The second is imputation. Replace missing entries with a fixed value: the mean or median for numeric columns, the mode or a sentinel like “Unknown” for categorical columns. scikit-learn’s SimpleImputer does this.

from sklearn.impute import SimpleImputer
import numpy as np

X = np.array([[1.0, 2.0], [np.nan, 3.0], [7.0, np.nan]])
imp = SimpleImputer(strategy="median")
print(imp.fit_transform(X))

The third, and often the most informative, is to add a missingness indicator. Create a new column feature_was_missing set to 1 when the value was absent before imputation. Whether a customer answered an optional form is often itself predictive.

For more sophisticated cases, IterativeImputer models each missing column as a function of the others, but it is much slower and rarely needed for a baseline.

Constructing new features

Interactions and transforms often unlock signal that a linear model could never find on its own.

import numpy as np
import pandas as pd

orders = pd.DataFrame({
    "items": [1, 2, 3, 4],
    "price_per_item": [10.0, 5.0, 8.0, 3.0],
})

orders["total"] = orders["items"] * orders["price_per_item"]
orders["log_total"] = np.log1p(orders["total"])
orders["price_bucket"] = pd.cut(
    orders["price_per_item"],
    bins=[0, 4, 7, 100],
    labels=["cheap", "mid", "premium"],
)
print(orders)

Date columns are especially rich. From a single timestamp you can derive day of week, hour of day, day of month, is_weekend, and time since the previous event for the same user. Each is a candidate feature that often matters more than the raw timestamp.

For polynomial interactions, PolynomialFeatures generates products of existing columns. Used judiciously on small numeric feature sets, it gives a linear model the expressive power to fit gently curved relationships.

Leakage: the silent killer

Data leakage is when information from the future or from the test set sneaks into training, producing an offline metric that looks great and a production model that fails. There are three common forms.

The first is target leakage: a feature that is essentially the target in disguise. If you are predicting whether a customer churns next month and you include “days since last login” computed at scoring time, you are letting future behaviour leak in. Always ask whether a feature would actually be available at prediction time.

The second is split leakage: fitting preprocessing on the entire dataset before splitting. If you scale, impute, or target-encode using statistics computed from the full dataset, the test set has influenced training. The fix is to wrap the preprocessing inside a pipeline and call fit only on the training split.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("model", LogisticRegression(max_iter=1000)),
])
pipe.fit(X_train, y_train)
print("clean test acc:", round(pipe.score(X_test, y_test), 3))

The third is temporal leakage in time series. Random splits across time give the model peeks at the future. For any time-ordered problem you must split chronologically.

Wrap up

Good features beat clever models on tabular data. Encode categories appropriately for your learner, scale only when the model needs it, treat missingness as a signal, and reach for interactions and date-derived features before you reach for a bigger algorithm. Above all, build everything inside a pipeline so that preprocessing is fit only on training data and your evaluation in the metrics workflow actually means what it says.