ML Feature Engineering Techniques
Transform raw data into features that help models learn faster and generalize better with encoding, scaling, interactions, and target features.
What you'll learn
- ✓Why feature engineering still matters in 2026
- ✓Numerical scaling and binning techniques
- ✓Encoding categorical variables without explosion
- ✓Building useful interactions and date features
- ✓Avoiding leakage when computing target-based features
Prerequisites
- •Familiar with how APIs work
What and Why
Feature engineering is the craft of turning raw signals into representations that a model can learn from quickly. Gradient boosting, linear models, and even neural networks all benefit. Better features routinely beat fancier algorithms, and they are usually the cheapest improvement you can make.
The reason is simple: a model can only fit what is present in the input. If the right signal lives across two columns, a transformation that creates a third column with the answer can save a model from learning a complicated interaction from scratch.
Mental Model
Think of features as the vocabulary the model uses to describe a row. The richer and cleaner the vocabulary, the easier it is to explain the target.
raw row: { signup_date, country, total_orders, last_login_ts }
|
v
cleaning + scaling
|
v
+ days_since_signup
+ country_freq_encoded
+ log(total_orders + 1)
+ hours_since_last_login
|
v
model-ready vector The goal is not “more columns.” It is “columns the model can use directly.” Each feature should either expose a pattern or simplify an existing pattern.
Hands-on Example
A small playbook covering the most common transformations.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
df = pd.read_csv("users.csv")
# 1. Numerical scaling
df["total_orders_log"] = np.log1p(df["total_orders"])
scaler = StandardScaler()
df["age_scaled"] = scaler.fit_transform(df[["age"]])
# 2. Datetime decomposition
df["signup_date"] = pd.to_datetime(df["signup_date"])
df["signup_dow"] = df["signup_date"].dt.dayofweek
df["signup_month"] = df["signup_date"].dt.month
df["days_since_signup"] = (pd.Timestamp("2026-06-28") - df["signup_date"]).dt.days
# 3. Frequency encoding (cheap, robust to high cardinality)
country_freq = df["country"].value_counts(normalize=True)
df["country_freq"] = df["country"].map(country_freq)
# 4. Interaction
df["orders_per_day"] = df["total_orders"] / (df["days_since_signup"] + 1)
# 5. One-hot for low-cardinality categories
ohe = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
plan_oh = ohe.fit_transform(df[["plan"]])
Notice that every step has a clear “why.” Scaling stabilizes linear models. Log transforms tame skewed counts. Frequency encoding handles categoricals with thousands of values. Interactions surface combined effects.
Trade-offs
Different feature techniques have very different strengths.
- Standard scaling is essential for linear models, SVMs, and neural networks. Tree-based models (XGBoost, LightGBM) do not need it.
- One-hot encoding explodes columns for high-cardinality features. With 10,000 unique categories you create 10,000 columns.
- Frequency or target encoding scales gracefully to high cardinality but needs care to avoid leakage.
- Binning loses information but can help linear models capture non-linearities.
- Polynomial features can quickly explode.
degree=2on 50 features creates ~1300 columns.
Target encoding is especially seductive and dangerous. Computing a category’s mean target on the full dataset before splitting leaks the target into your features. Compute it on training folds only, then apply to validation and test.
Practical Tips
A few habits separate clean feature engineering from spaghetti.
- Wrap transforms in
PipelineandColumnTransformer. This guarantees the same transformations are applied at train and inference time and prevents leakage. - Compute features deterministically. A feature that depends on the current wall-clock time will drift between training and serving.
- Decompose dates aggressively. Day of week, day of month, week of year, “is_holiday,” and time-since-event features unlock seasonality patterns linear models cannot find on their own.
- Log-transform heavy-tailed counts.
np.log1phandles zeros and shrinks the tail so the model is not dominated by a few huge values. - For high-cardinality strings, use frequency or hash encoding first. Reach for target encoding only when you have proper cross-validated training folds.
- Track feature importance. A weekly look at which features dominate is a cheap way to spot drift, leakage, or candidates for pruning.
- Document each feature. Name, definition, source, and intended meaning. Future-you and your teammates will thank you when debugging.
A simple test: if you cannot explain in one sentence what a feature represents, the model probably has trouble too.
Wrap-up
Feature engineering is where domain knowledge meets the model. Clean numerical transforms, careful categorical encoding, decomposed dates, and a few well-chosen interactions usually deliver bigger gains than swapping algorithms. Build features inside a pipeline, watch for leakage, and document everything. The model is only as smart as the columns you feed it.
Related articles
- Machine Learning ML Cross-Validation Strategies
Compare k-fold, stratified, group, and time-series cross-validation so your offline scores actually predict production performance.
- Machine Learning ML Train Test Validation Split Explained
Understand why machine learning data is split into three sets, how to choose proportions, and how to avoid leakage that silently inflates scores.
- Machine Learning ML Overfitting and Regularization
See how models overfit, why it happens, and how L1, L2, dropout, and early stopping fight it without crippling capacity.
- Machine Learning ML Precision Recall and F1 Explained
Decode precision, recall, F1, and accuracy with concrete intuition, threshold tuning, and PR vs ROC curve guidance for imbalanced data.