ML Feature Engineering Techniques

Intermediate 10 min read

What you'll learn

✓Why feature engineering still matters in 2026
✓Numerical scaling and binning techniques
✓Encoding categorical variables without explosion
✓Building useful interactions and date features
✓Avoiding leakage when computing target-based features

Prerequisites

•Familiar with how APIs work

What and Why

Feature engineering is the craft of turning raw signals into representations that a model can learn from quickly. Gradient boosting, linear models, and even neural networks all benefit. Better features routinely beat fancier algorithms, and they are usually the cheapest improvement you can make.

The reason is simple: a model can only fit what is present in the input. If the right signal lives across two columns, a transformation that creates a third column with the answer can save a model from learning a complicated interaction from scratch.

Mental Model

Think of features as the vocabulary the model uses to describe a row. The richer and cleaner the vocabulary, the easier it is to explain the target.

raw row: { signup_date, country, total_orders, last_login_ts }
            |
            v
     cleaning + scaling
            |
            v
 + days_since_signup
 + country_freq_encoded
 + log(total_orders + 1)
 + hours_since_last_login
            |
            v
      model-ready vector

From raw data to model-ready features

The goal is not “more columns.” It is “columns the model can use directly.” Each feature should either expose a pattern or simplify an existing pattern.

Hands-on Example

A small playbook covering the most common transformations.

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder

df = pd.read_csv("users.csv")

# 1. Numerical scaling
df["total_orders_log"] = np.log1p(df["total_orders"])
scaler = StandardScaler()
df["age_scaled"] = scaler.fit_transform(df[["age"]])

# 2. Datetime decomposition
df["signup_date"] = pd.to_datetime(df["signup_date"])
df["signup_dow"] = df["signup_date"].dt.dayofweek
df["signup_month"] = df["signup_date"].dt.month
df["days_since_signup"] = (pd.Timestamp("2026-06-28") - df["signup_date"]).dt.days

# 3. Frequency encoding (cheap, robust to high cardinality)
country_freq = df["country"].value_counts(normalize=True)
df["country_freq"] = df["country"].map(country_freq)

# 4. Interaction
df["orders_per_day"] = df["total_orders"] / (df["days_since_signup"] + 1)

# 5. One-hot for low-cardinality categories
ohe = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
plan_oh = ohe.fit_transform(df[["plan"]])

Notice that every step has a clear “why.” Scaling stabilizes linear models. Log transforms tame skewed counts. Frequency encoding handles categoricals with thousands of values. Interactions surface combined effects.

Trade-offs

Different feature techniques have very different strengths.

Standard scaling is essential for linear models, SVMs, and neural networks. Tree-based models (XGBoost, LightGBM) do not need it.
One-hot encoding explodes columns for high-cardinality features. With 10,000 unique categories you create 10,000 columns.
Frequency or target encoding scales gracefully to high cardinality but needs care to avoid leakage.
Binning loses information but can help linear models capture non-linearities.
Polynomial features can quickly explode. degree=2 on 50 features creates ~1300 columns.

Target encoding is especially seductive and dangerous. Computing a category’s mean target on the full dataset before splitting leaks the target into your features. Compute it on training folds only, then apply to validation and test.

Practical Tips

A few habits separate clean feature engineering from spaghetti.

Wrap transforms in Pipeline and ColumnTransformer. This guarantees the same transformations are applied at train and inference time and prevents leakage.
Compute features deterministically. A feature that depends on the current wall-clock time will drift between training and serving.
Decompose dates aggressively. Day of week, day of month, week of year, “is_holiday,” and time-since-event features unlock seasonality patterns linear models cannot find on their own.
Log-transform heavy-tailed counts. np.log1p handles zeros and shrinks the tail so the model is not dominated by a few huge values.
For high-cardinality strings, use frequency or hash encoding first. Reach for target encoding only when you have proper cross-validated training folds.
Track feature importance. A weekly look at which features dominate is a cheap way to spot drift, leakage, or candidates for pruning.
Document each feature. Name, definition, source, and intended meaning. Future-you and your teammates will thank you when debugging.

A simple test: if you cannot explain in one sentence what a feature represents, the model probably has trouble too.

Wrap-up

Feature engineering is where domain knowledge meets the model. Clean numerical transforms, careful categorical encoding, decomposed dates, and a few well-chosen interactions usually deliver bigger gains than swapping algorithms. Build features inside a pipeline, watch for leakage, and document everything. The model is only as smart as the columns you feed it.