ML Overfitting and Regularization

Beginner 10 min read

What you'll learn

✓What overfitting actually looks like in training curves
✓How L1 and L2 shape the weight distribution
✓How dropout simulates ensembles
✓When early stopping is your most effective lever
✓How to diagnose underfit vs overfit quickly

Prerequisites

•Familiar with how APIs work

What and Why

Overfitting happens when a model learns the noise in your training data instead of the underlying pattern. It scores beautifully on training rows and disappoints on anything new. Regularization is the family of techniques that pushes a model toward simpler hypotheses, which generalize better.

Spotting overfit early and applying the right regularizer is one of the highest-leverage skills in applied ML. The wrong choice can also hurt: under-regularized models miss patterns; over-regularized models become too rigid to learn at all.

Mental Model

Plot training loss and validation loss against epochs. The relationship between the two curves tells the whole story.

underfit:     train loss HIGH, val loss HIGH    -> need more capacity
good fit:     train loss LOW,  val loss LOW     -> ship it
overfit:      train loss LOW,  val loss HIGH    -> regularize

epoch
|
|       val
|      /
|     /  <- overfit gap grows here
|____/______
|   \___ train
+------------- time

Reading the training and validation curves

The gap between train and validation curves is the overfit signal. A small stable gap is fine. A growing gap means the model is memorizing.

Hands-on Example

In sklearn, L2 (ridge) and L1 (lasso) regularization are exposed as a single hyperparameter.

from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import cross_val_score
import numpy as np

for alpha in [0.001, 0.01, 0.1, 1, 10, 100]:
    ridge = Ridge(alpha=alpha)
    score = cross_val_score(ridge, X_train, y_train, cv=5,
                            scoring="neg_mean_squared_error")
    print(f"alpha={alpha:>7}: {-np.mean(score):.4f}")

For neural networks, dropout and weight decay (L2) are the two staples.

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Dropout(p=0.3),
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Dropout(p=0.3),
    nn.Linear(64, 10),
)

optim = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)

weight_decay is L2 regularization applied at the optimizer level. Dropout zeroes out a fraction of activations during training, forcing the network to spread learning across many paths.

Trade-offs

Each regularizer has a personality.

L2 (ridge / weight decay) shrinks all weights smoothly toward zero. It is great when you have many small effects and want a stable solution.
L1 (lasso) drives weights exactly to zero, producing a sparse model. Useful for feature selection and when you suspect most features are irrelevant.
Elastic net mixes L1 and L2 for a balance of sparsity and stability.
Dropout is cheap and effective for neural networks. Excessive dropout (above 0.5) can slow training to a crawl.
Early stopping monitors validation loss and halts when it stops improving. It is arguably the most cost-effective regularizer because it also saves compute.
Data augmentation is the strongest regularizer in vision and audio. Every transformation the model sees makes memorization harder.

There is no universal “best” choice. The right answer is to compare validation loss across a few options and let the data decide.

Practical Tips

A practical workflow for diagnosing and fixing overfit:

Plot the curves first. Do not tune regularization without seeing where on the underfit-to-overfit spectrum you are. If both losses are high, more capacity, not regularization, is the answer.
Add early stopping before anything else. It is one extra callback and prevents most catastrophic overfit.
Scale features before applying L1/L2. Unscaled features get penalized unfairly because the regularizer treats raw magnitudes.
Tune alpha on a log scale. Try 0.001, 0.01, 0.1, 1, 10, 100. The right value usually sits in a narrow window you discover quickly.
For neural nets, use weight decay around 1e-4 as a default. Adjust based on the train/val gap.
Use dropout in dense layers, less in conv layers, sparingly in transformers. Modern transformers prefer layer normalization and small weight decay over heavy dropout.
Augment data when you can. A model trained on slightly noisy variants of your inputs almost always generalizes better than one trained on pristine data with heavier regularization.

When nothing helps, the answer is sometimes “get more data.” Regularization is a substitute for data, and more data is the more reliable substitute for regularization.

Wrap-up

Overfitting is not a failure of the model, it is a mismatch between capacity and signal. Train/validation curves tell you which direction to move, and the regularization toolbox gives you choices proportional to the severity. Add early stopping, scale your features, tune alpha on a log scale, and read the curves before you reach for anything fancier.