ML Train Test Validation Split Explained

Beginner 9 min read

What you'll learn

✓Why three splits exist instead of two
✓How each split is used during model development
✓How to choose split ratios for small and large datasets
✓Common leakage traps and how to avoid them
✓Stratified and time-based splits

Prerequisites

•Familiar with how APIs work

What and Why

If you train a model and evaluate it on the same data, you measure memorization, not generalization. The fix is to split your data so that the model never sees the rows you score it on. The standard practice is three splits: train, validation, and test.

The train set fits the model. The validation set tunes hyperparameters and selects between candidate models. The test set is touched once at the very end to estimate real-world performance.

Mental Model

Think of the splits as three different exams.

Train: open-book homework. The model learns from these rows directly.
Validation: a practice exam. You grade the model on it many times during tuning.
Test: the final exam. You grade it once. If you grade it twice and adjust based on the result, it stops being a real estimate of generalization.

Raw data (100%)
 |
 v
+---------+----------+-------+
|  Train  |   Val    | Test  |
|  (70%)  |  (15%)   | (15%) |
+---------+----------+-------+
 |          |          |
 v          v          v
fit model  tune model  report final score
                      (touch ONCE)

Three-way split and its purpose

The validation set is what lets you iterate without burning your test set. Without it, every time you tried a new hyperparameter you would be implicitly training on your test data.

Hands-on Example

sklearn exposes a simple helper, but you usually want two calls to get three splits.

from sklearn.model_selection import train_test_split

# First split off the test set.
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42, stratify=y
)
# Then split train/val from what is left.
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.1765, random_state=42, stratify=y_temp
)
# 0.1765 * 0.85 ~= 0.15, so the final ratio is 70/15/15.

print(len(X_train), len(X_val), len(X_test))

stratify=y is critical for classification with class imbalance. Without it, a rare class can vanish from one split entirely.

For time series, do not randomly shuffle. Split by date so the validation and test sets are strictly after the train set.

train = df[df["date"] < "2025-01-01"]
val   = df[(df["date"] >= "2025-01-01") & (df["date"] < "2025-04-01")]
test  = df[df["date"] >= "2025-04-01"]

Trade-offs

The “right” ratio depends on how much data you have.

Small datasets (a few thousand rows). Use cross-validation in place of a fixed validation set. A 80/20 train/test split with k-fold CV inside the train portion gives you more signal.
Medium datasets (tens of thousands). The classic 70/15/15 or 80/10/10 works well.
Large datasets (millions). You can shrink validation and test to 1% each because absolute counts matter more than ratios for stable estimates.

The classic mistake is to spend the entire test set early. Treat it like a sealed envelope. If you open it ten times during model development, you no longer have an honest estimate; you have just overfit to it slowly.

Practical Tips

Leakage is the single biggest threat to honest evaluation. Some patterns to watch for:

Feature leakage. A feature that was built using future information (e.g. “total spend by this customer over all time”) will inflate scores and fail in production. Compute features only from data available at prediction time.
Preprocessing leakage. Fitting a scaler or imputer on the full dataset before splitting leaks distribution information into the test set. Always .fit() on train and .transform() on val and test.
Group leakage. If the same customer or user appears in multiple rows, splitting at the row level lets the model see the same person in both train and test. Use GroupShuffleSplit so all rows for a group land in the same split.
Time leakage. Random shuffling of time series lets the model peek at the future. Always split chronologically for temporal data.
Duplicate leakage. Near-duplicate rows (slightly edited text, mirrored images) can land on both sides of the split. Deduplicate aggressively first.

A simple guard: write down what the test score is before you train, look at it once after the final model is chosen, and ship.

Wrap-up

Train, validation, and test splits exist because the only model that matters is the one that works on data it has not seen. A clean split, with stratification where needed and chronological ordering for time series, costs almost nothing and saves you from the embarrassment of a 95% offline score and a 60% production score. Treat your test set like a one-shot resource and you will rarely be surprised in deployment.