Train/Test Split and Classification Metrics

Intermediate 13 min read

What you'll learn

✓Why holding out a test set is non-negotiable
✓How to use train_test_split and why stratify matters
✓Accuracy vs precision, recall, and F1 — when each one matters
✓How to read a confusion matrix at a glance
✓What ROC-AUC tells you that accuracy doesn't
✓When to graduate from a single split to cross-validation

Prerequisites

•What Is Machine Learning? introduces features, labels, and the train/test idea
•Basic Python and a working scikit-learn install

A model that scores 99% on the data it was trained on tells you nothing. A model that scores 80% on data it has never seen tells you something. This post is about how to set up that second number honestly, and how to pick the metric that actually reflects what you care about.

If you’re new to ML in general, What Is Machine Learning? is the right starting point.

Why split at all

The cardinal sin in ML is evaluating a model on the same data you trained it on. The model has already seen that data — of course it does well. The score is meaningless.

The minimum viable discipline is the holdout split: set aside some fraction of your data before training, and only ever look at it for evaluation.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,      # 20% held out
    random_state=42,    # reproducibility
)

A 70/30 or 80/20 split is conventional. The exact number matters less than the principle: the test set is sacred. You don’t tune to it. You don’t peek at it. You evaluate once.

Stratify when classes are imbalanced

If 95% of your labels are class 0 and 5% are class 1, a random split might leave very few positives in the test set — your test score becomes unstable. stratify=y keeps the class proportions the same in both splits.

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y,
)

Rule of thumb: always pass stratify=y for classification. There is no downside; for imbalanced data there’s a big upside.

For regression, stratify isn’t standard but you can stratify on binned target values if your target is heavily skewed.

A baseline first

Before any fancy model, fit a baseline. For classification, “always predict the majority class” sets the floor.

from sklearn.dummy import DummyClassifier

baseline = DummyClassifier(strategy="most_frequent")
baseline.fit(X_train, y_train)
print("Baseline accuracy:", baseline.score(X_test, y_test))

If your fancy model can’t beat this number, you don’t have a model — you have a bug. For an imbalanced dataset (95/5), “always predict 0” already scores 95% accuracy. Knowing the floor stops you from celebrating a useless model.

Accuracy and its limits

The simplest metric. The fraction of predictions that were correct.

from sklearn.metrics import accuracy_score

preds = model.predict(X_test)
print(accuracy_score(y_test, preds))

Accuracy is fine when classes are balanced and the cost of different errors is similar. It is misleading otherwise. A fraud detector that predicts “not fraud” for every transaction is 99.9% accurate and 100% useless.

For anything imbalanced or any problem where false positives and false negatives have different costs, you need finer tools.

The confusion matrix

Every binary classification result fits into a 2×2 grid:

                Predicted 0   Predicted 1
Actual 0         TN            FP
Actual 1         FN            TP

TN — true negative. Correctly said 0.
FP — false positive. Said 1, actually 0. (“False alarm.”)
FN — false negative. Said 0, actually 1. (“Missed it.”)
TP — true positive. Correctly said 1.

In code:

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, preds)
print(cm)
# output:
# [[TN FP]
#  [FN TP]]

Read the matrix before reading any single-number metric. It tells you what your model is actually doing wrong.

Precision, recall, F1

Three numbers derived from the confusion matrix. Each answers a different question.

Precision = TP / (TP + FP). Of the things I flagged as positive, how many actually were? Answers: “When the model says yes, is it right?”
Recall = TP / (TP + FN). Of the actual positives, how many did I catch? Answers: “Did the model find all the real ones?”
F1 = harmonic mean of precision and recall. A single number that punishes you for being weak on either.

from sklearn.metrics import precision_score, recall_score, f1_score

print("Precision:", precision_score(y_test, preds))
print("Recall:   ", recall_score(y_test, preds))
print("F1:       ", f1_score(y_test, preds))

Or all at once:

from sklearn.metrics import classification_report

print(classification_report(y_test, preds))
# output:
#               precision    recall  f1-score   support
#            0       0.97      0.99      0.98       190
#            1       0.80      0.60      0.69        10

Which one matters

Pick the metric by thinking about the cost of each error.

Spam filter. False positives (real emails going to spam) are worse than false negatives (some spam slipping through). Optimise for precision.
Cancer screening. False negatives (missing a real case) are catastrophic. False positives lead to follow-up tests. Optimise for recall.
Fraud detection. Both errors cost real money. Often track F1 or a custom weighted score.
Balanced classes, equal costs. Accuracy is fine.

There is no universal “best” metric. There is the metric that matches the problem.

Try it yourself. Pretend you’re building a model that flags student essays for plagiarism review. False positives drag innocent students into a stressful process. False negatives let cheating slip through. Which metric — precision or recall — would you optimise for, and why? There’s no single right answer; this is exactly the conversation that should happen on day one of any classification project.

ROC-AUC, briefly

Most classifiers don’t just output a class — they output a probability. The decision threshold (default 0.5) turns the probability into a class. Different thresholds trade precision against recall.

ROC-AUC scores a classifier across all possible thresholds. It answers: “If I pick a random positive and a random negative, how often will the model rank the positive higher?”

from sklearn.metrics import roc_auc_score

probs = model.predict_proba(X_test)[:, 1]  # probability of class 1
print(roc_auc_score(y_test, probs))

0.5 — the model is no better than a coin flip.
0.7–0.8 — useful.
0.9+ — strong.
1.0 — perfect (or, more likely, leakage).

ROC-AUC is robust to class imbalance and doesn’t care about your threshold choice. It’s a great single-number summary for ranking-style problems. For very imbalanced data, PR-AUC (precision-recall AUC) is often more informative.

A worked example

Putting it together on a small imbalanced dataset.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, confusion_matrix,
    classification_report, roc_auc_score,
)

# 1. Make a 95/5 imbalanced dataset
X, y = make_classification(
    n_samples=2000, n_features=10,
    weights=[0.95, 0.05], random_state=42,
)

# 2. Stratified split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y,
)

# 3. Train
model = LogisticRegression(max_iter=500, class_weight="balanced")
model.fit(X_train, y_train)

# 4. Predictions and probabilities
preds = model.predict(X_test)
probs = model.predict_proba(X_test)[:, 1]

# 5. Evaluate
print("Accuracy:", accuracy_score(y_test, preds))
print(confusion_matrix(y_test, preds))
print(classification_report(y_test, preds))
print("ROC-AUC:", roc_auc_score(y_test, probs))

Notice class_weight="balanced" — it tells logistic regression to weight the rare class more heavily, which is one of the easier ways to deal with imbalance without resampling.

Cross-validation: the next step up

A single holdout is fine for quick experiments. For more reliable estimates, k-fold cross-validation splits the data into k chunks, trains k times, each time holding out a different chunk for evaluation. You get k scores; the mean and standard deviation tell you how stable the model is.

from sklearn.model_selection import cross_val_score

scores = cross_val_score(
    LogisticRegression(max_iter=500),
    X, y,
    cv=5,
    scoring="f1",
)
print(scores, scores.mean(), scores.std())

For classification with class imbalance, StratifiedKFold is the default behind cross_val_score and keeps the proportions balanced in each fold.

A reasonable progression for an ML project:

Single holdout split. Quick baseline.
Cross-validation for model selection and hyperparameter tuning.
A final, untouched test set you only evaluate once, at the very end.

The third step is what keeps you honest. Anything you’ve tuned against is no longer a valid evaluation.

Data leakage: the silent killer

Leakage happens when information from the test set bleeds into training. Symptoms: suspiciously high scores that collapse in production. Common causes:

Fitting a scaler on the full dataset before splitting (it sees test statistics).
Including features computed from the future.
Duplicate or near-duplicate rows across train and test.
Target-derived features (e.g. “average price by customer” computed using the test rows).

The fix: split first, fit any preprocessing on the train set only, then apply to test. Pipelines make this almost automatic:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("clf", LogisticRegression(max_iter=500)),
])
pipe.fit(X_train, y_train)
print(pipe.score(X_test, y_test))

The scaler now fits on train only and transforms test using train statistics. No leakage.

Try it yourself. Take the worked example above. Drop stratify=y from the split and re-run several times with different random_state values. Watch how the test-set class balance — and the recall — wobble. Then put stratify=y back and re-run. The stability difference is why stratification is the default advice for classification.

Common pitfalls

Reporting train accuracy. Always report test accuracy. The training score is for diagnosing overfitting, nothing else.
Tuning on the test set. Once you’ve tuned hyperparameters against it, it’s no longer a test set. Use cross-validation for tuning; keep a final holdout untouched.
Single number tunnel vision. Always look at the confusion matrix, not just one metric.
Ignoring the baseline. A 95% accuracy on a 95/5 dataset is the baseline, not a win.
Mismatched train/test distributions. If production data looks different from your test set, your test scores lie.

Recap

You now know:

The holdout split is non-negotiable; stratify=y keeps it stable
Always check a baseline before celebrating a model
The confusion matrix is the source of truth for binary classification
Precision for “is the model right when it says yes?”, recall for “did it find them all?”, F1 to combine
ROC-AUC evaluates ranking across thresholds
Cross-validation is the more reliable evaluation, with a final untouched test set

Next steps

Solid evaluation is the foundation. From here, the natural directions are feature engineering, hyperparameter tuning, and putting models into production. Each is a longer post on its own.

Questions or feedback? Email codeloomdevv@gmail.com.