ML Precision Recall and F1 Explained

Beginner 9 min read

What you'll learn

✓What precision, recall, and F1 actually measure
✓Why accuracy lies on imbalanced data
✓How threshold choice shifts the precision/recall trade-off
✓PR curve vs ROC curve and when to use each
✓Choosing a metric that matches your product cost

Prerequisites

•Familiar with how APIs work

What and Why

Classification models output probabilities, but business decisions need yes/no answers. You set a threshold and turn each prediction into a class. The four buckets that result are the foundation of every metric you will use.

Picking the right metric is the difference between a model that looks great in a report and a model that actually solves a problem. Pick the wrong metric and you optimize for the wrong thing.

Mental Model

Every binary prediction lands in one of four cells of the confusion matrix.

                Predicted +    Predicted -
Actual +         TP             FN
Actual -         FP             TN

precision = TP / (TP + FP)    "of what I flagged, how much was right?"
recall    = TP / (TP + FN)    "of what was actually positive, how much did I catch?"
F1        = 2 * P * R / (P + R)
accuracy  = (TP + TN) / total

Confusion matrix and the metrics built on it

Precision punishes false alarms. Recall punishes misses. F1 forces you to do well on both at once.

A worked example: a fraud detector flags 100 transactions. 80 are actually fraud. There were 200 fraudulent transactions in total. Precision = 80/100 = 0.80. Recall = 80/200 = 0.40. The model is precise but misses too much fraud.

Hands-on Example

sklearn has clean implementations of every metric.

from sklearn.metrics import (
    precision_score, recall_score, f1_score, classification_report,
    confusion_matrix, precision_recall_curve, roc_auc_score
)

y_pred = (model.predict_proba(X_val)[:, 1] >= 0.5).astype(int)

print("Precision:", precision_score(y_val, y_pred))
print("Recall:   ", recall_score(y_val, y_pred))
print("F1:       ", f1_score(y_val, y_pred))
print(classification_report(y_val, y_pred))

# Tune the threshold
probs = model.predict_proba(X_val)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_val, probs)

# Pick threshold that hits a target recall of 0.90
import numpy as np
idx = np.argmin(np.abs(recall - 0.90))
print("threshold:", thresholds[idx], "precision:", precision[idx])

The default 0.5 threshold is rarely the right one. Pick the threshold that matches your operating point, not the algorithm’s default.

Trade-offs

There is no metric that is “correct” in all situations. Match the metric to the cost of mistakes.

High precision matters when false positives are expensive. Spam filters, content moderation, and medical screenings where a positive prediction triggers an expensive intervention.
High recall matters when missed positives are expensive. Cancer detection, fraud detection on big-ticket transactions, safety-critical alerts.
F1 balances them. Useful when you have no strong prior on which error costs more.
F-beta lets you weight one over the other. F2 weights recall twice as much as precision; F0.5 does the opposite.
Accuracy is fine when classes are balanced and the costs are symmetric. It is misleading when one class is rare.

On a 99/1 imbalanced dataset, predicting “always negative” gives 99% accuracy and 0% recall. That model is useless and the metric does not reveal it.

PR vs ROC

Two curves dominate model selection: ROC (true positive rate vs false positive rate) and PR (precision vs recall).

ROC-AUC is robust to class imbalance interpretation-wise but can look deceptively good on highly imbalanced data because the false-positive denominator (TN + FP) is dominated by an enormous TN.
PR-AUC is the better choice when positives are rare. The PR curve does not let a huge true-negative population hide poor performance.

A rule of thumb: if positives are below 10% of the data, prefer PR-AUC for model comparisons.

Practical Tips

Always look at the confusion matrix, not just the headline metric. Many bugs are visible the moment you see the four numbers laid out.
Tune the threshold to a business target. “Hit 90% recall and report whatever precision falls out” is a clearer product specification than “maximize F1.”
Compare models at the same operating point. Comparing model A at threshold 0.3 to model B at threshold 0.7 is meaningless.
For multi-class, decide between macro and weighted averaging. Macro treats every class equally; weighted respects class frequency. The right pick depends on whether rare classes matter as much as common ones.
Report a confidence interval on small validation sets. A precision of 0.83 on 50 positives is a wide interval. Bootstrap to get error bars.
Track metrics over time in production. A model that scores 0.85 F1 in offline eval can drift to 0.70 over a few months as data shifts.

Wrap-up

Precision, recall, and F1 are not interchangeable. Each encodes a different cost structure, and the right one is dictated by the consequences of each kind of error. Look at the confusion matrix, tune the threshold to the operating point your product needs, and use PR curves when positives are rare. Once the metric matches the problem, model improvement becomes a much cleaner conversation.