ML Precision Recall and F1 Explained
Decode precision, recall, F1, and accuracy with concrete intuition, threshold tuning, and PR vs ROC curve guidance for imbalanced data.
What you'll learn
- ✓What precision, recall, and F1 actually measure
- ✓Why accuracy lies on imbalanced data
- ✓How threshold choice shifts the precision/recall trade-off
- ✓PR curve vs ROC curve and when to use each
- ✓Choosing a metric that matches your product cost
Prerequisites
- •Familiar with how APIs work
What and Why
Classification models output probabilities, but business decisions need yes/no answers. You set a threshold and turn each prediction into a class. The four buckets that result are the foundation of every metric you will use.
Picking the right metric is the difference between a model that looks great in a report and a model that actually solves a problem. Pick the wrong metric and you optimize for the wrong thing.
Mental Model
Every binary prediction lands in one of four cells of the confusion matrix.
Predicted + Predicted -
Actual + TP FN
Actual - FP TN
precision = TP / (TP + FP) "of what I flagged, how much was right?"
recall = TP / (TP + FN) "of what was actually positive, how much did I catch?"
F1 = 2 * P * R / (P + R)
accuracy = (TP + TN) / total Precision punishes false alarms. Recall punishes misses. F1 forces you to do well on both at once.
A worked example: a fraud detector flags 100 transactions. 80 are actually fraud. There were 200 fraudulent transactions in total. Precision = 80/100 = 0.80. Recall = 80/200 = 0.40. The model is precise but misses too much fraud.
Hands-on Example
sklearn has clean implementations of every metric.
from sklearn.metrics import (
precision_score, recall_score, f1_score, classification_report,
confusion_matrix, precision_recall_curve, roc_auc_score
)
y_pred = (model.predict_proba(X_val)[:, 1] >= 0.5).astype(int)
print("Precision:", precision_score(y_val, y_pred))
print("Recall: ", recall_score(y_val, y_pred))
print("F1: ", f1_score(y_val, y_pred))
print(classification_report(y_val, y_pred))
# Tune the threshold
probs = model.predict_proba(X_val)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_val, probs)
# Pick threshold that hits a target recall of 0.90
import numpy as np
idx = np.argmin(np.abs(recall - 0.90))
print("threshold:", thresholds[idx], "precision:", precision[idx])
The default 0.5 threshold is rarely the right one. Pick the threshold that matches your operating point, not the algorithm’s default.
Trade-offs
There is no metric that is “correct” in all situations. Match the metric to the cost of mistakes.
- High precision matters when false positives are expensive. Spam filters, content moderation, and medical screenings where a positive prediction triggers an expensive intervention.
- High recall matters when missed positives are expensive. Cancer detection, fraud detection on big-ticket transactions, safety-critical alerts.
- F1 balances them. Useful when you have no strong prior on which error costs more.
- F-beta lets you weight one over the other.
F2weights recall twice as much as precision;F0.5does the opposite. - Accuracy is fine when classes are balanced and the costs are symmetric. It is misleading when one class is rare.
On a 99/1 imbalanced dataset, predicting “always negative” gives 99% accuracy and 0% recall. That model is useless and the metric does not reveal it.
PR vs ROC
Two curves dominate model selection: ROC (true positive rate vs false positive rate) and PR (precision vs recall).
- ROC-AUC is robust to class imbalance interpretation-wise but can look deceptively good on highly imbalanced data because the false-positive denominator (TN + FP) is dominated by an enormous TN.
- PR-AUC is the better choice when positives are rare. The PR curve does not let a huge true-negative population hide poor performance.
A rule of thumb: if positives are below 10% of the data, prefer PR-AUC for model comparisons.
Practical Tips
- Always look at the confusion matrix, not just the headline metric. Many bugs are visible the moment you see the four numbers laid out.
- Tune the threshold to a business target. “Hit 90% recall and report whatever precision falls out” is a clearer product specification than “maximize F1.”
- Compare models at the same operating point. Comparing model A at threshold 0.3 to model B at threshold 0.7 is meaningless.
- For multi-class, decide between
macroandweightedaveraging. Macro treats every class equally; weighted respects class frequency. The right pick depends on whether rare classes matter as much as common ones. - Report a confidence interval on small validation sets. A precision of 0.83 on 50 positives is a wide interval. Bootstrap to get error bars.
- Track metrics over time in production. A model that scores 0.85 F1 in offline eval can drift to 0.70 over a few months as data shifts.
Wrap-up
Precision, recall, and F1 are not interchangeable. Each encodes a different cost structure, and the right one is dictated by the consequences of each kind of error. Look at the confusion matrix, tune the threshold to the operating point your product needs, and use PR curves when positives are rare. Once the metric matches the problem, model improvement becomes a much cleaner conversation.
Related articles
- Machine Learning Confusion Matrix Deep Dive
A thorough look at the confusion matrix: how to read it, the metrics it produces, and how to use it to diagnose classifier behavior beyond a single accuracy number that often hides what is going wrong.
- Machine Learning ML Cross-Validation Strategies
Compare k-fold, stratified, group, and time-series cross-validation so your offline scores actually predict production performance.
- Machine Learning ML Feature Engineering Techniques
Transform raw data into features that help models learn faster and generalize better with encoding, scaling, interactions, and target features.
- Machine Learning K-Nearest Neighbors Algorithm Explained
Understand how the k-nearest neighbors algorithm classifies and regresses by looking at similar examples, when it works well, and how to tune k and distance metrics for real problems.