Logistic Regression for Binary Classification
Learn how logistic regression turns a linear score into a probability, how to train it with scikit-learn, and how to evaluate binary classifiers using ROC-AUC.
What you'll learn
- ✓Why we squash a linear score with the sigmoid function
- ✓How the decision boundary works in feature space
- ✓Training a logistic regression model with scikit-learn
- ✓Evaluating classifiers with ROC curves and AUC
- ✓When logistic regression is the right starting point
Prerequisites
- •A foundation in [what machine learning is](/blog/what-is-machine-learning)
- •Familiarity with [pandas dataframes](/blog/pandas-dataframes-basics)
- •Understanding of [train/test split and metrics](/blog/ml-train-test-split-and-metrics)
Despite the name, logistic regression is a classification model. It is one of the most widely deployed models in production because it is fast, interpretable, and ships sensible probability estimates out of the box. If you are predicting a yes or no outcome, train a logistic regression first and demand that any fancier model beat it.
From linear score to probability
If you have ever trained linear regression you already know two thirds of logistic regression. The model still computes a weighted sum of features, z = w1*x1 + w2*x2 + ... + b. The difference is what happens next. Linear regression returns z directly as a real-valued prediction. Logistic regression passes z through the sigmoid function:
sigmoid(z) = 1 / (1 + exp(-z))
The sigmoid squashes any real number into the interval (0, 1), which we can read as a probability. Large positive z produces probabilities close to 1, large negative z produces probabilities close to 0, and z = 0 produces exactly 0.5.
The decision boundary
If we classify by thresholding the probability at 0.5, the cutoff happens exactly where z = 0. In feature space, the set of points where z = 0 is a straight line, plane, or hyperplane, depending on the number of features. Logistic regression is therefore a linear classifier even though its output is non-linear.
A consequence is that logistic regression cannot separate classes that are interleaved in a non-linear way. If your two classes form concentric circles, no straight line will divide them well and you need a different model or feature engineering.
Training in scikit-learn
We will use the well-known breast cancer dataset bundled with scikit-learn so the example is reproducible.
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
accuracy_score, roc_auc_score, classification_report,
)
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=0
)
pipe = Pipeline([
("scale", StandardScaler()),
("logreg", LogisticRegression(max_iter=2000, C=1.0)),
])
pipe.fit(X_train, y_train)
preds = pipe.predict(X_test)
probs = pipe.predict_proba(X_test)[:, 1]
print("accuracy:", round(accuracy_score(y_test, preds), 3))
print("ROC-AUC:", round(roc_auc_score(y_test, probs), 3))
print(classification_report(y_test, preds))
A few practical notes. Logistic regression is sensitive to feature scale, so we wrap it in a pipeline with StandardScaler. The C hyperparameter controls the inverse of regularisation strength; smaller values produce simpler models. We pull predict_proba rather than predict whenever we want the probability for a downstream threshold or for ROC-AUC.
ROC curves and AUC
Accuracy is a tempting metric but it is misleading on imbalanced data. If 95 percent of records are negatives, predicting negative for every input gets you 95 percent accuracy with zero useful behaviour.
The Receiver Operating Characteristic curve sweeps the classification threshold from 0 to 1 and plots the true positive rate against the false positive rate at each step. The Area Under the Curve, or AUC, summarises that curve in a single number between 0 and 1. An AUC of 0.5 corresponds to random guessing, 1.0 is a perfect ranker, and 0.8 to 0.9 is typical for a good production classifier.
AUC has an attractive interpretation: it is the probability that a randomly chosen positive scores higher than a randomly chosen negative. Because it does not depend on the threshold, AUC tells you how well the model ranks examples, which is often what you care about when you will later tune a threshold for a business cost function.
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
fpr, tpr, thresholds = roc_curve(y_test, probs)
plt.plot(fpr, tpr, label="logreg")
plt.plot([0, 1], [0, 1], "--", label="random")
plt.xlabel("false positive rate")
plt.ylabel("true positive rate")
plt.legend()
plt.show()
When to choose logistic regression
Reach for logistic regression when you need a calibrated probability rather than just a label, when the dataset is small to medium sized, when interpretability matters because coefficients show the direction and rough size of each feature’s effect, or when latency matters because logistic regression scores a row in nanoseconds.
You should look elsewhere when the relationships in your data are strongly non-linear, when interactions between features dominate the signal, or when you have millions of sparse text features and would benefit from a model designed for that scale.
A note on regularisation
LogisticRegression in scikit-learn applies L2 regularisation by default, which is usually what you want. Switch to penalty="l1" with solver="liblinear" or solver="saga" if you want feature selection as a side effect of training. The L1 penalty drives coefficients of unhelpful features to exactly zero, which can make the resulting model both smaller and easier to interpret.
Wrap up
Logistic regression turns a linear score into a probability with the sigmoid function, draws a linear decision boundary, and trains in seconds on tabular data. Pair it with a careful look at your metrics, prefer ROC-AUC to raw accuracy on imbalanced problems, and use the resulting baseline to judge whether any heavier classifier is actually earning its complexity.
Related articles
- Machine Learning ML SVM Explained With Intuition
An intuitive walkthrough of support vector machines, the kernel trick, and when SVMs still make sense in a world dominated by gradient boosted trees and neural networks.
- Machine Learning Decision Trees in scikit-learn: A Practical Intro
A working introduction to decision trees in scikit-learn covering splitting criteria, overfitting, max_depth tuning, visualization, and the path to random forests.
- Machine Learning Feature Engineering Basics for Tabular Data
Practical feature engineering for tabular machine learning, covering encoding, scaling, missing value handling, interaction features, and how to avoid data leakage.
- Machine Learning Linear Regression: The First Model You Should Train
A practical introduction to linear regression with scikit-learn covering OLS, evaluation with R-squared and MAE, and the assumptions that make or break the model.