Logistic Regression for Binary Classification

Intermediate 10 min read

What you'll learn

✓Why we squash a linear score with the sigmoid function
✓How the decision boundary works in feature space
✓Training a logistic regression model with scikit-learn
✓Evaluating classifiers with ROC curves and AUC
✓When logistic regression is the right starting point

Prerequisites

•A foundation in [what machine learning is](/blog/what-is-machine-learning)
•Familiarity with [pandas dataframes](/blog/pandas-dataframes-basics)
•Understanding of [train/test split and metrics](/blog/ml-train-test-split-and-metrics)

Despite the name, logistic regression is a classification model. It is one of the most widely deployed models in production because it is fast, interpretable, and ships sensible probability estimates out of the box. If you are predicting a yes or no outcome, train a logistic regression first and demand that any fancier model beat it.

From linear score to probability

If you have ever trained linear regression you already know two thirds of logistic regression. The model still computes a weighted sum of features, z = w1*x1 + w2*x2 + ... + b. The difference is what happens next. Linear regression returns z directly as a real-valued prediction. Logistic regression passes z through the sigmoid function:

sigmoid(z) = 1 / (1 + exp(-z))

The sigmoid squashes any real number into the interval (0, 1), which we can read as a probability. Large positive z produces probabilities close to 1, large negative z produces probabilities close to 0, and z = 0 produces exactly 0.5.

The decision boundary

If we classify by thresholding the probability at 0.5, the cutoff happens exactly where z = 0. In feature space, the set of points where z = 0 is a straight line, plane, or hyperplane, depending on the number of features. Logistic regression is therefore a linear classifier even though its output is non-linear.

A consequence is that logistic regression cannot separate classes that are interleaved in a non-linear way. If your two classes form concentric circles, no straight line will divide them well and you need a different model or feature engineering.

Training in scikit-learn

We will use the well-known breast cancer dataset bundled with scikit-learn so the example is reproducible.

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    accuracy_score, roc_auc_score, classification_report,
)

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=0
)

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("logreg", LogisticRegression(max_iter=2000, C=1.0)),
])
pipe.fit(X_train, y_train)

preds = pipe.predict(X_test)
probs = pipe.predict_proba(X_test)[:, 1]

print("accuracy:", round(accuracy_score(y_test, preds), 3))
print("ROC-AUC:", round(roc_auc_score(y_test, probs), 3))
print(classification_report(y_test, preds))

A few practical notes. Logistic regression is sensitive to feature scale, so we wrap it in a pipeline with StandardScaler. The C hyperparameter controls the inverse of regularisation strength; smaller values produce simpler models. We pull predict_proba rather than predict whenever we want the probability for a downstream threshold or for ROC-AUC.

ROC curves and AUC

Accuracy is a tempting metric but it is misleading on imbalanced data. If 95 percent of records are negatives, predicting negative for every input gets you 95 percent accuracy with zero useful behaviour.

The Receiver Operating Characteristic curve sweeps the classification threshold from 0 to 1 and plots the true positive rate against the false positive rate at each step. The Area Under the Curve, or AUC, summarises that curve in a single number between 0 and 1. An AUC of 0.5 corresponds to random guessing, 1.0 is a perfect ranker, and 0.8 to 0.9 is typical for a good production classifier.

AUC has an attractive interpretation: it is the probability that a randomly chosen positive scores higher than a randomly chosen negative. Because it does not depend on the threshold, AUC tells you how well the model ranks examples, which is often what you care about when you will later tune a threshold for a business cost function.

from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt

fpr, tpr, thresholds = roc_curve(y_test, probs)
plt.plot(fpr, tpr, label="logreg")
plt.plot([0, 1], [0, 1], "--", label="random")
plt.xlabel("false positive rate")
plt.ylabel("true positive rate")
plt.legend()
plt.show()

When to choose logistic regression

Reach for logistic regression when you need a calibrated probability rather than just a label, when the dataset is small to medium sized, when interpretability matters because coefficients show the direction and rough size of each feature’s effect, or when latency matters because logistic regression scores a row in nanoseconds.

You should look elsewhere when the relationships in your data are strongly non-linear, when interactions between features dominate the signal, or when you have millions of sparse text features and would benefit from a model designed for that scale.

A note on regularisation

LogisticRegression in scikit-learn applies L2 regularisation by default, which is usually what you want. Switch to penalty="l1" with solver="liblinear" or solver="saga" if you want feature selection as a side effect of training. The L1 penalty drives coefficients of unhelpful features to exactly zero, which can make the resulting model both smaller and easier to interpret.

Wrap up

Logistic regression turns a linear score into a probability with the sigmoid function, draws a linear decision boundary, and trains in seconds on tabular data. Pair it with a careful look at your metrics, prefer ROC-AUC to raw accuracy on imbalanced problems, and use the resulting baseline to judge whether any heavier classifier is actually earning its complexity.