Linear Regression: The First Model You Should Train

Intermediate 10 min read

What you'll learn

✓The intuition behind fitting a straight line to data
✓How Ordinary Least Squares minimizes squared error
✓Training a model with scikit-learn LinearRegression
✓Comparing R-squared and MAE for regression evaluation
✓The core assumptions and when linear regression fails

Prerequisites

•A foundation in [what machine learning is](/blog/what-is-machine-learning)
•Familiarity with [pandas dataframes](/blog/pandas-dataframes-basics)
•Comfort with [train/test split and metrics](/blog/ml-train-test-split-and-metrics)

Linear regression is the model you should reach for first whenever you face a regression problem. It is fast to train, easy to explain to stakeholders, and the diagnostics it produces tell you a lot about your data. Even if you ultimately ship a gradient boosted model, linear regression provides a baseline you can defend.

The intuition

Suppose you have a dataset of house sale prices with a single feature, square footage. If you plot price against square footage, you typically see a roughly linear trend: bigger houses cost more. Linear regression draws the single straight line that best summarises that trend.

With one feature the model is y = w*x + b. With many features it becomes y = w1*x1 + w2*x2 + ... + wn*xn + b. The training algorithm searches for the weights w and intercept b that make the line fit your training data as closely as possible.

Ordinary Least Squares

The classic way to fit a linear regression is Ordinary Least Squares, or OLS. For each training row we measure the residual, which is the difference between the actual y and the predicted y. OLS chooses the weights that minimise the sum of the squared residuals.

Squaring serves two purposes. It makes positive and negative residuals contribute the same way, and it penalises large mistakes more heavily than small ones. The squared-error objective has a closed-form solution in linear algebra, so scikit-learn does not need an iterative optimiser for this model.

Training with scikit-learn

Here is a complete training pipeline using a small synthetic dataset.

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score

rng = np.random.default_rng(42)
n = 500
sqft = rng.integers(600, 3500, size=n)
bedrooms = rng.integers(1, 6, size=n)
noise = rng.normal(0, 25_000, size=n)
price = 50_000 + 180 * sqft + 12_000 * bedrooms + noise

df = pd.DataFrame({"sqft": sqft, "bedrooms": bedrooms, "price": price})

X = df[["sqft", "bedrooms"]]
y = df["price"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0
)

model = LinearRegression()
model.fit(X_train, y_train)

print("intercept:", model.intercept_)
print("coefficients:", dict(zip(X.columns, model.coef_)))

A couple of things to notice. We split the data before fitting so we can evaluate on unseen rows. We also fit on the training portion only. The coef_ attribute gives you a direct, interpretable estimate of how much each feature changes the prediction.

Evaluating: R-squared vs MAE

Two metrics get used most often for regression.

R-squared, or the coefficient of determination, measures the proportion of variance in the target that the model explains. A value of 1.0 means perfect fit, 0.0 means the model does no better than predicting the mean.

Mean Absolute Error reports the average size of the prediction error in the original units of the target. If MAE is 18,000 dollars, your typical prediction is off by about that much.

preds = model.predict(X_test)
print("R^2:", round(r2_score(y_test, preds), 3))
print("MAE:", round(mean_absolute_error(y_test, preds), 2))

Use R-squared to compare different models on the same dataset. Use MAE to communicate error in business terms. If you want a metric that punishes large mistakes more heavily, Root Mean Squared Error sits between MAE and R-squared in interpretation.

Assumptions that matter

Linear regression has a small number of assumptions. Knowing them tells you when the model can be trusted.

The first is linearity. The relationship between each feature and the target should be approximately a straight line. If price grows with the square of square footage, a plain linear model under-fits unless you add a sqft_squared feature.

The second is independent errors. Residuals on one row should not depend on residuals from another. Time series data routinely breaks this assumption and needs special handling.

The third is constant variance, sometimes called homoscedasticity. The spread of residuals should be roughly the same across the range of predictions. Plotting predicted values against residuals quickly reveals violations.

Finally, multicollinearity hurts interpretability. If two features carry nearly the same information, the model can split the coefficient between them in unstable ways. Compute correlations before reading too much into a single coefficient.

When linear regression fails

If your evaluation metrics are poor on both training and test sets, the model is under-fitting. Possible fixes include polynomial features, log transforms of skewed columns, or moving to a non-linear model such as a gradient boosted tree.

If training error is low but test error is high, you are over-fitting, which often happens when you have many features relative to rows. Ridge regression and Lasso add regularisation that penalises large coefficients and tends to help in this regime.

If a single coefficient looks unreasonable, check whether the feature is highly correlated with another or whether the column contains outliers that drag the line away from the bulk of the data.

Wrap up

Linear regression is small, fast, and clear. It gives you a baseline error number, a set of interpretable coefficients, and a starting point for diagnosing whether the data has the shape your modelling assumes. Train it first, evaluate with the right metrics, and then decide whether the gains from a heavier model are worth the loss of interpretability.