Linear Regression: The First Model You Should Train
A practical introduction to linear regression with scikit-learn covering OLS, evaluation with R-squared and MAE, and the assumptions that make or break the model.
What you'll learn
- ✓The intuition behind fitting a straight line to data
- ✓How Ordinary Least Squares minimizes squared error
- ✓Training a model with scikit-learn LinearRegression
- ✓Comparing R-squared and MAE for regression evaluation
- ✓The core assumptions and when linear regression fails
Prerequisites
- •A foundation in [what machine learning is](/blog/what-is-machine-learning)
- •Familiarity with [pandas dataframes](/blog/pandas-dataframes-basics)
- •Comfort with [train/test split and metrics](/blog/ml-train-test-split-and-metrics)
Linear regression is the model you should reach for first whenever you face a regression problem. It is fast to train, easy to explain to stakeholders, and the diagnostics it produces tell you a lot about your data. Even if you ultimately ship a gradient boosted model, linear regression provides a baseline you can defend.
The intuition
Suppose you have a dataset of house sale prices with a single feature, square footage. If you plot price against square footage, you typically see a roughly linear trend: bigger houses cost more. Linear regression draws the single straight line that best summarises that trend.
With one feature the model is y = w*x + b. With many features it becomes y = w1*x1 + w2*x2 + ... + wn*xn + b. The training algorithm searches for the weights w and intercept b that make the line fit your training data as closely as possible.
Ordinary Least Squares
The classic way to fit a linear regression is Ordinary Least Squares, or OLS. For each training row we measure the residual, which is the difference between the actual y and the predicted y. OLS chooses the weights that minimise the sum of the squared residuals.
Squaring serves two purposes. It makes positive and negative residuals contribute the same way, and it penalises large mistakes more heavily than small ones. The squared-error objective has a closed-form solution in linear algebra, so scikit-learn does not need an iterative optimiser for this model.
Training with scikit-learn
Here is a complete training pipeline using a small synthetic dataset.
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
rng = np.random.default_rng(42)
n = 500
sqft = rng.integers(600, 3500, size=n)
bedrooms = rng.integers(1, 6, size=n)
noise = rng.normal(0, 25_000, size=n)
price = 50_000 + 180 * sqft + 12_000 * bedrooms + noise
df = pd.DataFrame({"sqft": sqft, "bedrooms": bedrooms, "price": price})
X = df[["sqft", "bedrooms"]]
y = df["price"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=0
)
model = LinearRegression()
model.fit(X_train, y_train)
print("intercept:", model.intercept_)
print("coefficients:", dict(zip(X.columns, model.coef_)))
A couple of things to notice. We split the data before fitting so we can evaluate on unseen rows. We also fit on the training portion only. The coef_ attribute gives you a direct, interpretable estimate of how much each feature changes the prediction.
Evaluating: R-squared vs MAE
Two metrics get used most often for regression.
R-squared, or the coefficient of determination, measures the proportion of variance in the target that the model explains. A value of 1.0 means perfect fit, 0.0 means the model does no better than predicting the mean.
Mean Absolute Error reports the average size of the prediction error in the original units of the target. If MAE is 18,000 dollars, your typical prediction is off by about that much.
preds = model.predict(X_test)
print("R^2:", round(r2_score(y_test, preds), 3))
print("MAE:", round(mean_absolute_error(y_test, preds), 2))
Use R-squared to compare different models on the same dataset. Use MAE to communicate error in business terms. If you want a metric that punishes large mistakes more heavily, Root Mean Squared Error sits between MAE and R-squared in interpretation.
Assumptions that matter
Linear regression has a small number of assumptions. Knowing them tells you when the model can be trusted.
The first is linearity. The relationship between each feature and the target should be approximately a straight line. If price grows with the square of square footage, a plain linear model under-fits unless you add a sqft_squared feature.
The second is independent errors. Residuals on one row should not depend on residuals from another. Time series data routinely breaks this assumption and needs special handling.
The third is constant variance, sometimes called homoscedasticity. The spread of residuals should be roughly the same across the range of predictions. Plotting predicted values against residuals quickly reveals violations.
Finally, multicollinearity hurts interpretability. If two features carry nearly the same information, the model can split the coefficient between them in unstable ways. Compute correlations before reading too much into a single coefficient.
When linear regression fails
If your evaluation metrics are poor on both training and test sets, the model is under-fitting. Possible fixes include polynomial features, log transforms of skewed columns, or moving to a non-linear model such as a gradient boosted tree.
If training error is low but test error is high, you are over-fitting, which often happens when you have many features relative to rows. Ridge regression and Lasso add regularisation that penalises large coefficients and tends to help in this regime.
If a single coefficient looks unreasonable, check whether the feature is highly correlated with another or whether the column contains outliers that drag the line away from the bulk of the data.
Wrap up
Linear regression is small, fast, and clear. It gives you a baseline error number, a set of interpretable coefficients, and a starting point for diagnosing whether the data has the shape your modelling assumes. Train it first, evaluate with the right metrics, and then decide whether the gains from a heavier model are worth the loss of interpretability.
Related articles
- Machine Learning Decision Trees in scikit-learn: A Practical Intro
A working introduction to decision trees in scikit-learn covering splitting criteria, overfitting, max_depth tuning, visualization, and the path to random forests.
- Machine Learning Feature Engineering Basics for Tabular Data
Practical feature engineering for tabular machine learning, covering encoding, scaling, missing value handling, interaction features, and how to avoid data leakage.
- Machine Learning Logistic Regression for Binary Classification
Learn how logistic regression turns a linear score into a probability, how to train it with scikit-learn, and how to evaluate binary classifiers using ROC-AUC.
- Machine Learning ML Bias Variance Tradeoff
The bias variance tradeoff explained with intuition, examples, and practical guidance on how to diagnose and reduce each component of error in your ML models.