ML Gradient Descent Explained

Intermediate 10 min read

What you'll learn

✓What a gradient really is
✓How descent steps work
✓Batch vs stochastic vs mini-batch
✓Common failure modes
✓How learning rate shapes training

Prerequisites

•Basic calculus intuition

Gradient descent is the single idea that ties almost every modern ML model together. Linear regression, logistic regression, neural networks, and the large transformers that power chat assistants all rely on it. This post explains it from the ground up without drowning in math.

What gradient descent really is

A model has parameters, like the weights of a neural network. A loss function measures how wrong the model’s predictions are on training data. The goal is to find parameter values that make the loss small.

Gradient descent does this by repeatedly nudging the parameters in the direction that decreases the loss fastest. That direction is given by the negative gradient of the loss with respect to the parameters. You take a step, recompute the gradient, take another step, and so on until you stop improving.

Mental model

Imagine the loss as a landscape and the parameters as your coordinates. The gradient tells you which way is uphill. The negative gradient points downhill. You walk a short distance downhill, look around, and walk downhill again. The size of each step is the learning rate.

current params w
 |
 v
compute loss L(w)
 |
 v
compute gradient g = dL/dw
 |
 v
update: w := w - lr * g
 |
 v
repeat until converged

One step of gradient descent

Hands-on example

Here is gradient descent fitting a line to data without any library.

import numpy as np

X = np.array([1, 2, 3, 4, 5], dtype=float)
y = np.array([2.1, 3.9, 6.1, 8.0, 10.2])

w, b = 0.0, 0.0
lr = 0.01

for step in range(2000):
    pred = w * X + b
    err = pred - y
    grad_w = (2 * err * X).mean()
    grad_b = (2 * err).mean()
    w -= lr * grad_w
    b -= lr * grad_b

print(f"w={w:.3f}, b={b:.3f}")

After a few thousand steps, w lands near 2 and b near 0, which is the true relationship. The gradient was computed analytically from the mean squared error. Real frameworks like PyTorch or JAX use autograd to compute gradients for any model you build.

There are three common variants. Batch gradient descent uses all training data per step, which is accurate but slow. Stochastic gradient descent uses one sample at a time, which is noisy but cheap. Mini-batch gradient descent uses a small group, like 32 or 256 samples, and is the standard choice in practice.

Trade-offs

A large learning rate trains fast but can overshoot the minimum or diverge entirely. A small learning rate is stable but takes forever. Schedules that start large and shrink over time work well: warmup, cosine decay, or simple step decay.

Batch size affects both speed and generalization. Larger batches use hardware better but tend to find sharper minima that generalize worse. Smaller batches are noisier, which can help escape bad local regions but slows wall-clock progress.

Plain gradient descent struggles with ravines in the loss surface. Momentum-based methods like Adam, AdamW, and SGD with momentum smooth out the updates and converge much faster on real problems. Adam is the safe default for most deep learning workloads.

Local minima are less of a problem than people fear, especially in high-dimensional models. Saddle points and flat regions are usually the bigger trouble.

Practical tips

Always normalize your inputs. Features with very different scales create elongated loss surfaces where one direction is steep and another flat. Gradient descent zigzags through these and converges slowly.

Watch the training loss curve. If it spikes or oscillates, the learning rate is too high. If it barely moves, it is too low. Loss curves are the first thing to inspect when training misbehaves.

Use gradient clipping for recurrent and transformer models. Large gradients can blow up updates and destabilize training. Clipping the norm to a fixed value, like 1.0, prevents this.

Set a seed for reproducibility, but do not chase tiny differences. Two runs with different seeds will land at slightly different losses. Compare ranges, not single numbers.

When debugging, overfit a tiny subset first. If your model cannot drive the loss near zero on 16 samples, there is a bug somewhere in the model, the data, or the loss. Solve that before scaling up.

Wrap-up

Gradient descent is short to describe and deep to use. The core idea is to step downhill on the loss landscape, but the practical details, like learning rate, batch size, and optimizer choice, decide whether training is fast and stable or slow and broken. Build the intuition first, then trust your favorite framework to handle the calculus for you.