Gradient Descent

Gradient descent is one of the most important ideas in modern optimization and machine learning.
It’s not a model like linear regression or random forest — it’s the engine that trains many models by minimizing a loss function.

If you understand gradient descent deeply, you understand:

how models “learn,”
why learning rates matter,
why training can be unstable,
and why techniques like momentum, Adam, and normalization exist.

1) What gradient descent is (in one sentence)

Gradient descent is an iterative optimization method that minimizes a function by repeatedly moving in the direction of the negative gradient (the steepest downhill direction).

2) The core intuition (feel it before the math)

Imagine you’re standing on a hilly landscape in fog and your goal is to reach the lowest point.

The gradient tells you the direction of steepest uphill.
So the negative gradient tells you the direction of steepest downhill.
You take a step downhill, then repeat.

Two things determine what happens:

Direction (negative gradient)
Step size (learning rate)

Too small → you crawl.
Too big → you bounce around or explode.

3) The math definition

Let your objective function be:

\min_{\boldsymbol{\theta}} \; J(\boldsymbol{\theta})

where $\boldsymbol{\theta}$ is a vector of parameters (weights).

The gradient is:

\nabla J(\boldsymbol{\theta}) = \begin{bmatrix} \frac{\partial J}{\partial \theta_1}\\ \vdots\\ \frac{\partial J}{\partial \theta_d} \end{bmatrix}

Gradient descent update rule:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla J(\boldsymbol{\theta}_t)

$\eta$ is the learning rate (step size)
$t$ is the iteration number

4) A concrete example: linear regression with MSE

Suppose:

\hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\beta}

Loss (MSE):

J(\boldsymbol{\beta}) = \frac{1}{n}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_2^2

Gradient:

\nabla_{\boldsymbol{\beta}} J = -\frac{2}{n}\mathbf{X}^\top(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})

Update:

\boldsymbol{\beta} \leftarrow \boldsymbol{\beta} + \eta \frac{2}{n}\mathbf{X}^\top(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})

This shows a helpful interpretation:

the term $(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})$ is the residual,
$\mathbf{X}^\top(\text{residual})$ measures correlation between features and errors,
you update weights to reduce those errors.

5) Learning rate: the one knob that can break everything

5.1 Too small

slow convergence
looks stable but takes forever

5.2 Too large

oscillation around minimum
divergence (loss explodes)

5.3 Practical learning rate strategy

start with a moderate value
reduce when loss plateaus
or use learning rate schedules / adaptive optimizers

6) Variants of gradient descent (batch, SGD, mini-batch)

Assume objective is a sum over samples:

J(\boldsymbol{\theta}) = \frac{1}{n}\sum_{i=1}^n \ell_i(\boldsymbol{\theta})

6.1 Batch Gradient Descent

Uses full dataset each step:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla J(\boldsymbol{\theta}_t)

Pros:

stable, smooth loss curve

Cons:

expensive for large datasets

6.2 Stochastic Gradient Descent (SGD)

Uses one random sample $i$ :

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla \ell_i(\boldsymbol{\theta}_t)

Pros:

fast updates, good for massive/streaming data

Cons:

noisy updates, loss fluctuates

6.3 Mini-batch Gradient Descent (most common)

Uses a batch $B$ of size $m$ :

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \frac{1}{m}\sum_{i\in B} \nabla \ell_i(\boldsymbol{\theta}_t)

Pros:

efficient on GPUs
less noisy than SGD
scalable

Cons:

still needs tuning

7) Convergence: when does gradient descent work?

7.1 Convex vs non-convex

If $J(\boldsymbol{\theta})$ is convex, any local minimum is global.
If it’s non-convex (deep nets), GD may find local minima or saddle points.

7.2 Strong convexity and smoothness (intuition)

Convergence is easiest when:

gradients don’t change too violently (smoothness)
there’s a clear curvature toward the minimum (strong convexity)

In practice:

feature scaling improves smoothness
regularization can help conditioning

8) Conditioning: why scaling matters so much

If your loss contours are long and skinny (like a stretched ellipse), gradient descent zig-zags.

This happens when features have different scales or strong correlations.

8.1 What it looks like

one direction is steep, another is shallow
GD bounces across steep walls and slowly progresses along shallow direction

8.2 Fixes

standardize features
use momentum or adaptive methods
use second-order methods (Newton, quasi-Newton) when feasible

9) Momentum: “roll downhill with inertia”

Momentum accumulates a running velocity:

\mathbf{v}_{t+1} = \mu \mathbf{v}_t - \eta \nabla J(\boldsymbol{\theta}_t)

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \mathbf{v}_{t+1}

$\mu$ is momentum (e.g., 0.9)
helps accelerate along consistent directions
reduces zig-zagging

10) Adaptive optimizers (AdaGrad, RMSProp, Adam)

These change the learning rate per parameter based on gradient history.

10.1 AdaGrad

Accumulates squared gradients:

\mathbf{G}_t = \sum_{\tau=1}^t \mathbf{g}_\tau^2

Update per component:

\theta_{t+1,j} = \theta_{t,j} - \frac{\eta}{\sqrt{G_{t,j}}+\epsilon} g_{t,j}

Good for sparse features, but learning rate can decay too much.

10.2 RMSProp

Uses an exponential moving average of squared gradients.

10.3 Adam (most popular)

Adam combines momentum + RMSProp-style scaling.

Conceptually:

keep moving average of gradients (first moment)
keep moving average of squared gradients (second moment)
bias-correct them
update with normalized step

Adam is a strong default for deep learning, but SGD with momentum can generalize better in some settings.

11) Learning rate schedules (how training is often made stable)

Common schedules:

step decay
exponential decay
cosine annealing
warmup then decay

General idea:

start with larger learning rate to explore
reduce later to fine-tune near optimum

12) Practical stopping criteria

You usually stop when:

validation loss stops improving (early stopping)
gradient norm becomes small: $\|\nabla J(\boldsymbol{\theta})\|$ is near 0
parameter changes become tiny
max iterations reached

Early stopping is also a form of regularization.

13) Pros and cons

13.1 Pros

Scales to huge datasets
Simple and widely applicable
Works for many model families (linear models, neural nets, logistic regression)
Easy to implement and parallelize (mini-batch)

13.2 Cons

Sensitive to learning rate and hyperparameters
Can be slow on ill-conditioned problems
Can get stuck/slow at saddle points in non-convex optimization
Requires careful validation to avoid overfitting

14) When to use gradient descent (and when not to)

14.1 Use gradient descent when

you can compute gradients efficiently
dataset is large enough that closed-form solutions are expensive
you’re training neural networks or large-scale linear models
you want an online/streaming learning setup

14.2 Avoid (or reconsider) gradient descent when

the problem has a closed-form solution and the dataset is small (e.g., OLS with QR/SVD)
you can use faster second-order/quasi-Newton methods (L-BFGS) for medium-sized convex problems
gradients are noisy and the optimization landscape is poorly conditioned without preprocessing

15) Practical projects to practice (Kaggle-friendly + from-scratch)

Project A: Implement linear regression with gradient descent (from scratch)

Goal:

fit $\boldsymbol{\beta}$ using batch GD and compare to closed-form OLS

Practice:

learning rate tuning
convergence plots (loss vs iterations)
effect of feature scaling

Dataset keyword: house prices or any simple regression dataset

Deliverables:

GD vs closed-form coefficients
RMSE comparison
learning rate sensitivity experiment

Project B: Logistic regression with SGD (from scratch)

Goal:

implement binary classification with SGD updates

Practice:

mini-batch vs SGD stability
decision boundary visualization (2D toy dataset)
early stopping

Dataset keyword: titanic or breast cancer

Deliverables:

training curve, validation curve
accuracy/F1/ROC-AUC

Project C: Gradient descent path visualization (2D function)

Goal:

visualize GD trajectories on a convex bowl and an ill-conditioned ellipse

Practice:

momentum effect
step size effect
conditioning intuition

Deliverables:

plots of path, loss curves

Project D: Compare optimizers on a small neural net

Goal:

train a tiny MLP on digits or fashion MNIST (if you have it locally)

Practice:

SGD vs Adam
learning rate schedules
generalization discussion

Deliverables:

validation accuracy comparison
training stability notes

16) Common pitfalls (most real issues)

16.1 Not scaling features (for linear/logistic regression)

This makes the objective ill-conditioned, causing zig-zags and slow convergence.

16.2 Using one learning rate forever

You often need decay schedules or adaptive optimizers.

16.3 No validation monitoring

Training loss can keep going down while validation loss goes up (overfitting).

16.4 Batch size extremes

too small: noisy gradients, unstable training
too large: slow updates, may generalize worse, needs large memory

17) Extensions worth knowing

17.1 Newton’s method (second-order)

Uses Hessian $\nabla^2 J(\boldsymbol{\theta})$ :

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - [\nabla^2 J(\boldsymbol{\theta}_t)]^{-1}\nabla J(\boldsymbol{\theta}_t)

Very fast near optimum for convex problems, but Hessian is expensive in high dimensions.

17.2 Quasi-Newton (L-BFGS)

Approximates Hessian efficiently; great for medium-scale convex optimization.

17.3 Conjugate gradient

Efficient for large sparse linear systems, related to quadratic objectives.

18) Minimal mental model

Gradient descent is “walk downhill using the slope.”
Learning rate controls stability vs speed.
Mini-batches give scalable updates.
Momentum reduces zig-zagging.
Adaptive optimizers scale step sizes per parameter.
Feature scaling improves conditioning and makes GD behave.