CONTENTS

Gradient Descent

Gradient descent is one of the most important ideas in modern optimization and machine learning.
It’s not a model like linear regression or random forest — it’s the engine that trains many models by minimizing a loss function.

If you understand gradient descent deeply, you understand:

  • how models “learn,”
  • why learning rates matter,
  • why training can be unstable,
  • and why techniques like momentum, Adam, and normalization exist.

1) What gradient descent is (in one sentence)

Gradient descent is an iterative optimization method that minimizes a function by repeatedly moving in the direction of the negative gradient (the steepest downhill direction).

2) The core intuition (feel it before the math)

Imagine you’re standing on a hilly landscape in fog and your goal is to reach the lowest point.

  • The gradient tells you the direction of steepest uphill.
  • So the negative gradient tells you the direction of steepest downhill.
  • You take a step downhill, then repeat.

Two things determine what happens:

  1. Direction (negative gradient)
  2. Step size (learning rate)

Too small → you crawl.
Too big → you bounce around or explode.


3) The math definition

Let your objective function be:

minθ  J(θ)\min_{\boldsymbol{\theta}} \; J(\boldsymbol{\theta})

where θ\boldsymbol{\theta} is a vector of parameters (weights).

The gradient is:

J(θ)=[Jθ1Jθd]\nabla J(\boldsymbol{\theta}) = \begin{bmatrix} \frac{\partial J}{\partial \theta_1}\\ \vdots\\ \frac{\partial J}{\partial \theta_d} \end{bmatrix}
Gradient descent update rule:
θt+1=θtηJ(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla J(\boldsymbol{\theta}_t)
  • η\eta is the learning rate (step size)
  • tt is the iteration number

4) A concrete example: linear regression with MSE

Suppose:

y^=Xβ\hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\beta}

Loss (MSE):

J(β)=1nyXβ22J(\boldsymbol{\beta}) = \frac{1}{n}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_2^2

Gradient:

βJ=2nX(yXβ)\nabla_{\boldsymbol{\beta}} J = -\frac{2}{n}\mathbf{X}^\top(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})

Update:

ββ+η2nX(yXβ)\boldsymbol{\beta} \leftarrow \boldsymbol{\beta} + \eta \frac{2}{n}\mathbf{X}^\top(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})

This shows a helpful interpretation:

  • the term (yXβ)(\mathbf{y}-\mathbf{X}\boldsymbol{\beta}) is the residual,
  • X(residual)\mathbf{X}^\top(\text{residual}) measures correlation between features and errors,
  • you update weights to reduce those errors.

5) Learning rate: the one knob that can break everything

5.1 Too small

  • slow convergence
  • looks stable but takes forever

5.2 Too large

  • oscillation around minimum
  • divergence (loss explodes)

5.3 Practical learning rate strategy

  • start with a moderate value
  • reduce when loss plateaus
  • or use learning rate schedules / adaptive optimizers

6) Variants of gradient descent (batch, SGD, mini-batch)

Assume objective is a sum over samples:

J(θ)=1ni=1ni(θ)J(\boldsymbol{\theta}) = \frac{1}{n}\sum_{i=1}^n \ell_i(\boldsymbol{\theta})

6.1 Batch Gradient Descent

Uses full dataset each step:

θt+1=θtηJ(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla J(\boldsymbol{\theta}_t)

Pros:

  • stable, smooth loss curve

Cons:

  • expensive for large datasets

6.2 Stochastic Gradient Descent (SGD)

Uses one random sample ii:

θt+1=θtηi(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla \ell_i(\boldsymbol{\theta}_t)

Pros:

  • fast updates, good for massive/streaming data

Cons:

  • noisy updates, loss fluctuates

6.3 Mini-batch Gradient Descent (most common)

Uses a batch BB of size mm:

θt+1=θtη1miBi(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \frac{1}{m}\sum_{i\in B} \nabla \ell_i(\boldsymbol{\theta}_t)

Pros:

  • efficient on GPUs
  • less noisy than SGD
  • scalable

Cons:

  • still needs tuning

7) Convergence: when does gradient descent work?

7.1 Convex vs non-convex

  • If J(θ)J(\boldsymbol{\theta}) is convex, any local minimum is global.
  • If it’s non-convex (deep nets), GD may find local minima or saddle points.

7.2 Strong convexity and smoothness (intuition)

Convergence is easiest when:

  • gradients don’t change too violently (smoothness)
  • there’s a clear curvature toward the minimum (strong convexity)

In practice:

  • feature scaling improves smoothness
  • regularization can help conditioning

8) Conditioning: why scaling matters so much

If your loss contours are long and skinny (like a stretched ellipse), gradient descent zig-zags.

This happens when features have different scales or strong correlations.

8.1 What it looks like

  • one direction is steep, another is shallow
  • GD bounces across steep walls and slowly progresses along shallow direction

8.2 Fixes

  • standardize features
  • use momentum or adaptive methods
  • use second-order methods (Newton, quasi-Newton) when feasible

9) Momentum: “roll downhill with inertia”

Momentum accumulates a running velocity:

vt+1=μvtηJ(θt)\mathbf{v}_{t+1} = \mu \mathbf{v}_t - \eta \nabla J(\boldsymbol{\theta}_t) θt+1=θt+vt+1\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \mathbf{v}_{t+1}
  • μ\mu is momentum (e.g., 0.9)
  • helps accelerate along consistent directions
  • reduces zig-zagging

10) Adaptive optimizers (AdaGrad, RMSProp, Adam)

These change the learning rate per parameter based on gradient history.

10.1 AdaGrad

Accumulates squared gradients:

Gt=τ=1tgτ2\mathbf{G}_t = \sum_{\tau=1}^t \mathbf{g}_\tau^2

Update per component:

θt+1,j=θt,jηGt,j+ϵgt,j\theta_{t+1,j} = \theta_{t,j} - \frac{\eta}{\sqrt{G_{t,j}}+\epsilon} g_{t,j}

Good for sparse features, but learning rate can decay too much.

10.2 RMSProp

Uses an exponential moving average of squared gradients.

Adam combines momentum + RMSProp-style scaling.

Conceptually:

  • keep moving average of gradients (first moment)
  • keep moving average of squared gradients (second moment)
  • bias-correct them
  • update with normalized step

Adam is a strong default for deep learning, but SGD with momentum can generalize better in some settings.


11) Learning rate schedules (how training is often made stable)

Common schedules:

  • step decay
  • exponential decay
  • cosine annealing
  • warmup then decay

General idea:

  • start with larger learning rate to explore
  • reduce later to fine-tune near optimum

12) Practical stopping criteria

You usually stop when:

  • validation loss stops improving (early stopping)
  • gradient norm becomes small: J(θ)\|\nabla J(\boldsymbol{\theta})\| is near 0
  • parameter changes become tiny
  • max iterations reached

Early stopping is also a form of regularization.


13) Pros and cons

13.1 Pros

  • Scales to huge datasets
  • Simple and widely applicable
  • Works for many model families (linear models, neural nets, logistic regression)
  • Easy to implement and parallelize (mini-batch)

13.2 Cons

  • Sensitive to learning rate and hyperparameters
  • Can be slow on ill-conditioned problems
  • Can get stuck/slow at saddle points in non-convex optimization
  • Requires careful validation to avoid overfitting

14) When to use gradient descent (and when not to)

14.1 Use gradient descent when

  • you can compute gradients efficiently
  • dataset is large enough that closed-form solutions are expensive
  • you’re training neural networks or large-scale linear models
  • you want an online/streaming learning setup

14.2 Avoid (or reconsider) gradient descent when

  • the problem has a closed-form solution and the dataset is small (e.g., OLS with QR/SVD)
  • you can use faster second-order/quasi-Newton methods (L-BFGS) for medium-sized convex problems
  • gradients are noisy and the optimization landscape is poorly conditioned without preprocessing

15) Practical projects to practice (Kaggle-friendly + from-scratch)

Project A: Implement linear regression with gradient descent (from scratch)

Goal:

  • fit β\boldsymbol{\beta} using batch GD and compare to closed-form OLS

Practice:

  • learning rate tuning
  • convergence plots (loss vs iterations)
  • effect of feature scaling
Dataset keyword: house prices or any simple regression dataset

Deliverables:

  • GD vs closed-form coefficients
  • RMSE comparison
  • learning rate sensitivity experiment

Project B: Logistic regression with SGD (from scratch)

Goal:

  • implement binary classification with SGD updates

Practice:

  • mini-batch vs SGD stability
  • decision boundary visualization (2D toy dataset)
  • early stopping
Dataset keyword: titanic or breast cancer

Deliverables:

  • training curve, validation curve
  • accuracy/F1/ROC-AUC

Project C: Gradient descent path visualization (2D function)

Goal:

  • visualize GD trajectories on a convex bowl and an ill-conditioned ellipse

Practice:

  • momentum effect
  • step size effect
  • conditioning intuition

Deliverables:

  • plots of path, loss curves

Project D: Compare optimizers on a small neural net

Goal:

  • train a tiny MLP on digits or fashion MNIST (if you have it locally)

Practice:

  • SGD vs Adam
  • learning rate schedules
  • generalization discussion

Deliverables:

  • validation accuracy comparison
  • training stability notes

16) Common pitfalls (most real issues)

16.1 Not scaling features (for linear/logistic regression)

This makes the objective ill-conditioned, causing zig-zags and slow convergence.

16.2 Using one learning rate forever

You often need decay schedules or adaptive optimizers.

16.3 No validation monitoring

Training loss can keep going down while validation loss goes up (overfitting).

16.4 Batch size extremes

  • too small: noisy gradients, unstable training
  • too large: slow updates, may generalize worse, needs large memory

17) Extensions worth knowing

17.1 Newton’s method (second-order)

Uses Hessian 2J(θ)\nabla^2 J(\boldsymbol{\theta}):

θt+1=θt[2J(θt)]1J(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - [\nabla^2 J(\boldsymbol{\theta}_t)]^{-1}\nabla J(\boldsymbol{\theta}_t)

Very fast near optimum for convex problems, but Hessian is expensive in high dimensions.

17.2 Quasi-Newton (L-BFGS)

Approximates Hessian efficiently; great for medium-scale convex optimization.

17.3 Conjugate gradient

Efficient for large sparse linear systems, related to quadratic objectives.


18) Minimal mental model

  • Gradient descent is “walk downhill using the slope.”
  • Learning rate controls stability vs speed.
  • Mini-batches give scalable updates.
  • Momentum reduces zig-zagging.
  • Adaptive optimizers scale step sizes per parameter.
  • Feature scaling improves conditioning and makes GD behave.