0% found this document useful (0 votes)
25 views9 pages

Gradient Descent

gradient descent explained

Uploaded by

nttan23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views9 pages

Gradient Descent

gradient descent explained

Uploaded by

nttan23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Gradient Descent

Gradient descent is a widely used algorithm in machine learning, playing a key role in
tasks ranging from optimizing linear regression models to training complex neural
networks. Its primary objective is to minimize a given function, regardless of the number
of parameters involved.

https://upload.wikimedia.org/wikipedia/commons/trans
coded/4/4c/Gradient_Descent_in_2D.webm/Gradient_
Descent_in_2D.webm.720p.vp9.webm

Gradient Descent in 2D:

Definning Model
Understanding Gradient Descent
Gradient descent is a systematic method to find optimal
values for parameters (w and b) that minimize the cost
function (j).

It is applicable not only to linear regression but also to


more complex models, including deep learning

Source: Khan Academy

Process of Gradient Descent


Start with initial guesses for parameters, often set to 0(= 0) .

Iteratively adjust parameters in the direction of steepest descent to reduce the cost function until reaching
a minimum.

Let’s outline on a high level, how this algorithm works:

Start with some parameters w, b

Computes gradient using a single Training example.

Keep changing the values for w, bto reduce the cost function J(w, b).

Continue until we settle at or near a minimum. Note, some functions may have more than 1 minimum.

Gradient Descent 1
💡 Example: an analogy is used to explain gradient descent through a hilly landscape

Imaginary Hill: You stand on a hill, with high points as hills and low points as valleys.

Finding the Path: You look around to find the steepest descent and take a small step in that
direction.

Iterative Process: Repeat this process until you reach the valley (local minimum).

This example illustrates how gradient descent works by continuously adjusting your position based
on the steepest descent until the minimum cost is found.

Local Minima in Gradient Descent


Different starting points can lead to different local
minima, meaning the algorithm may converge to various
solutions based on initial values.

This property highlights the importance of the starting


point in the optimization process.

Local Minima

Defining Algorithm
The gradient descent algorithm is defined as a repeated convergence for each input parameter,

w = w − α ∂w J(w, b)
​ 


b = b − α ∂b J(w, b)
​ 

α= Learning rate, which controls the size of steps taken during optimization. Smaller values (between 0and
1) lead to more gradual updates, while larger values result in bigger steps.

∂w
J(w, b)= Derivative of the cost function, which determines the direction to take each step. It is also

calculated as the slope of the graph at a particular point.

Gradient Descent 2
💡 NOTE: Simultaneous Updates

Both w and bshould be updated simultaneously to ensure accurate calculations.

The correct implementation involves calculating updates for both parameters before applying
them, avoiding errors that can arise from non-simultaneous updates.

Gradient Descent Intuition


Understanding the Derivative
The derivative indicates the slope of the cost
function at a given point, guiding the direction of
the parameter update.

A positive slope results in decreasing the


parameter w, while a negative slope increases w,
both aiming to minimize the cost function.

positive slope increases w


negative slope increases w

Gradient Descent 3
💡 Example:

∂J(w,b)
The left plot shows the slope of the cost curve ∂w 

at three points. It’s positive on the right and negative


on the left, so gradient descent always moves toward
the bottom where the gradient is zero.

The left plot fixes b = 100, showing how gradient


∂J(w,b) ∂J(w,b)
descent uses both ∂w and ∂b . The right quiver
​ ​

plot visualizes their combined gradients—arrow size


shows magnitude, direction shows ratio. Gradients
point away from the minimum, but since gradient
descent subtracts the gradient, it moves parameters
toward lower cost.

❓ Why the Gradient point away from the minimum?

Example
source: TowardsDataScience

Consider that you are walking along with the graph


below, and you are currently at the ‘green‘ dot. You
aim to reach the minimum, i.e., the ‘red’ dot, but from
your position, you are unable to view it.
Possible actions would be:

You might go upward or downward

If you decide which way to go, you might take a


bigger step or a little step to reach your
destination.

Essentially, there are two things that you should know to reach the minima, i.e. which
way to go and how big a step to take.

Gradient Descent 4
⇒ Gradient Descent Algorithm: helps us make efficient decisions using derivatives, which represent the
slope of a graph at a point. This slope, shown by a tangent line, guides us toward the minimum.

The Minimum Value and Steep Slope


A tangent at the green point shows we’re moving
away from the minimum if going upward. Its steep
slope means larger steps, while the gentler slope at
the blue point means smaller steps toward the
minimum.

Choosing Learning Rate


The learning rate (α) significantly affects the efficiency of gradient descent.

Too small α:
Leads to very slow convergence.

Requires many steps, increasing computation


time.

Too large α:
Can overshoot the minimum.

May cause the algorithm to diverge away from


the minimum (fail to converge).

The challenge increases with complex graphs:

Multiple local minima and maxima make optimization harder.

Choosing the right α is crucial for stable and effective convergence.

Gradient Descent 5
Can reach local minimum with fixed Learning Rate α
Near a local minimum,

Derivative automatically gets smaller.

Update steps become smaller.

Optimizations
source: GitHub

How can we check gradient descent is working correctly?


source: Vinija_Notes

We can have 2 ways to achieve this. We can plot the cost function J , which is calculated on the training
set, and plot the value of J at each iteration (aka each simultaneous update of parameters w, b) of
gradient descent.

We can also use an Automatic convergence test. We choose an ϵto be a very small number. If the cost J
decreases by less than ϵon one iteration, then you’re likely on this flattened part of the curve, and you
can declare convergence.

Learning curve

Gradient Descent 6
Gradient descent aims to minimize the cost function
J(w, b). Plotting J over iterations gives the learning
curve, which ideally should drop quickly and
converge to 0. If J increases at any point, it may
indicate a poor learning rate αor a bug. The required
number of iterations varies by application, making it
hard to predict convergence time.

Automatic convergence test


An automatic convergence test can stop training when the drop in cost J(w, b) between iterations is by <= ϵ.
Selecting the right ϵcan be challenging, so it’s best used together with a learning curve.

Debugging
Although a cost function that fails to decrease or fluctuates is often due to a high learning rate, it may also
indicate a coding bug. Using a very small learning rate can help reveal bugs, as the algorithm should still
behave unexpectedly if an issue exists.

Gradient Descent for Linear Regression


Linear regression model Cost function
fw,b (x) = wx + b
​ (The squared error cost function)
1 m
J(w, b) = 2m
​ ∑i=1 (fw,b (xi ) − y i )2 
​ ​

Pre-derived gradient descent algorithm


repeat until convergence:

w = w − α ∂w J(w, b) ​ 


b = b − α ∂b J(w, b)



We have:
∂ 1 m
∂w
J(w, b)
​ ⇒ m
​ ∑i=1 (fw,b (xi ) − y i )xi 
​ ​

∂ 1 m
∂b
​J(w, b) ⇒ m
​ ∑i=1 (fw,b (xi ) − y i )
​ ​

Gradient Descent 7
Final gradient descent algorithm
w = w − α m1 ∑i=1 (fw,b (xi ) − y i )xi 
m
​ ​ ​

m
b = b − α m1 ∑i=1 (fw,b (xi ) − y i )
​ ​ ​

💡 Convex function is cost function of linear regression

The squared error cost function is convex,


ensuring that gradient descent will always
converge to a global minimum, avoiding
local minima issues.

Batch Gradient Descent


Batch gradient descent uses the entire training
dataset to compute updates at each step,
ensuring a comprehensive approach to
optimization.

This method contrasts with other versions of


gradient descent that utilize smaller subsets of
data.

https://statusneo.com/efficientdl-mini-batch-gradient-
descent-explained/ 23/05/2025

When we try to find the minimum in the contour plot: The graph is like following:

Gradient Descent 8
Batch gradient descent uses the entire training dataset to compute updates at each step, ensuring a
comprehensive approach to optimization.

This method contrasts with other versions of gradient descent that utilize smaller subsets of data.

That means, if the number of training example is 47then m = 47:


m
1 2
J(w, b) = ∑ (fw,b (x(i) ) − y (i) )
2m
​ ​ ​

i=1

Example:
The following, the left graph shows w 's progression over the first few steps of gradient descent. w oscillates
from positive to negative and cost grows rapidly. Gradient Descent is operating on both $w$ and $b$
simultaneously, so one needs the 3-D plot on the right for the complete picture.

source: w1lab4

Gradient Descent 9

You might also like