Lecture 3 ML_optimization

OPTIMIZATION IN
MACHINE LEARNING
Dr. Dinesh K Vishwakarma

Professor, Department of Information Technology
Delhi Technological University, Delhi-110042, India
Email: dinesh@dtu.ac.in
Mobile:9971339840
Webpage: http://www.dtu.ac.in/web/departments/informationtechnology/faculty/dkvishwakarma.php
1
Introduction
− Optimization is the process where we train the model
iteratively that results in a maximum and minimum
function evaluation. One of the Important phenomena in
ML is to get better results.
− An optimization problem consists of maximizing or
minimizing a real function by systematically choosing
Input values.
− Optimization is the most essential ingredient in the
recipe of machine learning algorithms. It starts with
defining some kind of loss function/cost function.
− The choice of optimization algorithm can make a
difference between getting a good accuracy in hours or
days.
2
Maxima & Minima
3
Types of Optimization
−Most popular types of optimizations
are:
 Maximum likelihood
 Expectation maximization
 Gradient descent
4
Maximum Likelihood
− Many methods are used for estimating
unknown parameters from data.
− The maximum likelihood estimate (MLE),
works on the principle of parameter value
that has biggest probability?
− The MLE is an example of a point estimate
because it gives a single value for the
unknown parameter.
− It is often easy to compute and that it
agrees with our intuition in simple examples.
5
Maximum Likelihood…
− Problem Statement
 Consider a Random Samples (RS) 𝑋1 , 𝑋2 … . . . 𝑋𝑛 and
their probability distribution depends on some
unknown parameter 𝜃.
 Goal of ML is to find a point estimator
u(𝑋1 , 𝑋2 … . . . 𝑋𝑛 ), such that u(𝑥1 , 𝑥2 … . . . 𝑥𝑛 ) is a "good"
point estimate of 𝜃 , where 𝑥1 , 𝑥2 . . . 𝑥𝑛 are the
observed values of the random sample.
 For e.g., consider a RS 𝑋1 , 𝑋2 … . . . 𝑋𝑛 for which the 𝑋𝑖
are assumed to be normally distributed with mean 𝜇
and variance 𝜎 2 , then our goal will be to find a
good estimate of 𝜇, say, using the data 𝑥1 , 𝑥2 … . . . 𝑥𝑛
that we obtained from our specific random sample.
6
− Basic Idea
 A good estimate of the unknown parameter 𝜃 would
be the value of that maximizes the probability
 Suppose we have RS 𝑋1 , 𝑋2 … . . . 𝑋𝑛 , for probability
distribution or mass function of each 𝑋𝑖 is 𝑓 𝑥𝑖 , 𝜃 .
 The joint probability distribution or mass function of
𝑋1 , 𝑋2 … . . . 𝑋𝑛 is 𝐿 𝜃 , 𝑎𝑙𝑠𝑜 𝑐𝑎𝑙𝑙𝑒𝑑 𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛.
 𝐿 𝜃 = 𝑃 𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 … . . 𝑋𝑛 = 𝑥𝑛 = 𝑓(𝑥1 , 𝜃). 𝑓(𝑥2 , 𝜃).
… 𝑓 𝑥𝑛 , 𝜃 = ς𝑛𝑖=1 𝑓 𝑥𝑖 , 𝜃 .
 The first equality is just the definition of the joint
probability mass function. The second equality comes
from that fact that we have a RS, which implies by
definition that the 𝑋𝑖 are independent.
"likelihood function" L(θ) as a function of θ, and find the value of θ that maximizes it. 7
− Example
 Consider RS 𝑋1 , 𝑋2 … . . . 𝑋𝑛 , where
 𝑋𝑖 = 0 if a randomly selected student does not
own a sports car, and
 𝑋𝑖 = 1 if a randomly selected student does own a
sports car.
 Assuming 𝑋𝑖 are independent Bernoulli random
variables with unknown parameter 𝑝 , find the
maximum likelihood estimator of 𝑝 , the
proportion of students who own a sports car.
8
− Solution
 𝑋𝑖 are independent Bernoulli random variables with
unknown parameter 𝑝 then the PMF is 𝑓 𝑥𝑖 , 𝑝 =
𝑝 𝑥𝑖 . (1 − 𝑝)1−𝑥𝑖 for 𝑥𝑖 = 0 𝑜𝑟 1 and 0 < 𝑝 < 1.
 The likelihood function
 𝐿 𝑝 = ς𝑛𝑖=1 𝑓 𝑥𝑖 , 𝑝 = 𝑝 𝑥1 (1 − 𝑝)1−𝑥1 × 𝑝 𝑥2 (1 − 𝑝)1−𝑥2 . .×
𝑝 𝑥𝑛 . (1 − 𝑝)1−𝑥𝑛 = 𝑝σ 𝑥𝑖 (1 − 𝑝) 𝑛−σ 𝑥𝑖
 Now, in order to implement the method of
maximum likelihood, we need to find the 𝑝 that
maximizes the likelihood 𝐿 𝑝 .
 Here, we can recall our calculus.
9
− Solution
 To have easy calculation we can take logarithmic of
both the side of 𝐿 𝑝 = 𝑝σ 𝑥𝑖 (1 − 𝑝) 𝑛−σ 𝑥𝑖 .
 log 𝐿 𝑝 = log 𝑝σ 𝑥𝑖 + (𝑛 − σ 𝑥𝑖 ) 𝑙𝑜𝑔 (1 − 𝑝) and taking
derivative of the both the side and setting to ‘0’.
𝜕log(𝐿(𝑝) σ 𝑥𝑖 𝑛−σ 𝑥𝑖 )
 = + =0
𝜕𝑝 𝑝 1−𝑝
 Multiply both side with p 1 − 𝑝 and get

 σ 𝑥𝑖 − p × σ 𝑥𝑖 − np − p × σ 𝑥𝑖 = 0 or σ 𝑥𝑖 − 𝑛𝑝 = 0
σ𝒏
𝒊=𝟏 𝒙𝒊
 Hence the estimate 𝒑
ෝ= , Alternatively, an
𝐧
σ𝒏
𝒊=𝟏 𝑿𝒊
estimator 𝒑
ෝ=
𝒏
10
Example 1
− A coin is flipped 100 times, given that there were 55
heads, find the maximum likelihood estimate for the
probability p of head on single toss.
− Solution
− For a given value of p, the probability of getting 55
heads in this experiment is the binomial probability.
− The probability of getting 55 heads depends on the value
of p, the conditional probability:
100 55
 𝑃 ℎ𝑒𝑎𝑑𝑠/𝑝 = 𝑝 (1 − 𝑝)45 : the probability of 55 heads given p
55
 The Maximum Likelihood Estimate can be used to find the
value of p.
𝑑𝑃(ℎ𝑒𝑎𝑑𝑠/𝑝) 100 54 1 − 𝑝 45 − 45𝑝 54 1 − 𝑝 45 = 0
 = 55𝑝
𝑑𝑝 55
 The MLE is 𝒑 ෝ =. 𝟓𝟓
11
Example 1
− Alternate Solution: Log Likelihood
− The log likelihood is :
ℎ𝑒𝑎𝑑𝑠 100
 ln 𝑃 = ln + 55 ln 𝑝 + 45ln(1 − 𝑝) ∶
𝑝 55
 The Maximum Likelihood Estimate same as maximizing Log
likelihood.
=ln 100 +55 ln 𝑝 +45ln(1−𝑝)

ℎ𝑒𝑎𝑑𝑠
𝑑 ln 𝑃
𝑑(log 𝑙𝑖𝑘𝑙𝑖ℎ𝑜𝑜𝑑) 𝑝 55
 =
𝑑𝑝 𝑑𝑝
55 45
 − =0
𝑝 1−𝑝
 55 1 − 𝑝 = 45𝑝
 The log LE is 𝒑ෝ =. 𝟓𝟓
12
Expectation Maximization
− Expectation Maximization (EM) Algorithm
− EM algorithm is an approach for maximum likelihood
estimation in the presence of latent variables.
− The EM algorithm is an iterative approach that cycles
between two modes.
− The first mode attempts to estimate the missing or latent
variables, called the estimation-step or E-step.
− The second mode attempts to optimize the parameters of
the model to best explain the data, called the
maximization-step or M-step.
− EM Algorithm can be used in unsupervised machine learning
such as clustering and density estimation.
Latent variables are variables that are not directly observed but are inferred or estimated from
13
other observed variables in a statistical or mathematical model.
Expectation Maximization
− Expectation Maximization (EM) Algorithm
E-step, the algorithm
computes the latent
variables i.e. expectation of
the log-likelihood using the
current parameter
estimates.
M-step, the algorithm

determines the parameters
that maximize the expected
log-likelihood obtained in the
E step, and corresponding
model parameters are
updated based on the
estimated latent variables.
14
Gradient Descent (GD)
− Gradient descent is an iterative first-order
optimization algorithm, which finds a local
minimum/maximum of a given function.
− GD is commonly used in machine learning (ML)
and deep learning(DL) to minimize a cost/loss
function (e.g. in a linear regression).
− Also used in Control Engineering (robotics,
chemical, etc.), Computer games & mechanical
engineering
− Augustin-Louis Cauchy, who first suggested
it in 1847.
15
Gradient Descent (GD)…
Slope of Y=X² Slope of points as moved towards

minima
16
Fundamentals of GD
− Function requirements
− GD algorithm doesn't work for all functions.
Hence, it has two specific requirements that
a function has to be:
− Differentiable
− Convex
− A differentiable function has, its derivative
for each point in its domain and not all
functions meet this criteria, such as … next
slide…
17
Fundamentals of GD…
E.g. differentiable functions
Non-differentiable functions have a step a cusp or a discontinuity 18

− Convexity in GD optimization
− Our goal is to minimize the cost function in order to
improve the accuracy of the model. MSE is a convex
function (it is differentiable twice). This means there is no
local minimum, but only the global minimum. Thus
gradient descent would converge to the global minimum.
19
− Convexity in GD optimization
− Another way to check mathematically if a univariate
function is convex then the second derivative is always
greater than 0.
𝑑 2 𝑓(𝑥) 𝑑𝑓 𝑥 𝑑2𝑓 𝑥
− > 0; E.g. 𝑓 𝑥 = 𝑥2 − 𝑥 + 3; = 2𝑥 − 1 𝑎𝑛𝑑 =2
𝑑𝑥 2 𝑑𝑥 𝑑𝑥 2
− Hence, 𝑓 𝑥 is convex.
− There may be a case of Quasi-Convex function such as
𝑑𝑓 𝑥
𝑓 𝑥 = 𝑥 4 − 2𝑥 3 + 2; = 4𝑥 3 − 6𝑥 2 = 𝑥 2 (4𝑥 − 6) ; here 𝑥 =
𝑑𝑥
0 𝑎𝑛𝑑 𝑥 = 1.5 where this function has extrema (maximum &
Minimum).
𝑑 2 𝑓(𝑥)
− Lets check the = 12𝑥 2 − 12𝑥 = 12𝑥 𝑥 − 1 . The value of
𝑑𝑥 2
𝑑 2 𝑓(𝑥)
is zero for x=0 and x=1. These locations are called an
𝑑𝑥 2
inflexion point; a place where the curvature changes the
sign from convex to concave or vice-versa 20
− Convexity in GD optimization: Quasi-Convex function…
− Now we see that point 𝑥 = 0 has both first and second
derivative equal to zero, meaning this is a saddle point
and point x=1.5 is a global minimum.
 For multivariate functions the most appropriate check if a point
is a saddle point is to calculate a Hessian matrix which involves
a bit more complex calculations
21
Gradient Descent…
− There are three variants of gradient descent, which differ in
how much data we use to compute the gradient of the
objective function.
− Batch gradient descent
− Vanilla gradient descent, aka batch gradient descent,
computes the gradient of the cost function w.r.t. to the
parameters θ for the entire training dataset:
− 𝜽𝒕+𝟏 = 𝜽𝒕 − 𝜼𝛁𝑱(𝜽) : 𝜼 → 𝑳𝒆𝒂𝒓𝒏𝒊𝒏𝒈 𝑹𝒂𝒕𝒆
− GD algorithm iteratively calculates the next point using
gradient at the current position, scales it (by a learning rate)
and subtracts obtained value from the current position (makes
a step).
− It subtracts the value because we want to minimize the
function (to maximize it would be adding).
22
Batch gradient descent
− 𝜼 which scales the gradient and thus controls the
step size. In machine learning, it is called learning
rate and have a strong influence on performance.
− Smaller value 𝜼 GD may converges in longer, or
may reach maximum iteration before reaching the
optimum point.
− Higher Value 𝜼, algorithm may not converge to the
optimal point (jump around) or even to diverge
completely.
23
Steps in BGD
1. Choose a starting point (initialization)
2. Calculate gradient at this point
3. Make a scaled step in the opposite direction
to the gradient (objective: minimize)
4. Repeat points 2 and 3 until one of the
criteria is met:
 Maximum number of iterations reached
 Step size is smaller than the tolerance (due to
scaling or a small gradient)
24
E.g. BGD
− A quadratic function : 𝑓 𝑥 = 𝑥 2 − 4𝑥 + 1
𝑑𝑓 𝑥
− It is a univariate function. = 2𝑥 − 4
𝑑𝑥
− let us consider 𝜂 = 0.1 and starting point 𝑥 = 9. Then calculation
is as follows
− 𝒙𝟏 = 𝟗 − 𝟎. 𝟏 ∗ 𝟐 ∗ 𝟗 − 𝟒 = 𝟕. 𝟔
− 𝒙𝟐 = 𝟕. 𝟔 − 𝟎. 𝟏 ∗ 𝟐 ∗ 𝟕. 𝟔 − 𝟒 = 𝟔. 𝟖
− 𝒙𝟑 = 𝟔. 𝟖 − 𝟎. 𝟏 ∗ 𝟐 ∗ 𝟔. 𝟖 − 𝟒 = 𝟓. 𝟓𝟖𝟒
25
E.g. BGD
− A function with a saddle point: 𝒇 𝒙 = 𝒙𝟒 − 𝟐𝒙𝟑 +
𝟐.
Results for two learning rates and two different staring points.
Learning rate of 0.4 and a starting

point x=-0.5.
26
BGD…
− It requires to calculate
the gradients for the
whole dataset to perform
just one update.
− BGD can be very slow
and is intractable for
datasets that don't fit in
memory, it also doesn't
allow us to update the
model online i.e BGD isn’t
performed on dataset
that update continuously.
27
Stochastic GD
− Stochastic gradient descent
− Stochastic gradient descent (SGD) in contrast
performs a parameter update for each training
example 𝑥(𝑖) and label y 𝑖 .
− 𝜽𝒕+𝟏 = 𝜽𝒕 − 𝜼𝛁𝑱(𝜽; 𝒙 𝒊 ; 𝒚(𝒊))
• BGD performs redundant
computations for large datasets,
SGD avoids this redundancy by
performing one update at a time.
• It is therefore usually much faster
and can also be used to learn online.
• SGD performs frequent updates with
a high variance that cause the
objective function to fluctuate
heavily .
28
Stochastic GD…
− While SGD’s fluctuation, on the one hand, enables it
to jump to new and potentially better local minima.
On the other hand, this ultimately complicates
convergence to the exact minimum, as SGD will
keep overshooting.
− However, it has been shown that when we slowly
decrease the learning rate, SGD shows the same
convergence behavior as BGD.
29
Mini Batch GD
− Mini-batch gradient descent finally takes the best of both
worlds and performs an update for every mini-batch of 𝒏
training examples.
− 𝜽𝒕+𝟏 = 𝜽𝒕 − 𝜼𝛁𝑱 𝜽; 𝒙 𝒊: 𝑰 + 𝒏 ; 𝒚 𝒊: 𝑰 + 𝒏
− a) It reduces the variance of the parameter updates, which can
lead to more stable convergence; b) can make use of highly
optimized matrix optimizations common to state-of-the-art
deep learning libraries that make computing the gradient w.r.t.
a mini-batch very efficient.
between 50 and 256
Common mini-batch
sizes range
30
References
− https://online.stat.psu.edu/stat415/lesson/1
/1.2
− https://towardsdatascience.com/gradient-
descent-algorithm-a-deep-dive-
cf04e8115f21
− https://medium.com/analytics-
vidhya/gradient-descent-optimization-
techniques-4316419c5b74
31
THANK YOU
CONTACT: DINESH@DTU.AC.IN
MOBILE: +91-9971339840
9/12/2023 Dinesh K. Vishwakarma, Ph.D. 32

Lecture 3 ML_optimization

Uploaded by

Lecture 3 ML_optimization

Uploaded by

OPTIMIZATION IN

Dr. Dinesh K Vishwakarma

 Multiply both side with p 1 − 𝑝 and get

=ln 100 +55 ln 𝑝 +45ln(1−𝑝)

M-step, the algorithm

Slope of Y=X² Slope of points as moved towards

Non-differentiable functions have a step a cusp or a discontinuity 18

Learning rate of 0.4 and a starting

9/12/2023 Dinesh K. Vishwakarma, Ph.D. 32

You might also like