(MLP) Lecture Notes
(MLP) Lecture Notes
A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with
experience E.
Supervised Learning
In supervised learning, we are given a data set and already know what our correct output should look
like, having the idea that there is a relationship between the input and the output.
Example 1 - Regression
1
Example 2 - Classification
a) Single Feature:
b) Multiple Features:
Unsupervised Learning
Unsupervised learning allows us to approach problems with little or no idea what our results should
look like. We can derive structure from data where we don't necessarily know the effect of the
variables. We can derive this structure by clustering the data based on relationships among the
variables in the data. With unsupervised learning there is no feedback based on the prediction
results.
2
Linear Regression
We’ll use x(i) to denote the “input” variables (features), and y(i) to denote the “output” or target
variable that we are trying to predict. A pair (x(i),y(i)) is called a training example, a list of m trai
examples (x(i),y(i)) is called a training set.
Our goal is, given a training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the
corresponding value of y.
3
Cost Function
We can measure the accuracy of our hypothesis function by using a cost function. This takes an
average difference of all the results of the hypothesis with inputs from x's and the actual output y's.
The mean is halved as a convenience for the computation of the gradient descent, as the derivative
term of the square function will cancel out the (½) term.
We need an algorithm to
automatically find the set
of parameters θ0 and θ1
that minimizes the cost
function J(θ0,θ1), which
is always a convex
function.
4
Gradient Descent
Do the following until convergence (convergence denotes the state where the optimization algorithm has
effectively minimized the loss function, which leads to stable and optimal model parameters)
5
This is simply gradient descent on the original cost function J. This method looks at every example in
the entire training set on every step, and is called batch gradient descent. Note that, while gradient
descent can be susceptible to local minima in general, the optimization problem we have posed here
for linear regression has only one global and no other local optima (J is a convex quadratic function);
thus gradient descent always converges (assuming the learning rate α is not too large) to the global
minimum.
6
Multivariate Linear Regression
7
We can speed up gradient descent by having each of our input values in roughly the same range.
This is because θ will descend quickly on small ranges and slowly on large ranges, and so will
oscillate inefficiently down to the optimum when the variables are very uneven.
Two techniques to help with this are feature scaling and mean normalization. Feature scaling
involves dividing the input values by the range of the input variable, resulting in a new range of just 1.
Mean normalization involves subtracting the average value for an input variable from the values for
that input variable resulting in a new average value for the input variable of just zero.
8
It has been proven that if α is sufficiently small, then J(θ) will decrease on every iteration.
9
10
Normal Equation
Gradient descent gives one way of minimizing J. Let’s discuss a second way of doing so, this time
performing the minimization explicitly and without resorting to an iterative algorithm. In the "Normal
Equation" method, we will minimize J by explicitly taking its derivatives with respect to the θj’s, and
setting them to zero. This allows us to find the optimum theta without iteration.
11
If 𝑋𝑇𝑋 is noninvertible, the common causes might be having:
Redundant features, where two features are very closely related (i.e. they are linearly dependent)
Too many features (e.g. m ≤ n). In this case, delete some features or use "regularization".
Solutions to the above problems include deleting a feature that is linearly dependent with another or
deleting one or more features when there are too many features.
12
Classification
The function g(z), shown here, maps any real number to the (0, 1) interval, making it useful for
transforming an arbitrary-valued function into a function better suited for classification.
13
14
Note that writing the cost function in this way guarantees that J(θ) is
convex for logistic regression.
Simplified form:
15
If our correct answer 'y' is 0, then the cost
function will be 0 if our hypothesis function also
outputs 0. If our hypothesis approaches 1, then
the cost function will approach infinity.
16
Multiclass Classification
17
The Problem of Overfitting
18
How to Address Overfitting
Regularization
19
20
Regularized Linear Regression
Is going to be a number less than 1 usually. Usually learning rate is small and m is large. So this
typically evaluates to (1 - a small number. So the term is often around 0.99 to 0.95. This in effect
means θj gets multiplied by 0.99. Means the squared norm of θj a little smaller
The second term is exactly the same as the original gradient descent
21
Regularized Logistic Regression
22