Machine Learning Notes

MACHINE LEARNING
What is Machine Learning?

Two definitions of Machine Learning are offered. Arthur Samuel described it as: "the field of
study that gives computers the ability to learn without being explicitly programmed." This is an
older, informal definition.
Tom Mitchell provides a more modern definition: "A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with experience E."
Example: playing checkers.
E = the experience of playing many games of checkers
T = the task of playing checkers.
P = the probability that the program will win the next game.
In general, any machine learning problem can be assigned to one of two broad classifications:
Supervised learning and Unsupervised learning.
Supervised Learning
In supervised learning, we are given a data set and already know what our correct output should
look like, having the idea that there is a relationship between the input and the output.
Supervised learning problems are categorized into "regression" and "classification" problems. In
a regression problem, we are trying to predict results within a continuous output, meaning that
we are trying to map input variables to some continuous function. In a classification problem, we
are instead trying to predict results in a discrete output. In other words, we are trying to map input
variables into discrete categories.
Example 1:
Given data about the size of houses on the real estate market, try to predict their price. Price as
a function of size is a continuous output, so this is a regression problem.
We could turn this example into a classification problem by instead making our output about
whether the house "sells for more or less than the asking price." Here we are classifying the
houses based on price into two discrete categories.
Example 2:
(a) Regression - Given a picture of a person, we have to predict their age on the basis of the
given picture
(b) Classification - Given a patient with a tumor, we have to predict whether the tumor is
malignant or benign.
Unsupervised Learning
Unsupervised learning allows us to approach problems with little or no idea what our results
should look like. We can derive structure from data where we don't necessarily know the effect of
the variables.
We can derive this structure by clustering the data based on relationships among the variables in
the data.
With unsupervised learning there is no feedback based on the prediction results.
Example:
Clustering: Take a collection of 1,000,000 different genes, and find a way to automatically group
these genes into groups that are somehow similar or related by different variables, such as
lifespan, location, roles, and so on.
Non-clustering: The "Cocktail Party Algorithm", allows you to find structure in a chaotic
environment. (i.e. identifying individual voices and music from a mesh of sounds at a cocktail
party).
LINEAR REGRESSION WITH ONE VARIABLE:
1) Model Representation
To establish notation for future use, we’ll use x(i) to denote the “input” variables (living area in
this example), also called input features, and y(i) to denote the “output” or target variable that we
are trying to predict (price). A pair (x(i),y(i)) is called a training example, and the dataset that we’ll
be using to learn—a list of m training examples (x(i),y(i));i=1,...,m—is called a training set. Note
that the superscript “(i)” in the notation is simply an index into the training set, and has nothing to
do with exponentiation. We will also use X to denote the space of input values, and Y to denote
the space of output values. In this example, X = Y = ℝ.
To describe the supervised learning problem slightly more formally, our goal is, given a training
set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding value of
y. For historical reasons, this function h is called a hypothesis. Seen pictorially, the process is
therefore like this:
When the target variable that we’re trying to predict is continuous, such as in our housing
example, we call the learning problem a regression problem. When y can take on only a small
number of discrete values (such as if, given the living area, we wanted to predict if a dwelling is a
house or an apartment, say), we call it a classification problem.
2) Cost Function
3) Cost Function - Intuition I
If we try to think of it in visual terms, our training data set is scattered on the x-y plane. We are
trying to make a straight line (defined by hθ(x)) which passes through these scattered data
points.
Our objective is to get the best possible line. The best possible line will be such so that the
average squared vertical distances of the scattered points from the line will be the least. Ideally,
the line should pass through all the points of our training data set. In such a case, the value of
J(θ0,θ1) will be 0. The following example shows the ideal situation where we have a cost function
of 0.
When θ1=1, we get a slope of 1 which goes through every single data point in our model.
Conversely, when θ1=0.5, we see the vertical distance from our fit to the data points increase.
This increases our cost function to 0.58. Plotting several other points yields to the following
graph:
Thus as a goal, we should try to minimize the cost function. In this case, θ1=1 is our global
minimum.
Cost Function - Intuition II

A contour plot is a graph that contains many contour lines. A contour line of a two variable
function has a constant value at all points of the same line. An example of such a graph is the
one to the right below.
Taking any color and going along the 'circle', one would expect to get the same value of the cost
function. For example, the three green points found on the green line above have the same value
for J(θ0,θ1) and as a result, they are found along the same line. The circled x displays the value
of the cost function for the graph on the left when θ0 = 800 and \θ1= -0.15. Taking another h(x)
and plotting its contour plot, one gets the following graphs:
When θ0 = 360 and θ1 = 0, the value of J(θ0,θ1) in the contour plot gets closer to the center thus
reducing the cost function error. Now giving our hypothesis function a slightly positive slope
results in a better fit of the data.
The graph above minimizes the cost function as much as possible and consequently, the result
of θ1 and θ0 tend to be around 0.12 and 250 respectively. Plotting those values on our graph to
the right seems to put our point in the center of the inner most 'circle'.
Gradient Descent
So we have our hypothesis function and we have a way of measuring how well it fits into the
data. Now we need to estimate the parameters in the hypothesis function. That's where gradient
descent comes in.
Imagine that we graph our hypothesis function based on its fields θ0 and θ1 (actually we are
graphing the cost function as a function of the parameter estimates). We are not graphing x and
y itself, but the parameter range of our hypothesis function and the cost resulting from selecting a
particular set of parameters.
We put θ0 on the x axis and θ1 on the y axis, with the cost function on the vertical z axis. The
points on our graph will be the result of the cost function using our hypothesis with those specific
theta parameters. The graph below depicts such a setup.
We will know that we have succeeded when our cost function is at the very bottom of the pits in
our graph, i.e. when its value is the minimum. The red arrows show the minimum points in the
graph.
The way we do this is by taking the derivative (the tangential line to a function) of our cost
function. The slope of the tangent is the derivative at that point and it will give us a direction to
move towards. We make steps down the cost function in the direction with the steepest descent.
The size of each step is determined by the parameter α, which is called the learning rate.
For example, the distance between each 'star' in the graph above represents a step determined
by our parameter α. A smaller α would result in a smaller step and a larger α results in a larger
step. The direction in which the step is taken is determined by the partial derivative of J(θ0,θ1).
Depending on where one starts on the graph, one could end up at different points. The image
above shows us two different starting points that end up in two different places.
The gradient descent algorithm is:
repeat until convergence:

Gradient Descent Intuition
In this video we explored the scenario where we used one parameter \theta_1θ1 and plotted its
cost function to implement a gradient descent. Our formula for a single parameter was :
Repeat until convergence:
θ1:=θ1−αdθ1dJ(θ1)
Regardless of the slope's sign for \frac{d}{d\theta_1} J(\theta_1)dθ1dJ(θ1), θ1 eventually

converges to its minimum value. The following graph shows that when the slope is negative, the
value of θ1 increases and when it is positive, the value of θ1 decreases.
On a side note, we should adjust our parameter α to ensure that the gradient descent algorithm
converges in a reasonable time. Failure to converge or too much time to obtain the minimum
value imply that our step size is wrong.
How does gradient descent converge with a fixed step size α?
The intuition behind the convergence is that \frac{d}{d\theta_1} J(\theta_1)dθ1dJ(θ1)

approaches 0 as we approach the bottom of our convex function. At the minimum, the derivative
will always be 0 and thus we get:
θ1:=θ1−α∗0
Gradient Descent For Linear Regression
When specifically applied to the case of linear regression, a new form of the gradient descent
equation can be derived. We can substitute our actual cost function and our actual hypothesis
function and modify the equation to :
where m is the size of the training set, θ0 a constant that will be changing simultaneously with θ1
and xi,yiare values of the given training set (data).
Note that we have separated out the two cases for θj into separate equations for \theta_0θ0 and
θ1; and that for θ1 we are multiplying xi at the end due to the derivative. The following is a
derivation of \frac {\partial}{\partial \theta_j}J(\theta) ∂θj∂J(θ) for a single example :
The point of all this is that if we start with a guess for our hypothesis and then repeatedly apply
these gradient descent equations, our hypothesis will become more and more accurate.
So, this is simply gradient descent on the original cost function J. This method looks at every
example in the entire training set on every step, and is called batch gradient descent. Note that,
while gradient descent can be susceptible to local minima in general, the optimization problem
we have posed here for linear regression has only one global, and no other local, optima; thus
gradient descent always converges (assuming the learning rate α is not too large) to the global
minimum. Indeed, J is a convex quadratic function. Here is an example of gradient descent as it
is run to minimize a quadratic function.
The ellipses shown above are the contours of a quadratic function. Also shown is the trajectory
taken by gradient descent, which was initialized at (48,30). The x’s in the figure (joined by
straight lines) mark the successive values of θ that gradient descent went through as it
converged to its minimum.
LINEAR REGRESSION WITH MULTIPLE VARIABLES:

1) Gradient Descent for Multiple Variables
The following image compares gradient descent with one variable to gradient descent with
multiple variables:
2) Gradient Descent in Practice I - Feature Scaling
3) Gradient Descent in Practice II - Learning Rate
Debugging gradient descent. Make a plot with number of iterations on the x-axis. Now
plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever
increases, then you probably need to decrease α.
Automatic convergence test. Declare convergence if J(θ) decreases by less than E in
one iteration, where E is some small value such as 10^{−3}. However in practice it's
difficult to choose this threshold value.
It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on
every iteration.
To summarize:
If α is too small: slow convergence.
If α is too large: may not decrease on every iteration and thus may not converge.
Features and Polynomial Regression

Normal Equation
LOGISTIC REGRESSION (CLASSIFICATION)

Hypothesis Representation
Decision Boundary
Cost Function
Simplified Cost Function and Gradient Descent (IMP)
Advanced Optimization(COMPLEX TOPIC)
Multiclass Classification: One-vs-all
The following image shows how one could classify 3 classes:

REGULARIZATION
The Problem of Overfitting
This terminology is applied to both linear and logistic regression. There are two main options to
address the issue of overfitting:
1) Reduce the number of features:
 Manually select which features to keep.
 Use a model selection algorithm (studied later in the course).
2) Regularization
 Keep all the features, but reduce the magnitude of parameters θj.
 Regularization works well when we have a lot of slightly useful features.

Cost Function
Using the above cost function with the extra summation, we can smooth the output of our
hypothesis function to reduce overfitting. If lambda is chosen to be too large, it may smooth out
the function too much and cause underfitting. Hence, what would happen if λ=0 or is too small ?
Regularized Linear Regression
Regularized Logistic Regression

We can regularize logistic regression in a similar way that we regularize linear regression. As a
result, we can avoid overfitting. The following image shows how the regularized function,
displayed by the pink line, is less likely to overfit than the non-regularized function represented by
the blue line:
NEURAL NETWORKS
Model Representation I
Example: If layer 1 has 2 input nodes and layer 2 has 4 activation nodes. Dimension of Θ(1) is
going to be 4×3 where sj=2 and sj+1=4, so sj+1×(sj+1)=4×3.
Model Representation II
Examples and Intuitions I
Where g(z) is the following:

Examples and Intuitions II
Multiclass Classification
To classify data into multiple classes, we let our hypothesis function return a vector of values.
Say we wanted to classify our data into one of four categories. We will use the following example
to see how this classification is done. This algorithm takes as input an image and classifies it
accordingly:
We can define our set of resulting classes as y:
Each y(i) represents a different image corresponding to either a car, pedestrian, truck, or
motorcycle. The inner layers, each provide us with some new information which leads to our final
hypothesis function. The setup looks like:
Our resulting hypothesis for one set of inputs may look like:
Cost Function
We have added a few nested summations to account for our multiple output nodes. In the first
part of the equation, before the square brackets, we have an additional nested summation that
loops through the number of output nodes.
In the regularization part, after the square brackets, we must account for multiple theta matrices.
The number of columns in our current theta matrix is equal to the number of nodes in our current
layer (including the bias unit). The number of rows in our current theta matrix is equal to the
number of nodes in the next layer (excluding the bias unit). As before with logistic regression, we
square every term.
Note:
 the double sum simply adds up the logistic regression costs calculated for each cell
in the output layer
 the triple sum simply adds up the squares of all the individual Θs in the entire
network.
 the i in the triple sum does not refer to training example i

Backpropagation Algorithm
Backpropagation Intuition
Implementation Note: Unrolling Parameters
To summarize:
Gradi
ent Checking
We previously saw how to calculate the deltaVector. So once we compute our gradApprox
vector, we can check that gradApprox ≈ deltaVector.
Once you have verified once that your backpropagation algorithm is correct, you don't need to
compute gradApprox again. The code to compute gradApprox can be very slow.
Random Initialization
Initializing all theta weights to zero does not work with neural networks. When we backpropagate,
all nodes will update to the same value repeatedly. Instead we can randomly initialize our
weights for our \ThetaΘ matrices using the following method:
Putting it Together
First, pick a network architecture; choose the layout of your neural network, including how many
hidden units in each layer and how many layers in total you want to have.
 Number of input units = dimension of features x(i)
 Number of output units = number of classes
 Number of hidden units per layer = usually more the better (must balance with cost
of computation as it increases with more hidden units)
 Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is
recommended that you have the same number of units in every hidden layer.
Training a Neural Network
1. Randomly initialize the weights
2. Implement forward propagation to get hΘ(x(i)) for any x(i)
3. Implement the cost function
4. Implement backpropagation to compute partial derivatives
5. Use gradient checking to confirm that your backpropagation works. Then disable
gradient checking.
6. Use gradient descent or a built-in optimization function to minimize the cost

function with the weights in theta.
When we perform forward and back propagation, we loop on every training example:
The following image gives us an intuition of what is happening as we are implementing our
neural network:
Ideally, you want hΘ(x(i)) ≈ y(i). This will minimize our cost function. However, keep in mind that
J(Θ) is not convex and thus we can end up in a local minimum instead.

Machine Learning Notes

Uploaded by

Machine Learning Notes

Uploaded by

MACHINE LEARNING

What is Machine Learning?

Example: playing checkers.

E = the experience of playing many games of checkers

T = the task of playing checkers.

Supervised learning and Unsupervised learning.

With unsupervised learning there is no feedback based on the prediction results.

LINEAR REGRESSION WITH ONE VARIABLE:

Cost Function - Intuition II

The gradient descent algorithm is:

repeat until convergence:

Repeat until convergence:

Regardless of the slope's sign for \frac{d}{d\theta_1} J(\theta_1)dθ1dJ(θ1), θ1 eventually

The intuition behind the convergence is that \frac{d}{d\theta_1} J(\theta_1)dθ1dJ(θ1)

LINEAR REGRESSION WITH MULTIPLE VARIABLES:

Features and Polynomial Regression

LOGISTIC REGRESSION (CLASSIFICATION)

The following image shows how one could classify 3 classes:

1) Reduce the number of features:

 Manually select which features to keep.

 Use a model selection algorithm (studied later in the course).

 Regularization works well when we have a lot of slightly useful features.

Regularized Logistic Regression

Where g(z) is the following:

 the i in the triple sum does not refer to training example i

 Number of input units = dimension of features x(i)

 Number of output units = number of classes

Training a Neural Network

1. Randomly initialize the weights

2. Implement forward propagation to get hΘ(x(i)) for any x(i)

3. Implement the cost function

4. Implement backpropagation to compute partial derivatives

6. Use gradient descent or a built-in optimization function to minimize the cost

You might also like