Machine Learning

Super vised learning Model for prediction
In supervised learning, we are given a data set and already

know what our correct output should look like, having the
idea that there is a relationship between the input and the
output.
Supervised learning problems are categorized into

"regression" and "classification" problems. In a regression
problem, we are trying to predict results within a continuous
output, meaning that we are trying to map input variables to
some continuous function. In a classification problem, we
are instead trying to predict results in a discrete output. In
other words, we are trying to map input variables into
discrete categories.
Example 1:
Given data about the size of houses on the real estate

market, try to predict their price. Price as a function of size
is a continuous output, so this is a regression problem.
We could turn this example into a classification problem by

instead making our output about whether the house "sells for
more or less than the asking price." Here we are classifying
the houses based on price into two discrete categories.
Example 2:
Hypothesis function
(a) Regression - Given a picture of a person, we have to
predict their age on the basis of the given picture
Cost Function
(b) Classification - Given a patient with a tumor, we have to we'll define something called the cost function,
predict whether the tumor is malignant or benign. this will let us figure out how to fit the best possible straight
line to our data.
Unsupervised Learning
Unsupervised learning allows us to approach problems with

little or no idea what our results should look like. We can
derive structure from data where we don't necessarily know
the effect of the variables.
We can derive this structure by clustering the data based on

relationships among the variables in the data.
With unsupervised learning there is no feedback based on

the prediction results.
Example:
Clustering: Take a collection of 1,000,000 different genes,

and find a way to automatically group these genes into
groups that are somehow similar or related by different
variables, such as lifespan, location, roles, and so on.
Non-clustering: The "Cocktail Party Algorithm", allows you

to find structure in a chaotic environment. (i.e. identifying
individual voices and music from a mesh of sounds at a
cocktail party). There are 2 function
 Hypothesis function
 Cost function
Taking any color and going along the 'circle', one would
expect to get the same value of the cost function. For
example, the three green points found on the green line
above have the same value for J(θ0 ,θ1) and as a result, they
are found along the same line. The circled x displays the
value of the cost function for the graph on the left when θ0 =
800 and θ1= -0.15. Taking another h(x) and plotting its
contour plot, one gets the following graphs:
When θ0 = 360 and θ1= 0, the value of J(θ0 ,θ1) in the

contour plot gets closer to the centre thus reducing the cost
function error. Now giving our hypothesis function a
slightly positive slope results in a better fit of the data.
Now for the purpose of illustration in the rest of this video The graph above minimizes the cost function as much
I'm not actually going to use these sort of 3D surfaces to
as possible and consequently, the result of θ1 and θ0 ,
show you the cost function J, instead I'm going to use
contour plots. Or what I also call contour figures. tend to be around 0.12 and 250 respectively. Plotting
those values on our graph to the right seems to put
our point in the center of the inner most 'circle
A contour plot is a graph that contains many contour lines.
A contour line of a two variable function has a constant
value at all points of the same line. An example of such a
graph is the one to the right below.
Gradient Descent
So we have our hypothesis function and we have a way of

measuring how well it fits into the data. Now we need to
estimate the parameters in the hypothesis function. That's
where gradient descent comes in.
Imagine that we graph our hypothesis function based on its

fields θ0 and θ1 (actually we are graphing the cost function
as a function of the parameter estimates). We are not
graphing x and y itself, but the parameter range of our
hypothesis function and the cost resulting from selecting a
particular set of parameters.
We put θ0 on the x axis and θ1 on the y axis, with the cost

function on the vertical z axis. The points on our graph will
be the result of the cost function using our hypothesis with
those specific theta parameters. The graph below depicts
such a setup.
In this video we explored the scenario where we used

one parameter θ1 and plotted its cost function to
implement a gradient descent. Our formula for a
single parameter was
On a side note, we should adjust our parameter α to ensure

We will know that we have succeeded when our cost that the gradient descent algorithm converges in a
function is at the very bottom of the pits in our graph, i.e. reasonable time. Failure to converge or too much time to
when its value is the minimum. The red arrows show the obtain the minimum value imply that our step size is wrong.
minimum points in the graph.
The way we do this is by taking the derivative (the

tangential line to a function) of our cost function. The slope
of the tangent is the derivative at that point and it will give
us a direction to move towards. We make steps down the
cost function in the direction with the steepest descent.
The size of each step is determined by the parameter α,
which is called the learning rate.
For example, the distance between each 'star' in the graph

above represents a step determined by our parameter α. A
smaller α would result in a smaller step and a larger α results
in a larger step. The direction in which the step is taken is
determined by the partial derivative of J(θ0 ,θ1) Depending
on where one starts on the graph, one could end up at
different points. The image above shows us two different
starting points that end up in two different places.
The gradient descent algorithm is:

The multivariable form of the hypothesis function
accommodating these multiple features is as follows
Gradient decent method is used to reduce cost function
The gradient descent equation itself is generally the

same form; we just have to repeat it for our 'n'
features:
Multiple Features
The method to make gradient descent faster
This is done by feature scaling & Mean Normalization.
Features and polynomial regression
Polynomial regression
Our hypothesis function need not be linear (a

More tips on gradient descent straight line) if that does not fit the data well.
We can change the behaviour or curve of our

hypothesis function by making it a quadratic,
cubic or square root function (or any other form).
Gradient descent vs normal eqn
If number of features are less than 1000 use normal eqn but
if number of features are greater than 1000 use gradient
descent
Pinv finds pseudo inverse whereas inv finds only inverse
Pseudo inverse gives a matrix even if the matrix on which it

is applied is non-invertible. Same thing can’t be seen for
inverse. We can’t find inverse of a non-invertible matrix.
Things to stay in touch

Gradient descent algorithm

Machine Learning

Uploaded by

Machine Learning

Uploaded by

Super vised learning Model for prediction

In supervised learning, we are given a data set and already

Supervised learning problems are categorized into

Given data about the size of houses on the real estate

We could turn this example into a classification problem by

Unsupervised learning allows us to approach problems with

We can derive this structure by clustering the data based on

With unsupervised learning there is no feedback based on

Clustering: Take a collection of 1,000,000 different genes,

Non-clustering: The "Cocktail Party Algorithm", allows you

When θ0 = 360 and θ1= 0, the value of J(θ0 ,θ1) in the

So we have our hypothesis function and we have a way of

Imagine that we graph our hypothesis function based on its

We put θ0 on the x axis and θ1 on the y axis, with the cost

In this video we explored the scenario where we used

On a side note, we should adjust our parameter α to ensure

The way we do this is by taking the derivative (the

For example, the distance between each 'star' in the graph

The gradient descent algorithm is:

Gradient decent method is used to reduce cost function

The gradient descent equation itself is generally the

This is done by feature scaling & Mean Normalization.

Features and polynomial regression

Our hypothesis function need not be linear (a

We can change the behaviour or curve of our

Pinv finds pseudo inverse whereas inv finds only inverse

Pseudo inverse gives a matrix even if the matrix on which it

You might also like