0% found this document useful (0 votes)
29 views7 pages

Machine Learning

Supervised learning problems are categorized into regression and classification problems. Regression predicts continuous outputs while classification predicts discrete categories. Gradient descent is used to estimate the parameters of a hypothesis function by minimizing a cost function. It works by taking steps in the direction of steepest descent as indicated by the derivative of the cost function. Multiple features can be accommodated by repeating the gradient descent equation for each feature.

Uploaded by

Praveen Patnaik
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
29 views7 pages

Machine Learning

Supervised learning problems are categorized into regression and classification problems. Regression predicts continuous outputs while classification predicts discrete categories. Gradient descent is used to estimate the parameters of a hypothesis function by minimizing a cost function. It works by taking steps in the direction of steepest descent as indicated by the derivative of the cost function. Multiple features can be accommodated by repeating the gradient descent equation for each feature.

Uploaded by

Praveen Patnaik
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 7

Super vised learning Model for prediction

In supervised learning, we are given a data set and already


know what our correct output should look like, having the
idea that there is a relationship between the input and the
output.

Supervised learning problems are categorized into


"regression" and "classification" problems. In a regression
problem, we are trying to predict results within a continuous
output, meaning that we are trying to map input variables to
some continuous function. In a classification problem, we
are instead trying to predict results in a discrete output. In
other words, we are trying to map input variables into
discrete categories.

Example 1:

Given data about the size of houses on the real estate


market, try to predict their price. Price as a function of size
is a continuous output, so this is a regression problem.

We could turn this example into a classification problem by


instead making our output about whether the house "sells for
more or less than the asking price." Here we are classifying
the houses based on price into two discrete categories.

Example 2:
Hypothesis function
(a) Regression - Given a picture of a person, we have to
predict their age on the basis of the given picture
Cost Function
(b) Classification - Given a patient with a tumor, we have to we'll define something called the cost function,
predict whether the tumor is malignant or benign. this will let us figure out how to fit the best possible straight
line to our data.
Unsupervised Learning

Unsupervised learning allows us to approach problems with


little or no idea what our results should look like. We can
derive structure from data where we don't necessarily know
the effect of the variables.

We can derive this structure by clustering the data based on


relationships among the variables in the data.

With unsupervised learning there is no feedback based on


the prediction results.

Example:

Clustering: Take a collection of 1,000,000 different genes,


and find a way to automatically group these genes into
groups that are somehow similar or related by different
variables, such as lifespan, location, roles, and so on.

Non-clustering: The "Cocktail Party Algorithm", allows you


to find structure in a chaotic environment. (i.e. identifying
individual voices and music from a mesh of sounds at a
cocktail party). There are 2 function
 Hypothesis function
 Cost function
Taking any color and going along the 'circle', one would
expect to get the same value of the cost function. For
example, the three green points found on the green line
above have the same value for J(θ0 ,θ1) and as a result, they
are found along the same line. The circled x displays the
value of the cost function for the graph on the left when θ0 =
800 and θ1= -0.15. Taking another h(x) and plotting its
contour plot, one gets the following graphs:

When θ0 = 360 and θ1= 0, the value of J(θ0 ,θ1) in the


contour plot gets closer to the centre thus reducing the cost
function error. Now giving our hypothesis function a
slightly positive slope results in a better fit of the data.

Now for the purpose of illustration in the rest of this video The graph above minimizes the cost function as much
I'm not actually going to use these sort of 3D surfaces to
as possible and consequently, the result of θ1 and θ0 ,
show you the cost function J, instead I'm going to use
contour plots. Or what I also call contour figures. tend to be around 0.12 and 250 respectively. Plotting
those values on our graph to the right seems to put
our point in the center of the inner most 'circle
A contour plot is a graph that contains many contour lines.
A contour line of a two variable function has a constant
value at all points of the same line. An example of such a
graph is the one to the right below.
Gradient Descent

So we have our hypothesis function and we have a way of


measuring how well it fits into the data. Now we need to
estimate the parameters in the hypothesis function. That's
where gradient descent comes in.

Imagine that we graph our hypothesis function based on its


fields θ0 and θ1 (actually we are graphing the cost function
as a function of the parameter estimates). We are not
graphing x and y itself, but the parameter range of our
hypothesis function and the cost resulting from selecting a
particular set of parameters.

We put θ0 on the x axis and θ1 on the y axis, with the cost


function on the vertical z axis. The points on our graph will
be the result of the cost function using our hypothesis with
those specific theta parameters. The graph below depicts
such a setup.

In this video we explored the scenario where we used


one parameter θ1 and plotted its cost function to
implement a gradient descent. Our formula for a
single parameter was

On a side note, we should adjust our parameter α to ensure


We will know that we have succeeded when our cost that the gradient descent algorithm converges in a
function is at the very bottom of the pits in our graph, i.e. reasonable time. Failure to converge or too much time to
when its value is the minimum. The red arrows show the obtain the minimum value imply that our step size is wrong.
minimum points in the graph.

The way we do this is by taking the derivative (the


tangential line to a function) of our cost function. The slope
of the tangent is the derivative at that point and it will give
us a direction to move towards. We make steps down the
cost function in the direction with the steepest descent.
The size of each step is determined by the parameter α,
which is called the learning rate.

For example, the distance between each 'star' in the graph


above represents a step determined by our parameter α. A
smaller α would result in a smaller step and a larger α results
in a larger step. The direction in which the step is taken is
determined by the partial derivative of J(θ0 ,θ1) Depending
on where one starts on the graph, one could end up at
different points. The image above shows us two different
starting points that end up in two different places.

The gradient descent algorithm is:


The multivariable form of the hypothesis function
accommodating these multiple features is as follows

Gradient decent method is used to reduce cost function

The gradient descent equation itself is generally the


same form; we just have to repeat it for our 'n'
features:

Multiple Features
The method to make gradient descent faster

This is done by feature scaling & Mean Normalization.

Features and polynomial regression

Polynomial regression

Our hypothesis function need not be linear (a


More tips on gradient descent straight line) if that does not fit the data well.

We can change the behaviour or curve of our


hypothesis function by making it a quadratic,
cubic or square root function (or any other form).
Gradient descent vs normal eqn

If number of features are less than 1000 use normal eqn but
if number of features are greater than 1000 use gradient
descent

Pinv finds pseudo inverse whereas inv finds only inverse

Pseudo inverse gives a matrix even if the matrix on which it


is applied is non-invertible. Same thing can’t be seen for
inverse. We can’t find inverse of a non-invertible matrix.
Things to stay in touch


Gradient descent algorithm

You might also like