0% found this document useful (0 votes)
116 views60 pages

Machine Learning

The document discusses machine learning and some of its key concepts. It begins by defining machine learning as a field of study that allows computers to learn from experience without being explicitly programmed. It describes two main types of machine learning: supervised learning, where the algorithm is provided labeled examples to learn from, and unsupervised learning, where unlabeled data is used to find hidden patterns. Specific supervised learning techniques discussed include regression for predicting continuous outputs and classification for predicting discrete categories. Unsupervised techniques include clustering to group similar examples and non-clustering methods like the cocktail party algorithm. The document also discusses model representation, cost functions to measure accuracy, and gradient descent as a way to optimize model parameters to minimize cost.

Uploaded by

jaweria arshad
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
116 views60 pages

Machine Learning

The document discusses machine learning and some of its key concepts. It begins by defining machine learning as a field of study that allows computers to learn from experience without being explicitly programmed. It describes two main types of machine learning: supervised learning, where the algorithm is provided labeled examples to learn from, and unsupervised learning, where unlabeled data is used to find hidden patterns. Specific supervised learning techniques discussed include regression for predicting continuous outputs and classification for predicting discrete categories. Unsupervised techniques include clustering to group similar examples and non-clustering methods like the cocktail party algorithm. The document also discusses model representation, cost functions to measure accuracy, and gradient descent as a way to optimize model parameters to minimize cost.

Uploaded by

jaweria arshad
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 60

Jia arshad

MACHINE LEARNING
Machine learning grew out a work for AI. It develops new capabilities for computer.

Examples:

 Data base mining


 Applications cannot program by hand
 Self-customizing program
 Understanding human learning

What is Machine Learning?

Field of study that gives computer to learn without being explicitly


programmed
Auther Samul(1959)

He says, a computer program is said to learn from experience E with respect to 
some task T and some performance measure P, if its performance on T, 
as measured by P, improves with experience E.
Example: playing checkers.

E = the experience of playing many games of checkers

T = the task of playing checkers.

P = the probability that the program will win the next game.

machine learning algorithms:

Page | 1
Jia arshad

 supervised learning
 unsupervised learning

Supervised Learning :
In supervised learning, we are given a data set and already know what our correct output should
look like, having the idea that there is a relationship between the input and the output.

Supervised learning problems are categorized into "regression" and "classification" problems. In a
regression problem, we are trying to predict results within a continuous output, meaning that we are
trying to map input variables to some continuous function. In a classification problem, we are
instead trying to predict results in a discrete output. In other words, we are trying to map input
variables into discrete categories.

Example 1:

Given data about the size of houses on the real estate market, try to predict their price. Price as a
function of size is a continuous output, so this is a regression problem.

We could turn this example into a classification problem by instead making our output about
whether the house "sells for more or less than the asking price." Here we are classifying the houses
based on price into two discrete categories.

Example 2:

(a) Regression - Given a picture of a person, we have to predict their age on the basis of the given
picture

(b) Classification - Given a patient with a tumor, we have to predict whether the tumor is malignant
or benign.

types of supervised
learning

regression classification

Unsupervised Learning:

Page | 2
Jia arshad

Unsupervised learning allows us to approach problems with little or no idea what our results
should look like. We can derive structure from data where we don't necessarily know the effect of
the variables.

We can derive this structure by clustering the data based on relationships among the variables in
the data.

With unsupervised learning there is no feedback based on the prediction results.

Example:

Clustering: Take a collection of 1,000,000 different genes, and find a way to automatically group
these genes into groups that are somehow similar or related by different variables, such as
lifespan, location, roles, and so on.

Non-clustering: The "Cocktail Party Algorithm", allows you to find structure in a chaotic
environment. (i.e. identifying individual voices and music from a mesh of sounds at a cocktail
party).

types of unsupervised learing

clustering non-clustering

MODEL REPRESENTAION:

Page | 3
Jia arshad

To establish notation for future use, we’ll use x^{(i)}x(i) to denote the “input” variables (living
area in this example), also called input features, and y^{(i)}y(i) to denote the “output” or target
variable that we are trying to predict (price). A pair (x^{(i)} , y^{(i)} )(x(i),y(i)) is called a
training example, and the dataset that we’ll be using to learn—a list of m training examples
{(x^{(i)} , y^{(i)} ); i = 1, . . . , m}(x(i),y(i));i=1,...,m—is called a training set. Note that the
superscript “(i)” in the notation is simply an index into the training set, and has nothing to do with
exponentiation. We will also use X to denote the space of input values, and Y to denote the
space of output values. In this example, X = Y = ℝ.

To describe the supervised learning problem slightly more formally, our goal is, given a training
set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding value of
y. For historical reasons, this function h is called a hypothesis. Seen pictorially, the process is
therefore like this:

When the target variable that we’re trying to predict is continuous, such as in our housing
example, we call the learning problem a regression problem. When y can take on only a
small number of discrete values (such as if, given the living area, we wanted to predict if a
dwelling is a house or an apartment, say), we call it a classification problem.

Page | 4
Jia arshad

COST FUNCTION:

Cost function also known as squared error function

Page | 5
Jia arshad

We can measure the accuracy of our hypothesis function by using a cost function. This takes
an average difference (actually a fancier version of an average) of all the results of the
hypothesis with inputs from x's and the actual output y's.
J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i}- y_{i} \
right)^2 = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \
right)^2J(θ0,θ1)=2m1i=1∑m(y^i−yi)2=2m1i=1∑m(hθ(xi)−yi)2
To break it apart, it is \frac{1}{2}21 \bar{x}xˉ where \bar{x}xˉ is the mean of the squares of
h_\theta (x_{i}) - y_{i}hθ(xi)−yi , or the difference between the predicted value and the
actual value.
This function is otherwise called the "Squared error function", or "Mean squared error". The
mean is halved \left(\frac{1}{2}\right)(21) as a convenience for the computation of the
gradient descent, as the derivative term of the square function will cancel out the \frac{1}
{2}21 term.

COST FUCTION INTUTIUON:


If we try to think of it in visual terms, our training data set is scattered on the x-y plane. We are
trying to make a straight line (defined by h_\theta(x)hθ(x)) which passes through these scattered
data points.

Our objective is to get the best possible line. The best possible line will be such so that the
average squared vertical distances of the scattered points from the line will be the least. Ideally,
the line should pass through all the points of our training data set. In such a case, the value of J(\
theta_0, \theta_1)J(θ0,θ1) will be 0. The following example shows the ideal situation where we
have a cost function of 0.

When \theta_1 = 1θ1=1, we get a slope of 1 which goes through every single data point in our
model. Conversely, when \theta_1 = 0.5θ1=0.5, we see the vertical distance from our fit to the
data points increase.

Page | 6
Jia arshad

This increases our cost function to 0.58. Plotting several other points yields to the following
graph:

Thus as a goal, we should try to minimize the cost function. In this case, \theta_1 = 1θ1=1 is our
global minimum.

Cost Function - Intuition II


A contour plot is a graph that contains many contour lines. A contour line of a two variable
function has a constant value at all points of the same line. An example of such a graph is the
one to the right below.

Page | 7
Jia arshad

Taking any color and going along the 'circle', one would expect to get the same value of the cost
function. For example, the three green points found on the green line above have the same value
for J(\theta_0,\theta_1)J(θ0,θ1) and as a result, they are found along the same line. The circled
x displays the value of the cost function for the graph on the left when \theta_0θ0 = 800 and \
theta_1θ1= -0.15. Taking another h(x) and plotting its contour plot, one gets the following graphs:

When \theta_0θ0 = 360 and \theta_1θ1 = 0, the value of J(\theta_0,\theta_1)J(θ0,θ1) in the


contour plot gets closer to the center thus reducing the cost function error. Now giving our
hypothesis function a slightly positive slope results in a better fit of the data.

Page | 8
Jia arshad

The graph above minimizes the cost function as much as possible and consequently, the result
of \theta_1θ1 and \theta_0θ0 tend to be around 0.12 and 250 respectively. Plotting those values
on our graph to the right seems to put our point in the center of the inner most 'circle'.

Gradient Descent
So we have our hypothesis function and we have a way of measuring how well it fits into the
data. Now we need to estimate the parameters in the hypothesis function. That's where gradient
descent comes in.

Imagine that we graph our hypothesis function based on its fields \theta_0θ0 and \theta_1θ1
(actually we are graphing the cost function as a function of the parameter estimates). We are not
graphing x and y itself, but the parameter range of our hypothesis function and the cost resulting
from selecting a particular set of parameters.

We put \theta_0θ0 on the x axis and \theta_1θ1 on the y axis, with the cost function on the
vertical z axis. The points on our graph will be the result of the cost function using our hypothesis
with those specific theta parameters. The graph below depicts such a setup.

Page | 9
Jia arshad

We will know that we have succeeded when our cost function is at the very bottom of the pits in
our graph, i.e. when its value is the minimum. The red arrows show the minimum points in the
graph.

The way we do this is by taking the derivative (the tangential line to a function) of our cost
function. The slope of the tangent is the derivative at that point and it will give us a direction to
move towards. We make steps down the cost function in the direction with the steepest descent.
The size of each step is determined by the parameter α, which is called the learning rate.

For example, the distance between each 'star' in the graph above represents a step determined
by our parameter α. A smaller α would result in a smaller step and a larger α results in a larger
step. The direction in which the step is taken is determined by the partial derivative of J(\
theta_0,\theta_1)J(θ0,θ1). Depending on where one starts on the graph, one could end up at
different points. The image above shows us two different starting points that end up in two
different places.

The gradient descent algorithm is:

repeat until convergence:

\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)θj:=θj−α∂θj


∂J(θ0,θ1)

where

j=0,1 represents the feature index number.

Page | 10
Jia arshad

At each iteration j, one should simultaneously update the parameters \theta_1, \theta_2,...,\
theta_nθ1,θ2,...,θn. Updating a specific parameter prior to calculating another one on the
j^{(th)}j(th) iteration would yield to a wrong implementation.

Gradient Descent Intuition


In this video we explored the scenario where we used one parameter \theta_1θ1 and plotted its
cost function to implement a gradient descent. Our formula for a single parameter was :

Repeat until convergence:

\theta_1:=\theta_1-\alpha \frac{d}{d\theta_1} J(\theta_1)θ1:=θ1−αdθ1dJ(θ1)


Regardless of the slope's sign for \frac{d}{d\theta_1} J(\theta_1)dθ1dJ(θ1), \theta_1θ1
eventually converges to its minimum value. The following graph shows that when the slope is
negative, the value of \theta_1θ1 increases and when it is positive, the value of \theta_1θ1
decreases.

Page | 11
Jia arshad

On a side note, we should adjust our parameter \alphaα to ensure that the gradient descent
algorithm converges in a reasonable time. Failure to converge or too much time to obtain the
minimum value imply that our step size is wrong.

Page | 12
Jia arshad

How does gradient descent converge with a fixed step size \alphaα?
The intuition behind the convergence is that \frac{d}{d\theta_1} J(\theta_1)dθ1dJ(θ1)
approaches 0 as we approach the bottom of our convex function. At the minimum, the derivative
will always be 0 and thus we get:

\theta_1:=\theta_1-\alpha * 0θ1:=θ1−α∗0

Gradient Descent For Linear Regression


Note: [At 6:15 "h(x) = -900 - 0.1x" should be "h(x) = 900 - 0.1x"]

When specifically applied to the case of linear regression, a new form of the gradient descent
equation can be derived. We can substitute our actual cost function and our actual hypothesis
function and modify the equation to :

repeat until convergence: {θ0:=θ1:=}θ0−α1m∑i=1m(hθ(xi)−yi)θ1−α1m∑i=1m((hθ(xi)−yi)

where m is the size of the training set, \theta_0θ0 a constant that will be changing
simultaneously with \theta_1θ1 and x_{i}, y_{i}xi,yiare values of the given training set (data).

Note that we have separated out the two cases for \theta_jθj into separate equations for \
theta_0θ0 and \theta_1θ1; and that for \theta_1θ1 we are multiplying x_{i}xi at the end due to
the derivative. The following is a derivation of \frac {\partial}{\partial \theta_j}J(\theta) ∂θj∂J(θ)
for a single example :

The point of all this is that if we start with a guess for our hypothesis and then repeatedly apply
these gradient descent equations, our hypothesis will become more and more accurate.

So, this is simply gradient descent on the original cost function J. This method looks at every
example in the entire training set on every step, and is called batch gradient descent. Note that,
while gradient descent can be susceptible to local minima in general, the optimization problem
we have posed here for linear regression has only one global, and no other local, optima; thus

Page | 13
Jia arshad

gradient descent always converges (assuming the learning rate α is not too large) to the global
minimum. Indeed, J is a convex quadratic function. Here is an example of gradient descent as it
is run to minimize a quadratic function.

The ellipses shown above are the contours of a quadratic function. Also shown is the trajectory
taken by gradient descent, which was initialized at (48,30). The x’s in the figure (joined by
straight lines) mark the successive values of θ that gradient descent went through as it
converged to its minimum.

Matrices and Vectors


Matrices are 2-dimensional arrays:

Page | 14
Jia arshad

⎡⎣⎢⎢⎢adgjbehkcfil⎤⎦⎥⎥⎥
[abcdefghijkl]
The above matrix has four rows and three columns, so it is a 4 x 3 matrix.

A vector is a matrix with one column and many rows:

⎡⎣⎢⎢wxyz⎤⎦⎥⎥
[wxyz]
So vectors are a subset of matrices. The above vector is a 4 x 1 matrix.

Notation and terms:

 A_{ij}Aij refers to the element in the ith row and jth column of matrix A.
 A vector with 'n' rows is referred to as an 'n'-dimensional vector.
 v_ivi refers to the element in the ith row of the vector.
 In general, all our vectors and matrices will be 1-indexed. Note that for some programming
languages, the arrays are 0-indexed.
 Matrices are usually denoted by uppercase names while vectors are lowercase.
 "Scalar" means that an object is a single value, not a vector or matrix.
 \mathbb{R}R refers to the set of scalar real numbers.
 \mathbb{R^n}Rn refers to the set of n-dimensional vectors of real numbers.

Addition and Scalar Multiplication


Addition and subtraction are element-wise, so you simply add or subtract each corresponding
element:

[acbd]

[wyxz]

[a+wc+yb+xd+z]

[abcd]+[wxyz]=[a+wb+xc+yd+z]

Subtracting Matrices:

Page | 15
Jia arshad

[acbd]

[wyxz]

[a−wc−yb−xd−z]

[abcd]−[wxyz]=[a−wb−xc−yd−z]

To add or subtract two matrices, their dimensions must be the same.

In scalar multiplication, we simply multiply every element by the scalar value:

[acbd]

*x=

[a∗xc∗xb∗xd∗x]

[abcd]∗x=[a∗xb∗xc∗xd∗x]

In scalar division, we simply divide every element by the scalar value:

[acbd]

/x=

[a/xc/xb/xd/x]

[abcd]/x=[a/xb/xc/xd/x]

Experiment below with the Octave/Matlab commands for matrix addition and scalar
multiplication. Feel free to try out different commands. Try to write out your answers for each
command before running the cell below.

Matrix-Vector Multiplication
Page | 16
Jia arshad

We map the column of the vector onto each row of the matrix, multiplying each element and
summing the result.

⎡⎣acebdf⎤⎦
*

[xy]

=
⎡⎣⎢a∗x+b∗yc∗x+d∗ye∗x+f∗y⎤⎦⎥
[abcdef]∗[xy]=[a∗x+b∗yc∗x+d∗ye∗x+f∗y]

The result is a vector. The number of columns of the matrix must equal the number of rows of
the vector.

An m x n matrix multiplied by an n x 1 vector results in an m x 1 vector.

Matrix-Matrix Multiplication
We multiply two matrices by breaking it into several vector multiplications and concatenating the
result.

⎡⎣acebdf⎤⎦
*

[wyxz]

=
⎡⎣⎢a∗w+b∗yc∗w+d∗ye∗w+f∗ya∗x+b∗zc∗x+d∗ze∗x+f∗z⎤⎦⎥
[abcdef]∗[wxyz]=[a∗w+b∗ya∗x+b∗zc∗w+d∗yc∗x+d∗ze∗w+f∗ye∗x+f∗z]

An m x n matrix multiplied by an n x o matrix results in an m x o matrix. In the above example,


a 3 x 2 matrix times a 2 x 2 matrix resulted in a 3 x 2 matrix.

To multiply two matrices, the number of columns of the first matrix must equal the number of
rows of the second matrix.

Matrix Multiplication Properties


Page | 17
Jia arshad

 Matrices are not commutative: A∗B \neq B∗AA∗B=B∗A


 Matrices are associative: (A∗B)∗C = A∗(B∗C)(A∗B)∗C=A∗(B∗C)
The identity matrix, when multiplied by any matrix of the same dimensions, results in the
original matrix. It's just like multiplying numbers by 1. The identity matrix simply has 1's on the
diagonal (upper left to lower right diagonal) and 0's elsewhere.

⎡⎣100010001⎤⎦
[100010001]
When multiplying the identity matrix after some matrix (A∗I), the square identity matrix's
dimension should match the other matrix's columns. When multiplying the identity matrix before
some other matrix (I∗A), the square identity matrix's dimension should match the other matrix's
rows.

Inverse and Transpose


The inverse of a matrix A is denoted A^{-1}A−1. Multiplying by the inverse results in the identity
matrix.

A non square matrix does not have an inverse matrix. We can compute inverses of matrices in
octave with the pinv(A)pinv(A) function and in Matlab with the inv(A)inv(A) function. Matrices
that don't have an inverse are singular or degenerate.

The transposition of a matrix is like rotating the matrix 90° in clockwise direction and then
reversing it. We can compute transposition of matrices in matlab with the transpose(A) function
or A':

A=
⎡⎣acebdf⎤⎦
A=[abcdef]

A^T =
[abcdef]

AT=[acebdf]

In other words:

A_{ij} = A^T_{ji}Aij=AjiT

Multiple Features
Page | 18
Jia arshad

Note: [7:25 - \theta^TθT is a 1 by (n+1) matrix and not an (n+1) by 1 matrix]

Linear regression with multiple variables is also known as "multivariate linear regression".

We now introduce notation for equations where we can have any number of input variables.

x(i)jx(i)mn=value of feature j in the ith training example=the input (features) of the ith train


example=the number of training examples=the number of features
The multivariable form of the hypothesis function accommodating these multiple features is as
follows:

h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \cdots + \theta_n
x_nhθ(x)=θ0+θ1x1+θ2x2+θ3x3+⋯+θnxn

In order to develop intuition about this function, we can think about \theta_0θ0 as the basic price
of a house, \theta_1θ1 as the price per square meter, \theta_2θ2 as the price per floor, etc.
x_1x1 will be the number of square meters in the house, x_2x2 the number of floors, etc.

Using the definition of matrix multiplication, our multivariable hypothesis function can be
concisely represented as:

hθ(x)=[θ0θ1...θn]⎡⎣⎢⎢⎢x0x1⋮xn⎤⎦⎥⎥⎥=θTx
This is a vectorization of our hypothesis function for one training example; see the lessons on
vectorization to learn more.

Remark: Note that for convenience reasons in this course we assume x_{0}^{(i)} =1 \
text{ for } (i\in { 1,\dots, m } )x0(i)=1 for (i∈1,…,m). This allows us to do matrix operations
with theta and x. Hence making the two vectors '\thetaθ' and x^{(i)}x(i) match each other
element-wise (that is, have the same number of elements: n+1).]

Gradient Descent For Multiple Variables


Gradient Descent for Multiple Variables
The gradient descent equation itself is generally the same form; we just have to repeat it for our
'n' features:

}repeat until convergence:{θ0:=θ0−α1m∑i=1m(hθ(x(i))−y(i))⋅x(i)0θ1:=θ1−α1m∑i=1m(hθ(x

−y(i))⋅x(i)1θ2:=θ2−α1m∑i=1m(hθ(x(i))−y(i))⋅x(i)2⋯

Page | 19
Jia arshad

In other words:

}repeat until convergence:{θj:=θj−α1m∑i=1m(hθ(x(i))−y(i))⋅x(i)jfor j := 0...n

The following image compares gradient descent with one variable to gradient descent with
multiple variables:

Gradient Descent in Practice I & II -


Feature Scaling
Note: [6:20 - The average size of a house is 1000 but 100 is accidentally written instead]

We can speed up gradient descent by having each of our input values in roughly the same
range. This is because θ will descend quickly on small ranges and slowly on large ranges, and
so will oscillate inefficiently down to the optimum when the variables are very uneven.

The way to prevent this is to modify the ranges of our input variables so that they are all roughly
the same. Ideally:

−1 ≤ x_{(i)}x(i) ≤ 1

or

−0.5 ≤ x_{(i)}x(i) ≤ 0.5

These aren't exact requirements; we are only trying to speed things up. The goal is to get all
input variables into roughly one of these ranges, give or take a few.

Page | 20
Jia arshad

Two techniques to help with this are feature scaling and mean normalization. Feature scaling
involves dividing the input values by the range (i.e. the maximum value minus the minimum
value) of the input variable, resulting in a new range of just 1. Mean normalization involves
subtracting the average value for an input variable from the values for that input variable resulting
in a new average value for the input variable of just zero. To implement both of these techniques,
adjust your input values as shown in this formula:

x_i := \dfrac{x_i - \mu_i}{s_i}xi:=sixi−μi

Where μ_iμi is the average of all the values for feature (i) and s_isi is the range of values (max -
min), or s_isi is the standard deviation.

Note that dividing by the range, or dividing by the standard deviation, give different results. The
quizzes in this course use range - the programming exercises use standard deviation.

For example, if x_ixi represents housing prices with a range of 100 to 2000 and a mean value of
1000, then, x_i := \dfrac{price-1000}{1900}xi:=1900price−1000.

Debugging gradient descent. Make a plot with number of iterations on the x-axis. Now plot the
cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increases, then
you probably need to decrease α.

Automatic convergence test. Declare convergence if J(θ) decreases by less than E in one
iteration, where E is some small value such as 10^{−3}10−3. However in practice it's difficult to
choose this threshold value.

It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every
iteration.

Page | 21
Jia arshad

To summarize:

If \alphaα is too small: slow convergence.

If \alphaα is too large: may not decrease on every iteration and thus may not converge.

Features and Polynomial Regression


We can improve our features and the form of our hypothesis function in a couple different ways.

We can combine multiple features into one. For example, we can combine x_1x1 and x_2x2
into a new feature x_3x3 by taking x_1x1⋅x_2x2.

Polynomial Regression

Our hypothesis function need not be linear (a straight line) if that does not fit the data well.

We can change the behavior or curve of our hypothesis function by making it a quadratic,
cubic or square root function (or any other form).

For example, if our hypothesis function is h_\theta(x) = \theta_0 + \theta_1 x_1hθ(x)=θ0+θ1x1


then we can create additional features based on x_1x1, to get the quadratic function h_\theta(x)
= \theta_0 + \theta_1 x_1 + \theta_2 x_1^2hθ(x)=θ0+θ1x1+θ2x12 or the cubic function h_\
theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2 + \theta_3 x_1^3hθ(x)=θ0+θ1x1+θ2x12
+θ3x13

In the cubic version, we have created new features x_2x2 and x_3x3 where x_2 = x_1^2x2=x12
and x_3 = x_1^3x3=x13.

Page | 22
Jia arshad

To make it a square root function, we could do: h_\theta(x) = \theta_0 + \theta_1 x_1 + \
theta_2 \sqrt{x_1}hθ(x)=θ0+θ1x1+θ2x1

One important thing to keep in mind is, if you choose your features this way then feature scaling
becomes very important.

eg. if x_1x1 has range 1 - 1000 then range of x_1^2x12 becomes 1 - 1000000 and that of
x_1^3x13 becomes 1 – 1000000000

Normal Equation
Note: [8:00 to 8:44 - The design matrix X (in the bottom right side of the slide) given in the
example should have elements x with subscript 1 and superscripts varying from 1 to m because
for all m training sets there are only 2 features x_0x0 and x_1x1. 12:56 - The X matrix is m by
(n+1) and NOT n by n. ]

Gradient descent gives one way of minimizing J. Let’s discuss a second way of doing so, this
time performing the minimization explicitly and without resorting to an iterative algorithm. In the
"Normal Equation" method, we will minimize J by explicitly taking its derivatives with respect to
the θj ’s, and setting them to zero. This allows us to find the optimum theta without iteration. The
normal equation formula is given below:

\theta = (X^T X)^{-1}X^T yθ=(XTX)−1XTy

There is no need to do feature scaling with the normal equation.

The following is a comparison of gradient descent and the normal equation:

Page | 23
Jia arshad

Gradient Descent Normal Equation

Need to choose alpha No need to choose alpha

Needs many iterations No need to iterate

O (kn^2kn2) O (n^3n3), need to calculate inverse of X^TXXTX

Works well when n is large Slow if n is very large


With the normal equation, computing the inversion has complexity \mathcal{O}(n^3)O(n3). So
if we have a very large number of features, the normal equation will be slow. In practice, when n
exceeds 10,000 it might be a good time to go from a normal solution to an iterative process.

Normal Equation Noninvertibility


When implementing the normal equation in octave we want to use the 'pinv' function rather than
'inv.' The 'pinv' function will give you a value of \thetaθ even if X^TXXTX is not invertible.

If X^TXXTX is noninvertible, the common causes might be having :

Page | 24
Jia arshad

 Redundant features, where two features are very closely related (i.e. they are linearly
dependent)
 Too many features (e.g. m ≤ n). In this case, delete some features or use "regularization" (to
be explained in a later lesson).
Solutions to the above problems include deleting a feature that is linearly dependent with another
or deleting one or more features when there are too many features.

Classification:

To attempt classification, one method is to use linear regression and map all predictions greater
than 0.5 as a 1 and all less than 0.5 as a 0. However, this method doesn't work well because
classification is not actually a linear function.

The classification problem is just like the regression problem, except that the values we now
want to predict take on only a small number of discrete values. For now, we will focus on the
binary classification problem in which y can take on only two values, 0 and 1. (Most of what
we say here will also generalize to the multiple-class case.) For instance, if we are trying to build
a spam classifier for email, then x^{(i)}x(i) may be some features of a piece of email, and y may
be 1 if it is a piece of spam mail, and 0 otherwise. Hence, y∈{0,1}. 0 is also called the negative
class, and 1 the positive class, and they are sometimes also denoted by the symbols “-” and “+.”
Given x^{(i)}x(i), the corresponding y^{(i)}y(i) is also called the label for the training example.

Hypothesis Representation
We could approach the classification problem ignoring the fact that y is discrete-valued, and use
our old linear regression algorithm to try to predict y given x. However, it is easy to construct
examples where this method performs very poorly. Intuitively, it also doesn’t make sense for h_\
theta (x)hθ(x) to take values larger than 1 or smaller than 0 when we know that y ∈ {0, 1}. To fix
this, let’s change the form for our hypotheses h_\theta (x)hθ(x) to satisfy 0 \leq h_\theta (x) \
leq 10≤hθ(x)≤1. This is accomplished by plugging \theta^TxθTx into the Logistic Function.

Our new form uses the "Sigmoid Function," also called the "Logistic Function":

\begin{align*}& h_\theta (x) = g ( \theta^T x ) \newline \newline& z = \theta^T x \newline& g(z) = \dfrac{1}{1 + e^
The following image shows us what the sigmoid function looks like:

Page | 25
Jia arshad

The function g(z), shown here, maps any real number to the (0, 1) interval, making it useful for
transforming an arbitrary-valued function into a function better suited for classification.

h_\theta(x)hθ(x) will give us the probability that our output is 1. For example, h_\
theta(x)=0.7hθ(x)=0.7 gives us a probability of 70% that our output is 1. Our probability that our
prediction is 0 is just the complement of our probability that it is 1 (e.g. if probability that it is 1 is
70%, then the probability that it is 0 is 30%).

\begin{align*}& h_\theta(x) = P(y=1 | x ; \theta) = 1 - P(y=0 | x ; \theta) \newline& P(y = 0 | x;\theta) + P(y = 1 | x
end{align*}

Decision Boundary
In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis
function as follows:

hθ(x)≥0.5→y=1hθ(x)<0.5→y=0
The way our logistic function g behaves is that when its input is greater than or equal to zero, its
output is greater than or equal to 0.5:

g(z)≥0.5whenz≥0
Remember.

z=0,e0=1⇒g(z)=1/2z→∞,e−∞→0⇒g(z)=1z→−∞,e∞→∞⇒g(z)=0
So if our input to g is \theta^T XθTX, then that means:

hθ(x)=g(θTx)≥0.5whenθTx≥0
From these statements we can now say:

θTx≥0⇒y=1θTx<0⇒y=0
The decision boundary is the line that separates the area where y = 0 and where y = 1. It is
created by our hypothesis function.

Example:

Page | 26
Jia arshad

θ=⎡⎣5−10⎤⎦y=1if5+(−1)x1+0x2≥05−x1≥0−x1≥−5x1≤5
In this case, our decision boundary is a straight vertical line placed on the graph where x_1 =
5x1=5, and everything to the left of that denotes y = 1, while everything to the right denotes y =
0.

Again, the input to the sigmoid function g(z) (e.g. \theta^T XθTX) doesn't need to be linear, and
could be a function that describes a circle (e.g. z = \theta_0 + \theta_1 x_1^2 +\theta_2
x_2^2z=θ0+θ1x12+θ2x22) or any shape to fit our data.

Cost Function
We cannot use the same cost function that we use for linear regression because the Logistic
Function will cause the output to be wavy, causing many local optima. In other words, it will not
be a convex function.

Instead, our cost function for logistic regression looks like:

J(θ)=1m∑i=1mCost(hθ(x(i)),y(i))Cost(hθ(x),y)=−log(hθ(x))Cost(hθ(x),y)=−log(1−hθ(
=0

When y = 1, we get the following plot for J(\theta)J(θ) vs h_\theta (x)hθ(x):

Similarly, when y = 0, we get the following plot for J(\theta)J(θ) vs h_\theta (x)hθ(x):

Page | 27
Jia arshad

Cost(hθ(x),y)=0 if hθ(x)=yCost(hθ(x),y)→∞ if y=0andhθ(x)→1Cost(hθ(x),y)→∞ if y=

If our correct answer 'y' is 0, then the cost function will be 0 if our hypothesis function also
outputs 0. If our hypothesis approaches 1, then the cost function will approach infinity.

If our correct answer 'y' is 1, then the cost function will be 0 if our hypothesis function outputs 1. If
our hypothesis approaches 0, then the cost function will approach infinity.

Note that writing the cost function in this way guarantees that J(θ) is convex for logistic
regression.

Simplified Cost Function and Gradient


Descent
Note: [6:53 - the gradient descent equation should have a 1/m factor]

We can compress our cost function's two conditional cases into one case:

\mathrm{Cost}(h_\theta(x),y) = - y \; \log(h_\theta(x)) - (1 - y) \log(1 - h_\


theta(x))Cost(hθ(x),y)=−ylog(hθ(x))−(1−y)log(1−hθ(x))

Notice that when y is equal to 1, then the second term (1-y)\log(1-h_\theta(x))(1−y)log(1−hθ


(x)) will be zero and will not affect the result. If y is equal to 0, then the first term -y \log(h_\
theta(x))−ylog(hθ(x)) will be zero and will not affect the result.

We can fully write out our entire cost function as follows:

J(\theta) = - \frac{1}{m} \displaystyle \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 - y^{(i)}

Page | 28
Jia arshad

theta(x^{(i)}))]J(θ)=−m1i=1∑m[y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))]

A vectorized implementation is:

h=g(Xθ)J(θ)=1m⋅(−yTlog(h)−(1−y)Tlog(1−h))

Gradient Descent

Remember that the general form of gradient descent is:

Repeat{θj:=θj−α∂∂θjJ(θ)}

We can work out the derivative part using calculus to get:

Repeat{θj:=θj−αm∑i=1m(hθ(x(i))−y(i))x(i)j}

Notice that this algorithm is identical to the one we used in linear regression. We still have to
simultaneously update all values in theta.

A vectorized implementation is:

\theta := \theta - \frac{\alpha}{m} X^{T} (g(X \theta ) - \vec{y})θ:=θ−mαXT(g(Xθ)−y)

Advanced Optimization
Note: [7:35 - '100' should be 100 instead. The value provided should be an integer and not a
character string.]

"Conjugate gradient", "BFGS", and "L-BFGS" are more sophisticated, faster ways to optimize θ
that can be used instead of gradient descent. We suggest that you should not write these more
sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the
libraries instead, as they're already tested and highly optimized. Octave provides them.

We first need to provide a function that evaluates the following two functions for a given input
value θ:

J(θ)∂∂θjJ(θ)
We can write a single function that returns both of these:

Page | 29
Jia arshad

function [jVal, gradient] = costFunction(theta)

  jVal = [...code to compute J(theta)...];

  gradient = [...code to compute derivative of J(theta)...];

end

Then we can use octave's "fminunc()" optimization algorithm along with the "optimset()" function
that creates an object containing the options we want to send to "fminunc()". (Note: the value for
MaxIter should be an integer, not a character string - errata in the video at 7:30)

options = optimset('GradObj', 'on', 'MaxIter', 100);

initialTheta = zeros(2,1);

   [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, op
tions);

Page | 30
Jia arshad

We give to the function "fminunc()" our cost function, our initial vector of theta values, and the
"options" object that we created beforehand.

The Problem of Overfitting


Consider the problem of predicting y from x ∈ R. The leftmost figure below shows the result of
fitting a y = θ_0 + θ_1xθ0+θ1x to a dataset. We see that the data doesn’t really lie on straight
line, and so the fit is not very good.

Instead, if we had added an extra feature x^2x2 , and fit y = \theta_0 + \theta_1x + \
theta_2x^2y=θ0+θ1x+θ2x2 , then we obtain a slightly better fit to the data (See middle figure).
Naively, it might seem that the more features we add, the better. However, there is also a danger
in adding too many features: The rightmost figure is the result of fitting a 5^{th}5th order
polynomial y = \sum_{j=0} ^5 \theta_j x^jy=∑j=05θjxj. We see that even though the fitted
curve passes through the data perfectly, we would not expect this to be a very good predictor of,
say, housing prices (y) for different living areas (x). Without formally defining what these terms
mean, we’ll say the figure on the left shows an instance of underfitting—in which the data
clearly shows structure not captured by the model—and the figure on the right is an example of
overfitting.

Underfitting, or high bias, is when the form of our hypothesis function h maps poorly to the trend
of the data. It is usually caused by a function that is too simple or uses too few features. At the
other extreme, overfitting, or high variance, is caused by a hypothesis function that fits the
available data but does not generalize well to predict new data. It is usually caused by a
complicated function that creates a lot of unnecessary curves and angles unrelated to the data.

Page | 31
Jia arshad

This terminology is applied to both linear and logistic regression. There are two main options to
address the issue of overfitting:

1) Reduce the number of features:

 Manually select which features to keep.

 Use a model selection algorithm (studied later in the course).

2) Regularization

 Keep all the features, but reduce the magnitude of parameters \theta_jθj.

 Regularization works well when we have a lot of slightly useful features.

Cost Function
Note: [5:18 - There is a typo. It should be \sum_{j=1}^{n} \theta _j ^2∑j=1nθj2 instead of \
sum_{i=1}^{n} \theta _j ^2∑i=1nθj2]

If we have overfitting from our hypothesis function, we can reduce the weight that some of the
terms in our function carry by increasing their cost.

Say we wanted to make the following function more quadratic:

\theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3 + \theta_4x^4θ0+θ1x+θ2x2+θ3x3+θ4x4

We'll want to eliminate the influence of \theta_3x^3θ3x3 and \theta_4x^4θ4x4 . Without actually
getting rid of these features or changing the form of our hypothesis, we can instead modify our
cost function:

min_\theta\ \dfrac{1}{2m}\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + 1000\cdot\


theta_3^2 + 1000\cdot\theta_4^2minθ 2m1∑i=1m(hθ(x(i))−y(i))2+1000⋅θ32+1000⋅θ42

We've added two extra terms at the end to inflate the cost of \theta_3θ3 and \theta_4θ4. Now, in
order for the cost function to get close to zero, we will have to reduce the values of \theta_3θ3
and \theta_4θ4 to near zero. This will in turn greatly reduce the values of \theta_3x^3θ3x3 and \
theta_4x^4θ4x4 in our hypothesis function. As a result, we see that the new hypothesis
(depicted by the pink curve) looks like a quadratic function but fits the data better due to the extra
small terms \theta_3x^3θ3x3 and \theta_4x^4θ4x4.

Page | 32
Jia arshad

We could also regularize all of our theta parameters in a single summation as:

min_\theta\ \dfrac{1}{2m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\ \sum_{j=1


theta_j^2minθ 2m1 ∑i=1m(hθ(x(i))−y(i))2+λ ∑j=1nθj2

The λ, or lambda, is the regularization parameter. It determines how much the costs of our
theta parameters are inflated.

Using the above cost function with the extra summation, we can smooth the output of our
hypothesis function to reduce overfitting. If lambda is chosen to be too large, it may smooth out
the function too much and cause underfitting. Hence, what would happen if \lambda = 0λ=0 or
is too small ?

Regularized Linear Regression


Note: [8:43 - It is said that X is non-invertible if m \leq≤ n. The correct statement should be that
X is non-invertible if m < n, and may be non-invertible if m = n.

We can apply regularization to both linear regression and logistic regression. We will approach
linear regression first.

Gradient Descent

We will modify our gradient descent function to separate out \theta_0θ0 from the rest of the
parameters because we do not want to penalize \theta_0θ0.

Page | 33
Jia arshad

Repeat {    θ0:=θ0−α 1m ∑i=1m(hθ(x(i))−y(i))x(i)0    θj:=θj−α [(1m ∑i=1m(hθ(x(i))−y

+λmθj]}          j∈{1,2...n}

The term \frac{\lambda}{m}\theta_jmλθj performs our regularization. With some manipulation


our update rule can also be represented as:

\theta_j := \theta_j(1 - \alpha\frac{\lambda}{m}) - \alpha\frac{1}{m}\sum_{i=1}^m(h_\


theta(x^{(i)}) - y^{(i)})x_j^{(i)}θj:=θj(1−αmλ)−αm1∑i=1m(hθ(x(i))−y(i))xj(i)

The first term in the above equation, 1 - \alpha\frac{\lambda}{m}1−αmλ will always be less
than 1. Intuitively you can see it as reducing the value of \theta_jθj by some amount on every
update. Notice that the second term is now exactly the same as it was before.

Normal Equation

Now let's approach regularization using the alternate method of the non-iterative normal
equation.

To add in regularization, the equation is the same as our original, except that we add another
term inside the parentheses:

θ=(XTX+λ⋅L)−1XTywhere  L=⎡⎣⎢⎢⎢⎢⎢⎢011⋱1⎤⎦⎥⎥⎥⎥⎥⎥

L is a matrix with 0 at the top left and 1's down the diagonal, with 0's everywhere else. It should
have dimension (n+1)×(n+1). Intuitively, this is the identity matrix (though we are not including
x_0x0), multiplied with a single real number λ.

Recall that if m < n, then X^TXXTX is non-invertible. However, when we add the term λ⋅L, then
X^TXXTX + λ⋅L becomes invertible.

Regularized Logistic Regression


We can regularize logistic regression in a similar way that we regularize linear regression. As a
result, we can avoid overfitting. The following image shows how the regularized function,
displayed by the pink line, is less likely to overfit than the non-regularized function represented by
the blue line:

Page | 34
Jia arshad

Cost Function

Recall that our cost function for logistic regression was:

J(\theta) = - \frac{1}{m} \sum_{i=1}^m \large[ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 -


y^{(i)})\ \log (1 - h_\theta(x^{(i)})) \large]J(θ)=−m1∑i=1m[y(i) log(hθ(x(i)))
+(1−y(i)) log(1−hθ(x(i)))]

We can regularize this equation by adding a term to the end:

J(\theta) = - \frac{1}{m} \sum_{i=1}^m \large[ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \lo
theta(x^{(i)}))\large] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2J(θ)=−m1∑i=1m[y(i) log(hθ(
+(1−y(i)) log(1−hθ(x(i)))]+2mλ∑j=1nθj2

The second sum, \sum_{j=1}^n \theta_j^2∑j=1nθj2 means to explicitly exclude the bias
term, \theta_0θ0. I.e. the θ vector is indexed from 0 to n (holding n+1 values, \theta_0θ0 through
\theta_nθn), and this sum explicitly skips \theta_0θ0, by running from 1 to n, skipping 0. Thus,
when computing the equation, we should continuously update the two following equations:

Model Representation I
Let's examine how we will represent a hypothesis function using neural networks. At a very
simple level, neurons are basically computational units that take inputs (dendrites) as electrical
inputs (called "spikes") that are channeled to outputs (axons). In our model, our dendrites are
like the input features x_1\cdots x_nx1⋯xn, and the output is the result of our hypothesis
function. In this model our x_0x0 input node is sometimes called the "bias unit." It is always
equal to 1. In neural networks, we use the same logistic function as in classification, \frac{1}{1

Page | 35
Jia arshad

+ e^{-\theta^Tx}}1+e−θTx1, yet we sometimes call it a sigmoid (logistic) activation function. In


this situation, our "theta" parameters are sometimes called "weights".

Visually, a simplistic representation looks like:

⎡⎣x0x1x2⎤⎦
\rightarrow
[   ]
\rightarrow h_\theta(x)[x0x1x2]→[   ]→hθ(x)

Our input nodes (layer 1), also known as the "input layer", go into another node (layer 2), which
finally outputs the hypothesis function, known as the "output layer".

We can have intermediate layers of nodes between the input and output layers called the
"hidden layers."

In this example, we label these intermediate or "hidden" layer nodes a^2_0 \cdots a^2_na02⋯
an2 and call them "activation units."

a(j)i="activation" of unit i in layer jΘ(j)=matrix of weights controlling function mapping fr


layer j+1
If we had one hidden layer, it would look like:

⎡⎣⎢⎢x0x1x2x3⎤⎦⎥⎥
\rightarrow
⎡⎣⎢⎢⎢a(2)1a(2)2a(2)3⎤⎦⎥⎥⎥
\rightarrow h_\theta(x)[x0x1x2x3]→[a1(2)a2(2)a3(2)]→hθ(x)

The values for each of the "activation" nodes is obtained as follows:

a(2)1=g(Θ(1)10x0+Θ(1)11x1+Θ(1)12x2+Θ(1)13x3)a(2)2=g(Θ(1)20x0+Θ(1)21x1+Θ(1)22x2+Θ(1)23x3)a(
Θ(1)31x1+Θ(1)32x2+Θ(1)33x3)hΘ(x)=a(3)1=g(Θ(2)10a(2)0+Θ(2)11a(2)1+Θ(2)12a(2)2+Θ(2)13a(2)3)
This is saying that we compute our activation nodes by using a 3×4 matrix of parameters. We
apply each row of the parameters to our inputs to obtain the value for one activation node. Our
hypothesis output is the logistic function applied to the sum of the values of our activation nodes,
which have been multiplied by yet another parameter matrix \Theta^{(2)}Θ(2) containing the
weights for our second layer of nodes.

Each layer gets its own matrix of weights, \Theta^{(j)}Θ(j).

Page | 36
Jia arshad

The dimensions of these matrices of weights is determined as follows:

\text{If network has $s_j$ units in layer $j$ and $s_{j+1}$ units in layer $j+1$, then $\
Theta^{(j)}$ will be of dimension $s_{j+1} \times (s_j +
1)$.}If network has sj units in layer j and sj+1 units in layer j+1, then Θ(j) will be of dimen
sion sj+1×(sj+1).

The +1 comes from the addition in \Theta^{(j)}Θ(j) of the "bias nodes," x_0x0 and \
Theta_0^{(j)}Θ0(j). In other words the output nodes will not include the bias nodes while the
inputs will. The following image summarizes our model representation:

Example: If layer 1 has 2 input nodes and layer 2 has 4 activation nodes. Dimension of \
Theta^{(1)}Θ(1) is going to be 4×3 where s_j = 2sj=2 and s_{j+1} = 4sj+1=4, so s_{j+1} \
times (s_j + 1) = 4 \times 3sj+1×(sj+1)=4×3.

Model Representation II
To re-iterate, the following is an example of a neural network:

a(2)1=g(Θ(1)10x0+Θ(1)11x1+Θ(1)12x2+Θ(1)13x3)a(2)2=g(Θ(1)20x0+Θ(1)21x1+Θ(1)22x2+Θ(1)23x3)a(
Θ(1)31x1+Θ(1)32x2+Θ(1)33x3)hΘ(x)=a(3)1=g(Θ(2)10a(2)0+Θ(2)11a(2)1+Θ(2)12a(2)2+Θ(2)13a(2)3)

Page | 37
Jia arshad

In this section we'll do a vectorized implementation of the above functions. We're going to define
a new variable z_k^{(j)}zk(j) that encompasses the parameters inside our g function. In our
previous example if we replaced by the variable z for all the parameters we would get:

a(2)1=g(z(2)1)a(2)2=g(z(2)2)a(2)3=g(z(2)3)
In other words, for layer j=2 and node k, the variable z will be:

z_k^{(2)} = \Theta_{k,0}^{(1)}x_0 + \Theta_{k,1}^{(1)}x_1 + \cdots + \Theta_{k,n}^{(1)}x_nzk(2)=


+⋯+Θk,n(1)xn
The vector representation of x and z^{j}zj is:

x=⎡⎣⎢⎢x0x1⋯xn⎤⎦⎥⎥z(j)=⎡⎣⎢⎢⎢⎢z(j)1z(j)2⋯z(j)n⎤⎦⎥⎥⎥⎥
Setting x = a^{(1)}x=a(1), we can rewrite the equation as:

z^{(j)} = \Theta^{(j-1)}a^{(j-1)}z(j)=Θ(j−1)a(j−1)
We are multiplying our matrix \Theta^{(j-1)}Θ(j−1) with dimensions s_j\times (n+1)sj×(n+1)
(where s_jsj is the number of our activation nodes) by our vector a^{(j-1)}a(j−1) with height
(n+1). This gives us our vector z^{(j)}z(j) with height s_jsj. Now we can get a vector of our
activation nodes for layer j as follows:

a^{(j)} = g(z^{(j)})a(j)=g(z(j))

Where our function g can be applied element-wise to our vector z^{(j)}z(j).

We can then add a bias unit (equal to 1) to layer j after we have computed a^{(j)}a(j). This will
be element a_0^{(j)}a0(j) and will be equal to 1. To compute our final hypothesis, let's first
compute another z vector:

z^{(j+1)} = \Theta^{(j)}a^{(j)}z(j+1)=Θ(j)a(j)

We get this final z vector by multiplying the next theta matrix after \Theta^{(j-1)}Θ(j−1) with the
values of all the activation nodes we just got. This last theta matrix \Theta^{(j)}Θ(j) will have
only one row which is multiplied by one column a^{(j)}a(j) so that our result is a single number.
We then get our final result with:

h_\Theta(x) = a^{(j+1)} = g(z^{(j+1)})hΘ(x)=a(j+1)=g(z(j+1))

Notice that in this last step, between layer j and layer j+1, we are doing exactly the same thing
as we did in logistic regression. Adding all these intermediate layers in neural networks allows us
to more elegantly produce interesting and more complex non-linear hypotheses.

Page | 38
Jia arshad

Examples and Intuitions I


A simple example of applying neural networks is by predicting x_1x1 AND x_2x2, which is the
logical 'and' operator and is only true if both x_1x1 and x_2x2 are 1.

The graph of our functions will look like:

⎡⎣x0x1x2⎤⎦→[g(z(2))]→hΘ(x)
Remember that x_0x0 is our bias variable and is always 1.

Let's set our first theta matrix as:

\Theta^{(1)} =
[−302020]
Θ(1)=[−302020]

This will cause the output of our hypothesis to only be positive if both x_1x1 and x_2x2 are 1. In
other words:

hΘ(x)=g(−30+20x1+20x2)x1=0  and  x2=0  then  g(−30)≈0x1=0  and  x2=1  then  g(−10)≈0
0  then  g(−10)≈0x1=1  and  x2=1  then  g(10)≈1
So we have constructed one of the fundamental operations in computers by using a small neural
network rather than using an actual AND gate. Neural networks can also be used to simulate all
the other logical gates. The following is an example of the logical operator 'OR', meaning either
x_1x1 is true or x_2x2 is true, or both:

Where g(z) is the following:

Page | 39
Jia arshad

Examples and
Intuitions II
The Θ^{(1)}Θ(1) matrices for AND, NOR, and OR are:

AND:Θ(1)NOR:Θ(1)OR:Θ(1)=[−302020]=[10−20−20]=[−102020]
We can combine these to get the XNOR logical operator (which gives 1 if x_1x1 and x_2x2 are
both 0 or both 1).

⎡⎣x0x1x2⎤⎦→⎡⎣a(2)1a(2)2⎤⎦→[a(3)]→hΘ(x)
For the transition between the first and second layer, we'll use a Θ^{(1)}Θ(1) matrix that
combines the values for AND and NOR:

\Theta^{(1)} =
[−301020−2020−20]

Θ(1)=[−30202010−20−20]

For the transition between the second and third layer, we'll use a Θ^{(2)}Θ(2) matrix that uses
the value for OR:

\Theta^{(2)} =
[−102020]
Θ(2)=[−102020]

Let's write out the values for all our nodes:

a(2)=g(Θ(1)⋅x)a(3)=g(Θ(2)⋅a(2))hΘ(x)=a(3)
And there we have the XNOR operator using a hidden layer with two nodes! The following
summarizes the above algorithm:

Page | 40
Jia arshad

Multiclass Classification
To classify data into multiple classes, we let our hypothesis function return a vector of values.
Say we wanted to classify our data into one of four categories. We will use the following example
to see how this classification is done. This algorithm takes as input an image and classifies it
accordingly:

We can define our set of resulting classes as y:

Page | 41
Jia arshad

Each y^{(i)}y(i) represents a different image corresponding to either a car, pedestrian, truck, or
motorcycle. The inner layers, each provide us with some new information which leads to our final
hypothesis function. The setup looks like:

Our resulting hypothesis for one set of inputs may look like:

h_\Theta(x) =
⎡⎣⎢⎢0010⎤⎦⎥⎥
hΘ(x)=[0010]

In which case our resulting class is the third one down, or h_\Theta(x)_3hΘ(x)3, which
represents the motorcyc

Cost Function
Let's first define a few variables that we will need to use:

 L = total number of layers in the network


 s_lsl = number of units (not counting bias unit) in layer l
 K = number of output units/classes

Page | 42
Jia arshad

Recall that in neural networks, we may have many output nodes. We denote h_\Theta(x)_khΘ
(x)k as being a hypothesis that results in the k^{th}kth output. Our cost function for neural
networks is going to be a generalization of the one we used for logistic regression. Recall that the
cost function for regularized logistic regression was:

J(\theta) = - \frac{1}{m} \sum_{i=1}^m [ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \log (1 - h_


+ \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2J(θ)=−m1∑i=1m[y(i) log(hθ(x(i)))+(1−y(i)) log(1−hθ(x(i)))]
For neural networks, it is going to be slightly more complicated:

J(Θ)=−1m∑i=1m∑k=1K[y(i)klog((hΘ(x(i)))k)+(1−y(i)k)log(1−(hΘ(x(i)))k)]

+λ2m∑l=1L−1∑i=1sl∑j=1sl+1(Θ(l)j,i)2

We have added a few nested summations to account for our multiple output nodes. In the first
part of the equation, before the square brackets, we have an additional nested summation that
loops through the number of output nodes.

In the regularization part, after the square brackets, we must account for multiple theta matrices.
The number of columns in our current theta matrix is equal to the number of nodes in our current
layer (including the bias unit). The number of rows in our current theta matrix is equal to the
number of nodes in the next layer (excluding the bias unit). As before with logistic regression, we
square every term.

Note:

 the double sum simply adds up the logistic regression costs calculated for each cell in the
output layer
 the triple sum simply adds up the squares of all the individual Θs in the entire network.
 the i in the triple sum does not refer to training example i

Backpropagation Algorithm
"Backpropagation" is neural-network terminology for minimizing our cost function, just like what
we were doing with gradient descent in logistic and linear regression. Our goal is to compute:

\min_\Theta J(\Theta)minΘJ(Θ)

That is, we want to minimize our cost function J using an optimal set of parameters in theta. In
this section we'll look at the equations we use to compute the partial derivative of J(Θ):

\dfrac{\partial}{\partial \Theta_{i,j}^{(l)}}J(\Theta)∂Θi,j(l)∂J(Θ)

To do so, we use the following algorithm:

Page | 43
Jia arshad

Back propagation Algorithm

Given training set \lbrace (x^{(1)}, y^{(1)}) \cdots (x^{(m)}, y^{(m)})\rbrace{(x(1),y(1))⋯


(x(m),y(m))}

 Set \Delta^{(l)}_{i,j}Δi,j(l) := 0 for all (l,i,j), (hence you end up having a matrix full of
zeros)
For training example t =1 to m:

1. Set a^{(1)} := x^{(t)}a(1):=x(t)


2. Perform forward propagation to compute a^{(l)}a(l) for l=2,3,…,L

Page | 44
Jia arshad

3. Using y^{(t)}y(t), compute \delta^{(L)} = a^{(L)} - y^{(t)}δ(L)=a(L)−y(t)

Where L is our total number of layers and a^{(L)}a(L) is the vector of outputs of the activation
units for the last layer. So our "error values" for the last layer are simply the differences of our
actual results in the last layer and the correct outputs in y. To get the delta values of the layers
before the last layer, we can use an equation that steps us back from right to left:

4. Compute \delta^{(L-1)}, \delta^{(L-2)},\dots,\delta^{(2)}δ(L−1),δ(L−2),…,δ(2) using \


delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)})\ .*\ a^{(l)}\ .*\ (1 -
a^{(l)})δ(l)=((Θ(l))Tδ(l+1)) .∗ a(l) .∗ (1−a(l))

The delta values of layer l are calculated by multiplying the delta values in the next layer with the
theta matrix of layer l. We then element-wise multiply that with a function called g', or g-prime,
which is the derivative of the activation function g evaluated with the input values given by
z^{(l)}z(l).

The g-prime derivative terms can also be written out as:

g'(z^{(l)}) = a^{(l)}\ .*\ (1 - a^{(l)})g′(z(l))=a(l) .∗ (1−a(l))


5. \Delta^{(l)}_{i,j} := \Delta^{(l)}_{i,j} + a_j^{(l)} \delta_i^{(l+1)}Δi,j(l):=Δi,j(l)+aj(l)
δi(l+1) or with vectorization, \Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^TΔ(l):=Δ(l)
+δ(l+1)(a(l))T

Hence we update our new \DeltaΔ matrix.

 D^{(l)}_{i,j} := \dfrac{1}{m}\left(\Delta^{(l)}_{i,j} + \lambda\Theta^{(l)}_{i,j}\


right)Di,j(l):=m1(Δi,j(l)+λΘi,j(l)), if j≠0.
 D^{(l)}_{i,j} := \dfrac{1}{m}\Delta^{(l)}_{i,j}Di,j(l):=m1Δi,j(l) If j=0

The capital-delta matrix D is used as an "accumulator" to add up our values as we go

along and eventually compute our partial derivative. Thus we get \frac \partial {\partial \

Page | 45
Jia arshad

Theta_{ij}^{(l)}} J(\Theta)∂Θij(l)∂J(Θ)= D_{ij}^{(l)}Dij(l)

Backpropagation Intuition
Note: [4:39, the last term for the calculation for z^3_1z13 (three-color handwritten formula)
should be a^2_2a22 instead of a^2_1a12. 6:08 - the equation for cost(i) is incorrect. The first
term is missing parentheses for the log() function, and the second term should be (1-y^{(i)})\
log(1-h{_\theta}{(x^{(i)}}))(1−y(i))log(1−hθ(x(i))). 8:50 - \delta^{(4)} = y -
a^{(4)}δ(4)=y−a(4) is incorrect and should be \delta^{(4)} = a^{(4)} - yδ(4)=a(4)−y.]

Recall that the cost function for a neural network is:

J(Θ)=−1m∑t=1m∑k=1K[y(t)k log(hΘ(x(t)))k+(1−y(t)k) log(1−hΘ(x(t))k)]

+λ2m∑l=1L−1∑i=1sl∑j=1sl+1(Θ(l)j,i)2

If we consider simple non-multiclass classification (k = 1) and disregard regularization, the cost is


computed with:

cost(t) =y^{(t)} \ \log (h_\Theta (x^{(t)})) + (1 - y^{(t)})\ \log (1 - h_\Theta(x^{(t)}))cost(t)=y(t) log(h


+(1−y(t)) log(1−hΘ(x(t)))
Intuitively, \delta_j^{(l)}δj(l) is the "error" for a^{(l)}_jaj(l) (unit j in layer l). More formally, the
delta values are actually the derivative of the cost function:

\delta_j^{(l)} = \dfrac{\partial}{\partial z_j^{(l)}} cost(t)δj(l)=∂zj(l)∂cost(t)


Recall that our derivative is the slope of a line tangent to the cost function, so the steeper the
slope the more incorrect we are. Let us consider the following neural network below and see how
we could calculate some \delta_j^{(l)}δj(l):

Page | 46
Jia arshad

In the image above, to calculate \delta_2^{(2)}δ2(2), we multiply the weights \


Theta_{12}^{(2)}Θ12(2) and \Theta_{22}^{(2)}Θ22(2) by their respective \deltaδ values
found to the right of each edge. So we get \delta_2^{(2)}δ2(2)= \Theta_{12}^{(2)}Θ12(2)*\
delta_1^{(3)}δ1(3)+\Theta_{22}^{(2)}Θ22(2)*\delta_2^{(3)}δ2(3). To calculate every single
possible \delta_j^{(l)}δj(l), we could start from the right of our diagram. We can think of our
edges as our \Theta_{ij}Θij. Going from right to left, to calculate the value of \delta_j^{(l)}δj(l),
you can just take the over all sum of each weight times the \deltaδ it is coming from. Hence,
another example would be \delta_2^{(3)}δ2(3)=\Theta_{12}^{(3)}Θ12(3)*\delta_1^{(4)}δ1(4).

mplementation Note: Unrolling


Parameters
With neural networks, we are working with sets of matrices:

Θ(1),Θ(2),Θ(3),…D(1),D(2),D(3),…
In order to use optimizing functions such as "fminunc()", we will want to "unroll" all the elements
and put them into one long vector:

If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11, then we can get back
our original matrices from the "unrolled" versions as follows:

To summarize:

Page | 47
Jia arshad

Gradient Checking
Gradient checking will assure that our backpropagation works as intended. We can approximate
the derivative of our cost function with:

\dfrac{\partial}{\partial\Theta}J(\Theta) \approx \dfrac{J(\Theta + \epsilon) - J(\Theta - \


epsilon)}{2\epsilon}∂Θ∂J(Θ)≈2ϵJ(Θ+ϵ)−J(Θ−ϵ)

With multiple theta matrices, we can approximate the derivative with respect to Θ_jΘj as
follows:

\dfrac{\partial}{\partial\Theta_j}J(\Theta) \approx \dfrac{J(\Theta_1, \dots, \Theta_j + \


epsilon, \dots, \Theta_n) - J(\Theta_1, \dots, \Theta_j - \epsilon, \dots, \Theta_n)}{2\
epsilon}∂Θj∂J(Θ)≈2ϵJ(Θ1,…,Θj+ϵ,…,Θn)−J(Θ1,…,Θj−ϵ,…,Θn)

A small value for {\epsilon}ϵ (epsilon) such as {\epsilon = 10^{-4}}ϵ=10−4, guarantees that
the math works out properly. If the value for \epsilonϵ is too small, we can end up with numerical
problems.

Hence, we are only adding or subtracting epsilon to the \Theta_jΘj matrix. In octave we can do
it as follows:

epsilon = 1e-4;

for i = 1:n,

  thetaPlus = theta;

Page | 48
Jia arshad

  thetaPlus(i) += epsilon;

  thetaMinus = theta;

  thetaMinus(i) -= epsilon;

  gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)

end;

We previously saw how to calculate the deltaVector. So once we compute our gradApprox
vector, we can check that gradApprox ≈ deltaVector.

Once you have verified once that your backpropagation algorithm is correct, you don't need to
compute gradApprox again. The code to compute gradApprox can be very slow.

Random Initialization
Initializing all theta weights to zero does not work with neural networks. When we backpropagate,
all nodes will update to the same value repeatedly. Instead we can randomly initialize our
weights for our \ThetaΘ matrices using the following method:

Page | 49
Jia arshad

Hence, we initialize each \Theta^{(l)}_{ij}Θij(l) to a random value between [-\epsilon,\


epsilon][−ϵ,ϵ]. Using the above formula guarantees that we get the desired bound. The same
procedure applies to all the \ThetaΘ's. Below is some working code you could use to
experiment.

If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.

Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;

Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;

Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;

Page | 50
Jia arshad

rand(x,y) is just a function in octave that will initialize a matrix of random real numbers between 0
and 1.

(Note: the epsilon used above is unrelated to the epsilon from Gradient Checking)

Putting it Together
First, pick a network architecture; choose the layout of your neural network, including how many
hidden units in each layer and how many layers in total you want to have.

 Number of input units = dimension of features x^{(i)}x(i)


 Number of output units = number of classes
 Number of hidden units per layer = usually more the better (must balance with cost of
computation as it increases with more hidden units)
 Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that
you have the same number of units in every hidden layer.
Training a Neural Network

1. Randomly initialize the weights


2. Implement forward propagation to get h_\Theta(x^{(i)})hΘ(x(i)) for any x^{(i)}x(i)
3. Implement the cost function
4. Implement backpropagation to compute partial derivatives
5. Use gradient checking to confirm that your backpropagation works. Then disable gradient
checking.
6. Use gradient descent or a built-in optimization function to minimize the cost function with the
weights in theta.
When we perform forward and back propagation, we loop on every training example:

for i = 1:m,

Page | 51
Jia arshad

   Perform forward propagation and backpropagation using example (x(i),y(i))

   (Get activations a(l) and delta terms d(l) for l = 2,...,L

The following image gives us an intuition of what is happening as we are implementing our
neural network:

Ideally, you want h_\Theta(x^{(i)})hΘ(x(i)) \approx≈ y^{(i)}y(i). This will minimize our cost
function. However, keep in mind that J(\Theta)J(Θ) is not convex and thus we can end up in a
local minimum instead.

Page | 52
Jia arshad

Evaluating a Hypothesis
Once we have done some trouble shooting for errors in our predictions by:

 Getting more training examples


 Trying smaller sets of features
 Trying additional features
 Trying polynomial features
 Increasing or decreasing λ
We can move on to evaluate our new hypothesis.

A hypothesis may have a low error for the training examples but still be inaccurate (because of
overfitting). Thus, to evaluate a hypothesis, given a dataset of training examples, we can split up
the data into two sets: a training set and a test set. Typically, the training set consists of 70 %
of your data and the test set is the remaining 30 %.

The new procedure using these two sets is then:

1. Learn \ThetaΘ and minimize J_{train}(\Theta)Jtrain(Θ) using the training set


2. Compute the test set error J_{test}(\Theta)Jtest(Θ)

The test set error


1. For linear regression: J_{test}(\Theta) = \dfrac{1}{2m_{test}} \
sum_{i=1}^{m_{test}}(h_\Theta(x^{(i)}_{test}) - y^{(i)}_{test})^2Jtest(Θ)=2mtest1
∑i=1mtest(hΘ(xtest(i))−ytest(i))2
2. For classification ~ Misclassification error (aka 0/1 misclassification error):
err(hΘ(x),y)=10if hΘ(x)≥0.5 and y=0 or hΘ(x)<0.5 and y=1otherwise
This gives us a binary 0 or 1 error result based on a misclassification. The average test error for
the test set is:

\text{Test Error} = \dfrac{1}{m_{test}} \sum^{m_{test}}_{i=1} err(h_\


Theta(x^{(i)}_{test}), y^{(i)}_{test})Test Error=mtest1∑i=1mtesterr(hΘ(xtest(i)),ytest(i))

This gives us the proportion of the test data that was misclassified.

Model Selection and Train/Validation/Test Sets

Just because a learning algorithm fits a training set well, that does not mean it is a good hypothesis.
It could over fit and as a result your predictions on the test set would be poor. The error of your
hypothesis as measured on the data set with which you trained the parameters will be lower than
the error on any other data set.

Page | 53
Jia arshad

Given many models with different polynomial degrees, we can use a systematic approach to identify
the 'best' function. In order to choose the model of your hypothesis, you can test each degree of
polynomial and look at the error result.

One way to break down our dataset into the three sets is:

 Training set: 60%

 Cross validation set: 20%

 Test set: 20%

We can now calculate three separate error values for the three different sets using the following
method:

1. Optimize the parameters in Θ using the training set for each polynomial degree.

2. Find the polynomial degree d with the least error using the cross validation set.

3. Estimate the generalization error using the test set with J_{test}(\Theta^{(d)})Jtest(Θ(d)), (d =
theta from polynomial with lower error);

This way, the degree of the polynomial d has not been trained using the test set.

Diagnosing Bias vs. Variance


In this section we examine the relationship between the degree of the polynomial d and the
underfitting or overfitting of our hypothesis.

 We need to distinguish whether bias or variance is the problem contributing to bad


predictions.
 High bias is underfitting and high variance is overfitting. Ideally, we need to find a golden
mean between these two.
The training error will tend to decrease as we increase the degree d of the polynomial.

At the same time, the cross validation error will tend to decrease as we increase d up to a point,
and then it will increase as d is increased, forming a convex curve.

High bias (underfitting): both J_{train}(\Theta)Jtrain(Θ) and J_{CV}(\Theta)JCV(Θ) will be


high. Also, J_{CV}(\Theta) \approx J_{train}(\Theta)JCV(Θ)≈Jtrain(Θ).

High variance (overfitting): J_{train}(\Theta)Jtrain(Θ) will be low and J_{CV}(\Theta)JCV


(Θ) will be much greater than J_{train}(\Theta)Jtrain(Θ).

The is summarized in the figure below:

Page | 54
Jia arshad

Regularization and Bias/Variance


Note: [The regularization term below and through out the video should be \frac \lambda {2m} \
sum _{j=1}^n \theta_j ^22mλ∑j=1nθj2 and NOT \frac \lambda {2m} \sum _{j=1}^m \theta_j
^22mλ∑j=1mθj2]

In the figure above, we see that as \lambdaλ increases, our fit becomes more rigid. On the other
hand, as \lambdaλ approaches 0, we tend to over overfit the data. So how do we choose our
parameter \lambdaλ to get it 'just right' ? In order to choose the model and the regularization
term λ, we need to:

1. Create a list of lambdas (i.e. λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24});


2. Create a set of models with different degrees or any other variants.

Page | 55
Jia arshad

3. Iterate through the \lambdaλs and for each \lambdaλ go through all the models to learn
some \ThetaΘ.
4. Compute the cross validation error using the learned Θ (computed with λ) on the J_{CV}(\
Theta)JCV(Θ) without regularization or λ = 0.
5. Select the best combo that produces the lowest error on the cross validation set.
6. Using the best combo Θ and λ, apply it on J_{test}(\Theta)Jtest(Θ) to see if it has a good
generalization of the problem.

Learning Curves
Training an algorithm on a very few number of data points (such as 1, 2 or 3) will easily have 0
errors because we can always find a quadratic curve that touches exactly those number of
points. Hence:

 As the training set gets larger, the error for a quadratic function increases.
 The error value will plateau out after a certain m, or training set size.
Experiencing high bias:

Low training set size: causes J_{train}(\Theta)Jtrain(Θ) to be low and J_{CV}(\Theta)JCV


(Θ) to be high.

Large training set size: causes both J_{train}(\Theta)Jtrain(Θ) and J_{CV}(\Theta)JCV(Θ)


to be high with J_{train}(\Theta)Jtrain(Θ)≈J_{CV}(\Theta)JCV(Θ).

If a learning algorithm is suffering from high bias, getting more training data will not (by itself)
help much.

Experiencing high variance:

Low training set size: J_{train}(\Theta)Jtrain(Θ) will be low and J_{CV}(\Theta)JCV(Θ) will
be high.

Large training set size: J_{train}(\Theta)Jtrain(Θ) increases with training set size and
J_{CV}(\Theta)JCV(Θ) continues to decrease without leveling off. Also, J_{train}(\
Theta)Jtrain(Θ) < J_{CV}(\Theta)JCV(Θ) but the difference between them remains significant.

Page | 56
Jia arshad

If a learning algorithm is suffering from high variance, getting more training data is likely to help.

Deciding What to Do Next Revisited


Our decision process can be broken down as follows:

 Getting more training examples: Fixes high variance


 Trying smaller sets of features: Fixes high variance
 Adding features: Fixes high bias
 Adding polynomial features: Fixes high bias
 Decreasing λ: Fixes high bias
 Increasing λ: Fixes high variance.

Diagnosing Neural Networks

 A neural network with fewer parameters is prone to underfitting. It is also


computationally cheaper.
 A large neural network with more parameters is prone to overfitting. It is also
computationally expensive. In this case you can use regularization (increase λ) to address
the overfitting.
Using a single hidden layer is a good starting default. You can train your neural network on a
number of hidden layers using your cross validation set. You can then select the one that
performs best.

Model Complexity Effects:

 Lower-order polynomials (low model complexity) have high bias and low variance. In this
case, the model fits poorly consistently.
 Higher-order polynomials (high model complexity) fit the training data extremely well and the
test data extremely poorly. These have low bias on the training data, but very high variance.
 In reality, we would want to choose a model somewhere in between, that can generalize
well but also fits the data reasonably well.

Prioritizing What to Work On


System Design Example:

Given a data set of emails, we could construct a vector for each email. Each entry in this vector
represents a word. The vector normally contains 10,000 to 50,000 entries gathered by finding the
most frequently used words in our data set. If a word is to be found in the email, we would assign
its respective entry a 1, else if it is not found, that entry would be a 0. Once we have all our x
vectors ready, we train our algorithm and finally, we could use it to classify if an email is a spam
or not.

Page | 57
Jia arshad

So how could you spend your time to improve the accuracy of this classifier?

 Collect lots of data (for example "honeypot" project but doesn't always work)
 Develop sophisticated features (for example: using email header data in spam emails)
 Develop algorithms to process your input in different ways (recognizing misspellings in
spam).
It is difficult to tell which of the options will be most helpful.

Error Analysis
The recommended approach to solving machine learning problems is to:

 Start with a simple algorithm, implement it quickly, and test it early on your cross validation
data.
 Plot learning curves to decide if more data, more features, etc. are likely to help.
 Manually examine the errors on examples in the cross validation set and try to spot a trend
where most of the errors were made.
For example, assume that we have 500 emails and our algorithm misclassifies a 100 of them.
We could manually analyze the 100 emails and categorize them based on what type of emails
they are. We could then try to come up with new cues and features that would help us classify
these 100 emails correctly. Hence, if most of our misclassified emails are those which try to steal
passwords, then we could find some features that are particular to those emails and add them to

Page | 58
Jia arshad

our model. We could also see how classifying each word according to its root changes our error
rate:

It is very important to get error results as a single, numerical value. Otherwise it is difficult to
assess your algorithm's performance. For example if we use stemming, which is the process of
treating the same word with different forms (fail/failing/failed) as one word (fail), and get a 3%
error rate instead of 5%, then we should definitely add it to our model. However, if we try to
distinguish between upper case and lower case letters and end up getting a 3.2% error rate
instead of 3%, then we should avoid using this new feature. Hence, we should try new things, get
a numerical value for our error rate, and based on our result decide whether we want to keep the
new feature or not.

Page | 59
Jia arshad

ss

Page | 60

You might also like