Machine Learning
Machine Learning
MACHINE LEARNING
Machine learning grew out a work for AI. It develops new capabilities for computer.
Examples:
He says, a computer program is said to learn from experience E with respect to
some task T and some performance measure P, if its performance on T,
as measured by P, improves with experience E.
Example: playing checkers.
P = the probability that the program will win the next game.
Page | 1
Jia arshad
supervised learning
unsupervised learning
Supervised Learning :
In supervised learning, we are given a data set and already know what our correct output should
look like, having the idea that there is a relationship between the input and the output.
Supervised learning problems are categorized into "regression" and "classification" problems. In a
regression problem, we are trying to predict results within a continuous output, meaning that we are
trying to map input variables to some continuous function. In a classification problem, we are
instead trying to predict results in a discrete output. In other words, we are trying to map input
variables into discrete categories.
Example 1:
Given data about the size of houses on the real estate market, try to predict their price. Price as a
function of size is a continuous output, so this is a regression problem.
We could turn this example into a classification problem by instead making our output about
whether the house "sells for more or less than the asking price." Here we are classifying the houses
based on price into two discrete categories.
Example 2:
(a) Regression - Given a picture of a person, we have to predict their age on the basis of the given
picture
(b) Classification - Given a patient with a tumor, we have to predict whether the tumor is malignant
or benign.
types of supervised
learning
regression classification
Unsupervised Learning:
Page | 2
Jia arshad
Unsupervised learning allows us to approach problems with little or no idea what our results
should look like. We can derive structure from data where we don't necessarily know the effect of
the variables.
We can derive this structure by clustering the data based on relationships among the variables in
the data.
Example:
Clustering: Take a collection of 1,000,000 different genes, and find a way to automatically group
these genes into groups that are somehow similar or related by different variables, such as
lifespan, location, roles, and so on.
Non-clustering: The "Cocktail Party Algorithm", allows you to find structure in a chaotic
environment. (i.e. identifying individual voices and music from a mesh of sounds at a cocktail
party).
clustering non-clustering
MODEL REPRESENTAION:
Page | 3
Jia arshad
To establish notation for future use, we’ll use x^{(i)}x(i) to denote the “input” variables (living
area in this example), also called input features, and y^{(i)}y(i) to denote the “output” or target
variable that we are trying to predict (price). A pair (x^{(i)} , y^{(i)} )(x(i),y(i)) is called a
training example, and the dataset that we’ll be using to learn—a list of m training examples
{(x^{(i)} , y^{(i)} ); i = 1, . . . , m}(x(i),y(i));i=1,...,m—is called a training set. Note that the
superscript “(i)” in the notation is simply an index into the training set, and has nothing to do with
exponentiation. We will also use X to denote the space of input values, and Y to denote the
space of output values. In this example, X = Y = ℝ.
To describe the supervised learning problem slightly more formally, our goal is, given a training
set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding value of
y. For historical reasons, this function h is called a hypothesis. Seen pictorially, the process is
therefore like this:
When the target variable that we’re trying to predict is continuous, such as in our housing
example, we call the learning problem a regression problem. When y can take on only a
small number of discrete values (such as if, given the living area, we wanted to predict if a
dwelling is a house or an apartment, say), we call it a classification problem.
Page | 4
Jia arshad
COST FUNCTION:
Page | 5
Jia arshad
We can measure the accuracy of our hypothesis function by using a cost function. This takes
an average difference (actually a fancier version of an average) of all the results of the
hypothesis with inputs from x's and the actual output y's.
J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i}- y_{i} \
right)^2 = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \
right)^2J(θ0,θ1)=2m1i=1∑m(y^i−yi)2=2m1i=1∑m(hθ(xi)−yi)2
To break it apart, it is \frac{1}{2}21 \bar{x}xˉ where \bar{x}xˉ is the mean of the squares of
h_\theta (x_{i}) - y_{i}hθ(xi)−yi , or the difference between the predicted value and the
actual value.
This function is otherwise called the "Squared error function", or "Mean squared error". The
mean is halved \left(\frac{1}{2}\right)(21) as a convenience for the computation of the
gradient descent, as the derivative term of the square function will cancel out the \frac{1}
{2}21 term.
Our objective is to get the best possible line. The best possible line will be such so that the
average squared vertical distances of the scattered points from the line will be the least. Ideally,
the line should pass through all the points of our training data set. In such a case, the value of J(\
theta_0, \theta_1)J(θ0,θ1) will be 0. The following example shows the ideal situation where we
have a cost function of 0.
When \theta_1 = 1θ1=1, we get a slope of 1 which goes through every single data point in our
model. Conversely, when \theta_1 = 0.5θ1=0.5, we see the vertical distance from our fit to the
data points increase.
Page | 6
Jia arshad
This increases our cost function to 0.58. Plotting several other points yields to the following
graph:
Thus as a goal, we should try to minimize the cost function. In this case, \theta_1 = 1θ1=1 is our
global minimum.
Page | 7
Jia arshad
Taking any color and going along the 'circle', one would expect to get the same value of the cost
function. For example, the three green points found on the green line above have the same value
for J(\theta_0,\theta_1)J(θ0,θ1) and as a result, they are found along the same line. The circled
x displays the value of the cost function for the graph on the left when \theta_0θ0 = 800 and \
theta_1θ1= -0.15. Taking another h(x) and plotting its contour plot, one gets the following graphs:
Page | 8
Jia arshad
The graph above minimizes the cost function as much as possible and consequently, the result
of \theta_1θ1 and \theta_0θ0 tend to be around 0.12 and 250 respectively. Plotting those values
on our graph to the right seems to put our point in the center of the inner most 'circle'.
Gradient Descent
So we have our hypothesis function and we have a way of measuring how well it fits into the
data. Now we need to estimate the parameters in the hypothesis function. That's where gradient
descent comes in.
Imagine that we graph our hypothesis function based on its fields \theta_0θ0 and \theta_1θ1
(actually we are graphing the cost function as a function of the parameter estimates). We are not
graphing x and y itself, but the parameter range of our hypothesis function and the cost resulting
from selecting a particular set of parameters.
We put \theta_0θ0 on the x axis and \theta_1θ1 on the y axis, with the cost function on the
vertical z axis. The points on our graph will be the result of the cost function using our hypothesis
with those specific theta parameters. The graph below depicts such a setup.
Page | 9
Jia arshad
We will know that we have succeeded when our cost function is at the very bottom of the pits in
our graph, i.e. when its value is the minimum. The red arrows show the minimum points in the
graph.
The way we do this is by taking the derivative (the tangential line to a function) of our cost
function. The slope of the tangent is the derivative at that point and it will give us a direction to
move towards. We make steps down the cost function in the direction with the steepest descent.
The size of each step is determined by the parameter α, which is called the learning rate.
For example, the distance between each 'star' in the graph above represents a step determined
by our parameter α. A smaller α would result in a smaller step and a larger α results in a larger
step. The direction in which the step is taken is determined by the partial derivative of J(\
theta_0,\theta_1)J(θ0,θ1). Depending on where one starts on the graph, one could end up at
different points. The image above shows us two different starting points that end up in two
different places.
where
Page | 10
Jia arshad
At each iteration j, one should simultaneously update the parameters \theta_1, \theta_2,...,\
theta_nθ1,θ2,...,θn. Updating a specific parameter prior to calculating another one on the
j^{(th)}j(th) iteration would yield to a wrong implementation.
Page | 11
Jia arshad
On a side note, we should adjust our parameter \alphaα to ensure that the gradient descent
algorithm converges in a reasonable time. Failure to converge or too much time to obtain the
minimum value imply that our step size is wrong.
Page | 12
Jia arshad
How does gradient descent converge with a fixed step size \alphaα?
The intuition behind the convergence is that \frac{d}{d\theta_1} J(\theta_1)dθ1dJ(θ1)
approaches 0 as we approach the bottom of our convex function. At the minimum, the derivative
will always be 0 and thus we get:
\theta_1:=\theta_1-\alpha * 0θ1:=θ1−α∗0
When specifically applied to the case of linear regression, a new form of the gradient descent
equation can be derived. We can substitute our actual cost function and our actual hypothesis
function and modify the equation to :
where m is the size of the training set, \theta_0θ0 a constant that will be changing
simultaneously with \theta_1θ1 and x_{i}, y_{i}xi,yiare values of the given training set (data).
Note that we have separated out the two cases for \theta_jθj into separate equations for \
theta_0θ0 and \theta_1θ1; and that for \theta_1θ1 we are multiplying x_{i}xi at the end due to
the derivative. The following is a derivation of \frac {\partial}{\partial \theta_j}J(\theta) ∂θj∂J(θ)
for a single example :
The point of all this is that if we start with a guess for our hypothesis and then repeatedly apply
these gradient descent equations, our hypothesis will become more and more accurate.
So, this is simply gradient descent on the original cost function J. This method looks at every
example in the entire training set on every step, and is called batch gradient descent. Note that,
while gradient descent can be susceptible to local minima in general, the optimization problem
we have posed here for linear regression has only one global, and no other local, optima; thus
Page | 13
Jia arshad
gradient descent always converges (assuming the learning rate α is not too large) to the global
minimum. Indeed, J is a convex quadratic function. Here is an example of gradient descent as it
is run to minimize a quadratic function.
The ellipses shown above are the contours of a quadratic function. Also shown is the trajectory
taken by gradient descent, which was initialized at (48,30). The x’s in the figure (joined by
straight lines) mark the successive values of θ that gradient descent went through as it
converged to its minimum.
Page | 14
Jia arshad
⎡⎣⎢⎢⎢adgjbehkcfil⎤⎦⎥⎥⎥
[abcdefghijkl]
The above matrix has four rows and three columns, so it is a 4 x 3 matrix.
⎡⎣⎢⎢wxyz⎤⎦⎥⎥
[wxyz]
So vectors are a subset of matrices. The above vector is a 4 x 1 matrix.
A_{ij}Aij refers to the element in the ith row and jth column of matrix A.
A vector with 'n' rows is referred to as an 'n'-dimensional vector.
v_ivi refers to the element in the ith row of the vector.
In general, all our vectors and matrices will be 1-indexed. Note that for some programming
languages, the arrays are 0-indexed.
Matrices are usually denoted by uppercase names while vectors are lowercase.
"Scalar" means that an object is a single value, not a vector or matrix.
\mathbb{R}R refers to the set of scalar real numbers.
\mathbb{R^n}Rn refers to the set of n-dimensional vectors of real numbers.
[acbd]
[wyxz]
[a+wc+yb+xd+z]
[abcd]+[wxyz]=[a+wb+xc+yd+z]
Subtracting Matrices:
Page | 15
Jia arshad
[acbd]
[wyxz]
[a−wc−yb−xd−z]
[abcd]−[wxyz]=[a−wb−xc−yd−z]
[acbd]
*x=
[a∗xc∗xb∗xd∗x]
[abcd]∗x=[a∗xb∗xc∗xd∗x]
[acbd]
/x=
[a/xc/xb/xd/x]
[abcd]/x=[a/xb/xc/xd/x]
Experiment below with the Octave/Matlab commands for matrix addition and scalar
multiplication. Feel free to try out different commands. Try to write out your answers for each
command before running the cell below.
Matrix-Vector Multiplication
Page | 16
Jia arshad
We map the column of the vector onto each row of the matrix, multiplying each element and
summing the result.
⎡⎣acebdf⎤⎦
*
[xy]
=
⎡⎣⎢a∗x+b∗yc∗x+d∗ye∗x+f∗y⎤⎦⎥
[abcdef]∗[xy]=[a∗x+b∗yc∗x+d∗ye∗x+f∗y]
The result is a vector. The number of columns of the matrix must equal the number of rows of
the vector.
Matrix-Matrix Multiplication
We multiply two matrices by breaking it into several vector multiplications and concatenating the
result.
⎡⎣acebdf⎤⎦
*
[wyxz]
=
⎡⎣⎢a∗w+b∗yc∗w+d∗ye∗w+f∗ya∗x+b∗zc∗x+d∗ze∗x+f∗z⎤⎦⎥
[abcdef]∗[wxyz]=[a∗w+b∗ya∗x+b∗zc∗w+d∗yc∗x+d∗ze∗w+f∗ye∗x+f∗z]
To multiply two matrices, the number of columns of the first matrix must equal the number of
rows of the second matrix.
⎡⎣100010001⎤⎦
[100010001]
When multiplying the identity matrix after some matrix (A∗I), the square identity matrix's
dimension should match the other matrix's columns. When multiplying the identity matrix before
some other matrix (I∗A), the square identity matrix's dimension should match the other matrix's
rows.
A non square matrix does not have an inverse matrix. We can compute inverses of matrices in
octave with the pinv(A)pinv(A) function and in Matlab with the inv(A)inv(A) function. Matrices
that don't have an inverse are singular or degenerate.
The transposition of a matrix is like rotating the matrix 90° in clockwise direction and then
reversing it. We can compute transposition of matrices in matlab with the transpose(A) function
or A':
A=
⎡⎣acebdf⎤⎦
A=[abcdef]
A^T =
[abcdef]
AT=[acebdf]
In other words:
A_{ij} = A^T_{ji}Aij=AjiT
Multiple Features
Page | 18
Jia arshad
Linear regression with multiple variables is also known as "multivariate linear regression".
We now introduce notation for equations where we can have any number of input variables.
h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \cdots + \theta_n
x_nhθ(x)=θ0+θ1x1+θ2x2+θ3x3+⋯+θnxn
In order to develop intuition about this function, we can think about \theta_0θ0 as the basic price
of a house, \theta_1θ1 as the price per square meter, \theta_2θ2 as the price per floor, etc.
x_1x1 will be the number of square meters in the house, x_2x2 the number of floors, etc.
Using the definition of matrix multiplication, our multivariable hypothesis function can be
concisely represented as:
hθ(x)=[θ0θ1...θn]⎡⎣⎢⎢⎢x0x1⋮xn⎤⎦⎥⎥⎥=θTx
This is a vectorization of our hypothesis function for one training example; see the lessons on
vectorization to learn more.
Remark: Note that for convenience reasons in this course we assume x_{0}^{(i)} =1 \
text{ for } (i\in { 1,\dots, m } )x0(i)=1 for (i∈1,…,m). This allows us to do matrix operations
with theta and x. Hence making the two vectors '\thetaθ' and x^{(i)}x(i) match each other
element-wise (that is, have the same number of elements: n+1).]
−y(i))⋅x(i)1θ2:=θ2−α1m∑i=1m(hθ(x(i))−y(i))⋅x(i)2⋯
Page | 19
Jia arshad
In other words:
The following image compares gradient descent with one variable to gradient descent with
multiple variables:
We can speed up gradient descent by having each of our input values in roughly the same
range. This is because θ will descend quickly on small ranges and slowly on large ranges, and
so will oscillate inefficiently down to the optimum when the variables are very uneven.
The way to prevent this is to modify the ranges of our input variables so that they are all roughly
the same. Ideally:
−1 ≤ x_{(i)}x(i) ≤ 1
or
These aren't exact requirements; we are only trying to speed things up. The goal is to get all
input variables into roughly one of these ranges, give or take a few.
Page | 20
Jia arshad
Two techniques to help with this are feature scaling and mean normalization. Feature scaling
involves dividing the input values by the range (i.e. the maximum value minus the minimum
value) of the input variable, resulting in a new range of just 1. Mean normalization involves
subtracting the average value for an input variable from the values for that input variable resulting
in a new average value for the input variable of just zero. To implement both of these techniques,
adjust your input values as shown in this formula:
Where μ_iμi is the average of all the values for feature (i) and s_isi is the range of values (max -
min), or s_isi is the standard deviation.
Note that dividing by the range, or dividing by the standard deviation, give different results. The
quizzes in this course use range - the programming exercises use standard deviation.
For example, if x_ixi represents housing prices with a range of 100 to 2000 and a mean value of
1000, then, x_i := \dfrac{price-1000}{1900}xi:=1900price−1000.
Debugging gradient descent. Make a plot with number of iterations on the x-axis. Now plot the
cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increases, then
you probably need to decrease α.
Automatic convergence test. Declare convergence if J(θ) decreases by less than E in one
iteration, where E is some small value such as 10^{−3}10−3. However in practice it's difficult to
choose this threshold value.
It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every
iteration.
Page | 21
Jia arshad
To summarize:
If \alphaα is too large: may not decrease on every iteration and thus may not converge.
We can combine multiple features into one. For example, we can combine x_1x1 and x_2x2
into a new feature x_3x3 by taking x_1x1⋅x_2x2.
Polynomial Regression
Our hypothesis function need not be linear (a straight line) if that does not fit the data well.
We can change the behavior or curve of our hypothesis function by making it a quadratic,
cubic or square root function (or any other form).
In the cubic version, we have created new features x_2x2 and x_3x3 where x_2 = x_1^2x2=x12
and x_3 = x_1^3x3=x13.
Page | 22
Jia arshad
To make it a square root function, we could do: h_\theta(x) = \theta_0 + \theta_1 x_1 + \
theta_2 \sqrt{x_1}hθ(x)=θ0+θ1x1+θ2x1
One important thing to keep in mind is, if you choose your features this way then feature scaling
becomes very important.
eg. if x_1x1 has range 1 - 1000 then range of x_1^2x12 becomes 1 - 1000000 and that of
x_1^3x13 becomes 1 – 1000000000
Normal Equation
Note: [8:00 to 8:44 - The design matrix X (in the bottom right side of the slide) given in the
example should have elements x with subscript 1 and superscripts varying from 1 to m because
for all m training sets there are only 2 features x_0x0 and x_1x1. 12:56 - The X matrix is m by
(n+1) and NOT n by n. ]
Gradient descent gives one way of minimizing J. Let’s discuss a second way of doing so, this
time performing the minimization explicitly and without resorting to an iterative algorithm. In the
"Normal Equation" method, we will minimize J by explicitly taking its derivatives with respect to
the θj ’s, and setting them to zero. This allows us to find the optimum theta without iteration. The
normal equation formula is given below:
Page | 23
Jia arshad
Page | 24
Jia arshad
Redundant features, where two features are very closely related (i.e. they are linearly
dependent)
Too many features (e.g. m ≤ n). In this case, delete some features or use "regularization" (to
be explained in a later lesson).
Solutions to the above problems include deleting a feature that is linearly dependent with another
or deleting one or more features when there are too many features.
Classification:
To attempt classification, one method is to use linear regression and map all predictions greater
than 0.5 as a 1 and all less than 0.5 as a 0. However, this method doesn't work well because
classification is not actually a linear function.
The classification problem is just like the regression problem, except that the values we now
want to predict take on only a small number of discrete values. For now, we will focus on the
binary classification problem in which y can take on only two values, 0 and 1. (Most of what
we say here will also generalize to the multiple-class case.) For instance, if we are trying to build
a spam classifier for email, then x^{(i)}x(i) may be some features of a piece of email, and y may
be 1 if it is a piece of spam mail, and 0 otherwise. Hence, y∈{0,1}. 0 is also called the negative
class, and 1 the positive class, and they are sometimes also denoted by the symbols “-” and “+.”
Given x^{(i)}x(i), the corresponding y^{(i)}y(i) is also called the label for the training example.
Hypothesis Representation
We could approach the classification problem ignoring the fact that y is discrete-valued, and use
our old linear regression algorithm to try to predict y given x. However, it is easy to construct
examples where this method performs very poorly. Intuitively, it also doesn’t make sense for h_\
theta (x)hθ(x) to take values larger than 1 or smaller than 0 when we know that y ∈ {0, 1}. To fix
this, let’s change the form for our hypotheses h_\theta (x)hθ(x) to satisfy 0 \leq h_\theta (x) \
leq 10≤hθ(x)≤1. This is accomplished by plugging \theta^TxθTx into the Logistic Function.
Our new form uses the "Sigmoid Function," also called the "Logistic Function":
\begin{align*}& h_\theta (x) = g ( \theta^T x ) \newline \newline& z = \theta^T x \newline& g(z) = \dfrac{1}{1 + e^
The following image shows us what the sigmoid function looks like:
Page | 25
Jia arshad
The function g(z), shown here, maps any real number to the (0, 1) interval, making it useful for
transforming an arbitrary-valued function into a function better suited for classification.
h_\theta(x)hθ(x) will give us the probability that our output is 1. For example, h_\
theta(x)=0.7hθ(x)=0.7 gives us a probability of 70% that our output is 1. Our probability that our
prediction is 0 is just the complement of our probability that it is 1 (e.g. if probability that it is 1 is
70%, then the probability that it is 0 is 30%).
\begin{align*}& h_\theta(x) = P(y=1 | x ; \theta) = 1 - P(y=0 | x ; \theta) \newline& P(y = 0 | x;\theta) + P(y = 1 | x
end{align*}
Decision Boundary
In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis
function as follows:
hθ(x)≥0.5→y=1hθ(x)<0.5→y=0
The way our logistic function g behaves is that when its input is greater than or equal to zero, its
output is greater than or equal to 0.5:
g(z)≥0.5whenz≥0
Remember.
z=0,e0=1⇒g(z)=1/2z→∞,e−∞→0⇒g(z)=1z→−∞,e∞→∞⇒g(z)=0
So if our input to g is \theta^T XθTX, then that means:
hθ(x)=g(θTx)≥0.5whenθTx≥0
From these statements we can now say:
θTx≥0⇒y=1θTx<0⇒y=0
The decision boundary is the line that separates the area where y = 0 and where y = 1. It is
created by our hypothesis function.
Example:
Page | 26
Jia arshad
θ=⎡⎣5−10⎤⎦y=1if5+(−1)x1+0x2≥05−x1≥0−x1≥−5x1≤5
In this case, our decision boundary is a straight vertical line placed on the graph where x_1 =
5x1=5, and everything to the left of that denotes y = 1, while everything to the right denotes y =
0.
Again, the input to the sigmoid function g(z) (e.g. \theta^T XθTX) doesn't need to be linear, and
could be a function that describes a circle (e.g. z = \theta_0 + \theta_1 x_1^2 +\theta_2
x_2^2z=θ0+θ1x12+θ2x22) or any shape to fit our data.
Cost Function
We cannot use the same cost function that we use for linear regression because the Logistic
Function will cause the output to be wavy, causing many local optima. In other words, it will not
be a convex function.
J(θ)=1m∑i=1mCost(hθ(x(i)),y(i))Cost(hθ(x),y)=−log(hθ(x))Cost(hθ(x),y)=−log(1−hθ(
=0
Similarly, when y = 0, we get the following plot for J(\theta)J(θ) vs h_\theta (x)hθ(x):
Page | 27
Jia arshad
Cost(hθ(x),y)=0 if hθ(x)=yCost(hθ(x),y)→∞ if y=0andhθ(x)→1Cost(hθ(x),y)→∞ if y=
If our correct answer 'y' is 0, then the cost function will be 0 if our hypothesis function also
outputs 0. If our hypothesis approaches 1, then the cost function will approach infinity.
If our correct answer 'y' is 1, then the cost function will be 0 if our hypothesis function outputs 1. If
our hypothesis approaches 0, then the cost function will approach infinity.
Note that writing the cost function in this way guarantees that J(θ) is convex for logistic
regression.
We can compress our cost function's two conditional cases into one case:
Page | 28
Jia arshad
theta(x^{(i)}))]J(θ)=−m1i=1∑m[y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))]
h=g(Xθ)J(θ)=1m⋅(−yTlog(h)−(1−y)Tlog(1−h))
Gradient Descent
Repeat{θj:=θj−α∂∂θjJ(θ)}
Repeat{θj:=θj−αm∑i=1m(hθ(x(i))−y(i))x(i)j}
Notice that this algorithm is identical to the one we used in linear regression. We still have to
simultaneously update all values in theta.
Advanced Optimization
Note: [7:35 - '100' should be 100 instead. The value provided should be an integer and not a
character string.]
"Conjugate gradient", "BFGS", and "L-BFGS" are more sophisticated, faster ways to optimize θ
that can be used instead of gradient descent. We suggest that you should not write these more
sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the
libraries instead, as they're already tested and highly optimized. Octave provides them.
We first need to provide a function that evaluates the following two functions for a given input
value θ:
J(θ)∂∂θjJ(θ)
We can write a single function that returns both of these:
Page | 29
Jia arshad
function [jVal, gradient] = costFunction(theta)
jVal = [...code to compute J(theta)...];
gradient = [...code to compute derivative of J(theta)...];
end
Then we can use octave's "fminunc()" optimization algorithm along with the "optimset()" function
that creates an object containing the options we want to send to "fminunc()". (Note: the value for
MaxIter should be an integer, not a character string - errata in the video at 7:30)
options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, op
tions);
Page | 30
Jia arshad
We give to the function "fminunc()" our cost function, our initial vector of theta values, and the
"options" object that we created beforehand.
Instead, if we had added an extra feature x^2x2 , and fit y = \theta_0 + \theta_1x + \
theta_2x^2y=θ0+θ1x+θ2x2 , then we obtain a slightly better fit to the data (See middle figure).
Naively, it might seem that the more features we add, the better. However, there is also a danger
in adding too many features: The rightmost figure is the result of fitting a 5^{th}5th order
polynomial y = \sum_{j=0} ^5 \theta_j x^jy=∑j=05θjxj. We see that even though the fitted
curve passes through the data perfectly, we would not expect this to be a very good predictor of,
say, housing prices (y) for different living areas (x). Without formally defining what these terms
mean, we’ll say the figure on the left shows an instance of underfitting—in which the data
clearly shows structure not captured by the model—and the figure on the right is an example of
overfitting.
Underfitting, or high bias, is when the form of our hypothesis function h maps poorly to the trend
of the data. It is usually caused by a function that is too simple or uses too few features. At the
other extreme, overfitting, or high variance, is caused by a hypothesis function that fits the
available data but does not generalize well to predict new data. It is usually caused by a
complicated function that creates a lot of unnecessary curves and angles unrelated to the data.
Page | 31
Jia arshad
This terminology is applied to both linear and logistic regression. There are two main options to
address the issue of overfitting:
2) Regularization
Keep all the features, but reduce the magnitude of parameters \theta_jθj.
Cost Function
Note: [5:18 - There is a typo. It should be \sum_{j=1}^{n} \theta _j ^2∑j=1nθj2 instead of \
sum_{i=1}^{n} \theta _j ^2∑i=1nθj2]
If we have overfitting from our hypothesis function, we can reduce the weight that some of the
terms in our function carry by increasing their cost.
We'll want to eliminate the influence of \theta_3x^3θ3x3 and \theta_4x^4θ4x4 . Without actually
getting rid of these features or changing the form of our hypothesis, we can instead modify our
cost function:
We've added two extra terms at the end to inflate the cost of \theta_3θ3 and \theta_4θ4. Now, in
order for the cost function to get close to zero, we will have to reduce the values of \theta_3θ3
and \theta_4θ4 to near zero. This will in turn greatly reduce the values of \theta_3x^3θ3x3 and \
theta_4x^4θ4x4 in our hypothesis function. As a result, we see that the new hypothesis
(depicted by the pink curve) looks like a quadratic function but fits the data better due to the extra
small terms \theta_3x^3θ3x3 and \theta_4x^4θ4x4.
Page | 32
Jia arshad
We could also regularize all of our theta parameters in a single summation as:
The λ, or lambda, is the regularization parameter. It determines how much the costs of our
theta parameters are inflated.
Using the above cost function with the extra summation, we can smooth the output of our
hypothesis function to reduce overfitting. If lambda is chosen to be too large, it may smooth out
the function too much and cause underfitting. Hence, what would happen if \lambda = 0λ=0 or
is too small ?
We can apply regularization to both linear regression and logistic regression. We will approach
linear regression first.
Gradient Descent
We will modify our gradient descent function to separate out \theta_0θ0 from the rest of the
parameters because we do not want to penalize \theta_0θ0.
Page | 33
Jia arshad
Repeat { θ0:=θ0−α 1m ∑i=1m(hθ(x(i))−y(i))x(i)0 θj:=θj−α [(1m ∑i=1m(hθ(x(i))−y
+λmθj]} j∈{1,2...n}
The first term in the above equation, 1 - \alpha\frac{\lambda}{m}1−αmλ will always be less
than 1. Intuitively you can see it as reducing the value of \theta_jθj by some amount on every
update. Notice that the second term is now exactly the same as it was before.
Normal Equation
Now let's approach regularization using the alternate method of the non-iterative normal
equation.
To add in regularization, the equation is the same as our original, except that we add another
term inside the parentheses:
θ=(XTX+λ⋅L)−1XTywhere L=⎡⎣⎢⎢⎢⎢⎢⎢011⋱1⎤⎦⎥⎥⎥⎥⎥⎥
L is a matrix with 0 at the top left and 1's down the diagonal, with 0's everywhere else. It should
have dimension (n+1)×(n+1). Intuitively, this is the identity matrix (though we are not including
x_0x0), multiplied with a single real number λ.
Recall that if m < n, then X^TXXTX is non-invertible. However, when we add the term λ⋅L, then
X^TXXTX + λ⋅L becomes invertible.
Page | 34
Jia arshad
Cost Function
J(\theta) = - \frac{1}{m} \sum_{i=1}^m \large[ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \lo
theta(x^{(i)}))\large] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2J(θ)=−m1∑i=1m[y(i) log(hθ(
+(1−y(i)) log(1−hθ(x(i)))]+2mλ∑j=1nθj2
The second sum, \sum_{j=1}^n \theta_j^2∑j=1nθj2 means to explicitly exclude the bias
term, \theta_0θ0. I.e. the θ vector is indexed from 0 to n (holding n+1 values, \theta_0θ0 through
\theta_nθn), and this sum explicitly skips \theta_0θ0, by running from 1 to n, skipping 0. Thus,
when computing the equation, we should continuously update the two following equations:
Model Representation I
Let's examine how we will represent a hypothesis function using neural networks. At a very
simple level, neurons are basically computational units that take inputs (dendrites) as electrical
inputs (called "spikes") that are channeled to outputs (axons). In our model, our dendrites are
like the input features x_1\cdots x_nx1⋯xn, and the output is the result of our hypothesis
function. In this model our x_0x0 input node is sometimes called the "bias unit." It is always
equal to 1. In neural networks, we use the same logistic function as in classification, \frac{1}{1
Page | 35
Jia arshad
⎡⎣x0x1x2⎤⎦
\rightarrow
[ ]
\rightarrow h_\theta(x)[x0x1x2]→[ ]→hθ(x)
Our input nodes (layer 1), also known as the "input layer", go into another node (layer 2), which
finally outputs the hypothesis function, known as the "output layer".
We can have intermediate layers of nodes between the input and output layers called the
"hidden layers."
In this example, we label these intermediate or "hidden" layer nodes a^2_0 \cdots a^2_na02⋯
an2 and call them "activation units."
⎡⎣⎢⎢x0x1x2x3⎤⎦⎥⎥
\rightarrow
⎡⎣⎢⎢⎢a(2)1a(2)2a(2)3⎤⎦⎥⎥⎥
\rightarrow h_\theta(x)[x0x1x2x3]→[a1(2)a2(2)a3(2)]→hθ(x)
a(2)1=g(Θ(1)10x0+Θ(1)11x1+Θ(1)12x2+Θ(1)13x3)a(2)2=g(Θ(1)20x0+Θ(1)21x1+Θ(1)22x2+Θ(1)23x3)a(
Θ(1)31x1+Θ(1)32x2+Θ(1)33x3)hΘ(x)=a(3)1=g(Θ(2)10a(2)0+Θ(2)11a(2)1+Θ(2)12a(2)2+Θ(2)13a(2)3)
This is saying that we compute our activation nodes by using a 3×4 matrix of parameters. We
apply each row of the parameters to our inputs to obtain the value for one activation node. Our
hypothesis output is the logistic function applied to the sum of the values of our activation nodes,
which have been multiplied by yet another parameter matrix \Theta^{(2)}Θ(2) containing the
weights for our second layer of nodes.
Page | 36
Jia arshad
\text{If network has $s_j$ units in layer $j$ and $s_{j+1}$ units in layer $j+1$, then $\
Theta^{(j)}$ will be of dimension $s_{j+1} \times (s_j +
1)$.}If network has sj units in layer j and sj+1 units in layer j+1, then Θ(j) will be of dimen
sion sj+1×(sj+1).
The +1 comes from the addition in \Theta^{(j)}Θ(j) of the "bias nodes," x_0x0 and \
Theta_0^{(j)}Θ0(j). In other words the output nodes will not include the bias nodes while the
inputs will. The following image summarizes our model representation:
Example: If layer 1 has 2 input nodes and layer 2 has 4 activation nodes. Dimension of \
Theta^{(1)}Θ(1) is going to be 4×3 where s_j = 2sj=2 and s_{j+1} = 4sj+1=4, so s_{j+1} \
times (s_j + 1) = 4 \times 3sj+1×(sj+1)=4×3.
Model Representation II
To re-iterate, the following is an example of a neural network:
a(2)1=g(Θ(1)10x0+Θ(1)11x1+Θ(1)12x2+Θ(1)13x3)a(2)2=g(Θ(1)20x0+Θ(1)21x1+Θ(1)22x2+Θ(1)23x3)a(
Θ(1)31x1+Θ(1)32x2+Θ(1)33x3)hΘ(x)=a(3)1=g(Θ(2)10a(2)0+Θ(2)11a(2)1+Θ(2)12a(2)2+Θ(2)13a(2)3)
Page | 37
Jia arshad
In this section we'll do a vectorized implementation of the above functions. We're going to define
a new variable z_k^{(j)}zk(j) that encompasses the parameters inside our g function. In our
previous example if we replaced by the variable z for all the parameters we would get:
a(2)1=g(z(2)1)a(2)2=g(z(2)2)a(2)3=g(z(2)3)
In other words, for layer j=2 and node k, the variable z will be:
x=⎡⎣⎢⎢x0x1⋯xn⎤⎦⎥⎥z(j)=⎡⎣⎢⎢⎢⎢z(j)1z(j)2⋯z(j)n⎤⎦⎥⎥⎥⎥
Setting x = a^{(1)}x=a(1), we can rewrite the equation as:
z^{(j)} = \Theta^{(j-1)}a^{(j-1)}z(j)=Θ(j−1)a(j−1)
We are multiplying our matrix \Theta^{(j-1)}Θ(j−1) with dimensions s_j\times (n+1)sj×(n+1)
(where s_jsj is the number of our activation nodes) by our vector a^{(j-1)}a(j−1) with height
(n+1). This gives us our vector z^{(j)}z(j) with height s_jsj. Now we can get a vector of our
activation nodes for layer j as follows:
a^{(j)} = g(z^{(j)})a(j)=g(z(j))
We can then add a bias unit (equal to 1) to layer j after we have computed a^{(j)}a(j). This will
be element a_0^{(j)}a0(j) and will be equal to 1. To compute our final hypothesis, let's first
compute another z vector:
z^{(j+1)} = \Theta^{(j)}a^{(j)}z(j+1)=Θ(j)a(j)
We get this final z vector by multiplying the next theta matrix after \Theta^{(j-1)}Θ(j−1) with the
values of all the activation nodes we just got. This last theta matrix \Theta^{(j)}Θ(j) will have
only one row which is multiplied by one column a^{(j)}a(j) so that our result is a single number.
We then get our final result with:
Notice that in this last step, between layer j and layer j+1, we are doing exactly the same thing
as we did in logistic regression. Adding all these intermediate layers in neural networks allows us
to more elegantly produce interesting and more complex non-linear hypotheses.
Page | 38
Jia arshad
⎡⎣x0x1x2⎤⎦→[g(z(2))]→hΘ(x)
Remember that x_0x0 is our bias variable and is always 1.
\Theta^{(1)} =
[−302020]
Θ(1)=[−302020]
This will cause the output of our hypothesis to only be positive if both x_1x1 and x_2x2 are 1. In
other words:
hΘ(x)=g(−30+20x1+20x2)x1=0 and x2=0 then g(−30)≈0x1=0 and x2=1 then g(−10)≈0
0 then g(−10)≈0x1=1 and x2=1 then g(10)≈1
So we have constructed one of the fundamental operations in computers by using a small neural
network rather than using an actual AND gate. Neural networks can also be used to simulate all
the other logical gates. The following is an example of the logical operator 'OR', meaning either
x_1x1 is true or x_2x2 is true, or both:
Page | 39
Jia arshad
Examples and
Intuitions II
The Θ^{(1)}Θ(1) matrices for AND, NOR, and OR are:
AND:Θ(1)NOR:Θ(1)OR:Θ(1)=[−302020]=[10−20−20]=[−102020]
We can combine these to get the XNOR logical operator (which gives 1 if x_1x1 and x_2x2 are
both 0 or both 1).
⎡⎣x0x1x2⎤⎦→⎡⎣a(2)1a(2)2⎤⎦→[a(3)]→hΘ(x)
For the transition between the first and second layer, we'll use a Θ^{(1)}Θ(1) matrix that
combines the values for AND and NOR:
\Theta^{(1)} =
[−301020−2020−20]
Θ(1)=[−30202010−20−20]
For the transition between the second and third layer, we'll use a Θ^{(2)}Θ(2) matrix that uses
the value for OR:
\Theta^{(2)} =
[−102020]
Θ(2)=[−102020]
a(2)=g(Θ(1)⋅x)a(3)=g(Θ(2)⋅a(2))hΘ(x)=a(3)
And there we have the XNOR operator using a hidden layer with two nodes! The following
summarizes the above algorithm:
Page | 40
Jia arshad
Multiclass Classification
To classify data into multiple classes, we let our hypothesis function return a vector of values.
Say we wanted to classify our data into one of four categories. We will use the following example
to see how this classification is done. This algorithm takes as input an image and classifies it
accordingly:
Page | 41
Jia arshad
Each y^{(i)}y(i) represents a different image corresponding to either a car, pedestrian, truck, or
motorcycle. The inner layers, each provide us with some new information which leads to our final
hypothesis function. The setup looks like:
Our resulting hypothesis for one set of inputs may look like:
h_\Theta(x) =
⎡⎣⎢⎢0010⎤⎦⎥⎥
hΘ(x)=[0010]
In which case our resulting class is the third one down, or h_\Theta(x)_3hΘ(x)3, which
represents the motorcyc
Cost Function
Let's first define a few variables that we will need to use:
Page | 42
Jia arshad
Recall that in neural networks, we may have many output nodes. We denote h_\Theta(x)_khΘ
(x)k as being a hypothesis that results in the k^{th}kth output. Our cost function for neural
networks is going to be a generalization of the one we used for logistic regression. Recall that the
cost function for regularized logistic regression was:
J(Θ)=−1m∑i=1m∑k=1K[y(i)klog((hΘ(x(i)))k)+(1−y(i)k)log(1−(hΘ(x(i)))k)]
+λ2m∑l=1L−1∑i=1sl∑j=1sl+1(Θ(l)j,i)2
We have added a few nested summations to account for our multiple output nodes. In the first
part of the equation, before the square brackets, we have an additional nested summation that
loops through the number of output nodes.
In the regularization part, after the square brackets, we must account for multiple theta matrices.
The number of columns in our current theta matrix is equal to the number of nodes in our current
layer (including the bias unit). The number of rows in our current theta matrix is equal to the
number of nodes in the next layer (excluding the bias unit). As before with logistic regression, we
square every term.
Note:
the double sum simply adds up the logistic regression costs calculated for each cell in the
output layer
the triple sum simply adds up the squares of all the individual Θs in the entire network.
the i in the triple sum does not refer to training example i
Backpropagation Algorithm
"Backpropagation" is neural-network terminology for minimizing our cost function, just like what
we were doing with gradient descent in logistic and linear regression. Our goal is to compute:
\min_\Theta J(\Theta)minΘJ(Θ)
That is, we want to minimize our cost function J using an optimal set of parameters in theta. In
this section we'll look at the equations we use to compute the partial derivative of J(Θ):
\dfrac{\partial}{\partial \Theta_{i,j}^{(l)}}J(\Theta)∂Θi,j(l)∂J(Θ)
Page | 43
Jia arshad
Set \Delta^{(l)}_{i,j}Δi,j(l) := 0 for all (l,i,j), (hence you end up having a matrix full of
zeros)
For training example t =1 to m:
Page | 44
Jia arshad
Where L is our total number of layers and a^{(L)}a(L) is the vector of outputs of the activation
units for the last layer. So our "error values" for the last layer are simply the differences of our
actual results in the last layer and the correct outputs in y. To get the delta values of the layers
before the last layer, we can use an equation that steps us back from right to left:
The delta values of layer l are calculated by multiplying the delta values in the next layer with the
theta matrix of layer l. We then element-wise multiply that with a function called g', or g-prime,
which is the derivative of the activation function g evaluated with the input values given by
z^{(l)}z(l).
along and eventually compute our partial derivative. Thus we get \frac \partial {\partial \
Page | 45
Jia arshad
Backpropagation Intuition
Note: [4:39, the last term for the calculation for z^3_1z13 (three-color handwritten formula)
should be a^2_2a22 instead of a^2_1a12. 6:08 - the equation for cost(i) is incorrect. The first
term is missing parentheses for the log() function, and the second term should be (1-y^{(i)})\
log(1-h{_\theta}{(x^{(i)}}))(1−y(i))log(1−hθ(x(i))). 8:50 - \delta^{(4)} = y -
a^{(4)}δ(4)=y−a(4) is incorrect and should be \delta^{(4)} = a^{(4)} - yδ(4)=a(4)−y.]
J(Θ)=−1m∑t=1m∑k=1K[y(t)k log(hΘ(x(t)))k+(1−y(t)k) log(1−hΘ(x(t))k)]
+λ2m∑l=1L−1∑i=1sl∑j=1sl+1(Θ(l)j,i)2
Page | 46
Jia arshad
Θ(1),Θ(2),Θ(3),…D(1),D(2),D(3),…
In order to use optimizing functions such as "fminunc()", we will want to "unroll" all the elements
and put them into one long vector:
If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11, then we can get back
our original matrices from the "unrolled" versions as follows:
To summarize:
Page | 47
Jia arshad
Gradient Checking
Gradient checking will assure that our backpropagation works as intended. We can approximate
the derivative of our cost function with:
With multiple theta matrices, we can approximate the derivative with respect to Θ_jΘj as
follows:
A small value for {\epsilon}ϵ (epsilon) such as {\epsilon = 10^{-4}}ϵ=10−4, guarantees that
the math works out properly. If the value for \epsilonϵ is too small, we can end up with numerical
problems.
Hence, we are only adding or subtracting epsilon to the \Theta_jΘj matrix. In octave we can do
it as follows:
epsilon = 1e-4;
for i = 1:n,
thetaPlus = theta;
Page | 48
Jia arshad
thetaPlus(i) += epsilon;
thetaMinus = theta;
thetaMinus(i) -= epsilon;
gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;
We previously saw how to calculate the deltaVector. So once we compute our gradApprox
vector, we can check that gradApprox ≈ deltaVector.
Once you have verified once that your backpropagation algorithm is correct, you don't need to
compute gradApprox again. The code to compute gradApprox can be very slow.
Random Initialization
Initializing all theta weights to zero does not work with neural networks. When we backpropagate,
all nodes will update to the same value repeatedly. Instead we can randomly initialize our
weights for our \ThetaΘ matrices using the following method:
Page | 49
Jia arshad
If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.
Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Page | 50
Jia arshad
rand(x,y) is just a function in octave that will initialize a matrix of random real numbers between 0
and 1.
(Note: the epsilon used above is unrelated to the epsilon from Gradient Checking)
Putting it Together
First, pick a network architecture; choose the layout of your neural network, including how many
hidden units in each layer and how many layers in total you want to have.
for i = 1:m,
Page | 51
Jia arshad
Perform forward propagation and backpropagation using example (x(i),y(i))
(Get activations a(l) and delta terms d(l) for l = 2,...,L
The following image gives us an intuition of what is happening as we are implementing our
neural network:
Ideally, you want h_\Theta(x^{(i)})hΘ(x(i)) \approx≈ y^{(i)}y(i). This will minimize our cost
function. However, keep in mind that J(\Theta)J(Θ) is not convex and thus we can end up in a
local minimum instead.
Page | 52
Jia arshad
Evaluating a Hypothesis
Once we have done some trouble shooting for errors in our predictions by:
A hypothesis may have a low error for the training examples but still be inaccurate (because of
overfitting). Thus, to evaluate a hypothesis, given a dataset of training examples, we can split up
the data into two sets: a training set and a test set. Typically, the training set consists of 70 %
of your data and the test set is the remaining 30 %.
This gives us the proportion of the test data that was misclassified.
Just because a learning algorithm fits a training set well, that does not mean it is a good hypothesis.
It could over fit and as a result your predictions on the test set would be poor. The error of your
hypothesis as measured on the data set with which you trained the parameters will be lower than
the error on any other data set.
Page | 53
Jia arshad
Given many models with different polynomial degrees, we can use a systematic approach to identify
the 'best' function. In order to choose the model of your hypothesis, you can test each degree of
polynomial and look at the error result.
One way to break down our dataset into the three sets is:
We can now calculate three separate error values for the three different sets using the following
method:
1. Optimize the parameters in Θ using the training set for each polynomial degree.
2. Find the polynomial degree d with the least error using the cross validation set.
3. Estimate the generalization error using the test set with J_{test}(\Theta^{(d)})Jtest(Θ(d)), (d =
theta from polynomial with lower error);
This way, the degree of the polynomial d has not been trained using the test set.
At the same time, the cross validation error will tend to decrease as we increase d up to a point,
and then it will increase as d is increased, forming a convex curve.
Page | 54
Jia arshad
In the figure above, we see that as \lambdaλ increases, our fit becomes more rigid. On the other
hand, as \lambdaλ approaches 0, we tend to over overfit the data. So how do we choose our
parameter \lambdaλ to get it 'just right' ? In order to choose the model and the regularization
term λ, we need to:
Page | 55
Jia arshad
3. Iterate through the \lambdaλs and for each \lambdaλ go through all the models to learn
some \ThetaΘ.
4. Compute the cross validation error using the learned Θ (computed with λ) on the J_{CV}(\
Theta)JCV(Θ) without regularization or λ = 0.
5. Select the best combo that produces the lowest error on the cross validation set.
6. Using the best combo Θ and λ, apply it on J_{test}(\Theta)Jtest(Θ) to see if it has a good
generalization of the problem.
Learning Curves
Training an algorithm on a very few number of data points (such as 1, 2 or 3) will easily have 0
errors because we can always find a quadratic curve that touches exactly those number of
points. Hence:
As the training set gets larger, the error for a quadratic function increases.
The error value will plateau out after a certain m, or training set size.
Experiencing high bias:
If a learning algorithm is suffering from high bias, getting more training data will not (by itself)
help much.
Low training set size: J_{train}(\Theta)Jtrain(Θ) will be low and J_{CV}(\Theta)JCV(Θ) will
be high.
Large training set size: J_{train}(\Theta)Jtrain(Θ) increases with training set size and
J_{CV}(\Theta)JCV(Θ) continues to decrease without leveling off. Also, J_{train}(\
Theta)Jtrain(Θ) < J_{CV}(\Theta)JCV(Θ) but the difference between them remains significant.
Page | 56
Jia arshad
If a learning algorithm is suffering from high variance, getting more training data is likely to help.
Lower-order polynomials (low model complexity) have high bias and low variance. In this
case, the model fits poorly consistently.
Higher-order polynomials (high model complexity) fit the training data extremely well and the
test data extremely poorly. These have low bias on the training data, but very high variance.
In reality, we would want to choose a model somewhere in between, that can generalize
well but also fits the data reasonably well.
Given a data set of emails, we could construct a vector for each email. Each entry in this vector
represents a word. The vector normally contains 10,000 to 50,000 entries gathered by finding the
most frequently used words in our data set. If a word is to be found in the email, we would assign
its respective entry a 1, else if it is not found, that entry would be a 0. Once we have all our x
vectors ready, we train our algorithm and finally, we could use it to classify if an email is a spam
or not.
Page | 57
Jia arshad
So how could you spend your time to improve the accuracy of this classifier?
Collect lots of data (for example "honeypot" project but doesn't always work)
Develop sophisticated features (for example: using email header data in spam emails)
Develop algorithms to process your input in different ways (recognizing misspellings in
spam).
It is difficult to tell which of the options will be most helpful.
Error Analysis
The recommended approach to solving machine learning problems is to:
Start with a simple algorithm, implement it quickly, and test it early on your cross validation
data.
Plot learning curves to decide if more data, more features, etc. are likely to help.
Manually examine the errors on examples in the cross validation set and try to spot a trend
where most of the errors were made.
For example, assume that we have 500 emails and our algorithm misclassifies a 100 of them.
We could manually analyze the 100 emails and categorize them based on what type of emails
they are. We could then try to come up with new cues and features that would help us classify
these 100 emails correctly. Hence, if most of our misclassified emails are those which try to steal
passwords, then we could find some features that are particular to those emails and add them to
Page | 58
Jia arshad
our model. We could also see how classifying each word according to its root changes our error
rate:
It is very important to get error results as a single, numerical value. Otherwise it is difficult to
assess your algorithm's performance. For example if we use stemming, which is the process of
treating the same word with different forms (fail/failing/failed) as one word (fail), and get a 3%
error rate instead of 5%, then we should definitely add it to our model. However, if we try to
distinguish between upper case and lower case letters and end up getting a 3.2% error rate
instead of 3%, then we should avoid using this new feature. Hence, we should try new things, get
a numerical value for our error rate, and based on our result decide whether we want to keep the
new feature or not.
Page | 59
Jia arshad
ss
Page | 60