Machine Learning Revision Notes

Machine Learning Top Algorithms Revision
Material
March 1, 2016
Decision Trees
When to use decision trees?

When there are fixed set of features and features take on a small set of
values
Target function has discrete small set of values
It is robust to errors in classification of training data and can also handle
some missing values. Why? Because it considers all available training
examples during learning..
Important algorithms and extensions

1. ID3
(a) The hypothesis space for ID3 consists of all possible hypothesis functions and amongst the multiple possible valid hypotheses, the algorithm prefers shorter trees over longer trees.
(b) Why shorter trees? Occams razor: Prefer the simplest hypothesis
that fits the data. Why?
(c) It is a greedy search algorithm i.e. the algorithm never backtracks
to re-consider its previous choice. Therefore it might end up at local
optimum
(d) The crux of the algorithm is to specify a way to choose an optimal
root node
(e) Optimal selection of root node is decided by calculating Information
gain which measures how well the node/feature separates the training
examples w.r.t target classification
(f) For variable with only binary values, entropy is defined as P log2 P P log2 P
1
Therefore for a training set with 9 and 5 values of target variable,

the entropy will be
9
9
5
5
log2
log2
= 0.940
14
14 14
14
(g) Information gain of a node with attribute A is defined as Entropy(S)
X |Sv |
iA
|S|
Entropy(Sv )
Where S is the set of all examples at root ans Sv is the set of examples
in the child node corresponding to a value of A. Node with maximum
information gain is selected as root
2. C4.5
(a) If training data has significant noise, it results in over-fitting.
(b) Over-fitting can be resolved by post pruning. One such successful
method is called Rule Post Pruning and a variant of the algorithm
is called C4.5
(c) Infer the decision tree from the training set, growing the tree until
the training data is fit as well as possible and allowing over-fitting to
occur
(d) Convert the learned tree into an equivalent set of rules by creating
one rule for each path from the root node to a leaf node. Why?
(e) Prune (generalize) each rule by removing any preconditions that result in improving its estimated accuracy. Done recursively on that
rule till the accuracy worsens.
(f) Sort the pruned rules by their estimated accuracy, and consider them
in this sequence when classifying subsequent instances
(g) Drawback is keeping a separate validation set. To apply algorithm
using same training set, use pessimistic estimate
Other Modifications
Substitute maximum occurring value at that node in case of missing values
Divide information gains by weights if the features are to be weighted
Use thresholds for continuous values
Need to use a modified version of information gain for choosing root node
such as Split Information Ratio
References
Machine Learning - Tom Mitchell, Chapter 3
2
Logistic Regression
When to use logistic regression?

When target variable is discrete valued
Algorithm
1. The hypothesis function is of the form (for binary classification)h (x) = g(T x) =
1
1 + eT x
Where g is the logistic function which varies between 0 and 1

2. How did we choose this logistic function? Answer - GLMs
3. Let P (y = 1bx; ) = h (x)
P (y = 0bx; ) = 1 h (x)
More generally,
P (ybx; ) = h (x)y (1 h (x))1y
4. Likelihood, which is the joint probability will be m
Y
(i)
(i)
P (
y bX; ) = h (x(i) )y (1 h (x(i) ))1y
i=1
5. We maximize log of this likelihood function by using batch gradient ascent

method := + l()
6. for one training example, derivative of log likelihood w.r.t is (simple
derivation)l()
= (y h (x))xj
j
7. Therefore the update rule becomes (i)
j := j + (y (i) h (x(i) ))xj
8. Can also use Newtons method for maximizing log likelihood for fast convergence but it involves computing inverse of hessian of log likelihood
function w.r.t at each iteration
9. Use Softmax Regression which can also be derived from GLM theory
3
References
Stanford CS229 Notes 1
Linear Regression
When to use linear regression?

For any regression problem
Algorithm
1. The hypothesis function is of the form (for binary classification)h (x) = T x
With x0 = 1
2. Intuitively, we reduce the sum squared error over the training data to find
3. Cost function becomes m
J() =
1X
(h (x(i) ) y (i) )2
2 i=1
And to solve we minimize J()

4. Why did we choose least squares cost function? Answer - Assume the
training data is just a Gaussian noise around an actual polynomial. The
log of probability distribution of noise i.e. log likelihood gives you the
least squares formula
5. Use gradient descent to minimize cost function
6. Normal equation solution
= (X T X)1 X T
y)
7. Locally weighted linear regression is one modification to account for highly
varying data without over-fitting
References
Stanford CS229 Notes 1
Nearest Neighbours
Gaussian Discriminant Analysis
When to use GDA?

For classification problems with continuous features
GDA makes stronger assumptions about the training data than logistic
regression. When these are correct, GDA generally performs better than
logistic regression.
Algorithm
1. GDA is a generative algorithm
2. Intuitive explanation for generative algorithms - First, looking at elephants, we can build a model of what elephants look like. Then, looking
at dogs, we can build a separate model of what dogs look like. Finally, to
classify a new animal, we can match the new animal against the elephant
model, and match it against the dog model, to see whether the new animal looks more like the elephants or more like the dogs we had seen in
the training set.
3. Model y = Bernoulli()
(x|y = 0) = (0 , )
(x|y = 1) = (1 , )
Naive Bayes
Bayes Networks
Adaboost
SVM
10
K-means Clustering
11
Expectation Maximization
12
SVD
13
PCA
14
Random Forests
15
Artificial Neural Networks

Machine Learning Revision Notes

Uploaded by

Machine Learning Revision Notes

Uploaded by

Machine Learning Top Algorithms Revision

When to use decision trees?

Important algorithms and extensions

Therefore for a training set with 9 and 5 values of target variable,

(g) Information gain of a node with attribute A is defined as Entropy(S)

When to use logistic regression?

Where g is the logistic function which varies between 0 and 1

5. We maximize log of this likelihood function by using batch gradient ascent

j := j + (y (i) h (x(i) ))xj

When to use linear regression?

3. Cost function becomes m

And to solve we minimize J()

Gaussian Discriminant Analysis

When to use GDA?

Artificial Neural Networks

You might also like