Machine Learning Revision Notes
Machine Learning Revision Notes
Material
March 1, 2016
Decision Trees
9
9
5
5
log2
log2
= 0.940
14
14 14
14
X |Sv |
iA
|S|
Entropy(Sv )
Where S is the set of all examples at root ans Sv is the set of examples
in the child node corresponding to a value of A. Node with maximum
information gain is selected as root
2. C4.5
(a) If training data has significant noise, it results in over-fitting.
(b) Over-fitting can be resolved by post pruning. One such successful
method is called Rule Post Pruning and a variant of the algorithm
is called C4.5
(c) Infer the decision tree from the training set, growing the tree until
the training data is fit as well as possible and allowing over-fitting to
occur
(d) Convert the learned tree into an equivalent set of rules by creating
one rule for each path from the root node to a leaf node. Why?
(e) Prune (generalize) each rule by removing any preconditions that result in improving its estimated accuracy. Done recursively on that
rule till the accuracy worsens.
(f) Sort the pruned rules by their estimated accuracy, and consider them
in this sequence when classifying subsequent instances
(g) Drawback is keeping a separate validation set. To apply algorithm
using same training set, use pessimistic estimate
Other Modifications
Substitute maximum occurring value at that node in case of missing values
Divide information gains by weights if the features are to be weighted
Use thresholds for continuous values
Need to use a modified version of information gain for choosing root node
such as Split Information Ratio
References
Machine Learning - Tom Mitchell, Chapter 3
2
Logistic Regression
Algorithm
1. The hypothesis function is of the form (for binary classification)h (x) = g(T x) =
1
1 + eT x
(i)
(i)
P (
y bX; ) = h (x(i) )y (1 h (x(i) ))1y
i=1
8. Can also use Newtons method for maximizing log likelihood for fast convergence but it involves computing inverse of hessian of log likelihood
function w.r.t at each iteration
9. Use Softmax Regression which can also be derived from GLM theory
3
References
Stanford CS229 Notes 1
Linear Regression
Algorithm
1. The hypothesis function is of the form (for binary classification)h (x) = T x
With x0 = 1
2. Intuitively, we reduce the sum squared error over the training data to find
J() =
1X
(h (x(i) ) y (i) )2
2 i=1
References
Stanford CS229 Notes 1
Nearest Neighbours
Algorithm
1. GDA is a generative algorithm
2. Intuitive explanation for generative algorithms - First, looking at elephants, we can build a model of what elephants look like. Then, looking
at dogs, we can build a separate model of what dogs look like. Finally, to
classify a new animal, we can match the new animal against the elephant
model, and match it against the dog model, to see whether the new animal looks more like the elephants or more like the dogs we had seen in
the training set.
3. Model y = Bernoulli()
(x|y = 0) = (0 , )
(x|y = 1) = (1 , )
Naive Bayes
Bayes Networks
Adaboost
SVM
10
K-means Clustering
11
Expectation Maximization
12
SVD
13
PCA
14
Random Forests
15