Assignment 6
Introduction to Machine Learning
Prof. B. Ravindran
1. When building models using decision trees we essentially split the entire input space using
(a) axis parallel hyper-rectangles
(b) polynomials curves of order greater than two
(c) polynomial curves of the same order as the length of decision tree
(d) none of the above
Sol. (a)
2. In building a decision tree model, to control the size of the tree, we need to control the number
of regions. One approach to do this would be to split tree nodes only if the resultant decrease
in the sum of squares error exceeds some threshold. For the described method, which among
the following are true?
(a) it would, in general, help restrict the size of the trees
(b) it has the potential to affect the performance of the resultant regression/classification
model
(c) it is computationally infeasible
Sol. (a), (b)
While this approach may restrict the eventual number of regions produced, the main problem
with this approach is that it is too restrictive and may result in poor performance. It is very
common for splits at one level, which themselves are not that good (i.e., they do not decrease
the error significantly), to lead to very good splits (i.e., where the error is significantly reduced)
down the line. Think about the XOR problem.
3. Suppose we use the decision tree model for solving a multi-class classification problem. As we
continue building the tree, w.r.t. the generalisation error of the model,
(a) the error due to bias increases
(b) the error due to bias decreases
(c) the error due to variance increases
(d) the error due to variance decreases
Sol. (b) & (c)
As we continue to build the decision tree model, it is possible that we overfit the data. In
this case, the model is sufficiently complex, i.e., the error due to bias is low. However, due to
overfitting, the error due to variance starts increasing.
4. (2 marks) Having built a decision tree, we are using reduced error pruning to reduce the size
of the tree. We select a node to collapse. For this particular node, on the left branch, there are
3 training data points with the following outputs: 5, 7, 9.6 and for the right branch, there are
four training data points with the following outputs: 8.7, 9.8, 10.5, 11. The average value of the
outputs of data points denotes the response of a branch. The original responses for data points
1
along the two branches (left right respectively) were response left and, response right and the
new response after collapsing the node is response new. What are the values for response left,
response right and response new (numbers in the option are given in the same order)?
(a) 21.6, 40, 61.6
(b) 7.2; 10; 8.8
(c) 3, 4, 7
(d) depends on the tree height.
Sol. (b)
Original responses:
Left: 5+7+9.6
3 = 7.2
Right: 8.7+9.8+10.5+11
4 = 10
New response: 7.2 × 37 + 10 × 4
7 = 8.8
5. (2 marks) Consider the following dataset:
feature1 feature2 output
11.7 183.2 a
12.8 187.6 a
15.3 177.4 a
13.9 198.6 a
17.2 175.3 a
16.8 151.1 b
17.5 171.4 b
23.6 162.8 b
16.9 179.5 b
19.1 173.8 b
Which among the following split-points for the feature 1 would give the best split according to
the information gain measure?
(a) 14.6
(b) 16.05
(c) 16.85
(d) 17.35
Sol. (b)
3 3 3 0 0 7 2 2 5 5
info feature1 (14.6) (D) = 10 (− 3 log2 3 − 3 log2 3 ) + 10 (− 7 log2 7 − 7 log2 7 ) = 0.6042
4
info feature1 (16.05) (D) = 10 (− 44 log2 44 − 04 log2 04 ) + 10
6
(− 16 log2 61 − 56 log2 56 ) = 0.39
5
info feature1 (16.85) (D) = 10 (− 45 log2 54 − 15 log2 15 ) + 10
5
(− 15 log2 51 − 45 log2 45 ) = 0.7219
7
info feature1 (17.35) (D) = 10 (− 57 log2 75 − 27 log2 27 ) + 10
3
(− 03 log2 30 − 33 log2 33 ) = 0.6042
6. (2 marks) For the same dataset, which among the following split-points for feature2 would
give the best split according to the gini index measure?
(a) 172.6
2
(b) 176.35
(c) 178.45
(d) 185.4
Sol. (a)
7
ginifeature2 (172.6) (D) = 10 × 2 × 75 × 27 + 10
3
× 2 × 03 × 33 = 0.2857
5
ginifeature2 (176.35) (D) = 10 × 2 × 51 × 54 + 105
× 2 × 45 × 15 = 0.32
ginifeature2 (178.45) (D) = 10 × 2 × 6 × 6 + 10 × 2 × 34 × 14 = 0.4167
6 2 4 4
2
ginifeature2 (185.4) (D) = 10 × 2 × 22 × 02 + 10
8
× 2 × 38 × 85 = 0.375
7. In which of the following situations is it appropriate to introduce a new category ’Missing’ for
missing values? (multiple options may be correct)
(a) When values are missing because the 108 emergency operator is sometimes attending a
very urgent distress call.
(b) When values are missing because the attendant spilled coffee on the papers from which
the data was extracted.
(c) When values are missing because the warehouse storing the paper records went up in
flames and burnt parts of it.
(d) When values are missing because the nurse/doctor finds the patient’s situation too urgent.
Sol. (a),(d)
We typically introduce a ‘Missing’ value when the fact that a value is missing can also be a
relevant feature. In the case of (a) is can imply that the call was so urgent that the operator
couldn’t note it down. This urgency could potentially be useful to determine the target.
But a coffee spill corrupting the records is likely to be completely random and we glean no
new information from it. In this case, a better method is to try to predict the missing data
from the available data.