Hw1 Theory Solution PuHK4fmHvB

The document presents solutions to a homework assignment on machine learning, focusing on Boolean functions, mistake bounds, and PAC learning. It includes detailed proofs and calculations related to decision trees, information gain, and the performance of a PAC learner. The overall accuracy of the decision tree is reported as 74%, with a thorough analysis of the learning process and error bounds provided.

Uploaded by

Koushik Rajagopalan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

174 views4 pages

Hw1 Theory Solution PuHK4fmHvB

Uploaded by

Koushik Rajagopalan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CS 391L: Machine Learning Spring 2024

Homework 1 - Theory - Solutions

Lecture: Prof. Adam Klivans
Keywords: Boolean functions, mistake bounds, PAC learning

1. Since f, g are {−1, 1}-valued, f (x)g(x) is +1 whenever f (x) = g(x) and −1 whenever f (x) ̸=
g(x). Thus we can write

Ex∼D [f (x)g(x)] = (+1) Px∼D [f (x) = g(x)] + (−1) Px∼D [f (x) ̸= g(x)]
= 1 − 2 Px∼D [f (x) ̸= g(x)],

using the fact that Px∼D [f (x) = g(x)] = 1 − Px∼D [f (x) ̸= g(x)]. Rearranging the above
equation proves the result.
Notice that we did not need any properties from the domain or the distribution at all for this
proof, just that f, g be {−1, 1}-valued. Thus the statement holds for arbitrary domains.

2. We can write the decision tree as a polynomial by decomposing it in terms of its root-to-leaf
paths. Consider an example path where we move along x1 = 1 at the root x1 , along x3 = −1
at x3 , along x8 = 1 at x8 , and then output −1; in other words, if x1 = 1, x3 = −1, and
x8 = 1, output −1. We can represent this by the following term:

1 + x1 1 − x3 1 + x8
(−1)
2 2 2

Clearly this is nonzero iff x1 = 1, x3 = −1, x8 = 1, in which case it evaluates to −1. In this
way we can represent any path by a polynomial term. We can now write the whole decision
tree f by summing the terms for each of the t root-to-leaf paths. Since any assignment
of values to x1 , . . . , xn will follow exactly one of the root-to-leaf paths, exactly one of the
corresponding terms will be nonzero, and the overall sum will be +1 or −1 according to the
value at the leaf. Thus we obtain a polynomial p such that f (x) = p(x) for all x ∈ {−1, 1}n .

3. Overall accuracy is 74%. The decision tree is shown below:

Z
Z=0 Z=1

Y X
Y =0 Y =1 X=0 X=1

0 1 1 1

The calculations are as follows:

1
165 85
Root node: Note that the potential for all the examples is C(P r[P os]) = 2 ∗ 250 ∗ 250 =
0.4488.
If we choose X to be the root node, then the new expected potential is P r[X = 0]C(P r[P os|X =
0]) + P r[X = 1]C(P r[P os|X = 1]) = 150 105 100 60
250 ∗ C( 150 ) + 250 ∗ C( 100 ) = 0.444
120 70 130 95
If we choose Y , it is 250 ∗ C( 120 ) + 250 ∗ C( 130 ) = 0.437949.
120 60 130 105
If we choose Z it is 250 ∗ C( 120 ) + 250 ∗ C( 130 ) = 0.401538.
So, Z minimizes the new potential, or in other words, it maximizes the information gain
which is 0.4488 − 0.401538 = 0.047262. Thus, we should pick Z to be the root node.
Left node: Since an example only enters the left node when Z = 0, we will restrict ourselves
to the examples where Z = 0.
60
Note that the total potential in this case is C( 120 ) = 0.5.
80 45 40 15
If we choose X to be the left node, then the new expected potential is 120 ∗C( 80 )+ 120 ∗C( 40 ) =
0.484375
50 15 70 45
If we choose Y to be the left node, then the new expected potential is 120 ∗C( 50 )+ 120 ∗C( 70 ) =
0.442857
Thus, choosing Y minimizes the new expected potential, or maximizes the information gain
which is 0.5−0.442857 = 0.057143. Since in this case, we have more examples labeled negative
then positive when Y = 0, we will output negative when Y = 0. Similarly, we have more
examples labeled positive than negative when Y = 1, and so when Y = 1, we will output
positive.
Right node: Since an example only enters the right node when Z = 1, we will restrict
ourselves to the examples where Z = 1.
105
Note that the total potential in this case is C( 130 ) = 0.31065088757.
70
If we choose X to be the right node, then the new expected potential is 130 ∗ C( 60
70 ) +
60
130 ∗
C( 45
60 ) = 0.304945
70
If we choose Y to be the right node, then the new expected potential is 130 ∗ C( 55
70 ) +
60
130 ∗
C( 50
60 ) = 0.309524
Thus, choosing X minimizes the new expected potential, or maximizes the information gain
which is 0.31065088757−0.304945 = 0.00570588757. Since in this case, we have more examples
labeled positive then negative when X = 0, we will output positive when X = 0. Similarly,
we have more examples labeled positive than negative when X = 1, and so when X = 1, we
will output positive.
Our accuracy is computed as (number of examples our decision tree labels correctly)/(total
number of examples). In this case, this is 185
250 = 0.74 = 74%.

4. This is actually a strictly simpler, instructive version of the axis-parallel rectangles problem
from lecture. As in that problem, the algorithm is very natural: draw a large number (say m)
of examples, and then pick the “tightest-fitting” threshold function. More concretely, we can
arrange our training data in ascending order in terms of x, and then pick the largest x that
is labeled −1 as our threshold. (Another approach would be to pick the smallest x that is
labeled +1; this also works and has a nearly identical analysis.) This can be done efficiently
in O(m) time by going through all m points. (Aside: we cannot use binary search since the
points don’t arrive in sorted order.) What remains is the analysis of how large m needs to be
in order to be have high confidence that our output has low error.

2
To do this, suppose hθ , for some θ ∈ R, is the true threshold function that is labeling the data.
Suppose that the threshold we obtain (by picking the largest x labeled −1 in our dataset) is
θ̂. Notice that we will necessarily have θ̂ ≤ θ, and the only area where hθ̂ differs from the
true hθ is the interval [θ̂, θ]. This is where our classifier errs, and the error is precisely the
probability mass of this interval.
Define B to be the interval immediately to the left of θ that has probability mass ϵ. Observe
that if we get even one training point in B, then we are guaranteed that [θ̂, θ] lies within B,
and so our error will be at most ϵ. Thus, our bad event is that none of our m training points
fall in B. The probability of this happening is (1 − ϵ)m ≤ e−ϵm (using the fact that 1 + x ≤ ex
for all x), which can be made at most δ by picking m = 1ϵ log 1δ . This completes the analysis.

− − − − − − − B + + + + + + +

θ̂ θ

5. (a) If err(h) > ϵ, then the probability that h labels a single randomly drawn example
correctly is less than 1 − ϵ. The probability of getting k independent examples right is
less than (1−ϵ)k ≤ e−ϵk (using the fact that 1+x ≤ ex for all x). By picking k = 1ϵ log δ1′ ,
this is at most δ ′ (which, as we’ll see, will be picked based on our final desired δ).
(b) Since A has a mistake bound of t, and it only updates when it makes a mistake, it can
go through at most t hypotheses. This means that if we view our examples as consisting
of t + 1 blocks of size k, then we must have a block where we make no mistakes.
(c) First we state the algorithm, then give the analysis. Our overall PAC learner works as
follows:
i. The learner draws a block of k random examples from D.
ii. At the start of the block, we will assume that algorithm A, with its current state/hypothesis
hi , satisfies err(hi ) ≤ ϵ, and will use the examples in the block to test if this is indeed
the case. With the block of examples, the learner feeds examples one by one to A.
If A makes a mistake labeling one of the examples in the block, we start again at
5(c)i. If it does not, we stop and output hi .
Let hi denote A’s hypothesis (or state) after i mistakes. Based on our steps so far, the
most natural idea for when we decide to stop and output hi is to do so when hi has
gotten its block of k examples right.1 So the hypothesis our PAC learner eventually
1
Important pedagogical note: when we talk of the event Ei , we are saying something about the examples, not the
hypothesis per se — we are saying that the ith block is misleading in the technical sense that all k examples happen
to fall into hi ’s “good area”. That is, the function is fixed, and it is the examples that are random. It’s important
to understand this fact in this kind of analysis. If you like, the sample space consists of realizations of our random
draws of examples from D.

3
outputs will be one of the hi that A goes through. Define the event Ei to be the event
that we output hi such that err(hi ) > ϵ. The PAC learner’s failure event is precisely that
it outputs a hypothesis with error greater than ϵ, and is thus described by E = ∪i Ei .
Here the union is taken over all the blocks we go through, which number at most t + 1.
P
We want to ensure that P[E] ≤ δ. By the union bound, we can say that P[E] ≤ i P[Ei ].
Recall that we picked k such that P[Ei ] ≤ δ ′ . Since there are at most t + 1 events Ei ,
picking δ ′ = δ/(t+1), we have P[E] ≤ (t+1)δ ′ ≤ δ, as desired. This proves the correctness
of our PAC learner. The total number of examples used is (t + 1)k = t+1 t+1
ϵ log δ .

Decision Trees
No ratings yet
Decision Trees
25 pages
Feature Selection Technique
No ratings yet
Feature Selection Technique
7 pages
Introduction to Applied Machine Learning
100% (1)
Introduction to Applied Machine Learning
48 pages
Understanding Random Forests in Machine Learning
100% (1)
Understanding Random Forests in Machine Learning
4 pages
An Introduction To Kohonen Self Organizing Maps: Rajarshi Guha
No ratings yet
An Introduction To Kohonen Self Organizing Maps: Rajarshi Guha
12 pages
Naive Bayes Classifier Explained
No ratings yet
Naive Bayes Classifier Explained
11 pages
Machine Learning in Mechanical Engineering
No ratings yet
Machine Learning in Mechanical Engineering
20 pages
Decision Trees
No ratings yet
Decision Trees
32 pages
Cluster Validation Techniques Explained
No ratings yet
Cluster Validation Techniques Explained
47 pages
Decision Trees & The Iterative Dichotomiser 3 (ID3) Algorithm
100% (1)
Decision Trees & The Iterative Dichotomiser 3 (ID3) Algorithm
8 pages
of Bayesian Statistics (Chirayu Jain & Group)
No ratings yet
of Bayesian Statistics (Chirayu Jain & Group)
8 pages
Introduction
No ratings yet
Introduction
6 pages
Machine Learning Overview and Techniques
No ratings yet
Machine Learning Overview and Techniques
62 pages
Advanced Information Retreival: Chapter 02: Modeling - Neural Network Model
No ratings yet
Advanced Information Retreival: Chapter 02: Modeling - Neural Network Model
31 pages
Model Generalization
No ratings yet
Model Generalization
117 pages
CS188 Summer 2025 Exam Guidelines
No ratings yet
CS188 Summer 2025 Exam Guidelines
28 pages
Lecture 9 PDF
100% (1)
Lecture 9 PDF
28 pages
VTU Module-4 Chapter-2 Ensemble Learning and Random Forests
No ratings yet
VTU Module-4 Chapter-2 Ensemble Learning and Random Forests
61 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Support Vector Machines (SVMS) : Cs479/679 Pattern Recognition Dr. George Bebis
No ratings yet
Support Vector Machines (SVMS) : Cs479/679 Pattern Recognition Dr. George Bebis
37 pages
REPORT On DECISION TREE
No ratings yet
REPORT On DECISION TREE
40 pages
AML Winter 2021 Solution
No ratings yet
AML Winter 2021 Solution
6 pages
ML Notes Updated
No ratings yet
ML Notes Updated
60 pages
Final Exam: CS 188 Spring 2019 Introduction To Artificial Intelligence
No ratings yet
Final Exam: CS 188 Spring 2019 Introduction To Artificial Intelligence
23 pages
R2 Model Validation and Cross-Validation
No ratings yet
R2 Model Validation and Cross-Validation
46 pages
Feature Selection Techniques For ML - A Survey of More Than Two Decades of Research - Dipti Theng
No ratings yet
Feature Selection Techniques For ML - A Survey of More Than Two Decades of Research - Dipti Theng
63 pages
Guide To AUC ROC Curve in Machine Learning
No ratings yet
Guide To AUC ROC Curve in Machine Learning
10 pages
Decision Tree & Random Forest
No ratings yet
Decision Tree & Random Forest
28 pages
Neural Networks for CS Students
100% (1)
Neural Networks for CS Students
22 pages
Decision Tree Question
No ratings yet
Decision Tree Question
6 pages
AIMLCZG521 - Conversational AI
No ratings yet
AIMLCZG521 - Conversational AI
488 pages
Predicate Logic Inference Examples
No ratings yet
Predicate Logic Inference Examples
2 pages
Association Rules for Data Analysts
No ratings yet
Association Rules for Data Analysts
16 pages
C15-Momentum RMSProp Adam
No ratings yet
C15-Momentum RMSProp Adam
23 pages
Self Organizing Maps
No ratings yet
Self Organizing Maps
27 pages
Ai NK
No ratings yet
Ai NK
88 pages
Classifying mRNA vs ncRNA Using ML
100% (1)
Classifying mRNA vs ncRNA Using ML
27 pages
ML Unit-4 Prob Learning
No ratings yet
ML Unit-4 Prob Learning
36 pages
Linear and Nonlinear SVM Examples
No ratings yet
Linear and Nonlinear SVM Examples
10 pages
Optimal Learning Rates in Deep Learning
No ratings yet
Optimal Learning Rates in Deep Learning
9 pages
Understanding Version Spaces in ML
No ratings yet
Understanding Version Spaces in ML
26 pages
Deep Neural Networks Explained
No ratings yet
Deep Neural Networks Explained
12 pages
hw3 Solutions PDF
No ratings yet
hw3 Solutions PDF
11 pages
Machine Learning Deep Learning Overview AIST
No ratings yet
Machine Learning Deep Learning Overview AIST
86 pages
CS725 2020 Midsem
No ratings yet
CS725 2020 Midsem
3 pages
Supervised Regression in Machine Learning
No ratings yet
Supervised Regression in Machine Learning
32 pages
K Fold Cross Validation
No ratings yet
K Fold Cross Validation
17 pages
Bayesian Inference
No ratings yet
Bayesian Inference
5 pages
Statistical Learning Methods
No ratings yet
Statistical Learning Methods
28 pages
Bayesian Networks Lecture SEO
No ratings yet
Bayesian Networks Lecture SEO
76 pages
Nueral Network Mcqs
No ratings yet
Nueral Network Mcqs
6 pages
Solutions To Selected Problems-Duda, Hart
67% (3)
Solutions To Selected Problems-Duda, Hart
12 pages
What Is Supervised Machine Learning
No ratings yet
What Is Supervised Machine Learning
3 pages
Understanding Bayesian Networks
No ratings yet
Understanding Bayesian Networks
15 pages
Teaching Bayesian Method
No ratings yet
Teaching Bayesian Method
20 pages
Solutions For Exercises in Foundations of Machine Learning, 2nd Edition - Mohri & Rostamizadeh
100% (1)
Solutions For Exercises in Foundations of Machine Learning, 2nd Edition - Mohri & Rostamizadeh
5 pages
Machine Learning Theory Lecture
No ratings yet
Machine Learning Theory Lecture
6 pages
Foundations of Machine Learning, Second Edition (2nd Ed) (Instructor Res. N. 1 of 3, Solution Manual, Solutions) (201
100% (1)
Foundations of Machine Learning, Second Edition (2nd Ed) (Instructor Res. N. 1 of 3, Solution Manual, Solutions) (201
61 pages
PAC Learning and Sample Complexity in ML
No ratings yet
PAC Learning and Sample Complexity in ML
64 pages
Understanding Machine Learning Solution Manual: 2 Gentle Start
No ratings yet
Understanding Machine Learning Solution Manual: 2 Gentle Start
67 pages
Grade 10 Math Exam 1st FINAL
100% (4)
Grade 10 Math Exam 1st FINAL
4 pages
Behringer 1002fx
No ratings yet
Behringer 1002fx
2 pages
Top 50 DSA - Cantilever Labs - (Toolkit 4)
No ratings yet
Top 50 DSA - Cantilever Labs - (Toolkit 4)
18 pages
Digital Filters: Basics and Applications
No ratings yet
Digital Filters: Basics and Applications
26 pages
Binary Search and Sorting Algorithms
No ratings yet
Binary Search and Sorting Algorithms
64 pages
Srmpsooptimization
No ratings yet
Srmpsooptimization
14 pages
Candidate Generation and Pruning
100% (1)
Candidate Generation and Pruning
9 pages
Finite Differences & Interpolation Guide
No ratings yet
Finite Differences & Interpolation Guide
10 pages
Ai Case T4
No ratings yet
Ai Case T4
36 pages
ICS 2105 - Lecture Notes - Topic 4 - 2025
No ratings yet
ICS 2105 - Lecture Notes - Topic 4 - 2025
8 pages
Assignment 1
No ratings yet
Assignment 1
9 pages
ANN Syllabus
No ratings yet
ANN Syllabus
2 pages
Time Response of Second Order Systems - GATE Study Material in PDF
No ratings yet
Time Response of Second Order Systems - GATE Study Material in PDF
5 pages
Matrix Operations in Python
No ratings yet
Matrix Operations in Python
31 pages
Viterbi Algorithm Example and Decoding
No ratings yet
Viterbi Algorithm Example and Decoding
7 pages
RNNs for Sequential Data Modeling
No ratings yet
RNNs for Sequential Data Modeling
33 pages
Chapter 2
No ratings yet
Chapter 2
24 pages
Count Rock Samples by Size Ranges
No ratings yet
Count Rock Samples by Size Ranges
5 pages
Unifying Two Types of Scaling Laws From The Perspective of Conditional Kolmogorov Complexity
No ratings yet
Unifying Two Types of Scaling Laws From The Perspective of Conditional Kolmogorov Complexity
10 pages
Softmax Regression Mnist
No ratings yet
Softmax Regression Mnist
3 pages
Midterm Exam Solution
No ratings yet
Midterm Exam Solution
11 pages
8.5 Solving Lineqs - Qr-Lu
No ratings yet
8.5 Solving Lineqs - Qr-Lu
25 pages
Module 6 2nd Ungraded Quizz
No ratings yet
Module 6 2nd Ungraded Quizz
13 pages
SESSION 24 High-Frequency ADCs
No ratings yet
SESSION 24 High-Frequency ADCs
435 pages
Unsupervised Learning & Clustering Guide
No ratings yet
Unsupervised Learning & Clustering Guide
22 pages
Advanced Calculus Optimization
No ratings yet
Advanced Calculus Optimization
1 page
ANN Slide Explanations
No ratings yet
ANN Slide Explanations
2 pages
Lec 14 - Characterization of LTI Systems Using The Laplace Transform v1.0
No ratings yet
Lec 14 - Characterization of LTI Systems Using The Laplace Transform v1.0
30 pages
Autonomous Vehicle DDS via ML
No ratings yet
Autonomous Vehicle DDS via ML
39 pages
Assignment 2
No ratings yet
Assignment 2
11 pages

Hw1 Theory Solution PuHK4fmHvB

Uploaded by

Hw1 Theory Solution PuHK4fmHvB

Uploaded by

CS 391L: Machine Learning Spring 2024

Homework 1 - Theory - Solutions

3. Overall accuracy is 74%. The decision tree is shown below:

The calculations are as follows:

You might also like