0% found this document useful (0 votes)
174 views4 pages

Hw1 Theory Solution PuHK4fmHvB

The document presents solutions to a homework assignment on machine learning, focusing on Boolean functions, mistake bounds, and PAC learning. It includes detailed proofs and calculations related to decision trees, information gain, and the performance of a PAC learner. The overall accuracy of the decision tree is reported as 74%, with a thorough analysis of the learning process and error bounds provided.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
174 views4 pages

Hw1 Theory Solution PuHK4fmHvB

The document presents solutions to a homework assignment on machine learning, focusing on Boolean functions, mistake bounds, and PAC learning. It includes detailed proofs and calculations related to decision trees, information gain, and the performance of a PAC learner. The overall accuracy of the decision tree is reported as 74%, with a thorough analysis of the learning process and error bounds provided.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CS 391L: Machine Learning Spring 2024

Homework 1 - Theory - Solutions


Lecture: Prof. Adam Klivans
Keywords: Boolean functions, mistake bounds, PAC learning

1. Since f, g are {−1, 1}-valued, f (x)g(x) is +1 whenever f (x) = g(x) and −1 whenever f (x) ̸=
g(x). Thus we can write

Ex∼D [f (x)g(x)] = (+1) Px∼D [f (x) = g(x)] + (−1) Px∼D [f (x) ̸= g(x)]
= 1 − 2 Px∼D [f (x) ̸= g(x)],

using the fact that Px∼D [f (x) = g(x)] = 1 − Px∼D [f (x) ̸= g(x)]. Rearranging the above
equation proves the result.
Notice that we did not need any properties from the domain or the distribution at all for this
proof, just that f, g be {−1, 1}-valued. Thus the statement holds for arbitrary domains.

2. We can write the decision tree as a polynomial by decomposing it in terms of its root-to-leaf
paths. Consider an example path where we move along x1 = 1 at the root x1 , along x3 = −1
at x3 , along x8 = 1 at x8 , and then output −1; in other words, if x1 = 1, x3 = −1, and
x8 = 1, output −1. We can represent this by the following term:
   
1 + x1 1 − x3 1 + x8
(−1)
2 2 2

Clearly this is nonzero iff x1 = 1, x3 = −1, x8 = 1, in which case it evaluates to −1. In this
way we can represent any path by a polynomial term. We can now write the whole decision
tree f by summing the terms for each of the t root-to-leaf paths. Since any assignment
of values to x1 , . . . , xn will follow exactly one of the root-to-leaf paths, exactly one of the
corresponding terms will be nonzero, and the overall sum will be +1 or −1 according to the
value at the leaf. Thus we obtain a polynomial p such that f (x) = p(x) for all x ∈ {−1, 1}n .

3. Overall accuracy is 74%. The decision tree is shown below:

Z
Z=0 Z=1

Y X
Y =0 Y =1 X=0 X=1

0 1 1 1

The calculations are as follows:

1
165 85
Root node: Note that the potential for all the examples is C(P r[P os]) = 2 ∗ 250 ∗ 250 =
0.4488.
If we choose X to be the root node, then the new expected potential is P r[X = 0]C(P r[P os|X =
0]) + P r[X = 1]C(P r[P os|X = 1]) = 150 105 100 60
250 ∗ C( 150 ) + 250 ∗ C( 100 ) = 0.444
120 70 130 95
If we choose Y , it is 250 ∗ C( 120 ) + 250 ∗ C( 130 ) = 0.437949.
120 60 130 105
If we choose Z it is 250 ∗ C( 120 ) + 250 ∗ C( 130 ) = 0.401538.
So, Z minimizes the new potential, or in other words, it maximizes the information gain
which is 0.4488 − 0.401538 = 0.047262. Thus, we should pick Z to be the root node.
Left node: Since an example only enters the left node when Z = 0, we will restrict ourselves
to the examples where Z = 0.
60
Note that the total potential in this case is C( 120 ) = 0.5.
80 45 40 15
If we choose X to be the left node, then the new expected potential is 120 ∗C( 80 )+ 120 ∗C( 40 ) =
0.484375
50 15 70 45
If we choose Y to be the left node, then the new expected potential is 120 ∗C( 50 )+ 120 ∗C( 70 ) =
0.442857
Thus, choosing Y minimizes the new expected potential, or maximizes the information gain
which is 0.5−0.442857 = 0.057143. Since in this case, we have more examples labeled negative
then positive when Y = 0, we will output negative when Y = 0. Similarly, we have more
examples labeled positive than negative when Y = 1, and so when Y = 1, we will output
positive.
Right node: Since an example only enters the right node when Z = 1, we will restrict
ourselves to the examples where Z = 1.
105
Note that the total potential in this case is C( 130 ) = 0.31065088757.
70
If we choose X to be the right node, then the new expected potential is 130 ∗ C( 60
70 ) +
60
130 ∗
C( 45
60 ) = 0.304945
70
If we choose Y to be the right node, then the new expected potential is 130 ∗ C( 55
70 ) +
60
130 ∗
C( 50
60 ) = 0.309524
Thus, choosing X minimizes the new expected potential, or maximizes the information gain
which is 0.31065088757−0.304945 = 0.00570588757. Since in this case, we have more examples
labeled positive then negative when X = 0, we will output positive when X = 0. Similarly,
we have more examples labeled positive than negative when X = 1, and so when X = 1, we
will output positive.
Our accuracy is computed as (number of examples our decision tree labels correctly)/(total
number of examples). In this case, this is 185
250 = 0.74 = 74%.

4. This is actually a strictly simpler, instructive version of the axis-parallel rectangles problem
from lecture. As in that problem, the algorithm is very natural: draw a large number (say m)
of examples, and then pick the “tightest-fitting” threshold function. More concretely, we can
arrange our training data in ascending order in terms of x, and then pick the largest x that
is labeled −1 as our threshold. (Another approach would be to pick the smallest x that is
labeled +1; this also works and has a nearly identical analysis.) This can be done efficiently
in O(m) time by going through all m points. (Aside: we cannot use binary search since the
points don’t arrive in sorted order.) What remains is the analysis of how large m needs to be
in order to be have high confidence that our output has low error.

2
To do this, suppose hθ , for some θ ∈ R, is the true threshold function that is labeling the data.
Suppose that the threshold we obtain (by picking the largest x labeled −1 in our dataset) is
θ̂. Notice that we will necessarily have θ̂ ≤ θ, and the only area where hθ̂ differs from the
true hθ is the interval [θ̂, θ]. This is where our classifier errs, and the error is precisely the
probability mass of this interval.
Define B to be the interval immediately to the left of θ that has probability mass ϵ. Observe
that if we get even one training point in B, then we are guaranteed that [θ̂, θ] lies within B,
and so our error will be at most ϵ. Thus, our bad event is that none of our m training points
fall in B. The probability of this happening is (1 − ϵ)m ≤ e−ϵm (using the fact that 1 + x ≤ ex
for all x), which can be made at most δ by picking m = 1ϵ log 1δ . This completes the analysis.

− − − − − − − B + + + + + + +

θ̂ θ

5. (a) If err(h) > ϵ, then the probability that h labels a single randomly drawn example
correctly is less than 1 − ϵ. The probability of getting k independent examples right is
less than (1−ϵ)k ≤ e−ϵk (using the fact that 1+x ≤ ex for all x). By picking k = 1ϵ log δ1′ ,
this is at most δ ′ (which, as we’ll see, will be picked based on our final desired δ).
(b) Since A has a mistake bound of t, and it only updates when it makes a mistake, it can
go through at most t hypotheses. This means that if we view our examples as consisting
of t + 1 blocks of size k, then we must have a block where we make no mistakes.
(c) First we state the algorithm, then give the analysis. Our overall PAC learner works as
follows:
i. The learner draws a block of k random examples from D.
ii. At the start of the block, we will assume that algorithm A, with its current state/hypothesis
hi , satisfies err(hi ) ≤ ϵ, and will use the examples in the block to test if this is indeed
the case. With the block of examples, the learner feeds examples one by one to A.
If A makes a mistake labeling one of the examples in the block, we start again at
5(c)i. If it does not, we stop and output hi .
Let hi denote A’s hypothesis (or state) after i mistakes. Based on our steps so far, the
most natural idea for when we decide to stop and output hi is to do so when hi has
gotten its block of k examples right.1 So the hypothesis our PAC learner eventually
1
Important pedagogical note: when we talk of the event Ei , we are saying something about the examples, not the
hypothesis per se — we are saying that the ith block is misleading in the technical sense that all k examples happen
to fall into hi ’s “good area”. That is, the function is fixed, and it is the examples that are random. It’s important
to understand this fact in this kind of analysis. If you like, the sample space consists of realizations of our random
draws of examples from D.

3
outputs will be one of the hi that A goes through. Define the event Ei to be the event
that we output hi such that err(hi ) > ϵ. The PAC learner’s failure event is precisely that
it outputs a hypothesis with error greater than ϵ, and is thus described by E = ∪i Ei .
Here the union is taken over all the blocks we go through, which number at most t + 1.
P
We want to ensure that P[E] ≤ δ. By the union bound, we can say that P[E] ≤ i P[Ei ].
Recall that we picked k such that P[Ei ] ≤ δ ′ . Since there are at most t + 1 events Ei ,
picking δ ′ = δ/(t+1), we have P[E] ≤ (t+1)δ ′ ≤ δ, as desired. This proves the correctness
of our PAC learner. The total number of examples used is (t + 1)k = t+1 t+1
ϵ log δ .

You might also like