Practice Midterm 2010
Practice Midterm 2010
Notes:
1. The midterm will have about 5-6 long questions, and about 8-10 short questions. Space
will be provided on the actual midterm for you to write your answers.
2. The midterm is meant to be educational, and as such some questions could be quite
challenging. Use your time wisely to answer as much as you can!
3. For additional practice, please see CS 229 extra problem sets available at
http://see.stanford.edu/see/materials/aimlcs229/assignments.aspx
m
X
`(θ) = log p(y (i) | x(i) ; θ).
i=1
Give a set of conditions on b(y), T (y), and a(η) which ensure that the loglikelihood is
a concave function of θ (and thus has a unique maximum). Your conditions must be
reasonable, and should be as weak as possible. (E.g., the answer “any b(y), T (y), and
a(η) so that `(θ) is concave” is not reasonable. Similarly, overly narrow conditions,
including ones that apply only to specific GLMs, are also not reasonable.)
(b) [3 points] When the response variable is distributed according to a Normal distribu-
−y 2 η2
tion (with unit variance), we have b(y) = √12π e 2 , T (y) = y, and a(η) = 2 . Verify
that the condition(s) you gave in part (a) hold for this setting.
in our usual linear least-squares model.1 Let a set of m IID training examples be given
(with x(i) ∈ Rn+1 ). Recall that the MAP estimate of the parameters θ is given by:
m
!
Y
(i) (i)
θM AP = arg max p(y |x , θ) p(θ)
θ
i=1
Find, in closed form, the MAP estimate of the parameters θ. For this problem, you should
treat τ 2 and σ 2 as fixed, known, constants. [Hint: Your solution should involve deriving
something that looks a bit like the Normal equations.]
(a) [8 points] Let K(x, z) be a valid (Mercer) kernel over Rn ×Rn . Consider the function
given by
Ke (x, z) = exp(K(x, z)).
Show that Ke is a valid kernel. [Hint: There are many ways of proving this result,
but you mightP find the following two facts useful: (i) The Taylor expansion of ex is
∞
given by ex = j=0 j!1 xj (ii) If a sequence of non-negative numbers ai ≥ 0 has a limit
a = limi→∞ ai , then a ≥ 0.]
(b) [8 points] The Gaussian kernel is given by the function
||x−z||2
K(x, z) = e− σ2 ,
where σ 2 > 0 is some fixed, positive constant. We said in class that this is a valid
kernel, but did not prove it. Prove that the Gaussian kernel is indeed a valid kernel.
[Hint: The following fact may be useful. ||x − z||2 = ||x||2 − 2xT z + ||z||2 .]
(a) [9 points] The primal optimization problem for the one-class SVM was given above.
Write down the corresponding dual optimization problem. Simplify your answer as
much as possible. In particular, w should not appear in your answer.
1 Equivalently, y (i) = θT x(i) + ε(i) , where the ε(i) ’s are distributed IID N (0, σ 2 ).
2 This turns out to be useful for anomaly detection, but I assume you already have enough to keep you
entertained for the 2h 15min of the midterm, and thus wouldn’t want to read about it here. See the midterm
solutions for details.
CS229 Practice Midterm 3
(b) [4 points] Can the one-class SVM be kernelized (both in training and testing)?
Justify your answer.
(c) [5 points] Give an SMO-like algorithm to optimize the dual. I.e., give an algorithm
that in every optimization step optimizes over the smallest possible subset of variables.
Also give in closed-form the update equation for this subset of variables. You should
also justify why it is sufficient to consider this many variables at a time in each step.
5. [18 points] Uniform Convergence
In this problem, we consider trying to estimate the mean of a biased coin toss. We will
repeatedly toss the coin and keep a running estimate of the mean. We would like to prove
that (with high probability), after some initial set of N tosses, the running estimate from
that point on will always be accurate and never deviate too much from the true value.
More formally, let Xi ∼ Bernoulli(φ) be IID random variables. Let φ̂n be our estimate for
φ after n observations:
n
1X
φ̂n = Xi .
n i=1
We’d like to show that after a certain number of coin flips, our estimates will stay close
to the true value of φ. More formally, we’d like to show that for all γ, δ ∈ (0, 1/2], there
exists a value N such that
P max |φ − φ̂n | > γ ≤ δ.
n≥N
Show that in order to make the guarantee above, it suffices to have N = O( γ12 log( δγ 1
)).
1 1
You may need to use the fact that for γ ∈ (0, 1/2], log( 1−exp(−2γ 2 ) ) = O(log( γ )).
[Hint: Let An be the event that |φ − φ̂n | > γ and consider taking a union bound over the
set of events An , An+1 , An+2 , . . . ..]
6. [40 points] Short Answers
The following questions require a true/false accompanied by one sentence of explanation,
or a reasonably short answer (usually at most 1-2 sentences or a figure).
To discourage random guessing, one point will be deducted for a wrong answer
on multiple choice questions! Also, no credit will be given for answers without
a correct explanation.
(a) [5 points] Let there be a binary classification problem with continuous-valued fea-
tures. In Problem Set #1, you showed if we apply Gaussian discriminant analysis
using the same covariance matrix Σ for both classes, then the resulting decision bound-
ary will be linear. What will the decision boundary look like if we modeled the two
classes using separate covariance matrices Σ0 and Σ1 ? (I.e., x(i) |y (i) = b ∼ N (µb , Σb ),
for b = 0 or 1.)
(b) [5 points] Consider a sequence of examples (x(1) , y (1) ), (x(2) , y (2) ), · · · , (x(m) , y (m) ).
Assume that for all i we have kx(i) k ≤ D and that the data are linearly separated with
a margin γ. Suppose that the perceptron algorithm makes exactly (D/γ)2 mistakes
on this sequence of examples. Now, suppose we use a feature mapping φ(·) to a
higher dimensional space and use the corresponding kernel perceptron algorithm on
the same sequence of data (now in the higher-dimensional feature space). Then the
kernel perceptron (implicitly operating in this higher dimensional feature space) will
make a number of mistakes that is
CS229 Practice Midterm 4
3 M I(S
P P
= Si y P (Si , y) log(P (Si , y)/P (Si )P (y)), where the first summation is over all possible values
i , y)
of the features in Si .