hw3 Solution
hw3 Solution
November 2, 2011
• [2 points] True or false: (if true, give a 1 sentence justification; if false, give a counter
example.) When learning an HMM for a fixed set of observations, assume we do not know
the true number of hidden states (which is often the case), we can always increase the training
data likelihood by permitting more hidden states.
Answer: True. For the worst case, we could give one hidden sate for each output value in
the training sequence, and achieve perfect fitting. (Note: if you consider the scenario where
the number of states is even larger than the number of observed output values, and answered
’False’, you will also get full credit.)
• [4 points] Show that if any elements of the parameters π (start probability) or A (transition
probability) for a hidden Markov model are initially set to zero, then those elements will
remain zero in all subsequent updates of the EM algorithm.
Answer: In the E step, since π and A are initialized to be zero, there wouldn’t be any training
example associated with the zero probability states, nor transition to any zero probability
transitions. Hence, in the M step, the updated probabilities will remain zero.
1
1.2 HMM for DNA Sequence
In this problem, you will use HMM to decode a simple DNA sequence. It is well known that a DNA
sequence is a series of components from {A, C, G, T }. Now let’s assume there is one hidden variable
S that controls the generation of DNA sequence. S takes 2 possible states {S1 , S2 }. Assume the
following transition probabilities for HMM M
P (S1 |S1 ) = 0.8, P (S2 |S1 ) = 0.2, P (S1 |S2 ) = 0.2, P (S2 |S2 ) = 0.8
• [5 points] P (x|M ) using the forward algorithm. Show your work to get full credit.
Answer:
Initialization:
Iteration:
Results:
α Sk = S1 Sk = S2
α1k 0.05 0.2
α2k 0.032 0.017
α3k 0.0029 0.008
α4k 3.9200e-04 0.0028
α5k 3.4880e-04 2.3120e-04
α6k 1.3011e-04 2.5472e-05
• [5 points] The posterior probabilities P (πi = S1 |x, M ) for i = 1, . . . , 6. Show your work to
get full credit.
Answer:
First, use backward algorithm to get βtk .
Results:
2
β Sk = S1 Sk = S2
β1k 7.9744e-04 5.7856e-04
β2k 0.0022 0.0051
β3k 0.0122 0.0150
β4k 0.1120 0.0400
β5k 0.3400 0.1600
β6k 1 1
Posterior:
Sk = S1
P (π1 = S1 |x, M ) 0.2563
P (π2 = S1 |x, M ) 0.4476
P (π3 = S1 |x, M ) 0.4476
P (π4 = S1 |x, M ) 0.2822
P (π5 = S1 |x, M ) 0.7622
P (π6 = S1 |x, M ) 0.8363
• [5 points] The most likely path of hidden states using the Viterbi algorithm. Show your
work to get full credit.
Answer:
First, calculate Vtk using Viterbi algorithm, and record the states that maximize the probability
for each step P tr(k, t).
Results:
Vtk Sk = S1 Sk = S2
V1k 0.0500 0.2000
V2k 0.0160 0.0160
V3k 0.0013 0.0013
V4k 1.0240e-04 0.0016
V5k 1.3107e-04 1.3107e-04
V6k 4.1943e-05 1.0486e-05
P tr(k, t) Sk = S1 Sk = S2
P tr(k, 1) S1 S1
P tr(k, 2) S1 S2
P tr(k, 3) S1 S2
P tr(k, 4) S1 S2
P tr(k, 5) S2 S2
P tr(k, 6) S1 S2
3
2 Bayesian Network
2.1 True or False
• (a) True. Let’s say the variables are X1 , . . . , XN . Construct the full clique X1 → Xi for any
i > 1, X2 → Xi for any i > 2, etc.. This network assumes no conditional independencies,
therefore it can encode any probability distribution over X1 , . . . , XN . (Note: I will not deduct
points on this question since I feel the statement is not clear enough.)
• (b) False. Some distributions in D can have additional independence assumptions not encoded
in the network. For example, consider the 3-clique defined by X1 → X2 → X3 and X1 → X3 ,
where all X are boolean. Let the conditional probability table of P (X3 | X1 , X2 ) be
(X1 , X2 ) = (0, 0) (X1 , X2 ) = (0, 1) (X1 , X2 ) = (1, 0) (X1 , X2 ) = (1, 1)
X3 = 1 0.7 0.7 0.4 0.4
X3 = 0 0.3 0.3 0.6 0.6
Observe that P (X3 | X1 , X2 = 0) = P (X3 | X1 , X2 = 1). This proves that X3 is conditionally
independent of X2 given X1 , even though we cannot derive this independence from the Bayes
Net G.
• (b) We need 1 Bernoulli parameter for each of P (A) , P (B) , P (G), 2 parameters for each
of P (D | C) , P (F | D) (1 Bernoulli parameter for each of the 2 settings of the conditioning
variables), and 4 parameters for each of P (E | C, G) P (C | A, B) (1 Bernoulli parameter for
each of the 4 settings of the conditioning variables). In total, we need 3 (1) + 2 (2) + 2 (4) =
3 + 4 + 8 = 15 parameters.
2.5 D-Separation
• A ⊥ B | C: False. The only trail from A to B is A → C ← B, and it is active because we
condition on the inverted fork variable C.
4
• B ⊥ G | C, E: True. The only trail from B to G is B → C → E ← G, and it is d-separated
since we condition on C.
2. Objective function optimized - CRF maximizes conditional probability P (Y |X), while HMM
maximizes likelihood.
3. Require a normalization constant - CRF requires a normalization constant, while HMM does
not.
2. nk = I[tag(yi ) = “Proper noun” AND Xi is uppercase] - Yes, it could be written in the form
of g(yi , x)
3. ok = I[yi = yi+2 ] - No, standard CRF only permits one step correlation between states, i.e.,
(yi+1 , yi ).
4. nk = I[tag(yi ) = “Proper noun” AND Xi−1 is an article] - Yes, it could be written in the
form of g(yi , x)
2.
∑
z(x) = exp {λ [f (y2 , y1 , x) + f (y3 , y2 , x) + f (y4 , y2 , x) + f (y5 , y3 , x) + f (y5 , y4 , x)]
y1 ,y2 ,y3 ,y4 ,y5
+µ [g(y1 , x) + g(y2 , x) + g(y3 , x) + g(y4 , x) + g(y5 , x)]}
5
3.4 Computing the normalization constant [4 points]
1. 25
2. 2n
2.
[N ]
∏ δ(zi =k)
p (µk = u | x, z, µ \ {µk }) ∝ p (xi | zi , µk = u, µ \ {µk }) p (µk = u)
i=1
[N ( )δ(zi =k) ] { }
∏ −D/2 1 2 −D/2 1 2
∝ (2π) exp − ∥xi − u∥2 (2π) exp − ∥u∥2 ,
2 2
i=1
[N ( )δ(zi =k) ] { }
∏ 1 2 1 2
∝ exp − ∥xi − u∥2 exp − ∥u∥2 ,
2 2
i=1
( [ ])
∑N
∝ exp −0.5 ∥u∥22 + δ (zi = k) ∥xi − u∥22
i=1
6
2.
2. The likelihood terms in both cases (except when we have to add a new Gaussian in the infinte
prior) are the same. In the K-Gaussians case, the prior over z gives each gaussian uniform
weight but in the infinite prior, each Gaussian is weighted by how many data points are
assigned to it.
3. With an infinite uniform prior, we would have an infinite number of potential z values with
identical non-zero probability mass assigned to them. This would be impossible to do under
the restriction that the total probability mass for all z values should be equal to 1.