0% found this document useful (0 votes)
49 views

hw3 Solution

This document provides solutions to homework problems involving hidden Markov models and Bayesian networks. The key points are: 1) Four data sets were evaluated for appropriateness of using hidden Markov models, with justifications provided. 2) It was shown that elements of HMM parameters initialized to zero will remain zero in EM updates. 3) Calculations were shown using an HMM to model a DNA sequence, including likelihood, posteriors, and most likely path. 4) Questions involving Bayesian networks were answered regarding structure, joint probability, parameters, and Markov blankets.

Uploaded by

Anas Tubail
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

hw3 Solution

This document provides solutions to homework problems involving hidden Markov models and Bayesian networks. The key points are: 1) Four data sets were evaluated for appropriateness of using hidden Markov models, with justifications provided. 2) It was shown that elements of HMM parameters initialized to zero will remain zero in EM updates. 3) Calculations were shown using an HMM to model a DNA sequence, including likelihood, posteriors, and most likely path. 4) Questions involving Bayesian networks were answered regarding structure, joint probability, parameters, and Markov blankets.

Uploaded by

Anas Tubail
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

10-701 Machine Learning, Fall 2011: Homework 3 Solutions

November 2, 2011

1 Hidden Markov Model [25 points, Bin]


1.1 General Questions
• [4 points] For each of the following data sets, is it appropriate to use HMM? Provide a one
sentence explanation to your answer.

– Stock market price data


Answer: True. Stock market price is time sensitive.
– Collaborative filtering on a database of movie reviews: for example, Netflix challenge:
predict about how much someone is going to enjoy a movie based on their and other
users’ movie preferences
Answer: False. User’s preferences do not change much overtime.
– Daily precipitation data in Pittsburgh
Answer: True. Whether it rains or not depends largely on whether it rained yesterday
or not (may not be true for Pittsburgh though).
– Optical character recognition
Answer: True. Word recognition is sensitive to the character sequence. “Rib” vs
“Rob”. (Note: if you only consider single character recognition and answer ‘No’, you
will also get full credit.)

• [2 points] True or false: (if true, give a 1 sentence justification; if false, give a counter
example.) When learning an HMM for a fixed set of observations, assume we do not know
the true number of hidden states (which is often the case), we can always increase the training
data likelihood by permitting more hidden states.
Answer: True. For the worst case, we could give one hidden sate for each output value in
the training sequence, and achieve perfect fitting. (Note: if you consider the scenario where
the number of states is even larger than the number of observed output values, and answered
’False’, you will also get full credit.)

• [4 points] Show that if any elements of the parameters π (start probability) or A (transition
probability) for a hidden Markov model are initially set to zero, then those elements will
remain zero in all subsequent updates of the EM algorithm.
Answer: In the E step, since π and A are initialized to be zero, there wouldn’t be any training
example associated with the zero probability states, nor transition to any zero probability
transitions. Hence, in the M step, the updated probabilities will remain zero.

1
1.2 HMM for DNA Sequence
In this problem, you will use HMM to decode a simple DNA sequence. It is well known that a DNA
sequence is a series of components from {A, C, G, T }. Now let’s assume there is one hidden variable
S that controls the generation of DNA sequence. S takes 2 possible states {S1 , S2 }. Assume the
following transition probabilities for HMM M

P (S1 |S1 ) = 0.8, P (S2 |S1 ) = 0.2, P (S1 |S2 ) = 0.2, P (S2 |S2 ) = 0.8

emission probabilities as following

P (A|S1 ) = 0.4, P (C|S1 ) = 0.1, P (G|S1 ) = 0.4, P (T |S1 ) = 0.1


P (A|S2 ) = 0.1, P (C|S2 ) = 0.4, P (G|S2 ) = 0.1, P (T |S2 ) = 0.4

and start probabilities as following

P (S1 ) = 0.5, P (S2 ) = 0.5

Assume the observed sequence is x = CGT CAG, calculate:

• [5 points] P (x|M ) using the forward algorithm. Show your work to get full credit.
Answer:
Initialization:

∀Sk ∈ {S1 , S2 }, α1k = P (x1 |π1 = Sk )P (π1 = Sk ) (1)

Iteration:

∀Sk ∈ {S1 , S2 }, t ∈ {2, 3, 4, 5, 6} (2)



αtk = P (xt |πt = Sk ) i
αt−1 ai,k (3)
i∈{1,2}

Results:

α Sk = S1 Sk = S2
α1k 0.05 0.2
α2k 0.032 0.017
α3k 0.0029 0.008
α4k 3.9200e-04 0.0028
α5k 3.4880e-04 2.3120e-04
α6k 1.3011e-04 2.5472e-05

• [5 points] The posterior probabilities P (πi = S1 |x, M ) for i = 1, . . . , 6. Show your work to
get full credit.
Answer:
First, use backward algorithm to get βtk .
Results:

2
β Sk = S1 Sk = S2
β1k 7.9744e-04 5.7856e-04
β2k 0.0022 0.0051
β3k 0.0122 0.0150
β4k 0.1120 0.0400
β5k 0.3400 0.1600
β6k 1 1

Then, calculate P (πi = S1 |x, M ) by



P (πi = S1 |x, M ) = αi1 βi1 / α1k β1k (4)
k∈{1,2}

Posterior:

Sk = S1
P (π1 = S1 |x, M ) 0.2563
P (π2 = S1 |x, M ) 0.4476
P (π3 = S1 |x, M ) 0.4476
P (π4 = S1 |x, M ) 0.2822
P (π5 = S1 |x, M ) 0.7622
P (π6 = S1 |x, M ) 0.8363

• [5 points] The most likely path of hidden states using the Viterbi algorithm. Show your
work to get full credit.

Answer:
First, calculate Vtk using Viterbi algorithm, and record the states that maximize the probability
for each step P tr(k, t).
Results:
Vtk Sk = S1 Sk = S2
V1k 0.0500 0.2000
V2k 0.0160 0.0160
V3k 0.0013 0.0013
V4k 1.0240e-04 0.0016
V5k 1.3107e-04 1.3107e-04
V6k 4.1943e-05 1.0486e-05

P tr(k, t) Sk = S1 Sk = S2
P tr(k, 1) S1 S1
P tr(k, 2) S1 S2
P tr(k, 3) S1 S2
P tr(k, 4) S1 S2
P tr(k, 5) S2 S2
P tr(k, 6) S1 S2

The most likely path is S2 , S2 , S2 , S2 , S1 , S1 .

3
2 Bayesian Network
2.1 True or False
• (a) True. Let’s say the variables are X1 , . . . , XN . Construct the full clique X1 → Xi for any
i > 1, X2 → Xi for any i > 2, etc.. This network assumes no conditional independencies,
therefore it can encode any probability distribution over X1 , . . . , XN . (Note: I will not deduct
points on this question since I feel the statement is not clear enough.)

• (b) False. Some distributions in D can have additional independence assumptions not encoded
in the network. For example, consider the 3-clique defined by X1 → X2 → X3 and X1 → X3 ,
where all X are boolean. Let the conditional probability table of P (X3 | X1 , X2 ) be
(X1 , X2 ) = (0, 0) (X1 , X2 ) = (0, 1) (X1 , X2 ) = (1, 0) (X1 , X2 ) = (1, 1)
X3 = 1 0.7 0.7 0.4 0.4
X3 = 0 0.3 0.3 0.6 0.6
Observe that P (X3 | X1 , X2 = 0) = P (X3 | X1 , X2 = 1). This proves that X3 is conditionally
independent of X2 given X1 , even though we cannot derive this independence from the Bayes
Net G.

2.2 Joint Probability

P (A, B, C, D, E, F, G) = P (A) P (B) P (G) P (C | A, B) P (E | C, G) P (D | C) P (F | D)

2.3 Number of Parameters


• (a) There are 7 variables, meaning that we need to encode the probabilities for 27 = 128
possible settings of (A, B, C, D, E, F, G). This implies we need 128 − 1 = 127 parameters.

• (b) We need 1 Bernoulli parameter for each of P (A) , P (B) , P (G), 2 parameters for each
of P (D | C) , P (F | D) (1 Bernoulli parameter for each of the 2 settings of the conditioning
variables), and 4 parameters for each of P (E | C, G) P (C | A, B) (1 Bernoulli parameter for
each of the 4 settings of the conditioning variables). In total, we need 3 (1) + 2 (2) + 2 (4) =
3 + 4 + 8 = 15 parameters.

2.4 Markov Blanket


The Markov Blanket of C includes all immediate ancestors of C (namely A, B), all immediate
descendants of C (namely D, E), and all “co-parents” of C (defined as any variable X that shares
an immediate descendant with C, which includes just G). Hence the Markov Blanket of C is
A, B, D, E, G.

2.5 D-Separation
• A ⊥ B | C: False. The only trail from A to B is A → C ← B, and it is active because we
condition on the inverted fork variable C.

• A ⊥ G | E: False. The only trail from A to G is A → C → E ← G, and it is active since (1)


we do not condition on C, and (2) we condition on the inverted fork variable E.

4
• B ⊥ G | C, E: True. The only trail from B to G is B → C → E ← G, and it is d-separated
since we condition on C.

• F ⊥ G: True. The only trail from F to G is F ← D ← C → E ← G, and it is d-separated


since we do not condition on the inverted fork variable E.

3 Conditional Random Fields [Suyash, 25 points]


3.1 CRFs and HMMs [6 points]
1. Type of model - CRF is a discriminative model, while HMM is generative.

2. Objective function optimized - CRF maximizes conditional probability P (Y |X), while HMM
maximizes likelihood.

3. Require a normalization constant - CRF requires a normalization constant, while HMM does
not.

3.2 Features in CRFs [8 points]


Consider the standard CRF discussed in class (Lecture 12, slide 22). For each of the following
feature functions, explain whether they can be represented by the CRF probability distribution?
Briefly explain your answer.

1. mk = I[yi = yi+1 ] - Yes, it could be written in the form of f (yi+1 , yi , x)

2. nk = I[tag(yi ) = “Proper noun” AND Xi is uppercase] - Yes, it could be written in the form
of g(yi , x)

3. ok = I[yi = yi+2 ] - No, standard CRF only permits one step correlation between states, i.e.,
(yi+1 , yi ).

4. nk = I[tag(yi ) = “Proper noun” AND Xi−1 is an article] - Yes, it could be written in the
form of g(yi , x)

3.3 Complex CRFs [7 points]


1.
1
P (y|x) = exp {λ [f (y2 , y1 , x) + f (y3 , y2 , x) + f (y4 , y2 , x) + f (y5 , y3 , x) + f (y5 , y4 , x)]
z(x)
+µ [g(y1 , x) + g(y2 , x) + g(y3 , x) + g(y4 , x) + g(y5 , x)]}

2.

z(x) = exp {λ [f (y2 , y1 , x) + f (y3 , y2 , x) + f (y4 , y2 , x) + f (y5 , y3 , x) + f (y5 , y4 , x)]
y1 ,y2 ,y3 ,y4 ,y5
+µ [g(y1 , x) + g(y2 , x) + g(y3 , x) + g(y4 , x) + g(y5 , x)]}

5
3.4 Computing the normalization constant [4 points]
1. 25

2. 2n

3. The computational time is exponential with the length of x

4 Gibbs sampling for an infinite gaussian mixture model


4.1 Uniform discrete prior for z
1.

p (zi = k | x, µ, z \ {zi }) ∝ p (xi | zi = k, µ) p (zi = k)


( )
−D/2 1 2 1
∝ (2π) exp − ∥xi − µk ∥2
2 K
( )
1
∝ exp − ∥xi − µk ∥22
2

2.
[N ]
∏ δ(zi =k)
p (µk = u | x, z, µ \ {µk }) ∝ p (xi | zi , µk = u, µ \ {µk }) p (µk = u)
i=1
[N ( )δ(zi =k) ] { }
∏ −D/2 1 2 −D/2 1 2
∝ (2π) exp − ∥xi − u∥2 (2π) exp − ∥u∥2 ,
2 2
i=1
[N ( )δ(zi =k) ] { }
∏ 1 2 1 2
∝ exp − ∥xi − u∥2 exp − ∥u∥2 ,
2 2
i=1
( [ ])
∑N
∝ exp −0.5 ∥u∥22 + δ (zi = k) ∥xi − u∥22
i=1

4.2 An infinite prior over z


1.

p (zi = k | x, µ, z \ {zi }) ∝ p (xi | zi = k, µ) p (zi = k | z \ {zi }) if # [z \ {zi } = k] > 0


∝ p (xi | zi = k, µ) p (zi = k | z \ {zi })
{ }
−D/2 1 2 # [z \ {zi } = k]
∝ (2π) exp − ∥xi − µk ∥2
2 N +α
{ }
1
∝ exp − ∥xi − µk ∥22 (# [z \ {zi } = k])
2

6
2.

p (zi = k | x, µ, z \ {zi }) ∝ p (xi | zi = k, µ) p (zi = k | z \ {zi })


k is the smallest positive integer such that # [z \ {zi } = k] = 0
∝ p (xi | zi = k, µ) p (zi = k | z \ {zi })
{ }
−D/2 1 2 α
∝ (2π) exp − ∥xi − µk ∥2
2 N +α
{ }
1
∝ exp − ∥xi − µk ∥22
2

4.3 A few subtleties


1.
[∫ ]
p (zi = k | x, µ, z \ {zi }) ∝ p (xi | zi = k, µk = u, µ \ {µk }) p (µk = u) du p (zi = k | z \ {zi })
u
k is the smallest positive integer such that # [z \ {zi } = k] = 0.
[∫ ]
∝ p (xi | zi = k, µk = u, µ \ {µk }) p (µk = u) du p (zi = k | z \ {zi })
u
[∫ { } { } ]
1 1 α
∝ (2π)−D/2 exp − ∥xi − u∥22 (2π)−D/2 exp − ∥u∥22 du
2 2 N +α
[∫u { } ]
1 1
∝ exp − ∥xi − u∥22 − ∥u∥22 du
u 2 2

2. The likelihood terms in both cases (except when we have to add a new Gaussian in the infinte
prior) are the same. In the K-Gaussians case, the prior over z gives each gaussian uniform
weight but in the infinite prior, each Gaussian is weighted by how many data points are
assigned to it.

3. With an infinite uniform prior, we would have an infinite number of potential z values with
identical non-zero probability mass assigned to them. This would be impossible to do under
the restriction that the total probability mass for all z values should be equal to 1.

You might also like