Information Theory and Machine Learning
Information Theory and Machine Learning
machine learning
I: Rate Distortion Theory, Deterministic
Annealing, and Soft K-means.
II: Information Bottleneck Method.
Lossy Compression
Example: Speech.
Relevance (2)
Tishby, Pereira, Bialek (1999): Measure
relevance directly via Shannon’s mutual
information by defining:
p(c)
Solution: p(c|x) = exp [−βd(x, xc )]
Z(x, β)
d
xc from ! d(x, xc )"p(x|c) = 0
dxc
(centroid condition)
Rate distortion curve
Family of optimal solutions, one for each value
of the Lagrange multiplyer beta. This parameter
controls the trade-off between compression
and fidelity.
p(c)
p(c|x) = exp [−βd(x, xc )]
Z(x, β)
Evaluate objective function at the optimum for
each value of beta and plot I vs D.
=> Rate-distortion curve.
Remarks
for squared error distortion, d = (x − x )2
c /2 ,
the centroid condition reduces to
xc = !x"p(x|c)
p(c)
p(c|x) = exp [−βd(x, xc )]
Z(x, β)
Remarks
in the zero temperature limit, β → ∞ , we
have deterministic (or “hard”) assignments
because
feasible region
bit rate
K=2
K=3
infeasible region etc.
Distortion
How to choose the
distortion function?
Example: Speech
Intelligibility criterion?
C
Constrained optimization
max [I(y, c) − λI(x, c)]
p(c|x)
?
Constrained optimization
max [I(y, c) − λI(x, c)]
p(c|x)
x K=2
I(c,x)
Homework