0% found this document useful (0 votes)
50 views

Information Theory and Machine Learning

The document discusses using information theory concepts like rate distortion theory and the information bottleneck method for lossy data compression and machine learning tasks like clustering. It describes how to formulate clustering as an optimization problem that trades off compression rate and distortion, and how the information bottleneck method directly measures relevance through mutual information rather than requiring an ad hoc distortion function. Soft K-means clustering is presented as an example algorithm that solves this optimization problem.

Uploaded by

hoai_thu_15
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

Information Theory and Machine Learning

The document discusses using information theory concepts like rate distortion theory and the information bottleneck method for lossy data compression and machine learning tasks like clustering. It describes how to formulate clustering as an optimization problem that trades off compression rate and distortion, and how the information bottleneck method directly measures relevance through mutual information rather than requiring an ad hoc distortion function. Soft K-means clustering is presented as an example algorithm that solves this optimization problem.

Uploaded by

hoai_thu_15
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Information theory and

machine learning
I: Rate Distortion Theory, Deterministic
Annealing, and Soft K-means.
II: Information Bottleneck Method.
Lossy Compression

Summarize data by keeping only relevant


information and throwing away irrelevant
information.

Need a measure for information -> Shannon’s


mutual information.

Need a notion of relevance.


Relevance (1)
Shannon (1948): Define a function that
measures the distortion between the original
signal and its compressed representation.

Note: This is related to the similarity


measure in unsupervised learning/cluster
analysis.
Distortion function
Degree of freedom: which function to use is
up to the experimenter.

It is not always obvious what function


should be used, especially if the data do not
live in a metric space, and so there is no
“natural” measure.

Example: Speech.
Relevance (2)
Tishby, Pereira, Bialek (1999): Measure
relevance directly via Shannon’s mutual
information by defining:

Relevant information = the information


about a variable of interest.

Then, there is no need to define a distortion


function ad hoc, and the appropriate
similarity measure arises naturally.
Learning and lossy data
compression
When we build a model of a data set, we map
observations to a representation that summarizes the
data in an efficient way.

Example: K-means. Map N data points to K clusters,


with centroids c. If K << N, then we get a substantial
compression! log(K) << log(N).

Recall: Entropy H[X] ~ log(N).


How “efficient” is the
representation?
Entropy can be used to measure the
compactness of the model. Sometimes called
“statistical complexity” (J. P. Crutchfield.)

This is a good measure of complexity if we


are searching for a deterministic map (in
clustering called a “hard partition”)

In general, we may search over probabilistic


assignments (clustering: “soft partition”)
Trade-off between
compression and accuracy
Think about a continuous variable.

To describe it exactly, you need infinitely


many bits.

But finite bits to describe it up to some


accuracy.
Rate distortion theory
used for clustering
Find assignments p(c|x) from data x ∈ X to clusters
c = 1, ..., K, and find class-representatives xc
(“cluster centers”), such that the average distortion
D = !d(x, xc )" is small.
Compress the data by summarizing it efficiently in
terms of bit-cost: Minimize the coding rate, the
information which the clusters retain about the raw
data. p(x, c)
I=! "
p(x)p(c)
Constrained optimization
min [I(x, c) + β!d(x, xc )"]
p(c|x)
xc

p(c)
Solution: p(c|x) = exp [−βd(x, xc )]
Z(x, β)

d
xc from ! d(x, xc )"p(x|c) = 0
dxc

(centroid condition)
Rate distortion curve
Family of optimal solutions, one for each value
of the Lagrange multiplyer beta. This parameter
controls the trade-off between compression
and fidelity.
p(c)
p(c|x) = exp [−βd(x, xc )]
Z(x, β)
Evaluate objective function at the optimum for
each value of beta and plot I vs D.
=> Rate-distortion curve.
Remarks
for squared error distortion, d = (x − x )2
c /2 ,
the centroid condition reduces to

xc = !x"p(x|c)

“soft K-means” because assignments can be


probabilistic (fuzzy):

p(c)
p(c|x) = exp [−βd(x, xc )]
Z(x, β)
Remarks
in the zero temperature limit, β → ∞ , we
have deterministic (or “hard”) assignments
because

c∗ := arg min d(x, xc ); D(x, c) := d(x, xc ) − d(x, xc∗ ) > 0


c
exp(−βd(x, xc∗ )) " # $−1
p(c |x) = !

= 1+ exp(−βD(x, c)) →1
c exp(−βd(x, xc )) ∗ c"=c

the analogy to thermodynamics inspired


Deterministic Annealing (Rose, 1990)
Soft K-means algorithm
Choose a distortion measure.

Fix the “temperature”, T, to a very large value


(corresponds to small = 1/T)

Solve iteratively, until convergence:


p(c)
Assignments: p(c|x) = exp [−βd(x, xc )]
Z(x, β)
d
Centroids: ! d(x, xc )"p(x|c) = 0
dxc

Lower temperature: T <- aT, and repeat.

a = “annealing rate”, a small, positive number.


Rate-distortion curve

feasible region
bit rate

K=2

K=3
infeasible region etc.

Distortion
How to choose the
distortion function?
Example: Speech

Cluster speech such that signals which encode


the same word will group together

May be extremely difficult to define a


distortion function which achieves this

Intelligibility criterion?

Should measure how well the meaning is


preserved.
Information Bottleneck Method
Tishby, Pereira, Bialek, 1999

Instead of guessing a distortion function, define


relevant information as information the data carries
about a quantity of interest (Example: phonemes or
words.)

Data is compressed such that relevant information is


kept maximally.
X y

min I(x,c) max I(c,y)

C
Constrained optimization
max [I(y, c) − λI(x, c)]
p(c|x)

Optimal assignment rule

?
Constrained optimization
max [I(y, c) − λI(x, c)]
p(c|x)

Optimal assignment rule


! "
p(c) 1
p(c|x) = exp − DKL [p(y|x)"p(y|c)]
Z(x, λ) λ

Kullback-Leibler divergence emerges as the


distortion function:
! " #
p(y|x)
DKL [p(y|x)!p(y|c)] = p(y|x) log2
y
p(y|c)
Information Plane
I(c,y)
infeasible
x K=4 etc.
x K=3

x K=2

I(c,x)
Homework

Implement Soft K-means algorithm.

Helpful reading (on course website): K. Rose,


"Deterministic Annealing for Clustering,
Compression, Classification, Regression, and
Related Optimization Problems," Proceedings of
the IEEE, vol. 80, pp. 2210-2239, November
1998.

You might also like