0% found this document useful (0 votes)

50 views

Information Theory and Machine Learning

The document discusses using information theory concepts like rate distortion theory and the information bottleneck method for lossy data compression and machine learning tasks like clustering. It describes how to formulate clustering as an optimization problem that trades off compression rate and distortion, and how the information bottleneck method directly measures relevance through mutual information rather than requiring an ad hoc distortion function. Soft K-means clustering is presented as an example algorithm that solves this optimization problem.

Uploaded by

hoai_thu_15

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views

Information Theory and Machine Learning

Uploaded by

hoai_thu_15

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Information theory and

machine learning
I: Rate Distortion Theory, Deterministic
Annealing, and Soft K-means.
II: Information Bottleneck Method.
Lossy Compression

Summarize data by keeping only relevant

information and throwing away irrelevant
information.

Need a measure for information -> Shannon’s

mutual information.

Need a notion of relevance.

Relevance (1)
Shannon (1948): Define a function that
measures the distortion between the original
signal and its compressed representation.

Note: This is related to the similarity

measure in unsupervised learning/cluster
analysis.
Distortion function
Degree of freedom: which function to use is
up to the experimenter.

It is not always obvious what function

should be used, especially if the data do not
live in a metric space, and so there is no
“natural” measure.

Example: Speech.
Relevance (2)
Tishby, Pereira, Bialek (1999): Measure
relevance directly via Shannon’s mutual
information by defining:

Relevant information = the information

about a variable of interest.

Then, there is no need to define a distortion

function ad hoc, and the appropriate
similarity measure arises naturally.
Learning and lossy data
compression
When we build a model of a data set, we map
observations to a representation that summarizes the
data in an efficient way.

Example: K-means. Map N data points to K clusters,

with centroids c. If K << N, then we get a substantial
compression! log(K) << log(N).

Recall: Entropy H[X] ~ log(N).

How “efficient” is the
representation?
Entropy can be used to measure the
compactness of the model. Sometimes called
“statistical complexity” (J. P. Crutchfield.)

This is a good measure of complexity if we

are searching for a deterministic map (in
clustering called a “hard partition”)

In general, we may search over probabilistic

assignments (clustering: “soft partition”)
Trade-off between
compression and accuracy
Think about a continuous variable.

To describe it exactly, you need infinitely

many bits.

But finite bits to describe it up to some

accuracy.
Rate distortion theory
used for clustering
Find assignments p(c|x) from data x ∈ X to clusters
c = 1, ..., K, and find class-representatives xc
(“cluster centers”), such that the average distortion
D = !d(x, xc )" is small.
Compress the data by summarizing it efficiently in
terms of bit-cost: Minimize the coding rate, the
information which the clusters retain about the raw
data. p(x, c)
I=! "
p(x)p(c)
Constrained optimization
min [I(x, c) + β!d(x, xc )"]
p(c|x)
xc

p(c)
Solution: p(c|x) = exp [−βd(x, xc )]
Z(x, β)

d
xc from ! d(x, xc )"p(x|c) = 0
dxc

(centroid condition)
Rate distortion curve
Family of optimal solutions, one for each value
of the Lagrange multiplyer beta. This parameter
controls the trade-off between compression
and fidelity.
p(c)
p(c|x) = exp [−βd(x, xc )]
Z(x, β)
Evaluate objective function at the optimum for
each value of beta and plot I vs D.
=> Rate-distortion curve.
Remarks
for squared error distortion, d = (x − x )2
c /2 ,
the centroid condition reduces to

xc = !x"p(x|c)

“soft K-means” because assignments can be

probabilistic (fuzzy):

p(c)
p(c|x) = exp [−βd(x, xc )]
Z(x, β)
Remarks
in the zero temperature limit, β → ∞ , we
have deterministic (or “hard”) assignments
because

c∗ := arg min d(x, xc ); D(x, c) := d(x, xc ) − d(x, xc∗ ) > 0

c
exp(−βd(x, xc∗ )) " # $−1
p(c |x) = !
∗
= 1+ exp(−βD(x, c)) →1
c exp(−βd(x, xc )) ∗ c"=c

the analogy to thermodynamics inspired

Deterministic Annealing (Rose, 1990)
Soft K-means algorithm
Choose a distortion measure.

Fix the “temperature”, T, to a very large value

(corresponds to small = 1/T)

Solve iteratively, until convergence:

p(c)
Assignments: p(c|x) = exp [−βd(x, xc )]
Z(x, β)
d
Centroids: ! d(x, xc )"p(x|c) = 0
dxc

Lower temperature: T <- aT, and repeat.

a = “annealing rate”, a small, positive number.

Rate-distortion curve

feasible region
bit rate

K=2

K=3
infeasible region etc.

Distortion
How to choose the
distortion function?
Example: Speech

Cluster speech such that signals which encode

the same word will group together

May be extremely difficult to define a

distortion function which achieves this

Intelligibility criterion?

Should measure how well the meaning is

preserved.
Information Bottleneck Method
Tishby, Pereira, Bialek, 1999

Instead of guessing a distortion function, define

relevant information as information the data carries
about a quantity of interest (Example: phonemes or
words.)

Data is compressed such that relevant information is

kept maximally.
X y

min I(x,c) max I(c,y)

C
Constrained optimization
max [I(y, c) − λI(x, c)]
p(c|x)

Optimal assignment rule

?
Constrained optimization
max [I(y, c) − λI(x, c)]
p(c|x)

Optimal assignment rule

! "
p(c) 1
p(c|x) = exp − DKL [p(y|x)"p(y|c)]
Z(x, λ) λ

Kullback-Leibler divergence emerges as the

x K=2

I(c,x)
Homework

Implement Soft K-means algorithm.

Helpful reading (on course website): K. Rose,

"Deterministic Annealing for Clustering,
Compression, Classification, Regression, and
Related Optimization Problems," Proceedings of
the IEEE, vol. 80, pp. 2210-2239, November
1998.

MAST90083 2021 S2 Exam Paper
No ratings yet
MAST90083 2021 S2 Exam Paper
4 pages
Homework For Module 3 Part 2
100% (3)
Homework For Module 3 Part 2
6 pages
Stanford University CS 229, Autumn 2014 Midterm Examination
No ratings yet
Stanford University CS 229, Autumn 2014 Midterm Examination
23 pages
Homework 1
0% (1)
Homework 1
4 pages
Performance Management 101 Workbook
86% (7)
Performance Management 101 Workbook
34 pages
Pattern Classification: All Materials in These Slides Were Taken From
No ratings yet
Pattern Classification: All Materials in These Slides Were Taken From
18 pages
leastsquares_minnorm_problems
No ratings yet
leastsquares_minnorm_problems
6 pages
Assign 1
No ratings yet
Assign 1
5 pages
10-701/15-781, Machine Learning: Homework 1: Aarti Singh Carnegie Mellon University
No ratings yet
10-701/15-781, Machine Learning: Homework 1: Aarti Singh Carnegie Mellon University
6 pages
10.1007@s00211 005 0618 1
No ratings yet
10.1007@s00211 005 0618 1
29 pages
ML Cheatsheet
No ratings yet
ML Cheatsheet
1 page
HW2 MTH452/552
No ratings yet
HW2 MTH452/552
7 pages
Matlab Homework Experts 2
No ratings yet
Matlab Homework Experts 2
10 pages
Notes On Gans, Energy-Based Models, and Saddle Points
No ratings yet
Notes On Gans, Energy-Based Models, and Saddle Points
10 pages
Maximum Likelihood An Introduction: L. Le Cam
No ratings yet
Maximum Likelihood An Introduction: L. Le Cam
31 pages
Machine Learning Course - Kernel Regression
No ratings yet
Machine Learning Course - Kernel Regression
9 pages
Ass 1
No ratings yet
Ass 1
3 pages
Lecture 4
No ratings yet
Lecture 4
51 pages
assignment-2
No ratings yet
assignment-2
5 pages
Kernel Ridge Regression
No ratings yet
Kernel Ridge Regression
8 pages
practicalMachineLearning_lecture3
No ratings yet
practicalMachineLearning_lecture3
25 pages
homework4_v1.0
No ratings yet
homework4_v1.0
5 pages
Home Exercise 3: Dynamic Programming and Randomized Algorithms
No ratings yet
Home Exercise 3: Dynamic Programming and Randomized Algorithms
5 pages
Cheat ML
No ratings yet
Cheat ML
1 page
Problem Sheet 1 (1)
No ratings yet
Problem Sheet 1 (1)
3 pages
Practice Midterm 2010
No ratings yet
Practice Midterm 2010
4 pages
Admm Homework
No ratings yet
Admm Homework
5 pages
hw2
No ratings yet
hw2
2 pages
Csci567 Hw1 Spring 2016
No ratings yet
Csci567 Hw1 Spring 2016
9 pages
Assignment 9 solution
No ratings yet
Assignment 9 solution
4 pages
OQM Lecture Note - Part 1 Introduction To Mathematical Optimisation
No ratings yet
OQM Lecture Note - Part 1 Introduction To Mathematical Optimisation
10 pages
8.indefinite IntegrationPROBLEM SOLVING TACTICSFormulae Sheet
No ratings yet
8.indefinite IntegrationPROBLEM SOLVING TACTICSFormulae Sheet
6 pages
Signal Processing MCQs (1)
No ratings yet
Signal Processing MCQs (1)
16 pages
Final Slides HL
No ratings yet
Final Slides HL
22 pages
Clustering Lec 1 Introduction To Clustering
No ratings yet
Clustering Lec 1 Introduction To Clustering
48 pages
KernelMethods
No ratings yet
KernelMethods
19 pages
Chapter 07
No ratings yet
Chapter 07
68 pages
ML Columbia PDF
No ratings yet
ML Columbia PDF
615 pages
Asset-V1 ColumbiaX+CSMM.102x+1T2018+type@asset+block@ML Lecture1
No ratings yet
Asset-V1 ColumbiaX+CSMM.102x+1T2018+type@asset+block@ML Lecture1
17 pages
CS 229, Public Course Problem Set #3: Learning Theory and Unsuper-Vised Learning
No ratings yet
CS 229, Public Course Problem Set #3: Learning Theory and Unsuper-Vised Learning
4 pages
SinhaDu16 PDF
No ratings yet
SinhaDu16 PDF
20 pages
University of Edinburgh College of Science and Engineering School of Informatics
No ratings yet
University of Edinburgh College of Science and Engineering School of Informatics
5 pages
15-506
No ratings yet
15-506
25 pages
Practice Midterm
No ratings yet
Practice Midterm
8 pages
Ps 1
No ratings yet
Ps 1
5 pages
Tut5 Questions
No ratings yet
Tut5 Questions
2 pages
Adaptive Mean Shift-Based Clustering
No ratings yet
Adaptive Mean Shift-Based Clustering
11 pages
CMPUT 466/551 - Assignment 1: Paradox?
No ratings yet
CMPUT 466/551 - Assignment 1: Paradox?
6 pages
Extra Exercises PDF
No ratings yet
Extra Exercises PDF
232 pages
Fuzzy Sets and Rule Bases (2)
No ratings yet
Fuzzy Sets and Rule Bases (2)
9 pages
Probability Distribution Functions and Partial Descriptors
No ratings yet
Probability Distribution Functions and Partial Descriptors
6 pages
Denoising Autoencoders tr1316
No ratings yet
Denoising Autoencoders tr1316
16 pages
Merta - 2016 - Parallel Time-Domain Boundary Element Method For 3-Dimensional Wave Equation
100% (1)
Merta - 2016 - Parallel Time-Domain Boundary Element Method For 3-Dimensional Wave Equation
20 pages
Ma3001 Numerical Mathematics
No ratings yet
Ma3001 Numerical Mathematics
37 pages
QP2
No ratings yet
QP2
44 pages
Quiz - 1: EC60128: Linear Algebra and Error Control Techniques Amitalok J. Budkuley (Amitalok@ece - Iitkgp.ac - In)
No ratings yet
Quiz - 1: EC60128: Linear Algebra and Error Control Techniques Amitalok J. Budkuley (Amitalok@ece - Iitkgp.ac - In)
5 pages
Chapter 4
No ratings yet
Chapter 4
36 pages
ML Recap
No ratings yet
ML Recap
96 pages
Chia Proof of Space Construction v1.1
No ratings yet
Chia Proof of Space Construction v1.1
28 pages
TJUSAMO 2011 - Discrete Calculus: Mitchell Lee, Andre Kessler
No ratings yet
TJUSAMO 2011 - Discrete Calculus: Mitchell Lee, Andre Kessler
3 pages
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Analysis of Frederick Douglass and Malcolm X
No ratings yet
Analysis of Frederick Douglass and Malcolm X
2 pages
Question Tag
100% (1)
Question Tag
14 pages
Lesson Plan Maths Volume
No ratings yet
Lesson Plan Maths Volume
5 pages
(Luận văn) a translation quality assessment of the first three chapters of the novel the da vinci code by do thu ha (2005) based on j houses model
No ratings yet
(Luận văn) a translation quality assessment of the first three chapters of the novel the da vinci code by do thu ha (2005) based on j houses model
83 pages
English BOoks
No ratings yet
English BOoks
4 pages
Teacher Coaching and Development Process
No ratings yet
Teacher Coaching and Development Process
9 pages
Lesson Plan Science
No ratings yet
Lesson Plan Science
2 pages
Improving Resoning Ability
No ratings yet
Improving Resoning Ability
12 pages
Color Theory - Formative Assessment
100% (1)
Color Theory - Formative Assessment
13 pages
HRDF - Teaching Demo Form - Revised
No ratings yet
HRDF - Teaching Demo Form - Revised
3 pages
How To Speak English Fluently Guide - Success Darpan
100% (1)
How To Speak English Fluently Guide - Success Darpan
60 pages
Conceptualization: On Theory and Theorizing Using Grounded Theory
100% (1)
Conceptualization: On Theory and Theorizing Using Grounded Theory
31 pages
Bayesian Networks PDF
No ratings yet
Bayesian Networks PDF
5 pages
Lesson Plan 2
No ratings yet
Lesson Plan 2
2 pages
Degrees of Comparison
No ratings yet
Degrees of Comparison
41 pages
MKT 005 Module #3 Sas
No ratings yet
MKT 005 Module #3 Sas
5 pages
Poste Test
No ratings yet
Poste Test
7 pages
Q4 Week 1 Problem Solution
No ratings yet
Q4 Week 1 Problem Solution
11 pages
Immersion Philosophy-Pdf Version
No ratings yet
Immersion Philosophy-Pdf Version
17 pages
Self Awarness
100% (2)
Self Awarness
51 pages
DLL - Mapeh 4 - Q1 - W1
No ratings yet
DLL - Mapeh 4 - Q1 - W1
2 pages
Interpreting Art Through Metaphors Parsons
No ratings yet
Interpreting Art Through Metaphors Parsons
11 pages
Theory of Caritative Caring
0% (1)
Theory of Caritative Caring
2 pages
WD AUTOMATED SF9-new
No ratings yet
WD AUTOMATED SF9-new
64 pages
Nature of Teaching
No ratings yet
Nature of Teaching
8 pages
FearofMissingOut Topost
No ratings yet
FearofMissingOut Topost
40 pages
AI in Robotics
No ratings yet
AI in Robotics
7 pages
The Problem of Design Problems. Dorst
No ratings yet
The Problem of Design Problems. Dorst
13 pages
Emotional Branding by Mascot: Role of Mascot Branding An Indian Case Study Y. R. Sonawane
No ratings yet
Emotional Branding by Mascot: Role of Mascot Branding An Indian Case Study Y. R. Sonawane
5 pages