Supervised Learning
Supervised Learning
Massih-Reza Amini
http://ama.liglab.fr/~amini/Cours/ML/ML.html
1 1
0 ?
1 1
0 ?
Sport
Sport
…… Politics
……
……
? ? ? ……
… ……
… ……
……
…… …
Politics …
……
……
…
2. Unsupervised Learning:
q Generative models and the EM algorithm
q CEM algorithm
3. Semi-supervised Learning:
q Graphical and Generative models
q Discriminant models
1
Program based on Chapters 1, 2, 3 & 5 of [Amini 15]
[email protected] Machine Learning Fundamentals
7
Organization
q Formation
q Theoretical courses - 12 weeks (1.5h per week - 3 ECTS)
q Timetable available at
https://edt.grenoble-inp.fr/2018-2019/ensimag/etudiant/
Pattern recognition
...
...
Pattern recognition
2. Trouver les
séparateurs
0. Base d’apprentissage
1. Vecteur de
représentation
3. Nouveaux exemples
Medicine
Approximation - Interpolation
Approximation - Interpolation
Approximation - Interpolation
Is it reasonable?
Occam razor
Basic Hypotheses
Aims
→ How can one do predictions with past data? What are the
hypotheses?
Probabilistic model
Formally
Notations
Symbol Definition
X ⊆ Rd Input space
Y Output space
S = (xi , yi )1≤i≤m Training set of size m
D Probability distribution generating the data (i.i.d)
ℓ : Y × Y → R+ Instantaneous loss
F = {f : X → Y} Class of functions
L(f) = E(x,y)∼D [ℓ(f(x), y)] Generalization error of
L̂m (f, S) or L̂(w) Empirical Loss of f over S
w Parameters of the prediction function
1π Indicator function equals 1 if π is true and 0 otherwise
Rm (F) Rademacher complexity of the class of functions F
R̂m (F, S) Empirical Rademacher complexity of F estimated over S
Supervised Learning
q Discriminant models directly find a classification function
f : X → Y from a given class of functions F;
q The function found should be the one having the lowest
probability of error
∫
L(f) = E(x,y)∼D [ℓ(f(x), y)] = ℓ(f(x), y)dD(x, y)
X ×Y
Where ℓ is an instantaneous loss defined as
ℓ : Y × Y → R+
The risk function considered in classification is usually the
misclassification error:
∀(x, y); ℓ(f(x), y) = 1f(x)̸=y
Where 1π is equal to 1 if the predicate π is true and 0
otherwise.
[email protected] Machine Learning Fundamentals
22
1 ∑m
L̂m (f, S) = ℓ(f(xi ), yi )
m i=1
P
L̂m (fS , S) → L(fS )
These two conditions imply that the empirical error L̂m (fS , S) of
the prediction function found by the learning algorithm over a
training S, fS , converges in probability to its generalization
error L(fS ) and infg∈F L(g) :
True risk,
Empirical risk,
xd b wd
synaptic
weights
xd b wd
synaptic
weights
|hw (x)|
||w̄||
xp
w̄
b (x1 , +1) b
w(t+1)
b (x2 , +1) −x3 b
w(t) w(t)
rs
(x3 , −1) rs
(x4 , −1) rs rs
Perceptron (algorithm)
Perceptron (convergence)
[Novikoff, 1962] showed that
q and, w̄(0) = 0, η = 1,
Homework
1. We suppose that all the examples in the training set are within a
hypersphere of radius R (i.e. ∀xi ∈ S, ||xi || ≤ R). Further, we initialise
the weight vector to be the null vector (i.e. w(0) = 0) as well as the
learning rate ϵ = 1. Show that after k updates, the norm of the
current weight vector satisfies :
||w(k) ||2 ≤ k × R2 (1)
(k) 2 (0) 2
hint : You can consider ||w || as ||w − w || (k)
2. Using the the same condition than in the previous question, show that
after k updates of the weight vector we have
⟨ ⟩
w∗
, w(k) ≥k×ρ (2)
||w∗ ||
3. Deduce from equations (1) and (2) that the number of iterations k is
bounded by ⌊( ) ⌋
2
R
k≤
ρ
where ⌊x⌋ represents the floor function (This result is due to Novikoff,
1966).
[email protected] Machine Learning Fundamentals
38
Perceptron Program
#include "defs.h"
void perceptron(X, Y, w, m, d, eta , T)
double **X;
double *Y;
double *w;
long int m;
long int d;
double eta;
long int T;
{
long int i, j, t=0;
double ProdScal;
// Initialisation of the weight vector
for(j=0; j<=d; j++)
w[j]=0.0;
while(t<T)
{
i=( rand ()%m) + 1;
for(ProdScal=w[0], j=1; j<=d; j++)
ProdScal +=w[j]*X[i][j];
if(Y[i]* ProdScal <= 0.0){
w[0]+= eta*Y[i];
for(j=1; j<=d; j++)
w[j]+= eta*Y[i]*X[i][j];
}
t++;
}
}
source: http://ama.liglab.fr/~amini/Perceptron/
1 ∑
m
L̂(w) = (yi − hw (xi ))2
m i=1
Adaline
q ADAptive LInear NEuron
q Linear prediction function :
hw : X → R
x 7→ ⟨w̄, x⟩ + w0
Formal models
x0 = 1 b w0
x1 b w1
x2 b w2
x Σ hw (x)
xd b wd
Perceptron vs Adaline
b b
b b
b b
rs b b
rs
rs rs
rs
rs
rs
rs
rs
rs rs
rs
which marked the 1st winter of NN; and the genesis of research
in ML.
[email protected] Machine Learning Fundamentals
44
∏
m
L(Θ) = ln P(xi | Θ)
i=1
(2π) |Σk |
2 2
P(y = 1 | x)
ln = ⟨w̄, x⟩ + w0
P(y = 0 | x)
Logistic regression
q The logistic regression has been proposed to model the
posterior probability of classes via logistic functions.
1
P(y = 1 | x) = = gw (x)
1 + e−⟨w̄,x⟩−w0
1
P(y = 0 | x) = 1 − P(y = 1 | x) = = 1 − gw (x)
1 + e⟨w̄,x⟩+w0
P(y | x) = (gw (x))y (1 − gw (x))1−y
0.8
1/(1+exp(-<w,x>))
0.6
0.4
0.2
0
-6 -4 -2 0 2 4 6
<w,x>
Logistic regression
q For
g:R → ]0, 1[
1
x 7→
1 + e−x
we have
∂g
g′ (x) = = g(x)(1 − g(x))
∂x
1 ∑m
L̂(w) = ln(1 + e−yi hw (xi ) )
m i=1
x2 y
y x2
0 0.5
x1 x1
AdaBoost, algorithm
Algorithm 3 The algorithm of Boosting
1: Training set S = {(xi , yi ) | i ∈ {1, . . . , m}}
2: 1
Initialize the initial distribution over examples ∀i ∈ {1, . . . , m}, D1 (i) = m
3: T, the maximum number of iterations (or classifiers to be combined)
4: for each t=1,…,T do
5: ∑
Train a weak classifier ft : X → {−1, +1} by using the distribution Dt
6: Set ϵt = Dt (i)
i:ft (xi )̸=yi
1−ϵ
7: Choose αt = 1 2
ln ϵ t
t
8: Update the distribution of weights
Dt (i)e−αt yi ft (xi )
∀i ∈ {1, . . . , m}, Dt+1 (i) =
Zt
Where,
∑
m
−αt yi ft (xi )
Zt = Dt (i)e
i=1
source: http://ama.liglab.fr/~amini/RankBoost/
AdaBoost, algorithm
Algorithm 4 The algorithm of Boosting
1: Training set S = {(xi , yi ) | i ∈ {1, . . . , m}}
2: 1
Initialize the initial distribution over examples ∀i ∈ {1, . . . , m}, D1 (i) = m
3: T, the maximum number of iterations (or classifiers to be combined)
4: for each t=1,…,T do
5: ∑
Train a weak classifier ft : X → {−1, +1} by using the distribution Dt
6: Set ϵt = Dt (i)
i:ft (xi )̸=yi
1−ϵ
7: Choose αt = 1 2
ln ϵ t
t
8: Update the distribution of weights
Dt (i)e−αt yi ft (xi )
∀i ∈ {1, . . . , m}, Dt+1 (i) =
Zt
Where,
∑
m
−αt yi ft (xi )
Zt = Dt (i)e
i=1
source: http://ama.liglab.fr/~amini/RankBoost/
Dt (U)
V b
U
q Choose randomly an index U ∈ {1, . . . , m} and a real-value
V ∈ [0, maxi∈{1,...,m} Dt (i)], if Dt (U) > V then accept the
example (xU , yU ).
b b b b b b
b rs rs b rs rs b rs rs
b b b b b b
rs rs rs
rs rs rs rs rs rs
b b
b rs rs
b b
rs
rs rs
Homework
∑T
1. If we denote by ∀x, H(x) = t=1 αt ft (x) and
F(x) = sign(H(x)) show that
1 ∑m
1 ∑m
1yi ̸=F(xi ) ≤ e−yi H(xi )
m i=1 m i=1
2. Deduce that
1 ∑m ∑m ∏
e−yi H(xi ) = Z1 D2 (i) e−yi αt ft (xi )
m i=1 i=1 t>1
And,
1 ∑m ∏T
e−yi H(xi ) = Zt (4)
m i=1 t=1
Homework
3. The minimization of (4) is carried out by minimizing each
of its terms. Using the definition of ϵt show that:
∀t, Zt = ϵt eαt + (1 − ϵt )e−αt
4. Further show that the minimum of the normalisation term,
with respect to the combination weights, αt is reached for
αt = 12 ln 1−ϵ
ϵt
t
Property
Property (2)
q Every weight vector w − w∗ can be uniquely decomposed in
this basis
∑
d
w − w∗ = qi vi
i=1
q That to say
1∑ d
L̂(w) = L̂(w∗ ) + λi q2i
2 i=1
q Furthermore the Hessian matrix is definite positive,
because of the definition of the global minimum
∑
d
(w − w∗ )⊤ H(w − w∗ ) = λi q2i = 2(L̂(w) − L̂(w∗ )) ≥ 0
i=1
Property (3)
q So
(t) (0)
∀i ∈ {1, . . . , d}, qi = (1 − ηλi )t qi
and the algorithm convergence if
1
η<
2λmax
[email protected] Machine Learning Fundamentals
63
q Update
w(t+1) ← w(t) + ηt pt
// Where ηt is a positive learning rate making w(t+1) be
acceptable for the next iteration.
Wolfe conditions
Armajio condition
∀t ∈ N∗ , p⊤
t ∇L̂(w
(t)
+ ηt pt ) ≥ βp⊤
t ∇L̂(w )
(t)
Armajio condition
So
L̂(w(t) + η̂t pt ) = L̂(w(t) ) + αη̂t p⊤
t ∇L̂(w )
(t)
4. We finally get
p⊤
t ∇L̂(w
(t)
+ η̂t pt ) ≥ αp⊤ ⊤
t ∇L̂(w ) ≥ βpt ∇L̂(w )
(t) (t)
Does it work?
.
Theorem (Zoutendijk)
.
Let L̂ : Rd → R be a differentiable objective function with a
lipschtizien gradient and lower bounded. Let A be an algorithm
generating (w(t) )t∈N defined by
∀t ∈ N, w(t+1) = w(t) + ηt pt
−p⊤
t ∇L̂(w )
(t)
cos(θt ) =
||L̂(w(t) )|| × ||pt ||
( )
∀t, p⊤
t (∇L̂(w
(t+1)
) − ∇L̂(w(t) )) ≥ (β − 1) p⊤
t ∇L̂(w )
(t)
p⊤
t (∇L̂(w
(t+1)
) − ∇L̂(w(t) )) ≤ ||∇L̂(w(t+1) ) − ∇L̂(w(t) )|| × ||pt ||
≤ L||w(t+1) − w(t) || × ||pt ||
≤ Lηt ||pt ||2
∀t, 0 ≤ (β − 1)(p⊤
t ∇L̂(w )) ≤ Lηt ||pt ||
(t) 2
β−1 p⊤
t ∇L̂(w )
(t)
4. For ηt ≥ L ||pt ||2
> 0 we get from Armijo’s condition
1 − β (p⊤
t ∇L̂(w ))
(t) 2
≥α
L ||pt ||2
1−β
≥α cos2 (θt )||∇L̂(w(t) )||2 ≥ 0
L
is convergent.
is convergent.
q Hence, the sequence (∇L̂(w(t) ))t tends to 0 when t tends to
infinity.
Can we do better?
Conjugate gradient method
q The adaptive search of the learning rate with the line
search algorithm does not prevent the oscillations of the
weight vector around the minimiser of the objective
function
Hence
p⊤ ∗
t H(w − w )
(0)
∀t, ηt =
p⊤
t Hpt
p⊤
t Hw
(t)
= p⊤
t Hw
(0)
That is
p⊤
t ∇L̂(w )
(t)
∀t, ηt = − (5)
p⊤t Hpt
∀t, p⊤
t (∇L̂(w
(t+1)
) − ∇L̂(w(t) )) = −p⊤
t ∇L̂(w )
(t)
Which gives
∀t, p⊤
t ∇L̂(w
(t+1)
)=0
q For
p⊤
t H∇L̂(w
(t+1) )
∀t, βt = ⊤
pt Hpt
It is easy to show that the descent directions are mutually
conjugated.
Followed by others
∇⊤ L̂(w(t+1) )∇L̂(w(t+1) )
∀t, βt = (Fletcher and Reeves, 64)
∇⊤ L̂(w(t) )∇L̂(w(t) )
return(g);
}
double FoncLoss(double *w, double **X, double *y, long int m, long int d)
{
double S=0.0, ps;
long int i, j;
return (S);
}
source: http://ama.liglab.fr/~amini/LR/
|Φ(x1 , .., xi−1 , xi , xi+1 , .., xm ) − Φ(x1 , .., xi−1 , x′ , xi+1 , .., xm )| ≤ ci
We have then
2
∑−2ϵ
m
c2
∀ϵ > 0, P(Φ(x1 , ..., xm ) − E[Φ] > ϵ) ≤ e i=1 i
→ ES sup(L(f) − L̂m (f, S)) = ES sup[ES′ (L̂m (f, S′ ) − L̂m (f, S))]
f∈F f∈F
[ ]
1 ∑
m
ES ES′ Eσ sup σi (ℓ(f(x′i ), y′i ) − ℓ(f(xi ), yi )) ≤
f∈F m
i=1
1 ∑
m
1 ∑
m
1 ∑
m
ES ES′ sup[L(f, S′ ) − L̂m (f, S)] ≤ 2ES Eσ sup σi ℓ(f(xi ), yi ) (9)
f∈F f∈F m
i=1
| {z }
≤Rm (ℓ◦F )
2
Thus for δ/2 = e−mϵ /2
, we have with probability at least equal to 1 − δ/2 :
√
ln 2δ
Rm (ℓ ◦ F) ≤ R̂m (ℓ ◦ F , S) + 2
2m
From the first point (Eq. 7) of the theorem 2, we have also with probability
at least equal to 1 − δ/2 :
√
ln 2δ
∀f ∈ F , ∀S, L(f) ≤ L̂m (f, S) + Rm (ℓ ◦ F ) +
2m
The second point (Eq. 8) of the theorem 2 is then obtained by combining
the two previous results using the union bound.
Regularization
In summary
Homework
http://archive.ics.uci.edu/ml/datasets/Mushroom
ℓ(f(x), y) = 1f(x)̸=y
1 ∑m
f∗ = argmin L̂m (f, S) = argmin ℓ(f(xi ), yi )
f∈F f∈F m i=1
h : Rd → RK
x 7→ (h(x, 1), . . . , h(x, K))
Approaches
q Uncombined approaches
q K-Nearest Neighbour (K-NN),
q Generative models,
q Discriminative models (MLP, M-SVM, M-AdaBoost, etc.)
OvO
OvA
[email protected] Machine Learning Fundamentals
116
OvO
OvA
ds
x1
y1
Output
Input
x y
yk
xd
zℓ
Hidden
layer
(1) (1)
∀j ∈ {1, . . . , ℓ}, aj = ⟨wj. , x⟩ + wj0
∑
d
(1)
= wji xi
i=0
For the model above, the value of jth unit of the hidden layer for
an observation x = (xi )i=1...d in input is obtained by
composition :
q The values of units of the output (h1 , . . . , hK ) is obtained in
the same manner between the vector of the hidden layer
zj , j ∈ {0, . . . , ℓ} and the weights linking this layer to the
(2) (2)
output wk. = (wkj )j=1,...,ℓ ; k ∈ {1, . . . , K},
q the predicted output for an observation x is a composite
transformation of the input, which for the previous model is
( ℓ ( ))
∑ (2)
∑
d
(1)
∀x, ∀k ∈ {1, . . . , K}, h(x, k) = H̄(ak ) = H̄ wkj × H̄ wji × xi
j=0 i=0
1 1 ∑K
∀(x, y), ℓ(h(x), y) = ||h(x) − y||2 = × (hk − yk )2
2 2 k=1
∂a
where ∂wjij = zi .
q In the case where, the unit j is on the output layer we have
∂ℓ(h(x), y)
δj = = H̄′ (aj ) × (hj − yj ).
∂aj
where Af(j) is the set of units that are on the layer which
succeeds the one containing unit j.
wji zi
δl wlj
Be(j) zj j δj Af(j)
l∈Af(j)
i∈Be(j)
∑
∑
H̄ (aj )
H̄
′
Backpropagation
of error
Figure from Amini (Apprentissage Machine)
References
Massih-Reza Amini
Apprentissage Machine de la théorie à la pratique
éditions Eyrolles, 2015.
Translated in Chinese :
References
V. Koltchinskii
Rademacher penalties and structural risk minimization
IEEE Transactions on Information Theory, 47(5):1902–1914,
2001.
Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalker
Foundations of Machine Learning
2012.
A.B. Novikoff
On convergence proofs on perceptrons.
Symposium on the Mathematical Theory of Automata, 12: 615–622.
1962
F. Rosenblatt
The perceptron: A probabilistic model for information storage and
organization in the brain.
Psychological Review, 65: 386–408.
1958
References