0% found this document useful (0 votes)
152 views145 pages

Supervised Learning

This document provides an introduction to machine learning fundamentals and supervised learning. It discusses what machine learning is, the process of inference involving observing phenomena, constructing models, and making predictions. It also outlines three main machine learning frameworks - supervised learning, unsupervised learning, and semi-supervised learning. The document relates these frameworks to inductive reasoning and provides an overview of topics that will be covered, including supervised learning techniques and applications such as pattern recognition.

Uploaded by

Mohammad Gamal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
152 views145 pages

Supervised Learning

This document provides an introduction to machine learning fundamentals and supervised learning. It discusses what machine learning is, the process of inference involving observing phenomena, constructing models, and making predictions. It also outlines three main machine learning frameworks - supervised learning, unsupervised learning, and semi-supervised learning. The document relates these frameworks to inductive reasoning and provides an overview of topics that will be covered, including supervised learning techniques and applications such as pattern recognition.

Uploaded by

Mohammad Gamal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UE: Machine Learning Fundamentals

Part I : Supervised Learning

Massih-Reza Amini
http://ama.liglab.fr/~amini/Cours/ML/ML.html

Université Grenoble Alpes


Laboratoire d’Informatique de Grenoble
[email protected]
2

What is Machine Learning?


q Wikipedia: Machine learning is a field of computer science
that gives computers the ability to learn without being
explicitly programmed for it!

1 1

0 ?

[email protected] Machine Learning Fundamentals


2

What is Machine Learning?


q Wikipedia: Machine learning is a field of computer science
that gives computers the ability to learn without being
explicitly programmed for it!

1 1

0 ?

q Machine Learning programs are hence designed to make


inference.

[email protected] Machine Learning Fundamentals


3

Learning and Inference

The process of inference is done in three steps:


1. Observe a phenomenon,
2. Construct a model of the phenomenon,
3. Do predictions.

[email protected] Machine Learning Fundamentals


3

Learning and Inference


The process of inference is done in three steps:
1. Observe a phenomenon,
2. Construct a model of the phenomenon,
3. Do predictions.

q These steps are involved in more or less all natural sciences!


All that is necessary to reduce the whole nature of laws
similar to those which Newton discovered with the aid of
calculus, is to have a sufficient number of observations and
a mathematics that is complex enough (Marquis de
Condorcet, 1785)

[email protected] Machine Learning Fundamentals


3

Learning and Inference

The process of inference is done in three steps:


1. Observe a phenomenon,
2. Construct a model of the phenomenon,
3. Do predictions.

q These steps are involved in more or less all natural sciences!


q The aim of learning is to automate this process,

[email protected] Machine Learning Fundamentals


3

Learning and Inference

The process of inference is done in three steps:


1. Observe a phenomenon,
2. Construct a model of the phenomenon,
3. Do predictions.

q These steps are involved in more or less all natural sciences!


q The aim of learning is to automate this process,
q The aim of the learning theory is to formalize the process.

[email protected] Machine Learning Fundamentals


4

Three main Frameworks

Sport
Sport
…… Politics
……
……
? ? ? ……
… ……
… ……
……
…… …
Politics …
……
……

Supervised Learning Unsupervised Learning Semi-supervised Learning

Related to Inductive reasoning


[email protected] Machine Learning Fundamentals
5

Induction vs. deduction

q Induction is the process of deriving general principles from


particular facts or instances.

q Deduction is, in the other hand, the process of reasoning in


which a conclusion follows necessarily from the stated
premises; it is an inference by reasoning from the general to
the specific.

This is how mathematicians prove theorems from axioms.

[email protected] Machine Learning Fundamentals


6

What will you learn here?1


1. Supervised Learning:
q The Empirical Risk Minimization principle
q Binary models and their link with the ERM principle
q Unconstrained Convex Optimization
q Consistency of the ERM principle
q Multi-class classification

2. Unsupervised Learning:
q Generative models and the EM algorithm
q CEM algorithm

3. Semi-supervised Learning:
q Graphical and Generative models
q Discriminant models

1
Program based on Chapters 1, 2, 3 & 5 of [Amini 15]
[email protected] Machine Learning Fundamentals
7

Organization

q Formation
q Theoretical courses - 12 weeks (1.5h per week - 3 ECTS)

q Practical information (important dates, timetables, defence


schedule, etc.) are available at
http://mosig.imag.fr
https://msiam.imag.fr

q Timetable available at
https://edt.grenoble-inp.fr/2018-2019/ensimag/etudiant/

q Research projects http://im2ag-pcarre.e.ujf-grenoble.fr/

[email protected] Machine Learning Fundamentals


8

Pattern recognition

If we consider the context of supervised learning for pattern


recognition:
q The data consist of pairs of examples (vector
representation of an observation, class label),
q Class labels are often Y = {1, . . . , K} with K large (but in
the theory of ML we consider the binary classification case
Y = {−1, +1}),
q The learning algorithm constructs an association between
the vector representation of an observation → class label,
q Aim: Make few errors on unseen examples.

[email protected] Machine Learning Fundamentals


9

Pattern recognition (Example)

IRIS classification, Ronald Fisher (1936)

Iris Setosa Iris Versicolor Iris Virginica

[email protected] Machine Learning Fundamentals


10

Pattern recognition (Example)


q First step is to formalize the perception of the flowers with
relevant common characteristics, that constitute the
features of their vector representations.

q This usually requires expert knowledge.

[email protected] Machine Learning Fundamentals


11

Pattern recognition (Example)

q If observations are from a Field of Irises

[email protected] Machine Learning Fundamentals


11

Pattern recognition (Example)


q If observations are from a Field of Irises then they become

...

...

[email protected] Machine Learning Fundamentals


11

Pattern recognition (Example)

q The constitution of vectorised observations and their


associated labels is generally time consuming.

q Many studies are now focused on representation learning


using deep neural networks

q Second step: Learning translates then in the search of a


function that maps vectorised observations (inputs) to
their associated outputs

[email protected] Machine Learning Fundamentals


12

Pattern recognition

2. Trouver les
séparateurs
0. Base d’apprentissage

1. Vecteur de
représentation
3. Nouveaux exemples

5. Prédire les étiquettes


des nouveaux exemples

[email protected] Machine Learning Fundamentals


13

Some popular applications

Machine Learning for

Medicine

Image recognition Finance and buisness

[email protected] Machine Learning Fundamentals


14

Approximation - Interpolation

It is always possible to construct a function that exactly fits the


data.

[email protected] Machine Learning Fundamentals


14

Approximation - Interpolation

It is always possible to construct a function that exactly fits the


data.

[email protected] Machine Learning Fundamentals


14

Approximation - Interpolation

It is always possible to construct a function that exactly fits the


data.

Is it reasonable?

[email protected] Machine Learning Fundamentals


15

Occam razor

Idea: Search for regularities (or repetitions) in the observed


phenomenon, generalization is done from the passed
observations to the new futur ones ⇒ Take the most simple
model ...
But how to measure the simplicity ?
1. Number of constantes,
2. Number de parameters,
3. ...

[email protected] Machine Learning Fundamentals


16

Basic Hypotheses

Two types of hypotheses:

q Past observations are related to the future ones


→ The phenomenon is stationary

q Observations are independently generated from a source


→ Notion of independence

[email protected] Machine Learning Fundamentals


17

Aims

→ How can one do predictions with past data? What are the
hypotheses?

q Give a formel definition of learning, generalization,


overfitting,

q Characterize the performance of learning algorithms,

q Construct better algorithms.

[email protected] Machine Learning Fundamentals


18

Probabilistic model

Relations between the past and future observations.

q Independence: Each new observation provides a maximum


individual information,

q identically Distributed : Observations provide information


on the phenomenon which generates the observations.

[email protected] Machine Learning Fundamentals


19

Formally

We consider an input space X ⊆ Rd and an output space Y.


Assumption: Example pairs (x, y) ∈ X × Y are identically and
independently distributed (i.i.d) with respect to an unknown
but fixed probability distribution D.

Samples: We observe a sequence of m pairs of examples (xi , yi )


generated i.i.d from D.

Aim: Construct a prediction function f : X → Y which predicts


an output y for a given new x with a minimum probability of
error.

[email protected] Machine Learning Fundamentals


20

Notations

Symbol Definition
X ⊆ Rd Input space
Y Output space
S = (xi , yi )1≤i≤m Training set of size m
D Probability distribution generating the data (i.i.d)
ℓ : Y × Y → R+ Instantaneous loss
F = {f : X → Y} Class of functions
L(f) = E(x,y)∼D [ℓ(f(x), y)] Generalization error of
L̂m (f, S) or L̂(w) Empirical Loss of f over S
w Parameters of the prediction function
1π Indicator function equals 1 if π is true and 0 otherwise
Rm (F) Rademacher complexity of the class of functions F
R̂m (F, S) Empirical Rademacher complexity of F estimated over S

[email protected] Machine Learning Fundamentals


21

Supervised Learning
q Discriminant models directly find a classification function
f : X → Y from a given class of functions F;
q The function found should be the one having the lowest
probability of error

L(f) = E(x,y)∼D [ℓ(f(x), y)] = ℓ(f(x), y)dD(x, y)
X ×Y
Where ℓ is an instantaneous loss defined as

ℓ : Y × Y → R+
The risk function considered in classification is usually the
misclassification error:
∀(x, y); ℓ(f(x), y) = 1f(x)̸=y
Where 1π is equal to 1 if the predicate π is true and 0
otherwise.
[email protected] Machine Learning Fundamentals
22

Empirical risk minimization (ERM) principle

q As the probability distribution D is unknown, the analytic


form of the true risk cannot be driven, so the prediction
function cannot be found directly on L(f).

q Empirical risk minimization (ERM) principle: Find f by


minimizing the unbiased estimator of its generalization
error L(f) on a given training set S = (xi , yi )m
i=1 :

1 ∑m
L̂m (f, S) = ℓ(f(xi ), yi )
m i=1

q However, without restricting the class of functions this is


not the right way of proceeding (occam razor) ...

[email protected] Machine Learning Fundamentals


23

ERM principle, problem


Suppose that the input dimension is d = 1, let the input space
X be the interval [a, b] ⊂ R where a and b are real values such
that a < b, and suppose that the output space is {−1, +1}.
Moreover, suppose that the distribution D generating the
examples (x, y) is an uniform distribution over [a, b] × {−1}.
Consider now, a learning algorithm which minimizes the
empirical risk by choosing a function in the function class
F = {f : [a, b] → {−1, +1}} (also denoted as F = {−1, +1}[a,b] )
in the following way ; after reviewing a training set
S = {(x1 , y1 ), . . . , (xm , ym )} the algorithm outputs the
prediction function fS such that
{
−1, if x ∈ {x1 , . . . , xm }
fS (x) =
+1, otherwise

[email protected] Machine Learning Fundamentals


24

Consistency of the ERM principle

q For the above problem, the found classifier has an


empirical risk equal to 0, and that for any given training
set. However, as the classifier makes an error over the
entire infinite set [a, b] except on a finite training set (of
measure zero), its generalization error is always equal to 1.

q So the question is : in which case the ERM principle is


likely to generate a general learning rule?
⇒ The answer of this question lies in a statistical notion
called consistency.

[email protected] Machine Learning Fundamentals


25

Consistency of the ERM principle (2)


This concept indicates two conditions that a learning algorithm
has to fulfil, namely
(a) the algorithm must return a prediction function whose
empirical error reflects its generalization error when the
size of the training set tends to infinity :
∀ϵ > 0, lim P(|L̂m (fS , S) − L(fS )| > ϵ) = 0, denoted as,
m→∞

P
L̂m (fS , S) → L(fS )

(b) in the asymptotic case, the algorithm must allow to find


the function which minimises the generalization error in
the considered function class :
P
L̂m (fS , S) → inf L(g)
g∈F

[email protected] Machine Learning Fundamentals


26

Consistency of the ERM principle (3)

These two conditions imply that the empirical error L̂m (fS , S) of
the prediction function found by the learning algorithm over a
training S, fS , converges in probability to its generalization
error L(fS ) and infg∈F L(g) :

True risk,

Empirical risk,

[email protected] Machine Learning Fundamentals


27

Study the consistency of the ERM principle

The fundamental result of the learning theory [Vapnik 88,


theorem 2.1, p.38] concerning the consistency of the ERM
principle, exhibits another relation involving the supremum over
the function class in the form of an unilateral uniform
convergence and which stipulates that :

The ERM principle is consistent if and only if :


( )
[ ]
∀ϵ > 0, lim P sup L(f) − L̂m (f, S) > ϵ =0
m→∞ f∈F

[email protected] Machine Learning Fundamentals


28

Study the consistency of the ERM principle

q A direct implication of this result is an uniform bound over


the generalization error of all prediction functions f ∈ F
learned on a training set S of size m and which writes :
( )
∀δ ∈]0, 1], P ∀f ∈ F, L(f) − L̂m (f, S) ≤ C(F, m, δ) ≥ 1 − δ

Where C depends on the size of the function class, the size


of the training set, and the desired precision δ ∈]0, 1].
There are different ways to measure the size of a function
class and the measure commonly used is called complexity
or the capacity of the function class.

[email protected] Machine Learning Fundamentals


Usual binary classification models
29

First attempts to build learnable artificial


models

It begun at the end of the 19th century with the works of


Santiago Ramón y Cajal who first represented the biological
neuron:

[email protected] Machine Learning Fundamentals


30

MuCulloch & Pitts formal neuron (1943)


x1 b w1
w0
x2 b w2

Signals Σ hw (.) Output

xd b wd
synaptic
weights

q Linear prediction function


hw : Rd → R
x 7→ ⟨w̄, x⟩ + w0

[email protected] Machine Learning Fundamentals


30

MuCulloch & Pitts formal neuron (1943)


x1 b w1
w0
x2 b w2

Signals Σ hw (.) Output

xd b wd
synaptic
weights

q Linear prediction function


q Different learning rules have been proposed - the most
popular one was the Hebb’s rule (1949): neurons that fire
together, wire together.
[email protected] Machine Learning Fundamentals
31

Perceptron [Rosenblatt, 1958]

[email protected] Machine Learning Fundamentals


32

Perceptron [Rosenblatt, 1958]


q Linear prediction function
hw : Rd → R
x 7→ ⟨w̄, x⟩ + w0
q A principle way to learn: find the parameters w = (w̄, w0 )
by minimising the distance between the misclassified
examples to the decision boundary.
hw
(x
)=
⟨w̄
,x
⟩+
w
0 =
0
x b

|hw (x)|
||w̄||
xp

[email protected] Machine Learning Fundamentals


33

Learning Perceptron parameters


q Objective function

L̂(w) = − yi′ (⟨w̄, xi′ ⟩ + w0 )
i′ ∈I

q Update rule: Gradient descente


∀t ≥ 1, w(t) ← w(t−1) − η∇w(t−1) L̂(w(t−1) )
q Derivatives of with respect to the parameters
∂ L̂(w) ∑
= − yi′ ,
∂w0
i′ ∈I

∇L̂(w̄) = − yi′ xi′
i′ ∈I

q Stochastic gradient descente


( ) ( ) ( )
w0 w0 y
∀(x, y), if y(⟨w̄, x⟩ + w0 ) ≤ 0 then ← +η
w̄ w̄ yx

[email protected] Machine Learning Fundamentals


34

Graphical depiction of the online update rule

b (x1 , +1) b

w(t+1)
b (x2 , +1) −x3 b

w(t) w(t)

rs
(x3 , −1) rs

(x4 , −1) rs rs

[email protected] Machine Learning Fundamentals


35

Perceptron (algorithm)

Algorithm 1 The algorithm of perceptron


1: Training set S = {(xi , yi ) | i ∈ {1, . . . , m}}
2: Initialize the weights w(0) ← 0
3: t←0
4: Learning rate η > 0
5: repeat
6: ⟩ an example (x , y ) ∈ S
Choose (t) (t)
⟨ randomly
7: if y w(t) , x(t) < 0 then
(t+1) (t)
8: w0 ← w0 + η × y(t)
9: w(t+1) ← w(t) + η × y(t) × x(t)
10: end if
11: t←t+1
12: until t > T

+ But does this updates converge?

[email protected] Machine Learning Fundamentals


36

Perceptron (convergence)
[Novikoff, 1962] showed that

q if there exists a weight w̄∗ , such that


∀i ∈ {1, . . . , m}, yi × ⟨w̄∗ , xi ⟩ > 0,
( ⟨ ⟩)
w̄∗
q then, by denoting ρ = mini∈{1,...,m} yi ||w̄∗ || , xi ,

q and, R = maxi∈{1,...,m} ||xi ||,

q and, w̄(0) = 0, η = 1,

q we have a bound over the maximum number of updates k :


⌊( )2 ⌋
R
k≤
ρ

[email protected] Machine Learning Fundamentals


37

Homework
1. We suppose that all the examples in the training set are within a
hypersphere of radius R (i.e. ∀xi ∈ S, ||xi || ≤ R). Further, we initialise
the weight vector to be the null vector (i.e. w(0) = 0) as well as the
learning rate ϵ = 1. Show that after k updates, the norm of the
current weight vector satisfies :
||w(k) ||2 ≤ k × R2 (1)
(k) 2 (0) 2
hint : You can consider ||w || as ||w − w || (k)

2. Using the the same condition than in the previous question, show that
after k updates of the weight vector we have
⟨ ⟩
w∗
, w(k) ≥k×ρ (2)
||w∗ ||
3. Deduce from equations (1) and (2) that the number of iterations k is
bounded by ⌊( ) ⌋
2
R
k≤
ρ
where ⌊x⌋ represents the floor function (This result is due to Novikoff,
1966).
[email protected] Machine Learning Fundamentals
38

Perceptron Program
#include "defs.h"
void perceptron(X, Y, w, m, d, eta , T)
double **X;
double *Y;
double *w;
long int m;
long int d;
double eta;
long int T;
{
long int i, j, t=0;
double ProdScal;
// Initialisation of the weight vector
for(j=0; j<=d; j++)
w[j]=0.0;

while(t<T)
{
i=( rand ()%m) + 1;
for(ProdScal=w[0], j=1; j<=d; j++)
ProdScal +=w[j]*X[i][j];
if(Y[i]* ProdScal <= 0.0){
w[0]+= eta*Y[i];
for(j=1; j<=d; j++)
w[j]+= eta*Y[i]*X[i][j];
}
t++;
}
}
source: http://ama.liglab.fr/~amini/Perceptron/

[email protected] Machine Learning Fundamentals


39

ADAptive LInear NEuron


[Widrow & Hoff, 1960]
q ADAptive LInear NEuron
q Linear prediction function :
hw : X → R
x 7→ ⟨w̄, x⟩ + w0
q Find parameters that minimise the convex upper-bound of the
empirical 0/1 loss

1 ∑
m
L̂(w) = (yi − hw (xi ))2
m i=1

q Update rule : stochastic gradient descent algorithm with a


learning rate η > 0
( ) ( ) ( )
w0 w0 1
∀(x, y), ← + η(y − hw (x)) (3)
w̄ w̄ x

[email protected] Machine Learning Fundamentals


40

Adaline
q ADAptive LInear NEuron
q Linear prediction function :
hw : X → R
x 7→ ⟨w̄, x⟩ + w0

Algorithm 2 The algorithm of Adaline


1: Training set S = {(xi , yi ) | i ∈ {1, . . . , m}}
2: Initialize the weights w(0) ← 0
3: t←0
4: Learning rate η > 0
5: repeat
6: Choose randomly an example (x(t) , y(t) ) ∈ S
(t+1) (t)
7: w0 ← w0 + η × (y(t) − hw (x(t) ))
8: w̄(t+1) ← w̄(t) + η × (y(t) − hw (x(t) )) × x(t)
9: t←t+1
10: until t > T

[email protected] Machine Learning Fundamentals


41

Formal models

x0 = 1 b w0

x1 b w1

x2 b w2

x Σ hw (x)

xd b wd

[email protected] Machine Learning Fundamentals


42

Perceptron vs Adaline

b b

b b

b b

rs b b

rs

rs rs
rs

rs

rs

rs
rs

rs rs

rs

[email protected] Machine Learning Fundamentals


43

Perceptron and Adaline sparked excitement but

which marked the 1st winter of NN; and the genesis of research
in ML.
[email protected] Machine Learning Fundamentals
44

Logistic regression: generative models

Each example x is supposed to be generated by a mixture


model of parameters Θ:

K
P(x | Θ) = P(y = k)P(x | y = k, Θ)
k=1
[email protected] Machine Learning Fundamentals
45

Logistic regression: generative models

q The aim is then to find the parameters Θ for which the


model explains the best the observations,
q That is done by maximizing the log-likelihood of data
S = {(xi , yi ); i ∈ {1, . . . , m}}


m
L(Θ) = ln P(xi | Θ)
i=1

q Classical density functions are Gaussian density functions


1 1 ⊤ Σ−1 (x−µ )
P(x | y = k, Θ) = d 1 e− 2 (x−µk ) k k

(2π) |Σk |
2 2

[email protected] Machine Learning Fundamentals


46

Logistic regression: generative models


q Once the parameters Θ are estimated; the generative model
can be used for classification by applying the Bayes rule:

∀x; y∗ = argmax P(y = k | x)


k
∝ argmax P(y = k) × P(x | y = k, Θ)
k

q Problem: in most real life applications the distributional


assumption over data does not hold,
q The Logistic Regression model does not make any
assumption except that

P(y = 1 | x)
ln = ⟨w̄, x⟩ + w0
P(y = 0 | x)

[email protected] Machine Learning Fundamentals


47

Logistic regression
q The logistic regression has been proposed to model the
posterior probability of classes via logistic functions.
1
P(y = 1 | x) = = gw (x)
1 + e−⟨w̄,x⟩−w0
1
P(y = 0 | x) = 1 − P(y = 1 | x) = = 1 − gw (x)
1 + e⟨w̄,x⟩+w0
P(y | x) = (gw (x))y (1 − gw (x))1−y

0.8
1/(1+exp(-<w,x>))

0.6

0.4

0.2

0
-6 -4 -2 0 2 4 6
<w,x>

[email protected] Machine Learning Fundamentals


48

Logistic regression
q For
g:R → ]0, 1[
1
x 7→
1 + e−x
we have
∂g
g′ (x) = = g(x)(1 − g(x))
∂x

q Model parameters w are found by maximizing the complete


log-liklihood, which by assuming that m training examples
are generated independently, writes

m ∏
m ∏
m
ln P(xi , yi ) = ln P(yi | xi ) + ln P(xi )
i=1 i=1 i=1

m
[ ]
≈ ln (gw (xi ))yi (1 − gw (xi ))1−yi
i=1

[email protected] Machine Learning Fundamentals


49

Logisitic Regression : link with the ERM


principle

q If we consider the function hw : x 7→ ⟨w̄, x⟩ + w0 , the


maximization of the log-liklihood L is equivalent to the
minimization of the empirical logistic loss in the case where
∀i, yi ∈ {−1, +1}.

1 ∑m
L̂(w) = ln(1 + e−yi hw (xi ) )
m i=1

q Minimization can be carried out with usual convex


optimization techniques (i.e. conjugate gradient or the
quasi-newton method)

[email protected] Machine Learning Fundamentals


50

Adaline vs Logistic regression

x2 y
y x2

0   0.5  

x1   x1  

[email protected] Machine Learning Fundamentals


51

ADAptive BOOSTing [Schapire, 1999]

q The Adaboost algorithm generates a set of weak learners and


combines them with a majority vote in order to produce an
efficient final classifier.

q Each weak classifier is trained sequentially in the way to take


into account the classification errors of the previous classifier
+ This is done by assigning weights to training examples and at
each iteration to increase the weights of those on which the
current classifier makes misclassification.

+ In this way the new classifier is focalized on hard examples that


have been misclassified by the previous classifier.

[email protected] Machine Learning Fundamentals


52

AdaBoost, algorithm
Algorithm 3 The algorithm of Boosting
1: Training set S = {(xi , yi ) | i ∈ {1, . . . , m}}
2: 1
Initialize the initial distribution over examples ∀i ∈ {1, . . . , m}, D1 (i) = m
3: T, the maximum number of iterations (or classifiers to be combined)
4: for each t=1,…,T do
5: ∑
Train a weak classifier ft : X → {−1, +1} by using the distribution Dt
6: Set ϵt = Dt (i)
i:ft (xi )̸=yi
1−ϵ
7: Choose αt = 1 2
ln ϵ t
t
8: Update the distribution of weights

Dt (i)e−αt yi ft (xi )
∀i ∈ {1, . . . , m}, Dt+1 (i) =
Zt

Where,

m
−αt yi ft (xi )
Zt = Dt (i)e

i=1

9: end for each


(∑T )
10: The final classifier: ∀x, F(x) = sign
t=1
αt ft (x)

source: http://ama.liglab.fr/~amini/RankBoost/

[email protected] Machine Learning Fundamentals


52

AdaBoost, algorithm
Algorithm 4 The algorithm of Boosting
1: Training set S = {(xi , yi ) | i ∈ {1, . . . , m}}
2: 1
Initialize the initial distribution over examples ∀i ∈ {1, . . . , m}, D1 (i) = m
3: T, the maximum number of iterations (or classifiers to be combined)
4: for each t=1,…,T do
5: ∑
Train a weak classifier ft : X → {−1, +1} by using the distribution Dt
6: Set ϵt = Dt (i)
i:ft (xi )̸=yi
1−ϵ
7: Choose αt = 1 2
ln ϵ t
t
8: Update the distribution of weights

Dt (i)e−αt yi ft (xi )
∀i ∈ {1, . . . , m}, Dt+1 (i) =
Zt

Where,

m
−αt yi ft (xi )
Zt = Dt (i)e

i=1

9: end for each


(∑T )
10: The final classifier: ∀x, F(x) = sign
t=1
αt ft (x)

source: http://ama.liglab.fr/~amini/RankBoost/

[email protected] Machine Learning Fundamentals


53

How to sample using a distribution Dt

Dt (U)
V b

U
q Choose randomly an index U ∈ {1, . . . , m} and a real-value
V ∈ [0, maxi∈{1,...,m} Dt (i)], if Dt (U) > V then accept the
example (xU , yU ).

[email protected] Machine Learning Fundamentals


54

AdaBoost, geometry interpretation


b b b

b b b b b b

b rs rs b rs rs b rs rs

b b b b b b

rs rs rs

rs rs rs rs rs rs

α1 = 0.5 α2 = 0.1 α3 = 0.75

b b

b rs rs

b b

rs

rs rs

[email protected] Machine Learning Fundamentals


55

Homework
∑T
1. If we denote by ∀x, H(x) = t=1 αt ft (x) and
F(x) = sign(H(x)) show that

1 ∑m
1 ∑m
1yi ̸=F(xi ) ≤ e−yi H(xi )
m i=1 m i=1

2. Deduce that

1 ∑m ∑m ∏
e−yi H(xi ) = Z1 D2 (i) e−yi αt ft (xi )
m i=1 i=1 t>1

And,
1 ∑m ∏T
e−yi H(xi ) = Zt (4)
m i=1 t=1

[email protected] Machine Learning Fundamentals


56

Homework
3. The minimization of (4) is carried out by minimizing each
of its terms. Using the definition of ϵt show that:
∀t, Zt = ϵt eαt + (1 − ϵt )e−αt
4. Further show that the minimum of the normalisation term,
with respect to the combination weights, αt is reached for
αt = 12 ln 1−ϵ
ϵt
t

5. By posing γt = 12 − ϵt , and when ϵt < 21 show that



2
∀t, Zt = 1 − 4γt2 ≤ e−2γt
6. Finally show that the empirical misclassification error
decreases exponentially to 0
1 ∑m ∏T ∑T 2
1yi ̸=F(xi ) ≤ Zt ≤ e−2 t=1 γt
m i=1 t=1

[email protected] Machine Learning Fundamentals


Unconstrained convex optimization
57

Common convex upper bounds for the


misclassification error

[email protected] Machine Learning Fundamentals


58

Property

q The learning problem casts into a easier unconstrained


convex optimization problem.
q Consider the Taylor formula of the objective function
around its minimiser
1
L̂(w) = L̂(w∗ ) + (w − w∗ )⊤ ∇L̂(w∗ ) + (w − w∗ )⊤ H(w − w∗ ) + o(∥ w − w∗ ∥2 )
| {z } 2
=0

q The Hessian matrix is symmetric and from Schwarz


theorem its eigenvectors (vi )di=1 form an orthonormal basis.
{
+1 : si i = j,
2
∀(i, j) ∈ {1, . . . , d} , Hvi = λi vi , et v⊤
i vj =
0 : otherwise.

[email protected] Machine Learning Fundamentals


59

Property (2)
q Every weight vector w − w∗ can be uniquely decomposed in
this basis

d
w − w∗ = qi vi
i=1

q That to say
1∑ d
L̂(w) = L̂(w∗ ) + λi q2i
2 i=1
q Furthermore the Hessian matrix is definite positive,
because of the definition of the global minimum


d
(w − w∗ )⊤ H(w − w∗ ) = λi q2i = 2(L̂(w) − L̂(w∗ )) ≥ 0
i=1

All the eigenvalues of H are then positive.

[email protected] Machine Learning Fundamentals


60

Property (3)

q This implies that the level lines of L̂, defined by weight


points for which L̂ is constant, are ellipses

[email protected] Machine Learning Fundamentals


61

Gradient descent algorithm


[Rumelhart et al., 1986]
q The gradient descent algorithm is an iterative algorithm
that updates the weight vectors at each step :
∀t ∈ N, w(t+1) = w(t) − η∇L̂(w(t) )
Where η > 0 is the learning rate

[email protected] Machine Learning Fundamentals


62

Convergence of the gradient descent algorithm


q Take the decomposition of any vector w − w∗ in the
orthonormal basis (vi )di=1 formed by the eigenvectors of the
Hessian matrix

d
∇L̂(w) = qi λi vi
i=1
q Let w(t) be the weight vector obtained from w(t−1) after
applying the gradient descent rule
d (
∑ ) ∑
d
(t) (t−1) (t−1)
w(t) − w(t−1) = qi − qi vi = −η∇L̂(w(t−1) ) = −η qi λi vi
i=1 i=1

q So
(t) (0)
∀i ∈ {1, . . . , d}, qi = (1 − ηλi )t qi
and the algorithm convergence if
1
η<
2λmax
[email protected] Machine Learning Fundamentals
63

OK but how to find the good learning rate?


Line search

At each iteration t, on w(t)


q Estimate the descent direction pt (i.e. p⊤
t ∇L̂(w ) < 0)
(t)

q Update
w(t+1) ← w(t) + ηt pt
// Where ηt is a positive learning rate making w(t+1) be
acceptable for the next iteration.

[email protected] Machine Learning Fundamentals


64

Wolfe conditions

q To find the sequence (w(t) )t∈N following the line search


rule, the following necessary condition

∀t ∈ N, L̂(w(t+1) ) < L̂(w(t) )

is not sufficient to guarantee the convergence of the


sequence to the minimiser of L̂.

q In two situations, the previous condition is satisfied but


there is no convergence

[email protected] Machine Learning Fundamentals


65

1. The decreasing of L̂ is too small


with respect to the length of the jumps

Consider the following example d = 1 ; L̂(w) = w2 with


3
w(0) = 2, (pt = (−1)t+1 )t∈N∗ and (ηt = (2 + 2t+1 ))t∈N∗ . The
sequence of updates would then be
∀t ∈ N∗ , w(t) = (−1)t (1 + 2−t )

[email protected] Machine Learning Fundamentals


66

1. The decreasing of L̂ is too small


with respect to the length of the jumps

⇒ Armijo condition : require that for a given α ∈ (0, 1),

∀t ∈ N∗ , L̂(w(t) + ηt pt ) ≤ L̂(w(t) ) + αηt p⊤


t ∇L̂(w )
(t)

[email protected] Machine Learning Fundamentals


67

Armajio condition

Armajio’s  admissible  learning  rate  values    

[email protected] Machine Learning Fundamentals


68

2. The jumps of the weight vectors are too small

Consider the following example d = 1 ; L̂(w) = w2 with


w(0) = 2, (pt = −1)t∈N∗ and (ηt = (2−t+1 ))t∈N∗ . The sequence
of updates would then be
∀t ∈ N∗ , w(t) = (1 + 2−t )

[email protected] Machine Learning Fundamentals


69

2. The jumps of the weight vectors are too small

⇒ ∃β ∈ (α, 1) such that

∀t ∈ N∗ , p⊤
t ∇L̂(w
(t)
+ ηt pt ) ≥ βp⊤
t ∇L̂(w )
(t)

[email protected] Machine Learning Fundamentals


70

Armajio condition

[email protected] Machine Learning Fundamentals


71

Existence of learning rates verifying Wolfe


conditions
q Let pt be a descent direction of L̂ at w(t) . Suppose that the
function ψt : η 7→ L̂(w(t) + ηpt ) is derivative and lower
bounded, then there exists ηt verifying both Wolfe
conditions.
proof:
1. consider

E = {a ∈ R+ | ∀η ∈]0, a], L̂(w(t) +ηpt ) ≤ L̂(w(t) )+αηp⊤


t ∇L̂(w )}
(t)

As pt is a descent direction of L̂ at w(t) then for all α < 1


there exists ā > 0 such that

∀η ∈]0, ā], L̂(w(t) + ηpt ) < L̂(w(t) ) + αηp⊤


t ∇L̂(w )
(t)

[email protected] Machine Learning Fundamentals


72

Existence of learning rates verifying Wolfe


conditions
2. So E ̸= ∅. Furthermore, as the function ψt is lower
bounded, the largest rate in E, η̂t = sup E, exists. By
continuity of ψt we have

L̂(w(t) + η̂t pt ) < L̂(w(t) ) + αη̂t p⊤


t ∇L̂(w )
(t)

3. Let (ηn )n∈N be a convergence sequence to η̂t by higher


values, i.e. ∀n ∈ N, ηn > η̂t and lim ηn = η̂t . As
n→+∞
(ηn )n∈N ∈
/ E we get

∀n ∈ N, L̂(w(t) + ηn pt ) > L̂(w(t) ) + αηn p⊤


t ∇L̂(w )
(t)

So
L̂(w(t) + η̂t pt ) = L̂(w(t) ) + αη̂t p⊤
t ∇L̂(w )
(t)

[email protected] Machine Learning Fundamentals


73

Existence of learning rates verifying Wolfe


conditions

4. We finally get

p⊤
t ∇L̂(w
(t)
+ η̂t pt ) ≥ αp⊤ ⊤
t ∇L̂(w ) ≥ βpt ∇L̂(w )
(t) (t)

Where β ∈ (α, 1) and p⊤


t ∇L̂(w ) < 0.
(t)

⇒ The learning rate η̂t verifies both Wolfe conditions

[email protected] Machine Learning Fundamentals


74

Does it work?
.
Theorem (Zoutendijk)
.
Let L̂ : Rd → R be a differentiable objective function with a
lipschtizien gradient and lower bounded. Let A be an algorithm
generating (w(t) )t∈N defined by

∀t ∈ N, w(t+1) = w(t) + ηt pt

where pt is a descent direction of L̂ and ηt a learning rate verifying


both Wolfe conditions. By considering the angle θt between the descent
direction pt and the direction of the gradient :

−p⊤
t ∇L̂(w )
(t)
cos(θt ) =
||L̂(w(t) )|| × ||pt ||

The following series is convergent



cos2 (θt )||∇L̂(w(t) )||2
. t

[email protected] Machine Learning Fundamentals


75

Proof of Zoutendijk’s theorem

1. Using the second Wolfe’s condition and by subtracting


p⊤
t ∇L̂(w ) from both terms of the inequality, we get
(t)

( )
∀t, p⊤
t (∇L̂(w
(t+1)
) − ∇L̂(w(t) )) ≥ (β − 1) p⊤
t ∇L̂(w )
(t)

2. Using the lipschitzian property of the gradient of the


objective function

p⊤
t (∇L̂(w
(t+1)
) − ∇L̂(w(t) )) ≤ ||∇L̂(w(t+1) ) − ∇L̂(w(t) )|| × ||pt ||
≤ L||w(t+1) − w(t) || × ||pt ||
≤ Lηt ||pt ||2

[email protected] Machine Learning Fundamentals


76

Proof of Zoutendijk’s theorem


3. By combining both inequalities it comes

∀t, 0 ≤ (β − 1)(p⊤
t ∇L̂(w )) ≤ Lηt ||pt ||
(t) 2

β−1 p⊤
t ∇L̂(w )
(t)
4. For ηt ≥ L ||pt ||2
> 0 we get from Armijo’s condition

L̂(w(t) ) − L̂(w(t+1) ) ≥ −αηt p⊤


t ∇L̂(w )
(t)

1 − β (p⊤
t ∇L̂(w ))
(t) 2
≥α
L ||pt ||2
1−β
≥α cos2 (θt )||∇L̂(w(t) )||2 ≥ 0
L

[email protected] Machine Learning Fundamentals


77

Proof of Zoutendijk’s theorem

5. The objective function is lower bounded, the sequence of


general term L̂(w(t) ) − L̂(w(t+1) ) > 0 is convergent

6. Hence, the series



cos2 (θt )||∇L̂(w(t) )||2
t

is convergent.

[email protected] Machine Learning Fundamentals


78

Corollary of Zoutendijk’s theorem


Guarantee of convergence

q In the case where, the descente direction and the gradient


are not orthogonal :

∃κ > 0, ∀t ≥ T, cos2 (θt ) ≥ κ

q Following Zoutendijk’s theorem the series :



||∇L̂(w(t) )||2
t

is convergent.
q Hence, the sequence (∇L̂(w(t) ))t tends to 0 when t tends to
infinity.

[email protected] Machine Learning Fundamentals


79

Can we do better?
Conjugate gradient method
q The adaptive search of the learning rate with the line
search algorithm does not prevent the oscillations of the
weight vector around the minimiser of the objective
function

[email protected] Machine Learning Fundamentals


80

Conjugate gradient method


q At the neighbourhood of the minimiser of the objective
function, where the quadratic approximation holds
q Suppose that we have d conjugate directions
{pt , t ∈ [[0, d − 1]]}
∀(t, t′ ) ∈ [[0, d − 1]]2 , t ̸= t′ , p⊤
t Hpt′ = 0

q As the Hessian matrix is symmetric positive definite, we


can show that the directions {pt } are linearly independent
and that they form a basis. We have

d−1
w∗ − w(0) = ηt pt
t=0

Hence
p⊤ ∗
t H(w − w )
(0)
∀t, ηt =
p⊤
t Hpt

[email protected] Machine Learning Fundamentals


81

Conjugate gradient method


q Let

t−1
w(t) = w(0) + ηi pi
i=0

We get the following update rule

∀t ∈ [[0, d − 1]], w(t+1) = w(t) + ηt pt

q From the mutual conjugate property of (pt )d−1


t=0 :

p⊤
t Hw
(t)
= p⊤
t Hw
(0)

That is
p⊤
t ∇L̂(w )
(t)
∀t, ηt = − (5)
p⊤t Hpt

[email protected] Machine Learning Fundamentals


82

Conjugate gradient method


q With the previous definition of learning rates, it is simple
to show that the current gradient is orthogonal to all the
previous descent directions. In fact

∀t, ∇L̂(w(t+1) ) − ∇L̂(w(t) ) = H(w


|
(t+1)
{z
− w(t)})
ηt p t

q By multiplying pt from the left and by the definition of ηt

∀t, p⊤
t (∇L̂(w
(t+1)
) − ∇L̂(w(t) )) = −p⊤
t ∇L̂(w )
(t)

Which gives
∀t, p⊤
t ∇L̂(w
(t+1)
)=0

[email protected] Machine Learning Fundamentals


83

Conjugate gradient method

q For a given index t ∈ [[0, d − 1]], we finally get



∀t′ , ∀t, t < t′ , p⊤
t ∇L̂(w
(t )
)=0 (6)

q Hence, if the descent directions are conjugate after d


updates
∀t ∈ [[0, d − 1]], w(t+1) = w(t) + ηt pt
using the learning rate above, we arrive to a point where
the gradient of the objective function at this point is
orthogonal to all the descent direction, and which is the
minimum.

[email protected] Machine Learning Fundamentals


84

Conjugate gradient algorithm

q The necessary condition to get the previous result is to have


descent directions (pt )d−1
t=0 that are mutually conjugated
q The following sequence
{
p0 = −∇L̂(w(0) )
pt+1 = −∇L̂(w(t+1) ) + βt pt si t ≥ 0

q For
p⊤
t H∇L̂(w
(t+1) )
∀t, βt = ⊤
pt Hpt
It is easy to show that the descent directions are mutually
conjugated.

[email protected] Machine Learning Fundamentals


85

Conjugate gradient algorithm

q The coefficients (βt ) can be estimated without the use of


the Hessian matrix (Hestenes and Stiefel, 52)

∇⊤ L̂(w(t+1) )Hpt ∇⊤ L̂(w(t+1) )(∇L̂(w(t+1) ) − ∇L̂(w(t) ))


∀t, βt = =
p⊤
t Hpt p⊤
t (∇L̂(w
(t+1) ) − ∇L̂(w(t) ))

Followed by others

∇⊤ L̂(w(t+1) )(∇L̂(w(t+1) ) − ∇L̂(w(t) ))


∀t, βt = (Polak and Ribiere, 69)
∇⊤ L̂(w(t) )∇L̂(w(t) )

∇⊤ L̂(w(t+1) )∇L̂(w(t+1) )
∀t, βt = (Fletcher and Reeves, 64)
∇⊤ L̂(w(t) )∇L̂(w(t) )

[email protected] Machine Learning Fundamentals


86

Conjugate gradient algorithm


void grdcnj(double **X, double *Y, long int m, long int d, double *w, double epsilon)
{
long int j, Epoque =0;
double *wold , OldLoss , NewLoss , *g, *p, *h, dgg , ngg , beta;
// wold , p, g, h allocated
for(j=0; j<=d; j++)
wold[j]= 2.0*( rand () / (double) RAND_MAX ) -1.0;
NewLoss = FoncLoss(wold , X, Y, m, d);
OldLoss = NewLoss + 2* epsilon;
g = Gradient(wold , X, Y, m, d);
for(j=0; j<=d; j++)
p[j] = -g[j]; // ▷ p0 ← −∇L̂(w(0) )

while(fabs(OldLoss -NewLoss) > (fabs(OldLoss )* epsilon ))


{
OldLoss = NewLoss;
rchln(wold , OldLoss , g, p, w, &NewLoss , X, Y, m, d);
h = Gradient(w, X, Y, m, d); // New gradient ▷ ∇L̂(w(t+1) )
for(dgg =0.0, ngg =0.0, j=0; j<=d; j++){
dgg+=g[j]*g[j];
ngg+=h[j]*h[j];
}
beta=ngg/dgg;
for(j=0; j<=d; j++){
wold[j]=w[j];
g[j]=h[j];
p[j]=-g[j]+ beta*p[j]; // New descent direction
}
}
}

[email protected] Machine Learning Fundamentals


87

Logisitc Regression Program


// Logistic function x 7→ 1
1+e−x
double Logistic(double x)
{
return (1.0/(1.0+ exp(-x)));
}

// Estimation of the gradient vector


double *Gradient(double *w, double **X, double *y, long int m, long int d)
{
double ps , *g;
long int i, j;

g=( double *) malloc ((d+1)* sizeof (double ));


for(j=0; j<=d; j++)
g[j]=0.0;

for(i=1; i<=m; i++){


for(ps=w[0],j=1; j<=d; j++)
ps+=w[j]*X[i][j];
g[0]+=( Logistic(y[i]*ps) -1.0)*y[i];
for(j=1; j<=d; j++)
g[j]+=( Logistic(y[i]*ps ) -1.0)*y[i]*X[i][j];
}

for(j=0; j<=d; j++)


g[j]/=( double ) m;

return(g);
}

[email protected] Machine Learning Fundamentals


88

Logisitc Regression Program

double FoncLoss(double *w, double **X, double *y, long int m, long int d)
{
double S=0.0, ps;
long int i, j;

for(i=1; i<=m; i++){


for(ps=w[0],j=1; j<=d; j++)
ps+=w[j]*X[i][j];
S+= log (1.0+ exp(-y[i]*ps));
}
S/=( double ) m;

return (S);
}

void RegressionLogistique (double *w, DATA TrainingSet , LR_PARAM params)


{
// Minimization of the logistic loss using the gradient conjuguate

grdcnj(TrainingSet.X, TrainingSet.y, TrainingSet.m, TrainingSet.d, w, params.eps);

source: http://ama.liglab.fr/~amini/LR/

[email protected] Machine Learning Fundamentals


Consistency of the ERM principle
89

Estimation of the generalization error on a test set


q Remind that the examples of a test set are generated i.i.d.
with respect to the same probability distribution D which
has generated the training set,
q Consider fS a learned function over the training set S, and
let T = {(xi , yi ); i ∈ {1, . . . , n}} be a test set of size n,
q (fS (xi ), yi ) 7→ ℓ(fS (xi ), yi ) can be considered as the
independent copies of the same random variable :
1∑ n
ET∼Dn L̂n (fS , T) = ET∼Dn [ℓ(fS (xi ), yi )]
n i=1
1∑ n
= E [ℓ(fS (x), y)] = L(fS )
n i=1 (x,y)∼D

⇒ The empirical error of fS on the test set, L̂n (fS , T) is an


unbiased estimator of its generalization error.
[email protected] Machine Learning Fundamentals
90

[Hoeffding 63] Inequality

Let X1 , . . . , Xn be independent random variables and define the


empirical mean of these variables : Sn = X1 + · · · + Xn . Assume
that the Xi are almost surely bounded within the interval
[ai , bi ]. Then for any ϵ > 0, the Theorem 2 of Hoeffding proves
the inequalities
( )
2ϵ2
P (Sn − E[Sn ] ≥ ϵ) ≤ exp − ∑n
i=1 (bi − ai )
2
( )
2ϵ2
P (| Sn − E[Sn ] |≥ ϵ) ≤ 2 exp − ∑n
i=1 (bi − ai )
2

[email protected] Machine Learning Fundamentals


91

Estimation of the generalization error on a test set


q For each test example (xi , yi ) let Xi be the random variable
1
n ℓ(fS (xi ), yi )
q Further, all the random variables Xi , i ∈ {1, . . . , n} are
independent and that they take values in {0, 1} ( )

n ∑
n
q By noting that L̂n (fS , T) = Xi and L(fS ) = E Xi ,
i=1 i=1
following [Hoeffding 63] we get
( ) 2
∀ϵ > 0, P L(fS ) − L̂n (fS , T) > ϵ ≤ e−2nϵ
q To better understand this result, let solve
√ the equation
−2nϵ2 ln 1/δ
e = δ with respect to ϵ; i.e. ϵ = 2n , and :
 √ 
ln 1/δ 
∀δ ∈]0, 1], P L(fS ) ≤ L̂n (fS , T) + ≥1−δ
2n

[email protected] Machine Learning Fundamentals


92

Estimation of the generalization error on a test set


q For a small δ, according to the previous equation, we have
the following inequality which stands with high probability
and all test sets of size n :

ln 1/δ
L(fS ) ≤ L̂n (fS , T) +
2n
q From this result, we have a bound over the generalization
error of a learned function which can be estimated using
any test set, and in the case where n is sufficiently large,
this bound gives a very accurate estimated of the latter.
q Example: suppose that the empirical error of a prediction
function fS over a test set T of size √
n = 1000 is
L̂n (fS , T) = 0.23. For δ = 0.01, i.e. ln(1/δ)
2n ≈ 0.047, the
generalization error of fS is upperbounded by 0.277 with a
probability at least 0.99.
[email protected] Machine Learning Fundamentals
93

A uniform generalization error bound


q As part of the study of the consistency of the ERM
principle, we would now establish a uniform bound on the
generalization error of a learned function depending on its
empirical error over a training base.
q We cannot reach this result, by using the same
development than previously.
q This is mainly due to the fact that when the learned
function fS has knowledge of the training data
S = {(xi , yi ); i ∈ {1, . . . , m}}, random variables
1
Xi = m ℓ(fS (xi ), yi ); i ∈ {1, . . . , m} involved in the
estimation of the empirical error of fS on S, are all
dependent on each other.
⇒ Indeed, if we change an example of the training set, the
selected function fS will also change, as well as the
instantaneous errors of all the other examples.
[email protected] Machine Learning Fundamentals
94

Rademacher complexity [Koltchinskii 01]

q In the derivation of uniform generalization error bounds


different capacity measures of the class of functions have
been proposed. Among which the Rademacher complexity
allows an accurate estimates of the capacity of a class of
functions and it is dependent to the training sample

q The empirical Rademacher complexity estimates the


richness of a function class F by measuring the degree to
which the latter is able fit to random noise on a training
set S = {(x1 , y1 ), . . . , (xm , ym )} of size m generated i.i.d.
with respect to a probability distribution D.

[email protected] Machine Learning Fundamentals


95

Rademacher complexity [Koltchinskii 01]


q This complexity is estimated through Rademacher
variables σ = (σ1 , . . . , σm )⊤ which are independent discrete
random variables taking values in {−1, +1} with the same
probability 1/2, i.e.
∀i ∈ {1, . . . , m}; P(σi = −1) = P(σi = +1) = 1/2, and is
defined as :
[ ]
2 ∑ m

R̂m (F, S) = Eσ sup σi f(xi ) | x1 , . . . , xm
m f∈F i=1

q Furthermore, we define the Rademacher complexity of the


class of functions F independently to a given training set
by
[ ]
2 ∑m

Rm (F) = ES∼D R̂m (F, S) = ESσ sup σi f(xi )
m
m
f∈F i=1

[email protected] Machine Learning Fundamentals


96

A uniform generalization error bound


.
Theorem (Generalization bound with the Rademacher
complexity)
.
Let X ∈ Rd be a vectoriel space and Y = {−1, +1} an output space. Suppose
that the pairs of examples (x, y) ∈ X × Y are generated i.i.d. with respect to
the distribution probability D. Let F be a class of functions having values in
Y and ℓ : Y × Y → [0, 1] a given instantaneous loss. Then for all δ ∈]0, 1],
we have with probability at least 1 − δ the following inequality :

ln 1δ
∀f ∈ F , L(f) ≤ L̂m (f, S) + Rm (ℓ ◦ F) + (7)
2m
and also with probability at least 1 − δ

ln 2δ
L(f) ≤ L̂m (f, S) + R̂m (ℓ ◦ F, S) + 3 (8)
. 2m
Where ℓ ◦ F = {(x, y) 7→ ℓ(f(x), y) | f ∈ F }.

[email protected] Machine Learning Fundamentals


97

A uniform generalization error bound (1)


1. Link the supremum of L(f) − L̂m (f, S) on F with its
expectation
The study of this bound is achieved by linking the supremum
appearing, in the right hand side of the above inequality, with its
expectation through a powerful tool developed for empirical processes by
[McDiarmid 89], and known as the theorem of bounded differences

Let I ⊂ R be a real valued interval, and (X1 , ..., Xm ), m


independent random variables taking values in Im . Let
Φ : Im → R be defined such that : ∀i ∈ {1, ..., m}, ∃ci ∈ R the
following inequality holds for any (x1 , ..., xm ) ∈ Im and ∀x′ ∈ I :

|Φ(x1 , .., xi−1 , xi , xi+1 , .., xm ) − Φ(x1 , .., xi−1 , x′ , xi+1 , .., xm )| ≤ ci

We have then
2
∑−2ϵ
m
c2
∀ϵ > 0, P(Φ(x1 , ..., xm ) − E[Φ] > ϵ) ≤ e i=1 i

[email protected] Machine Learning Fundamentals


98

A uniform generalization error bound (1)

1. Link the supremum of L(f) − L̂m (f, S) on F with its


expectation
consider the following function

Φ : S 7→ sup[L(f) − L̂m (f, S)]


f∈F

Mcdiarmid inequality can then be applied for the function Φ


with ci = 1/m, ∀i, thus :
( )
2
∀ϵ > 0, P sup[L(f) − L̂m (f, S)] − ES sup[L(f) − L̂m (f, S)] > ϵ ≤ e−2mϵ
f∈F f∈F

[email protected] Machine Learning Fundamentals


99

A uniform generalization error bound (2)


2. Bound ES sup[L(f) − L̂m (f, S)] with respect to Rm (ℓ ◦ F )
f∈F
This step is a symmetrisation step and it consists in introducing
a second virtual sample S′ also generated i.i.d. with respect to
Dm into ES supf∈F [L(f) − L̂m (f, S)].

→ ES sup(L(f) − L̂m (f, S)) = ES sup[ES′ (L̂m (f, S′ ) − L̂m (f, S))]
f∈F f∈F

≤ ES ES′ sup[L(f, S′ ) − L̂m (f, S)]


f∈F

→ In the other hand,

ES ES′ sup[L(f, S′ ) − L̂m (f, S)]


f∈F
[ ]
1 ∑
m
= ES ES′ Eσ sup σi (L(f(x′i ), y′i ) − L(f(xi ), yi ))
f∈F m i=1

[email protected] Machine Learning Fundamentals


100

A uniform generalization error bound (2)


2. Bound ES sup[L(f) − L̂m (f, S)] with respect to Rm (ℓ ◦ F )
f∈F

By applying the triangular inequality sup = ||.||∞ it comes

[ ]
1 ∑
m
ES ES′ Eσ sup σi (ℓ(f(x′i ), y′i ) − ℓ(f(xi ), yi )) ≤
f∈F m
i=1

1 ∑
m
1 ∑
m

ES ES′ Eσ sup σi ℓ(f(x′i ), y′i ) + ES ES′ Eσ sup (−σi )ℓ(f(x′i ), y′i )


f∈F m f∈F m
i=1 i=1

Finally as ∀i, σi and −σi have the same distribution we have

1 ∑
m
ES ES′ sup[L(f, S′ ) − L̂m (f, S)] ≤ 2ES Eσ sup σi ℓ(f(xi ), yi ) (9)
f∈F f∈F m
i=1
| {z }
≤Rm (ℓ◦F )

[email protected] Machine Learning Fundamentals


101

A uniform generalization error bound (2)

2. Bound ES sup[L(f) − L̂m (f, S)] with respect to Rm (ℓ ◦ F )


f∈F

In summarizing the results obtained so far, we have:


1. ∀f ∈ F , ∀S, L(f) − L̂m (f, S) ≤ supf∈F [L(f) − L̂m (f, S)]
( )
2
2. ∀ϵ > 0, P sup[L(f) − L̂m (f, S)] − ES sup[L(f) − L̂m (f, S)] > ϵ ≤ e−2mϵ
f∈F f∈F

3. ES sup(L(f) − L̂m (f, S)) ≤ Rm (ℓ ◦ F )


f∈F

The first point of the theorem 2 is obtained by resolving the


2
equation e−2mϵ = δ with respect to ϵ.

[email protected] Machine Learning Fundamentals


102

A uniform generalization error bound (3)


3. Bound Rm (ℓ ◦ F ) with respect to R̂m (ℓ ◦ F, S)
→ Apply the McDiarmid inequality to the function Φ : S 7→ R̂m (ℓ ◦ F, S)
2
∀ϵ > 0, P(Rm (ℓ ◦ F) > R̂m (ℓ ◦ F , S) + ϵ) ≤ e−mϵ /2

2
Thus for δ/2 = e−mϵ /2
, we have with probability at least equal to 1 − δ/2 :

ln 2δ
Rm (ℓ ◦ F) ≤ R̂m (ℓ ◦ F , S) + 2
2m
From the first point (Eq. 7) of the theorem 2, we have also with probability
at least equal to 1 − δ/2 :

ln 2δ
∀f ∈ F , ∀S, L(f) ≤ L̂m (f, S) + Rm (ℓ ◦ F ) +
2m
The second point (Eq. 8) of the theorem 2 is then obtained by combining
the two previous results using the union bound.

[email protected] Machine Learning Fundamentals


103

Structural Risk Minimization

Complexity   Empirical  error   Empirical  error  +  complexity  

[email protected] Machine Learning Fundamentals


104

Structural Risk Minimization (2)

Image from : http://www.svms.org/srm/

[email protected] Machine Learning Fundamentals


105

Regularization

q Find a predictor by minimising the empirical risk with an


added penalty for the size of the model,
q A simple approach consists in choosing a large class of
functions F and to define on F a regularizer, typically a
norm || g ||, then to minimize the regularized empirical risk

f̂ = argmin L̂m (f, S) + γ × || f ||2


f∈F |{z}
hyperparameter

q The hyper parameter, or the regularisation parameter


allows to choose the right trade-off between fit and
complexity.

[email protected] Machine Learning Fundamentals


106

K-fold cross validation


q Create a K-fold partition of the dataset
q For each of K experiments, use K − 1 folds for training and
a different fold for testing, this procedure is illustrated in
the following figure for K = 4
Train 1, 1 Test 1 Crossval. 1

Train 2, 2 Test 2 Crossval. 2

Test 3 Train 3, 3 Crossval. 3

Test 4 Train 4, 4 Crossval. 4

q The value of the hyper parameter corresponds to the value


of γk for which the testing performance is the highest on
one of the folds.
[email protected] Machine Learning Fundamentals
107

In summary

q For induction, we should control the capacity of the class of


functions.

q The study of the consistency of the ERM principle led to


the second fundamental principle of machine learning
called structural risk minimization (SRM).

q Learning is a compromise between a low empirical risk and


a high capacity of the class of functions in use.

[email protected] Machine Learning Fundamentals


108

Homework

q Divide randomly each of the following datasets on 60%


Training and 40% Test sets
http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
http://archive.ics.uci.edu/ml/datasets/Ionosphere

http://archive.ics.uci.edu/ml/datasets/Mushroom

q Learn each Perceptron, Adaline, Logistic Regression,


Adaboost with perceptron on the training sets

q Compare their accuracy on the test sets

[email protected] Machine Learning Fundamentals


Multiclass Classification
109

Real-life classification applications


q In most real-life classification applications the number of
classes is more than two.

[email protected] Machine Learning Fundamentals


110

Multi-class classification problem

q There are two cases to be distinguished :


q the mono-label case, where each example is labeled with a
single class. In this case the output space Y is a finite set of
classes marked generally with numbers for convenience
Y = {1, . . . , K},
q the multi-label case, where each example can be labeled
with several classes ; Y = {1, +1}K .
q In both cases, the learning algorithm takes a labeled
training set S = {(x1 , y1 ), . . . , (xm , ym )} ∈ (X × Y)m in
input where pairs of examples (x, y) ∈ X × Y are supposed
i.i.d. with respect to an unknown yet fixed probability
distribution D.

[email protected] Machine Learning Fundamentals


111

Multi-class classification problem


q The aim of learning is then to find a prediction function
from F = {f : X → Y} with the lowest generalization error :
L(f) = E(x,y)∼D [ℓ(f(x), y)], (10)
where, ℓ : Y × Y → R+ is the instantaneous classification
error, and f(x) = (f1 (x), . . . , fK (x)) ∈ Y is a predicted
output vector for example x in the multi-label case, or the
class label of x in the mono-label case.
q In the multi-label case, the instantaneous error is based on
the Hamming distance that counts the number of different
components in the predicted, f(x), and the true class, y,
labels for x.
1∑ K
ℓ(f(x), y) = (1 − yk fk (x))
2 k=1

[email protected] Machine Learning Fundamentals


112

Multi-class classification problem

q In the mono-label case, the instantaneous error is simply:

ℓ(f(x), y) = 1f(x)̸=y

q As in binary classification, the prediction function is found


according to the Empirical Risk Minimization principle
using a training set
S = {(xi , yi ); i ∈ {1, . . . , m}} ∈ (X × Y)m :

1 ∑m
f∗ = argmin L̂m (f, S) = argmin ℓ(f(xi ), yi )
f∈F f∈F m i=1

[email protected] Machine Learning Fundamentals


113

Multi-class classification problem


q In practice, the function learned h is defined as :

h : Rd → RK
x 7→ (h(x, 1), . . . , h(x, K))

where h ∈ RX ×Y , by minimizing a convex derivative upper


bound of the empirical error.

q For an example x, the prediction is hence obtained by


thresholding the outputs h(x, k), k ∈ {1, . . . , K} for the
multi-label case, or by taking the class index giving the
highest prediction in the mono-label case:

∀x, fh (x) = argmax h(x, k)


k∈{1,...,K}

[email protected] Machine Learning Fundamentals


114

Approaches

q Combined approaches (on the basis of binary classification)


q One-versus-All (OvA),
q One-versus-One (OvO),
q Error-correction codes (ECOC).

q Uncombined approaches
q K-Nearest Neighbour (K-NN),
q Generative models,
q Discriminative models (MLP, M-SVM, M-AdaBoost, etc.)

[email protected] Machine Learning Fundamentals


115

Combined approach - OvA

Algorithm 5 The OVA approach


1: Training set S = {(xi , yi ) | i ∈ {1, . . . , m}}
2: for each k = 1 . . . K do
3: S̃ ← ∅
4: for each i = 1 . . . m do
5: if yi == k then
6: S̃ ← (xi , +1)
7: else
8: S̃ ← (xi , −1)
9: end if
10: end for each
11: Learn a classifier hk : X → R on S̃;
12: end for each
13: The final classifier: ∀x ∈ X , f(x) = argmax hk (x)
k∈{1,...,K}

[email protected] Machine Learning Fundamentals


115

Combined approach - OvA

OvO
OvA
[email protected] Machine Learning Fundamentals
116

Combined approach - OvO


Algorithm 6 The OVO approach
1: Training set S = {(xi , yi ) | i ∈ {1, . . . , m}}
2: for each k = 1 . . . K − 1 do
3: for each ℓ = k + 1 . . . K do
4: S̃ ← ∅
5: for each i = 1 . . . m do
6: if yi == k then
7: S̃ ← (xi , +1)
8: else if yi == ℓ then
9: S̃ ← (xi , −1)
10: end if
11: end for each
12: Learn a classifier hkℓ : X → R on S̃;
13: end for each
14: end for each
15: The final classifier: ∀x ∈ X , f(x) = argmax {y | fyy′ (x) = +1}
y′ ∈Y,y′ ̸=y
{ ( )
sgn hyy′ (x) , if y < y′
where, ∀x ∈ X , ∀(y, y′ ) ∈ Y 2 , y ̸= y′ , fyy′ (x) =
−fy′ y (x), if y′ < y

[email protected] Machine Learning Fundamentals


116

Combined approach - OvO

OvO
OvA
ds

[email protected] Machine Learning Fundamentals


117

Combined approach - ECOC


This technique is composed of three steps :
q Each class k ∈ {1, . . . , K} is first coded (or represented) by
a code word which is generally a binary vector of length n,
Dk ∈ {−1, +1}n ,
q With the resulting matrix of codes D ∈ {−1, +1}K×n , n
binary classifiers (fj )nj=1 are learned after creating n
training sets S̃j from the initial training set S :
∀(x, y) ∈ X ×Y, ∀j ∈ {1, . . . , n}, the associated code is (x, Dy (j))
q To predict the class of an example x, let f(x) denote the
vector f(x) = (f1 (x), . . . , fn (x)), the associated class is the
one having the lowest Hamming distance with the line
vectors of D:
1∑ n
∀x ∈ X , y∗ = argmin (1 − sgn(Dk (j)fj (x)))
k∈{1,...,K} 2 j=1

[email protected] Machine Learning Fundamentals


118

Uncombined approach - K-NN


q The K-Nearest Neighbors algorithm is a non-parametric
method used for classification,
q The input consists of the K closest training examples in the
characteristics space.
q For each observation x ∈ X the class membership is
decided by a majority vote of its neighbours.

[email protected] Machine Learning Fundamentals


119

Uncombined approach - Generative models


q Estimate p(y) and p(x | y) by maximizing the complete
log-likelihood,
q For prediction, use the Bayes rule p(y | x) ∝ p(y) × p(x | y)
q Affect an observation x to y∗ = argmaxy p(y | x)

Figure from Duda, Hart and Stork (Pattern Classification)


[email protected] Machine Learning Fundamentals
120

Uncombined approach - MLP

q Multi-Layer Perceptron is a feed-forward model


bias
bias z0
x0
z1

x1
y1

Output
Input

x y

yk
xd

zℓ
Hidden
layer

[email protected] Machine Learning Fundamentals


121

Uncombined approach - MLP


For the model above, the value of jth unit of the hidden layer for
an observation x = (xi )i=1...d in input is obtained by
composition :
q of a dot product aj , between x and the weight vector
(1) (1)
wj. = (wji )i=1,...,d ; j ∈ {1, . . . , ℓ} the features of x to this
(1)
jth unit and the parameters of the bias wj0 :

(1) (1)
∀j ∈ {1, . . . , ℓ}, aj = ⟨wj. , x⟩ + wj0

d
(1)
= wji xi
i=0

q and a bounded transfert function, H̄(.) : R → R :

∀j ∈ {1, . . . , ℓ}, zj = H̄(aj )

[email protected] Machine Learning Fundamentals


122

Uncombined approach - MLP

For the model above, the value of jth unit of the hidden layer for
an observation x = (xi )i=1...d in input is obtained by
composition :
q The values of units of the output (h1 , . . . , hK ) is obtained in
the same manner between the vector of the hidden layer
zj , j ∈ {0, . . . , ℓ} and the weights linking this layer to the
(2) (2)
output wk. = (wkj )j=1,...,ℓ ; k ∈ {1, . . . , K},
q the predicted output for an observation x is a composite
transformation of the input, which for the previous model is
( ℓ ( ))
∑ (2)

d
(1)
∀x, ∀k ∈ {1, . . . , K}, h(x, k) = H̄(ak ) = H̄ wkj × H̄ wji × xi
j=0 i=0

[email protected] Machine Learning Fundamentals


123

Uncombined approach - MLP


q An efficient way to estimate the parameters of NNs is the
backpropagation algorithm [Rumelhart et al., 1986],
q For the mono-label classification case, an indicator vector is
associated to each class
 
 
∀(x, y) ∈ X ×Y, y = k ⇔ y⊤ = y1 , . . . , yk−1 , yk , yk+1 , . . . , yK 
| {z } |{z} | {z }
all equal to 0 =1 all equal to 0

q After the phase of propagation of information for an


example (x, y), an error is estimated between the
prediction and the desired output

1 1 ∑K
∀(x, y), ℓ(h(x), y) = ||h(x) − y||2 = × (hk − yk )2
2 2 k=1

[email protected] Machine Learning Fundamentals


124

Uncombined approach - MLP


q And the weights are corrected accordingly from the output
to the input using the gradient descent algorithm
∂ℓ(h(x), y)
wji ← wji − η
∂wji
q Using the chain rule
∂ℓ(h(x), y) ∂ℓ(h(x), y) ∂aj
=
∂wji ∂aj ∂wji
| {z }
=δj

∂a
where ∂wjij = zi .
q In the case where, the unit j is on the output layer we have
∂ℓ(h(x), y)
δj = = H̄′ (aj ) × (hj − yj ).
∂aj

[email protected] Machine Learning Fundamentals


125

Uncombined approach - MLP

q If the unit j is on the hidden layer, we have by applying the


chain rule again :

∂ℓ(h(x), y) ∑ ∂ℓ(h(x), y) ∂al


δj = =
∂aj l∈Af(j)
∂al aj

= H̄′ (aj ) δl × wlj
l∈Af(j)

where Af(j) is the set of units that are on the layer which
succeeds the one containing unit j.

[email protected] Machine Learning Fundamentals


126

Backpropagation in one glance


Propagation


wji zi 

δl wlj
Be(j) zj j δj Af(j)

l∈Af(j)
i∈Be(j)


H̄ (aj )

H̄


Backpropagation
of error
Figure from Amini (Apprentissage Machine)

[email protected] Machine Learning Fundamentals


127

References
Massih-Reza Amini
Apprentissage Machine de la théorie à la pratique
éditions Eyrolles, 2015.
Translated in Chinese :

iTuring edition, 2018


R.O. Duda, P.E. Hart, D.G. Stork
Pattern Classification
2000.
T. Hastie, R. Tibshirani, J. Friedman
The Elements of Statistical Learning
2009.
W. Hoeffding
Probability inequalities for sums of bounded random variables
Journal of the American Statistical Association, 58:13–30,
1963.
C. McDiarmid
On the method of bounded differences
Surveys in combinatorics, 141:148–188,
1989.

[email protected] Machine Learning Fundamentals


128

References

V. Koltchinskii
Rademacher penalties and structural risk minimization
IEEE Transactions on Information Theory, 47(5):1902–1914,
2001.
Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalker
Foundations of Machine Learning
2012.
A.B. Novikoff
On convergence proofs on perceptrons.
Symposium on the Mathematical Theory of Automata, 12: 615–622.
1962
F. Rosenblatt
The perceptron: A probabilistic model for information storage and
organization in the brain.
Psychological Review, 65: 386–408.
1958

[email protected] Machine Learning Fundamentals


129

References

D. E. Rumelhart, G. E. Hinton and R. Williams


Learning internal representations by error propagation.
Parallel Distributed Processing: Explorations in the Microstructure of
Cognition,
1986
R.E. Schapire
Theoretical views of boosting and applications.
In Proceedings of the 10th International Conference on Algorithmic Learning
Theory, pages 13–25.
1999
G. Widrow and M. Hoff
Adaptive switching circuits.
Institute of Radio Engineers, Western Electronic Show and Convention,
Convention Record, 4: 96–104, 1960.
V. Vapnik.
The nature of statistical learning theory.
Springer, Verlag, 1998.

[email protected] Machine Learning Fundamentals

You might also like