0% found this document useful (0 votes)

152 views145 pages

Supervised Learning

This document provides an introduction to machine learning fundamentals and supervised learning. It discusses what machine learning is, the process of inference involving observing phenomena, constructing models, and making predictions. It also outlines three main machine learning frameworks - supervised learning, unsupervised learning, and semi-supervised learning. The document relates these frameworks to inductive reasoning and provides an overview of topics that will be covered, including supervised learning techniques and applications such as pattern recognition.

Uploaded by

Mohammad Gamal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

152 views145 pages

Supervised Learning

Uploaded by

Mohammad Gamal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

UE: Machine Learning Fundamentals

Part I : Supervised Learning

Massih-Reza Amini
http://ama.liglab.fr/~amini/Cours/ML/ML.html

Université Grenoble Alpes

Laboratoire d’Informatique de Grenoble
[email protected]
2

What is Machine Learning?

q Wikipedia: Machine learning is a ﬁeld of computer science
that gives computers the ability to learn without being
explicitly programmed for it!

1 1

0 ?

[email protected] Machine Learning Fundamentals

What is Machine Learning?

q Wikipedia: Machine learning is a ﬁeld of computer science
that gives computers the ability to learn without being
explicitly programmed for it!

1 1

0 ?

q Machine Learning programs are hence designed to make

inference.

[email protected] Machine Learning Fundamentals

Learning and Inference

The process of inference is done in three steps:

1. Observe a phenomenon,
2. Construct a model of the phenomenon,
3. Do predictions.

[email protected] Machine Learning Fundamentals

Learning and Inference

The process of inference is done in three steps:
1. Observe a phenomenon,
2. Construct a model of the phenomenon,
3. Do predictions.

q These steps are involved in more or less all natural sciences!

All that is necessary to reduce the whole nature of laws
similar to those which Newton discovered with the aid of
calculus, is to have a suﬃcient number of observations and
a mathematics that is complex enough (Marquis de
Condorcet, 1785)

[email protected] Machine Learning Fundamentals

Learning and Inference

The process of inference is done in three steps:

1. Observe a phenomenon,
2. Construct a model of the phenomenon,
3. Do predictions.

q These steps are involved in more or less all natural sciences!

q The aim of learning is to automate this process,

[email protected] Machine Learning Fundamentals

Learning and Inference

The process of inference is done in three steps:

1. Observe a phenomenon,
2. Construct a model of the phenomenon,
3. Do predictions.

q These steps are involved in more or less all natural sciences!

q The aim of learning is to automate this process,
q The aim of the learning theory is to formalize the process.

[email protected] Machine Learning Fundamentals

Three main Frameworks

Sport
Sport
…… Politics
……
……
? ? ? ……
… ……
… ……
……
…… …
Politics …
……
……
…

Supervised Learning Unsupervised Learning Semi-supervised Learning

Related to Inductive reasoning

[email protected] Machine Learning Fundamentals
5

Induction vs. deduction

q Induction is the process of deriving general principles from

particular facts or instances.

q Deduction is, in the other hand, the process of reasoning in

which a conclusion follows necessarily from the stated
premises; it is an inference by reasoning from the general to
the speciﬁc.

This is how mathematicians prove theorems from axioms.

[email protected] Machine Learning Fundamentals

What will you learn here?1

1. Supervised Learning:
q The Empirical Risk Minimization principle
q Binary models and their link with the ERM principle
q Unconstrained Convex Optimization
q Consistency of the ERM principle
q Multi-class classiﬁcation

2. Unsupervised Learning:
q Generative models and the EM algorithm
q CEM algorithm

3. Semi-supervised Learning:
q Graphical and Generative models
q Discriminant models

1
Program based on Chapters 1, 2, 3 & 5 of [Amini 15]
[email protected] Machine Learning Fundamentals
7

Organization

q Formation
q Theoretical courses - 12 weeks (1.5h per week - 3 ECTS)

q Practical information (important dates, timetables, defence

schedule, etc.) are available at
http://mosig.imag.fr
https://msiam.imag.fr

q Timetable available at
https://edt.grenoble-inp.fr/2018-2019/ensimag/etudiant/

q Research projects http://im2ag-pcarre.e.ujf-grenoble.fr/

[email protected] Machine Learning Fundamentals

Pattern recognition

If we consider the context of supervised learning for pattern

recognition:
q The data consist of pairs of examples (vector
representation of an observation, class label),
q Class labels are often Y = {1, . . . , K} with K large (but in
the theory of ML we consider the binary classiﬁcation case
Y = {−1, +1}),
q The learning algorithm constructs an association between
the vector representation of an observation → class label,
q Aim: Make few errors on unseen examples.

[email protected] Machine Learning Fundamentals

Pattern recognition (Example)

IRIS classiﬁcation, Ronald Fisher (1936)

Iris Setosa Iris Versicolor Iris Virginica

[email protected] Machine Learning Fundamentals

Pattern recognition (Example)

q First step is to formalize the perception of the ﬂowers with
relevant common characteristics, that constitute the
features of their vector representations.

q This usually requires expert knowledge.

[email protected] Machine Learning Fundamentals

Pattern recognition (Example)

q If observations are from a Field of Irises

[email protected] Machine Learning Fundamentals

Pattern recognition (Example)

q If observations are from a Field of Irises then they become

...

[email protected] Machine Learning Fundamentals

Pattern recognition (Example)

q The constitution of vectorised observations and their

associated labels is generally time consuming.

q Many studies are now focused on representation learning

using deep neural networks

q Second step: Learning translates then in the search of a

function that maps vectorised observations (inputs) to
their associated outputs

[email protected] Machine Learning Fundamentals

Pattern recognition

2. Trouver les
séparateurs
0. Base d’apprentissage

1. Vecteur de
représentation
3. Nouveaux exemples

5. Prédire les étiquettes

des nouveaux exemples

[email protected] Machine Learning Fundamentals

Some popular applications

Machine Learning for

Medicine

Image recognition Finance and buisness

[email protected] Machine Learning Fundamentals

Approximation - Interpolation

It is always possible to construct a function that exactly ﬁts the

data.

[email protected] Machine Learning Fundamentals

Approximation - Interpolation

It is always possible to construct a function that exactly ﬁts the

data.

[email protected] Machine Learning Fundamentals

Approximation - Interpolation

It is always possible to construct a function that exactly ﬁts the

data.

Is it reasonable?

[email protected] Machine Learning Fundamentals

Occam razor

Idea: Search for regularities (or repetitions) in the observed

phenomenon, generalization is done from the passed
observations to the new futur ones ⇒ Take the most simple
model ...
But how to measure the simplicity ?
1. Number of constantes,
2. Number de parameters,
3. ...

[email protected] Machine Learning Fundamentals

Basic Hypotheses

Two types of hypotheses:

q Past observations are related to the future ones

→ The phenomenon is stationary

q Observations are independently generated from a source

→ Notion of independence

[email protected] Machine Learning Fundamentals

Aims

→ How can one do predictions with past data? What are the
hypotheses?

q Give a formel deﬁnition of learning, generalization,

overﬁtting,

q Characterize the performance of learning algorithms,

q Construct better algorithms.

[email protected] Machine Learning Fundamentals

Probabilistic model

Relations between the past and future observations.

q Independence: Each new observation provides a maximum

individual information,

q identically Distributed : Observations provide information

on the phenomenon which generates the observations.

[email protected] Machine Learning Fundamentals

Formally

We consider an input space X ⊆ Rd and an output space Y.

Assumption: Example pairs (x, y) ∈ X × Y are identically and
independently distributed (i.i.d) with respect to an unknown
but ﬁxed probability distribution D.

Samples: We observe a sequence of m pairs of examples (xi , yi )

generated i.i.d from D.

Aim: Construct a prediction function f : X → Y which predicts

an output y for a given new x with a minimum probability of
error.

[email protected] Machine Learning Fundamentals

Notations

Symbol Deﬁnition
X ⊆ Rd Input space
Y Output space
S = (xi , yi )1≤i≤m Training set of size m
D Probability distribution generating the data (i.i.d)
ℓ : Y × Y → R+ Instantaneous loss
F = {f : X → Y} Class of functions
L(f) = E(x,y)∼D [ℓ(f(x), y)] Generalization error of
L̂m (f, S) or L̂(w) Empirical Loss of f over S
w Parameters of the prediction function
1π Indicator function equals 1 if π is true and 0 otherwise
Rm (F) Rademacher complexity of the class of functions F
R̂m (F, S) Empirical Rademacher complexity of F estimated over S

[email protected] Machine Learning Fundamentals

Supervised Learning
q Discriminant models directly find a classification function
f : X → Y from a given class of functions F;
q The function found should be the one having the lowest
probability of error
∫
L(f) = E(x,y)∼D [ℓ(f(x), y)] = ℓ(f(x), y)dD(x, y)
X ×Y
Where ℓ is an instantaneous loss defined as

ℓ : Y × Y → R+
The risk function considered in classiﬁcation is usually the
misclassiﬁcation error:
∀(x, y); ℓ(f(x), y) = 1f(x)̸=y
Where 1π is equal to 1 if the predicate π is true and 0
otherwise.
[email protected] Machine Learning Fundamentals
22

Empirical risk minimization (ERM) principle

q As the probability distribution D is unknown, the analytic

form of the true risk cannot be driven, so the prediction
function cannot be found directly on L(f).

q Empirical risk minimization (ERM) principle: Find f by

minimizing the unbiased estimator of its generalization
error L(f) on a given training set S = (xi , yi )m
i=1 :

1 ∑m
L̂m (f, S) = ℓ(f(xi ), yi )
m i=1

q However, without restricting the class of functions this is

not the right way of proceeding (occam razor) ...

[email protected] Machine Learning Fundamentals

ERM principle, problem

Suppose that the input dimension is d = 1, let the input space
X be the interval [a, b] ⊂ R where a and b are real values such
that a < b, and suppose that the output space is {−1, +1}.
Moreover, suppose that the distribution D generating the
examples (x, y) is an uniform distribution over [a, b] × {−1}.
Consider now, a learning algorithm which minimizes the
empirical risk by choosing a function in the function class
F = {f : [a, b] → {−1, +1}} (also denoted as F = {−1, +1}[a,b] )
in the following way ; after reviewing a training set
S = {(x1 , y1 ), . . . , (xm , ym )} the algorithm outputs the
prediction function fS such that
{
−1, if x ∈ {x1 , . . . , xm }
fS (x) =
+1, otherwise

[email protected] Machine Learning Fundamentals

Consistency of the ERM principle

q For the above problem, the found classiﬁer has an

empirical risk equal to 0, and that for any given training
set. However, as the classifier makes an error over the
entire infinite set [a, b] except on a finite training set (of
measure zero), its generalization error is always equal to 1.

q So the question is : in which case the ERM principle is

likely to generate a general learning rule?
⇒ The answer of this question lies in a statistical notion
called consistency.

[email protected] Machine Learning Fundamentals

Consistency of the ERM principle (2)

This concept indicates two conditions that a learning algorithm
has to fulfil, namely
(a) the algorithm must return a prediction function whose
empirical error reflects its generalization error when the
size of the training set tends to infinity :
∀ϵ > 0, lim P(|L̂m (fS , S) − L(fS )| > ϵ) = 0, denoted as,
m→∞

P
L̂m (fS , S) → L(fS )

(b) in the asymptotic case, the algorithm must allow to ﬁnd

the function which minimises the generalization error in
the considered function class :
P
L̂m (fS , S) → inf L(g)
g∈F

[email protected] Machine Learning Fundamentals

Consistency of the ERM principle (3)

These two conditions imply that the empirical error L̂m (fS , S) of
the prediction function found by the learning algorithm over a
training S, fS , converges in probability to its generalization
error L(fS ) and infg∈F L(g) :

True risk,

Empirical risk,

[email protected] Machine Learning Fundamentals

Study the consistency of the ERM principle

The fundamental result of the learning theory [Vapnik 88,

theorem 2.1, p.38] concerning the consistency of the ERM
principle, exhibits another relation involving the supremum over
the function class in the form of an unilateral uniform
convergence and which stipulates that :

The ERM principle is consistent if and only if :

( )
[ ]
∀ϵ > 0, lim P sup L(f) − L̂m (f, S) > ϵ =0
m→∞ f∈F

[email protected] Machine Learning Fundamentals

Study the consistency of the ERM principle

q A direct implication of this result is an uniform bound over

the generalization error of all prediction functions f ∈ F
learned on a training set S of size m and which writes :
( )
∀δ ∈]0, 1], P ∀f ∈ F, L(f) − L̂m (f, S) ≤ C(F, m, δ) ≥ 1 − δ

Where C depends on the size of the function class, the size

of the training set, and the desired precision δ ∈]0, 1].
There are diﬀerent ways to measure the size of a function
class and the measure commonly used is called complexity
or the capacity of the function class.

[email protected] Machine Learning Fundamentals

Usual binary classiﬁcation models
29

First attempts to build learnable artiﬁcial

models

It begun at the end of the 19th century with the works of

Santiago Ramón y Cajal who ﬁrst represented the biological
neuron:

[email protected] Machine Learning Fundamentals

MuCulloch & Pitts formal neuron (1943)

x1 b w1
w0
x2 b w2

Signals Σ hw (.) Output

xd b wd
synaptic
weights

q Linear prediction function

hw : Rd → R
x 7→ ⟨w̄, x⟩ + w0

[email protected] Machine Learning Fundamentals

MuCulloch & Pitts formal neuron (1943)

x1 b w1
w0
x2 b w2

Signals Σ hw (.) Output

xd b wd
synaptic
weights

q Linear prediction function

q Diﬀerent learning rules have been proposed - the most
popular one was the Hebb’s rule (1949): neurons that ﬁre
together, wire together.
[email protected] Machine Learning Fundamentals
31

Perceptron [Rosenblatt, 1958]

[email protected] Machine Learning Fundamentals

Perceptron [Rosenblatt, 1958]

q Linear prediction function
hw : Rd → R
x 7→ ⟨w̄, x⟩ + w0
q A principle way to learn: ﬁnd the parameters w = (w̄, w0 )
by minimising the distance between the misclassiﬁed
examples to the decision boundary.
hw
(x
)=
⟨w̄
,x
⟩+
w
0 =
0
x b

|hw (x)|
||w̄||
xp
w̄

[email protected] Machine Learning Fundamentals

Learning Perceptron parameters

q Objective function
∑
L̂(w) = − yi′ (⟨w̄, xi′ ⟩ + w0 )
i′ ∈I

q Update rule: Gradient descente

∀t ≥ 1, w(t) ← w(t−1) − η∇w(t−1) L̂(w(t−1) )
q Derivatives of with respect to the parameters
∂ L̂(w) ∑
= − yi′ ,
∂w0
i′ ∈I
∑
∇L̂(w̄) = − yi′ xi′
i′ ∈I

q Stochastic gradient descente

( ) ( ) ( )
w0 w0 y
∀(x, y), if y(⟨w̄, x⟩ + w0 ) ≤ 0 then ← +η
w̄ w̄ yx

[email protected] Machine Learning Fundamentals

Graphical depiction of the online update rule

b (x1 , +1) b

w(t+1)
b (x2 , +1) −x3 b

w(t) w(t)

rs
(x3 , −1) rs

(x4 , −1) rs rs

[email protected] Machine Learning Fundamentals

Perceptron (algorithm)

Algorithm 1 The algorithm of perceptron

1: Training set S = {(xi , yi ) | i ∈ {1, . . . , m}}
2: Initialize the weights w(0) ← 0
3: t←0
4: Learning rate η > 0
5: repeat
6: ⟩ an example (x , y ) ∈ S
Choose (t) (t)
⟨ randomly
7: if y w(t) , x(t) < 0 then
(t+1) (t)
8: w0 ← w0 + η × y(t)
9: w(t+1) ← w(t) + η × y(t) × x(t)
10: end if
11: t←t+1
12: until t > T

+ But does this updates converge?

[email protected] Machine Learning Fundamentals

Perceptron (convergence)
[Novikoﬀ, 1962] showed that

q if there exists a weight w̄∗ , such that

∀i ∈ {1, . . . , m}, yi × ⟨w̄∗ , xi ⟩ > 0,
( ⟨ ⟩)
w̄∗
q then, by denoting ρ = mini∈{1,...,m} yi ||w̄∗ || , xi ,

q and, R = maxi∈{1,...,m} ||xi ||,

q and, w̄(0) = 0, η = 1,

q we have a bound over the maximum number of updates k :

⌊( )2 ⌋
R
k≤
ρ

[email protected] Machine Learning Fundamentals

Homework
1. We suppose that all the examples in the training set are within a
hypersphere of radius R (i.e. ∀xi ∈ S, ||xi || ≤ R). Further, we initialise
the weight vector to be the null vector (i.e. w(0) = 0) as well as the
learning rate ϵ = 1. Show that after k updates, the norm of the
current weight vector satisﬁes :
||w(k) ||2 ≤ k × R2 (1)
(k) 2 (0) 2
hint : You can consider ||w || as ||w − w || (k)

2. Using the the same condition than in the previous question, show that
after k updates of the weight vector we have
⟨ ⟩
w∗
, w(k) ≥k×ρ (2)
||w∗ ||
3. Deduce from equations (1) and (2) that the number of iterations k is
bounded by ⌊( ) ⌋
2
R
k≤
ρ
where ⌊x⌋ represents the ﬂoor function (This result is due to Novikoﬀ,
1966).
[email protected] Machine Learning Fundamentals
38

Perceptron Program
#include "defs.h"
void perceptron(X, Y, w, m, d, eta , T)
double **X;
double *Y;
double *w;
long int m;
long int d;
double eta;
long int T;
{
long int i, j, t=0;
double ProdScal;
// Initialisation of the weight vector
for(j=0; j<=d; j++)
w[j]=0.0;

while(t<T)
{
i=( rand ()%m) + 1;
for(ProdScal=w[0], j=1; j<=d; j++)
ProdScal +=w[j]*X[i][j];
if(Y[i]* ProdScal <= 0.0){
w[0]+= eta*Y[i];
for(j=1; j<=d; j++)
w[j]+= eta*Y[i]*X[i][j];
}
t++;
}
}
source: http://ama.liglab.fr/~amini/Perceptron/

[email protected] Machine Learning Fundamentals

ADAptive LInear NEuron

[Widrow & Hoﬀ, 1960]
q ADAptive LInear NEuron
q Linear prediction function :
hw : X → R
x 7→ ⟨w̄, x⟩ + w0
q Find parameters that minimise the convex upper-bound of the
empirical 0/1 loss

1 ∑
m
L̂(w) = (yi − hw (xi ))2
m i=1

q Update rule : stochastic gradient descent algorithm with a

learning rate η > 0
( ) ( ) ( )
w0 w0 1
∀(x, y), ← + η(y − hw (x)) (3)
w̄ w̄ x

[email protected] Machine Learning Fundamentals

Adaline
q ADAptive LInear NEuron
q Linear prediction function :
hw : X → R
x 7→ ⟨w̄, x⟩ + w0

Algorithm 2 The algorithm of Adaline

1: Training set S = {(xi , yi ) | i ∈ {1, . . . , m}}
2: Initialize the weights w(0) ← 0
3: t←0
4: Learning rate η > 0
5: repeat
6: Choose randomly an example (x(t) , y(t) ) ∈ S
(t+1) (t)
7: w0 ← w0 + η × (y(t) − hw (x(t) ))
8: w̄(t+1) ← w̄(t) + η × (y(t) − hw (x(t) )) × x(t)
9: t←t+1
10: until t > T

[email protected] Machine Learning Fundamentals

Formal models

x0 = 1 b w0

x1 b w1

x2 b w2

x Σ hw (x)

xd b wd

[email protected] Machine Learning Fundamentals

Perceptron vs Adaline

b b

rs b b

rs rs
rs

rs
rs

rs rs

[email protected] Machine Learning Fundamentals

Perceptron and Adaline sparked excitement but

which marked the 1st winter of NN; and the genesis of research
in ML.
[email protected] Machine Learning Fundamentals
44

Logistic regression: generative models

Each example x is supposed to be generated by a mixture

model of parameters Θ:
∑
K
P(x | Θ) = P(y = k)P(x | y = k, Θ)
k=1
[email protected] Machine Learning Fundamentals
45

Logistic regression: generative models

q The aim is then to ﬁnd the parameters Θ for which the

model explains the best the observations,
q That is done by maximizing the log-likelihood of data
S = {(xi , yi ); i ∈ {1, . . . , m}}

∏
m
L(Θ) = ln P(xi | Θ)
i=1

q Classical density functions are Gaussian density functions

1 1 ⊤ Σ−1 (x−µ )
P(x | y = k, Θ) = d 1 e− 2 (x−µk ) k k

(2π) |Σk |
2 2

[email protected] Machine Learning Fundamentals

Logistic regression: generative models

q Once the parameters Θ are estimated; the generative model
can be used for classiﬁcation by applying the Bayes rule:

∀x; y∗ = argmax P(y = k | x)

k
∝ argmax P(y = k) × P(x | y = k, Θ)
k

q Problem: in most real life applications the distributional

assumption over data does not hold,
q The Logistic Regression model does not make any
assumption except that

P(y = 1 | x)
ln = ⟨w̄, x⟩ + w0
P(y = 0 | x)

[email protected] Machine Learning Fundamentals

Logistic regression
q The logistic regression has been proposed to model the
posterior probability of classes via logistic functions.
1
P(y = 1 | x) = = gw (x)
1 + e−⟨w̄,x⟩−w0
1
P(y = 0 | x) = 1 − P(y = 1 | x) = = 1 − gw (x)
1 + e⟨w̄,x⟩+w0
P(y | x) = (gw (x))y (1 − gw (x))1−y

0.8
1/(1+exp(-<w,x>))

0.6

0.4

0.2

0
-6 -4 -2 0 2 4 6
<w,x>

[email protected] Machine Learning Fundamentals

Logistic regression
q For
g:R → ]0, 1[
1
x 7→
1 + e−x
we have
∂g
g′ (x) = = g(x)(1 − g(x))
∂x

q Model parameters w are found by maximizing the complete

log-liklihood, which by assuming that m training examples
are generated independently, writes
∏
m ∏
m ∏
m
ln P(xi , yi ) = ln P(yi | xi ) + ln P(xi )
i=1 i=1 i=1
∑
m
[ ]
≈ ln (gw (xi ))yi (1 − gw (xi ))1−yi
i=1

[email protected] Machine Learning Fundamentals

Logisitic Regression : link with the ERM

principle

q If we consider the function hw : x 7→ ⟨w̄, x⟩ + w0 , the

maximization of the log-liklihood L is equivalent to the
minimization of the empirical logistic loss in the case where
∀i, yi ∈ {−1, +1}.

1 ∑m
L̂(w) = ln(1 + e−yi hw (xi ) )
m i=1

q Minimization can be carried out with usual convex

optimization techniques (i.e. conjugate gradient or the
quasi-newton method)

[email protected] Machine Learning Fundamentals

Adaline vs Logistic regression

x2 y
y x2

0 0.5

x1 x1

[email protected] Machine Learning Fundamentals

ADAptive BOOSTing [Schapire, 1999]

q The Adaboost algorithm generates a set of weak learners and

combines them with a majority vote in order to produce an
efficient final classifier.

q Each weak classiﬁer is trained sequentially in the way to take

into account the classification errors of the previous classifier
+ This is done by assigning weights to training examples and at
each iteration to increase the weights of those on which the
current classifier makes misclassification.

+ In this way the new classiﬁer is focalized on hard examples that

have been misclassiﬁed by the previous classiﬁer.

[email protected] Machine Learning Fundamentals

AdaBoost, algorithm
Algorithm 3 The algorithm of Boosting
1: Training set S = {(xi , yi ) | i ∈ {1, . . . , m}}
2: 1
Initialize the initial distribution over examples ∀i ∈ {1, . . . , m}, D1 (i) = m
3: T, the maximum number of iterations (or classiﬁers to be combined)
4: for each t=1,…,T do
5: ∑
Train a weak classiﬁer ft : X → {−1, +1} by using the distribution Dt
6: Set ϵt = Dt (i)
i:ft (xi )̸=yi
1−ϵ
7: Choose αt = 1 2
ln ϵ t
t
8: Update the distribution of weights

Dt (i)e−αt yi ft (xi )
∀i ∈ {1, . . . , m}, Dt+1 (i) =
Zt

Where,
∑
m
−αt yi ft (xi )
Zt = Dt (i)e

i=1

9: end for each

(∑T )
10: The ﬁnal classiﬁer: ∀x, F(x) = sign
t=1
αt ft (x)

source: http://ama.liglab.fr/~amini/RankBoost/

[email protected] Machine Learning Fundamentals

AdaBoost, algorithm
Algorithm 4 The algorithm of Boosting
1: Training set S = {(xi , yi ) | i ∈ {1, . . . , m}}
2: 1
Initialize the initial distribution over examples ∀i ∈ {1, . . . , m}, D1 (i) = m
3: T, the maximum number of iterations (or classiﬁers to be combined)
4: for each t=1,…,T do
5: ∑
Train a weak classiﬁer ft : X → {−1, +1} by using the distribution Dt
6: Set ϵt = Dt (i)
i:ft (xi )̸=yi
1−ϵ
7: Choose αt = 1 2
ln ϵ t
t
8: Update the distribution of weights

Dt (i)e−αt yi ft (xi )
∀i ∈ {1, . . . , m}, Dt+1 (i) =
Zt

Where,
∑
m
−αt yi ft (xi )
Zt = Dt (i)e

i=1

9: end for each

(∑T )
10: The ﬁnal classiﬁer: ∀x, F(x) = sign
t=1
αt ft (x)

source: http://ama.liglab.fr/~amini/RankBoost/

[email protected] Machine Learning Fundamentals

How to sample using a distribution Dt

Dt (U)
V b

U
q Choose randomly an index U ∈ {1, . . . , m} and a real-value
V ∈ [0, maxi∈{1,...,m} Dt (i)], if Dt (U) > V then accept the
example (xU , yU ).

[email protected] Machine Learning Fundamentals

AdaBoost, geometry interpretation

b b b

b b b b b b

b rs rs b rs rs b rs rs

b b b b b b

rs rs rs

rs rs rs rs rs rs

α1 = 0.5 α2 = 0.1 α3 = 0.75

b b

b rs rs

b b

rs rs

[email protected] Machine Learning Fundamentals

Homework
∑T
1. If we denote by ∀x, H(x) = t=1 αt ft (x) and
F(x) = sign(H(x)) show that

1 ∑m
1 ∑m
1yi ̸=F(xi ) ≤ e−yi H(xi )
m i=1 m i=1

2. Deduce that

1 ∑m ∑m ∏
e−yi H(xi ) = Z1 D2 (i) e−yi αt ft (xi )
m i=1 i=1 t>1

And,
1 ∑m ∏T
e−yi H(xi ) = Zt (4)
m i=1 t=1

[email protected] Machine Learning Fundamentals

Homework
3. The minimization of (4) is carried out by minimizing each
of its terms. Using the deﬁnition of ϵt show that:
∀t, Zt = ϵt eαt + (1 − ϵt )e−αt
4. Further show that the minimum of the normalisation term,
with respect to the combination weights, αt is reached for
αt = 12 ln 1−ϵ
ϵt
t

5. By posing γt = 12 − ϵt , and when ϵt < 21 show that

√
2
∀t, Zt = 1 − 4γt2 ≤ e−2γt
6. Finally show that the empirical misclassiﬁcation error
decreases exponentially to 0
1 ∑m ∏T ∑T 2
1yi ̸=F(xi ) ≤ Zt ≤ e−2 t=1 γt
m i=1 t=1

[email protected] Machine Learning Fundamentals

Unconstrained convex optimization
57

Common convex upper bounds for the

misclassiﬁcation error

[email protected] Machine Learning Fundamentals

Property

q The learning problem casts into a easier unconstrained

convex optimization problem.
q Consider the Taylor formula of the objective function
around its minimiser
1
L̂(w) = L̂(w∗ ) + (w − w∗ )⊤ ∇L̂(w∗ ) + (w − w∗ )⊤ H(w − w∗ ) + o(∥ w − w∗ ∥2 )
| {z } 2
=0

q The Hessian matrix is symmetric and from Schwarz

theorem its eigenvectors (vi )di=1 form an orthonormal basis.
{
+1 : si i = j,
2
∀(i, j) ∈ {1, . . . , d} , Hvi = λi vi , et v⊤
i vj =
0 : otherwise.

[email protected] Machine Learning Fundamentals

Property (2)
q Every weight vector w − w∗ can be uniquely decomposed in
this basis
∑
d
w − w∗ = qi vi
i=1

q That to say
1∑ d
L̂(w) = L̂(w∗ ) + λi q2i
2 i=1
q Furthermore the Hessian matrix is deﬁnite positive,
because of the deﬁnition of the global minimum

∑
d
(w − w∗ )⊤ H(w − w∗ ) = λi q2i = 2(L̂(w) − L̂(w∗ )) ≥ 0
i=1

All the eigenvalues of H are then positive.

[email protected] Machine Learning Fundamentals

Property (3)

q This implies that the level lines of L̂, deﬁned by weight

points for which L̂ is constant, are ellipses

[email protected] Machine Learning Fundamentals

Gradient descent algorithm

[Rumelhart et al., 1986]
q The gradient descent algorithm is an iterative algorithm
that updates the weight vectors at each step :
∀t ∈ N, w(t+1) = w(t) − η∇L̂(w(t) )
Where η > 0 is the learning rate

[email protected] Machine Learning Fundamentals

Convergence of the gradient descent algorithm

q Take the decomposition of any vector w − w∗ in the
orthonormal basis (vi )di=1 formed by the eigenvectors of the
Hessian matrix
∑
d
∇L̂(w) = qi λi vi
i=1
q Let w(t) be the weight vector obtained from w(t−1) after
applying the gradient descent rule
d (
∑ ) ∑
d
(t) (t−1) (t−1)
w(t) − w(t−1) = qi − qi vi = −η∇L̂(w(t−1) ) = −η qi λi vi
i=1 i=1

q So
(t) (0)
∀i ∈ {1, . . . , d}, qi = (1 − ηλi )t qi
and the algorithm convergence if
1
η<
2λmax
[email protected] Machine Learning Fundamentals
63

OK but how to ﬁnd the good learning rate?

Line search

At each iteration t, on w(t)

q Estimate the descent direction pt (i.e. p⊤
t ∇L̂(w ) < 0)
(t)

q Update
w(t+1) ← w(t) + ηt pt
// Where ηt is a positive learning rate making w(t+1) be
acceptable for the next iteration.

[email protected] Machine Learning Fundamentals

Wolfe conditions

q To ﬁnd the sequence (w(t) )t∈N following the line search

rule, the following necessary condition

∀t ∈ N, L̂(w(t+1) ) < L̂(w(t) )

is not suﬃcient to guarantee the convergence of the

sequence to the minimiser of L̂.

q In two situations, the previous condition is satisﬁed but

there is no convergence

[email protected] Machine Learning Fundamentals

1. The decreasing of L̂ is too small

with respect to the length of the jumps

Consider the following example d = 1 ; L̂(w) = w2 with

3
w(0) = 2, (pt = (−1)t+1 )t∈N∗ and (ηt = (2 + 2t+1 ))t∈N∗ . The
sequence of updates would then be
∀t ∈ N∗ , w(t) = (−1)t (1 + 2−t )

[email protected] Machine Learning Fundamentals

1. The decreasing of L̂ is too small

with respect to the length of the jumps

⇒ Armijo condition : require that for a given α ∈ (0, 1),

∀t ∈ N∗ , L̂(w(t) + ηt pt ) ≤ L̂(w(t) ) + αηt p⊤

t ∇L̂(w )
(t)

[email protected] Machine Learning Fundamentals

Armajio condition

Armajio’s admissible learning rate values

[email protected] Machine Learning Fundamentals

2. The jumps of the weight vectors are too small

Consider the following example d = 1 ; L̂(w) = w2 with

w(0) = 2, (pt = −1)t∈N∗ and (ηt = (2−t+1 ))t∈N∗ . The sequence
of updates would then be
∀t ∈ N∗ , w(t) = (1 + 2−t )

[email protected] Machine Learning Fundamentals

2. The jumps of the weight vectors are too small

⇒ ∃β ∈ (α, 1) such that

∀t ∈ N∗ , p⊤
t ∇L̂(w
(t)
+ ηt pt ) ≥ βp⊤
t ∇L̂(w )
(t)

[email protected] Machine Learning Fundamentals

Armajio condition

[email protected] Machine Learning Fundamentals

Existence of learning rates verifying Wolfe

conditions
q Let pt be a descent direction of L̂ at w(t) . Suppose that the
function ψt : η 7→ L̂(w(t) + ηpt ) is derivative and lower
bounded, then there exists ηt verifying both Wolfe
conditions.
proof:
1. consider

E = {a ∈ R+ | ∀η ∈]0, a], L̂(w(t) +ηpt ) ≤ L̂(w(t) )+αηp⊤

t ∇L̂(w )}
(t)

As pt is a descent direction of L̂ at w(t) then for all α < 1

there exists ā > 0 such that

∀η ∈]0, ā], L̂(w(t) + ηpt ) < L̂(w(t) ) + αηp⊤

t ∇L̂(w )
(t)

[email protected] Machine Learning Fundamentals

Existence of learning rates verifying Wolfe

conditions
2. So E ̸= ∅. Furthermore, as the function ψt is lower
bounded, the largest rate in E, η̂t = sup E, exists. By
continuity of ψt we have

L̂(w(t) + η̂t pt ) < L̂(w(t) ) + αη̂t p⊤

t ∇L̂(w )
(t)

3. Let (ηn )n∈N be a convergence sequence to η̂t by higher

values, i.e. ∀n ∈ N, ηn > η̂t and lim ηn = η̂t . As
n→+∞
(ηn )n∈N ∈
/ E we get

∀n ∈ N, L̂(w(t) + ηn pt ) > L̂(w(t) ) + αηn p⊤

t ∇L̂(w )
(t)

So
L̂(w(t) + η̂t pt ) = L̂(w(t) ) + αη̂t p⊤
t ∇L̂(w )
(t)

[email protected] Machine Learning Fundamentals

Existence of learning rates verifying Wolfe

conditions

4. We ﬁnally get

p⊤
t ∇L̂(w
(t)
+ η̂t pt ) ≥ αp⊤ ⊤
t ∇L̂(w ) ≥ βpt ∇L̂(w )
(t) (t)

Where β ∈ (α, 1) and p⊤

t ∇L̂(w ) < 0.
(t)

⇒ The learning rate η̂t veriﬁes both Wolfe conditions

[email protected] Machine Learning Fundamentals

Does it work?
.
Theorem (Zoutendijk)
.
Let L̂ : Rd → R be a diﬀerentiable objective function with a
lipschtizien gradient and lower bounded. Let A be an algorithm
generating (w(t) )t∈N deﬁned by

∀t ∈ N, w(t+1) = w(t) + ηt pt

where pt is a descent direction of L̂ and ηt a learning rate verifying

both Wolfe conditions. By considering the angle θt between the descent
direction pt and the direction of the gradient :

−p⊤
t ∇L̂(w )
(t)
cos(θt ) =
||L̂(w(t) )|| × ||pt ||

The following series is convergent

∑
cos2 (θt )||∇L̂(w(t) )||2
. t

[email protected] Machine Learning Fundamentals

Proof of Zoutendijk’s theorem

1. Using the second Wolfe’s condition and by subtracting

p⊤
t ∇L̂(w ) from both terms of the inequality, we get
(t)

( )
∀t, p⊤
t (∇L̂(w
(t+1)
) − ∇L̂(w(t) )) ≥ (β − 1) p⊤
t ∇L̂(w )
(t)

2. Using the lipschitzian property of the gradient of the

objective function

p⊤
t (∇L̂(w
(t+1)
) − ∇L̂(w(t) )) ≤ ||∇L̂(w(t+1) ) − ∇L̂(w(t) )|| × ||pt ||
≤ L||w(t+1) − w(t) || × ||pt ||
≤ Lηt ||pt ||2

[email protected] Machine Learning Fundamentals

Proof of Zoutendijk’s theorem

3. By combining both inequalities it comes

∀t, 0 ≤ (β − 1)(p⊤
t ∇L̂(w )) ≤ Lηt ||pt ||
(t) 2

β−1 p⊤
t ∇L̂(w )
(t)
4. For ηt ≥ L ||pt ||2
> 0 we get from Armijo’s condition

L̂(w(t) ) − L̂(w(t+1) ) ≥ −αηt p⊤

t ∇L̂(w )
(t)

1 − β (p⊤
t ∇L̂(w ))
(t) 2
≥α
L ||pt ||2
1−β
≥α cos2 (θt )||∇L̂(w(t) )||2 ≥ 0
L

[email protected] Machine Learning Fundamentals

Proof of Zoutendijk’s theorem

5. The objective function is lower bounded, the sequence of

general term L̂(w(t) ) − L̂(w(t+1) ) > 0 is convergent

6. Hence, the series

∑
cos2 (θt )||∇L̂(w(t) )||2
t

is convergent.

[email protected] Machine Learning Fundamentals

Corollary of Zoutendijk’s theorem

Guarantee of convergence

q In the case where, the descente direction and the gradient

are not orthogonal :

∃κ > 0, ∀t ≥ T, cos2 (θt ) ≥ κ

q Following Zoutendijk’s theorem the series :

∑
||∇L̂(w(t) )||2
t

is convergent.
q Hence, the sequence (∇L̂(w(t) ))t tends to 0 when t tends to
inﬁnity.

[email protected] Machine Learning Fundamentals

Can we do better?
Conjugate gradient method
q The adaptive search of the learning rate with the line
search algorithm does not prevent the oscillations of the
weight vector around the minimiser of the objective
function

[email protected] Machine Learning Fundamentals

Conjugate gradient method

q At the neighbourhood of the minimiser of the objective
function, where the quadratic approximation holds
q Suppose that we have d conjugate directions
{pt , t ∈ [[0, d − 1]]}
∀(t, t′ ) ∈ [[0, d − 1]]2 , t ̸= t′ , p⊤
t Hpt′ = 0

q As the Hessian matrix is symmetric positive deﬁnite, we

can show that the directions {pt } are linearly independent
and that they form a basis. We have
∑
d−1
w∗ − w(0) = ηt pt
t=0

Hence
p⊤ ∗
t H(w − w )
(0)
∀t, ηt =
p⊤
t Hpt

[email protected] Machine Learning Fundamentals

Conjugate gradient method

q Let
∑
t−1
w(t) = w(0) + ηi pi
i=0

We get the following update rule

∀t ∈ [[0, d − 1]], w(t+1) = w(t) + ηt pt

q From the mutual conjugate property of (pt )d−1

t=0 :

p⊤
t Hw
(t)
= p⊤
t Hw
(0)

That is
p⊤
t ∇L̂(w )
(t)
∀t, ηt = − (5)
p⊤t Hpt

[email protected] Machine Learning Fundamentals

Conjugate gradient method

q With the previous deﬁnition of learning rates, it is simple
to show that the current gradient is orthogonal to all the
previous descent directions. In fact

∀t, ∇L̂(w(t+1) ) − ∇L̂(w(t) ) = H(w

|
(t+1)
{z
− w(t)})
ηt p t

q By multiplying pt from the left and by the deﬁnition of ηt

∀t, p⊤
t (∇L̂(w
(t+1)
) − ∇L̂(w(t) )) = −p⊤
t ∇L̂(w )
(t)

Which gives
∀t, p⊤
t ∇L̂(w
(t+1)
)=0

[email protected] Machine Learning Fundamentals

Conjugate gradient method

q For a given index t ∈ [[0, d − 1]], we ﬁnally get

′
∀t′ , ∀t, t < t′ , p⊤
t ∇L̂(w
(t )
)=0 (6)

q Hence, if the descent directions are conjugate after d

updates
∀t ∈ [[0, d − 1]], w(t+1) = w(t) + ηt pt
using the learning rate above, we arrive to a point where
the gradient of the objective function at this point is
orthogonal to all the descent direction, and which is the
minimum.

[email protected] Machine Learning Fundamentals

Conjugate gradient algorithm

q The necessary condition to get the previous result is to have

descent directions (pt )d−1
t=0 that are mutually conjugated
q The following sequence
{
p0 = −∇L̂(w(0) )
pt+1 = −∇L̂(w(t+1) ) + βt pt si t ≥ 0

q For
p⊤
t H∇L̂(w
(t+1) )
∀t, βt = ⊤
pt Hpt
It is easy to show that the descent directions are mutually
conjugated.

[email protected] Machine Learning Fundamentals

Conjugate gradient algorithm

q The coeﬃcients (βt ) can be estimated without the use of

the Hessian matrix (Hestenes and Stiefel, 52)

∇⊤ L̂(w(t+1) )Hpt ∇⊤ L̂(w(t+1) )(∇L̂(w(t+1) ) − ∇L̂(w(t) ))

∀t, βt = =
p⊤
t Hpt p⊤
t (∇L̂(w
(t+1) ) − ∇L̂(w(t) ))

Followed by others

∇⊤ L̂(w(t+1) )(∇L̂(w(t+1) ) − ∇L̂(w(t) ))

∀t, βt = (Polak and Ribiere, 69)
∇⊤ L̂(w(t) )∇L̂(w(t) )

∇⊤ L̂(w(t+1) )∇L̂(w(t+1) )
∀t, βt = (Fletcher and Reeves, 64)
∇⊤ L̂(w(t) )∇L̂(w(t) )

[email protected] Machine Learning Fundamentals

Conjugate gradient algorithm

void grdcnj(double **X, double *Y, long int m, long int d, double *w, double epsilon)
{
long int j, Epoque =0;
double *wold , OldLoss , NewLoss , *g, *p, *h, dgg , ngg , beta;
// wold , p, g, h allocated
for(j=0; j<=d; j++)
wold[j]= 2.0*( rand () / (double) RAND_MAX ) -1.0;
NewLoss = FoncLoss(wold , X, Y, m, d);
OldLoss = NewLoss + 2* epsilon;
g = Gradient(wold , X, Y, m, d);
for(j=0; j<=d; j++)
p[j] = -g[j]; // ▷ p0 ← −∇L̂(w(0) )

while(fabs(OldLoss -NewLoss) > (fabs(OldLoss )* epsilon ))

{
OldLoss = NewLoss;
rchln(wold , OldLoss , g, p, w, &NewLoss , X, Y, m, d);
h = Gradient(w, X, Y, m, d); // New gradient ▷ ∇L̂(w(t+1) )
for(dgg =0.0, ngg =0.0, j=0; j<=d; j++){
dgg+=g[j]*g[j];
ngg+=h[j]*h[j];
}
beta=ngg/dgg;
for(j=0; j<=d; j++){
wold[j]=w[j];
g[j]=h[j];
p[j]=-g[j]+ beta*p[j]; // New descent direction
}
}
}

[email protected] Machine Learning Fundamentals

Logisitc Regression Program

// Logistic function x 7→ 1
1+e−x
double Logistic(double x)
{
return (1.0/(1.0+ exp(-x)));
}

// Estimation of the gradient vector

double *Gradient(double *w, double **X, double *y, long int m, long int d)
{
double ps , *g;
long int i, j;

g=( double ) malloc ((d+1) sizeof (double ));

for(j=0; j<=d; j++)
g[j]=0.0;

for(i=1; i<=m; i++){

for(ps=w[0],j=1; j<=d; j++)
ps+=w[j]*X[i][j];
g[0]+=( Logistic(y[i]*ps) -1.0)*y[i];
for(j=1; j<=d; j++)
g[j]+=( Logistic(y[i]*ps ) -1.0)*y[i]*X[i][j];
}

for(j=0; j<=d; j++)

g[j]/=( double ) m;

return(g);
}

[email protected] Machine Learning Fundamentals

Logisitc Regression Program

double FoncLoss(double *w, double **X, double *y, long int m, long int d)
{
double S=0.0, ps;
long int i, j;

for(i=1; i<=m; i++){

for(ps=w[0],j=1; j<=d; j++)
ps+=w[j]*X[i][j];
S+= log (1.0+ exp(-y[i]*ps));
}
S/=( double ) m;

return (S);
}

void RegressionLogistique (double *w, DATA TrainingSet , LR_PARAM params)

{
// Minimization of the logistic loss using the gradient conjuguate

grdcnj(TrainingSet.X, TrainingSet.y, TrainingSet.m, TrainingSet.d, w, params.eps);

source: http://ama.liglab.fr/~amini/LR/

[email protected] Machine Learning Fundamentals

Consistency of the ERM principle
89

Estimation of the generalization error on a test set

q Remind that the examples of a test set are generated i.i.d.
with respect to the same probability distribution D which
has generated the training set,
q Consider fS a learned function over the training set S, and
let T = {(xi , yi ); i ∈ {1, . . . , n}} be a test set of size n,
q (fS (xi ), yi ) 7→ ℓ(fS (xi ), yi ) can be considered as the
independent copies of the same random variable :
1∑ n
ET∼Dn L̂n (fS , T) = ET∼Dn [ℓ(fS (xi ), yi )]
n i=1
1∑ n
= E [ℓ(fS (x), y)] = L(fS )
n i=1 (x,y)∼D

⇒ The empirical error of fS on the test set, L̂n (fS , T) is an

unbiased estimator of its generalization error.
[email protected] Machine Learning Fundamentals
90

[Hoeﬀding 63] Inequality

Let X1 , . . . , Xn be independent random variables and deﬁne the

empirical mean of these variables : Sn = X1 + · · · + Xn . Assume
that the Xi are almost surely bounded within the interval
[ai , bi ]. Then for any ϵ > 0, the Theorem 2 of Hoeﬀding proves
the inequalities
( )
2ϵ2
P (Sn − E[Sn ] ≥ ϵ) ≤ exp − ∑n
i=1 (bi − ai )
2
( )
2ϵ2
P (| Sn − E[Sn ] |≥ ϵ) ≤ 2 exp − ∑n
i=1 (bi − ai )
2

[email protected] Machine Learning Fundamentals

Estimation of the generalization error on a test set

q For each test example (xi , yi ) let Xi be the random variable
1
n ℓ(fS (xi ), yi )
q Further, all the random variables Xi , i ∈ {1, . . . , n} are
independent and that they take values in {0, 1} ( )
∑
n ∑
n
q By noting that L̂n (fS , T) = Xi and L(fS ) = E Xi ,
i=1 i=1
following [Hoeﬀding 63] we get
( ) 2
∀ϵ > 0, P L(fS ) − L̂n (fS , T) > ϵ ≤ e−2nϵ
q To better understand this result, let solve
√ the equation
−2nϵ2 ln 1/δ
e = δ with respect to ϵ; i.e. ϵ = 2n , and :
 √ 
ln 1/δ 
∀δ ∈]0, 1], P L(fS ) ≤ L̂n (fS , T) + ≥1−δ
2n

[email protected] Machine Learning Fundamentals

Estimation of the generalization error on a test set

q For a small δ, according to the previous equation, we have
the following inequality which stands with high probability
and all test sets of size n :
√
ln 1/δ
L(fS ) ≤ L̂n (fS , T) +
2n
q From this result, we have a bound over the generalization
error of a learned function which can be estimated using
any test set, and in the case where n is suﬃciently large,
this bound gives a very accurate estimated of the latter.
q Example: suppose that the empirical error of a prediction
function fS over a test set T of size √
n = 1000 is
L̂n (fS , T) = 0.23. For δ = 0.01, i.e. ln(1/δ)
2n ≈ 0.047, the
generalization error of fS is upperbounded by 0.277 with a
probability at least 0.99.
[email protected] Machine Learning Fundamentals
93

A uniform generalization error bound

q As part of the study of the consistency of the ERM
principle, we would now establish a uniform bound on the
generalization error of a learned function depending on its
empirical error over a training base.
q We cannot reach this result, by using the same
development than previously.
q This is mainly due to the fact that when the learned
function fS has knowledge of the training data
S = {(xi , yi ); i ∈ {1, . . . , m}}, random variables
1
Xi = m ℓ(fS (xi ), yi ); i ∈ {1, . . . , m} involved in the
estimation of the empirical error of fS on S, are all
dependent on each other.
⇒ Indeed, if we change an example of the training set, the
selected function fS will also change, as well as the
instantaneous errors of all the other examples.
[email protected] Machine Learning Fundamentals
94

Rademacher complexity [Koltchinskii 01]

q In the derivation of uniform generalization error bounds

diﬀerent capacity measures of the class of functions have
been proposed. Among which the Rademacher complexity
allows an accurate estimates of the capacity of a class of
functions and it is dependent to the training sample

q The empirical Rademacher complexity estimates the

richness of a function class F by measuring the degree to
which the latter is able ﬁt to random noise on a training
set S = {(x1 , y1 ), . . . , (xm , ym )} of size m generated i.i.d.
with respect to a probability distribution D.

[email protected] Machine Learning Fundamentals

Rademacher complexity [Koltchinskii 01]

q This complexity is estimated through Rademacher
variables σ = (σ1 , . . . , σm )⊤ which are independent discrete
random variables taking values in {−1, +1} with the same
probability 1/2, i.e.
∀i ∈ {1, . . . , m}; P(σi = −1) = P(σi = +1) = 1/2, and is
deﬁned as :
[ ]
2 ∑ m

R̂m (F, S) = Eσ sup σi f(xi ) | x1 , . . . , xm
m f∈F i=1

q Furthermore, we deﬁne the Rademacher complexity of the

class of functions F independently to a given training set
by
[ ]
2 ∑m

Rm (F) = ES∼D R̂m (F, S) = ESσ sup σi f(xi )
m
m
f∈F i=1

[email protected] Machine Learning Fundamentals

A uniform generalization error bound

.
Theorem (Generalization bound with the Rademacher
complexity)
.
Let X ∈ Rd be a vectoriel space and Y = {−1, +1} an output space. Suppose
that the pairs of examples (x, y) ∈ X × Y are generated i.i.d. with respect to
the distribution probability D. Let F be a class of functions having values in
Y and ℓ : Y × Y → [0, 1] a given instantaneous loss. Then for all δ ∈]0, 1],
we have with probability at least 1 − δ the following inequality :
√
ln 1δ
∀f ∈ F , L(f) ≤ L̂m (f, S) + Rm (ℓ ◦ F) + (7)
2m
and also with probability at least 1 − δ
√
ln 2δ
L(f) ≤ L̂m (f, S) + R̂m (ℓ ◦ F, S) + 3 (8)
. 2m
Where ℓ ◦ F = {(x, y) 7→ ℓ(f(x), y) | f ∈ F }.

[email protected] Machine Learning Fundamentals

A uniform generalization error bound (1)

1. Link the supremum of L(f) − L̂m (f, S) on F with its
expectation
The study of this bound is achieved by linking the supremum
appearing, in the right hand side of the above inequality, with its
expectation through a powerful tool developed for empirical processes by
[McDiarmid 89], and known as the theorem of bounded diﬀerences

Let I ⊂ R be a real valued interval, and (X1 , ..., Xm ), m

independent random variables taking values in Im . Let
Φ : Im → R be deﬁned such that : ∀i ∈ {1, ..., m}, ∃ci ∈ R the
following inequality holds for any (x1 , ..., xm ) ∈ Im and ∀x′ ∈ I :

|Φ(x1 , .., xi−1 , xi , xi+1 , .., xm ) − Φ(x1 , .., xi−1 , x′ , xi+1 , .., xm )| ≤ ci

We have then
2
∑−2ϵ
m
c2
∀ϵ > 0, P(Φ(x1 , ..., xm ) − E[Φ] > ϵ) ≤ e i=1 i

[email protected] Machine Learning Fundamentals

A uniform generalization error bound (1)

1. Link the supremum of L(f) − L̂m (f, S) on F with its

expectation
consider the following function

Φ : S 7→ sup[L(f) − L̂m (f, S)]

f∈F

Mcdiarmid inequality can then be applied for the function Φ

with ci = 1/m, ∀i, thus :
( )
2
∀ϵ > 0, P sup[L(f) − L̂m (f, S)] − ES sup[L(f) − L̂m (f, S)] > ϵ ≤ e−2mϵ
f∈F f∈F

[email protected] Machine Learning Fundamentals

A uniform generalization error bound (2)

2. Bound ES sup[L(f) − L̂m (f, S)] with respect to Rm (ℓ ◦ F )
f∈F
This step is a symmetrisation step and it consists in introducing
a second virtual sample S′ also generated i.i.d. with respect to
Dm into ES supf∈F [L(f) − L̂m (f, S)].

→ ES sup(L(f) − L̂m (f, S)) = ES sup[ES′ (L̂m (f, S′ ) − L̂m (f, S))]
f∈F f∈F

≤ ES ES′ sup[L(f, S′ ) − L̂m (f, S)]

f∈F

→ In the other hand,

ES ES′ sup[L(f, S′ ) − L̂m (f, S)]

f∈F
[ ]
1 ∑
m
= ES ES′ Eσ sup σi (L(f(x′i ), y′i ) − L(f(xi ), yi ))
f∈F m i=1

[email protected] Machine Learning Fundamentals

100

A uniform generalization error bound (2)

2. Bound ES sup[L(f) − L̂m (f, S)] with respect to Rm (ℓ ◦ F )
f∈F

By applying the triangular inequality sup = ||.||∞ it comes

[ ]
1 ∑
m
ES ES′ Eσ sup σi (ℓ(f(x′i ), y′i ) − ℓ(f(xi ), yi )) ≤
f∈F m
i=1

1 ∑
m
1 ∑
m

ES ES′ Eσ sup σi ℓ(f(x′i ), y′i ) + ES ES′ Eσ sup (−σi )ℓ(f(x′i ), y′i )

f∈F m f∈F m
i=1 i=1

Finally as ∀i, σi and −σi have the same distribution we have

1 ∑
m
ES ES′ sup[L(f, S′ ) − L̂m (f, S)] ≤ 2ES Eσ sup σi ℓ(f(xi ), yi ) (9)
f∈F f∈F m
i=1
| {z }
≤Rm (ℓ◦F )

[email protected] Machine Learning Fundamentals

101

A uniform generalization error bound (2)

2. Bound ES sup[L(f) − L̂m (f, S)] with respect to Rm (ℓ ◦ F )

f∈F

In summarizing the results obtained so far, we have:

1. ∀f ∈ F , ∀S, L(f) − L̂m (f, S) ≤ supf∈F [L(f) − L̂m (f, S)]
( )
2
2. ∀ϵ > 0, P sup[L(f) − L̂m (f, S)] − ES sup[L(f) − L̂m (f, S)] > ϵ ≤ e−2mϵ
f∈F f∈F

3. ES sup(L(f) − L̂m (f, S)) ≤ Rm (ℓ ◦ F )

f∈F

The ﬁrst point of the theorem 2 is obtained by resolving the

2
equation e−2mϵ = δ with respect to ϵ.

[email protected] Machine Learning Fundamentals

102

A uniform generalization error bound (3)

3. Bound Rm (ℓ ◦ F ) with respect to R̂m (ℓ ◦ F, S)
→ Apply the McDiarmid inequality to the function Φ : S 7→ R̂m (ℓ ◦ F, S)
2
∀ϵ > 0, P(Rm (ℓ ◦ F) > R̂m (ℓ ◦ F , S) + ϵ) ≤ e−mϵ /2

2
Thus for δ/2 = e−mϵ /2
, we have with probability at least equal to 1 − δ/2 :
√
ln 2δ
Rm (ℓ ◦ F) ≤ R̂m (ℓ ◦ F , S) + 2
2m
From the ﬁrst point (Eq. 7) of the theorem 2, we have also with probability
at least equal to 1 − δ/2 :
√
ln 2δ
∀f ∈ F , ∀S, L(f) ≤ L̂m (f, S) + Rm (ℓ ◦ F ) +
2m
The second point (Eq. 8) of the theorem 2 is then obtained by combining
the two previous results using the union bound.

[email protected] Machine Learning Fundamentals

103

Structural Risk Minimization

Complexity Empirical error Empirical error + complexity

[email protected] Machine Learning Fundamentals

104

Structural Risk Minimization (2)

Image from : http://www.svms.org/srm/

[email protected] Machine Learning Fundamentals

105

Regularization

q Find a predictor by minimising the empirical risk with an

added penalty for the size of the model,
q A simple approach consists in choosing a large class of
functions F and to deﬁne on F a regularizer, typically a
norm || g ||, then to minimize the regularized empirical risk

f̂ = argmin L̂m (f, S) + γ × || f ||2

f∈F |{z}
hyperparameter

q The hyper parameter, or the regularisation parameter

allows to choose the right trade-oﬀ between ﬁt and
complexity.

[email protected] Machine Learning Fundamentals

106

K-fold cross validation

q Create a K-fold partition of the dataset
q For each of K experiments, use K − 1 folds for training and
a diﬀerent fold for testing, this procedure is illustrated in
the following ﬁgure for K = 4
Train 1, 1 Test 1 Crossval. 1

Train 2, 2 Test 2 Crossval. 2

Test 3 Train 3, 3 Crossval. 3

Test 4 Train 4, 4 Crossval. 4

q The value of the hyper parameter corresponds to the value

of γk for which the testing performance is the highest on
one of the folds.
[email protected] Machine Learning Fundamentals
107

In summary

q For induction, we should control the capacity of the class of

functions.

q The study of the consistency of the ERM principle led to

the second fundamental principle of machine learning
called structural risk minimization (SRM).

q Learning is a compromise between a low empirical risk and

a high capacity of the class of functions in use.

[email protected] Machine Learning Fundamentals

108

Homework

q Divide randomly each of the following datasets on 60%

Training and 40% Test sets
http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
http://archive.ics.uci.edu/ml/datasets/Ionosphere

http://archive.ics.uci.edu/ml/datasets/Mushroom

q Learn each Perceptron, Adaline, Logistic Regression,

Adaboost with perceptron on the training sets

q Compare their accuracy on the test sets

[email protected] Machine Learning Fundamentals

Multiclass Classiﬁcation
109

Real-life classiﬁcation applications

q In most real-life classiﬁcation applications the number of
classes is more than two.

[email protected] Machine Learning Fundamentals

110

Multi-class classiﬁcation problem

q There are two cases to be distinguished :

q the mono-label case, where each example is labeled with a
single class. In this case the output space Y is a ﬁnite set of
classes marked generally with numbers for convenience
Y = {1, . . . , K},
q the multi-label case, where each example can be labeled
with several classes ; Y = {1, +1}K .
q In both cases, the learning algorithm takes a labeled
training set S = {(x1 , y1 ), . . . , (xm , ym )} ∈ (X × Y)m in
input where pairs of examples (x, y) ∈ X × Y are supposed
i.i.d. with respect to an unknown yet ﬁxed probability
distribution D.

[email protected] Machine Learning Fundamentals

111

Multi-class classiﬁcation problem

q The aim of learning is then to find a prediction function
from F = {f : X → Y} with the lowest generalization error :
L(f) = E(x,y)∼D [ℓ(f(x), y)], (10)
where, ℓ : Y × Y → R+ is the instantaneous classification
error, and f(x) = (f1 (x), . . . , fK (x)) ∈ Y is a predicted
output vector for example x in the multi-label case, or the
class label of x in the mono-label case.
q In the multi-label case, the instantaneous error is based on
the Hamming distance that counts the number of different
components in the predicted, f(x), and the true class, y,
labels for x.
1∑ K
ℓ(f(x), y) = (1 − yk fk (x))
2 k=1

[email protected] Machine Learning Fundamentals

112

Multi-class classiﬁcation problem

q In the mono-label case, the instantaneous error is simply:

ℓ(f(x), y) = 1f(x)̸=y

q As in binary classiﬁcation, the prediction function is found

according to the Empirical Risk Minimization principle
using a training set
S = {(xi , yi ); i ∈ {1, . . . , m}} ∈ (X × Y)m :

1 ∑m
f∗ = argmin L̂m (f, S) = argmin ℓ(f(xi ), yi )
f∈F f∈F m i=1

[email protected] Machine Learning Fundamentals

113

Multi-class classiﬁcation problem

q In practice, the function learned h is deﬁned as :

h : Rd → RK
x 7→ (h(x, 1), . . . , h(x, K))

where h ∈ RX ×Y , by minimizing a convex derivative upper

bound of the empirical error.

q For an example x, the prediction is hence obtained by

thresholding the outputs h(x, k), k ∈ {1, . . . , K} for the
multi-label case, or by taking the class index giving the
highest prediction in the mono-label case:

∀x, fh (x) = argmax h(x, k)

k∈{1,...,K}

[email protected] Machine Learning Fundamentals

114

Approaches

q Combined approaches (on the basis of binary classiﬁcation)

q One-versus-All (OvA),
q One-versus-One (OvO),
q Error-correction codes (ECOC).

q Uncombined approaches
q K-Nearest Neighbour (K-NN),
q Generative models,
q Discriminative models (MLP, M-SVM, M-AdaBoost, etc.)

[email protected] Machine Learning Fundamentals

115

Combined approach - OvA

Algorithm 5 The OVA approach

1: Training set S = {(xi , yi ) | i ∈ {1, . . . , m}}
2: for each k = 1 . . . K do
3: S̃ ← ∅
4: for each i = 1 . . . m do
5: if yi == k then
6: S̃ ← (xi , +1)
7: else
8: S̃ ← (xi , −1)
9: end if
10: end for each
11: Learn a classifier hk : X → R on S̃;
12: end for each
13: The final classifier: ∀x ∈ X , f(x) = argmax hk (x)
k∈{1,...,K}

[email protected] Machine Learning Fundamentals

115

Combined approach - OvA

OvO
OvA
[email protected] Machine Learning Fundamentals
116

Combined approach - OvO

Algorithm 6 The OVO approach
1: Training set S = {(xi , yi ) | i ∈ {1, . . . , m}}
2: for each k = 1 . . . K − 1 do
3: for each ℓ = k + 1 . . . K do
4: S̃ ← ∅
5: for each i = 1 . . . m do
6: if yi == k then
7: S̃ ← (xi , +1)
8: else if yi == ℓ then
9: S̃ ← (xi , −1)
10: end if
11: end for each
12: Learn a classifier hkℓ : X → R on S̃;
13: end for each
14: end for each
15: The final classifier: ∀x ∈ X , f(x) = argmax {y | fyy′ (x) = +1}
y′ ∈Y,y′ ̸=y
{ ( )
sgn hyy′ (x) , if y < y′
where, ∀x ∈ X , ∀(y, y′ ) ∈ Y 2 , y ̸= y′ , fyy′ (x) =
−fy′ y (x), if y′ < y

[email protected] Machine Learning Fundamentals

116

Combined approach - OvO

OvO
OvA
ds

[email protected] Machine Learning Fundamentals

117

Combined approach - ECOC

This technique is composed of three steps :
q Each class k ∈ {1, . . . , K} is ﬁrst coded (or represented) by
a code word which is generally a binary vector of length n,
Dk ∈ {−1, +1}n ,
q With the resulting matrix of codes D ∈ {−1, +1}K×n , n
binary classiﬁers (fj )nj=1 are learned after creating n
training sets S̃j from the initial training set S :
∀(x, y) ∈ X ×Y, ∀j ∈ {1, . . . , n}, the associated code is (x, Dy (j))
q To predict the class of an example x, let f(x) denote the
vector f(x) = (f1 (x), . . . , fn (x)), the associated class is the
one having the lowest Hamming distance with the line
vectors of D:
1∑ n
∀x ∈ X , y∗ = argmin (1 − sgn(Dk (j)fj (x)))
k∈{1,...,K} 2 j=1

[email protected] Machine Learning Fundamentals

118

Uncombined approach - K-NN

q The K-Nearest Neighbors algorithm is a non-parametric
method used for classiﬁcation,
q The input consists of the K closest training examples in the
characteristics space.
q For each observation x ∈ X the class membership is
decided by a majority vote of its neighbours.

[email protected] Machine Learning Fundamentals

119

Uncombined approach - Generative models

q Estimate p(y) and p(x | y) by maximizing the complete
log-likelihood,
q For prediction, use the Bayes rule p(y | x) ∝ p(y) × p(x | y)
q Aﬀect an observation x to y∗ = argmaxy p(y | x)

Figure from Duda, Hart and Stork (Pattern Classiﬁcation)

[email protected] Machine Learning Fundamentals
120

Uncombined approach - MLP

q Multi-Layer Perceptron is a feed-forward model

bias
bias z0
x0
z1

x1
y1

Output
Input

x y

yk
xd

zℓ
Hidden
layer

[email protected] Machine Learning Fundamentals

121

Uncombined approach - MLP

For the model above, the value of jth unit of the hidden layer for
an observation x = (xi )i=1...d in input is obtained by
composition :
q of a dot product aj , between x and the weight vector
(1) (1)
wj. = (wji )i=1,...,d ; j ∈ {1, . . . , ℓ} the features of x to this
(1)
jth unit and the parameters of the bias wj0 :

(1) (1)
∀j ∈ {1, . . . , ℓ}, aj = ⟨wj. , x⟩ + wj0
∑
d
(1)
= wji xi
i=0

q and a bounded transfert function, H̄(.) : R → R :

∀j ∈ {1, . . . , ℓ}, zj = H̄(aj )

[email protected] Machine Learning Fundamentals

122

Uncombined approach - MLP

For the model above, the value of jth unit of the hidden layer for
an observation x = (xi )i=1...d in input is obtained by
composition :
q The values of units of the output (h1 , . . . , hK ) is obtained in
the same manner between the vector of the hidden layer
zj , j ∈ {0, . . . , ℓ} and the weights linking this layer to the
(2) (2)
output wk. = (wkj )j=1,...,ℓ ; k ∈ {1, . . . , K},
q the predicted output for an observation x is a composite
transformation of the input, which for the previous model is
( ℓ ( ))
∑ (2)
∑
d
(1)
∀x, ∀k ∈ {1, . . . , K}, h(x, k) = H̄(ak ) = H̄ wkj × H̄ wji × xi
j=0 i=0

[email protected] Machine Learning Fundamentals

123

Uncombined approach - MLP

q An eﬃcient way to estimate the parameters of NNs is the
backpropagation algorithm [Rumelhart et al., 1986],
q For the mono-label classiﬁcation case, an indicator vector is
associated to each class
 
 
∀(x, y) ∈ X ×Y, y = k ⇔ y⊤ = y1 , . . . , yk−1 , yk , yk+1 , . . . , yK 
| {z } |{z} | {z }
all equal to 0 =1 all equal to 0

q After the phase of propagation of information for an

example (x, y), an error is estimated between the
prediction and the desired output

1 1 ∑K
∀(x, y), ℓ(h(x), y) = ||h(x) − y||2 = × (hk − yk )2
2 2 k=1

[email protected] Machine Learning Fundamentals

124

Uncombined approach - MLP

q And the weights are corrected accordingly from the output
to the input using the gradient descent algorithm
∂ℓ(h(x), y)
wji ← wji − η
∂wji
q Using the chain rule
∂ℓ(h(x), y) ∂ℓ(h(x), y) ∂aj
=
∂wji ∂aj ∂wji
| {z }
=δj

∂a
where ∂wjij = zi .
q In the case where, the unit j is on the output layer we have
∂ℓ(h(x), y)
δj = = H̄′ (aj ) × (hj − yj ).
∂aj

[email protected] Machine Learning Fundamentals

125

Uncombined approach - MLP

q If the unit j is on the hidden layer, we have by applying the

chain rule again :

∂ℓ(h(x), y) ∑ ∂ℓ(h(x), y) ∂al

δj = =
∂aj l∈Af(j)
∂al aj
∑
= H̄′ (aj ) δl × wlj
l∈Af(j)

where Af(j) is the set of units that are on the layer which
succeeds the one containing unit j.

[email protected] Machine Learning Fundamentals

126

Backpropagation in one glance

Propagation


wji zi 

δl wlj
Be(j) zj j δj Af(j)

l∈Af(j)
i∈Be(j)

∑
∑

H̄ (aj )

H̄

′
Backpropagation
of error
Figure from Amini (Apprentissage Machine)

[email protected] Machine Learning Fundamentals

127

References
Massih-Reza Amini
Apprentissage Machine de la théorie à la pratique
éditions Eyrolles, 2015.
Translated in Chinese :

iTuring edition, 2018

R.O. Duda, P.E. Hart, D.G. Stork
Pattern Classification
2000.
T. Hastie, R. Tibshirani, J. Friedman
The Elements of Statistical Learning
2009.
W. Hoeffding
Probability inequalities for sums of bounded random variables
Journal of the American Statistical Association, 58:13–30,
1963.
C. McDiarmid
On the method of bounded differences
Surveys in combinatorics, 141:148–188,
1989.

[email protected] Machine Learning Fundamentals

128

References

V. Koltchinskii
Rademacher penalties and structural risk minimization
IEEE Transactions on Information Theory, 47(5):1902–1914,
2001.
Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalker
Foundations of Machine Learning
2012.
A.B. Novikoﬀ
On convergence proofs on perceptrons.
Symposium on the Mathematical Theory of Automata, 12: 615–622.
1962
F. Rosenblatt
The perceptron: A probabilistic model for information storage and
organization in the brain.
Psychological Review, 65: 386–408.
1958

[email protected] Machine Learning Fundamentals

129

References

D. E. Rumelhart, G. E. Hinton and R. Williams

Learning internal representations by error propagation.
Parallel Distributed Processing: Explorations in the Microstructure of
Cognition,
1986
R.E. Schapire
Theoretical views of boosting and applications.
In Proceedings of the 10th International Conference on Algorithmic Learning
Theory, pages 13–25.
1999
G. Widrow and M. Hoﬀ
Adaptive switching circuits.
Institute of Radio Engineers, Western Electronic Show and Convention,
Convention Record, 4: 96–104, 1960.
V. Vapnik.
The nature of statistical learning theory.
Springer, Verlag, 1998.

[email protected] Machine Learning Fundamentals

Python Modules Guide
No ratings yet
Python Modules Guide
9 pages
r05320505 Neural Networks
100% (2)
r05320505 Neural Networks
5 pages
Module I
No ratings yet
Module I
109 pages
Machine Learning Introduction
100% (1)
Machine Learning Introduction
20 pages
Information Retrieval Course
No ratings yet
Information Retrieval Course
24 pages
Self-Serving Bias & Social Psychology Test
No ratings yet
Self-Serving Bias & Social Psychology Test
3 pages
ML - Unit I - Final
No ratings yet
ML - Unit I - Final
132 pages
Machine Learning FDP: Real-World Applications
No ratings yet
Machine Learning FDP: Real-World Applications
5 pages
R22 B.tech CSE 1 1 Sem Syllabus
No ratings yet
R22 B.tech CSE 1 1 Sem Syllabus
20 pages
AI Assignment
No ratings yet
AI Assignment
6 pages
Single-Layer Perceptrons Guide
No ratings yet
Single-Layer Perceptrons Guide
11 pages
MLUnit 1
No ratings yet
MLUnit 1
131 pages
Machine Learning Process Lifecycle: Talat@amii - Ca Luke@amii - Ca Shazan@amii - Ca Sankalp@amii - Ca
No ratings yet
Machine Learning Process Lifecycle: Talat@amii - Ca Luke@amii - Ca Shazan@amii - Ca Sankalp@amii - Ca
13 pages
Machine Learning: Instructor: Prof. Ayesha
No ratings yet
Machine Learning: Instructor: Prof. Ayesha
31 pages
Tech - Seminar - PPT AR/VR
No ratings yet
Tech - Seminar - PPT AR/VR
20 pages
ABP DWDM UNIT 4 Classification 1
No ratings yet
ABP DWDM UNIT 4 Classification 1
51 pages
Unit 4 Classification
No ratings yet
Unit 4 Classification
87 pages
Introduction to Artificial Intelligence
No ratings yet
Introduction to Artificial Intelligence
16 pages
AI Introduction
No ratings yet
AI Introduction
11 pages
Company Law Summary
No ratings yet
Company Law Summary
48 pages
ML Unit II - Final
No ratings yet
ML Unit II - Final
138 pages
Two Mark Question
No ratings yet
Two Mark Question
20 pages
Lecture6trunking and Grade of Service
No ratings yet
Lecture6trunking and Grade of Service
47 pages
EPCE4306 Chapter 5 SensorsActuators
No ratings yet
EPCE4306 Chapter 5 SensorsActuators
36 pages
TD Learning & Algorithms Guide
No ratings yet
TD Learning & Algorithms Guide
6 pages
Deep Neural Network Presentation
No ratings yet
Deep Neural Network Presentation
9 pages
Lecture Notes: IV B. Tech I Semester (JNTUH-R13)
No ratings yet
Lecture Notes: IV B. Tech I Semester (JNTUH-R13)
18 pages
Machine Learning Industry Applications
No ratings yet
Machine Learning Industry Applications
43 pages
USC Dentistry Emergency Response Plan
No ratings yet
USC Dentistry Emergency Response Plan
34 pages
PhD Proposal: ML for Fractured Media
No ratings yet
PhD Proposal: ML for Fractured Media
2 pages
DC Generators BEEE
No ratings yet
DC Generators BEEE
41 pages
Perceptons Neural Networks
No ratings yet
Perceptons Neural Networks
33 pages
Statistical Learning Theory Guide
No ratings yet
Statistical Learning Theory Guide
4 pages
Understanding MWOC in Fiber Optics
No ratings yet
Understanding MWOC in Fiber Optics
82 pages
Multistage Amplifiers Overview and Analysis
No ratings yet
Multistage Amplifiers Overview and Analysis
108 pages
Power Plant Engineering Notes UNIT 1
No ratings yet
Power Plant Engineering Notes UNIT 1
33 pages
Unit 5 Intro To Machine Learning
No ratings yet
Unit 5 Intro To Machine Learning
25 pages
Network Management Architecture
No ratings yet
Network Management Architecture
12 pages
NLP Notes of Unit One
No ratings yet
NLP Notes of Unit One
278 pages
Introduction to Machine Learning Concepts
100% (1)
Introduction to Machine Learning Concepts
8 pages
Unit 3
No ratings yet
Unit 3
113 pages
Introduction to Software Project Management
No ratings yet
Introduction to Software Project Management
21 pages
Bat Algorithm: Overview & Applications
No ratings yet
Bat Algorithm: Overview & Applications
17 pages
Transformer Maintenance and Insulation Guide
No ratings yet
Transformer Maintenance and Insulation Guide
55 pages
Data Science Syllabus From Beginner To Advanced
No ratings yet
Data Science Syllabus From Beginner To Advanced
7 pages
Unit 4 Conditional Random Field
No ratings yet
Unit 4 Conditional Random Field
4 pages
FLNN Question Bank
75% (4)
FLNN Question Bank
23 pages
ANFIS
100% (1)
ANFIS
19 pages
Fan Troubleshooting Table
No ratings yet
Fan Troubleshooting Table
2 pages
Lecture Notes 5
No ratings yet
Lecture Notes 5
3 pages
Designing A Learning System: DR - Chandrika.J Professor CSE Course Faculty
No ratings yet
Designing A Learning System: DR - Chandrika.J Professor CSE Course Faculty
22 pages
CEC349 RFID Manual Final
No ratings yet
CEC349 RFID Manual Final
62 pages
Module 5 IMLA QB 6677
100% (1)
Module 5 IMLA QB 6677
2 pages
Understanding Convolutional Neural Networks
No ratings yet
Understanding Convolutional Neural Networks
50 pages
Big Questions With Answers
100% (1)
Big Questions With Answers
32 pages
SOP
100% (1)
SOP
5 pages
Deep Learning Algorithms
No ratings yet
Deep Learning Algorithms
19 pages
Unit 1 - Introduction
No ratings yet
Unit 1 - Introduction
25 pages
MFML Mra 1
No ratings yet
MFML Mra 1
165 pages
ML Intro Theory
No ratings yet
ML Intro Theory
10 pages
Assignment 2 - Audience Analysis
No ratings yet
Assignment 2 - Audience Analysis
4 pages
Ch-2 Supervised Machine Learning
No ratings yet
Ch-2 Supervised Machine Learning
48 pages
RSF M118
No ratings yet
RSF M118
17 pages
3 Leadership Styles
No ratings yet
3 Leadership Styles
12 pages
Motif Batik Lereng
No ratings yet
Motif Batik Lereng
8 pages
Statistics Explained, 4th Edition Full PDF Download
100% (13)
Statistics Explained, 4th Edition Full PDF Download
14 pages
Grade 5 Lesson Plan: Past Activities
No ratings yet
Grade 5 Lesson Plan: Past Activities
2 pages
CV Project Manager - Michael Adhidaya Susantyo
No ratings yet
CV Project Manager - Michael Adhidaya Susantyo
10 pages
RFP Committee Engagement Notice
No ratings yet
RFP Committee Engagement Notice
2 pages
Suzette Bray - Borderline Personality Disorder Workbook - DBT Strategies and Exercises To Manage Symptoms and Improve Well-Being-Rockridge Press (2023)
100% (3)
Suzette Bray - Borderline Personality Disorder Workbook - DBT Strategies and Exercises To Manage Symptoms and Improve Well-Being-Rockridge Press (2023)
202 pages
ICSE Chemistry Grade 6 E-Book
100% (3)
ICSE Chemistry Grade 6 E-Book
120 pages
Faith Reflection
No ratings yet
Faith Reflection
1 page
Media Project-Assignment 1
No ratings yet
Media Project-Assignment 1
2 pages
Ritik Pal Project
No ratings yet
Ritik Pal Project
52 pages
CALLP Reviewer
No ratings yet
CALLP Reviewer
10 pages
Meaning of History
No ratings yet
Meaning of History
17 pages
PMP 49 Processes
No ratings yet
PMP 49 Processes
1 page
DepEd Order on Academic Honesty
No ratings yet
DepEd Order on Academic Honesty
4 pages
English2 DLP Synonyms and Antonyms PDF
100% (1)
English2 DLP Synonyms and Antonyms PDF
6 pages
CV (A4)
No ratings yet
CV (A4)
1 page
Historical Thinking Skills
No ratings yet
Historical Thinking Skills
9 pages
Quiz #1
No ratings yet
Quiz #1
2 pages
Andhra Pradesh
No ratings yet
Andhra Pradesh
39 pages
JN 123
No ratings yet
JN 123
10 pages
Exploring Love Across Cultures
No ratings yet
Exploring Love Across Cultures
9 pages
Entrepreneurial Education's Impact on Unemployment in Nigeria
No ratings yet
Entrepreneurial Education's Impact on Unemployment in Nigeria
36 pages
Vendors of Lms
No ratings yet
Vendors of Lms
176 pages
Zhou Bnzhou SM SDM 2024 Thesis
No ratings yet
Zhou Bnzhou SM SDM 2024 Thesis
100 pages
Detailed Lesson Plan in Math 4
No ratings yet
Detailed Lesson Plan in Math 4
13 pages
Fingerprint SDK for Developers
No ratings yet
Fingerprint SDK for Developers
2 pages