05 Ensemble Learning

Ensemble learning
Ensembles of simple classifiers
Ensembles: Boosting, Weak and Strong

Learning, AdaBoost
Ensemble learning
◼ ENSEMBLE
❑ group, set (of classifiers)
◼ ENSEMBLE LEARNING
❑ learning a set of classifiers
◼ also called ENSEMBLE BASED SYSTEM
AIDP, M.Oravec, ÚIM FEI STU

Main literature
◼ R. Polikar: Ensemble Based Systems in Decision Making
❑ IEEE Circuits and Systems Magazine, vol.6, no.3, pp. 21-45, 2006
◼ http://users.rowan.edu/~polikar/RESEARCH/PUBLICATIONS/csm06.pdf

Ensembles (slov. súbory)
◼ „ensemble based systems“
❑ often we seek a second opinion before making a decision,
sometimes a third, and sometimes many more
❑ we weigh the individual opinions, and combine them
through some thought process to reach a final (process of
consulting “several experts” before making a final decision)
❑ other names:
◼ multiple classifier systems
◼ committee of classifiers
◼ mixture of experts

Principle
◼ one of the possibilities:
Training data
Data 1 Data 2  Data m
Learner 1 Learner 2  Learner m
Model1 Model2  Model m
Combination of the models Final model

Why to use ensemble based systems
◼ statistical reasons
❑ good classification of training data does not mean good behavior for test data
(we also know from neural networks)
❑ a combination (averaging) of several classifiers will help
◼ large amounts of data

❑ dividing the data into smaller parts, training the classifiers, then combining
their outputs
◼ small amounts of data

❑ „resampling" - creation of several randomly overlapping subsets of data,
training of classifiers, then a combination of their outputs

◼ too demanding a task for one classifier
❑ divide and rule (divide et impera) - dividing space into smaller (and less
demanding) parts, classifiers for these simpler parts, then their combination

◼ data fusion
◼ several data sets from different sources (heterogeneous data)
❑ data from each modality to the appropriate classifier, then a
combination

Generating individual classifiers
◼ generally two types of combination
1. classifier selection
◼ each classifier is an expert for a certain subspace
◼ combination:
❑ the classifier closest (based on the metric) to the input vector has the
highest weight
❑ several such local experts will be allowed to vote
2. classifier fusion
◼ the whole set of classifiers learns the whole space
◼ combination:
❑ the combination of individual (WEAK) classifiers creates one (STRONG)
expert with the best performance
❑ e.g. bagging, boosting, ...

Diversity
◼ diversity
❑ strategy for ensemble based systems:
◼ create many classifiers, combine their outputs
◼ the overall performance will be better than for one classifier
❑ individual classifiers must make errors for different examples (each
classifier should be unique)
◼ ensemble of diversity classifiers:

❑ classifiers whose decision boundaries are different

How to achieve classifier diversity?
1) using different training datasets to train individual classifiers
◼ a) resampling techniques – bootstrapping, bagging (training data subsets are
drawn randomly)
AIDP, M.Oravec, ÚIM FEI STU with

replacement
◼ b) using different training parameters for different classifiers
❑ k-fold data split (in Slovak: k-násobné delenie dát )
◼ k different overlapping data sets
AIDP, M.Oravec, ÚIM FEI STU without

replacement
2) using different training parameters for individual classifiers
❑ e.g. set of MLPs, different initializations, different configurations, different
required errors ...
❑ possibility of managing instability of individual MLP -> diversity
3) a combination of completely different types of classifiers

❑ e.g. MLP + SVM + decision trees + nearest neighbor classifiers
4) by using different symptoms

❑ so-called random subspace method
◼ Diversity is most often achieved through 1)

Two key components of ensemble based
systems
1. strategy for creating a set of classifiers with maximum

diversity
◼ bagging, boosting, AdaBoost, ...
2. strategy for combining classifier outputs

◼ the right decisions are strengthened, the wrong ones are
discarded

1) Strategy for creating a set of classifiers with
maximum diversity
bagging, boosting, AdaBoost, ...

Weak learner, base classifier
◼ weak learner
❑ classifier which is to be learnt
◼ Base Classifier (BC)

❑ a simple classifier that is able to classify any input sample better
than randomly (probability of success greater than 50%)

Bagging
◼ bagging = bootstrap aggregating
◼ bootstrapped replicas of the training
data
◼ combination of outputs - by majority
voting
◼ suitable for small datasets
◼ large portions of the samples (75% to
100%) are drawn into each subset
◼ neural networks and decision trees are
suitable classifiers

Boosting
◼ boosting - 3 weak classifiers:
❑ C1 - training on a random subset
❑ C2 - training set: ½ correctly
classified examples from C1 and
½ misclassified
❑ C3 - training on examples for
which C1 and C2 disagree
◼ C1, C2, C3 shall be combined

by majority voting into a
strong classifier

Adaboost.M1
◼ AdaBoost - more versions

❑ AdaBoost.M1 (multiclass)
❑ AdaBoost.R (regression)
◼ training of a weak classifier on examples selected from

iteratively updated distributions of training data
❑ the update will ensure that examples that have been incorrectly
classified by the previous classifier are more likely to be included
in the training data of the new classifier
◼ combination by weighted majority vote

Adaboost.M1
• weight distribution Dt(i) for training
samples xi, i = 1, . . . , N from which training
data subsets St are chosen for each
consecutive classifier (hypothesis) ht
• during initialization the distribution is

uniform, all examples have the same
chance to get to the first training set

Adaboost.M1
• the trainig error εt of the classifier ht is
also weighted by this distribution
• the error εt must be less than ½
• computation of normalized error βt ,

for 0 < εt < ½
we have 0 < βt < 1.

Adaboost.M1
distribution update rule:
• the weights of well-classified examples by

the current hypothesis are reduced by the
factor βt
• the weights of misclassified examples do

not change
• thus, after normalization, the weights of

misclassified examples increase

Adaboost.M1
• weighted majority voting (as opposed to
bagging and boosting)
• well-classifying classifiers during

training are rewarded with higher voting
weights than the others
• 1/βt is a measure of performance, for a

small error it is large - sometimes too
large, for possible stability problems a
logarithm is used

Adaboost.M1 - block diagram
◼ algorithm is sequential:
◼ classifier CK is created before CK+1, i.e. βK and DK are available

Adaboost.M1
◼ the training error E (ensemble error) is bounded above
◼ since εt <1/2, E is guaranteed to decrease with each new classifier
◼ resistance against overtraining!

◼ relation to margin theory
❑ we studied margins - SVM (maximizing the margin among classes)

! Adaboost SVM
◼ support vectors are said to define the margin that separates the classes
◼ both SVM and AdaBoost maximize margin
◼ AdaBoost also boosts the margins

Good material about Adaboost:
◼ http://www.comp.leeds.ac.uk/scsjso/adaboost_talk.pdf

2) Strategy for combining classifier outputs

Strategy for combining classifier outputs
◼ 2 taxonomies
1) trainable vs. non-trainable combination rules,

◼ trainable (dynamic)
❑ parameters of combiner (weights) are determined by a separate training algorithm (e.g. EM
– expectation maximization)
◼ nontrainable
❑ e.g. weighted majority voting
2) combination rules that apply either to class tags or to class-specific continuous

outputs
◼ application to class labels ωj, j = 1, . . . , C
◼ application directly to continuous outputs of individual classifiers
❑ e.g. to continuous outputs of MLP or RBF network

Strategy for combining classifier outputs
(2) combination rules that apply either to class tags or to class-specific

continuous outputs
◼ combining class labels

❑ Majority Voting
❑ Weighted Majority Voting
❑ Behavior Knowledge Space (BKS)
❑ Borda Count
◼ combining continuous outputs

❑ Algebraic combiners
▪ mean rule, weighted average, trimmed mean,
Minimum/Maximum/Median Rule, product rule, Generalized Mean,
❑ Decision Templates
❑ Dempster-Shafer Based Combination

Which ensemble generation or
combination rule is the best?
◼ there is no best classifier for all classification problems

❑ the best algorithm depends on the data structure and a priori knowledge
◼ similar applies to combination rules

! dropout ensembles
◼ close relationship of dropout in deep NN and

ensemble learning:
❑ different neurons off - a set of different NN architectures!
https://cv-tricks.com/cnn/understand-resnet-
AIDP, M.Oravec, ÚIM FEI STU alexnet-vgg-inception/
Illustrations
◼ The graph shows 2 classes of 100 objects. Banana shaped classes were
used to generate data. 40% of the data was used for training, the rest
was used for testing.
◼ bpxnc classifier (back-propagation), MLP classifier.
Banana set, 3 neurons in hidden layer Banana set, more neurons in hidden layer
MLP klasifikátor, 3 neurons in hidden layer MLP classifier, 5,15 and 50 neurons in
hidden layer

◼ Result of ensemble of classifiers
ensemble of classifiers ensemble of classifiers
mean, voting and maximum combiner mean, voting and maximum combiner

◼ Separate classifiers and the result of ensemble of
classifiers
Various classifiers Ensemble of classifiers
linear, quadratic, parzen and mean, voting, maximum and product

backpropagation classifier combiner

05 Ensemble Learning

Uploaded by

05 Ensemble Learning

Uploaded by

Ensemble learning

Ensembles of simple classifiers

Ensembles: Boosting, Weak and Strong

◼ also called ENSEMBLE BASED SYSTEM

AIDP, M.Oravec, ÚIM FEI STU

AIDP, M.Oravec, ÚIM FEI STU

AIDP, M.Oravec, ÚIM FEI STU

Data 1 Data 2  Data m

Learner 1 Learner 2  Learner m

Model1 Model2  Model m

Combination of the models Final model

AIDP, M.Oravec, ÚIM FEI STU

◼ large amounts of data

◼ small amounts of data

AIDP, M.Oravec, ÚIM FEI STU

AIDP, M.Oravec, ÚIM FEI STU

AIDP, M.Oravec, ÚIM FEI STU

AIDP, M.Oravec, ÚIM FEI STU

◼ ensemble of diversity classifiers:

AIDP, M.Oravec, ÚIM FEI STU

AIDP, M.Oravec, ÚIM FEI STU with

AIDP, M.Oravec, ÚIM FEI STU without

3) a combination of completely different types of classifiers

4) by using different symptoms

◼ Diversity is most often achieved through 1)

AIDP, M.Oravec, ÚIM FEI STU

1. strategy for creating a set of classifiers with maximum

2. strategy for combining classifier outputs

AIDP, M.Oravec, ÚIM FEI STU

bagging, boosting, AdaBoost, ...

AIDP, M.Oravec, ÚIM FEI STU

◼ Base Classifier (BC)

AIDP, M.Oravec, ÚIM FEI STU

AIDP, M.Oravec, ÚIM FEI STU

◼ C1, C2, C3 shall be combined

AIDP, M.Oravec, ÚIM FEI STU

◼ AdaBoost - more versions

◼ training of a weak classifier on examples selected from

◼ combination by weighted majority vote

AIDP, M.Oravec, ÚIM FEI STU

• during initialization the distribution is

AIDP, M.Oravec, ÚIM FEI STU

• the error εt must be less than ½

• computation of normalized error βt ,

AIDP, M.Oravec, ÚIM FEI STU

distribution update rule:

• the weights of well-classified examples by

• the weights of misclassified examples do

• thus, after normalization, the weights of

AIDP, M.Oravec, ÚIM FEI STU

• well-classifying classifiers during

• 1/βt is a measure of performance, for a

AIDP, M.Oravec, ÚIM FEI STU

AIDP, M.Oravec, ÚIM FEI STU

◼ since εt <1/2, E is guaranteed to decrease with each new classifier

◼ resistance against overtraining!

AIDP, M.Oravec, ÚIM FEI STU

AIDP, M.Oravec, ÚIM FEI STU

AIDP, M.Oravec, ÚIM FEI STU

AIDP, M.Oravec, ÚIM FEI STU

1) trainable vs. non-trainable combination rules,

2) combination rules that apply either to class tags or to class-specific continuous

AIDP, M.Oravec, ÚIM FEI STU

(2) combination rules that apply either to class tags or to class-specific

◼ combining class labels

◼ combining continuous outputs

AIDP, M.Oravec, ÚIM FEI STU

◼ there is no best classifier for all classification problems

◼ similar applies to combination rules

AIDP, M.Oravec, ÚIM FEI STU

◼ close relationship of dropout in deep NN and

AIDP, M.Oravec, ÚIM FEI STU

ensemble of classifiers ensemble of classifiers

AIDP, M.Oravec, ÚIM FEI STU

Various classifiers Ensemble of classifiers

linear, quadratic, parzen and mean, voting, maximum and product