Decision Trees and Boosting
Helge Voss (MPI–K, Heidelberg)
TMVA Workshop , CERN, 21 Jan 2011
Boosted Decision Trees
Decision Tree: Sequential application of cuts splits
the data into nodes, where the final nodes (leafs)
classify an event as signal or background
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 2
Boosted Decision Trees
Decision Tree: Sequential application of cuts splits
the data into nodes, where the final nodes (leafs)
classify an event as signal or background
used since a long time in general “data-mining”
applications, less known in (High Energy)
Physics
similar to “simple Cuts”: each leaf node is a
set of cuts. many boxes in phase space
attributed either to signal or backgr.
independent of monotonous variable
transformations, immune against outliers
weak variables are ignored (and don’t
(much) deteriorate performance)
Disadvantage very sensitive to statistical
fluctuations in training data
Boosted Decision Trees (1996):
combine a whole forest of Decision Trees,
derived from the same sample, e.g. using
became popular in HEP since
different event weights. MiniBooNE, [Link] et.a., NIM 543(2005)
overcomes the stability problem
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 3
Growing a Decision Tree
start with training sample at the root node
split training sample at node into two, using a cut
in the variable that gives best separation gain
continue splitting until:
minimal #events per node
maximum number of nodes
maximum depth specified
a split doesn’t give a minimum separation gain
leaf-nodes classify S,B according to the
majority of events or give a S/B probability
Why no multiple branches (splits) per node ?
Fragments data too quickly; also: multiple splits per node = series of binary node splits
What about multivariate splits?
time consuming
other methods more adapted for such correlations
we’ll see later that for “boosted” DTs weak (dull) classifiers are often better, anyway
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 4
Separation Gain
What do we mean by “best separation gain”?
define a measure on how mixed S and B are in a node:
MisClassification:
1-max(p,1-p)
Gini-index: (Corrado Gini 1912, typically used to measure income inequality)
p (1-p) : p=purity
Cross Entropy:
-(plnp + (1-p)ln(1-p))
cross entropy
difference in the various indices are small, Gini index
most commonly used: Gini-index misidentification
purity
separation gain: e.g. NParent*GiniParent – Nleft*GiniLeftNode – Nright*GiniRightNode
Consider all variables and all possible cut values
select variable and cut that maximises the separation gain.
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 5
Separation Gain
MisClassificationError sort of the classical way to choose cut
BUT:
cumulative-
distributions
There are cases where the simple “misclassificaton” does not have any optimium at all!
other S=400,B=400 (S=300,B=100) (S=100,B=300) or (S=200,B=0) (S=200,B=400)
example:
equal in terms of misclassification error, but GiniIndex/Entropy favour the latter
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 6
Decision Tree Pruning
One can continue node splitting until all leaf nodes
are basically pure (using the training sample)
obviously: that’s overtraining
Two possibilities:
stop growing earlier
generally not a good idea, even useless
splits might open up subsequent useful splits
grow tree to the end and “cut back”, nodes
that seem statistically dominated:
pruning
e.g. Cost Complexity pruning: C(T, ) | y(x) y(C) | Nleaf nodes
assign to every sub-tree, T C(T,) : leafs events
of T in leaf
find subtree T with minmal C(T,) for given
Loss function regularisaion/
use subsequent weakest link pruning
cost parameter
which cost parameter ?
large enough to avoid overtraining
tuning parameter or “cross validation” (still to come in TMVA hopefully soon…)
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 7
Decision Tree Pruning
“Real life” example of an optimally pruned Decision Tree:
Decision tree
Decision tree before pruning after pruning
Pruning algorithms are developed and applied on individual trees
optimally pruned single trees are not necessarily optimal in a forest !
actually they tend to be TOO big when boosted, no matter how hard you
prune!
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 8
Boosting
classifier
Training Sample C(0)(x)
re-weight
Weighted classifier
Sample C(1)(x)
re-weight
Weighted classifier
Sample C(2)(x)
NClassifier
re-weight
Weighted classifier
y(x) i
w iC(i) (x)
Sample C(3)(x)
re-weight
Weighted classifier
Sample C(m)(x)
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 9
Adaptive Boosting (AdaBoost)
classifier AdaBoost re-weights events
Training Sample C(0)(x) misclassified by previous classifier by:
re-weight
classifier
1 ferr
Weighted with :
Sample C(1)(x) ferr
re-weight
misclassified events
Weighted classifier ferr
Sample C(2)(x) all events
re-weight
Weighted classifier AdaBoost weights the classifiers also
Sample C(3)(x) using the error rate of the individual
classifier according to:
re-weight
NClassifier
1 ferr
(i)
(i)
y(x) log (i) C (x)
classifier
i ferr
Weighted
Sample C(m)(x)
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 10
Boosted Decision Trees
Result of ONE Decision Tree for test event is either “Signal” or “Background”
the tree gives a fixed signal eff. and background rejection
For a whole Forest however:
y(B) 0
y(S) 1
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 11
AdaBoost in Pictures
Start here: misclassified events get
… and so on
equal event weights larger weights
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 12
Boosted Decision Trees – Control Plots
A very well behaved
example’
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 13
Boosted Decision Trees – Control Plots
A more “difficult” example
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 14
AdaBoost: A simple demonstration
The example: (somewhat artificial…but nice for demonstration) :
• Data file with three “bumps” var(i) > x var(i) <= x
• Weak classifier (i.e. one single simple “cut” ↔ decision tree stumps )
B S
b) a)
Two reasonable cuts: a) Var0 > 0.5 εsignal=66% εbkg ≈ 0% misclassified events in total 16.5%
or
b) Var0 < -0.5 εsignal=33% εbkg ≈ 0% misclassified events in total 33%
the training of a single decision tree stump will find “cut a)”
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 15
AdaBoost: A simple demonstration
The first “tree”, choosing cut a) will give an error fraction: err = 0.165
before building the next “tree”: weight wrong classified training events by ( 1-err/err) ) ≈ 5
the next “tree” sees essentially the following data sample:
re-weight .. and hence will
chose: “cut b)”:
b) Var0 < -0.5
The combined classifier: Tree1 + Tree2
the (weighted) average of the response to
a test event from both trees is able to
separate signal from background as
good as one would expect from the most
powerful classifier
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 16
AdaBoost: A simple demonstration
Only 1 tree “stump” Only 2 tree “stumps” with AdaBoost
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 17
“A Statistical View of Boosting” (Friedman 1998 [Link])
Boosted Decision Trees: two different interpretations
give events that are “difficult to categorize” more weight and average afterwards the
results of all classifiers that were obtained with different weights
see each Tree as a “basis function” of a possible classifier
• boosting or bagging is just a mean to generate a set of “basis functions”
• linear combination of basis functions gives final classifier or: final classifier is an
expansion in the basis functions.
y( , x) T (x)
tree
i i
• every “boosting” algorithm can be interpreted as optimising the loss function in a
“greedy stagewise” manner
•i.e. from the current point in the optimisation – [Link] of the decision tree
forest- :
• chooses the parameters for the next boost step (weights) such that one
moves a long the steepest gradient of the loss function
• AdaBoost: “exponential loss function” = exp( -y0y(α,x)) where y0=-1 (bkg), y0=1 (signal)
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 18
Gradient Boost
Gradient Boost is a way to implement “boosting” with arbitrary “loss functions” by
approximating “somehow” the gradient of the loss function
AdaBoost: Exponential loss exp( -y0y(α,x)) theoretically sensitive to outliers
Binomial log-likelihood loss ln(1 + exp( -2y0y(α,x)) more well behaved loss function,
(the corresponding “GradientBoost” is implmented in TMVA)
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 19
Bagging and Randomised Trees
other classifier combinations:
Bagging:
combine trees grown from “bootstrap” samples
(i.e re-sample training data with replacement)
Randomised Trees: (Random Forest: trademark [Link], [Link])
combine trees grown with:
random bootstrap (or subsets) of the training data only
consider at each node only a random subsets of variables for the split
NO Pruning!
These combined classifiers work surprisingly well, are very stable and
almost perfect “out of the box” classifiers
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 20
AdaBoost vs Bagging and Randomised Forests
Sometimes people present “boosting” as nothing else then just “smearing” in order to make
the Decision Trees more stable w.r.t statistical fluctuations in the training.
clever “boosting” however can do more, than for example: for previous example of “three
bumps”
- Random Forests
- Bagging
as in this case, pure statistical fluctuations are
not enough to enhance the 2nd peak sufficiently
however: a “fully grown decision tree” is
much more than a “weak classifier”
_AdaBoost
“stabilization” aspect is more important
Surprisingly: Often using smaller trees (weaker classifiers) in AdaBoost and other clever boosting
algorithms (i.e. gradient boost) seems to give overall significantly better performance !
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 21
Boosting at Work
Boosting seems to work best on “weak” classifiers (i.e. small, dum trees)
Tuning (tree building) parameter settings are important
For good out of the box performance: Large numbers of very small trees
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 22
Generalised Classifier Boosting
Principle (just as in BDT): multiple training cycles, each time wrongly
classified events get a higher event weight
classifier
Training Sample C(0)(x)
re-weight
Weighted classifier
Sample C(1)(x)
re-weight
NClassifier
1 ferr
(i)
(i)
Weighted classifier
y(x) log (i) C (x)
Sample C(2)(x)
i ferr
re-weight
Response is weighted sum
of each classifier response
Weighted classifier
Sample C(m)(x)
Boosting might be interesting especially for simple (weak) Methods like Cuts, Linear
Discriminants, simple (small, few nodes) MLPs
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 23
AdaBoost On a linear Classifier (e.g. Fisher)
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 24
AdaBoost On a linear Classifier (e.g. Fisher)
Ups… there’s still a problem in TMVA’s generalized boosting. This example doesn’t work yet !
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 25
Boosting a Fisher Discriminant in TMVA…
100 Boosts of a “Fisher Discriminant”
as Multivariate Tree split (yes.. it is in TMVA
although I argued against it earlier. I hoped to
cope better with linear correlations that way…)
generalised boosting of Fisher classifier
Something isn’t quite correct yet !
1st Fisher cut 2nd Fisher cut 65th Fisher cut
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 26
Learning with Rule Ensembles
Following RuleFit approach by Friedman-Popescu Friedman-Popescu, Tech Rep,
Stat. Dpt, Stanford U., 2003
Model is linear combination of rules, where a rule is a sequence of cuts (i.e. a
branch of a decision tree)
RuleFit classifier rules (cut sequence normalised
rm=1 if all cuts discriminating
satisfied, =0 otherwise) event variables
MR nR
y RF x a0 am rm xˆ bk xˆk
m 1 k 1
Sum of rules Linear Fisher term
The problem to solve is
Create rule ensemble: use forest of decision trees
pruning removes topologically equal rules (same variables in cut sequence)
Add a “Fisher term” to capture linear correlations
Fit coefficients am, bk: gradient direct regularization minimising Risk (Friedman et al.)
One of the elementary cellular automaton rules (Wolfram 1983, 2002). It specifies the next color in a cell, depending
on its color and its immediate neighbors. Its rule outcomes are encoded in the binary representation 30=000111102.
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 27
Regression Trees
Rather than calling leafs Signal or Background
could also give them “values” (i.e. “mean value” of all values attributed to
training events that end up in the node)
Regression Tree
Node Splitting: Separation Gain Gain in Variance (RMS) of target function
Boosting: error fraction “distance” measure from the mean
linear, square or exponential
Use this to model ANY non analytic function of which you have “training data”
i.e.
energy in your calorimeter as function of show parameters
training data from testbeam
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 28
Regression Trees
Leaf Nodes:
One output value
ZOOM
Regression Trees seem to need DESPITE BOOSTING larger trees
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 29
Summary
Boosted Decision Trees a “brute force method” works “out of the
box”
check tuning parameters anyway.
start with “small trees” (limit the maximum number of splits (tree depth)
automatic tuning parameter optimisation
first implementation is done, obviously needs LOTs of time!
be as careful as with “cuts” and check against data
Boosting can (in principle) be applied to any (weak) classifier
Boosted Regression Trees at least as much “brute force”
little experience with yet.. but probably equally robust and powerful
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 30