0% found this document useful (0 votes)
14 views30 pages

L2 - Problems in ML & Performance Evaluation

The document discusses key concepts in machine learning, including the learning machine, empirical risk minimization, feature engineering, and overfitting. It emphasizes the importance of performance evaluation and various evaluation metrics such as accuracy, precision, recall, and F1 score. The lecture outlines methods for model performance evaluation, including hold-out, cross-validation, and bootstrap sampling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views30 pages

L2 - Problems in ML & Performance Evaluation

The document discusses key concepts in machine learning, including the learning machine, empirical risk minimization, feature engineering, and overfitting. It emphasizes the importance of performance evaluation and various evaluation metrics such as accuracy, precision, recall, and F1 score. The lecture outlines methods for model performance evaluation, including hold-out, cross-validation, and bootstrap sampling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Problems in ML &

Performance Evaluation

Lương Thái Lê
Outline of the Lecture
1. Learning Machine
2. Problems in ML
• Emprical Risk Minimize

• Feature Engineering

• Over fitting

3. ML model Performance Evaluation


4. Evaluation metrics
Learning Machine
• A learning machine capable of implementing a set of functions
f ( x, w), w  
• The learning problem is to choose from the given set of functions the
one which best approximates the supervisor’s response.
• The selection is based on training samples (𝑥𝑖 , 𝑦𝑖 ), 𝑖 = 1, … , 𝑙

• => Need to choose and optimize (depend on concrete problem)


Outline of the Lecture
1. Learning Machine
2. Problems in ML
• Emprical Risk Minimize

• Feature Engineering

• Over fitting

3. ML model Performance Evaluation


4. Evaluation metrics
Regression Example: Find road surface
• Given the points, estimate
parameters

• Data/Feature
• dimension (p=2) 𝑥𝑖 =
𝑇
𝑥𝑖1 , 𝑥𝑖2 , … 𝑥𝑖𝑝
• # training samples 𝑥1 , 𝑦1 … (𝑥𝑙 , 𝑦𝑙 )
𝑇
• parameters α = 𝛼0 , 𝛼1 , … 𝛼𝑝
• Evaluation Metric
• How good is the fitted plane?
Loss Function
• To chose the best function, it makes sense to minimize a loss between the
response of the supervisor and the learning machine, given an input x
𝐿 𝑦, 𝑓 𝑥, 𝛼

• Since we want to minimize the loss over all samples, we are interested in
minimizing the expected loss

𝑅 𝛼 = න 𝐿 𝑦, 𝑓 𝑥, 𝛼 𝑑𝐹(𝑥, 𝑦)

• 𝑅 𝛼 is called Risk function


• 𝐹(𝑥, 𝑦) is the joint probability distribution function
=> Find 𝑓 𝑥, 𝛼 ∗ that minimize 𝑅 𝛼 with the only available
information is contained in the training set: 𝑥𝑖 , 𝑦𝑖 , 𝑖 = 1, … , 𝑙
Empirical Risk Minimization Principle

𝑅 𝛼 = න 𝐿 𝑦, 𝑓 𝑥, 𝛼 𝑑𝐹(𝑥, 𝑦)

• The risk functional is replaced by the empirical risk functional


𝑙
1
𝑅𝑒𝑚𝑝 𝛼 = ෍ 𝐿 𝑦𝑖 , 𝑓 𝑥𝑖 , 𝛼
𝑙
𝑖=1

=> find 𝑓 𝑥, 𝛼 ∗ that minimize 𝑅 𝛼 over class of function 𝑓 𝑥, 𝛼


Loss function: A Probabilistic View
𝑝
• 𝐿 𝑦, 𝑓 𝑥, 𝛼 = σ𝑙𝑖=1(𝑦𝑖 − 𝑓 𝑥𝑖 , 𝛼 ) σ 𝑙 σ
= 𝑖=1(𝑦𝑖 − 𝛼0 − 𝑗=1 𝑥𝑖𝑗 𝛼𝑗 )
• Let the noise ||𝜀|| = ||𝐿 𝑦, 𝑓 𝑥, 𝛼 ||
• If we model the noise as zero mean Gaussian random variable with
variance 𝛿 2 , the distribution is:
𝑝 2
1 𝜀𝑖 2 1 (𝑦𝑖 − 𝛼0 − σ 𝑥 𝛼
𝑗=1 𝑖𝑗 𝑗 )
𝑝 𝜀𝑖 = exp − 2 = exp −
2𝜋𝛿 2𝛿 2𝜋𝛿 2𝛿 2
1 (𝑦𝑖 −𝛼 𝑇 𝑥𝑖 )2
= exp −
2𝜋𝛿 2𝛿 2
1 1 σ𝑙𝑖=1(𝑦𝑖 −𝛼 𝑇 𝑥𝑖 )2
=>𝑝 𝜀 = ς𝑙1 𝑝 𝜀𝑖 = 𝑙 exp −
( 2𝜋𝛿) 2 𝛿2
Likelihood function
𝑇 2
1 1 𝑦−𝛼 𝑥
⇒ 𝑝 𝜀 = 𝑝(𝜀1 , 𝜀2 , … , 𝜀𝑙 ) = exp −
( 2𝜋𝛿)𝑙 2 𝛿2

We can view this joint probability as a function of the parameters

𝑇 2
1 1 𝑦−𝛼 𝑥
𝐿 𝛼 = 𝑝 𝜀1 , 𝜀2 , … , 𝜀𝑙 𝛼 = exp −
( 2𝜋𝛿) 𝑙 2 𝛿

=> Need to maximum Likelihood function


Maximum Likelihood Function
• Maximize the likelihood over all available samples

2
1 1 𝑦−𝛼 𝑇 𝑥
𝛼 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝛼 𝑝 𝜀1 , 𝜀2 , … , 𝜀𝑙 𝛼 =𝑎𝑟𝑔𝑚𝑎𝑥𝛼 exp −
( 2𝜋𝛿)𝑙 2 𝛿

• Since log is a monotonic function, often log-likelihood is used:

𝑙
𝛼 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝛼 (− ෍ 𝑦𝑖 − 𝛼 𝑇 𝑥𝑖 2 + 𝑐𝑜𝑛𝑠𝑡)
𝑖=1
=> That is where Gradient Descent comes in
Outline of the Lecture
1. Learning Machine
2. Problems in ML
• Emprical Risk Minimize

• Feature Engineering

• Over fitting

3. ML model Performance Evaluation


4. Evaluation metrics
Feature Engineering
• Features are individual independent variables that act as an input in
your system. In simpler feature is a column of data in your input
dataset. Ex: Age, Sex, Income…
• Two types of feature:
• Categorical: has little values, such as: Sex, Class of ticket (Economy,
Premium…), Color,…
• Numerical: has continuous/discrete values: Age, Price, Name…
• Can create a new feature from a root feature to improve learning
• Name: Mr John May => create new feature: Tittle (Mr, Miss,…)
• Group features with few values into a generic attribute
• Tittle with few values, Ex: Rev, Dr, Capital… => Others
Outline of the Lecture
1. Learning Machine
2. Problems in ML
• Emprical Risk Minimize

• Feature Engineering

• Over fitting

3. ML model Performance Evaluation


4. Evaluation metrics
Over fitting
• Definition
An objective function that be learnt F will be said to overfit a learning set if there
exists another objective function F' such that:
•F' is less suitable (gain less accurate) than F for the training set, but
•F' is more accurate than F for the entire dataset (including examples used in future)
• Reasons of Over-fitting:
• Error (noise) in the training set ( due to the colletion/construction data process)
• The number of learning examples is too small to represent the entire examples set of the
problems
=>Preferably choose the simplest objective function that fits (nonecessarily perfect)
with training examples
An Over-fitting Example
Outline of the Lecture
1. Learning Machine
2. Problems in ML
• Emprical Risk Minimize

• Feature Engineering

• Over fitting

3. ML model Performance Evaluation


4. Evaluation metrics
The Model Performance evaluation (1)
• Evaluation of machine learning system performance is often perform
empirically, rather than analytically.
The Model Performance evaluation (2)
• The performance of system depends not only on the machine
learning algorithms are used, but also depends on:
• Class distribution
• Cost of misclassification
• Size of the training set
• Size of the test set
• How to obtain a reliable assessment of system performance?
• The larger the training set, the better the performance of the learning system
• The larger the test set, the more accurate the evaluation
• Problem: It is very difficult (rarely) to obtain (very) large data sets
Evaluation Methods
• Hold-out
• Stratified sampling
• Repeated hold-out
• Cross-validation
• k-fold
• Leave-one-out
• Bootstrap sampling
Hold-out
• Data set is devided into 2 parts:
• Training set: D_train
• Test set: D_test
• Requirements:
• Any example in D_test is not used in training process
• Any example in D_train is not used in the model evaluation process
• Popular: |D_train|=2/3.|D|; |D_test|=1/3.|D|
• Suitable for large set of examples D
Cross Validation – k fold
• The entire set of examples D is divided into
k non-intersecting subsets (referred to as
“fold”) of approximately the same size
• Each time (of k) iterations, a subset is used
as the D_test, and (k-1) the remaining
subset is used as the D_train.
• k error values (each corresponding to a
fold) are averaged to get the overall error
value
• Popular: k= 10; or k=5
Bootstrap Sampling
• Bootstrap sampling method uses repeated sampling to create a
training set
• Suppose the whole set D consists of n examples
• From the set D, randomly select an example x (but do not remove x from D)
• Put example x in the training set: D_train = D_train ∪ x
• Repeat the above 2 steps n times
• Use D_train to train the model
• D_test = {z∈D; z∉D_train} used for test the model
Evaluation Criteria
• Accuracy
• Predictability (classification) of the (trained) model with respect to test instances
• Efficiency
• Cost of time and resources (memory) required for model training and testing
• Robustness
• The system's ability to handle (tolerable) noise (error) or missing value
• Scalability
• How does system performance (e.g. learning/classification rate) change with respect to the
size of the data set?
• Interpretability
• Understanding (for the user) of the system's results and operations easily
• Complexity
• The complexity of the system model (objective function) learned
Outline of the Lecture
1. Learning Machine
2. Problems in ML
• Emprical Risk Minimize

• Feature Engineering

• Over fitting

3. ML model Performance Evaluation


4. Evaluation metrics
Accuracy
• Show the accuracy of the model when solving the problem
• For the classification problem:
1 1 𝑖𝑓 𝑎 = 𝑏
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = ෍ 𝑖𝑑(𝑚 𝑥 𝑟(𝑥)) 𝑖𝑑 𝑎, 𝑏 = ቊ
|𝐷_𝑡𝑒𝑠𝑡| 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑥∈𝐷_𝑡𝑒𝑠𝑡
• m(x) is the class that the model predict for example x
• r(x) is the real class
• For the regression problem:
1
𝐸𝑟𝑟𝑜𝑟 = ෍ |𝑚 𝑥 − 𝑟 𝑥 |
|𝐷_𝑡𝑒𝑠𝑡|
𝑥∈𝐷_𝑡𝑒𝑠𝑡
• m(x) is the prediction of the model for x
• r(x) is the real output of x
Confuse Matrix (contingency table)
• Only use for classification
problem
• TP: Number of examples belonging
to class c correctly classified into
class c
• TN: Number of examples that do not
belong to class c that be determined
exactly
• FP: Number of examples that are not
in class c isclassified in class c
• FN: Number of examples belong to
class c but be classified in other class
Precision and Recall for each class c
• Often used in text classification
• Precision for class ci:
The total number of examples in class ci
correctly classified divided by the total
number of examples classified in class ci by
the model

• Recall for class ci:


The total number of examples correctly
classified by the model into class ci divided
by the total number of examples actually in
class ci
Precision and Recall for over all classes
• Assume that the model classifies data into a set of classes 𝐶 =
{𝑐𝑖 }𝑛𝑖=1 and, we get TPi , TNi , FPi , FNi for each ci
• Micro - averaging:

• Macro - averaging:
F1
• F1 is a harmonic mean of Precision and Recall

• F1 tends to take the closest value, whichever is the smaller of the two
Precision and Recall values
• F1 có giá trị lớn nếu cả 2 giá trị Precision và Recall đều lớn
Q&A - Thank you!

You might also like