IIM Rohtak
Business Analytics
Chapters
6-12 Today objective
Supervised Learning
Machine Learning Approach
Predictive Analysis
Classification analysis
Accuracy Theory & Decision Tree based Approach
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
ORANGE
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
ORANGE
Clustering
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
ORANGE
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
ORANGE
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
ORANGE
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
ORANGE
Now select any cluster ,the
same may be you will get in
data table and scatter plot
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
ORANGE
K Means
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
ORANGE
Understanding K means
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
ORANGE
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
ORANGE
Randomly allocation of data points
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
ORANGE
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
ORANGE
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
ORANGE
k-Means
[Link]
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
ORANGE
Let you want k=4
,then fixed k=4
But now you asked
from system
(silhouette) and find
optimal value of k,
then please select
from tab under k
means window
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
ORANGE
Now system is suggesting
k=2 (score is high among
others)
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
ORANGE
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
ORANGE
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Predictive Learning??
MARCH month : In most of Indian
school ,Annual yearly examination was
held?
Now I am sure most of the parents(re
call your conversation with your parent )
saying you do hard work in the subjects
like math's ,science etc.
On what basis your parents suggest to you
??and why ???
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Machine Learning
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Machine Learning
Supervised learning
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Machine Learning
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Machine Learning
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Classification
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Classification
•Logistic Regression
•Support Vector Machine
•k-Nearest Neighbours
•Decision Tree
•Ensemble Method(bagging
and Boosting )
•Random Forest
•Gradient Boost
•Ada Boost
•Neural Network
•Navie Bayes
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Classification
[Link]
a-types-in-statistics-347e152e8bee Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Classification of machine learning
algorithms
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Classification
Decision Trees are used to predict a Label (usually
Binary ) dependent variables such as:
• Will a person suffer a heart attack in the next year?
• Will a voter vote BJP in the next country
election?
• Will student X clear this time IAS exam?
For such type of problem ,decision tree gives basic
solution based upon past data.
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Decision Tree
Tree based learning algorithms are considered to be one of the best
and mostly used supervised learning methods. Tree based methods
empower predictive models with high accuracy, stability and ease of
interpretation. Unlike linear models, they map non-linear
relationships quite well. They are adaptable at solving any kind of
problem at hand.
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Classification
(a decision tree structure)
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined
class, as determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision
trees, or mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the
classified result from the model
• Accuracy rate is the percentage of test set samples that
are correctly classified by the model
– If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
150
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
False Positive (FP) – Type I Error False Negative (FN) – Type II Error
•The predicted value was falsely •The predicted value was falsely
predicted. predicted.
•The actual value was negative, but •The actual value was positive, but
the model predicted a positive the model predicted a negative
value. value.
•Also known as the type I error. •Also known as the type II error.
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
It is easy to “game” the
accuracy metric when making
predictions for a dataset like
this. To do that, you must
predict that nothing will
happen and label every email
as non-spam. The model
predicting the majority (non-
spam) class all the time will
mostly be right, leading to very
high accuracy.
In this specific example, the accuracy is 95%: yes,
the model missed every spam email, but it was
still right in 57 cases out of 60.
However, this accuracy is now meaningless. The
model does not serve the primary goal or help
identify the target event.
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
Pros:
•Accuracy is a helpful metric when you deal with balanced classes and
care about the overall model “correctness” and not the ability to
predict a specific class.
•Accuracy is easy to explain and communicate.
Cons:
•If you have imbalanced classes, accuracy is less useful since it gives
equal weight to the model’s ability to predict all categories.
•Communicating accuracy in such cases can be misleading and disguise
low performance on the target class.
When you see an imbalanced example like the spam example above, it
is very intuitive to suggest a different approach to model evaluation
that overcomes the limitation of accuracy: we do not need the
“overall” correctness. We want to find spam emails, after all! Can we
focus on how well we see and detect them specifically?
Precision and recall are the two metrics that help with that.
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
Calculate Recall | Sensitivity | True Positive Rate — TPR
=TP/Actual Yes
95
95 + 5
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
The false positive rate is calculated as
FP/FP+TN, where FP is the number of
false positives and TN is the number of
true negatives (FP+TN being the total
number of negatives). It’s the
probability that a false alarm will be
raised: that a positive result will be
given when the true value is negative.
5 5
= = = 10%
5 + 45 50
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
𝑇𝑁
=
𝑇𝑁 + 𝐹𝑃
𝑇𝑁 = 45 = 45 =
90%
𝑇𝑁 + 𝐹𝑃 45 + 5 50
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
False Negative – The predicted value is negative, but the actual value is
positive, i.e., the model falsely predicted the positive class labels to be
negative.
False Negative Rate – The ratio of false-negative and totally positive,
i.e.,
FNR = FN / P
FNR = FN / (FN+TP)
NOTE: False negative (FN) is also called ‘type-2 error’.
FNR = FN / (FN+TP)=5/5+95=5/100=5%
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
The formula for calculating precision= =TP/Predicted Yes
When it is predicted Yes, how often is it correct?
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
What is F1-Score?
F1-score is the harmonic mean of precision and recall. It gives us an
overall measure of classifier performance by balancing both the
precision and recall values. It is given by the formula
It ranges from 0 to 1, with 1 being the best possible score.
Why is the f1-score the harmonic mean of precision and recall rather
than the arithmetic mean?
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
Accuracy: Overall, how often is the classifier correct?
•(TP+TN)/total = (100+50)/165 = 0.91
Misclassification Rate: Overall, how often is it wrong?
•(FP+FN)/total = (10+5)/165 = 0.09
•equivalent to 1 minus Accuracy
•also known as "Error Rate"
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
True Positive Rate: When it's actually yes, how often does it
predict yes?
•TP/actual yes = 100/105 = 0.95
•also known as "Sensitivity" or "Recall"
False Positive Rate: When it's actually no, how often does it
predict yes?
•FP/actual no = 10/60 = 0.17
True Negative Rate: When it's actually no, how often does it
predict no?
•TN/actual no = 50/60 = 0.83
•equivalent to 1 minus False Positive Rate
Precision: When it predicts yes, how often is it correct?
•TP/predicted yes = 100/110 = 0.91
•Prevalence: How often does the yes condition actually occur
in our sample?
actual yes/total = 105/165 = 0.64
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
Cohen's kappa coefficient?
What is the Cohen's kappa coefficient?
The Cohen's kappa coefficient is a statistic that
quantifies the agreement between two raters or
classifiers, taking into account the possibility of
random agreement. It is often used to assess the
reliability of human annotations or the accuracy of
machine learning models for classification tasks. The
kappa coefficient ranges from -1 to 1, where -1
means complete disagreement, 0 means random
agreement, and 1 means perfect agreement. A higher
kappa value indicates a better performance of the
model or the rater. Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
Cohen's kappa coefficient?
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
Cohen's kappa coefficient?
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
Cohen's kappa coefficient?
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
Cohen's kappa coefficient?
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Confusion Matrix
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Model Construction
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Process (1): Model Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier
Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Process (2): Using the Model in
Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Decision Tree Induction: Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
YES
>40 low no fair ??
NO
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
>40 low no fair ?? YES
Decision Tree Induction: An Example
age income student credit_rating buys_computer
❑ Training data set: Buys_computer <=30 high no fair no
❑ Resulting tree: <=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
student? yes credit rating?
no yes excellent fair
no yes yes
Indian Institute of Management (IIM),Rohtak
IIM Rohtak
Thank you !!! 66
Indian Institute of Management (IIM),Rohtak