0% found this document useful (0 votes)
39 views66 pages

IIM Rohtak: Machine Learning Insights

The document discusses supervised learning in machine learning, focusing on predictive analysis and classification techniques such as decision trees, logistic regression, and k-means clustering. It emphasizes the importance of model accuracy, confusion matrices, and metrics like precision, recall, and F1-score for evaluating model performance. Additionally, it introduces the concept of Cohen's kappa coefficient for assessing agreement between classifiers.

Uploaded by

Mirgank Tirkha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views66 pages

IIM Rohtak: Machine Learning Insights

The document discusses supervised learning in machine learning, focusing on predictive analysis and classification techniques such as decision trees, logistic regression, and k-means clustering. It emphasizes the importance of model accuracy, confusion matrices, and metrics like precision, recall, and F1-score for evaluating model performance. Additionally, it introduces the concept of Cohen's kappa coefficient for assessing agreement between classifiers.

Uploaded by

Mirgank Tirkha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

IIM Rohtak

Business Analytics
Chapters
6-12 Today objective

Supervised Learning
Machine Learning Approach
Predictive Analysis
Classification analysis
Accuracy Theory & Decision Tree based Approach
Indian Institute of Management (IIM),Rohtak
IIM Rohtak

ORANGE

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

ORANGE
Clustering

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

ORANGE

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

ORANGE

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

ORANGE

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

ORANGE

Now select any cluster ,the


same may be you will get in
data table and scatter plot

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

ORANGE
K Means

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

ORANGE
Understanding K means

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

ORANGE

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

ORANGE

Randomly allocation of data points

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

ORANGE

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

ORANGE

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

ORANGE
k-Means
[Link]

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

ORANGE

Let you want k=4


,then fixed k=4

But now you asked


from system
(silhouette) and find
optimal value of k,
then please select
from tab under k
means window

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

ORANGE

Now system is suggesting


k=2 (score is high among
others)

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

ORANGE

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

ORANGE

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Predictive Learning??
MARCH month : In most of Indian
school ,Annual yearly examination was
held?
Now I am sure most of the parents(re
call your conversation with your parent )
saying you do hard work in the subjects
like math's ,science etc.
On what basis your parents suggest to you
??and why ???
Indian Institute of Management (IIM),Rohtak
IIM Rohtak

Machine Learning

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Machine Learning
Supervised learning

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Machine Learning

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Machine Learning

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Classification

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Classification
•Logistic Regression
•Support Vector Machine
•k-Nearest Neighbours
•Decision Tree
•Ensemble Method(bagging
and Boosting )
•Random Forest
•Gradient Boost
•Ada Boost
•Neural Network
•Navie Bayes

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Classification

[Link]
a-types-in-statistics-347e152e8bee Indian Institute of Management (IIM),Rohtak
IIM Rohtak

Classification of machine learning


algorithms

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Classification

Decision Trees are used to predict a Label (usually


Binary ) dependent variables such as:
• Will a person suffer a heart attack in the next year?
• Will a voter vote BJP in the next country
election?
• Will student X clear this time IAS exam?
For such type of problem ,decision tree gives basic
solution based upon past data.

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Decision Tree
Tree based learning algorithms are considered to be one of the best
and mostly used supervised learning methods. Tree based methods
empower predictive models with high accuracy, stability and ease of
interpretation. Unlike linear models, they map non-linear
relationships quite well. They are adaptable at solving any kind of
problem at hand.

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Classification
(a decision tree structure)
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined
class, as determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision
trees, or mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the
classified result from the model
• Accuracy rate is the percentage of test set samples that
are correctly classified by the model
– If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
Indian Institute of Management (IIM),Rohtak
IIM Rohtak

Confusion Matrix

150

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Confusion Matrix

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Confusion Matrix

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Confusion Matrix

False Positive (FP) – Type I Error False Negative (FN) – Type II Error
•The predicted value was falsely •The predicted value was falsely
predicted. predicted.
•The actual value was negative, but •The actual value was positive, but
the model predicted a positive the model predicted a negative
value. value.
•Also known as the type I error. •Also known as the type II error.
Indian Institute of Management (IIM),Rohtak
IIM Rohtak

Confusion Matrix

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Confusion Matrix
It is easy to “game” the
accuracy metric when making
predictions for a dataset like
this. To do that, you must
predict that nothing will
happen and label every email
as non-spam. The model
predicting the majority (non-
spam) class all the time will
mostly be right, leading to very
high accuracy.

In this specific example, the accuracy is 95%: yes,


the model missed every spam email, but it was
still right in 57 cases out of 60.
However, this accuracy is now meaningless. The
model does not serve the primary goal or help
identify the target event.

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Confusion Matrix
Pros:
•Accuracy is a helpful metric when you deal with balanced classes and
care about the overall model “correctness” and not the ability to
predict a specific class.
•Accuracy is easy to explain and communicate.
Cons:
•If you have imbalanced classes, accuracy is less useful since it gives
equal weight to the model’s ability to predict all categories.
•Communicating accuracy in such cases can be misleading and disguise
low performance on the target class.
When you see an imbalanced example like the spam example above, it
is very intuitive to suggest a different approach to model evaluation
that overcomes the limitation of accuracy: we do not need the
“overall” correctness. We want to find spam emails, after all! Can we
focus on how well we see and detect them specifically?
Precision and recall are the two metrics that help with that.
Indian Institute of Management (IIM),Rohtak
IIM Rohtak

Confusion Matrix
Calculate Recall | Sensitivity | True Positive Rate — TPR

=TP/Actual Yes

95
95 + 5
Indian Institute of Management (IIM),Rohtak
IIM Rohtak

Confusion Matrix

The false positive rate is calculated as


FP/FP+TN, where FP is the number of
false positives and TN is the number of
true negatives (FP+TN being the total
number of negatives). It’s the
probability that a false alarm will be
raised: that a positive result will be
given when the true value is negative.
5 5
= = = 10%
5 + 45 50

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Confusion Matrix

𝑇𝑁
=
𝑇𝑁 + 𝐹𝑃
𝑇𝑁 = 45 = 45 =
90%
𝑇𝑁 + 𝐹𝑃 45 + 5 50

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Confusion Matrix
False Negative – The predicted value is negative, but the actual value is
positive, i.e., the model falsely predicted the positive class labels to be
negative.
False Negative Rate – The ratio of false-negative and totally positive,
i.e.,
FNR = FN / P
FNR = FN / (FN+TP)
NOTE: False negative (FN) is also called ‘type-2 error’.

FNR = FN / (FN+TP)=5/5+95=5/100=5%

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Confusion Matrix
The formula for calculating precision= =TP/Predicted Yes

When it is predicted Yes, how often is it correct?

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Confusion Matrix

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Confusion Matrix
What is F1-Score?
F1-score is the harmonic mean of precision and recall. It gives us an
overall measure of classifier performance by balancing both the
precision and recall values. It is given by the formula

It ranges from 0 to 1, with 1 being the best possible score.

Why is the f1-score the harmonic mean of precision and recall rather
than the arithmetic mean?

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Confusion Matrix

Accuracy: Overall, how often is the classifier correct?


•(TP+TN)/total = (100+50)/165 = 0.91
Misclassification Rate: Overall, how often is it wrong?

•(FP+FN)/total = (10+5)/165 = 0.09


•equivalent to 1 minus Accuracy
•also known as "Error Rate"
Indian Institute of Management (IIM),Rohtak
IIM Rohtak

Confusion Matrix
True Positive Rate: When it's actually yes, how often does it
predict yes?
•TP/actual yes = 100/105 = 0.95
•also known as "Sensitivity" or "Recall"
False Positive Rate: When it's actually no, how often does it
predict yes?
•FP/actual no = 10/60 = 0.17
True Negative Rate: When it's actually no, how often does it
predict no?
•TN/actual no = 50/60 = 0.83
•equivalent to 1 minus False Positive Rate
Precision: When it predicts yes, how often is it correct?
•TP/predicted yes = 100/110 = 0.91
•Prevalence: How often does the yes condition actually occur
in our sample?
actual yes/total = 105/165 = 0.64
Indian Institute of Management (IIM),Rohtak
IIM Rohtak

Confusion Matrix
Cohen's kappa coefficient?
What is the Cohen's kappa coefficient?
The Cohen's kappa coefficient is a statistic that
quantifies the agreement between two raters or
classifiers, taking into account the possibility of
random agreement. It is often used to assess the
reliability of human annotations or the accuracy of
machine learning models for classification tasks. The
kappa coefficient ranges from -1 to 1, where -1
means complete disagreement, 0 means random
agreement, and 1 means perfect agreement. A higher
kappa value indicates a better performance of the
model or the rater. Indian Institute of Management (IIM),Rohtak
IIM Rohtak

Confusion Matrix
Cohen's kappa coefficient?

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Confusion Matrix
Cohen's kappa coefficient?

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Confusion Matrix
Cohen's kappa coefficient?

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Confusion Matrix
Cohen's kappa coefficient?

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Confusion Matrix

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Confusion Matrix

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Confusion Matrix

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Confusion Matrix

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Model Construction

Indian Institute of Management (IIM),Rohtak


IIM Rohtak

Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
Indian Institute of Management (IIM),Rohtak
IIM Rohtak

Process (2): Using the Model in


Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
Indian Institute of Management (IIM),Rohtak
IIM Rohtak

Decision Tree Induction: Training Dataset


age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
YES

>40 low no fair ??


NO
Indian Institute of Management (IIM),Rohtak
IIM Rohtak

>40 low no fair ?? YES


Decision Tree Induction: An Example
age income student credit_rating buys_computer
❑ Training data set: Buys_computer <=30 high no fair no
❑ Resulting tree: <=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes yes
Indian Institute of Management (IIM),Rohtak
IIM Rohtak

Thank you !!! 66


Indian Institute of Management (IIM),Rohtak

You might also like