Intermediate
Analytics
ALY6015
Northeastern University
By: Behzad Abdi
Week Three
Standard Binary Classification Algorithms:
• K-Nearest Neighbors (KNN)
• Logistic Regression
• Support Vector Machine (SVM)
• Decision Trees
• Random Forest
• Gradient Boosting
• Neural Networks
• Naive Bayes
• AdaBoost
• Stochastic Gradient Descent (SGD)
[0 1 2 3 4 5 6 7 8 9 10]
[0 0 0 0 0 ? 1 1 1 1 1]
What is a Threshold?
What is a Threshold?
How to Use the ROC Curve for Optimal Threshold Selection?
Generalize
d Linear
Models
Introduction to Generalized Linear Models (GLM)
Structure of a GLM
Common Types of GLM
Applications of GLMs
Advantages and Limitations of GLM
Advantages and Limitations of GLM
Classifier
s
Evaluatio
n
Methods
Introducing the Confusion Matrix
What is a Confusion Matrix?
• A table used to evaluate the performance of a classification model.
• Compares predicted labels with actual labels.
Introducing the Confusion Matrix
Introducing the Confusion Matrix
Definition: Proportion of correctly predicted samples (both positive and negative) out of the total
samples.
Example: If a spam filter correctly classifies 90 out of 100 Emails, its Accuracy is 90%
Use Case:
• When the dataset is balanced (i.e., the number of samples in each class is roughly equal).
• Example: Predicting whether an email is spam when spam and non-spam emails are equally
represented.
Limitations:
• Misleading for imbalanced datasets (e.g., predicting all samples as the majority class may still result
in high accuracy).
Example: Classifying transactions as fraudulent or non-fraudulent:
Dataset:
Total transactions: 10,000
• 9500 non-fraudulent transactions (majority class)
• 500 fraudulent transactions (minority class)
Model Prediction: with :
Let’s assume the model predicts that all transactions are non-fraudulent. In this case:
• It correctly predicts 9500 non-fraudulent transactions (True Negatives).
• It incorrectly predicts 500 fraudulent transactions as non-fraudulent (False Negatives).
Model Accuracy (High but Misleading Accuracy ):
•Correct predictions: 9500
Accuracy = (Correct predictions / Total samples) = 9500 / 10000 = 95%
•Total transactions: 10,000
Example: Classifying transactions as fraudulent or non-fraudulent:
Dataset:
• Total transactions: 10,000
• 9500 non-fraudulent transactions (majority class)
• 500 fraudulent transactions (minority class)
Dataset Recap:
• Total transactions: 10,000
• Non-fraudulent transactions (majority class): 9,500
• Fraudulent transactions (minority class): 500
Model Behavior:
• The model predicts all transactions as non-fraudulent.
• True Negatives (TN): 9,500 (correctly predicted non-fraudulent transactions).
• False Negatives (FN): 500 (fraudulent transactions incorrectly predicted as non-fraudulent).
Model Accuracy Calculation:
model has 95% accuracy, which seems high and impressive at first glance.
Example: Classifying transactions as fraudulent or non-fraudulent:
Why is this Misleading?
1. Model Ignores the Minority Class (Fraudulent Transactions):
• The model does not detect any fraudulent transactions (500 missed cases).
• False negatives are critical in fraud detection. Missing these cases could lead to serious financial losses
or risks.
2. Accuracy Favors the Majority Class:
• Since 95% of the data is non-fraudulent, predicting "non-fraudulent" for every transaction results in
high accuracy.
• However, the model fails to address the minority class (fraud).
3. Critical Errors Are Overlooked:
• A fraud detection model must prioritize detecting fraudulent transactions (minority class), even if it
sacrifices some accuracy for the majority class.
Definition: Proportion of true positive predictions out of all positive predictions.
Use Case:
• When false positives (FP) are costly or critical to minimize.
• Example: Predicting whether a patient has cancer (false positives can lead to unnecessary treatments).
Strengths:
• Focuses on the quality of positive predictions.
• Useful when positive predictions should be highly reliable.
(Sensitivity or True Positive Rate)
Definition: Proportion of true positive predictions out of all actual positive samples.
Use Case:
• When false negatives (FN) are costly or critical to minimize.
• Example: Detecting fraud or diagnosing a rare disease (missing a positive case is unacceptable)
Strengths:
• Ensures that as many positive cases as possible are detected.
• Useful in applications where missing true positives has severe consequences.
Definition: Harmonic mean of Precision and Recall, balancing both metrics.
Use Case:
• When there is an uneven class distribution and a balance between Precision and Recall is desired.
• Example: Spam detection, where both false positives (wrongly classifying a legitimate email as spam)
and false negatives (missing a spam email) are important.
Strengths:
• Useful for imbalanced datasets.
• Provides a single score that balances the trade-off between Precision and Recall.
When to Use Each Metric
Summary:
• Use Accuracy for balanced datasets or when all errors are equally important.
• Use Precision to avoid false alarms in applications where positive predictions must be highly reliable.
• Use Recall when failing to detect true positives is costly.
• Use F1-Score when the dataset is imbalanced, and you need to balance Precision and Recall.
Introduction to ROC Curve
Understanding AUC
Advantages and Limitations of ROC and AUC