Intro To ML RevisionNotes

Intro to ML Revision Notes
1
Topic Page Number
Introduction to ML 3-4
Linear Regression 4-10
Bias Variance Regularization 11-14
Logistic Regression 14-17
Classification Metrics 17-24
2
● Introduction to ML:
How is Machine Learning different from Classical Programming?
Classical Programming has rigid rules written by programmers and a lot of

hardcoding(i.e, rules are predefined and does not change with data) is involved while ML
needs data as input and it predicts based on the learnt rules it learns after training on the
data.
Classical Programming: Rigid rules written by programmers, a lot of hardcoding.

Machine Learning (ML): Rules are learnt from training a system/model on the data.
When to use ML over Classical Programming ?
ML > Classical Prog. → applications which may not have easily visible patterns.
How to categorize the different types of ML models?
Based on type of Learning ML algo is categorized as:

- Supervised Learning: When output label (Y) is present in the data
- Unsupervised Learning: Output label (Y) is not in the data
- Reinforcement Learning: Used for AI in games, where there is State,
Environment and Reward
Criteria: Type of Learning

1. Supervised: When output label (Y) is present in the training data
2. Unsupervised: Output label (Y) is not present in the training data
3. Reinforcement: Agent takes an action in an “environment” for maximizing
“reward” and changes its “state”
Criteria: Type of Task

1. Classification: Input data needs to be classified into categories
2. Regression: Input data needs to be mapped to a real valued output
3. Clustering: Grouping of similar items together as one
4. Recommendation: To recommend items → more likable to the datapoint
5. Forecasting: Data pattern understanding → predicts future values
3
● Linear Regression:
How does the data look for Linear Regression ?

(𝑛 ×𝑑)
Regression Problem → Data: n samples with [𝑓1, 𝑓2,....., 𝑓𝑑]ϵ 𝑅 where
- sample 𝑋𝑖 consists the features
- That has a target label 𝑦𝑖 ϵ 𝑅 .
How does the training data look for the Regression problem?
Regression → Supervised task, target 𝑦𝑖 is numerical

𝑛𝑋𝑑 𝑇
1. Input sample = 𝑋𝑖 ϵ 𝑅 , 𝑥𝑖 = [𝑥𝑖1,....., 𝑥𝑖𝑑]
2. Output sample = 𝑦𝑖 ϵ 𝑅
3. Training example = (𝑋𝑖,𝑦𝑖)
4. Training Dataset = {(𝑋𝑖,𝑦𝑖), 𝑖 = 1, 2,.. 𝑛} ,
What is the goal of the ML model ?
Ans: To find 𝑓: 𝑋 → 𝑦 such that 𝑓(𝑥𝑖) ≈ 𝑦𝑖
How to define function f ?
4
^
Algebraic Intuition → find 𝑦 = 𝑓(𝑥𝑖), we can say for Linear Regression:
𝑓(𝑥𝑖1, 𝑥𝑖2, ..... , 𝑥𝑖𝑑) = 𝑤1𝑥𝑖1 + 𝑤2𝑥𝑖2 + 𝑤3𝑥𝑖3 +.... + 𝑤𝑑𝑥𝑖𝑑 + 𝑤0
𝐷
^
𝑦𝑖 = 𝑓( 𝑥𝑖) = ∑ 𝑤𝑗𝑥𝑖𝑗 + 𝑤0
𝑗=1
𝑇
Now, 𝑤 = [𝑤1, 𝑤2..... 𝑤𝑑] and 𝑥𝑖 = [𝑥𝑖1, 𝑥𝑖2..... 𝑥𝑖𝑑] , then:
^ 𝑇
𝑦𝑖 = 𝑓(𝑥𝑖) = 𝑤 𝑥𝑖 + 𝑤0
How does the ML model find the function f ?
Ans: By updating weights of the model on the training dataset
How does the ML model update weights ?

Ans: By finding the gradients of the weights with respect to the loss function and by
subtracting that gradient from the weights.
Is 𝑓(𝑥𝑖) in Linear Regression analogous to 𝑦 = 𝑚𝑥 + 𝑐 ?

Ans: Yes, it is.
Linear Regression: finding a best D Dimensional hyperplane that fits the D-Dimensional
^
data such that 𝑦𝑞 ≈ 𝑦𝑞
How to find the best fit line of Linear Regression model ?

𝑇
Ans: By optimizing the weights vector 𝑊 = [𝑤1, 𝑤2,.....,, 𝑤𝑑] wrt the loss function.
How to say Linear Regression is optimized ?

Ans: when we see the loss function is not decreasing anymore i.e. local minima
What loss function to use for linear regression optimization ?

^
Ans: Mean Square Error → finds the mean of the square difference between 𝑦 , 𝑦 .
𝑛
1 ^ 2
𝑚𝑖𝑛𝑤,𝑤 𝑛
∑ (𝑦𝑖 − 𝑦𝑖)
0 𝑖=1
How to find optimal weights for Lin. Reg. ?
Ans: Gradient Descent → minimizes the Mean Squared error to reach global minima
How to find the Gradients of Mean Square Error ?

Ans: We define loss function as :
5
𝑛
1 𝑇 2
𝐿(𝑤, 𝑤0) = 𝑛
∑ (𝑦𝑖 − (𝑤 𝑥𝑖 + 𝑤0))
𝑖=1
On finding gradients with respect to w, loss becomes:

∂𝐿(𝑤,𝑤0) 𝑛 𝑇
∂(𝑦𝑖−(𝑤 𝑥𝑖+𝑤0))
2 𝑛 𝑇
∂(𝑦𝑖−(𝑤 𝑥𝑖+𝑤0) )
1 2 𝑇
∂𝑤
= 𝑛
∑ ∂𝑤
= 𝑛
∑ (𝑦𝑖 − (𝑤 𝑥𝑖 + 𝑤0)) ∂𝑤
𝑖=1 𝑖=1
𝑑(𝑢𝑣+𝑐+𝑎)
As we know 𝑑𝑢
= 𝑣 , hence on simplifying, the equation becomes:
∂𝐿(𝑤,𝑤0) 𝑛
2 𝑇
∂𝑤
= 𝑛
∑ (𝑦𝑖 − (𝑤 𝑥𝑖 + 𝑤0)) (− 𝑥𝑖)
𝑖=1
Similarly gradient for 𝑤0becomes:

∂𝐿(𝑤,𝑤0) 𝑛 𝑇
(𝑦𝑖−(𝑤 𝑥𝑖+𝑤0))
2 𝑛 𝑇
∂(𝑦𝑖−(𝑤 𝑥𝑖+𝑤0) )
1 2 𝑇
∂𝑤0
= 𝑛
∑ ∂𝑤0
= 𝑛
∑ (𝑦𝑖 − (𝑤 𝑥𝑖 + 𝑤0)) ∂𝑤0
𝑖=1 𝑖=1
∂𝐿(𝑤,𝑤0) 𝑛
2 𝑇
∂𝑤0
= 𝑛
∑ (𝑦𝑖 − (𝑤 𝑥𝑖 + 𝑤0))(− 1)
𝑖=1
Updating weights (𝑤, 𝑤0 ) with a learning rate α :

∂𝑙(𝑤,𝑤0)
𝑤 = 𝑤 − α× ∂𝑤
∂𝑙(𝑤,𝑤0)
𝑤0 = 𝑤0 − α × ∂𝑤0
Why use Learning Rate ?

Ans: Learning Rate α → hyperparameter to control rate at which Gradient Descent
reach global minima
What happens if a too small value of Learning Rate(α) is used ?

Ans: makes Gradient Descent reach the global minima very slowly
What happens if a too large value of Learning Rate(α) is used ?

Ans: may make the Gradient Descent overshoot the global minima
What will be the simplest model for predicting a value ?

Ans: Mean model → the mean of the entire data as its prediction.
After training the model, how to measure model performance ?
6
Ans: R-Squared metric. → measures the performance of Linear Regression over a
mean model. It is Defined as:
𝑛
^ 2
𝑆𝑆𝑟𝑒𝑠 ∑ (𝑦𝑖−𝑦𝑖)
2 −
𝑅 = 1− 𝑆𝑆𝑡𝑜𝑡𝑎𝑙
= 1− 𝑖=1
𝑛 , where 𝑦 𝑖 is the mean model.
− 2
∑ (𝑦𝑖−𝑦𝑖 )
𝑖=1
SSres - Squared sum of error of regression line

SStotal-(total sum of squares) Squared sum of error of mean line
2
What will be the best value of 𝑅 ?
𝑛
^ 2
Ans: 1, when ∑ (𝑦𝑖 − 𝑦𝑖) = 0.
𝑖=1
2
What will be the minimum value of 𝑅 ?
𝑛 𝑛
^ 2 − 2
Ans: - ∞ , when ∑ (𝑦𝑖 − 𝑦𝑖) >> ∑ (𝑦𝑖 − 𝑦𝑖 )
𝑖=1 𝑖=1
What happens to R-squared if we add a new feature ?

Ans: if the feature is a relevant, R-square ↑.
But if the Feature is not relevant,

R-square should ↓ when model performance gets worse →
- Model’s task is to minimize the loss
- So, if adding a feature is reducing the performance i.e. increasing the loss
- Model can simply assign small or zero weights to new features in order to avoid
decrease in performance.
Will the R-Square value increase or remain the same if we add a new feature?
Ans: Both are possible → small weights or zero weight can be assigned to new feature
which will keep performance the same.
model performance
Model can start making spurious associations with new features (i.e. overfit) causing
model performance to increase on the train set.
As R-Square fails to compare performance and model complexity (no of features),

What other metrics to use ?
Ans: Adjusted R-Squared, defined as:

2
2 (1−𝑅 )(𝑛−1)
𝐴𝑑𝑗𝑅 = 1− [ (𝑛−𝑑−1)
],
where n is the number of samples, and d is the number of features
7
How does Adj R-Square compare performance and model complexity ?
Ans: if number of features (d) increases
with no significant feature:

2
- 𝑅 remains constant or slightly increased ⇒ (n-d-1) ↓ ⇒ Adj-R-squared ↓
with significant feature:

2
2 (1−𝑅 )(𝑛−1)
- 𝑅 ↑ significantly ⇒ [ (𝑛−𝑑−1)
] ↓ ⇒ Adj-R-squared ↑
How to determine which features impact the model most during prediction ?
Ans: The feature with highest weight → most important feature
What does the -ve sign mean in weight of model ?

^
Ans: The -ve sign means → if feature value ↑ , 𝑦 ↓
Why perform column standardization ?

Ans: To bring all features to the same scale.
i.e. Removes ambiguity in feature importances
For example: if there are two features 𝑓1, 𝑓2 :

- 𝑓1 value range >>> 𝑓2 value range → Weight of 𝑓1 >>> Weight of 𝑓2 ⇒ 𝑓1 important
feature even though 𝑓2 is the important one.
Now Column Standardization makes : 𝑓1 value range ≈ 𝑓2 value range → Weight of 𝑓1

< Weight of 𝑓2 ⇒ 𝑓2 becomes important feature
Assumptions of Linear Regression :
a. Assumption of Linearity:
linear relationship between the features 𝑥 and target variable 𝑦
b. Features are not multi-collinear:
What is collinearity ?
Ans: 2 features (𝑓1, 𝑓2), have a linear relationship between them. 𝑓1 = α 𝑓2
What is Multicollinearity ?
Ans: feature 𝑓1 has collinearity across multiple features 𝑓2 , 𝑓3 , 𝑓4
𝑓1 = α1𝑓2 + α2𝑓3 + α3𝑓4
8
Why is MultiCollinearity a problem ?
Ans: MultiCollinearity → non-reliability on the feature importance and model

interpretability.
How to resolve Multi- Collinearity ?

Ans: Using Variance Inflation factor (VIF), defined as:
1 2
𝑉𝐼𝐹 𝑓𝑜𝑟 𝑓𝑗 = 2 ; 𝑤ℎ𝑒𝑟𝑒 𝑅 𝑖𝑠 𝑅𝑠𝑞𝑢𝑎𝑟𝑒𝑑
1−𝑅 𝑗
VIF algorithm works as :
- Calculate VIF of each features
- if VIF >= 5 → high Multicollinearity
- Remove feature having the highest VIF
- Recalculate the VIF for the remaining feature
- Again remove the next feature having the highest VIF
- Repeat till all VIF<5 or some number of iteration is reached
c. Errors are normally distributed:
Used to ensure there are no outliers present in the data
d. Heteroskedasticity should not exist:
Heteroskedasticity → unequal scatter of the error term → not having the same
variance
Why Heteroskedasticity is a problem ?

Ans: model inaccurate or outliers in the data.
How to check Heteroskedasticity ?
^ ^
Ans: Plotting a Residual plot → Errors (𝑦 − 𝑦) vs prediction (𝑦)
9
e. No AutoCorrelation:
What is AutoCorrelation ?
Ans: When the current feature value depends upon its previous value
Why is AutoCorrelation a problem ?
^
Ans: Linear regression assumes 𝑦1 = 𝑓(𝑥) has to be independent of
^
𝑦2 = 𝑓(𝑥 + 1) → AutoCorrelation contradicts this assumption.
Is there any other way to solve Linear Regression ?
Ans: Closed Form/ Normal Equation
Why use Normal Equations ?

Ans: Finds the optimal weights without any iterating steps as done in Gradient Descent.
𝑇 −1 𝑇
The optimal weights: 𝑊 = (𝑋 𝑋) 𝑋 𝑌
𝑛 (𝑆𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒) ×𝑑 (𝑑 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙)

Where X → feature matrix: 𝑅 , and Y → target vector:
𝑛 (𝑆𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒) ×1
𝑅
Why even use gradient descent ?

𝑇 −1
Ans: (𝑋 𝑋) → computationally expensive operation ⇒ Not used when the number of
features is high.
Dimension of mat mul: [ (d x n) * (n x d) => (d x d) ]
What if there is non-linearity in the data , can Linear Regression work ?

Ans: No
What modifications can be done to Linear Regression for the model to be complex
enough to fit non-linear data ?
Ans: By using Polynomial Regression → transforms linear equation of linear

Regression to Polynomial equations
How does Polynomial Regression work ?

^
Ans: if Linear Regression has 𝑦 = 𝑤1 𝑓1 + 𝑤2𝑓2 +..... 𝑤𝑑𝑓𝑑 + 𝑤0;
10
2 4 2
polynomial introduces terms like 𝑓1 , 𝑓3 = 𝑓3 , 𝑓2 = α𝑓2 + α0
making the Model complex for handling non-linearity.
● Bias-Variance,Regularization:
If model A : covers all datapoints with high degree features

i.e. predicting hyperplane passes through all the datapoints
Model B : misses out only a handful datapoints using lower degree features,
I.e. predicting hyperplane misses a handful of datapoints
Then which model is better ?
Ans: Model B generalizes on the data → model captures the pattern of the data and not
get influenced by Outliers (hence those points are missed out)
Model B , a simpler model than Model A →Occam’s Razor
What can we say about model A ?

Ans: Model overfits the data → fitting to outliers/noises in the data
When to say model underfits the data ?

Ans: When the model is not able to predict most of the datapoints in the data → model
has poor performance.
How is data split ?

Ans: Training, Validation and testing dataset.
How is training and test data related to underfit and overfit model ?
Ans: an underfitted model → has a high training and test loss
- An Overfitted model → very low training loss but a high testing loss.
11
What is a suitable model ?
Ans: a tradeoff between both such tha model has a low training and testing loss
→ perfectly fit model
We can understand Underfit and overfit using Bias and Variance.
What do we mean by Bias and Variance ?
Ans: Understanding bias- variance with a target shooting example
Observe
- High Bias → have a wrong aim
- High Variance → an unsteady aim.
How is underfit related to Bias and Variance ?

Ans: Now in Underfitting → predictions are consistent but are wrong→ for different
training sets → High Bias and low Variance
How is overfit related to Bias and Variance ?

Ans: Now in Overfitting → predictions vary too much and are wrong → for different
training sets → Low Bias and High Variance.
How to control Underfit Overfit tradeoff to find the perfect model ?

𝑑
2
Ans: Regularization in the loss function → adds a term ∑ 𝑤 𝑗 → making weight small
𝑗=1
→ for insignificant features
How does Regularization make weights small for insignificant features ?

Ans: With optimization algorithm, → minimizes the values of 𝑤𝑗
12
𝑑
2
𝑇𝑜𝑡𝑎𝑙𝐿𝑜𝑠𝑠 = min 𝐿𝑜𝑠𝑠𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 + λ ∑ 𝑤 𝑗
𝑤𝑗 𝑗=1
How to control Regularization ?

Ans: By using regularization parameter λ:
since too much regularization → makes the model underfit the data
Too little regularization → makes the model overfit.
Thus ⇒ λ becomes hyperparameter → on tuning gives the overfit-underfit tradeoff
Is squaring of weights the only way for Regularization ?
Ans: No, Regularization is majorly of three types:

𝑑 𝑑|𝑤𝑗|
A. L1 / Lasso Regularization : Uses the term ∑ |𝑤𝑗| → has 𝑤𝑗
= 0, when
𝑗=1
𝑤𝑗 = 0 → making the weight vector sparse.
𝑑
2
B. L2/ Ridge/ Tikhonov regularization : Uses the term ∑ 𝑤𝑗 → have close to 0
𝑗=1
values → for insignificant features.
C. ElasticNet Regularization: Combination of both L1 and L2 Regularization →

with λ1 and λ2 as regularization parameters respectively.
𝑑 𝑑
2
𝑇𝑜𝑡𝑎𝑙𝐿𝑜𝑠𝑠 = min 𝐿𝑜𝑠𝑠𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 + λ1 ∑ 𝑤 𝑗 + λ2 ∑ |𝑤𝑗|
𝑤𝑗 𝑗=1 𝑗=1
Why split data into Validation dataset ?

Ans: hyperparameter tuning → done only on Validation data → test data solely used for
evaluating the model on unseen data.
What are the steps for a model building ?

Ans: The steps are :
- Train model → with some regularization parameter λ → on training data
- Measure the model performance → with different value of hyperparameters → on the
Validation dataset
- Pick the hyperparameters of the best performing model
- Measure the performance of the Best performing model on Test data.
13
If data is too small to have a validation dataset, what to do then ?
Ans: use k-Fold CV algorithm since:
- splits data into k smaller sets
- for each iteration, the model trained on k-1 folds
- validated on 1 fold
- performance is avg over all the iterations.
Note: Though k-Fold is a computationally expensive algorithm, it is useful when dataset

is small.
● Logistic Regression
Why do we need Logistic Regression ?

● Useful for binary classification
What are the assumptions of Logistic regression ?

● Data should be linearly separable
What is the goal of this algorithm ?

● To find a hyperplane π which accurately separates the data
How to perform prediction through logistic regression ?
14
● Given labels -> 𝑦𝑖 ϵ {0, 1}
𝑡
● Compute a linear function of x -> 𝑧𝑖 = 𝑤 𝑥𝑖 + 𝑤0 ϵ {− ∞, ∞}
1
● Compute Sigmoid(zi) -> σ (𝑧𝑖 ) = −𝑧𝑖
1+𝑒
^
● Predicted label -> 𝑦𝑖 = 1 𝑖𝑓 σ (𝑧𝑖 ) > 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑒𝑙𝑠𝑒 0
What are the properties of Sigmoid function ?

● Range -> (0, 1)
● Smooth and differentiable at all points
What is the derivative of a logistic regression model ?
σ'(𝑧) = σ(𝑧) × [1 − σ(𝑧)]
What is the significance of the sigmoid function ?

● σ (𝑧𝑖 ) is the probability of xi belonging to class 1
Which loss function is used for Logistic Regression ?
● Log loss -> Combination of:

^
○ − log(𝑦𝑖) when 𝑦𝑖 = 1
^
○ − log(1 − 𝑦𝑖) when 𝑦𝑖 = 0.
^ ^
𝐿𝑜𝑔 𝐿𝑜𝑠𝑠 = − 𝑦𝑖 log(𝑦𝑖) − (1 − 𝑦𝑖) log(1 − 𝑦𝑖)
Total loss becomes:
15
𝑑
2
𝑇𝑜𝑡𝑎𝑙𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑔 𝐿𝑜𝑠𝑠 + λ ∑ 𝑤𝑗
𝑗=1
How does log loss help train logistic regression model ?

^ ^ ^
● − log(𝑦𝑖) -> High when 𝑦𝑖 = 0 and very low when 𝑦𝑖 = 1 .
^ ^ ^
● − log(1 − 𝑦𝑖) ->High when 𝑦𝑖 = 1 and very low when 𝑦𝑖 = 0.
Thus both of these components help penalize the model most when doing wrong
predictions.
𝑛 −𝑦𝑖𝑧𝑖
Also if 𝑦𝑖 ϵ {− 1, 1} , then 𝐿𝑜𝑔 𝐿𝑜𝑠𝑠 = ∑ 𝑙𝑜𝑔( 1 + 𝑒 )
𝑖=1
But why can’t we use Mean Square Error as in Linear regression ?
● Non-convex curve : contains a lot of local minimas
● Difficult for Gradient Descent to reach global minima
What does the derivative of Log Loss look like ?

^ ^
∂𝐿𝑜𝑔 𝐿𝑜𝑠𝑠 ∂(−𝑦 log(𝑦 )− (1−𝑦 )log(1−𝑦 )) ^ 𝑡
∂𝑤
= ∂𝑤
, 𝑦 = σ(𝑧 ) 𝑎𝑛𝑑 𝑧 = 𝑤 𝑥 + 𝑤0
Intermediate steps : reference blog
On solving:
∂𝐿𝑜𝑔 𝐿𝑜𝑠𝑠 ∂𝑧 ^ ∂𝑧 ^ ∂𝑧 ^ ∂𝑧
∂𝑤
=− 𝑦 ∂𝑤
+ 𝑦 ×𝑦 ∂𝑤
+ 𝑦 ∂𝑤
− 𝑦 ×𝑦 ∂𝑤
∂𝐿𝑜𝑔 𝐿𝑜𝑠𝑠 ∂𝑧 ^ ∂𝑧
∂𝑤
=− 𝑦 ∂𝑤
+ 𝑦 ∂𝑤
∂𝑧 ∂𝐿𝑜𝑔 𝐿𝑜𝑠𝑠 ^
Also since ∂𝑤
= 𝑥 ; ∂𝑤
= (𝑦 − 𝑦) 𝑥
16
What if we want to predict odds of yi = 1 vs yi = 0 ?
● Log - odds: Shows how the model is similar to a linear model which is
predicting log-odds of 𝑦𝑖 = 1 𝑣𝑠 𝑦𝑖 = 0, defined as :
𝑧𝑖
𝑝 1 𝑒 1
𝑙𝑜𝑔𝑒(𝑜𝑑𝑑𝑠) = log[ 1−𝑝 ] ; 𝑝 = −𝑧𝑖 = 𝑧𝑖 𝑎𝑛𝑑 1 − 𝑝 = 𝑧𝑖
1+𝑒 𝑒 +1 𝑒 +1
On substituting the value of p and 1-p , and solving it we get
𝑧 𝑡
𝑙𝑜𝑔𝑒(𝑜𝑑𝑑𝑠) = log𝑒[𝑒 𝑖] = 𝑧𝑖 = 𝑤 𝑥𝑖 + 𝑤0
Note: Hence the name Regression in Logistic Regression.
Is there any way to use Logistic Regression for multiclass classification ?

● One vs Rest Method:
○ For 𝑦𝑖 = {1, 2, 3......, 𝐾} , generate k-binary logistic Regression models
○ Final prediction -> Argmax of all of the predictions made by each models
● Classification Metrics
Issue with Accuracy ?

Ans: Accuracy metric fails when data is imbalanced.
- Consider there are 90 data points of class 1 and 10 data points of class 0.
- If the model predicts every datapoint as class 1.
⇒ The accuracy of the model is at 90% , which is completely wrong.
17
What other metric to use ?
Ans: Confusion matrix
When a model predicts, there can be 4 scenarios:

A. True Positive (TP): Model predicts True, is Actually True
B. False Positive (FP): Model predicts True, is Actually False
- Aka Type 1 Error
C. True Negative (TN): Model predicts False, is Actually False
D. False Negative (FN): Model predicts False, is Actually True
- Aka Type 2 Error
Note:
- For a dumb model that predicts everything as negative, FP = TP = 0
- For an Ideal model that has no incorrect classification, FP = FN = 0
How to use CM to determine if a model with high accuracy is actually good?

Ans. High accuracy can be deceiving. It is only a good model if:
- Both TP and TN should be high

- Both FP and FN should be low
18
Given a CM, how can we calculate actual positives?
Ans. TP + FN = P (total actual positives)
⇒ Similarly, FP + TN = N (total actual negatives)
How to calculate accuracy from the confusion matrix ?

𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 𝑇𝑃+𝑇𝑁
Ans: Accuracy: 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
= 𝑇𝑃+𝑇𝑁+𝐹𝑁+𝐹𝑃
Which metric to use, when we cannot afford to have any false positives?
Ans: Precision: It tells us out of all points predicted to be positive, how many are
actually positive.
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃+𝐹𝑃
For ex
- Misclassification of a spam email as not spam is somewhat acceptable i.e FN
- However, classifying an important mail as spam can lead to major loss i.e. FP
⇒ i.e reducing FP is more critical.
What is the range of precision values?

Ans. Between 0 to 1.
Which metric to use, when we cannot afford to have any false negatives?
Recall / Sensitivity / Hit Rate: It tells us out of all the actually positive points, how many
of them are predicted to be positive
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃+𝐹𝑁
For ex
- Classifying a healthy person as cancerous and carry out further testing is
somewhat acceptable
- However, classifying a person with cancer as healthy can be life-death situation.
⇒ Here, reducing FN is more critical.
Note:
𝑇𝑃
- True Positive Rate (TPR): 𝑇𝑃𝑅 = 𝑇𝑃+𝐹𝑁
; is same as Recall
What is the range of recall values?

Ans. Between 0 to 1.
19
What other metrics to look for ?
𝑇𝑁
- True Negative Rate (TNR): 𝑇𝑁𝑅 = 𝐹𝑃+𝑇𝑁
- TNR tells out of all the actual negative points, how many have been predicted
as False .
- Also called as Specificity / Selectivity
𝐹𝑃
- False Positive Rate (FPR):𝐹𝑃𝑅 = 𝐹𝑃+𝑇𝑁
- Intuitively, it tells, out of all data points which are actually negative, how many
are misclassified as positive
𝐹𝑁
- False Negative Rate (FNR):𝐹𝑁𝑅 = 𝐹𝑁+𝑇𝑃
- Intuitively, it tells, out of all data points which are actually positive, how many
are misclassified as negative
Which metrics are used in the medical domain?

They are used to measure how good a test is at correctly identifying the presence or
absence of a disease.
In medical terms,
- Sensitivity : proportion of people with the disease who test positive for it
⇒ Test is good to be used as screening test
⇒ There is low chance of missing out a person with disease (low FN)
- Specificity : proportion of people without the disease who test negative for it
⇒ It means that test is good for confirmatory test
⇒ There will be low FP
What is F1 score? When is it used?

When both Precision and Recall is equally important measure for the model evaluation,
then we use F1-Score:
2×(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ×𝑅𝑒𝑐𝑎𝑙𝑙)
𝐹1 𝑆𝑐𝑜𝑟𝑒 = 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙
For ex, in a fintech company, giving out loans,
- Loss to business if they give loan to people who are unable to repay it (FP)
- Also a loss if they miss out on good people who will be able to repay (FN)
⇒ Here, we need to focus on both FP and FN.
Note:
- F1-score is just the Harmonic mean of Precision and Recall.
- F1-score Is also Useful when data is imbalanced.
- Range - [0, 1]
20
Why do we take harmonic mean in F1 score instead of arithmetic mean?
Harmonic Mean penalizes the reduction in Precision and Recall more than Arithmetic
Mean.
Can we adjust F1 score to give more preference to precision or recall?

A beta parameter can be added to make F-1 Score have more attention on either
Precision score or Recall score.
2
(1+𝑏𝑒𝑡𝑎 ) ×𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹𝑏𝑒𝑡𝑎 = 2
𝑏𝑒𝑡𝑎 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
- Beta = 2 if Recall more important than Precision.

- Beta = 0.5 when Precision is more important.
What is AU-ROC? Where is it used?

- AU-ROC : Area Under Receiver Operating Characteristic Curve
^
- Used to find the best model by plotting TPR and FPR by sorting value of 𝑦𝑖 and
keeping them as threshold for final prediction (𝑦𝑝𝑟𝑒𝑑).
How do we determine the better model using AU-ROC?

After plotting whichever curve has the most area covered tends to be the better model.
For ex:
- Here, AUC of model B > model A
- Hence, model B is better
Note:
- Unlike precision, recall or F1-score, AU-ROC does not work well for highly
imbalanced data.
21
What is the fundamental difference between AUC and the other metrics?
When we calculate Precision, Recall or F1 score
^
- We calculate it for a certain threshold on 𝑦𝑖
- This threshold is 0.5, by default
On the other hand, for AU-ROC
- we are calculating it using all possible thresholds
What will be the AUC of a random model?

The ROC curve will be diagonal. ⇒ Hence AUC will be 0.5
What to do if a model’s AUC < 0.5 ?

A simple fix is to invert your predictions.
⇒ After inverting, you will get area of 1-(actual area value)
^
Does AUC depend on the actual values of 𝑦𝑖?
^
No. AU-ROC depends on the how the ordering of the 𝑦𝑖 is done and not on the actual
^
values of 𝑦𝑖.
For Example, say we have two models M1 and M2

- Actual y labels: [1, 1, 0, 1, 1]
^
- 𝑦𝑖 for M1: [0. 95. 0. 92, 0. 80, 0. 76, 0. 71]
^
- And 𝑦𝑖 for M2: [0. 2, 0. 1, 0. 08, 0. 06, 0. 01]
^
⇒ Since both have the same ordering of how 𝑦𝑖 are arranged , hence AUC(M1) =
AUC(M2).
AU-ROC is highly sensitive to imbalanced data, what metric can we use there?
We can use Area under the Precision-Recall cure (AU-PRC).
22
This is a very good metric for imbalanced data.
How is PRC plotted?

- Precision on y axis
- Recall on x axis
^
- Similar to ROC curve, we'll take each 𝑦𝑖 as threshold
Then we take the area under the PRC curve to get AU-PRC.
Which metric can we use to know how a model would do business wise?
- Business people need to know how our model's differences would make in the
business term compared to random targeting.
- ⇒ Lift and Gain charts are used.
- ⇒ Help us graphically understand the benefit of using that model (in layman
terms)
How to make lift and gain charts?

^
- Step 1: Obtain the predictions 𝑦𝑖 on cross-validation data Dcv
- Step 2: Sort the data predicted probability in desc order We will have highest
probability at the top, lowest at the bottom.
- Step 3: Break the sorted cross validation data into 10 groups (or deciles). We get
the data for each decile.
- Step 4: Using these deciles, we built a table as follows
⇒ 1st decile will have datapoints with the highest pred. probability
⇒ 10th decile will have datapoints with lowest pred. probability
Calculation of Gain:
- Dividing the cumulative responses by total number of responses
23
- i.e. cumulative number of positives by total number of positives.
Calculation of lift:
- Dividing total % of positive point up to that decile (gain) for a smart model by total
% of positive points if we had a random model
- in other terms, it is ratio of gain of our model to a gain of random model
⇒ After calculating the lift and gain values, we plot them for each decile.
What does gain mean?

Gain for ith decile tells us "what percentage of positive points are in ith or smaller decile"
For example:
- If gain is 97.87 % for the 8th decile
⇒ It means we cover 97% of positive datapoints, till the 8th decile.
What does lift mean?

- It means Cumulative percentage of positive points till ith decile divided by
cumulative percentage of positive points by random model
- It is intuitively telling how much better a model is compared to a random model.
How to use Accuracy metric even when data is imbalanced ?

Ans: G-Mean: When data is imbalanced, Geometric-Mean(G-Mean) measures model
performance on both the majority and minority classes.
𝐺𝑀𝑒𝑎𝑛 = 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 × 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦
Which metric to use when? (Cheat Sheet)

- If we want probabilities of classes: Log loss
- If classes are balanced: Accuracy
- IF classes are imbalanced:
- If FP is more critical: Precision.
- If FN is more critical: Recall.
- F1 score is a balance between precision and recall.
- If our concern is both classes (TN and TP): ROC_AUC
- If severe imbalance: PR AUC
How are performance metrics different from loss functions?

- Loss functions are usually differentiable in the model’s parameters.
- Performance metrics don’t need to be differentiable.
- A metric that is differentiable can be used as a loss function also. For ex: MSE
24

Intro To ML RevisionNotes

Uploaded by

Intro To ML RevisionNotes

Uploaded by

Intro to ML Revision Notes

Linear Regression 4-10

Bias Variance Regularization 11-14

Logistic Regression 14-17

Classification Metrics 17-24

How is Machine Learning different from Classical Programming?

Classical Programming has rigid rules written by programmers and a lot of

Classical Programming: Rigid rules written by programmers, a lot of hardcoding.

When to use ML over Classical Programming ?

How to categorize the different types of ML models?

Based on type of Learning ML algo is categorized as:

Criteria: Type of Learning

Criteria: Type of Task

How does the data look for Linear Regression ?

Regression → Supervised task, target 𝑦𝑖 is numerical

What is the goal of the ML model ?

Ans: To find 𝑓: 𝑋 → 𝑦 such that 𝑓(𝑥𝑖) ≈ 𝑦𝑖

How to define function f ?

How does the ML model find the function f ?

Ans: By updating weights of the model on the training dataset

How does the ML model update weights ?

Is 𝑓(𝑥𝑖) in Linear Regression analogous to 𝑦 = 𝑚𝑥 + 𝑐 ?

How to find the best fit line of Linear Regression model ?

How to say Linear Regression is optimized ?

What loss function to use for linear regression optimization ?

How to find the Gradients of Mean Square Error ?

On finding gradients with respect to w, loss becomes:

Similarly gradient for 𝑤0becomes:

Updating weights (𝑤, 𝑤0 ) with a learning rate α :

Why use Learning Rate ?

What happens if a too small value of Learning Rate(α) is used ?

What happens if a too large value of Learning Rate(α) is used ?

What will be the simplest model for predicting a value ?

After training the model, how to measure model performance ?

SSres - Squared sum of error of regression line

What happens to R-squared if we add a new feature ?

But if the Feature is not relevant,

As R-Square fails to compare performance and model complexity (no of features),

Ans: Adjusted R-Squared, defined as:

where n is the number of samples, and d is the number of features

Ans: if number of features (d) increases

with no significant feature:

with significant feature:

What does the -ve sign mean in weight of model ?

Why perform column standardization ?

For example: if there are two features 𝑓1, 𝑓2 :

Now Column Standardization makes : 𝑓1 value range ≈ 𝑓2 value range → Weight of 𝑓1

Assumptions of Linear Regression :

b. Features are not multi-collinear:

Ans: MultiCollinearity → non-reliability on the feature importance and model

How to resolve Multi- Collinearity ?

VIF algorithm works as :

- Calculate VIF of each features

- if VIF >= 5 → high Multicollinearity

- Remove feature having the highest VIF

- Recalculate the VIF for the remaining feature

- Again remove the next feature having the highest VIF

- Repeat till all VIF<5 or some number of iteration is reached

c. Errors are normally distributed:

Used to ensure there are no outliers present in the data

d. Heteroskedasticity should not exist:

Why Heteroskedasticity is a problem ?

How to check Heteroskedasticity ?

Why is AutoCorrelation a problem ?

Is there any other way to solve Linear Regression ?

Ans: Closed Form/ Normal Equation

Why use Normal Equations ?

𝑛 (𝑆𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒) ×𝑑 (𝑑 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙)

Why even use gradient descent ?

What if there is non-linearity in the data , can Linear Regression work ?

Ans: By using Polynomial Regression → transforms linear equation of linear

How does Polynomial Regression work ?

making the Model complex for handling non-linearity.

If model A : covers all datapoints with high degree features

Then which model is better ?

Model B , a simpler model than Model A →Occam’s Razor