Intro To ML RevisionNotes
Intro To ML RevisionNotes
1
Topic Page Number
Introduction to ML 3-4
2
● Introduction to ML:
ML > Classical Prog. → applications which may not have easily visible patterns.
3
● Linear Regression:
How does the training data look for the Regression problem?
4
^
Algebraic Intuition → find 𝑦 = 𝑓(𝑥𝑖), we can say for Linear Regression:
𝑓(𝑥𝑖1, 𝑥𝑖2, ..... , 𝑥𝑖𝑑) = 𝑤1𝑥𝑖1 + 𝑤2𝑥𝑖2 + 𝑤3𝑥𝑖3 +.... + 𝑤𝑑𝑥𝑖𝑑 + 𝑤0
𝐷
^
𝑦𝑖 = 𝑓( 𝑥𝑖) = ∑ 𝑤𝑗𝑥𝑖𝑗 + 𝑤0
𝑗=1
𝑇
Now, 𝑤 = [𝑤1, 𝑤2..... 𝑤𝑑] and 𝑥𝑖 = [𝑥𝑖1, 𝑥𝑖2..... 𝑥𝑖𝑑] , then:
^ 𝑇
𝑦𝑖 = 𝑓(𝑥𝑖) = 𝑤 𝑥𝑖 + 𝑤0
Linear Regression: finding a best D Dimensional hyperplane that fits the D-Dimensional
^
data such that 𝑦𝑞 ≈ 𝑦𝑞
5
𝑛
1 𝑇 2
𝐿(𝑤, 𝑤0) = 𝑛
∑ (𝑦𝑖 − (𝑤 𝑥𝑖 + 𝑤0))
𝑖=1
𝑑(𝑢𝑣+𝑐+𝑎)
As we know 𝑑𝑢
= 𝑣 , hence on simplifying, the equation becomes:
∂𝐿(𝑤,𝑤0) 𝑛
2 𝑇
∂𝑤
= 𝑛
∑ (𝑦𝑖 − (𝑤 𝑥𝑖 + 𝑤0)) (− 𝑥𝑖)
𝑖=1
∂𝐿(𝑤,𝑤0) 𝑛
2 𝑇
∂𝑤0
= 𝑛
∑ (𝑦𝑖 − (𝑤 𝑥𝑖 + 𝑤0))(− 1)
𝑖=1
∂𝑙(𝑤,𝑤0)
𝑤0 = 𝑤0 − α × ∂𝑤0
6
Ans: R-Squared metric. → measures the performance of Linear Regression over a
mean model. It is Defined as:
𝑛
^ 2
𝑆𝑆𝑟𝑒𝑠 ∑ (𝑦𝑖−𝑦𝑖)
2 −
𝑅 = 1− 𝑆𝑆𝑡𝑜𝑡𝑎𝑙
= 1− 𝑖=1
𝑛 , where 𝑦 𝑖 is the mean model.
− 2
∑ (𝑦𝑖−𝑦𝑖 )
𝑖=1
2
What will be the best value of 𝑅 ?
𝑛
^ 2
Ans: 1, when ∑ (𝑦𝑖 − 𝑦𝑖) = 0.
𝑖=1
2
What will be the minimum value of 𝑅 ?
𝑛 𝑛
^ 2 − 2
Ans: - ∞ , when ∑ (𝑦𝑖 − 𝑦𝑖) >> ∑ (𝑦𝑖 − 𝑦𝑖 )
𝑖=1 𝑖=1
Will the R-Square value increase or remain the same if we add a new feature?
Ans: Both are possible → small weights or zero weight can be assigned to new feature
which will keep performance the same.
model performance
Model can start making spurious associations with new features (i.e. overfit) causing
model performance to increase on the train set.
7
How does Adj R-Square compare performance and model complexity ?
How to determine which features impact the model most during prediction ?
Ans: The feature with highest weight → most important feature
a. Assumption of Linearity:
linear relationship between the features 𝑥 and target variable 𝑦
What is collinearity ?
Ans: 2 features (𝑓1, 𝑓2), have a linear relationship between them. 𝑓1 = α 𝑓2
What is Multicollinearity ?
Ans: feature 𝑓1 has collinearity across multiple features 𝑓2 , 𝑓3 , 𝑓4
𝑓1 = α1𝑓2 + α2𝑓3 + α3𝑓4
8
Why is MultiCollinearity a problem ?
1 2
𝑉𝐼𝐹 𝑓𝑜𝑟 𝑓𝑗 = 2 ; 𝑤ℎ𝑒𝑟𝑒 𝑅 𝑖𝑠 𝑅𝑠𝑞𝑢𝑎𝑟𝑒𝑑
1−𝑅 𝑗
Heteroskedasticity → unequal scatter of the error term → not having the same
variance
^ ^
Ans: Plotting a Residual plot → Errors (𝑦 − 𝑦) vs prediction (𝑦)
9
e. No AutoCorrelation:
What is AutoCorrelation ?
Ans: When the current feature value depends upon its previous value
^
Ans: Linear regression assumes 𝑦1 = 𝑓(𝑥) has to be independent of
^
𝑦2 = 𝑓(𝑥 + 1) → AutoCorrelation contradicts this assumption.
𝑇 −1 𝑇
The optimal weights: 𝑊 = (𝑋 𝑋) 𝑋 𝑌
What modifications can be done to Linear Regression for the model to be complex
enough to fit non-linear data ?
10
2 4 2
polynomial introduces terms like 𝑓1 , 𝑓3 = 𝑓3 , 𝑓2 = α𝑓2 + α0
● Bias-Variance,Regularization:
Model B : misses out only a handful datapoints using lower degree features,
I.e. predicting hyperplane misses a handful of datapoints
Ans: Model B generalizes on the data → model captures the pattern of the data and not
get influenced by Outliers (hence those points are missed out)
How is training and test data related to underfit and overfit model ?
Ans: an underfitted model → has a high training and test loss
- An Overfitted model → very low training loss but a high testing loss.
11
What is a suitable model ?
Ans: a tradeoff between both such tha model has a low training and testing loss
→ perfectly fit model
Observe
- High Bias → have a wrong aim
- High Variance → an unsteady aim.
12
𝑑
2
𝑇𝑜𝑡𝑎𝑙𝐿𝑜𝑠𝑠 = min 𝐿𝑜𝑠𝑠𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 + λ ∑ 𝑤 𝑗
𝑤𝑗 𝑗=1
since too much regularization → makes the model underfit the data
𝑑
2
B. L2/ Ridge/ Tikhonov regularization : Uses the term ∑ 𝑤𝑗 → have close to 0
𝑗=1
values → for insignificant features.
13
If data is too small to have a validation dataset, what to do then ?
Ans: use k-Fold CV algorithm since:
- splits data into k smaller sets
- for each iteration, the model trained on k-1 folds
- validated on 1 fold
- performance is avg over all the iterations.
● Logistic Regression
14
● Given labels -> 𝑦𝑖 ϵ {0, 1}
𝑡
● Compute a linear function of x -> 𝑧𝑖 = 𝑤 𝑥𝑖 + 𝑤0 ϵ {− ∞, ∞}
1
● Compute Sigmoid(zi) -> σ (𝑧𝑖 ) = −𝑧𝑖
1+𝑒
^
● Predicted label -> 𝑦𝑖 = 1 𝑖𝑓 σ (𝑧𝑖 ) > 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑒𝑙𝑠𝑒 0
15
𝑑
2
𝑇𝑜𝑡𝑎𝑙𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑔 𝐿𝑜𝑠𝑠 + λ ∑ 𝑤𝑗
𝑗=1
Thus both of these components help penalize the model most when doing wrong
predictions.
𝑛 −𝑦𝑖𝑧𝑖
Also if 𝑦𝑖 ϵ {− 1, 1} , then 𝐿𝑜𝑔 𝐿𝑜𝑠𝑠 = ∑ 𝑙𝑜𝑔( 1 + 𝑒 )
𝑖=1
But why can’t we use Mean Square Error as in Linear regression ?
● Non-convex curve : contains a lot of local minimas
● Difficult for Gradient Descent to reach global minima
On solving:
∂𝐿𝑜𝑔 𝐿𝑜𝑠𝑠 ∂𝑧 ^ ∂𝑧 ^ ∂𝑧 ^ ∂𝑧
∂𝑤
=− 𝑦 ∂𝑤
+ 𝑦 ×𝑦 ∂𝑤
+ 𝑦 ∂𝑤
− 𝑦 ×𝑦 ∂𝑤
∂𝐿𝑜𝑔 𝐿𝑜𝑠𝑠 ∂𝑧 ^ ∂𝑧
∂𝑤
=− 𝑦 ∂𝑤
+ 𝑦 ∂𝑤
∂𝑧 ∂𝐿𝑜𝑔 𝐿𝑜𝑠𝑠 ^
Also since ∂𝑤
= 𝑥 ; ∂𝑤
= (𝑦 − 𝑦) 𝑥
16
What if we want to predict odds of yi = 1 vs yi = 0 ?
● Log - odds: Shows how the model is similar to a linear model which is
predicting log-odds of 𝑦𝑖 = 1 𝑣𝑠 𝑦𝑖 = 0, defined as :
𝑧𝑖
𝑝 1 𝑒 1
𝑙𝑜𝑔𝑒(𝑜𝑑𝑑𝑠) = log[ 1−𝑝 ] ; 𝑝 = −𝑧𝑖 = 𝑧𝑖 𝑎𝑛𝑑 1 − 𝑝 = 𝑧𝑖
1+𝑒 𝑒 +1 𝑒 +1
𝑧 𝑡
𝑙𝑜𝑔𝑒(𝑜𝑑𝑑𝑠) = log𝑒[𝑒 𝑖] = 𝑧𝑖 = 𝑤 𝑥𝑖 + 𝑤0
● Classification Metrics
17
What other metric to use ?
Ans: Confusion matrix
Note:
- For a dumb model that predicts everything as negative, FP = TP = 0
- For an Ideal model that has no incorrect classification, FP = FN = 0
18
Given a CM, how can we calculate actual positives?
Which metric to use, when we cannot afford to have any false positives?
Ans: Precision: It tells us out of all points predicted to be positive, how many are
actually positive.
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃+𝐹𝑃
For ex
- Misclassification of a spam email as not spam is somewhat acceptable i.e FN
- However, classifying an important mail as spam can lead to major loss i.e. FP
⇒ i.e reducing FP is more critical.
Which metric to use, when we cannot afford to have any false negatives?
Recall / Sensitivity / Hit Rate: It tells us out of all the actually positive points, how many
of them are predicted to be positive
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃+𝐹𝑁
For ex
- Classifying a healthy person as cancerous and carry out further testing is
somewhat acceptable
- However, classifying a person with cancer as healthy can be life-death situation.
⇒ Here, reducing FN is more critical.
Note:
𝑇𝑃
- True Positive Rate (TPR): 𝑇𝑃𝑅 = 𝑇𝑃+𝐹𝑁
; is same as Recall
19
What other metrics to look for ?
𝑇𝑁
- True Negative Rate (TNR): 𝑇𝑁𝑅 = 𝐹𝑃+𝑇𝑁
- TNR tells out of all the actual negative points, how many have been predicted
as False .
- Also called as Specificity / Selectivity
𝐹𝑃
- False Positive Rate (FPR):𝐹𝑃𝑅 = 𝐹𝑃+𝑇𝑁
- Intuitively, it tells, out of all data points which are actually negative, how many
are misclassified as positive
𝐹𝑁
- False Negative Rate (FNR):𝐹𝑁𝑅 = 𝐹𝑁+𝑇𝑃
- Intuitively, it tells, out of all data points which are actually positive, how many
are misclassified as negative
Note:
- F1-score is just the Harmonic mean of Precision and Recall.
- F1-score Is also Useful when data is imbalanced.
- Range - [0, 1]
20
Why do we take harmonic mean in F1 score instead of arithmetic mean?
Harmonic Mean penalizes the reduction in Precision and Recall more than Arithmetic
Mean.
Note:
- Unlike precision, recall or F1-score, AU-ROC does not work well for highly
imbalanced data.
21
What is the fundamental difference between AUC and the other metrics?
When we calculate Precision, Recall or F1 score
^
- We calculate it for a certain threshold on 𝑦𝑖
- This threshold is 0.5, by default
On the other hand, for AU-ROC
- we are calculating it using all possible thresholds
^
Does AUC depend on the actual values of 𝑦𝑖?
^
No. AU-ROC depends on the how the ordering of the 𝑦𝑖 is done and not on the actual
^
values of 𝑦𝑖.
AU-ROC is highly sensitive to imbalanced data, what metric can we use there?
We can use Area under the Precision-Recall cure (AU-PRC).
22
This is a very good metric for imbalanced data.
Which metric can we use to know how a model would do business wise?
- Business people need to know how our model's differences would make in the
business term compared to random targeting.
- ⇒ Lift and Gain charts are used.
- ⇒ Help us graphically understand the benefit of using that model (in layman
terms)
Calculation of Gain:
- Dividing the cumulative responses by total number of responses
23
- i.e. cumulative number of positives by total number of positives.
Calculation of lift:
- Dividing total % of positive point up to that decile (gain) for a smart model by total
% of positive points if we had a random model
- in other terms, it is ratio of gain of our model to a gain of random model
⇒ After calculating the lift and gain values, we plot them for each decile.
24