Assignment 2 Documentation
Assignment 2 Documentation
Assignment 2
2) Mathematical Derivations.
B Say we have N data points with D-dimensions and we want to reduce the
dimensions such that we capture maximum variance form the original
dimensions.
𝑂𝐴
cosθ = 𝑂𝐵
2
⃗ ⋅ x⃗ = ||𝑢||||𝑥𝑛 ||cosθ ; where ||u|| = √u12 + ⋯ + u2D
⇒u
⃗ ⋅ x⃗ = ||𝑢||OBcosθ
⇒u
⃗ ⋅ x⃗ = ||𝑢||OA
⇒u
⃗ ⋅𝑥
𝑢
⇒ 𝑂A = ||𝑢||
⃗
𝑢
Let 𝑢̂ = ||𝑢||
⇒ OA = 𝑢̂ ⋅ ⃗⃗⃗⃗
xn
𝑁
1 2
⃗⃗⃗⃗𝑛 − 𝑢̂ ⋅ 𝑥̅ ) ; ||𝑢̂|| = 1
𝑉𝑎𝑟𝑖𝑒𝑛𝑐𝑒 = ∑(𝑢̂ ⋅ 𝑥
𝑁
𝑛=1
𝑁
1 2 2
ma𝑥𝑢̂ ∑(𝑢̂ ⋅ ⃗⃗⃗ ̂ ⋅ ⃗𝑥̅)
𝑥𝑛 − 𝑢 such that ||u
̂ || = 1
𝑁
𝑛=1
𝑁 2
1 2
⇒ 𝑚a𝑥𝑢̂ ∑ (𝑢̂ ⋅ (⃗⃗⃗ ̅))
𝑥𝑛 − 𝑥 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 ||𝑢̂|| = 1
𝑁
𝑛=1
𝑁
1 𝑇
⇒ 𝑚𝑎𝑥𝑢̂ ∑ (𝑢̂ ⋅ (𝑥 ⃗⃗⃗⃗𝑛 − 𝑥̅ )) 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 ||𝑢̂|| 2 = 1
⃗⃗⃗⃗𝑛 − 𝑥̅ )) (𝑢̂ ⋅ (𝑥
𝑁
𝑛=1
𝑁
1 𝑇 2
⇒ 𝑚𝑎𝑥𝑢̂ ∑ 𝑢̂ ⋅ (𝑥 ⃗⃗⃗⃗𝑛 − 𝑥̅ ) ⋅ 𝑢̂ 𝑇 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 ||𝑢̂|| = 1
⃗⃗⃗⃗𝑛 − 𝑥̅ ) (𝑥
𝑁
𝑛=1
𝑁
1 𝑇 2
⇒ 𝑚𝑎𝑥𝑢̂ 𝑢̂ ⋅ (∑ (𝑥 ⃗⃗⃗⃗𝑛 − 𝑥̅ ) ) ⋅ 𝑢̂ 𝑇 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 ||𝑢̂|| = 1
⃗⃗⃗⃗𝑛 − 𝑥̅ ) (𝑥
𝑁
𝑛=1
𝑁
𝑇
𝑁𝑜𝑤 Σ = ∑ (𝑥
⃗⃗⃗⃗𝑛 − 𝑥̅ ) (𝑥
⃗⃗⃗⃗𝑛 − 𝑥̅ ) 𝑖𝑠 𝑡ℎ𝑒 𝑐𝑜𝑣𝑎𝑟𝑖𝑒𝑛𝑐𝑒 𝑚𝑎𝑡𝑟𝑖𝑥
𝑛=1
1 2
⇒ 𝑚𝑎𝑥𝑢̂ 𝑢̂ ⋅ Σ ⋅ 𝑢̂ 𝑇 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 ||𝑢̂|| = 1
𝑁
1 2
⇒ 𝑚𝑎𝑥𝑢̂ 𝑢̂ ⋅ Σ ⋅ 𝑢̂ 𝑇 + λ (1 − ||𝑢̂|| )
𝑁
1
⇒ maxû û ⋅ Σ ⋅ û T + λ(1 − û ⋅ û T )
N
Differentiating w.r.t. û
𝜕
⇒ (û ⋅ Σ ⋅ û T + λ(1 − û ⋅ û T )) = 0
𝜕𝑢̂
⇒ 2Σ ⋅ û T − 2λû T = 0
⇒ Σ ⋅ û T = λû T → 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛 1
𝜕
(û ⋅ Σ ⋅ û T + λ(1 − û ⋅ û T )) = 0
𝜕𝑢̂
Differentiating w.r.t. λ
𝜕
⇒ (û ⋅ Σ ⋅ û T + λ(1 − û ⋅ û T )) = 0
𝜕λ
⇒ û ⋅ û T = 1 → 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛 2
𝑇 𝑇
Σ𝑢̂ 𝐷 = 𝜆𝐷 𝑢̂ 𝐷
first 3
eigenvalues being higher than the remaining ones by a significant
margin.
ii. For standardized data:
1. The variance captured is seen to gradually increase
with the number of components
For this part of the assignment we had to apply PCA analysis on the Hitters
dataset and determine the optimal number of principal components required for
Regression analysis.
1) Method and Code implementation
a. We started by importing dataset and removing null values.
df = pd.read_csv('Hitters.csv')
df = df.fillna(df.mean())
for i in numeric_list:
Q1 = df[i].quantile(0.25)
Q3 = df[i].quantile(0.75)
IQR = Q3 - Q1
up_lim = Q3 + 1.5 * IQR
low_lim = Q1 - 1.5 * IQR
df.loc[df[i] > up_lim,i] = up_lim
df.loc[df[i] < low_lim,i] = low_lim
# outlier query
for i in df:
Q1 = df[i].quantile(0.25)
Q3 = df[i].quantile(0.75)
IQR = Q3-Q1
up = Q3 + 1.5*IQR
low = Q1 - 1.5*IQR
self.weights = np.zeros(num_features)
self.bias = 0
indices = np.random.permutation(num_samples)
X_shuffled = X[indices]
y_shuffled = y[indices]
X_batch = X_shuffled[i:i+self.batch_size]
y_batch = y_shuffled[i:i+self.batch_size]