0% found this document useful (0 votes)
15 views

Assignment 2 Documentation

Uploaded by

f20221525
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Assignment 2 Documentation

Uploaded by

f20221525
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

CS F320: Foundations of Data Science

Assignment 2

Name of Student ID Number

Dev Gala 2021A7PS0182H

Atharva Vinod Dashora 2021A7PS0127H

Kolasani Amit Vishnu 2021A7PS0151H


Assignment 2A: Implementing PCA and applying it on Car data
1) What is Principal Component Analysis?
a. With increase in number of features or dimensions in a dataset, the
amount of datapoints required to obtain a statistically significant
result also increases. This can lead to issues such as overfitting,
increased computation time, and reduced accuracy of machine
learning models. This is known as the curse of dimensionality
problem.

b. Feature engineering approaches such as feature extraction and


selection are used to combat the curse of dimensionality. One kind
of feature extraction technique called dimensionality reduction
seeks to minimize the amount of input features while preserving as
much of the original data as feasible.

c. Principal Component Analysis was introduced in 1901 and works


on the condition that data mapped to lower dimensional space must
capture maximum variance of the original data. It is a statistical
procedure to convert a set of correlated variables to a set of
uncorrelated variables. PCA is used to reduce the dimensionality of
a data set by finding a new set of variables, smaller than the
original set of variables, retaining most of the sample’s
information.

2) Mathematical Derivations.
B Say we have N data points with D-dimensions and we want to reduce the
dimensions such that we capture maximum variance form the original
dimensions.

𝑂𝐴
cosθ = 𝑂𝐵

2
⃗ ⋅ x⃗ = ||𝑢||||𝑥𝑛 ||cosθ ; where ||u|| = √u12 + ⋯ + u2D
⇒u
⃗ ⋅ x⃗ = ||𝑢||OBcosθ
⇒u
⃗ ⋅ x⃗ = ||𝑢||OA
⇒u

⃗ ⋅𝑥
𝑢
⇒ 𝑂A = ||𝑢||


𝑢
Let 𝑢̂ = ||𝑢||
⇒ OA = 𝑢̂ ⋅ ⃗⃗⃗⃗
xn

Now we find the mean of the data in D-dimensional space.


𝑁
1
projected mean = ∑ 𝑢̂ ⋅ ⃗⃗⃗⃗⃗
𝑥𝑛 = 𝑢̂ ⋅ 𝑥̅
𝑁
𝑛=1

𝑁
1 2
⃗⃗⃗⃗𝑛 − 𝑢̂ ⋅ 𝑥̅ ) ; ||𝑢̂|| = 1
𝑉𝑎𝑟𝑖𝑒𝑛𝑐𝑒 = ∑(𝑢̂ ⋅ 𝑥
𝑁
𝑛=1

So, for PCA, we need to solve the following optimization


problem:

𝑁
1 2 2
ma𝑥𝑢̂ ∑(𝑢̂ ⋅ ⃗⃗⃗ ̂ ⋅ ⃗𝑥̅)
𝑥𝑛 − 𝑢 such that ||u
̂ || = 1
𝑁
𝑛=1
𝑁 2
1 2
⇒ 𝑚a𝑥𝑢̂ ∑ (𝑢̂ ⋅ (⃗⃗⃗ ̅))
𝑥𝑛 − 𝑥 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 ||𝑢̂|| = 1
𝑁
𝑛=1

𝑁
1 𝑇
⇒ 𝑚𝑎𝑥𝑢̂ ∑ (𝑢̂ ⋅ (𝑥 ⃗⃗⃗⃗𝑛 − 𝑥̅ )) 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 ||𝑢̂|| 2 = 1
⃗⃗⃗⃗𝑛 − 𝑥̅ )) (𝑢̂ ⋅ (𝑥
𝑁
𝑛=1
𝑁
1 𝑇 2
⇒ 𝑚𝑎𝑥𝑢̂ ∑ 𝑢̂ ⋅ (𝑥 ⃗⃗⃗⃗𝑛 − 𝑥̅ ) ⋅ 𝑢̂ 𝑇 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 ||𝑢̂|| = 1
⃗⃗⃗⃗𝑛 − 𝑥̅ ) (𝑥
𝑁
𝑛=1
𝑁
1 𝑇 2
⇒ 𝑚𝑎𝑥𝑢̂ 𝑢̂ ⋅ (∑ (𝑥 ⃗⃗⃗⃗𝑛 − 𝑥̅ ) ) ⋅ 𝑢̂ 𝑇 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 ||𝑢̂|| = 1
⃗⃗⃗⃗𝑛 − 𝑥̅ ) (𝑥
𝑁
𝑛=1
𝑁
𝑇
𝑁𝑜𝑤 Σ = ∑ (𝑥
⃗⃗⃗⃗𝑛 − 𝑥̅ ) (𝑥
⃗⃗⃗⃗𝑛 − 𝑥̅ ) 𝑖𝑠 𝑡ℎ𝑒 𝑐𝑜𝑣𝑎𝑟𝑖𝑒𝑛𝑐𝑒 𝑚𝑎𝑡𝑟𝑖𝑥
𝑛=1

1 2
⇒ 𝑚𝑎𝑥𝑢̂ 𝑢̂ ⋅ Σ ⋅ 𝑢̂ 𝑇 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 ||𝑢̂|| = 1
𝑁
1 2
⇒ 𝑚𝑎𝑥𝑢̂ 𝑢̂ ⋅ Σ ⋅ 𝑢̂ 𝑇 + λ (1 − ||𝑢̂|| )
𝑁
1
⇒ maxû û ⋅ Σ ⋅ û T + λ(1 − û ⋅ û T )
N
Differentiating w.r.t. û
𝜕
⇒ (û ⋅ Σ ⋅ û T + λ(1 − û ⋅ û T )) = 0
𝜕𝑢̂
⇒ 2Σ ⋅ û T − 2λû T = 0
⇒ Σ ⋅ û T = λû T → 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛 1
𝜕
(û ⋅ Σ ⋅ û T + λ(1 − û ⋅ û T )) = 0
𝜕𝑢̂
Differentiating w.r.t. λ
𝜕
⇒ (û ⋅ Σ ⋅ û T + λ(1 − û ⋅ û T )) = 0
𝜕λ
⇒ û ⋅ û T = 1 → 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛 2

From equation 1 and 2


u ⋅ Σ ⋅ û T = λ
Now from equation 1 we also see 𝜆 is the eigenvalue and 𝑢̂ 𝑇 is the
eigenvector for the transformation Σ. Since Σ is a D x D matrix, there
are D eigenvalues and D eigenvectors. All of these D eigenvector,
eigenvalue pairs will be solutions to the initial constrained
optimization problem.
𝑇 𝑇
Σ𝑢̂ 1 = λ1 𝑢̂ 1
These are all feasible solutions to the
𝑇 𝑇
Σ𝑢̂ 2 = 𝜆2 𝑢̂ 2 constrained optimization problem. To get
optimal solution, we take maximum
… variance when points are projected on to 𝑢̂ 𝑇

𝑇 𝑇
Σ𝑢̂ 𝐷 = 𝜆𝐷 𝑢̂ 𝐷

Total Variance of data:


𝑇𝑜𝑡𝑎𝑙𝑉𝑎𝑟𝑖𝑒𝑛𝑐𝑒 = 𝑡𝑟𝑎𝑐𝑒(Σ)
Since Σ is a diagonal matrix
𝑡𝑟𝑎𝑐𝑒(Σ) = σ12 + σ22 + ⋯ + σ2𝐷
But, 𝑡𝑟𝑎𝑐𝑒(Σ) = 𝜆1 + 𝜆2 + ⋯ + 𝜆𝐷
Therefore, 𝜎12 + 𝜎22 + ⋯ + 𝜎𝐷2 = 𝜆1 + 𝜆2 + ⋯ + 𝜆𝐷
Thus, 𝜎𝑖2 corresponds to the variance captured by the 𝑖𝑡ℎ eigenvector. Therefore,
𝑇𝑜𝑡𝑎𝑙 𝑉𝑎𝑟𝑖𝑒𝑛𝑐𝑒 = 𝜆1 + 𝜆2 + ⋯ + 𝜆𝐷 .
Now,
Percentage of Variance λ𝑖
= x 100
captured by 𝑖𝑡ℎ eigenvector ∑D
𝑑=1 λ𝑑

(or the percentage of


variance captured by
collapsing to 1-dimension)

The 𝑢̂ 𝑖 𝑇 that maximizes variance is the one corresponding to


maximum eigenvalue.
Say, we have D-dimensions and we want to collapse to m < D
dimensions. We take the m largest eigenvalues and corresponding
eigenvectors and project the D-dimensional vectors to the spaced
spanned by these m eigenvectors.

3) Code Implementation for PCA


# Step 1: Center the data
mean_vector = np.mean(df, axis=0)
df_centered = (df-mean_vector).copy()
centered_data = df_centered.copy()

# Step 2: Calculate the covariance matrix


covariance_matrix = np.cov(centered_data, rowvar=False)

# Step 3: Compute the eigenvectors and eigenvalues


eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)
# Sort eigenvectors and eigenvalues in descending order
sorted_indices = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[sorted_indices]
eigenvectors = eigenvectors[:, sorted_indices]

# Step 4: Choose the number of principal components


num_components = 2
# Step 5: Project the data onto the new feature space
projection_matrix = eigenvectors[:, :num_components]
pca_result = centered_data.dot(projection_matrix)

4) Visual Representations and Plots


a. Plot of Cumulative Explained Variance vs Number of Principal
Components.
i. For centred data.
1. Most of the variance of the data is captured by the first
2 principal components. This is also observed by the

first 3
eigenvalues being higher than the remaining ones by a significant
margin.
ii. For standardized data:
1. The variance captured is seen to gradually increase
with the number of components

b. Pair Plots with projected principal components:


5) Conclusion:
a. By using PCA, we could reduce the dimensions from 5 to 2 while
capturing almost 99% of the variance.
b. We also see that only 1 or 2 features are required to capture most of
the variance of the data as the largest 2 eigenvalues are
significantly larger than the remaining.

End of Part A of Assignment 2


Assignment 2B:
PCA Analysis and Determining Optimal Number of
Components

For this part of the assignment we had to apply PCA analysis on the Hitters
dataset and determine the optimal number of principal components required for
Regression analysis.
1) Method and Code implementation
a. We started by importing dataset and removing null values.
df = pd.read_csv('Hitters.csv')
df = df.fillna(df.mean())

b. Next we separated the categorical and target variables and


standardized the dataset.
c. We detected outliers using IQR and set values of lower 25
percentile to the value at 25th percentile and similarly for points for
greater than 75%.

#Outlier detection using IQR


for i in df:
Q1 = df[i].quantile(0.25)
Q3 = df[i].quantile(0.75)
IQR = Q3-Q1
up = Q3 + 1.5*IQR
low = Q1 - 1.5*IQR

if df[(df[i] > up) | (df[i] < low)].any(axis=None):


print(i,"yes")
else:
print(i, "no")

for i in numeric_list:
Q1 = df[i].quantile(0.25)
Q3 = df[i].quantile(0.75)
IQR = Q3 - Q1
up_lim = Q3 + 1.5 * IQR
low_lim = Q1 - 1.5 * IQR
df.loc[df[i] > up_lim,i] = up_lim
df.loc[df[i] < low_lim,i] = low_lim

# outlier query
for i in df:
Q1 = df[i].quantile(0.25)
Q3 = df[i].quantile(0.75)
IQR = Q3-Q1
up = Q3 + 1.5*IQR
low = Q1 - 1.5*IQR

if df[(df[i] > up) | (df[i] < low)].any(axis=None):


print(i,"yes")
else:
print(i, "no")

d. Then from each number of components i.e. from 1 to 24, we apply


a multiple linear regression model to predict salary. We calculate
MSE and RMSE for each model to determine the optimal number
of components by taking the number of components that
correspond to the least RMSE.
Linear Regression Implementation:
class LinearRegression:
def __init__(self, learning_rate, num_iterations, batch_size):
self.learning_rate = learning_rate
self.num_iterations = num_iterations
self.batch_size = batch_size
self.weights = None
self.bias = None

def fit(self, X, y):


num_samples, num_features = X.shape

self.weights = np.zeros(num_features)
self.bias = 0

# Mini-Batch Gradient Descent


for _ in range(self.num_iterations):

indices = np.random.permutation(num_samples)
X_shuffled = X[indices]
y_shuffled = y[indices]

for i in range(0, num_samples, self.batch_size):

X_batch = X_shuffled[i:i+self.batch_size]
y_batch = y_shuffled[i:i+self.batch_size]

y_pred = np.dot(X_batch, self.weights) + self.bias

dw = (1/self.batch_size) * np.dot(X_batch.T, (y_pred -y_batch))


db = (1/self.batch_size) * np.sum(y_pred - y_batch)

# Update weights and bias


self.weights -= self.learning_rate * dw
self.bias -= self.learning_rate * db

def predict(self, X):


return np.dot(X, self.weights) + self.bias

For each number of components, we make multiple linear regression model:


for num in range(1, 24):
X,_,_ = PCA(df, num)
split = int(0.75*X.shape[0])
data = pd.concat([X, y], axis = 1)
d_train, d_test = data.iloc[:split,:], data.iloc[split:,:]
y_train = d_train['Salary'].to_numpy()
y_test = d_test['Salary'].to_numpy()
X_train = d_train.drop(['Salary'], axis = 1).to_numpy()
X_test = d_test.drop(['Salary'], axis = 1).to_numpy()
lr = LinearRegression(learning_rate, num_iterations, 32)
lr.fit(X_train, y_train)
predictions = lr.predict(X_test)
error.append((mean_squared_error(y_test, predictions))**0.5)
e. From all the RMSE values we find that 19 components gives us the
best prediction with the least error.

2) Plots and Visualization


a. Distribution of target variables

b. Plotting numeric variables against target


c. Cumulative Variance vs Number of Components
i. The variance captured is seen to gradually increase with the
number of components

d. Number of Components vs RMSE for each Multiple Linear


Regression model
i. The RMSE increases for 10 components since we select
components that do no capture a lot of the variance and it
starts to decrease as the amount of variance captured has
increased.
3) Finding Optimal number of components
a. Identifying the Most Efficient Model:
i. The point on the graph where the RMSE starts to stabilize or
show diminishing returns with additional components ,which
in the graph is on 19 components. The number of
components corresponding to this point is likely the most
efficient model in terms of prediction accuracy.
b. Significance of Selecting an Appropriate Number of Components:
i. Dimensionality Reduction: Principal Component Analysis
(PCA) is commonly used to reduce the dimensionality of the
data. By selecting an appropriate number of components,
you strike a balance between capturing the essential
information and avoiding overfitting.
ii. Computational Efficiency: Fewer components often lead to
faster training times and more efficient predictions. It
reduces the computational burden associated with a high-
dimensional feature space.
c. Scatter plot of actual vs predicted values

End of Part B of Assignment 2

You might also like