EXPERIMENT – 2
IM: Use PCA on a high-dimensional dataset to reduceits dimensionality while retaining most of the
A
variance and visualize the data.
CODE:
mport pandas as pd
i
import numpy as np
import [Link] as plt
import seaborn as sns
from [Link] import StandardScaler
from [Link] import PCA
df = pd.read_csv('USA_Housing.csv')
rint([Link]().sum()) # no null values
p
[Link](['Address'],axis=1,inplace=True)
Putting feature variable to X
#
X = df[['Avg. Area Income','Avg. Area House Age','Avg. Area Number
of Rooms','Avg. Area Number of Bedrooms','Area Population']]
Output variable
#
y = df['Price']
Standardize the features
#
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Apply PCA
#
pca = PCA(n_components=2) # Reduce to 2 dimensions for
visualization
X_pca = pca.fit_transform(X_scaled)
Explained variance ratio
#
explained_variance = pca.explained_variance_ratio_
Visualize the PCA results
#
[Link](figsize=(8, 6))
[Link](X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
[Link]('PCA of USA Pricing dataset')
[Link]('Principal Component 1 (Explained Variance:
{:.2f}%)'.format(explained_variance[0]*100))
2
[Link]('Principal Component 2 (Explained Variance:
p
{:.2f}%)'.format(explained_variance[1]*100))
[Link]()
OUTPUT:
Scatter Plot
Before PCA
After PCA
RESULT: Hence, we have reduced the dimension of thedataset.
3
EXPERIMENT – 3
IM:Performalinearregressionanalysisonadatasettopredictacontinuoustargetvariablebasedon
A
a one o r morepredictorvariables.Evaluatethemodel’sperformanceusingmetricslikeRMSEand
R-sqaured.
CODE:
Importing Libraries
#
import pandas as pd
import numpy as np
import [Link] as plt
import seaborn as sns
from [Link] import StandardScaler
from [Link] import PCA
from sklearn.model_selection import train_test_split
from [Link] import mean_squared_error, r2_score
df = pd.read_csv('USA_Housing.csv')
rint("Checking null values:\n",[Link]().sum(),"\n\n") # no null
p
values
[Link](['Address'],axis=1,inplace=True)
Putting feature variable to X
#
X = df[['Avg. Area Income','Avg. Area House Age','Avg. Area Number
of Rooms','Avg. Area Number of Bedrooms','Area Population']]
Output variable
#
y = df['Price']
_train, X_test, y_train, y_test = train_test_split(X, y,
X
test_size=0.2, random_state=42)
rom sklearn.linear_model import LinearRegression
f
lm = LinearRegression()
[Link](X_train, y_train)
y_pred = [Link](X_test)
Calculate RMSE
#
rmse = [Link](mean_squared_error(y_test, y_pred))
Calculate R-squared
#
r_squared = r2_score(y_test, y_pred)
4
rint("Linear regression model performance:\nRoot Mean Squared Error
p
(RMSE):", rmse)
print("R-squared:", r_squared)
[Link](figsize=(8, 6))
p
[Link](y_test, y_pred, alpha=0.5, label='Predicted',
color='cyan')
[Link](y_test, y_test, alpha=0.5, label='Actual', color='blue')
[Link]('Actual Values')
[Link]('Predicted Values')
[Link]('Actual vs Predicted Values')
[Link]()
[Link]()
OUTPUT:
Predicted Value vs Actual Value
ESULT: Hence, we have trained and evaluated the model.
R
Evaluation Results are:
● Root Mean Squared Error (RMSE): 100444.06055558745
● R-squared: 0.9179971706834289
5
EXPERIMENT – 4
IM: Compare the performance of various classifications algorithms (e.g., Logistic Regression,
A
Decision Trees, Random Forest, SVM and Naïve Bayes) on a common dataset using accuracy,
precision, recall and F1-Score.
CODE:
mport pandas as pd
i
import seaborn as sns
import [Link] as plt
from sklearn.model_selection import train_test_split
from [Link] import StandardScaler
from sklearn.linear_model import LogisticRegression
from [Link] import DecisionTreeClassifier
from [Link] import RandomForestClassifier
from [Link] import SVC
from sklearn.naive_bayes import GaussianNB
from [Link] import accuracy_score, precision_score,
recall_score, f1_score
f = pd.read_csv('gender_classification_v7.csv')
d
df['gender'] = df['gender'].apply(lambda x: 0 if x == 'Male' else 1)
[Link](figsize=(2, 4))
p
[Link]('Count of Gender', size=10)
[Link](data=df, x="gender")
[Link]('Count', size=12)
[Link]('Gender', size=12)
[Link](top=True, right=True, left=False, bottom=False)
[Link]()
= [Link](columns=['gender'])
X
y = df['gender']
caler = StandardScaler()
s
X_scaled = [Link](scaler.fit_transform(X), columns=[Link])
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.2, random_state=42, stratify=y)
Name, Accuracy, Precision, Recall, F1_Score = [], [], [], [], []
Logistic Regression
#
regression = LogisticRegression()
[Link](X_train, y_train)
6
_pred = [Link](X_test)
y
[Link](float("{:.2f}".format(accuracy_score(y_test, y_pred)
*100)))
[Link](precision_score(y_test, y_pred))
[Link](recall_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))
[Link]('Logistic Regression')
Decision Tree
#
tree = DecisionTreeClassifier(criterion="gini", random_state=100,
max_depth=3, min_samples_leaf=5)
[Link](X_train, y_train)
y_pred = [Link](X_test)
[Link](float("{:.2f}".format(accuracy_score(y_test, y_pred)
*100)))
[Link](precision_score(y_test, y_pred))
[Link](recall_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))
[Link]('Decision Tree')
Random Forest
#
forest = RandomForestClassifier(n_estimators = 100)
[Link](X_train, y_train)
y_pred = [Link](X_test)
[Link](float("{:.2f}".format(accuracy_score(y_test, y_pred)
*100)))
[Link](precision_score(y_test, y_pred))
[Link](recall_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))
[Link]('Random Forest')
SVM
#
svm = SVC(kernel='linear')
[Link](X_train, y_train)
y_pred = [Link](X_test)
[Link](float("{:.2f}".format(accuracy_score(y_test, y_pred)
*100)))
[Link](precision_score(y_test, y_pred))
[Link](recall_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))
[Link]('SVM')
Naive Bayes
#
naiveBayes = GaussianNB()
[Link](X_train, y_train)
y_pred = [Link](X_test)
7
[Link](float("{:.2f}".format(accuracy_score(y_test, y_pred)
A
*100)))
[Link](precision_score(y_test, y_pred))
[Link](recall_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))
[Link]('Naive Bayes')
Combining all models' Performance
#
models = {'Model':Name,'Accuracy':Accuracy, 'Precision':Precision,
'Recall':Recall, 'F1_Score':F1_Score}
model_df = [Link](models)
model_df
OUTPUT:
Dataset
Count Plot
RESULT: All models are trained and evaluated. NaïveBayes performs best for the given dataset.
8
EXPERIMENT – 5
IM: Implement ensemble methods such as Bagging (e.g., Random Forest) and Boosting (e.g.,
A
AdaBoost) on a classification task and compare their performance to individual models.
ODE:
C
import pandas as pd
import seaborn as sns
import [Link] as plt
from sklearn.model_selection import train_test_split
from [Link] import StandardScaler
from [Link] import RandomForestClassifier,
AdaBoostClassifier
from [Link] import accuracy_score, precision_score,
recall_score, f1_score
f = pd.read_csv('gender_classification_v7.csv')
d
df['gender'] = df['gender'].apply(lambda x: 0 if x == 'Male' else 1)
[Link](figsize=(2, 4))
p
[Link]('Count of Gender', size=10)
[Link](data=df, x="gender")
[Link]('Count', size=12)
[Link]('Gender', size=12)
[Link](top=True, right=True, left=False, bottom=False)
[Link]()
= [Link](columns=['gender'])
X
y = df['gender']
scaler = StandardScaler()
X_scaled = [Link](scaler.fit_transform(X), columns=[Link])
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.2, random_state=42, stratify=y)
Name, Accuracy, Precision, Recall, F1_Score = [], [], [], [], []
Ensemble Methods
#
# Bagging - Random Forest
forest = RandomForestClassifier(n_estimators = 100)
[Link](X_train, y_train)
y_pred = [Link](X_test)
[Link](float("{:.2f}".format(accuracy_score(y_test, y_pred)
*100)))
[Link](precision_score(y_test, y_pred))
[Link](recall_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))
[Link]('Random Forest')
9
Boosting - AdaBoost Classifier
#
adaBoost = AdaBoostClassifier()
[Link](X_train, y_train)
y_pred = [Link](X_test)
[Link](float("{:.2f}".format(accuracy_score(y_test, y_pred)
*100)))
[Link](precision_score(y_test, y_pred))
[Link](recall_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))
[Link]('AdaBoost')
Combining all models' Performance
#
models = {'Model':Name,'Accuracy':Accuracy, 'Precision':Precision,
'Recall':Recall, 'F1_Score':F1_Score}
model_df = [Link](models)
model_df
OUTPUT:
Dataset
Count Plot
ESULT: Both models are trained and evaluated. AdaBoostClassifier performs best for the given
R
dataset.
10
EXPERIMENT – 6
IM: Write a code for feature selection techniquesto reduce the no. of features in a dataset while
A
maintaining or improving the model's performance.
CODE:
mport pandas as pd
i
from [Link] import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from [Link] import mean_squared_error, r2_score
import math
Assuming 'df' is your DataFrame containing the dataset
#
df = pd.read_csv('USA_Housing.csv')
Selecting features and output variable
#
X = df[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number
of Rooms', 'Avg. Area Number of Bedrooms', 'Area Population']]
y = df['Price']
Perform univariate feature selection using ANOVA F-test
#
selector = SelectKBest(score_func=f_regression, k=4) # Select top 3
features based on F-test
X_selected = selector.fit_transform(X, y)
Get selected feature indices
#
selected_indices = selector.get_support(indices=True)
Get the names of selected features
#
selected_features = [Link][selected_indices]
Splitting the dataset into training and testing sets
#
X_train, X_test, y_train, y_test = train_test_split(X_selected, y,
test_size=0.2, random_state=42)
Initialize and fit a linear regression model
#
model = LinearRegression()
[Link](X_train, y_train)
Predict on the test set
#
y_pred = [Link](X_test)
# Calculate performance metrics
11
mse = [Link](mean_squared_error(y_test, y_pred))
r
r_squared = r2_score(y_test, y_pred)
rint(f'Root Mean Squared Error (MSE): {rmse:.4f}')
p
print(f'R-squared: {r_squared:.4f}')
print('Original Features: ', list([Link]))
print(f'Selected Features: {list(selected_features)}\n')
[Link](figsize=(8, 6))
p
[Link](y_test, y_pred, alpha=0.5, label='Predicted',
color='cyan')
[Link](y_test, y_test, alpha=0.5, label='Actual', color='blue')
[Link]('Actual Values')
[Link]('Predicted Values')
[Link]('Actual vs Predicted Values')
[Link]()
[Link]()
OUTPUT:
ESULT: After Feature reduction using ANOVA F-Test. Model performance is good. Results are
R
shown below:
● Root Mean Squared Error (MSE):100367.9313
● R-squared: 0.9181
● OriginalFeatures:['Avg.AreaIncome','Avg.AreaHouseAge','Avg.AreaNumberofRooms',
'Avg. Area Number of Bedrooms', 'Area Population']
● SelectedFeatures:['Avg.AreaIncome','Avg.AreaHouseAge','Avg.AreaNumberofRooms',
'Area Population']
12
EXPERIMENT – 7
IM: Write a code to apply Apriori algorithm to discoverassociation rules in retail transaction dataset
A
to identify frequently co-occurring items in customer purchases
CODE:
pip install apyori
!
import pandas as pd
from mlxtend.frequent_patterns import association_rules
import [Link] as plt
import [Link] as px
from apyori import apriori
Create a dataframe and assign data from excel spreadsheet
#
data = pd.read_csv('/content/Groceries_dataset.csv')
#one hot encoding the products:
ummy = pd.get_dummies(data['itemDescription'])
d
[Link](['itemDescription'], inplace =True, axis=1)
data = [Link](dummy)
[Link]()
d
# Transaction: If a customer bought multiple products in one day, it
will be considered as 1 transaction:
ata1 = [Link](['Member_number', 'Date'])[products[:]].sum()
d
data1 = data1.reset_index()[products]
rint("New Dimension", [Link])
p
[Link]()
#Replacing all non-zero values with the name of the product:
def product_names(x):
for product in products:
if x[product] >0:
x[product] = product
return x
ata1 = [Link](product_names, axis=1)
d
[Link]()
#Removing Zeros, Extracting the list of items bought per customer
= [Link]
x
x = [sub[~(sub==0)].tolist() for sub in x if sub [sub !=
0].tolist()]
transactions = x
transactions[0:10]
rules = apriori(transactions, min_support = 0.00030, min_lift = 3,
max_length = 2, target = "rules")
13
ssociation_results = list(rules)
a
print(association_results[0])
for item in association_results:
air = item[0]
p
items = [x for x in pair]
rint("Rule : ", items[0], " -> " + items[1])
p
print("Support : ", str(item[1]))
print("Confidence : ",str(item[2][0][2]))
print("Lift : ", str(item[2][0][3]))
print("=============================")
OUTPUT:
ESULT: 8 association rules in retail(grocery) transactiondataset are identified for frequently
R
co-occurring items in customer purchases.
14
EXPERIMENT – 8
IM: Implement k-fold cross-validation on a classificationtask to assess the model’s performance,
A
addressing issue of overfitting.
CODE:
mport numpy as np
i
import pandas as pd
from sklearn.model_selection import KFold
from [Link] import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from numpy import mean,std
ata = load_iris()
d
df = [Link]([Link], columns=data.feature_names)
df['species'] = [Link]
df
df['species'].value_counts()
= [Link](['species'],axis='columns')
X
Y = [Link]
for i in range(2,16):
kf=KFold(n_splits=i, random_state=1, shuffle=True)
scores = cross_val_score(model, X, Y, scoring='accuracy', cv=kf,
n_jobs=-1)
print('n-split:',i)
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))
OUTPUT:
ESULT:K-fold cross-validation helps in obtaininga more reliable estimate of the model's
R
performance by repeating the training and testing process k times with different subsets. It
helps to identify models that generalize well to unseen data and reduces the risk of
overfitting to specific patterns in the training data, leading to a more robust evaluation of
model performance.
15
EXPERIMENT – 9
IM: To implement a simple classification model topredict the species of iris flowers in Iris Dataset
A
using basic algorithms like logistic regression or k-nearest neighbors.
CODE:
mport numpy as np
i
import pandas as pd
import [Link] as plt
from [Link] import load_iris
from [Link] import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
data = load_iris()
df = [Link]([Link], columns=data.feature_names)
df['species'] = [Link]
[Link]()
X = [Link](['species'],axis='columns')
Y = [Link]
X_train, X_test, y_train, y_test = train_test_split(X, Y,test_size =
0.2, random_state = 0)
#KNN Classifier
knn5 = KNeighborsClassifier(n_neighbors = 5)
knn1 = KNeighborsClassifier(n_neighbors=1)
[Link](X_train, y_train)
[Link](X_train, y_train)
_pred_knn5 = [Link](X_test)
y
y_pred_knn1 = [Link](X_test)
print("Accuracy with KNN at k=5", accuracy_score(y_test,
y_pred_knn5)*100)
print("Accuracy with KNN at k=1", accuracy_score(y_test,
y_pred_knn1)*100)
log_regr = LogisticRegression(solver='lbfgs', max_iter=1000)
log_regr.fit(X_train, y_train)
# Predict labels of unseen (test) data
y_pred_lr=log_regr.predict(X_test)
score=accuracy_score(y_test,y_pred_lr)
# The score method returns the accuracy of the model
print("Accuracy of logistic regression ", score*100)
OUTPUT:
ESULT: Simple classification models (K-Nearest Neighborand Logistic Regression) are trained
R
and evaluated.
16
EXPERIMENT – 10
IM: Predict the quality of wine based on featureslike acidity, alcohol content, and pH by using
A
either linear regression or decision trees.
CODE:
mport pandas as pd
i
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import numpy as np
import [Link] as plt
Load the Dataset
#
data = pd.read_csv('/content/[Link]')
df = [Link](data)
df
[Link]().sum()
d
[Link]([Link]([Link]()))
X = df[['fixed acidity', 'volatile acidity', 'citric acid',
'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur
dioxide', 'density', 'pH','sulphates','alcohol']].values
Y = df[‘quality'].values
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size
= 0.2, random_state = 0)
regressor = LinearRegression()
[Link](X_train, y_train)
coeff_df = [Link](regressor.coef_, ['fixed acidity', 'volatile
acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur
dioxide', 'total sulfur dioxide', 'density',
'pH','sulphates','alcohol'] , columns=['Coefficient'])
coeff_df
print(regressor.intercept_)
y_pred = [Link](X_test)
Calculate RMSE
#
rmse = [Link](mean_squared_error(y_test, y_pred))
Calculate R-squared
#
r_squared = r2_score(y_test, y_pred)
rint("Linear regression model performance:\nRoot Mean Squared Error
p
(RMSE):", rmse)
print("R-squared:", r_squared)
[Link](figsize=(8, 6))
[Link](y_test, y_pred, alpha=0.5, label='Predicted',
color='cyan')
[Link](y_test, y_test, alpha=0.5, label='Actual', color='blue')
[Link]('Actual Values')
[Link]('Predicted Values')
[Link]('Actual vs Predicted Values')
17
[Link]()
p
[Link]()
OUTPUT:
ESULT:Hence, we have trained and evaluated the model.
R
Evaluation Results are:
● Root Mean Squared Error (RMSE): 0.7302836974721729
● R-squared: 0.3001119515373122
18