0% found this document useful (0 votes)
46 views29 pages

AI and ML Lab Manual

The document describes implementing the k-nearest neighbors algorithm to classify a diabetes dataset. It includes steps to load and analyze the data, split it into training and test sets, calculate error rates for different values of k to determine the optimal k, train a k-NN model using the optimal k and evaluate its accuracy, and print a classification report. The dataset contains several medical attributes for each patient and a target variable indicating the presence or absence of diabetes.

Uploaded by

Nithya Nair
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
46 views29 pages

AI and ML Lab Manual

The document describes implementing the k-nearest neighbors algorithm to classify a diabetes dataset. It includes steps to load and analyze the data, split it into training and test sets, calculate error rates for different values of k to determine the optimal k, train a k-NN model using the optimal k and evaluate its accuracy, and print a classification report. The dataset contains several medical attributes for each patient and a target variable indicating the presence or absence of diabetes.

Uploaded by

Nithya Nair
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 29

Exercise 3: Write a program to implement the Naïve Bayesian classifier for

a sample training data set. Compute the accuracy of the classifier,


considering few test data sets.
Data Set: SMS Spam Collection Dataset
Link to Download the Data set: https://www.kaggle.com/uciml/sms-spam-collection-
dataset
Aim:
To implement the naive Bayesian classifier and compute its accuracy for
SMS spam detection.
Procedural Steps:
1. Import necessary libraries and load the Data from dataset to a dataframe
2. Analyse the dataset
3. Visualize the Data
4. Preprocess the data
5. Split the Data into two parts train and test
6. Create and Train the Model with Training Data using Multinominal Naive
Bayes Algorithm
7. Find the Accuracy of our Model with Test Data
8. Predict the output for new unclassified data

Step 1: Import necessary libraries and load the Data from dataset to a dataframe

PROGRAM:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('spam.csv', encoding='latin-1')
data.head(5)
OUTPUT:
Step 2: Analyze the Data
PROGRAM:
data.info()
OUTPUT:

Step 3: Visualize the Data


PROGRAM:
import seaborn as sns
plt.figure(dpi=150)
sns.countplot(data=data,x='v1')
data['v1'].value_counts()
OUTPUT:
Step 4: Preprocess the Data
PROGRAM:
from sklearn.feature_extraction.text import CountVectorizer
f = CountVectorizer(stop_words = 'english')
X = f.fit_transform(data["v2"])
#fit_transform() - function to vectorize the text
print(X.shape)
OUTPUT:
(5572, 8404)

Step 5: Split the Data into two parts train and test
PROGRAM:
from sklearn.model_selection import train_test_split
y=data["v1"]
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape)
print(X_test.shape)
OUTPUT:
(4179, 8404)
(1393, 8404)
Step 6: Create and Train the Model with Training Data using Multinominal Naive
Bayes Algorithm
PROGRAM:
from sklearn.naive_bayes import MultinomialNB
mnb=MultinomialNB() #Created a model
mnb.fit(X_train,y_train) #Train the model
OUTPUT:
MultinomialNB()

Step 7: Find the Accuracy of our Model with Test Data


PROGRAM:
mnb = round(mnb.score(X_test, y_test)*100,2)
print('accuracy score = ', mnb,'%')
OUTPUT:
accuracy score = 97.7 %

Step 8: Predict the output for new unclassified data


PROGRAM:
text_messages=['Free Entry to T20 World cup, text T20 to 82521',
'you are the lucky winner of the day',
'collect your prize money, click this link to get it',
'add Rs. 200000 to your festive season budget with moneytap for
credit line in just 4 minutes, click the link to get , prize',
'get the prize',
'prize',
'amazon urgently recruiting for part time jobs, you can earn 20000 to
30000 everyday add whatsapp']
text_messages=f.transform(text_messages)
mnb.predict(text_messages)
OUTPUT:
array(['spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham'],
dtype='<U4')

Result:
Created a Machine Learning model using MultiNominal Naïve Bayes
Algorithm, tested the accuracy of the algorithm and predicted the output for new
unclassified data.
Exercise 4: Write a program to construct a Gaussian Naïve Bayes Model
considering medical data. Use this model to demonstrate the diagnosis of
heart patients using standard Heart Disease Data Set.
Data Set: Heart Disease Dataset
Link to Download the Data set: https://www.kaggle.com/johnsmith88/heart-disease-
dataset
Aim:
To construct a Gaussian Naïve Bayes Model considering medical data
and use this model to demonstrate the diagnosis of heart patients using
standard Heart Disease Data Set
Procedural Steps:
1. Import necessary libraries and load the Data from dataset to a dataframe
2. Analyze the dataset
3. Split the Data into two parts train and test
4. Create and Train the Model with Training Data using Gaussian Naive Bayes
Algorithm and Find the Accuracy of Model
5. Confusion Matrix
6. Plot Confusion Matrix
7. Classification Report
Understanding the Dataset
 age (#)
 sex : 1 = Male, 0 = Female (Binary)
 (cp) chest pain [type (4 values, Ordinal)]: 1: typical angina, 2: atypical angina,
3: non-anginal pain, 4: asymptomatic
 (trestbps) resting blood pressure (#)
 (chol) serum cholestoral in mg/dl (#)
 (fbs) fasting blood sugar > 120 mg/dl (Binary) [1 = true; 0 = false]
 (restecg) resting electrocardiographic results [values 0,1,2]
 (thalach) maximum heart rate achieved (#)
 (exang) exercise induced angina (Binary) [1 = yes; 0 = no]
 (oldpeak) = ST depression induced by exercise relative to rest (#)
 (slope) of the peak exercise ST segment (Ordinal) [ 1: upsloping, 2: flat , 3:
downsloping)
 (ca) number of major vessels (0-3, Ordinal) colored by fluoroscopy
 (thal) maximum heart rate achieved (Ordinal) [3 = normal; 6 = fixed defect; 7 =
reversable defect]
Step 1: Import necessary libraries and load the Data from dataset to a dataframe

PROGRAM:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import
confusion_matrix,accuracy_score,classification_report,plot_confusion_matrix
from sklearn.naive_bayes import GaussianNB
data = pd.read_csv('heart.csv')
data.head()

OUTPUT:

Step 2: Analyze the Data


PROGRAM:
data.info()
OUTPUT:
Step 3: Split the Data into two parts train and test
PROGRAM:
y = data["target"]
X = data.drop('target',axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
OUTPUT:
(820, 13) (205, 13) (820,) (205,)

Step 4: Create and Train the Model with Training Data using Gaussian Naive Bayes
Algorithm and Find the Accuracy of Model

PROGRAM:
nb = GaussianNB()
nb.fit(X_train,y_train)
nbpred = nb.predict(X_test)
nb_acc_score = accuracy_score(y_test, nbpred)
print("Accuracy of Naive Bayes model:",nb_acc_score*100,'\n')
OUTPUT:
Accuracy of Naive Bayes model: 80.97560975609757

Step 5: Confusion Matrix


PROGRAM:
nb_conf_matrix = confusion_matrix(y_test, nbpred)
print("Confussion matrix")
print(nb_conf_matrix)
print("\n")
OUTPUT:
Confussion matrix
[[90 25]
[14 76]]

Step 6: Plot Confusion Matrix


PROGRAM:
plot_confusion_matrix(nb, X_test, y_test)
OUTPUT:
Step 5: Classification Report
PROGRAM:
print(classification_report(y_test,nbpred))
OUTPUT:
precision recall f1-score support

0 0.87 0.78 0.82 115


1 0.75 0.84 0.80 90

accuracy 0.81 205


macro avg 0.81 0.81 0.81 205
weighted avg 0.82 0.81 0.81 205

Result:
Created a Machine Learning model using Gaussian Naïve Bayes Algorithm,
and tested the accuracy of the algorithm.
Exercise 5: Write a program to implement k-Nearest Neighbor algorithm to
classify the data set
Data Set: Pima Indians Diabetes Database
Link to Download the Data set: https://www.kaggle.com/uciml/pima-indians-diabetes-
database

Aim:
To write a program to implement k-Nearest Neighbor algorithm to
classify the data set

Procedural Steps:
1. Import necessary libraries and load the Data from dataset to a dataframe
2. Analyze the dataset
3. Split the Data into two parts train and test
4. Calculate Error rates for different k values using k-Nearest Neighbor
algorithm
5. Plot the error rates to find optimum k value
6. Create and Train the Model with Training Data using k-Nearest
Neighbor algorithm with optimal k value and find the accuracy
7. Print the Classification Report
Understanding the Dataset
 Pregnancies - Number of times pregnant
 Glucose - Plasma glucose concentration a 2 hours in an oral glucose tolerance
test
 BloodPressure - Diastolic blood pressure (mm Hg)
 SkinThickness - Triceps skin fold thickness (mm)
 Insulin - 2-Hour serum insulin (mu U/ml)
 BMI - Body mass index (weight in kg/(height in m)^2)
 DiabetesPedigreeFunction - Diabetes pedigree function
 Age - Age (years)
 Outcome - Class variable (0 or 1) 268 of 768 are 1, the others are 0

Step 1: Import necessary libraries and load the Data from dataset to a dataframe

PROGRAM:
#Load the necessary python libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.neighbors import KNeighborsClassifier
#Load the dataset
df = pd.read_csv('diabetes.csv')
#Print the first 5 rows of the dataframe.
df.head()

OUTPUT:

Step 2: Analyze the Dataset


PROGRAM:
df.info()
OUTPUT:
Step 3: Split the Data into two parts train and test
PROGRAM:
#Let's create numpy arrays for features and target
X = df.drop('Outcome',axis=1).values
y = df['Outcome'].values
#importing train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
OUTPUT:
(576, 8) (192, 8) (576,) (192,)

Step 4: Calculate Error rates for different k values using k-Nearest Neighbor
algorithm

PROGRAM:
#Finding Error Rates while number of neighbors varies from 1 to 40
error_rate = []
for i in range(1,40):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train,y_train)
pred_i = knn.predict(X_test)
error_rate.append(np.mean(pred_i != y_test))
print(error_rate)
OUTPUT:
[0.265625, 0.234375, 0.234375, 0.22395833333333334,
0.22395833333333334, 0.21875, 0.21875, 0.21354166666666666,
0.22395833333333334, 0.19270833333333334, 0.20833333333333334,
0.20833333333333334, 0.21354166666666666, 0.19791666666666666, 0.203125,
0.19270833333333334, 0.203125, 0.20833333333333334, 0.19791666666666666,
0.21354166666666666, 0.21875, 0.22395833333333334, 0.22395833333333334,
0.23958333333333334, 0.23958333333333334, 0.23958333333333334,
0.23958333333333334, 0.2552083333333333, 0.234375, 0.22395833333333334,
0.234375, 0.22395833333333334, 0.234375, 0.22916666666666666, 0.234375,
0.22916666666666666, 0.25, 0.22916666666666666, 0.24479166666666666]

Step 5: Plot the error rates to find optimum k value


PROGRAM:
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
OUTPUT:
Step 6: Create and Train the Model with Training Data k-Nearest Neighbor
algorithm with optimal k value and find the accuracy
PROGRAM:
# NOW WITH K=11
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train,y_train)
pred = knn.predict(X_test)
print(round(knn.score(X_test,y_test)*100,2),"%")
OUTPUT:
80.73 %

Step 7: Print the Classification Report


PROGRAM:
print(confusion_matrix(y_test,pred))

print(classification_report(y_test,pred))
OUTPUT:
Result:
Created a Machine Learning model using k-Nearest Neighbor Algorithm,
and tested the accuracy of the algorithm.
Exercise 6: Implement Support Vector Classification to classify the Iris
Species
Data Set: Iris Species
Link to Download the Data set: https://www.kaggle.com/uciml/iris

Aim:
Implement Support Vector Classification to classify the Iris Species

Procedural Steps:
1. Import necessary libraries and load the Data from dataset to a dataframe
2. Analyze the dataset
3. Visualize the data using pairplot
4. Split the Data into two parts train and test
5. Create and Train the Model with Training Data using Support Vector
Classification algorithm and find the accuracy
6. Print the Confusion Matrix
7. Print the Classification Report
Understanding the Dataset
The dataset is a CSV file which contains a set of 150 records under 5 attributes - Petal
Length, Petal Width, Sepal Length, Sepal width and Class(Species)

Step 1: Import necessary libraries and load the Data from dataset to a dataframe
PROGRAM:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

iris=pd.read_csv('Iris.csv')
iris.head()

OUTPUT:

Step 2: Analyze the Dataset


PROGRAM:
iris.info()
OUTPUT:
Step 3: Visualize the data using pairplot
PROGRAM:
# Creating a pairplot to visualize the similarities and especially difference
between the species
sns.pairplot(data=iris, hue='Species')

OUTPUT:
Step 4: Split the Data into two parts train and test

PROGRAM:
# Separating the independent variables from dependent variables
X=iris.iloc[:,1:5]
y=iris.iloc[:,5]
X_train,X_test, y_train, y_test=train_test_split(X,y,test_size=0.30)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
OUTPUT:
(105, 4) (45, 4) (105,) (45,)

Step 5: Create and Train the Model with Training Data using Support Vector
Classification algorithm and find the accuracy.
PROGRAM:
model=SVC()
model.fit(X_train, y_train)
pred=model.predict(X_test)
print(round(model.score(X_test,y_test)*100,2),"%")
OUTPUT:
95.56 %

Step 6: Print the Confusion Matrix


PROGRAM:
print(confusion_matrix(y_test,pred))
OUTPUT:

Step 7: Print the Classification Report


PROGRAM:
print(classification_report(y_test, pred))
OUTPUT:

Result:
Created a Support Vector Classification model and tested the accuracy of the
algorithm.
7. Apply EM algorithm to cluster a set of data stored in a .CSV file.

Data Set : Height and Weight Data

Data Set Link : https://www.kaggle.com/mayankkilhor/ml-lab-da6/data

Expectation Maximization algorithm process

Gaussian Mixture models work based on an algorithm called Expectation-


Maximization, or EM. When given the number of clusters for a Gaussian Mixture
model, the EM algorithm tries to figure out the parameters of these Gaussian
distributions in two basic steps.

 The E-step makes a guess of the parameters based on available data. Data


points are assigned to a Gaussian cluster and probabilities are calculated that they
belong to that cluster.

 The M-step updates the cluster parameters based on the calculations from the
E-step. The mean, covariance, and density are calculated for clusters based on the
data points in the E step.

 The process is repeated with the calculated values continuing to be updated until
convergence is reached.

PROGRAM:
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('Clustering_gmm.csv')

# training gaussian mixture model


from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=4)
gmm.fit(data)

#predictions from gmm


labels = gmm.predict(data)
frame = pd.DataFrame(data)

frame['cluster'] = labels

color=['blue','green','cyan', 'black']

#Ploting the Clustered Data


plt.figure(dpi=200)
for k in range(0,4):
data = frame[frame["cluster"]==k]
plt.scatter(data["Weight"],data["Height"],c=color[k])

plt.show()
OUTPUT:
Exercise 8: Apply the technique of pruning and derive the decision tree from
this data. Analyze the results by comparing the structure of pruned and
unpruned tree.

Data Set: Pima Indians Diabetes Database


Link to Download the Data set: https://www.kaggle.com/uciml/pima-indians-
diabetes-database

Procedure :
1. Download above data set
2. Download and Install the software graphviz for windows 10 using the link
https://graphviz.org/download/
3. Install the libraries pydotplus and graphviz
4. Import required libraries
5. Train the data with DecisionTreeClassifier
6. Visualize Decision Tree using graphviz
7. Maximum depth of the tree can be used as a control variable for pre-pruning
8. The classification rate is increased due pre-pruning

PROGRAM:

1. Install the required libraries

pip install pydotplus


pip install graphviz

2. Import required libraries and import the data

# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split
function
from sklearn import metrics #Import scikit-learn metrics module for accuracy
calculation
pima = pd.read_csv("diabetes.csv")
pima.head()

OUTPUT:

3. Train the model and find the accuracy score


feature_cols = ['Pregnancies',
'Glucose','BloodPressure','Insulin','BMI','DiabetesPedigreeFunction','Age']
X = pima[feature_cols] # Features
y = pima['Outcome'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
# 70% training and 30% test
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer


clf = clf.fit(X_train,y_train)

#Predict the response for test dataset


y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?


print("Accuracy:",round(metrics.accuracy_score(y_test, y_pred)*100,2))
OUTPUT:

Accuracy: 67.53

4. Visualize Decision Tree


from sklearn.tree import export_graphviz
from io import StringIO
from IPython.display import Image
import pydotplus

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png())

OUTPUT:
5. In Scikit-learn, optimization of decision tree classifier performed
by only pre-pruning. Maximum depth of the tree can be used as a
control variable for pre-pruning. In the following the example, you
can plot a decision tree on the same data with max_depth=3.
# Create Decision Tree classifer object
clf = DecisionTreeClassifier(max_depth=3)

# Train Decision Tree Classifer


clf = clf.fit(X_train,y_train)

#Predict the response for test dataset


y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?


print("Accuracy:",round(metrics.accuracy_score(y_test, y_pred)*100,2))

OUTPUT:
Accuracy: 75.76

6. Visualize Decision Tree


dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png())
OUTPUT:

You might also like