AI and ML Lab Manual

Exercise 3: Write a program to implement the Naïve Bayesian classifier for
a sample training data set. Compute the accuracy of the classifier,

considering few test data sets.
Data Set: SMS Spam Collection Dataset
Link to Download the Data set: https://www.kaggle.com/uciml/sms-spam-collection-
dataset
Aim:
To implement the naive Bayesian classifier and compute its accuracy for
SMS spam detection.
Procedural Steps:
1. Import necessary libraries and load the Data from dataset to a dataframe
2. Analyse the dataset
3. Visualize the Data
4. Preprocess the data
5. Split the Data into two parts train and test
6. Create and Train the Model with Training Data using Multinominal Naive
Bayes Algorithm
7. Find the Accuracy of our Model with Test Data
8. Predict the output for new unclassified data
Step 1: Import necessary libraries and load the Data from dataset to a dataframe
PROGRAM:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('spam.csv', encoding='latin-1')
data.head(5)
OUTPUT:
Step 2: Analyze the Data
PROGRAM:
data.info()
OUTPUT:
Step 3: Visualize the Data

PROGRAM:
import seaborn as sns
plt.figure(dpi=150)
sns.countplot(data=data,x='v1')
data['v1'].value_counts()
OUTPUT:
Step 4: Preprocess the Data
PROGRAM:
from sklearn.feature_extraction.text import CountVectorizer
f = CountVectorizer(stop_words = 'english')
X = f.fit_transform(data["v2"])
#fit_transform() - function to vectorize the text
print(X.shape)
OUTPUT:
(5572, 8404)
Step 5: Split the Data into two parts train and test
PROGRAM:
from sklearn.model_selection import train_test_split
y=data["v1"]
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape)
print(X_test.shape)
OUTPUT:
(4179, 8404)
(1393, 8404)
Step 6: Create and Train the Model with Training Data using Multinominal Naive
Bayes Algorithm
PROGRAM:
from sklearn.naive_bayes import MultinomialNB
mnb=MultinomialNB() #Created a model
mnb.fit(X_train,y_train) #Train the model
OUTPUT:
MultinomialNB()
Step 7: Find the Accuracy of our Model with Test Data

PROGRAM:
mnb = round(mnb.score(X_test, y_test)*100,2)
print('accuracy score = ', mnb,'%')
OUTPUT:
accuracy score = 97.7 %
Step 8: Predict the output for new unclassified data

PROGRAM:
text_messages=['Free Entry to T20 World cup, text T20 to 82521',
'you are the lucky winner of the day',
'collect your prize money, click this link to get it',
'add Rs. 200000 to your festive season budget with moneytap for
credit line in just 4 minutes, click the link to get , prize',
'get the prize',
'prize',
'amazon urgently recruiting for part time jobs, you can earn 20000 to
30000 everyday add whatsapp']
text_messages=f.transform(text_messages)
mnb.predict(text_messages)
OUTPUT:
array(['spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham'],
dtype='<U4')
Result:
Created a Machine Learning model using MultiNominal Naïve Bayes
Algorithm, tested the accuracy of the algorithm and predicted the output for new
unclassified data.
Exercise 4: Write a program to construct a Gaussian Naïve Bayes Model
considering medical data. Use this model to demonstrate the diagnosis of
heart patients using standard Heart Disease Data Set.
Data Set: Heart Disease Dataset
Link to Download the Data set: https://www.kaggle.com/johnsmith88/heart-disease-
dataset
Aim:
To construct a Gaussian Naïve Bayes Model considering medical data
and use this model to demonstrate the diagnosis of heart patients using
standard Heart Disease Data Set
Procedural Steps:
2. Analyze the dataset
4. Create and Train the Model with Training Data using Gaussian Naive Bayes
Algorithm and Find the Accuracy of Model
5. Confusion Matrix
6. Plot Confusion Matrix
7. Classification Report
Understanding the Dataset
 age (#)
 sex : 1 = Male, 0 = Female (Binary)
 (cp) chest pain [type (4 values, Ordinal)]: 1: typical angina, 2: atypical angina,
3: non-anginal pain, 4: asymptomatic
 (trestbps) resting blood pressure (#)
 (chol) serum cholestoral in mg/dl (#)
 (fbs) fasting blood sugar > 120 mg/dl (Binary) [1 = true; 0 = false]
 (restecg) resting electrocardiographic results [values 0,1,2]
 (thalach) maximum heart rate achieved (#)
 (exang) exercise induced angina (Binary) [1 = yes; 0 = no]
 (oldpeak) = ST depression induced by exercise relative to rest (#)
 (slope) of the peak exercise ST segment (Ordinal) [ 1: upsloping, 2: flat , 3:
downsloping)
 (ca) number of major vessels (0-3, Ordinal) colored by fluoroscopy
 (thal) maximum heart rate achieved (Ordinal) [3 = normal; 6 = fixed defect; 7 =
reversable defect]
PROGRAM:
import pandas as pd
from sklearn.metrics import
confusion_matrix,accuracy_score,classification_report,plot_confusion_matrix
from sklearn.naive_bayes import GaussianNB
data = pd.read_csv('heart.csv')
data.head()
OUTPUT:
Step 2: Analyze the Data

PROGRAM:
data.info()
OUTPUT:
PROGRAM:
y = data["target"]
X = data.drop('target',axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
OUTPUT:
(820, 13) (205, 13) (820,) (205,)
Step 4: Create and Train the Model with Training Data using Gaussian Naive Bayes
Algorithm and Find the Accuracy of Model
PROGRAM:
nb = GaussianNB()
nb.fit(X_train,y_train)
nbpred = nb.predict(X_test)
nb_acc_score = accuracy_score(y_test, nbpred)
print("Accuracy of Naive Bayes model:",nb_acc_score*100,'\n')
OUTPUT:
Accuracy of Naive Bayes model: 80.97560975609757
Step 5: Confusion Matrix

PROGRAM:
nb_conf_matrix = confusion_matrix(y_test, nbpred)
print("Confussion matrix")
print(nb_conf_matrix)
print("\n")
OUTPUT:
Confussion matrix
[[90 25]
[14 76]]
Step 6: Plot Confusion Matrix

PROGRAM:
plot_confusion_matrix(nb, X_test, y_test)
OUTPUT:
Step 5: Classification Report
PROGRAM:
print(classification_report(y_test,nbpred))
OUTPUT:
precision recall f1-score support
0 0.87 0.78 0.82 115

1 0.75 0.84 0.80 90
accuracy 0.81 205

macro avg 0.81 0.81 0.81 205
weighted avg 0.82 0.81 0.81 205
Result:
Created a Machine Learning model using Gaussian Naïve Bayes Algorithm,
and tested the accuracy of the algorithm.
Exercise 5: Write a program to implement k-Nearest Neighbor algorithm to
classify the data set
Data Set: Pima Indians Diabetes Database
Link to Download the Data set: https://www.kaggle.com/uciml/pima-indians-diabetes-
database
Aim:
To write a program to implement k-Nearest Neighbor algorithm to
classify the data set
Procedural Steps:
4. Calculate Error rates for different k values using k-Nearest Neighbor
algorithm
5. Plot the error rates to find optimum k value
6. Create and Train the Model with Training Data using k-Nearest
Neighbor algorithm with optimal k value and find the accuracy
7. Print the Classification Report
 Pregnancies - Number of times pregnant
 Glucose - Plasma glucose concentration a 2 hours in an oral glucose tolerance
test
 BloodPressure - Diastolic blood pressure (mm Hg)
 SkinThickness - Triceps skin fold thickness (mm)
 Insulin - 2-Hour serum insulin (mu U/ml)
 BMI - Body mass index (weight in kg/(height in m)^2)
 DiabetesPedigreeFunction - Diabetes pedigree function
 Age - Age (years)
 Outcome - Class variable (0 or 1) 268 of 768 are 1, the others are 0
PROGRAM:
#Load the necessary python libraries
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.neighbors import KNeighborsClassifier
#Load the dataset
df = pd.read_csv('diabetes.csv')
#Print the first 5 rows of the dataframe.
df.head()
OUTPUT:
Step 2: Analyze the Dataset

PROGRAM:
df.info()
OUTPUT:
PROGRAM:
#Let's create numpy arrays for features and target
X = df.drop('Outcome',axis=1).values
y = df['Outcome'].values
#importing train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y)
OUTPUT:
(576, 8) (192, 8) (576,) (192,)
Step 4: Calculate Error rates for different k values using k-Nearest Neighbor
algorithm
PROGRAM:
#Finding Error Rates while number of neighbors varies from 1 to 40
error_rate = []
for i in range(1,40):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train,y_train)
pred_i = knn.predict(X_test)
error_rate.append(np.mean(pred_i != y_test))
print(error_rate)
OUTPUT:
[0.265625, 0.234375, 0.234375, 0.22395833333333334,
0.22395833333333334, 0.21875, 0.21875, 0.21354166666666666,
0.22395833333333334, 0.19270833333333334, 0.20833333333333334,
0.20833333333333334, 0.21354166666666666, 0.19791666666666666, 0.203125,
0.19270833333333334, 0.203125, 0.20833333333333334, 0.19791666666666666,
0.21354166666666666, 0.21875, 0.22395833333333334, 0.22395833333333334,
0.23958333333333334, 0.23958333333333334, 0.23958333333333334,
0.23958333333333334, 0.2552083333333333, 0.234375, 0.22395833333333334,
0.234375, 0.22395833333333334, 0.234375, 0.22916666666666666, 0.234375,
0.22916666666666666, 0.25, 0.22916666666666666, 0.24479166666666666]
Step 5: Plot the error rates to find optimum k value

PROGRAM:
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
OUTPUT:
Step 6: Create and Train the Model with Training Data k-Nearest Neighbor
algorithm with optimal k value and find the accuracy
PROGRAM:
# NOW WITH K=11
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train,y_train)
pred = knn.predict(X_test)
print(round(knn.score(X_test,y_test)*100,2),"%")
OUTPUT:
80.73 %
Step 7: Print the Classification Report

PROGRAM:
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))
OUTPUT:
Result:
Created a Machine Learning model using k-Nearest Neighbor Algorithm,
and tested the accuracy of the algorithm.
Exercise 6: Implement Support Vector Classification to classify the Iris
Species
Data Set: Iris Species
Link to Download the Data set: https://www.kaggle.com/uciml/iris
Aim:
Implement Support Vector Classification to classify the Iris Species
Procedural Steps:
3. Visualize the data using pairplot
5. Create and Train the Model with Training Data using Support Vector
Classification algorithm and find the accuracy
6. Print the Confusion Matrix
7. Print the Classification Report
The dataset is a CSV file which contains a set of 150 records under 5 attributes - Petal
Length, Petal Width, Sepal Length, Sepal width and Class(Species)
PROGRAM:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
iris=pd.read_csv('Iris.csv')
iris.head()
OUTPUT:
Step 2: Analyze the Dataset

PROGRAM:
iris.info()
OUTPUT:
Step 3: Visualize the data using pairplot
PROGRAM:
# Creating a pairplot to visualize the similarities and especially difference
between the species
sns.pairplot(data=iris, hue='Species')
OUTPUT:
PROGRAM:
# Separating the independent variables from dependent variables
X=iris.iloc[:,1:5]
y=iris.iloc[:,5]
X_train,X_test, y_train, y_test=train_test_split(X,y,test_size=0.30)
OUTPUT:
(105, 4) (45, 4) (105,) (45,)
Step 5: Create and Train the Model with Training Data using Support Vector
Classification algorithm and find the accuracy.
PROGRAM:
model=SVC()
model.fit(X_train, y_train)
pred=model.predict(X_test)
print(round(model.score(X_test,y_test)*100,2),"%")
OUTPUT:
95.56 %
Step 6: Print the Confusion Matrix

PROGRAM:
print(confusion_matrix(y_test,pred))
OUTPUT:
Step 7: Print the Classification Report

PROGRAM:
print(classification_report(y_test, pred))
OUTPUT:
Result:
Created a Support Vector Classification model and tested the accuracy of the
algorithm.
7. Apply EM algorithm to cluster a set of data stored in a .CSV file.
Data Set : Height and Weight Data
Data Set Link : https://www.kaggle.com/mayankkilhor/ml-lab-da6/data
Expectation Maximization algorithm process
Gaussian Mixture models work based on an algorithm called Expectation-

Maximization, or EM. When given the number of clusters for a Gaussian Mixture
model, the EM algorithm tries to figure out the parameters of these Gaussian
distributions in two basic steps.
 The E-step makes a guess of the parameters based on available data. Data

points are assigned to a Gaussian cluster and probabilities are calculated that they
belong to that cluster.
 The M-step updates the cluster parameters based on the calculations from the
E-step. The mean, covariance, and density are calculated for clusters based on the
data points in the E step.
 The process is repeated with the calculated values continuing to be updated until
convergence is reached.
PROGRAM:
import pandas as pd
data = pd.read_csv('Clustering_gmm.csv')
# training gaussian mixture model

from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=4)
gmm.fit(data)
#predictions from gmm

labels = gmm.predict(data)
frame = pd.DataFrame(data)
frame['cluster'] = labels
color=['blue','green','cyan', 'black']
#Ploting the Clustered Data

plt.figure(dpi=200)
for k in range(0,4):
data = frame[frame["cluster"]==k]
plt.scatter(data["Weight"],data["Height"],c=color[k])
plt.show()
OUTPUT:
Exercise 8: Apply the technique of pruning and derive the decision tree from
this data. Analyze the results by comparing the structure of pruned and
unpruned tree.
Data Set: Pima Indians Diabetes Database

Link to Download the Data set: https://www.kaggle.com/uciml/pima-indians-
diabetes-database
Procedure :
1. Download above data set
2. Download and Install the software graphviz for windows 10 using the link
https://graphviz.org/download/
3. Install the libraries pydotplus and graphviz
4. Import required libraries
5. Train the data with DecisionTreeClassifier
6. Visualize Decision Tree using graphviz
7. Maximum depth of the tree can be used as a control variable for pre-pruning
8. The classification rate is increased due pre-pruning
PROGRAM:
1. Install the required libraries
pip install pydotplus

pip install graphviz
2. Import required libraries and import the data
# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split
function
from sklearn import metrics #Import scikit-learn metrics module for accuracy
calculation
pima = pd.read_csv("diabetes.csv")
pima.head()
OUTPUT:
3. Train the model and find the accuracy score

feature_cols = ['Pregnancies',
'Glucose','BloodPressure','Insulin','BMI','DiabetesPedigreeFunction','Age']
X = pima[feature_cols] # Features
y = pima['Outcome'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
# 70% training and 30% test
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()
# Train Decision Tree Classifer

clf = clf.fit(X_train,y_train)
#Predict the response for test dataset

y_pred = clf.predict(X_test)
# Model Accuracy, how often is the classifier correct?

print("Accuracy:",round(metrics.accuracy_score(y_test, y_pred)*100,2))
OUTPUT:
Accuracy: 67.53
4. Visualize Decision Tree

from sklearn.tree import export_graphviz
from io import StringIO
from IPython.display import Image
import pydotplus
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png())
OUTPUT:
5. In Scikit-learn, optimization of decision tree classifier performed
by only pre-pruning. Maximum depth of the tree can be used as a
control variable for pre-pruning. In the following the example, you
can plot a decision tree on the same data with max_depth=3.
# Create Decision Tree classifer object
clf = DecisionTreeClassifier(max_depth=3)
# Train Decision Tree Classifer

clf = clf.fit(X_train,y_train)
#Predict the response for test dataset

y_pred = clf.predict(X_test)
# Model Accuracy, how often is the classifier correct?

print("Accuracy:",round(metrics.accuracy_score(y_test, y_pred)*100,2))
OUTPUT:
Accuracy: 75.76
6. Visualize Decision Tree

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png())
OUTPUT:

AI and ML Lab Manual

Uploaded by

AI and ML Lab Manual

Uploaded by

Exercise 3: Write a program to implement the Naïve Bayesian classifier for

a sample training data set. Compute the accuracy of the classifier,

Step 3: Visualize the Data

Step 7: Find the Accuracy of our Model with Test Data

Step 8: Predict the output for new unclassified data

Step 2: Analyze the Data

Step 5: Confusion Matrix

Step 6: Plot Confusion Matrix

0 0.87 0.78 0.82 115

accuracy 0.81 205

Step 2: Analyze the Dataset

Step 5: Plot the error rates to find optimum k value

Step 7: Print the Classification Report

Step 2: Analyze the Dataset

Step 6: Print the Confusion Matrix

Step 7: Print the Classification Report

Data Set : Height and Weight Data

Data Set Link : https://www.kaggle.com/mayankkilhor/ml-lab-da6/data

Expectation Maximization algorithm process

Gaussian Mixture models work based on an algorithm called Expectation-

 The E-step makes a guess of the parameters based on available data. Data

# training gaussian mixture model

#predictions from gmm

#Ploting the Clustered Data

Data Set: Pima Indians Diabetes Database

1. Install the required libraries

pip install pydotplus

2. Import required libraries and import the data

3. Train the model and find the accuracy score

# Train Decision Tree Classifer

#Predict the response for test dataset

# Model Accuracy, how often is the classifier correct?

4. Visualize Decision Tree

# Train Decision Tree Classifer

#Predict the response for test dataset

# Model Accuracy, how often is the classifier correct?

6. Visualize Decision Tree

You might also like