0% found this document useful (0 votes)
2 views84 pages

Deep Learning lab with Tensorflow (2)

The document is a laboratory manual for a Deep Learning with TensorFlow course at Bonam Venkata Chalamayya Engineering College. It outlines course outcomes, a list of experiments involving neural networks, and specific tasks such as classifying handwritten digits and movie reviews. The manual also includes details on the datasets used, the process for implementing models, and references for further reading.

Uploaded by

satishbokka1619
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
2 views84 pages

Deep Learning lab with Tensorflow (2)

The document is a laboratory manual for a Deep Learning with TensorFlow course at Bonam Venkata Chalamayya Engineering College. It outlines course outcomes, a list of experiments involving neural networks, and specific tasks such as classifying handwritten digits and movie reviews. The manual also includes details on the datasets used, the process for implementing models, and references for further reading.

Uploaded by

satishbokka1619
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 84

BONAM VENKATA CHALAMAYYA ENGINEERING COLLEGE

(AUTONOMOUS)
APPROVED BY AICTE, NEW DELHI, AFFILIATED TO JNTUK KAKINADA
ODALAREVU, ALLAVARAL MANDAL, DR.B.R AMBEDKAR KONASEEMA DISTRICT, ANDHRA PRADESH – 533210.

DEEP LEARNING WITH TENSORFLOW LAB


Laboratory Manual
Prepared By:

B.Satish

k.Sai ram

T.Vihar Ram

B. Navaneeth Krishna

S.Durga Prasad

III B.Tech II Semester


(BR20)

DEPARTMENT OF CSE-AI&ML
III YEAR II SEM Code: 20AD6L04 L T P C
0 0 3 1.5
DEEP LEARNING WITH TENSORFLOW LAB
Course Outcomes:
On completion of this course, the student will be able to
 Implement deep neural networks to solve real world problems
 Choose appropriate pre-trained model to solve real time problem
 Interpret the results of two different deep learning models
List of Experiments:
1. Implement multilayer perceptron algorithm for MNIST Hand written Digit Classification.
2. Design a neural network for classifying movie reviews (Binary Classification) using IMDB
dataset.
3. Design a neural Network for classifying news wires (Multi class classification) using
Reuters dataset.
4. Design a neural network for predicting house prices using Boston Housing Price dataset.
5. Build a Convolution Neural Network for MNIST Hand written Digit Classification.
6. Build a Convolution Neural Network for simple image (dogs and Cats) Classification
7. Use a pre-trained Convolution Neural Network (VGG16) for image classification.0
8. Implement one hot encoding of words or characters.
9. Implement word embeddings for IMDB dataset.
10. Implement a Recurrent Neural Network for IMDB movie review classification problem.
Text Books:
1. Reza Zadeh and BharathRamsundar, ―Tensorflow for Deep Learning‖, O‘Reilly publishers,
2018
References:
1. https://github.com/fchollet/deep-learning-with-python-notebooks
TABLE OF CONTENTS

Sl.No Name of Experiment Page.No

1 Implement multilayer perceptron algorithm for MNIST Hand written Digit


Classification.

2 Design a neural network for classifying movie reviews (Binary Classification) using
IMDB dataset.
.
3 Design a neural Network for classifying news wires (Multi class classification)
using Reuters dataset

4 Design a neural network for predicting house prices using Boston Housing Price
dataset.

5 Build a Convolution Neural Network for MNIST Hand written Digit Classification.

6 Build a Convolution Neural Network for simple image (dogs and Cats) Classification

7 Use a pre-trained Convolution Neural Network (VGG16) for image classification.0

8 Implement one hot encoding of words or characters.

9 Implement word embeddings for IMDB dataset

10 Implement a Recurrent Neural Network for IMDB movie review classification


problem
EXPERIMENT:1

I MPL E ME N T MU L T I L A Y E R PE R C E PT R O N A L G O R I T H M F O R MN I S
T HAND WR I T T E N D I G I T C L A S S I F I C A T I O N .

Aim: Implement multilayer perceptron algorithm for MNIST Hand written Digit Classification.
Description:
 Handwritten digit recognition using the MNIST dataset is a significant project built with the help
of neural
networks. It is designed to detect scanned images of handwritten digits.
 We have taken this concept a step further by enhancing our handwritten digit recognition system
to not only
identify scanned images but also allow users to write digits directly on the screen using an
integrated GUI for
real-time recognition.
 The MNIST dataset (Modified National Institute of Standards and Technology) is a comprehensive
collection
of handwritten digits (0-9) widely used for training and testing machine learning models,
especially in
image classification and deep learning.
Key Features of MNIST:
 Contains 60,000 training images and 10,000 test images.
 Each image is 28×28 pixels in grayscale (values ranging from 0 to 255).
 Labels range from 0 to 9, representing the corresponding digit in the image.
 Commonly used for benchmarking neural network architectures like MLPs, CNNs, and RNNs.
PROCESS:
1.Importing libraries
The necessary libraries such as TensorFlow, Keras, NumPy, and Matplotlib are imported.
These libraries help in defining the model, handling data, and visualizing results.

2.Loading and preprocessing dataset


A dataset (e.g., MNIST, IMDB sentiment analysis, or Reuters news classification) is loaded.
Data is split into training and testing sets.
Features are normalized (scaling pixel values or encoding text data). Labels are one-hot encoded if it's
a classification task.

3.Definding the MLP model


A sequential model is created using keras.Sequential(). Layers are added:
Input Layer: Specifies the input shape.
Hidden Layers: Fully connected (dense) layers with activation functions (e.g., ReLU).
Output Layer: Uses activation like Softmax (for multi-class classification) or Sigmoid (for binary
classification)

4.Compling the model


The model is compiled with:
Loss function (e.g., categorical_crossentropy for classification). Optimizer (e.g., Adam, SGD).
Metrics like accuracy to monitor performance.

5.training the model


The model is trained using the fit() function.
Training happens for a specified number of epochs with a batch size. Validation data is used to monitor
performance.

6.Evaluating the model


The trained model is tested on unseen data using evaluate(). Performance metrics such as accuracy, loss,
precision, and recall are analyzed.

7.Making prediction
New samples are passed through the model using predict(). The predictions are compared with
actual labels.

8.Visualizing result
Accuracy and loss curves are plotted using Matplotlib. Confusion matrix and classification
reports are generated.

9.saving and loading the model


The trained model is saved (model.save()) for future use. It can be loaded later using
keras.models.load_model()
In [1]:

import tensorflow as tf
import numpy as np
from tensorflow.keras.models import Sequential from
tensorflow.keras.layers import Flatten from
tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Activation
import matplotlib.pyplot as plt

In [2]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist. npz


11490434/11490434 ━━━━━━━━━━━━━━━━━━━━ 8s 1us/step

In [3]:
x_train = x_train.astype('float32') x_test =
x_test.astype('float32')

In [4]:
gray_scale = 255
x_train /= gray_scale x_test
/= gray_scale

In [5]:
print("Feature matrix:", x_train.shape)
print("Target matrix:", x_test.shape)
print("Feature matrix:", y_train.shape)
print("Target matrix:", y_test.shape)

Feature matrix: (60000, 28, 28)


Target matrix: (10000, 28, 28)
Feature matrix: (60000,)
Target matrix: (10000,)

In [6]:
fig, ax = plt.subplots(10, 10) k =
0
for i in range(10):
for j in range(10):
ax[i][j].imshow(x_train[k].reshape(28, 28),
aspect='auto')
k += 1
plt.show()
In [7]:
model = Sequential([
Flatten(input_shape=(28, 28)),
Dense(256, activation='sigmoid'),
Dense(128, activation='sigmoid'),
Dense(10, activation='sigmoid'),
])

C:\Users\reddy\anaconda3\Lib\site-packages\keras\src\layers\reshaping\flatten.py:37: Use rWarning: Do not pass an


`input_shape`/`input_dim` argument to a layer. When using Seque ntial models, prefer using an `Input(shape)`
object as the first layer in the model inst ead.
super(). init (**kwargs)

In [8]:
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])

In [9]:
model.fit(x_train, y_train, epochs=10,
batch_size=2000,
validation_split=0.2)

Epoch 1/10

24/24 ━━━━━━━━━━━━━━━━━━━━ 2s 39ms/step - accuracy: 0.2255 - loss: 2.2490 - val_accurac


y: 0.6957 - val_loss: 1.7442
Epoch 2/10
24/24 ━━━━━━━━━━━━━━━━━━━━ 1s 24ms/step - accuracy: 0.6951 - loss: 1.5704 - val_accurac
y: 0.7732 - val_loss: 1.0563
Epoch 3/10
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 19ms/step - accuracy: 0.7818 -
y: 0.8565 - val_loss: 0.6860
Epoch 4/10
24/24 ━━━━━━━━━━━━━━━━━━━━ 1s 23ms/step - accuracy: 0.8551 -
y: 0.8866 - val_loss: 0.5025
Epoch 5/10
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 18ms/step - accuracy: 0.8828 -
y: 0.9003 - val_loss: 0.4083
Epoch 6/10
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 19ms/step - accuracy: 0.8973 -
y: 0.9079 - val_loss: 0.3549
Epoch 7/10
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 18ms/step - accuracy: 0.9047 -
y: 0.9147 - val_loss: 0.3211
Epoch 8/10
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 18ms/step - accuracy: 0.9108 -
y: 0.9187 - val_loss: 0.2974
Epoch 9/10
24/24 ━━━━━━━━━━━━━━━━━━━━ 1s 26ms/step - accuracy: 0.9181 -
y: 0.9242 - val_loss: 0.2780
Epoch 10/10
24/24 ━━━━━━━━━━━━━━━━━━━━ 1s 27ms/step - accuracy: 0.9222 -
y: 0.9267 - val_loss: 0.2628
Out[10]:
<keras.src.callbacks.history.History
at 0x24bb73fcec0> In [24]:

results = model.evaluate(x_test, y_test, verbose = 0) print('test loss,


test acc:', results)
test loss, test acc: [0.2701019048690796,
0.9233999848365784] In [ ]:
EXPERIMENT:2

D E S I G N A N E U R A L N E T WO R K F O R C L A S S I F Y I N G MO V I E R E V I E WS ( B I N A R
Y
C L A S S I F I C A T I O N ) U S I N G I MD B D A T A S E T .

AIM:Designa neural network for classifying moviereviews (Binary Classification) using IMDB dataset.

Description:
 The IMDB dataset is a popular benchmark dataset for binary sentiment
classification, where movie reviews are categorized as either positive (1) or
negative (0). It is widely utilized in natural language processing (NLP) tasks,
particularly for sentiment analysis.
 Dataset Overview:
o Total Size: 50,000 movie reviews
o Training Set: 25,000 reviews
o Test Set: 25,000 reviews
Labels:
o 1 → Positive review
o 0 → Negative review
This dataset serves as a fundamental resource for training and evaluating machine learning models in
sentiment classification.
Process:
STEP 1: Loading the dataset
STEP 2: Decoding the review
STEP 3: Dadding the example
STEP 4: Creating and training
STEP 5: predictions and evaluation.

we will be using
 Tensorflow
 keras
IMDB dataset
TensorFlow: An open-source machine learning framework developed by Google for deep
learning
and numerical computation.
Keras: A high-level neural networks API, built on top of TensorFlow, that simplifies deep
learning
model development.

IMDB Dataset: A dataset of 50,000 movie reviews labeled as positive or negative, commonly used

for sentiment analysis.

[1]: from keras.datasets import imdb

[2]: # Load the data, keeping only 10,000 of the most frequently occuring words
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words␣
𝗌= 10000)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras- datasets/imdb.npz


1
17464
789/1
74647
89

60s
3us/s

2
[3]:
# Since we restricted ourselves to the top 10000 frequent words, no word index␣
𝗌should exceed 10000

# we'll verify this below

# Here is a list of maximum indexes in every review --- we search the maximum␣
𝗌index in this list of max indexes

print(type([max(sequence) for sequence in train_data]))

# Find the maximum of all max indexes


max([max(sequence) for sequence in train_data])

<class 'list'>
[3]: 9999
[4]: #Let's quickly decode a review

# step 1: load the dictionary mappings from word to integer index


word_index = imdb.get_word_index()

# step 2: reverse word index to map integer indexes to their respective words
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

# Step 3: decode the review, mapping integer indices to words


#
# indices are off by 3 because 0, 1, and 2 are reserverd indices for "padding",␣
𝗌"Start of sequence" and "unknown"

3
STEP 3:

STEP 4:

4
STEP 5:

STEP 6:

5
PROGRAM:

6
OUTPUT:

7
EXPERIMENT:3

Write a program to demonstrate the working of the decision tree based ID3 algorithm. Use an appropriate
data set for building the decision tree and apply this knowledge to classify a new sample.

AIM:

Write a program to demonstrate the working of the decision tree based ID3 algorithm. Use an appropriate
data set for building the decision tree and apply this knowledge to classify a new sample.

Description:

8
Training Dataset:

9
Test Dataset:

Program:

1
0
1
1
OUTPUT:

EXPERIMENT:4

Exercises to solve the real-world problems using the following machine learning methods: a) Linear
Regression b) Logistic Regression c) Binary Classifier

AIM:

Exercises to solve the real-world problems using the following machine learning methods: a) Linear
Regression b) Logistic Regression c) Binary Classifier

Description:
Linear Regression:

Simple Linear Regression is a type of Regression algorithms that models the relationship
between a dependent variable and a single independent variable. The relationship shown by
a Simple Linear Regression model is linear or a sloped straight line, hence it is called Simple
Linear Regression.
1
2
The key point in Simple Linear Regression is that the dependent variable must be a
continuous/real value. However, the independent variable can be measured on continuous
or categorical values.

1
3
Here we are taking a dataset that has two variables: salary (dependent variable) and
experience (Independent variable). The goals of this problem is:

 We want to find out if there is any correlation between these two variables
 We will find the best fit line for the dataset.
 How the dependent variable is changing by changing the independent variable.

Procedure:

Step-1: Data Pre-processing

Step-2: Fitting the Simple Linear Regression to the

Training Set Step: 3. Prediction of test set result

Step: 4. visualizing the Training

set results Step: 5. visualizing the

Test set results Program:

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

import os

os.getcwd()

os.chdir("/content/drive/MyDrive/Datasets")

data_set= pd.read_csv('Salary_Data.csv')

data_set

x= data_set.iloc[:, :-1].values

y= data_set.iloc[:, 1].values

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 1/3, random_state=0)

x_test

y_test

1
4
x_train

y_train

from sklearn.linear_model import LinearRegression

regressor= LinearRegression()

regressor.fit(x_train, y_train)

y_pred= regressor.predict(x_test)

x_pred= regressor.predict(x_train)

plt.scatter(x_train, y_train, color="green")

plt.plot(x_train, x_pred, color="red")

plt.title("Salary vs Experience (Training Dataset)")

plt.xlabel("Years of Experience")

plt.ylabel("Salary(In Rupees)")

plt.show()

OUTPUT:

plt.scatter(x_test, y_test, color="blue")

plt.plot(x_train, x_pred, color="red") 1


5
plt.title("Salary vs Experience (Test Dataset)")

plt.xlabel("Years of Experience")

plt.ylabel("Salary(In Rupees)")

plt.show()

OUTPUT:

Logistic Regression:

 Logistic regression is one of the most popular Machine Learning algorithms,


which comes under the Supervised Learning technique. It is used for predicting
the categorical dependent variable using a given set of independent variables.
 Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be either
Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0
and 1, it gives the probabilistic values which lie between 0 and 1.
 Logistic Regression is much similar to the Linear Regression except that how
they are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification
problems.
 In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
1
6
Logistic Function (Sigmoid Function):

1
7
Logistic Regression Equation:
o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide
the above equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:

The above equation is the final equation for Logistic Regression.

Example: There is a dataset given which contains the information of various users
obtained from the social networking sites. There is a car making company that has
recently launched a new SUV car. So the company wanted to check how many users
from the dataset, wants to purchase the car.

For this problem, we will build a Machine Learning model using the Logistic regression
algorithm. The dataset is shown in the below image. In this problem, we will predict
the purchased variable (Dependent Variable) by using age and salary
(Independent variables).

NOTE: In logistic regression, we will do feature scaling because we want accurate


result of predictions. Here we will only scale the independent variable because
dependent variable have only 0 and 1 values. Below is the code for it:

1
8
PROGRAM:

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

import os

os.getcwd()

os.chdir("/content/drive/MyDrive/Datasets")

data_set= pd.read_csv('car_data.csv')

data_set

x= data_set.iloc[:, [2,3]].values

y= data_set.iloc[:, 4].values

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)

x_test

y_test

x_train 1
9
y_train

from sklearn.preprocessing import StandardScaler

st_x= StandardScaler()

x_train= st_x.fit_transform(x_train)

x_test= st_x.transform(x_test)

from sklearn.linear_model import LogisticRegression

classifier= LogisticRegression(random_state=0)

classifier.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,

intercept_scaling=1, l1_ratio=None, max_iter=100,

multi_class='warn', n_jobs=None, penalty='l2',

random_state=0, solver='warn', tol=0.0001, verbose=0,

warm_start=False)

y_pred= classifier.predict(x_test)

from sklearn.metrics import confusion_matrix

cm= confusion_matrix(y_test,y_pred)

from matplotlib.colors import ListedColormap

x_set, y_set = x_train, y_train

x1, x2 = np.meshgrid(np.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),

np.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))

plt.contourf(x1, x2, classifier.predict(np.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),

alpha = 0.75, cmap = ListedColormap(('purple','green' )))

plt.xlim(x1.min(), x1.max())

plt.ylim(x2.min(), x2.max())

for i, j in enumerate(np.unique(y_set)):

plt.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1], 2


0
c = ListedColormap(('purple', 'green'))(i), label = j)

2
1
plt.title('Logistic Regression (Training set)')

plt.xlabel('Age')

plt.ylabel('Estimated Salary')

plt.legend()

plt.show()

from matplotlib.colors import ListedColormap

x_set, y_set = x_train, y_train

x1, x2 = np.meshgrid(np.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),

np.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))

plt.contourf(x1, x2, classifier.predict(np.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),

alpha = 0.75, cmap = ListedColormap(('purple','green' )))

plt.xlim(x1.min(), x1.max())

plt.ylim(x2.min(), x2.max())

for i, j in enumerate(np.unique(y_set)):

plt.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],

c = ListedColormap(('purple', 'green'))(i), label = j)

plt.title('Logistic Regression (Training set)')

plt.xlabel('Age')

plt.ylabel('Estimated Salary')

plt.legend()

plt.show()

2
2
OUTPUT

from matplotlib.colors import ListedColormap

x_set, y_set = x_test, y_test

x1, x2 = np.meshgrid(np.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),

np.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))

plt.contourf(x1, x2, classifier.predict(np.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),

alpha = 0.75, cmap = ListedColormap(('purple','green' )))

plt.xlim(x1.min(), x1.max())

plt.ylim(x2.min(), x2.max())

for i, j in enumerate(np.unique(y_set)):

plt.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],

c = ListedColormap(('purple', 'green'))(i), label = j)

plt.title('Logistic Regression (Test set)')


2
3
plt.xlabel('Age')

2
4
plt.ylabel('Estimated Salary')

plt.legend()

plt.show()

Binary Classifier:

 A Classifier in Machine Learning is an algorithm, that will determine the class to


which the input data belongs to based on a set of features.
 A Binary Classifier is an instance of Supervised Learning. In Supervised Learning
we have a set of input data and a set of labels, our task is to map each data with a
label. A Binary Classifier classifies elements into two groups, either Zero or One.

Types of Classification
Classification is of two types:
1. Binary Classification: When we have to categorize given data into 2 distinct
classes. Example – On the basis of given health conditions of a person, we have to
determine whether the person has a certain disease or not.
2. Multiclass Classification: The number of classes is more than 2. For Example – On the
basis of data about different species of flowers, we have to determine which specie
our observation belongs.

2
5
Examples of Binary Classification:

 Email spam detection (spam or not).


 Churn prediction (churn or not).
 Conversion prediction (buy or not).

Popular algorithms that can be used for binary classification include:

 Logistic Regression
 k-Nearest Neighbors
 Decision Trees
 Support Vector Machine
 Naive Bayes
Multi-Class Classification
Multi-class classification refers to those classification tasks that have more than two class
labels. Examples include:

 Face classification.
 Plant species classification.
 Optical character recognition.
Popular algorithms that can be used for multi-class classification include:

 k-Nearest Neighbors. 2
6
 Decision Trees.
 Naive Bayes.

2
7
 Random Forest.
 Gradient Boosting.
PROGRAM:

from numpy import where from

collections import Counter

from sklearn.datasets import make_blobs from

matplotlib import pyplot

# define dataset

X, y = make_blobs(n_samples=1000, centers=2, random_state=1) #

summarize dataset shape

print(X.shape, y.shape)

# summarize observations by class label counter =

Counter(y)

print(counter)

# summarize first few examples for i

in range(10):

print(X[i], y[i])

# plot the dataset and color the by class label for

label, _ in counter.items():

row_ix = where(y == label)[0]

pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))

pyplot.legend()

pyplot.show()

2
8
OUTPUT:

EXPERIMENT:5

Develop a program for Bias, Variance, Remove duplicates, Cross Validation

AIM:

Develop a program for Bias, Variance, Remove duplicates, Cross Validation

Description:

Bias:

In machine learning, bias refers to the difference between the predictions made by a learning
algorithm and the true values of the target variable. It measures the systematic error or the
tendency of a model to consistently underfit or overfit the data.

Variance:

In machine learning, variance refers to the variability or instability of a model's predictions


when trained on different subsets of the training data. It measures the sensitivity of the
model to the randomness in the training data.

Cross Validation:

Cross-validation is a technique used in machine learning to assess the performance and


generalization ability of a model. It involves partitioning the available data into multiple
subsets, called folds, and iteratively training and evaluating the model on different
combinations of these folds.

PROGRAM:

import numpy as np
from sklearn.model_selection import
train_test_split from sklearn.linear_model 2
9
import LinearRegression

3
0
from sklearn.metrics import mean_squared_error

# Generate sample data


np.random.seed(42)
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = 3 * X + np.random.randn(100).reshape(-1, 1)

# Add duplicate
samples X =
np.vstack((X, X[:10]))
y = np.vstack((y, y[:10]))

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit a linear regression


model model =
LinearRegression()
model.fit(X_train, y_train)

# Calculate the training and testing errors (bias and variance)


y_train_pred = model.predict(X_train)
train_error = mean_squared_error(y_train,
y_train_pred) y_test_pred = model.predict(X_test)
test_error = mean_squared_error(y_test, y_test_pred)

print("Training error (bias):", train_error)


print("Testing error (variance):",
test_error)

# Remove duplicate samples


X_unique, indices = np.unique(X, axis=0,
return_index=True) y_unique = y[indices]

# Perform cross-validation
cross_val_errors = []
for i in range(5):
X_train, X_val, y_train, y_val = train_test_split(X_unique, y_unique, test_size=0.2, random_state=i)
model = LinearRegression()
model.fit(X_train, y_train)
y_val_pred = model.predict(X_val)
val_error = mean_squared_error(y_val, y_val_pred)
cross_val_errors.append(val_error)

print("Cross-validation errors:", cross_val_errors)


print("Average cross-validation error:", np.mean(cross_val_errors))

3
1
OUTPUT:

Training error (bias): 0.9087396281299075


Testing error (variance): 0.4189781106088322
Cross-validation errors: [0.8943963399542353, 0.711838267559308,
0.9664698336127481, 0.9156270854451775, 1.1754593805313762]
Average cross-validation error: 0.932758181420569

EXPERIMENT:6

Write a program to implement Categorical Encoding, One-hot Encoding

AIM:

Write a program to implement Categorical Encoding, One-hot Encoding

Description:

Categorical encoding:
Categorical encoding is a process of converting categorical variables (features) into numerical
representations that machine learning algorithms can understand. Categorical variables are
variables that represent discrete categories or groups, such as color, country, or product type.

There are several common methods for categorical encoding:

1. Label Encoding:
 Assigns a unique numerical label to each category in the variable.
 Useful for ordinal variables where the categories have an inherent order.
Implemented using LabelEncoder class in scikit-learn.

the
2. One-Hot
Encoding:
 Creates binary columns for each category and represents the presence or
absence of a category using 1s and 0s.
 Suitable for nominal variables where there is no inherent order.

Implemented using OneHotEncoder class in scikit-learn.



the
3. Ordinal Encoding:
 Assigns a numerical value to each category based on a predefined order or
mapping.
 Useful when the categories have an order, but the numerical
difference between them may not be meaningful.
 Can be implemented using mapping dictionaries or custom encoding
functions.
4. Frequency Encoding:
 Replaces each category with its frequency or occurrence in the dataset.
 Useful when the frequency of a category may be informative for the model.
 Can be implemented using pandas' value_counts function or custom
encoding functions.
5. Target Encoding: 3
2
 Replaces each category with the mean or median of the target
variable for that category.
 Useful when the relationship between the category and the target
variable is important.

3
3
Requires careful handling to avoid leakage and overfitting.

6. Binary Encoding:
 Represents each category with binary codes.
 Useful for high-cardinality categorical variables (variables with
many unique categories).
 Implemented using libraries category_encoders or custom encoding functions.
like

PROGRAM:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Sample dataset
data = {'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue']}
df = pd.DataFrame(data)

# Categorical Encoding
label_encoder = LabelEncoder()
df['Color_Encoded'] = label_encoder.fit_transform(df['Color'])

# One-Hot Encoding
onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoded = onehot_encoder.fit_transform(df[['Color_Encoded']])
onehot_df = pd.DataFrame(onehot_encoded, columns=label_encoder.classes_)

# Print the original and encoded dataframes


print("Original DataFrame:")
print(df)
print("\nCategorical Encoded DataFrame:")
print(df[['Color', 'Color_Encoded']])
print("\nOne-Hot Encoded DataFrame:")
print(onehot_df)

OUTPUT:

3
4
EXPERIMENT:7

Build an Artificial Neural Network by implementing the Back propagation algorithm and test the same using
appropriate data sets.

AIM:

Build an Artificial Neural Network by implementing the Back propagation algorithm and test the same using
appropriate data sets.
Description:
The backpropagation algorithm is a widely used method for training artificial neural
networks (ANNs). It allows the network to learn from labeled training data by
iteratively adjusting the weights and biases of the network's connections to minimize
the error between predicted and actual outputs. Here is a step-by-step explanation of
the backpropagation algorithm:
1. Initialize the network:
 Define the network architecture, including the number of layers,
neurons per layer, and activation functions.
 Randomly initialize the weights and biases for each connection in the
network.
2. Forward propagation:
 Input an instance of training data to the network.
 Calculate the weighted sum of inputs and biases for each neuron in each
layer.
 Apply the activation function to obtain the output of each neuron.
 Pass the outputs forward to the next layer until reaching the output layer.
 Compare the network's output with the actual output and calculate the
error.
3. Backward propagation:
 Calculate the gradient of the error with respect to the weights and
biases of the output layer.
 Update the weights and biases of the output layer using the gradient and
a learning rate.
 Calculate the gradients for the previous layers using the chain rule.
 Update the weights and biases of the previous layers using the
gradients and the learning rate.
 Repeat the above steps for all instances in the training dataset.

4. Repeat the steps above:


 Repeat steps 2 and 3 for a specified number of epochs or until the
network reaches a satisfactory level of accuracy.
 Adjust the learning rate and other hyperparameters if necessary.
 Monitor the training progress by observing the decrease in the error over
epochs.
5. Evaluate the trained network:
 Once training is complete, evaluate the performance of the trained
network using validation or test data.
 Input new instances of data to the network and observe the predicted
outputs. 3
5
 Calculate metrics such as accuracy, precision, recall, or others
depending on the problem.

3
6
PROGRAM:

import tensorflow as tf
import numpy as np

# Define the dataset


X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=np.float32)
y = np.array([[0], [1], [1], [0]], dtype=np.float32)

# Define the architecture of the neural network


n_input = 2
n_hidden = 2
n_output = 1

# Define the weights and biases as TensorFlow variables


weights = {
'hidden': tf.Variable(tf.random.normal([n_input, n_hidden])),
'output': tf.Variable(tf.random.normal([n_hidden, n_output]))
}

biases = {
'hidden': tf.Variable(tf.random.normal([n_hidden])),
'output': tf.Variable(tf.random.normal([n_output]))
}

# Define the forward pass


def forward_propagation(x):
hidden_layer = tf.sigmoid(tf.add(tf.matmul(x, weights['hidden']), biases['hidden']))
output_layer = tf.sigmoid(tf.add(tf.matmul(hidden_layer, weights['output']), biases['output']))
return output_layer

# Define the backpropagation algorithm


def backpropagation(x, y):
with tf.GradientTape() as tape:
output_layer = forward_propagation(x)
loss = tf.reduce_mean(0.5 * (y - output_layer) ** 2)

gradients = tape.gradient(loss, [weights['hidden'], weights['output'], biases['hidden'], biases['output']])


optimizer.apply_gradients(zip(gradients, [weights['hidden'], weights['output'], biases['hidden'],
biases['output']]))

# Define the training loop


optimizer = tf.optimizers.SGD(learning_rate=0.1)
epochs = 10000

for epoch in range(epochs):


backpropagation(X, y)
if epoch % 1000 == 0:
output = forward_propagation(X)
loss = tf.reduce_mean(0.5 * (y - output) ** 2)
print(f"Epoch: {epoch}, Loss: {loss}")
3
7
# Test the trained model

3
8
predictions = forward_propagation(X)
print("Predictions:")
print(predictions.numpy().round())

OUTPUT:

Epoch: 0, Loss: 0.13507995009422302


Epoch: 1000, Loss: 0.1251978725194931
Epoch: 2000, Loss: 0.12484432756900787
Epoch: 3000, Loss: 0.12444967031478882
Epoch: 4000, Loss: 0.12391756474971771
Epoch: 5000, Loss: 0.12311005592346191
Epoch: 6000, Loss: 0.12179925292730331
Epoch: 7000, Loss: 0.11967145651578903
Epoch: 8000, Loss: 0.11657140403985977
Epoch: 9000, Loss: 0.11282829940319061
Predictions:
[[0.]
[1.]
[0.]
[0.]]

EXPERIMENT:8

Write a program to implement k-Nearest Neighbor algorithm to classify the iris data set. Print both correct
and wrong predictions.

AIM:

Write a program to implement k-Nearest Neighbor algorithm to classify the iris data set. Print both correct
and wrong predictions.
Description:

The k-Nearest Neighbor (k-NN) algorithm is a popular supervised machine learning algorithm
used for both classification and regression tasks. It is a non-parametric method that makes
predictions based on the similarity between input data points.

Process:

1. Data Preparation: Gather a labeled training dataset consisting of input


feature vectors and their corresponding class labels or target values.
2. Choose a Distance Metric: Select an appropriate distance metric to measure
the similarity or dissimilarity between data points. Commonly used distance
metrics include Euclidean distance, Manhattan distance, or cosine similarity.
3. Choose the Value of k: Determine the value of k, which represents the
number of nearest neighbors to consider during prediction. The optimal value
of k can be determined using cross-validation or other model evaluation
techniques.
4. Compute Distances: Calculate the distance between the new input data
point and all the training data points in the feature space, using the chosen
distance metric.
5. Find k Nearest Neighbors: Select the k data points with the shortest
distances to the new input data point.3
9
6. Make Predictions: For classification problems, assign the class label that
appears most frequently among the k nearest neighbors as the predicted class
for the new data point. For regression problems, calculate the average or
weighted average of the target values of the k nearest neighbors as the
predicted value for the new data point.
7. Evaluate Performance: Assess the performance of the k-NN model using
appropriate evaluation metrics such as accuracy, precision, recall, or mean
squared error, depending on the problem type.

PROGRAM:

import numpy as np
import pandas as pd

from sklearn.neighbors import KNeighborsClassifier


from sklearn.model_selection import train_test_split
from sklearn import metrics

import os
os.getcwd()

os.chdir("/content/drive/MyDrive/Datasets")

# Read dataset to pandas dataframe


names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
dataset = pd.read_csv("iris_data.csv", names=names)

X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1]

X.head()
y.head()

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.10)

classifier = KNeighborsClassifier(n_neighbors=5).fit(Xtrain, ytrain)


ypred = classifier.predict(Xtest)

i=0
print ("\n ")
4
0
print ('%-25s %-25s %-25s' % ('Original Label', 'Predicted Label',
'Correct/Wrong'))
print (" ")
for label in ytest:
print ('%-25s %-25s' % (label, ypred[i]), end="")
if (label == ypred[i]):
print (' %-25s' % ('Correct'))
else:
print (' %-25s' % ('Wrong'))
i=i+1
print (" ")
print("\nConfusion Matrix:\n",metrics.confusion_matrix(ytest, ypred))
print (" ")
print("\nClassification Report:\n",metrics.classification_report(ytest, ypred))
print (" ")
print('Accuracy of the classifer is %0.2f' % metrics.accuracy_score(ytest,ypred))
print (" ")

OUTPUT:
-------------------------------------------------------------------------
Original Label Predicted Label Correct/Wrong
-------------------------------------------------------------------------
Iris-versicolor Iris-versicolor
Correct Iris-setosa Iris-
setosa Correct Iris-versicolor Iris-
versicolor Correct Iris-virginica Iris-
virginica Correct Iris-versicolor Iris-
versicolor Correct Iris-virginica Iris-
virginica Correct Iris-versicolor Iris-
versicolor Correct Iris-versicolor Iris-
versicolor Correct Iris-virginica Iris-
virginica Correct
Iris-setosa Iris-setosa Correct
Iris-virginica Iris-virginica
Correct Iris-versicolor Iris-
versicolor Correct Iris-versicolor Iris-
versicolor Correct Iris-virginica Iris-
virginica Correct
Iris-setosa Iris-setosa Correct
-------------------------------------------------------------------------

Confusion Matrix:
[[3 0 0]
[0 7 0]
[0 0 5]]
...
4
1
-------------------------------------------------------------------------
Accuracy of the classifer is 1.00

4
2
EXPERIMENT:9

Implement the non-parametric Locally Weighted Regression algorithm in order to fit data points. Select
appropriate data set for your experiment and draw graphs.

AIM:

Implement the non-parametric Locally Weighted Regression algorithm in order to fit data points. Select
appropriate data set for your experiment and draw graphs.

Description:

The Locally Weighted Regression (LWR) algorithm is a non-parametric regression method


that aims to model the relationship between the input features and the target variable.
Unlike traditional regression algorithms, LWR assigns weights to the training data points
based on their proximity to the query point during prediction.

PROCESS:

1. Data Preparation: Gather a labeled training dataset consisting of input


feature vectors and their corresponding target values.
2. Choose a Kernel Function: Select a kernel function that assigns weights to
the training data points based on their proximity to the query point. Commonly
used kernel functions include Gaussian kernel, Epanechnikov kernel, and
triangular kernel.
3. Choose the Value of the Bandwidth Parameter: Determine the value of
the bandwidth parameter, which controls the width of the kernel and thus the
influence of nearby data points on the prediction. Smaller bandwidth values
give more weight to closer points, while larger bandwidth values consider
points farther away.
4. Compute Weights: For each query point, calculate the weights for all
training data points based on their distances to the query point, using the
selected kernel function and bandwidth parameter.
5. Fit Local Models: For each query point, fit a local regression model using the
weighted data points. This can be done by minimizing a weighted least squares
cost function, such as ordinary least squares or ridge regression.
6. Make Predictions: Once the local models are fitted, use them to predict the
target variable for new query points by applying the learned local regression
functions.
7. Evaluate Performance: Assess the performance of the LWR model using
appropriate evaluation metrics, such as mean squared error or R-squared,
depending on the problem type.

PROGRAM:

import numpy as np
import matplotlib.pyplot as plt

def gaussian_kernel(x, xi, tau):


return np.exp((x - xi)**2 / (-2 * tau**2)) 4
3
def locally_weighted_regression(X_train, y_train, x_query, tau):
m = len(X_train)
X = np.column_stack((np.ones(m), X_train))
W = np.diag([gaussian_kernel(x_query, xi, tau) for xi in X_train])
theta = np.linalg.inv(X.T @ W @ X) @ X.T @ W @ y_train
return theta

# Prepare the dataset


X_train = np.array([1, 2, 3, 4, 5, 6])
y_train = np.array([1, 3, 2, 5, 4, 6])

# Choose query points for prediction


x_query = np.linspace(0, 7, 100)

# Choose tau
tau = 0.5

# Perform locally weighted regression for each query point


y_pred = []
for x in x_query:
theta = locally_weighted_regression(X_train, y_train, x, tau)
y_pred.append(theta[0] + theta[1] * x)

# Plot the original data points and the fitted curve


plt.scatter(X_train, y_train, color='blue', label='Original Data')
plt.plot(x_query, y_pred, color='red', label='Locally Weighted Regression')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Locally Weighted Regression')
plt.legend()
plt.grid(True)
plt.show()

OUTPUT:

4
4
EXPERIMENT:10

Assuming a set of documents that need to be classified, use the naïve Bayesian Classifier model to perform
this task. Built-in Java classes/API can be used to write the program. Calculate the accuracy, precision, and
recall for your data set.

AIM:

Assuming a set of documents that need to be classified, use the naïve Bayesian Classifier model to perform
this task. Built-in Java classes/API can be used to write the program. Calculate the accuracy, precision, and
recall for your data set.
Description:

1. Data Preparation: Prepare your dataset of labeled documents. Each


document should be represented as a feature vector, and each vector should
be associated with a class label.
2. Data Preprocessing: Perform any necessary preprocessing steps such as
tokenization, removing stop words, and stemming to clean and normalize
the text data.
3. Feature Extraction: Convert the preprocessed text data into numerical
features that can be used by the Naive Bayes Classifier. One common approach
is to use the bag-of-words model, where each document is represented by a
vector indicating the presence or absence of each word in a predefined
vocabulary.
4. Train the Naive Bayes Classifier: Use the training set to estimate the
class prior probabilities and likelihood probabilities based on the feature
vectors. You can use Java's built-in classes or external libraries such as
Apache OpenNLP or Weka for this task.
5. Test the Classifier: Apply the trained Naive Bayes Classifier to classify the
documents in the test set. Compare the predicted class labels with the ground
truth labels to evaluate the accuracy, precision, and recall.
6. Calculate Evaluation Metrics: Compute the evaluation metrics using the
predicted labels and the ground truth labels. Accuracy measures the overall
correctness of the classifier's predictions. Precision measures the proportion
of true positives out of all predicted positives, while recall measures the
proportion of true positives out of all actual positives.

PROGRAM:

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;

import weka.classifiers.bayes.NaiveBayes;
import weka.classifiers.evaluation.Evaluation;
import weka.core.Attribute;
import weka.core.DenseInstance;
import weka.core.Instances;
4
5
public class NaiveBayesClassifierExample {
public static void main(String[] args) throws Exception {
// Step 1: Read the dataset file
BufferedReader reader = new BufferedReader(new FileReader("dataset.arff"));

4
6
Instances dataset = new Instances(reader);
reader.close();

// Step 2: Set the class attribute


dataset.setClassIndex(dataset.numAttributes() - 1);

// Step 3: Train the Naive Bayes Classifier


NaiveBayes classifier = new NaiveBayes();
classifier.buildClassifier(dataset);

// Step 4: Evaluate the Classifier using cross-validation


Evaluation evaluation = new Evaluation(dataset);
evaluation.crossValidateModel(classifier, dataset, 10, new java.util.Random(1));

// Step 5: Calculate evaluation metrics


double accuracy = evaluation.pctCorrect();
double precision = evaluation.weightedPrecision();
double recall = evaluation.weightedRecall();

// Step 6: Print evaluation metrics


System.out.println("Accuracy: " + accuracy);
System.out.println("Precision: " + precision);
System.out.println("Recall: " + recall);
}
}

EXPERIMENT:11

Apply EM algorithm to cluster a Heart Disease Data Set. Use the same data set for clustering using k-Means
algorithm. Compare the results of these two algorithms and comment on the quality of clustering. You can
add Java/Python ML library classes/API in the program.

AIM:

Apply EM algorithm to cluster a Heart Disease Data Set. Use the same data set for clustering using k-Means
algorithm. Compare the results of these two algorithms and comment on the quality of clustering. You can
add Java/Python ML library classes/API in the program.

Description:

The Expectation-Maximization (EM) algorithm is an iterative method used to estimate


the parameters of probabilistic models when dealing with missing or incomplete data. It is
widely used in various fields, including statistics, machine learning, and data clustering. The
EM algorithm seeks to find the maximum likelihood estimates of the model parameters by
iteratively updating the estimates based on the expected values of the missing data.

The k-Means algorithm is an iterative clustering algorithm used to partition a dataset


into k distinct clusters. It is one of the most popular and widely used clustering
algorithms due to its simplicity and efficiency. The goal of the k-Means algorithm is to
minimize the within-cluster variance by iteratively assigning data points to the nearest
cluster centroid and updating the centroid positions.
4
7
PROGRAM:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import cluster
import os

os.chdir("/content/drive/MyDrive/Bhavani")

df=pd.read_csv("heart.csv")

df.head()

X=df.drop(columns='target')

#Elbow Method
Sum_of_squared_distances = []
K = range(1,10)
for num_clusters in K :
kmeans_model = cluster.KMeans(n_clusters=num_clusters)
kmeans_model.fit(X[["trestbps","chol"]])
Sum_of_squared_distances.append(kmeans_model.inertia_)
plt.plot(K,Sum_of_squared_distances,'bx-')
plt.xlabel('Values of K')
plt.ylabel('Sum of squared distances/Inertia')
plt.title('Elbow Method For Optimal k')
plt.show()

kmeans=cluster.KMeans(n_clusters=2,init='k-means++')

#Considering Two attributes trestbps and chol


kmeans=kmeans.fit(X[["trestbps","chol"]])

X['Clusters']=kmeans.labels_

#Centroids
centers = np.array(kmeans.cluster_centers_)
centers

#Red Square Indicates Centroids


sns.scatterplot(x="trestbps",y="chol",hue="Clusters",data=X)
plt.scatter(centers[:,0], centers[:,1], marker="s", color='r')

4
8
OUTPUT:

EXPERIMENT:12

Exploratory Data Analysis for Classification using Pandas or Matplotlib.

AIM:

Exploratory Data Analysis for Classification using Pandas or Matplotlib.

Description:
Exploratory Data Analysis (EDA) is an important step in the data analysis process. It
involves examining and visualizing the data to gain insights, identify patterns, and
understand the characteristics of the dataset. EDA helps in discovering relationships between
variables, detecting outliers, and preparing the data for further analysis or modeling. Here's a
general framework for performing EDA:

4
9
PROCESS:

1. Load the Data: Start by loading your dataset into a suitable data structure,
such as a Pandas DataFrame in Python.
2. Data Summary: Get an overview of the dataset by examining the
dimensions, data types, and basic statistics of the variables. Some useful
functions for this include shape, head, info,
. describe , dtypes
3. and
: Check for missing values in the dataset and decide on an appropriate
Missing
strategy for handling them. Use functions like isnull, isna, to handle missing
fillna
dropna, or data.
4. Data Visualization: Create visualizations to explore the distribution,
relationships, and patterns in the data. Some common plots include
histograms, box plots, scatter plots, bar plots, and correlation matrices.
Matplotlib, Seaborn, and Plotly are popular libraries for data visualization in
Python.
5. Univariate Analysis: Analyze each variable individually to understand its
distribution and characteristics. This can involve examining frequency counts,
summary statistics, histograms, or box plots for numerical variables, and bar
plots or pie charts for categorical variables.
6. Bivariate Analysis: Explore relationships between pairs of variables to identify
correlations, dependencies, or associations. Scatter plots, line plots, or
heatmaps can be used to visualize these relationships. Statistical tests such as
correlation coefficients or t-tests can also provide insights into the relationships.
7. Multivariate Analysis: Analyze the relationships among multiple variables
simultaneously. This can involve visualizations like pair plots, parallel
coordinates plots, or heatmaps to observe patterns and interactions.
8. Outlier Detection: Identify potential outliers in the dataset that deviate
significantly from the rest of the data. Box plots, scatter plots, or statistical
methods like z-scores or the IQR (interquartile range) can be used for outlier
detection.
9. Feature Engineering: Explore opportunities for creating new features or
transforming existing ones to improve the predictive power of your data.
This can involve scaling, normalization, encoding categorical variables, or
creating interaction variables.
10. Data Preprocessing: Clean and preprocess the data as needed for further
analysis or modeling. This can include handling missing values, dealing with
outliers, standardizing or normalizing variables, or performing feature
selection.

PROGRAM:

import pandas as pd
import matplotlib.pyplot as plt
import os
os.chdir("C://Users//INDIAN//AppData//Roaming//Microsoft//Windows//Start Menu//Programs//Python
3.7")
data = pd.read_csv("WASDE-DATA.csv")
# Check the dimensions of the dataset
5
0
print("Number of rows:", data.shape[0])
print("Number of columns:", data.shape[1])

# View the first few records

5
1
print(data.head())

# Check the data types of each


column print(data.dtypes)
print(data.describe())
class_counts = data['region'].value_counts()
print(class_counts)
data.hist(figsize=(10, 10))
plt.show()
data.boxplot(by='region', figsize=(10, 10))
plt.show()
categorical_columns = ['commodity', 'item']
for column in categorical_columns:
data[column].value_counts().plot(kind='bar')
plt.title(column)
plt.show()
plt.scatter(data['commodity'], data['item'])
plt.xlabel('commodity')
plt.ylabel('item')
plt.show()
correlation_matrix = data.corr()
plt.figure(figsize=(10, 10))
plt.imshow(correlation_matrix, cmap='coolwarm', interpolation='nearest')
plt.colorbar()
plt.xticks(range(len(correlation_matrix.columns)), correlation_matrix.columns, rotation=90)
plt.yticks(range(len(correlation_matrix.columns)), correlation_matrix.columns)
plt.title('Correlation Matrix')
plt.show()

OUTPUT:

Number of rows: 10000


Number of columns: 10
code report_month region commodity item \
0 WHEAT_WORLD_19 2023-01 World Less China Wheat Production
1 WHEAT_WORLD_19 2023-01 World Less China Wheat Production
2 WHEAT_WORLD_19 2023-01 World Less China Wheat Imports
3 WHEAT_WORLD_19 2023-01 World Less China Wheat Imports
4 WHEAT_WORLD_19 2023-01 World Less China Wheat Exports

year period value min_value max_value


0 2022/23 Proj. Jan 643.59 NaN NaN
1 2022/23 Proj. Dec 642.59 NaN NaN
2 2022/23 Proj. Jan 195.55 NaN NaN
3 2022/23 Proj. Dec 194.80 NaN NaN
4 2022/23 Proj. Jan 210.72 NaN NaN
code object
report_month object
region object
commodity object
5
2
item object
year object

5
3
period object
value float64
min_value float64
max_value float64
dtype: object
value min_value max_value
count 10000.000000 0.0 0.0
mean 73.100683 NaN NaN
std 149.321209 NaN NaN
min 0.000000 NaN NaN
25% 2.000000 NaN NaN
50% 11.570000 NaN NaN
75% 53.990000 NaN NaN
max 794.440000 NaN NaN
World Less China 455
World 3/ 455
United States 445
Argentina 441
Australia 441
Bangladesh 441
Brazil 441
Canada 441
China 441
India 441
Japan 441
Kazakhstan 441
Major Exporters 4/ 441
Major Importers 6/ 441
N. Africa 7/ 441
Nigeria 441
Russia 441
Sel. Mideast 8/ 441
Southeast Asia 9/ 441
Total Foreign 441
Ukraine 441
European Union 5/ 385
United Kingdom 266
EU-27+UK 5/ 56
Name: region, dtype: int64

5
4
5
5
5
6
5
7
5
8
EXPERIMENT:13

Write a Python program to construct a Bayesian network considering medical data. Use this model to
demonstrate the diagnosis of heart patients using standard Heart Disease Data Set

AIM:

Write a Python program to construct a Bayesian network considering medical data. Use this model to
demonstrate the diagnosis of heart patients using standard Heart Disease Data Set

Description:

A Bayesian network, also known as a Bayesian belief network or probabilistic graphical


model, is a graphical representation of probabilistic relationships among variables. It is based
on the principles of Bayesian probability and provides a compact and intuitive way to model
and reason about uncertain or probabilistic domains.

1. Conditional Probability Distributions (CPDs): CPDs specify the probability


distribution of a node given the values of its parent nodes. They quantify the
conditional dependencies in the network and are used to perform probabilistic
inference.
2. Inference: Inference in a Bayesian network involves computing the probability
distribution of one or more variables given evidence or observed values of other
variables. Various algorithms, such as variable elimination, enumeration, or
sampling methods, can be used for inference.
3. Learning: Bayesian networks can be learned from data using techniques such
as maximum likelihood estimation or Bayesian parameter estimation. Learning
the network structure and 5
9
CPDs from data allows us to discover the underlying dependencies and make
predictions or perform reasoning.

PROGRAM:

import numpy as np
import pandas as pd
import csv
from pgmpy.estimators import MaximumLikelihoodEstimator
from pgmpy.models import BayesianModel
from pgmpy.inference import VariableElimination
import os
os.chdir("C://Users//INDIAN//AppData//Roaming//Microsoft//Windows//Start Menu//Programs//Python
3.7")

heartDisease = pd.read_csv('new dataset.csv')


heartDisease = heartDisease.replace('?',np.nan)

print('Sample instances from the dataset are given below')


print(heartDisease.head())

print('\n Attributes and datatypes')


print(heartDisease.dtypes)

model= BayesianModel([('age','heartdisease'),('sex','heartdisease'),('exang','heartdisease'),('cp','heartdisease'),
('heartdis ease','restecg'),('heartdisease','chol')])
print('\nLearning CPD using Maximum likelihood estimators')
model.fit(heartDisease,estimator=MaximumLikelihoodEstimator)

print('\n Inferencing with Bayesian Network:')


HeartDiseasetest_infer = VariableElimination(model)

print('\n 1. Probability of HeartDisease given evidence= restecg')


q1=HeartDiseasetest_infer.query(variables=['heartdisease'],evidence={'restecg':1})
print(q1)

print('\n 2. Probability of HeartDisease given evidence= cp ')


q2=HeartDiseasetest_infer.query(variables=['heartdisease'],evidence={'cp':2})
print(q2)

OUTPUT:

Sample instances from the dataset are given below


age sex cp trestbps chol fbs restecg thalach exang oldpeak slope \
0 63 1 1 145 233 1 2 150 0 2.3 3
1 67 1 4 160 286 0 2 108 1 1.5 2
2 67 1 4 120 229 0 2 129 1 2.6 2
3 37 1 3 130 250 0 0 187 0 3.5 3
4 41 0 2 130 204 0 2 172 0 1.4 1

ca thal heartdisease
0 0 6 0 6
0
1 3 3 2

6
1
2 2 7 1
3 0 3 0
4 0 3 0

Attributes and datatypes


age int64
sex int64
cp int64
trestbps int64
chol int64
fbs int64
restecg int64
thalach int64
exang int64
oldpeak float64
slope int64
ca object
thal object
heartdisease int64
dtype: object

Learning CPD using Maximum likelihood estimators

Inferencing with Bayesian Network:

1. Probability of HeartDisease given evidence= restecg


+ + +
| heartdisease | phi(heartdisease) |
+=================+=====================+
| heartdisease(0) | 0.1012 |
+ + +
| heartdisease(1) | 0.0000 |
+ + +
| heartdisease(2) | 0.2392 |
+ + +
| heartdisease(3) | 0.2015 |
+ + +
| heartdisease(4) | 0.4581 |
+ + +

2. Probability of HeartDisease given evidence= cp


+ + +
| heartdisease | phi(heartdisease) |
+=================+=====================+
| heartdisease(0) | 0.3610 |
+ + +
| heartdisease(1) | 0.2159 |
+ + +
| heartdisease(2) | 0.1373 |
+ + +
| heartdisease(3) | 0.1537 |
+ + +
| heartdisease(4) | 0.1321 |
+ + +

6
2
EXPERIMENT:14

Write a program to Implement Support Vector Machines and Principle Component Analysis

AIM:
Write a program to Implement Support Vector Machines and Principle Component Analysis

Description:
Support Vector Machines (SVM) is a powerful supervised machine learning algorithm used
for classification and regression tasks. SVMs are particularly effective in cases where the data
has complex patterns and a clear margin of separation between classes.

Program:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score
import os
os.chdir("C://Users//INDIAN//AppData//Roaming//Microsoft//Windows//Start Menu//Programs//Python
3.7")

# Load the dataset


6
3
data = pd.read_csv("dataset.csv")

6
4
# Split the dataset into features (X) and target (y)
X = data.drop('target', axis=1)
y = data['target']

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Perform PCA for dimensionality reduction


pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Train the SVM classifier


svm = SVC()
svm.fit(X_train_pca, y_train)

# Make predictions on the test set


y_pred = svm.predict(X_test_pca)

# Calculate the accuracy of the SVM classifier


accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

OUTPUT:
Accuracy: 0.7377049180327869

EXPERIMENT:15

Write a program to Implement Principle Component Analysis

AIM:

Write a program to Implement Principle Component Analysis

Description:
Principal Component Analysis (PCA) is a popular dimensionality reduction technique used in
data analysis and machine learning. It aims to transform a high-dimensional dataset into a
lower-dimensional space while retaining as much information as possible.

USES:

6
5
1. Dimensionality Reduction: By selecting a subset of the top-k principal
components, where k is lower than the original number of features, PCA
reduces the dimensionality of the dataset. This can be helpful when dealing
with high-dimensional data, as it simplifies subsequent analysis and modeling
tasks.
2. Data Visualization: PCA can be used to visualize high-dimensional data in a
lower- dimensional space. By projecting the data onto the principal
components, it is possible to create two or three-dimensional scatter plots
that provide insights into the structure and relationships within the data.
3. Feature Extraction: The principal components themselves can be
interpreted as new features that capture the most important patterns in the
data. These new features can be

6
6
used in subsequent analysis or modeling tasks, potentially leading to improved
performance.
PROCESS:

1. Data Preprocessing: Prepare the dataset by handling missing values,


normalizing or standardizing features, and ensuring that the data is in a
suitable format.
2. Compute Covariance Matrix: Calculate the covariance matrix of the data,
which measures the pairwise relationships between the features.
3. Compute Eigenvectors and Eigenvalues: Determine the eigenvectors and
corresponding eigenvalues of the covariance matrix. The eigenvectors
represent the principal components, while the eigenvalues indicate the amount
of variance captured by each component.
4. Select Principal Components: Select the top-k eigenvectors with the largest
eigenvalues to retain the most important components. Typically, you choose a
value of k that retains a significant portion of the variance, such as 95% or 99%.
5. Project Data: Project the original data onto the selected principal components
to obtain the lower-dimensional representation of the dataset.

PROGRAM:

import pandas as pd
from sklearn.decomposition import PCA
import os
os.chdir("C://Users//INDIAN//AppData//Roaming//Microsoft//Windows//Start Menu//Programs//Python
3.7")

# Load the dataset


data = pd.read_csv("dataset.csv")

# Separate features and target variable


X = data.drop('target', axis=1)
y = data['target']

# Perform PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Explained variance ratio


explained_variance_ratio = pca.explained_variance_ratio_
print("Explained Variance Ratio:", explained_variance_ratio)

# Access principal components


principal_components = pca.components_
print("Principal Components:\n", principal_components)

OUTPUT:

Explained Variance Ratio: [0.7475642 0.15037022]


Principal Components:
6
7
[[ 3.94611190e-02 -1.78278639e-03 -1.53716667e-03 4.75880705e-02
9.98053283e-01 1.16389852e-04 -1.55243101e-03 -7.35838010e-03

6
8
6.31483108e-04 1.32988432e-03 -9.99857233e-05 1.46773705e-03
1.18215354e-03]
[ 1.82186255e-01 7.93727347e-04 -1.25419057e-02 1.03810033e-01
-1.94250905e-02 4.61971663e-04 -1.20213285e-03 -9.77188942e-01
7.54817512e-03 1.79407185e-02 -1.04271838e-02 1.01095919e-02
2.59241726e-03]]

Additional Experiments
EXPERIMENT 1 :

Exercises to solve the real-world problems using the following machine learning methods like multi class
classification

AIM:

Exercises to solve the real-world problems using the following machine learning methods like multi class
classification

Description:

Types of Classification
Classification is of two types:
3. Binary Classification: When we have to categorize given data into 2 distinct
classes. Example – On the basis of given health conditions of a person, we have to
determine whether the person has a certain disease or not.
4. Multiclass Classification: The number of classes is more than 2. For Example – On the
basis of data about different species of flowers, we have to determine which specie
our observation belongs.

Examples of Binary Classification:

 Email spam detection (spam or not).


 Churn prediction (churn or not).
 Conversion prediction (buy or not).
6
9
Popular algorithms that can be used for binary classification include:

 Logistic Regression
 k-Nearest Neighbors
 Decision Trees
 Support Vector Machine
 Naive Bayes
Multi-Class Classification
Multi-class classification refers to those classification tasks that have more than two class
labels. Examples include:

 Face classification.
 Plant species classification.
 Optical character recognition.
Popular algorithms that can be used for multi-class classification include:

k-Nearest Neighbors.

Decision Trees.

Naive Bayes.

Random Forest.

Gradient Boosting.

PROGRAM:

from numpy import where

from collections import Counter

from sklearn.datasets import make_blobs from

matplotlib import pyplot

# define dataset

X, y = make_blobs(n_samples=1000, centers=3, random_state=1) #


7
0
summarize dataset shape

7
1
print(X.shape, y.shape)

# summarize observations by class label

counter = Counter(y)

print(counter)

# summarize first few examples

for i in range(10):

print(X[i], y[i])

# plot the dataset and color the by class label

for label, _ in counter.items():

row_ix = where(y == label)[0]

pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))

pyplot.legend()

pyplot.show()

OUTPUT:

7
2
7
3

You might also like