0% found this document useful (0 votes)
3 views6 pages

Bayesian Algorithm

Implementation of Bayesian Algorithm Using Colab Notepad

Uploaded by

gazalgg11
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
3 views6 pages

Bayesian Algorithm

Implementation of Bayesian Algorithm Using Colab Notepad

Uploaded by

gazalgg11
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 6

EXPERIMENT 4

Aim: Implementation of Bayesian Algorithm Using Colab Notepad


The aim of this experiment is to implement the Naive Bayes classifier learning model to
solve the problem statement of an online bookstore. The dataset will be acquired from an
open-source repository and stored in .csv format. We will classify the dataset's entries and
compute the accuracy of the model.

Theory:
1. Introduction to Data Mining:
Data mining refers to the process of discovering patterns, correlations, anomalies, and useful
information from large datasets. This information is often used for decision-making purposes.
It involves techniques such as classification, regression, clustering, and association rule
learning.
2. Classification Learning Models:
Classification is a supervised learning technique in machine learning where the model learns
from a labelled dataset to classify unseen data points into predefined classes. A classifier
algorithm, such as Naive Bayes, decision trees, or support vector machines, is used to build
models that assign categories to data points.
3. Dataset and Different Types of Attributes:
A dataset is a collection of data organised in a structured manner, typically consisting of rows
(data points) and columns (attributes/features). Attributes can be categorised as:

● Numerical Attributes: Represent numeric values (e.g., price).

● Categorical Attributes: Represent discrete values or categories (e.g., currency,


format).
4. Naive Bayes Classification Algorithm:
The Naive Bayes classifier is a probabilistic classification algorithm based on Bayes'
Theorem. It assumes that features are independent of each other, hence "naive." This
algorithm is particularly effective for text classification and other problems with independent
features.
Steps:
1. Convert the dataset into a frequency table.
2. Calculate the likelihood table based on the frequency table.
3. Use Bayes' Theorem to compute the posterior probability for each class.
4. Predict the class with the highest posterior probability.
Formula for Posterior Probability:

pg. 1
P(Y∣X)=P(X∣Y)⋅P(Y)P(X)P(Y | X) = \frac{P(X | Y) \cdot P(Y)}
{P(X)}P(Y∣X)=P(X)P(X∣Y)⋅P(Y)
Where:

● P(Y∣X)P(Y | X)P(Y∣X) is the posterior probability.

● P(X∣Y)P(X | Y)P(X∣Y) is the likelihood.

● P(Y)P(Y)P(Y) is the prior probability of class.

● P(X)P(X)P(X) is the prior probability of predictor.

Dataset:
We are using a dataset of an online bookstore, which contains the following features:

● price (numeric)

● book_depository_stars (numeric)

● format (categorical)

● currency (categorical)

The target variable is category, which represents different book categories in the store.

Algorithm Implementation:
1. Acquiring the Dataset:
The dataset is acquired and stored in a .csv format. The following preprocessing steps are
applied:

● Handling missing values by removing rows with missing data.

● Removing currency symbols from the price column.

● Encoding categorical features using LabelEncoder.

2. Modelling the Naive Bayes Classifier:

● The dataset is split into 80% training and 20% testing data using train_test_split.

● The Naive Bayes classifier is instantiated using GaussianNB().

● The model is trained on the training data and used to make predictions on the test set.

pg. 2
3. Modelling the Random Forest Classifier:

● As an additional approach, we also test the dataset with the Random Forest classifier
to compare accuracy.

INPUT
import numpy as np
import IPython.display as display
from matplotlib import pyplot as plt
import io
import base64
ys = 200 + np.random.randn(100)
x = [x for x in range(len(ys))]
fig = plt.figure(figsize=(4, 3), facecolor='w')
plt.plot(x, ys, '-')
plt.fill_between(x, ys, 195, where=(ys > 195), facecolor='g', alpha=0.6)
plt.title("Sample Visualization", fontsize=10)

data = io.BytesIO()
plt.savefig(data)
image = F"data:image/png;base64,{base64.b64encode(data.getvalue()).decode()}"
alt = "Sample Visualization"
display.display(display.Markdown(F"""![{alt}]({image})"""))
plt.close(fig)

OUTPUT

INPUT
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Preprocessing: Drop irrelevant columns for classification


features = data[['price', 'book_depository_stars', 'format', 'currency']]
target = data['category']

pg. 3
# Handle missing values by dropping rows with missing data
features = features.dropna()
target = target[features.index]

# Remove currency symbols and codes (e.g., $, US) from 'price' and convert it to a float
features['price'] = features['price'].replace('[^\d.]', '', regex=True).astype(float)

# Encode categorical features


label_encoders = {}
for column in ['format', 'currency']:
le = LabelEncoder()
features[column] = le.fit_transform(features[column])
label_encoders[column] = le

# Feature Scaling: Standardize 'price' and 'book_depository_stars'


scaler = StandardScaler()
features[['price', 'book_depository_stars']] = scaler.fit_transform(features[['price',
'book_depository_stars']])

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2,
random_state=42)

# Try a different classifier (e.g., Random Forest for better accuracy)


rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

# Make predictions on the test set


y_pred_rf = rf_classifier.predict(X_test)

# Calculate accuracy for Random Forest


accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {accuracy_rf * 100:.2f}%")

# Print classification report for more insight


print(classification_report(y_test, y_pred_rf))

# If you want to try Naive Bayes again with scaling, here is the code:
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)
y_pred_nb = nb_classifier.predict(X_test)

# Calculate accuracy for Naive Bayes


accuracy_nb = accuracy_score(y_test, y_pred_nb)
print(f"Naive Bayes Accuracy: {accuracy_nb * 100:.2f}%")

pg. 4
OUTPUT

Model Accuracy:
● Random Forest Accuracy: 8.25%

pg. 5
● Naive Bayes Accuracy: 7.48%

Conclusion:

In this experiment, we implemented and compared two classifiers: Naive Bayes and
Random Forest. The Random Forest classifier provided a higher accuracy due to its ability
to handle complex interactions between features, while Naive Bayes performed relatively
worse due to its assumption of feature independence.

pg. 6

You might also like