Bayesian Algorithm
Bayesian Algorithm
Theory:
1. Introduction to Data Mining:
Data mining refers to the process of discovering patterns, correlations, anomalies, and useful
information from large datasets. This information is often used for decision-making purposes.
It involves techniques such as classification, regression, clustering, and association rule
learning.
2. Classification Learning Models:
Classification is a supervised learning technique in machine learning where the model learns
from a labelled dataset to classify unseen data points into predefined classes. A classifier
algorithm, such as Naive Bayes, decision trees, or support vector machines, is used to build
models that assign categories to data points.
3. Dataset and Different Types of Attributes:
A dataset is a collection of data organised in a structured manner, typically consisting of rows
(data points) and columns (attributes/features). Attributes can be categorised as:
pg. 1
P(Y∣X)=P(X∣Y)⋅P(Y)P(X)P(Y | X) = \frac{P(X | Y) \cdot P(Y)}
{P(X)}P(Y∣X)=P(X)P(X∣Y)⋅P(Y)
Where:
Dataset:
We are using a dataset of an online bookstore, which contains the following features:
● price (numeric)
● book_depository_stars (numeric)
● format (categorical)
● currency (categorical)
The target variable is category, which represents different book categories in the store.
Algorithm Implementation:
1. Acquiring the Dataset:
The dataset is acquired and stored in a .csv format. The following preprocessing steps are
applied:
● The dataset is split into 80% training and 20% testing data using train_test_split.
● The model is trained on the training data and used to make predictions on the test set.
pg. 2
3. Modelling the Random Forest Classifier:
● As an additional approach, we also test the dataset with the Random Forest classifier
to compare accuracy.
INPUT
import numpy as np
import IPython.display as display
from matplotlib import pyplot as plt
import io
import base64
ys = 200 + np.random.randn(100)
x = [x for x in range(len(ys))]
fig = plt.figure(figsize=(4, 3), facecolor='w')
plt.plot(x, ys, '-')
plt.fill_between(x, ys, 195, where=(ys > 195), facecolor='g', alpha=0.6)
plt.title("Sample Visualization", fontsize=10)
data = io.BytesIO()
plt.savefig(data)
image = F"data:image/png;base64,{base64.b64encode(data.getvalue()).decode()}"
alt = "Sample Visualization"
display.display(display.Markdown(F""""""))
plt.close(fig)
OUTPUT
INPUT
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
pg. 3
# Handle missing values by dropping rows with missing data
features = features.dropna()
target = target[features.index]
# Remove currency symbols and codes (e.g., $, US) from 'price' and convert it to a float
features['price'] = features['price'].replace('[^\d.]', '', regex=True).astype(float)
# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2,
random_state=42)
# If you want to try Naive Bayes again with scaling, here is the code:
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)
y_pred_nb = nb_classifier.predict(X_test)
pg. 4
OUTPUT
Model Accuracy:
● Random Forest Accuracy: 8.25%
pg. 5
● Naive Bayes Accuracy: 7.48%
Conclusion:
In this experiment, we implemented and compared two classifiers: Naive Bayes and
Random Forest. The Random Forest classifier provided a higher accuracy due to its ability
to handle complex interactions between features, while Naive Bayes performed relatively
worse due to its assumption of feature independence.
pg. 6