Data Mining
Data Mining
In simple terms, Bayes' theorem allows us to update our belief about the probability of an event
occurring (posterior) by considering prior knowledge (prior) and the likelihood of observing specific
data.
The Naive Bayes classifier is a specific application of Bayes' theorem for classification tasks. It is called
"naive" because it makes a key simplifying assumption: it assumes that all features are independent
given the class label. In other words, the probability of observing one feature is independent of the
probability of observing another, given the class. This assumption, although often unrealistic in real-
world data, drastically simplifies the computations and enables the classifier to be both efficient and
scalable.
Code:
Here's Python code to perform classifica on using Naive Bayes with the scikit-learn library.
from sklearn.model_selec on import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classifica on_report
from sklearn.datasets import load_iris
# Load the dataset
data = load_iris()
X = data.data
y = data.target
# Split the data into training and tes ng sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Ini alize the Gaussian Naive Bayes classifier
gnb = GaussianNB()
# Train the classifier
gnb.fit(X_train, y_train)
# Predict on the test set
y_pred = gnb.predict(X_test)
# Output accuracy and classifica on report
accuracy = accuracy_score(y_test, y_pred)
report = classifica on_report(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classifica on Report:\n", report)
Output:
A er running the code, the output should show:
Accuracy: A percentage represen ng the model's accuracy on the test set.
Classifica on Report: A detailed breakdown of precision, recall, F1-score, and support for each
class.
Experiment 2
Aim: To perform cluster analysis using the K-Means clustering method in Python.
Theory:
K-Means Clustering Algorithm
K-Means is a widely used clustering algorithm designed to par on data into a predefined number of
clusters, denoted as kkk. Each cluster contains data points that are more similar to each other than to
those in other clusters. The algorithm groups data by minimizing the variance (or distance) within each
cluster, crea ng compact and well-separated groups. It is an unsupervised learning technique, o en
applied in data analysis, image segmenta on, customer segmenta on, and other tasks where natural
groupings in data are needed.
Steps of the K-Means Algorithm
1. Ini alize:
o Select kkk ini al cluster centroids randomly from the dataset.
o These centroids serve as the star ng points for the clusters.
2. Assign:
o For each data point, calculate the Euclidean distance to each centroid.
o Assign each data point to the nearest centroid, thus forming kkk clusters.
3. Update:
o For each cluster, recompute the centroid by calcula ng the mean of all data points assigned to
that cluster.
o This centroid represents the "center" of the cluster and serves as the updated posi on for that
cluster.
4. Repeat:
o Repeat steps 2 and 3 un l one of the following condi ons is met:
Centroids stabilize (i.e., the centroids do not change between itera ons).
A predefined maximum number of itera ons is reached.
The algorithm is guaranteed to converge, though it may not always find the global op mum due to
random ini aliza on of centroids. To improve results, K-Means is o en run mul ple mes with different
ini aliza ons, and the best outcome is chosen based on minimized variance.
Code:
Here's Python code to perform K-Means clustering using the scikit-learn library.
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
# Generate synthe c data for clustering
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=0)
# Ini alize the KMeans model with 4 clusters
kmeans = KMeans(n_clusters=4, random_state=0)
# Fit the model to the data
kmeans.fit(X)
# Predict the clusters
y_kmeans = kmeans.predict(X)
# Plot the clustered data
plt.sca er(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.sca er(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X')
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt. tle("K-Means Clustering with k=4")
plt.show()
Output:
Experiment 3
Aim: To perform hierarchical clustering using Python.
Theory:
Hierarchical Clustering
Hierarchical clustering is a method of cluster analysis that builds a hierarchy of clusters, represented by
a tree-like structure known as a dendrogram. Unlike methods that require specifying the number of
clusters in advance, hierarchical clustering provides a visual representation of the data's clustering
structure at different levels, allowing users to decide the number of clusters based on the dendrogram.
This method is commonly used in biological taxonomy, document clustering, and other fields where
hierarchical relationships are significant.
Approaches to Hierarchical Clustering
The Dendrogram
A dendrogram is a tree-like diagram that illustrates the order in which clusters are merged (in
agglomerative clustering) or split (in divisive clustering). Each node in the dendrogram represents a
cluster, and the height of each node represents the distance (or dissimilarity) at which clusters are
merged or split. By cutting the dendrogram at a specific height, users can determine the number of
clusters in the data.
Linkage Methods
In hierarchical clustering, the choice of linkage method affects how distances between clusters are
calculated. The two most common linkage methods are:
1. Single Linkage:
o Measures the minimum distance between any two points in different clusters.
o Tends to create elongated, "chain-like" clusters, as clusters are merged based on the closest pair
of points.
o Sensitive to noise and outliers, leading to possible chaining effects, where clusters may be
combined due to a single close data point.
2. Complete Linkage:
o Measures the maximum distance between any two points in different clusters.
o Ensures that clusters are compact, as they are only merged if all points in one cluster are close to
all points in the other.
o Results in more compact and evenly shaped clusters, but may lead to smaller clusters being
absorbed by larger ones.
Other linkage methods include average linkage (based on the average distance between points in
clusters) and Ward’s method (minimizing the variance within clusters), each providing different cluster
shapes and stability.
Code:
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.datasets import make_blobs
Y = b0 + b1X
where:
Y is the dependent variable,
X is the independent variable,
b0 is the intercept, and
b1 is the slope.
The objec ve in simple linear regression is to find the best-fi ng line, meaning the line that minimizes
the sum of squared differences between the observed values and the predicted values. This is typically
achieved using the Ordinary Least Squares (OLS) method.
Code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selec on import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
Outliers are data points that significantly deviate from the rest of the dataset, o en lying far from the
mean or median. These extreme values can arise due to various reasons, including natural variability in
data, measurement errors, or mistakes in data entry or collec on. While some outliers may represent
important informa on (e.g., rare events), they can o en distort sta s cal analyses, skew results, and
reduce the accuracy of predic ve models. Iden fying and managing outliers is a crucial step in the data
preprocessing phase of machine learning and sta s cal analysis.
Impacts of Outliers
Bias: Outliers can skew sta s cal es mates like the mean, variance, and standard devia on,
leading to biased results.
Model Performance: In predic ve modeling, outliers can reduce the performance of algorithms
such as linear regression, as they can dispropor onately influence the fit of the model, leading to
inaccurate predic ons.
Interpreta on: Outliers may lead to misinterpreta on of data trends and pa erns, par cularly
when not properly handled.
Thus, detec ng and addressing outliers is essen al for improving the reliability of analy cal models.
Common Methods for Outlier Detec on
Several techniques can be used to iden fy outliers in a dataset. Below are the most widely used methods:
1. Z-score Method
The Z-score is a sta s cal measure that quan fies how far a data point is from the mean in terms of
standard devia ons. If a data point’s Z-score is significantly high or low, it can be considered an outlier.
Formula for Z-score:
Z =X−μ
σ
Where:
X is the individual data point,
μ is the mean of the dataset,
σ is the standard devia on.
A typical threshold for iden fying outliers using the Z-score is ±3 standard devia ons. If the Z-score is
greater than 3 or less than -3, the data point is considered an outlier.
Example:
For a dataset of test scores, if a student’s score has a Z-score of 4, it is considered an outlier since
it is 4 standard devia ons away from the mean.
2. Interquar le Range (IQR) Method
The IQR method is based on the spread of the middle 50% of the data. The IQR is the difference between
the third quar le (Q3) and the first quar le (Q1). Outliers are considered as data points that fall outside
a range defined by the IQR.
Formula for Iden fying Outliers:
Lower bound: Q1−1.5×IQR
Upper bound: Q3+1.5×IQR
Where:
Q1 is the first quar le (25th percen le),
Q3 is the third quar le (75th percen le),
IQR is the interquar le range, calculated as Q3−Q1.
Any data point that falls below the lower bound or above the upper bound is considered an outlier.
Example:
For a dataset of house prices, if a house price falls below Q1−1.5×IQR or above Q3+1.5×IQR, it
would be flagged as an outlier.
3. Boxplot Visualiza on
A boxplot (also known as a box-and-whisker plot) is a graphical method that displays the distribu on of
data and visually highlights poten al outliers. It uses the IQR to define the "whiskers" (the range of non-
outlier data) and any data points outside this range are plo ed as individual points, o en called
"outliers."
The box in a boxplot represents the interquar le range (from Q1 to Q3).
The line inside the box represents the median of the dataset.
The whiskers extend from the box to the smallest and largest values within the normal range
(within 1.5 mes the IQR from Q1 and Q3).
Data points outside the whiskers are plo ed as individual dots or symbols, indica ng outliers.
Example: A boxplot of salaries might show a few individuals earning much higher than the rest of the
group. These higher earnings will be flagged as outliers by the boxplot
Code:
Here, we’ll use the IQR method to detect outliers in a sample dataset.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Output:
Experiment 7
Aim: To demonstrate associa on rule mining using the FP-Growth algorithm on
supermarket transac on data.
Theory:
Introduc on to FP-Growth
The FP-Growth (Frequent Pa ern Growth) algorithm is an advanced and more efficient method for
mining frequent itemsets in large datasets. It was developed as an alterna ve to the Apriori algorithm
to overcome its limita ons, par cularly the large number of database scans and the candidate genera on
step, which can be computa onally expensive. FP-Growth leverages a more efficient approach by u lizing
a compact data structure known as an FP-tree (Frequent Pa ern Tree).
The main advantage of FP-Growth is its ability to mine frequent itemsets without genera ng candidate
itemsets, which significantly improves performance, especially with large datasets. This makes FP-Growth
highly suitable for tasks such as market basket analysis, where iden fying associa ons between items,
like products o en bought together, is key for business strategies such as product placement,
promo ons, and cross-selling.
Steps in the FP-Growth Algorithm
FP-Growth operates in two main steps:
1. Construc ng the FP-tree
The first step in the FP-Growth algorithm involves construc ng an FP-tree. Here’s how it works:
Scan the dataset: The algorithm starts by scanning the dataset to find frequent items. Only items
that appear frequently enough (according to a minimum support threshold) are retained.
Sor ng Items: The frequent items are sorted in decreasing order of their frequency. This ensures
that the FP-tree is built in a way that higher-frequency items appear first, making the tree
structure more compact.
Building the Tree:
o The FP-tree is constructed by reading the transac ons in the dataset. Each transac on is
represented as a path in the tree.
o If a transac on contains the itemset {A,B,C}, for instance, and A is the first frequent item,
B the second, and C the third, the transac on will be represented as a path extending from
the root node in the tree.
o If the itemset is already present in the tree, the algorithm increments the count for the
respec ve node; otherwise, a new node is created.
The tree is thus "compressed" because it represents itemsets and their counts in a more efficient way
than the Apriori algorithm’s candidate genera on method.
2. Mining the FP-tree
Once the FP-tree is constructed, the next step is mining the FP-tree to extract frequent itemsets. This
process does not require genera ng candidate itemsets, which significantly reduces the computa onal
cost compared to Apriori. Here's how mining works:
Condi onal Pa ern Base: For each frequent item in the FP-tree, a condi onal pa ern base is
built. This is a set of pa erns (prefix paths in the tree) that co-occur with the item. It represents
the context in which the item is frequently found.
Condi onal FP-tree: A new FP-tree is constructed for each frequent item based on the condi onal
pa ern base. This process is repeated recursively for each item, and the frequent itemsets are
extracted from each condi onal FP-tree.
Frequent Itemset Extrac on: Once the recursive process completes, all frequent itemsets are
extracted, including combina ons of items from various paths in the tree.
The key feature of FP-Growth is that it can mine frequent itemsets directly from the FP-tree without
genera ng candidate itemsets, making it much faster and more scalable than Apriori, especially with
large datasets.
Code:
import pandas as pd
from mlxtend.frequent_pa erns import fpgrowth, associa on_rules
# Sample dataset represen ng supermarket transac ons
data = {
'Milk': [1, 0, 1, 1, 0],
'Bread': [1, 1, 0, 1, 1],
'Bu er': [0, 1, 1, 1, 1],
'Eggs': [1, 1, 1, 0, 1],
'Cheese': [0, 1, 1, 0, 0]
}
df = pd.DataFrame(data)
# Applying the FP-Growth algorithm to find frequent itemsets
frequent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True)
# Genera ng associa on rules from the frequent itemsets
rules = associa on_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
# Display results
print("Frequent Itemsets:\n", frequent_itemsets)
print("\nAssocia on Rules:\n", rules)
Output:
Experiment 8
Aim: To perform sta s cal analysis on a dataset using Python.
Theory:
Sta s cal analysis is the process of collec ng, organizing, summarizing, and interpre ng data to uncover
underlying pa erns, trends, and insights. It is widely used in various fields such as business, science,
engineering, and social studies to make informed decisions, predict outcomes, and understand
rela onships between variables. Sta s cal methods are typically divided into two main branches:
Descrip ve Sta s cs and Inferen al Sta s cs.
1. Descrip ve Sta s cs
Descrip ve sta s cs involve summarizing or describing the features of a dataset using numerical
measures and visual tools. These measures help to present large amounts of data in a more diges ble
form. The main measures in descrip ve sta s cs include:
Mean:
o The mean (or average) is the sum of all values in the dataset divided by the total number
of values. It provides a measure of the central tendency of the data.
o Formula: Mean= ∑Xi/n where Xi is each data point and n is the number of data points.
Median:
o The median is the middle value in the dataset when it is arranged in ascending or
descending order. If there is an even number of values, the median is the average of the
two middle numbers. The median is useful for understanding the central tendency,
especially when data is skewed.
Mode:
o The mode is the value that appears most frequently in a dataset. A dataset may have one
mode (unimodal), more than one mode (bimodal or mul modal), or no mode at all if no
value repeats.
Standard Devia on (SD):
o Standard devia on measures how spread out the data points are around the mean. A
small SD indicates that the data points are clustered near the mean, while a large SD
suggests that the data points are more spread out.
o Formula: SD = 2
∑(𝑋𝑖 − μ)/𝑛 where μ is the mean of the data and Xi are the individual
data points.
Variance:
o Variance is the square of the standard devia on. It measures the degree of spread in the
data, indica ng how far data points are from the mean.
o Formula: Variance=𝑆𝐷
These measures of central tendency and variability provide a snapshot of the data, helping to summarize
large datasets into meaningful insights.
2. Inferen al Sta s cs
Inferen al sta s cs is the branch of sta s cs that makes inferences or generaliza ons about a popula on
based on a sample of data. It goes beyond just describing data and a empts to make predic ons or draw
conclusions. Key techniques in inferen al sta s cs include:
Hypothesis Tes ng:
o Hypothesis tes ng is a method used to determine whether there is enough sta s cal
evidence to support a specific hypothesis or claim about the data. It involves two compe ng
hypotheses:
Null Hypothesis (H₀): The assump on that there is no effect or rela onship.
Alterna ve Hypothesis (H₁): The assump on that there is an effect or rela onship.
o Sta s cal tests, such as the t-test, chi-square test, and ANOVA, are used to evaluate the
hypothesis, with a p-value indica ng whether the null hypothesis can be rejected.
Correla on:
o Correla on measures the strength and direc on of the rela onship between two variables.
It tells us whether and how strongly pairs of variables are related.
o The most commonly used measure of correla on is the Pearson correla on coefficient (r),
which ranges from -1 (perfect nega ve correla on) to +1 (perfect posi ve correla on). A
value of 0 indicates no linear rela onship.
∑( )( )
o Formula for Pearson correla on: r=
∑( ) ∑( )
where Xi and Yi are data points, and X and Y are the means of X and Y, respec vely.
Regression Analysis:
o Regression analysis is used to explore and model the rela onship between one or more
independent variables (predictors) and a dependent variable (outcome). The goal is o en to
make predic ons or understand the impact of independent variables on the dependent
variable.
o In linear regression, the rela onship between the variables is modeled as a straight line:
Y=β0+β1X where Y is the dependent variable, X is the independent variable, β0 is the
intercept, and β1 is the slope (the effect of X on Y).
o Mul ple regression extends this concept by modeling the rela onship between a dependent
variable and two or more independent variables.
These techniques in inferen al sta s cs help to make predic ons, test theories, and understand the
rela onships between different variables within a dataset.
Code:
import pandas as pd
import numpy as np
from scipy import stats
# Sample data represen ng sales data
data = {
"Sales": [250, 300, 400, 200, 500, 600, 700, 550, 450, 300]
}
df = pd.DataFrame(data)
# Descrip ve Sta s cs
mean = df["Sales"].mean()
median = df["Sales"].median()
mode = df["Sales"].mode()[0]
std_dev = df["Sales"].std()
variance = df["Sales"].var()