Data Mining

Experiment 1
Aim: To perform classifica on using the Bayesian classifica on algorithm in Python.

Theory:
Bayesian Classifica on and Naive Bayes Classifier

Bayesian classifica on is a fundamental sta s cal approach rooted in Bayes' theorem. This method is
widely used for probabilis c classifica on, where it calculates the probability that a given instance
belongs to a par cular class based on prior knowledge of condi ons related to that class. Bayesian
classifiers are powerful because they can combine prior probabili es and observed data to make
predic ons, making them applicable across various domains, especially when working with uncertain or
incomplete data.
Bayes' Theorem
Bayes' theorem provides a mathema cal formula for determining the condi onal probability of an event,
based on prior knowledge of condi ons that might be related to the event. The theorem can be stated
as follows:
P(Y/X) = [P(X/Y) ∗ P(Y)]
P(X)
where:
 P(Y/X): Posterior probability of class Y given predictor X
 P(X/Y): Likelihood, which is the probability of predictor X given class Y
 P(Y): Prior probability of class Y
 P(X): Marginal probability of predictor X
In simple terms, Bayes' theorem allows us to update our belief about the probability of an event
occurring (posterior) by considering prior knowledge (prior) and the likelihood of observing specific
data.
Naive Bayes Classifier:
The Naive Bayes classifier is a specific application of Bayes' theorem for classification tasks. It is called
"naive" because it makes a key simplifying assumption: it assumes that all features are independent
given the class label. In other words, the probability of observing one feature is independent of the
probability of observing another, given the class. This assumption, although often unrealistic in real-
world data, drastically simplifies the computations and enables the classifier to be both efficient and
scalable.
Code:
Here's Python code to perform classifica on using Naive Bayes with the scikit-learn library.
from sklearn.model_selec on import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classifica on_report
from sklearn.datasets import load_iris
# Load the dataset
data = load_iris()
X = data.data
y = data.target
# Split the data into training and tes ng sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Ini alize the Gaussian Naive Bayes classifier
gnb = GaussianNB()
# Train the classifier
gnb.fit(X_train, y_train)
# Predict on the test set
y_pred = gnb.predict(X_test)
# Output accuracy and classifica on report
accuracy = accuracy_score(y_test, y_pred)
report = classifica on_report(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classifica on Report:\n", report)
Output:
A er running the code, the output should show:
 Accuracy: A percentage represen ng the model's accuracy on the test set.
 Classifica on Report: A detailed breakdown of precision, recall, F1-score, and support for each
class.
Experiment 2
Aim: To perform cluster analysis using the K-Means clustering method in Python.
Theory:
K-Means Clustering Algorithm
K-Means is a widely used clustering algorithm designed to par on data into a predefined number of
clusters, denoted as kkk. Each cluster contains data points that are more similar to each other than to
those in other clusters. The algorithm groups data by minimizing the variance (or distance) within each
cluster, crea ng compact and well-separated groups. It is an unsupervised learning technique, o en
applied in data analysis, image segmenta on, customer segmenta on, and other tasks where natural
groupings in data are needed.
Steps of the K-Means Algorithm
1. Ini alize:
o Select kkk ini al cluster centroids randomly from the dataset.
o These centroids serve as the star ng points for the clusters.
2. Assign:
o For each data point, calculate the Euclidean distance to each centroid.
o Assign each data point to the nearest centroid, thus forming kkk clusters.
3. Update:
o For each cluster, recompute the centroid by calcula ng the mean of all data points assigned to
that cluster.
o This centroid represents the "center" of the cluster and serves as the updated posi on for that
cluster.
4. Repeat:
o Repeat steps 2 and 3 un l one of the following condi ons is met:
 Centroids stabilize (i.e., the centroids do not change between itera ons).
 A predefined maximum number of itera ons is reached.
The algorithm is guaranteed to converge, though it may not always find the global op mum due to
random ini aliza on of centroids. To improve results, K-Means is o en run mul ple mes with different
ini aliza ons, and the best outcome is chosen based on minimized variance.
Code:
Here's Python code to perform K-Means clustering using the scikit-learn library.
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
# Generate synthe c data for clustering
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=0)
# Ini alize the KMeans model with 4 clusters
kmeans = KMeans(n_clusters=4, random_state=0)
# Fit the model to the data
kmeans.fit(X)
# Predict the clusters
y_kmeans = kmeans.predict(X)
# Plot the clustered data
plt.sca er(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.sca er(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X')
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt. tle("K-Means Clustering with k=4")
plt.show()
Output:
Experiment 3
Aim: To perform hierarchical clustering using Python.
Theory:
Hierarchical Clustering
Hierarchical clustering is a method of cluster analysis that builds a hierarchy of clusters, represented by
a tree-like structure known as a dendrogram. Unlike methods that require specifying the number of
clusters in advance, hierarchical clustering provides a visual representation of the data's clustering
structure at different levels, allowing users to decide the number of clusters based on the dendrogram.
This method is commonly used in biological taxonomy, document clustering, and other fields where
hierarchical relationships are significant.
Approaches to Hierarchical Clustering
Hierarchical clustering can be performed in two ways:
1. Agglomerative Clustering (Bottom-Up):

1. Starts with each data point as an individual cluster.
2. Iteratively merges pairs of clusters based on similarity until only a single cluster remains or a
desired level is reached.
3. This approach builds the hierarchy from the bottom, progressively combining clusters.
2. Divisive Clustering (Top-Down):
1. Begins with all data points in one large cluster.
2. Recursively splits the cluster into smaller clusters until each data point is in its own cluster.
3. This approach builds the hierarchy from the top, progressively splitting clusters.
The Dendrogram
A dendrogram is a tree-like diagram that illustrates the order in which clusters are merged (in
agglomerative clustering) or split (in divisive clustering). Each node in the dendrogram represents a
cluster, and the height of each node represents the distance (or dissimilarity) at which clusters are
merged or split. By cutting the dendrogram at a specific height, users can determine the number of
clusters in the data.
Linkage Methods
In hierarchical clustering, the choice of linkage method affects how distances between clusters are
calculated. The two most common linkage methods are:
1. Single Linkage:
o Measures the minimum distance between any two points in different clusters.
o Tends to create elongated, "chain-like" clusters, as clusters are merged based on the closest pair
of points.
o Sensitive to noise and outliers, leading to possible chaining effects, where clusters may be
combined due to a single close data point.
2. Complete Linkage:
o Measures the maximum distance between any two points in different clusters.
o Ensures that clusters are compact, as they are only merged if all points in one cluster are close to
all points in the other.
o Results in more compact and evenly shaped clusters, but may lead to smaller clusters being
absorbed by larger ones.
Other linkage methods include average linkage (based on the average distance between points in
clusters) and Ward’s method (minimizing the variance within clusters), each providing different cluster
shapes and stability.
Code:
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.datasets import make_blobs
# Generate sample data

X, _ = make_blobs(n_samples=50, centers=3, random_state=42)
# Perform hierarchical clustering

linked = linkage(X, method='ward')
# Plot the dendrogram

plt.figure(figsize=(10, 7))
dendrogram(linked,
orienta on='top',
distance_sort='descending',
show_leaf_counts=True)
plt. tle("Dendrogram for Hierarchical Clustering")
plt.xlabel("Sample index")
plt.ylabel("Distance")
plt.show()
Output:
Experiment 4
Aim: To study and perform regression analysis using Python.
Theory:
Introduc on to Regression Analysis
Regression analysis is a powerful sta s cal technique used to understand the rela onship between a
dependent variable (also known as the target or outcome) and one or more independent variables (also
known as predictors or features). The primary aim of regression is to model this rela onship so that we
can predict the dependent variable's values based on the known values of the independent variables.
This makes regression analysis a vital tool for forecas ng, me series analysis, and machine learning.
There are several types of regression techniques, each designed to handle different types of rela onships
between variables. Linear regression, however, remains one of the simplest, most widely used, and
founda onal regression models in sta s cal analysis and machine learning.
Types of Regression Techniques

1. Linear Regression: This is the most basic form of regression, where the rela onship between the
dependent variable (Y) and one or more independent variables (X) is assumed to be linear.
2. Mul ple Linear Regression: An extension of simple linear regression, where mul ple
independent variables are used to predict the dependent variable.
3. Polynomial Regression: This type of regression fits a polynomial equa on to the data, which can
be er capture non-linear rela onships.
4. Ridge and Lasso Regression: These techniques are used when the data contains mul collinearity,
or when regulariza on is needed to prevent overfi ng.
Simple Linear Regression
Simple linear regression models the rela onship between two variables by fi ng a straight line to the
data points. The equa on of this line is given by:
Y = b0 + b1X
where:
 Y is the dependent variable,
 X is the independent variable,
 b0 is the intercept, and
 b1 is the slope.
The objec ve in simple linear regression is to find the best-fi ng line, meaning the line that minimizes
the sum of squared differences between the observed values and the predicted values. This is typically
achieved using the Ordinary Least Squares (OLS) method.
Code:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selec on import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Generate sample data

np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# Split the data into training and tes ng sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predic ons
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R^2 Score: {r2}")
# Plot the results
plt.sca er(X, y, color='blue', label="Data points")
plt.plot(X_test, y_pred, color='red', linewidth=2, label="Regression Line")
plt.xlabel("Independent Variable (X)")
plt.ylabel("Dependent Variable (y)")
plt. tle("Linear Regression")
plt.legend()
plt.show()
Output:
Experiment 5
Aim: To detect outliers in a dataset using Python.
Theory:
What Are Outliers?
Outliers are data points that significantly deviate from the rest of the dataset, o en lying far from the
mean or median. These extreme values can arise due to various reasons, including natural variability in
data, measurement errors, or mistakes in data entry or collec on. While some outliers may represent
important informa on (e.g., rare events), they can o en distort sta s cal analyses, skew results, and
reduce the accuracy of predic ve models. Iden fying and managing outliers is a crucial step in the data
preprocessing phase of machine learning and sta s cal analysis.
Impacts of Outliers
 Bias: Outliers can skew sta s cal es mates like the mean, variance, and standard devia on,
leading to biased results.
 Model Performance: In predic ve modeling, outliers can reduce the performance of algorithms
such as linear regression, as they can dispropor onately influence the fit of the model, leading to
inaccurate predic ons.
 Interpreta on: Outliers may lead to misinterpreta on of data trends and pa erns, par cularly
when not properly handled.
Thus, detec ng and addressing outliers is essen al for improving the reliability of analy cal models.
Common Methods for Outlier Detec on
Several techniques can be used to iden fy outliers in a dataset. Below are the most widely used methods:
1. Z-score Method
The Z-score is a sta s cal measure that quan fies how far a data point is from the mean in terms of
standard devia ons. If a data point’s Z-score is significantly high or low, it can be considered an outlier.
Formula for Z-score:
Z =X−μ
σ
Where:
 X is the individual data point,
 μ is the mean of the dataset,
 σ is the standard devia on.
A typical threshold for iden fying outliers using the Z-score is ±3 standard devia ons. If the Z-score is
greater than 3 or less than -3, the data point is considered an outlier.
Example:
 For a dataset of test scores, if a student’s score has a Z-score of 4, it is considered an outlier since
it is 4 standard devia ons away from the mean.
2. Interquar le Range (IQR) Method
The IQR method is based on the spread of the middle 50% of the data. The IQR is the difference between
the third quar le (Q3) and the first quar le (Q1). Outliers are considered as data points that fall outside
a range defined by the IQR.
Formula for Iden fying Outliers:
 Lower bound: Q1−1.5×IQR
 Upper bound: Q3+1.5×IQR
Where:
 Q1 is the first quar le (25th percen le),
 Q3 is the third quar le (75th percen le),
 IQR is the interquar le range, calculated as Q3−Q1.
Any data point that falls below the lower bound or above the upper bound is considered an outlier.
Example:
 For a dataset of house prices, if a house price falls below Q1−1.5×IQR or above Q3+1.5×IQR, it
would be flagged as an outlier.
3. Boxplot Visualiza on
A boxplot (also known as a box-and-whisker plot) is a graphical method that displays the distribu on of
data and visually highlights poten al outliers. It uses the IQR to define the "whiskers" (the range of non-
outlier data) and any data points outside this range are plo ed as individual points, o en called
"outliers."
 The box in a boxplot represents the interquar le range (from Q1 to Q3).
 The line inside the box represents the median of the dataset.
 The whiskers extend from the box to the smallest and largest values within the normal range
(within 1.5 mes the IQR from Q1 and Q3).
 Data points outside the whiskers are plo ed as individual dots or symbols, indica ng outliers.
Example: A boxplot of salaries might show a few individuals earning much higher than the rest of the
group. These higher earnings will be flagged as outliers by the boxplot
Code:
Here, we’ll use the IQR method to detect outliers in a sample dataset.
import numpy as np
import pandas as pd
# Generate sample data with some outliers

np.random.seed(0)
data = np.random.normal(50, 10, 100) # Normal distribu on centered at 50
data = np.append(data, [120, 130, 140]) # Adding outliers
# Convert data to a DataFrame

df = pd.DataFrame(data, columns=["Values"])
# Calculate Q1, Q3, and IQR

Q1 = df["Values"].quan le(0.25)
Q3 = df["Values"].quan le(0.75)
IQR = Q3 - Q1
# Define outliers based on IQR

outliers = df[(df["Values"] < Q1 - 1.5 * IQR) | (df["Values"] > Q3 + 1.5 * IQR)]
# Print detected outliers

print("Detected Outliers:\n", outliers)
# Plot the data with a boxplot to visualize outliers

plt.figure(figsize=(8, 6))
plt.boxplot(df["Values"], vert=False)
plt. tle("Boxplot for Outlier Detec on")
plt.xlabel("Values")
plt.show()
Output:
Experiment 6
Aim: To demonstrate associa on rule mining using the Apriori algorithm on

supermarket transac on data.
Theory:
Introduc on to Associa on Rule Mining
Associa on rule mining is a fundamental data mining technique used to discover rela onships or
associa ons among items in large datasets. This technique is most applied in market basket analysis,
where the goal is to iden fy products that are frequently purchased together. The insights gained from
these associa ons can help businesses op mize their opera ons, such as improving product placement,
cra ing promo ons, and implemen ng effec ve cross-selling strategies.
For example, if data analysis reveals that customers who buy bread are also likely to buy bu er, a
supermarket could place these items closer together or run joint promo ons. Associa on rule mining
helps extract ac onable pa erns from the data, enabling businesses to make data-driven decisions.
Key Concepts in Associa on Rule Mining
1. Support:
o Support refers to the propor on of transac ons that contain a par cular itemset. It is used
to measure how frequently an itemset appears in the dataset.
o Formula:
Support(A) = Transac ons containing A
Total number of transac ons
2. Confidence:
o Confidence measures the likelihood of buying one item given that another item has been
purchased. It is an important metric for determining the strength of a rule.
o Formula:
Confidence(A⇒B) = Support(A∪B)
Support(A)
o This metric tells us how likely item B is to be purchased when item A is already in the
basket.
3. Li :
o Li measures the strength of an associa on rule, considering the likelihood of the rule
occurring by chance. A li value greater than 1 indicates that the items are more likely to
be bought together than by chance.
o Formula:
Li (A⇒B) = Confidence(A⇒B)
Support(B)
o A li value greater than 1 suggests a strong associa on, while a li value less than 1
indicates that the items are less likely to be bought together than expected.
The Apriori Algorithm
The Apriori algorithm is one of the most widely used methods for associa on rule mining, par cularly
for finding frequent itemsets in a dataset. It is based on the principle that "if an itemset is frequent, then
all of its subsets must also be frequent." The algorithm works in two key steps:
1. Frequent Itemset Genera on:
o In this step, the algorithm iden fies item combina ons (itemsets) that appear frequently
in the dataset, based on a user-defined minimum support threshold. This means that only
itemsets that appear in a minimum number of transac ons will be considered frequent.
o The process begins by scanning the dataset for individual items (1-itemsets) that meet the
support threshold. These items are then paired together to form 2-itemsets, 3-itemsets,
and so on, with the algorithm itera vely checking for larger itemsets.
o The Apriori property is used in this step: if an itemset is frequent, all of its subsets must
also be frequent. This helps prune the search space and reduce the computa onal cost.
2. Rule Genera on:
o Once the frequent itemsets are found, the algorithm generates associa on rules based on
a minimum confidence threshold. These rules take the form of "If A, then B," meaning if
item A is bought, item B is likely to be bought as well.
o For each frequent itemset, the algorithm checks all possible ways to split it into two parts
(antecedent and consequent) and evaluates the confidence of the rule.
o The rules with confidence above the threshold are retained, providing valuable insights
into the rela onships between items in the dataset.
Code:
import pandas as pd
from mlxtend.frequent_pa erns import apriori, associa on_rules
# Sample dataset represen ng supermarket transac ons
data = {
'Milk': [1, 0, 1, 1, 0],
'Bread': [1, 1, 0, 1, 1],
'Bu er': [0, 1, 1, 1, 1],
'Eggs': [1, 1, 1, 0, 1],
'Cheese': [0, 1, 1, 0, 0]
}
df = pd.DataFrame(data)
# Applying the Apriori algorithm to find frequent itemsets
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
# Genera ng associa on rules from the frequent itemsets
rules = associa on_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
# Display results
print("Frequent Itemsets:\n", frequent_itemsets)
print("\nAssocia on Rules:\n", rules)
Output:
Experiment 7
Aim: To demonstrate associa on rule mining using the FP-Growth algorithm on
supermarket transac on data.
Theory:
Introduc on to FP-Growth
The FP-Growth (Frequent Pa ern Growth) algorithm is an advanced and more efficient method for
mining frequent itemsets in large datasets. It was developed as an alterna ve to the Apriori algorithm
to overcome its limita ons, par cularly the large number of database scans and the candidate genera on
step, which can be computa onally expensive. FP-Growth leverages a more efficient approach by u lizing
a compact data structure known as an FP-tree (Frequent Pa ern Tree).
The main advantage of FP-Growth is its ability to mine frequent itemsets without genera ng candidate
itemsets, which significantly improves performance, especially with large datasets. This makes FP-Growth
highly suitable for tasks such as market basket analysis, where iden fying associa ons between items,
like products o en bought together, is key for business strategies such as product placement,
promo ons, and cross-selling.
Steps in the FP-Growth Algorithm
FP-Growth operates in two main steps:
1. Construc ng the FP-tree
The first step in the FP-Growth algorithm involves construc ng an FP-tree. Here’s how it works:
 Scan the dataset: The algorithm starts by scanning the dataset to find frequent items. Only items
that appear frequently enough (according to a minimum support threshold) are retained.
 Sor ng Items: The frequent items are sorted in decreasing order of their frequency. This ensures
that the FP-tree is built in a way that higher-frequency items appear first, making the tree
structure more compact.
 Building the Tree:
o The FP-tree is constructed by reading the transac ons in the dataset. Each transac on is
represented as a path in the tree.
o If a transac on contains the itemset {A,B,C}, for instance, and A is the first frequent item,
B the second, and C the third, the transac on will be represented as a path extending from
the root node in the tree.
o If the itemset is already present in the tree, the algorithm increments the count for the
respec ve node; otherwise, a new node is created.
The tree is thus "compressed" because it represents itemsets and their counts in a more efficient way
than the Apriori algorithm’s candidate genera on method.
2. Mining the FP-tree
Once the FP-tree is constructed, the next step is mining the FP-tree to extract frequent itemsets. This
process does not require genera ng candidate itemsets, which significantly reduces the computa onal
cost compared to Apriori. Here's how mining works:
 Condi onal Pa ern Base: For each frequent item in the FP-tree, a condi onal pa ern base is
built. This is a set of pa erns (prefix paths in the tree) that co-occur with the item. It represents
the context in which the item is frequently found.
 Condi onal FP-tree: A new FP-tree is constructed for each frequent item based on the condi onal
pa ern base. This process is repeated recursively for each item, and the frequent itemsets are
extracted from each condi onal FP-tree.
 Frequent Itemset Extrac on: Once the recursive process completes, all frequent itemsets are
extracted, including combina ons of items from various paths in the tree.
The key feature of FP-Growth is that it can mine frequent itemsets directly from the FP-tree without
genera ng candidate itemsets, making it much faster and more scalable than Apriori, especially with
large datasets.
Code:
import pandas as pd
from mlxtend.frequent_pa erns import fpgrowth, associa on_rules
# Sample dataset represen ng supermarket transac ons
data = {
'Milk': [1, 0, 1, 1, 0],
'Bread': [1, 1, 0, 1, 1],
'Bu er': [0, 1, 1, 1, 1],
'Eggs': [1, 1, 1, 0, 1],
'Cheese': [0, 1, 1, 0, 0]
}
# Applying the FP-Growth algorithm to find frequent itemsets
frequent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True)
# Genera ng associa on rules from the frequent itemsets
rules = associa on_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
# Display results
print("Frequent Itemsets:\n", frequent_itemsets)
print("\nAssocia on Rules:\n", rules)
Output:
Experiment 8
Aim: To perform sta s cal analysis on a dataset using Python.
Theory:
Sta s cal analysis is the process of collec ng, organizing, summarizing, and interpre ng data to uncover
underlying pa erns, trends, and insights. It is widely used in various fields such as business, science,
engineering, and social studies to make informed decisions, predict outcomes, and understand
rela onships between variables. Sta s cal methods are typically divided into two main branches:
Descrip ve Sta s cs and Inferen al Sta s cs.
1. Descrip ve Sta s cs
Descrip ve sta s cs involve summarizing or describing the features of a dataset using numerical
measures and visual tools. These measures help to present large amounts of data in a more diges ble
form. The main measures in descrip ve sta s cs include:
 Mean:
o The mean (or average) is the sum of all values in the dataset divided by the total number
of values. It provides a measure of the central tendency of the data.
o Formula: Mean= ∑Xi/n where Xi is each data point and n is the number of data points.
 Median:
o The median is the middle value in the dataset when it is arranged in ascending or
descending order. If there is an even number of values, the median is the average of the
two middle numbers. The median is useful for understanding the central tendency,
especially when data is skewed.
 Mode:
o The mode is the value that appears most frequently in a dataset. A dataset may have one
mode (unimodal), more than one mode (bimodal or mul modal), or no mode at all if no
value repeats.
 Standard Devia on (SD):
o Standard devia on measures how spread out the data points are around the mean. A
small SD indicates that the data points are clustered near the mean, while a large SD
suggests that the data points are more spread out.
o Formula: SD = 2
∑(𝑋𝑖 − μ)/𝑛 where μ is the mean of the data and Xi are the individual
data points.
 Variance:
o Variance is the square of the standard devia on. It measures the degree of spread in the
data, indica ng how far data points are from the mean.
o Formula: Variance=𝑆𝐷
These measures of central tendency and variability provide a snapshot of the data, helping to summarize
large datasets into meaningful insights.
2. Inferen al Sta s cs
Inferen al sta s cs is the branch of sta s cs that makes inferences or generaliza ons about a popula on
based on a sample of data. It goes beyond just describing data and a empts to make predic ons or draw
conclusions. Key techniques in inferen al sta s cs include:
 Hypothesis Tes ng:
o Hypothesis tes ng is a method used to determine whether there is enough sta s cal
evidence to support a specific hypothesis or claim about the data. It involves two compe ng
hypotheses:
Null Hypothesis (H₀): The assump on that there is no effect or rela onship.

 Alterna ve Hypothesis (H₁): The assump on that there is an effect or rela onship.
o Sta s cal tests, such as the t-test, chi-square test, and ANOVA, are used to evaluate the
hypothesis, with a p-value indica ng whether the null hypothesis can be rejected.
 Correla on:
o Correla on measures the strength and direc on of the rela onship between two variables.
It tells us whether and how strongly pairs of variables are related.
o The most commonly used measure of correla on is the Pearson correla on coefficient (r),
which ranges from -1 (perfect nega ve correla on) to +1 (perfect posi ve correla on). A
value of 0 indicates no linear rela onship.
∑( )( )
o Formula for Pearson correla on: r=
∑( ) ∑( )
where Xi and Yi are data points, and X and Y are the means of X and Y, respec vely.
 Regression Analysis:
o Regression analysis is used to explore and model the rela onship between one or more
independent variables (predictors) and a dependent variable (outcome). The goal is o en to
make predic ons or understand the impact of independent variables on the dependent
variable.
o In linear regression, the rela onship between the variables is modeled as a straight line:
Y=β0+β1X where Y is the dependent variable, X is the independent variable, β0 is the
intercept, and β1 is the slope (the effect of X on Y).
o Mul ple regression extends this concept by modeling the rela onship between a dependent
variable and two or more independent variables.
These techniques in inferen al sta s cs help to make predic ons, test theories, and understand the
rela onships between different variables within a dataset.
Code:
import pandas as pd
import numpy as np
from scipy import stats
# Sample data represen ng sales data
data = {
"Sales": [250, 300, 400, 200, 500, 600, 700, 550, 450, 300]
}
# Descrip ve Sta s cs
mean = df["Sales"].mean()
median = df["Sales"].median()
mode = df["Sales"].mode()[0]
std_dev = df["Sales"].std()
variance = df["Sales"].var()
# Inferen al Sta s cs - Hypothesis Tes ng (One-sample T-test)

# Checking if the mean sales is significantly different from 400
t_sta s c, p_value = stats. est_1samp(df["Sales"], 400)
# Display results
print("Descrip ve Sta s cs:")
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Standard Devia on: {std_dev}")
print(f"Variance: {variance}\n")
print("Inferen al Sta s cs:")
print(f"T-Sta s c: {t_sta s c}")
print(f"P-Value: {p_value}")
Output:
Experiment 9
Case Study: Data Warehouse Implementa on for XYZ Retail
Background:
XYZ Retail is a na onal retail chain with hundreds of stores across the country. They sell a
wide range of products, including clothing, electronics, home goods, and groceries. The
company has been experiencing several data-related challenges, including data
fragmenta on, slow repor ng, and difficulty in gaining a comprehensive view of their
opera ons.
Challenges:
1. Data Fragmenta on: Data was sca ered across various departments, stores, and
databases, making it challenging to access and integrate data for meaningful
analysis.
2. Slow Repor ng: Genera ng reports took a significant amount of me due to the
complex querying and the sheer volume of data.
3. Lack of Real- me Insights: The inability to access real- me data made it challenging
to respond quickly to market changes and customer demands.
4. Data Quality Issues: There were inconsistencies and data quality issues, leading to
inaccurate insights
Aim: To design and implement a data warehouse solu on for XYZ Retail to address
issues of data fragmenta on, slow repor ng, lack of real- me insights, and data quality.
Theory:
XYZ Retail’s decision to implement a centralized data warehouse reflects a strategic move to tackle
several challenges related to data access, quality, and repor ng, while improving opera onal efficiency
and decision-making capabili es. Let's break down each key component of the data warehouse solu on
and its associated benefits in more detail.
Key Components of the Data Warehouse Solu on
1. Data Integra on
Data integra on is cri cal for unifying the data from various systems across different departments and
stores within XYZ Retail.
 ETL Process: The Extract, Transform, and Load (ETL) process forms the backbone of data
integra on. The data is first extracted from mul ple source systems, such as inventory
management, sales, and customer rela onship management systems. The data is then
transformed to ensure it adheres to a consistent format, cleansed for any inconsistencies, and
structured appropriately for repor ng. Finally, it is loaded into the data warehouse.
 Standardiza on: By standardizing data formats and defini ons across departments, XYZ Retail
eliminated data silos and inconsistencies. For example, different departments may have had
different naming conven ons or units for product prices, which could lead to confusion or
inaccurate analysis. Now, a unified data model ensures that all stakeholders are working with the
same defini ons, reducing redundancies and making the data more reliable for decision-making.
2. Data Storage and Organiza on
Organizing and structuring data is fundamental to maximizing the u lity of the data warehouse.
 Subject-Oriented Schemas: The data is organized into subject-oriented schemas like inventory,
sales, customer, and supplier data. This organiza on allows business users to quickly navigate and
query the data related to their specific area of interest. For example, a manager in the sales
department can focus on sales data without having to si through unrelated data, such as
customer feedback or supplier records.
 Historical Data: By storing historical data, XYZ Retail can conduct trend analysis and gain insights
into pa erns over me. This is especially useful for understanding seasonal demand fluctua ons,
which are cri cal in retail. For instance, the company can track product popularity during different
mes of the year, adjus ng inventory and marke ng strategies accordingly. Historical data also
enables predic ve analy cs, helping the company an cipate future trends.
3. Real- me Data Processing
In today’s fast-paced retail environment, real- me data processing is key to staying compe ve.
 Real- me Data Inges on: XYZ Retail incorporated tools to ingest data in real me, allowing it to
capture live data on inventory levels, sales transac ons, and customer feedback. This con nuous
stream of informa on provides up-to-the-minute insights into the business.
 Dashboard for KPIs: A real- me dashboard was created for execu ves and managers, which
displays key performance indicators (KPIs) such as sales trends, stock levels, and regional
performance. This dashboard enables decision-makers to access mely informa on and take
ac on immediately. For instance, if a par cular product is seeing a sudden surge in sales, they
can quickly restock or adjust pricing or promo onal strategies to capitalize on this demand.
4. Data Quality and Governance
The importance of data quality and governance cannot be overstated, especially when mul ple
departments rely on accurate, consistent informa on for decision-making.
 Data Valida on and Cleansing: During the ETL process, data undergoes thorough valida on and
cleansing to ensure that errors, duplicates, and inconsistencies are iden fied and addressed. This
helps to eliminate issues like incomplete records, misspelled entries, or discrepancies between
data sources, which can lead to flawed repor ng.
 Data Governance: XYZ Retail implemented data governance policies to maintain the integrity of
the data. This includes establishing clear guidelines for data access, data ownership, and ensuring
compliance with privacy regula ons like GDPR. Effec ve governance prevents unauthorized
access to sensi ve informa on and ensures that data is handled responsibly across the
organiza on.
Benefits and Outcomes
1. Improved Repor ng Speed
The data warehouse drama cally enhances the speed at which reports can be generated.
 Op mized Storage and Querying: The centralized structure of the data warehouse and the
op mized querying capabili es mean that reports that previously took hours or even days to
compile can now be produced in a ma er of minutes. Fast report genera on accelerates decision-
making processes, allowing managers to act more quickly and with more confidence.
2. Enhanced Data Accessibility
By centralizing the data, XYZ Retail has made it easier for department heads and managers to access
cri cal informa on.
 Unified View: Having all data in one place means that users no longer need to search across
mul ple systems or databases to find the informa on they need. For example, a marke ng
manager can quickly access sales data, customer demographics, and inventory levels, facilita ng
be er cross-func onal analysis and more coordinated decision-making.
3. Real- me Insights for Proac ve Decision-Making
Real- me data access enables XYZ Retail to be proac ve, rather than reac ve, to market changes.
 Agility in Decision-Making: With real- me tracking of sales transac ons, inventory levels, and
customer feedback, XYZ Retail can quickly iden fy and respond to shi s in customer behavior,
sales spikes, or inventory shortages. For example, if a product is selling faster than expected, the
company can adjust its pricing or marke ng efforts immediately to op mize sales. This agility
supports a compe ve advantage, as the company can adapt to market changes before its
compe tors do.
4. Improved Data Quality and Consistency
The data warehouse’s focus on data standardiza on and governance has resulted in higher-quality data.
 Accuracy and Reliability: With standardized data formats and strong governance, reports are
more accurate and reliable. Stakeholders have greater confidence in the data, which in turn
fosters trust in the insights derived from it. This reduces the likelihood of errors that can arise
from inconsistent or incomplete data, ensuring that decision-making is based on high-quality,
consistent informa on.

Data Mining

Uploaded by

Data Mining

Uploaded by

Experiment 1

Aim: To perform classiﬁca on using the Bayesian classiﬁca on algorithm in Python.

Bayesian Classiﬁca on and Naive Bayes Classiﬁer

Naive Bayes Classifier:

Hierarchical clustering can be performed in two ways:

1. Agglomerative Clustering (Bottom-Up):

# Generate sample data

# Perform hierarchical clustering

# Plot the dendrogram

Types of Regression Techniques

# Generate sample data

# Generate sample data with some outliers

# Convert data to a DataFrame

# Calculate Q1, Q3, and IQR

# Deﬁne outliers based on IQR

# Print detected outliers

# Plot the data with a boxplot to visualize outliers

Aim: To demonstrate associa on rule mining using the Apriori algorithm on

# Inferen al Sta s cs - Hypothesis Tes ng (One-sample T-test)

You might also like