Data Warehousing and Data Mining Lab
Data Warehousing and Data Mining Lab
Code:
def cleanse_name(name):
# Remove leading and trailing spaces
cleaned_name = name.strip()
# Replace multiple spaces with a single space
cleaned_name = " ".join(cleaned_name.split())
# Capitalize each word in the name
cleaned_name = cleaned_name.title()
return cleaned_name
Output:
Experiment – 3
Aim: Program of Data warehouse cleansing to remove redundancy in data.
Theory:Redundancy in data warehousing refers to the presence of
duplicate records or entries, which can result in inaccurate analysis,
increased storage costs, and performance inefficiencies. Redundant data
can arise due to data integration from multiple sources, manual data entry
errors, or other inconsistencies. Removing redundancy helps maintain data
quality, improves efficiency in data processing, and ensures accurate
analysis.
# Alternatively, remove duplicates based on specific columns, e.g., 'CustomerID' and 'Name'
# df_cleaned = df.drop_duplicates(subset=['CustomerID', 'Name'])
4.Click on “Classify” and choose the "IBK" algorithm from the "lazy" section of the classifiers.
5.Congifure the model by clicking on the “IBK” classifier.
Higher support indicates that the rule is relevant for a larger portion of the
dataset.Confidence: The likelihood of seeing Y in a transaction that already contains X.
Confidence for X→Y is calculated as:
Steps:Launch the Weka GUI and click on the "Explorer" button to open the Weka Explorer.
3. Click on “Associate” and choose the "Apriori" algorithm from the "associations" section.
Steps:
1.Launch the Weka GUI and click on the "Explorer" button to open the Weka Explorer.
2.Click on the "Open file" button to load your dataset.
5.We can visualize the linear regression model in the visualization tab.
6.We can also visualize the scatter plot between any 2 attributes from the dataset.
7. We can also select a particular section of the scatter plot to visualize separately by clicking
on “Select Instance”.
9.Click on “Submit”
Experiment – 9
Aim: Perform Data Similarity Measure (Euclidean, Manhattan Distance).
Theory: Data similarity measures quantify how alike two data points or vectors are. These measures are
fundamental in various fields, including machine learning, data mining, and pattern recognition. Among the
most commonly used similarity measures are Euclidean distance and Manhattan distance, which are often
used to assess the closeness of points in a multidimensional space.
1. Euclidean Distance
Definition: Euclidean distance is the straight-line distance between two points in Euclidean space. It is
calculated using the Pythagorean theorem and is one of the most commonly used distance measures in
mathematics and machine learning.
Formula: For two points P and Q in an n-dimensional space, where P=(p1,p2,…,pn) and Q=(q1,q2,…,qn),
the Euclidean distance d is given by:
Properties:
• Non-negativity: d(P,Q)≥0
• Identity: d(P,Q)=0 if and only if P=Q
• Symmetry: d(P,Q)=d(Q,P)
• Triangle Inequality: d(P,R)≤d(P,Q)+d(Q,R) for any points P,Q,R
Applications:
• Used in clustering algorithms like K-means to determine the distance between points and centroids.
• Commonly applied in image recognition, pattern matching, and other areas requiring spatial analysis.
2. Manhattan Distance
Definition: Manhattan distance, also known as the "city block" or "taxicab" distance, measures the distance
between two points by summing the absolute differences of their coordinates. It reflects the total grid
distance one would need to travel in a grid-like path.
Formula: For two points P and Q in an n-dimensional space, the Manhattan distance d is calculated as:
Properties:
• Non-negativity: d(P,Q)≥0
• Identity: d(P,Q)=0 if and only if P=Q
• Symmetry: d(P,Q)=d(Q,P)
• Triangle Inequality: d(P,R)≤d(P,Q)+d(Q,R)
Applications:
• Useful in scenarios where movement is restricted to a grid, such as in geographical data analysis.
• Often employed in clustering algorithms and machine learning models where linear relationships are
more meaningful than straight-line distances.
Comparison of Euclidean and Manhattan Distance
1. Geometric Interpretation:
o Euclidean distance measures the shortest path between two points, while Manhattan
distance measures the total path required to travel along axes.
2. Sensitivity to Dimensions:
o Euclidean distance can be sensitive to the scale of data and the number of dimensions, as
it tends to emphasize larger values. In contrast, Manhattan distance treats all dimensions
equally, summing absolute differences.
3. Use Cases:
o Euclidean distance is preferred in applications involving continuous data and geometric
spaces, whereas Manhattan distance is favored in discrete settings, such as grid-based
environments.
Code:
!pip install liac-arff pandas scipy
import arff
from google.colab import files
import pandas as pd
from scipy.spatial import distance
Output:
Experiment – 10
Aim: Perform Apriori algorithm to mine frequent item-sets.
Theory: The Apriori algorithm is a classic algorithm used in data mining for mining frequent itemsets and
learning association rules. It was proposed by R. Agrawal and R. Srikant in 1994. The algorithm is
particularly effective in market basket analysis, where the goal is to find sets of items that frequently co-
occur in transactions.
Key Concepts
1. Itemsets:
o An itemset is a collection of one or more items. For example, in a grocery store dataset,
an itemset might include items like {milk, bread}.
2. Frequent Itemsets:
o A frequent itemset is an itemset that appears in the dataset with a frequency greater than
or equal to a specified threshold, called support.
o The support of an itemset X is defined as the proportion of transactions in the dataset that
contain X.
3. Association Rules:
o An association rule is an implication of the form X→Y, indicating that the presence of
itemset X in a transaction implies the presence of itemset Y.
o Rules are evaluated based on two main metrics: support and confidence. The confidence of a
rule is the proportion of transactions that contain Y among those that contain X:
4. Lift:
o Lift measures the effectiveness of a rule over random chance and is defined as:
Output:
Experiment – 11
Aim: Develop different clustering algorithms like K-Means, KMedoids Algorithm, Partitioning Algorithm
and Hierarchical.
Theory: Clustering is a fundamental technique in data mining and machine learning used to group similar
data points into clusters based on their features. This unsupervised learning method helps in identifying
patterns and structures within datasets. Here, we explore four common clustering algorithms: K-Means, K-
Medoids, Partitioning Algorithm, and Hierarchical Clustering.
1. K-Means Clustering
Overview: K-Means is one of the most popular clustering algorithms that partitions a dataset into K
distinct, non-overlapping subsets (clusters). It aims to minimize the variance within each cluster while
maximizing the variance between clusters.
Algorithm Steps:
1. Initialization: Randomly select K initial centroids from the dataset.
2. Assignment: Assign each data point to the nearest centroid based on the Euclidean distance,
forming K clusters.
3. Update: Calculate the new centroids as the mean of all points assigned to each cluster.
4. Convergence: Repeat the assignment and update steps until the centroids no longer change
significantly or a predetermined number of iterations is reached.
Strengths:
• Simple and easy to implement.
• Efficient for large datasets.
• Scales well with data size.
Weaknesses:
• Requires specifying the number of clusters (K) in advance.
• Sensitive to the initial placement of centroids.
• Prone to converging to local minima.
2. K-Medoids Clustering
Overview: K-Medoids is similar to K-Means but instead of using the mean to represent a cluster, it uses
actual data points called medoids. This method is more robust to noise and outliers compared to K-Means.
Algorithm Steps:
1. Initialization: Randomly select K medoids from the dataset.
2. Assignment: Assign each data point to the nearest medoid based on the chosen distance
metric (commonly Manhattan distance).
3. Update: For each cluster, choose the data point with the smallest total distance to all other points
in the cluster as the new medoid.
4. Convergence: Repeat the assignment and update steps until no changes occur in the medoids.
Strengths:
• More robust to outliers than K-Means since it uses medoids.
• Does not require calculating means, which may not be meaningful in some contexts.
Weaknesses:
• Computationally more expensive than K-Means, especially for large datasets.
• Still requires the specification of the number of clusters (K).
4. Hierarchical Clustering
Overview: Hierarchical clustering builds a hierarchy of clusters either through a bottom-up (agglomerative)
or top-down (divisive) approach. This method does not require a predetermined number of clusters and
allows for the exploration of data at various levels of granularity.
Types:
1. Agglomerative Hierarchical Clustering:
o Starts with each data point as an individual cluster and iteratively merges the closest
clusters until one cluster remains or a stopping criterion is met.
o Common linkage criteria include single-linkage (minimum distance), complete-linkage
(maximum distance), and average-linkage (average distance).
2. Divisive Hierarchical Clustering:
o Starts with one cluster containing all data points and recursively splits it into smaller
clusters based on a chosen criterion.
Strengths:
• Does not require the number of clusters to be specified in advance.
• Produces a dendrogram (tree-like diagram) that visually represents the merging of clusters.
Weaknesses:
• Computationally intensive, especially for large datasets (O(nZ) complexity).
• Sensitive to noise and outliers, which can distort the hierarchy.
Code:
!pip install sklearn
!pip install pyclustering
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans, AgglomerativeClustering
from pyclustering.cluster.kmedoids import kmedoids
from pyclustering.cluster.kmeans import kmeans
from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
from pyclustering.utils import calculate_distance_matrix
from scipy.cluster.hierarchy import dendrogram, linkage
#Generate a smaller sample dataset (you can use a subset of your original data)
sampled_data = X[:50] # Use first 50 data points from your dataset
#Initialize the medoid indices (choose random initial medoids from the subset)
initial_medoids = [0, 10, 20] # Choose medoid indices carefully
Output:
Experiment - 12
Aim: Apply Validity Measures to evaluate the quality of Data.
Theory: Evaluating data quality is critical in data analysis and machine learning as it directly affects the
performance of models and the validity of insights derived from the data. Various validity measures help
assess different aspects of data quality. Here, we outline the key validity measures demonstrated in your
code, along with their importance.
1. Missing Values
Definition: Missing values occur when data points for certain features are not recorded. They can introduce
bias and reduce the quality of the dataset.
Importance:
• A high proportion of missing values can lead to inaccurate models and biased results.
• Different strategies can be applied to handle missing values, including imputation, deletion, or using
algorithms that can work with missing data.
Measure:
• The code calculates the number of missing values per column, which provides insight into the extent
of the issue.
2. Duplicate Entries
Definition: Duplicate entries refer to identical records in a dataset. They can occur due to errors during data
collection or processing.
Importance:
• Duplicates can skew the results of analyses and lead to overfitting in machine learning models.
• Identifying and removing duplicates is crucial for maintaining data integrity.
Measure:
• The code checks for the number of duplicate entries, helping to quantify this issue in the dataset.
3. Outlier Detection
Definition: Outliers are data points that differ significantly from other observations in the dataset. They can
arise due to variability in the measurement or may indicate a measurement error.
Importance:
• Outliers can disproportionately affect statistical analyses and model training, leading to inaccurate
predictions.
• Identifying outliers allows for the option to investigate them further and decide whether to keep,
remove, or adjust them.
Measure:
• The code uses the Isolation Forest algorithm to detect outliers, reporting the number detected.
This helps understand the presence of anomalies in the dataset.
4. Multicollinearity
Definition: Multicollinearity occurs when two or more independent variables in a regression model are
highly correlated, meaning they contain similar information.
Importance:
• High multicollinearity can inflate the variance of coefficient estimates, making the model unstable
and reducing interpretability.
• It complicates the process of determining the importance of predictors.
Measure:
• The correlation matrix generated in the code reveals the relationships between features, allowing
for the identification of highly correlated features.
5. Feature Distribution (Skewness)
Definition: Skewness measures the asymmetry of the probability distribution of a real-valued random
variable. A skewed distribution can indicate that certain transformations may be needed before modeling.
Importance:
• Features that are heavily skewed can violate the assumptions of certain statistical tests and machine
learning algorithms, impacting model performance.
• Understanding skewness helps in selecting appropriate preprocessing methods (e.g., normalization,
logarithmic transformation).
Measure:
• The code calculates and prints the skewness of each feature, as well as visualizes the feature
distributions, which aids in identifying heavily skewed variables.
6. Class Imbalance
Definition: Class imbalance occurs when the number of instances in each class of a classification problem is
not approximately equal. This is common in binary classification tasks.
Importance:
• Imbalanced classes can lead to biased models that perform well on the majority class but poorly
on the minority class.
• It's crucial to evaluate the class distribution and apply techniques to handle imbalances, such as
resampling methods (oversampling, undersampling) or algorithm adjustments.
Measure:
• The code checks the distribution of the target class, helping to identify any potential imbalance that
could affect model training.
Code:
import pandas as
pd import numpy
as np import
seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import
make_classification from sklearn.ensemble
import IsolationForest from
sklearn.preprocessing import
StandardScaler
#Step 1: Generate sample data
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, weights=[0.9, 0.1],
random_state=42) df = pd.DataFrame (X, columns=[f"Feature_{i}" for i in range(10)])
df ['Target'] = y
#3. Detect Outliers using Isolation Forest (anomaly detection) ) # Assuming 5% of the
data is outliers iso_forest = IsolationForest(contamination=0.05)
outliers = iso_forest.fit_predict(df.drop('Target', axis=1).fillna(df.mean())) # Fill
missing values outlier_count = sum(outliers == -1)
print(f"\nNumber of Outliers Detected: {outlier_count}")
Output: