Chapter 1: Introduction to Data Mining
Numerical Problems:
1. Suppose a dataset contains 10,000 records, and each record has 20 attributes. Estimate
the storage required if each attribute takes 4 bytes.
2. If a data mining system can analyze 500 records per second, estimate how long it will
take to process 2 million records.
3. A company uses data mining to classify customer transactions. Given that 10% of the
transactions are fraudulent, estimate the number of fraudulent transactions from a
dataset of 50,000 transactions.
Previous Year Questions:
1. Define data mining and explain its different functionalities.
2. Discuss the major issues in data mining with examples.
3. Explain the Knowledge Discovery in Databases (KDD) process.
4. How does a data warehouse support data mining?
Chapter 2: Data Preprocessing
Numerical Problems:
1. Given the dataset: (10, 20, 30, 40, 50, 60, 70, 80, 90, 100)
o Normalize the dataset using Min-Max Normalization between [0,1].
o Perform Z-score normalization using mean and standard deviation.
2. Suppose a dataset contains missing values. If 20% of a 1,000-record dataset has missing
values, how many records need imputation?
3. Given a dataset with 5 categorical attributes, each having 4 possible values, calculate the
number of unique possible records.
Previous Year Questions:
1. What are different techniques for handling missing values in a dataset?
2. Explain data discretization and concept hierarchy generation with examples.
3. How is dimensionality reduction helpful in data mining? Explain CUR decomposition.
4. Describe feature selection and feature transformation.
Chapter 3: Mining Frequent Patterns, Associations, and Correlations
Numerical Problems:
1. Given the transactions:
css
CopyEdit
T1: {Milk, Bread, Butter}
T2: {Milk, Bread}
T3: {Milk, Butter}
T4: {Bread, Butter}
o Calculate the support and confidence for the rule {Milk} → {Bread}.
o Find frequent itemsets using Apriori algorithm with a minimum support of 50%.
2. Suppose you have the following frequent itemsets and confidence values:
less
CopyEdit
{A, B} → {C} (Confidence = 70%)
{A} → {C} (Confidence = 50%)
o Compute the Lift Ratio and analyze the association rule strength.
Previous Year Questions:
1. What are the major steps in the Apriori algorithm? Explain with an example.
2. Discuss how association rule mining differs from correlation analysis.
3. Explain different types of association rule techniques and their applications.
4. What are the methods for measuring the quality of association rules?
Chapter 4: Classification and Prediction
Numerical Problems:
1. A dataset has the following classification results for a binary classifier:
cpp
CopyEdit
True Positives (TP) = 40, False Positives (FP) = 10
True Negatives (TN) = 30, False Negatives (FN) = 20
o Calculate Accuracy, Precision, Recall, and F1-score.
2. Given the following dataset:
makefile
CopyEdit
Age: (25, 30, 45, 50, 60)
Salary: (40k, 50k, 80k, 90k, 100k)
o Fit a linear regression model to predict salary based on age.
o Estimate the salary for an employee aged 40.
3. A company uses a decision tree-based classifier. The training dataset has 100 records
with 4 attributes. If each attribute can take 3 different values, calculate the number of
possible attribute-value combinations.
Previous Year Questions:
1. Explain Decision Tree algorithm with an example.
2. Differentiate between classification and prediction.
3. What are the different performance evaluation techniques used for classification?
4. Explain the role of logistic regression in prediction problems.
Chapter 5: Cluster Analysis
Numerical Problems:
1. Apply the K-Means clustering algorithm to the dataset:
scss
CopyEdit
(2,2), (3,3), (8,8), (9,9)
o Assume K=2 and initial centroids as (2,2) and (8,8).
o Perform one iteration of K-Means and update the centroids.
2. A dataset has 200 points and is divided into 5 clusters. Compute the intra-cluster
distance if the average distance within each cluster is 2.5.
3. A hierarchical clustering algorithm merges two clusters with distances D(A, B) = 5 and
D(B, C) = 7. Compute the new distance using:
o Single linkage method
o Complete linkage method
Previous Year Questions:
1. Compare K-Means and Hierarchical clustering techniques.
2. What are the major issues in clustering high-dimensional data?
3. How does outlier detection affect clustering performance?
4. Explain agglomerative and divisive hierarchical clustering techniques.
Chapter 6: Web Mining and Other Data Mining Techniques
Numerical Problems:
1. Given a web log file with 500 entries, where 100 belong to a single user session,
calculate the session duration if the average time spent per entry is 2 minutes.
2. A website has the following clickstream data:
css
CopyEdit
Page A → Page B (50 clicks)
Page B → Page C (30 clicks)
Page C → Page A (20 clicks)
o Compute the transition probability matrix for web usage mining.
3. Given a dataset of multimedia files, where images occupy 40% of the total storage,
videos 50%, and audio 10%, compute the total space required if the total dataset size is
1TB.
Previous Year Questions:
1. What is web usage mining? Explain its applications.
2. Discuss various types of web mining and their importance.
3. Explain how spatial and temporal data mining differ from traditional data mining
techniques.
4. What are the challenges in multimedia mining?