Tutorial_DataMiningENG

Uploaded by

chamarilk

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

0% found this document useful (0 votes)

16 views8 pages

Tutorial_DataMiningENG

Uploaded by

chamarilk

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

You are on page 1/ 8

Data mining: concepts and algorithms

Practice – Data mining

Objective
Exploit data mining algorithms to analyze a real dataset using the RapidMiner machine learning tool. The
practice session is organized in two parts. The first part focuses on classification algorithms while the second
one focuses on clustering algorithms.

Dataset
The Users dataset (Users.xls, downloadable in the web page of the course. It collects census data about
American users of a given company. Users are classified as “basic” or “premium” according to their
commonly asked services. Each dataset record corresponds to a different user. The dataset collects around
1,000 different users, including some personal user characteristics (e.g., age, sex, workclass) as well as their
corresponding class. The class attribute, which will be used as class attribute throughout the practice, is
reported as the last record attribute.

The complete list of dataset attributes is reported below.

(1) Age
(2) Workclass
(3) FlnWgt
(4) Education
(5) Education-num
(6) Marital status
(7) Occupation
(8) Relationship
(9) Race
(10) Sex
(11) Capital Gain
(12) Capital loss
(13) Hours per week
(14) Native country
(15) class (class attribute)

Part I: Classification problem

Context
Analysts want to predict the class of new users, according to the already classified user characteristics. To this
purpose, analysts exploit three different classification algorithms: a decision tree (Decision Tree), a Bayesian
classifier (Naïve Bayes), and a distance-based classifier (K-NN). The Users dataset is used to train classifiers
and to validate their performance.

Goal
The aim of this part of the practice is to generate and analyze different classification models and validate their
performance on the Users dataset using the Rapid Miner tool. Different Rapid Miner processes have to be
developed. To evaluate classification performance, different configuration settings have to be tested and
compared with each other. A 10-fold Stratified Cross-Validation process must be used to validate classifier
performance. Results achieved by each algorithm should be analyzed in order to analyze the impact of the
main input parameters.

Questions
Answer to the following questions:

1. Learn a Decision Tree using the whole dataset as training data and the default configuration setting
for algorithm Decision Tree. (a) Which attribute is deemed to be the most discriminative one for class
prediction? (b) What is the height of the generated Decision Tree? (c) Find an example of pure
partition in the Decision Tree generated.
2. Analyze the impact of the minimal gain (using the gain ratio splitting criterion) parameter on the
characteristics on the Decision Tree model learnt from the whole dataset (keep the default
configuration for all the other parameters).
3. Use a 10-fold Stratified Cross-Validation approach to validate the accuracy of the generated
classification model. What is the impact of the minimal gain parameter on the average accuracy
achieved by the generated Decision Tree? Compare the confusion matrices achieved using different
parameter settings (keep the default configuration for all the other parameters).
4. Considering the K-Nearest Neighbor (K-NN) classifier and performing a 10-fold Stratified
CrossValidation, what is the impact of parameter K (number of considered neighbors) on the classifier
performance? Compare the confusion matrices achieved using different K parameter values. Perform
a 10-fold Stratified Cross-Validation with the Naïve Bayes classifier. Does K-NN perform on average
better or worse than the Naïve Bayes classifier on the analyzed data?

Part II: Clustering problem

Context
Analysts want to identify group of similar users. More specifically, they want to segment the users in a set of
groups (clusters). For each cluster an ad-hoc advertising campaign will be designed. To this purpose, analysts
exploit two different clustering algorithms: a k-Means clustering algorithm (K-Means) and a density-based
algorithm (DBScan). The Users dataset contains the users to analyze. Pay attention that only the numerical
attributes are used during this second part of the practice.

Goal
The aim of this second part of the practice is to generate and analyze different clustering models and validate
their performance on the Users dataset using the Rapid Miner tool. To evaluate clustering performance,
different configuration settings have to be tested and compared with each other. The average within cluster
distance will be used to validate clustering performance. Results achieved by each algorithm should be
analyzed in order to analyze the impact of the main input parameters.

Questions
Answer to the following questions:

1. Apply the k-Means algorithm to cluster the users. Analyze the characteristics of the generated clusters
(e.g., the size of the extracted clusters).
2. Analyze the impact of parameter k (number of generated cluster) on the generated clusters. More
specifically, perform an empirical analysis by using the average within cluster distance measure (a
Cluster Cohesion measure) to evaluate the impact of the value of k on the quality of the generated
clusters. What is the impact of k on the generated clusters?
3. Consider the DBScan algorithm (a density based algorithm) and compare its performance with that
of the k-Means algorithm. What is the impact of parameter epsilon on the performance of DBScan
(in terms of average within cluster distance)?
Program setup
- Run the Rapid Miner application under Windows XP

Process building and analysis

- Create a new Rapid Miner process.
- Build the data mining flow by dragging the operators available on the left-hand side menu and
dropping them into the main process window.

Figure 1. Operators

- To handle process execution, use the Start/Stop/Pause buttons. To view the results, change the
perspective from Design to Results.
Switch to Design view
Switch to Results view

Figure 2. Execution/perspective change buttons

- Look into the content of the Users dataset, which is available in the Excel format (.xls).

Classification task

Figure 3. Decision tree classification process

- Execute the process and analyze the Decision Tree generated through the Results perspective.
- Change the configuration setting for algorithm Decision Tree clicking on the corresponding operator
and using the right-hand side menu in the Design perspective. Specifically, vary the minimal gain
threshold value to analyze its impact on the characteristics of the classification model.
- Modify the process flow in order to perform a 10-fold Stratified Cross-Validation. To this aim, include
the “Validation” operator in place of Decision Tree into the main process first.

Figure 4. 10-Fold Cross-Validation process.

Next, double-click operator “Validation” and create a nested process as the one reported below:
Figure 5. Validation subprocess

- Moving to the Results perspective, analyze the confusion matrix generated by the validation process.
- Substitute the classifier operator with the Naïve Bayes classifier first and with the K-NN classifier next
and analyze the achieved results.
- Compare the performance of K-NN and Naïve Bayes performance in terms of average accuracy by
analyzing the corresponding confusion matrices. For the K-NN classifier, vary parameter K values using
the right-hand side menu in the Design perspective.

Clustering task

- Import the source data into the Data Mining process by using the operator “Read Excel”. To import
data use the Data Import Wizard as follows:
o Select the source file (Step 1).
o Select all the spreadsheet content (Step 2).
o Annotate the first row as the attribute name (label “name”), while keeping all the remaining
rows unlabeled (“-”) at Step 3.
- Select exclusively the set of numerical attributes (i.e., exclude non-numerical attributes) by means of
the “Select Attributes ” operator. Set the “attribute filter type” parameter of the operator to “subset”
and then click on the “Select attributes” button and select the subset of numerical attributes (age,
FlnWgt, Education-Num, Capital gain, Capital loss, and Hours-per-week).
- Normalize data values by means of the operator “Normalize”. Set “attribute filter type” to all and
“method” to Z-transformation.
- Include the “k-Means” clustering operator at the end of the data mining flow. The currently generated
process looks like the following one:
Figure 6. k-means (clustering) process

- Execute the process and analyze the generated clusters (number of clusters, number of data per
cluster).
- To validate the quality of the generated clusters in terms of cluster cohesion two other operators must
be included in the process. Specifically, include the “Data to similarity” similarity operator and then
the “Cluster density performance” operator. The “Data to similarity” has one input (the example
dataset) and two outputs (the distance between each pair of objects of the example dataset and the
dataset itself). It computes the distance between the objects of the input datasets. Select “numerical
measure” as measure type and “EuclideanDistance” as measure. The “Cluster density performance”
computes the average within cluster distance and hence the cluster cohesion of a set of clusters. It is
computed by averaging all distances between each pair of examples of a cluster. The “Cluster density
performance” has three mandatory inputs: (i) the cluster model (i.e., the output of the clustering
algorithm), (ii) the set of considered objects (the dataset of users in our case), and (iii) the distance
between the considered objects (i.e., the first output of the “Data to similarity” operator). The
currently generated process looks like the following one:

Figure 7. k-means (clustering) process + performance evaluation

- Change the configuration setting for the k-Means algorithm clicking on the corresponding operator
and using the right-hand side menu in the Design perspective. Specifically, vary the value of parameter
k and analyze the impact of its value on the quality the generated clusters (i.e., the impact on the
average within cluster distance measure).
- Substitute the K-Means clustering operator with the DBscan clustering algorithm/operator and
analyze the achieved results. Consider different values of epsilon.

Data Mining - Theories - Algorithms - and Examples PDF
No ratings yet
Data Mining - Theories - Algorithms - and Examples PDF
347 pages
Grade 06 ICT 1st Term Test Paper 2023 English Medium Royal College
No ratings yet
Grade 06 ICT 1st Term Test Paper 2023 English Medium Royal College
6 pages
Weka Tutorial 3
No ratings yet
Weka Tutorial 3
60 pages
My Lecture On CLUSTER ANALYSIS PDF
No ratings yet
My Lecture On CLUSTER ANALYSIS PDF
55 pages
A Comparative Study of Classification Methods in Data Mining Using RapidMiner Studio
100% (1)
A Comparative Study of Classification Methods in Data Mining Using RapidMiner Studio
6 pages
Week001-Module (1) Merged
No ratings yet
Week001-Module (1) Merged
122 pages
R18CSE4102-UNIT 2 Data Mining Notes
100% (1)
R18CSE4102-UNIT 2 Data Mining Notes
31 pages
DM Manual-Min
No ratings yet
DM Manual-Min
100 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
Yihao Final Paper CCSC for Submission
No ratings yet
Yihao Final Paper CCSC for Submission
6 pages
Assignment Solution 074
No ratings yet
Assignment Solution 074
8 pages
Data Mining 4th Is
No ratings yet
Data Mining 4th Is
24 pages
41 j48 Naive Bayes Weka
No ratings yet
41 j48 Naive Bayes Weka
5 pages
Knowledge Mining Using Classification Through Clustering
No ratings yet
Knowledge Mining Using Classification Through Clustering
6 pages
DM-Unit-I Introduction To Association-1
No ratings yet
DM-Unit-I Introduction To Association-1
97 pages
Data Mining: (Kumar, Viswanath and Rao, 2016)
No ratings yet
Data Mining: (Kumar, Viswanath and Rao, 2016)
3 pages
Classification Algorithm in Data Mining: An
No ratings yet
Classification Algorithm in Data Mining: An
6 pages
Classification: Basic Concepts, Decision Trees, and Model Evaluation
No ratings yet
Classification: Basic Concepts, Decision Trees, and Model Evaluation
25 pages
DMlecture1
No ratings yet
DMlecture1
39 pages
Applying Data Mining Techniques in Property Casualty Insurance
No ratings yet
Applying Data Mining Techniques in Property Casualty Insurance
25 pages
Lect 1
No ratings yet
Lect 1
38 pages
Building Data Mining Models in The Oracle 9i Environment
No ratings yet
Building Data Mining Models in The Oracle 9i Environment
10 pages
DWM Exp6 C49
No ratings yet
DWM Exp6 C49
15 pages
Group 1 5b Report
No ratings yet
Group 1 5b Report
10 pages
FAM_QUESTION_BANK_CT[1]
No ratings yet
FAM_QUESTION_BANK_CT[1]
14 pages
It 311-Ads Module 5
No ratings yet
It 311-Ads Module 5
9 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
Final Report For Sales Dataset Project
No ratings yet
Final Report For Sales Dataset Project
25 pages
5 What Is Data-WPS Office
No ratings yet
5 What Is Data-WPS Office
19 pages
DWDM Notes Unit-4
No ratings yet
DWDM Notes Unit-4
89 pages
Lecture2 DataMiningFunctionalities
No ratings yet
Lecture2 DataMiningFunctionalities
18 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
DMDW Lecture Notes
No ratings yet
DMDW Lecture Notes
24 pages
Classification Algorithm
No ratings yet
Classification Algorithm
51 pages
Data Mining Lab Manual
No ratings yet
Data Mining Lab Manual
34 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
35 pages
Data Mining
No ratings yet
Data Mining
20 pages
DM Notes-1
No ratings yet
DM Notes-1
71 pages
Bia Unit-3 Part-2
No ratings yet
Bia Unit-3 Part-2
43 pages
DMBI Questions
No ratings yet
DMBI Questions
8 pages
Data Mining
No ratings yet
Data Mining
30 pages
Data Mining Slide
No ratings yet
Data Mining Slide
35 pages
Journal On Decision Tree
No ratings yet
Journal On Decision Tree
5 pages
Data Mining
No ratings yet
Data Mining
63 pages
STRT Abhay
No ratings yet
STRT Abhay
14 pages
Gomez Jorge Project
No ratings yet
Gomez Jorge Project
9 pages
Presentation 1
No ratings yet
Presentation 1
28 pages
Emilydavis
No ratings yet
Emilydavis
7 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
Chapter 2
No ratings yet
Chapter 2
17 pages
SQL Server 2008 For Business Intelligence: UTS Short Course
No ratings yet
SQL Server 2008 For Business Intelligence: UTS Short Course
43 pages
WS - Data Analytics Fundamental-R
No ratings yet
WS - Data Analytics Fundamental-R
51 pages
Improvement of K-Means Clustering Algorithm: Prof P M Chawan Saurabh R Bhonde Shirish Patil
No ratings yet
Improvement of K-Means Clustering Algorithm: Prof P M Chawan Saurabh R Bhonde Shirish Patil
5 pages
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
Introduction To Data Mining Techniques: Dr. Rajni Jain
No ratings yet
Introduction To Data Mining Techniques: Dr. Rajni Jain
11 pages
What Is Data Mining: Effective Data Collection Warehousing
No ratings yet
What Is Data Mining: Effective Data Collection Warehousing
21 pages
Evans Analytics2e PPT 10 Data Mining
No ratings yet
Evans Analytics2e PPT 10 Data Mining
69 pages
Process: 1. Data Mining (The Analysis Step of The "Knowledge Discovery in Databases" Process, or KDD)
No ratings yet
Process: 1. Data Mining (The Analysis Step of The "Knowledge Discovery in Databases" Process, or KDD)
4 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
DWM Notes Class by Proff
No ratings yet
DWM Notes Class by Proff
88 pages
Data Mining Using Rapidminer by William Murakami-Brundage Mar. 15, 2012
No ratings yet
Data Mining Using Rapidminer by William Murakami-Brundage Mar. 15, 2012
44 pages
data mining
No ratings yet
data mining
44 pages
Busiess Analytics Data Mining Lecture 3
No ratings yet
Busiess Analytics Data Mining Lecture 3
52 pages
Data Mining Notes
No ratings yet
Data Mining Notes
82 pages
19-DistributedDatabases
No ratings yet
19-DistributedDatabases
76 pages
14-PhysicalAccess
No ratings yet
14-PhysicalAccess
41 pages
15-QueryOptimization
No ratings yet
15-QueryOptimization
78 pages
MongoDB-for-Data-Science-seminar
No ratings yet
MongoDB-for-Data-Science-seminar
135 pages
Answers_Assignment_B21_43
No ratings yet
Answers_Assignment_B21_43
7 pages
tutorial_rm5_prom6extension
No ratings yet
tutorial_rm5_prom6extension
20 pages
18-Recovery
No ratings yet
18-Recovery
53 pages
DSTBD_oracle_hints-IT
No ratings yet
DSTBD_oracle_hints-IT
11 pages
V6I5-0268
No ratings yet
V6I5-0268
7 pages
20-ElasticSearch
No ratings yet
20-ElasticSearch
62 pages
Data_2
No ratings yet
Data_2
1 page
optimization
No ratings yet
optimization
4 pages
RapidMiner-Data-Science-Foundations-Course-Description
No ratings yet
RapidMiner-Data-Science-Foundations-Course-Description
2 pages
correct-validation-wp-final-v
No ratings yet
correct-validation-wp-final-v
26 pages
Writing
No ratings yet
Writing
4 pages
K-means clustering using RapidMiner
No ratings yet
K-means clustering using RapidMiner
10 pages
DSTBD_9-DMassrules
No ratings yet
DSTBD_9-DMassrules
98 pages
IQ
No ratings yet
IQ
7 pages
DSTBD_10-DMClassification-ENG
No ratings yet
DSTBD_10-DMClassification-ENG
160 pages
Bodhipooja Print
No ratings yet
Bodhipooja Print
21 pages
jovanovicetal.2014RapidMinerBook
No ratings yet
jovanovicetal.2014RapidMinerBook
17 pages
WMM Final Updated
No ratings yet
WMM Final Updated
11 pages
Nuwanethi Obata Senehebara Amathumak
No ratings yet
Nuwanethi Obata Senehebara Amathumak
40 pages
Infotec Ai 1000 Program-hcia-Ai Lab Guide
No ratings yet
Infotec Ai 1000 Program-hcia-Ai Lab Guide
82 pages
SV2021112102
No ratings yet
SV2021112102
12 pages
A Survey On Gesture Pattern Recognition For Mute Peoples
No ratings yet
A Survey On Gesture Pattern Recognition For Mute Peoples
5 pages
Title: K-Means Clustering Algorithm Implementation: Department of Computer Science and Engineering
No ratings yet
Title: K-Means Clustering Algorithm Implementation: Department of Computer Science and Engineering
7 pages
DM C0 Introduction
No ratings yet
DM C0 Introduction
18 pages
1 s2.0 S0957417424005128 Main
No ratings yet
1 s2.0 S0957417424005128 Main
17 pages
Imbalanced K-Means: An Algorithm To Cluster Imbalanced-Distributed Data
No ratings yet
Imbalanced K-Means: An Algorithm To Cluster Imbalanced-Distributed Data
9 pages
1 s2.0 S1876610217348890 Main
No ratings yet
1 s2.0 S1876610217348890 Main
6 pages
CSAXXXX Applied Machine Learning
No ratings yet
CSAXXXX Applied Machine Learning
3 pages
Cluster Analysis: Motivation: Why Cluster Analysis Dissimilarity Matrices Introduction To Clustering Algorithms
No ratings yet
Cluster Analysis: Motivation: Why Cluster Analysis Dissimilarity Matrices Introduction To Clustering Algorithms
34 pages
Data Mining Techniques in Analyzing Process Data: A Didactic
No ratings yet
Data Mining Techniques in Analyzing Process Data: A Didactic
11 pages
Heart Disease Prediction Using Naive Bayes and K-Means Techniques
No ratings yet
Heart Disease Prediction Using Naive Bayes and K-Means Techniques
5 pages
Digital Image Processing
No ratings yet
Digital Image Processing
23 pages
Lecture 4.1 Machine Learning Deep Learning Reinforcement Learning
No ratings yet
Lecture 4.1 Machine Learning Deep Learning Reinforcement Learning
32 pages
BDA Notes Unit-5
No ratings yet
BDA Notes Unit-5
62 pages
Cluster Analysis
100% (1)
Cluster Analysis
19 pages
20CS610 ML Syllabus
No ratings yet
20CS610 ML Syllabus
2 pages
ML Lab
No ratings yet
ML Lab
62 pages
Fusing Concurrent Orthogonal Wide-Aperture Sonar Images For Dense Underwater 3D Reconstruction
No ratings yet
Fusing Concurrent Orthogonal Wide-Aperture Sonar Images For Dense Underwater 3D Reconstruction
8 pages
Project-Team CQFD Quality Control and Dynamic Reliability: Ctivity
No ratings yet
Project-Team CQFD Quality Control and Dynamic Reliability: Ctivity
22 pages
Anomaly Detection RapidMiner
No ratings yet
Anomaly Detection RapidMiner
12 pages
1 PB
No ratings yet
1 PB
23 pages
Data Science Upgrad
No ratings yet
Data Science Upgrad
13 pages
Lung Cancer Detection From CT Image Usin
No ratings yet
Lung Cancer Detection From CT Image Usin
11 pages
DWM UNIT-VI (2)
No ratings yet
DWM UNIT-VI (2)
30 pages
Roles of Data Scientists in Business and Society
No ratings yet
Roles of Data Scientists in Business and Society
47 pages
DWDM
No ratings yet
DWDM
2 pages