Individual Assignment 2

Extension: Data Mining : Assignment 2
Individual Assignment 2
Due date: April 10 , 2022
- This assignment is about hands-on experience with data mining tasks using
real world datasets and data mining platform, WEKA.
- You can find data sets for task 1 , 2 and task 3 in the file (HW2 winrar).
( Data sets are given for tasks 1, 2 and 3)
 The tasks:
1. Task 1 : Association Rule mining/discovery
- You are to perform Association Rule mining/discovery on the data set
using the Weka package(Apriori Algorithm).
The data is contained in the file bank-data.csv . Each record is a customer
description where the "pep" field indicates whether or not that customer
bought a PEP after the last mailing ( for details descriptions see table
below).
Table 1: The data contains the following fields

id a unique identification number
age age of customer in years (numeric)
sex MALE / FEMALE
region inner_city/rural/suburban/town
income income of customer (numeric)
married is the customer married (YES/NO)
children number of children (numeric)
car does the customer own a car (YES/NO)
save_acct does the customer have a saving account (YES/NO)
current_acct does the customer have a current account (YES/NO)
mortgage does the customer have a mortgage (YES/NO)
pep did the customer buy a PEP (Personal Equity Plan) after the last
mailing (YES/NO)
First perform the necessary preprocessing steps required for association rule
mining. Specifically, the "id" field will need to be removed and the numerical
attributes must be discretized.
Next perform association rule discovery on the transformed data. Experiment
with different parameters so that you get at least 20-30 strong rules (e.g., rules
with high lift and confidence which at the same time have relatively good
support).
In WEKA Apriori algorithm interface set "outputItemsets" to "True" so that you
can also view the frequent items sets of different sizes in addition to the rules.
 The report:
a) Write a report regarding the results you did achieve with the method and data
sets above that address the following
i. Cleaning :
- Discuss any cleaning you had to do over the given data set to apply/run the
algorithm/method.
1|Page
b) Select the top 5 most "interesting" rules and for each specify the
following:
a) an explanation of the pattern and why you believe it is interesting
based on the business objectives of the company;
b) any recommendations based on the discovered rule that might help the
company to better understand behavior of its customers or in its
marketing campaign. also provide some non-trivial, actionable
knowledge based on the underlying business objectives
2. Task 2:
- Use the following learning/algorithm/scheme method: K-Nearest Neighbor
[weka.classifiers.lazy.Ibk]
- You will use an image segmentation data set
- This dataset contains information characterizing images with each line
corresponding to one image. Each image is represented by 19 features
(these are the columns in the data and correspond to the feature names in
the list of attributes.
- The last column in the data contains the class labels corresponding to the
image types (brickface, sky, foliage, cement, window, path, grass). The data
set contains three files. The file "segment-train.arff" is the training data
consisting of 30 instances (images) from each of the 7 categories. The test
data ("segment-test.arff") is used for evaluating the model built using the
training data and it contains 2310 instances. A detailed description of the
data set, including the meanings of various attributes is provided in the file
"segment-decription.txt". The data set used in this learning
scheme/algorithm is based on the Image Segmentation data set at the UCI
Machine Learning Repository
Your tasks are the following:
a) Load in the training image segment data into WEKA and select WEKA's KNN
implementation(called IBk) under the Classify tab( for details see weka
documentation). Use default value (10-fold cross-validation). Use
appropriate value for K(number of neighbors) by opening the classifier
options dialog box.
 Run the classifier multiple times, experimenting with different values of
K (you can use K = 5, 10, 15, 20 and so on) and with or without the
distance weighting option set.
 For each run examine the evaluation result. Once you are satisfied that
you have the best set of options, record the final results by saving your
buffer for the corresponding result set. You should submit this result set
and also provide a short summary of which options you tried and your
findings.
2|Page
b) Next, apply your model from part (a) to the test data. Under the Test
options select "Supplied test set" and set the test set to the file segment-
test.arff. Under More options, make sure that "Output predictions" is
selected. Finally, run the KNN classifier on the test data. Compare the
evaluation results to the results from 10-fold cross-validation. Submit your
results set (including the predictions) along with a summary of your
observations.
3. Task 3
- You are given two data sets((in ARFF format) contained in the Zip
archive bank-data.zip:
bank-data.arff - Pre-classified training data Set for Building a Model
(this is the data from task 1)
bank-new.arff - A set of new customers from which to find the "hot
prospects" for the next target marketing campaign (i.e. those that are
likely to respond positively to an offer for PEP. For details about data sets
pls see table 1 in task 1
- Note that since the ID attribute is not used for building the classifier, you
should begin by loading each of these data sets into WEKA, and in each
case removing the ID attribute and saving both filtered data sets into new
files.
a) Using WEKA package create a "C4.5" classification model based on the pre-
classified training data. In WEKA, the C4.5 algorithm is implemented by
"weka.classifiers.trees.J48". Use 10-fold cross-validation to evaluate your
model accuracy. Record the final decision tree and model accuracy statistics
obtained from your model. Be sure to indicate the parameters you use in
building your classification model (if you experiment with non-default
values). You can save the statistics and results by right-clicking the last
result set in the "Result list" window and selecting "Save result buffer." You
should also generate and create a screen shot of your tree by selecting the
"Visualize tree" command from the same menu.
You should provide the decision tree together with the accuracy results from
the cross-validation as part of your submission.
b) Next, apply the classification model from the previous part to the new
customers data set as the "Supplied test set." Be sure to the select the
option "Output predictions" in the test options for the classifier
(under More Options). This option will show you the predicted classes for
the 200 new instances. In your final submitted result you shouldo map the
resulting answers back to the original customer "id" field for the new
customers (this could be done using a spreadsheet program such as Excel
and the original new customers data set in CSV format).
Provide your resulting predictions for the 200 new cases and other
supporting documentation as part of your submission.
3|Page
4. Task 4:
- You are to conduct 10-fold cross-validation to evaluate the following
classification learning methods/algorithm:
 C4.5 "weka.classifiers.trees.J48".
 RIPPER "weka.classifiers.rules.JRip".
 Naïve Bayesian Classification“weka.classifiers.bayes.NaiveBayes”
 K-Nearest Neighbor “weka.classifiers.lazy.Ibk”
- On the following data sets from the UCI Machine Learning Repository
 Ecoli data base (ecoli.arff)
 Glass identification data base(glass.arff)
 Yeast.arff
 Iris plant (iris.arff)
The report:
- Write a report regarding the results you did achieve with the classification methods
and data sets above that address the following
(I). Cleaning :
 Discuss any cleaning you had to do over the given data sets to apply/run the
algorithms/methods.
(II) Method utility:
 Report the classification accuracy and run time of each methods/algorithms on
each data sets. Discuss the results and determine if there is sigificant changes
interms of accuracy ( misclassification rates) and run time among the data sets.
If so, what difference in charateristics of the data sets might account for this?
(III) Accuracy improvement
 Present the method and dataset for which you improved the accuracy through
parameter adjustment.
 State the accuracy improvement you achieved, and state what was your
parameter adjustment.
- NB: data set wont be given for task 4. So you must download them your self.
- For task 4 : find the data sets from this : the UCI Machine Learning
Repository
4|Page

Individual Assignment 2

Uploaded by

Individual Assignment 2

Uploaded by

Extension: Data Mining : Assignment 2

Table 1: The data contains the following fields

Your tasks are the following:

You might also like