0% found this document useful (0 votes)
430 views74 pages

Latest Data Mining Lab Manual

This document provides an overview of the Weka data mining software laboratory manual for a fourth year undergraduate data mining course. It includes a certificate for completing the course laboratory, an index, and a detailed syllabus covering various data mining techniques including credit risk assessment using decision trees, cross-validation, attribute selection and more. The document aims to teach students how to use the Weka data mining software through hands-on exercises analyzing a real-world German credit card dataset.

Uploaded by

Aakash
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
430 views74 pages

Latest Data Mining Lab Manual

This document provides an overview of the Weka data mining software laboratory manual for a fourth year undergraduate data mining course. It includes a certificate for completing the course laboratory, an index, and a detailed syllabus covering various data mining techniques including credit risk assessment using decision trees, cross-validation, attribute selection and more. The document aims to teach students how to use the Weka data mining software through hands-on exercises analyzing a real-world German credit card dataset.

Uploaded by

Aakash
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 74

DATA MINING

LABORATORY MANUAL

For
B.TECH (IV YEAR – I SEM) (2019-20)

Submitted by

[Your Name]

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


MALLA REDDY COLLEGE OF
ENGINEERING
Approved by AICTE , Permanently Affiliated to JNTUH & Accredited by NBA & NAAC

Recognized under Section 2(f) & 12(B) of the UGC Act 1956, An ISO 9001:2015
Certified Institution

Maisammaguda, Secunderabad, Telangana. PIN:


500100

CERTIFICATE
This is to certify that, Mr./Mrs…………………………bearing

H.T.No:.......................... bonafide student of Computer Science

and Engineering Department. He is Successfully completed the


B.Tech IV-year I-Semester Data Mining [WEKA] Lab for the

Academic year 2019-20.

Internal External
Examiner HOD
Examiner

Index
S.No. Topic Page No Signature

6
7

10

11

12

13

Syllabus

Credit Risk Assessment


Description: The business of banks is making loans. Assessing the credit
worthiness of an applicant is of crucial importance. You have to develop a system to
help a loan officer decide whether the credit of a customer is good. Or bad. A bank’s
business rules regarding loans must consider two opposing factors. On the one hand, a
bank wants to make as many loans as possible. Interest on these loans is the banks
profit source. On the other hand, a bank can not afford to make too many bad loans.
Too many bad loans could lead to the collapse of the bank. The bank’s loan policy
must involve a compromise. Not too strict and not too lenient.

To do the assignment, you first and foremost need some knowledge about the world of
credit.
You can acquire such knowledge in a number of ways.

• Knowledge engineering: Find a loan officer who is willing to talk. Interview her
and try to represent her knowledge in a number of ways.
• Books: Find some training manuals for loan officers or perhaps a suitable
textbook on finance. Translate this knowledge from text from to production rule
form.
• Common sense: Imagine yourself as a loan officer and make up reasonable rules
which can be used to judge the credit worthiness of a loan applicant.
• Case histories: Find records of actual cases where competent loan officers
correctly judged when and not to. Approve a loan application.

The German Credit Data


Actual historical credit data is not always easy to come by because of
confidentiality rules.
Here is one such data set. Consisting of 1000 actual cases collected in Germany.

In spite of the fact that the data is German, you should probably make use of it
for this assignment (Unless you really can consult a real loan officer!)
There are 20 attributes used in judging a loan applicant (ie., 7 Numerical
attributes and 13 Categorical or Nominal attributes). The goal is to classify the applicant
into one of two categories. Good or Bad.

Subtasks:

• List all the categorical (or nominal) attributes and the real valued attributes separately.

• What attributes do you think might be crucial in making the credit assessment?
Come up with some simple rules in plain English using your selected attributes.

3.One type of model that you can create is a Decision tree . train a Decision tree using
the complete data set as the training data. Report the model obtained after training.

4.Suppose you use your above model trained on the complete dataset, and classify credit
good/bad for each of the examples in the dataset. What % of examples can you classify
correctly?(This is also called testing on the training set) why do you think can not get 100% training
accuracy

• Is testing on the training set as you did above a good idea? Why or why not?

• One approach for solving the problem encountered in the previous question is using
cross- validation? Describe what is cross validation briefly. Train a decision tree again
using cross validation and report your results. Does accuracy increase/decrease? Why?
• Check to see if the data shows a bias against “foreign workers” or “personal-
status”. One way to do this is to remove these attributes from the data set and see if the
decision tree created in those cases is significantly different from the full dataset case
which you have already done. Did removing these attributes have any significantly
effect? Discuss.

• Another question might be, do you really need to input so many attributes to get
good results? May be only a few would do. For example, you could try just having
attributes 2,3,5,7,10,17 and 21. Try out some combinations.(You had removed two
attributes in problem 7. Remember to reload the arff data file to get all the attributes
initially before you start selecting the ones you want.)

• Sometimes, The cost of rejecting an applicant who actually has good credit might be
higher than accepting an applicant who has bad credit. Instead of counting the
misclassification equally in both cases, give a higher cost to the first case ( say cost 5)
and lower cost to the second case. By using a cost matrix in weak. Train your decision
tree and report the Decision Tree and cross validation results. Are they significantly
different from results obtained in problem 6.

• Do you think it is a good idea to prefect simple decision trees instead of having long
complex decision tress? How does the complexity of a Decision Tree relate to the bias
of the model?

• You can make your Decision Trees simpler by pruning the nodes. One approach is to
use Reduced Error Pruning. Explain this idea briefly. Try reduced error pruning for
training your Decision Trees using cross validation and report the Decision Trees you
obtain? Also Report your accuracy using the pruned model Does your Accuracy increase?

• How can you convert a Decision Tree into “if-then-else rules”. Make up your own
small Decision Tree consisting 2-3 levels and convert into a set of rules. There also exist
different classifiers that output the model in the form of rules. One such classifier in
weka is rules. PART, train this model and report the set of rules obtained. Sometimes
just one attribute can be good enough in making the decision, yes, just one ! Can you
predict what attribute that might be in this data set? OneR classifier uses a single
attribute to make decisions(it chooses the attribute based on minimum error).Report
the rule obtained by training a one R classifier. Rank the performance of
j48,PART,oneR.

1. Weka Introduction
Weka is created by researchers at the university WIKATO in NewZealand.
University of Waikato, Hamilton, New Zealand Alex Seewald (original Command-line
primer) David Scuse (original Experimenter tutorial)

• It is java based application.


• It is collection often source, Machine Learning Algorithm.
• The routines (functions) are implemented as classes and logically arranged in
packages.
• It comes with an extensive GUI Interface.
• Weka routines can be used standalone via the command line interface.

The Graphical User Interface


The Weka GUI Chooser (class weka.gui.GUIChooser) provides a starting point
for launching Weka’s main GUI applications and supporting tools. If one prefers a MDI
(“multiple document interface”) appearance, then this is provided by an alternative
launcher called “Main” (class weka.gui.Main). The GUI Chooser consists of four
buttons—one for each of the four major Weka applications—and four menus.

The buttons can be used to start the following applications:


• Explorer An environment for exploring data with
WEKA (the rest of this documentation deals with this
application in more detail).
• Experimenter An environment for performing experiments and conducting
statistical tests between learning schemes.

• Knowledge Flow This environment supports essentially the same


functions as the Explorer but with a drag-and-drop interface. One
advantage is that it supports incremental learning.
• SimpleCLI Provides a simple command-line interface that allows
direct execution of WEKA commands for operating systems that do
not provide their own command line interface.
• Explorer

The Graphical user interface

• Section Tabs
At the very top of the window, just below the title bar, is a row of tabs. When
the Explorer is first started only the first tab is active; the others are greyed out. This is
because it is necessary to open (and potentially pre-process) a data set before starting to
explore the data.
The tabs are as follows:

• Preprocess. Choose and modify the data being acted on.


• Classify. Train & test learning schemes that classify or perform regression
• Cluster. Learn clusters for the data.
• Associate. Learn association rules for the data.
• Select attributes. Select the most relevant attributes in the data.
• Visualize. View an interactive 2D plot of the data.

Once the tabs are active, clicking on them flicks between different screens, on
which the respective actions can be performed. The bottom area of the window
(including the status box, the log button, and the Weka bird) stays visible regardless of
which section you are in. The Explorer can be easily extended with custom tabs. The
Wiki article “Adding tabs in the Explorer” [7] explains this in detail.

II. Experimenter
2.1 Introduction
The Weka Experiment Environment enables the user to create, run, modify, and
analyse experiments in a more convenient manner than is possible when processing
the schemes individually. For example, the user can create an experiment that runs
several schemes against a series of datasets and then analyse the results to determine
if one of the schemes is (statistically) better than the other schemes.
The Experiment Environment can be run from the command line using the
Simple CLI. For example, the following commands could be typed into the CLI to run
the OneR scheme on the Iris dataset using a basic train and test process. (Note that the
commands would be typed on one line into the CLI.) While commands can be typed
directly into the CLI, this technique is not particularly convenient and the experiments
are not easy to modify. The Experimenter comes in two flavours, either with a simple
interface that provides most of the functionality one needs for experiments, or with an
interface with full access to the Experimenter’s capabilities. You can choose between
those two with the Experiment Configuration Mode radio buttons:

• Simple
• Advanced

Both setups allow you to setup standard experiments, that are run locally on
a single machine, or remote experiments, which are distributed between several hosts.
The distribution of experiments cuts down the time the experiments will take until
completion, but on the other hand the setup takes more time. The next section covers
the standard experiments (both, simple and advanced), followed by the remote
experiments and finally the analysing of the results.

III. Knowledge Flow


3.1 Introduction
The Knowledge Flow provides an alternative to the Explorer as a graphical front
end to WEKA’s core algorithms.
The KnowledgeFlow presents a data-flow inspired interface to WEKA. The user
can selectWEKA components from a palette, place them on a layout canvas and connect
them together in order to form a knowledge flow for processing and analyzing data. At
present, all of WEKA’s classifiers, filters, clusterers, associators, loaders and savers are
available in the KnowledgeFlow along withsome extra tools.

The Knowledge Flow can handle data either incrementally or in batches (the
Explorer handles batch data only). Of course learning from data incremen- tally
requires a classifier that can be updated on an instance by instance basis. Currently in
WEKA there are ten classifiers that can handle data incrementally.

The Knowledge Flow offers the following features:


• intuitive data flow style layout
• process data in batches or incrementally
• process multiple batches or streams in parallel (each
separate flow executes in its own thread)
• process multiple streams sequentially via a
user-specified order of execution
• chain filters together

• view models produced by classifiers for each fold in a cross validation


• visualize performance of incremental classifiers
during processing (scrolling plots of classification
accuracy, RMS error, predictions etc.)
• plugin “perspectives” that add major new functionality
(e.g. 3D data visualization, time series forecasting
environment etc.)

IV. Simple CLI

The Simple CLI provides full access to all Weka classes, i.e., classifiers, filters,
clusterers, etc., but without the hassle of the CLASSPATH (it facilitates the one, with
which Weka was started). It offers a simple Weka shell with separated command
line and output.

4.1 Commands

The following commands are available in the Simple CLI:


• java <classname> [<args>]
invokes a java class with the given arguments (if any)
• break
stops the current thread, e.g., a running classifier, in a friendly manner killstops the
current thread in an unfriendly fashion
• cls
clears the output area
• capabilities <classname> [<args>]
lists the capabilities of the specified class, e.g., for a
classifier with its option:
capabilities weka.classifiers.meta.Bagging -W weka.classifiers.trees.Id3
• exit
exits the Simple CLI
• help [<command>]
provides an overview of the available commands if
without a command name as argument, otherwise more
help on the specified command
• Invocation
In order to invoke a Weka class, one has only to prefix the class with ”java”. This
command tells the Simple CLI to load a class and execute it with any given parameters.
E.g., the J48 classifier can be invoked on the iris dataset with the following command:

java weka.classifiers.trees.J48 -t

c:/temp/iris.arff This results in

the following output:

java weka.classifiers.trees.J48test.arf>j48.txt

Note: the > must be preceded and followed by a space, otherwise it is not recognized
as redirection, but part of another parameter.

4.4 Command completion


Commands starting with java support completion for classnames and filenames via Tab
(Alt+BackSpace deletes parts of the command again). In case that there are several
matches, Weka lists all possible matches.
• package name completion java
weka.cl<Tab>
results in the following output of possible matches of package
names: Possible matches:
weka.classifiers
weka.clusterers
• classname completion
java weka.classifiers.meta.A<Tab> lists
the following classes
Possible matches: weka.classifiers.meta.AdaBoostM1
weka.classifiers.meta.AdditiveRegression
weka.classifiers.meta.Attribute Selected Classifier
• filename completion
In order for Weka to determine whether a the string under the cursor is a classname or
a filename, filenames need to be absolute (Unix/Linx: /some/path/file;Windows:
C:\Some\Path\file) or relative and starting with a dot (Unix/Linux:
./some/other/path/file;Windows:
.\Some\Other\Path\file).

Command redirection
Starting with this version of Weka one can perform a basic redirection:

2. ARFF File Format

An ARFF (= Attribute-Relation File Format ) file is an


ASCII text file that describes a list of instances sharing a set of
attributes.

ARFF files are not the only format one can load, but all files that can be
converted with Weka’s “core converters”. The following formats are currently
supported:

• ARFF (+ compressed)
• C4.5
• CSV
• libsvm
• binary serialized instances
• XRFF (+ compressed)

10.1 Overview
ARFF files have two distinct sections. The first section is the Header
information, which is followed the Data information. The Header of the ARFF file
contains the name of the relation, a list of the attributes (the columns in the data), and
their types.
An example header on the standard IRIS dataset looks like this:
% 1. Title: Iris Plants Database
%
% 2. Sources:
% (a) Creator: R.A. Fisher
% (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
% (c) Date: July, 1988
%
@RELATION iris
@ATTRIBUTE
sepallength NUMERIC
@ATTRIBUTE
sepalwidth NUMERIC
@ATTRIBUTE
petallength NUMERIC
@ATTRIBUTE
petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa,Iris-
versicolor,Iris-virginica} The Data of the
ARFF file looks like the following:
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
Lines that begin with a % are comments.

The @RELATION, @ATTRIBUTE and @DATA declarations are case insensitive.

The ARFF Header Section

The ARFF Header section of the file contains the relation


declaration and at- tribute declarations.
The @relation Declaration
The relation name is defined as the first line in the ARFF file.
The format is: @relation <relation-name>
where <relation-name> is a string. The string must be quoted if the name includes
spaces.
The @attribute Declarations
Attribute declarations take the form of an ordered sequence of @attribute
statements. Each attribute in the data set has its own @attribute statement which
uniquely defines the name of that attribute and it’s data type. The order the attributes
are declared indicates the column position in the data section of the file. For example, if
an attribute is the third one declared then Weka expects that all that attributes values
will be found in the third comma delimited column.

The format for the


@attribute statement is:
@attribute <attribute-
name> <datatype>
where the <attribute-name> must start with an alphabetic character. If spaces
are to be included in the name then the entire name must be quoted.

The <datatype> can be any of the four types supported by Weka:


• numeric
• integer is treated as numeric
• real is treated as numeric
• <nominal-specification>
• string
• date [<date-format>]
• relational for multi-instance data (for future use)
where <nominal-specification> and <date-format> are defined below. The
keywords numeric, real, integer, string and date are case insensitive.

Numeric attributes
Numeric attributes can be real or integer numbers.

Nominal attributes
Nominal values are defined by providing an <nominal-specification> listing
the possible values: <nominal-name1>, <nominal-name2>, <nominal-name3>,
...
For example, the class value of the Iris dataset can be defined as
follows: @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-
virginica}
Values that contain spaces must be quoted.

String attributes
String attributes allow us to create attributes containing arbitrary textual values.
This is very useful in text-mining applications, as we can create datasets with string
attributes, then writeWeka Filters to manipulate strings (like String-
ToWordVectorFilter). String attributes are declared as follows:
@ATTRIBUTE LCC string

Date attributes
Date attribute declarations take
the form: @attribute <name>
date [<date-format>]
where <name> is the name for the attribute and <date-format> is an optional string
specifying how date values should be parsed and printed (this is the same format used
by SimpleDateFormat). The default format string accepts the ISO-8601 combined date
and time format: yyyy-MM- dd’T’HH:mm:ss. Dates must be specified in the data
section as the corresponding string representations of the date/time (see example
below).

Relational attributes
Relational attribute declarations
take the form: @attribute
<name> relational
<further
attribute definitions>
@end <name>
For the multi-instance dataset MUSK1 the definition would look like this (”...” denotes an
omission): @attribute molecule_name {MUSK-jf78,...,NON-MUSK-199}
@attribute
bag
relational
@attribute
f1 numeric
...
@attribute
f166
numeric
@end bag
@attribute class {0,1}
...

The ARFF Data Section


The ARFF Data section of the file contains the data declaration
line and the actual instance lines.

The @data Declaration


The @data declaration is a single line denoting the start of the data segment in the file.
The format is: @data
The instance data
Each instance is represented on a single line, with carriage returns denoting the
end of the instance. A percent sign (%) introduces a comment, which continues to the
end of the line.
Attribute values for each instance are delimited by commas. They must appear
in the order that they were declared in the header section (i.e. the data corresponding
to the nth @attribute declaration is always the nth field of the attribute).
Missing values are represented by a single question mark, as in:
@data
4.4,?,1.5,?,Iris-setosa
Values of string and nominal attributes are case sensitive, and any that contain space or
the comment- delimiter character % must be quoted. (The code suggests that double-
quotes are acceptable and that a backslash will escape individual characters.)

An
example
follows:
@relation
LCCvsLCS
H
@attribute
LCC string
@attribute
LCSH
string
@data
AG5, ’Encyclopedias and
dictionaries.;Twentieth century.’ AS262,
’Science -- Soviet Union -- History.’
AE5, ’Encyclopedias and dictionaries.’
AS281, ’Astronomy, Assyro-Babylonian.;Moon
-- Phases.’ AS281, ’Astronomy, Assyro-
Babylonian.;Moon -- Tables.’

Dates must be specified in the data section using the string representation specified in the
attribute declaration.

For
example:
@RELATI
ON
Timestam
ps
@ATTRIBUTE timestamp DATE "yyyy-MM-
dd HH:mm:ss" @DATA
"2001-04-03 12:12:12"
"2001-05-03 12:59:55"
Relational data must be enclosed within double quotes ”. For example an instance of
the MUSK1 dataset (”...” denotes an omission):

MUSK-188,"42,...,30",1

3. Preprocess Tab
1. Loading Data
The first four buttons at the top of the preprocess section
enable you to load data into WEKA:

• Open file.... Brings up a dialog box allowing you to browse for the data file on the local
file system.

• Open URL.... Asks for a Uniform Resource Locator address for where the data is
stored.

• Open DB.... Reads data from a database. (Note that to make this work you might
have to edit the file in weka/experiment/DatabaseUtils.props.)

• Generate.... Enables you to generate artificial data from a variety of Data


Generators. Using the Open file... button you can read files in a variety of formats:
WEKA’s ARFF format, CSV format, C4.5 format, or serialized Instances format. ARFF
files typically have a .arff extension, CSV files a .csv extension, C4.5 files a .data and
.names extension, and serialized Instances objects a .bsi extension.

The Current Relation: Once some data has been loaded, the Preprocess panel shows
a variety of information. The Current relation box (the “current relation” is the
currently loaded data, which can be interpreted as a single relational table in database
terminology) has three entries:

1. Relation. The name of the relation, as given in the file it was loaded from. Filters
(described below) modify the name of a relation.

• Instances. The number of instances (data points/records) in the data.

• Attributes. The number of attributes (features) in the data.

2.3 Working with Attributes

Below the Current relation box is a box titled Attributes.


There are four Buttons, and beneath them is a list of the
attributes in the current relation.
The list has three columns:

• No... A number that identifies the attribute in the order they are specified in the data
file.

• Selection tick boxes. These allow you select which attributes are present in the
relation.

• Name. The name of the attribute, as it was declared in the data file. When you click
on different rows in the list of attributes, the fields change in the box to the right titled
Selected attribute. This box displays the characteristics of the currently highlighted
attribute in the list:

• Name. The name of the attribute, the same as that given in the attribute list.
• Type. The type of attribute, most commonly Nominal or Numeric.
• Missing. The number (and percentage) of instances in the data for which this
attribute is missing (unspecified).
• Distinct. The number of different values that the data contains for this attribute.
• Unique. The number (and percentage) of instances in the data having a value for
this attribute that no other instances have.

Below these statistics is a list showing more information about the values stored
in this attribute, which differ depending on its type. If the attribute is nominal, the list
consists of each possible value for the attribute along with the number of instances that
have that value. If the attribute is numeric, the list gives four statistics describing the
distribution of values in the data—the minimum, maximum, mean and standard
deviation. And below these statistics there is a coloured histogram, colour-coded
according to the attribute chosen as the Class using the box above the histogram. (This
box will bring up a drop-down list of available selections when clicked.) Note that only
nominal Class attributes will result in a colour-coding. Finally, after pressing the
Visualize All button, histograms for all the attributes in the data are shown in a separate
window.

Returning to the attribute list, to begin with all the tick boxes are unticked.
They can be toggled on/off by clicking on them individually. The four buttons above
can also be used to change the selection:

PREPROCESSING
• All. All boxes are ticked.
• None. All boxes are cleared (unticked).
• Invert. Boxes that are ticked become unticked and vice versa.
Pattern. Enables the user to select attributes based on a Perl 5 Regular Expression. E.g., .* id
selects all attributes which name ends with id.

Once the desired attributes have been selected, they can be removed by clicking
the Remove button below the list of attributes. Note that this can be undone by clicking
the Undo button, which is located next to the Edit button in the top-right corner of the
Preprocess panel.

Working With Filters

The preprocess section allows filters to be defined that transform the data in
various ways. The Filter box is used to set up the filters that are required. At the left of
the Filter box is a Choose button. By clicking this button it is possible to select one of
the filters in WEKA. Once a filter has been selected, its name and options are shown in
the field next to the Choose button. Clicking on this box with the left mouse button
brings up a GenericObjectEditor dialog box. A click with the right mouse button (or
Alt+Shift+left click) brings up a menu where you can choose, either to display the
properties in a GenericObjectEditor dialog box, or to copy the current setup string to
the clipboard.
The GenericObjectEditor Dialog Box

The GenericObjectEditor dialog box lets you configure a filter.


The same kind of dialog box is used to configure other objects, such
as classifiers and clusterers
(see below). The fields in the window reflect the available options. Right-clicking (or
Alt+Shift+Left- Click) on such a field will bring up a popup menu, listing the following options:

• Show properties... has the same effect as left-clicking on the field, i.e., a dialog
appears allowing you to alter the settings.

• Copy configuration to clipboard copies the currently displayed configuration


string to the system’s clipboard and therefore can be used anywhere else in WEKA or in
the console. This is rather handy if you have to setup complicated, nested schemes.

• Enter configuration... is the “receiving” end for configurations that got copied to
the clipboard earlier on. In this dialog you can enter a class name followed by options (if
the class supports these). This also allows you to transfer a filter setting from the
Preprocess panel to a Filtered Classifier used in the Classify panel.
Left-Clicking on any of these gives an opportunity to alter the filters settings. For
example, the setting may take a text string, in which case you type the string into the
text field provided. Or it may give a drop-down box listing several states to choose from.
Or it may do something else, depending on the information required. Information on
the options is provided in a tool tip if you let the mouse pointer hover of the
corresponding field. More information on the filter and its options can be obtained by
clicking on the More button in the About panel at the top of the GenericObjectEditor
window.
Some objects display a brief description of what they do in an About box, along
with a More button. Clicking on the More button brings up a window describing what
the different options do. Others have an additional button, Capabilities, which lists the
types of attributes and classes the object can handle.

At the bottom of the GenericObjectEditor dialog are four buttons. The first two,
Open... and Save... allow object configurations to be stored for future use. The Cancel
button backs out without remembering any changes that have been made. Once you are
happy with the object and settings you have chosen, click OK to return to the main
Explorer window.

Applying Filters
Once you have selected and configured a filter, you can apply it to the data by
pressing the Apply button at the right end of the Filter panel in the Preprocess panel.
The Preprocess panel will then show the transformed data. The change can be undone
by pressing the Undo button. You can also use the Edit...button to modify your data
manually in a dataset editor. Finally, the Save... button at the top right of the
Preprocess panel saves the current version of the relation in file formats that can
represent the relation, allowing it to be kept for future use.

Note: Some of the filters behave differently depending on whether a class attribute has been
set or not (using the box above the histogram, which will bring up a drop-down list of possible
selections when clicked). In particular, the “supervised filters” require a class attribute to be set,
and some of the “unsupervised attribute filters” will skip the class attribute if one is set. Note that it is
also possible to set Class to None, in which case no class is set.

4. Classification Tab

• Selecting a Classifier
At the top of the classify section is the Classifier box. This box has a text field
that gives the name of the currently selected classifier, and its options. Clicking on the
text box with the left mouse button brings up a GenericObjectEditor dialog box, just
the same as for filters, that you can use to configure the options of the current
classifier. With a right click (or Alt+Shift+left click) you can once again copy the setup
string to the clipboard or display the properties in a GenericObjectEditor dialog box.
The Choose button allows you to choose one of the classifiers that are available in
WEKA.

• Test Options

The result of applying the chosen classifier will be tested according to the
options that are set by clicking in the Test options box. There are four test modes:

• Use training set. The classifier is evaluated on how well it predicts the class of the
instances it was trained on.

• Supplied test set. The classifier is evaluated on how well it predicts the class of a
set of instances loaded from a file. Clicking the Set... button brings up a dialog allowing
you to choose the file to test on.

• Cross-validation. The classifier is evaluated by cross-validation, using the number


of folds that are entered in the Folds text field.

• Percentage split. The classifier is evaluated on how well it predicts a certain


percentage of the data which is held out for testing. The amount of data held out
depends on the value entered in the % field.

Note: No matter which evaluation method is used, the model that is output is Always the
one build from all the training data. Further testing options can be Set by clicking on the More
options... button:
• Output model. The classification model on the full training set is output so that it
can be viewed, visualized, etc. This option is selected by default.

• Output per-class stats. The precision/recall and true/false statistics for each class
are output. This option is also selected by default.

• Output entropy evaluation measures. Entropy evaluation measures are


included in the output. This option is not selected by default.

• Output confusion matrix. The confusion matrix of the classifier’s predictions is


included in the output. This option is selected by default.

• Store predictions for visualization. The classifier’s predictions are remembered so


that they can be visualized. This option is selected by default.

• Output predictions. The predictions on the evaluation data are output.

Note that in the case of a cross-validation the instance numbers do not correspond
to the location in the data!
• Output additional attributes. If additional attributes need to be output alongside
the predictions, e.g., an ID attribute for tracking misclassifications, then the index of
this attribute can be specified here. The usual Weka ranges are supported,“first” and
“last” are therefore valid indices as well (example: “first-3,6,8,12-last”).

• Cost-sensitive evaluation. The errors is evaluated with respect to a cost matrix.


The Set... button allows you to specify the cost matrix used.

• Random seed for xval / % Split. This specifies the random seed used when
randomizing the data before it is divided up for evaluation purposes.

• Preserve order for % Split. This suppresses the randomization of the data
before splitting into train and test set.

• Output source code. If the classifier can output the built model as Java source
code, you can specify the class name here. The code will be printed in the “Classifier
output” area.

• The Class Attribute


The classifiers in WEKA are designed to be trained to predict a single ‘class’
attribute, which is the target for prediction. Some classifiers can only learn nominal
classes; others can only learn numeric classes (regression problems) still others can learn
both.
By default, the class is taken to be the last attribute in the data. If you want
to train a classifier to predict a different attribute, click on the box below the Test
options box to bring up a drop-down list of attributes to choose from.

• Training a Classifier

Once the classifier, test options and class have all been set, the learning process
is started by clicking on the Start button. While the classifier is busy being trained, the
little bird moves around. You can stop the training process at any time by clicking on
the Stop button. When training is complete, several things happen. The Classifier
output area to the right of the display is filled with text describing the results of
training and testing. A new entry appears in the Result list box. We look at the result
list below; but first we investigate the text that has been output.

• The Classifier Output Text


The text in the Classifier output area has scroll bars
allowing you to browse the results. Clicking with the left mouse
button into the text area, while holding Alt and Shift, brings up
a dialog that enables you to save the displayed output
in a variety of formats (currently, BMP, EPS, JPEG and
PNG). Of course, you can also resize the Explorer window to
get a larger display area.
The output is
Split into several sections:

1. Run information. A list of information giving the learning scheme options,


relation name, instances, attributes and test mode that were involved in the process.

• Classifier model (full training set). A textual representation of the classification


model that was produced on the full training data.

• The results of the chosen test mode are broken down thus.

• Summary. A list of statistics summarizing how accurately the classifier was able to
predict the true class of the instances under the chosen test mode.

• Detailed Accuracy By Class. A more detailed per-class break down of the classifier’s
prediction accuracy.

• Confusion Matrix. Shows how many instances have been assigned to each class.
Elements show the number of test examples whose actual class is the row and whose
predicted class is the column.

Source code (optional). This section lists the Java source code if one chose “Output
source code” in the “More options” dialog.

5. Clustering Tab

5.4.1 Selecting a Clusterer


By now you will be familiar with the process of selecting and configuring objects.
Clicking on the clustering scheme listed in the Clusterer box at the top of the
window brings up a GenericObjectEditor dialog with which to
choose a new clustering scheme.
Cluster Modes

The Cluster mode box is used to choose what to cluster and how to evaluate
the results. The first three options are the same as for classification: Use training set,
Supplied test set and Percentage split (Section 5.3.1)—except that now the data is
assigned to clusters instead of trying to predict a specific class. The fourth mode,
Classes to clusters evaluation, compares how well the chosen clusters match up with a
pre-assigned class in the data. The drop-down box below this option selects the class,
just as in the Classify panel.
An additional option in the Cluster mode box, the Store clusters for visualization
tick box, determines whether or not it will be possible to visualize the clusters once
training is complete. When dealing with datasets that are so
large that memory becomes a problem it may be helpful to disable this option.

• Ignoring Attributes

Often, some attributes in the data should be ignored when clustering. The Ignore
attributes button brings up a small window that allows you to select which attributes
are ignored. Clicking on an attribute in the window highlights it, holding down the
SHIFT key selects a range of consecutive attributes, and holding down CTRL toggles
individual attributes on and off. To cancel the selection, back out with the Cancel
button. To activate it, click the Select button. The next time clustering is invoked, the
selected attributes are ignored.

• Working with Filters

The Filtered Clusterer meta-clusterer offers the user the possibility to apply
filters directly before the clusterer is learned. This approach eliminates the manual
application of a filter in the Preprocess panel, since the data gets processed on the fly.
Useful if one needs to try out different filter setups.

• Learning Clusters

The Cluster section, like the Classify section, has Start/Stop buttons, a result text area
and a result list. These all behave just like their classification counterparts. Right-
clicking an entry in the result list brings up a similar menu, except that it shows only
two visualization options: Visualize cluster assignments and Visualize tree. The latter
is grayed out when it is not applicable.

6. Associate Tab
• Setting Up
This panel contains schemes for learning association rules, and the learners are
chosen and configured in the same way as the clusterers, filters, and classifiers
in the other panels.
• Learning Associations
Once appropriate parameters for the association rule learner have been set,
click the Start button. When complete, right-clicking on an entry in the result list
allows the results to be viewed or saved.

7. Selecting Attributes Tab

5.6.1 Searching and Evaluating

Attribute selection involves searching through all possible combinations of


attributes in the data to find which subset of attributes works best for prediction.
To do this, two objects must be set up: an attribute evaluator and a search method. The
evaluator determines what method is used to assign a worth to each subset of
attributes. The search method determines what style of search is performed.
5.6.2 Options

The Attribute Selection Mode box has two options:

• Use full training set. The worth of the attribute subset is determined using the full
set of training data.

• Cross-validation. The worth of the attribute subset is determined by a process of


cross-validation. The Fold and Seed fields set the number of folds to use and the
random seed used when shuffling the data. As with Classify (Section 5.3.1), there is a
drop-down box that can be used to specify which attribute to treat as the class.

5.6.3 Performing Selection

Clicking Start starts running the attribute selection process. When it is finished,
the results are output into the result area, and an entry is added to the result list. Right-
clicking on the result list gives several options. The first three, (View in main window,
View in separate window and Save result buffer), are the same as for the classify panel. It
is also possible to Visualize reduced data, or if you have used an attribute transformer
such as Principal Components, Visualize transformed data. The reduced/transformed
data can be saved to a file with the Save reduced data... or Save transformed data...
option.

In case one wants to reduce/transform a training and a test at the same time and not use the
Attribute Selected Classifier from the classifier panel, it is best to use the Attribute Selection
filter (a supervised attribute filter) in batch mode (’-b’) from the command line or in the
Simple CLI. The batch mode allows one to specify an additional input and output file pair
(options -r and -s), that is processed with the filter setup that was determined based on the
training data.

8. Visualizing Tab

WEKA’s visualization section allows you to visualize 2D plots of the current relation.

• The scatter plot matrix


When you select the Visualize panel, it shows a scatter plot matrix for all the
attributes, colour coded according to the currently selected class. It is possible to
change the size of each individual 2D plot and the point size, and to randomly jitter the
data (to uncover obscured points). It also possible to change the attribute used to
colour the plots, to select only a subset of attributes for inclusion in the scatter plot
matrix, and to sub sample the data. Note that changes will only come into effect once
the Update button has been pressed.

• Selecting an individual 2D scatter plot


When you click on a cell in the scatter plot matrix, this will bring up a separate
window with a visualization of the scatter plot you selected. (We described
above how to visualize particular results in a separate window—for example, classifier
errors—the same visualization controls are used here.)
Data points are plotted in the main area of the window. At the top are two drop-down
list buttons for selecting the axes to plot. The one on the left shows which attribute is
used for the x-axis; the one on the right shows which is used for the y-axis.

Beneath the x-axis selector is a drop-down list for choosing the colour scheme. This
allows you to colour the points based on the attribute selected. Below the plot
area, a legend describes what values the colours correspond to. If the values are
discrete, you can modify the colour used for each one by clicking on them and
making an appropriate selection in the window that pops up.
To the right of the plot area is a series of horizontal strips. Each strip represents
an attribute, and the dots within it show the distribution of values of the attribute.
These values are randomly scattered vertically to help you see concentrations of points.
You can choose what axes are used in the main graph by clicking on these strips. Left-
clicking an attribute strip changes the x-axis to that attribute, whereas right-clicking
changes the y-axis. The ‘X’ and ‘Y’ written beside the strips shows what the current axes
are (‘B’ is used for ‘both X and Y’).
Above the attribute strips is a slider labelled Jitter, which is a random
displacement given to all points in the plot. Dragging it to the right increases the
amount of jitter, which is useful for spotting concentrations of points. Without jitter, a
million instances at the same point would look no different to just a
single lonely instance.

5.7.3 Selecting Instances


There may be situations where it is helpful to select a subset of the data using the
visualization tool. (A special case of this is the User Classifier in the Classify panel,
which lets you build your own classifier by interactively selecting instances.)
Below the y-axis selector button is a drop-down list button for choosing a
selection method. A group of data points can be selected in four ways:

• Select Instance. Clicking on an individual data point brings up a window listing its
attributes. If more than one point appears at the same location, more than one set of
attributes is shown.

• Rectangle. You can create a rectangle, by dragging, that selects the points inside it.
• Polygon. You can build a free-form polygon that selects the points inside it. Left-
click to add vertices to the polygon, right-click to complete it. The polygon will always
be closed off by connecting the first point to the last.

• Polyline. You can build a polyline that distinguishes the points on one side from
those on the other. Left-click to add vertices to the polyline, right-click to finish. The
resulting shape is open (as opposed to a polygon, which is always closed).
Once an area of the plot has been selected using Rectangle, Polygon or Polyline,
it turns grey. At this point, clicking the Submit button removes all instances from the
plot except those within the grey selection area. Clicking on the Clear button erases the
selected area without affecting the graph.
Once any points have been removed from the graph, the Submit button changes to a Reset
button. This button undoes all previous removals and returns you to the original graph
with all points included. Finally, clicking the Save button allows you to save the currently
visible instances to a new ARFF file.

SAMPLE ARFF FILES


FILE 01

Aim:

Create an Employee Table with the help of Data Mining Tool WEKA.

Description:

We need to create an Employee Table with training data set which includes attributes like name,
id, salary, experience, gender, phone number.

Procedure:

Steps:

  
• Open Start Programs Accessories Notepad
• Type the following training data set with the help of Notepad for Employee Table.

@relation employee
@attribute name {x,y,z,a,b}

@attribute id numeric

@attribute salary {low,medium,high}

@attribute exp numeric

@attribute gender {male,female}

@attribute phone numeric

@data

x,101,low,2,male,250311

y,102,high,3,female,251665

z,103,medium,1,male,240238

a,104,low,5,female,200200

b,105,high,2,male,240240

• After that the file is saved with .arff file format.

 
• Minimize the arff file and then open Start Programs weka-3-4.
• Click on weka-3-4, then Weka dialog box is displayed on the screen.

• In that dialog box there are four modes, click on explorer.

• Explorer shows many options. In that click on ‘open file’ and select the arff file

• Click on edit button which shows employee table on weka.


Training Data Set Weather Table
Result:

This program has been successfully executed.

FILE 02

Aim:

Create a Weather Table with the help of Data Mining Tool WEKA.

Description:

We need to create a Weather table with training data set which includes attributes like outlook,
temperature, humidity, windy, play.

Procedure:

Steps:

  
• Open Start Programs Accessories Notepad
• Type the following training data set with the help of Notepad for Weather Table.

@relation weather

@attribute outlook {sunny,rainy,overcast}

@attribute temparature numeric

@attribute humidity numeric


@attribute windy {true,false}

@attribute play {yes,no}

@data

sunny,85.0,85.0,false,no

overcast,80.0,90.0,true,no

sunny,83.0,86.0,false,yes

rainy,70.0,86.0,false,yes

rainy,68.0,80.0,false,yes

rainy,65.0,70.0,true,no

overcast,64.0,65.0,false,yes

sunny,72.0,95.0,true,no

sunny,69.0,70.0,false,yes

rainy,75.0,80.0,false,yes

• After that the file is saved with .arff file format.


 
• Minimize the arff file and then open Start Programs weka-3-4.
• Click on weka-3-4, then Weka dialog box is displayed on the screen.

• In that dialog box there are four modes, click on explorer.

• Explorer shows many options. In that click on ‘open file’ and select the arff file

• Click on edit button which shows weather table on weka.


Training Data Set Weather Table
Result:

This program has been successfully executed.

FILE 03

Aim:

Apply Pre-Processing techniques to the training data set of Weather Table

Description:

Real world databases are highly influenced to noise, missing and inconsistency due to their queue
size so the data can be pre-processed to improve the quality of data and missing results and it also
improves the efficiency.

There are 3 pre-processing techniques they are:

• Add
• Remove

• Normalization

Creation of Weather Table:

Procedure:

  
• Open Start Programs Accessories Notepad
• Type the following training data set with the help of Notepad for Weather Table.

@relation weather

@attribute outlook {sunny,rainy,overcast}

@attribute temparature numeric

@attribute humidity numeric

@attribute windy {true,false}

@attribute play {yes,no}

@data

sunny,85.0,85.0,false,no

overcast,80.0,90.0,true,no
sunny,83.0,86.0,false,yes

rainy,70.0,86.0,false,yes

rainy,68.0,80.0,false,yes

rainy,65.0,70.0,true,no

overcast,64.0,65.0,false,yes

sunny,72.0,95.0,true,no

sunny,69.0,70.0,false,yes

rainy,75.0,80.0,false,yes

• After that the file is saved with .arff file format.


 
• Minimize the arff file and then open Start Programs weka-3-4.
• Click on weka-3-4, then Weka dialog box is displayed on the screen.

• In that dialog box there are four modes, click on explorer.

• Explorer shows many options. In that click on ‘open file’ and select the arff file

• Click on edit button which shows weather table on weka.



Add Pre-Processing Technique:

Procedure:
  
• Start Programs Weka-3-4 Weka-3-4
• Click on explorer.

• Click on open file.

• Select Weather.arff file and click on open.

• Click on Choose button and select the Filters option.

• In Filters, we have Supervised and Unsupervised data.

• Click on Unsupervised data.

• Select the attribute Add.

• A new window is opened.

• In that we enter attribute index, type, data format, nominal label values for Climate.

• Click on OK.

• Press the Apply button, then a new attribute is added to the Weather Table.

• Save the file.

• Click on the Edit button, it shows a new Weather Table on Weka.

Weather Table after adding new attribute CLIMATE:


Remove Pre-Processing Technique:

Procedure:
  
• Start Programs Weka-3-4 Weka-3-4
• Click on explorer.

• Click on open file.

• Select Weather.arff file and click on open.

• Click on Choose button and select the Filters option.

• In Filters, we have Supervised and Unsupervised data.

• Click on Unsupervised data.

• Select the attribute Remove.

• Select the attributes windy, play to Remove.

• Click Remove button and then Save.

• Click on the Edit button, it shows a new Weather Table on Weka.

Weather Table after removing attributes WINDY, PLAY:



Normalize Pre-Processing Technique:

Procedure:

  
• Start Programs Weka-3-4 Weka-3-4
• Click on explorer.

• Click on open file.

• Select Weather.arff file and click on open.

• Click on Choose button and select the Filters option.

• In Filters, we have Supervised and Unsupervised data.

• Click on Unsupervised data.


• Select the attribute Normalize.

• Select the attributes temparature, humidity to Normalize.

• Click on Apply button and then Save.

• Click on the Edit button, it shows a new Weather Table with normalized values on Weka.

Weather Table after Normalizing TEMPARATURE, HUMIDITY:


Result:

This program has been successfully executed.


9. Description of German Credit Data.

Credit Risk Assessment


Description: The business of banks is making loans. Assessing the credit
worthiness of an applicant is of crucial importance. You have to develop a system to
help a loan officer decide whether the credit of a customer is good. Or bad. A bank’s
business rules regarding loans must consider two opposing factors. On th one han, a
bank wants to make as many loans as possible. Interest on these loans is the banks
profit source. On the other hand, a bank can not afford to make too many bad loans.
Too many bad loans could lead to the collapse of the bank. The bank’s loan policy
must involved a compromise. Not too strict and not too lenient.

To do the assignment, you first and foremost need some knowledge about the world of
credit.
You can acquire such knowledge in a number of ways.

• Knowledge engineering: Find a loan officer who is willing to talk. Interview her
and try to represent her knowledge in a number of ways.
• Books: Find some training manuals for loan officers or perhaps a suitable
textbook on finance. Translate this knowledge from text from to production rule
form.
• Common sense: Imagine yourself as a loan officer and make up reasonable rules
which can be used to judge the credit worthiness of a loan applicant.
• Case histories: Find records of actual cases where competent loan officers
correctly judged when and not to. Approve a loan application.

The German Credit Data


Actual historical credit data is not always easy to come by because of
confidentiality rules.
Here is one such data set. Consisting of 1000 actual cases collected in Germany.

In spite of the fact that the data is German, you should probably make use of it
for this assignment(Unless you really can consult a real loan officer!)
There are 20 attributes used in judging a loan applicant( ie., 7 Numerical
attributes and 13 Categorical or Nominal attributes). The goal is the classify the
applicant into one of two categories. Good or Bad.
The total number of attributes present in German credit data are.
• Checking_Status
• Duration
• Credit_history
• Purpose
• Credit_amout
• Savings_status
• Employment
• Installment_Commitment
• Personal_status
• Other_parties
• Residence_since
• Property_Magnitude
13.Age

• Other_payment_plans
• Housing
• Existing_credits
• Job
• Num_dependents
• Own_telephone
• Foreign_worker
• Class
EXPERIMENT-1
• OBJECTIVE:

List all the categorical (or nominal) attributes and the real-valued attributes separately.
• PROCEDURE:
• Open the Weka GUI Chooser.
• Select EXPLORER present in Applications.
• Select Preprocess Tab.
• Go to OPEN file and browse the file that is already stored in the system “bank.csv”.
• Clicking on any attribute in the left panel will show the basic statistics on that selected
attribute.
• OUTPUT:
EXPERIMENT-2

• OBJECTIVE:

Which attributes do you think might be crucial in making the credit assessment? Come
up with some simple rules in plain English using your selected attributes.
• PROCEDURE:

• Given the Bank database for mining.


• Select EXPLORER in WEKA GUI Chooser.
• Load “Bank.csv” in Weka by Open file in Preprocess tab.
• Select only Nominal values.
• Go to Associate Tab.
• Select Apriori algorithm from “Choose “ button present in Associator
weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0
-c -1
• Select Start button
• Now we can see the sample rules.


OUTPUT:

EXPERIMENT-3

• OBJECTIVE:
One type of model that you can create is a decision tree. Train a decision tree using
the complete dataset as the training data. Report the model obtained after training.
PROCEDURE:

• Open Weka GUI Chooser.


• Select EXPLORER present in Applications.
• Select Preprocess Tab.
• Go to OPEN file and browse the file that is already stored in the system “bank.csv”.
• Go to Classify tab.
• Here the c4.5 algorithm has been chosen which is entitled as j48 in Java and
can be selected by clicking the button choose and select tree j48
• Select Test options “Use training set”
• if need select attribute.
• Click Start.
• Now we can see the output details in the Classifier output.
• Right click on the result list and select “visualize tree”option .

• OUTPUT:


EXPERIMENT-4
• OBJECTIVE:

Suppose you use your above model trained on the complete dataset, and classify
credit good/bad for each of the examples in the dataset. What % of examples can you
classify correctly? (This is also called testing on the training set) Why do you think you
cannot get 100 % training accuracy?
• PROCEDURE:

• Given the Bank database for mining.


• Use the Weka GUI Chooser.
• Select EXPLORER present in Applications.
• Select Preprocess Tab.
• Go to OPEN file and browse the file that is already stored in the system “bank.csv”.
• Go to Classify tab.
• Choose Classifier “Tree”
• Select “NBTree” i.e., Navie Baysiean tree.
• Select Test options “Use training set”
• if need select attribute.
• Now start weka.
• Now we can see the output details in the Classifier output.

5.5 OUTPUT:
=== Evaluation on training set
===
=== Summary === 554 92.3333
Correctly Classified Instances %
Incorrectly Classified Instances 46 7.6667 %
Kappa statistic 0.845
Mean absolute error 0.1389
Root mean squared error 0.2636
Relative absolute error 27.9979 %
Root relative squared error 52.9137 %
Total Number of Instances 600

=== Detailed Accuracy By Class ===

TP Rate FP Precision Recall F- ROC Area Class


Rate Measure
0.894 0.052 0.935 0.894 0.914 0.936 YES
0.948 0.106 0.914 0.948 0.931 0.936 NO

Weighted Avg.

0.923 0.081 0.924 0.923 0.923 0.936

=== Confusion Matrix ===

A
B

24529

17309

<-- classified
as
a = YES , b =
NO

EXPERIMENT-5

• OBJECTIVE:

Is testing on the training set as you did above a good idea? Why or Why not?

• PROCEDURE:
• In Test options, select the Supplied test set radio button
• Click Set
• Choose the file which contains records that were not in the training set we used to create
the model.
• Click Start(WEKA will run this test data set through the model we already created. )
• Compare the output results with that of the 4th experiment

• OUTPUT:

This can be experienced by the different problem solutions while doing practice.

The important numbers to focus on here are the numbers next to the "Correctly
Classified Instances" (92.3 percent) and the "Incorrectly Classified Instances" (7.6
percent). Other important numbers are in the "ROC Area" column, in the first row (the
0.936); Finally, in the "Confusion Matrix," it shows the number of false positives and
false negatives. The false positives are 29, and the false negatives are 17 in this
matrix.

Based on our accuracy rate of 92.3 percent, we say that upon initial analysis, this is a
good model.

One final step to validating our classification tree, which is to run our test set through
the model and ensure that accuracy of the model

Comparing the "Correctly Classified Instances" from this test set with the "Correctly
Classified Instances" from the training set, we see the accuracy of the model, which
indicates that the model will not break down with unknown data, or when future data is
applied to it.
EXPERIMENT-6

• OBJECTIVE:

One approach for solving the problem encountered in the previous question is using
cross-validation? Describe what is cross -validation briefly. Train a Decision Tree again
using cross - validation and report your results. Does your accuracy
increase/decrease? Why?

• PROCEDURE:

• Given the Bank database for mining.


• Use the Weka GUI Chooser.
• Select EXPLORER present in Applications.
• Select Preprocess Tab.
• Go to OPEN file and browse the file that is already stored in the system “bank.csv”.
• Go to Classify tab.
• Choose Classifier “Tree”
• Select J48
• Select Test options “Cross-validation”.
• Set “Folds” Ex:10
• if need select attribute.
• now Start weka.
• now we can see the output details in the Classifier output.
• th experiment
Compare the output results with that of the 4
• check whether the accuracy increased or decreased?





• OUTPUT:
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 539 89.8333
%
Incorrectly Classified Instances 61 10.1667 %

Kappa statistic 0.7942


Mean absolute error 0.167
Root mean squared error 0.305

Relative absolute error

33.6

511 % Root relative squared error 61.2344 % Total Number of Instances 600

=== Detailed Accuracy By Class ===


TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.861 0.071

0.929 0.139 0.889 0.929 0.909 0.883


NO

Weighted Avg.

0.898 0.108 0.899 0.898 0.898 0.883

=== Confusion

Matrix === a b <-

- classified as

236 38 | a = YES 23

303 | b = NO

EXPERIMENT-7

• OBJECTIVE

Check to see if the data shows a bias against "foreign workers" (attribute 20), or
"personal -status" (attribute 9). One way to do this (perhaps rather simple minded) is to
remove these attributes from the dataset and see if the decision tree created in those
cases is significantly different from the full dataset case which you have already done.
To remove an attribute you can use the preprocess tab in Weka's GUI Explorer. Did
removing these attributes have any significant effect? Discuss.

• PROCEDURE:

• Given the Bank database for mining.


• Use the Weka GUI Chooser.
• Select EXPLORER present in Applications.
• Select Preprocess Tab.
• Go to OPEN file and browse the file that is already stored in the system “bank.csv”.
• In the "Filter" panel, click on the "Choose" button. This will show a popup window with
list available filters.
• Select “weka.filters.unsupervised.attribute.Remove”
• Next, click on text box immediately to the right of the "Choose" button
• In the resulting dialog box enter the index of the attribute to be filtered out (Make sure
that the "invert Selection" option is set to false )
• Then click "OK" . Now, in the filter box you will see "Remove -R 1"
• Click the "Apply" button to apply this filter to the data. This will remove the "id"
attribute and create a new working relation
• To save the new working relation as an ARFF file, click on save button in the top panel.
• Go to OPEN file and browse the file that is newly saved (attribute deleted file)
• Go to Classify tab.
• Choose Classifier “Tree”
• Select j48 tree
• Select Test options “Use training set”
• If need select attribute.
• Now start weka.
• Now we can see the output details in the Classifier output.
• Right click on the result list and select ” visualize tree “option .
• th experiment
Compare the output results with that of the 4
• Check whether the accuracy increased or decreased.
• Check whether removing these attributes have any significant effect.

• OUTPUT:
EXPERIMENT-08
• OBJECTIVE:

Another question might be, do you really need to input so many attributes to get good
results? Maybe only a few would do. For example, you could try just having attributes
2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations.
(You had removed two attributes in problem 7. Remember to reload the arff data file to
get all the attributes initially before you start selecting the ones you want).
• PROCEDURE:

• Given the Bank database for mining.


• Use the Weka GUI Chooser.
• Select EXPLORER present in Applications.
• Select Preprocess Tab.
• Go to OPEN file and browse the file that is already stored in the system “bank.csv”.
• Select some of the attributes from attributes list which are to be removed.
With this step only the attributes necessary for classification are left in the
attributes panel.
• The go to Classify tab.
• Choose Classifier “Tree”
• Select j48
• Select Test options “Use training set”
• If need select attribute.
• Now start Weka.
• Now we can see the output details in the Classifier output.
• Right click on the result list and select “visualize tree” option.
• th experiment.
Compare the output results with that of the 4
• Check whether the accuracy increased or decreased?
• Check whether removing these attributes have any significant effect.
• OUTPUT:
EXPERIMENT-09

• OBJECTIVE:

Sometimes, the cost of rejecting an applicant who actually has a good credit (case 1)
might be higher than accepting an applicant who has bad credit (case 2). Instead of
counting the misclassifications equally in both cases, give a higher cost to the first case
(say cost 5) and lower cost to the second case. You can do this by using a cost matrix
in Weka. Train your Decision Tree again and report the Decision Tree and cross -
validation results. Are they significantly different from results obtained in problem 6
(using equal cost)?
• PROCEDURE:
• Given the Bank database for mining.
• Use the Weka GUI Chooser.
• Select EXPLORER present in Applications.
• Select Preprocess Tab.
• Go to OPEN file and browse the file that is already stored in the system “bank.csv”.
• Go to Classify tab.
• Choose Classifier “Tree”
• Select j48
• Select Test options “Training set”.
• Click on “more options”.
• Select cost sensitive evaluation and click on set button
• Set the matrix values and click on resize. Then close the window.
• Click Ok
• Click start.
• We can see the output details in the Classifier output
• Select Test options “Cross-validation”.
• Set “Folds” Ex: 10
• if need select attribute.
• Now start weka.
• Now we can see the output details in the Classifier output.
• th and 20th steps.
Compare results of 15
• Compare the results with that of experiment 6.

• OUTPUT:
EXPERIMENT-10
• OBJECTIVE:

Do you think it is a good idea to prefer simple decision trees instead of having long
complex decision trees? How does the complexity of a Decision Tree relate to the
bias of the model?
PROCEDURE:

This will be based on the attribute set, and the requirement of relationship among
attribute we want to study. This can be viewed based on the database and user
requirement.
EXPERIMENT-11

• OBJECTIVE:

You can make your Decision Trees simpler by pruning the nodes. one approach is to use
Reduced Error Pruning -Explain this idea briefly. Try reduced error pruning for training
your Decision Trees using cross
-validation (you can do this in Weka) and report the Decision Tree you obtain? Also,
report your accuracy using the pruned model. Does your accuracy increase?

• PROCEDURE:

• Given the Bank database for mining.


• Use the Weka GUI Chooser.
• Select EXPLORER present in Applications.
• Select Preprocess Tab.
• Go to OPEN file and browse the file that is already stored in the system “bank.csv”.
• Select some of the attributes from attributes list
• Go to Classify tab.
• Choose Classifier “Tree”
• Select “NBTree” i.e., Navie Bayesian tree.
• Select Test options “Use training set”
• Right click on the text box besides choose button, select show properties
• Now change unprone mode “false” to “true”.
• Change the reduced error pruning % as needed.
• If need select attribute.
• Now start weka.
• Now we can see the output details in the Classifier output.
• Right click on the result list and select ” visualize tree “option.

• OUTPUT:
EXPERIMENT-12

• OBJECTIVE:

(Extra Credit): How can you convert a Decision Trees into "if –then -else rules". Make
up your own small Decision Tree consisting of 2 - 3 levels and convert it into a set of
rules. There also exist different classifiers that output the model in the form of rules -
one such classifier in Weka is rules. PART, train this model and report the set of rules
obtained. Sometimes just one attribute can be good enough in making the decision,
yes, just one! Can you predict what attribute that might be in this dataset? OneR
classifier uses a single attribute to make decisions (it chooses the attribute based on
minimum error). Report the rule obtained by training a one R classifier. Rank the
performance of j48, PART and OneR.

• RESOURCES:

Weka mining tool.

• PROCEDURE:

• Given the Bank database for mining.


• Use the Weka GUI Chooser.
• Select EXPLORER present in Applications.
• Select Preprocess Tab.
• Go to OPEN file and browse the file that is already stored in the system “bank.csv”.
• Select some of the attributes from attributes list
• Go to Classify tab.
• Choose Classifier “Trees Rules”
• Select “J48”.
• Select Test options “Use training set”
• If need select attribute.
• Now start weka.
• Now we can see the output details in the Classifier output.
• Right click on the result list and select ” visualize
tree “option . (or)

java weka.classifiers.trees.J48 -t c:\temp\bank.arff

Procedure for “OneR”:


• Given the Bank database for mining.
• Use the Weka GUI Chooser.
• Select EXPLORER present in Applications.
• Select Preprocess Tab.
• Go to OPEN file and browse the file that is already stored in the system “bank.csv”.
• Select some of the attributes from attributes list
• Go to Classify tab.
• Choose Classifier “Rules”
• Select “OneR”.
• Select Test options “Use training set”

• if need select attribute.


• Now start weka.
• Now we can see the output details in the Classifier output.

Procedure for “PART”:


• Given the Bank database for mining.
• Use the Weka GUI Chooser.
• Select EXPLORER present in Applications.
• Select Preprocess Tab.
• Go to OPEN file and browse the file that is already stored in the system “bank.csv”.
• Select some of the attributes from attributes list
• Go to Classify tab.
• Choose Classifier “Rules”.
• Select “PART”.
• Select Test options “Use training set”
• if need select attribute.
• Now start weka.
• Now we can see the output details in the Classifier output.
Attribute relevance with respect to the class – relevant
attribute (science) IF accounting=1 THEN class=A
(Error=0, Coverage = 7 instance)
IF accounting=0 THEN class=B (Error=4/13, Coverage = 13 instances)

12.4 OUTPUT:

You might also like