Latest Data Mining Lab Manual
Latest Data Mining Lab Manual
LABORATORY MANUAL
For
B.TECH (IV YEAR – I SEM) (2019-20)
Submitted by
[Your Name]
Recognized under Section 2(f) & 12(B) of the UGC Act 1956, An ISO 9001:2015
Certified Institution
CERTIFICATE
This is to certify that, Mr./Mrs…………………………bearing
Internal External
Examiner HOD
Examiner
Index
S.No. Topic Page No Signature
6
7
10
11
12
13
Syllabus
To do the assignment, you first and foremost need some knowledge about the world of
credit.
You can acquire such knowledge in a number of ways.
• Knowledge engineering: Find a loan officer who is willing to talk. Interview her
and try to represent her knowledge in a number of ways.
• Books: Find some training manuals for loan officers or perhaps a suitable
textbook on finance. Translate this knowledge from text from to production rule
form.
• Common sense: Imagine yourself as a loan officer and make up reasonable rules
which can be used to judge the credit worthiness of a loan applicant.
• Case histories: Find records of actual cases where competent loan officers
correctly judged when and not to. Approve a loan application.
In spite of the fact that the data is German, you should probably make use of it
for this assignment (Unless you really can consult a real loan officer!)
There are 20 attributes used in judging a loan applicant (ie., 7 Numerical
attributes and 13 Categorical or Nominal attributes). The goal is to classify the applicant
into one of two categories. Good or Bad.
Subtasks:
• List all the categorical (or nominal) attributes and the real valued attributes separately.
• What attributes do you think might be crucial in making the credit assessment?
Come up with some simple rules in plain English using your selected attributes.
3.One type of model that you can create is a Decision tree . train a Decision tree using
the complete data set as the training data. Report the model obtained after training.
4.Suppose you use your above model trained on the complete dataset, and classify credit
good/bad for each of the examples in the dataset. What % of examples can you classify
correctly?(This is also called testing on the training set) why do you think can not get 100% training
accuracy
• Is testing on the training set as you did above a good idea? Why or why not?
• One approach for solving the problem encountered in the previous question is using
cross- validation? Describe what is cross validation briefly. Train a decision tree again
using cross validation and report your results. Does accuracy increase/decrease? Why?
• Check to see if the data shows a bias against “foreign workers” or “personal-
status”. One way to do this is to remove these attributes from the data set and see if the
decision tree created in those cases is significantly different from the full dataset case
which you have already done. Did removing these attributes have any significantly
effect? Discuss.
• Another question might be, do you really need to input so many attributes to get
good results? May be only a few would do. For example, you could try just having
attributes 2,3,5,7,10,17 and 21. Try out some combinations.(You had removed two
attributes in problem 7. Remember to reload the arff data file to get all the attributes
initially before you start selecting the ones you want.)
• Sometimes, The cost of rejecting an applicant who actually has good credit might be
higher than accepting an applicant who has bad credit. Instead of counting the
misclassification equally in both cases, give a higher cost to the first case ( say cost 5)
and lower cost to the second case. By using a cost matrix in weak. Train your decision
tree and report the Decision Tree and cross validation results. Are they significantly
different from results obtained in problem 6.
• Do you think it is a good idea to prefect simple decision trees instead of having long
complex decision tress? How does the complexity of a Decision Tree relate to the bias
of the model?
• You can make your Decision Trees simpler by pruning the nodes. One approach is to
use Reduced Error Pruning. Explain this idea briefly. Try reduced error pruning for
training your Decision Trees using cross validation and report the Decision Trees you
obtain? Also Report your accuracy using the pruned model Does your Accuracy increase?
• How can you convert a Decision Tree into “if-then-else rules”. Make up your own
small Decision Tree consisting 2-3 levels and convert into a set of rules. There also exist
different classifiers that output the model in the form of rules. One such classifier in
weka is rules. PART, train this model and report the set of rules obtained. Sometimes
just one attribute can be good enough in making the decision, yes, just one ! Can you
predict what attribute that might be in this data set? OneR classifier uses a single
attribute to make decisions(it chooses the attribute based on minimum error).Report
the rule obtained by training a one R classifier. Rank the performance of
j48,PART,oneR.
1. Weka Introduction
Weka is created by researchers at the university WIKATO in NewZealand.
University of Waikato, Hamilton, New Zealand Alex Seewald (original Command-line
primer) David Scuse (original Experimenter tutorial)
• Section Tabs
At the very top of the window, just below the title bar, is a row of tabs. When
the Explorer is first started only the first tab is active; the others are greyed out. This is
because it is necessary to open (and potentially pre-process) a data set before starting to
explore the data.
The tabs are as follows:
Once the tabs are active, clicking on them flicks between different screens, on
which the respective actions can be performed. The bottom area of the window
(including the status box, the log button, and the Weka bird) stays visible regardless of
which section you are in. The Explorer can be easily extended with custom tabs. The
Wiki article “Adding tabs in the Explorer” [7] explains this in detail.
II. Experimenter
2.1 Introduction
The Weka Experiment Environment enables the user to create, run, modify, and
analyse experiments in a more convenient manner than is possible when processing
the schemes individually. For example, the user can create an experiment that runs
several schemes against a series of datasets and then analyse the results to determine
if one of the schemes is (statistically) better than the other schemes.
The Experiment Environment can be run from the command line using the
Simple CLI. For example, the following commands could be typed into the CLI to run
the OneR scheme on the Iris dataset using a basic train and test process. (Note that the
commands would be typed on one line into the CLI.) While commands can be typed
directly into the CLI, this technique is not particularly convenient and the experiments
are not easy to modify. The Experimenter comes in two flavours, either with a simple
interface that provides most of the functionality one needs for experiments, or with an
interface with full access to the Experimenter’s capabilities. You can choose between
those two with the Experiment Configuration Mode radio buttons:
• Simple
• Advanced
Both setups allow you to setup standard experiments, that are run locally on
a single machine, or remote experiments, which are distributed between several hosts.
The distribution of experiments cuts down the time the experiments will take until
completion, but on the other hand the setup takes more time. The next section covers
the standard experiments (both, simple and advanced), followed by the remote
experiments and finally the analysing of the results.
The Knowledge Flow can handle data either incrementally or in batches (the
Explorer handles batch data only). Of course learning from data incremen- tally
requires a classifier that can be updated on an instance by instance basis. Currently in
WEKA there are ten classifiers that can handle data incrementally.
The Simple CLI provides full access to all Weka classes, i.e., classifiers, filters,
clusterers, etc., but without the hassle of the CLASSPATH (it facilitates the one, with
which Weka was started). It offers a simple Weka shell with separated command
line and output.
4.1 Commands
java weka.classifiers.trees.J48 -t
java weka.classifiers.trees.J48test.arf>j48.txt
Note: the > must be preceded and followed by a space, otherwise it is not recognized
as redirection, but part of another parameter.
Command redirection
Starting with this version of Weka one can perform a basic redirection:
ARFF files are not the only format one can load, but all files that can be
converted with Weka’s “core converters”. The following formats are currently
supported:
• ARFF (+ compressed)
• C4.5
• CSV
• libsvm
• binary serialized instances
• XRFF (+ compressed)
10.1 Overview
ARFF files have two distinct sections. The first section is the Header
information, which is followed the Data information. The Header of the ARFF file
contains the name of the relation, a list of the attributes (the columns in the data), and
their types.
An example header on the standard IRIS dataset looks like this:
% 1. Title: Iris Plants Database
%
% 2. Sources:
% (a) Creator: R.A. Fisher
% (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
% (c) Date: July, 1988
%
@RELATION iris
@ATTRIBUTE
sepallength NUMERIC
@ATTRIBUTE
sepalwidth NUMERIC
@ATTRIBUTE
petallength NUMERIC
@ATTRIBUTE
petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa,Iris-
versicolor,Iris-virginica} The Data of the
ARFF file looks like the following:
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
Lines that begin with a % are comments.
Numeric attributes
Numeric attributes can be real or integer numbers.
Nominal attributes
Nominal values are defined by providing an <nominal-specification> listing
the possible values: <nominal-name1>, <nominal-name2>, <nominal-name3>,
...
For example, the class value of the Iris dataset can be defined as
follows: @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-
virginica}
Values that contain spaces must be quoted.
String attributes
String attributes allow us to create attributes containing arbitrary textual values.
This is very useful in text-mining applications, as we can create datasets with string
attributes, then writeWeka Filters to manipulate strings (like String-
ToWordVectorFilter). String attributes are declared as follows:
@ATTRIBUTE LCC string
Date attributes
Date attribute declarations take
the form: @attribute <name>
date [<date-format>]
where <name> is the name for the attribute and <date-format> is an optional string
specifying how date values should be parsed and printed (this is the same format used
by SimpleDateFormat). The default format string accepts the ISO-8601 combined date
and time format: yyyy-MM- dd’T’HH:mm:ss. Dates must be specified in the data
section as the corresponding string representations of the date/time (see example
below).
Relational attributes
Relational attribute declarations
take the form: @attribute
<name> relational
<further
attribute definitions>
@end <name>
For the multi-instance dataset MUSK1 the definition would look like this (”...” denotes an
omission): @attribute molecule_name {MUSK-jf78,...,NON-MUSK-199}
@attribute
bag
relational
@attribute
f1 numeric
...
@attribute
f166
numeric
@end bag
@attribute class {0,1}
...
An
example
follows:
@relation
LCCvsLCS
H
@attribute
LCC string
@attribute
LCSH
string
@data
AG5, ’Encyclopedias and
dictionaries.;Twentieth century.’ AS262,
’Science -- Soviet Union -- History.’
AE5, ’Encyclopedias and dictionaries.’
AS281, ’Astronomy, Assyro-Babylonian.;Moon
-- Phases.’ AS281, ’Astronomy, Assyro-
Babylonian.;Moon -- Tables.’
Dates must be specified in the data section using the string representation specified in the
attribute declaration.
For
example:
@RELATI
ON
Timestam
ps
@ATTRIBUTE timestamp DATE "yyyy-MM-
dd HH:mm:ss" @DATA
"2001-04-03 12:12:12"
"2001-05-03 12:59:55"
Relational data must be enclosed within double quotes ”. For example an instance of
the MUSK1 dataset (”...” denotes an omission):
MUSK-188,"42,...,30",1
3. Preprocess Tab
1. Loading Data
The first four buttons at the top of the preprocess section
enable you to load data into WEKA:
• Open file.... Brings up a dialog box allowing you to browse for the data file on the local
file system.
• Open URL.... Asks for a Uniform Resource Locator address for where the data is
stored.
• Open DB.... Reads data from a database. (Note that to make this work you might
have to edit the file in weka/experiment/DatabaseUtils.props.)
The Current Relation: Once some data has been loaded, the Preprocess panel shows
a variety of information. The Current relation box (the “current relation” is the
currently loaded data, which can be interpreted as a single relational table in database
terminology) has three entries:
1. Relation. The name of the relation, as given in the file it was loaded from. Filters
(described below) modify the name of a relation.
• No... A number that identifies the attribute in the order they are specified in the data
file.
• Selection tick boxes. These allow you select which attributes are present in the
relation.
• Name. The name of the attribute, as it was declared in the data file. When you click
on different rows in the list of attributes, the fields change in the box to the right titled
Selected attribute. This box displays the characteristics of the currently highlighted
attribute in the list:
• Name. The name of the attribute, the same as that given in the attribute list.
• Type. The type of attribute, most commonly Nominal or Numeric.
• Missing. The number (and percentage) of instances in the data for which this
attribute is missing (unspecified).
• Distinct. The number of different values that the data contains for this attribute.
• Unique. The number (and percentage) of instances in the data having a value for
this attribute that no other instances have.
Below these statistics is a list showing more information about the values stored
in this attribute, which differ depending on its type. If the attribute is nominal, the list
consists of each possible value for the attribute along with the number of instances that
have that value. If the attribute is numeric, the list gives four statistics describing the
distribution of values in the data—the minimum, maximum, mean and standard
deviation. And below these statistics there is a coloured histogram, colour-coded
according to the attribute chosen as the Class using the box above the histogram. (This
box will bring up a drop-down list of available selections when clicked.) Note that only
nominal Class attributes will result in a colour-coding. Finally, after pressing the
Visualize All button, histograms for all the attributes in the data are shown in a separate
window.
Returning to the attribute list, to begin with all the tick boxes are unticked.
They can be toggled on/off by clicking on them individually. The four buttons above
can also be used to change the selection:
PREPROCESSING
• All. All boxes are ticked.
• None. All boxes are cleared (unticked).
• Invert. Boxes that are ticked become unticked and vice versa.
Pattern. Enables the user to select attributes based on a Perl 5 Regular Expression. E.g., .* id
selects all attributes which name ends with id.
Once the desired attributes have been selected, they can be removed by clicking
the Remove button below the list of attributes. Note that this can be undone by clicking
the Undo button, which is located next to the Edit button in the top-right corner of the
Preprocess panel.
The preprocess section allows filters to be defined that transform the data in
various ways. The Filter box is used to set up the filters that are required. At the left of
the Filter box is a Choose button. By clicking this button it is possible to select one of
the filters in WEKA. Once a filter has been selected, its name and options are shown in
the field next to the Choose button. Clicking on this box with the left mouse button
brings up a GenericObjectEditor dialog box. A click with the right mouse button (or
Alt+Shift+left click) brings up a menu where you can choose, either to display the
properties in a GenericObjectEditor dialog box, or to copy the current setup string to
the clipboard.
The GenericObjectEditor Dialog Box
• Show properties... has the same effect as left-clicking on the field, i.e., a dialog
appears allowing you to alter the settings.
• Enter configuration... is the “receiving” end for configurations that got copied to
the clipboard earlier on. In this dialog you can enter a class name followed by options (if
the class supports these). This also allows you to transfer a filter setting from the
Preprocess panel to a Filtered Classifier used in the Classify panel.
Left-Clicking on any of these gives an opportunity to alter the filters settings. For
example, the setting may take a text string, in which case you type the string into the
text field provided. Or it may give a drop-down box listing several states to choose from.
Or it may do something else, depending on the information required. Information on
the options is provided in a tool tip if you let the mouse pointer hover of the
corresponding field. More information on the filter and its options can be obtained by
clicking on the More button in the About panel at the top of the GenericObjectEditor
window.
Some objects display a brief description of what they do in an About box, along
with a More button. Clicking on the More button brings up a window describing what
the different options do. Others have an additional button, Capabilities, which lists the
types of attributes and classes the object can handle.
At the bottom of the GenericObjectEditor dialog are four buttons. The first two,
Open... and Save... allow object configurations to be stored for future use. The Cancel
button backs out without remembering any changes that have been made. Once you are
happy with the object and settings you have chosen, click OK to return to the main
Explorer window.
Applying Filters
Once you have selected and configured a filter, you can apply it to the data by
pressing the Apply button at the right end of the Filter panel in the Preprocess panel.
The Preprocess panel will then show the transformed data. The change can be undone
by pressing the Undo button. You can also use the Edit...button to modify your data
manually in a dataset editor. Finally, the Save... button at the top right of the
Preprocess panel saves the current version of the relation in file formats that can
represent the relation, allowing it to be kept for future use.
Note: Some of the filters behave differently depending on whether a class attribute has been
set or not (using the box above the histogram, which will bring up a drop-down list of possible
selections when clicked). In particular, the “supervised filters” require a class attribute to be set,
and some of the “unsupervised attribute filters” will skip the class attribute if one is set. Note that it is
also possible to set Class to None, in which case no class is set.
4. Classification Tab
• Selecting a Classifier
At the top of the classify section is the Classifier box. This box has a text field
that gives the name of the currently selected classifier, and its options. Clicking on the
text box with the left mouse button brings up a GenericObjectEditor dialog box, just
the same as for filters, that you can use to configure the options of the current
classifier. With a right click (or Alt+Shift+left click) you can once again copy the setup
string to the clipboard or display the properties in a GenericObjectEditor dialog box.
The Choose button allows you to choose one of the classifiers that are available in
WEKA.
• Test Options
The result of applying the chosen classifier will be tested according to the
options that are set by clicking in the Test options box. There are four test modes:
• Use training set. The classifier is evaluated on how well it predicts the class of the
instances it was trained on.
• Supplied test set. The classifier is evaluated on how well it predicts the class of a
set of instances loaded from a file. Clicking the Set... button brings up a dialog allowing
you to choose the file to test on.
Note: No matter which evaluation method is used, the model that is output is Always the
one build from all the training data. Further testing options can be Set by clicking on the More
options... button:
• Output model. The classification model on the full training set is output so that it
can be viewed, visualized, etc. This option is selected by default.
• Output per-class stats. The precision/recall and true/false statistics for each class
are output. This option is also selected by default.
Note that in the case of a cross-validation the instance numbers do not correspond
to the location in the data!
• Output additional attributes. If additional attributes need to be output alongside
the predictions, e.g., an ID attribute for tracking misclassifications, then the index of
this attribute can be specified here. The usual Weka ranges are supported,“first” and
“last” are therefore valid indices as well (example: “first-3,6,8,12-last”).
• Random seed for xval / % Split. This specifies the random seed used when
randomizing the data before it is divided up for evaluation purposes.
• Preserve order for % Split. This suppresses the randomization of the data
before splitting into train and test set.
• Output source code. If the classifier can output the built model as Java source
code, you can specify the class name here. The code will be printed in the “Classifier
output” area.
• Training a Classifier
Once the classifier, test options and class have all been set, the learning process
is started by clicking on the Start button. While the classifier is busy being trained, the
little bird moves around. You can stop the training process at any time by clicking on
the Stop button. When training is complete, several things happen. The Classifier
output area to the right of the display is filled with text describing the results of
training and testing. A new entry appears in the Result list box. We look at the result
list below; but first we investigate the text that has been output.
• The results of the chosen test mode are broken down thus.
• Summary. A list of statistics summarizing how accurately the classifier was able to
predict the true class of the instances under the chosen test mode.
• Detailed Accuracy By Class. A more detailed per-class break down of the classifier’s
prediction accuracy.
• Confusion Matrix. Shows how many instances have been assigned to each class.
Elements show the number of test examples whose actual class is the row and whose
predicted class is the column.
Source code (optional). This section lists the Java source code if one chose “Output
source code” in the “More options” dialog.
5. Clustering Tab
The Cluster mode box is used to choose what to cluster and how to evaluate
the results. The first three options are the same as for classification: Use training set,
Supplied test set and Percentage split (Section 5.3.1)—except that now the data is
assigned to clusters instead of trying to predict a specific class. The fourth mode,
Classes to clusters evaluation, compares how well the chosen clusters match up with a
pre-assigned class in the data. The drop-down box below this option selects the class,
just as in the Classify panel.
An additional option in the Cluster mode box, the Store clusters for visualization
tick box, determines whether or not it will be possible to visualize the clusters once
training is complete. When dealing with datasets that are so
large that memory becomes a problem it may be helpful to disable this option.
• Ignoring Attributes
Often, some attributes in the data should be ignored when clustering. The Ignore
attributes button brings up a small window that allows you to select which attributes
are ignored. Clicking on an attribute in the window highlights it, holding down the
SHIFT key selects a range of consecutive attributes, and holding down CTRL toggles
individual attributes on and off. To cancel the selection, back out with the Cancel
button. To activate it, click the Select button. The next time clustering is invoked, the
selected attributes are ignored.
The Filtered Clusterer meta-clusterer offers the user the possibility to apply
filters directly before the clusterer is learned. This approach eliminates the manual
application of a filter in the Preprocess panel, since the data gets processed on the fly.
Useful if one needs to try out different filter setups.
• Learning Clusters
The Cluster section, like the Classify section, has Start/Stop buttons, a result text area
and a result list. These all behave just like their classification counterparts. Right-
clicking an entry in the result list brings up a similar menu, except that it shows only
two visualization options: Visualize cluster assignments and Visualize tree. The latter
is grayed out when it is not applicable.
6. Associate Tab
• Setting Up
This panel contains schemes for learning association rules, and the learners are
chosen and configured in the same way as the clusterers, filters, and classifiers
in the other panels.
• Learning Associations
Once appropriate parameters for the association rule learner have been set,
click the Start button. When complete, right-clicking on an entry in the result list
allows the results to be viewed or saved.
• Use full training set. The worth of the attribute subset is determined using the full
set of training data.
Clicking Start starts running the attribute selection process. When it is finished,
the results are output into the result area, and an entry is added to the result list. Right-
clicking on the result list gives several options. The first three, (View in main window,
View in separate window and Save result buffer), are the same as for the classify panel. It
is also possible to Visualize reduced data, or if you have used an attribute transformer
such as Principal Components, Visualize transformed data. The reduced/transformed
data can be saved to a file with the Save reduced data... or Save transformed data...
option.
In case one wants to reduce/transform a training and a test at the same time and not use the
Attribute Selected Classifier from the classifier panel, it is best to use the Attribute Selection
filter (a supervised attribute filter) in batch mode (’-b’) from the command line or in the
Simple CLI. The batch mode allows one to specify an additional input and output file pair
(options -r and -s), that is processed with the filter setup that was determined based on the
training data.
8. Visualizing Tab
WEKA’s visualization section allows you to visualize 2D plots of the current relation.
Beneath the x-axis selector is a drop-down list for choosing the colour scheme. This
allows you to colour the points based on the attribute selected. Below the plot
area, a legend describes what values the colours correspond to. If the values are
discrete, you can modify the colour used for each one by clicking on them and
making an appropriate selection in the window that pops up.
To the right of the plot area is a series of horizontal strips. Each strip represents
an attribute, and the dots within it show the distribution of values of the attribute.
These values are randomly scattered vertically to help you see concentrations of points.
You can choose what axes are used in the main graph by clicking on these strips. Left-
clicking an attribute strip changes the x-axis to that attribute, whereas right-clicking
changes the y-axis. The ‘X’ and ‘Y’ written beside the strips shows what the current axes
are (‘B’ is used for ‘both X and Y’).
Above the attribute strips is a slider labelled Jitter, which is a random
displacement given to all points in the plot. Dragging it to the right increases the
amount of jitter, which is useful for spotting concentrations of points. Without jitter, a
million instances at the same point would look no different to just a
single lonely instance.
• Select Instance. Clicking on an individual data point brings up a window listing its
attributes. If more than one point appears at the same location, more than one set of
attributes is shown.
• Rectangle. You can create a rectangle, by dragging, that selects the points inside it.
• Polygon. You can build a free-form polygon that selects the points inside it. Left-
click to add vertices to the polygon, right-click to complete it. The polygon will always
be closed off by connecting the first point to the last.
• Polyline. You can build a polyline that distinguishes the points on one side from
those on the other. Left-click to add vertices to the polyline, right-click to finish. The
resulting shape is open (as opposed to a polygon, which is always closed).
Once an area of the plot has been selected using Rectangle, Polygon or Polyline,
it turns grey. At this point, clicking the Submit button removes all instances from the
plot except those within the grey selection area. Clicking on the Clear button erases the
selected area without affecting the graph.
Once any points have been removed from the graph, the Submit button changes to a Reset
button. This button undoes all previous removals and returns you to the original graph
with all points included. Finally, clicking the Save button allows you to save the currently
visible instances to a new ARFF file.
Aim:
Create an Employee Table with the help of Data Mining Tool WEKA.
Description:
We need to create an Employee Table with training data set which includes attributes like name,
id, salary, experience, gender, phone number.
Procedure:
Steps:
• Open Start Programs Accessories Notepad
• Type the following training data set with the help of Notepad for Employee Table.
@relation employee
@attribute name {x,y,z,a,b}
@attribute id numeric
@data
x,101,low,2,male,250311
y,102,high,3,female,251665
z,103,medium,1,male,240238
a,104,low,5,female,200200
b,105,high,2,male,240240
• Minimize the arff file and then open Start Programs weka-3-4.
• Click on weka-3-4, then Weka dialog box is displayed on the screen.
• Explorer shows many options. In that click on ‘open file’ and select the arff file
Training Data Set Weather Table
Result:
FILE 02
Aim:
Create a Weather Table with the help of Data Mining Tool WEKA.
Description:
We need to create a Weather table with training data set which includes attributes like outlook,
temperature, humidity, windy, play.
Procedure:
Steps:
• Open Start Programs Accessories Notepad
• Type the following training data set with the help of Notepad for Weather Table.
@relation weather
@data
sunny,85.0,85.0,false,no
overcast,80.0,90.0,true,no
sunny,83.0,86.0,false,yes
rainy,70.0,86.0,false,yes
rainy,68.0,80.0,false,yes
rainy,65.0,70.0,true,no
overcast,64.0,65.0,false,yes
sunny,72.0,95.0,true,no
sunny,69.0,70.0,false,yes
rainy,75.0,80.0,false,yes
• Explorer shows many options. In that click on ‘open file’ and select the arff file
Training Data Set Weather Table
Result:
FILE 03
Aim:
Description:
Real world databases are highly influenced to noise, missing and inconsistency due to their queue
size so the data can be pre-processed to improve the quality of data and missing results and it also
improves the efficiency.
• Add
• Remove
• Normalization
Procedure:
• Open Start Programs Accessories Notepad
• Type the following training data set with the help of Notepad for Weather Table.
@relation weather
@data
sunny,85.0,85.0,false,no
overcast,80.0,90.0,true,no
sunny,83.0,86.0,false,yes
rainy,70.0,86.0,false,yes
rainy,68.0,80.0,false,yes
rainy,65.0,70.0,true,no
overcast,64.0,65.0,false,yes
sunny,72.0,95.0,true,no
sunny,69.0,70.0,false,yes
rainy,75.0,80.0,false,yes
• Explorer shows many options. In that click on ‘open file’ and select the arff file
Procedure:
• Start Programs Weka-3-4 Weka-3-4
• Click on explorer.
• In that we enter attribute index, type, data format, nominal label values for Climate.
• Click on OK.
• Press the Apply button, then a new attribute is added to the Weather Table.
Remove Pre-Processing Technique:
Procedure:
• Start Programs Weka-3-4 Weka-3-4
• Click on explorer.
Procedure:
• Start Programs Weka-3-4 Weka-3-4
• Click on explorer.
• Click on the Edit button, it shows a new Weather Table with normalized values on Weka.
To do the assignment, you first and foremost need some knowledge about the world of
credit.
You can acquire such knowledge in a number of ways.
• Knowledge engineering: Find a loan officer who is willing to talk. Interview her
and try to represent her knowledge in a number of ways.
• Books: Find some training manuals for loan officers or perhaps a suitable
textbook on finance. Translate this knowledge from text from to production rule
form.
• Common sense: Imagine yourself as a loan officer and make up reasonable rules
which can be used to judge the credit worthiness of a loan applicant.
• Case histories: Find records of actual cases where competent loan officers
correctly judged when and not to. Approve a loan application.
In spite of the fact that the data is German, you should probably make use of it
for this assignment(Unless you really can consult a real loan officer!)
There are 20 attributes used in judging a loan applicant( ie., 7 Numerical
attributes and 13 Categorical or Nominal attributes). The goal is the classify the
applicant into one of two categories. Good or Bad.
The total number of attributes present in German credit data are.
• Checking_Status
• Duration
• Credit_history
• Purpose
• Credit_amout
• Savings_status
• Employment
• Installment_Commitment
• Personal_status
• Other_parties
• Residence_since
• Property_Magnitude
13.Age
• Other_payment_plans
• Housing
• Existing_credits
• Job
• Num_dependents
• Own_telephone
• Foreign_worker
• Class
EXPERIMENT-1
• OBJECTIVE:
List all the categorical (or nominal) attributes and the real-valued attributes separately.
• PROCEDURE:
• Open the Weka GUI Chooser.
• Select EXPLORER present in Applications.
• Select Preprocess Tab.
• Go to OPEN file and browse the file that is already stored in the system “bank.csv”.
• Clicking on any attribute in the left panel will show the basic statistics on that selected
attribute.
• OUTPUT:
EXPERIMENT-2
• OBJECTIVE:
Which attributes do you think might be crucial in making the credit assessment? Come
up with some simple rules in plain English using your selected attributes.
• PROCEDURE:
•
OUTPUT:
EXPERIMENT-3
• OBJECTIVE:
One type of model that you can create is a decision tree. Train a decision tree using
the complete dataset as the training data. Report the model obtained after training.
PROCEDURE:
• OUTPUT:
•
EXPERIMENT-4
• OBJECTIVE:
Suppose you use your above model trained on the complete dataset, and classify
credit good/bad for each of the examples in the dataset. What % of examples can you
classify correctly? (This is also called testing on the training set) Why do you think you
cannot get 100 % training accuracy?
• PROCEDURE:
5.5 OUTPUT:
=== Evaluation on training set
===
=== Summary === 554 92.3333
Correctly Classified Instances %
Incorrectly Classified Instances 46 7.6667 %
Kappa statistic 0.845
Mean absolute error 0.1389
Root mean squared error 0.2636
Relative absolute error 27.9979 %
Root relative squared error 52.9137 %
Total Number of Instances 600
Weighted Avg.
A
B
24529
17309
<-- classified
as
a = YES , b =
NO
EXPERIMENT-5
• OBJECTIVE:
Is testing on the training set as you did above a good idea? Why or Why not?
• PROCEDURE:
• In Test options, select the Supplied test set radio button
• Click Set
• Choose the file which contains records that were not in the training set we used to create
the model.
• Click Start(WEKA will run this test data set through the model we already created. )
• Compare the output results with that of the 4th experiment
• OUTPUT:
This can be experienced by the different problem solutions while doing practice.
The important numbers to focus on here are the numbers next to the "Correctly
Classified Instances" (92.3 percent) and the "Incorrectly Classified Instances" (7.6
percent). Other important numbers are in the "ROC Area" column, in the first row (the
0.936); Finally, in the "Confusion Matrix," it shows the number of false positives and
false negatives. The false positives are 29, and the false negatives are 17 in this
matrix.
Based on our accuracy rate of 92.3 percent, we say that upon initial analysis, this is a
good model.
One final step to validating our classification tree, which is to run our test set through
the model and ensure that accuracy of the model
Comparing the "Correctly Classified Instances" from this test set with the "Correctly
Classified Instances" from the training set, we see the accuracy of the model, which
indicates that the model will not break down with unknown data, or when future data is
applied to it.
EXPERIMENT-6
• OBJECTIVE:
One approach for solving the problem encountered in the previous question is using
cross-validation? Describe what is cross -validation briefly. Train a Decision Tree again
using cross - validation and report your results. Does your accuracy
increase/decrease? Why?
• PROCEDURE:
•
•
•
•
• OUTPUT:
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 539 89.8333
%
Incorrectly Classified Instances 61 10.1667 %
33.6
511 % Root relative squared error 61.2344 % Total Number of Instances 600
Weighted Avg.
=== Confusion
- classified as
236 38 | a = YES 23
303 | b = NO
EXPERIMENT-7
• OBJECTIVE
Check to see if the data shows a bias against "foreign workers" (attribute 20), or
"personal -status" (attribute 9). One way to do this (perhaps rather simple minded) is to
remove these attributes from the dataset and see if the decision tree created in those
cases is significantly different from the full dataset case which you have already done.
To remove an attribute you can use the preprocess tab in Weka's GUI Explorer. Did
removing these attributes have any significant effect? Discuss.
• PROCEDURE:
• OUTPUT:
EXPERIMENT-08
• OBJECTIVE:
Another question might be, do you really need to input so many attributes to get good
results? Maybe only a few would do. For example, you could try just having attributes
2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations.
(You had removed two attributes in problem 7. Remember to reload the arff data file to
get all the attributes initially before you start selecting the ones you want).
• PROCEDURE:
• OBJECTIVE:
Sometimes, the cost of rejecting an applicant who actually has a good credit (case 1)
might be higher than accepting an applicant who has bad credit (case 2). Instead of
counting the misclassifications equally in both cases, give a higher cost to the first case
(say cost 5) and lower cost to the second case. You can do this by using a cost matrix
in Weka. Train your Decision Tree again and report the Decision Tree and cross -
validation results. Are they significantly different from results obtained in problem 6
(using equal cost)?
• PROCEDURE:
• Given the Bank database for mining.
• Use the Weka GUI Chooser.
• Select EXPLORER present in Applications.
• Select Preprocess Tab.
• Go to OPEN file and browse the file that is already stored in the system “bank.csv”.
• Go to Classify tab.
• Choose Classifier “Tree”
• Select j48
• Select Test options “Training set”.
• Click on “more options”.
• Select cost sensitive evaluation and click on set button
• Set the matrix values and click on resize. Then close the window.
• Click Ok
• Click start.
• We can see the output details in the Classifier output
• Select Test options “Cross-validation”.
• Set “Folds” Ex: 10
• if need select attribute.
• Now start weka.
• Now we can see the output details in the Classifier output.
• th and 20th steps.
Compare results of 15
• Compare the results with that of experiment 6.
• OUTPUT:
EXPERIMENT-10
• OBJECTIVE:
Do you think it is a good idea to prefer simple decision trees instead of having long
complex decision trees? How does the complexity of a Decision Tree relate to the
bias of the model?
PROCEDURE:
This will be based on the attribute set, and the requirement of relationship among
attribute we want to study. This can be viewed based on the database and user
requirement.
EXPERIMENT-11
• OBJECTIVE:
You can make your Decision Trees simpler by pruning the nodes. one approach is to use
Reduced Error Pruning -Explain this idea briefly. Try reduced error pruning for training
your Decision Trees using cross
-validation (you can do this in Weka) and report the Decision Tree you obtain? Also,
report your accuracy using the pruned model. Does your accuracy increase?
• PROCEDURE:
• OUTPUT:
EXPERIMENT-12
• OBJECTIVE:
(Extra Credit): How can you convert a Decision Trees into "if –then -else rules". Make
up your own small Decision Tree consisting of 2 - 3 levels and convert it into a set of
rules. There also exist different classifiers that output the model in the form of rules -
one such classifier in Weka is rules. PART, train this model and report the set of rules
obtained. Sometimes just one attribute can be good enough in making the decision,
yes, just one! Can you predict what attribute that might be in this dataset? OneR
classifier uses a single attribute to make decisions (it chooses the attribute based on
minimum error). Report the rule obtained by training a one R classifier. Rank the
performance of j48, PART and OneR.
• RESOURCES:
• PROCEDURE:
12.4 OUTPUT: