Lab Manual - DM
Lab Manual - DM
Index
Perform Remove, AddNoise, FirstOrder Filters using weka with different data set.
Create Clustering for a given data set using different algorithms in WEKA and
compare them.
Test Naive Bayes classifier using suitable data.
Analyze classification rule process on data set employee.arff using id3 algorithm.
Perform j48 algorithm on the data set employee.arff to identify classification rules.
Perform j48 algorithm and Id3 algorithm on the data set weather.arff and compare the results.
Compare J48 and J48 graft algorithm using suitable data set. Derive Conclusions In
Your Own Language.
Setting up a Knowledge Flow to load an arff file (batch mode) and perform a cross validation
using J48.
Perform Linear Regression for a given suitable data set.
Page 2
Data Mining (3160714)
PRACTICAL 1
Aim : Introducing WEKA tool for data mining.
Launching Weka:
The Weka GUI Chooser (class weka.gui.GUIChooser) provides a starting point for launching Weka’s
main GUI applications and supporting tools. If one prefers a MDI (“multiple document interface”)
appearance, then this is provided by an alternative launcher called “Main” (class weka.gui.Main). The
GUI Chooser consists of four buttons—one for each of the four major Weka applications—and four
menus.
Page 3
Data Mining (3160714)
• LogWindow Opens a log window that captures all that is printed to stdout or stderr. Useful for
environments like MS Windows, where WEKA is normally not started from a terminal.
• Exit Closes WEKA.
Tools Other useful applications.
Page 4
Data Mining (3160714)
Once the tabs are active, clicking on them flicks between different screens, onwhich the respective
Page 5
Data Mining (3160714)
actions can be performed. The bottom area of the window(including the status box, the log button, and
the Weka bird) stays visible regardless of which section you are in.
Status Box
The status box appears at the very bottom of the window. It displays messages that keep you informed
about what’s going on. For example, if the Explorer is busy loading a file, the status box will say
that.TIP—right-clicking the mouse anywhere inside the status box brings up a little menu. The menu
gives two options:
1. Memory information. Display in the log box the amount of memory available to WEKA.
2. Run garbage collector. Force the Java garbage collector to search for memory that is no longer needed
and free it up, allowing more memory for new tasks. Note that the garbage collector is constantly running
as a background task anyway.
Log Button
Clicking on this button brings up a separate window containing a scrollable textfield. Each line of text is
stamped with the time it was entered into the log. As you perform actions in WEKA, the log keeps a
record of what has happened. For people using the command line or the SimpleCLI, the log now also
contains the full setup strings for classification, clustering, attribute selection, etc., so that it is possible
to copy/paste them elsewhere. Options for dataset(s) and, if applicable, the class attribute still have to be
provided by the user (e.g., -t for classifiers or -i and -o for filters).
Page 6
Data Mining (3160714)
Page 7
Data Mining (3160714)
PRACTICAL 2
Aim : Perform Normalize, ReplaceMissingValues filters available in weka.
Select supermarket data set. In viewer it is found that many values are missing in this dataset.
Page 8
Data Mining (3160714)
After applying this filter we can see that all missing values are filled.
We use normalize filter to normalize the numeric values. Select iris data set and view it.
Page 9
Data Mining (3160714)
Page 10
Data Mining (3160714)
PRACTICAL 3
Aim : Perform Remove, AddNoise, FirstOrder Filters using Weka with different data set.
You have to select and configure a filter, to apply it to the data by pressing the Apply button at the right
end of the Filter panel in the Preprocess panel. The Preprocess panel will then show the transformed data.
The change can be undone by pressing the Undo button. You can also use the Edit.. button to modify
your data manually in a dataset editor. Finally, the Save...button at the top right of the Preprocess panel
saves the current version of the relation in file formats that can represent the relation, allowing it to be
kept for future use.
Add
Page 11
Data Mining (3160714)
Add Noise
Page 12
Data Mining (3160714)
After applying “AddNoise” filter the values in first column are changed due to noise.
Page 13
Data Mining (3160714)
FirstOrder
Page 14
Data Mining (3160714)
Remove
Page 15
Data Mining (3160714)
Data set view after applying “Remove” filter. Third attribute is removed.
Page 16
Data Mining (3160714)
Practical- 4
AIM: Analyze Apriori algorithm implemented in WEKA using different data sets.
This experiment illustrates some of the basic elements of association rule mining using WEKA. The sample
dataset used for this example is contactlenses.arff
Step1: Open the data file in Weka Explorer. It is presumed that the required data fields have been discretized. In
this example it is age attribute.
Step2: Clicking on the associate tab will bring up the interface for association rule algorithm.
Step4: Inorder to change the parameters for the run (example support, confidence etc) we click on the text box
immediately to the right of the choose button.
Dataset contactlenses.arff
Page 17
Data Mining (3160714)
The following screenshot shows the association rules that were generated when apriori algorithm is applied on
the given dataset.
Output:
Page 18
Data Mining (3160714)
Page 19
Data Mining (3160714)
PRACTICAL 5
Aim : Create Clustering for a given data set using different algorithms in weka and compare
them.
Clustering in WEKA.
Selecting a Cluster:
By now you will be familiar with the process of selecting and configuring objects. Clicking on the
clustering scheme listed in the Cluster box at the top of the window brings up a GenericObjectEditor
dialog with which to choose a new clustering scheme.
Cluster Modes
Page 20
Data Mining (3160714)
The Cluster mode box is used to choose what to cluster and how to evaluate the results. The first three
options are the same as for classification: Use training set, Supplied test set and Percentage split except
that now the data is assigned to clusters instead of trying to predict a specific class. The fourth mode,
Classes to clusters evaluation, compares how well the chosen clusters match up with a pre-assigned class
in the data. The drop-down box below this option selects the class, just as in the Classify panel. An
additional option in the Cluster mode box, the Store clusters for visualization tick box, determines
whether or not it will be possible to visualize the clusters once training is complete. When dealing with
datasets that are so large that memory becomes a problem it may be helpful to disable this option.
Ignoring Attributes
Often, some attributes in the data should be ignored when clustering. The Ignore attributes button brings
up a small window that allows you to select which attributes are ignored. Clicking on an attribute in the
window highlights it, holding down the SHIFT key selects a range of consecutive attributes, and holding
down CTRL toggles individual attributes on and off. T o cancel the selection, back out with the Cancel
button. To activate it, click the Select button. The next time clustering is invoked, the selected attributes
are ignored.
Page 21
Data Mining (3160714)
The FilteredClusterer metaclusterer offers the user the possibility to apply filters directly before the
clusterer is learned. This approach eliminates the manual application of a filter in the Preprocess panel,
since the data gets processed on the fly. Useful if one needs to try out different filter setups.
Learning Clusters
The Cluster section, like the Classify section, has Start/Stop buttons, a result text area and a result list.
These all behave just like their classification counterparts. Right-clicking an entry in the result list brings
p a similar menu, except that it shows only two visualization options: Visualize cluster assignments and
Visualize tree. The latter is grayed out when it is not applicable.
Clustering algorithms:
SimpleKMeans
Page 22
Data Mining (3160714)
HierarchicalClustering:
DBSCAN
Page 23
Data Mining (3160714)
Page 24
Data Mining (3160714)
Comparison Table:
For glass Data set:
3 29(14%)
4 10(5%)
5 9(4%)
Unclustered - - 4
instances
1 2948 1 2948(64%)
Unclustered - - -
instances
Page 25
Data Mining (3160714)
Unclustered - - -
instances
Page 26
Data Mining (3160714)
Practical 6
AIM: Test Naive Bayes classifier using suitable data.
Page 27
Data Mining (3160714)
Page 28
Data Mining (3160714)
Practical 7
AIM: Analyze classification rule process on data set employee.arff using id3 algorithm.
This experiment illustrates the use of id3 classifier in weka. The sample data set used in this experiment is
“employee”data available at arff format. This document assumes that appropriate data pre processing has been
performed.
Step 1. We begin the experiment by loading the data (employee.arff) into weka.
Step2: Next we select the “classify” tab and click “choose” button to select the “id3”classifier.
Step3: Now we specify the various parameters. These can be specified by clicking in the text box to the right of
the chose button. In this example, we accept the default values his default version does perform some pruning but
does not perform error pruning.
Step4: Under the “text “options in the main panel. We select the 10-fold cross validation as our evaluation
approach. Since we don’t have separate evaluation data set, this is necessary to get a reasonable idea of accuracy
of generated model.
Step-5: We now click”start”to generate the model .the ASCII version of the tree as well as evaluation statistic will
appear in the right panel when the model construction is complete.
Step-6: Note that the classification accuracy of model is about 69%.this indicates that we may find more work.
(Either in preprocessing or in selecting current parameters for the classification)
Step-7: Now weka also lets us a view a graphical version of the classification tree. This can be done by right
clicking the last result set and selecting “visualize tree” from the pop-up menu.
Step-9: In the main panel under “text “options click the “supplied test set” radio button and then click the “set”
button. This will show pop-up window which will allow you to open the file containing test instances.
@relation employee
@attribute salary{10k,15k,17k,20k,25k,30k,35k,32k}
@data
Page 29
Data Mining (3160714)
The following screenshot shows the classification rules that were generated when id3 algorithm is applied on
the given dataset.
Creating employee.arff:
OR
You can also use the ArffViewer (Tools -> ArffViewer or Ctrl+A). Then open your CSV file.
Next go to File -> Save as... and select Arff data files (should be selected by default).
Note that your fields must be separated with a comma and not a semicolon.
Page 30
Data Mining (3160714)
Page 31
Data Mining (3160714)
Practical-8
Aim: Perform j48 algorithm on the data set employee.arff to identify classification rules.
This experiment illustrates the use of j48 classifier in weka.the sample data set used in
this experiment is “employee”data available at .arff format. This document assumes that appropriate data pre
processing has been performed.
Step 1: We begin the experiment by loading the data (employee.arff) into weka.
Step2: Next we select the “classify” tab and click “choose” button to select the “j48”classifier.
Step3: Now we specify the various parameters. These can be specified by clicking in the text box to the right of
the chose button. In this example, we accept the default values the default
version does perform some pruning but does not perform error pruning.
Step4: Under the “text “options in the main panel. We select the 10-fold cross validation as our evaluation
approach. Since we don’t have separate evaluation data set, this is necessary to get a reasonable idea of accuracy
of generated model.
Step-5: We now click ”start” to generate the model .the ASCII version of the tree as well as evaluation statistic
will appear in the right panel when the model construction is complete.
Step-6: Note that the classification accuracy of model is about 69%.this indicates that we may find more work.
(Either in preprocessing or in selecting current parameters for the classification)
Step-7: Now weka also lets us a view a graphical version of the classification tree. This can be done by right
clicking the last result set and selecting “visualize tree” from the pop-up menu.
Step-9: In the main panel under “text “options click the “supplied test set” radio button and then click the “set”
button. This wills pop-up a window which will allow you to open the file containing test instances.
@relation employee
@attribute salary{10k,15k,17k,20k,25k,30k,35k,32k}
@data
Page 32
Data Mining (3160714)
48, 32k,good
The following screenshot shows the classification rules that were generated whenj48 algorithm is applied on the
given dataset.
Page 33
Data Mining (3160714)
Page 34
Data Mining (3160714)
Page 35
Data Mining (3160714)
Practical-9
Aim: Perform J48 algorithm and Id3 algorithm on the data set weather.arff and compare the results.
This experiment illustrates the use of j48 classifier in weka.the sample data set used in
this experiment is “employee”data available at arff format. This document assumes that appropriate data pre
processing has been performed.
Step 1: We begin the experiment by loading the data (weather.arff) into weka.
Step2: Next we select the “classify” tab and click “choose” button to select the “j48”classifier.
Step3: Now we specify the various parameters. These can be specified by clicking in the text box to the right of
the chose button. In this example, we accept the default values the default version does perform some pruning
but does not perform error pruning.
Step4: Under the “text “options in the main panel. We select the 10-fold cross validation as our evaluation
approach. Since we don’t have separate evaluation data set, this is necessary to get a reasonable idea of accuracy
of generated model.
Step-5: We now click ”start” to generate the model .the ASCII version of the tree as well as evaluation statistic
will appear in the right panel when the model construction is complete.
Step-6: Note that the classification accuracy of model is about 69%.this indicates that we may find more work.
(Either in preprocessing or in selecting current parameters for the classification)
Step-7: Now weka also lets us a view a graphical version of the classification tree. This can be done by right
clicking the last result set and selecting “visualize tree” from the pop-up menu.
Page 36
Data Mining (3160714)
Page 37
Data Mining (3160714)
Page 38
Data Mining (3160714)
Id3 algorithm
a b <-- classified as
8 1| a = yes
1 4| b = no
J48 algortihm
a b <-- classified as
5 4| a = yes
3 2| b = no
Page 39
Data Mining (3160714)
Practical-10
Aim: Compare J48 and J48 graft algorithm using suitable data set. Derive Conclusions In Your
Own Language.
This experiment illustrates the use of j48 classifier in weka. The sample data set used in this experiment is
“employee”data available at arff format. This document assumes that appropriate data pre processing has been
performed.
Step 1: We begin the experiment by loading the data (weather.arff) into weka.
Step2: Next we select the “classify” tab and click “choose” button to select the “j48”classifier.
Step3: Now we specify the various parameters. These can be specified by clicking in the text box to the right of
the chose button. In this example, we accept the default values the default version does perform some pruning
but does not perform error pruning.
Step4: Under the “text “options in the main panel. We select the 10-fold cross validation as our evaluation
approach. Since we don’t have separate evaluation data set, this is necessary to get a reasonable idea of accuracy
of generated model.
Step-5: We now click ”start” to generate the model .the ASCII version of the tree as well as evaluation statistic
will appear in the right panel when the model construction is complete.
Step-6: Note that the classification accuracy of model is about 69%.this indicates that we may find more work.
(Either in preprocessing or in selecting current parameters for the classification)
Step-7: Now weka also lets us a view a graphical version of the classification tree. This can be done by right
clicking the last result set and selecting “visualize tree” from the pop-up menu.
Page 40
Data Mining (3160714)
Id3 algorithm
a b <-- classified as
8 1| a = yes
1 4| b = no
J48 algortihm
a b <-- classified as
5 4| a = yes
3 2| b = no
Page 41
Data Mining (3160714)
Practical-11
Aim: Setting up a flow to load an arff file (batch mode) and perform a cross validation using J48 (Weka's
C4.5 implementation).
Next click on the DataSources tab and choose "ArffLoader" from the
toolbar (the mouse pointer will change to a "cross hairs").
Next place the ArffLoader component on the layout area by clicking somewhere on the layout (A copy
of the ArffLoader icon will appear on the layout area).
Next specify an arff file to load by first right clicking the mouse over the ArffLoader icon on the layout.
A pop-up menu will appear. Select "Configure" under "Edit" in the list from this menu and browse to
the location of your arff file.
Page 42
Data Mining (3160714)
Now connect the ArffLoader to the ClassAssigner: first right click over the ArffLoader and select the
"dataSet" under "Connections" in the menu. A "rubber band" line will appear.
Move the mouse over the ClassAssigner component and left click - a red line labeled "dataSet" will
connect the two components.
Next right click over the ClassAssigner and choose "Configure" from the menu. This will pop up a
window from which you can specify which column is the class in your data (last is the default).
Page 43
Data Mining (3160714)
Next right click over the ClassAssigner and choose "Configure" from the menu. This will pop up a
window from which you can specify which column is the class in your data (last is the default).
Next grab a "CrossValidationFoldMaker" component from the Evaluation toolbar and place it on the
layout.
Connect the ClassAssigner to the CrossValidationFoldMaker by right clicking over "ClassAssigner"
and selecting "dataSet" from under "Connections" in the menu.
Next click on the "Classifiers" tab at the top of the window and scroll along the toolbar until you reach
the "J48" component in the "trees" section.
Connect the CrossValidationFoldMaker to J48 TWICE by first choosing "trainingSet" and then
"testSet" from the pop-up menu for the CrossValidationFoldMaker.
Page 44
Data Mining (3160714)
Next go back to the "Evaluation" tab and place a "ClassifierPerformanceEvaluator" component on the
layout.
Connect J48 to this component by selecting the "batchClassifier" entry from the pop-up menu for J48.
Page 45
Data Mining (3160714)
Next go to the "Visualization" toolbar and place a "TextViewer" component on the layout.
Connect the ClassifierPerformanceEvaluator to the TextViewer by selecting the "text" entry from the
pop-up menu for ClassifierPerformanceEvaluator.
Page 46
Data Mining (3160714)
Now start the flow executing by selecting "Start loading" from the pop-up menu for ArffLoader.
When finished you can view the results by choosing show results from the pop-up menu for the
TextViewer component.
Page 47
Data Mining (3160714)
Practical 12
AIM: Linear Regression using Excel
Page 48
Data Mining (3160714)
3. In the Add-Ins available box, check the Analysis ToolPak and then OK
4. If Analysis ToolPak is not listed in the Add-Ins available box, click Browse to locate it.
where:
Page 49
Data Mining (3160714)
In the data analysis tool select the regression and then click Ok.
Output
Page 50
Data Mining (3160714)
The output is given in the coefficients column in the last set of output
Also in the regression statistics output gives the goodness of fit measure
Page 51
Data Mining (3160714)
Practical 13
AIM : Correlation analysis in Excel
The correlation coefficient (a value between -1 and +1) tells you how strongly two variables are related to each
other. We can use the CORREL function or the Analysis Toolpak add-in in Excel to find the correlation
coefficient between two variables.
Page 52
Data Mining (3160714)
To use the Analysis Toolpak add-in in Excel to quickly generate correlation coefficients between multiple
variables, execute the following steps.
Note: can't find the Data Analysis button? Click here to load the Analysis ToolPak add-in.
Page 53
Data Mining (3160714)
6. Click OK.
Result.
Page 54
Data Mining (3160714)
Conclusion: variables A and C are positively correlated (0.91). Variables A and B are not correlated (0.19).
Variables B and C are also not correlated (0.11) . You can verify these conclusions by looking at the graph.
Page 55
Data Mining (3160714)
Page 56