0% found this document useful (0 votes)
189 views56 pages

Lab Manual - DM

This document outlines experiments to be performed using the WEKA data mining tool. It includes introducing WEKA and exploring its interface. Experiments will apply normalization, missing value replacement, and noise addition filters to different datasets. Classification algorithms like Naive Bayes and J48 will be tested on datasets. Clustering and association rule mining will also be performed and compared across algorithms and datasets. Evaluation metrics and visualizations will aid in analyzing results.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
189 views56 pages

Lab Manual - DM

This document outlines experiments to be performed using the WEKA data mining tool. It includes introducing WEKA and exploring its interface. Experiments will apply normalization, missing value replacement, and noise addition filters to different datasets. Classification algorithms like Naive Bayes and J48 will be tested on datasets. Clustering and association rule mining will also be performed and compared across algorithms and datasets. Evaluation metrics and visualizations will aid in analyzing results.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 56

Data Mining (3160714)

Index

 Introducing WEKA for Data Mining.


 Perform Normlize, ReplaceMissingValues filters available in WEKA.

 Perform Remove, AddNoise, FirstOrder Filters using weka with different data set.

 Analyze Apriori algorithm implemented in WEKA using different data sets.

 Create Clustering for a given data set using different algorithms in WEKA and
compare them.
 Test Naive Bayes classifier using suitable data.
 Analyze classification rule process on data set employee.arff using id3 algorithm.

 Perform j48 algorithm on the data set employee.arff to identify classification rules.

 Perform j48 algorithm and Id3 algorithm on the data set weather.arff and compare the results.

 Compare J48 and J48 graft algorithm using suitable data set. Derive Conclusions In
Your Own Language.
 Setting up a Knowledge Flow to load an arff file (batch mode) and perform a cross validation
using J48.
 Perform Linear Regression for a given suitable data set.

 Perform Correlation analysis for a given suitable data set.

Page 2
Data Mining (3160714)

PRACTICAL 1
Aim : Introducing WEKA tool for data mining.

Launching Weka:
The Weka GUI Chooser (class weka.gui.GUIChooser) provides a starting point for launching Weka’s
main GUI applications and supporting tools. If one prefers a MDI (“multiple document interface”)
appearance, then this is provided by an alternative launcher called “Main” (class weka.gui.Main). The
GUI Chooser consists of four buttons—one for each of the four major Weka applications—and four
menus.

The buttons can be used to start the following applications:


• Explorer An environment for exploring data with WEKA (the rest of this documentation deals with
this application in more detail).
• Experimenter An environment for performing experiments and conducting statistical tests between
learning schemes.
• KnowledgeFlow This environment supports essentially the same functions as the Explorer but with a
drag-and-drop interface. One advantage is that it supports incremental learning.
• SimpleCLI Provides a simple command-line interface that allows direct execution of WEKA
commands for operating systems that do not provide their own command line interface.

The menu consists of four sections:


 Program

Page 3
Data Mining (3160714)

• LogWindow Opens a log window that captures all that is printed to stdout or stderr. Useful for
environments like MS Windows, where WEKA is normally not started from a terminal.
• Exit Closes WEKA.
 Tools Other useful applications.

• ArffViewer An MDI application for viewing ARFF files in spread-sheet format.


• SqlViewer Represents an SQL worksheet, for querying databases via JDBC.
• Bayes net editor An application for editing, visualizing and learn-ing Bayes nets.

 Visualization Ways of visualizing data with WEKA.

• Plot For plotting a 2D plot of a dataset.


• ROC Displays a previously saved ROC curve.
• TreeVisualizer For displaying directed graphs, e.g., a decision tree.
• GraphVisualizer Visualizes XML BIF or DOT format graphs, e.g., for Bayesian networks.
• BoundaryVisualizer Allows the visualization of classifier decision boundaries in two dimensions.
 Help Online resources for WEKA can be found here.

Page 4
Data Mining (3160714)

• Weka homepage Opens a browser window with WEKA’s home-page.


• HOWTOs, code snippets, etc. The general WekaWiki , con-taining lots of examples and HOWTOs
around the development and use of WEKA.
• Weka on Sourceforge WEKA’s project homepage on Sourceforge.net.
• SystemInfo Lists some internals about the Java/WEKA environment, e.g., the CLASSPATH.
Explorer:
At the very top of the window, just below the title bar, is a row of tabs. When the Explorer is first started
only the first tab is active; the others are grayed out. This is because it is necessary to open (and potentially
pre-process) a data set before starting to explore the data. The tabs are as follows:
1. Preprocess. Choose and modify the data being acted on.
2. Classify. Train and test learning schemes that classify or perform regression.
3. Cluster. Learn clusters for the data.
4. Associate. Learn association rules for the data.
5. Select attributes. Select the most relevant attributes in the data.
6. Visualize. View an interactive 2D plot of the data.

Once the tabs are active, clicking on them flicks between different screens, onwhich the respective

Page 5
Data Mining (3160714)

actions can be performed. The bottom area of the window(including the status box, the log button, and
the Weka bird) stays visible regardless of which section you are in.

Status Box

The status box appears at the very bottom of the window. It displays messages that keep you informed
about what’s going on. For example, if the Explorer is busy loading a file, the status box will say
that.TIP—right-clicking the mouse anywhere inside the status box brings up a little menu. The menu
gives two options:
1. Memory information. Display in the log box the amount of memory available to WEKA.
2. Run garbage collector. Force the Java garbage collector to search for memory that is no longer needed
and free it up, allowing more memory for new tasks. Note that the garbage collector is constantly running
as a background task anyway.
Log Button

Clicking on this button brings up a separate window containing a scrollable textfield. Each line of text is
stamped with the time it was entered into the log. As you perform actions in WEKA, the log keeps a
record of what has happened. For people using the command line or the SimpleCLI, the log now also
contains the full setup strings for classification, clustering, attribute selection, etc., so that it is possible
to copy/paste them elsewhere. Options for dataset(s) and, if applicable, the class attribute still have to be
provided by the user (e.g., -t for classifiers or -i and -o for filters).

Page 6
Data Mining (3160714)

WEKA Status Icon


To the right of the status box is the WEKA status icon. When no processes are running, the bird sits down
and takes a nap. The number beside the × symbol gives the number of concurrent processes running.
When the system is idle it is zero, but it increases as the number of processes increases. When any process
is started, the bird gets up and starts moving around. If it’s standing but stops moving for a long time, it’s
sick: something has gone wrong! In that case you should restart the WEKA Explorer.
Graphical output
Most graphical displays in WEKA, e.g., the Graph Visualize or the TreeVisualizer, support saving the
output to a file. A dialog for saving the output can be brought up with Alt+Shift+left-click. Supported
formats are currently Windows Bitmap, JPEG, PNG and EPS (encapsulated Postscript). The dialog also
allows you to specify the dimensions of the generated image.

Page 7
Data Mining (3160714)

PRACTICAL 2
Aim : Perform Normalize, ReplaceMissingValues filters available in weka.

We had applied two filters to study data cleaning in weka.

To deal with missing values in weka

Select supermarket data set. In viewer it is found that many values are missing in this dataset.

Use filter ReplaceMissingValues to fill in the missing values.

Page 8
Data Mining (3160714)

After applying this filter we can see that all missing values are filled.

To normalize values in weka tool.

We use normalize filter to normalize the numeric values. Select iris data set and view it.

Page 9
Data Mining (3160714)

Select Normalize filter and apply it.

All the values in iris data set are normalized.

Page 10
Data Mining (3160714)

PRACTICAL 3
Aim : Perform Remove, AddNoise, FirstOrder Filters using Weka with different data set.

You have to select and configure a filter, to apply it to the data by pressing the Apply button at the right
end of the Filter panel in the Preprocess panel. The Preprocess panel will then show the transformed data.
The change can be undone by pressing the Undo button. You can also use the Edit.. button to modify
your data manually in a dataset editor. Finally, the Save...button at the top right of the Preprocess panel
saves the current version of the relation in file formats that can represent the relation, allowing it to be
kept for future use.

Explanation of some filters.

Add

Data set view before applying “add” filter.

Page 11
Data Mining (3160714)

“add” Filter description

Data set view after applying “add” filter

Add Noise

Data set view before applying“AddNoise” filter.

Page 12
Data Mining (3160714)

“AddNoise” Filter description

After applying “AddNoise” filter the values in first column are changed due to noise.

Page 13
Data Mining (3160714)

FirstOrder

Data set view before applying “FirstOrder” filter.

“FirstOrder” Filter description

Data set view after applying “FirstOrder” filter

Page 14
Data Mining (3160714)

Remove

Data set view before applying “Remove” filter.

“Remove” Filter description

Page 15
Data Mining (3160714)

Data set view after applying “Remove” filter. Third attribute is removed.

Page 16
Data Mining (3160714)

Practical- 4
AIM: Analyze Apriori algorithm implemented in WEKA using different data sets.

This experiment illustrates some of the basic elements of association rule mining using WEKA. The sample
dataset used for this example is contactlenses.arff

Step1: Open the data file in Weka Explorer. It is presumed that the required data fields have been discretized. In
this example it is age attribute.

Step2: Clicking on the associate tab will bring up the interface for association rule algorithm.

Step3: We will use apriori algorithm. This is the default algorithm.

Step4: Inorder to change the parameters for the run (example support, confidence etc) we click on the text box
immediately to the right of the choose button.

Dataset contactlenses.arff

Page 17
Data Mining (3160714)

The following screenshot shows the association rules that were generated when apriori algorithm is applied on
the given dataset.

Output:

Page 18
Data Mining (3160714)

Page 19
Data Mining (3160714)

PRACTICAL 5
Aim : Create Clustering for a given data set using different algorithms in weka and compare
them.

Clustering in WEKA.

Selecting a Cluster:

By now you will be familiar with the process of selecting and configuring objects. Clicking on the
clustering scheme listed in the Cluster box at the top of the window brings up a GenericObjectEditor
dialog with which to choose a new clustering scheme.

Cluster Modes

Page 20
Data Mining (3160714)

The Cluster mode box is used to choose what to cluster and how to evaluate the results. The first three
options are the same as for classification: Use training set, Supplied test set and Percentage split except
that now the data is assigned to clusters instead of trying to predict a specific class. The fourth mode,
Classes to clusters evaluation, compares how well the chosen clusters match up with a pre-assigned class
in the data. The drop-down box below this option selects the class, just as in the Classify panel. An
additional option in the Cluster mode box, the Store clusters for visualization tick box, determines
whether or not it will be possible to visualize the clusters once training is complete. When dealing with
datasets that are so large that memory becomes a problem it may be helpful to disable this option.

Ignoring Attributes

Often, some attributes in the data should be ignored when clustering. The Ignore attributes button brings
up a small window that allows you to select which attributes are ignored. Clicking on an attribute in the
window highlights it, holding down the SHIFT key selects a range of consecutive attributes, and holding
down CTRL toggles individual attributes on and off. T o cancel the selection, back out with the Cancel
button. To activate it, click the Select button. The next time clustering is invoked, the selected attributes
are ignored.

Working with Filters

Page 21
Data Mining (3160714)

The FilteredClusterer metaclusterer offers the user the possibility to apply filters directly before the
clusterer is learned. This approach eliminates the manual application of a filter in the Preprocess panel,
since the data gets processed on the fly. Useful if one needs to try out different filter setups.

Learning Clusters

The Cluster section, like the Classify section, has Start/Stop buttons, a result text area and a result list.
These all behave just like their classification counterparts. Right-clicking an entry in the result list brings
p a similar menu, except that it shows only two visualization options: Visualize cluster assignments and
Visualize tree. The latter is grayed out when it is not applicable.

We analyzed results of three clustering algorithms for three data sets.

Datasets: Glass, iris and supermarket

Clustering algorithms:

SimpleKMeans

Page 22
Data Mining (3160714)

HierarchicalClustering:

DBSCAN

Page 23
Data Mining (3160714)

Page 24
Data Mining (3160714)

Comparison Table:
For glass Data set:

SimpleKMeans HierarchicalClustering DBSCAN

Time taken 0.02 Sec 0.06 Sec 0.03 Sec

Clustered instances 0 87(41%) 0 211(99%) 0 70(33%)

1 84(39%) 1 1(0%) 1 17(8%)

2 43(20%) 2 2(1%) 2 75(36%)

3 29(14%)

4 10(5%)

5 9(4%)

Unclustered - - 4
instances

For supermarket Data set:

SimpleKMeans HierarchicalClusteri DBSCAN


ng

Time taken 0.97Sec - 0.22Sec

Clustered 0 1679(36 - 0 1679(36%)


instances %)

1 2948 1 2948(64%)

Unclustered - - -
instances

Page 25
Data Mining (3160714)

For iris Data set:

SimpleKMeans HierarchicalClustering DBSCAN

Time taken 0 Sec 0.02 Sec 0.02 Sec

Clustered instances 0 50(33%) 0 50(33%) 0 50(33%)

1 50(33%) 1 50(33%) 1 50(33%)

2 50(33%) 2 50(33%) 2 50(33%)

Unclustered - - -
instances

Page 26
Data Mining (3160714)

Practical 6
AIM: Test Naive Bayes classifier using suitable data.

Page 27
Data Mining (3160714)

Page 28
Data Mining (3160714)

Practical 7
AIM: Analyze classification rule process on data set employee.arff using id3 algorithm.

This experiment illustrates the use of id3 classifier in weka. The sample data set used in this experiment is
“employee”data available at arff format. This document assumes that appropriate data pre processing has been
performed.

Steps involved in this experiment:

Step 1. We begin the experiment by loading the data (employee.arff) into weka.

Step2: Next we select the “classify” tab and click “choose” button to select the “id3”classifier.

Step3: Now we specify the various parameters. These can be specified by clicking in the text box to the right of
the chose button. In this example, we accept the default values his default version does perform some pruning but
does not perform error pruning.

Step4: Under the “text “options in the main panel. We select the 10-fold cross validation as our evaluation
approach. Since we don’t have separate evaluation data set, this is necessary to get a reasonable idea of accuracy
of generated model.

Step-5: We now click”start”to generate the model .the ASCII version of the tree as well as evaluation statistic will
appear in the right panel when the model construction is complete.

Step-6: Note that the classification accuracy of model is about 69%.this indicates that we may find more work.
(Either in preprocessing or in selecting current parameters for the classification)

Step-7: Now weka also lets us a view a graphical version of the classification tree. This can be done by right
clicking the last result set and selecting “visualize tree” from the pop-up menu.

Step-8: We will use our model to classify the new instances.

Step-9: In the main panel under “text “options click the “supplied test set” radio button and then click the “set”
button. This will show pop-up window which will allow you to open the file containing test instances.

Data set employee.arff:

@relation employee

@attribute age {25, 27, 28, 29, 30, 35, 48}

@attribute salary{10k,15k,17k,20k,25k,30k,35k,32k}

@attribute performance {good, avg, poor}

@data

Page 29
Data Mining (3160714)

25, 10k, poor

27, 15k, poor

27, 17k, poor

28, 17k, poor

29, 20k, avg

30, 25k, avg

29, 25k, avg

30, 20k, avg

35, 32k, good

48, 34k, good

48, 32k, good

The following screenshot shows the classification rules that were generated when id3 algorithm is applied on
the given dataset.

Creating employee.arff:

 Create employee.csv file in Microsoft Excel.


 Having column as: age,salary,performance.
 Insert values into .csv file as mentioned.
 Open weka CLI.
 Run the command
java weka.core.converters.CSVLoader /path/employee.csv > /path/employee.arff

(Specify the folder as required.)

OR

 You can also use the ArffViewer (Tools -> ArffViewer or Ctrl+A). Then open your CSV file.
 Next go to File -> Save as... and select Arff data files (should be selected by default).
 Note that your fields must be separated with a comma and not a semicolon.

Page 30
Data Mining (3160714)

Page 31
Data Mining (3160714)

Practical-8
Aim: Perform j48 algorithm on the data set employee.arff to identify classification rules.

This experiment illustrates the use of j48 classifier in weka.the sample data set used in

this experiment is “employee”data available at .arff format. This document assumes that appropriate data pre
processing has been performed.

Steps involved in this experiment:

Step 1: We begin the experiment by loading the data (employee.arff) into weka.

Step2: Next we select the “classify” tab and click “choose” button to select the “j48”classifier.

Step3: Now we specify the various parameters. These can be specified by clicking in the text box to the right of
the chose button. In this example, we accept the default values the default

version does perform some pruning but does not perform error pruning.

Step4: Under the “text “options in the main panel. We select the 10-fold cross validation as our evaluation
approach. Since we don’t have separate evaluation data set, this is necessary to get a reasonable idea of accuracy
of generated model.

Step-5: We now click ”start” to generate the model .the ASCII version of the tree as well as evaluation statistic
will appear in the right panel when the model construction is complete.

Step-6: Note that the classification accuracy of model is about 69%.this indicates that we may find more work.
(Either in preprocessing or in selecting current parameters for the classification)

Step-7: Now weka also lets us a view a graphical version of the classification tree. This can be done by right
clicking the last result set and selecting “visualize tree” from the pop-up menu.

Step-8: We will use our model to classify the new instances.

Step-9: In the main panel under “text “options click the “supplied test set” radio button and then click the “set”
button. This wills pop-up a window which will allow you to open the file containing test instances.

Data set employee.arff:

@relation employee

@attribute age {25, 27, 28, 29, 30, 35, 48}

@attribute salary{10k,15k,17k,20k,25k,30k,35k,32k}

@attribute performance {good, avg, poor}

@data

Page 32
Data Mining (3160714)

25, 10k, poor

27, 15k, poor

27, 17k, poor

28, 17k, poor

29, 20k, avg

30, 25k, avg

29, 25k, avg

30, 20k, avg

35, 32k, good

48, 34k, good

48, 32k,good

The following screenshot shows the classification rules that were generated whenj48 algorithm is applied on the
given dataset.

Page 33
Data Mining (3160714)

Page 34
Data Mining (3160714)

Page 35
Data Mining (3160714)

Practical-9
Aim: Perform J48 algorithm and Id3 algorithm on the data set weather.arff and compare the results.

This experiment illustrates the use of j48 classifier in weka.the sample data set used in

this experiment is “employee”data available at arff format. This document assumes that appropriate data pre
processing has been performed.

Steps involved in this experiment:

Step 1: We begin the experiment by loading the data (weather.arff) into weka.

Step2: Next we select the “classify” tab and click “choose” button to select the “j48”classifier.

Step3: Now we specify the various parameters. These can be specified by clicking in the text box to the right of
the chose button. In this example, we accept the default values the default version does perform some pruning
but does not perform error pruning.

Step4: Under the “text “options in the main panel. We select the 10-fold cross validation as our evaluation
approach. Since we don’t have separate evaluation data set, this is necessary to get a reasonable idea of accuracy
of generated model.

Step-5: We now click ”start” to generate the model .the ASCII version of the tree as well as evaluation statistic
will appear in the right panel when the model construction is complete.

Step-6: Note that the classification accuracy of model is about 69%.this indicates that we may find more work.
(Either in preprocessing or in selecting current parameters for the classification)

Step-7: Now weka also lets us a view a graphical version of the classification tree. This can be done by right
clicking the last result set and selecting “visualize tree” from the pop-up menu.

Step-8 Repeat this process for Id3 algorithm.

Page 36
Data Mining (3160714)

Page 37
Data Mining (3160714)

Page 38
Data Mining (3160714)

Id3 algorithm

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

0.889 0.2 0.889 0.889 0.889 0.844 yes

0.8 0.111 0.8 0.8 0.8 0.844 no

Weighted Avg. 0.857 0.168 0.857 0.857 0.857 0.844

=== Confusion Matrix ===

a b <-- classified as

8 1| a = yes

1 4| b = no

J48 algortihm

== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

0.556 0.6 0.625 0.556 0.588 0.633 yes

0.4 0.444 0.333 0.4 0.364 0.633 no

Weighted Avg. 0.5 0.544 0.521 0.5 0.508 0.633

=== Confusion Matrix ===

a b <-- classified as

5 4| a = yes

3 2| b = no

Page 39
Data Mining (3160714)

Practical-10
Aim: Compare J48 and J48 graft algorithm using suitable data set. Derive Conclusions In Your
Own Language.

This experiment illustrates the use of j48 classifier in weka. The sample data set used in this experiment is
“employee”data available at arff format. This document assumes that appropriate data pre processing has been
performed.

Steps involved in this experiment:

Step 1: We begin the experiment by loading the data (weather.arff) into weka.

Step2: Next we select the “classify” tab and click “choose” button to select the “j48”classifier.

Step3: Now we specify the various parameters. These can be specified by clicking in the text box to the right of
the chose button. In this example, we accept the default values the default version does perform some pruning
but does not perform error pruning.

Step4: Under the “text “options in the main panel. We select the 10-fold cross validation as our evaluation
approach. Since we don’t have separate evaluation data set, this is necessary to get a reasonable idea of accuracy
of generated model.

Step-5: We now click ”start” to generate the model .the ASCII version of the tree as well as evaluation statistic
will appear in the right panel when the model construction is complete.

Step-6: Note that the classification accuracy of model is about 69%.this indicates that we may find more work.
(Either in preprocessing or in selecting current parameters for the classification)

Step-7: Now weka also lets us a view a graphical version of the classification tree. This can be done by right
clicking the last result set and selecting “visualize tree” from the pop-up menu.

Step-8 Repeat this process for J48 Graft algorithm.

Page 40
Data Mining (3160714)

Id3 algorithm

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

0.889 0.2 0.889 0.889 0.889 0.844 yes

0.8 0.111 0.8 0.8 0.8 0.844 no

Weighted Avg. 0.857 0.168 0.857 0.857 0.857 0.844

=== Confusion Matrix ===

a b <-- classified as

8 1| a = yes

1 4| b = no

J48 algortihm

== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

0.556 0.6 0.625 0.556 0.588 0.633 yes

0.4 0.444 0.333 0.4 0.364 0.633 no

Weighted Avg. 0.5 0.544 0.521 0.5 0.508 0.633

=== Confusion Matrix ===

a b <-- classified as

5 4| a = yes

3 2| b = no

Page 41
Data Mining (3160714)

Practical-11
Aim: Setting up a flow to load an arff file (batch mode) and perform a cross validation using J48 (Weka's
C4.5 implementation).

 First start the KnowlegeFlow.

 Next click on the DataSources tab and choose "ArffLoader" from the
toolbar (the mouse pointer will change to a "cross hairs").

 Next place the ArffLoader component on the layout area by clicking somewhere on the layout (A copy
of the ArffLoader icon will appear on the layout area).
 Next specify an arff file to load by first right clicking the mouse over the ArffLoader icon on the layout.
 A pop-up menu will appear. Select "Configure" under "Edit" in the list from this menu and browse to
the location of your arff file.

Page 42
Data Mining (3160714)

 Next click the "Evaluation" tab at the top of the window


 Choose the "ClassAssigner" (allows you to choose which column to be the class)component from the
toolbar. Place this on the layout.

 Now connect the ArffLoader to the ClassAssigner: first right click over the ArffLoader and select the
"dataSet" under "Connections" in the menu. A "rubber band" line will appear.
 Move the mouse over the ClassAssigner component and left click - a red line labeled "dataSet" will
connect the two components.
 Next right click over the ClassAssigner and choose "Configure" from the menu. This will pop up a
window from which you can specify which column is the class in your data (last is the default).

Page 43
Data Mining (3160714)

 Next right click over the ClassAssigner and choose "Configure" from the menu. This will pop up a
window from which you can specify which column is the class in your data (last is the default).

 Next grab a "CrossValidationFoldMaker" component from the Evaluation toolbar and place it on the
layout.
 Connect the ClassAssigner to the CrossValidationFoldMaker by right clicking over "ClassAssigner"
and selecting "dataSet" from under "Connections" in the menu.

 Next click on the "Classifiers" tab at the top of the window and scroll along the toolbar until you reach
the "J48" component in the "trees" section.
 Connect the CrossValidationFoldMaker to J48 TWICE by first choosing "trainingSet" and then
"testSet" from the pop-up menu for the CrossValidationFoldMaker.

Page 44
Data Mining (3160714)

 Place a J48 component on the layout.


 Next go back to the "Evaluation" tab and place a "ClassifierPerformanceEvaluator" component on the
layout.

 Next go back to the "Evaluation" tab and place a "ClassifierPerformanceEvaluator" component on the
layout.

 Connect J48 to this component by selecting the "batchClassifier" entry from the pop-up menu for J48.

Page 45
Data Mining (3160714)

 Next go to the "Visualization" toolbar and place a "TextViewer" component on the layout.

 Connect the ClassifierPerformanceEvaluator to the TextViewer by selecting the "text" entry from the
pop-up menu for ClassifierPerformanceEvaluator.

Page 46
Data Mining (3160714)

 Now start the flow executing by selecting "Start loading" from the pop-up menu for ArffLoader.
 When finished you can view the results by choosing show results from the pop-up menu for the
TextViewer component.

Page 47
Data Mining (3160714)

Practical 12
AIM: Linear Regression using Excel

Application: Microsoft Excel 2007

In the Microsoft Office button, go to excel options to click Add-ins

In the Add-Ins box, select Analysis ToolPak and click Go...

Install the Analysis ToolPak (continue)

Page 48
Data Mining (3160714)

3. In the Add-Ins available box, check the Analysis ToolPak and then OK

4. If Analysis ToolPak is not listed in the Add-Ins available box, click Browse to locate it.

Linear Regression using the Data Analysis Add-In

Suppose we want to determine a whether Y is a function of X

(Y) i = a + b(X) i + (error) i

where:

(Y) i = value of Y for observation i

a = mean value of Y when X is zero (inercept coefficient)

b = average change in Y given a one unit change in X, i,e (slope of X)

Page 49
Data Mining (3160714)

(X)i = value of X for observation i

In the data analysis tool select the regression and then click Ok.

Select the Input Data Set for Y and X values

2. After clicking the regression, a regression box appears.

3. Select the Inputs for Y range and X range.

4. Select the place where you want your output

5. Check the Labels and then click Ok

Output

Page 50
Data Mining (3160714)

The output is given in the coefficients column in the last set of output

a = 1.329 (Intercept Coefficient)

b = 0.801 (Coefficient of X i.e. slope)

So our regression equation is Y = 1.329 + 0.801(X)

Also in the regression statistics output gives the goodness of fit measure

Adjusted R square = 0.3984 which measures the fit

This means 39.84% of Y is determined by X

Page 51
Data Mining (3160714)

Practical 13
AIM : Correlation analysis in Excel

The correlation coefficient (a value between -1 and +1) tells you how strongly two variables are related to each
other. We can use the CORREL function or the Analysis Toolpak add-in in Excel to find the correlation
coefficient between two variables.

The equation for the correlation coefficient is:

Where are the sample means AVERAGE(array1) and AVERAGE(array2).

- A correlation coefficient of +1 indicates a perfect positive correlation. As variable X increases, variable Y


increases. As variable X decreases, variable Y decreases.

- A correlation coefficient of -1 indicates a perfect negative correlation. As variable X increases, variable Z


decreases. As variable X decreases, variable Z increases.

Page 52
Data Mining (3160714)

- A correlation coefficient near 0 indicates no correlation.

To use the Analysis Toolpak add-in in Excel to quickly generate correlation coefficients between multiple
variables, execute the following steps.

1. On the Data tab, click Data Analysis.

Note: can't find the Data Analysis button? Click here to load the Analysis ToolPak add-in.

2. Select Correlation and click OK.

Page 53
Data Mining (3160714)

3. For example, select the range A1:C6 as the Input Range.

4. Check Labels in first row.

5. Select cell A9 as the Output Range.

6. Click OK.

Result.

Page 54
Data Mining (3160714)

Conclusion: variables A and C are positively correlated (0.91). Variables A and B are not correlated (0.19).
Variables B and C are also not correlated (0.11) . You can verify these conclusions by looking at the graph.

Page 55
Data Mining (3160714)

Page 56

You might also like