Data Preparation in SPSS Modeler
Data Preparation in SPSS Modeler
CS /IS C415
Data Mining
PoonamGoyal
Saiyedul Islam
EDD Notes
Birla Institute of Technology & Science
Pilani (Rajasthan)
Table of Contents
i
Getting Started with IBM SPSS Modeler
The IBM SPSS Modeler is a data mining, modeling and reporting tool. It provides a
nice GUI to carry out all the data mining tasks in form of Nodes and Stream
Flows.Nodes are the icons or shapes that represent individual operations on the
data. The nodes are linked together in a stream to represent the flow of data
through each operation i.e. A set of actions (reading in, preprocessing,
classification/association rule mining/clustering, reporting, etc.) on some input data
is called a stream.
Modeler Interface -
Nodes Projects
Palette
Modeling -
ii
Fig 2 - An abstract stream
Source Nodes
Some important data source nodes are-
Symbol Node type Imports data from
Database
MS SQL Server, DB2, Oracle (using ODBC)
Node
Variable File
Delimited text data (*.csv files)
Node
User Input
Generate synthetic data
Node
iii
Sorts record into ascending or descending order based on
Sort Node
values of one or more fields
Takes multiple input records from different sources and
Merge Node creates a single output record containing some or all of
input fields
Append
Concatenates sets of records
Node
Distinct Node Removes duplicate records
Output Nodes
Output nodes provide the means to obtain information about data and models.
Some important output nodes are-
Symbol Node type Function
Displays the data in tabular format, which can also be
Table Node
written to file
iv
Matrix Node Creates a table that shows relationships between fields
Export Nodes
These nodes provide a mechanism for exporting data in various formats to
interface with other software tools. Some important export nodes are-
Symbol Node type Function
Database
Writes data to an ODBC-compliant relational data source.
Export Node
Flat File
Outputs data to a delimited text file.
Export Node
Excel Export
Outputs data in Microsoft Excel Format (*.xls)
Node
v
Lab 1
Data Preprocessing
Objective
Apply different kind of preprocessing techniques on given dataset.
Input Files-
Data file => iris_data.csv
Metadata file => iris_metadata.txt
Reading Data
1. Read the given metadata file and find out about the number of fields and their
types (i.e. continuous, nominal, etc.), also have a look at input data file. Notice,
field names are not mentioned.
2. Drag-n-drop a Variable File Node (from Sources tab in Nodes palette) on to
Modeler Canvas.
3. Double-click on Variable File Node and go to "File" tab. Browse and select the
given input data file.
4. Uncheck "Read field names from file", and check "Specify number of fields" and
set the field count value according to information given in metadata file.
5. Go to "Data" tab and make sure that you are satisfied with the "Storage" type of
each field. Otherwise, check "Override" and change the storage type.
6. Go to "Filter" tab and change the output field names according to the names
mentioned in metadata file.
7. Go to "Types" and make sure "Measurement" of each field is correct (i.e.
continuous, nominal, etc.)
8. [Optional] To know about range of values that each field have in this data, click
on "Values" cell in front of the field, select "<Read>" and click on "Read Values"
button.
9. [Optional] To mark a field as Classifier field (on which classification has to be
performed), mark it as "Target" in "Role" cell.
10. Click "Apply" and "OK". Now, a source node with your input file name must
appear on the modeler canvas.
11. [Optional] To view the output of this source node, add a "Table node"
12. Connect it to source node and "Run" the stream.
Running a stream
To run whole stream, use the “Run Stream” button in toolbar
To run down-stream from a particular node, right click on it and select “Run From
Here”
1
Connecting Node N1 to N2
Right-click on N1, select "Connect" and then select N2, or
Select N1 ,press "F2" and then select N2.
Analyzing Input
1. To understand the input data, use "Data Audit Node" from "Output" tab in node
palette and connect it to the source node.
2. Double-click on data audit node and go to "Settings" tab.
3. Either select "Default" for auditing all fields or select "Use custom fields" and
select required fields.
4. Check all options in "Display".
5. Go to "Quality" tab and select "Outlier and Extreme Detection method". Specify
the values for outliers and extremes.
6. Click "Apply" and "OK". Now, a data audit node will appear on the modeler
canvas."Run" the stream. (Ctrl+E)
7. Observe output ("Audit" and "Quality" tabs) carefully and try to determine
necessity of preprocessing.
2
a. Click on "Expression Builder" button.Left hand side lists all functions with
their return values and right hand side lists fields in the input dataset.
Connectors specified in middle are used to connect two expressions.
b. To select all records with missing Field1 value, select "@NULL" function from
function list and insert it into expression. Select Field1 from field list and
insert it as argument of above mentioned function. "@NULL" is used for
numeric values only, for string values use "STRING1 matches STRING2" and
specify STRING2 as "".
c. To select all records with any one of the missing field values, write
expressions as explained above, joined by "OR" operation.
d. Verify the expression by clicking "Check". Remove syntax error, if any. Click
"OK".
5. Click "Apply" and "OK". Now, a select node will appear on the modeler canvas.
6. [Optional] To view the output of this select node, add a "Table node". Connect
select node to the table node and "Run" the stream from the select node.
3
Discretization (Binning)
1. Select a "Binning Node" from "Field Ops" tab in nodes palette and connect select
node of previous step to this node.Go to "Settings" tab.
2. Select all the numeric fields with "Continuous" values as "Bin Fields".
3. As per the requirement, you may choose among many Binning methods like
"Fixed-width" or "Tiles" or "Ranks"or "Mean/Std Deviation based". For now,
select "Fixed-width" method. Read Help to know what these term mean.
4. Select "Number of bins" as appropriate.
5. [Optional] Go to "Bin Values" tab, click on "Read Values" button. You can see the
lower and upper range of each of the bins for each of the binned field.
6. Click "Apply" and "OK". Now, a binning node will appear on the modeler canvas.
7. New Binned values are appended as new fields in the dataset. To remove old
fields, select a "Filter Node" from "Field Ops" tab in nodes palette.
8. Connect binning node to this filter node and double click on it.
9. Click on "Filter" column of old fields to filter them out. Click "Apply" and "OK".
10. [Optional] To view the output of this node, add a"Table node". Connect this node
to the table node and "Run" the stream from this node.
Sampling
1. Select a "Sample Node" from "Record Ops" tab in nodes palette.Connect select
node of Step 1 to this sample node.Double click on sample node and go to
"Settings" tab.
2. As per the requirement, you may choose among the "Simple" or "Complex"
sample method. For now, select "Simple" method.
3. You can select the sampling criterion as either "First n" or "1-in-n" or "Random
%". Try to understand difference between all the operations.
4. Click "Apply" and "OK". Now, a sample node will appear on the modeler canvas.
5. [Optional] To view the output of this sample node, add a"Table node". Connect
this node to the table node and "Run" the stream.
Normalization
4. Select a "Filler Node" from "Field Ops" tab in nodes palette.Connect select node
of Step 1 to this filler node.Double click on sample node and go to "Settings"
tab.Select all numeric fields in "Fill in fields".Select "Replace" condition as
"Always".
5. In "Replace with" box, write your normalization expression. For example, for 0-1
normalization you should write
(@FIELD - @GLOBAL_MIN(@FIELD))/(@GLOBAL_MAX(@FIELD)-
@GLOBAL_MIN(@FIELD))
6. Click "Apply" and "OK". Now, a filler node will appear on the modeler canvas.
4
7. [Optional] To view the output of this filler node, add a"Table node". Connect this
node to the table node and "Run" the stream.
CorrelationDetermination
1. Select a "Statistics Node" from "Output" tab in nodes palette.
2. Connect select node of Step 1 to this node.
3. Double click on the node and go to "Settings" tab.
4. Under "Examine" box, select all numeric fields. You may check all statistics
parameter under "Statistics" box.
5. Under "Correlate" box, again select all numeric fields.
6. You may adjust correlation strength parameters for weak, medium and strong
correlation. Click "OK".
7. Click "Apply" and "OK". Now, a statistics node will appear on the modeler canvas.
8. Run the stream from this node and observe the correlations between different
fields. It will help you in selecting features to be consider for further processing.
Assignment
1. Use "Auto Data Preparation" node and configure it to perform all types of
preprocessing techniques explained above.
2. Replace outlier values with null values using "Data Audit"Node.
Hint: Go to “Quality” tab, select an action in “Action” Column. Generate “Outlier
and Extreme SuperNode”. Connect source node to this super node.
3. Replace missing values by the mean of the values of records having same class
value.
4. Replace missing values by majority voting of the values of records having same
class value.
5. Perform Stratified Sampling and observe the difference in the output.
6. Perform binning using Tiles, Ranks and Mean based methods.
7. Perform z-score normalization without using Auto Data Preparation Node.
5
Lab 2
Decision Tree based Classification
Objective
Use the given data set to build a Decision tree based classification using C5.0
Algorithm and use it to predict the class of given cases. Also, analyze the
difference on outcome with and without pruning.
Model Nuggets
A model nugget is a container for a model, that is, the set of rules, formulae or
equations that represent the results of a model building operations. The main
purpose of a nugget is for scoring data to generate predictions, or to allow further
analysis of the model properties. Opening a model nugget on the screen enables you
to see various details about the model, such as the relative importance of the input
fields and/or Rule set used in creating the model. To view the predictions, you need
to attach and execute a further process or output node.
6
Rule set
A rule set is a set of rules that tries to make predictions for individual
records.In some cases, more than one rule may apply for any particular
record, or no rules at all may apply. If multiple rules apply, either
prediction of first applicable rule or prediction of majority voting of all
applicable rules becomes the class of record. If no rule applies, a default
prediction is assigned to the record.
7
Generating Rule set
1. Go to "Model" tab of C5.0 node and select "Rule set" as Output type, this time.
Select Mode as Simple. Go to Annotations and name the model as "C5.0 Simple
mode".Click "Apply" and "OK". Copy and paste this C5.0 node and connect
Training Select node to it.
2. Go to "Model" tab of this copied C5.0 node and select "Rule set" as Output type,
this time. Select Mode as Expert. Specify Pruning severity as 0 and Minimum
records per child branch as 1. Uncheck Use global pruning and Winnow
attributes. Uncheck Use partitioned data and Build model for each split.
3. Go to Annotations and name the model as "C5.0 Expert mode".Click "Apply" and
"OK". This Expert mode node doesn't perform any type of pruning and creates a
unrestricted decision tree (its rule set in this case).
4. "Run" this stream. Two model nuggets will get created. Go to their Model tab
and observe the difference in Rule sets.
5. Take a Merge node from Record Ops tab in nodes palette. Connect both model
nuggets to this node.Configure the merge node to merge the data on the basis of
Keys and select all keys except the predicted class and corresponding confidence
fields of both models.Rename these fields accordingly.
6. Connect merge node to an Analysis node. Run the stream and observe the
accuracy of both the models.
8
Predicting class of test cases
1. Go to "Generate" option in the menu bar of Simple mode model nugget and
select "Rule Trace Node". A SuperNode named as "RULE" will get generated on
modeler canvas. Rename it as "RULE: Simple mode". A SuperNode groups
multiple nodes into a single node by encapsulating sections of a data stream, in
this case, all the rules specified by Rule Set.
2. Select this SuperNode and choose "Zoom-in" option from "SuperNode" menu
from the menu bar of SPSS modeler. A stream of Derive node will appear, each
driving a new field based on the conditions in Rule Set. Double-click on the nodes
to see the deriving condition.
3. Third-last node derives a filed having concatenation of all the rules or its value is
"No rules". (This node uses concatenation operator ><).Second-last node derives
another field which contains the concatenated rule set without the trailing semi-
colon.Last node will be a Filter node which will be filtering out all the
intermediate fields generated by different rules from final outcome.
4. Value of these derive fields contains predicted outcome of the rule along with its
confidence value. In order to only compare the predicted value, we need to
modify the output. [Optional] Add a Table node to last/second last node to see
the output.
5. To make prediction of first applicable rule as the prediction of the record, select
a "Derive" node from "Field Ops" tab under in nodes palette. Insert it in between
second-last and last node. Choose "Flag" under Derive as option. Specify True
value as "YES" and False value as "NO". Specify the true condition as
'RULE' matches "YES*"
This will put "YES" in newly derived field of all the records which had a string
starting with YES in their RULE field.
6. Go to filter node, filter out RULE field and rename newly derived field as "Simple
mode prediction". Zoom-out of SuperNode.Repeat all steps from 1 to 9 for Expert
mode model nugget. Do the renaming correspondingly.Add Table nodes to these
SuperNodes to see the predicted values.
Comparing performance
9
1. Select a "Merge" node from "Record Ops" tab in nodes palette.Connect Simple
mode and Expert modeSuperNodes to this Merge node.Go to Merge tab and
select "Keys" as Merge Method.Select all possible keys.
2. Select a "Matrix" node from "Output" tab in nodes palette. Under Settings select
Simple mode prediction in Rows and Expert mode prediction in Columns.
3. Select Cell-contents as Cross-tabulations. Click on Run to see the results.Analyze
it carefully.
4. Vary the "Pruning Severity", repeat all steps from beginning and see the change
in this matrix.
5. Connect a Table node to Merge node. Under "Highlight records where", write
appropriate condition for highlighting records where prediction of both the
modes differ.Select "Output to file" and file type as *.html.
Assignment
1. Pass both the partitions to the C5.0 models and use Analysis node to know the
accuracy of model.
2. Observe the difference in model by performing following operations -
Using misclassification cost to heavily penalize false positives
Generating a Rule Set from already generated Decision Tree
Performing Majority Voting to select class of record in case of multiple rule
applicability
Preprocessing the data before applying C5.0
Performing Global Pruning and Winnow attributes
3. Select a field as Split field and generate models for each split.
10
Lab 3
K-nearest Neighbor based Classification
Objective
Use the given data set to build a K-nearest neighbor (KNN) model and use it to
predict the class of given cases.
Data Preprocessing
1. Read in the input dataset using appropriate source node.Select an "Auto Data
Prep" node from Field Ops tab in nodes palette and connect source node to it. Go
to Objectives tab of the node and select Optimize for accuracy.Analyze the
Settings tab and explore through its options. Go to Analysis tab and click on
Analyze Data button in the menu bar. Select Fields option from the drop-down
list on left-hand side panel.
2. Select Do not use option from the drop-down list adjoining the field names for
fields having Predictive Power below a threshold (0.1 in this case). Select Field
Details on right-hand side panel and explore through details of all the fields one
by one.
3. Go to Generate menu item in the menu bar of the Auto Data Prep node and click
on Filter node. Connect this node to auto data prep node to.
11
Joining Test Cases with Input Dataset-
1. Read the "to be predicted" dataset using appropriate source node.Select an "Append"
node from Record Ops tab in nodes palette and connect "to be predicted dataset" and
filter node of input data set to it.Go to "Inputs" tab of Append node and make sure that
input dataset is above to be predicted dataset. To view the resulting fields, go to
"Append" tab.
2. Add a "Type" node from Field Ops tab in node palette and connect Append node to it.
Mark the Role of target field as Target under Types tab of this node.
12
Fig 3 - Model tab of Model Nugget
13
Lab 4
Bayesian Network based Classification
Objective
Use the given data set to build Bayesian network based classification model using
different structuring methods and analyze the effect of feature selection on these
methods.
14
Lab 5
Support Vector Machine based Classification
Objective
Use the given data set to build SVM based classification model. Also, analyze the
difference on outcome by varying different parameters of SVM generation.
The SPSS Modeler offers two type of modes for SVM model-
Simple mode
Expert mode
15
5. "Run" this stream from SVM node, a "SVM Model Nugget" will be created and
placed on stream canvas.
6. [Optional]Connect aAnalysis node to this nugget and see the partition wise
classification error.
Comparing performance-
6. Select a "Merge" node from "Record Ops" tab in nodes palette.
7. Connect Simple mode and Expert mode model nuggets to this Merge node.
16
8. Go to Merge tab of merge node and select "Keys" as Merge Method.Select all
possible keys except predicted target field and its probability field (Fields starting
with $S-sign).
9. Connect an Analysis node to the merge node. Run the stream
10. Carefully analyze the results i.e. performance of individual models and their
comparison with original value of target field which was give in the dataset.
11. Vary the parameters on Expert node to generate better models.
12. Export analysis results to *.html file. Concatenate your ID in their names. You
may be required to submit them for evaluation.
Assignment
1. Find out the default value of all the parameters in Simple mode.
17
Lab 6
C&R Tree based Classification
Objective
Explore different methods for performing tree based classification using C&R tree
node.
18
combining rule for categorical and continuous targets in the Ensembles menu.
Also, specify value for number of component models. Run the stream.
2. Connect model nugget to an Analysis node. Run the stream from model nugget.
3. Observe the change in accuracy from Single Tree model. Try to tune various
parameters to get best possible model.
19
Go to "Gains" tab during tree generation process and you can see statistics
for all terminal nodes in the tree. These statistics are "Tree Growing Set" and
"Overfit Prevention Set".
Go to "Risks" tab during tree generation process and you can see the chances
of misclassification at any level. You will "Misclassification Matrix" for both
"Tree Growing Set" and "Overfit Prevention Set".
3. In the end, go to "Generate" menu and select "Generate Model". A new model
nugget will be placed on the canvas. Connect partition node to it and connect it
to an analysis node.Compare the accuracy of all four models.
Assignment
1. Select "Highest Probability Wins" and "Highest Mean Probability" in the
combining rule for categorical target and observe the difference in the boosting
and bagging model.
20
Lab 7
Ensemble based Classification
Objective
Understand the working of ensemble based classification by implementing
Ensemble of same model applied on subsets of records(Bagging)
Ensemble of different Models applied on same data
Ensemble of same model applied on subset of fields (Random Forest)
Ensemble Node
The Ensemble node combines two or more model nuggets to obtain more
accurate predictions than can be gained from any of the individual models. By
combining predictions from multiple models, limitations in individual models
may be avoided, resulting in a higher overall accuracy. Models combined in this
manner typically perform at least as well as the best of the individual models and
often better.
21
9. Carefully analyze the output. See the accuracy of each model and the accuracy of
ensemble of these models. Try to tune each model individually in order to get
most accurate model.
Assignment
Change Ensemble method to Confidence-weighted voting and Highest confidence
wins and observe the changes in all three cases.
22
Lab 8
Apriori based Association Rule Mining
Objective
Use the given data set to generate association rules using Apriori algorithm.
The Apriori node discovers association rules in the data. Association rules are
statements of the form
Support refers to the percentage of records in the training data for which the
antecedents are true. Confidence is based on the records for which the rule's
antecedents are true and is the percentage of those records for which the
consequent(s) are also true.
23
Fig 1- Output of a web node
Assignment
1. Go to Expert tab of model and select mode as Expert. Generate association rules
through different Evaluation measures -
Evaluation
Description
Measure
The default method uses the confidence (or accuracy) of the rule
Rule confidence
to evaluate rules
Confidence It is the absolute difference between the rule's confidence and its
difference prior confidence.
Confidence It is the ratio of rule confidence to prior confidence
ratio
Information It is based on the concept of Information gain
difference
Normalized Chi- It is a statistical index of association between antecedents and
square consequents
24
Lab 9
K-means based Clustering
Objective
Find the optimal value of number of clusters (K) for K-means algorithm for a
given data set through scripting in SPSS Modeler.
Creating Nodes
create NODE
create NODE at X Y
create NODE between NODE1 and NODE2
create NODE connected between NODE1 and NODE2
NODE can be any valid node like variablefilenode, typenode, tablenode, etc.
Nodes can be created using variables also. For example -
var x
set x = create variablefilenode
rename ^x as "Input"
position ^x at 200 200
Properties of any node can be accessed using dot (.) operator. Like in above
cases, the property delimit_spaceofvariablefilenodehas been set to true.
25
For complete list of properties of individual nodes, refer SPSS Properties reference
manual.
Connecting Nodes
connect NODE1 to NODE2
connect NODE1 between NODE2 and NODE3
if EXPR then
STATEMENTS 1
else
STATEMENTS 2
endif
for...endfor
for PARAMETER in LIST
STATEMENTS
endfor
26
Fig 1 - Stream after one iteration of for loop in the above script
Complete Script
varsource_file_path
setsource_file_path = "C:\Users\Dell\Documents\DM lab
Manuals\Lab 6 Kmeans"
varsource_file_name
setsource_file_name = "\s1.csv"
varlower_limit
varupper_limit
setlower_limit = 2
setupper_limit = 5
varkname
vardname
var t
var temp
var d
vargname
var g
varvname
varymid
setymid = 150 * (^upper_limit/2)
27
execute ^kname
insert model ^t:kmeansnode at 350 ^temp
connect ^vname to ^t:applykmeansnode
setdname = create derivenode
position ^dname at 450 ^temp
connect ^t:applykmeansnode to ^dname
set d = "Square Error " ><^i
set ^dname.new_name = ^d
set ^dname.result_type = Formula
set ^dname.formula_expr = "'$KMD-" >< ^t ><"'*'$KMD-"><
^t ><"'"
createreportnode
connect ^vname to :reportnode
set :reportnode.output_mode = File
set :reportnode.output_format = Text
set :reportnode.full_filename = ^source_file_path>< "kmeans-
output.txt"
set :reportnode.text = ""
fori from ^lower_limit to ^upper_limit
set :reportnode.text = :reportnode.text>< ^i><",
[@GLOBAL_SUM('Square Error " >< ^i>< "')]\n"
endfor
execute :reportnode
createplotnode
position :plotnode at 750 ^ymid
connect ^vname to :plotnode
set :plotnode.x_field = "No of Clusters"
set :plotnode.y_field = "Sum of Squared Errors"
set :plotnode.title = "Plot of SSE Vs. K"
execute :plotnode
28
Lab 10
Self-organizing Map based Clustering
Objective
Use the given data set to build a Kohonen(SOM) Clustering model
The basic units of Kohonen network or Self-organizing Maps are neurons, and
they are organized into two layers: the input layer and the output layer (also
called the output map). All of the input neurons are connected to all of the
output neurons, and these connections have strengths, or weights, associated
with them. During training, each unit competes with all of the others to "win"
each record. As training proceeds, the weights on the grid units are adjusted so
that they form a two-dimensional "map" of the clusters.
Assignment
1. Adjust the parameters in Expert mode such that value of Silhoutte measure gets
maximized. Uncheck the Show feedback graph option to avoid waiting long
during every execution of model.
29
Parameter Description
Width and It is size of two-dimensional output map.
length
It is a weighting factor that decreases over time, such that the
Learning rate
network starts off encoding large-scale features of the data and
decay
gradually focuses on more fine-level detail.
It is a rough estimation phase, used to capture the gross patterns
Phase 1
in the data
It is a tuning phase, used to adjust the map to model the finer
Phase 2
features of the data.
This determines the number of "nearby" units that get updated
Neighborhood
along with the winning unit during training.
30
Lab 11
Real World Data Mining
Objective
Use learned data mining techniques to identify relevant patterns and useful
information in a given data set.
Performance Metrics
Explain what performance metric(s) will be used and why to evaluate the models
that you have constructed in order to answer questions identified (For example -
accuracy, error rate, etc).
Model Interpretation
Describe the resulting model (e.g., size of the model, readability, accuracy,etc.).
Summarize the model in your own words focusing on the most
interesting/relevant patterns. Elaborate on if and how the model answers the
objectives identified.
Results
What is/are the model(s) with the highest performance for each objective of
each data set?
Discuss how well a data mining technique worked on these datasets. What
combination of parameters yielded particularly good results?
31
Overall project conclusion: strength & weaknesses
Make a list of tasks that will work for any dataset and another list of tasks
particular to given dataset
32