TwoStep Cluster Analysis
TwoStep Cluster Analysis
Clustering Principles
In order to handle categorical and continuous variables, the TwoStep Cluster Analysis
procedure uses a likelihood distance measure which assumes that variables in the
cluster model are independent. Further, each continuous variable is assumed to have a
normal (Gaussian) distribution and each categorical variable is assumed to have a
multinomial distribution.
Empirical internal testing indicates that the procedure is fairly robust to violations of
both the assumption of independence and the distributional assumptions, but you
should try to be aware of how well these assumptions are met.
The two steps of the TwoStep Cluster Analysis procedure's algorithm can be
summarized as follows:
Step 1. The procedure begins with the construction of a Cluster Features (CF) Tree.
The tree begins by placing the first case at the root of the tree in a leaf node that
contains variable information about that case. Each successive case is then added to an
existing node or forms a new node, based upon its similarity to existing nodes and
using the distance measure as the similarity criterion. A node that contains multiple
cases contains a summary of variable information about those cases. Thus, the CF tree
provides a capsule summary of the data file.
Step 2. The leaf nodes of the CF tree are then grouped using an agglomerative
clustering algorithm. The agglomerative clustering can be used to produce a range of
Distance Measure. This selection determines how the similarity between two clusters
is computed.
Number of Clusters. This selection allows you to specify how the number of clusters is
to be determined.
Assumptions. The likelihood distance measure assumes that variables in the cluster
model are independent. Further, each continuous variable is assumed to have a
normal (Gaussian) distribution, and each categorical variable is assumed to have a
multinomial distribution. Empirical internal testing indicates that the procedure is fairly
robust to violations of both the assumption of independence and the distributional
assumptions, but you should try to be aware of how well these assumptions are met.
Use the Bivariate Correlations procedure to test the independence of two continuous
variables. Use the Crosstabs procedure to test the independence of two categorical
variables. Use the Means procedure to test the independence between a continuous
variable and categorical variable. Use the Explore procedure to test the normality of a
continuous variable. Use the Chi-Square Test procedure to test whether a categorical
variable has a specified multinomial distribution.
To Obtain a TwoStep Cluster Analysis
This feature requires the Statistics Base option.
1. From the menus choose:
Analyze > Classify > TwoStep Cluster...
2. Select one or more categorical or continuous variables.
Optionally, you can:
If you select noise handling and the CF tree fills, it will be regrown after placing
cases in sparse leaves into a "noise" leaf. A leaf is considered sparse if it
contains fewer than the specified percentage of cases of the maximum leaf
size. After the tree is regrown, the outliers will be placed in the CF tree if
possible. If not, the outliers are discarded.
If you do not select noise handling and the CF tree fills, it will be regrown using
a larger distance change threshold. After final clustering, values that cannot be
assigned to a cluster are labeled outliers. The outlier cluster is given an
Consult your system administrator for the largest value that you can specify on
your system.
The algorithm may fail to find the correct or specified number of clusters if this
value is too low.
Initial Distance Change Threshold. This is the initial threshold used to grow the
CF tree. If inserting a given case into a leaf of the CF tree would yield tightness
less than the threshold, the leaf is not split. If the tightness exceeds the
threshold, the leaf is split.
Maximum Branches (per leaf node). The maximum number of child nodes that
a leaf node can have.
Maximum Tree Depth. The maximum number of levels that the CF tree can
have.
Maximum Number of Nodes Possible. This indicates the maximum number of
CF tree nodes that could potentially be generated by the procedure, based on
the function (b d+1 1) / (b 1), where b is the maximum branches and d is the
maximum tree depth. Be aware that an overly large CF tree can be a drain on
system resources and can adversely affect the performance of the procedure.
At a minimum, each node requires 16 bytes.
Cluster Model Update. This group allows you to import and update a cluster model
generated in a prior analysis. The input file contains the CF tree in XML format. The
model will then be updated with the data in the active file. You must select the
variable names in the main dialog box in the same order in which they were specified
in the prior analysis. The XML file remains unaltered, unless you specifically write the
new model information to the same filename. See the topic TwoStep Cluster Analysis
Output for more information.
Working Data File. This group allows you to save variables to the active dataset.
XML Files. The final cluster model and CF tree are two types of output files that can be
exported in XML format.
Export final model. The final cluster model is exported to the specified file in
XML (PMML) format. You can use this model file to apply the model
information to other data files for scoring purposes. See the topic Scoring
Wizard for more information.
Export CF tree. This option allows you to save the current state of the cluster
tree and update it later using newer data. See TwoStep Cluster Analysis Options
for more information on reading this file.
Using the main views and the various linked views in the Cluster Viewer, you can gain
insight to help you answer these questions.
Who uses clustering?
Clustering techniques are useful in a wide variety of situations, including:
To see information about the cluster model, activate (double-click) the Model Viewer
object in the Viewer.
Cluster Viewer
The Cluster Viewer is made up of two panels, the main view on the left and the linked,
or auxiliary, view on the right. There are two main views:
Model Summary (the default). See the topic Model Summary View for more
information.
Clusters. See the topic Clusters View for more information.
Predictor Importance. See the topic Cluster Predictor Importance View for
more information.
Cluster Sizes (the default). See the topic Cluster Sizes View for more
information.
Cell Distribution. See the topic Cell Distribution View for more information.
Cluster Comparison. See the topic Cluster Comparison View for more
information.
Model Summary View
Clusters View
Cluster Predictor Importance View
Cluster Sizes View
Cell Distribution View
Cluster Comparison View
Clusters View
The Clusters view contains a cluster-by-features grid that includes cluster names, sizes,
and profiles for each cluster.
The columns in the grid contain the following information:
When you hover your mouse over a cell, the full name/label of the feature and the
importance value for the cell is displayed. Further information may be displayed,
depending on the view and feature type. In the Cluster Centers view, this includes the
cell statistic and the cell value; for example: Mean: 4.32. For categorical features the
cell shows the name of the most frequent (modal) category and its percentage.
Within the Clusters view, you can select various ways to display the cluster
information:
Transpose clusters and features. See the topic Transpose Clusters and Features
for more information.
Sort features. See the topic Sort Features for more information.
Sort clusters. See the topic Sort Clusters for more information.
Select cell contents. See the topic Cell Contents for more information.
Sort Clusters
Cell Contents
The size of the smallest cluster (both a count and percentage of the whole).
The size of the largest cluster (both a count and percentage of the whole).
The ratio of size of the largest cluster to the smallest cluster.
Categorical features are shown as dot plots, where the size of the dot indicates
the most frequent/modal category for each cluster (by feature).
Continuous features are displayed as boxplots, which show overall medians and
the interquartile ranges.
For continuous features, square point markers and horizontal lines indicate the
median and interquartile range for each cluster.
Each cluster is represented by a different color, shown at the top of the view.
Topic
See Transpose Clusters and Features
See Sort Features By
See Sort Clusters By
See Cells
shown by default. Note: This check box is unavailable if no evaluation fields are
available.
Cluster Descriptions. Selected by default. To hide all cluster description cells, deselect
the check box.
Cluster Sizes. Selected by default. To hide all cluster size cells, deselect the check box.
Maximum Number of Categories. Specify the maximum number of categories to
display in charts of categorical features; the default is 20.
Filtering Records
If you want to know more about the cases in a particular cluster or group of clusters,
you can select a subset of records for further analysis based on the selected clusters.
1. Select the clusters in the Cluster view of the Cluster Viewer. To select multiple
clusters, use Ctrl-click.
2. From the menus choose:
Generate > Filter Records...
3. Enter a filter variable name. Records from the selected clusters will receive a
value of 1 for this field. All other records will receive a value of 0 and will be
excluded from subsequent analyses until you change the filter status.
4. Click OK.
The Viewer contains a Model Viewer object. By activating (double-clicking) this object,
you gain an interactive view of the model. The default main view is the Model
Summary view.
The model summary table indicates that three clusters were found based on
the ten input features (fields) you selected.
The cluster quality chart indicates that the overall model quality is "Fair".
Cluster Distribution
Figure 1. Cluster distribution table
The Cluster Sizes view shows the frequency of each cluster. Hovering over a slice in the
pie chart reveals the number of records assigned to the cluster. 40.8% (62) of the
records were assigned to the first cluster, 25.7% (39) to the second, and 33.6% (51) to
the third.
Cluster Profiles
1. In the main view, select Clusters from the dropdown to display the Clusters
view.
Figure 1. Clusters table
By default, clusters are sorted from left to right by cluster size, so they are currently
ordered 1, 3, 2.
Figure 2. Cluster profiles: cells show cluster centers
The cluster means suggest that the clusters are well separated.
o
2. The cluster means (for continuous fields) and modes (for categorical fields) are
useful, but only give information about the cluster centers. In order to get a
visualization of the distribution of values for each field by cluster, click on the
Cells show absolute distributions button in the toolbar.
Now you can see, for example, that there is some overlap between clusters 1 and 3 on
curb weight, engine size, and fuel capacity. There is considerably more overlap
between clusters 2 and 3 on these fields, with the difference that the vehicles with the
very highest curb weight and fuel capacity are in cluster 2 (column 3) and the vehicles
with the very highest engine size appear to be in cluster 3 (column 2).
Figure 4. Cluster profiles: cells show absolute distributions
3. To see this information for the evaluation fields, click on the Display button in
the toolbar.
4. Select Evaluation fields.
5. Click OK.
The evaluation fields should now appear in the cluster table.
Figure 5. Cluster profiles for evaluation fields: cells show absolute distributions
The distribution of sales is similar across clusters, except that clusters 1 and 2 have
longer tails than cluster 3 (column 2). There is a fair amount of overlap in the
distributions of 4-year resale value, but clusters 2 and 3 are centered on a higher value
than cluster 1, and cluster 3 has a longer tail than either cluster 1 or 2.
6. For another way to compare clusters, select (control-click) on the cluster
numbers (column headings) in the clusters table.
7. In the auxiliary view, select Cluster Comparison from the dropdown.
Figure 6. Cluster comparison view : first four fields shown
For each categorical field, this shows a dot plot for the modal category of each cluster,
with dot size corresponding to the percentage of records. For continuous fields, this
shows a boxplot for the distribution of values within each cluster overlaid on a boxplot
for the distribution of values overall. These plots generally confirm what you've seen in
the Clusters view. The Cluster Comparison view can be especially helpful when there
are many clusters, and you want to compare only a few of them.