0% found this document useful (0 votes)
16 views134 pages

Doctorat - Predicting Modeling Based On Classification and Pattern Matching Method

Uploaded by

anjeru
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
16 views134 pages

Doctorat - Predicting Modeling Based On Classification and Pattern Matching Method

Uploaded by

anjeru
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 134

PREDICTIVE MODELING BASED ON

CLASSIFICATION AND PATTERN


MATCHING METHODS
by

Wei Wang
B.Sc. Beijing Polytechnic University, 1992

a thesis submitted in partial fulfillment


of the requirements for the degree of
Master of Science
in the School
of
Computing Science

c Wei Wang 1999


SIMON FRASER UNIVERSITY
May 1999

All rights reserved. This work may not be


reproduced in whole or in part, by photocopy
or other means, without the permission of the author.
APPROVAL

Name: Wei Wang


Degree: Master of Science
Title of thesis: Predictive Modeling Based on Classication and Pat-
tern Matching Methods

Examining Committee: Dr. Brian Funt


Chair

Dr. Jiawei Han, Senior Supervisor

Dr. Tiko Kameda, Supervisor

Dr. Qiang Yang, External Examiner

Date Approved:

ii
Abstract
Predictive modeling, i.e., predicting unknown values of certain attributes of interest
based on the values of other attributes, is a major task in data mining. Predictive
modeling has wide applications, including credit evaluation, sales promotion, nancial
forecasting, and market trend analysis.
In this thesis, two predictive modeling methods are proposed. The rst is a
classication-based method which integrates attribute-oriented induction with the
ID3 decision tree method. This method extracts prediction rules at multiple levels of
abstraction and handles large data sets and continuous numerical values in a scalable
way. Since the number of distinct values in each attribute is reduced by attribute-
oriented induction, the problem of favoring the attributes with a large number of
distinct values in the original ID3 method is overcome.
The second approach is a pattern matching-based method which integrates statis-
tical analysis with attribute-oriented induction to predict data values or value distri-
butions of the attribute of interest based on similar groups of data in the database.
The attributes which strongly inuence the values of the attribute of interest are iden-
tied rst by the analysis of data relevance or correlation using a statistical relevance
analysis method. Moreover, by allowing users to specify their request in di erent
concept levels, the system can perform prediction on user-desired concept levels to
make the result more interesting and suitable to the user's needs.
Both proposed methods are implemented and tested. The performance study and
experiments show that they work e ciently for large databases. Our study concludes
that predictive modeling can be conducted e ciently at multiple levels of abstraction
in large databases and it is practical at solving some large scale application problems.

iii
Dedication
To my parents and my wife.

iv
Acknowledgements
I want to thank Dr. Jiawei Han, my senior supervisor, for his guidance, encouragement
and support during my study. Despite his busy schedule, he is always available to give
me advice, support and guidance during the entire period of my study. His insight
and creative ideas are always the inspiration for me during the research.
I would also like to thank Dr. Tiko Kameda for serving on my supervisory com-
mittee. I am very grateful for his advice and support.
My thanks to Dr. Qiang Yang for serving as examiner of this thesis.
I also want to express my gratitude to Dr. Lou Hafer, Dr. Krishnamurti, Mr.
Russ Tront for their advice and support during my TA work. I want to thank Mrs.
Kersti Jaager and other secretaries who are always available for help.
My thanks also goes to many of my fellow graduate students who make me feel
like a member of an extended family: Yongjian Fu, Krzysztof Koperski, Liang Yao,
Betty Xia, Jenny Chiang, Osmar Zaiane, Wan Gong, Yijun Lu, Nebojsa Stefanovic,
Micheline Kamber, Jianping Chen, Hongshen Chin, Ye Lu, Paul Tan, Shan Cheng,
Hui Li and Jie Wei.
My deepest gratitude goes to my wife Jing Jing for her love, care and support. I
am also deeply indebted to my parents for their everlasting love and encouragement.

v
Contents
Approval ii
Abstract iii
Acknowledgements v
List of Tables x
List of Figures xi
1 Introduction 1
1.1 Knowledge Discovery in Databases . . . . . . . . . . . . . . . . . . . 1
1.1.1 What is Knowledge Discovery in Databases . . . . . . . . . . 1
1.1.2 Tasks, Methods and Applications . . . . . . . . . . . . . . . . 3
1.2 Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Predictive Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Related Work 8
2.1 Attribute-Oriented Induction . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 What is Attribute-Oriented Induction? . . . . . . . . . . . . . 8
2.1.2 Motivation of Attribute-Oriented Induction . . . . . . . . . . 9
2.1.3 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Concept Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.5 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

vi
2.1.6 Feasibility of Attribute-Oriented Induction . . . . . . . . . . . 15
2.1.7 Application of the Attribute-Oriented Induction . . . . . . . . 16
2.1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Statistical Approaches . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Machine Learning Approaches . . . . . . . . . . . . . . . . . . 21
2.2.3 Neural Net Approaches . . . . . . . . . . . . . . . . . . . . . . 22
2.2.4 Summary of Classication Approaches . . . . . . . . . . . . . 25
2.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 Statistical Approaches . . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 Classication-Based Approaches . . . . . . . . . . . . . . . . . 29
2.3.3 Neural Net Approaches . . . . . . . . . . . . . . . . . . . . . . 31
2.3.4 Summary of Predictive Modeling Approaches . . . . . . . . . 34
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3 Classication Based Method 36
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.1 Why Use Classication-Based Method . . . . . . . . . . . . . 41
3.3.2 Why Use Decision Tree-Based Method . . . . . . . . . . . . . 41
3.3.3 Why Use Attribute-Oriented Induction . . . . . . . . . . . . . 42
3.4 Discussion of ID3 Method . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 General Ideas of Our Approach . . . . . . . . . . . . . . . . . . . . . 47
3.6 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.7 Variations of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . 52
3.7.1 Using Single Unsplit Generalized Table to Generate Target Sub-
class: Algorithm PC USGS1 . . . . . . . . . . . . . . . . . . . 52
3.7.2 Using Split Generalized Table to Generate Target Subclass: Al-
gorithm PC STGS2 . . . . . . . . . . . . . . . . . . . . . . . . 53
3.7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

vii
3.8 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4 Pattern Matching Based Method 61
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4 Relevance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5 General Ideas of our Approach . . . . . . . . . . . . . . . . . . . . . . 71
4.6 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.7 Variations of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . 75
4.7.1 Using Base Table to Generate Contingency Table: Algorithm
PP S2BT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.7.2 Using Hash Table to Generate Contingency Table: Algorithm
PP S1HT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.8 Example and Experiment Results . . . . . . . . . . . . . . . . . . . . 82
4.8.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.8.2 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . 86
4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5 Performance Study 89
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Performance Study on Classication-Based Predictive Modeling Method 90
5.2.1 Scale Up Performance . . . . . . . . . . . . . . . . . . . . . . 90
5.2.2 Performance Study on Di erent Classication Thresholds . . . 91
5.2.3 Performance Study on Di erent Numbers of Attributes . . . . 92
5.2.4 Performance Study on Di erent Numbers of the Predictive At-
tribute Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.2.5 Performance Study on Di erent Numbers of the Descriptive At-
tribute Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

viii
5.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3 Performance Study on Pattern Matching-Based Method . . . . . . . . 96
5.3.1 Scale Up Performance . . . . . . . . . . . . . . . . . . . . . . 97
5.3.2 Performance Study on Di erent Support Thresholds . . . . . . 97
5.3.3 Performance Study on Di erent Child/Parent Ratios . . . . . 99
5.3.4 Performance Study on Di erent Numbers of Attributes . . . . 100
5.3.5 Performance Study on Di erent Numbers of Predictive Attribute
Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3.6 Performance Study on Di erent Numbers of Descriptive At-
tribute Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4 Summary of the Performance Study . . . . . . . . . . . . . . . . . . . 104
6 Conclusion and Future Work 105
6.1 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2 Comparison of the Two Proposed Predictive Modeling Approaches . . 107
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
A Database Generator 111
B Hierarchy Generator 113
Bibliography 115

ix
List of Tables
2.1 An Initial Relation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 A Generalized Relation. . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 An Example Data Set (by Quinlan 1986). . . . . . . . . . . . . . . . 44
3.2 An Initial Relation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3 A Generalized Relation. . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 The prime class represented as a generalized feature table. . . . . . . 56
3.5 The information gain for each attribute to classify the prime class. . . 56
3.6 Object distribution in each subclass. . . . . . . . . . . . . . . . . . . 57
3.7 The information gain for each remaining attribute in the two subclasses. 57
3.8 Object distribution in the subclass Salary:High and Salary:Medium. 58
3.9 Object distribution in the remaining subclasses of Salary:High. . . . . 58
3.10 Object distribution in the remaining subclasses of Salary:Medium. . . 59
4.1 Contingency Table for Attributes A and B. . . . . . . . . . . . . . . . 68
4.2 Contingency Table for Occupation and Salary. . . . . . . . . . . . . . 68
4.3 Database Table: SalaryInfo . . . . . . . . . . . . . . . . . . . . . . . 82
4.4 Generalized Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.5 Contingency Table for Attribute Occupation and Salary. . . . . . . . 84
4.6 Another Generalized Data Set . . . . . . . . . . . . . . . . . . . . . . 85
4.7 Prediction Result For the Example . . . . . . . . . . . . . . . . . . . 86
4.8 Prediction Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

x
List of Figures
1.1 Steps of the KDD Process (Fayyad et, al., 1996). . . . . . . . . . . . . 2
2.1 Concept Hierarchy for Provinces . . . . . . . . . . . . . . . . . . . . . 12
2.2 The Bayesian Classication System. (Wu, et al., 1991) . . . . . . . . 18
2.3 The Classication Scheme that Integrates Heuristic and Bayesian Ap-
proaches. (Wu, et al., 1991) . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 A Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 A Three Layer Feedforward Neural Network. (Lu, et al., 1995) . . . . 23
3.1 A Common Classication System . . . . . . . . . . . . . . . . . . . . 37
3.2 A Decision Tree (by Quinlan 1986). . . . . . . . . . . . . . . . . . . . 45
3.3 A More Complicated Decision Tree (by Quinlan 1986). . . . . . . . . 46
3.4 The decision tree generated based on the query and the determining
attribute \House size". . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1 Scale-Up Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 Performance Study on Di erent Classication Thresholds . . . . . . . 92
5.3 Performance Study on Di erent Numbers of Attributes . . . . . . . . 93
5.4 Performance Study on Di erent Numbers of the Predictive Attribute
Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.5 Performance Study on Di erent Numbers of the Descriptive Attribute
Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.6 Scale-Up Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.7 Performance Study on Di erent Support Thresholds . . . . . . . . . . 98

xi
5.8 Performance Study on Di erent Child/Parent Ratios . . . . . . . . . 99
5.9 Performance Study on Di erent Numbers of Attributes . . . . . . . . 100
5.10 Performance Study on Di erent Numbers of Predictive Attribute Values102
5.11 Performance Study on Di erent Numbers of Descriptive Attribute Values103

xii
Chapter 1
Introduction
1.1 Knowledge Discovery in Databases
1.1.1 What is Knowledge Discovery in Databases
Today, the amount of data stored in databases grows at an amazing speed. Wal-
Mart (a U.S. retailer) has the largest business database in the world which handles
over 20 million transactions per day 8]. Mobil Oil Corporation, is developing a
database which can store more than 100 terabytes of data relevant to oil exploration
39]. NASA Earth Observing System (EOS) of orbiting satellites and other space-
borne instruments is able to generate 50 gigabytes of remotely sensed image data
per hour 85]! Obviously, such huge amount of data is far beyond human capabili-
ties to analyze using traditional manual methods of data analysis. This gives rise to
the signicant need for new techniques and tools to assist humans intelligently and
automatically in analyzing such gigabytes or even terabytes of data to get useful in-
formation. This increasing need gives birth to a new research eld called Knowledge
Discovery in Databases (KDD) or Data Mining which has attracted more and more
attention from researchers in many di erent elds including database, articial in-
telligence, machine learning, pattern recognition, statistics, expert system, and data
visualization.
The denition of knowledge discovery in databases is given by Fayyad, et al.,

1
CHAPTER 1. INTRODUCTION 2

as: \Knowledge discovery in databases is the non-trivial process of identifying valid,


novel, potentially useful, and ultimately understandable patterns in data" 23]. In their
opinion, a KDD process usually consists of several steps: data selection, preprocessing,
transformation, data mining, and interpretation/evaluation of the results as shown
in Figure 1.1 23]. Data mining is considered as only one step of a KDD process.
They think the di erence between KDD and data mining is that \KDD refers to the
overall process of discovering useful knowledge from data while data mining refers to
the application of algorithms for extracting patterns from data without the additional
steps of the KDD process" 23]. However, since Data Mining is the crucial part of the
KDD process, most researchers usually use both terms rather than strictly distinguish
them 80, 3, 35, 44]. In this thesis, the term Data Mining and the term Knowledge
Discovery in Databases are used without distinction.

Interpretation/
Evaluation

Data Mining
Knowledge

Transformation Patterns

Preprocessing
Transformed
Data
Selection
Preprocessed
Data
Target Data
Data

Figure 1.1: Steps of the KDD Process (Fayyad et, al., 1996).
Figure 1.1 presents an overview of the steps comprising a KDD process. The
low end is the data stored in the databases. A subset of the data (called target
data) is collected in the Selection step according to the data mining request. Noise
CHAPTER 1. INTRODUCTION 3

removal and missing data handling is done in step Preprocessing. Data reduction and
projection is performed in the Transformation step. Searching for interesting patterns is
accomplished in Data Mining step. The interpretation and evaluation of the discovered
pattern in the nal step is to ensure the derivation of valid, useful knowledge which
should be understandable by people. Although the background knowledge is not
specied in this gure, it is very important in the whole KDD process.

1.1.2 Tasks, Methods and Applications


Generally, the goal of Data Mining in practice is prediction and description. Predic-
tion is to predict unknown or future values of the attributes of interest using other
attributes in the databases while description is to nd patterns to describe the data
in a manner understandable to human. These two goals can be further classied into
following data mining tasks: classi cation, regression, clustering, summarization, dis-
crimination, dependency modeling, prediction as well as change and deviation detection.
Classi cation is to classify a data item into one of several predened classes. Re-
gression is to map a data item to a real-valued prediction variable. Clustering is to
identify a nite set of categories or clusters to describe the data. Summarization is
to nd a concise description for a subset of data. Discrimination is to discover the
features or properties that distinguish one set of data (called target class) from other
sets of data (called contrasting classes). Dependency Modeling is to nd a model which
describes signicant dependencies between variables. Prediction is to predict unknown
or future data behavior based on previous data patterns. Change and Deviation Detec-
tion involves discovering the signicant changes in the data from previously measured
or normative values.
There are many di erent methodological approaches to data mining including
machine learning, statistics, database-oriented, etc. Machine learning approaches
include learning from examples 64], conceptual clustering 66], decision tree induction
71], etc. Mathematical and statistical approaches include Bayesian inference 16, 17],
and rough set 69]. Database-oriented approaches include attribute-oriented induction
34, 32], Apriori 1, 4], etc. There are also other approaches including knowledge
CHAPTER 1. INTRODUCTION 4

representation approaches 28], visualization and interactive approaches 90, 46] and
neural network approaches 56].
For the past few years, many KDD systems have been built based on fruitful
researches which include Database Marketing for American Express 9], Coverstory
from IRI 78], KEFIR 60, 59] from GTE, SKICAT from JPL/Caltech 22], QUEST
3, 82, 4] from IBM, DBMiner from SFU 35], IMACS 12, 11], Recon 81], Explora 48],
Spotlight 5], and several others. The wide application and great practical potential
of KDD have been shown by these prototypes which have produced a lot of promising
experimental results.

1.2 Classication
Classication in the context of data mining is learning a function that maps (classies)
a data item into one of several predened classes 38, 86, 61]. The classication task
is to analyze the training data and to develop an accurate description or model for
each class according to the features present in the data. The future test data is then
classied by using the class descriptions which can also be used to provide a better
understanding of each class in the database.
Since classication is an important task of data mining, much research has been
done based on di erent methodologies including Bayesian inference 16, 83], neural
net approaches 56], decision tree methods 62], etc. Among these methods, the
decision tree-based methods are the most popular methods in the research which
have adopted a machine learning paradigm and perform classication on the given
data by generating a decision tree. However, since these approaches all perform
classication on primitive data stored in the databases, thus they inherit the same
problems from decision tree-based machine learning method which they have adopted,
such as di culties in handling large amounts of data and continuous numerical values,
the tendency to favor many-valued attributes in the selection of determinant attribute,
etc. Although some approaches have overcome some problems, Metha, et al., 62],
solved the problem of handling large amounts of data and continuous numerical values
for example, some problems such as a bushy tree due to the large size of data, remain
CHAPTER 1. INTRODUCTION 5

untouched.
Although classication and prediction are di erent tasks of data mining, they are
closely related and classication has actually been adopted by some approaches as a
method to do prediction 58].
Our motivation to study classication is to take advantage of some of the classica-
tion methods developed in the past research and extend them for predictive modeling
purposes. For example, from classication result, we may know the data characteris-
tics for each class. Then we can predict the future behavior of an object by assigning
it to an existing class based on its particular characteristics.

1.3 Predictive Modeling


Prediction is to see or describe the future happening in advance as a result of knowl-
edge, experience, and reasoning. Prediction plays a very important role in helping
people make decisions. Predictive modeling for knowledge discovery in databases is to
predict unknown or future values of some attributes of interest based on the values
of other attributes in a database. Di erent methodologies have been used including
Bayesian inference 50], classication 7, 58], statistics methods 20] and neural nets
53].
All these approaches are either domain dependent, or not capable of handling large
amounts of data, or restricted by a single concept level, or quite time-consuming in
the learning phase, etc.
One of our proposed approaches is based on attribute-oriented induction and a pop-
ular machine learning method - ID3. We developed the classication-based method
to perform predictive modeling on the data stored in large databases. This method
can extract predictive modeling rules at di erent concept levels and makes it possible
for the user to get knowledge at the desirable concept levels without the limitations
of the original ID3 method such as the incapability to handle a large data set and
continuous numerical values. We change the termination condition, which makes the
method realistic for large databases. Since the distinct values for each attribute can be
reduced by attribute-oriented induction, we solve the problem caused by the tendency
CHAPTER 1. INTRODUCTION 6

of the original ID3 method to favor the attributes with a large number of distinct
values.
Another approach we proposed is a pattern matching-based method which inte-
grates a statistical method with attribute-oriented induction to predict data values
or value distributions of the attribute of interest based on similar groups of data in
the database. The attributes which strongly inuence the values of the attribute of
interest are identied rst by the analysis of data relevance or correlation using a
statistical relevance analysis method. Moreover, by allowing users to specify their
request at di erent concept levels, the system can perform prediction on user desired
concept levels to make the result more interesting and suitable to users' needs. This
approach is domain independent, capable of handling a large volume of data and
multiple concept levels.
Both of our proposed methods are implemented and tested. The performance
study and experiments we did on these two methods show that both methods work
e ciently against large databases.

1.4 Organization of the Thesis


The rest of this thesis is organized as follows. In Chapter 2, attribute-oriented induc-
tion 31] is introduced and a brief summary of related research work on classication
and predictive modeling is given, including di erent approaches to classication and
predictive modeling, such as machine learning approaches, statistics approaches, neu-
ral net approaches, etc. Several typical data mining systems are also introduced.
Our classication-based predictive modeling method is presented in Chapter 3
along with a brief introduction to the ID3 method 73, 74] which is used in this
approach. Examples are given to illustrate the proposed method. The computational
complexity and two variations of the proposed algorithm are also discussed.
In Chapter 4, we propose the pattern matching-based predictive modeling method
and introduce the statistical method used in this method to analyze the relevance
between attributes. We also use examples and experimental results to illustrate this
pattern matching-based method. The computational complexity and two variations
CHAPTER 1. INTRODUCTION 7

of the proposed algorithm are examined as well.


We present our performance study on both of our proposed methods in Chapter
5, which shows that both methods work e ciently against large databases.
The conclusion on our study is drawn in Chapter 6, in which the comparison of the
two proposed approaches is also given and the major results are summarized. Some
interesting future research problems are discussed as well.
Chapter 2
Related Work
In this chapter, attribute-oriented induction, which is adopted in both of our proposed
predictive modeling approaches, will be rst introduced. Since one of our proposed
approaches is based on classication, several approaches for classication will also be
briey discussed in Section 2.2. Approaches for predictive modeling will be investi-
gated in Section 2.3. For Sections 2.1, 2.2, and 2.3, a short summary will also be
given at the end of each section. This chapter will be summarized in Section 2.4.

2.1 Attribute-Oriented Induction


2.1.1 What is Attribute-Oriented Induction?
Attribute-Oriented Induction 15, 32] (AOI) is a set-oriented data mining method. It
generalizes the target data set attribute-by-attribute into a generalized relation RG
from which it extracts the general features of the target data set.
The attribute-oriented induction integrates a machine learning paradigm 64] with
database operations to extract generalized rules from a data set of interest and discover
high-level data regularities, which include characteristic rule, discriminant rule, and
association rule.
Characteristic rule describes the characteristics of the target data set. Discrimi-
nant rule presents the features which can distinguish one data set from the other data

8
CHAPTER 2. RELATED WORK 9

sets. Association rule reveals associations in the data. An association rule is in the
form of \X1 ^^ Xi ! Y1 ^^ Yj " which means objects Y1  Yj tend to appear
together with objects X1  Xi in the target data.
The attribute-oriented induction retrieves a target data set as the initial data
relation, performs the induction process attribute-by-attribute on the retrieved data
set, eliminates duplications in generalized tuples (with counts added up), and extracts
generalized relations and rules. This method has been implemented by Han, et al.,
in our knowledge discovery system DBMiner 35] and tested successfully against large
relational databases.

2.1.2 Motivation of Attribute-Oriented Induction


The motivation to develop attribute-oriented induction comes from the following rea-
sons:
 Although some data regularities (i.e., association rules) can be discovered and
represented at the primitive concept level (the original concept level of the data
actually stored in the databases) by some data mining methods 70, 1], often
more interesting data regularities usually can only be discovered at higher con-
cept levels. Therefore, in many cases, in order to get useful knowledge, it is
necessary to generalize data at low concept levels to those at higher levels.
 Because it is di cult for the user to nd desirable and useful knowledge in
the large volume of rules generated by costly automatic discovery processes, in
most cases, it is desirable to constrain the knowledge discovery process by the
user's request which describes his/her interest in a particular data set for desired
knowledge.
 Certain background knowledge not only enhances the e ciency of a data mining
process but also represents a user's preference for generalization, and it will guide
an e cient data mining process to discover user-desired knowledge.
CHAPTER 2. RELATED WORK 10

2.1.3 Query
A data mining request, including the data set of interest and the type of knowledge
to be discovered, can be specied in a data mining query language (e.g., DMQL 36])
which uses an SQL{like syntax. The following is an example:
use NSERC94
nd characteristic rule for \CS Grants"
from award A, organization O
where O.org code = A.org code and A.disc code = \Computer"
in relevance to province, amount, percentage(count), percentage(amount)
The above query species that the desired type of knowledge is \characteristic rule".
It also requires that the data set of interest is from two relations: \award" and
\organization", must satisfy the condition specied in the where clause and is in rel-
evance to four attributes: province, amount, percentage(count), percentage(amount),
where percentage(x) = x=total x% (the ratio of x versus total x). The two relations,
\award" and \organization", are stored in the database \NSERC94" which contains
the information about 1994{1995 NSERC research grants.
As mentioned in Section 2.1.1, characteristic rule describes the characteristics of
the target data set. The words in bold are reserved key words of DMQL.
The process rst retrieves the specied data set (target class) by executing an SQL
query according to the condition specied in the where clause. But di erent from the
standard SQL, high level concepts like \Computer Science", which are not stored
in the database, can also be specied in the query. Thus, before the SQL query is
submitted, concept hierarchy (a set of mappings from a set of concepts to their higher-
level counterparts) will be consulted to map such high-level concepts (e.g., \Computer
Science" for the above example) to low-level concepts (e.g., di erent discipline codes
such as 25502, 25634, etc.) which are actually stored in the database.
CHAPTER 2. RELATED WORK 11

2.1.4 Concept Hierarchy


A concept hierarchy denes a set of mappings from a set of concepts to their higher-
level counterparts. It represents the necessary background knowledge and reects
the user's preference for generalization. It is partially ordered in a general-to-specic
order. In a concept hierarchy, the most general concept is usually dened by a reserved
word \any" and the most specic concepts correspond to the specic data actually
stored in the database.
By using a concept hierarchy, a data mining process is not restricted by the prim-
itive concept level any more. Instead, it can discover knowledge at di erent concept
levels and generate more useful results in many cases.
Concept hierarchies can be provided by knowledge engineers or domain experts.
It is reasonable even for large databases because a concept hierarchy contains only
the distinct attribute values or ranges of numerical values of an attribute, which are
usually not very large. In many cases, based on data semantics and data distribution
statistics, concept hierarchies can also be generated automatically 25]. Some concept
hierarchies are stored in the database implicitly, such as geo location (city, province,
country), which can be made explicit by specifying certain attribute mapping rules.
Sometimes, when it is not best suited for a particular data mining task, a given concept
hierarchy needs to be dynamically rened for desired knowledge 36].
A concept hierarchy for attribute \province" is given in gure 2.1, which is based
on the geographic and administrative regions of Canada. The highest-level concept is
the reserved word \Any" while the lowest level (primitive level) concepts correspond
to the distinct values of attribute \province" stored in the database.
Based on di erent preferences, di erent concept hierarchies can be specied on
the same attribute. For instance, di erent concept hierarchies can be provided for
attribute \City" based on administrative regions, geographic locations, size of cities,
etc. Di erent hierarchies on the same attribute can be selected explicitly in a data
mining request for a particular mining task. If no hierarchy is specied, a popularly
referenced hierarchy is used as the default. A lattice or DAG (directed acyclic graph)
can also be used to describe a concept hierarchy.
CHAPTER 2. RELATED WORK 12

Canada

Western Central Maritime

B.C. Prairies Ontario Quebec NB NFL NS

Alberta Manitoba Saskatchewan

Figure 2.1: Concept Hierarchy for Provinces


2.1.5 Method
Several important concepts are rst dened below before we start examining the
attribute-oriented induction method in detail.
 Generalized Relation is a relation in which some or all attribute values are con-
cepts at non-leaf nodes in the concept hierarchies.
 Initial Relation is a relation retrieved from a database by an SQL query.
 Attribute Threshold is an integer specifying the maximum number of distinct
values for an attribute contained in a generalized relation.
 Desirable Level is the level in a concept hierarchy at which the number of distinct
concepts (values) for an attribute is no more than the corresponding attribute
threshold.
 Prime Level is the lowest possible level in a concept hierarchy at which the number
of distinct concepts (values) for an attribute is no more than the corresponding
attribute threshold.
 Prime Relation of an initial relation (RI ) is the generalized relation (RG ) of RI
in which every attribute is at the prime level.
CHAPTER 2. RELATED WORK 13

 Generalization Pair, which includes one child concept (Cc ) of an attribute in a


concept hierarchy and its corresponding parent concept (Cp ), indicates that (Cc )
will be replaced by (Cp) in a generalization process.
Table 2.2 illustrates a generalized relation R' of the initial relation R which is
illustrated in Table 2.1.
It is desirable for every attribute in a generalized relation to have only a small
number of distinct values. This is achieved by specifying appropriate attribute thresh-
olds. The attribute threshold for each attribute can be adjusted by users because they
may know better how many distinct values will be appropriate for each attribute in
order to get the desired knowledge. In addition to threshold control, users can also
explicitly specify a level in the concept hierarchy of an attribute as the desired level
to which the attribute should be generalized.
Suppose the initial relation is retrieved from a student information database of a
university, which contains the records shown in Table 2.1. During attribute-oriented
induction, the attribute Name is removed because there is no concept hierarchy as-
sociated with it. The attribute Gender is not generalized because it has only two
values. The attribute Age is generalized from numerical values into intervals. The
attribute Birth place is generalized from cities to the corresponding countries. The
attribute Department is generalized by replacing the primitive level concepts, such as
\Computer Science", with the prime level concepts, such as \Applied Science". The
attribute GPA is generalized from continuous numerical values into high level concepts
such as \high, low, medium". The generalized relation generated by attribute-oriented
induction is shown in Table 2.2. The attribute Count is inserted into the generalized
relation during attribute-oriented induction which indicates the number of original
records covered by a generalized record.
Using the example query given in Section 2.1.3, we will take mining characteristic
rules as an example to illustrate the attribute-oriented induction which is performed
in the steps described below 33]:
CHAPTER 2. RELATED WORK 14

Name Gender Age Birth place Department GPA


C. Smith female 18 Montreal Computer Science 3.92
M. Jordan male 20 Chicago Engineering 2.36
G. Tong male 26 Beijing Math 3.25
     
Table 2.1: An Initial Relation.

Gender Age Birth place Department GPA Count


female 16-20 Canada Applied Science high 40
male 20-25 U.S.A. Applied Science low 15
male 25-30 China Science medium 10
     
Table 2.2: A Generalized Relation.

Step 1 Initial data collection: the data mining request specied in DMQL is converted
into an SQL query with high level concepts replaced by its corresponding prim-
itive level concepts by consulting the concept hierarchies. This SQL query is
then executed to collect task-relevant data set into the initial relation.
Step 2 Derivation of the generalization scheme for each attribute: in the initial relation,
if the number of distinct values for an attribute is too large, the attribute will
be generalized by either attribute removal or generalization. The former is
performed when there is no concept hierarchy on the attribute, or its higher-level
concepts are expressed in the values of another attribute. The latter is performed
otherwise by rst determining the prime level concepts for each attribute, and
then examining them together with the data in the initial relation to form
generalization pairs.
Step 3 Extraction of the prime relation: Perform attribute-oriented generalization (at-
tribute by attribute), by substituting the primitive level concepts of an attribute
CHAPTER 2. RELATED WORK 15

in the initial relation with its corresponding prime level concepts using the gen-
eralization pairs generated in step 2. A prime relation is generated after the
generalization by eliminating duplicated records and accumulating the counts
accordingly in the retained records.
The above method integrates relational database operations with attribute-oriented
generalization. In Step 1, a relational database query is executed and its optimization
depends on the well-developed relational database technology. Suppose the initial
relation R and the derived prime relation R' contain m and n records respectively.
Step 2 involves one scan (or less if a sampling technique is used) of R, and the worst-
case time complexity is O(m). Step 3 scans R, generalizes this relation attribute by
attribute, and inserts generalized records into R', which takes O(mlogn) time if the
records in R' are ordered and a binary or tree-based search is used.
Knowledge expressed in rules or general data distributions can then be derived
from the prime relation, using statistics or machine learning techniques 34, 65, 89].

2.1.6 Feasibility of Attribute-Oriented Induction


Attribute-Oriented Induction (AOI) is more e cient than tuple-oriented induction
(TOI) in the induction process. AOI generalizes the initial relation on each attribute
uniformly for all the tuples to generate the prime relation. TOI examines di erent
possible combinations for a large number of tuples in the same generalization phase.
It is not e cient to investigate di erent possible generalization paths for di erent
tuples at an early generalization stage because these combinations will be merged in
further generalization.
With the incorporation of statistical information (like count) and the generation of
disjunctive rules, attribute-oriented induction is quite robust and capable of handling
noise. Furthermore, with the availability of count information, incremental learning
25], data sampling 47] and parallelism can also be exploited in knowledge discovery
34].
As a generalization-based method, attribute-oriented induction can discover knowl-
edge rules at general concept levels. It may not be suitable for applications such as
CHAPTER 2. RELATED WORK 16

discovering association (at the primitive level) or dependency rules which require the
discovery of knowledge at primitive concept levels. Moreover, concept hierarchies have
to be provided for attribute-oriented induction to generalize non-numerical attributes.
Generally, this method can be used in most database-oriented applications in which
generalization of some or all of the relevant attributes is necessary.

2.1.7 Application of the Attribute-Oriented Induction


Based on the attribute-oriented induction, Han, et al., have built a knowledge discov-
ery prototype DBMiner 35] which can nd characteristic, association, discriminant,
classication and prediction rules successfully from large relational databases. More
modules will be added into DBMiner to enhance its functionality in the near future
such as evolution and deviation rule nder which are all based on the attribute-oriented
induction technique.
The attribute-oriented induction technique is also used by other researchers in
di erent applications.
Dao, et al. 21], uses attribute-oriented induction to mine characteristic and clas-
sication rules for individual attributes from heterogeneous databases. Heterogeneous
databases (multi-databases/federated databases) is a collection of cooperating but au-
tonomous component database systems which are integrated to various degrees. How-
ever, due to the di erence in DBMSs and in data semantics, the component databases
are usually heterogeneous. The semantic heterogeneity is the most challenging prob-
lem involved in heterogeneous databases. One important problem in identifying and
resolving semantic heterogeneity is to determine equivalent attributes between com-
ponent databases, which Dao and his associates try to solve using attribute-oriented
induction. In their approach, each mining request is conditioned on a subset of at-
tributes identied as \common" between the multiple databases. The researchers
developed a method to compare the discovered rules for two or more attributes from
di erent databases and use the similarity between the rules as a basis to suggest sim-
ilarity between attributes. As a result, they use relationships among the entire sets
of attributes from multiple databases to derive the schema integration process.
CHAPTER 2. RELATED WORK 17

2.1.8 Summary
In brief, attribute-oriented induction is a set-oriented, generalization-based data min-
ing method which is e cient, robust and with wide applications. It can also be
extended to knowledge discovery in other kinds of databases such as object-oriented
37], deductive as well as spatial databases 51]. However, attribute-oriented induction
is a generalization-based method which requires the availability of some background
knowledge such as concept hierarchies. Therefore, it may not be suitable for knowl-
edge mining which is not based on generalization or background knowledge.

2.2 Classication
Classi cation is a data mining process which groups objects with common properties
into several predened classes and produces a classication scheme over a set of data
objects. Since classication is fundamental to research in many elds in social and
natural sciences, it has been extensively studied in statistics, machine learning and
neural network research 72, 55, 13, 10, 76].
One term which often causes confusion is clustering. Clustering is a form of un-
supervised learning that partitions objects into classes or clusters (collectively, called
a clustering). Clustering is di erent from classication in that it classies data into
classes which are not predened and have to be found in the process, according to
some class similarity measurements. Although clustering is also an important task in
data mining and a lot of research has been done 24, 26, 66, 16, 6, 19, 54, 88, 18],
since one of our approaches is classication-based, the survey in this thesis is mainly
focused on classication instead of clustering.
There are many classication methods developed in di erent research works which
can be classied into three groups: statistical methods, machine learning methods and
neural net methods.

2.2.1 Statistical Approaches


In data mining, one major statistical method for classication is Bayesian inference.
CHAPTER 2. RELATED WORK 18

The traditional Bayes classier (illustrated in Figure 2.2) can be described as fol-
lows. Given the problem of classifying a set of patterns ai i = 1 ::: n, each pattern is
perceived in terms of a measurement vector Fi obtained by a sensing device which is
capable of capturing its features. Each pattern ai can be associated with a classica-
tion set Ci which includes all the possible classes that can be assigned to pattern ai.
In general, the classication sets can be distinct, but for simplicity, all classication
sets are considered identical. Thus, each pattern belongs to one of m possible classes
Ci i = 1 ::: m: Also for simplicity, we only consider the case in which the same feature
measurements are made for each pattern.

Bayesian Classification

EVALUATION OF L1

LIKELYHOOD
Pattern Fj
FEATURE L2 MAXIMUM Decision
FUNCTIONS
aj EXTRACTION
SELECTOR
{j = 1, ..., n} Li

{i = 1, ..., m} Lm

Figure 2.2: The Bayesian Classication System. (Wu, et al., 1991)


In order to classify a pattern into one of the m classes, a feature hyperspace is
constructed based on the measurement vector F, which can be considered as measure-
ments of true feature values corrupted by random noises. The multi-class conditional
probability density functions estimated from a learning data set represent the uncer-
tainty in discovered knowledge.
pFjCi] i = 1 ::: m.
Using the Bayes formula, the computation of the aposteriori probability results in:
CHAPTER 2. RELATED WORK 19

P CijF] = Li = pFjCpiF]P] Ci] i = 1 ::: m:


Subsequently, statistical decision criteria can then be applied for classication. In
most cases, in order to obtain the theoretically optimal solution, the Bayesian min-
imum error decision rule (known as maximum likelihood classication) is applied,
which is accomplished by minimizing wrong classication (error) { a pattern is clas-
sied into class i with the highest likelihood or a posteriori probability:
Li = max
j
fLj g j = 1 ::: m:
However, in practice, because of the di culties in obtaining a reliable estimation
of multi-class and multivariate conditional probability density functions, which a ect
the e ectiveness of the traditional statistical method, some simplications such as
normal distribution and the statistical independence of each feature are often made
in order to reduce the time and space complexity of the likelihood computation.
One approach 87] using Bayesian method aims at solving a class of pattern-
classication problems characterized by the hierarchical structures of the hypothe-
sis space and domain-context dependency. Such problems are often encountered in
biomedical elds. The proposed approach is essentially an integration of a traditional
Bayesian classication system and a domain knowledge-based component, which is
illustrated in Figure 2.3.
To classify a pattern into one of the predened classes, instead of globally ap-
plying the Bayesian classication procedure to each pattern with respect to all the
classes in the classication set (an ine cient exhaustive search approach), one con-
structs a classication tree using the hierarchical structures of hypothesis space. The
structure of the classication tree can be di erent depending on the problem. For
many pattern-classication problems, expert knowledge which exists in the form of
domain heuristics and contextual constraints can be incorporated to guide search in
the classication tree, to prune inconsistent branches and to avoid exhaustive search.
In some cases where there is no heuristic rule available due to overwhelming uncer-
tainties or ambiguities, the conventional Bayesian procedure can be applied instead so
that a search can be performed by alternately applying heuristic and Bayesian clas-
sication rules until an end node (singleton classication) is reached. By combining
CHAPTER 2. RELATED WORK 20

both decision rules in the classication procedure, the global optimization using tradi-
tional Bayesian classication, which can be inconsistent with the domain constraints,
is replaced by domain consistency and local optimization.

Knowledge-based component

Pattern FEATURE Fj SCOPE & HEURISTIC Y HEURISTIC


CONTEXT RULES
aj EXTRACTION AVAILABLE? CLASSIFICATION
EVALUATION
{j = 1, ..., n} N

BAYESIAN SINGLETON Decision

CLASSIFICATION REACHED Y

Figure 2.3: The Classication Scheme that Integrates Heuristic and Bayesian Ap-
proaches. (Wu, et al., 1991)
Wu, et al.'s algorithm illustrated in Figure 2.3 is summarized as follows: By
applying domain-contextual constraints and heuristic rules, domain knowledge is used
to guide the search in the tree. First, the classication context and the search scope
are evaluated. After the search scope is determined, depending on the availability
of heuristic rules, a pattern is classied by either a heuristic rule or the Bayesian
procedure according to its measured features. After a classication search has been
conducted, the pattern is checked to see if it has reached a leaf of the tree (can
be categorized into a single class). If it has, the decision is made, and the system
proceeds to the next unclassied pattern. If not, its classication context and scope
are evaluated again in step 1, and steps 2 and 3 are repeated. The whole procedure
terminates when the classication of each pattern has been completed.
CHAPTER 2. RELATED WORK 21

2.2.2 Machine Learning Approaches


The decision tree-based methods from machine learning are the most popular methods
in data mining research, which perform classication on a data set by generating a
decision tree. Each node in the tree represents a single test or decision, while the
outgoing branches of a node correspond to all the possible outcomes of the test at
the node. According to these outcomes, the set of objects is partitioned into many
subsets which are assigned to each child node accordingly. Figure 2.4 illustrates an
example of a decision tree. Attribute Occupation has two values \White Collar" and
\Blue Collar". Attribute Experience has three values \over 10 years", \5-10 years",
and \less than 5 years". Attribute Salary also has three values \high", \medium" and
\low". This decision tree classies objects into three classes: salary \high", \medium"
and \low" based on Occupation and Experience.

Occupation

White Collar Blue Collar

Experience Experience

over 10 years less than 5 years


5-10 years over 10 years 5-10 years less than 5 years

Salary

high medium low medium medium low

Figure 2.4: A Decision Tree


One approach using the decision tree method in data mining was proposed by
Mehta, et al. 62]. Their classication system SLIQ is a decision tree classier that
can handle both numerical and category attributes. It uses a pre-sorting technique in
the tree-growth phase. This sorting procedure is integrated with a breadth-rst tree
CHAPTER 2. RELATED WORK 22

growing strategy to enable classication of disk-resident data sets. In addition, SLIQ


uses a fast subsetting algorithm to determine splits for category attributes. SLIQ
also uses a new tree-pruning algorithm based on the Minimum Description Length
principle, which is e cient, and produces compact trees. The combination of these
techniques enables SLIQ to scale for large data sets and classify data sets regardless
of the number of classes, attributes, and records.
For numerical attributes, because the sorting time is a major factor when nding
the best split at a decision tree node, SLIQ sorts the training data just once for each
numerical attribute at the beginning of the tree growth phase, instead of sorting the
data at each node of the decision tree.
Their experimental results show that SLIQ achieves good scalability and performs
well for large data sets.
Another classication approach using the decision tree method was proposed by
IBM Tokyo Lab 27]. They use information gain-based decision tree method and
association between attributes to perform the classication.
Since decision tree method has been extensively used in classication, several
approaches have been focused mainly on the improvement of this method to enhance
its robustness 45], reduce the misclassication error 63], test its e ectiveness 68]
and extend its range of application 57].
Further details of the decision tree-based method will be discussed in Chapter 3.

2.2.3 Neural Net Approaches


The classication approach mainly based on neural nets is called the connectionist
approach. In general, it generates a lower classication error rate than the decision
tree methods but requires longer training time 75, 77, 79]. The general impression
about the connectionist approach among researchers is that it is not well suited for
data mining. The major criticisms can be summarized as follows.
In order to obtain high classication accuracy, the neural networks have to learn
the classication rules by multiple passes over the training data which usually results
in very long learning (training) time.
CHAPTER 2. RELATED WORK 23

The classication rules generated by neural networks are very di cult to be ex-
pressed because a neural network is usually a layered graph with the output of one
node feeding into one or many other nodes in the next layer which results in the
classication rules being buried in both the structure of the graph and the weights
assigned to the links between the nodes. For the same reason, it is also di cult to
apply available domain knowledge to a neural network.
Among the above described disadvantages of the connectionist approach, the ex-
pressing problem is one of the major hurdles which needs to be solved in order to
adopt the technique in data mining, because it would be rather di cult to verify or
interpret classication rules without explicit representation of them. One approach
by Lu and his associates 56] made a good attempt at solving this problem.
In their approach, they use a three-layer neural network to perform classication
which extracts similar rules with those generated by the symbolic methods (like C4.5
74]). Lu's approach is summarized as follows.

Output Layer

Hidden Layer

Input Layer

Figure 2.5: A Three Layer Feedforward Neural Network. (Lu, et al., 1995)
CHAPTER 2. RELATED WORK 24

Articial neural networks consist of many simple computational elements { neu-


rons which are densely interconnected. There are many network topologies 40] among
which the multi-layer perceptron is especially useful for classication purposes. Fig-
ure 2.5 shows a three-layer feedforward network which has an input layer, a hidden
layer and an output layer. A node (neuron) in the network has a number of inputs
and a single output. For instance, neuron Ni has xj1 xj2 ::: xjn as its inputs and ai
as its output. A link in the network is associated with a weight. The input links of
Ni has weights w2i w2i ::: wni . A node computes its output (the activation value) by
summing up its weighted inputs, subtracting a threshold, and passing the result to
a non-linear function f (the activation function). Outputs from neurons in one layer
are fed as inputs to neurons in the next layer. In this way, when an input tuple is
applied to the input layer, an output tuple is obtained at the output layer. For a well
trained network which servers as a classier, when a tuple (x1 x2 ::: xn) is applied
to the input layer of the network, an output tuple, (c1 c2 ::: cm) should be obtained,
where ci has value 1 if the input tuple belongs to class ci and 0 otherwise.
Their approach which uses neural networks to mine classication rules consists of
the steps described as follows.
In the rst step, a three-layer neural network is trained in order to nd the best
set of weights which allow the network to classify input tuples with a satisfactory
level of accuracy. After an initial set of weights is selected randomly in the interval
;1,1], these weights are updated using information involving the gradient of an error
function. This training phase terminates when the norm of the gradient of the error
function falls below a pre-specied value.
Since the network generated from the training phase is fully connected and may
have too many links and sometimes too many nodes, it is impossible to extract concise
rules which are meaningful to users and can be used to form database queries from
such a network. The second step is to remove redundant links and nodes without
increasing the classication error rate of the network, which makes it possible to
extract concise and comprehensible classication rules.
In the nal step, classication rules are extracted from the pruned network which
are in the form, \if (a1v1) and (x2v2) and ... and (xnvn) then Cj " (ais are the
CHAPTER 2. RELATED WORK 25

attribute values of an input tuple, vis are constants, s are relational operators (= 
 < >) and Cj is one of the class labels).
This approach was implemented in their NeuroRule system which has produced
some interesting results. However, despite the fact that the speed of network training
was improved by the fast algorithms they designed, the time required for NeuroRule
is still longer than the time needed by a symbolic approach such as C4.5 74].

2.2.4 Summary of Classi cation Approaches


Several classication approaches were examined in this section. Although in theory,
the Bayesian method can provide the optimal error-minimizing solution to classica-
tion problems, it is not feasible to apply it directly in practice because, for real prob-
lems, its demand for complete probability data for all statistical dependencies among
variables is unrealistic. To solve real world problems, di erent assumptions (such
as the normal distribution, statistical independence of each feature, etc.) are often
made in Bayesian-based approaches to estimate the conditional probabilities. These
approaches can be considered as approximations to the Bayesian method, which may
work well when the adopted assumptions are true but may work poorly when they
are not.
Since decision trees and rules using univariate splits have a simple representational
form, the result is relatively easy to understand by the user. The decision tree-
based methods are also rather quick 86]. However, the functional form and thus the
approximation power of the model may be signicantly restricted due to the restriction
to a particular tree or rule representation. In machine learning and applied statistics,
there are a large number of decision tree and rule induction algorithms 14, 74] which,
to a large extent, are based on likelihood-based model evaluation methods. In terms
of the complexity of the penalizing model, these algorithms have various degrees of
sophistication. Besides classication, trees and rules are also used for regression,
predictive modeling 7, 22] and summary descriptive modeling 2].
The neural nets in general are very powerful but the time cost of the training phase
is unrealistically high. However, for most applications, if su cient time is allowed for
CHAPTER 2. RELATED WORK 26

convergence, accuracy of training is usually very good. An appropriately complex


model can be found using train-and-test technique which requires the investigation
of networks of many di erent sizes. However, the number of possibilities that can
be examined is restricted by available computing resources. The challenge which
researchers are facing in this area is to develop faster training methods and faster and
e ective training techniques to nd the best complexity t.
With the tremendous growth of data stored in large databases, it becomes imper-
ative to develop e cient methods to perform classication on such huge amounts of
data. Although data in a relational databases are usually well-formatted and mod-
eled by semantic data models 43], the contents of the data may not be classied. For
instance, a biological database may store a large amount of experimental data in a
relational format, but it is still necessary to classify the data in order to determine
the intrinsic regularity of the data. Since schemas and data formats obviously are
not equivalent to conceptual classes, the classication of data in a database is still
necessary for us to discover the patterns and regularities hidden behind the data and
present them in an understandable manner. Moreover, classication also plays an
important role in many predictive modeling approaches which will be discussed later
in Section 2.3.

2.3 Prediction
Predictive modeling for knowledge discovery in databases is to predict unknown or
future values of some attributes of interest based on the values of other attributes
in a database. Di erent approaches have been taken in predictive modeling based on
di erent methodologies. The major methods which are used in the current research
include statistics, classication, and neural net-based methods.

2.3.1 Statistical Approaches


Linear regression method is one of the statistical methods used in predictive model-
ing, in which a linear regression model is usually constructed based on previous data
CHAPTER 2. RELATED WORK 27

in order to predict the future data. The challenge is that the model should be con-
structed in a way to best explain the previous data without over-emphasizing the past,
which may result in unreliable estimate of the future. In these linear regression-based
approaches, in order to know how well the constructed linear regression model can
predict the future, existing data is often divided into a training set and a validation
set. The training set is used to construct the model, while the validation set is used
to test the prediction error.
One approach adopts the linear regression method 20] to solve the problem of
performing prediction on nancial data with a small number of data points and high
dimensionality, for which classical economic forecasting techniques do not work.
This approach aims at solving a particular nancial problem of predicting the
direction of interest rate spread which is the di erence between the lending interest
rate (charged by nancial institutions such as banks and credit card companies to
their customers) and the borrowing interest rate at which the nancial institutions
can borrow. It is very important for the nancial institutions to be able to reliably
predict the interest spread so that they can maximize their prots by hedging (buying
insurance against a future spread decrease).
In adopting this approach, the researchers used a number of techniques which
trade o training error against model complexity to reduce dimensionality. They
also minimized the mean squared error of prediction and conrmed statistical validity
using bootstrap techniques 86], to predict whether the spread will increase or decrease
in the foreseeable future.
The details of this approach are summarized as follows.
Standard ordinary least squares regression minimizes the Residual Sum of squares
(RSS). Consider the following two equations, where (xj yj ) and (xj yj ) are training
and validation points respectively:
RSS =
Xn (y ; g(x ))2
j j
j =1
MSEP =
1 XN
(y ; g(x ))2
N j=1 j j

The researchers use g(x) to minimize RSS (g(xj ) is the output for training point xj
CHAPTER 2. RELATED WORK 28

and xj may be multi-dimensional, yj is the observed value of y, g(xj ) is the output for
validation point xj ). The second equation is the validation error or the mean square
error of prediction (MSEP). Although the summations look similar in two equations,
the point (xj yj ) in the second equation is from the validation set which was not used
in constructing g(x).
The complexity of the model is another important issue which the researchers have
to deal with. In order to achieve optimal results, they have to select an appropriate
complexity for the model so that it is neither too simple nor too complicated. Too
simple a model may result in more training and validation errors (undertting) due to
the lack of enough free parameters to model the irregularities of the training set. Too
complicated a model may result in overtting and the rise of validation errors after
reaching a certain complexity, despite the fact that the training errors will decline to
zero and the validation error decline as well before that.
\Structural Risk Minimization" or \capability control" is often used as the refer-
ence to the process of nding the optimal model complexity for a given training set
84].
One of the big problems the researchers encountered was to construct a linear
regression model properly based on the existing data with a large number of explana-
tory variables but a relatively small number of samples to predict future spread, and
to avoid overtting of the samples. Furthermore, the success in predicting the spread
depends on being able to predict the values of the explanatory variables. Their goal
is to nd the best set of p explanatory variables which predicts the spread. They add
one variable at a time to the linear model until the mean square error of prediction
(MSEP) goes through the minimum. The main challenge of this approach is that
the best set of p variables is not necessarily a subset of the best set of (p+1) vari-
ables. In order to nd the minimum validation error, one must exhaustively search
Ckp combinations for each p (with total number of variables k), and with a total of
2k ; 1 combinations over all p. Because of the substantial cost of hedging, the re-
searchers believe that the benets are worth the use of computing resources even if
the exhaustive search takes days to complete.
In their experiment, exhaustive search was applied to the quarterly data (between
CHAPTER 2. RELATED WORK 29

1983 and the end of 1993 and with twenty one possible explanatory variables) to
nd the p variables which minimize the MSEP. For the monthly data, they used a
sequential selection scheme in which the rst variable with the highest correlation
to the spread is chosen. Then a linear model with only this variable is constructed
based on the full training set. The residuals are obtained which will be explained by
the remaining variables. Then another variable with the highest correlation to the
residuals is selected. A model with two variables is then constructed and the residuals
are recalculated. The process will continue until all variables (or a large number of
variables) are ranked.

2.3.2 Classi cation-Based Approaches


Since classication is closely related to predictive modeling, many researchers adopt
classication methods in their predictive modeling approaches. One such approach
was proposed by researchers in IBM T. J. Watson Research Center 7].
Their experiments with capital market data suggest that the domain can be e ec-
tively modeled by classication rules derived from historical data to make predictions
for equity investments. They believe that new classication techniques developed at
IBM Research, including minimal rule generation (R-MINI) and contextual feature
analysis, are robust enough for consistently extracting useful information from noisy
domains such as nancial markets. They also believe that the classication rules ob-
tained by using these new techniques can be e ectively used for numerical prediction
which may eventually lead to an investment policy.
The R-MINI rule generation system 41] can be used to generate \minimal" classi-
cation rules from tabular data sets with one of the columns as a \class" variable and
the remaining columns as \independent" features. A feature discretization subsystem
is then used to completely discretize the data set before rule generation which per-
forms feature ranking (of both continuous valued and category features) and convert
the continuous valued features into discretized features using an optimal cutting algo-
rithm 42]. Their method to rank the features is di erent from popularly used decision
tree method. In their approach, features are ranked according to merits computed
CHAPTER 2. RELATED WORK 30

for each feature. Merits are computed by taking a set of \best" counterexamples
for each example in a class, and accumulating a gure for each feature which is a
function of the example-pair feature values. Compared to the tree-based methods,
the R-MINI contextual feature analyzer may be considered as a full-level look-ahead
feature analyzer. The researchers believe that it will not su er from falling into false
local minima due to its ability to analyze merits of features in a global context. When
the rule generation is nished, the R-MINI system can be used to classify unseen data
sets and measure the performance using various error metrics.
In order to precisely quantify the predictive performance, it is necessary to use
the classication rules to predict the actual return instead of the discretized class
segments. To extend the classication model to a rule-based regression model, the
researchers calculate additional metrics for the rules based on the training examples
and the rules derived from this training data: , the mean of all actual class values of
training examples covered by that rule , the standard deviation of these values and
N, the total number of training examples covered by that rule. When such a rule set
is applied to unseen data, for each example in that set there will be potentially zero or
more rules which may apply to that example. If no rule covers an unseen example, a
numerical value can be assigned or predicted which suggests a default for the domain
based on priors such as the normal expected mean for the class. If one or more rules
apply to an unseen example, an average is calculated from the rule coverage metrics
which is then assigned as the class label for that example.
Another classication-based approach 58] focused on database marketing appli-
cations. The goal in this domain is to predict the customer behavior based on their
previous actions. A usual approach is to develop training models which maximize
accuracy on the training and testing sets and then apply these models on the unseen
data. The researchers in this approach think that accuracy optimization is insu -
cient by itself and should explore di erent strategies to take the customer value into
account. They propose a framework for comparing payo s of di erent models and
use it to compare several di erent approaches for selecting the most valuable subset
of customers.
They use historical telephone customer records which contain billing data and
CHAPTER 2. RELATED WORK 31

responses to previous special o ers and use a commercial neural network classier to
group the customers into several classes, according to their responses to such special
o ers. Based on the predicted classes (responders, non-responders), o ers are made
to customers who have a high predicted probability of response. However, in order
to maximize business value, not only prediction accuracy need to be maximized but
also a group of \high value" customers among those highly likely to respond need to
be identied. However, how to achieve this goal still remains an open problem. Is it
enough to just select the estimated high value customers from the group of predicted
responders as a post processing step or might it be better to have the predictive model
itself take the value into consideration? They tried di erent strategies in order to get
a satisfying answer.
To evaluate di erent approaches, they divide the data into one training set and
several testing sets with di erent sizes to calculate the merits for each approach.
In order to get better accuracy, they reduce the irrelevant attributes by excluding
dependent attributes and eliminating attributes with low correlation weight to the
target attribute.
The strategies on which they did experiments are baseline payo calculation, post
processing, value-based training, modeling stratied groups based on value, straight
merge of stratied logs and merge of \optimal" subsets from the stratied logs. Details
of these strategies can be found in 58].
Their prediction approach is mainly based on a classication technique. They
construct the predictive model on top of the classication strategy, while taking the
business payo into account.

2.3.3 Neural Net Approaches


The recursive neural nets are sometimes adopted in the predictive modeling ap-
proaches. In one approach 53], an available set of independent variables is aug-
mented with a weighted \e ect" set to provide information about the moderating
e ects of (not) knowing particular independent variables. Using an existing psycho-
logical database, this approach showed that recursive neural nets were able to induce
CHAPTER 2. RELATED WORK 32

and use relevant main and interaction e ects.


The researchers of this approach argue that since the statistical methods consider
prediction as the problem of nding an optimal function to predict a set of unknown
(dependent) values from a disjoint set of known (independent) variables, these meth-
ods may not be suitable for many cases where the distinction between dependent
and independent variables is not very clear due to the following reasons: (1) some
dependent variables may already be known while some independent variables may be
missing. (2) people can often draw the same conclusion based on di erent sets of
available information. (3) additional information can be added in at any time. Since
the same variable may appear both as a dependent variable and as an independent
variable, it is very di cult to distinguish between these two types of variables under
such situations. To handle such situations, a data mining system must allow any
set of unknown variables to be predicted from arbitrary collections of already known
variables. To achieve this goal, a traditional system may recompute the predictor
function whenever a new situation arises. However, it may not be suitable in most
cases due to the cost of accessing all the appropriate databases. Another possible
approach is to compute a di erent predictor function for each conceivable pattern of
known and unknown variables but it is computationally infeasible. For instance, if
there are 30 variables in a relatively simple case, there will be 230 di erent patterns
of known and unknown variables to be considered and the computing cost will be
prohibitively high.
Their approach use recursive neural nets to store information so that the perfor-
mance benets from knowing which variables can or cannot be used in the prediction
process. Because known variables are allowed to di er according to their contribution
to a particular prediction, the prediction process is called \weighted e ect" approach
which is described as follows.
Suppose set V (with elements v) contains all variables which are relevant in a
particular context. Subset IN  V includes all known variables. Subset OUT  V
includes all unknown variables. Then predictions are the result of the mapping:
V' = F(V, E),
CHAPTER 2. RELATED WORK 33

where E represents the \e ect" of knowing the value of each variable v. That is,
all variables in V including the unknown variables in OUT are used as input to the
function F to generate a new vector V' with predicted values which includes predictions
for already known variables in IN, as moderated by the e ects in E.
The researchers propose that the e ect set E contain an entry for every variable
in V. ei = 1 when the variable is known, and ei = ;1 when its value is not known.
The reason behind this assumption is that such weights allow a neural net to learn all
linear main and interaction e ects (perhaps some non-linear e ects as well) associated
with the presence or absence of each variable.
In order to test whether the addition of the e ect set E improves prediction, the
researchers created two basic experimental conditions. The rst condition is called
the \E ect condition" in which the e ect set E is dened as described in the preceding
paragraphs. V is a copy of a record in the data set and the function F is implemented
as a partially recursive neural net. This net has 31 variables in V' as its outputs and
the 31+31 = 62 variables in V and E as its inputs. A standard back-propagation
algorithm 29] was used and all variables were scaled in the range ;0.9 to +0.9. The
data was randomly divided into training and testing sets with approximately the same
size.
There are two phases in the process. The rst phase is Training Phase during
which knowledge of predictor variables was simulated by randomly selecting between
0 and 15 variables using an e cient algorithm described in 49]. The selected variables
were assigned the value 0.0 and were then added to OUT. Their entry in E was set to
;0.9. The remaining variables were added to IN with their original values remaining
the same and their entry in E set to +0.9. The resulting V + E were then presented
to the back-propagation algorithm.
The second phase is Test Phase in which the weights obtained during the training
phase are validated on the test set by randomly selecting between 1 and 15 variables
to be included in the OUT set. Then the E set is constructed and used exactly as
during the forward propagation stage in the training phase.
The second condition is called \Control condition" which is identical to the \E ect
condition", except that no set \E" is used during the training phase. Instead, the
CHAPTER 2. RELATED WORK 34

value 0 is simply assigned to the randomly selected elements of OUT. That is, all
results are based on a fully recursive neural net with 2 intermediate layers, 31 outputs
(V') and 31 inputs (V).
Their experimental results showed that the addition of the e ect set E to the
inputs minimized distortions in known variables during the transformation from V to
V'.

2.3.4 Summary of Predictive Modeling Approaches


Several di erent approaches to predictive modeling were discussed in this section.
The approaches using statistical linear regression method work well on numerical
attributes but may not be suitable for applications which need to handle a lot of
nominal attributes. In some cases, the assumption of the linear distribution of data
stored in databases may not be valid, which will cause prediction errors. Faster
algorithms also need to be developed to nd the best set of variables and the most
appropriate model complexity. Moreover, for di erent predictive attributes, di erent
models have to be constructed. Even for the same predictive attribute, the model has
to be adjusted or even rebuilt if di erent sets of known variables are used.
Classication-based approaches take advantage of the previously developed clas-
sication methods to classify data into several predened classes and obtain the clas-
sication rules which describe the characteristics for each class. The behavior of a
future object can then be predicted by assigning it to one of such classes and get the
description for that class. Because of the wide diversity of the database data, one
potential problem for classication-based approaches is that an object may belong to
none or more than one predened class and in the latter case the probability of this
object falling into any class may be quite low. It is di cult to assign such an object to
one class without losing prediction accuracy. Since a classier is built on one specic
classication attribute, another problem is that for each possible predictive attribute,
a classier has to be built. Since the number of attributes may be quite large in a
real database, it is di cult to pre-construct and pre-store all the possible classiers.
In this sense, classiers may have to be built during a prediction process so that the
CHAPTER 2. RELATED WORK 35

cost of building such classiers is vital to classication-based approaches. Some costly


models such as neural nets classier may not be suitable for a predictive modeling
task which requires fast response.
The neural net approaches are appropriate when di erent sets of variables will
be used as either dependent (predictive) variables or independent variables. We can
use the same model for di erent cases, while, for the other two kinds of approaches,
di erent models have to be constructed for di erent cases. However, the training
phase is rather time-consuming, which may not be suitable for many applications.

2.4 Summary
In this chapter, a brief introduction to the work related to our proposed predictive
modeling approaches was given. The motivation, method, feasibility, and di erent
applications of attribute-oriented induction were discussed in Section 2.1 along with
the query and concept hierarchy which it uses for a data mining task. In Section 2.2,
several classication approaches based on di erent methodologies such as statistical
method, machine learning method and neural net method were presented. We also
examined the advantages and the weaknesses of the approaches based on di erent
methodologies. In Section 2.3, di erent approaches to predictive modeling were in-
vestigated. These approaches were divided into three groups based on the method
they use: statistical, classication-based, and neural net approaches. A brief summary
on each type of predictive modeling approach was also given.
Chapter 3
Predictive Modeling Based on
Classication Method
3.1 Introduction
Classication in terms of data mining is learning a function that maps (classies) a
data item into one of several predened classes 38, 86, 61], which can be described
as follows. The input data, called training data, consists of multiple objects. Each
object has multiple attributes and is tagged with a special class label. The classi-
cation task is to analyze the training data and to develop an accurate description
or model for each class according to the features present in the data. The future
test data is then classied by using the class descriptions which can also be used to
provide a better understanding of each class in the database. Figure 3.1 illustrates
a common classication system. There are wide applications of classication such as
credit approval, target marketing, medical diagnosis, treatment e ectiveness analysis,
etc.
Classication is closely related to predictive modeling. In statistics the classica-
tion problem is sometimes called the prediction problem. Generally, a classication
process can be thought to have two phases: model construction phase and new object
identication phase. In the rst phase, existing data is analyzed to get an accurate
description for each class. In the second phase, a new object is identied and assigned

36
CHAPTER 3. CLASSIFICATION BASED METHOD 37

Object to Classifier Class Name


be classified

Class Description

Class Analyzer

Training Data

Figure 3.1: A Common Classication System

to one class according to the class description obtained in the rst phase. This phase
can actually be considered as a prediction process because the behavior of the future
object can be predicted from the common behavior of the class to which it belongs.
The so-called \common behavior" of one class represents the general characteristics
of all the previously known data in this class.
In terms of predictive modeling, the rst phase of classication can be seen as
the predictor constructing phase, while the second phase is the prediction phase.
Classication can be thought to extract the patterns of the existing data and the
prediction can be seen as the use of such patterns to forecast the behavior of future
data. In this sense, they can be treated as two phases in one data mining process.
Since classication has been extensively studied in previous research with many
e cient methods developed, many approaches in predictive modeling are based on
classication 7, 58].
Our approach integrates the ID3 73, 74] decision tree classication method with
attribute-oriented induction to predict object distribution over di erent classes. The
details of our approach will be discussed in the following sections.
CHAPTER 3. CLASSIFICATION BASED METHOD 38

The rest of this chapter is organized as follows. In Section 3.2, the problem we want
to tackle is dened. In Section 3.3, our motivation for integrating the ID3 method with
attribute-oriented induction is discussed. In Section 3.4, the ID3 decision tree method
is examined. In Section 3.5, the general idea of the approach is described, which is
followed by Section 3.6, where the algorithm and the rationale of the algorithm are
discussed. In Section 3.7, two variations of the algorithm are examined. Section 3.8
illustrates our method by an example. Finally, a brief summary is given in Section
3.9.

3.2 Problem Statement


Classication-based predictive modeling can be described as follows:
Suppose there is a group of existing objects G which contains X objects. 8 object
Oi in G can be described as Oi (v1h v2j ::: vnp) which has a set of N attributes A=<
a1 a2 ::: an > with corresponding values v1h v2j ::: vnp (1  h  H , 1  j  J ,
1  p  P , where H, J and P are the number of distinct values of attribute a1,
a2 and an respectively). All the objects in G can be categorized into a group of M
classes C =< c1 c2 ::: cM >. Suppose we are interested in a set of objects S G
and for objects in set S, we are only interested in a set of K (0 < K  N ) attributes
AAs=< aa1 aa2 ::: aaK > where AAs A. All the objects in S can be categorized
into a set of L (0 < L  M ) classes CCs=< cc1 cc2 ::: ccL >, where CCs C .
Our rst goal is to derive a number of simple rules r1 r2  rz from the object set
S which can categorize each object in S based on the values of attributes in AAs into
one or more classes with certain probability. Each rule can be described using one or
more attributes, the corresponding value(s) associated with the attribute(s), one or
more logical operators, one or more classes and the probability associated with each
class. For example, a rule ri can be described as: if aax = vq and aay = vw then
ccd70%] _ cce25%] _ ccf 5%] (1  x  K, 1  y  K, 1  q  X, 1  w  Y, 1  d 
L, 1  e  L, 1  f  L, aax 2 AAs, aay 2 AAs, ccd 2 CCs , cce 2 CCs, ccf 2 CCs, X
and Y are the number of distinct values of attribute aax and aay respectively, 70%],
25%] and 5%] are the probability associated with each class).
CHAPTER 3. CLASSIFICATION BASED METHOD 39

On one hand, these derived rules can be used to study the behavior of the existing
objects and produce a useful prediction result. For example, we may obtain a number
of rules about the house size in British Columbia, Canada, based on the existing
objects stored in a database. By studying these rules, a local paint company may
nd that the people with certain characteristics may very likely own medium-sized
houses. Thus this paint company can use this knowledge to direct its promotion
of certain paint designed for medium-sized houses to those potential customers who
possess those characteristics.
On the other hand, these obtained rules can also be used to categorize one or more
new objects into certain classes so that we can predict its (their) behavior based on
the common behavior of the objects in the classes associated with it (them).
For instance, a credit card company may derive a number of rules about the credit
rating of their current customers. This company may use these rules to lter the new
applicants for credit cards so that it can recruit the new customers with potential
good credit and reject those applicants with potential bad credit. Moreover, it can
also assign appropriate credit limit to the accepted new customers based on these
rules so that it can maximize prots, while at the same time minimizing the risk.
There are three cases associated with the process to predict a new object's behavior
based on the rules obtained using the classication-based predictive modeling method:
 Case 1: Based on the rules, the new object can be assigned to a single class.
 Case 2: According to the rules, the new object can be assigned to more than
one classes but with di erent probability for each class.
 Case 3: The existing rules do not cover the new object.
Case 1 is the ideal case which rarely happens when the rules are derived from a
large database, because of the wide diversity of the objects it contains.
In case 2, the common method is to compute the probability of the occurrence of
this object in di erent classes and assign it to the class with the highest probability.
However, for a large database, the probability of the occurrence of the new object may
be quite low even for the class with the highest probability due to the wide diversity of
CHAPTER 3. CLASSIFICATION BASED METHOD 40

data in the database. The assignment of the new object to any class will be reluctant
with such low probability and the prediction based on this assignment may not be
accurate or useful. It will be desirable to know the probabilities of the new object to
be assigned to di erent classes instead of assigning it to one class reluctantly.
In case 3, new rules which cover the new object have to be found. However,
this process may not be always successful if there is no similar object stored in the
database from which the new rules can be derived.
For a classication-based predictive modeling method, in order to extract pre-
dictive rules, a classication model usually needs to be constructed rst. Then the
rules can be derived from the established model for further analysis to produce useful
predictive results. Since the classication model has to be constructed based on pre-
dened classes, for di erent predictive attributes, di erent classication models have
to be created accordingly. Thus the model construction cost has to be considered
when selecting the method. For example, since the time to construct a neural net
classier is rather long, the classication method based on the neural nets may not
be suitable for many predictive modeling applications which require fast response.
Another factor we need to consider is that the method we use should be able
to extract a limited number of simple rules from the existing objects. Otherwise, it
may not be useful or practical because of the di culty in analyzing a large number
of complicated rules and the fact that the more complicated (thus more specic and
less generic) the rules are, the less chance they will cover a new object, which makes
the rules not very useful in predicting the behavior of new objects. For example, one
rule may be described as \people who spend $500-1500 a month have good credit".
Another rule may be described as \people who spend $300-1000 a month, live in
suburb, have full time work with over $50k salary and credit card balance less than
$200 have good credit". Obviously, the rst rule is not only easy to analyze but also
likely to cover more new objects than the second one and thus more useful in the
future.
In the worst case, each derived rule can only describe one object in the database
and all the rules combined are actually the original object set itself - thus no extra
knowledge is gained in the predictive modeling process. Therefore, we must pay
CHAPTER 3. CLASSIFICATION BASED METHOD 41

special attention to selecting the methods which can reliably extract a limited number
of simple rules and avoid those which can not reliably do so.

3.3 Motivation
3.3.1 Why Use Classi cation-Based Method
 Classication and prediction are closely related. They can be considered as two
phases in one predictive modeling process. It is a natural and easy way to
perform prediction based on classication results.
 Many e cient methods have been developed in previous classication study
which can be e ectively utilized in predictive modeling.

3.3.2 Why Use Decision Tree-Based Method


 Decision tree-based method is the most popular method for classication and
ID3 is the most popular method used in decision tree classication, which has
been proven to be accurate, e cient and robust by many researchers 86, 45, 68,
67]. Moreover, in previous research, ID3 method was also proven to be able to
reliably construct simple and concise decision trees, which led to the derivation
of simple and concise rules.
 As mentioned in Section 3.2, the cost of model construction has to be taken
seriously in classication-based predictive modeling. Since the cost of decision
tree-based model (such as ID3) construction is not as much as that of the other
models such as the neural net and Bayes inference, more time can be spent to
perform prediction on di erent predictive attributes due to the savings in the
model construction phase.
CHAPTER 3. CLASSIFICATION BASED METHOD 42

3.3.3 Why Use Attribute-Oriented Induction


Most decision tree algorithms can not handle large amount of data due to the limita-
tion that the training data has to t in memory. Therefore they can not be applied
in data mining directly because it is usual that the database may contain millions of
tuples which are far beyond these algorithms' capability to handle. The decision tree
algorithms have to be modied somehow in order to deal with such large amount of
data as researchers did in SLIQ algorithm 62]. Since attribute-oriented induction can
reduce the size of the data set of interest considerably by generalization, its integra-
tion with decision tree method provides another choice to deal with large amount of
data stored in the databases.
Furthermore, the data stored in databases without generalization is usually at
the primitive concept level even including continuous values for numerical attributes.
Classication model construction process performed on such data as most decision
tree algorithms may will result in very bushy or meaningless results. In the worst
case, the model can not be constructed at all if the size of the data set is too large
for the algorithms to handle. For this reason, even in the given examples by Quinlan
73], the data to be classied is at high concept levels such as \hot" for temperature
instead of continuous values like \35" centigrade and \high" for humidity instead of
\95%". In a normal case, this kind of high concept level data may not be stored
in the database but very likely, low level data such as \25" for temperature is stored
instead. To get high concept level data, generalization like attribute-oriented induction
is necessary in which the continuous numerical values will be discretized. With the
help of attribute-oriented induction, the inherent weakness { the tendency to favor
many-valued attributes in the selection of determinant attribute, will also be avoided
since attribute-oriented induction can reduce the large number of attribute values into
a small set of distinct values according to specied thresholds which can be adjusted
exibly.
Moreover, in the decision tree algorithms, the classication process terminates
only when all the objects at a particular node belong to the same class. However,
this kind of condition may not be achieved in the real cases because of the large
CHAPTER 3. CLASSIFICATION BASED METHOD 43

amount and wide diversity of data in large databases. In our proposed algorithm,
a classication threshold is used to terminate the process when a major portion of
the objects (indicated by the classication threshold) belongs to one class. Because
of the incorporation of quantitative information in attribute-oriented induction, the
proposed algorithm can handle noisy or exceptional data, which is quite common in
large databases, by pruning the generalized tuples which have negligible counts using
particular thresholds, while most decision tree method has di culties in handling such
data.
With the help of the attribute-oriented induction, prediction can be made at dif-
ferent desirable concept levels. With the association of the quantitative information
such as count of objects at each node in a decision tree, prediction can be made based
on the distribution of the target objects in di erent classes rather than based on only
one class with the highest probability, which makes the result more reasonable.

3.4 Discussion of ID3 Method


In the previous section, we discussed the motivation to develop the classication-
based predictive modeling method by integrating attribute-oriented induction with
ID3 decision tree method. In this section, we will examine the ID3 method in detail.
ID3 is a decision tree induction method developed by Quinlan 73, 74]. Suppose
there are a group of objects as illustrated by Table 3.1 in which each object belongs
to one of the two classes \P" and \N" which are mutually exclusive. The use of only
two classes is to simplify the description and this method can be extended to any
number of classes.
This set of objects can be used as a training set to develop a classication rule that
can determine the class of any object from its values of the attributes. A classication
rule can be expressed as a decision tree. A decision tree that classies each object in
the training set is shown in Figure 3.2.
Given a group of objects, many decision trees can be constructed to correctly
classify these objects. For instance, Figure 3.3 shows another decision tree based on
the same set of data illustrated in Table 3.1 but it is a lot more complicated than
CHAPTER 3. CLASSIFICATION BASED METHOD 44

Outlook Temperature Humidity Windy Class


sunny hot high false N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N
Table 3.1: An Example Data Set (by Quinlan 1986).

the rst decision tree shown in Figure 3.2. The preference for simpler decision trees
72] leads the ID3 method to use an information-theoretic approach which selects the
attribute that provides the highest information gain to be the root of the tree (or
subtree). This selection process minimizes the expected number of tests to classify an
object and guarantees that a simple (but may not be the simplest) tree is found.
Let the prime class P contain pi objects in class Pi (for i = 1, ..., m) and each class
Pi is distinguished from another class Pj (i 6= j ) based on their di erent values in the
determining attribute. An arbitrary object shall belong to class Pi with probability
pi=p, where p is the total number of objects in the prime class P. When a decision tree
is used to classify an object, it returns a class. A decision tree can thus be regarded
as a source of messages for Pi's with the expected information needed to generate this
message given by:

I (p1 p2 ::: pm) = ;


X
m p
i
log pi
p 2 p
i=1
If an attribute A with values a1 a2 ::: ak is used for the root of the decision tree,
CHAPTER 3. CLASSIFICATION BASED METHOD 45

Outlook

sunny overcast rain

humidity P
windy

high normal true false

P N P
N

Figure 3.2: A Decision Tree (by Quinlan 1986).

it will partition a class C into C1 C2 ::: Ck, where Cj contains those objects in C that
have value aj of A. Let Cj contain pij objects of class Pi. The expected information
required for the tree with A as the root is then obtained as the weighted average:
Xk
E (A) = p1j + :::p + pmj I (p1j ::: pmj )
j =1
The information gained by branching on A is:
gain(A) = I (p1 p2 ::: pm) ; E (A)
ID3 examines all the candidate attributes and chooses an attribute A to maximize
gain(A) to form a tree. It then uses the same process recursively to form a decision
tree for the residual subset C1 C2 ::: Ck.
We can use the training set in Table 3.1 to illustrate the selection process of the
decision tree root as follows.
Let T be the set of objects in Table 3.1. Among the 14 objects, 9 objects belong to
class P and 5 objects belong to class N, so the information required for classication
is:
CHAPTER 3. CLASSIFICATION BASED METHOD 46

temperature

cool mild hot

outlook outlook windy


sunny o’cast rain sunny o’cast rain true false

P P windy windy P humidity N humidity


true false true false high normal high normal

N PP N
windy P outlook P

true false sunny o’cast rain

N P N P null

Figure 3.3: A More Complicated Decision Tree (by Quinlan 1986).


9 log 9 ; 5 log 5 = 0:940
I (p n) = ; 14 2 14 14 2 14

Then consider the outlook attribute with values \sunny", \overcast" and \rain".
Five of the 14 objects in T have the value \sunny", two of them from class P and
three from class N, so
p1 = 2 n1 = 3 I (p1 n1) = 0:971
and similarly
p2 = 4 n2 = 0 I (p2 n2) = 0
p3 = 3 n3 = 2 I (p3 n3) = 0:971
Therefore, the expected information requirement is:
E (outlook) = 145 I (p1 n1) + 144 I (p2 n2) + 145 I (p3 n3) = 0:694
The gain of attribute outlook is then:
CHAPTER 3. CLASSIFICATION BASED METHOD 47

Name Occupation House-size (m2) Address Salary


J. Bush Associate professor 680 Vancouver 65,000
W. Ng Junior manager 450 Richmond 40,000
G. Butler Full professor 1000 Burnaby 86,000
    
Table 3.2: An Initial Relation.

gain(outlook) = 0:940 ; E (outlook) = 0:246


Similar calculation gives
gain(temperature) = 0:029, gain(humidity) = 0:151, gain(windy) = 0:048
Thus in this case, ID3 would choose outlook as the attribute for the root of the
decision tree. The objects would be divided into subsets based on their values of
the outlook attribute and a decision tree for each subset would then be induced in a
similar manner. The actual decision tree generated by ID3 from this training set is
shown in Figure 3.2.
At each non-leaf node of the decision tree, the computational complexity of the
procedure is O(jT jjAj), jT j is the number of objects in the class and jAj is the number
of attributes of the objects 73]. Therefore, ID3's total computational requirement per
iteration is proportional to the product of the size of the training set, the number of
attributes and the number of non-leaf nodes in the decision tree.
The experimental results in Quinlan's study show that ID3 is very e cient and
capable to get simple decision trees in most cases and no exponential growth in time
or space was observed as the training set increases.

3.5 General Ideas of Our Approach


In this section, we will illustrate the general ideas of our approach using the data
shown in Table 3.2.
In Table 3.2, there are ve attributes: \Name", \Occupation", \House-size", \Ad-
dress" and \Salary". In this simplied case, suppose a local paint company wants
CHAPTER 3. CLASSIFICATION BASED METHOD 48

to know the general characteristics of professors and managers living in lower main-
land of British Columbia, Canada who own particular sized houses in order to target
its promotion of certain type of paint to the right group of people, we can use the
classication-based predictive modeling method to nd the answer for them. We can
simply construct a decision tree based on the above data using ID3 algorithm with
the attribute \House-size" as the target class attribute. Then from the decision tree,
derive the predictive rules along the paths of the tree but the result may not be very
good due to the following reasons:
1. Very bushy and thus not interesting decision tree could be constructed due
to the large number of tuples in the database and distinct values for some
attributes such as \Name" and \Salary", which may result in many unnecessary
and meaningless rules not related to the task.
2. Di erent people at di erent time or in di erent cases may want to know infor-
mation at di erent concept levels. For example, sometimes people only want
to know that the house size is big, medium or small rather than exact area
by square meters. Sometimes they may only pay attention to professors or
managers as a whole instead of distinguishing between di erent professors or
managers. The decision trees only based on the primitive concept level may not
satisfy such requests.
3. The amount of data is too large for the decision tree algorithm to handle { the
decision tree can not be constructed at all.
According to the above observations, we may want to remove some attributes with
large number of distinct values which can not be generalized and are irrelevant to the
task such as attribute \Name". We may use attribute-oriented induction to discretize
some numerical attributes such as \Salary" to reduce the number of distinct values
and generalize the given data set to di erent concept levels to meet di erent requests.
For example, Table 3.2 can be generalized to Table 3.3.
In Table 3.3, the values of \Occupation" are generalized to \professor" and \man-
ager" the values of \House-size" are generalized to \medium", \small" and \big"
CHAPTER 3. CLASSIFICATION BASED METHOD 49

Occupation House-size Address Salary count


professor medium Vancouver medium 12
manager small Richmond low 8
professor big Burnaby high 46
    
Table 3.3: A Generalized Relation.

while the values of \Salary" are generalized to \medium", \low" and \high". The
attribute \Count" is inserted into the generalized relation during attribute-oriented
induction, indicating the number of original tuples covered by a generalized tuple.
Notice that any generalization (such as attribute removing, attribute-oriented in-
duction) will result in a non-clean classication, that is, originally di erent objects
which belong to di erent classes may become the same objects but they are still in
di erent classes. For example, according to house size, an assistant professor living in
Burnaby with medium salary belongs to class \medium" and an associate professor
living in Burnaby with a medium salary belongs to class \big". After generalization
on attribute \Occupation", these two objects become the same object: \Occupation
= professor, Address = Burnaby, Salary = medium" but in di erent classes \medium"
and \big". We can not categorize such objects into any single class. At the leaf nodes
of each decision tree, we may have to give the class distribution instead of a unique
class label. As we discussed in Section 3.2, such cases are very common in databases,
even without any generalization due to the wide diversity of the data. The associa-
tion with quantitative information and the class distribution at the leaf nodes actually
provide a reasonable solution for such problems.
According to the above discussion, we may rst collect the task-relevant data from
the database into a table IT (as illustrated in Table 3.2) by executing an SQL query.
Then we remove the attributes with a large number of distinct values, which cannot
be generalized or discretized from IT. Then we generalize the data set to a certain
level according to the request and put it into another table GT (as illustrated in
Table 3.3). Then we perform ID3 algorithm on GT and create a decision tree. From
CHAPTER 3. CLASSIFICATION BASED METHOD 50

the derived rules, we can obtain the desired predictive information.


Notice that in ID3, the recursive classication process terminates when all the
objects in a subclass belong to one class. Because of the large amount and the wide
diversity of data in databases, there is rarely such a case that all the objects in a
target class belong to the same class. Thus, a classication threshold should be set up
such that further classication on a set of classied objects may become unnecessary
if a substantial portion (i.e., no less than the prespecied classication threshold) of
the classied objects belongs to the same class.
Based on the above discussion, we will present the algorithm of our proposed
classication-based predictive modeling method in the next section.

3.6 Algorithm
Algorithm 3.1 (Classication-Based Predictive Modeling Method) Derive pre-
dictive modeling rules using attribute-oriented induction and ID3.
Input: A DMQL query for prediction.
Output: Predictive modeling rules derived from the constructed decision tree.
Method:
1. Data retrieval: According to the given request, collect the task-relevant data
and generalize it into the specied concept level to get the prime target class.
2. Model Construction: Construct a decision tree based on the prime target class.
2.1 Compute the information gain for each candidate attribute based on the information-
theoretic approach, using the equations given in Section 3.4. Select one candi-
date attribute as the classifying attribute at this current level and classify the
target class.
2.2 For each classied target subclass, repeat Step 2.1 to further classify it until
either (1) all or a substantial proportion (no less than the classication threshold
CHAPTER 3. CLASSIFICATION BASED METHOD 51

specied in the DMQL query) of the objects is in one (determinant) class, or


(2) no more classifying attributes can be used for further classication.
3. Output the rules derived from the constructed decision tree.
Rationale:
Step 1 is the attribute-oriented induction algorithm 15]. Step 2.1 is essentially the
ID3 algorithm 73, 74]. Step 2.2 determines whether a recursive, further classication
needs to be performed. If a substantial portion of objects (no less than the specied
classication threshold ) in the target class belongs to one class, the set of objects can
be viewed as having been successfully categorized and it is not necessary to perform
further classication on it. Otherwise, further classication will be performed on the
set of objects in a process similar to Step 2.1. The classication process terminates
when either all the objects are so classied or no more classifying attributes can be
used for further classication. Step 3 generates rules according to the decision tree so
derived which reect the general regularity of the data in the database.

Computational Complexity:
Theorem 3.1 Algorithm 3.1 will take O(Ngl  Nr  logNr + Nnonleaf  Na  Nr) time
to nish. Nr is the number of the records in the initial data set (the task-relevant
data set which is initially collected from the database and is at the primitive concept
level). Ngl is the number of concept levels of all the attributes in the initial data set
which have been generalized from the primitive concept level to the specied concept
level. Nnonleaf is the total number of non-leaf nodes in the decision tree. Na is the
number of attributes in the initial data set.
Rationale:
Since the attribute-oriented induction takes O(N  logN ) 15] (N is the number
of the records in a relational table), in order to generalize the initial data set to the
concept level specied in the request, the algorithm will generalize Ngl times (levels).
Because the number of records in the generalized table after each generalization will
become smaller and smaller, it will take O(Ngl  Nr  logNr ) time to generalize the
CHAPTER 3. CLASSIFICATION BASED METHOD 52

initial data set to the desired concept level. Then the decision tree construction on
the generalized target data set will cost at most Nnonleaf  Na  Nr time 73]. So
the total computational requirement for this algorithm will be O(Ngl  Nr  logNr +
Nnonleaf  Na  Nr ).

3.7 Variations of the Algorithm


Two variations of Algorithm 3.1 were developed during our study by exploring di er-
ent ways to share data structures and intermediate results for potential performance
enhancement. In the rest of this section, we will briey examine the two variations of
Algorithm 3.1.

3.7.1 Using Single Unsplit Generalized Table to Generate


Target Subclass: Algorithm PC USGS1
The rst variation stores the generalized target table (will be referred to as master
table) and for each node during the decision tree construction, its target subtable is
extracted from this master table. The algorithm is briey summarized as follows.
Algorithm 3.2 (PC USGS1) A variation of Algorithm 3.1 { use single unsplit gen-
eralized table to generate target subclass.
Input: A DMQL query for prediction.
Output: Predictive modeling rules derived from the constructed decision tree.
Method:
1. Data retrieval: According to the given request, collect the task-relevant data
and generalize it into the specied concept level to get the prime target class
and store it in the master table.
2. Model Construction: Construct a decision tree based on the prime target class.
CHAPTER 3. CLASSIFICATION BASED METHOD 53

2.1 Compute the information gain for each candidate attribute based on the information-
theoretic approach using the equations given in Section 3.4. Select one candidate
attribute as the classifying attribute at this current level and classify the target
class.
2.2 For each classied target subclass, get the subtable from the master table and
repeat Step 2.1 to further classify it until either (1) all or a substantial proportion
(no less than the classication threshold specied in the DMQL query) of the
objects are in one (determinant) class, or (2) no more classifying attributes can
be used for further classication.
3. Output the rules derived from the constructed decision tree.

3.7.2 Using Split Generalized Table to Generate Target Sub-


class: Algorithm PC STGS2
The second variation splits the generalized table along the tree and for each node
during the decision tree construction, its target subtable is extracted from the subtable
of its parent node. The algorithm is briey summarized as follows.
Algorithm 3.3 (PC STGS2) A variation of Algorithm 3.1 { use split generalized
table to generate target subclass.
Input: A DMQL query for prediction.
Output: Predictive modeling rules derived from the constructed decision tree.
Method:
1. Data retrieval: According to the given request, collect the task-relevant data
and generalize it into the specied concept level to get the prime target class.
2. Model Construction: Construct a decision tree based on the prime target class.
CHAPTER 3. CLASSIFICATION BASED METHOD 54

2.1 Compute the information gain for each candidate attribute based on the information-
theoretic approach using the equations given in Section 3.4. Select one candidate
attribute as the classifying attribute at this current level and classify the target
class. Subtable for each child node is derived from the prime target class if the
current node is the root or from the subtable of its parent node when the current
node is not the root. The so derived subtable is then associated with each child
node.
2.2 For each classied target subclass, use the subtable associated with it and repeat
Step 2.1 to further classify it until either (1) all or a substantial proportion (no
less than the classication threshold specied in the DMQL query) of the objects
are in one (determinant) class, or (2) no more classifying attributes can be used
for further classication.
3. Output the rules derived from the constructed decision tree.

3.7.3 Summary
Since the target tables become smaller and smaller from the parent generations to
child generations, the cost to extract subtables is less in algorithm PC STGS 2 which
results in better performance in most cases. However, algorithm PC STGS 2 takes
a little more memory than algorithm PC USGS 1 and when the number of records
is small, the performance di erence of these two algorithms is not signicant. The
detailed performance study of these two algorithms is presented in Chapter 5.

3.8 Example
In this section, we will use a simplied example to illustrate our proposed classication-
based predictive modeling approach. Suppose a local paint company wants to market
their high-end paint to professors and managers living in the lower mainland of British
Columbia, Canada. In order to achieve the best result, they need to know the general
characteristics of the group of professors and managers who own certain sized houses
CHAPTER 3. CLASSIFICATION BASED METHOD 55

so that they can target their promotion to a particular group with the appropriate
products designed for the certain sized houses that people in this group may likely
own. Suppose they have access to a database \HouseDB" which contains house in-
formation of people living in the lower mainland and its schema is represented in
Table 3.2. Then they can submit the following DMQL query to get the information
they need.
use HouseDB
nd prediction rule for \House size"
from HouseInfo
where Occupation = \professor" or Occupation = \manager"
and Address = \Burnaby" or Address = \Vancouver"
or Address = \Richmond" and Salary = \low" or Salary = \medium"
or Salary = \high"
in relevance to Occupation, House size, Address, Salary
with classi cation threshold 85%

The query is then executed and the task-relevant data set is collected from the
database. Then this initial data set is generalized to the specied concept level which
results in the prime class table (shown in Table 3.4 as a feature table).
Based on the extracted prime class table, information gain is computed for each
candidate attribute { \Occupation", \Salary" and \Address" using the information-
theoretic method and the equations given in Section 3.4.
The computation results in Table 3.5, which implies that \Salary" should be chosen
as the root of the decision tree, and the objects should be classied according to the
values (High, Medium, Low) of the attribute \Salary".
The object distribution in each subclass is presented in Table 3.6. since the de-
terminant class, \House size = Small", contains 95.61% of objects in the subclass
\Salary:Low", which is above the specied classication threshold \85%", it is un-
necessary to further classify the subclass \Salary:Low". Classication is further per-
formed in other subclasses.
CHAPTER 3. CLASSIFICATION BASED METHOD 56

professor manager Total


L M H Total L M H Total
Richmond 0 64 103 167 0 16 20 36 203
big Burnaby 2 27 46 75 1 8 9 18 93
Vancouver 0 11 39 50 0 4 8 12 62
Total 2 102 188 292 1 28 37 66 358
Richmond 1 42 20 63 0 12 1 13 76
medium Burnaby 0 21 62 83 0 3 18 21 104
Vancouver 0 12 32 44 1 2 5 8 52
Total 1 75 114 190 1 17 24 42 232
Richmond 4 10 0 14 8 0 0 8 22
small Burnaby 11 4 2 17 2 9 0 11 28
Vancouver 35 24 5 64 49 41 6 96 160
Total 50 38 7 95 59 50 6 115 210
Total 53 215 309 577 61 95 67 223 800

Table 3.4: The prime class represented as a generalized feature table.

Attribute Information Gain


Occupation 0.0865259036
Salary 0.3586720686
Address 0.1888694432
Table 3.5: The information gain for each attribute to classify the prime class.

The computation of information gain for each remaining attribute in the two
subclasses leads to table 3.7. Obviously, \Address" should be chosen as the classifying
attribute for both subclasses, \Salary:High" and \Salary:Medium".
The object distribution in each subclass is presented in Table 3.8. Since the
determinant class \House size = Big", contains 85.42% of objects in the subclass
\Salary: High & Address: Richmond", which is above the classication threshold
\85%", it is unnecessary to further classify this subclass. Classication is further
performed in other remaining subclasses.
CHAPTER 3. CLASSIFICATION BASED METHOD 57

Salary House size = Big House size = Medium House size = Small
High 59.84% 36.70% 3.46%
Medium 41.93% 29.68% 28.39%
Low 2.63% 1.76% 95.61%

Table 3.6: Object distribution in each subclass.

Information Gain
Attribute Salary: High Salary: Medium
Occupation 0.2164310451 0.3473828339
Address 0.2505954843 0.4922614419
Table 3.7: The information gain for each remaining attribute in the two subclasses.

For the remaining subclasses, based on the last classication attribute \Occupa-
tion", the object distribution in each leaf class is presented in Table 3.9 and Table 3.10
respectively.
The model construction process nally terminates when there is no attribute left
in the subclasses for further classication. This process leads to a decision tree of
Figure 3.4.
Prediction rules can be generated with probability distribution information asso-
ciated based on the tables and the decision tree so derived. for example, the rules
which associate house size with the salary can be derived as follows:
if Salary = High
then (House size = Big) 59.84%] _ (House size = Medium) 36.70%]
_ (House size = Small) 3.46%]
if Salary = Medium
then (House size = Big) 41.93%] _ (House size = Medium) 29.68%]
_ (House size = Small) 28.39%]
if Salary = Low
then (House size = Big) 2.63%] _ (House size = Medium) 1.76%]
CHAPTER 3. CLASSIFICATION BASED METHOD 58

Salary: High Salary: Medium


Big Medium Small Big Medium Small
Burnaby 40.15% 58.39% 1.46% 48.61% 33.33% 18.06%
Richmond 85.42% 14.58% 0.00% 55.56% 37.50% 6.94%
Vancouver 49.47% 38.95% 11.58% 15.96% 14.89% 69.15%

Table 3.8: Object distribution in the subclass Salary:High and Salary:Medium.

Salary:High & Address:Burnaby Salary:High & Address:Vancouver


Big Medium Small Big Medium Small
professor 41.82% 56.36% 1.82% 51.32% 42.10% 6.58%
manager 33.33% 66.67% 0.00% 42.10% 26.32% 31.58%

Table 3.9: Object distribution in the remaining subclasses of Salary:High.

_ (House size = Small) 95.61%]


Rules associating \House size" with \Salary" and \Address" or with \Salary",
\Address", and \Occupation" can also be derived in a similar way.

Salary High Medium Low

Address Richmond (Small)


Burnaby Vancouver
Vancouver Burnaby Richmond
Occupation (Big)
Prof Mgr Prof Mgr Prof Mgr Prof Mgr Prof Mgr

Figure 3.4: The decision tree generated based on the query and the determining
attribute \House size".
Notice that alternative choices can be explored for determining whether a subclass
needs to be further classied. One such alternative is to specify a noise threshold and
a disjunct threshold (the maximum number of disjuncts allowed). A subclass does not
CHAPTER 3. CLASSIFICATION BASED METHOD 59

Sal:Medium&Adr:Burnaby Sal:Medium&Adr:Richmond Sal:Medium&Adr:Vancouver


Big Medium Small Big Medium Small Big Medium Small
professor 51.92% 40.38% 7.69% 55.17% 36.21% 8.62% 23.40% 25.53% 51.06%
manager 40.00% 15.00% 45.00% 57.14% 42.86% 0.00% 8.51% 4.26% 87.23%

Table 3.10: Object distribution in the remaining subclasses of Salary:Medium.

need to be further classied if it contains only a small number of disjuncts (below the
disjunct threshold) after ltering each disjunct which is below the noise threshold. For
example, in Table 3.6, if the noise threshold is 5% and the disjunct threshold is 2 (i.e.,
two conjuncts are allowed), subclass \Salary:High" and \Salary:Low" contain only
one or two disjuncts after ltering each disjunct below the noise threshold. Further
classication needs to be performed only on one subclass, \Salary:Medium".
After study the rules derived by the classication-based predictive modeling pro-
cess, the local paint company may decide to target its promotion of one type of paint
designed for certain sized houses to a particular group of people who may likely own
such sized houses. For example, they may market their certain brand of paint designed
for small houses to the professors and managers with relatively low salaries.

3.9 Summary
In this chapter, we rst discussed the basics of classication, its close relation with
predictive modeling and our ideas to integrate the ID3 classication method into our
predictive modeling approach. Then in Section 3.2, we dened the problem we are
going to tackle in our classication-based predictive modeling. In Section 3.3, we dis-
cussed our motivation to integrate the ID3 method and attribute-oriented induction
into our approach. In Section 3.4, we discussed the ID3 decision tree-based classi-
cation method. In Section 3.5, we illustrated our general ideas using an example.
Then we presented the details of our proposed classication-based predictive modeling
method and algorithm in Section 3.6. The rationale and the computational complex-
ity of the algorithm were also discussed. In Section 3.7, two variations of the algorithm
were examined. In Section 3.8, we used an example to illustrate our method. Our
CHAPTER 3. CLASSIFICATION BASED METHOD 60

discussion in this chapter shows that the classication-based predictive modeling is


useful and feasible. In Chapter 5, we will present the details of our performance study
which shows that it is also e cient.
Chapter 4
Predictive Modeling Based on
Pattern Matching Method
4.1 Introduction
As we discussed in the previous chapters, Predictive Modeling in data mining is to pre-
dict future happening based on the existing data or knowledge stored in the databases.
This can be further described as predicting the behavior of a given data pattern based
on the behavior of existing data patterns. One natural approach is to nd existing
patterns which match the given pattern and then predict the behavior of the given
pattern based on the behavior of the matching patterns. This approach consists of
two phases: pattern collecting phase and analyzing phase.
In the pattern collecting phase, all the existing patterns will be examined against
the given pattern and matching patterns will be collected for the analyzing phase.
There are three cases in this phase:
 Case 1: There is a large set of existing patterns which exactly matches the given
pattern.
 Case 2: There is only a small set of existing patterns which exactly matches the
given pattern.

61
CHAPTER 4. PATTERN MATCHING BASED METHOD 62

 Case 3: There is no matching pattern at all.


In case 1, prediction can be made with condence based on the large set of match-
ing patterns. However, it is the ideal case which is unfortunately rare in the real
world. Usually we end up with case 2 and case 3 in which we can not make a good
prediction based on such a small set of matching patterns. In our approach, we in-
tegrate attribute-oriented induction into this phase so that when case 2 and case 3
happens, we will still be able to collect enough matching patterns by generalization.
We also use statistics method to guide the generalization process in order to minimize
the loss of relevant information during generalization.
In the analyzing phase, the behavior of the collected patterns are investigated and
prediction is made based on the probability at which di erent behavior may happen
for the given data pattern.
The rest part of this chapter is organized as following: In Section 4.2, the problem
we want to tackle in this approach is dened. Then our motivation to use attribute-
oriented induction and relevance analysis is explained in Section 4.3 followed by Sec-
tion 4.4, where we illustrate the relevance analysis method we use. In Section 4.5,
the general ideas of the approach are described, which is followed by Section 4.6,
where the algorithm and the rationale of the algorithm are discussed. Two variations
of the algorithm are examined in Section 4.7. Section 4.8 shows the examples and
experiment results. The whole chapter is summarized in Section 4.9.

4.2 Problem Denition


In terms of relational database, data pattern (Pi ) can be described as Pi = < V (A1)
V (A2)  V (An) >. A1 A2  An are attributes of a relational table. Suppose Ai
is one of the attributes which has a value set VSET (Ai). VSET (Ai) = fv1 v2  vng
which consists of all the possible values attribute Ai may have including \any", which
means it can be any value. V (Ai) represents any value of attribute Ai. That is,
vi = V (Ai) and vi 2 VSET (Ai). The behavior of pattern Pi can be dened as
B (Pi) = V (AP ) while AP is the predicting attribute in the table. It is the attribute
CHAPTER 4. PATTERN MATCHING BASED METHOD 63

of interest and is di erent from the attributes included in Pi. Given pattern Pinput ,
our task in the data collecting phase is to nd matching patterns MPSET in the
existing pattern set PSET (PSET = fP1 P2  Pn g). MPSET  PSET and all
matching patterns in MPSET have the same value with Pinput for each attribute.
That is, 8 Px 2 MPSET , V (A1) of Px = V (A1) of Pinput , V (A2) of Px = V (A2) of
Pinput  V (An) of Px = V (An) of Pinput . However, if the value for one attribute Aj
of the given pattern is \any", the V (Aj ) of the matching patterns can be any value
without necessarily being the same.
The task in pattern analyzing phase is to examine the di erent behavior of the
collected patterns in the MPSET and predict on the probability basis what behavior
the given pattern may have. Suppose B1 B2  Bm are di erent behaviors of the
collected patterns in MPSET. (In terms of relational database, B1 B2  Bm are the
di erent values of the attribute of interest. For convenience in the following discus-
sions, we name the attribute of interest as Predictive Attribute and the other attributes
which make up the pattern as Descriptive Attribute. In principle, any attribute can be
either predictive or descriptive attribute depending on di erent predictive modeling
requests.)
The possible behavior of the given pattern Pinput can be any one of them but with
di erent probability. Suppose the total number of patterns in MPSET is N and the
number of patterns in MPSET with behavior B1 B2  Bm is n1 n2  nm (N =
Pm ni) respectively. Thus the probability for Pi to have behavior B1 B2  Bm is
i=1
n1=N n2 =N  nm=N accordingly.
However, the quality of prediction based on di erent set of matching patterns
may be di erent. Obviously, the predictions based on a large set of patterns are more
reliable than those based on only a small set of patterns. We use Support to measure
the reliability of the prediction.
Denition 4.1 (Support) Support is the number of existing matching patterns of
the given pattern in a data set. That is, support = N where N is the number of
matching patterns of the given pattern.
Apparently, the more support, the more reliable the prediction is.
CHAPTER 4. PATTERN MATCHING BASED METHOD 64

Denition 4.2 (Support Threshold) Support Threshold is the support which is


considered su cient for a predictive modeling request.
It is evident that the Support Threshold for each particular predictive modeling
request varies depending upon di erent situations. For instance, a researcher at Simon
Fraser University may nd support threshold = 100 is su cient to predict how much
research grant he or she may get from NSERC considering there are only a few hundred
records in the NSERC research grant database. However, a new graduate with a B.A.
degree who wants to become an accountant may nd support threshold = 100 is not
satisfying to predict how much salary he or she can ask for considering there are
thousands of records in the accountant job database.
If in the real world, for any given pattern, we were always able to nd enough
matching patterns to calculate the probabilities of the behavior of the given pattern,
it would be a very easy task to accomplish. However, it is apparently not the case.
In the following section, we will discuss the motivation to integrate attribute-oriented
induction and relevance analysis into the pattern collecting phase.

4.3 Motivation
Let's study an example rst. Suppose one new graduate is looking for a job and one
of his primary concerns is how much salary he can expect given his personal situation.
Suppose there's an occupation database available to access by public, which contains
most occupation related information such as age, education, experience, profession,
region, etc. Assume there is a simple predictive modeling system available which
can gather the information specied by the user and do some calculation to give a
probability prediction. So the new graduate seeks the help of this system and hopes
to get a general picture of how much salary he can ask for during job interviews.
He gives the system his personal information as \Profession = programmer, Degree
= M.S., Experience = 2 (years), Age = 25, Location = Burnaby, Special Interest =
object-oriented database". The system then tries to nd matching patterns in the
existing records of the occupation database. In the database, there may be hundreds
CHAPTER 4. PATTERN MATCHING BASED METHOD 65

of programmers, thousands of people who have M.S. degree, and even more people
who are at age 25. However, the set of records which match all the specied conditions
is probably small. In this case, suppose there are only 12 matching records found.
The new graduate probably won't trust the prediction based on such a small set of
data and hopes that the system can help him nd more records which are closest
to the specied condition but may not necessarily be the same. One way to achieve
this is to generalize the given condition and tries to nd more matching records
based on the generalized result. In this case, suppose the condition is generalized to
\Profession = programmer, Degree = M.S., Experience = 2, Age = 22-27, Location =
BC, Special Interest = Computing Science". The system needs to nd the matching
records which satises the new condition. However, in the database, data is stored
in the original value (that is, at the primitive concept level) and may not be at the
concept level which the user desires. In this case, the values for attribute \Special
Interest" are at \A.I, object-oriented database, nancial planning, etc." level but
not at the \Computer Science, Business, etc." level. This means that the system
needs to generalize the existing data to the desired concept level in order to nd the
matching records. Obviously, this is one reason that we want to integrate the attribute-
oriented induction into the system in order to perform such a task. Suppose our
system already has such capability, and nd 400 records which match the generalized
condition. The new graduate may feel comfortable with the prediction based on this
set of data. What the system needs to do next is to calculate the distributions of the
salary among the found records. However, since the values of salary vary, the result
may not reect the general trend. For example, the result may look like\$42,000
{ 3%, $45,680 { 3.4%, ...". In order to clearly show the general trend, we need to
generalize the predictive attribute, in this case, \Salary". Thanks to the attribute-
oriented induction, the system generalizes the attribute \Salary" into four high level
values: \under $40000, $40000{45000, $45000{50000, and over $50000". So the result
may look clearer: \under $40000 { 10%, $40000{45000 { 25%, $45000{50000 { 60%,
over $50000 { 5%. Thus the new graduate knows probably he can ask for $45000-
50000 during his job hunting. This is another reason to use the attribute-oriented
induction.
CHAPTER 4. PATTERN MATCHING BASED METHOD 66

One caution we need to take during the generalization process is that we do not
want to generalize too much otherwise the generalized pattern may be too generic
and lose too much specic information from the original pattern so the prediction
may not be useful in such a case. For instance, if the generalized pattern of the
above case is \Profession = White Collar, Degree = M.S., Experience = 0-10, Age
= 20-40, Location = BC, Special Interest = Science", the result may not be very
helpful for the new graduate to nd out how much he can expect to earn. In this
example, the system generalizes too much which results in losing too many specics
and making the pattern too generic so that the result is not specic enough for the
particular request. As we can see, obtaining enough support while still retaining as
much necessary specic information as possible is the goal we must achieve in our
approach.
For each prediction, we know that the relevance between di erent descriptive at-
tributes and the predictive attribute is di erent. Some attributes are more relevant
to the prediction while some are less or even not relevant at all. If the relevance
between each attribute and the predictive attribute is known before the generaliza-
tion, such information can guide the generalization process so that those less relevant
attributes are generalized more than the relevant attributes which results in getting
enough support while still keeping the most relevant specic information as much as
possible.
Take the previous case as an example, if the system knows the \Profession", \Ex-
perience" and \Degree" are more relevant to \Salary" and the \Location" is least
relevant, these three attributes can be less generalized while \Location" can be gen-
eralized more during the generalization process which may result in the following
generalized pattern: \Profession = programmer, Degree = M.S., Experience = 2, Age
= 22-47, Location = Western Canada, Special Interest = Computer Science". The
new graduate can still get a clear picture about the salary even though the \Location"
has been generalized to a relatively higher concept level in order to obtain enough
support.
As discussed above, the motivation to integrate relevance analysis into our ap-
proach is to guide the generalization process so that the relevant specic information
CHAPTER 4. PATTERN MATCHING BASED METHOD 67

is retained as much as possible by generalizing more on less relevant attributes.

4.4 Relevance Analysis


As we discussed in the previous section, relevance analysis is necessary to guide the
generalization process in the pattern collecting phase in order to retain necessary
specic information relevant to the prediction. In this section, we will introduce the
method which we adopt to analyze the relevance between attributes in our approach
{ contingency analysis which is the well adopted method in both statistics and data
mining 70].
According to dictionary denitions, there are two opposite meanings of contin-
gency: (1) the state of being accidental or (2) the state of being dependent upon
another event or situation. Statistics adopts the second meaning with a little modi-
cation: \related in other than a chance manner associated in other than a random
manner" 52].
Before we start the actual study of the computation of relevance between two
attributes, we rst need to know the contingency table which is often used in the
computation.
Suppose we have two attributes A and B. A1 A2  An represents n groups of
tuples which have n distinct values of attribute A and B1 B2  Bm represents m
groups of tuples which have m distinct values of attribute B. Then a set of N tuples is
classied into n categories for A and m categories for B. For example, if A represents
attribute \Occupation", then A1 may represent \programmer", A2 for \professor"
and so on.
If N represents the number of distinct values of attribute A and M is the number
of distinct values of attribute B, then a contingency table for attribute A and B will
have N + 2 rows and M + 2 columns, The intersection of one row and one column in
a contingency table is called a cell. The total number of cells is (N + 2)  (M + 2).
The cells in the rst row contain the distinct values of attribute B. Beginning with
column 2, each cell in this row contains one distinct value of attribute B. The rst
column of the contingency table contains the distinct values of attribute A. Starting
CHAPTER 4. PATTERN MATCHING BASED METHOD 68

with the second row, each cell in this column contains one distinct value of attribute
A. For convenience, the cells which contain the distinct values of the attributes will be
referred to as value cell and the other cells will be referred to as data cell. Each data
cell in the contingency table which corresponds to one distinct value of attribute A and
one distinct value of attribute B contains the number of tuples which have these two
attribute values. Each cell in the last row and the last column of the table contains
the subtotal of each corresponding column or row. The last cell (the intersection of
row N+2 and column M+2) contains the total number of tuples in the contingency
table.
In a simplied example, let N=M=2 and we get the following contingency table:

Attribute B
Attribute A B1 B2 Total
A1 (A1B1) (A1B2) (TA1)
A2 (A2B1) (A2B2) (TA2)
Total (TB1) (TB2) (TN)
Table 4.1: Contingency Table for Attributes A and B.

Salary
Occupation low high Total
white collar 100 500 600
blue collar 700 200 900
Total 800 700 1500
Table 4.2: Contingency Table for Occupation and Salary.

To best illustrate the meaning of the contingency table shown in Table 4.1, we
can use a concrete example. Let A represent attribute \Occupation", A1 represent
\white collar" and A2 represent \blue collar". Let B represent attribute \Salary",
B1 represent \low" and B2 represent \high". Thus (A1B1) represents the number of
tuples with \Occupation" as \white collar" and \Salary" as \low". This example is
shown in Table 4.2.
CHAPTER 4. PATTERN MATCHING BASED METHOD 69

Since we have the data arranged into the contingency table, now we are ready to
study the relevance between attribute \Occupation" and attribute \Salary".
We will rst study the existence of relevance by examining the following questions:
(1) Is the proportion of people in the white collar group having high salary the same
as for all of the people? (2) Is the proportion of people with low salary the same
for people who are in the white collar group as for those who are in the blue collar
group? To answer these two questions, we need to calculate a pair of percentages for
each question. For question (1), we nd that 83% of the people in the white collar
group have high salary while 47% of all the people in the group have high salary. For
question (2), we nd that for people with low salary, 87.5% of them are in the blue
collar group while 12.5% of them belong to the white collar group.
Since the percentages in each of the pairs of percentages are di erent, we know
that the two attributes, \Occupation" and \Salary" are not independent. Thus the
answer is YES for the question \Is there any relevance between these two attributes?".
Now that we know that the relevance exists between the two attributes, we will
examine the degree of relevance. In the above example, we can get some idea of
closeness of the relevance by comparing 83% and 47%, However, it is not very useful
because it can not be used to compare with other percentage di erences calculated
from other contingency tables. We need a coe cient which can measure the degree of
relevance, is a pure number, free from the e ects of the absolute quantities or of the
size of the marginal percentages and can be compared for di erent contingency tables.
One such coe cient is T which measures the degree of relevance between attributes.
It is based on well known Chi Square method and frequently used in statistics 30].
The calculation of T is described as follows.
a11 a12 ... a1n a1(n+1)
a21 a22 ... a2n a2(n+1)
.
.
.
am1 am2 ... amn am(n+1)
a(m+1)1 a(m+1)2 ... a(m+1)n a(m+1)(n+1)
CHAPTER 4. PATTERN MATCHING BASED METHOD 70

In a usual case, suppose we have two attributes and one has m values while another
one has n values. Using (m +1)  (n +1) matrix (row m +1 contains the subtotal of the
corresponding column while column n + 1 contains the subtotal of the corresponding
row), we can represent the contingency table as above.
Then we can calculate P (P is an intermediate measure used to calculate Chi
Square.)

P Pmj=1 a a2ji
P= n j(n+1)
i=1 a(m+1)i

For the example shown in Table 4.2, P = 1:36.

Then we can get the Chi Square 2: (N is the total number of tuples in the table.)
2 =N P ;N
For our example, 2 = 1500  1:36 ; 1500 = 540.
After getting 2, we can calculate T: ( 2 is an intermediate measure. s represents
the number of rows and t represents the number of columns.)
2
2 = N

T 2 = p(s;1)(2 t;1)
For our example, 2 = 0.36, T 2 = 0:36, thus T = 0:6.
Therefore the degree of relevance between \Occupation" and \Salary" measured
by T is 0.6.
Obviously the computational complexity of the algorithm to calculate T is O(N 
M ) (N is the number of columns and M is the number of rows of a contingency table).
In the above discussion, we rst illustrated the general ideas to analyze the rele-
vance between two attributes. Then we introduced the T coe cient which is exten-
sively used in statistics to measure the degree of relevance between two attributes. In
the following sections, we will further discuss how to integrate this method with our
predictive modeling system to do the relevance analysis.
CHAPTER 4. PATTERN MATCHING BASED METHOD 71

4.5 General Ideas of our Approach


Based on the previous discussions, we will describe the general ideas of our approach
in this section.
As discussed in Section 4.1, our proposed predictive modeling process consists of
two phases: data collecting phase and pattern analyzing phase. In the data collecting
phase, at rst, the input pattern can be specied in a DMQL 36] (Data Mining Query
Language) request. DMQL is a super set of SQL in which besides SQL statement,
people can also specify the type of rule to be discovered (prediction in our case),
hierarchy to be used, support to be achieved, etc. In a DMQL request, not only the
values at the primitive level but also those at higher concept levels in the concept
hierarchy can be specied since the DMQL parser will consult the concept hierarchies
and map those higher level values to their corresponding primitive level values stored
in the database.
Then the request will be examined by the DMQL parser and translated into corre-
sponding SQL statements to retrieve the task-relevant data set. The attribute-oriented
induction is then conducted against the collected data set to generalize the attributes
to the concept levels specied in the query. Support is calculated based on the match-
ing patterns in the generalized data set and then compared with the specied support
threshold. If it is equal to or greater than the desired threshold, the distribution of
the matching patterns across the values of the predictive attribute is calculated and
the result is given. If the obtained support is less than the specied threshold, the
system will calculate the relevance between the predictive attribute and the other
attributes and generalize the least relevant attribute in the generalized data set to a
higher concept level. Such process repeats until either the obtained support is greater
than or equal to the specied threshold or all the attributes are generalized to the
highest concept level in the concept hierarchies.
Based on the above ideas, we develop the pattern matching-based predictive mod-
eling method which will be examined in the next section.
CHAPTER 4. PATTERN MATCHING BASED METHOD 72

4.6 Algorithm
Algorithm 4.1 (Pattern Matching-Based Predictive Modeling Method) Predict
value distribution of the predictive attribute for the given pattern based on the exist-
ing matching patterns.
Input: A DMQL query for prediction.
Output: The value distribution of the predictive attribute.
Method:
1. Data retrieval: According to the given data pattern (will be referred to as Target
Pattern), collect the task-relevant data set and generalize it to the specied
concept level to create the generalized data set.
2. Check support: Check if the number of matching patterns in the generalized
data set is equal to or greater than the specied support threshold. If yes, go
to step 5. If not, go to step 3.
3. Calculate Relevance: Calculate the relevance between the predictive attribute
and the descriptive attributes.
4. Further Generalization: Further generalize the least relevant attribute of the
generalized data set to its corresponding higher concept level and get a further
generalized data set. Replace the value of this least relevant attribute in the
Target Pattern with its corresponding value at the higher concept level of the
hierarchy. Then go back to step 2. If the selected attribute is already in the top
concept level of the hierarchy, try the next least relevant attribute. If all the
attributes are at their top concept levels, stop.
5. Output Result: Calculate the value distribution of the predictive attribute based
on the matching patterns and output the result.
Rationale:
CHAPTER 4. PATTERN MATCHING BASED METHOD 73

Step 1 is the attribute-oriented induction algorithm 15]. Step 2 checks if the spec-
ied support threshold has been passed. Step 3 calculates the relevance between each
descriptive attribute and the predictive attribute based on the method we discussed
in Section 4.4. Step 4 is also the attribute-oriented induction algorithm. Step 5 calcu-
lates the distribution of the di erent values of the predictive attribute as we discussed
in Section 4.2. The process terminates when either the number of matching patterns
in the generalized data set exceeds the specied support threshold or all the attributes
are generalized to their top concept levels.

Computational Complexity:
Before we discuss the computational complexity of the algorithm, we will rst
introduce three notations as follows.

(1) Initial Relation Table: The Initial Relation Table is the initial data set which
is collected from the database and is at the primitive concept level.

(2) Ntl :
Ntl = Pni=1 Nli
n is the number of attributes in the given data pattern. Nli is the number of concept
levels (excluding the top concept level) in the concept hierarchy of attribute i.

(3) MAXNv :
8 i (1iN ), MAXNv NVi
NVi is the number of distinct values of attribute i in the initial relation table. N is
the number of attributes in the initial relation table.

Theorem 4.1 In Algorithm 4.1, the calculation of the relevance for each descriptive
attribute will take O(MAXNv 2) time.
CHAPTER 4. PATTERN MATCHING BASED METHOD 74

Rationale:
As we discussed in Section 4.4, it takes O(m  n) time to calculate the relevance
between two attributes. m is the number of distinct values of one attribute and n
is the number of distinct values of the other attribute. Since MAXNv is the maxi-
mum number of distinct values for any attribute in the initial relation table and with
each generalization, the number of distinct values for each attribute either remains
the same or becomes smaller, so the MAXNv will always be greater or equal to the
number of distinct values of any attribute at any time. Obviously, the calculation of
the relevance for each descriptive attribute will take O(MAXNv 2) time.

Theorem 4.2 Algorithm 4.1 takes O((Ntl +Npl)Nr logNr +(Ntl;Nil)MAXNv 2


Nda) time to nish. As dened above, Ntl is the total number of concept levels
(excluding the top concept levels) in the concept hierarchies of all the descriptive
attributes in the initial relation table. Npl is the number of concept levels of the
predictive attribute in the initial relation table which have been generalized from the
primitive concept level to the default (or specied) concept level. Nr is the number
of records in the initial relation table. Nil is the number of concept levels of all the
descriptive attributes in the initial relation table which have been generalized from the
primitive concept level to the specied concept levels. Nda is the number of descriptive
attributes in the initial relation table.
Rationale:
Since the attribute-oriented induction takes O(N  logN ) 15] (N is the number
of records in a relational table), in order to generalize the initial relation table to the
concept level specied in the request, the algorithm will generalize Nil + Npl times
(levels). Because the number of records in the generalized tables will be smaller
than that in the initial relation table, it will take O((Nil + Npl)  Nr  logNr ) time
to generalize the initial relation table to the desired concept level. Since the algo-
rithm will stop when either the total count of the matching patterns exceeds specied
threshold or all the attributes are generalized to the top concept levels, the worst
case will be that the algorithm stops when all the attributes are generalized to the
CHAPTER 4. PATTERN MATCHING BASED METHOD 75

top concept levels. In this case, the maximum number of times it can generalize
is (Ntl ; Nil ) and each generalization including the calculation of the relevance for
each descriptive attribute will take O(MAXNv 2  Nda + Nr  logNr ) time. So in the
worst case it will take O((Ntl ; Nil)  (MAXNv 2  Nda + Nr  logNr )) time to stop.
Combining this time and the time spent on the initial generalization which general-
izes the initial relation table to the specied concept level, the algorithm will take
O((Nil + Npl)  Nr  logNr + (Ntl ; Nil)  (MAXNv 2  Nda + Nr  logNr )) time and
that can be simplied as O((Ntl + Npl)  Nr  logNr + (Ntl ; Nil )  MAXNv 2  Nda).

4.7 Variations of the Algorithm


Two variations of Algorithm 4.1 were developed during our study by exploring di er-
ent ways to share data structures and intermediate results for potential performance
enhancement. The di erence of these two algorithms is in the Step 3 of Algorithm
4.1. The other steps are the same. In the rest of this section, we will examine the two
variations of Algorithm 4.1 in detail.

4.7.1 Using Base Table to Generate Contingency Table: Al-


gorithm PP S2BT
The rst variation of Algorithm 4.1 (PP S2BT) extracts a base table from the gener-
alized table which consists of the two attributes we want to calculate the relevance for
in Step 3. The records in this base table are the combinations of the distinct values
of the two attributes and there is no duplicate record. Then the algorithm scans the
base table and generate the contingency table for further relevance calculation. This
algorithm is described as follows.
Since Input, Output, Step 1, 2, 4 and 5 of PP S2BT are the same as described in
algorithm 4.1 and also the same with the second variation PP S1HT, we omit them
here for simplicity.
Algorithm 4.2 (PP S2BT) A variation of Algorithm 4.1 { use base table to gen-
erate contingency table.
CHAPTER 4. PATTERN MATCHING BASED METHOD 76

Input: Same as Algorithm 4.1.


Output: Same as Algorithm 4.1.
Method:
1. Same as Algorithm 4.1.
2. Same as Algorithm 4.1.
3. Calculate Relevance: Calculate the relevance between the predictive attribute
and the descriptive attributes based on the contingency table generated by the
following algorithm (presented in a syntax similar to C and Pascal, which should
be self-explanatory) which takes the following input: the column index of a
given descriptive attribute { givenNonPredAttrIndex, the column index of the
predictive attribute { predAttrIndex, the target table { Tt, the number of records
in this table { N and generates the target contingency table Tct (the extracted
base table is Tb).
(1) curRowInCTGBaseTable := 0
(2) for (int row := 0 row < nRecords row++) do begin
(3) found := FALSE
(4) rowIndexOfFoundRecord := 0
(5) for (int rowInCTG := 0 rowInCTG < curRowInCTGBaseTable rowInCTG++)
do begin
(6) match := FALSE
(7) sourceValueNonPredAttr := Ttrow]givenNonPredAttrIndex]
(8) targetValueNonPredAttr := TbrowInCTG]nonPredAttrIndexInCTGBaseTable]
(9) if sourceValueNonPredAttr != targetValueNonPredAttr then begin
(10) match := FALSE
(11) else
(12) sourceValuePredAttr := Ttrow]givenPredAttrIndex]
(13) targetValuePredAttr := TbrowInCTG]predAttrIndexInCTGBaseTable]
(14) if sourceValuePredAttr = sourceValuePredAttr then begin
(15) match := TRUE
CHAPTER 4. PATTERN MATCHING BASED METHOD 77

(16) else
(17) match := FALSE
(18) end
(19) end
(20) if match = TRUE then begin
(21) rowIndexOfFoundRecord := rowInCTG
(22) found := TRUE
(23) break
(24) end
(25) end
(26) if found != TRUE then begin
(27) TbcurRowInCTGBaseTable]nonPredAttrIndexInCTGBaseTable]
:=Ttrow]givenNonPredAttrIndex]
(28) TbcurRowInCTGBaseTable]predAttrIndexInCTGBaseTable]
:= Ttrow]givenPredAttrIndex]
(29) TbcurRowInCTGBaseTable]countCol] := Ttrow]countCol]
(30) curRowInCTGBaseTable++
(31) else
(32) TbrowIndexOfFoundRecord]countCol]
:= TbrowIndexOfFoundRecord]countCol]+Ttrow]countCol]
(33) end
(34) end
(35) sort base table(Tb)
(36) for (row :=0 row < Ntb row++) do begin
(37) preRowValue := rowValue
(38) rowValue := Tbrow]Col1]
(39) colValue := Tbrow]Col2]
(40) if row = 0 then begin
(41) rowValueIndex := 0
(42) colValueIndex := 0
(43) add col value(colValueList, colValue)
(44) TctrowValueIndex]colValueIndex] := Tbrow]countCol]
(45) else
CHAPTER 4. PATTERN MATCHING BASED METHOD 78

(46) if rowValue != preRowValue then begin


(47) rowValueIndex++
(48) end
(49) colValuefound := FALSE
(50) colValuefound := lookup col index(colValueList,colValue,colValueIndex)
(51) if colValuefound = TRUE then begin
(52) TctrowValueIndex]colValueIndex] := Tbrow]countCol]
(53) end
(54) if colValuefound != TRUE then begin
(55) colValueIndex := get count(colValueList)
(56) add col value(colValueList, colValue)
(57) TctrowValueIndex]colValueIndex] := Tbrow]countCol]
(58) end
(59) end
(60) end
(61) nRows := rowValueIndex
(62) nCols := get count(colValueList)
(63) rowTotalIndex := nCols
(64) colTotalIndex := nRows
(65) for (row := 0 row < nRows row++) do begin
(66) for (int col := 0 col < nCols col++) do begin
(67) TctcolTotalIndex]col] := TctcolTotalIndex]col]+Tctrow]col]
(68) Tctrow]rowTotalIndex] := Tctrow]rowTotalIndex]+Tctrow]col]
(69) end
(70) TctcolTotalIndex]rowTotalIndex]
:= TctcolTotalIndex]rowTotalIndex]+Tctrow]rowTotalIndex]
(71) end

4. Same as Algorithm 4.1.


5. Same as Algorithm 4.1.
CHAPTER 4. PATTERN MATCHING BASED METHOD 79

4.7.2 Using Hash Table to Generate Contingency Table: Al-


gorithm PP S1HT
The second variation of Algorithm 4.1 (PP S1HT) uses a hash table and scan the gen-
eralized table once to generate the contingency table for further relevance calculation
in Step 3. The algorithm is presented as below.
Since Input, Output, Step 1, 2, 4 and 5 of PP S1HT are the same as described in
algorithm 4.1 and also the same with the rst variation PP S2BT, we omit them here
for simplicity.
Algorithm 4.3 (PP S1HT) A variation of Algorithm 4.1 { use hash table to gen-
erate contingency Table.
Input: Same as Algorithm 4.1.
Output: Same as Algorithm 4.1.
Method:
1. Same as Algorithm 4.1.
2. Same as Algorithm 4.1.
3. Calculate Relevance: Calculate the relevance between the predictive attribute
and the descriptive attributes based on the contingency table generated by the
following algorithm (presented in a syntax similar to C and Pascal, which should
be self-explanatory) which takes the following input: the column index of a
given descriptive attribute { givenAttrIndex, the column index of the predictive
attribute { predAttrIndex, the target table { Tt, the number of records in this
table { N and generates the target contingency table Tct (Tr is the hash table
which contains the values of the row attribute, Tc is the hash table which
contains the values of the column attribute).
(1) for (row :=0 row < N row++) do begin
(2) rowValue := Ttrow]givenAttrIndex]
CHAPTER 4. PATTERN MATCHING BASED METHOD 80

(3) colValue := Ttrow]predAttrIndex]


(4) rowValueIndex := 0
(5) colValueIndex := 0
(6) if row = 0 then begin
(7) set key at(Tr,rowValue,rowValueIndex)
(8) set key at(Tc,colValue,colValueIndex)
(9) TctrowValueIndex]colValueIndex]=Ttrow]countCol]
(10) else
(11) rowValuefound := FALSE
(12) colValuefound := FALSE
(13) rowValuefound := lookup key (Tr,rowValue,rowValueIndex)
(14) colValuefound := lookup key (Tc,colValue,colValueIndex)
(15) if rowValuefound = TRUE && colValuefound = TRUE then begin
(16) TctrowValueIndex]colValueIndex]
:= TctrowValueIndex]colValueIndex]+Ttrow]countCol]
(17) end
(18) if ( rowValuefound = TRUE && colValuefound != TRUE then begin
(19) colValueIndex := get number of keys(Tc)
(20) set key at(Tc,colValue,colValueIndex)
(21) TctrowValueIndex]colValueIndex] := Ttrow]countCol]
(22) end
(23) if rowValuefound != TRUE && colValuefound = TRUE then begin
(24) rowValueIndex := get number of keys(Tr)
(25) set key at(Tr,rowValue,rowValueIndex)
(26) TctrowValueIndex]colValueIndex] := Ttrow]countCol]
(27) end
(28) if rowValuefound != TRUE && colValuefound != TRUE then begin
(29) rowValueIndex := get number of keys(Tr)
(30) set key at(Tr,rowValue,rowValueIndex)
(31) colValueIndex := get number of keys(Tc)
(32) set key at(Tc,colValue,colValueIndex)
(33) TctrowValueIndex]colValueIndex] := Ttrow]countCol]
(34) end
CHAPTER 4. PATTERN MATCHING BASED METHOD 81

(35) end
(36) end
(37) nRows := get number of keys(Tr)
(38) nCols := get number of keys(Tc)
(39) rowTotalIndex := nCols
(40) colTotalIndex := nRows
(41) for (row := 0 row < nRows row++) do begin
(42) for (int col := 0 col < nCols col++) do begin
(43) TctcolTotalIndex]col] := TctcolTotalIndex]col]+Tctrow]col]
(44) Tctrow]rowTotalIndex] := Tctrow]rowTotalIndex]+Tctrow]col]
(45) end
(46) TctcolTotalIndex]rowTotalIndex]
:= TctcolTotalIndex]rowTotalIndex]+Tctrow]rowTotalIndex]
(47) end

4. Same as Algorithm 4.1.


5. Same as Algorithm 4.1.

4.7.3 Summary
In most cases, Algorithm PP S1HT has better performance than Algorithm PP S2BT.
However, when the number of records is small and the Child/Parent ratio (it species
the desired number of children under each parent node) of the concept hierarchy is
very low (i.e., 2), the overhead related to the hash table may o set its benet and the
performance of PP S2BT is slightly better but the di erence is not signicant. The
detailed performance study of these two algorithms is presented in Chapter 5.
CHAPTER 4. PATTERN MATCHING BASED METHOD 82

4.8 Example and Experiment Results


4.8.1 Example
We will use the following simplied example to illustrate the pattern matching-based
predictive modeling method.
Suppose there is a public database available which stores the salary information
about employees in software companies located in the province of British Columbia
and the database table is represented in Table 4.3.

EmpID Occupation Company Age Gender Degree Major Addrcd Exp Salary
23568 205 InfoSoft 25 m B. Sc. C.S. 46789324 2 41000
97621 102 ABCTech 30 f M.A. Accounting 15679562 5 55000
         

Table 4.3: Database Table: SalaryInfo

Suppose a new graduate with a B.Sc. degree in Computing Science from Simon
Fraser University just starts to look for a job and wants to know how much salary he
could ask for from a potential future employer. He expects to get a reasonably clear
answer from our predictive modeling system based on the information stored in the
database.
We assume that the hierarchy for each attribute has already been created by the
system administrator and can be browsed by the new graduate so that he can actually
specify his request at desired concept levels without being restricted to the primitive
concept level stored in the database. Suppose the total records in this database is
10000, the database name is \Employee" and the table name is \SalaryInfo". The
new graduate is only interested in attribute \Occupation", \Age", \Degree", \Major",
\Address", \Experience" and \Salary". He also feels that it will be good enough if
the result is based on at least 500 records in the table.
So he may specify his request in DMQL as following:
CHAPTER 4. PATTERN MATCHING BASED METHOD 83

use Employee
nd prediction rule for \Salary"
from SalaryInfo
where Occupation=\Programmer" and Age=25 and Degree =\B.Sc."
and Major=\Computing Science" and Address=\Vancouver" and
Experience=1
in relevance to Occupation, Age, Degree, Major, Address, Experience
with support 500

Our predictive modeling system will rst parse this query and generates a SQL
statement at the corresponding primitive concept level to collect the task-relevant data
set (will be referred to as initial data set) from the specied database table. Then
by attribute-oriented induction, it will generalize the initial data set to the specied
concept level and create a generalized data set as following:

Occupation Age Degree Major Addrcd Exp Salary Count


Programmer 25 B. Sc. C.S. Burnaby 1 20K-30K 58
Accountant 30 M.A. Accounting Vancouver 5 Over 50K 3
       
Table 4.4: Generalized Data Set

By default, the system generalizes the predictive attribute \Salary" to a certain


concept level in the hierarchy where there are only 5 - 10 distinct values. The user can
also specify in a DMQL query how many distinct values he wants for the predictive
attribute.
Then the system will nd all the records which match the specied record (will
be referred to as target record) < Occupation = \Programmer", Age = 25, Degree =
\B.Sc.", Major = \Computing Science", Address = \Vancouver", Experience = 1 >
and add up the corresponding count. If the total count is greater than or equal to the
specied support threshold, the system will calculate the salary distribution based on
these records and output the result.
CHAPTER 4. PATTERN MATCHING BASED METHOD 84

In this example, we assume the total count at this point is only 350 which is less
than the specied support threshold. Thus the system needs to further generalize
in order to meet the support requirement. Before the generalization, the system
needs to compare the relevance between each descriptive attribute and the predictive
attribute and then select the least relevant attribute to generalize. It rst generates
a set of contingency tables from the generalized data set and each table contains one
descriptive attribute and the predictive attribute - \Salary" in this case.
We will use the following contingency table as an example to illustrate how the
system will calculate the relevance between a descriptive attribute (\Occupation" in
this case) and the predictive attribute.

Attribute Salary
Attribute Occupation < 20K 20K ; 30K 30K ; 40K 40K ; 50K > 50K Row Total
Programmer 100 400 500 2000 1000 4000
Accountant 5 10 10 20 5 50
QA 300 2400 200 95 5 3000
Manager 0 0 20 80 50 150
TechnicalSupport 800 1100 80 20 0 2000
SalesStaff 500 250 30 18 2 800
Column Total 1705 4160 840 2233 1062 10000
Table 4.5: Contingency Table for Attribute Occupation and Salary.

The system will calculate the relevance of attribute \Occupation" and attribute
\Salary" based on the above contingency table using the Chi Square method which
we introduced in Section 4.4.
P Pm a a2ji
n j=1 jn+1
p= i=1 am+1i

In this case n is equal to the number of values of the attribute \Salary" and m is
equal to the number of values of the attribute \Occupation". So n=5 and m=6.
P P6j=1 aa2ji
p= 5
i=1
j6 = 1:85
a7i
The Chi Square 2 is
CHAPTER 4. PATTERN MATCHING BASED METHOD 85

2 = N  P ; N = 8500
2
2 = N = 0:85
T 2 = p(s;1)(2 t;1) = 0:19
Thus
T = 0:43
So the relevance between the attribute \Occupation" and the attribute \Salary"
is 0.43.
Then the system will follow the same steps to calculate the relevance of other
descriptive attributes. We assume the result is as follows.
The relevance between the attribute \Degree" and the attribute \Salary" is 0.32.
The relevance between the attribute \Experience" and the attribute \Salary" is 0.25.
The relevance between the attribute \Major" and the attribute \Salary" is 0.20.
The relevance between the attribute \Age" and the attribute \Salary" is 0.18. The
relevance between the attribute \Address" and the attribute \Salary" is 0.05.
Since the address is the least relevant attribute, the system will further generalize
this attribute to a higher concept level and get another generalized data set:

Occupation Age Degree Major Addrcd Exp Salary Count


Programmer 25 B.Sc. C.S. Lower Mainland 1 20K-30K 75
Accountant 45 M.A. Accounting Vancouver Island 5 Over 50K 1
       
Table 4.6: Another Generalized Data Set

The system will also adjust the target record accordingly from < Occupation =
\Programmer", Age = 25, Degree = \B.Sc.", Major = \Computing Science", Address
= \Vancouver", Experience = 1 > to < Occupation = \Programmer", Age = 25,
Degree = \B.Sc.", Major = \Computing Science", Address = \Lower Mainland",
Experience = 1 > .
CHAPTER 4. PATTERN MATCHING BASED METHOD 86

Then the system will nd all the matching records of the target record in the
further generalized data set and add up all the corresponding counts. In this example,
we assume the total count now is 550 which has passed the support threshold. Thus
it is not necessary to further generalize the data set . The system will then calculate
the salary distribution based on the matching records in the latest generalized data
set and output the result shown in Table 4.7.

Salary Probability
Under 20K 10%
20K-30K 25%
30K-40K 60%
40K-50K 4%
Over 50K 1%
Table 4.7: Prediction Result For the Example

So the new graduate will get a relatively clear picture on how much salary he can
earn as a programmer working in British Columbia.
If the total count is still less than the specied support threshold after generalizing
the attribute \Address" to the higher concept level, the system will repeat the previous
steps until either the support threshold is passed or all the attributes are generalized
to the top concept levels. In the latter case, the user has to lower the specied support
threshold and resubmit his request.
If the user is not satised with the result, he can adjust his request and resubmit
the query. In our example, since the result is quite satisfactory to the new graduate,
he does not have to submit another query.

4.8.2 Experiment Results


We implemented the above method and integrated it into DBMiner to test against
relational databases. Here is one example:
Input: A DMQL query
CHAPTER 4. PATTERN MATCHING BASED METHOD 87

use NSERC95
nd prediction rule for \amount"
from award A, organization O
where O.org code = A.org code and A.discd = \Computer"
and province = \British Columbia"
in relevance to discd, amount, province
with support 100

NSERC95 database is used in this experiment which contains all the NSERC grant
allocation information of 1995 in Canada. Notice that the specied value \Com-
puter" for attribute \discd" is a high level concept which is not stored in the NSERC
database. The system has to consult the concept hierarchy to decide which set of data
is task relevant. By specifying the above query, we want to know how much grant a
professor who is in the computer science department of a university located in British
Columbia can get. With support threshold as 100, we get the following result:

discd province awd ($) probability (in %)


Computer British Columbia 0-20K 57.00
Computer British Columbia 20K-40K 30.00
Computer British Columbia 40K-60K 8.00
Computer British Columbia 60K- 5.00
Table 4.8: Prediction Result

The result shows that for a professor who is in the computer science department of
a university located in British Columbia, the probability for him to get $0-20K from
NSERC is 57%. For $20K-40K, $40K-60K and over $60K, the probabilities are 30%,
8% and 5% respectively.
CHAPTER 4. PATTERN MATCHING BASED METHOD 88

4.9 Summary
In this chapter, we rst introduced the idea of pattern matching-based predictive
modeling approach. Then we dened the problem we are going to solve in our pat-
tern matching-based predictive modeling approach. Two basic concepts Support and
Support Threshold were also dened. In Section 4.3, we discussed our motivation to in-
tegrate attribute-oriented induction and relevance analysis into our pattern matching-
based predictive modeling approach. In Section 4.4, we introduced the Chi Square-
based statistic method to calculate the relevance between two attributes and gave a
few examples to illustrate the method. Then we presented the general ideas of our
approach in Section 4.5. In Section 4.6, we examined the method we proposed for
the pattern matching-based predictive modeling. The rationale of the algorithm and
its computational complexity were also discussed. Two variations of the algorithm
were examined in Section 4.7. A detailed example was given in Section 4.8 which
was followed by some experiment results produced by DBMiner. Our discussion in
this chapter shows that the pattern matching-based predictive modeling approach is
useful and feasible at solving real world problems. In Chapter 5, we will present the
details of our performance study which shows it is also e cient.
Chapter 5
Performance Study
5.1 Introduction
In this chapter, we will examine the performance of the two predictive modeling ap-
proaches we discussed in the previous chapters. In order to study the performance, we
implemented these two algorithms and tested them on an IBM-compatible PC with
a Pentium II 300 MHz processor and 128 MB of main memory. We also implemented
two programs to generate our testing databases and the hierarchies which we needed
for generalization. These two programs include a database generator which gener-
ates our synthetic databases for testing and a hierarchy generator which generates a
hierarchy for each attribute in our testing databases.
The database generator takes the following parameters: the desired number of
records, the number of attributes, the name and the number of distinct values of
each attribute. It then generates a synthetic database using a random record gen-
eration program which is described in Appendix A. The hierarchy generator takes
three parameters: the maximum number of generations, the Child/Parent ratio which
species the desired number of children under each parent node and the value list of
an attribute. This program is described in Appendix B.
The remaining part of this chapter is organized as follows. In Section 5.2, we
will present a performance study on our classication-based method. This section
is divided into six subsections, while in each of the rst ve subsections, we will

89
CHAPTER 5. PERFORMANCE STUDY 90

demonstrate how a particular factor a ects the performance and a summary is given
in the last subsection. The performance study on pattern matching-based method
is given in Section 5.3, which is divided into seven subsections and similarly in each
subsection, except the last one, we will study how a particular factor a ects the
performance of the pattern matching-based method. A brief summary is also given
in the nal subsection. The whole chapter is summarized in Section 5.4.

5.2 Performance Study on Classication-Based Pre-


dictive Modeling Method
As we discussed in Chapter 3, the performance of the classication-based method is
a ected by several factors such as the number of records, the classication threshold,
the number of attributes, etc. In this chapter, we will take a closer look at how
these factors actually a ect the performance in our experiments which are discussed
in detail in the following subsections. We also want to compare the performance of
the two variations of this method to examine which, and under what situations, is
better than the other.

5.2.1 Scale Up Performance


Figure 5.1 shows a scale-up experiment we did for our classication-based predictive
modeling method. In this experiment, we wanted to learn how the total number
of records a ects the performance of this algorithm with the same request, all the
other variables being the same. The parameter values used to generate our synthetic
database are as follows. Number of attributes: 6 (including 5 descriptive attributes
and 1 predictive attribute). The number of distinct values for each attribute is as
follows: 100 for the rst descriptive attribute, 80 for the second descriptive attribute,
50 for the third descriptive attribute, 20 for the forth descriptive attribute, 15 for the
fth descriptive attribute and 5 for the predictive attribute. The parameter values
used to generate the hierarchies are: the maximum number of generations is 5, the
Child/Parent ratio is 5. The classication threshold is 100%. The number of records
CHAPTER 5. PERFORMANCE STUDY 91

we are using is 10000, 20000, 40000, 80000, and 160000. The experimental result
is shown in Figure 5.1 which shows both variations scale up reasonably well but
PC STGS2 is better than PC USGS1.

Figure 5.1: Scale-Up Performance

5.2.2 Performance Study on Dierent Classi cation Thresh-


olds
Figure 5.2 shows the performance study we did for our classication-based predictive
modeling method on the same set of data but with di erent classication thresholds.
In this experiment, we wanted to learn how the di erent classication thresholds
a ect the performance of this algorithm with the request, all the other variables
remaining the same. The parameter values used to generate our testing database are
as follows. Number of attributes: 6 (including 5 descriptive attributes and 1 predictive
attribute). The number of distinct values for each attribute is as follows: 100 for the
rst descriptive attribute, 80 for the second descriptive attribute, 50 for the third
descriptive attribute, 20 for the forth descriptive attribute, 15 for the fth descriptive
attribute and 5 for the predictive attribute. The parameter values used to generate
the hierarchies are: the maximum number of generations is 5, the Child/Parent ratio
is 5. The total number of records is 20000. The classication thresholds we are using
are 10%, 20%, 40%, 60%, 80% and 100%.
The experimental result is shown in Figure 5.2 which shows that with the increase
CHAPTER 5. PERFORMANCE STUDY 92

in the classication threshold, the running time also increases because more and more
nodes whose parent node cannot pass the increased classication threshold have to
be constructed and information gain also has to be calculated at these nodes' parent
nodes in order to select the next branching attribute. From Figure 5.2, we can also
see that after a certain threshold, the time increase attens because in this testing
scenario very few nodes can actually pass that classication threshold, so the number
of nodes which remain unconstructed at that classication threshold is limited. Thus
the time increase is also limited when we further increase the classication threshold.
We can also see that when the classication threshold is below a certain point, the
time increase is also limited because in this testing scenario the number of nodes
which can pass the lowest classication threshold but cannot pass this classication
threshold are limited and thus the associated cost increase is limited. The above
observation is valid for both PC STGS2 and PC USGS1, but it is more obvious for
PC USGS1 in Figure 5.2.

Figure 5.2: Performance Study on Di erent Classication Thresholds

5.2.3 Performance Study on Dierent Numbers of Attributes


Figure 5.3 shows the performance study we did for our classication-based predictive
modeling method on the same number of records but with di erent numbers of at-
tributes. In this experiment, we wanted to learn how di erent numbers of attributes
a ect the performance of this algorithm with a similar request, all the other variables
CHAPTER 5. PERFORMANCE STUDY 93

being the same. The parameter values for our synthetic database are as follows:
1. Case 1: Number of attributes: 4 (including 3 descriptive attributes and 1 predic-
tive attribute). The number of distinct values for each attribute is as following:
50 for all the descriptive attributes and 5 for the predictive attribute.
2. Case 2: Number of attributes: 8 (including 7 descriptive attributes and 1 predic-
tive attribute). The number of distinct values for each attribute is as following:
50 for all the descriptive attributes and 5 for the predictive attribute.
3. Case 3: Number of attributes: 12 (including 11 descriptive attributes and 1
predictive attribute). The number of distinct values for each attribute is as
following: 50 for all the descriptive attributes and 5 for the predictive attribute.
4. Case 4: Number of attributes: 16 (including 15 descriptive attributes and 1
predictive attribute). The number of distinct values for each attribute is as
following: 50 for all the descriptive attributes and 5 for the predictive attribute.
We use the same value distribution to avoid possible distortion of the experimental
result due to di erent value distributions for each case.

Figure 5.3: Performance Study on Di erent Numbers of Attributes

The parameter values for the hierarchy generator are: the maximum number of
generations is 5, the Child/Parent ratio is 5. The total number of records is 10000.
The classication threshold we use is 100%. The experimental result is shown in
CHAPTER 5. PERFORMANCE STUDY 94

Figure 5.3 which shows that with the increase in the number of attributes in the
database, the running time also increases primarily due to the increased overhead
to select branching attribute at each node and constructing more nodes at additional
levels in the constructed decision tree. Figure 5.3 also shows that PC STGS2 is better
than PC USGS1.

5.2.4 Performance Study on Dierent Numbers of the Pre-


dictive Attribute Values

Figure 5.4: Performance Study on Di erent Numbers of the Predictive Attribute


Values
Figure 5.4 shows the performance study we did for our classication-based predictive
modeling method on the same set of data but with di erent numbers of predictive
attribute values. In this experiment, we wanted to learn how di erent numbers of
predictive attribute values a ect the performance of this algorithm with the same
request, all the other variables being the same. The parameter values for our synthetic
database are as follows. Number of attributes: 6 (including 5 descriptive attributes
and 1 predictive attribute). The number of distinct values for each attribute is as
follows: 100 for the rst descriptive attribute, 80 for the second descriptive attribute,
50 for the third descriptive attribute, 20 for the forth descriptive attribute and 15
for the fth descriptive attribute. The parameter values for the hierarchy generator
are: the maximum number of generations is 5, the Child/Parent ratio is 5. The total
CHAPTER 5. PERFORMANCE STUDY 95

number of records is 10000. The classication threshold we are using is 100%. The
number of predictive attribute values we use is as follows: 10, 20, 40, 80 and 160.
The experimental result is shown in Figure 5.4, from which we can see that,
with the increase of the number of predictive attribute values, the running time also
increases due to the overhead involved in calculating the value distribution for the
predictive attribute and the information gain at each node. The testing result also
shows that PC STGS2 does better than PC USGS1.

5.2.5 Performance Study on Dierent Numbers of the De-


scriptive Attribute Values

Figure 5.5: Performance Study on Di erent Numbers of the Descriptive Attribute


Values
Figure 5.5 shows the performance study we did for our classication-based predictive
modeling method on the same number of records but with di erent numbers of de-
scriptive attribute values. In this experiment, we wanted to learn how the di erent
numbers of descriptive attribute values a ect the performance of this algorithm with
the similar requests, all the other variables being the same. The parameter values for
our synthetic database are as follows. Number of attributes: 6 (including 5 descriptive
attributes and 1 predictive attribute). The parameter values for the hierarchy gener-
ator are: the maximum number of generations is 5, the Child/Parent ratio is 5. The
total number of records is 20000. The classication threshold we are using is 100%.
CHAPTER 5. PERFORMANCE STUDY 96

The number of predictive attribute values we use is 10. The number of distinct values
for each descriptive attribute in each testing case is as follows: 20, 40, 80 and 160.
We use the same value distribution (all descriptive attributes have the same number
of distinct values in each case) to avoid possible distortion of the experimental result
due to di erent value distributions in each case.
The experimental result is shown in Figure 5.5, from which we can see that, with
the increase of the number of descriptive attribute values, the running time also
increases due to the overhead involved in constructing more branches. The result also
shows that PC STGS2 does better than PC USGS1.

5.2.6 Summary
The above experimental results demonstrate that in most cases, PC STGS2 is bet-
ter than PC USGS1, as shown by the performance curves. The reason is that in
PC STGS2, the target tables become smaller and smaller from the parent generations
to child generations, thus the cost to extract subtables is less than that in PC USGS1.
However, when the number of records is small, classication threshold is very low, and
the number of distinct values of the attributes is small, the performance di erence
of these two variations is not signicant. So we can conclude that PC STGS2 can
be reasonably counted on as the better choice for the classication-based predictive
modeling method.

5.3 Performance Study on Pattern Matching-Based


Method
As we discussed in Chapter 4, the performance of the pattern matching-based method
is a ected by several factors such as the number of records, the support threshold,
the number of attributes, etc. In the following sections, we will examine how these
factors actually a ect the performance in our experiments. We also want to compare
the performance of the two variations of this method to examine which, and under
what situations, is better than the other.
CHAPTER 5. PERFORMANCE STUDY 97

5.3.1 Scale Up Performance

Figure 5.6: Scale-Up Performance


Figure 5.6 shows the scale-up experiment we did for our pattern matching-based
predictive modeling method. In this experiment, we wanted to learn how the total
number of records a ect the performance of this algorithm with the same request, all
the other variables being the same. The parameter values for our synthetic database is
as follows. Number of attributes: 6 (including 5 descriptive attributes and 1 predictive
attribute). The number of distinct values for each attribute is as follows: 100 for the
rst descriptive attribute, 80 for the second descriptive attribute, 50 for the third
descriptive attribute, 20 for the forth descriptive attribute, 15 for the fth descriptive
attribute and 10 for the predictive attribute. The parameter values for the hierarchy
generator are: the maximum number of generations is 5, the Child/Parent ratio is
5. The support threshold is 30% of the total number of records in our synthetic
database. The number of records we are using is 20000, 40000, 80000, and 160000. The
experimental result is shown in Figure 5.6 which demonstrates that both variations
scale up reasonably well when the number of records in the database increases but
PP S1HT is better than PP S2BT.

5.3.2 Performance Study on Dierent Support Thresholds


Figure 5.7 shows the performance study we did for our pattern matching-based predic-
tive modeling method on the same set of data but with di erent support thresholds.
CHAPTER 5. PERFORMANCE STUDY 98

In this experiment, we wanted to learn how the di erent support thresholds a ect
the performance of this algorithm with the same request, all the other variables being
the same. The parameter values for our synthetic database are as follows. Number
of attributes: 6 (including 5 descriptive attributes and 1 predictive attribute). The
number of distinct values for each attribute is as follows: 100 for the rst descriptive
attribute, 80 for the second descriptive attribute, 50 for the third descriptive attribute,
20 for the forth descriptive attribute, 15 for the fth descriptive attribute and 10 for
the predictive attribute. The parameter values for the hierarchy generator are: the
maximum number of generations is 5, the Child/Parent ratio is 2. The total number
of records is 10000. The support thresholds we are using are 5, 20, 80 and 320.

Figure 5.7: Performance Study on Di erent Support Thresholds


The experimental result is shown in Figure 5.7, from which we can see that, with
the increase of the support threshold, the running time also increases since the data
set needs to be generalized more in order to pass the increased support threshold.
Meanwhile we can also notice that when the support threshold is increased from 5
to 20, the running time increased much more than when it is increased from 20 to
80. The reason for this is that the initial data set is much larger than the generalized
data set generated after several generalizations. So it requires more time at the
beginning. However, when the initial data set has been generalized to relatively
higher levels, the number of records in the generalized table becomes much smaller so
further generalization will take less and less time to complete given a higher support
threshold. This is demonstrated clearly in Figure 5.7 with a attened curve after
CHAPTER 5. PERFORMANCE STUDY 99

support threshold is more than 80. The result also shows that PP S1HT is better
than PP S2BT.

5.3.3 Performance Study on Dierent Child/Parent Ratios

Figure 5.8: Performance Study on Di erent Child/Parent Ratios


Figure 5.8 shows the performance study we did for our pattern matching-based pre-
dictive modeling method on the same set of data but with di erent Child/Parent
ratios in the hierarchies. In this experiment, we wanted to learn how the di erent
Child/Parent ratios a ect the performance of this algorithm with the same request, all
the other variables being the same. The parameter values for our synthetic database
are as follows. Number of attributes: 6 (including 5 descriptive attributes and 1
predictive attribute). The number of distinct values for each attribute is as follows:
100 for the rst descriptive attribute, 80 for the second descriptive attribute, 50 for
the third descriptive attribute, 20 for the forth descriptive attribute, 15 for the fth
descriptive attribute and 10 for the predictive attribute. The parameter values for the
hierarchy generator are: the maximum number of generations is 5. The total number
of records is 10000 and the support threshold is 30% of the total number of records
in our synthetic database. The Child/Parent ratios we are using are 2, 3, 4, 5 and 6.
The experimental result is shown in Figure 5.8, from which we can see that, with the
increase of the Child/Parent ratios, the running time decreases since each generaliza-
tion generalizes more records with the increased Child/Parent ratios so it takes less
CHAPTER 5. PERFORMANCE STUDY 100

generalization to pass the given support threshold and each successive generalization
takes less time because the table shrinks quicker with higher Child/Parent ratios.
The testing result also shows that in general, PP S1HT does better than PP S2BT.
However, when Child/Parent ratio is very low and the number of records is small,
PP S2BT does slightly better than PP S1HT but the performance di erence is not
signicant.

5.3.4 Performance Study on Dierent Numbers of Attributes

Figure 5.9: Performance Study on Di erent Numbers of Attributes


Figure 5.9 shows the performance study we did for our pattern matching-based pre-
dictive modeling method on the same number of records but with di erent numbers of
attributes. In this experiment, we wanted to learn how di erent numbers of attributes
a ect the performance of this algorithm with similar requests, all the other variables
being the same. The parameter values for our synthetic database are as follows:
1. Case 1: Number of attributes: 3 (including 2 descriptive attributes and 1 predic-
tive attribute). The number of distinct values for each attribute is as following:
50 for all the descriptive attributes and 10 for the predictive attribute.
2. Case 2: Number of attributes: 4 (including 3 descriptive attributes and 1 predic-
tive attribute). The number of distinct values for each attribute is as following:
50 for all the descriptive attributes and 10 for the predictive attribute.
CHAPTER 5. PERFORMANCE STUDY 101

3. Case 3: Number of attributes: 5 (including 4 descriptive attributes and 1 predic-


tive attribute). The number of distinct values for each attribute is as following:
50 for all the descriptive attributes and 10 for the predictive attribute.
4. Case 4: Number of attributes: 6 (including 5 descriptive attributes and 1 predic-
tive attribute). The number of distinct values for each attribute is as following:
50 for all the descriptive attributes and 10 for the predictive attribute.
5. Case 5: Number of attributes: 7 (including 6 descriptive attributes and 1 predic-
tive attribute). The number of distinct values for each attribute is as following:
50 for all the descriptive attributes and 10 for the predictive attribute.
6. Case 6: Number of attributes: 8 (including 7 descriptive attributes and 1 predic-
tive attribute). The number of distinct values for each attribute is as following:
50 for all the descriptive attributes and 10 for the predictive attribute.
We use the same value distribution to avoid possible distortion of the experimental
result due to di erent value distributions for each case.
The parameter values for the hierarchy generator are: the maximum number of
generations is 5, the Child/Parent ratio is 5. The total number of records is 10000.
The support threshold is 30% of the total number of records in our synthetic database.
The experimental result is shown in Figure 5.9 which shows that with the increase of
the number of attributes, the running time also increases primarily due to increased
overhead to select generalizing attribute before each generalization. The result also
shows that PP S1HT does better than PP S2BT.

5.3.5 Performance Study on Dierent Numbers of Predictive


Attribute Values
Figure 5.10 shows the performance study we did for our pattern matching-based
predictive modeling method on the same set of data but with di erent numbers of
predictive attribute values. In this experiment, we wanted to learn how di erent
numbers of predictive attribute values a ect the performance of this algorithm with
CHAPTER 5. PERFORMANCE STUDY 102

the same request, all the other variables being the same. The parameter values for our
synthetic database are as follows. Number of attributes: 6 (including 5 descriptive
attributes and 1 predictive attribute). The number of distinct values for each attribute
is as following: 100 for the rst descriptive attribute, 80 for the second descriptive
attribute, 50 for the third descriptive attribute, 20 for the forth descriptive attribute,
and 15 for the fth descriptive attribute. The parameter values for the hierarchy
generator are: the maximum number of generations is 5, the Child/Parent ratio is
5. The total number of records is 10000. The support threshold is 30% of the total
number of records in our synthetic database. The number of predictive attribute
values we use is as following: 10, 20, 40, 80 and 160. The experimental result is
shown in Figure 5.10, from which we can see that, with the increase of the number
of predictive attribute values, the running time also increases due to the overhead
involved in calculating the relevance for each descriptive attribute. The result also
shows that PP S1HT is better than PP S2BT.

Figure 5.10: Performance Study on Di erent Numbers of Predictive Attribute Values

5.3.6 Performance Study on Dierent Numbers of Descrip-


tive Attribute Values
Figure 5.11 shows the performance study we did for our pattern matching-based pre-
dictive modeling method on the same number of records but with di erent numbers of
descriptive attribute values. In this experiment, we wanted to study how the di erent
CHAPTER 5. PERFORMANCE STUDY 103

numbers of descriptive attribute values a ect the performance of this algorithm with
the similar requests, all the other variables being the same. The parameter values for
our synthetic database are as follows. Number of attributes: 6 (including 5 descriptive
attributes and 1 predictive attribute). The parameter values for the hierarchy gener-
ator are: the maximum number of generations is 5, the Child/Parent ratio is 5. The
total number of records is 10000. The support threshold is 30% of the total number
of records in our synthetic database. The number of predictive attribute values we
use is 10. The number of distinct values for each descriptive attribute in each testing
case is as follows: 20, 40, 80, 120 and 160. We use the same value distribution (all
descriptive attributes have the same number of distinct values in each case) to avoid
possible distortion of the experimental result due to di erent value distributions in
each case. The experimental result is shown in Figure 5.11, from which we can see
that, with the increase of the number of descriptive attribute values, the running time
also increases mainly due to the overhead involved in calculating the relevance for each
descriptive attribute. This test also shows that PP S1HT is better than PP S2BT.

Figure 5.11: Performance Study on Di erent Numbers of Descriptive Attribute Values

5.3.7 Summary
In this section, we have studied the performance of the two variations of the pattern
matching-based method. In most cases, PP S1HT is better than PP S2BT but in
a few cases where the number of records is small and Child/Parent ratio is very
CHAPTER 5. PERFORMANCE STUDY 104

low, the overhead related to hash table more than o sets its benet and PP S2BT is
slightly better than PP S1HT, but the performance di erence is not signicant. So
the conclusion is that PP S1HT can be reasonably counted on as the better choice
for this method.

5.4 Summary of the Performance Study


In this chapter, we presented the performance study we did for both classication-
based and pattern matching-based predictive modeling methods. In Sections 5.2 and
5.3 we examined the performance of both methods in di erent situations such as
di erent numbers of records, di erent numbers of attributes, di erent numbers of
predictive and descriptive attribute values, etc. Our study shows that both methods
are e cient and scale up reasonably well against large databases.
Chapter 6
Conclusion and Future Work
In this Chapter, we will rst summarize the major work we did in this thesis and
discuss the conclusions we draw from our study. Then we will briey discuss and
compare the two predictive modeling approaches we proposed. Finally, we will discuss
some interesting future research problems.

6.1 Summary and Conclusions


Knowledge Discovery in Databases has become more and more important due to the
fast growth of data stored in databases. Such huge amounts of data are far beyond the
capability of traditional data analysis methods. New techniques need to be developed
to assist the humans in getting valuable knowledge from such a large amount of
data. Much fruitful research has been conducted in this wide open eld, which has
resulted in many creative techniques and prototypes. Attribute-Oriented Induction is
one such technique, which integrates a machine learning paradigm 64] with database
operations, extracts generalized rules and discovers high-level data regularities such
as characteristic rules from large databases.
Predictive Modeling is one of the major tasks of knowledge discovery in databases
and has an extensive use to help people solve real world problems such as credit
approval, nancial forecast, market trend analysis, customer behavior study, etc.

105
CHAPTER 6. CONCLUSION AND FUTURE WORK 106

The major methods which have been adopted in the past research include statis-
tics, classication, and neural net-based methods. All these major methods were
introduced and examples were given in Chapter 2 to give the reader an overall picture
of the predictive modeling research which has been conducted in both academic and
industrial worlds.
Based on attribute-oriented induction and one popular machine learning method
{ ID3, we developed a classication-based method to perform predictive modeling
on the data stored in large databases. This method can extract predictive modeling
rules at di erent concept levels and makes it possible for users to get knowledge at
desirable concept levels without the limitations of the original ID3 method, such as the
incapability to handle large data sets and continuous numerical values. We change the
termination condition, which makes the method realistic for large databases. Since
the distinct values for each attribute can be reduced by attribute-oriented induction,
we avoid the problem caused by the tendency of the original ID3 method to favor the
attributes with a large number of distinct values.
By integrating attribute-oriented induction and a widely adopted statistical method,
we proposed a pattern matching-based predictive modeling method to predict data
values or value distributions on the predictive attribute, based on similar groups of
data in the database. For example, one may predict the amount of research grants
that an applicant may receive, based on the data about the similar groups of re-
searchers. Thanks to the integration with attribute-oriented induction, this method
can also be applied to multiple concept levels.
Both of our proposed methods have been implemented and the performance study
and experiments we did showed that both algorithms work e ciently against large
databases.
However, if a request involves a very large data set, it may take nontrivial time to
just retrieve the data. In such cases, it would be desirable to precalculate some models
and store them for future use so that the result can be given to the user in a timely
manner regardless of the size of the data set involved in the queries. However, since
data stored in the databases evolve { in many cases, in a fast pace, we need to nd
e cient ways to automatically and incrementally update the precalculated models,
CHAPTER 6. CONCLUSION AND FUTURE WORK 107

which we believe is not a trivial problem.


In the two predictive modeling approaches we proposed, we assume that data
contained in the databases are cleaned by some preprocess so that there is no missing,
null or invalid value in the data. However, it would be desirable for the methods we
proposed to handle the data with missing, null or invalid values, which is in our future
research scope.
By integrating with attribute-oriented induction, our study broadens the scope of
predictive modeling to multiple concept levels, and predictive modeling is not limited to
single primitive concept level any more. Our study also demonstrates that predictive
modeling on large databases is feasible and can be e cient.

6.2 Comparison of the Two Proposed Predictive


Modeling Approaches
In this thesis, we proposed two di erent predictive modeling approaches: The rst
approach, which was discussed in Chapter 3, is derived from a decision tree-based
classication method. The second approach, which was examined in Chapter 4, is
based on a pattern matching method. There are some similarities between these
two approaches. Firstly, both approaches leverage the power of attribute-oriented
induction technique. Secondly, both approaches can be e ectively applied to large
databases, which is demonstrated in the performance study and the experiments we
did. Despite those similarities, these two approaches have some clear di erences:
 Usually the user of the classication-based predictive modeling method is more
interested in a number of objects and thus this method is more suited for an-
swering generic questions such as \What could be the possible salary ranges for
managers, programmers and accountants in Canada?", which may give the user
an overall picture of the underlying data and help the user make appropriate
decisions based on the result. For instance, a human resource consultant may
tell a company what salary ranges they may o er to their new hires based on
the result. On the other hand, the user of the pattern matching-based method
CHAPTER 6. CONCLUSION AND FUTURE WORK 108

is often interested in one particular object and thus this method is more suited
for answering specic questions such as \What is the possible salary for a pro-
grammer with a M.S. degree in computing science, 3 year experience and living
in Vancouver?", which may give the user an answer more tailored toward his
particular situation. For example, based on the result, a new computing science
graduate may decide what salary range he can ask for during job interviews.
 The classication-based predictive modeling method is more suited for making
predictions at higher concept levels due to the limitation of the ID3 method,
which was discussed in Chapter 3. The pattern matching-based method has no
such limitation.
 The classication-based predictive modeling method is more suited for queries
based on the xed concept levels. Once the desired concept level is specied in
the query, the decision tree will be constructed based on that specied concept
level and the nal result will be at the same concept level. Although users can
adjust the desired concept level by modifying and resubmitting their requests,
the entire decision tree has to be built on the newly specied concept level. The
pattern matching-based method will adjust the concept level of each attribute
automatically, based on the specied support threshold and the relevance be-
tween each descriptive attribute and the predictive attribute. So the nal result
may not be at the same specied concept level.
 The result of a classication-based method can be stored for future use. For ex-
ample, a credit card company may apply the rules derived from the classication-
based method to approve or reject new applicants. In most cases, the result of
a pattern matching-based method is more specic and thus is not stored for
future use.
These two approaches are proposed based on di erent real world problems and thus
they each suit di erent requests. The decision to adopt one method over the other
should be made based on di erent users' particular needs. One interesting observation
is that in some cases, they can be used as complimentary methods. For example,
CHAPTER 6. CONCLUSION AND FUTURE WORK 109

a credit card company may store some rules derived from the classication-based
method to process new applications. In most cases, this may work well. However,
sometimes the stored rules may not cover a particular applicant. In this case, the
company may use the pattern matching-based method to predict possible credit rating
of this particular applicant.

6.3 Future Work


In our pattern matching-based predictive modeling approach, we discussed the use of
relevance analysis to select the least relevant attribute as the next candidate to be
generalized. We may also integrate relevance analysis with other data mining pro-
cesses such as classication-based predictive modeling and association rule nding.
For example, in classication-based predictive modeling, if we can use relevance anal-
ysis to lter out some less relevant attributes before we construct the decision tree,
on one hand we can enhance the performance, on the other hand, the decision tree
will be more concise and thus the generated predictive rules will be further simplied.
However, further study needs to be conducted to examine how the attribute ltering
will actually a ect the result and what the ltering criteria should be in order to
achieve better performance with similar or better result.
The second interesting problem is to perform predictive modeling on time-related
data such as nancial market trend forecast, historic sales trend analysis, etc. We be-
lieve an extension of pattern matching-based predictive modeling may be a promising
direction to solve such kind of problems. However, more research needs to be done
to gure out how to dene the patterns of time-related data, which criteria should be
used to determine whether two or more patterns match and how to associate back-
ground information with the particular patterns to help nd one or more matching
patterns.
One problem which also deserves some study in the future is that in some cases,
two (or more) less relevant attributes combined together may be more relevant to the
predictive attribute than some other attributes. For example, the attribute \Province"
and \City" by themselves may be less relevant to the predictive attribute \House Size"
CHAPTER 6. CONCLUSION AND FUTURE WORK 110

than the attribute \Occupation" but if we combine them together such as \BC, West
Vancouver", they may become even more relevant than \Occupation". Since there
could be many combinations between descriptive attributes, we need further study to
gure out how to select the combination candidates and how it will a ect the nal
result.
Another interesting problem is related to system implementation. For the classication-
based predictive modeling, the decision tree building process could be very costly if
there is a very large data set (such as a retailer's transaction database) involved. If
we can identify the queries which are used very often and store the result for future
use, it will substantially enhance the performance. However, since the data stored in
many operational databases is not static and in many cases it evolves in a fast pace (a
retailer's transaction database, for example), we need to nd e cient ways to update
the stored result and in ideal cases such updates should be incremental, which, we
believe, is not a trivial problem.
Appendix A
Database Generator
The following diagram illustrates the algorithm we used to generate our testing
databases:

Input: Desired number of records, attributes, distinct values for each


attribute, name for each attribute

Generate value list for each attribute

Create database table

Create a hash table

Row pointer points to the first record in the table

Column pointer points to the first attribute in the table

111
APPENDIX A. DATABASE GENERATOR 112

Get a randomly generated value index for the current attribute

Get the value using the obtained index from the value list of the
current attribute

Put the value in the current attribute column of the current record

No
Column pointer points to the Is the last column?
next column in the table

Yes

Check if the new record is already in the existing table by looking for
it in the hash table

Yes
Found in the hash table?

No

Put the new record into the hash table

Row pointer points to the next No


Is the last record?
record in the table

Yes

Output the generated database table


Appendix B
Hierarchy Generator
The following diagrams illustrate the algorithm we used to generate the hierarchies
used in the generalization process.

Input: Child/Parent ratio, maximum number of generations, value list


for the specified attribute

Create all leaf nodes based on the value list of given attribute

Generation pointer points to the first node of the leaf generation

Generation count =1

113
APPENDIX B. HIERARCHY GENERATOR 114

Create the parent generation of the current generation

Enough nodes in the parent No


generation to create its parent
generation?

Yes

Generation Count ++

Generation pointer points to the No


first node of the newly created Number of generations >= maximum
generation number of generations?

Yes

Create root node

Output the generated hierarchy


Bibliography
1] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of
items in large databases. In Proc. 1993 ACM-SIGMOD Int. Conf. Management
of Data, pages 207{216, Washington, D.C., May 1993.
2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast
discovery of association rules. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth,
and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining,
pages 307{328. AAAI/MIT Press, 1996.
3] R. Agrawal, M. Mehta, J. Shafer, R. Srikant, A. Arning, and T. Bollinger. The
Quest data mining system. In Proc. 1996 Int. Conf. Data Mining and Knowledge
Discovery (KDD'96), pages 244{249, Portland, Oregon, August 1996.
4] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In
Proc. 1994 Int. Conf. Very Large Data Bases, pages 487{499, Santiago, Chile,
September 1994.
5] T. Anand and G. Kahn. Opportunity explorer: Navigating large databases using
knowledge discovery templates. In Proc. AAAI-93 Workshop Knowledge Discov-
ery in Databases, pages 45{51, Washington DC, July 1993.
6] M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
7] C. Apte and S. Hong. Predicting equity returns from securities data and min-
inal rule generation. In Advances in Knowledge Discovery and Data Mining,
AAAI/MIT Press, 1995.
8] C. Babcock. Parallel processing mines retail data. Computer World, 6, 1994.
9] J. Berry. Database marketing. In Business Week, pages 56{62, 1994.
10] L.B. Booker, D.E. Goldberg, and J.H. Holland. Classier systems and genetic
algorithms. Articial Intelligence, 40:235{282, 1989.

115
BIBLIOGRAPHY 116

11] R. Brachman, P. Selfridge, L. Terveen, B. Altman, F. Halper, T. Kirk, A. Lazar,


D. McGuinness, L. Resnick, and A. Borgida. Integrated support for data ar-
chaeology. Inter. J. Intelligent and Cooperative Information Systems, 2:159{185,
June, 1993.
12] R. J. Brachman and T. Anand. The process of knowledge discovery in databases:
a rst sketch. In Proc. AAAI'94 Workshop Knowledge Discovery in Databases
(KDD'94), pages 1{12, Seattle, WA, July 1994.
13] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classication and Regression
Trees. Wadsworth International Group, 1984.
14] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classication and Regression
Trees. Wadsworth, 1984.
15] Y. Cai, N. Cercone, and J. Han. Attribute-oriented induction in relational
databases. In G. Piatetsky-Shapiro and W. J. Frawley, editors, Knowledge Dis-
covery in Databases, pages 213{228. AAAI/MIT Press, 1991.
16] P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned
data for scaling machine learning. In Proc. 1st Int. Conf. Knowledge Discovery
and Data Mining (KDD'95), pages 39{44, Montreal, Canada, August 1995.
17] P. Cheeseman, J. Kelly, M. Self, J. Stutz, W. Taylor, and D. Freeman. Autoclass:
a bayesian classication system. In Proc. Fifth Int. Conf. Machine Learning,
pages 54{64, San Mateo, California, 1988.
18] P. Cheeseman and J. Stutz. Bayesian classication (AutoClass): Theory and re-
sults. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy,
editors, Advances in Knowledge Discovery and Data Mining, pages 153{180.
AAAI/MIT Press, 1996.
19] Y. Cheng and K. S. Fu. Conceptual clustering in knowledge organization. IEEE
Trans. Pattern Analysis and Machine Intelligence, 7:592{598, September 1985.
20] C. Cortes, H. Drucker, D. Hoover, and V. Vapnik. Capacity and complexity
control in predicting the spread between borrowing and lending interest rates.
In Proc. 1st Int. Conf. Knowledge Discovery and Data Mining (KDD'95), pages
51{56, Montreal, Canada, August 1995.
21] S. Dao and B. Perry. Applying a data miner to heterogeneous schema integration.
In Proc. 1st Int. Conf. Knowledge Discovery and Data Mining, pages 63{68,
Montreal, Canada, Aug. 1995.
BIBLIOGRAPHY 117

22] U. M. Fayyad, S. G. Djorgovski, and N. Weir. Automating the analysis and


cataloging of sky surveys. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining,
pages 471{493. AAAI/MIT Press, 1996.
23] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (eds.). Ad-
vances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
24] D. Fisher. Improving inference through conceptual clustering. In Proc. 1987
AAAI Conf., pages 461{465, Seattle, Washington, July 1987.
25] D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine
Learning, 2:139{172, 1987.
26] D. Fisher. Optimization and simplication of hierarchical clusterings. In Proc.
1st Int. Conf. Knowledge Discovery and Data Mining (KDD'95), pages 118{123,
Montreal, Canada, Aug. 1995.
27] T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Sonar: System for
optimized numeric association rules. In Proc. 1996 ACM-SIGMOD Int. Conf.
Management of Data, page 553, Montreal, Canada, June 1996.
28] B. R. Gaines. Exception dags as knowledge structures. In Proc. AAAI'94 Work-
shop Knowledge Discovery in Databases (KDD'94), pages 13{24, Seattle, WA,
July 1994.
29] S. I. Gallant. Neural Network Learning and Expert Systems. Cambridge, MA:
MIT Press, 1993.
30] M. J. Hagood. Statistics for sociologists. Holt, 1952.
31] J. Han, Y. Cai, and N. Cercone. Concept-based data classication in relational
databases. In 1991 AAAI Workshop Knowledge Discovery in Databases, pages
77{94, Anaheim, CA, July 1991.
32] J. Han, Y. Cai, and N. Cercone. Data-driven discovery of quantitative rules in
relational databases. IEEE Trans. Knowledge and Data Engineering, 5:29{40,
1993.
33] J. Han and Y. Fu. Dynamic generation and renement of concept hierarchies
for knowledge discovery in databases. In Proc. AAAI'94 Workshop Knowledge
Discovery in Databases (KDD'94), pages 157{168, Seattle, WA, July 1994.
BIBLIOGRAPHY 118

34] J. Han and Y. Fu. Exploration of the power of attribute-oriented induction in


data mining. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthu-
rusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 399{
421. AAAI/MIT Press, 1996.
35] J. Han, Y. Fu, W. Wang, J. Chiang, W. Gong, K. Koperski, D. Li, Y. Lu,
A. Rajan, N. Stefanovic, B. Xia, and O. R. Za#$ane. DBMiner: A system for
mining knowledge in large relational databases. In Proc. 1996 Int. Conf. Data
Mining and Knowledge Discovery (KDD'96), pages 250{255, Portland, Oregon,
August 1996.
36] J. Han, Y. Fu, W. Wang, K. Koperski, and O. R. Za#$ane. DMQL: A data mining
query language for relational databases. In Proc. 1996 SIGMOD'96 Workshop
Research Issues on Data Mining and Knowledge Discovery (DMKD'96), pages
27{34, Montreal, Canada, June 1996.
37] J. Han, S. Nishio, H. Kawano, and W. Wang. Generalization-based data mining
in object-oriented databases using an object-cube model. Data and Knowledge
Engineering, 25:55{97, 1998.
38] D. J. Hand. Discrimination and Classication. John Wiley & Sons, Chichester,
U.K., 1987.
39] D. Harrison. Backing up. In Network Computing, pages 98{104, Montreal
Canada, 1993.
40] J. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural
Computation. Addison Wesley: Reading, MA., 1991.
41] S. J. Hong. R-mini: A heuristic algorithm for generating minimal rules from ex-
amples. In Proc. 3rd Pacic Rim Int. Conf. Articial Intelligence (PRICAI'94),
pages 331{337, 1994.
42] S. J. Hong. Use of contextual information for feature ranking and discretization.
In Technical Report RC 19664, IBM Research Division, 1994.
43] R. Hull and R. King. Semantic database modeling: Survey, applications, and
research issues. ACM Comput. Surv., 19:201{260, 1987.
44] T. Imielinski and A. Virmani. DataMine { interactive rule discovery system. In
Proc. 1995 ACM-SIGMOD Int. Conf. Management of Data, page 472, San Jose,
CA, May 1995.
BIBLIOGRAPHY 119

45] G. H. John. Robust decision trees: Removing outliers from databases. In Proc.
1st Int. Conf. Knowledge Discovery and Data Mining, pages 174{179, Montreal,
Canada, Aug. 1995.
46] D. A. Keim, H.-P. Kriegel, and T. Seidl. Supporting data mining of large
databases by visual feedback queries. In Proc. 10th Int. Conf. Data Engineering,
pages 302{313, Houston, TX, Feb. 1994.
47] J. Kivinen and H. Mannila. The power of sampling in knowledge discovery. In
Proc. 13th ACM Symp. Principles of Database Systems, pages 77{85, Minneapo-
lis, MN, May 1994.
48] W. Kl#osgen. Explora: a multipattern and multistrategy discovery assistant.
In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors,
Advances in Knowledge Discovery and Data Mining, pages 249{271. AAAI/MIT
Press, 1996.
49] D. E. Knuth. The Art of Computer Programming. Addison-Wesley, 1969.
50] Petri Kontkanen, Petri Myllym#aki, and Henry Tirri. Predictive data mining with
nite mixtures. In Proc. 2nd Int. Conf. Knowledge Discovery and Data Mining
(KDD'96), pages 176{182, Portland, Oregon, August 1996.
51] K. Koperski and J. Han. Discovery of spatial association rules in geographic
information databases. In Proc. 4th Int. Symp. Large Spatial Databases (SSD'95),
pages 47{66, Portland, Maine, Aug. 1995.
52] A. K. Kurtz and H. A. Edgerton. Statistical Dictionary of Terms and Symbols.
Wiley, 1939.
53] Rense Lange. An empirical test of the weighted e ect approach to generalized
prediction using neural nets. In Proc. 2nd Int. Conf. Knowledge Discovery and
Data Mining (KDD'96), pages 183{188, Portland, Oregon, August 1996.
54] P. Langley and S. Sage. Conceptual clustering as discrimination learning. In
Proc. 5th Canadian Conf. Articial Intelligence, pages 95{98, London, Ontario,
1984.
55] P. Langley, J. Zytkow, H. Simon, and G. Bradshaw. The search for regularity:
Four aspects of scientic discovery. In Michalski et al., editor, Machine Learning:
An Articial Intelligence Approach, Vol. 2, pages 425{469. Morgan Kaufmann,
1986.
BIBLIOGRAPHY 120

56] H. Lu, R. Setiono, and H. Liu. Neurorule: A connectionist approach to data


mining. In Proc. 21st Int. Conf. Very Large Data Bases, pages 478{489, Zurich,
Switzerland, Sept. 1995.
57] M. Manago and Y. Kodrato . Induction of decision trees from complex structured
data. In G. Piatetsky-Shapiro and W. J. Frawley, editors, Knowledge Discovery
in Databases, pages 289{306. AAAI/MIT Press, 1991.
58] B. Masand and G. Piatetsky-Shapiro. A comparison of approaches for maximizing
business payo of prediction models. In Proc. 2nd Int. Conf. Knowledge Discovery
and Data Mining (KDD'96), pages 195{201, Portland, Oregon, August 1996.
59] C. J. Matheus and G. Piatesky-Shapiro. An application of KEFIR to the analysis
of healthcare information. In Proc. AAAI'94 Workshop Knowledge Discovery in
Databases (KDD'94), pages 441{452, Seattle, WA, July 1994.
60] C.J. Matheus, G. Piatetsky-Shapiro, and D. McNeil. Selecting and report-
ing what is interesting: The KEFIR application to healthcare data. In U.M.
Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances
in Knowledge Discovery and Data Mining, pages 495{516. AAAI/MIT Press,
1996.
61] G. McLachlan. Discriminant Analysis and Statistical Pattern Recognition. John
Wiley & Sons, New York, 1992.
62] M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classier for data
mining. In Proc. 1996 Int. Conf. Extending Database Technology (EDBT'96),
Avignon, France, March 1996.
63] M. Metha, J. Rissanen, and R. Agrawal. MDL-based decision tree pruning. In
Proc. 1st Int. Conf. Knowledge Discovery and Data Mining (KDD'95), pages
216{221, Montreal, Canada, August 1995.
64] R. S. Michalski. A theory and methodology of inductive learning. In Michalski
et al., editor, Machine Learning: An Articial Intelligence Approach, Vol. 1,
pages 83{134. Morgan Kaufmann, 1983.
65] R. S. Michalski, J. G. Carbonell, and T. M. Mitchell. Machine Learning, An
Articial Intelligence Approach, Vol. 2. Morgan Kaufmann, 1986.
66] R. S. Michalski and R. Stepp. Automated construction of classications: Con-
ceptual clustering versus numerical taxonomy. IEEE Trans. Pattern Analysis
and Machine Intelligence, 5:396{410, 1983.
BIBLIOGRAPHY 121

67] D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and


Statistical Classication. Ellis Horwood, 1994.
68] S. Murthy and S. Salzberg. Decision tree induction: How e ective is the greedy
heuristic? In Proc. 1st Int. Conf. Knowledge Discovery and Data Mining
(KDD'95), pages 222{227, Montreal, Canada, August 1995.
69] Z. Pawlak. Rough sets. Intl. J. Computer and Information Sciences, 11:341{356,
1982.
70] G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules.
In G. Piatetsky-Shapiro and W. J. Frawley, editors, Knowledge Discovery in
Databases, pages 229{238. AAAI/MIT Press, 1991.
71] J. R. Quinlan. Discovering rules by induction from large collections of examples.
In D. Michie, editor, Expert Systems in the Micro Electronic Age. Edinburgh
University Press, 1979.
72] J. R. Quinlan. Learning e cient classication procedures and their application
to chess end-games. In Michalski et al., editor, Machine Learning: An Articial
Intelligence Approach, Vol. 1, pages 463{482. Morgan Kaufmann, 1983.
73] J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81{106, 1986.
74] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
75] J.R. Quinlan. Comparing connectionist and symbolic learning methods. MIT
Press, 1994.
76] B.D. Ripley. Neural networks and related methods for classication. J. R. Statist.
Soc. B, 56:409{456, 1994.
77] S. Russell and P. Norvig. Articial Intelligence: A Modern Approach. Prentice-
Hall, 1995.
78] J. Schmitz, G. Armstrong, and J. D. C. Little. Coverstory { automated news
nding in marketing. In L. Volin, editor, S transactions, pages 46{54. Providence,
R.I., 1990.
79] J. W. Shavlik, R. J. Mooney, and G. G. Towell. Symbolic and neural learning
algorithms: An experimental comparison. Machine Learning, 6:111{144, 1991.
80] A. Silberschatz, M. Stonebraker, and J. D. Ullman. Database research: Achieve-
ments and opportunities into the 21st century. ACM SIGMOD Record, 25:52{63,
March 1996.
BIBLIOGRAPHY 122

81] E. Simoudis, B. Livezey, and R. Kerber. Using Recon for data cleaning. In Proc.
1st Int. Conf. Knowledge Discovery and Data Mining, pages 258{262, Montreal,
Canada, Aug. 1995.
82] R. Srikant and R. Agrawal. Mining quantitative association rules in large re-
lational tables. In Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data,
pages 1{12, Montreal, Canada, June 1996.
83] M. Stone. Cross-validatory choice and assessment of statistical predictions. Jour-
nal of the Royal Statistical Society, 36:111{147, 1974.
84] V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-
Verlag, 1982.
85] J. Way and E. A. Smith. The evolution of synthetic aperture radar systems
and their progression to the EOS SAR. IEEE Transactions on Geoscience and
Remote Sencing, 29:962{985, 1991.
86] S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classication
and Prediction Methods from Statistics, Neural Nets, Machine Learning, and
Expert Systems. Morgan Kaufman, 1991.
87] Q. Wu, P. Suetens, and A. Oosterlinck. Integration of heuristic and bayesian
approaches in a pattern-classication system. In G. Piatetsky-Shapiro and W. J.
Frawley, editors, Knowledge Discovery in Databases, pages 249{260. AAAI/MIT
Press, 1991.
88] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an e cient data cluster-
ing method for very large databases. In Proc. 1996 ACM-SIGMOD Int. Conf.
Management of Data, pages 103{114, Montreal, Canada, June 1996.
89] W. Ziarko. Rough Sets, Fuzzy Sets and Knowledge Discovery. Springer-Verlag,
1994.
90] J. Zytkow and J. Baker. Interactive mining of regularities in databases. In
G. Piatetsky-Shapiro and W. J. Frawley, editors, Knowledge Discovery in
Databases, pages 31{54. AAAI/MIT Press, 1991.

You might also like