Doctorat - Predicting Modeling Based On Classification and Pattern Matching Method
Doctorat - Predicting Modeling Based On Classification and Pattern Matching Method
Wei Wang
B.Sc. Beijing Polytechnic University, 1992
Date Approved:
ii
Abstract
Predictive modeling, i.e., predicting unknown values of certain attributes of interest
based on the values of other attributes, is a major task in data mining. Predictive
modeling has wide applications, including credit evaluation, sales promotion, nancial
forecasting, and market trend analysis.
In this thesis, two predictive modeling methods are proposed. The rst is a
classication-based method which integrates attribute-oriented induction with the
ID3 decision tree method. This method extracts prediction rules at multiple levels of
abstraction and handles large data sets and continuous numerical values in a scalable
way. Since the number of distinct values in each attribute is reduced by attribute-
oriented induction, the problem of favoring the attributes with a large number of
distinct values in the original ID3 method is overcome.
The second approach is a pattern matching-based method which integrates statis-
tical analysis with attribute-oriented induction to predict data values or value distri-
butions of the attribute of interest based on similar groups of data in the database.
The attributes which strongly inuence the values of the attribute of interest are iden-
tied rst by the analysis of data relevance or correlation using a statistical relevance
analysis method. Moreover, by allowing users to specify their request in di erent
concept levels, the system can perform prediction on user-desired concept levels to
make the result more interesting and suitable to the user's needs.
Both proposed methods are implemented and tested. The performance study and
experiments show that they work e ciently for large databases. Our study concludes
that predictive modeling can be conducted e ciently at multiple levels of abstraction
in large databases and it is practical at solving some large scale application problems.
iii
Dedication
To my parents and my wife.
iv
Acknowledgements
I want to thank Dr. Jiawei Han, my senior supervisor, for his guidance, encouragement
and support during my study. Despite his busy schedule, he is always available to give
me advice, support and guidance during the entire period of my study. His insight
and creative ideas are always the inspiration for me during the research.
I would also like to thank Dr. Tiko Kameda for serving on my supervisory com-
mittee. I am very grateful for his advice and support.
My thanks to Dr. Qiang Yang for serving as examiner of this thesis.
I also want to express my gratitude to Dr. Lou Hafer, Dr. Krishnamurti, Mr.
Russ Tront for their advice and support during my TA work. I want to thank Mrs.
Kersti Jaager and other secretaries who are always available for help.
My thanks also goes to many of my fellow graduate students who make me feel
like a member of an extended family: Yongjian Fu, Krzysztof Koperski, Liang Yao,
Betty Xia, Jenny Chiang, Osmar Zaiane, Wan Gong, Yijun Lu, Nebojsa Stefanovic,
Micheline Kamber, Jianping Chen, Hongshen Chin, Ye Lu, Paul Tan, Shan Cheng,
Hui Li and Jie Wei.
My deepest gratitude goes to my wife Jing Jing for her love, care and support. I
am also deeply indebted to my parents for their everlasting love and encouragement.
v
Contents
Approval ii
Abstract iii
Acknowledgements v
List of Tables x
List of Figures xi
1 Introduction 1
1.1 Knowledge Discovery in Databases . . . . . . . . . . . . . . . . . . . 1
1.1.1 What is Knowledge Discovery in Databases . . . . . . . . . . 1
1.1.2 Tasks, Methods and Applications . . . . . . . . . . . . . . . . 3
1.2 Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Predictive Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Related Work 8
2.1 Attribute-Oriented Induction . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 What is Attribute-Oriented Induction? . . . . . . . . . . . . . 8
2.1.2 Motivation of Attribute-Oriented Induction . . . . . . . . . . 9
2.1.3 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Concept Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.5 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
vi
2.1.6 Feasibility of Attribute-Oriented Induction . . . . . . . . . . . 15
2.1.7 Application of the Attribute-Oriented Induction . . . . . . . . 16
2.1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Statistical Approaches . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Machine Learning Approaches . . . . . . . . . . . . . . . . . . 21
2.2.3 Neural Net Approaches . . . . . . . . . . . . . . . . . . . . . . 22
2.2.4 Summary of Classication Approaches . . . . . . . . . . . . . 25
2.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 Statistical Approaches . . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 Classication-Based Approaches . . . . . . . . . . . . . . . . . 29
2.3.3 Neural Net Approaches . . . . . . . . . . . . . . . . . . . . . . 31
2.3.4 Summary of Predictive Modeling Approaches . . . . . . . . . 34
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3 Classication Based Method 36
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.1 Why Use Classication-Based Method . . . . . . . . . . . . . 41
3.3.2 Why Use Decision Tree-Based Method . . . . . . . . . . . . . 41
3.3.3 Why Use Attribute-Oriented Induction . . . . . . . . . . . . . 42
3.4 Discussion of ID3 Method . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 General Ideas of Our Approach . . . . . . . . . . . . . . . . . . . . . 47
3.6 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.7 Variations of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . 52
3.7.1 Using Single Unsplit Generalized Table to Generate Target Sub-
class: Algorithm PC USGS1 . . . . . . . . . . . . . . . . . . . 52
3.7.2 Using Split Generalized Table to Generate Target Subclass: Al-
gorithm PC STGS2 . . . . . . . . . . . . . . . . . . . . . . . . 53
3.7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
vii
3.8 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4 Pattern Matching Based Method 61
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4 Relevance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5 General Ideas of our Approach . . . . . . . . . . . . . . . . . . . . . . 71
4.6 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.7 Variations of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . 75
4.7.1 Using Base Table to Generate Contingency Table: Algorithm
PP S2BT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.7.2 Using Hash Table to Generate Contingency Table: Algorithm
PP S1HT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.8 Example and Experiment Results . . . . . . . . . . . . . . . . . . . . 82
4.8.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.8.2 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . 86
4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5 Performance Study 89
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Performance Study on Classication-Based Predictive Modeling Method 90
5.2.1 Scale Up Performance . . . . . . . . . . . . . . . . . . . . . . 90
5.2.2 Performance Study on Di erent Classication Thresholds . . . 91
5.2.3 Performance Study on Di erent Numbers of Attributes . . . . 92
5.2.4 Performance Study on Di erent Numbers of the Predictive At-
tribute Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.2.5 Performance Study on Di erent Numbers of the Descriptive At-
tribute Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
viii
5.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3 Performance Study on Pattern Matching-Based Method . . . . . . . . 96
5.3.1 Scale Up Performance . . . . . . . . . . . . . . . . . . . . . . 97
5.3.2 Performance Study on Di erent Support Thresholds . . . . . . 97
5.3.3 Performance Study on Di erent Child/Parent Ratios . . . . . 99
5.3.4 Performance Study on Di erent Numbers of Attributes . . . . 100
5.3.5 Performance Study on Di erent Numbers of Predictive Attribute
Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3.6 Performance Study on Di erent Numbers of Descriptive At-
tribute Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4 Summary of the Performance Study . . . . . . . . . . . . . . . . . . . 104
6 Conclusion and Future Work 105
6.1 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2 Comparison of the Two Proposed Predictive Modeling Approaches . . 107
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
A Database Generator 111
B Hierarchy Generator 113
Bibliography 115
ix
List of Tables
2.1 An Initial Relation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 A Generalized Relation. . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 An Example Data Set (by Quinlan 1986). . . . . . . . . . . . . . . . 44
3.2 An Initial Relation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3 A Generalized Relation. . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 The prime class represented as a generalized feature table. . . . . . . 56
3.5 The information gain for each attribute to classify the prime class. . . 56
3.6 Object distribution in each subclass. . . . . . . . . . . . . . . . . . . 57
3.7 The information gain for each remaining attribute in the two subclasses. 57
3.8 Object distribution in the subclass Salary:High and Salary:Medium. 58
3.9 Object distribution in the remaining subclasses of Salary:High. . . . . 58
3.10 Object distribution in the remaining subclasses of Salary:Medium. . . 59
4.1 Contingency Table for Attributes A and B. . . . . . . . . . . . . . . . 68
4.2 Contingency Table for Occupation and Salary. . . . . . . . . . . . . . 68
4.3 Database Table: SalaryInfo . . . . . . . . . . . . . . . . . . . . . . . 82
4.4 Generalized Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.5 Contingency Table for Attribute Occupation and Salary. . . . . . . . 84
4.6 Another Generalized Data Set . . . . . . . . . . . . . . . . . . . . . . 85
4.7 Prediction Result For the Example . . . . . . . . . . . . . . . . . . . 86
4.8 Prediction Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
x
List of Figures
1.1 Steps of the KDD Process (Fayyad et, al., 1996). . . . . . . . . . . . . 2
2.1 Concept Hierarchy for Provinces . . . . . . . . . . . . . . . . . . . . . 12
2.2 The Bayesian Classication System. (Wu, et al., 1991) . . . . . . . . 18
2.3 The Classication Scheme that Integrates Heuristic and Bayesian Ap-
proaches. (Wu, et al., 1991) . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 A Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 A Three Layer Feedforward Neural Network. (Lu, et al., 1995) . . . . 23
3.1 A Common Classication System . . . . . . . . . . . . . . . . . . . . 37
3.2 A Decision Tree (by Quinlan 1986). . . . . . . . . . . . . . . . . . . . 45
3.3 A More Complicated Decision Tree (by Quinlan 1986). . . . . . . . . 46
3.4 The decision tree generated based on the query and the determining
attribute \House size". . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1 Scale-Up Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 Performance Study on Di erent Classication Thresholds . . . . . . . 92
5.3 Performance Study on Di erent Numbers of Attributes . . . . . . . . 93
5.4 Performance Study on Di erent Numbers of the Predictive Attribute
Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.5 Performance Study on Di erent Numbers of the Descriptive Attribute
Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.6 Scale-Up Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.7 Performance Study on Di erent Support Thresholds . . . . . . . . . . 98
xi
5.8 Performance Study on Di erent Child/Parent Ratios . . . . . . . . . 99
5.9 Performance Study on Di erent Numbers of Attributes . . . . . . . . 100
5.10 Performance Study on Di erent Numbers of Predictive Attribute Values102
5.11 Performance Study on Di erent Numbers of Descriptive Attribute Values103
xii
Chapter 1
Introduction
1.1 Knowledge Discovery in Databases
1.1.1 What is Knowledge Discovery in Databases
Today, the amount of data stored in databases grows at an amazing speed. Wal-
Mart (a U.S. retailer) has the largest business database in the world which handles
over 20 million transactions per day 8]. Mobil Oil Corporation, is developing a
database which can store more than 100 terabytes of data relevant to oil exploration
39]. NASA Earth Observing System (EOS) of orbiting satellites and other space-
borne instruments is able to generate 50 gigabytes of remotely sensed image data
per hour 85]! Obviously, such huge amount of data is far beyond human capabili-
ties to analyze using traditional manual methods of data analysis. This gives rise to
the signicant need for new techniques and tools to assist humans intelligently and
automatically in analyzing such gigabytes or even terabytes of data to get useful in-
formation. This increasing need gives birth to a new research eld called Knowledge
Discovery in Databases (KDD) or Data Mining which has attracted more and more
attention from researchers in many di erent elds including database, articial in-
telligence, machine learning, pattern recognition, statistics, expert system, and data
visualization.
The denition of knowledge discovery in databases is given by Fayyad, et al.,
1
CHAPTER 1. INTRODUCTION 2
Interpretation/
Evaluation
Data Mining
Knowledge
Transformation Patterns
Preprocessing
Transformed
Data
Selection
Preprocessed
Data
Target Data
Data
Figure 1.1: Steps of the KDD Process (Fayyad et, al., 1996).
Figure 1.1 presents an overview of the steps comprising a KDD process. The
low end is the data stored in the databases. A subset of the data (called target
data) is collected in the Selection step according to the data mining request. Noise
CHAPTER 1. INTRODUCTION 3
removal and missing data handling is done in step Preprocessing. Data reduction and
projection is performed in the Transformation step. Searching for interesting patterns is
accomplished in Data Mining step. The interpretation and evaluation of the discovered
pattern in the nal step is to ensure the derivation of valid, useful knowledge which
should be understandable by people. Although the background knowledge is not
specied in this gure, it is very important in the whole KDD process.
representation approaches 28], visualization and interactive approaches 90, 46] and
neural network approaches 56].
For the past few years, many KDD systems have been built based on fruitful
researches which include Database Marketing for American Express 9], Coverstory
from IRI 78], KEFIR 60, 59] from GTE, SKICAT from JPL/Caltech 22], QUEST
3, 82, 4] from IBM, DBMiner from SFU 35], IMACS 12, 11], Recon 81], Explora 48],
Spotlight 5], and several others. The wide application and great practical potential
of KDD have been shown by these prototypes which have produced a lot of promising
experimental results.
1.2 Classication
Classication in the context of data mining is learning a function that maps (classies)
a data item into one of several predened classes 38, 86, 61]. The classication task
is to analyze the training data and to develop an accurate description or model for
each class according to the features present in the data. The future test data is then
classied by using the class descriptions which can also be used to provide a better
understanding of each class in the database.
Since classication is an important task of data mining, much research has been
done based on di erent methodologies including Bayesian inference 16, 83], neural
net approaches 56], decision tree methods 62], etc. Among these methods, the
decision tree-based methods are the most popular methods in the research which
have adopted a machine learning paradigm and perform classication on the given
data by generating a decision tree. However, since these approaches all perform
classication on primitive data stored in the databases, thus they inherit the same
problems from decision tree-based machine learning method which they have adopted,
such as di culties in handling large amounts of data and continuous numerical values,
the tendency to favor many-valued attributes in the selection of determinant attribute,
etc. Although some approaches have overcome some problems, Metha, et al., 62],
solved the problem of handling large amounts of data and continuous numerical values
for example, some problems such as a bushy tree due to the large size of data, remain
CHAPTER 1. INTRODUCTION 5
untouched.
Although classication and prediction are di erent tasks of data mining, they are
closely related and classication has actually been adopted by some approaches as a
method to do prediction 58].
Our motivation to study classication is to take advantage of some of the classica-
tion methods developed in the past research and extend them for predictive modeling
purposes. For example, from classication result, we may know the data characteris-
tics for each class. Then we can predict the future behavior of an object by assigning
it to an existing class based on its particular characteristics.
of the original ID3 method to favor the attributes with a large number of distinct
values.
Another approach we proposed is a pattern matching-based method which inte-
grates a statistical method with attribute-oriented induction to predict data values
or value distributions of the attribute of interest based on similar groups of data in
the database. The attributes which strongly inuence the values of the attribute of
interest are identied rst by the analysis of data relevance or correlation using a
statistical relevance analysis method. Moreover, by allowing users to specify their
request at di erent concept levels, the system can perform prediction on user desired
concept levels to make the result more interesting and suitable to users' needs. This
approach is domain independent, capable of handling a large volume of data and
multiple concept levels.
Both of our proposed methods are implemented and tested. The performance
study and experiments we did on these two methods show that both methods work
e ciently against large databases.
8
CHAPTER 2. RELATED WORK 9
sets. Association rule reveals associations in the data. An association rule is in the
form of \X1 ^^ Xi ! Y1 ^^ Yj " which means objects Y1 Yj tend to appear
together with objects X1 Xi in the target data.
The attribute-oriented induction retrieves a target data set as the initial data
relation, performs the induction process attribute-by-attribute on the retrieved data
set, eliminates duplications in generalized tuples (with counts added up), and extracts
generalized relations and rules. This method has been implemented by Han, et al.,
in our knowledge discovery system DBMiner 35] and tested successfully against large
relational databases.
2.1.3 Query
A data mining request, including the data set of interest and the type of knowledge
to be discovered, can be specied in a data mining query language (e.g., DMQL 36])
which uses an SQL{like syntax. The following is an example:
use NSERC94
nd characteristic rule for \CS Grants"
from award A, organization O
where O.org code = A.org code and A.disc code = \Computer"
in relevance to province, amount, percentage(count), percentage(amount)
The above query species that the desired type of knowledge is \characteristic rule".
It also requires that the data set of interest is from two relations: \award" and
\organization", must satisfy the condition specied in the where clause and is in rel-
evance to four attributes: province, amount, percentage(count), percentage(amount),
where percentage(x) = x=total x% (the ratio of x versus total x). The two relations,
\award" and \organization", are stored in the database \NSERC94" which contains
the information about 1994{1995 NSERC research grants.
As mentioned in Section 2.1.1, characteristic rule describes the characteristics of
the target data set. The words in bold are reserved key words of DMQL.
The process rst retrieves the specied data set (target class) by executing an SQL
query according to the condition specied in the where clause. But di erent from the
standard SQL, high level concepts like \Computer Science", which are not stored
in the database, can also be specied in the query. Thus, before the SQL query is
submitted, concept hierarchy (a set of mappings from a set of concepts to their higher-
level counterparts) will be consulted to map such high-level concepts (e.g., \Computer
Science" for the above example) to low-level concepts (e.g., di erent discipline codes
such as 25502, 25634, etc.) which are actually stored in the database.
CHAPTER 2. RELATED WORK 11
Canada
Step 1 Initial data collection: the data mining request specied in DMQL is converted
into an SQL query with high level concepts replaced by its corresponding prim-
itive level concepts by consulting the concept hierarchies. This SQL query is
then executed to collect task-relevant data set into the initial relation.
Step 2 Derivation of the generalization scheme for each attribute: in the initial relation,
if the number of distinct values for an attribute is too large, the attribute will
be generalized by either attribute removal or generalization. The former is
performed when there is no concept hierarchy on the attribute, or its higher-level
concepts are expressed in the values of another attribute. The latter is performed
otherwise by rst determining the prime level concepts for each attribute, and
then examining them together with the data in the initial relation to form
generalization pairs.
Step 3 Extraction of the prime relation: Perform attribute-oriented generalization (at-
tribute by attribute), by substituting the primitive level concepts of an attribute
CHAPTER 2. RELATED WORK 15
in the initial relation with its corresponding prime level concepts using the gen-
eralization pairs generated in step 2. A prime relation is generated after the
generalization by eliminating duplicated records and accumulating the counts
accordingly in the retained records.
The above method integrates relational database operations with attribute-oriented
generalization. In Step 1, a relational database query is executed and its optimization
depends on the well-developed relational database technology. Suppose the initial
relation R and the derived prime relation R' contain m and n records respectively.
Step 2 involves one scan (or less if a sampling technique is used) of R, and the worst-
case time complexity is O(m). Step 3 scans R, generalizes this relation attribute by
attribute, and inserts generalized records into R', which takes O(mlogn) time if the
records in R' are ordered and a binary or tree-based search is used.
Knowledge expressed in rules or general data distributions can then be derived
from the prime relation, using statistics or machine learning techniques 34, 65, 89].
discovering association (at the primitive level) or dependency rules which require the
discovery of knowledge at primitive concept levels. Moreover, concept hierarchies have
to be provided for attribute-oriented induction to generalize non-numerical attributes.
Generally, this method can be used in most database-oriented applications in which
generalization of some or all of the relevant attributes is necessary.
2.1.8 Summary
In brief, attribute-oriented induction is a set-oriented, generalization-based data min-
ing method which is e cient, robust and with wide applications. It can also be
extended to knowledge discovery in other kinds of databases such as object-oriented
37], deductive as well as spatial databases 51]. However, attribute-oriented induction
is a generalization-based method which requires the availability of some background
knowledge such as concept hierarchies. Therefore, it may not be suitable for knowl-
edge mining which is not based on generalization or background knowledge.
2.2 Classication
Classi cation is a data mining process which groups objects with common properties
into several predened classes and produces a classication scheme over a set of data
objects. Since classication is fundamental to research in many elds in social and
natural sciences, it has been extensively studied in statistics, machine learning and
neural network research 72, 55, 13, 10, 76].
One term which often causes confusion is clustering. Clustering is a form of un-
supervised learning that partitions objects into classes or clusters (collectively, called
a clustering). Clustering is di erent from classication in that it classies data into
classes which are not predened and have to be found in the process, according to
some class similarity measurements. Although clustering is also an important task in
data mining and a lot of research has been done 24, 26, 66, 16, 6, 19, 54, 88, 18],
since one of our approaches is classication-based, the survey in this thesis is mainly
focused on classication instead of clustering.
There are many classication methods developed in di erent research works which
can be classied into three groups: statistical methods, machine learning methods and
neural net methods.
The traditional Bayes classier (illustrated in Figure 2.2) can be described as fol-
lows. Given the problem of classifying a set of patterns ai i = 1 ::: n, each pattern is
perceived in terms of a measurement vector Fi obtained by a sensing device which is
capable of capturing its features. Each pattern ai can be associated with a classica-
tion set Ci which includes all the possible classes that can be assigned to pattern ai.
In general, the classication sets can be distinct, but for simplicity, all classication
sets are considered identical. Thus, each pattern belongs to one of m possible classes
Ci i = 1 ::: m: Also for simplicity, we only consider the case in which the same feature
measurements are made for each pattern.
Bayesian Classification
EVALUATION OF L1
LIKELYHOOD
Pattern Fj
FEATURE L2 MAXIMUM Decision
FUNCTIONS
aj EXTRACTION
SELECTOR
{j = 1, ..., n} Li
{i = 1, ..., m} Lm
both decision rules in the classication procedure, the global optimization using tradi-
tional Bayesian classication, which can be inconsistent with the domain constraints,
is replaced by domain consistency and local optimization.
Knowledge-based component
CLASSIFICATION REACHED Y
Figure 2.3: The Classication Scheme that Integrates Heuristic and Bayesian Ap-
proaches. (Wu, et al., 1991)
Wu, et al.'s algorithm illustrated in Figure 2.3 is summarized as follows: By
applying domain-contextual constraints and heuristic rules, domain knowledge is used
to guide the search in the tree. First, the classication context and the search scope
are evaluated. After the search scope is determined, depending on the availability
of heuristic rules, a pattern is classied by either a heuristic rule or the Bayesian
procedure according to its measured features. After a classication search has been
conducted, the pattern is checked to see if it has reached a leaf of the tree (can
be categorized into a single class). If it has, the decision is made, and the system
proceeds to the next unclassied pattern. If not, its classication context and scope
are evaluated again in step 1, and steps 2 and 3 are repeated. The whole procedure
terminates when the classication of each pattern has been completed.
CHAPTER 2. RELATED WORK 21
Occupation
Experience Experience
Salary
The classication rules generated by neural networks are very di cult to be ex-
pressed because a neural network is usually a layered graph with the output of one
node feeding into one or many other nodes in the next layer which results in the
classication rules being buried in both the structure of the graph and the weights
assigned to the links between the nodes. For the same reason, it is also di cult to
apply available domain knowledge to a neural network.
Among the above described disadvantages of the connectionist approach, the ex-
pressing problem is one of the major hurdles which needs to be solved in order to
adopt the technique in data mining, because it would be rather di cult to verify or
interpret classication rules without explicit representation of them. One approach
by Lu and his associates 56] made a good attempt at solving this problem.
In their approach, they use a three-layer neural network to perform classication
which extracts similar rules with those generated by the symbolic methods (like C4.5
74]). Lu's approach is summarized as follows.
Output Layer
Hidden Layer
Input Layer
Figure 2.5: A Three Layer Feedforward Neural Network. (Lu, et al., 1995)
CHAPTER 2. RELATED WORK 24
attribute values of an input tuple, vis are constants, s are relational operators (=
< >) and Cj is one of the class labels).
This approach was implemented in their NeuroRule system which has produced
some interesting results. However, despite the fact that the speed of network training
was improved by the fast algorithms they designed, the time required for NeuroRule
is still longer than the time needed by a symbolic approach such as C4.5 74].
2.3 Prediction
Predictive modeling for knowledge discovery in databases is to predict unknown or
future values of some attributes of interest based on the values of other attributes
in a database. Di erent approaches have been taken in predictive modeling based on
di erent methodologies. The major methods which are used in the current research
include statistics, classication, and neural net-based methods.
in order to predict the future data. The challenge is that the model should be con-
structed in a way to best explain the previous data without over-emphasizing the past,
which may result in unreliable estimate of the future. In these linear regression-based
approaches, in order to know how well the constructed linear regression model can
predict the future, existing data is often divided into a training set and a validation
set. The training set is used to construct the model, while the validation set is used
to test the prediction error.
One approach adopts the linear regression method 20] to solve the problem of
performing prediction on nancial data with a small number of data points and high
dimensionality, for which classical economic forecasting techniques do not work.
This approach aims at solving a particular nancial problem of predicting the
direction of interest rate spread which is the di erence between the lending interest
rate (charged by nancial institutions such as banks and credit card companies to
their customers) and the borrowing interest rate at which the nancial institutions
can borrow. It is very important for the nancial institutions to be able to reliably
predict the interest spread so that they can maximize their prots by hedging (buying
insurance against a future spread decrease).
In adopting this approach, the researchers used a number of techniques which
trade o training error against model complexity to reduce dimensionality. They
also minimized the mean squared error of prediction and conrmed statistical validity
using bootstrap techniques 86], to predict whether the spread will increase or decrease
in the foreseeable future.
The details of this approach are summarized as follows.
Standard ordinary least squares regression minimizes the Residual Sum of squares
(RSS). Consider the following two equations, where (xj yj ) and (xj yj ) are training
and validation points respectively:
RSS =
Xn (y ; g(x ))2
j j
j =1
MSEP =
1 XN
(y ; g(x ))2
N j=1 j j
The researchers use g(x) to minimize RSS (g(xj ) is the output for training point xj
CHAPTER 2. RELATED WORK 28
and xj may be multi-dimensional, yj is the observed value of y, g(xj ) is the output for
validation point xj ). The second equation is the validation error or the mean square
error of prediction (MSEP). Although the summations look similar in two equations,
the point (xj yj ) in the second equation is from the validation set which was not used
in constructing g(x).
The complexity of the model is another important issue which the researchers have
to deal with. In order to achieve optimal results, they have to select an appropriate
complexity for the model so that it is neither too simple nor too complicated. Too
simple a model may result in more training and validation errors (undertting) due to
the lack of enough free parameters to model the irregularities of the training set. Too
complicated a model may result in overtting and the rise of validation errors after
reaching a certain complexity, despite the fact that the training errors will decline to
zero and the validation error decline as well before that.
\Structural Risk Minimization" or \capability control" is often used as the refer-
ence to the process of nding the optimal model complexity for a given training set
84].
One of the big problems the researchers encountered was to construct a linear
regression model properly based on the existing data with a large number of explana-
tory variables but a relatively small number of samples to predict future spread, and
to avoid overtting of the samples. Furthermore, the success in predicting the spread
depends on being able to predict the values of the explanatory variables. Their goal
is to nd the best set of p explanatory variables which predicts the spread. They add
one variable at a time to the linear model until the mean square error of prediction
(MSEP) goes through the minimum. The main challenge of this approach is that
the best set of p variables is not necessarily a subset of the best set of (p+1) vari-
ables. In order to nd the minimum validation error, one must exhaustively search
Ckp combinations for each p (with total number of variables k), and with a total of
2k ; 1 combinations over all p. Because of the substantial cost of hedging, the re-
searchers believe that the benets are worth the use of computing resources even if
the exhaustive search takes days to complete.
In their experiment, exhaustive search was applied to the quarterly data (between
CHAPTER 2. RELATED WORK 29
1983 and the end of 1993 and with twenty one possible explanatory variables) to
nd the p variables which minimize the MSEP. For the monthly data, they used a
sequential selection scheme in which the rst variable with the highest correlation
to the spread is chosen. Then a linear model with only this variable is constructed
based on the full training set. The residuals are obtained which will be explained by
the remaining variables. Then another variable with the highest correlation to the
residuals is selected. A model with two variables is then constructed and the residuals
are recalculated. The process will continue until all variables (or a large number of
variables) are ranked.
for each feature. Merits are computed by taking a set of \best" counterexamples
for each example in a class, and accumulating a gure for each feature which is a
function of the example-pair feature values. Compared to the tree-based methods,
the R-MINI contextual feature analyzer may be considered as a full-level look-ahead
feature analyzer. The researchers believe that it will not su er from falling into false
local minima due to its ability to analyze merits of features in a global context. When
the rule generation is nished, the R-MINI system can be used to classify unseen data
sets and measure the performance using various error metrics.
In order to precisely quantify the predictive performance, it is necessary to use
the classication rules to predict the actual return instead of the discretized class
segments. To extend the classication model to a rule-based regression model, the
researchers calculate additional metrics for the rules based on the training examples
and the rules derived from this training data: , the mean of all actual class values of
training examples covered by that rule , the standard deviation of these values and
N, the total number of training examples covered by that rule. When such a rule set
is applied to unseen data, for each example in that set there will be potentially zero or
more rules which may apply to that example. If no rule covers an unseen example, a
numerical value can be assigned or predicted which suggests a default for the domain
based on priors such as the normal expected mean for the class. If one or more rules
apply to an unseen example, an average is calculated from the rule coverage metrics
which is then assigned as the class label for that example.
Another classication-based approach 58] focused on database marketing appli-
cations. The goal in this domain is to predict the customer behavior based on their
previous actions. A usual approach is to develop training models which maximize
accuracy on the training and testing sets and then apply these models on the unseen
data. The researchers in this approach think that accuracy optimization is insu -
cient by itself and should explore di erent strategies to take the customer value into
account. They propose a framework for comparing payo s of di erent models and
use it to compare several di erent approaches for selecting the most valuable subset
of customers.
They use historical telephone customer records which contain billing data and
CHAPTER 2. RELATED WORK 31
responses to previous special o ers and use a commercial neural network classier to
group the customers into several classes, according to their responses to such special
o ers. Based on the predicted classes (responders, non-responders), o ers are made
to customers who have a high predicted probability of response. However, in order
to maximize business value, not only prediction accuracy need to be maximized but
also a group of \high value" customers among those highly likely to respond need to
be identied. However, how to achieve this goal still remains an open problem. Is it
enough to just select the estimated high value customers from the group of predicted
responders as a post processing step or might it be better to have the predictive model
itself take the value into consideration? They tried di erent strategies in order to get
a satisfying answer.
To evaluate di erent approaches, they divide the data into one training set and
several testing sets with di erent sizes to calculate the merits for each approach.
In order to get better accuracy, they reduce the irrelevant attributes by excluding
dependent attributes and eliminating attributes with low correlation weight to the
target attribute.
The strategies on which they did experiments are baseline payo calculation, post
processing, value-based training, modeling stratied groups based on value, straight
merge of stratied logs and merge of \optimal" subsets from the stratied logs. Details
of these strategies can be found in 58].
Their prediction approach is mainly based on a classication technique. They
construct the predictive model on top of the classication strategy, while taking the
business payo into account.
where E represents the \e ect" of knowing the value of each variable v. That is,
all variables in V including the unknown variables in OUT are used as input to the
function F to generate a new vector V' with predicted values which includes predictions
for already known variables in IN, as moderated by the e ects in E.
The researchers propose that the e ect set E contain an entry for every variable
in V. ei = 1 when the variable is known, and ei = ;1 when its value is not known.
The reason behind this assumption is that such weights allow a neural net to learn all
linear main and interaction e ects (perhaps some non-linear e ects as well) associated
with the presence or absence of each variable.
In order to test whether the addition of the e ect set E improves prediction, the
researchers created two basic experimental conditions. The rst condition is called
the \E ect condition" in which the e ect set E is dened as described in the preceding
paragraphs. V is a copy of a record in the data set and the function F is implemented
as a partially recursive neural net. This net has 31 variables in V' as its outputs and
the 31+31 = 62 variables in V and E as its inputs. A standard back-propagation
algorithm 29] was used and all variables were scaled in the range ;0.9 to +0.9. The
data was randomly divided into training and testing sets with approximately the same
size.
There are two phases in the process. The rst phase is Training Phase during
which knowledge of predictor variables was simulated by randomly selecting between
0 and 15 variables using an e cient algorithm described in 49]. The selected variables
were assigned the value 0.0 and were then added to OUT. Their entry in E was set to
;0.9. The remaining variables were added to IN with their original values remaining
the same and their entry in E set to +0.9. The resulting V + E were then presented
to the back-propagation algorithm.
The second phase is Test Phase in which the weights obtained during the training
phase are validated on the test set by randomly selecting between 1 and 15 variables
to be included in the OUT set. Then the E set is constructed and used exactly as
during the forward propagation stage in the training phase.
The second condition is called \Control condition" which is identical to the \E ect
condition", except that no set \E" is used during the training phase. Instead, the
CHAPTER 2. RELATED WORK 34
value 0 is simply assigned to the randomly selected elements of OUT. That is, all
results are based on a fully recursive neural net with 2 intermediate layers, 31 outputs
(V') and 31 inputs (V).
Their experimental results showed that the addition of the e ect set E to the
inputs minimized distortions in known variables during the transformation from V to
V'.
2.4 Summary
In this chapter, a brief introduction to the work related to our proposed predictive
modeling approaches was given. The motivation, method, feasibility, and di erent
applications of attribute-oriented induction were discussed in Section 2.1 along with
the query and concept hierarchy which it uses for a data mining task. In Section 2.2,
several classication approaches based on di erent methodologies such as statistical
method, machine learning method and neural net method were presented. We also
examined the advantages and the weaknesses of the approaches based on di erent
methodologies. In Section 2.3, di erent approaches to predictive modeling were in-
vestigated. These approaches were divided into three groups based on the method
they use: statistical, classication-based, and neural net approaches. A brief summary
on each type of predictive modeling approach was also given.
Chapter 3
Predictive Modeling Based on
Classication Method
3.1 Introduction
Classication in terms of data mining is learning a function that maps (classies) a
data item into one of several predened classes 38, 86, 61], which can be described
as follows. The input data, called training data, consists of multiple objects. Each
object has multiple attributes and is tagged with a special class label. The classi-
cation task is to analyze the training data and to develop an accurate description
or model for each class according to the features present in the data. The future
test data is then classied by using the class descriptions which can also be used to
provide a better understanding of each class in the database. Figure 3.1 illustrates
a common classication system. There are wide applications of classication such as
credit approval, target marketing, medical diagnosis, treatment e ectiveness analysis,
etc.
Classication is closely related to predictive modeling. In statistics the classica-
tion problem is sometimes called the prediction problem. Generally, a classication
process can be thought to have two phases: model construction phase and new object
identication phase. In the rst phase, existing data is analyzed to get an accurate
description for each class. In the second phase, a new object is identied and assigned
36
CHAPTER 3. CLASSIFICATION BASED METHOD 37
Class Description
Class Analyzer
Training Data
to one class according to the class description obtained in the rst phase. This phase
can actually be considered as a prediction process because the behavior of the future
object can be predicted from the common behavior of the class to which it belongs.
The so-called \common behavior" of one class represents the general characteristics
of all the previously known data in this class.
In terms of predictive modeling, the rst phase of classication can be seen as
the predictor constructing phase, while the second phase is the prediction phase.
Classication can be thought to extract the patterns of the existing data and the
prediction can be seen as the use of such patterns to forecast the behavior of future
data. In this sense, they can be treated as two phases in one data mining process.
Since classication has been extensively studied in previous research with many
e cient methods developed, many approaches in predictive modeling are based on
classication 7, 58].
Our approach integrates the ID3 73, 74] decision tree classication method with
attribute-oriented induction to predict object distribution over di erent classes. The
details of our approach will be discussed in the following sections.
CHAPTER 3. CLASSIFICATION BASED METHOD 38
The rest of this chapter is organized as follows. In Section 3.2, the problem we want
to tackle is dened. In Section 3.3, our motivation for integrating the ID3 method with
attribute-oriented induction is discussed. In Section 3.4, the ID3 decision tree method
is examined. In Section 3.5, the general idea of the approach is described, which is
followed by Section 3.6, where the algorithm and the rationale of the algorithm are
discussed. In Section 3.7, two variations of the algorithm are examined. Section 3.8
illustrates our method by an example. Finally, a brief summary is given in Section
3.9.
On one hand, these derived rules can be used to study the behavior of the existing
objects and produce a useful prediction result. For example, we may obtain a number
of rules about the house size in British Columbia, Canada, based on the existing
objects stored in a database. By studying these rules, a local paint company may
nd that the people with certain characteristics may very likely own medium-sized
houses. Thus this paint company can use this knowledge to direct its promotion
of certain paint designed for medium-sized houses to those potential customers who
possess those characteristics.
On the other hand, these obtained rules can also be used to categorize one or more
new objects into certain classes so that we can predict its (their) behavior based on
the common behavior of the objects in the classes associated with it (them).
For instance, a credit card company may derive a number of rules about the credit
rating of their current customers. This company may use these rules to lter the new
applicants for credit cards so that it can recruit the new customers with potential
good credit and reject those applicants with potential bad credit. Moreover, it can
also assign appropriate credit limit to the accepted new customers based on these
rules so that it can maximize prots, while at the same time minimizing the risk.
There are three cases associated with the process to predict a new object's behavior
based on the rules obtained using the classication-based predictive modeling method:
Case 1: Based on the rules, the new object can be assigned to a single class.
Case 2: According to the rules, the new object can be assigned to more than
one classes but with di erent probability for each class.
Case 3: The existing rules do not cover the new object.
Case 1 is the ideal case which rarely happens when the rules are derived from a
large database, because of the wide diversity of the objects it contains.
In case 2, the common method is to compute the probability of the occurrence of
this object in di erent classes and assign it to the class with the highest probability.
However, for a large database, the probability of the occurrence of the new object may
be quite low even for the class with the highest probability due to the wide diversity of
CHAPTER 3. CLASSIFICATION BASED METHOD 40
data in the database. The assignment of the new object to any class will be reluctant
with such low probability and the prediction based on this assignment may not be
accurate or useful. It will be desirable to know the probabilities of the new object to
be assigned to di erent classes instead of assigning it to one class reluctantly.
In case 3, new rules which cover the new object have to be found. However,
this process may not be always successful if there is no similar object stored in the
database from which the new rules can be derived.
For a classication-based predictive modeling method, in order to extract pre-
dictive rules, a classication model usually needs to be constructed rst. Then the
rules can be derived from the established model for further analysis to produce useful
predictive results. Since the classication model has to be constructed based on pre-
dened classes, for di erent predictive attributes, di erent classication models have
to be created accordingly. Thus the model construction cost has to be considered
when selecting the method. For example, since the time to construct a neural net
classier is rather long, the classication method based on the neural nets may not
be suitable for many predictive modeling applications which require fast response.
Another factor we need to consider is that the method we use should be able
to extract a limited number of simple rules from the existing objects. Otherwise, it
may not be useful or practical because of the di culty in analyzing a large number
of complicated rules and the fact that the more complicated (thus more specic and
less generic) the rules are, the less chance they will cover a new object, which makes
the rules not very useful in predicting the behavior of new objects. For example, one
rule may be described as \people who spend $500-1500 a month have good credit".
Another rule may be described as \people who spend $300-1000 a month, live in
suburb, have full time work with over $50k salary and credit card balance less than
$200 have good credit". Obviously, the rst rule is not only easy to analyze but also
likely to cover more new objects than the second one and thus more useful in the
future.
In the worst case, each derived rule can only describe one object in the database
and all the rules combined are actually the original object set itself - thus no extra
knowledge is gained in the predictive modeling process. Therefore, we must pay
CHAPTER 3. CLASSIFICATION BASED METHOD 41
special attention to selecting the methods which can reliably extract a limited number
of simple rules and avoid those which can not reliably do so.
3.3 Motivation
3.3.1 Why Use Classi cation-Based Method
Classication and prediction are closely related. They can be considered as two
phases in one predictive modeling process. It is a natural and easy way to
perform prediction based on classication results.
Many e cient methods have been developed in previous classication study
which can be e ectively utilized in predictive modeling.
amount and wide diversity of data in large databases. In our proposed algorithm,
a classication threshold is used to terminate the process when a major portion of
the objects (indicated by the classication threshold) belongs to one class. Because
of the incorporation of quantitative information in attribute-oriented induction, the
proposed algorithm can handle noisy or exceptional data, which is quite common in
large databases, by pruning the generalized tuples which have negligible counts using
particular thresholds, while most decision tree method has di culties in handling such
data.
With the help of the attribute-oriented induction, prediction can be made at dif-
ferent desirable concept levels. With the association of the quantitative information
such as count of objects at each node in a decision tree, prediction can be made based
on the distribution of the target objects in di erent classes rather than based on only
one class with the highest probability, which makes the result more reasonable.
the rst decision tree shown in Figure 3.2. The preference for simpler decision trees
72] leads the ID3 method to use an information-theoretic approach which selects the
attribute that provides the highest information gain to be the root of the tree (or
subtree). This selection process minimizes the expected number of tests to classify an
object and guarantees that a simple (but may not be the simplest) tree is found.
Let the prime class P contain pi objects in class Pi (for i = 1, ..., m) and each class
Pi is distinguished from another class Pj (i 6= j ) based on their di erent values in the
determining attribute. An arbitrary object shall belong to class Pi with probability
pi=p, where p is the total number of objects in the prime class P. When a decision tree
is used to classify an object, it returns a class. A decision tree can thus be regarded
as a source of messages for Pi's with the expected information needed to generate this
message given by:
Outlook
humidity P
windy
P N P
N
it will partition a class C into C1 C2 ::: Ck, where Cj contains those objects in C that
have value aj of A. Let Cj contain pij objects of class Pi. The expected information
required for the tree with A as the root is then obtained as the weighted average:
Xk
E (A) = p1j + :::p + pmj I (p1j ::: pmj )
j =1
The information gained by branching on A is:
gain(A) = I (p1 p2 ::: pm) ; E (A)
ID3 examines all the candidate attributes and chooses an attribute A to maximize
gain(A) to form a tree. It then uses the same process recursively to form a decision
tree for the residual subset C1 C2 ::: Ck.
We can use the training set in Table 3.1 to illustrate the selection process of the
decision tree root as follows.
Let T be the set of objects in Table 3.1. Among the 14 objects, 9 objects belong to
class P and 5 objects belong to class N, so the information required for classication
is:
CHAPTER 3. CLASSIFICATION BASED METHOD 46
temperature
N PP N
windy P outlook P
N P N P null
Then consider the outlook attribute with values \sunny", \overcast" and \rain".
Five of the 14 objects in T have the value \sunny", two of them from class P and
three from class N, so
p1 = 2 n1 = 3 I (p1 n1) = 0:971
and similarly
p2 = 4 n2 = 0 I (p2 n2) = 0
p3 = 3 n3 = 2 I (p3 n3) = 0:971
Therefore, the expected information requirement is:
E (outlook) = 145 I (p1 n1) + 144 I (p2 n2) + 145 I (p3 n3) = 0:694
The gain of attribute outlook is then:
CHAPTER 3. CLASSIFICATION BASED METHOD 47
to know the general characteristics of professors and managers living in lower main-
land of British Columbia, Canada who own particular sized houses in order to target
its promotion of certain type of paint to the right group of people, we can use the
classication-based predictive modeling method to nd the answer for them. We can
simply construct a decision tree based on the above data using ID3 algorithm with
the attribute \House-size" as the target class attribute. Then from the decision tree,
derive the predictive rules along the paths of the tree but the result may not be very
good due to the following reasons:
1. Very bushy and thus not interesting decision tree could be constructed due
to the large number of tuples in the database and distinct values for some
attributes such as \Name" and \Salary", which may result in many unnecessary
and meaningless rules not related to the task.
2. Di erent people at di erent time or in di erent cases may want to know infor-
mation at di erent concept levels. For example, sometimes people only want
to know that the house size is big, medium or small rather than exact area
by square meters. Sometimes they may only pay attention to professors or
managers as a whole instead of distinguishing between di erent professors or
managers. The decision trees only based on the primitive concept level may not
satisfy such requests.
3. The amount of data is too large for the decision tree algorithm to handle { the
decision tree can not be constructed at all.
According to the above observations, we may want to remove some attributes with
large number of distinct values which can not be generalized and are irrelevant to the
task such as attribute \Name". We may use attribute-oriented induction to discretize
some numerical attributes such as \Salary" to reduce the number of distinct values
and generalize the given data set to di erent concept levels to meet di erent requests.
For example, Table 3.2 can be generalized to Table 3.3.
In Table 3.3, the values of \Occupation" are generalized to \professor" and \man-
ager" the values of \House-size" are generalized to \medium", \small" and \big"
CHAPTER 3. CLASSIFICATION BASED METHOD 49
while the values of \Salary" are generalized to \medium", \low" and \high". The
attribute \Count" is inserted into the generalized relation during attribute-oriented
induction, indicating the number of original tuples covered by a generalized tuple.
Notice that any generalization (such as attribute removing, attribute-oriented in-
duction) will result in a non-clean classication, that is, originally di erent objects
which belong to di erent classes may become the same objects but they are still in
di erent classes. For example, according to house size, an assistant professor living in
Burnaby with medium salary belongs to class \medium" and an associate professor
living in Burnaby with a medium salary belongs to class \big". After generalization
on attribute \Occupation", these two objects become the same object: \Occupation
= professor, Address = Burnaby, Salary = medium" but in di erent classes \medium"
and \big". We can not categorize such objects into any single class. At the leaf nodes
of each decision tree, we may have to give the class distribution instead of a unique
class label. As we discussed in Section 3.2, such cases are very common in databases,
even without any generalization due to the wide diversity of the data. The associa-
tion with quantitative information and the class distribution at the leaf nodes actually
provide a reasonable solution for such problems.
According to the above discussion, we may rst collect the task-relevant data from
the database into a table IT (as illustrated in Table 3.2) by executing an SQL query.
Then we remove the attributes with a large number of distinct values, which cannot
be generalized or discretized from IT. Then we generalize the data set to a certain
level according to the request and put it into another table GT (as illustrated in
Table 3.3). Then we perform ID3 algorithm on GT and create a decision tree. From
CHAPTER 3. CLASSIFICATION BASED METHOD 50
3.6 Algorithm
Algorithm 3.1 (Classication-Based Predictive Modeling Method) Derive pre-
dictive modeling rules using attribute-oriented induction and ID3.
Input: A DMQL query for prediction.
Output: Predictive modeling rules derived from the constructed decision tree.
Method:
1. Data retrieval: According to the given request, collect the task-relevant data
and generalize it into the specied concept level to get the prime target class.
2. Model Construction: Construct a decision tree based on the prime target class.
2.1 Compute the information gain for each candidate attribute based on the information-
theoretic approach, using the equations given in Section 3.4. Select one candi-
date attribute as the classifying attribute at this current level and classify the
target class.
2.2 For each classied target subclass, repeat Step 2.1 to further classify it until
either (1) all or a substantial proportion (no less than the classication threshold
CHAPTER 3. CLASSIFICATION BASED METHOD 51
Computational Complexity:
Theorem 3.1 Algorithm 3.1 will take O(Ngl Nr logNr + Nnonleaf Na Nr) time
to nish. Nr is the number of the records in the initial data set (the task-relevant
data set which is initially collected from the database and is at the primitive concept
level). Ngl is the number of concept levels of all the attributes in the initial data set
which have been generalized from the primitive concept level to the specied concept
level. Nnonleaf is the total number of non-leaf nodes in the decision tree. Na is the
number of attributes in the initial data set.
Rationale:
Since the attribute-oriented induction takes O(N logN ) 15] (N is the number
of the records in a relational table), in order to generalize the initial data set to the
concept level specied in the request, the algorithm will generalize Ngl times (levels).
Because the number of records in the generalized table after each generalization will
become smaller and smaller, it will take O(Ngl Nr logNr ) time to generalize the
CHAPTER 3. CLASSIFICATION BASED METHOD 52
initial data set to the desired concept level. Then the decision tree construction on
the generalized target data set will cost at most Nnonleaf Na Nr time 73]. So
the total computational requirement for this algorithm will be O(Ngl Nr logNr +
Nnonleaf Na Nr ).
2.1 Compute the information gain for each candidate attribute based on the information-
theoretic approach using the equations given in Section 3.4. Select one candidate
attribute as the classifying attribute at this current level and classify the target
class.
2.2 For each classied target subclass, get the subtable from the master table and
repeat Step 2.1 to further classify it until either (1) all or a substantial proportion
(no less than the classication threshold specied in the DMQL query) of the
objects are in one (determinant) class, or (2) no more classifying attributes can
be used for further classication.
3. Output the rules derived from the constructed decision tree.
2.1 Compute the information gain for each candidate attribute based on the information-
theoretic approach using the equations given in Section 3.4. Select one candidate
attribute as the classifying attribute at this current level and classify the target
class. Subtable for each child node is derived from the prime target class if the
current node is the root or from the subtable of its parent node when the current
node is not the root. The so derived subtable is then associated with each child
node.
2.2 For each classied target subclass, use the subtable associated with it and repeat
Step 2.1 to further classify it until either (1) all or a substantial proportion (no
less than the classication threshold specied in the DMQL query) of the objects
are in one (determinant) class, or (2) no more classifying attributes can be used
for further classication.
3. Output the rules derived from the constructed decision tree.
3.7.3 Summary
Since the target tables become smaller and smaller from the parent generations to
child generations, the cost to extract subtables is less in algorithm PC STGS 2 which
results in better performance in most cases. However, algorithm PC STGS 2 takes
a little more memory than algorithm PC USGS 1 and when the number of records
is small, the performance di erence of these two algorithms is not signicant. The
detailed performance study of these two algorithms is presented in Chapter 5.
3.8 Example
In this section, we will use a simplied example to illustrate our proposed classication-
based predictive modeling approach. Suppose a local paint company wants to market
their high-end paint to professors and managers living in the lower mainland of British
Columbia, Canada. In order to achieve the best result, they need to know the general
characteristics of the group of professors and managers who own certain sized houses
CHAPTER 3. CLASSIFICATION BASED METHOD 55
so that they can target their promotion to a particular group with the appropriate
products designed for the certain sized houses that people in this group may likely
own. Suppose they have access to a database \HouseDB" which contains house in-
formation of people living in the lower mainland and its schema is represented in
Table 3.2. Then they can submit the following DMQL query to get the information
they need.
use HouseDB
nd prediction rule for \House size"
from HouseInfo
where Occupation = \professor" or Occupation = \manager"
and Address = \Burnaby" or Address = \Vancouver"
or Address = \Richmond" and Salary = \low" or Salary = \medium"
or Salary = \high"
in relevance to Occupation, House size, Address, Salary
with classi cation threshold 85%
The query is then executed and the task-relevant data set is collected from the
database. Then this initial data set is generalized to the specied concept level which
results in the prime class table (shown in Table 3.4 as a feature table).
Based on the extracted prime class table, information gain is computed for each
candidate attribute { \Occupation", \Salary" and \Address" using the information-
theoretic method and the equations given in Section 3.4.
The computation results in Table 3.5, which implies that \Salary" should be chosen
as the root of the decision tree, and the objects should be classied according to the
values (High, Medium, Low) of the attribute \Salary".
The object distribution in each subclass is presented in Table 3.6. since the de-
terminant class, \House size = Small", contains 95.61% of objects in the subclass
\Salary:Low", which is above the specied classication threshold \85%", it is un-
necessary to further classify the subclass \Salary:Low". Classication is further per-
formed in other subclasses.
CHAPTER 3. CLASSIFICATION BASED METHOD 56
The computation of information gain for each remaining attribute in the two
subclasses leads to table 3.7. Obviously, \Address" should be chosen as the classifying
attribute for both subclasses, \Salary:High" and \Salary:Medium".
The object distribution in each subclass is presented in Table 3.8. Since the
determinant class \House size = Big", contains 85.42% of objects in the subclass
\Salary: High & Address: Richmond", which is above the classication threshold
\85%", it is unnecessary to further classify this subclass. Classication is further
performed in other remaining subclasses.
CHAPTER 3. CLASSIFICATION BASED METHOD 57
Salary House size = Big House size = Medium House size = Small
High 59.84% 36.70% 3.46%
Medium 41.93% 29.68% 28.39%
Low 2.63% 1.76% 95.61%
Information Gain
Attribute Salary: High Salary: Medium
Occupation 0.2164310451 0.3473828339
Address 0.2505954843 0.4922614419
Table 3.7: The information gain for each remaining attribute in the two subclasses.
For the remaining subclasses, based on the last classication attribute \Occupa-
tion", the object distribution in each leaf class is presented in Table 3.9 and Table 3.10
respectively.
The model construction process nally terminates when there is no attribute left
in the subclasses for further classication. This process leads to a decision tree of
Figure 3.4.
Prediction rules can be generated with probability distribution information asso-
ciated based on the tables and the decision tree so derived. for example, the rules
which associate house size with the salary can be derived as follows:
if Salary = High
then (House size = Big) 59.84%] _ (House size = Medium) 36.70%]
_ (House size = Small) 3.46%]
if Salary = Medium
then (House size = Big) 41.93%] _ (House size = Medium) 29.68%]
_ (House size = Small) 28.39%]
if Salary = Low
then (House size = Big) 2.63%] _ (House size = Medium) 1.76%]
CHAPTER 3. CLASSIFICATION BASED METHOD 58
Figure 3.4: The decision tree generated based on the query and the determining
attribute \House size".
Notice that alternative choices can be explored for determining whether a subclass
needs to be further classied. One such alternative is to specify a noise threshold and
a disjunct threshold (the maximum number of disjuncts allowed). A subclass does not
CHAPTER 3. CLASSIFICATION BASED METHOD 59
need to be further classied if it contains only a small number of disjuncts (below the
disjunct threshold) after ltering each disjunct which is below the noise threshold. For
example, in Table 3.6, if the noise threshold is 5% and the disjunct threshold is 2 (i.e.,
two conjuncts are allowed), subclass \Salary:High" and \Salary:Low" contain only
one or two disjuncts after ltering each disjunct below the noise threshold. Further
classication needs to be performed only on one subclass, \Salary:Medium".
After study the rules derived by the classication-based predictive modeling pro-
cess, the local paint company may decide to target its promotion of one type of paint
designed for certain sized houses to a particular group of people who may likely own
such sized houses. For example, they may market their certain brand of paint designed
for small houses to the professors and managers with relatively low salaries.
3.9 Summary
In this chapter, we rst discussed the basics of classication, its close relation with
predictive modeling and our ideas to integrate the ID3 classication method into our
predictive modeling approach. Then in Section 3.2, we dened the problem we are
going to tackle in our classication-based predictive modeling. In Section 3.3, we dis-
cussed our motivation to integrate the ID3 method and attribute-oriented induction
into our approach. In Section 3.4, we discussed the ID3 decision tree-based classi-
cation method. In Section 3.5, we illustrated our general ideas using an example.
Then we presented the details of our proposed classication-based predictive modeling
method and algorithm in Section 3.6. The rationale and the computational complex-
ity of the algorithm were also discussed. In Section 3.7, two variations of the algorithm
were examined. In Section 3.8, we used an example to illustrate our method. Our
CHAPTER 3. CLASSIFICATION BASED METHOD 60
61
CHAPTER 4. PATTERN MATCHING BASED METHOD 62
of interest and is di erent from the attributes included in Pi. Given pattern Pinput ,
our task in the data collecting phase is to nd matching patterns MPSET in the
existing pattern set PSET (PSET = fP1 P2 Pn g). MPSET PSET and all
matching patterns in MPSET have the same value with Pinput for each attribute.
That is, 8 Px 2 MPSET , V (A1) of Px = V (A1) of Pinput , V (A2) of Px = V (A2) of
Pinput V (An) of Px = V (An) of Pinput . However, if the value for one attribute Aj
of the given pattern is \any", the V (Aj ) of the matching patterns can be any value
without necessarily being the same.
The task in pattern analyzing phase is to examine the di erent behavior of the
collected patterns in the MPSET and predict on the probability basis what behavior
the given pattern may have. Suppose B1 B2 Bm are di erent behaviors of the
collected patterns in MPSET. (In terms of relational database, B1 B2 Bm are the
di erent values of the attribute of interest. For convenience in the following discus-
sions, we name the attribute of interest as Predictive Attribute and the other attributes
which make up the pattern as Descriptive Attribute. In principle, any attribute can be
either predictive or descriptive attribute depending on di erent predictive modeling
requests.)
The possible behavior of the given pattern Pinput can be any one of them but with
di erent probability. Suppose the total number of patterns in MPSET is N and the
number of patterns in MPSET with behavior B1 B2 Bm is n1 n2 nm (N =
Pm ni) respectively. Thus the probability for Pi to have behavior B1 B2 Bm is
i=1
n1=N n2 =N nm=N accordingly.
However, the quality of prediction based on di erent set of matching patterns
may be di erent. Obviously, the predictions based on a large set of patterns are more
reliable than those based on only a small set of patterns. We use Support to measure
the reliability of the prediction.
Denition 4.1 (Support) Support is the number of existing matching patterns of
the given pattern in a data set. That is, support = N where N is the number of
matching patterns of the given pattern.
Apparently, the more support, the more reliable the prediction is.
CHAPTER 4. PATTERN MATCHING BASED METHOD 64
4.3 Motivation
Let's study an example rst. Suppose one new graduate is looking for a job and one
of his primary concerns is how much salary he can expect given his personal situation.
Suppose there's an occupation database available to access by public, which contains
most occupation related information such as age, education, experience, profession,
region, etc. Assume there is a simple predictive modeling system available which
can gather the information specied by the user and do some calculation to give a
probability prediction. So the new graduate seeks the help of this system and hopes
to get a general picture of how much salary he can ask for during job interviews.
He gives the system his personal information as \Profession = programmer, Degree
= M.S., Experience = 2 (years), Age = 25, Location = Burnaby, Special Interest =
object-oriented database". The system then tries to nd matching patterns in the
existing records of the occupation database. In the database, there may be hundreds
CHAPTER 4. PATTERN MATCHING BASED METHOD 65
of programmers, thousands of people who have M.S. degree, and even more people
who are at age 25. However, the set of records which match all the specied conditions
is probably small. In this case, suppose there are only 12 matching records found.
The new graduate probably won't trust the prediction based on such a small set of
data and hopes that the system can help him nd more records which are closest
to the specied condition but may not necessarily be the same. One way to achieve
this is to generalize the given condition and tries to nd more matching records
based on the generalized result. In this case, suppose the condition is generalized to
\Profession = programmer, Degree = M.S., Experience = 2, Age = 22-27, Location =
BC, Special Interest = Computing Science". The system needs to nd the matching
records which satises the new condition. However, in the database, data is stored
in the original value (that is, at the primitive concept level) and may not be at the
concept level which the user desires. In this case, the values for attribute \Special
Interest" are at \A.I, object-oriented database, nancial planning, etc." level but
not at the \Computer Science, Business, etc." level. This means that the system
needs to generalize the existing data to the desired concept level in order to nd the
matching records. Obviously, this is one reason that we want to integrate the attribute-
oriented induction into the system in order to perform such a task. Suppose our
system already has such capability, and nd 400 records which match the generalized
condition. The new graduate may feel comfortable with the prediction based on this
set of data. What the system needs to do next is to calculate the distributions of the
salary among the found records. However, since the values of salary vary, the result
may not reect the general trend. For example, the result may look like\$42,000
{ 3%, $45,680 { 3.4%, ...". In order to clearly show the general trend, we need to
generalize the predictive attribute, in this case, \Salary". Thanks to the attribute-
oriented induction, the system generalizes the attribute \Salary" into four high level
values: \under $40000, $40000{45000, $45000{50000, and over $50000". So the result
may look clearer: \under $40000 { 10%, $40000{45000 { 25%, $45000{50000 { 60%,
over $50000 { 5%. Thus the new graduate knows probably he can ask for $45000-
50000 during his job hunting. This is another reason to use the attribute-oriented
induction.
CHAPTER 4. PATTERN MATCHING BASED METHOD 66
One caution we need to take during the generalization process is that we do not
want to generalize too much otherwise the generalized pattern may be too generic
and lose too much specic information from the original pattern so the prediction
may not be useful in such a case. For instance, if the generalized pattern of the
above case is \Profession = White Collar, Degree = M.S., Experience = 0-10, Age
= 20-40, Location = BC, Special Interest = Science", the result may not be very
helpful for the new graduate to nd out how much he can expect to earn. In this
example, the system generalizes too much which results in losing too many specics
and making the pattern too generic so that the result is not specic enough for the
particular request. As we can see, obtaining enough support while still retaining as
much necessary specic information as possible is the goal we must achieve in our
approach.
For each prediction, we know that the relevance between di erent descriptive at-
tributes and the predictive attribute is di erent. Some attributes are more relevant
to the prediction while some are less or even not relevant at all. If the relevance
between each attribute and the predictive attribute is known before the generaliza-
tion, such information can guide the generalization process so that those less relevant
attributes are generalized more than the relevant attributes which results in getting
enough support while still keeping the most relevant specic information as much as
possible.
Take the previous case as an example, if the system knows the \Profession", \Ex-
perience" and \Degree" are more relevant to \Salary" and the \Location" is least
relevant, these three attributes can be less generalized while \Location" can be gen-
eralized more during the generalization process which may result in the following
generalized pattern: \Profession = programmer, Degree = M.S., Experience = 2, Age
= 22-47, Location = Western Canada, Special Interest = Computer Science". The
new graduate can still get a clear picture about the salary even though the \Location"
has been generalized to a relatively higher concept level in order to obtain enough
support.
As discussed above, the motivation to integrate relevance analysis into our ap-
proach is to guide the generalization process so that the relevant specic information
CHAPTER 4. PATTERN MATCHING BASED METHOD 67
with the second row, each cell in this column contains one distinct value of attribute
A. For convenience, the cells which contain the distinct values of the attributes will be
referred to as value cell and the other cells will be referred to as data cell. Each data
cell in the contingency table which corresponds to one distinct value of attribute A and
one distinct value of attribute B contains the number of tuples which have these two
attribute values. Each cell in the last row and the last column of the table contains
the subtotal of each corresponding column or row. The last cell (the intersection of
row N+2 and column M+2) contains the total number of tuples in the contingency
table.
In a simplied example, let N=M=2 and we get the following contingency table:
Attribute B
Attribute A B1 B2 Total
A1 (A1B1) (A1B2) (TA1)
A2 (A2B1) (A2B2) (TA2)
Total (TB1) (TB2) (TN)
Table 4.1: Contingency Table for Attributes A and B.
Salary
Occupation low high Total
white collar 100 500 600
blue collar 700 200 900
Total 800 700 1500
Table 4.2: Contingency Table for Occupation and Salary.
To best illustrate the meaning of the contingency table shown in Table 4.1, we
can use a concrete example. Let A represent attribute \Occupation", A1 represent
\white collar" and A2 represent \blue collar". Let B represent attribute \Salary",
B1 represent \low" and B2 represent \high". Thus (A1B1) represents the number of
tuples with \Occupation" as \white collar" and \Salary" as \low". This example is
shown in Table 4.2.
CHAPTER 4. PATTERN MATCHING BASED METHOD 69
Since we have the data arranged into the contingency table, now we are ready to
study the relevance between attribute \Occupation" and attribute \Salary".
We will rst study the existence of relevance by examining the following questions:
(1) Is the proportion of people in the white collar group having high salary the same
as for all of the people? (2) Is the proportion of people with low salary the same
for people who are in the white collar group as for those who are in the blue collar
group? To answer these two questions, we need to calculate a pair of percentages for
each question. For question (1), we nd that 83% of the people in the white collar
group have high salary while 47% of all the people in the group have high salary. For
question (2), we nd that for people with low salary, 87.5% of them are in the blue
collar group while 12.5% of them belong to the white collar group.
Since the percentages in each of the pairs of percentages are di erent, we know
that the two attributes, \Occupation" and \Salary" are not independent. Thus the
answer is YES for the question \Is there any relevance between these two attributes?".
Now that we know that the relevance exists between the two attributes, we will
examine the degree of relevance. In the above example, we can get some idea of
closeness of the relevance by comparing 83% and 47%, However, it is not very useful
because it can not be used to compare with other percentage di erences calculated
from other contingency tables. We need a coe cient which can measure the degree of
relevance, is a pure number, free from the e ects of the absolute quantities or of the
size of the marginal percentages and can be compared for di erent contingency tables.
One such coe cient is T which measures the degree of relevance between attributes.
It is based on well known Chi Square method and frequently used in statistics 30].
The calculation of T is described as follows.
a11 a12 ... a1n a1(n+1)
a21 a22 ... a2n a2(n+1)
.
.
.
am1 am2 ... amn am(n+1)
a(m+1)1 a(m+1)2 ... a(m+1)n a(m+1)(n+1)
CHAPTER 4. PATTERN MATCHING BASED METHOD 70
In a usual case, suppose we have two attributes and one has m values while another
one has n values. Using (m +1) (n +1) matrix (row m +1 contains the subtotal of the
corresponding column while column n + 1 contains the subtotal of the corresponding
row), we can represent the contingency table as above.
Then we can calculate P (P is an intermediate measure used to calculate Chi
Square.)
P Pmj=1 a a2ji
P= n j(n+1)
i=1 a(m+1)i
Then we can get the Chi Square 2: (N is the total number of tuples in the table.)
2 =N P ;N
For our example, 2 = 1500 1:36 ; 1500 = 540.
After getting 2, we can calculate T: ( 2 is an intermediate measure. s represents
the number of rows and t represents the number of columns.)
2
2 = N
T 2 = p(s;1)(2 t;1)
For our example, 2 = 0.36, T 2 = 0:36, thus T = 0:6.
Therefore the degree of relevance between \Occupation" and \Salary" measured
by T is 0.6.
Obviously the computational complexity of the algorithm to calculate T is O(N
M ) (N is the number of columns and M is the number of rows of a contingency table).
In the above discussion, we rst illustrated the general ideas to analyze the rele-
vance between two attributes. Then we introduced the T coe cient which is exten-
sively used in statistics to measure the degree of relevance between two attributes. In
the following sections, we will further discuss how to integrate this method with our
predictive modeling system to do the relevance analysis.
CHAPTER 4. PATTERN MATCHING BASED METHOD 71
4.6 Algorithm
Algorithm 4.1 (Pattern Matching-Based Predictive Modeling Method) Predict
value distribution of the predictive attribute for the given pattern based on the exist-
ing matching patterns.
Input: A DMQL query for prediction.
Output: The value distribution of the predictive attribute.
Method:
1. Data retrieval: According to the given data pattern (will be referred to as Target
Pattern), collect the task-relevant data set and generalize it to the specied
concept level to create the generalized data set.
2. Check support: Check if the number of matching patterns in the generalized
data set is equal to or greater than the specied support threshold. If yes, go
to step 5. If not, go to step 3.
3. Calculate Relevance: Calculate the relevance between the predictive attribute
and the descriptive attributes.
4. Further Generalization: Further generalize the least relevant attribute of the
generalized data set to its corresponding higher concept level and get a further
generalized data set. Replace the value of this least relevant attribute in the
Target Pattern with its corresponding value at the higher concept level of the
hierarchy. Then go back to step 2. If the selected attribute is already in the top
concept level of the hierarchy, try the next least relevant attribute. If all the
attributes are at their top concept levels, stop.
5. Output Result: Calculate the value distribution of the predictive attribute based
on the matching patterns and output the result.
Rationale:
CHAPTER 4. PATTERN MATCHING BASED METHOD 73
Step 1 is the attribute-oriented induction algorithm 15]. Step 2 checks if the spec-
ied support threshold has been passed. Step 3 calculates the relevance between each
descriptive attribute and the predictive attribute based on the method we discussed
in Section 4.4. Step 4 is also the attribute-oriented induction algorithm. Step 5 calcu-
lates the distribution of the di erent values of the predictive attribute as we discussed
in Section 4.2. The process terminates when either the number of matching patterns
in the generalized data set exceeds the specied support threshold or all the attributes
are generalized to their top concept levels.
Computational Complexity:
Before we discuss the computational complexity of the algorithm, we will rst
introduce three notations as follows.
(1) Initial Relation Table: The Initial Relation Table is the initial data set which
is collected from the database and is at the primitive concept level.
(2) Ntl :
Ntl = Pni=1 Nli
n is the number of attributes in the given data pattern. Nli is the number of concept
levels (excluding the top concept level) in the concept hierarchy of attribute i.
(3) MAXNv :
8 i (1iN ), MAXNv NVi
NVi is the number of distinct values of attribute i in the initial relation table. N is
the number of attributes in the initial relation table.
Theorem 4.1 In Algorithm 4.1, the calculation of the relevance for each descriptive
attribute will take O(MAXNv 2) time.
CHAPTER 4. PATTERN MATCHING BASED METHOD 74
Rationale:
As we discussed in Section 4.4, it takes O(m n) time to calculate the relevance
between two attributes. m is the number of distinct values of one attribute and n
is the number of distinct values of the other attribute. Since MAXNv is the maxi-
mum number of distinct values for any attribute in the initial relation table and with
each generalization, the number of distinct values for each attribute either remains
the same or becomes smaller, so the MAXNv will always be greater or equal to the
number of distinct values of any attribute at any time. Obviously, the calculation of
the relevance for each descriptive attribute will take O(MAXNv 2) time.
top concept levels. In this case, the maximum number of times it can generalize
is (Ntl ; Nil ) and each generalization including the calculation of the relevance for
each descriptive attribute will take O(MAXNv 2 Nda + Nr logNr ) time. So in the
worst case it will take O((Ntl ; Nil) (MAXNv 2 Nda + Nr logNr )) time to stop.
Combining this time and the time spent on the initial generalization which general-
izes the initial relation table to the specied concept level, the algorithm will take
O((Nil + Npl) Nr logNr + (Ntl ; Nil) (MAXNv 2 Nda + Nr logNr )) time and
that can be simplied as O((Ntl + Npl) Nr logNr + (Ntl ; Nil ) MAXNv 2 Nda).
(16) else
(17) match := FALSE
(18) end
(19) end
(20) if match = TRUE then begin
(21) rowIndexOfFoundRecord := rowInCTG
(22) found := TRUE
(23) break
(24) end
(25) end
(26) if found != TRUE then begin
(27) TbcurRowInCTGBaseTable]nonPredAttrIndexInCTGBaseTable]
:=Ttrow]givenNonPredAttrIndex]
(28) TbcurRowInCTGBaseTable]predAttrIndexInCTGBaseTable]
:= Ttrow]givenPredAttrIndex]
(29) TbcurRowInCTGBaseTable]countCol] := Ttrow]countCol]
(30) curRowInCTGBaseTable++
(31) else
(32) TbrowIndexOfFoundRecord]countCol]
:= TbrowIndexOfFoundRecord]countCol]+Ttrow]countCol]
(33) end
(34) end
(35) sort base table(Tb)
(36) for (row :=0 row < Ntb row++) do begin
(37) preRowValue := rowValue
(38) rowValue := Tbrow]Col1]
(39) colValue := Tbrow]Col2]
(40) if row = 0 then begin
(41) rowValueIndex := 0
(42) colValueIndex := 0
(43) add col value(colValueList, colValue)
(44) TctrowValueIndex]colValueIndex] := Tbrow]countCol]
(45) else
CHAPTER 4. PATTERN MATCHING BASED METHOD 78
(35) end
(36) end
(37) nRows := get number of keys(Tr)
(38) nCols := get number of keys(Tc)
(39) rowTotalIndex := nCols
(40) colTotalIndex := nRows
(41) for (row := 0 row < nRows row++) do begin
(42) for (int col := 0 col < nCols col++) do begin
(43) TctcolTotalIndex]col] := TctcolTotalIndex]col]+Tctrow]col]
(44) Tctrow]rowTotalIndex] := Tctrow]rowTotalIndex]+Tctrow]col]
(45) end
(46) TctcolTotalIndex]rowTotalIndex]
:= TctcolTotalIndex]rowTotalIndex]+Tctrow]rowTotalIndex]
(47) end
4.7.3 Summary
In most cases, Algorithm PP S1HT has better performance than Algorithm PP S2BT.
However, when the number of records is small and the Child/Parent ratio (it species
the desired number of children under each parent node) of the concept hierarchy is
very low (i.e., 2), the overhead related to the hash table may o set its benet and the
performance of PP S2BT is slightly better but the di erence is not signicant. The
detailed performance study of these two algorithms is presented in Chapter 5.
CHAPTER 4. PATTERN MATCHING BASED METHOD 82
EmpID Occupation Company Age Gender Degree Major Addrcd Exp Salary
23568 205 InfoSoft 25 m B. Sc. C.S. 46789324 2 41000
97621 102 ABCTech 30 f M.A. Accounting 15679562 5 55000
Suppose a new graduate with a B.Sc. degree in Computing Science from Simon
Fraser University just starts to look for a job and wants to know how much salary he
could ask for from a potential future employer. He expects to get a reasonably clear
answer from our predictive modeling system based on the information stored in the
database.
We assume that the hierarchy for each attribute has already been created by the
system administrator and can be browsed by the new graduate so that he can actually
specify his request at desired concept levels without being restricted to the primitive
concept level stored in the database. Suppose the total records in this database is
10000, the database name is \Employee" and the table name is \SalaryInfo". The
new graduate is only interested in attribute \Occupation", \Age", \Degree", \Major",
\Address", \Experience" and \Salary". He also feels that it will be good enough if
the result is based on at least 500 records in the table.
So he may specify his request in DMQL as following:
CHAPTER 4. PATTERN MATCHING BASED METHOD 83
use Employee
nd prediction rule for \Salary"
from SalaryInfo
where Occupation=\Programmer" and Age=25 and Degree =\B.Sc."
and Major=\Computing Science" and Address=\Vancouver" and
Experience=1
in relevance to Occupation, Age, Degree, Major, Address, Experience
with support 500
Our predictive modeling system will rst parse this query and generates a SQL
statement at the corresponding primitive concept level to collect the task-relevant data
set (will be referred to as initial data set) from the specied database table. Then
by attribute-oriented induction, it will generalize the initial data set to the specied
concept level and create a generalized data set as following:
In this example, we assume the total count at this point is only 350 which is less
than the specied support threshold. Thus the system needs to further generalize
in order to meet the support requirement. Before the generalization, the system
needs to compare the relevance between each descriptive attribute and the predictive
attribute and then select the least relevant attribute to generalize. It rst generates
a set of contingency tables from the generalized data set and each table contains one
descriptive attribute and the predictive attribute - \Salary" in this case.
We will use the following contingency table as an example to illustrate how the
system will calculate the relevance between a descriptive attribute (\Occupation" in
this case) and the predictive attribute.
Attribute Salary
Attribute Occupation < 20K 20K ; 30K 30K ; 40K 40K ; 50K > 50K Row Total
Programmer 100 400 500 2000 1000 4000
Accountant 5 10 10 20 5 50
QA 300 2400 200 95 5 3000
Manager 0 0 20 80 50 150
TechnicalSupport 800 1100 80 20 0 2000
SalesStaff 500 250 30 18 2 800
Column Total 1705 4160 840 2233 1062 10000
Table 4.5: Contingency Table for Attribute Occupation and Salary.
The system will calculate the relevance of attribute \Occupation" and attribute
\Salary" based on the above contingency table using the Chi Square method which
we introduced in Section 4.4.
P Pm a a2ji
n j=1 jn+1
p= i=1 am+1i
In this case n is equal to the number of values of the attribute \Salary" and m is
equal to the number of values of the attribute \Occupation". So n=5 and m=6.
P P6j=1 aa2ji
p= 5
i=1
j6 = 1:85
a7i
The Chi Square 2 is
CHAPTER 4. PATTERN MATCHING BASED METHOD 85
2 = N P ; N = 8500
2
2 = N = 0:85
T 2 = p(s;1)(2 t;1) = 0:19
Thus
T = 0:43
So the relevance between the attribute \Occupation" and the attribute \Salary"
is 0.43.
Then the system will follow the same steps to calculate the relevance of other
descriptive attributes. We assume the result is as follows.
The relevance between the attribute \Degree" and the attribute \Salary" is 0.32.
The relevance between the attribute \Experience" and the attribute \Salary" is 0.25.
The relevance between the attribute \Major" and the attribute \Salary" is 0.20.
The relevance between the attribute \Age" and the attribute \Salary" is 0.18. The
relevance between the attribute \Address" and the attribute \Salary" is 0.05.
Since the address is the least relevant attribute, the system will further generalize
this attribute to a higher concept level and get another generalized data set:
The system will also adjust the target record accordingly from < Occupation =
\Programmer", Age = 25, Degree = \B.Sc.", Major = \Computing Science", Address
= \Vancouver", Experience = 1 > to < Occupation = \Programmer", Age = 25,
Degree = \B.Sc.", Major = \Computing Science", Address = \Lower Mainland",
Experience = 1 > .
CHAPTER 4. PATTERN MATCHING BASED METHOD 86
Then the system will nd all the matching records of the target record in the
further generalized data set and add up all the corresponding counts. In this example,
we assume the total count now is 550 which has passed the support threshold. Thus
it is not necessary to further generalize the data set . The system will then calculate
the salary distribution based on the matching records in the latest generalized data
set and output the result shown in Table 4.7.
Salary Probability
Under 20K 10%
20K-30K 25%
30K-40K 60%
40K-50K 4%
Over 50K 1%
Table 4.7: Prediction Result For the Example
So the new graduate will get a relatively clear picture on how much salary he can
earn as a programmer working in British Columbia.
If the total count is still less than the specied support threshold after generalizing
the attribute \Address" to the higher concept level, the system will repeat the previous
steps until either the support threshold is passed or all the attributes are generalized
to the top concept levels. In the latter case, the user has to lower the specied support
threshold and resubmit his request.
If the user is not satised with the result, he can adjust his request and resubmit
the query. In our example, since the result is quite satisfactory to the new graduate,
he does not have to submit another query.
use NSERC95
nd prediction rule for \amount"
from award A, organization O
where O.org code = A.org code and A.discd = \Computer"
and province = \British Columbia"
in relevance to discd, amount, province
with support 100
NSERC95 database is used in this experiment which contains all the NSERC grant
allocation information of 1995 in Canada. Notice that the specied value \Com-
puter" for attribute \discd" is a high level concept which is not stored in the NSERC
database. The system has to consult the concept hierarchy to decide which set of data
is task relevant. By specifying the above query, we want to know how much grant a
professor who is in the computer science department of a university located in British
Columbia can get. With support threshold as 100, we get the following result:
The result shows that for a professor who is in the computer science department of
a university located in British Columbia, the probability for him to get $0-20K from
NSERC is 57%. For $20K-40K, $40K-60K and over $60K, the probabilities are 30%,
8% and 5% respectively.
CHAPTER 4. PATTERN MATCHING BASED METHOD 88
4.9 Summary
In this chapter, we rst introduced the idea of pattern matching-based predictive
modeling approach. Then we dened the problem we are going to solve in our pat-
tern matching-based predictive modeling approach. Two basic concepts Support and
Support Threshold were also dened. In Section 4.3, we discussed our motivation to in-
tegrate attribute-oriented induction and relevance analysis into our pattern matching-
based predictive modeling approach. In Section 4.4, we introduced the Chi Square-
based statistic method to calculate the relevance between two attributes and gave a
few examples to illustrate the method. Then we presented the general ideas of our
approach in Section 4.5. In Section 4.6, we examined the method we proposed for
the pattern matching-based predictive modeling. The rationale of the algorithm and
its computational complexity were also discussed. Two variations of the algorithm
were examined in Section 4.7. A detailed example was given in Section 4.8 which
was followed by some experiment results produced by DBMiner. Our discussion in
this chapter shows that the pattern matching-based predictive modeling approach is
useful and feasible at solving real world problems. In Chapter 5, we will present the
details of our performance study which shows it is also e cient.
Chapter 5
Performance Study
5.1 Introduction
In this chapter, we will examine the performance of the two predictive modeling ap-
proaches we discussed in the previous chapters. In order to study the performance, we
implemented these two algorithms and tested them on an IBM-compatible PC with
a Pentium II 300 MHz processor and 128 MB of main memory. We also implemented
two programs to generate our testing databases and the hierarchies which we needed
for generalization. These two programs include a database generator which gener-
ates our synthetic databases for testing and a hierarchy generator which generates a
hierarchy for each attribute in our testing databases.
The database generator takes the following parameters: the desired number of
records, the number of attributes, the name and the number of distinct values of
each attribute. It then generates a synthetic database using a random record gen-
eration program which is described in Appendix A. The hierarchy generator takes
three parameters: the maximum number of generations, the Child/Parent ratio which
species the desired number of children under each parent node and the value list of
an attribute. This program is described in Appendix B.
The remaining part of this chapter is organized as follows. In Section 5.2, we
will present a performance study on our classication-based method. This section
is divided into six subsections, while in each of the rst ve subsections, we will
89
CHAPTER 5. PERFORMANCE STUDY 90
demonstrate how a particular factor a ects the performance and a summary is given
in the last subsection. The performance study on pattern matching-based method
is given in Section 5.3, which is divided into seven subsections and similarly in each
subsection, except the last one, we will study how a particular factor a ects the
performance of the pattern matching-based method. A brief summary is also given
in the nal subsection. The whole chapter is summarized in Section 5.4.
we are using is 10000, 20000, 40000, 80000, and 160000. The experimental result
is shown in Figure 5.1 which shows both variations scale up reasonably well but
PC STGS2 is better than PC USGS1.
in the classication threshold, the running time also increases because more and more
nodes whose parent node cannot pass the increased classication threshold have to
be constructed and information gain also has to be calculated at these nodes' parent
nodes in order to select the next branching attribute. From Figure 5.2, we can also
see that after a certain threshold, the time increase attens because in this testing
scenario very few nodes can actually pass that classication threshold, so the number
of nodes which remain unconstructed at that classication threshold is limited. Thus
the time increase is also limited when we further increase the classication threshold.
We can also see that when the classication threshold is below a certain point, the
time increase is also limited because in this testing scenario the number of nodes
which can pass the lowest classication threshold but cannot pass this classication
threshold are limited and thus the associated cost increase is limited. The above
observation is valid for both PC STGS2 and PC USGS1, but it is more obvious for
PC USGS1 in Figure 5.2.
being the same. The parameter values for our synthetic database are as follows:
1. Case 1: Number of attributes: 4 (including 3 descriptive attributes and 1 predic-
tive attribute). The number of distinct values for each attribute is as following:
50 for all the descriptive attributes and 5 for the predictive attribute.
2. Case 2: Number of attributes: 8 (including 7 descriptive attributes and 1 predic-
tive attribute). The number of distinct values for each attribute is as following:
50 for all the descriptive attributes and 5 for the predictive attribute.
3. Case 3: Number of attributes: 12 (including 11 descriptive attributes and 1
predictive attribute). The number of distinct values for each attribute is as
following: 50 for all the descriptive attributes and 5 for the predictive attribute.
4. Case 4: Number of attributes: 16 (including 15 descriptive attributes and 1
predictive attribute). The number of distinct values for each attribute is as
following: 50 for all the descriptive attributes and 5 for the predictive attribute.
We use the same value distribution to avoid possible distortion of the experimental
result due to di erent value distributions for each case.
The parameter values for the hierarchy generator are: the maximum number of
generations is 5, the Child/Parent ratio is 5. The total number of records is 10000.
The classication threshold we use is 100%. The experimental result is shown in
CHAPTER 5. PERFORMANCE STUDY 94
Figure 5.3 which shows that with the increase in the number of attributes in the
database, the running time also increases primarily due to the increased overhead
to select branching attribute at each node and constructing more nodes at additional
levels in the constructed decision tree. Figure 5.3 also shows that PC STGS2 is better
than PC USGS1.
number of records is 10000. The classication threshold we are using is 100%. The
number of predictive attribute values we use is as follows: 10, 20, 40, 80 and 160.
The experimental result is shown in Figure 5.4, from which we can see that,
with the increase of the number of predictive attribute values, the running time also
increases due to the overhead involved in calculating the value distribution for the
predictive attribute and the information gain at each node. The testing result also
shows that PC STGS2 does better than PC USGS1.
The number of predictive attribute values we use is 10. The number of distinct values
for each descriptive attribute in each testing case is as follows: 20, 40, 80 and 160.
We use the same value distribution (all descriptive attributes have the same number
of distinct values in each case) to avoid possible distortion of the experimental result
due to di erent value distributions in each case.
The experimental result is shown in Figure 5.5, from which we can see that, with
the increase of the number of descriptive attribute values, the running time also
increases due to the overhead involved in constructing more branches. The result also
shows that PC STGS2 does better than PC USGS1.
5.2.6 Summary
The above experimental results demonstrate that in most cases, PC STGS2 is bet-
ter than PC USGS1, as shown by the performance curves. The reason is that in
PC STGS2, the target tables become smaller and smaller from the parent generations
to child generations, thus the cost to extract subtables is less than that in PC USGS1.
However, when the number of records is small, classication threshold is very low, and
the number of distinct values of the attributes is small, the performance di erence
of these two variations is not signicant. So we can conclude that PC STGS2 can
be reasonably counted on as the better choice for the classication-based predictive
modeling method.
In this experiment, we wanted to learn how the di erent support thresholds a ect
the performance of this algorithm with the same request, all the other variables being
the same. The parameter values for our synthetic database are as follows. Number
of attributes: 6 (including 5 descriptive attributes and 1 predictive attribute). The
number of distinct values for each attribute is as follows: 100 for the rst descriptive
attribute, 80 for the second descriptive attribute, 50 for the third descriptive attribute,
20 for the forth descriptive attribute, 15 for the fth descriptive attribute and 10 for
the predictive attribute. The parameter values for the hierarchy generator are: the
maximum number of generations is 5, the Child/Parent ratio is 2. The total number
of records is 10000. The support thresholds we are using are 5, 20, 80 and 320.
support threshold is more than 80. The result also shows that PP S1HT is better
than PP S2BT.
generalization to pass the given support threshold and each successive generalization
takes less time because the table shrinks quicker with higher Child/Parent ratios.
The testing result also shows that in general, PP S1HT does better than PP S2BT.
However, when Child/Parent ratio is very low and the number of records is small,
PP S2BT does slightly better than PP S1HT but the performance di erence is not
signicant.
the same request, all the other variables being the same. The parameter values for our
synthetic database are as follows. Number of attributes: 6 (including 5 descriptive
attributes and 1 predictive attribute). The number of distinct values for each attribute
is as following: 100 for the rst descriptive attribute, 80 for the second descriptive
attribute, 50 for the third descriptive attribute, 20 for the forth descriptive attribute,
and 15 for the fth descriptive attribute. The parameter values for the hierarchy
generator are: the maximum number of generations is 5, the Child/Parent ratio is
5. The total number of records is 10000. The support threshold is 30% of the total
number of records in our synthetic database. The number of predictive attribute
values we use is as following: 10, 20, 40, 80 and 160. The experimental result is
shown in Figure 5.10, from which we can see that, with the increase of the number
of predictive attribute values, the running time also increases due to the overhead
involved in calculating the relevance for each descriptive attribute. The result also
shows that PP S1HT is better than PP S2BT.
numbers of descriptive attribute values a ect the performance of this algorithm with
the similar requests, all the other variables being the same. The parameter values for
our synthetic database are as follows. Number of attributes: 6 (including 5 descriptive
attributes and 1 predictive attribute). The parameter values for the hierarchy gener-
ator are: the maximum number of generations is 5, the Child/Parent ratio is 5. The
total number of records is 10000. The support threshold is 30% of the total number
of records in our synthetic database. The number of predictive attribute values we
use is 10. The number of distinct values for each descriptive attribute in each testing
case is as follows: 20, 40, 80, 120 and 160. We use the same value distribution (all
descriptive attributes have the same number of distinct values in each case) to avoid
possible distortion of the experimental result due to di erent value distributions in
each case. The experimental result is shown in Figure 5.11, from which we can see
that, with the increase of the number of descriptive attribute values, the running time
also increases mainly due to the overhead involved in calculating the relevance for each
descriptive attribute. This test also shows that PP S1HT is better than PP S2BT.
5.3.7 Summary
In this section, we have studied the performance of the two variations of the pattern
matching-based method. In most cases, PP S1HT is better than PP S2BT but in
a few cases where the number of records is small and Child/Parent ratio is very
CHAPTER 5. PERFORMANCE STUDY 104
low, the overhead related to hash table more than o sets its benet and PP S2BT is
slightly better than PP S1HT, but the performance di erence is not signicant. So
the conclusion is that PP S1HT can be reasonably counted on as the better choice
for this method.
105
CHAPTER 6. CONCLUSION AND FUTURE WORK 106
The major methods which have been adopted in the past research include statis-
tics, classication, and neural net-based methods. All these major methods were
introduced and examples were given in Chapter 2 to give the reader an overall picture
of the predictive modeling research which has been conducted in both academic and
industrial worlds.
Based on attribute-oriented induction and one popular machine learning method
{ ID3, we developed a classication-based method to perform predictive modeling
on the data stored in large databases. This method can extract predictive modeling
rules at di erent concept levels and makes it possible for users to get knowledge at
desirable concept levels without the limitations of the original ID3 method, such as the
incapability to handle large data sets and continuous numerical values. We change the
termination condition, which makes the method realistic for large databases. Since
the distinct values for each attribute can be reduced by attribute-oriented induction,
we avoid the problem caused by the tendency of the original ID3 method to favor the
attributes with a large number of distinct values.
By integrating attribute-oriented induction and a widely adopted statistical method,
we proposed a pattern matching-based predictive modeling method to predict data
values or value distributions on the predictive attribute, based on similar groups of
data in the database. For example, one may predict the amount of research grants
that an applicant may receive, based on the data about the similar groups of re-
searchers. Thanks to the integration with attribute-oriented induction, this method
can also be applied to multiple concept levels.
Both of our proposed methods have been implemented and the performance study
and experiments we did showed that both algorithms work e ciently against large
databases.
However, if a request involves a very large data set, it may take nontrivial time to
just retrieve the data. In such cases, it would be desirable to precalculate some models
and store them for future use so that the result can be given to the user in a timely
manner regardless of the size of the data set involved in the queries. However, since
data stored in the databases evolve { in many cases, in a fast pace, we need to nd
e cient ways to automatically and incrementally update the precalculated models,
CHAPTER 6. CONCLUSION AND FUTURE WORK 107
is often interested in one particular object and thus this method is more suited
for answering specic questions such as \What is the possible salary for a pro-
grammer with a M.S. degree in computing science, 3 year experience and living
in Vancouver?", which may give the user an answer more tailored toward his
particular situation. For example, based on the result, a new computing science
graduate may decide what salary range he can ask for during job interviews.
The classication-based predictive modeling method is more suited for making
predictions at higher concept levels due to the limitation of the ID3 method,
which was discussed in Chapter 3. The pattern matching-based method has no
such limitation.
The classication-based predictive modeling method is more suited for queries
based on the xed concept levels. Once the desired concept level is specied in
the query, the decision tree will be constructed based on that specied concept
level and the nal result will be at the same concept level. Although users can
adjust the desired concept level by modifying and resubmitting their requests,
the entire decision tree has to be built on the newly specied concept level. The
pattern matching-based method will adjust the concept level of each attribute
automatically, based on the specied support threshold and the relevance be-
tween each descriptive attribute and the predictive attribute. So the nal result
may not be at the same specied concept level.
The result of a classication-based method can be stored for future use. For ex-
ample, a credit card company may apply the rules derived from the classication-
based method to approve or reject new applicants. In most cases, the result of
a pattern matching-based method is more specic and thus is not stored for
future use.
These two approaches are proposed based on di erent real world problems and thus
they each suit di erent requests. The decision to adopt one method over the other
should be made based on di erent users' particular needs. One interesting observation
is that in some cases, they can be used as complimentary methods. For example,
CHAPTER 6. CONCLUSION AND FUTURE WORK 109
a credit card company may store some rules derived from the classication-based
method to process new applications. In most cases, this may work well. However,
sometimes the stored rules may not cover a particular applicant. In this case, the
company may use the pattern matching-based method to predict possible credit rating
of this particular applicant.
than the attribute \Occupation" but if we combine them together such as \BC, West
Vancouver", they may become even more relevant than \Occupation". Since there
could be many combinations between descriptive attributes, we need further study to
gure out how to select the combination candidates and how it will a ect the nal
result.
Another interesting problem is related to system implementation. For the classication-
based predictive modeling, the decision tree building process could be very costly if
there is a very large data set (such as a retailer's transaction database) involved. If
we can identify the queries which are used very often and store the result for future
use, it will substantially enhance the performance. However, since the data stored in
many operational databases is not static and in many cases it evolves in a fast pace (a
retailer's transaction database, for example), we need to nd e cient ways to update
the stored result and in ideal cases such updates should be incremental, which, we
believe, is not a trivial problem.
Appendix A
Database Generator
The following diagram illustrates the algorithm we used to generate our testing
databases:
111
APPENDIX A. DATABASE GENERATOR 112
Get the value using the obtained index from the value list of the
current attribute
Put the value in the current attribute column of the current record
No
Column pointer points to the Is the last column?
next column in the table
Yes
Check if the new record is already in the existing table by looking for
it in the hash table
Yes
Found in the hash table?
No
Yes
Create all leaf nodes based on the value list of given attribute
Generation count =1
113
APPENDIX B. HIERARCHY GENERATOR 114
Yes
Generation Count ++
Yes
115
BIBLIOGRAPHY 116
45] G. H. John. Robust decision trees: Removing outliers from databases. In Proc.
1st Int. Conf. Knowledge Discovery and Data Mining, pages 174{179, Montreal,
Canada, Aug. 1995.
46] D. A. Keim, H.-P. Kriegel, and T. Seidl. Supporting data mining of large
databases by visual feedback queries. In Proc. 10th Int. Conf. Data Engineering,
pages 302{313, Houston, TX, Feb. 1994.
47] J. Kivinen and H. Mannila. The power of sampling in knowledge discovery. In
Proc. 13th ACM Symp. Principles of Database Systems, pages 77{85, Minneapo-
lis, MN, May 1994.
48] W. Kl#osgen. Explora: a multipattern and multistrategy discovery assistant.
In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors,
Advances in Knowledge Discovery and Data Mining, pages 249{271. AAAI/MIT
Press, 1996.
49] D. E. Knuth. The Art of Computer Programming. Addison-Wesley, 1969.
50] Petri Kontkanen, Petri Myllym#aki, and Henry Tirri. Predictive data mining with
nite mixtures. In Proc. 2nd Int. Conf. Knowledge Discovery and Data Mining
(KDD'96), pages 176{182, Portland, Oregon, August 1996.
51] K. Koperski and J. Han. Discovery of spatial association rules in geographic
information databases. In Proc. 4th Int. Symp. Large Spatial Databases (SSD'95),
pages 47{66, Portland, Maine, Aug. 1995.
52] A. K. Kurtz and H. A. Edgerton. Statistical Dictionary of Terms and Symbols.
Wiley, 1939.
53] Rense Lange. An empirical test of the weighted e ect approach to generalized
prediction using neural nets. In Proc. 2nd Int. Conf. Knowledge Discovery and
Data Mining (KDD'96), pages 183{188, Portland, Oregon, August 1996.
54] P. Langley and S. Sage. Conceptual clustering as discrimination learning. In
Proc. 5th Canadian Conf. Articial Intelligence, pages 95{98, London, Ontario,
1984.
55] P. Langley, J. Zytkow, H. Simon, and G. Bradshaw. The search for regularity:
Four aspects of scientic discovery. In Michalski et al., editor, Machine Learning:
An Articial Intelligence Approach, Vol. 2, pages 425{469. Morgan Kaufmann,
1986.
BIBLIOGRAPHY 120
81] E. Simoudis, B. Livezey, and R. Kerber. Using Recon for data cleaning. In Proc.
1st Int. Conf. Knowledge Discovery and Data Mining, pages 258{262, Montreal,
Canada, Aug. 1995.
82] R. Srikant and R. Agrawal. Mining quantitative association rules in large re-
lational tables. In Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data,
pages 1{12, Montreal, Canada, June 1996.
83] M. Stone. Cross-validatory choice and assessment of statistical predictions. Jour-
nal of the Royal Statistical Society, 36:111{147, 1974.
84] V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-
Verlag, 1982.
85] J. Way and E. A. Smith. The evolution of synthetic aperture radar systems
and their progression to the EOS SAR. IEEE Transactions on Geoscience and
Remote Sencing, 29:962{985, 1991.
86] S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classication
and Prediction Methods from Statistics, Neural Nets, Machine Learning, and
Expert Systems. Morgan Kaufman, 1991.
87] Q. Wu, P. Suetens, and A. Oosterlinck. Integration of heuristic and bayesian
approaches in a pattern-classication system. In G. Piatetsky-Shapiro and W. J.
Frawley, editors, Knowledge Discovery in Databases, pages 249{260. AAAI/MIT
Press, 1991.
88] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an e cient data cluster-
ing method for very large databases. In Proc. 1996 ACM-SIGMOD Int. Conf.
Management of Data, pages 103{114, Montreal, Canada, June 1996.
89] W. Ziarko. Rough Sets, Fuzzy Sets and Knowledge Discovery. Springer-Verlag,
1994.
90] J. Zytkow and J. Baker. Interactive mining of regularities in databases. In
G. Piatetsky-Shapiro and W. J. Frawley, editors, Knowledge Discovery in
Databases, pages 31{54. AAAI/MIT Press, 1991.