CLASSIFICATION & PREDICTION
- Shailesh Yadav
Central University Of Rajasthan
CONTENTS
Classification & Prediction
Methods Of Classification
Other Classification Methods
Prediction
Conclusion
Classification vs. Prediction
Classification
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying
new data
Prediction
models continuous-valued functions, i.e., predicts unknown or missing
values
Typical applications:-
1-Credit/loan approval: Medical diagnosis: if a tumor is cancerous or benign
2-Fraud detection: if a transaction is fraudulent
3-Web page categorization: which category it is
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes
Each tuple /sample is assumed to belong to a predefined class, as determined
by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees, or
mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the classified result
from the model
Accuracy rate is the percentage of test set samples that are correctly
classified by the model
Test set is independent of training set, otherwise over-fitting will occur
If the accuracy is acceptable, use the model to classify data tuples whose class
labels are not known
Process (1): Model Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier
Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no IF rank = ‘professor’
Anne Associate Prof 3 no OR years > 6
THEN tenured = ‘yes’
Process (2): Using the Model in Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom
Merlisa
Assistant Prof
Associate Prof
2
7
no
no
Tenured?
George Professor 5 yes
Joseph Assistant Prof 7 yes
Issues: Data Preparation
Data cleaning
Preprocess data in order to reduce noise and handle
missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
Issues: Evaluating Classification Methods
Accuracy
classifier accuracy: predicting class label
predictor accuracy: guessing value of predicted attributes
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision tree size or
compactness of classification rules
Methods Of Classification
By Decision Tree Induction
Bayesian Classification
Rule Based Classification
Decision Tree Induction
A decision tree induction is the learning of
decision trees from class- labeled training
tuples. A decision tree is a flowchart- like tree
structure, where each internal node (non leaf
node) denotes a test on an attribute, each
branch represents an outcome of the test, and
each leaf node (or terminal node) hold a class
label. The topmost node in a tree is the root
node.
Decision Tree Induction: Training Dataset
age income student credit_rating buys_computer
This <=30
<=30
high
high
no
no
fair
excellent
no
no
follows 31…40 high no fair yes
>40 medium no fair yes
an >40 low yes fair yes
example >40
31…40
low
low
yes
yes
excellent
excellent
no
yes
of <=30 medium no fair no
Quinlan’s <=30
>40
low
medium
yes
yes
fair
fair
yes
yes
ID3 <=30 medium yes excellent yes
31…40 medium no excellent yes
(Playing 31…40 high yes fair yes
Tennis) >40 medium no excellent no
Output: A Decision Tree for “buys_computer”
age?
<=30 overcast
31..40 >40
student? yes credit rating?
no yes excellent fair
no yes no yes
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are discretized in advance)
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf
There are no samples left
Bayesian Classification: Why?
A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
Foundation: Based on Baye’s Theorem.
Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
Bayesian Theorem: Basics
Let X be a data sample (“evidence”): class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), the probability that the hypothesis
holds given the observed data sample X
P(H) (prior probability), the initial probability
E.g., X will buy computer, regardless of age, income, …
P(X): probability that sample data is observed
P(X|H) (posteriori probability), the probability of observing the sample X,
given that the hypothesis holds
E.g., Given that X will buy computer, the prob. that X is 31..40, medium
income
Bayesian Theorem
Given training data X, posteriori probability of a hypothesis H, P(H|X),
follows the Bayes theorem
P( H | X) P(X | H ) P(H )
P(X)
Informally, this can be written as
posteriori = likelihood x prior/evidence
Predicts X belongs to C2 iff the probability P(Ci|X) is the highest among
all the P(Ck|X) for all the k classes
Practical difficulty: require initial knowledge of many probabilities,
significant computational cost
Using IF-THEN Rules for Classification
Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
Rule antecedent/precondition vs. rule consequent
Assessment of a rule: coverage and accuracy
ncovers = # of tuples covered by R
ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
If more than one rule is triggered, need conflict resolution
Size ordering: assign the highest priority to the triggering rules that has the “toughest”
requirement (i.e., with the most attribute test)
Class-based ordering: decreasing order of prevalence or misclassification cost per class
Rule-based ordering (decision list): rules are organized into one long priority list, according
to some measure of rule quality or by experts
Other Classification Methods
Genetic Algorithms
Rough Set Approach
Fuzzy Set Approach
What Is Prediction?
(Numerical) prediction is similar to classification
construct a model
use model to predict continuous or ordered value for a given input
Prediction is different from classification
Classification refers to predict categorical class label
Prediction models continuous-valued functions
Major method for prediction: regression
model the relationship between one or more independent or predictor variables
and a dependent or response variable
Regression analysis
Linear and multiple regression
Non-linear regression
Other regression methods: generalized linear model, Poisson regression, log-linear
models, regression trees
Linear Regression
Linear regression: involves a response variable y and a single predictor variable x
y = w0 + w1 x
where w0 (y-intercept) and w1 (slope) are regression coefficients
Method of least squares: estimates the best-fitting straight line
| D|
(x x )( yi y )
w w yw x
i
i 1
1 | D|
0 1
(x i 1
i x)2
Multiple linear regression: involves more than one predictor variable
Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)
Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2
Solvable by extension of least square method or using SAS, S-Plus
Many nonlinear functions can be transformed into the above
Nonlinear Regression
Some nonlinear models can be modeled by a polynomial function
A polynomial regression model can be transformed into linear regression
model. For example,
y = w0 + w 1 x + w2 x 2 + w3 x3
convertible to linear with new variables: x2 = x2, x3= x3
y = w0 + w 1 x + w2 x 2 + w3 x3
Other functions, such as power function, can also be transformed to linear
model
Some models are intractable nonlinear (e.g., sum of exponential terms)
possible to obtain least square estimates through extensive calculation on
more complex formulae
Other Regression-Based Models
Generalized linear model:
Foundation on which linear regression can be applied to modeling
categorical response variables
Variance of y is a function of the mean value of y, not a constant
Logistic regression: models the prob. of some event occurring as a linear
function of a set of predictor variables
Poisson regression: models the data that exhibit a Poisson distribution
Log-linear models: (for categorical data)
Approximate discrete multidimensional prob. distributions
Also useful for data compression and smoothing
Regression trees and model trees
Trees to predict continuous values rather than class labels
Predictor Error Measures
Measure predictor accuracy: measure how far off the predicted value is from the
actual known value
Loss function: measures the error betw. y i and the predicted value yi’
Absolute error: | yi – yi’|
Squared error: (yi – yi’)2
Test error (generalization error): the average loss over the test set
d d
Mean absolute error: | yi yiMean
'| squared error: (y i yi ' ) 2
i 1 i 1
d d
d
Relative absolute error: | y Relative squared error:
yi '| d
(y yi ' ) 2
i
i 1 i
d
i 1
| y i y| d
i 1
(y i y)2
The mean squared-error exaggerates the presence of outliers i 1
Popularly use (square) root mean-square error, similarly, root relative squared
error
Conclusion
Classification and prediction are two forms of data analysis that can be used to
extract models describing important data classes or to predict future data trends.
Effective and scalable methods have been developed for decision trees
induction, Naive Bayesian classification, Bayesian belief network, rule-based
classifier, Back propagation, Support Vector Machine (SVM), associative
classification, nearest neighbor classifiers, and case-based reasoning, and other
classification methods such as genetic algorithms, rough set and fuzzy set
approaches.
Linear, nonlinear, and generalized linear models of regression can be used for
prediction. Many nonlinear problems can be converted to linear problems by
performing transformations on the predictor variables. Regression trees and
model trees are also used for prediction.
Conclusion (cont.)
Stratified k-fold cross-validation is a recommended method for accuracy
estimation. Bagging and boosting can be used to increase overall accuracy
by learning and combining a series of individual models.
There have been numerous comparisons of the different classification and
prediction methods, and the matter remains a research topic
No single method has been found to be superior over all others for all data
sets
Issues such as accuracy, training time, robustness, interpretability, and
scalability must be considered and can involve trade-offs, further
complicating the quest for an overall superior method
References
L. Breiman, J. Friedman, R. Olshen, and C.
Stone. Classification and Regression
Trees. Wadsworth International Group,
1984.
Jiawei Han and Micheline Kamber :Data
Mining- Concepts and Techniques
.
ANY QUERY ?
THANK YOU!