Multiple Discriminant Analysis and Logistic Regression
Multiple Discriminant Analysis and Logistic Regression
Examples:
• Gender – Ma
le vs. Female
• Heavy Users
vs. Light Users
• Purchaserrs
s vs. Non-purch
• Good Credit aserrs
s
Risk vs. Poor Cre
• Member vs. N dit Risk
on-Member
• Attorn
rney, Physician or
Pro
rofessor
Group 1
Would purchase 1 8 9 6
2 6 7 5
3 10 6 3
4 9 4 4
5 4 8 2
Group Mean 7.4 6.8 4.0
Group 2
Would not purchase 6 5 4 7
7 3 7 2
8 4 5 5
9 2 4 3
10 2 2 2
Group Mean 3.2 4.4 3.8
X2
A
B
A’
B’ X1
Discriminant
Function
Z
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall. 5-6
Discriminant Analysis Decision Process
Key Assumptions
• Multivariate normality of the
independent variables.
Other Assumptions
• Minimal multicollinearity among
independent variables.
• Group sample sizes relatively equal.
• Linear relationships.
• Elimination of outliers.
Major Considerations:
• The statistical and practical rational for
developing classification matrices,
• The cutting score determination,
• Construction of the classification matrices,
and
• Standards for assessing classification
accuracy.
Issues . . .
• Define the prior probabilities based either on
the relative sample sizes of the observed
groups or specified by the researcher (either
assumed to be equal or with values set by the
researcher), and
• Calculate the optimum cutting score value as a
weighted average based on the assumed sizes
of the groups (derived from the sample sizes).
Group A Group B
_ _
ZA ZB
Classify as B
Classify as A (Purchaser)
(Nonpurchaser)
Unweighted
Optimal Weighted Cutting Score
Cutting Score
Group B
Group A
_ _
ZA ZB
Predicted Group
Would Percent
Actual Would Not Actual Correct
Group Purchase Purchase Total Classification
(1) 22 3 25
88%
(2) 5 20 25
Predicted
80% 27 23 50
Total
Percent Correctly Classified (hit ratio) =
100 x [(22 + 20)/50] = 84%
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall. 5-24
Rules of Thumb 7–3
Assessing Predictive Accuracy
• The classification matrix and hit ratio replace R 2
as the
measure of model fit:
assess the hit ratio both overall and by group..
If the estimation and analysis samples both exceed
100 cases and each group exceeds 20 cases,
derive separate standards for each sample. If not,
derive a single standard from the overall sample.
• Analyze the missclassified observations both
graphically (territorial map) and empirically
(Mahalanobis D2).
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall. 5-25
Rules of Thumb 7–3 Continued . . .
Assessing Predictive Accuracy
• There are multiple criteria for comparison to the hit ratio:
The maximum chance criterion for evaluating the hit ratio is
the most conservative, giving the highest baseline value to
exceed.
Be cautious in using the maximum chance criterion in
situations with overall samples less than 10 and/or group
sizes under 20.
The proportional chance criterion considers all groups in
establishing the comparison standard and is the most
popular.
The actual predictive accuracy (hit ratio) should exceed the
any criterion value by at least 25%.
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall. 5-26
Stage 5: Interpretation of the Results
Three Methods . . .
1. Standardized discriminant weights,
2. Discriminant loadings (structure
correlations), and
3. Partial F values.
Three Steps . . .
1. Selecting variables,
2. Stretching the vectors, and
3. Plotting the group centroids.
2 Over 5 years
Function 2
0 X1 - Customer Type
1 to 5 years
-1 Less than 1 year Group Centroids
-2 Over 5 years
-3 1 to 5 years
Function 1
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall. 5-31
Rules of Thumb 7–4
6-38
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
Logistic Regression Defined
Logistic Regression . . . is a specialized
form of regression that is designed to predict
and explain a binary (two-group) categorical
variable rather than a metric dependent
measure. Its variate is similar to regular
regression and made up of metric
independent variables. It is less affected than
discriminant analysis when the basic
assumptions, particularly normality of the
independent variables, are not met.
6-39
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
Logistic Regression May Be Preferred . . .
6-40
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
Logistic Regression Decision Process
6-41
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
Stage 1: Objectives of Logistic Regression
6-42
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
Stage 2: Research Design for
Logistic Regression
• The binary nature of the dependent variable (0 – 1)
means the error term has a binomial distribution
instead of a normal distribution, and it thus invalidates
all testing based on the assumption of normality.
• The variance of the dichotomous variable is not
constant, creating instances of heteroscedasticity as
well.
• Neither of the above violations can be remedied
through transformations of the dependent or
independent variables. Logistic regression was
developed to specifically deal with these issues.
6-43
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
Stage 3: Assumptions of Logistic
Regression
6-44
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
Stage 4: Estimation of Logistic Regression Model
and Assessing Overall Fit
6-45
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
Estimating the Coefficients
6-46
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
Transforming a Probability into Odds
and Logit Values
o The logistic transformation has two basic steps:
Restating a probability as odds, and
Calculating the logit values.
o Instead of using ordinary least squares to
estimate the model, the maximum likelihood
method is used.
o The basic measure of how well the maximum
likelihood estimation procedure fits is the
likelihood value.
6-47
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
Model Estimation Fit – Between Model
comparisons . . .
6-48
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
Comparison to Multiple Regression . . .
6-49
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
Stage 5: Interpretation of the Results
6-50
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
Directionality of the Relationship
A positive relationship means an increase in the
independent variable is associated with an increase in the
predicted probability, and vice versa. But the direction of
the relationship is reflected differently for the original and
exponentiated logistic coefficients.
• Original coefficient signs indicate the direction of the
relationship.
• Exponentiated coefficients are interpreted differently
since they are the logarithms of the original coefficients
and do not have negative values. Thus, exponentiated
coefficients above 1.0 represent a positive relationship
and values less than 1.0 represent negative
relationships.
6-51
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
Magnitude of the Relationship . . .
6-52
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
Rules of Thumb 7–5
Logistic Regression
• Logistic regression is the preferred method for two-
group (binary) dependent variables due to its
robustness, ease of interpretation and diagnostics.
• Sample size considerations for logistic regression are
primarily focused on the size of each group, which
should have 10 times the number of estimated model
coefficients (the number of variables).
• Sample size should be met in both the analysis and
holdout samples.
• Model significance tests are made with a chi-square
test on the differences in the log likelihood values (-
2LL) between two models.
6-53
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
Rules of Thumb 7–5 continued . . .
Logistic Regression
• Coefficients are expressed in two forms: original and
exponentiated to assist in interpretation.
• Interpretation of the coefficients for direction and
magnitude is:
Direction can be directly assessed in the original
coefficients (positive or negative signs) or indirectly in
the exponentiated coefficients (less than 1 are
negative, greater than 1 are positive).
Magnitude is best assessed by the exponentiated
coefficient, with the percentage change in the
dependent variable shown by: Percentage change =
(Exponentiated Coefficient – 1.0) * 100
6-54
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
Stage 6: Validation of the Results
6-55
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
Description of HBAT Primary Database Variables
Variable Description Variable Type
Data Warehouse Classification Variables
X1 Customer Type nonmetric
X2 Industry Type nonmetric
X3 Firm Size nonmetric
X4 Region nonmetric
X5 Distribution System nonmetric
Performance Perceptions Variables
X6 Product Quality metric
X7 E-Commerce Activities/Website metric
X8 Technical Support metric
X9 Complaint Resolution metric
X10 Advertising metric
X11 Product Line metric
X12 Salesforce Image metric
X13 Competitive Pricing metric
X14 Warranty & Claims metric
X15 New Products metric
X16 Ordering & Billing metric
X17 Price Flexibility metric
X18 Delivery Speed metric
Outcome/Relationship Measures
X19 Satisfaction metric
X20 Likelihood of Recommendation metric
X21 Likelihood of Future Purchase metric
X22 Current Purchase/Usage Level metric
X23 Consider Strategic Alliance/Partnership in Future nonmetric 6-56
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.