0% found this document useful (0 votes)

122 views

Problems With Stepwise Regression

Stepwise regression is problematic for several reasons: 1) It produces biased results like overestimating R-squared and underestimating confidence intervals. 2) The selection of predictors is affected by collinearity between variables and the total number of candidate predictors. 3) Studies have shown that autonomous variable selection often results in models with mostly "noise" variables that are incorrectly interpreted as meaningful predictors. Manual selection of variables guided by subject-matter knowledge is preferable to let a computer select predictors automatically.

Uploaded by

vaskore

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

122 views

Problems With Stepwise Regression

Uploaded by

vaskore

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

Linear Statistical Models: Regression

Problems with Stepwise Regression

This statement by Singer & Willet (2003) represents one of the best statements concerning the use of stepwise approaches:

Never let a computer select predictors mechanically. The computer does not know your research questions nor the literature upon which they rest. It cannot distinguish predictors of direct
substantive interest from those whose effects you want to control.

These comments are from the Stata FAQ pages (www.stata.com)

Frank Harrell's comments:

Here are some of the problems with stepwise variable selection.

It yields R-squared values that are badly biased high.

The F and chi-squared tests quoted next to each variable on the printout do not have the claimed distribution.
The method yields confidence intervals for effects and predicted values that are falsely narrow (See Altman and Anderson, Statistics in Medicine).
It yields P-values that do not have the proper meaning and the proper correction for them is a very difficult problem.
It gives biased regression coefficients that need shrinkage (the coefficients for remaining variables are too large; see Tibshirani, 1996).
It has severe problems in the presence of collinearity.
It is based on methods (e.g., F tests for nested models) that were intended to be used to test prespecified hypotheses.
Increasing the sample size doesn't help very much (see Derksen and Keselman).
It allows us to not think about the problem.
It uses a lot of paper.

Note that "all possible subsets" regression does not solve any of these problems.

References

Altman, D. G. and P. K. Andersen. 1989. Bootstrap investigation of the stability of a Cox regression model. Statistics in Medicine 8: 771-783.

Shows that stepwise methods yields confidence limits that are far too narrow.

Derksen, S. and H. J. Keselman. 1992. Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables. British Journal of
Mathematical and Statistical Psychology 45: 265-282.

Conclusions

"The degree of correlation between the predictor variables affected the frequency with which authentic predictor variables found their way into the final model."

"The number of candidate predictor variables affected the number of noise variables that gained entry to the model."

"The size of the sample was of little practical importance in determining the number of authentic variables contained in the final model."

"The population multiple coefficient of determination could be faithfully estimated by adopting a statistic that is adjusted by the total number of candidate predictor variables rather than the
number of variables in the final model."

Roecker, Ellen B. 1991. Prediction error and its estimation for subset--selected models. Technometrics 33: 459-468.

Shows that all-possible regression can yield models that are "too small".

Mantel, Nathan. 1970. Why stepdown procedures in variable selection. Technometrics 12: 621-625.

Hurvich, C. M. and C. L. Tsai. 1990. The impact of model selection on inference in linear regression. American Statistician 44: 214-217.

Copas, J. B. 1983. Regression, prediction and shrinkage (with discussion). Journal of the Royal Statistical Society B 45: 311-354.

Shows why the number of CANDIDATE variables and not the number in the final model is the number of d.f. to consider.

Tibshirani, Robert. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B 58: 267-288.

Ronan Conroy's comments:

I am struck by the fact that Judd and McClelland in their excellent book Data Analysis: A Model Comparison Approach (Harcourt Brace Jovanovich, ISBN 0-15-516765-0) devote less than 2 pages
to stepwise methods. What they do say, however, is worth repeating:

Stepwise methods will not necessarily produce the best model if there are redundant predictors (common problem).
All-possible-subset methods produce the best model for each possible number of terms, but larger models need not necessarily be subsets of smaller ones, causing serious conceptual
problems about the underlying logic of the investigation.
Models identified by stepwise methods have an inflated risk of capitalising on chance features of the data. They frequently fail when applied to new datasets. They are rarely tested in this
way.
Since the interpretation of coefficients in a model depends on the other terms included, "it seems unwise," to quote J and McC, "to let an automatic algorithm determine the questions we do
and do not ask about our data".
I quote this last point directly, as it is sane and succinct:
"It is our experience and strong belief that better models and a better understanding of one's data result from focussed data analysis, guided by substantive theory." (p 204)

They end with a quote from Henderson and Velleman's paper "Building multiple regression models interactively" (1981, Biometrics 37: 391-411):

"The data analyst knows more than the computer,"

and they add

"failure to use that knowledge produces inadequate data analysis".

Personally, I would no more let an automatic routine select my model than I would let some best-fit procedure pack my suitcase.

Summary by Steve Blinkhorn:

So here is a brief abstract of the BJMSP paper, plus odd extracts from elsewhere:

The use of automated subset search algorithms is reviewed and issues concerning model selection and selection criteria are discussed. In addition, a Monte Carlo study is reported which presents
data regarding the frequency with which authentic and noise variables are selected by automated subset algorithms. In particular, the effects of the correlation between predictor variables, the
number of candidate predictor variables, the size of the sample, and the level of significance for entry and deletion of variables were studied for three automated subset selection algorithms:
BACKWARD ELIMINATION, FORWARD SELECTION and STEPWISE. Results indicated that: (1) the degree of correlation between the predictor variables affected the frequency with which
authentic predictor variables found their way into the final model; (2) the number of candidate predictor variables affected the number of noise variables that gained entry to the model; (3) the size
of the sample was of little practical importance in determining the number of authentic variables contained in the final model; and (4) the population multple coefficient of determination could be
faithfully estimated by adopting a statistic that is adjusted by the total number of candidate predictor variables rather than the number of variables in the final model.

..... the degree of collinearity between predictor variables was the most important factor influencin the selection of authentic variables.... ... the number of candidate predictor variables affected the
number of noise variables that gained entry to the model ...

...Even in the most favourable case investigated ..... 20 per cent of the variables finding their way into the model were noise. In the worst case .... 74 per cent of the selected variables were noise.

... the average number of authentic variables found in the final subset models was always less than half the number of available authentic predictor variables.

.... the 'data mining' approach to model building is likely to result in final models containing a large percentage of noise variables which wil be interpreted incorrectly as authentic.

Linear Statistical Models Course

Phil Ender, 14jan00

DURACRETEaguidelinefordurabilitybaseddesignofconcretestructures
No ratings yet
DURACRETEaguidelinefordurabilitybaseddesignofconcretestructures
10 pages
Excel Tool Potential Analyse VDA 6.3 2010 en
100% (1)
Excel Tool Potential Analyse VDA 6.3 2010 en
9 pages
Yang-39 2 Proof 27
No ratings yet
Yang-39 2 Proof 27
11 pages
Chapter 5
No ratings yet
Chapter 5
30 pages
Stepwise Regression
0% (1)
Stepwise Regression
9 pages
stepwise regression
No ratings yet
stepwise regression
2 pages
Jurnal Asli Diagram Sa
No ratings yet
Jurnal Asli Diagram Sa
11 pages
13 Paper PDF
No ratings yet
13 Paper PDF
14 pages
Unit 4
No ratings yet
Unit 4
7 pages
Stepwise Versus Hierarchical Regression: Pros and Cons
No ratings yet
Stepwise Versus Hierarchical Regression: Pros and Cons
30 pages
Variable Selection 8.1 The Model Building Problem
No ratings yet
Variable Selection 8.1 The Model Building Problem
18 pages
Variable Selection 8.1 The Model Building Problem
No ratings yet
Variable Selection 8.1 The Model Building Problem
18 pages
9 - APM 1205 Linear Model
No ratings yet
9 - APM 1205 Linear Model
20 pages
Optimal Model Selection by A Genetic Algorithm Using SAS®: Jerry S. Tsai, Clintuition, Los Angeles, CA
No ratings yet
Optimal Model Selection by A Genetic Algorithm Using SAS®: Jerry S. Tsai, Clintuition, Los Angeles, CA
13 pages
Glmulti Walkthrough
No ratings yet
Glmulti Walkthrough
29 pages
Regression.2021F.FinalExam_Solution
No ratings yet
Regression.2021F.FinalExam_Solution
5 pages
Lars Based S Estimator
No ratings yet
Lars Based S Estimator
10 pages
Stepwise Regression
No ratings yet
Stepwise Regression
4 pages
SAS Code To Select The Best Multiple Linear Regression Model For Multivariate Data Using Information Criteria
No ratings yet
SAS Code To Select The Best Multiple Linear Regression Model For Multivariate Data Using Information Criteria
6 pages
Lesson 10
No ratings yet
Lesson 10
9 pages
AIMS-Lukacs Burnman Anderson-2010
No ratings yet
AIMS-Lukacs Burnman Anderson-2010
9 pages
Notes 12
No ratings yet
Notes 12
41 pages
Test Bank Statistics
No ratings yet
Test Bank Statistics
9 pages
A Novel Bayesian Approach For Variable Selection in Linear Regression Models
No ratings yet
A Novel Bayesian Approach For Variable Selection in Linear Regression Models
24 pages
MultiLinear VariableSelection
No ratings yet
MultiLinear VariableSelection
10 pages
Yohai 1991
No ratings yet
Yohai 1991
10 pages
Mendenhall Ch06-+modified
No ratings yet
Mendenhall Ch06-+modified
28 pages
1xraftery Et All 1997
No ratings yet
1xraftery Et All 1997
14 pages
L2D-Multiple Regression D 2022-03-03 21_20_03
No ratings yet
L2D-Multiple Regression D 2022-03-03 21_20_03
31 pages
Stepwise Regression: Forward (Step-Up) Selection
No ratings yet
Stepwise Regression: Forward (Step-Up) Selection
7 pages
Regression Vs Box Jenkins Case Study
No ratings yet
Regression Vs Box Jenkins Case Study
14 pages
Chapter 4
No ratings yet
Chapter 4
23 pages
A Review of Bayesian Variable Selection
No ratings yet
A Review of Bayesian Variable Selection
34 pages
Reg07
No ratings yet
Reg07
22 pages
Comprehensive Exam - Issues in Multiple Regression
No ratings yet
Comprehensive Exam - Issues in Multiple Regression
8 pages
Time Series Summary
100% (1)
Time Series Summary
23 pages
Lab 5
No ratings yet
Lab 5
30 pages
Chapter 6: How To Do Forecasting by Regression Analysis
No ratings yet
Chapter 6: How To Do Forecasting by Regression Analysis
7 pages
Lesson 5 Model Selection
No ratings yet
Lesson 5 Model Selection
41 pages
Econometric Modeling
No ratings yet
Econometric Modeling
38 pages
Assignment of Econometrics
No ratings yet
Assignment of Econometrics
12 pages
Bayesian Model Averaging For Linear Regression Models
No ratings yet
Bayesian Model Averaging For Linear Regression Models
14 pages
Chap3-INTERVENTION ANALYSIS
No ratings yet
Chap3-INTERVENTION ANALYSIS
62 pages
Model Choice and Specification Analysis
No ratings yet
Model Choice and Specification Analysis
46 pages
Regression Analysis
No ratings yet
Regression Analysis
9 pages
Estimation Strategies For The Regression Coefficient Parameter Matrix in Multivariate Multiple Regression
No ratings yet
Estimation Strategies For The Regression Coefficient Parameter Matrix in Multivariate Multiple Regression
20 pages
Linear Regression
83% (6)
Linear Regression
499 pages
Finding The Outliers That Matter
No ratings yet
Finding The Outliers That Matter
10 pages
Medical Statistics from Scratch: An Introduction for Health Professionals
From Everand
Medical Statistics from Scratch: An Introduction for Health Professionals
David Bowers
No ratings yet
DSIMGTS
No ratings yet
DSIMGTS
13 pages
Lecture 5 Model Selection I: STAT 441: Statistical Methods For Learning and Data Mining
No ratings yet
Lecture 5 Model Selection I: STAT 441: Statistical Methods For Learning and Data Mining
17 pages
Econometrics 4
No ratings yet
Econometrics 4
37 pages
lecture 6 linear regression
No ratings yet
lecture 6 linear regression
8 pages
WINSEM2023-24 MAT6015 ETH VL2023240501308 2024-03-19 Reference-Material-I
No ratings yet
WINSEM2023-24 MAT6015 ETH VL2023240501308 2024-03-19 Reference-Material-I
39 pages
Chapter 9: Selection of Variables
No ratings yet
Chapter 9: Selection of Variables
30 pages
Bestglm Using R
No ratings yet
Bestglm Using R
39 pages
Model Selection
No ratings yet
Model Selection
11 pages
Revision235
No ratings yet
Revision235
8 pages
Robust Inference: Seminar F Ur Statistik Eidgen Ossische Technische Hochschule (ETH) CH-8092 Z Urich Switzerland
No ratings yet
Robust Inference: Seminar F Ur Statistik Eidgen Ossische Technische Hochschule (ETH) CH-8092 Z Urich Switzerland
33 pages
Sensitivity Analysis in Linear Regression
From Everand
Sensitivity Analysis in Linear Regression
Samprit Chatterjee
No ratings yet
Log-Linear Modeling: Concepts, Interpretation, and Application
From Everand
Log-Linear Modeling: Concepts, Interpretation, and Application
Alexander von Eye
No ratings yet
The Statistical Analysis of Experimental Data
From Everand
The Statistical Analysis of Experimental Data
John Mandel
3/5 (2)
Python - Display Number With Leading Zeros - Stack Overflow
No ratings yet
Python - Display Number With Leading Zeros - Stack Overflow
8 pages
Narrowing The Search: Which Hyperparameters Really Matter?
No ratings yet
Narrowing The Search: Which Hyperparameters Really Matter?
9 pages
Refer To Excel Cell in Table by Header Name and Row Number: 7 Answers
No ratings yet
Refer To Excel Cell in Table by Header Name and Row Number: 7 Answers
1 page
Mboxcox, Interpreting Difficult Regressions: 2 Answers
No ratings yet
Mboxcox, Interpreting Difficult Regressions: 2 Answers
1 page
Python - How Do I Find Numeric Columns in Pandas - Stack Overflow
No ratings yet
Python - How Do I Find Numeric Columns in Pandas - Stack Overflow
6 pages
Organisational Restructure Excel Dashboard - Excel Dashboards VBA
No ratings yet
Organisational Restructure Excel Dashboard - Excel Dashboards VBA
1 page
R - How Dnorm Works? - Stack Overflow
No ratings yet
R - How Dnorm Works? - Stack Overflow
1 page
Three Reasons That You Should NOT Use Deep Learning - by George Seif - Towards Data Science
No ratings yet
Three Reasons That You Should NOT Use Deep Learning - by George Seif - Towards Data Science
1 page
Autofilter With Column Formatted As Date: 10 Answers
No ratings yet
Autofilter With Column Formatted As Date: 10 Answers
1 page
For-Loops in R (Optional Lab) : This Is A Bonus Lab. You Are Not Required To Know This Information For The Final Exam
No ratings yet
For-Loops in R (Optional Lab) : This Is A Bonus Lab. You Are Not Required To Know This Information For The Final Exam
2 pages
VBA - String Parsing. String Parsing Involves Looking Through - by Breakcorporate - Medium
No ratings yet
VBA - String Parsing. String Parsing Involves Looking Through - by Breakcorporate - Medium
1 page
Excel - Selecting A Specific Column of A Named Range For The SUMIF Function - Stack Overflow
No ratings yet
Excel - Selecting A Specific Column of A Named Range For The SUMIF Function - Stack Overflow
1 page
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
No ratings yet
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
3 pages
3 Must-Have Projects For Your Data Science Portfolio - by Aakash N S - Jovian - Jan, 2021 - Medium
No ratings yet
3 Must-Have Projects For Your Data Science Portfolio - by Aakash N S - Jovian - Jan, 2021 - Medium
1 page
Excel - Can Advanced Filter Criteria Be in The VBA Rather Than A Range? - Stack Overflow
No ratings yet
Excel - Can Advanced Filter Criteria Be in The VBA Rather Than A Range? - Stack Overflow
1 page
TreeSheets: App Reviews, Features, Pricing & Download - AlternativeTo
No ratings yet
TreeSheets: App Reviews, Features, Pricing & Download - AlternativeTo
1 page
Excel VBA - Message and Input Boxes in Excel, MsgBox Function, InputBox Function, InputBox Method
No ratings yet
Excel VBA - Message and Input Boxes in Excel, MsgBox Function, InputBox Function, InputBox Method
2 pages
You're Not Good at Excel and You Don't Even Know It - by Breakcorporate - Medium
No ratings yet
You're Not Good at Excel and You Don't Even Know It - by Breakcorporate - Medium
1 page
Excel VBA Type Mismatch Error Passing Range To Array - Stack Overflow
No ratings yet
Excel VBA Type Mismatch Error Passing Range To Array - Stack Overflow
1 page
MS Excel PivotTable Deleted Items Remain - Excel and Access
No ratings yet
MS Excel PivotTable Deleted Items Remain - Excel and Access
1 page
Sorting Arrays in VBA
No ratings yet
Sorting Arrays in VBA
2 pages
VBA - Bubble Sort. A Bubble Sort Is A Technique To Order - by Breakcorporate - Medium
No ratings yet
VBA - Bubble Sort. A Bubble Sort Is A Technique To Order - by Breakcorporate - Medium
1 page
On Stress Measures in Deformed Solids: by Nasser M. Abbasi
No ratings yet
On Stress Measures in Deformed Solids: by Nasser M. Abbasi
57 pages
Thesis and Claim Difference
100% (3)
Thesis and Claim Difference
4 pages
0 Getting Started Status Assessment 508
No ratings yet
0 Getting Started Status Assessment 508
1 page
Freshco Owner 2490 Gerrard ST E, Scarborough, On M1N 1W8
No ratings yet
Freshco Owner 2490 Gerrard ST E, Scarborough, On M1N 1W8
2 pages
Mba Dissertation Topics in Project Management
100% (2)
Mba Dissertation Topics in Project Management
8 pages
Littleford (2014) Context, Control and The Spillover of Energy Use Behaviours Between Office and Home Settings
No ratings yet
Littleford (2014) Context, Control and The Spillover of Energy Use Behaviours Between Office and Home Settings
10 pages
Iit M Qualifier Exam Qpa1
No ratings yet
Iit M Qualifier Exam Qpa1
56 pages
Diane
No ratings yet
Diane
6 pages
399 HKD 01 PDF
No ratings yet
399 HKD 01 PDF
4 pages
Antecedente 1
No ratings yet
Antecedente 1
175 pages
Project Control Techniques - Variance Control
No ratings yet
Project Control Techniques - Variance Control
9 pages
Capstone 503 Project Report Template: Comments
No ratings yet
Capstone 503 Project Report Template: Comments
12 pages
Types of Research
No ratings yet
Types of Research
7 pages
Needs Assessment of Barangay Health Workers (BHWS) in Delivering Health Care Provisions in Barangay Dawis, Digos City, Philippines
No ratings yet
Needs Assessment of Barangay Health Workers (BHWS) in Delivering Health Care Provisions in Barangay Dawis, Digos City, Philippines
6 pages
SHS Students' Usage of Paraphrasing Applications For Academic Activities
No ratings yet
SHS Students' Usage of Paraphrasing Applications For Academic Activities
66 pages
Consumer Perception Towards Samsung
0% (1)
Consumer Perception Towards Samsung
5 pages
Business Statistics 3 ed. Edition Govind Chand Beri - The ebook is available for instant download, read anywhere
100% (2)
Business Statistics 3 ed. Edition Govind Chand Beri - The ebook is available for instant download, read anywhere
74 pages
What Makes Some People Think Astrology Is Scientific?: Nick Allum Department of Sociology University of Essex
No ratings yet
What Makes Some People Think Astrology Is Scientific?: Nick Allum Department of Sociology University of Essex
27 pages
Descriptive Research Design
No ratings yet
Descriptive Research Design
12 pages
A Study On Employee Job Satisfaction Review of Literature
100% (1)
A Study On Employee Job Satisfaction Review of Literature
6 pages
Multi Class SVM - File Exchange - MATLAB Central
No ratings yet
Multi Class SVM - File Exchange - MATLAB Central
2 pages
Persuasion: Twenty
No ratings yet
Persuasion: Twenty
16 pages
Thesis and Dissertation Format
No ratings yet
Thesis and Dissertation Format
14 pages
Chapter 3
No ratings yet
Chapter 3
15 pages
Planning of Projects at Pakistan
No ratings yet
Planning of Projects at Pakistan
23 pages
Competency Assessment: Tier 3 Public Health Professionals
No ratings yet
Competency Assessment: Tier 3 Public Health Professionals
20 pages
Nafdac Nigeria GMP 2021
No ratings yet
Nafdac Nigeria GMP 2021
116 pages
Data Collection Statistics
No ratings yet
Data Collection Statistics
18 pages

Problems With Stepwise Regression

Uploaded by

Problems With Stepwise Regression

Uploaded by

Linear Statistical Models: Regression

Problems with Stepwise Regression

These comments are from the Stata FAQ pages (www.stata.com)

Frank Harrell's comments:

Here are some of the problems with stepwise variable selection.

It yields R-squared values that are badly biased high.

Ronan Conroy's comments:

"The data analyst knows more than the computer,"

and they add

"failure to use that knowledge produces inadequate data analysis".

Summary by Steve Blinkhorn:

Linear Statistical Models Course

Phil Ender, 14jan00

You might also like