0% found this document useful (0 votes)
67 views26 pages

Multiple Regression Inference Guide

1) The lecture discusses inference for the multiple regression model, including assessing the significance of variables. Standard errors and confidence intervals allow estimating the variability and accuracy of regression coefficients. 2) Hypothesis tests examine whether individual variables are needed in the model and whether the overall regression relationship is statistically significant. The t-statistic and p-values assess individual predictors, while the F-statistic and p-value judge the entire model. 3) An example analyzes physical measures to predict biochemical levels in children. A full model is compared to submodels excluding variables, to see if a more parsimonious relationship can adequately describe the data. Confounding must be considered when variables are removed.

Uploaded by

PETER
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views26 pages

Multiple Regression Inference Guide

1) The lecture discusses inference for the multiple regression model, including assessing the significance of variables. Standard errors and confidence intervals allow estimating the variability and accuracy of regression coefficients. 2) Hypothesis tests examine whether individual variables are needed in the model and whether the overall regression relationship is statistically significant. The t-statistic and p-values assess individual predictors, while the F-statistic and p-value judge the entire model. 3) An example analyzes physical measures to predict biochemical levels in children. A full model is compared to submodels excluding variables, to see if a more parsimonious relationship can adequately describe the data. Confounding must be considered when variables are removed.

Uploaded by

PETER
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

STATS 330: Lecture 6

Inference for the Multiple Regression Model

31.07.2014
Getting RStudio

[Link]

[Link]
Inference for the regression model

Aim of todays lecture

I To discuss how we assess the significance of variables in the


regression.

I Key concepts:

I Standard errors
I Confidence intervals for the coefficients
I Tests of significance
Variability of the regression coefficients

I Imagine that we keep the xs fixed, but resample the errors


and refit the plane. How much would the plane (estimated
coefficients) change?

I This gives us an idea of the variability (accuracy) of the


estimated coefficients as estimates of the coefficients of the
true regression plane.
Y

X1
X2
Variability of the regression coefficients

I Variability depends on

I The arrangement of the xs (the more correlation, the more


change)
I The error variance (the more scatter about the true plane, the
more the fitted plane changes)

I Measure variability by the standard error of the coefficients


Example: Cherries

Call:
lm(formula = volume ~ diameter + height)

Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
diameter 56.4979 3.1712 17.816 < 2e-16 ***
height 0.3393 0.1302 2.607 0.0145 *
---

Residual standard error: 3.882 on 28 degrees of freedom


Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Confidence intervals

CI : Estimated coefficient standard error t

t : 97.5% point of t distribution with df degrees of


freedom.

df : n k 1.

n : number of observations.

k : number of covariates (assuming we have a constant


term).
Confidence intervals
Example: Cherries

Use stats function confint

> confint([Link])
2.5 % 97.5 %
(Intercept) -75.68226247 -40.2930554
diameter 50.00206788 62.9937842
height 0.07264863 0.6058538
Hypothesis test

I Often we ask do we need a particular variable, given the


others are in the model?

I Note that this is not the same as asking is a particular


variable related to the response?

I Can test the former by examining the ratio of the coefficient


to its standard error.
Hypothesis test

I This is the t-statistic t.

I The bigger t, the more we need the variable.

I Equivalently, the smaller the p-value, the more we need the


variable.
Example: Cherries

Call:
lm(formula = volume ~ diameter + height)

Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
diameter 56.4979 3.1712 17.816 < 2e-16 ***
height 0.3393 0.1302 2.607 0.0145 *
---

Residual standard error: 3.882 on 28 degrees of freedom


Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Recall: p-value
Density for t with df=28

2.607 2.607
0.4
0.3

pvalue = 0.0145
0.2
0.1
0.0

4 2 0 2 4
Other hypotheses

I Overall significance of the regression: do none of the variables


have a relationship with the response?

I Use the F statistic: the bigger F , the more evidence that at


least one variable has a relationship.

I equivalently, the smaller the p-value, the more evidence that


at least one variable has a relationship.
Example: Cherries

Call:
lm(formula = volume ~ diameter + height)

Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
diameter 56.4979 3.1712 17.816 < 2e-16 ***
height 0.3393 0.1302 2.607 0.0145 *
---

Residual standard error: 3.882 on 28 degrees of freedom


Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Testing if a subset is required

I Often we want to test if a subset of variables is unnecessary.

I Terminology

Full model: Model containing all variables.

Submodel: Model with a set of variables removed.

I Test is based on comparing the RSS of the submodel with the


RSS of the full model. Full model RSS is always smaller
(why?)
Testing if a subset is required

I If the full model RSS is not much smaller than the submodel
RSS, the submodel is adequate: we do not need the extra
variables.

I To do the test, we

I fit both models, get RSS for both;


I calculate test statistic;
I If the test statistic is large, and equivalently the p-value is
small, the submodel is not adequate.
Testing if a subset is required

I The test statistic is

(RSSsub RSSfull )
F =
s 2 (dffull dfsub )

I dffull dfsub is the number of variables dropped.

I s 2 is the estimate of 2 from the full model (the residual


mean square)

I R has a function anova to do the calculation.


p-values

I If the submodel is correct, the test statistic has an


F -distribution with dffull dfsub and n k 1 degrees of
freedom.

I We assess if the value of F calculated from the sample is a


plausible value from this distribution by means of a p-value.

I if the p-value is too small, we have evidence against the


hypothesis that the submodel is ok.
p-values
Density for F with 2 and 16 degrees of freedom

1.0
0.8

Fvalue
0.6
0.4
0.2

pvalue
0.0

0 2 4 6 8 10
Example: Free fatty acid data

I Use physical measures to model a biochemical parameter in


overweight children.

I Variables are

FFA: Free fatty acid level in blood (response variable)

Age: months

Weight: pounds

Skinfold thickness: inches


Analysis

Call:
lm(formula = ffa ~ age + weight + skinfold, data = [Link])

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.95777 1.40138 2.824 0.01222 *
age -0.01912 0.01275 -1.499 0.15323
weight -0.02007 0.00613 -3.274 0.00478 **
skinfold -0.07788 0.31377 -0.248 0.80714

This suggests
I age is not required if weight and skinfold are retained

I skinfold is not required if weight and age are retained

I Can we get away with just weight?


Analysis

> [Link] <- lm(ffa~weight,data=[Link])


> anova([Link],[Link])
Analysis of Variance Table

Model 1: ffa ~ weight


Model 2: ffa ~ age + weight + skinfold
[Link] RSS Df Sum of Sq F Pr(>F)
1 18 0.91007
2 16 0.79113 2 0.11895 1.2028 0.3261

I Small F and large p-value suggest weight alone is adequate.


I But test should be interpreted with caution, confounding?
Confounding?
I Non-causal relation due to missing variable.

I Effect can be checked by comparing coefficients in full and


submodel (if available).
> summary([Link])
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.95777 1.40138 2.824 0.01222 *
age -0.01912 0.01275 -1.499 0.15323
weight -0.02007 0.00613 -3.274 0.00478 **
skinfold -0.07788 0.31377 -0.248 0.80714

> summary([Link])
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.01651 0.37578 5.366 4.23e-05 ***
weight -0.02162 0.00608 -3.555 0.00226 **

You might also like