STATS 330: Lecture 6
Inference for the Multiple Regression Model
31.07.2014
Getting RStudio
[Link]
[Link]
Inference for the regression model
Aim of todays lecture
I To discuss how we assess the significance of variables in the
regression.
I Key concepts:
I Standard errors
I Confidence intervals for the coefficients
I Tests of significance
Variability of the regression coefficients
I Imagine that we keep the xs fixed, but resample the errors
and refit the plane. How much would the plane (estimated
coefficients) change?
I This gives us an idea of the variability (accuracy) of the
estimated coefficients as estimates of the coefficients of the
true regression plane.
Y
X1
X2
Variability of the regression coefficients
I Variability depends on
I The arrangement of the xs (the more correlation, the more
change)
I The error variance (the more scatter about the true plane, the
more the fitted plane changes)
I Measure variability by the standard error of the coefficients
Example: Cherries
Call:
lm(formula = volume ~ diameter + height)
Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
diameter 56.4979 3.1712 17.816 < 2e-16 ***
height 0.3393 0.1302 2.607 0.0145 *
---
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Confidence intervals
CI : Estimated coefficient standard error t
t : 97.5% point of t distribution with df degrees of
freedom.
df : n k 1.
n : number of observations.
k : number of covariates (assuming we have a constant
term).
Confidence intervals
Example: Cherries
Use stats function confint
> confint([Link])
2.5 % 97.5 %
(Intercept) -75.68226247 -40.2930554
diameter 50.00206788 62.9937842
height 0.07264863 0.6058538
Hypothesis test
I Often we ask do we need a particular variable, given the
others are in the model?
I Note that this is not the same as asking is a particular
variable related to the response?
I Can test the former by examining the ratio of the coefficient
to its standard error.
Hypothesis test
I This is the t-statistic t.
I The bigger t, the more we need the variable.
I Equivalently, the smaller the p-value, the more we need the
variable.
Example: Cherries
Call:
lm(formula = volume ~ diameter + height)
Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
diameter 56.4979 3.1712 17.816 < 2e-16 ***
height 0.3393 0.1302 2.607 0.0145 *
---
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Recall: p-value
Density for t with df=28
2.607 2.607
0.4
0.3
pvalue = 0.0145
0.2
0.1
0.0
4 2 0 2 4
Other hypotheses
I Overall significance of the regression: do none of the variables
have a relationship with the response?
I Use the F statistic: the bigger F , the more evidence that at
least one variable has a relationship.
I equivalently, the smaller the p-value, the more evidence that
at least one variable has a relationship.
Example: Cherries
Call:
lm(formula = volume ~ diameter + height)
Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
diameter 56.4979 3.1712 17.816 < 2e-16 ***
height 0.3393 0.1302 2.607 0.0145 *
---
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Testing if a subset is required
I Often we want to test if a subset of variables is unnecessary.
I Terminology
Full model: Model containing all variables.
Submodel: Model with a set of variables removed.
I Test is based on comparing the RSS of the submodel with the
RSS of the full model. Full model RSS is always smaller
(why?)
Testing if a subset is required
I If the full model RSS is not much smaller than the submodel
RSS, the submodel is adequate: we do not need the extra
variables.
I To do the test, we
I fit both models, get RSS for both;
I calculate test statistic;
I If the test statistic is large, and equivalently the p-value is
small, the submodel is not adequate.
Testing if a subset is required
I The test statistic is
(RSSsub RSSfull )
F =
s 2 (dffull dfsub )
I dffull dfsub is the number of variables dropped.
I s 2 is the estimate of 2 from the full model (the residual
mean square)
I R has a function anova to do the calculation.
p-values
I If the submodel is correct, the test statistic has an
F -distribution with dffull dfsub and n k 1 degrees of
freedom.
I We assess if the value of F calculated from the sample is a
plausible value from this distribution by means of a p-value.
I if the p-value is too small, we have evidence against the
hypothesis that the submodel is ok.
p-values
Density for F with 2 and 16 degrees of freedom
1.0
0.8
Fvalue
0.6
0.4
0.2
pvalue
0.0
0 2 4 6 8 10
Example: Free fatty acid data
I Use physical measures to model a biochemical parameter in
overweight children.
I Variables are
FFA: Free fatty acid level in blood (response variable)
Age: months
Weight: pounds
Skinfold thickness: inches
Analysis
Call:
lm(formula = ffa ~ age + weight + skinfold, data = [Link])
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.95777 1.40138 2.824 0.01222 *
age -0.01912 0.01275 -1.499 0.15323
weight -0.02007 0.00613 -3.274 0.00478 **
skinfold -0.07788 0.31377 -0.248 0.80714
This suggests
I age is not required if weight and skinfold are retained
I skinfold is not required if weight and age are retained
I Can we get away with just weight?
Analysis
> [Link] <- lm(ffa~weight,data=[Link])
> anova([Link],[Link])
Analysis of Variance Table
Model 1: ffa ~ weight
Model 2: ffa ~ age + weight + skinfold
[Link] RSS Df Sum of Sq F Pr(>F)
1 18 0.91007
2 16 0.79113 2 0.11895 1.2028 0.3261
I Small F and large p-value suggest weight alone is adequate.
I But test should be interpreted with caution, confounding?
Confounding?
I Non-causal relation due to missing variable.
I Effect can be checked by comparing coefficients in full and
submodel (if available).
> summary([Link])
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.95777 1.40138 2.824 0.01222 *
age -0.01912 0.01275 -1.499 0.15323
weight -0.02007 0.00613 -3.274 0.00478 **
skinfold -0.07788 0.31377 -0.248 0.80714
> summary([Link])
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.01651 0.37578 5.366 4.23e-05 ***
weight -0.02162 0.00608 -3.555 0.00226 **