Lab box cox and multiple linear reg-1
Lab box cox and multiple linear reg-1
(b) That the errors, ϵi , have constant variance. That is, the variation in the errors is theoretically the
same regardless of the value of x or ŷ.
• y versus x.
Assumption 1d is assessed with an autocorrelation (ACF) plot of the residuals. Assumption 1e is assessed
with a normal probability plot, and is considered the least crucial of the assumptions. We will see how
to generate the relevant graphical displays to help us assess whether the assumptions are met.
For this example, we will use the dataset “gpa.txt". As there is no headings in this dataset, it is often
convenient to use function colnames to names each columns.
We would like to construct a linear regression model of the variable GPA against ACT. And in the
following exercises, we will investigate whether the regression assumptions are met or not.
(a) Let us firstly take a look of the scatterplot of GPA against ACT. In addition to the scatterplots,
we would also like to add the estimated regression line to the plots:
result<-lm(GPA~ACT)
plot(ACT,GPA, main="GPA against ACT")
abline(result, col="red")
The function abline() overlays a line to the plot. In this case, it overlays the estimated regression
line from result. What are the features to look out for in a scatterplot of the response against
the predictor?
(b) It is usually easier to assess the regression assumptions using a residual plot (residuals plotted
against either the fitted values or the predictor).
As you may have remember from previous labs, $ can be used to access the components that are a
part of a more complex object in R. For instance, the function lm will return a data object (which
we save as result) that contains many sub-components. The fitted values and the residuals of
the regression model are stored in vectors named as fitted.values and residuals.
Based on this plot, what will you say about the regression assumptions?
(c) To assess independence of errors, we examine an autocorrelation (ACF) plot of the residuals.
(d) The last thing we need to check is the normality assumption. We use a normal probability plot for
this.
qqnorm(result$residuals)
qqline(result$residuals, col="red")
The first command qqnorm draws a plot of the quantiles of the estimated residuals from our linear
regression model against the theoretical quantiles of errors based on normality assumption. The
second command qqline add a reference line to the first plot to make it is easier to examine
whether the distribution of residuals is consistent with the normality assumption.
Based on this plot, what will you say about assumption regarding normality of error terms?
(e) In this part of the lab, we will go through the procedure on how to carry out the lack of fit (LOF)
test to test if the linearity assumption is reasonable, carry out the Box-Cox transformation, and
apply relevant transformation(s) to the predictor and/or response variable. For this example, we
will use the dataset “training.txt". This data set comes from an experiment that investigates
the impact of number of days of training (first column, the predictor) on the performance scores
(second column, the response variable). Read the data into R and name the columns accordingly.
i. Generate the scatterplot and residual plot. Comment on whether the regression assumptions
are satisfied. What are the consequences with violation of these assumptions?
ii. Suppose we want to apply a Box-Cox transformation. To produce a plot of the profile log-
likelihoods for the parameter, λ, of the Box-Cox power transformation, type
library(MASS)
boxcox(result)
The boxcox() function is stored in the MASS library. You need to load this library to use this
function. What do you notice? For the boxcox() function, there is an optional argument
called lambda which allows us to change the range of λ for the Box-Cox transform. Type
?boxcox to see how to specify this argument.
iv. Next, we perform a lack of fit (LOF) test. To produce the ANOVA table associated with an
LOF test, type
reduced<-lm(Performance~Training)
full<-lm(Performance~0 + as.factor(Training))
anova(reduced,full)
Here we still use the function lm to construct the full regression model, with the following
modifications: First, we use the command as.factor(Training) to let R treat the variable
Training as a categorial variables so that R will solely focus on the different levels of variables
Training. Second, the regression model is specified by Performance~0 + as.factor(Training).
Here 0+ as.factor(Training) is used to specific a regression model without the intercept
term (which is not necessary if we want to study the full model).
Solely based on the p-value of the LOF test, what conclusion can you draw?
vi. What transformation will you use? Apply the transformation to the data, perform the regres-
sion, and check if the assumptions are met.
2. Multiple Linear Regression in R Here we will investigates the data set “Bears.txt" dataset, which
contains informations on 19 female wild bears. The variables are Age (age in months), Neck (neck girth
in inches), Length (length of bear in inches), Chest (chest girth in inches), and Weight (weight of bear
in pounds).
(a) Before fitting the multiple linear regression model, create separate plots for Weight against each
predictor. Comment on the association between Weight against each predictor.
(b) Fit a multiple regression model using Weight as the response and the other variables as predictors.
To use lm() for multiple regression, type
result<-lm(Weight~Age+Neck+Length+Chest)
(c) Similar to linear regression, the command summary(result) can be used to display the key infor-
mations of regression results. Check the result of the above multiple linear regression. Also conduct
four different simple linear regression of Weight against each of the four predictors. Compare the
results with the result from multiple linear regression. Do you think multiple linear regression model
is appropriate for this data set?