0% found this document useful (0 votes)
1 views

Lab box cox and multiple linear reg-1

Uploaded by

nganvu28904
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Lab box cox and multiple linear reg-1

Uploaded by

nganvu28904
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Lab Practice: Box Cox transform and

Multiple Linear regression


BANA3010 Data driven analytics December 5, 2024

Lab Practice Submission Instructions:


• This is an individual lab practice and will typically be assigned in the laboratory (computer lab). You
can use your personal computer but the practical exams will be performed with a lab computer.
• Your program should work correctly on all inputs. If there are any specifications about how the program
should be written (or how the output should appear), those specifications should be followed.
• Your code and functions/modules should be appropriately commented. However, try to avoid making
your code overly busy (e.g., include a comment on every line).
• Variables and functions should have meaningful names, and code should be organized into function-
s/methods where appropriate.
• Academic honesty is required in all work you submit to be graded. You MUST NOT copy or share your
code with other students to avoid plagiarism issues.
• Use the template provided to prepare your solutions.
• You should upload your .R file(s) to the Canvas before the end of the laboratory session unless the
instructor gave a specified deadline.
• Submit separate .R file for each Lab problem with the following naming format: Lab4_Q1.R. Failure
to submit a .R file for lab or assignment will result in a 0.
• Late submission of lab practice without an approved extension is NOT allowed.

Lab Practice 3 Page 1


1. The simple linear regression model involves several assumptions. Among them are:

(a) That E(y|x), the mean value of y, is a straight-line function of x.

(b) That the errors, ϵi , have constant variance. That is, the variation in the errors is theoretically the
same regardless of the value of x or ŷ.

(c) The errors have mean 0.

(d) The errors are independent.

(e) The errors have a normal distribution.

To assess assumptions 1a to 1c, we can examine scatterplots of

• y versus x.

• residuals versus fitted values, ŷ, or x.

Assumption 1d is assessed with an autocorrelation (ACF) plot of the residuals. Assumption 1e is assessed
with a normal probability plot, and is considered the least crucial of the assumptions. We will see how
to generate the relevant graphical displays to help us assess whether the assumptions are met.

For this example, we will use the dataset “gpa.txt". As there is no headings in this dataset, it is often
convenient to use function colnames to names each columns.

data<-read.table("gpa.txt", header=FALSE ,sep="")


colnames(data)<-c("GPA","ACT")
attach(data)

We would like to construct a linear regression model of the variable GPA against ACT. And in the
following exercises, we will investigate whether the regression assumptions are met or not.

(a) Let us firstly take a look of the scatterplot of GPA against ACT. In addition to the scatterplots,
we would also like to add the estimated regression line to the plots:

result<-lm(GPA~ACT)
plot(ACT,GPA, main="GPA against ACT")
abline(result, col="red")

The function abline() overlays a line to the plot. In this case, it overlays the estimated regression
line from result. What are the features to look out for in a scatterplot of the response against
the predictor?

(b) It is usually easier to assess the regression assumptions using a residual plot (residuals plotted
against either the fitted values or the predictor).

plot(result$fitted.values, result$residuals,main="Residuals vs fits")


abline(h=0,col="red")

As you may have remember from previous labs, $ can be used to access the components that are a
part of a more complex object in R. For instance, the function lm will return a data object (which
we save as result) that contains many sub-components. The fitted values and the residuals of
the regression model are stored in vectors named as fitted.values and residuals.

Based on this plot, what will you say about the regression assumptions?

(c) To assess independence of errors, we examine an autocorrelation (ACF) plot of the residuals.

acf(result$residuals, main="ACF of Residuals")

Lab Practice 3 Page 2


What does this plot tell us?

(d) The last thing we need to check is the normality assumption. We use a normal probability plot for
this.

qqnorm(result$residuals)
qqline(result$residuals, col="red")

The first command qqnorm draws a plot of the quantiles of the estimated residuals from our linear
regression model against the theoretical quantiles of errors based on normality assumption. The
second command qqline add a reference line to the first plot to make it is easier to examine
whether the distribution of residuals is consistent with the normality assumption.

Based on this plot, what will you say about assumption regarding normality of error terms?

(e) In this part of the lab, we will go through the procedure on how to carry out the lack of fit (LOF)
test to test if the linearity assumption is reasonable, carry out the Box-Cox transformation, and
apply relevant transformation(s) to the predictor and/or response variable. For this example, we
will use the dataset “training.txt". This data set comes from an experiment that investigates
the impact of number of days of training (first column, the predictor) on the performance scores
(second column, the response variable). Read the data into R and name the columns accordingly.

data<-read.table("training.txt", header=FALSE ,sep="")


colnames(data)<-c("Training","Performance")

i. Generate the scatterplot and residual plot. Comment on whether the regression assumptions
are satisfied. What are the consequences with violation of these assumptions?

ii. Suppose we want to apply a Box-Cox transformation. To produce a plot of the profile log-
likelihoods for the parameter, λ, of the Box-Cox power transformation, type

library(MASS)
boxcox(result)

The boxcox() function is stored in the MASS library. You need to load this library to use this
function. What do you notice? For the boxcox() function, there is an optional argument
called lambda which allows us to change the range of λ for the Box-Cox transform. Type
?boxcox to see how to specify this argument.

iii. Should we transform the response variable?

iv. Next, we perform a lack of fit (LOF) test. To produce the ANOVA table associated with an
LOF test, type

reduced<-lm(Performance~Training)
full<-lm(Performance~0 + as.factor(Training))
anova(reduced,full)

Here we still use the function lm to construct the full regression model, with the following
modifications: First, we use the command as.factor(Training) to let R treat the variable
Training as a categorial variables so that R will solely focus on the different levels of variables
Training. Second, the regression model is specified by Performance~0 + as.factor(Training).
Here 0+ as.factor(Training) is used to specific a regression model without the intercept
term (which is not necessary if we want to study the full model).

Solely based on the p-value of the LOF test, what conclusion can you draw?

Lab Practice 3 Page 3


v. Does your conclusion from the LOF test contradict your earlier belief regarding the linear
relationship between the response variable and predictor? What do you think is going on here?

vi. What transformation will you use? Apply the transformation to the data, perform the regres-
sion, and check if the assumptions are met.

2. Multiple Linear Regression in R Here we will investigates the data set “Bears.txt" dataset, which
contains informations on 19 female wild bears. The variables are Age (age in months), Neck (neck girth
in inches), Length (length of bear in inches), Chest (chest girth in inches), and Weight (weight of bear
in pounds).

(a) Before fitting the multiple linear regression model, create separate plots for Weight against each
predictor. Comment on the association between Weight against each predictor.

(b) Fit a multiple regression model using Weight as the response and the other variables as predictors.
To use lm() for multiple regression, type

result<-lm(Weight~Age+Neck+Length+Chest)

The name of additional predictor is added to lm() after a + sign.

(c) Similar to linear regression, the command summary(result) can be used to display the key infor-
mations of regression results. Check the result of the above multiple linear regression. Also conduct
four different simple linear regression of Weight against each of the four predictors. Compare the
results with the result from multiple linear regression. Do you think multiple linear regression model
is appropriate for this data set?

Lab Practice 3 Page 4

You might also like