Lab 6 - ANOVA 1
Mansi Kumari (7908159)
2023-03-03
Learning Objectives
By the end of this lab, you should have a grasp on the following concepts:
• How to perform ANOVA in R, both step-by-step and with an easy R function.
• How to perform a simple investigation of the model assumptions.
Instructions
To complete this worksheet, add code as needed into the R code chunks given below. Do not delete the
question text. All text should be in complete English sentences. Be sure to change the author of this file to
reflect your name and student number.
To properly see the questions, knit this .Rmd file to .pdf and view the output. You will have a link in your
email that takes you to the Crowdmark submission page. Once you have completed the worksheet, knit it
to .pdf and upload your output to Crowdmark.
1
Exercises
Import the Games200 dataset. This dataset contains a random sample of 200 games released in 2019, along
with the metascore (average critic review), the userscore (average user review), and platform of release.
Games200 <- read.csv("~/Downloads/Games200.csv")
Our goal is to determine whether each video game platform receives the same metascore on average, or not,
based on this sample.
Make a boxplot comparing the metascores for each platform.
boxplot(Metascore ~ Platform, data = Games200)
90
80
Metascore
70
60
50
PC PlayStation 4 Switch Xbox One
Platform
Use aggregate to calculate the mean of each group
aggregate(Metascore ~ Platform, data = Games200,FUN = mean)
## Platform Metascore
## 1 PC 74.63462
## 2 PlayStation 4 71.48889
## 3 Switch 72.24675
## 4 Xbox One 78.11538
Use aggregate to determine the sample size of each group.
2
aggregate(Metascore ~ Platform, data = Games200,FUN = length)
## Platform Metascore
## 1 PC 52
## 2 PlayStation 4 45
## 3 Switch 77
## 4 Xbox One 26
Calculate the overall mean.
mean(Games200$Metascore)
## [1] 73.46
Calculate the SSG by hand, using your earlier calculations.
my.SSG<-52*(74.63-73.46)ˆ2 + 45*(71.48-73.46)ˆ2 + 77*(72.25-73.46)ˆ2 + 26*(78.12-73.46)ˆ2
my.SSG
## [1] 924.9421
Calculate the MSG by hand, using your earlier calculations.
my.MSG <- my.SSG/(4 - 1)
my.MSG
## [1] 308.314
Use the aggregate function with var to find the sample variances, and then from there find the SSE.
aggregate(Metascore ~ Platform, FUN = var, data = Games200)
## Platform Metascore
## 1 PC 58.78544
## 2 PlayStation 4 68.84646
## 3 Switch 57.42515
## 4 Xbox One 43.06615
my.SSE <- 51*58.79 + 44*68.85 + 76*57.43 + 25*43.07
my.SSE
## [1] 11469.12
Calculate the MSE by hand, using your earlier calculations.
my.MSE <- my.SSE/(200 - 4)
my.MSE
## [1] 58.51592
Calculate the F test statistic, using your earlier calculations.
3
my.F <- my.MSG/my.MSE
my.F
## [1] 5.268892
Use pf to find the P-value for this test.
1 - pf(my.F, df1 = 3, df2 = 196)
## [1] 0.001622573
What is your conclusion?
The p-value is 0.00162.We can conclude that we would reject our null hypothesis at 5% level of significance.We
have sufficient evidence to conclude that not all platforms have the same mean.
Repeat the earlier test, using the aov function.
my.aov <- aov(Metascore ~ Platform, data = Games200)
Use the summary function to print out the ANOVA results.
summary(my.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Platform 3 923 307.80 5.261 0.00164 **
## Residuals 196 11468 58.51
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Create a histogram of the residuals of the ANOVA model
hist(my.aov$residuals)
4
Histogram of my.aov$residuals
50
40
Frequency
30
20
10
0
−20 −10 0 10 20
my.aov$residuals
What does this tell you about your Normality assumption?
Use the aggregate function with sd to find the standard deviations of each group.
aggregate(Metascore ~ Platform, FUN = sd, data = Games200)
## Platform Metascore
## 1 PC 7.667167
## 2 PlayStation 4 8.297377
## 3 Switch 7.577939
## 4 Xbox One 6.562481
What does this tell you about your equal-variances assumption?
Next we will do ANOVA on the iris dataset. Use the data function to load in this dataset.
data(iris)
5
This dataset contains the petal and sepal lengths and widths (in cm) for a sample of 150 iris flowers. They
are divided by their species: iris setosa, iris virginica, and iris versicolor.
We will do an analysis to determine if their sepal widths differ significantly, on average.
Exercise: Write the hypotheses for this test in TeX
H0 : µSetosa = µV irginica = µV ersicolor vs Ha : Not all means are equal
Exercise: Use the aov function to conduct a hypothesis test at the 5% level of significance to
determine whether the mean sepal lengths are equal for all species.
my_aov <-aov(Sepal.Length~Species,data = iris)
summary(my_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 63.21 31.606 119.3 <2e-16 ***
## Residuals 147 38.96 0.265
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Exercise: Give a fully-worded conclusion to this test.
As our p-value is below 5% because we conducted this test at 5% level of significance which means we reject
our null hypothesis and there is sufficient evidence at 5 % level of significance to conclude that the mean
sepal lengths is not equal for all species.
Exercise: Check whether the ANOVA model assumptions appear to be accurate.
hist(my_aov$residuals)
6
Histogram of my_aov$residuals
60
50
40
Frequency
30
20
10
0
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
my_aov$residuals
aggregate(Sepal.Length~Species,data = iris,FUN = sd)
## Species Sepal.Length
## 1 setosa 0.3524897
## 2 versicolor 0.5161711
## 3 virginica 0.6358796
The residuals appear to have an approximately normal shape, and also that none of the standard deviations
are twice the size of the other ,so that the conditions of the test appear to be satisfied .