A BEGINNER
GUIDE TO
t-test and ANOVA
(Analysis of Variance)
in R Programming
Made by Habbee
Overview
T-test:
Independent t-test.
Paired t-test.
F-test:
One-way Analysis of Variance (ANOVA).
Two-way Analysis of Variance (ANOVA).
1. t-test:
To test difference in means for two small samples (n < 30) from
populations that are approximately normal.
(The two small samples are representatives of their parent
populations).
Mean A Mean B
different?
To test the linear dependence to check if the two small samples are
unrelated/independent.
Unrelated/Independent?
1. t-test:
1.1. Independent samples t-test
is applied when we want to test differences between the
means/averages of two completely independent groups (one does
not affect the other).
For instance, Ty goes on a three-mile run with his kids every
morning. He wanted to test if his son’s running time (in minutes) is
significantly lower than his daughter’s — meaning the boy can run
faster. To test the theory, he recorded their running times everyday
for a week as given in the following table:
First step, create the running time records in Rstudio.
We name the independent variables as “Son” and “Daughter”. Since R reads data
alphabetically , the daughter’s data is always processed before son’s, as the letter
D goes before S in the alphabet; Thus, our updated alternative hypothesis Ha
now has become μ (daughter) > μ (son), which is still equivalent to Ty’s theory —
“his son’s running time is significantly lower than his daughter’s” .
H0: μ (daughter) = μ (son)
Ha: μ (daughter) > μ (son)
From the result, t-statistic is 2.0337, and p-value = 0.03485, meaning it is less
than 0.05 (using the 0.05 significance level) ; therefore, H0 is rejected. There is
enough sufficient evidence to support Ha that the daughter has a higher mean
running time than the son.
In addition, R also calculates both the means of the daughter’s running time
(22.29 minutes) and son’s (18.14 minutes); hence, we can conclude that Ty’s son
is faster when he runs the three-mile route! Let’s view it in visualization!
Last but not least, sample sizes for the two groups *sometimes* are not equally
the same. For example, what if Ty’s daughter got busy one morning and could not
join the morning run with her brother and father during the week? The sample
size for her running data would be 6 instead of 7!
If groups sizes differ greatly (Homogeneity of Variance is violated), that can
cause the null hypothesis to be falsely rejected (type I error: reject H0 when it is
in fact true!)
1) t-test
1.2. Paired t-test:
is applied when we have two dependent (paired) samples from
just one population and want to see if they are significantly
different - useful for “before and after” situation.
Mean A (before) Mean A (after)
different?
Example, Ty wants to test the difference in means of his kids’
heart rates before and after the three-mile run.
Import the dataset into R for our paired t-test analysis.
H0: μ (before) = μ (after)
Ha: μ (before) ≠ μ (after)
With p-value = 0.05772 (that is greater than 0.05), we fail to reject H0 as we
do not have enough sufficient evidence to support Ty’s kids heart rates differ
significantly (statistically) before and after the 3-mile run.
However, the result also shows that the mean of the differences is 16.5 bpm, and
if we visualize our paired t-test, we can see the mean bpm from “after” running is
higher than “before”. Our hearts tend to beat faster per minute after we exercise!
As a final point, sample sizes for the two measurements in paired t-test are
always identical (equal variances), unlike independent t-test.
2. F-test:
Analysis of Variance (ANOVA)
works exactly like t-test but with more than two groups.
H0: μ (1) = μ (2) = μ (3)= … = μ (n)
Ha: at least two means are different.
Mean A Mean B Mean C
different? different?
different?
Assumptions of ANOVA: Each groups of samples are normally
distributed , have equal variances, and are independent.
2. F-test:
2.1. One-way ANOVA
is used to analyze the difference between the means of more than
two groups.
Assume the Dependent variable (DV) is how many miles that a car
can travel per gallon of fuel (mpg), and the Independent variable (IV)
is different brands of cars . Apply an analysis of variance to test if
the means are significantly different between them .
Let’s let R read our mpg data.
H0: μ (Toyota) = μ (Subaru) = μ (Lexus)
Ha: at least two means are different.
With our F-statistic is 77.17 and p-value is less than 0.05 (= 2.1e-10), we reject
null hypothesis, and there is enough evidence to claim that at least two means are
different.
.... but you may ask which means are different? The “TukeyHSD(model)” syntax
helps us clarify that. Since the “p-adj” values between each pair of cars are < 0.05,
we can state that there is a significant difference in average of mpg between
Subaru and Lexus, Toyota and Lexus, and Toyota and Subaru, with Toyota 4Runner
and Subaru differ the most in terms of mpg ( “diff”= 12.0).
Last but not least, if the confidence interval does not contain value 0 then there is
a significant difference between two variables’ averages.
For example, the lower bound (lwr) and upper bound (upr) of Subaru-Lexus’
confidence interval are (5.0402, 9.9598), which do not consist of 0.
2. F-test:
2.2. Two-way ANOVA:
is applied when we want to analyze how two Independent
variables (IV), in combination, affect a Dependent variable (DV)
because we want to study if there is an interaction between the
two IVs on our DV.
For instance, we want to know if the cars’ mpg values mentioned
above will differ when driven on highway and in the city.
The IVs now are car brands (Toyota, Subaru, and Lexus) and
where they are being driven (in the city or on the highway), with
our DV is mpg values.
Here is how to create a two-way ANOVA data frame in R.
We now have three different hypotheses to test, with the first one is:
H0: μ (Toyota) = μ (Subaru) = μ (Lexus)
Ha: at least two means are different.
P-value of “brand” is 4.12e-10; we can claim that there is a significant difference of
effect between driving the Toyota 4Runner, Subaru Crosstrek, and Lexus RX350 in
terms of mpg, at least for two of the brands.
Next, our second hypothesis is:
H0: μ (city) = μ (highway)
Ha: μ (city) ≠ μ (highway)
Similar to our variable “brand”, “where” we drive our cars is another factor
that does have a significant effect on the mean difference of our miles per
gallon because the p-value is less than 0.05 (= 2.28e-07).
In fact, we obtain higher mpg on highways than in the cities for majority of
cars out there in the market.
city highway
Last but not least, our last hypothesis is:
H0: there is no interaction between what brand of car you drive and
where you drive it.
Ha: there is an interaction between what brand of car you drive and
where you drive it.
Our test statistic value is 0.0304 and p-value is 0.743. We fail to reject the
null hypothesis, and there is not enough evidence to support the claim that
there is an interaction between the cars brands and where you drive your car.
Furthermore, the Tukey test helps us figure out where the differences are lying the
most, which specific groups’ means are different. It compares all possible pairs of
means (every single one of them).
The ggplot graph below also helps us understand the results better!
Key Takeaways:
Independent t-test: if samples are from two populations.
Paired t-test: if samples are from one population, useful in the
“before-after” scenario.
One-way ANOVA: compare means for more than two groups.
Two-way ANOVA: compare means for each factor and test if
there is an interaction between factors for more than two
groups.