MATH1041 Assignment, Joseph Onyoo Ha, z5420498
Question 1
1a. The relationship between bluebottles’ handedness and their size (as length and sail height).
1b. Washed-up bluebottles of Maroubra (assuming they do not conclusively represent the global
population)
1c. 70 washed up bluebottles on Maroubra’s shore
1d. n=70
1e. It’s an observational study; there is no attempt to influence responses while measuring individuals’
relevant features in the sample.
Question 2
2.a
median(BBData$Sail_Height);var(BBData$Sail_Height) #median, variance
for sail height
median(BBData$Bluebottle_Length);var(BBData$Bluebottle_Length)
#median, variance for bluebottle length
Sail_Height: M ≈ 0.46 cm, s2 ≈ 0.09 c m 2
Bluebottle_Length: M =4.40 cm, s2 ≈ 2.96 c m 2
2b.
boxplot.stats(BBData$Sail_Height)$out #outlier statistics from the
boxplot of Sail Height
boxplot.stats(BBData$Bluebottle_Length)$out #outlier statistics from
the boxplot of Length
These commands reveal that Sail Height has 2 outliers (1.3, 1.3) and Length has 2 outliers (8.7, 10.3).
2c.
length(which(BBData$Sail_Height > 0.6))/70 #the count of sail height
values greater than 0.6, over n=70 (proportion)
p̂ obs ( Sail Height> 0.6 cm ) ≈ 0.214 ( 3 sf )
2d.
greater0.6 <- which(BBData$Sail_Height>0.6)#indices where sail height
exceed 0.6cm
longer6 <- which(BBData$Bluebottle_Length>6)#indices for length>6cm
length(Reduce(intersect, list(greater0.6,longer6)))/70 #intersection
of the above lists, finds its length and divides by sample
(=proportion)
p̂ obs ( Sail Height> 0.6 cm∩ Length>6 cm) ≈ 0.114 ( 3 sf )
Question 3.
3a.
Min. 1st Qu. Median Mean 3rd Qu. Max. #Left-handed
1.500 3.900 4.400 4.734 5.700 10.300
Min. 1st Qu. Median Mean 3rd Qu. Max. #Right-handed
2.200 4.100 4.700 5.164 6.000 8.500
3b.
3c.
Differences: Left includes 2 outliers, while Right identifies none. Centres of data are lower for Left than
Right.
Similarities: Similar IQR, similar skewness (right skewed).
3d.
n¿=59¿ , n¿ =11
3e. The right-handed sample is smaller than the left-handed. This means that the left-handed sample is
more valid. In larger samples, the estimated mean is more likely to be accurate, outliers that cause skew
are clearly identified (aligning with my observations for Left), and margins of error are smaller overall.
Question 4
4a. Length is the explanatory variable.
4b. Sail Height is the response variable
4c. A scatterplot is an appropriate graphical summary.
4d. A regression line fits a scatterplot.
4e.
4f. Existent relationship, moderate correlation, increasing linear shape.
4g. cor(BBData$Bluebottle_Length, BBData$Sail_Height)
r =0.591 (3 sf )
4h. r measures the strength of the association between two variables, or the closeness of points to a
line fitted to minimise their average distance to it.r can range from –1 to 1 (inclusive). The sign of r (+, or
-) respectively shows that the response variable increases or decreases with an increase in the
explanatory variable. r is considered weak at a magnitude of 0 to 0.5, moderate between 0.5 and 0.7,
and strong beyond that. In this case, at around +0.591, it shows a moderate positive association. The
cluster of points are straight without curving from the fitted line, so it appears to follow a linear trend at
first glance.
4i Variability=r 2=( 0.5913363 )2 ≈ 0.350 ≈ 35.0 % ( 3 sf )
About 35% of the variability of sail height is explained by the length of a bluebottle.
4j Measuring variability is appropriate when assuming the absence of outliers.
4k
Using the 1.5IQR rule, no lower outliers were found, but one value (the 28 th sample) exceeding the
upper boundary was found. Ergo, the assumption in 4j was incorrect, and the regression line in 4e has
been distorted by the said outlier.
Question 5
5a.
5b.
5c. The normality of sample Length quantiles is fair.
5d. The samples must be randomised; as the bluebottles were collected and replaced, this reflects an
independent, identical distribution of length. Combined with a decent sample size of 70, the sample
validly approaches an estimate of the true population.
5e.
mean(BBData$Bluebottle_Length) #sample mean of length
#df=70-1=69, sample size n=70
qt(0.975, 69) #finds t which marks middle 95%
sd(BBData$Bluebottle_Length) #sample standard deviation
mean(BBData$Bluebottle_Length) - qt(0.975, 69) *
sd(BBData$Bluebottle_Length)/sqrt(70) #minimum of interval
mean(BBData$Bluebottle_Length) + qt(0.975, 69) *
sd(BBData$Bluebottle_Length)/sqrt(70) #maximum
Therefore, from the derived minimum and maximum, c i 95% ( μ )= [ 4.39 ,5.21 ] (3sf).
5f. There is no information on the population standard deviation, σ , so, t-distribution is used instead of a
z-test.
5g. For one measured range, the probability of containing a mean within calculated endpoints is not
meaningful; it refers to how 95% of trials to find intervals will contain the true mean. Individually, a
sample interval either has it or not. Secondly, a 95% confidence interval does not imply 95% of data
points are centred around the mean, but rather,
Question 6
6a. Let μ L be the true mean length of Maroubra’s beached bluebottles.
H 0 : μL ≤ 3.4 cm, H ̃ 0 : μ L=3.4 cm, H a : μ L >3.4 cm
¯
X ❑−μ 0
T=
Because σ is unknown, we must use the test statistic S . The mean sample length in
√n
Maroubra was xˉ ≈ 4.80 cm , sample standard deviation was s=1.72cm , and sample size was n=70.
4.80−3.4
t≈ ≈ 6.82 ~
This yields 1.72 . Also, the null distribution at ~
H ̃ 0 is t ( n−1 )=t ( 70−1 )=t ( 69 ) .
√7 0
Considering null hypothesis,
tval2=(mean(BBData$Bluebottle_Length)-3.4)/
(sd(BBData$Bluebottle_Length)/sqrt(70)) #the value of t
pt(tval2,df=69, lower.tail=FALSE) #p-value, lower tail false to read
when T is greater than t
# > [1] 1.409422e-09
−9
P−value=P μ =μ ( T ≥ 6.82 ) ≈ 1.41 ×10
L 0
The P-value, at about 1.41E-7%, is between 0 and 0.1%, providing very strong evidence against the null
hypothesis; it is almost certain that the true length of beached bluebottles in Maroubra is greater than
the South African estimate of 3.4cm.
6b.
Yes, both follow t distributions.