03 Basic Statistical Data Analysis Using Excel
03 Basic Statistical Data Analysis Using Excel
• Symmetric about the mean (50% of the data will fall below the mean and
half of the data will fall above the mean)
• Since a number of the most common statistical tests rely on the normality
of a sample or population, it is often useful to test whether the underlying
distribution is normal, or at least symmetric. This can be done via the
following approaches:
o Review the distribution graphically (via histograms, boxplots, QQ plots)
o Analyze the skewness and kurtosis
o Employ statistical tests (esp. Chi-square, Kolmogorov-Smirnov, Shapiro-
Wilk, Jarque-Barre, D’Agostino-Pearson)
• Note: If data is not symmetric, sometimes it is useful to make a
transformation whereby the transformed data is symmetric and so can be
analyzed more easily.
The following graphs can be used to test whether the data is normally
distributed or not :
• Histogram
• Boxplot
Note: the graphs are not as accurate as the formal test for normality.
Using Q-Q plot determine whether the data set with 20 elements is normally
distributed.
-14 -11 18 37 36 -21 -14 50 43 40
18 49 6 22 45 -24 24 20 4 -43
(…) 0 – Descending
(…) 1 - Ascending
of i.
𝑷𝒓𝒐𝒃. 𝒐𝒇 𝒊 = (𝑖 − 0.5)/𝑛
= 𝑁𝑂𝑅𝑀. 𝑆. 𝐼𝑁𝑉(𝑃𝑟𝑜𝑏 . 𝑜𝑓 𝑖)
Step 4. Find the theoretical
z-score based on Prob. of i
z ~ 𝑁𝑜𝑟𝑚𝑎𝑙 (0,1)
𝑋−𝜇
𝑧=
𝜎
Where:
𝜇 = 𝑀𝑒𝑎𝑛
𝜎 = 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
Interquartile Range
(IQR)
whisker whisker
Minimum/ Maximum/
Lower Fence Median Upper Fence
𝑸𝟏 𝑸𝟑
(𝑸𝟏 − 1.5 ∗ IQR) (25th Percentile) (75th Percentile) (𝑸𝟑 + 1.5 ∗ IQR)
Perfectly Symmetric
or Normal Distribution
Positively skewed distribution Negatively skewed distribution
or Skewed to the right or Skewed to the left
Kurtosis tells you the height and sharpness of the central peak, relative to
that of a standard bell curve.
𝑆𝑆 = σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2
𝑊 = 𝑏2 /𝑆𝑆
= (𝑏^2)/𝑆𝑆
or
= (𝑏 ∗ 𝑏)/𝑆𝑆
The p-value lies between 0.50 and 0.90. The W value for 0.5 is 0.943 and
the W value for 0.9 is 0.973.
GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 30
DEPARTMENT of
STATISTICS Shapiro-Wilk W Test
We need to interpolate since 𝑊 = 0.971 𝑦 lies between 0.50 and 0.90.
The 𝑊 value for 0.5 𝑥1 is 0.943 𝑦1 and the 𝑊 value for 0.9 𝑥2 is 0.973
𝑦2 .
Linear Interpolation:
𝑦2 − 𝑦1
𝑦 − 𝑦1 = 𝑥 − 𝑥1
𝑥2 − 𝑥1
0.973 − 0.943
0.971 − 0.943 = 𝑥 − 0.5
0.9 − 0.5
𝑥 = 0.873 ⇒ 𝑃 − 𝑣𝑎𝑙𝑢𝑒
Note:
• The residuals are the differences between the observed and expected values.
• Excel will not perform non-parametric tests even with the data analysis toolpak add in.
One Population
One Sample One Sample t-test
Parameter
Hypothesis Dependent
Paired Sample t-test
Samples
Testing
Independent Sample z-test
Two Population
Parameter (𝑖𝑓 𝜎1 & 𝜎2 𝑎𝑟𝑒 𝑘𝑛𝑜𝑤𝑛)
Critical Value
– value/s that separate the critical region from the values that would not
lead to rejection of 𝐻0
Level of significance 𝜶
• probability of making the mistake of rejecting the null hypothesis when it
is true
Left-tailed test
• The CR is in the extreme left region under the curve.
Right-tailed test:
• The CR is in the extreme right region under the curve.
Two-tailed test:
• The CR is in the two extreme regions under the curve.
GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 40
DEPARTMENT of
STATISTICS
1. Left-tailed test
• The CR is in the extreme left region under the curve.
Ha: 𝜇 < 𝜇𝑜
Acceptance Region
Rejection Region
It means that if the test
statistic fall in this region,
It means that if the test we failed to reject Ho
statistic fall in this region,
we reject Ho
Critical Value
2. Right-tailed test:
• The CR is in the extreme right region under the curve.
Ha: 𝜇 > 𝜇𝑜
Acceptance Region
Rejection Region
It means that if the test
statistic fall in this region,
we failed to reject Ho It means that if the test
statistic fall in this region,
we reject Ho
Critical Value
3. Two-tailed test:
• The CR is in the two extreme regions under the curve.
Ha: 𝜇 ≠ 𝜇𝑜
Rejection Region
Acceptance Region Rejection Region
a. State the null (Ho) and alternative (Ha) hypothesis. Identify the claim.
b. Determine the test statistic, critical value and tail of the distribution where
the rejection region is located.
c. Formulate the decision rule.
d. Compute the value of the test statistic.
e. Make a decision. ( whether to reject or to accept Ho)
f. Draw the conclusion by answering the original claim.
Decision Rule
• If the p-value is smaller than the significance level, Ho is rejected.
• If it is larger than the significance level, Ho is not rejected (or failed to
reject Ho
GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 45
DEPARTMENT of
STATISTICS
P-value
• The probability of observing a sample value as extreme as, or more
extreme than, the value observed, given that the null hypothesis is true.
The following table will help you when you summarize the results.
Decision Claim is Ho Claim is Ha
Reject Ho There is enough There is enough
evidence to reject the evidence to support the
claim. claim.
Failed to reject Ho There is no enough There is no enough
evidence to reject the evidence to support the
claim. claim.
< • less than • The mean weight is less than 50kg 𝜇 < 50kg
• The mean weight is greater than or
• greater than or equal to
≥ equal to 50kg 𝜇 ≥ 50kg
• at least
• The mean weight is at least 50kg
• The mean weight is less than or equal
• less than or equal to
≤ to 50kg 𝜇 ≤ 50kg
• at most
• The mean weight is at most 50kg
= • equal • The mean weight is equal to 50kg 𝜇 = 50kg
≠ • not equal • The mean weight is not equal to 50kg 𝜇 ≠ 50kg
Example 1:
Test the hypothesis that the average sugar content of particular cookies is
10g, if the contents of a random sample of 10 cookies are:
10.3 9.7 10.1 10.3 10.1 9.8 9.9 10.5 10.3 9.9
Use 0.01 level of significance and assume that the distribution of contents is
normal. Assume the data are normally distributed.
Step 3:
• For variable 1 range,
select the actual
data.
• For variable 2 range,
select the dummy
data.
• For the hypothesized
mean difference,
input the test value
from the hypothesis.
The following table will help you when you summarize the results.
Decision Claim is Ho Claim is Ha
Reject Ho There is enough There is enough
evidence to reject the evidence to support the
claim. claim.
Failed to reject Ho There is no enough There is no enough
evidence to reject the evidence to support the
claim. claim.
Example 2:
Communication company conducted a study about the number of text
messages sent by a teenager in a day. They claim that on the average,
teenager sent at least 65 text messages per day. To update the estimates, a
sample of 11 teenagers ask how many text messages they sent a day. Their
response were:
51 175 47 49 44 54 145 203 21 42 100
At 0.05 level of significance, can you conclude that the mean number of text
messages sent by teenagers per day is 65. Assume the data are normally
distributed.
The following table will help you when you summarize the results.
Decision Claim is Ho Claim is Ha
Reject Ho There is enough There is enough
evidence to reject the evidence to support the
claim. claim.
Failed to reject Ho There is no enough There is no enough
evidence to reject the evidence to support the
claim. claim.
Males Females
Example 1: 6 11 11 6 8 11
A random sample of the number of
6 14 8 7 5 13
sports offered by colleges for males
and females is shown. At 𝛼 = 0.10, 6 9 5 6 5 5
is there enough evidence to 6 9 18 10 7 6
support the claim that there is 15 6 11 16 10 7
significant difference in the number 9 9 5 7 5 5
of sports offered by colleges for 8 9 6 9 18 13
males and females?
9 5 11 7 8 5
Assume 𝜎1 = 𝜎2 = 3.3.
7 7 5 11 4 6
10 7 10 14 12 5
The following table will help you when you summarize the results.
Decision Claim is Ho Claim is Ha
Reject Ho There is enough There is enough
evidence to reject the evidence to support the
claim. claim.
Failed to reject Ho There is no enough There is no enough
evidence to reject the evidence to support the
claim. claim.
Example 1:
A teacher wants to know which of the two sections has a higher score. The
teacher will use these scores to make recommendations to the principal.
Random samples of students are asked about their scores. Test the claim that
the population mean scores are different for the two sections? Assume the
data are normally distributed and equal variances. Use =0.05.
Section A 81 77 75 74 86 90 62 73 91 98
Section B 89 64 35 68 69 55 37 57 42 49
5. the output
will be
The following table will help you when you summarize the results.
Decision Claim is Ho Claim is Ha
Reject Ho There is enough There is enough
evidence to reject the evidence to support the
claim. claim.
Failed to reject Ho There is no enough There is no enough
evidence to reject the evidence to support the
claim. claim.
Example 2:
The production manager at Bellevue Steel, a manufacturer of wheelchairs,
wants to compare the number of defective wheelchairs produced on the day
shift with the number on the afternoon shift. A sample of the production
from 6 day shifts and 8 afternoon shifts revealed the following number of
defects. At the 0.05 significance level, test the claim that there is a difference
in the mean number of defects per shift. Assume the data are normally
distributed and unequal variances.
Day 5 8 7 6 9 7
Afternoon 8 10 7 11 9 12 14 9
Note: Assume unequal variances, so use t-Test: Two-Sample Assuming Unequal Variances
The following table will help you when you summarize the results.
Decision Claim is Ho Claim is Ha
Reject Ho There is enough There is enough
evidence to reject the evidence to support the
claim. claim.
Failed to reject Ho There is no enough There is no enough
evidence to reject the evidence to support the
claim. claim.
The following table will help you when you summarize the results.
Decision Claim is Ho Claim is Ha
Reject Ho There is enough There is enough
evidence to reject the evidence to support the
claim. claim.
Failed to reject Ho There is no enough There is no enough
evidence to reject the evidence to support the
claim. claim.
Example 2:
Advertisements by Fitness Center claim that completing its course will result
in losing weight. A random sample of eight recent participants showed the
following weights before and after completing the course. At 0.01 level of
significance, can we conclude the students lost weight? Assume the data are
normally distributed.
Student 1 2 3 4 5 6 7 8
Before 155 228 141 162 211 164 184 172
After 154 201 147 157 196 150 170 165
The following table will help you when you summarize the results.
Decision Claim is Ho Claim is Ha
Reject Ho There is enough There is enough
evidence to reject the evidence to support the
claim. claim.
Failed to reject Ho There is no enough There is no enough
evidence to reject the evidence to support the
claim. claim.
Solution: F-test
Step 1: Ho: 𝜎12 = 𝜎22 (Variances are equal)
Ha: 𝜎12 ≠ 𝜎22 (Variances are not equal)
Step 2: F-test, 𝛼=0.05
Step 3: Reject Ho F > F Critical one-tail
Reject Ho if P-value < 𝛼 = 0.05 (P-value Method)
Step 4: F = 2.46 and F Critical = 3.17
P-value = 0.098
Step 5: Since F is not greater than F Critical, failed to reject Ho
Since P-value is not less that 𝛼 = 0.05, failed to reject Ho
Step 6: At 5% level of significance, we can conclude that the variances are equal.
Assumptions:
• The sample of paired (X,Y) data is a random sample.
• The data pairs fall approximately on a straight line and are measured at the
interval or ratio level.
• The pairs of (X,Y) data have a bivariate normal distribution.
Step 4: You'll see many options when you select this button, select Scatter
Step 5: The chart will appear. Customize bar chart through ”chart design”,
“format”, “Quick Layout”.
200
QUARTERLY SALES ($1000S)
150
100
50
0
0 5 10 15 20 25 30
STUDENT POPULATION (1000S)
Types of Correlation
𝑛 σ 𝑋𝑖 𝑌𝑖 − (σ 𝑋𝑖 )(σ 𝑌𝑖 )
𝑟=
2 2 2 2
σ σ σ σ
𝑛 𝑋𝑖 − ( 𝑋𝑖 ) 𝑛 𝑌𝑖 − ( 𝑌𝑖 )
𝑟 = −0.94
It means that the number of absences
and final grade have negative very
strong linear relationship
Basic Statistical Data Analysis Using Microsoft Excel | 116
DEPARTMENT of
STATISTICS
where:
t is the t-statistic from the coefficient value
df is the degrees of freedom which is equivalent to n-2
tails is the tailed of the distribution (‘1‘ for a one-tailed analysis or a
‘2‘ for a two-tailed analysis
Example 1 in Excel:
=PEARSON(B2:B8, C2:C8)
=(ABS(F2)*SQRT(F3-2))
/(SQRT(1-ABS(F2)^2))
=TDIST(F4,F5,F6)
=PEARSON(A2:A11,B2:B11)
=(ABS(E2)*SQRT(E3-2))/(SQRT(1-E2^2))
=TDIST(E4,E5,E6)
Other Correlations
• Pearson Product Moment Correlation - appropriate when both
variables are measured at an interval and ratio level.
• Spearman Rank Order Correlation (rho) – appropriate when the two
variables are in ordinal level.
• Point-Biserial Correlation – appropriate when one measure is a
continuous interval level and the other is dichotomous (i.e., two-
category).
𝑏0 = 𝑦ത − 𝑏1 𝑥ҧ
𝒚ෝ𝒊 = 𝒃𝒐 + 𝒃𝟏 𝑿
Interpretation of the slope and the intercept:
▪ 𝑏0 is the estimated average value of y when the value of x is zero
▪ 𝑏1 is the estimated change (increase or decrease) in the average
value of y as a result of a one-unit increase in x
Basic Statistical Data Analysis Using Microsoft Excel | 129
DEPARTMENT of
STATISTICS
3. Input the
range that
contains the
variables X and Y.
4. Choose output
options, then
click ok
Example: The data below shows the number of absences and the final grades of
seven randomly selected students from a statistics class.
Number of Final grade 1. Create regression equation
Student
absences (X) (Y) to predict the final grade of
A 6 82 student.
B 2 86 2. Interpret 𝑏0 and 𝑏1
C 15 43 3. Compute the value of 𝑅2
4. What is the estimated
D 9 74
value of the final grade if
E 12 58
the number of absences is
F 5 90 1, 4, 7, 10, 20
G 8 78
Basic Statistical Data Analysis Using Microsoft Excel | 134
DEPARTMENT of
STATISTICS a
Solution:
Regression Equation:
𝑦ෝ𝑖 = 𝑏𝑜 + 𝑏1 𝑥𝑖
𝑦ො𝑖 = 101.31 − 3.51𝑥𝑖
• 101.31 is the estimated average final grade when the number of absences
is zero.
• -3.51 is the estimated decreased in the average final grade as a result of
one-unit increase in the number of absences.
Basic Statistical Data Analysis Using Microsoft Excel | 135
DEPARTMENT of
STATISTICS
Solution:
Coefficient of Determination:
𝑅2 = 0.8927 or 89.27%
Solution:
What is the estimated value of the final grade if the number of absences is 1, 4,
7, 10, and 20
Regression Equation:
𝑦ෝ𝑖 = 𝑏𝑜 + 𝑏1 𝑥𝑖
𝑦ො𝑖 = −10.997 + 1.05𝑥𝑖
−10.997 is the estimated average height when the size of arm span is zero.
1.05 is the estimated increased in the average height as a result of one-unit
increase in the size of arm span.
Solution:
Coefficient of Determination:
𝑅2 = 0.8677 or 86.77%
86.77% of the total variation in the
height was explained, or accounted
for, by the variation of the size of arm
span.
Solution:
What is the estimated value of the height if the span is 150, 200, 250.