0% found this document useful (0 votes)
179 views141 pages

03 Basic Statistical Data Analysis Using Excel

This document discusses how to assess normality of data using Microsoft Excel. It covers three key topics: 1. The normal distribution and its properties like the bell curve shape. 2. Graphical methods to test normality - histograms, normal Q-Q plots, and boxplots. Normal Q-Q plots compare empirical quantiles against theoretical quantiles from a normal distribution to see how close the data fits. 3. Steps for constructing a normal Q-Q plot in Excel, including ranking the data, calculating probabilities, determining theoretical z-scores, and plotting the values to check linearity.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
179 views141 pages

03 Basic Statistical Data Analysis Using Excel

This document discusses how to assess normality of data using Microsoft Excel. It covers three key topics: 1. The normal distribution and its properties like the bell curve shape. 2. Graphical methods to test normality - histograms, normal Q-Q plots, and boxplots. Normal Q-Q plots compare empirical quantiles against theoretical quantiles from a normal distribution to see how close the data fits. 3. Steps for constructing a normal Q-Q plot in Excel, including ranking the data, calculating probabilities, determining theoretical z-scores, and plotting the values to check linearity.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 141

Basic Statistical Data Analysis

Using Microsoft Excel


Glyzel Grace M. Francisco
STAT2100 – Statistical Analysis with Software Application
1st Semester, 2022-2023

Basic Statistical Data Analysis Using Microsoft Excel | 1


CENTRAL LUZON STATE UNIVERSITY
DEPARTMENT of
STATISTICS Learning Outcomes

After completing this chapter, the students must be able to

• Assess the normality of data using Microsoft Excel


• Perform hypothesis testing using Microsoft Excel
• Perform correlation and simple linear regression using Microsoft Excel

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 2


DEPARTMENT of
STATISTICS Normal Distribution

• also called as the Gaussian distribution

• It is a distribution related to continuous data from a statistical analysis


standpoint.

• It is sometimes called the “bell curve,” since it is a bell-shaped and


asymptotic about the x-axis

• Mean = Median = Mode

• Symmetric about the mean (50% of the data will fall below the mean and
half of the data will fall above the mean)

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 3


DEPARTMENT of
STATISTICS Normal Distribution

• Since a number of the most common statistical tests rely on the normality
of a sample or population, it is often useful to test whether the underlying
distribution is normal, or at least symmetric. This can be done via the
following approaches:
o Review the distribution graphically (via histograms, boxplots, QQ plots)
o Analyze the skewness and kurtosis
o Employ statistical tests (esp. Chi-square, Kolmogorov-Smirnov, Shapiro-
Wilk, Jarque-Barre, D’Agostino-Pearson)
• Note: If data is not symmetric, sometimes it is useful to make a
transformation whereby the transformed data is symmetric and so can be
analyzed more easily.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 4


DEPARTMENT of
STATISTICS Normal Distribution

A normal or Gaussian distribution :

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 5


DEPARTMENT of
STATISTICS Normal Distribution

The following graphs can be used to test whether the data is normally
distributed or not :

• Histogram

• Normal Q-Q plot

• Boxplot

Note: the graphs are not as accurate as the formal test for normality.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 6


DEPARTMENT of
STATISTICS 1. Histogram

➢ A histogram can be used to test


whether data is normally
distributed. This test simply
consists of looking at the
histogram and discerning
whether it approximates the
bell curve of a normal
distribution.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 7


DEPARTMENT of
STATISTICS 2. Q-Q plot

• A Q-Q plot is a short term for “quantile-quantile”


plot

• often used to assess whether or not a set of data


potentially came from some theoretical
distribution.

• In most cases, this type of plot is used to


determine whether or not a set of data follows a
normal distribution.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 8


DEPARTMENT of
STATISTICS 2. Q-Q plot

• an alternative graphical method of assessing normality to the histogram and


is easier to use when there are small sample sizes. The scatter should lie as
close to the line as possible with no obvious pattern coming away from the
line for the data to be considered normally distributed.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 9


DEPARTMENT of
STATISTICS 2. Q-Q plot
Example:

Using Q-Q plot determine whether the data set with 20 elements is normally
distributed.
-14 -11 18 37 36 -21 -14 50 43 40

-48 -1 -48 17 -23 18 -50 34 39 -41

0 38 31 -20 27 46 -18 29 -6 -24

-39 -14 -6 -47 45 -27 -37 6 -38 -29

18 49 6 22 45 -24 24 20 4 -43

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 10


DEPARTMENT of
STATISTICS 2. Q-Q plot
Step 1. Enter and sort the data Step 2. Find the rank of each data value

=RANK(number, ref, [order])

(…) 0 – Descending
(…) 1 - Ascending

Note: When a formula contains an absolute


reference, no matter which cell the formula occupies
the cell reference does not change: if you copy or
move the formula, it refers to the same cell as it did
in its original location. In an absolute reference, each
part of the reference (the letter that refers to the
row and the number that refers to the column) is
preceded by a “$” – for example, $A$1 is an absolute
reference to cell A1. Wherever the formula is copied
or moved, it always refers to cell A1.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 11


DEPARTMENT of
STATISTICS 2. Q-Q plot

Step 3. Find the cumulative


= (𝑖 − 0.5)/𝑛
probability of getting less than or
theoretical z-score of the z-score = (𝑖 − 0.5)/𝐶𝑂𝑈𝑁𝑇(𝑑𝑎𝑡𝑎 𝑟𝑎𝑛𝑔𝑒)

of i.

𝑷𝒓𝒐𝒃. 𝒐𝒇 𝒊 = (𝑖 − 0.5)/𝑛

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 12


DEPARTMENT of
STATISTICS 2. Q-Q plot

= 𝑁𝑂𝑅𝑀. 𝑆. 𝐼𝑁𝑉(𝑃𝑟𝑜𝑏 . 𝑜𝑓 𝑖)
Step 4. Find the theoretical
z-score based on Prob. of i

z ~ 𝑁𝑜𝑟𝑚𝑎𝑙 (0,1)

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 13


DEPARTMENT of
STATISTICS 2. Q-Q plot

Step 5. Get the standardized = ( 𝑋 − 𝐴𝑉𝐸𝑅𝐴𝐺𝐸 𝑑𝑎𝑡𝑎 𝑟𝑎𝑛𝑔𝑒 /


value of X 𝑆𝑇𝐷𝐸𝑉. 𝑆(𝑑𝑎𝑡𝑎 𝑟𝑎𝑛𝑔𝑒))

𝑋−𝜇
𝑧=
𝜎

Where:
𝜇 = 𝑀𝑒𝑎𝑛
𝜎 = 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 14


DEPARTMENT of
STATISTICS 2. Q-Q plot
Step 6. Plot the z-score and standardized value of X
Insert ➝ scatter (z-score of i, standardized of X) then Add linear trend

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 15


DEPARTMENT of
STATISTICS 2. Q-Q plot
Step 6. Plot the z-score and standardized value of X

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 16


DEPARTMENT of
STATISTICS 2. Q-Q plot
Step 6. Plot the z-score and standardized value of X

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 17


DEPARTMENT of
STATISTICS 2. Q-Q plot
Step 6. Plot the z-score and standardized value of X
Note: Click the line then change the format
2.50 For normally distributed data,
2.00
observations should lie approximately on
a straight line. If the data is non-normal,
1.50
the points form a curve that deviates
1.00
markedly from a straight line. Possible
0.50 outliers are points at the ends of the line,
AXIS TITLE

0.00 distanced from the bulk of the


-3.00 -2.00 -1.00 0.00 1.00 2.00 3.00
-0.50 observations.
-1.00

-1.50 In this case, the data was merely


-2.00 approximately follows a normal
-2.50
distribution
AXIS TITLE

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 18


DEPARTMENT of
STATISTICS 3. Box Plot
• Box plots can’t actually be used to test for
normality, they can be useful for testing for
symmetry, which often is a sufficient
substitute for normality.

• The box plot shape will show if a statistical


data set is normally distributed or skewed.

• If your data comes from a NORMAL


DISTRIBUTION, the box will be symmetrical
with the mean and median in the center.

• If the data meets the assumption of


normality, there should also be few outliers.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 19


DEPARTMENT of
STATISTICS 3. Box Plot

Interquartile Range
(IQR)

whisker whisker
Minimum/ Maximum/
Lower Fence Median Upper Fence
𝑸𝟏 𝑸𝟑
(𝑸𝟏 − 1.5 ∗ IQR) (25th Percentile) (75th Percentile) (𝑸𝟑 + 1.5 ∗ IQR)

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 20


DEPARTMENT of
STATISTICS Skewness
Skewness is a measure of the asymmetry of the probability distribution of a random
variable about its mean.
• If skewness is 0, the data are perfectly symmetrical.
• If skewness is between -0.5 and 0.5, the distribution is approximately symmetric.
• If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is
moderately skewed.
• If skewness is less than -1 or greater than 1, the distribution is highly skewed.
In Microsoft Excel: =SKEW(number1, [number2], ...)
Moderately Approximately Moderately
Skewed symmetric Skewed
Highly Highly
Skewed Skewed
−𝟏 − 𝟏ൗ𝟐 𝟎 𝟏ൗ
𝟐 𝟏
GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 21
DEPARTMENT of
STATISTICS Skewness

Perfectly Symmetric
or Normal Distribution
Positively skewed distribution Negatively skewed distribution
or Skewed to the right or Skewed to the left

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 22


DEPARTMENT of
STATISTICS Kurtosis

Kurtosis tells you the height and sharpness of the central peak, relative to
that of a standard bell curve.

• A perfect normal distribution has a kurtosis of 0 (mesokurtic distribution)


• Kurtosis value above 0 = Leptokurtic distribution (sharper peak and
longer/flatter tails)
• Kurtosis value below 0 = Platykurtic distribution (rounder peak and
shorter/thinner tails)

In Microsoft Excel: KURT(number1, [number2], ...)

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 23


DEPARTMENT of
STATISTICS Kurtosis

Platykurtic Distribution Normal Distribution or Leptokurtic Distribution


Kurtosis < 0 Mesokurtic Distribution Kurtosis > 0
Kurtosis = 0

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 24


DEPARTMENT of
STATISTICS Shapiro-Wilk W Test
Tests for assessing if data is normally distributed
using Shapiro-Wilk W Test

• developed by Shapiro and Wilk (1965)


• It is the ratio of two estimates of the variance of a normal distribution
based on a random sample of n observations.
• This test for normality has been found to be the most powerful test in
most situations.
• Hypothesis:
Ho: The data is normally distributed.
Ha: The data is not normally distributed.
• Decision Rule: Reject Ho if P-value < 𝛼 (level of significance)

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 25


DEPARTMENT of
STATISTICS Shapiro-Wilk W Test
Step 1. Arrange the data in ascending order.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 26


DEPARTMENT of
STATISTICS Shapiro-Wilk W Test

Step 2. Calculate 𝑆𝑆 as follows:

𝑆𝑆 = σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2

Note: If 𝑛 is even, let 𝑚 = 𝑛/2, while if 𝑛


is odd let 𝑚 = (𝑛 − 1)/2

In excel use DEVSQ function to calculate


𝑆𝑆
= 𝐷𝐸𝑉𝑆𝑄(𝑑𝑎𝑡𝑎 𝑟𝑎𝑛𝑔𝑒)

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 27


DEPARTMENT of
STATISTICS Shapiro-Wilk W Test
Step 3. Calculate 𝑏 as follows
a. taking the 𝑎𝑖 weights from the
Shapiro-Wilk W Table 1.
b. Corresponding to each of the
6 coefficients a1 ,… , a6 , calculate
the values 𝑥12 − 𝑥1 , 𝑥11 − 𝑥2 ,…,
𝑥7 − 𝑥6 where 𝑥𝑖 is the ith data
element in ascending order.
Shapiro-Wilk W Table 1: Coefficients
c. Get the product of the = 𝑆𝑈𝑀(𝐼5: 𝐼10)
coefficients and difference values.
d. Get the sum of the product of the
coefficients and difference values.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 28


DEPARTMENT of
STATISTICS Shapiro-Wilk W Test
Step 4. Calculate the test
statistic

𝑊 = 𝑏2 /𝑆𝑆

= (𝑏^2)/𝑆𝑆
or
= (𝑏 ∗ 𝑏)/𝑆𝑆

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 29


DEPARTMENT of
STATISTICS Shapiro-Wilk W Test
Step 5. Find the value in the Shapiro-Wilk W Table 2 that is closest to 𝑊,
interpolating if necessary. This is the p-values for the test.

The p-value lies between 0.50 and 0.90. The W value for 0.5 is 0.943 and
the W value for 0.9 is 0.973.
GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 30
DEPARTMENT of
STATISTICS Shapiro-Wilk W Test
We need to interpolate since 𝑊 = 0.971 𝑦 lies between 0.50 and 0.90.
The 𝑊 value for 0.5 𝑥1 is 0.943 𝑦1 and the 𝑊 value for 0.9 𝑥2 is 0.973
𝑦2 .
Linear Interpolation:
𝑦2 − 𝑦1
𝑦 − 𝑦1 = 𝑥 − 𝑥1
𝑥2 − 𝑥1
0.973 − 0.943
0.971 − 0.943 = 𝑥 − 0.5
0.9 − 0.5
𝑥 = 0.873 ⇒ 𝑃 − 𝑣𝑎𝑙𝑢𝑒

Therefore, the p-value is equivalent to 0.873

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 31


DEPARTMENT of
STATISTICS

Formal hypothesis testing for Shapiro-Wilk W Test


a. Ho: The data is normally distributed.
Ha: The data is not normally distributed.
b. Test to use: Shapiro-Wilk W Test
c. Decision Rule: P-value Method
Reject Ho if P-value < 𝛼 = 0.05
d. P-value =0.873
e. Decision: Since 0.873 ≮ 0.05, failed to reject Ho
f. Conclusion: We therefore conclude that the data is normally
distributed.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 32


DEPARTMENT of
STATISTICS

What if the data is not normally distributed?


If the data is not normally distributed, there are two options:
• Transform the dependent variable (repeating the normality checks on
the transformed data): Common transformations include taking the
log or square root of the dependent variable.
• Use a non-parametric test: Non-parametric tests are often called
distribution free tests and can be used instead of their parametric
equivalent.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 33


DEPARTMENT of
STATISTICS

Key to non-parametric tests:


Parametric Test What to check for normality Non-parametric test
Independent t-test Dependent variable by group Mann-Whitney test
Paired t-test Paired differences Wilcoxon signed rank test
Residuals/ dependent variable
One-way ANOVA Kruskal-Wallis test
by group
Repeated measures ANOVA Residuals at each time point Friedman test
Pearson’s correlation coefficient Both variables Spearman’s correlation
Simple linear regression Residuals N/A

Note:
• The residuals are the differences between the observed and expected values.
• Excel will not perform non-parametric tests even with the data analysis toolpak add in.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 34


DEPARTMENT of
STATISTICS Hypothesis Testing

One Population
One Sample One Sample t-test
Parameter

Hypothesis Dependent
Paired Sample t-test
Samples
Testing
Independent Sample z-test
Two Population
Parameter (𝑖𝑓 𝜎1 & 𝜎2 𝑎𝑟𝑒 𝑘𝑛𝑜𝑤𝑛)

Independent Independent Sample t-test


Samples (𝑖𝑓 𝜎1 & 𝜎2 𝑎𝑟𝑒 𝑢𝑛𝑘𝑛𝑜𝑤𝑛 𝑎𝑛𝑑 𝜎1 = 𝜎2 )

Independent Sample t-test


(𝑖𝑓 𝜎1 & 𝜎2 𝑎𝑟𝑒 𝑢𝑛𝑘𝑛𝑜𝑤𝑛 𝑎𝑛𝑑 𝜎1 ≠ 𝜎2 )

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 35


DEPARTMENT of
STATISTICS Components of Hypothesis Testing
Test of Hypothesis
Set of rules that lead to the acceptance or rejection of statistical method.
• Hypothesis - standard procedure for testing a claim about a property of a
population based on sample data
• It is a claim or statement about a property of a population.

Null Hypothesis (𝑯𝒐 ) Alternative Hypothesis (𝑯𝒂 𝒐𝒓𝑯𝟏 )


Tail of
distribution hypothesis that differs from the null
must contain the condition of equality
hypothesis
Two-tailed  Equal (=) 𝜇 = 𝜇𝑜  Not equal to (≠) 𝜇 ≠ 𝜇𝑜
Right-tailed  Less than or equal to (≤) 𝜇 ≤ 𝜇𝑜  Greater than (>) 𝜇 > 𝜇𝑜
Left-tailed  Greater than or equal to (≥) 𝜇 ≥ 𝜇𝑜  Less than (<) 𝜇 < 𝜇𝑜

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 36


DEPARTMENT of
STATISTICS Two Types of Error in Hypothesis Testing
• Type I error - mistake of Rejecting 𝐻𝑜 when it is actually true
𝛼 = probability of committing a Type I error = P(Reject 𝐻𝑜 | 𝐻𝑜 is true)

• Type II error - mistake of Failing to Reject 𝐻𝑜 when it is actually


false
𝛽 = probability of committing a Type II error = P(Reject 𝐻𝑜 | 𝐻𝑜 is false)

Result of testing a hypothesis


Our Decision
Actual Situation
Accept Ho Reject Ho
𝐻𝑜 is true Correct Decision (1-α) Type I error (α)
𝐻𝑜 is false Type II error (β) Correct Decision (1-β)
GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 37
DEPARTMENT of
STATISTICS Components of Hypothesis Testing
Test statistic
– sample statistic or a value based on the sample data.
– used in making decision about the rejection of the null hypothesis

Critical region/Rejection region


– set of all values of the test statistic that would cause to reject 𝐻0

Critical Value
– value/s that separate the critical region from the values that would not
lead to rejection of 𝐻0

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 38


DEPARTMENT of
STATISTICS Components of Hypothesis Testing

Level of significance 𝜶
• probability of making the mistake of rejecting the null hypothesis when it
is true

Commonly used degree of confidence and level of significance

Degree of Confidence (1 − 𝛼) Level of significance (𝛼)


90% 0.10
95% 0.05
99% 0.01

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 39


DEPARTMENT of
STATISTICS Components of Hypothesis Testing
Tails in a distribution
• Examine the alternative hypothesis
• Extreme regions bounded by the critical values

Left-tailed test
• The CR is in the extreme left region under the curve.

Right-tailed test:
• The CR is in the extreme right region under the curve.

Two-tailed test:
• The CR is in the two extreme regions under the curve.
GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 40
DEPARTMENT of
STATISTICS

1. Left-tailed test
• The CR is in the extreme left region under the curve.
Ha: 𝜇 < 𝜇𝑜

Acceptance Region

Rejection Region
It means that if the test
statistic fall in this region,
It means that if the test we failed to reject Ho
statistic fall in this region,
we reject Ho

Critical Value

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 41


DEPARTMENT of
STATISTICS

2. Right-tailed test:
• The CR is in the extreme right region under the curve.
Ha: 𝜇 > 𝜇𝑜

Acceptance Region

Rejection Region
It means that if the test
statistic fall in this region,
we failed to reject Ho It means that if the test
statistic fall in this region,
we reject Ho

Critical Value

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 42


DEPARTMENT of
STATISTICS

3. Two-tailed test:
• The CR is in the two extreme regions under the curve.
Ha: 𝜇 ≠ 𝜇𝑜

Rejection Region
Acceptance Region Rejection Region

It means that if the test


statistic fall in this region, It means that if the test It means that if the test
we reject Ho statistic fall in this region, statistic fall in this region,
we failed to reject Ho we reject Ho

Critical Region Critical Value

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 43


DEPARTMENT of
STATISTICS

Guidelines in Hypothesis Testing:

a. State the null (Ho) and alternative (Ha) hypothesis. Identify the claim.
b. Determine the test statistic, critical value and tail of the distribution where
the rejection region is located.
c. Formulate the decision rule.
d. Compute the value of the test statistic.
e. Make a decision. ( whether to reject or to accept Ho)
f. Draw the conclusion by answering the original claim.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 44


DEPARTMENT of
STATISTICS

Steps in testing a hypothesis: P-value Method


Identify the
State the test to use,
Make a
null and the tailed Formulate
decision
alternative of the the Determine Interpret
(whether
hypothesis. distribution decision the P-value the result
to reject or
Identify the and the rule
not the Ho)
claim level of
significance

Decision Rule
• If the p-value is smaller than the significance level, Ho is rejected.
• If it is larger than the significance level, Ho is not rejected (or failed to
reject Ho
GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 45
DEPARTMENT of
STATISTICS

P-value
• The probability of observing a sample value as extreme as, or more
extreme than, the value observed, given that the null hypothesis is true.

INTERPRETING THE WEIGHT OF EVIDENCE AGAINST Ho


If the p-value is less than
(a) 0.10, we have some evidence that Ho is not true.
(b) 0.05, we have strong evidence that Ho is not true.
(c) 0.01, we have very strong evidence that Ho is not true.
(d) 0.001, we have extremely evidence that Ho is not true.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 46


DEPARTMENT of
STATISTICS

It is important to summarize the results of a statistical study correctly.

The following table will help you when you summarize the results.
Decision Claim is Ho Claim is Ha
Reject Ho There is enough There is enough
evidence to reject the evidence to support the
claim. claim.
Failed to reject Ho There is no enough There is no enough
evidence to reject the evidence to support the
claim. claim.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 47


DEPARTMENT of
STATISTICS

Symbol Definition Example

• greater than • The mean weight is greater than 50kg


> 𝜇 > 50kg
• more than • X is more than 50kg

< • less than • The mean weight is less than 50kg 𝜇 < 50kg
• The mean weight is greater than or
• greater than or equal to
≥ equal to 50kg 𝜇 ≥ 50kg
• at least
• The mean weight is at least 50kg
• The mean weight is less than or equal
• less than or equal to
≤ to 50kg 𝜇 ≤ 50kg
• at most
• The mean weight is at most 50kg
= • equal • The mean weight is equal to 50kg 𝜇 = 50kg
≠ • not equal • The mean weight is not equal to 50kg 𝜇 ≠ 50kg

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 48


DEPARTMENT of
STATISTICS

1. t-Test for One Sample Mean


• Used to compare the mean value of a sample with a constant value (𝜇0 ).

Example 1:
Test the hypothesis that the average sugar content of particular cookies is
10g, if the contents of a random sample of 10 cookies are:
10.3 9.7 10.1 10.3 10.1 9.8 9.9 10.5 10.3 9.9

Use 0.01 level of significance and assume that the distribution of contents is
normal. Assume the data are normally distributed.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 49


DEPARTMENT of
STATISTICS

Step in performing t-Test for One


Sample Mean

Step 1: Create a dummy second


variable and input at least 2 zeroes
in it.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 50


DEPARTMENT of
STATISTICS

Step 2: Click Data → Data Analysis → t-Test Two-Sample Assuming


Unequal Variances → OK

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 51


DEPARTMENT of
STATISTICS

Step 3:
• For variable 1 range,
select the actual
data.
• For variable 2 range,
select the dummy
data.
• For the hypothesized
mean difference,
input the test value
from the hypothesis.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 52


DEPARTMENT of
STATISTICS

Step 4: For the output,


• delete the dummy column,
• change the title of the
table
• delete the word mean in
the hypothesized mean
difference row.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 53


DEPARTMENT of
STATISTICS

Solution: P-value Method


Step 1: State the null (Ho) and alternative (Ha) hypothesis. Identify the claim
• Ho : 𝜇 = 10g ⟹ The average sugar content of particular cookies is 10g (claim)
• Ha : 𝜇 ≠ 10g ⟹ The average sugar content of particular cookies is not 10g)
Step 2: Identify the test to use, the tailed of the distribution and the level of
significance
• t-Test for One Sample Mean
• Since the alternative hypothesis is 𝜇 ≠ 10g, therefore the tail of the
distribution is two-tailed test.
• Level of significance: 𝛼 = 0.01
Step 3: Formulate the decision rule
• Reject Ho if P-value < 𝛼 = 0.01
GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 54
DEPARTMENT of
STATISTICS

Step 4: Determine the P-value


• Since the tail of the distribution is
two-tailed, therefore the P-value
is 0.30.

Step 5: Make a decision


(whether to reject or not the Ho)
• Since P-value is not less than 𝛼 =
0.01 , failed to reject Ho

Step 6: Interpret the results


• At 1% level of significance, there is no enough evidence to reject the claim
that the average sugar content of particular cookies is 10g.
GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 55
DEPARTMENT of
STATISTICS

The following table will help you when you summarize the results.
Decision Claim is Ho Claim is Ha
Reject Ho There is enough There is enough
evidence to reject the evidence to support the
claim. claim.
Failed to reject Ho There is no enough There is no enough
evidence to reject the evidence to support the
claim. claim.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 56


DEPARTMENT of
STATISTICS

Example 2:
Communication company conducted a study about the number of text
messages sent by a teenager in a day. They claim that on the average,
teenager sent at least 65 text messages per day. To update the estimates, a
sample of 11 teenagers ask how many text messages they sent a day. Their
response were:
51 175 47 49 44 54 145 203 21 42 100

At 0.05 level of significance, can you conclude that the mean number of text
messages sent by teenagers per day is 65. Assume the data are normally
distributed.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 57


DEPARTMENT of
STATISTICS

Solution: P-value Method


Step 1: State the null (Ho) and alternative (Ha) hypothesis. Identify the claim
• Ho : 𝜇 ≥ 65 ⟹ the average text messages per day is at least 65 (claim)
• Ha : 𝜇 < 65 ⟹ the average text messages per day is less than 65
Step 2: Identify the test to use, the tailed of the distribution and the level of
significance
• t-Test for One Sample Mean
• Since the alternative hypothesis is 𝜇 < 65 text messages per day,
therefore the tail of the distribution is one tailed to the right test.
• Level of significance: 𝛼 = 0.05
Step 3: Formulate the decision rule
• Reject Ho if P-value < 𝛼 = 0.05
GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 58
DEPARTMENT of
STATISTICS

Step 4: Determine the P-value


• Since the tail of the distribution is
one-tailed, therefore the P-value
is 0.16.

Step 5: Make a decision


(whether to reject or not the Ho)
• Since P-value is not less than 𝛼 =
0.05 , failed to reject Ho

Step 6: Interpret the results


• At 5% level of significance, there is no enough evidence to reject the claim
that the average text messages per day is at least 65.
GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 59
DEPARTMENT of
STATISTICS

The following table will help you when you summarize the results.
Decision Claim is Ho Claim is Ha
Reject Ho There is enough There is enough
evidence to reject the evidence to support the
claim. claim.
Failed to reject Ho There is no enough There is no enough
evidence to reject the evidence to support the
claim. claim.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 60


DEPARTMENT of
STATISTICS

Hypothesis Testing for Two Populations

• It is used when researchers wish to compare two sample means, using


experimental and control groups.
• Main interest is on the difference between the two populations.
• For example, the average lifetimes of two different brands of bus tires
might be compared to see whether there is any difference in tread wear.
Two different brands of fertilizer might be tested to see whether one is
better than the other for growing plants. Or two brands of cough syrup
might be tested to see whether one brand is more effective than the
other.
GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 61
DEPARTMENT of
STATISTICS

2. z-Test for Two Sample Means


Main interest is on the difference between 𝜇1 and 𝜇2
Assumptions:
1. Both samples are random samples.
2. The samples must be independent of each other. That is, there can be no
relationship between the subjects in each sample.
3. The standard deviations of both populations must be known; and if the
sample sizes are less than 30, the populations must be normally or
approximately normally distributed.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 62


DEPARTMENT of
STATISTICS

Males Females
Example 1: 6 11 11 6 8 11
A random sample of the number of
6 14 8 7 5 13
sports offered by colleges for males
and females is shown. At 𝛼 = 0.10, 6 9 5 6 5 5
is there enough evidence to 6 9 18 10 7 6
support the claim that there is 15 6 11 16 10 7
significant difference in the number 9 9 5 7 5 5
of sports offered by colleges for 8 9 6 9 18 13
males and females?
9 5 11 7 8 5
Assume 𝜎1 = 𝜎2 = 3.3.
7 7 5 11 4 6
10 7 10 14 12 5

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 63


DEPARTMENT of
STATISTICS

Step in performing z-Test for Two Sample Means


1. Click the Data tab on the ribbon, then look for Data Analysis
2. Choose z-Test: Two Sample for Means, then click OK

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 64


DEPARTMENT of
STATISTICS

3. Input the range that


contains the variables 1
and 2.
4. Input the known
variance for variables 1
and 2.
5. Choose output
options, then click ok

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 65


DEPARTMENT of
STATISTICS

6. the output will be

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 66


DEPARTMENT of
STATISTICS

Solution: P-value Method


Step 1: State the null (Ho) and alternative (Ha) hypothesis. Identify the claim
• Ho : 𝜇1 = 𝜇2 ⟹ There is no significant difference in the number of sports offered by colleges for males and female.
• Ha : 𝜇1 ≠ 𝜇2 ⟹ There is significant difference in the number of sports offered by colleges for males and female. (claim)
Step 2: Identify the test to use, the tailed of the distribution and the level of
significance
• z-Test for Two Sample Means
• Since the alternative hypothesis is 𝜇1 ≠ 𝜇2 , therefore the tail of the
distribution is two-tailed test.
• Level of significance: 𝛼 = 0.10
Step 3: Formulate the decision rule
• Reject Ho if P-value < 𝛼 = 0.10
GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 67
DEPARTMENT of
STATISTICS

Step 4: Determine the P-value


• Since the tail of the distribution is
two-tailed, therefore the P-value is
0.619.
Step 5: Make a decision (whether to
reject or not the Ho)
• Since P-value is not less than 𝛼 =
0.10 , failed to reject Ho
Step 6: Interpret the results
• At 10% level of significance, there is no enough evidence to support the
claim that there is significant difference in the number of sports offered by
colleges for males and females.
GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 68
DEPARTMENT of
STATISTICS

The following table will help you when you summarize the results.
Decision Claim is Ho Claim is Ha
Reject Ho There is enough There is enough
evidence to reject the evidence to support the
claim. claim.
Failed to reject Ho There is no enough There is no enough
evidence to reject the evidence to support the
claim. claim.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 69


DEPARTMENT of
STATISTICS

Example 2: Resident Commuter


The dean of students wants to see 22 19 23 23 32 18 26 18 36 35
whether there is a significant 25 19 24 21 18 23 23 22 23 24
difference in ages of resident 18 21 18 28 20 26 21 22 25 20
students and commuting students.
19 23 29 26 18 19 25 19 25 32
She selects a random sample of 50
students from each group. The ages 26 27 21 19 24 29 19 19 21 20
are shown here. At 𝛼 = 0.05, 22 26 21 22 19 20 23 19 18 19
determine if there is enough 19 26 26 18 23 20 22 22 27 18
evidence to reject the claim of no 25 18 27 22 22 20 35 28 19 17
difference in the ages of the two
20 19 18 26 18 18 21 22 35 17
groups. Assume 𝜎1 = 3.68 and
𝜎2 = 4.7. 30 19 19 18 20 30 18 19 24 16

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 70


DEPARTMENT of
STATISTICS

Solution: P-value Method


Step 1: State the null (Ho) and alternative (Ha) hypothesis. Identify the claim
• Ho : 𝜇1 = 𝜇2 ⟹ There is no significant difference in ages of resident students and commuting students. (claim)
• Ha : 𝜇1 ≠ 𝜇2 ⟹ There is no significant difference in ages of resident students and commuting students.
Step 2: Identify the test to use, the tailed of the distribution and the level of
significance
• z-Test for Two Sample Means
• Since the alternative hypothesis is 𝜇1 ≠ 𝜇2 , therefore the tail of the
distribution is two-tailed test.
• Level of significance: 𝛼 = 0.05
Step 3: Formulate the decision rule
• Reject Ho if P-value < 𝛼 = 0.05
GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 71
DEPARTMENT of
STATISTICS

Step 4: Determine the


P-value
• Since the tail of the
distribution is two-
tailed, therefore the
P-value is 0.130.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 72


DEPARTMENT of
STATISTICS

Step 5: Make a decision (whether to reject or not the Ho)


• Since P-value is not less than 𝛼 = 0.05 , failed to reject Ho

Step 6: Interpret the results


• At 5% level of significance, there is no enough evidence to reject the
claim that the there is no significant difference in ages of resident
students and commuting students

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 73


DEPARTMENT of
STATISTICS

3. t-Test for Two Independent Sample Means


• Samples are independent if the sample selected from one population is
not related to the sample selected from the other population.

Example 1:
A teacher wants to know which of the two sections has a higher score. The
teacher will use these scores to make recommendations to the principal.
Random samples of students are asked about their scores. Test the claim that
the population mean scores are different for the two sections? Assume the
data are normally distributed and equal variances. Use =0.05.
Section A 81 77 75 74 86 90 62 73 91 98
Section B 89 64 35 68 69 55 37 57 42 49

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 74


DEPARTMENT of
STATISTICS

Step in performing t-Test for Two Independent Sample Means


1. Click the Data tab on the ribbon, then look for Data Analysis
2. Choose t-Test: Two Sample…, then click OK

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 75


DEPARTMENT of
STATISTICS

3. Input the range that contains the variables 1 and 2.


4. Choose output options, then click ok

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 76


DEPARTMENT of
STATISTICS

5. the output
will be

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 77


DEPARTMENT of
STATISTICS

Solution: P-value Method


Step 1: State the null (Ho) and alternative (Ha) hypothesis. Identify the claim
• Ho : 𝜇1 = 𝜇2 ⟹ The population mean scores are not different for the two sections
• Ha : 𝜇1 ≠ 𝜇2 ⟹ The population mean scores are different for the two sections (claim)
Step 2: Identify the test to use, the tailed of the distribution and the level of
significance
• t-Test for Two Independent Sample Means (assume equal variances)
• Since the alternative hypothesis is 𝜇1 ≠ 𝜇2 , therefore the tail of the distribution
is two-tailed test.
• Level of significance: 𝛼 = 0.05
Step 3: Formulate the decision rule
• Reject Ho if P-value < 𝛼 = 0.05
GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 78
DEPARTMENT of
STATISTICS

Step 4: Determine the P-value


• Since the tail of the distribution
is two-tailed, therefore the P-
value is 0.001.
Step 5: Make a decision (whether
to reject or not the Ho)
• Since P-value is less than 𝛼 =
0.05 , reject Ho

Step 6: Interpret the results


• At 5% level of significance, there is enough evidence to support the claim
that the population mean scores are different for the two sections.
GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 79
DEPARTMENT of
STATISTICS

The following table will help you when you summarize the results.
Decision Claim is Ho Claim is Ha
Reject Ho There is enough There is enough
evidence to reject the evidence to support the
claim. claim.
Failed to reject Ho There is no enough There is no enough
evidence to reject the evidence to support the
claim. claim.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 80


DEPARTMENT of
STATISTICS

Example 2:
The production manager at Bellevue Steel, a manufacturer of wheelchairs,
wants to compare the number of defective wheelchairs produced on the day
shift with the number on the afternoon shift. A sample of the production
from 6 day shifts and 8 afternoon shifts revealed the following number of
defects. At the 0.05 significance level, test the claim that there is a difference
in the mean number of defects per shift. Assume the data are normally
distributed and unequal variances.

Day 5 8 7 6 9 7
Afternoon 8 10 7 11 9 12 14 9

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 81


DEPARTMENT of
STATISTICS

Solution: P-value Method


Step 1: State the null (Ho) and alternative (Ha) hypothesis. Identify the claim
• Ho : 𝜇1 = 𝜇2 ⟹ there is no difference in the mean number of defects per shift
• Ha : 𝜇1 ≠ 𝜇2 ⟹ there is a difference in the mean number of defects per shift (claim)
Step 2: Identify the test to use, the tailed of the distribution and the level of
significance
• t-Test for Two Independent Sample Means (assume unequal variances)
• Since the alternative hypothesis is 𝜇1 ≠ 𝜇2 , therefore the tail of the
distribution is two-tailed test.
• Level of significance: 𝛼 = 0.05
Step 3: Formulate the decision rule
• Reject Ho if P-value < 𝛼 = 0.05
GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 82
DEPARTMENT of
STATISTICS

Note: Assume unequal variances, so use t-Test: Two-Sample Assuming Unequal Variances

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 83


DEPARTMENT of
STATISTICS

Step 4: Determine the P-value


• Since the tail of the distribution is two-tailed, therefore the P-value
is 0.010.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 84


DEPARTMENT of
STATISTICS

Step 5: Make a decision (whether to reject or not the Ho)


• Since P-value is less than 𝛼 = 0.05 , reject Ho

Step 6: Interpret the results


• At 5% level of significance, there is enough evidence to support the claim
that there is a difference in the mean number of defects per shift.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 85


DEPARTMENT of
STATISTICS

The following table will help you when you summarize the results.
Decision Claim is Ho Claim is Ha
Reject Ho There is enough There is enough
evidence to reject the evidence to support the
claim. claim.
Failed to reject Ho There is no enough There is no enough
evidence to reject the evidence to support the
claim. claim.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 86


DEPARTMENT of
STATISTICS

4. t-Test for Two Dependent Sample Means


• Dependent means one sample is related to the other sample.
• such samples are referred to as paired sample or matched sample
• get one value from each of two subjects sharing the same characteristics
Example:
The following are the scores of eight students before and after a review. At 5%
level of significance, test the claim that there is significant difference in the
scores before and after the review. Assume the data are normally distributed.
Student 1 2 3 4 5 6 7 8
Before 77 74 82 73 87 68 66 80
After 72 68 76 68 84 68 61 76
GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 87
DEPARTMENT of
STATISTICS

Step in performing t-Test for Two Dependent Sample Means


1. Click the Data tab on the ribbon, then look for Data Analysis
2. Choose t-Test: Paired Two Sample for Means, then click OK

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 88


DEPARTMENT of
STATISTICS

3. Input the range that contains the variables 1 and 2.


4. Choose output options, then click ok

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 89


DEPARTMENT of
STATISTICS

5. the output will be

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 90


DEPARTMENT of
STATISTICS

Solution: P-value Method


Step 1: State the null (Ho) and alternative (Ha) hypothesis. Identify the claim
• Ho : 𝜇1 = 𝜇2 ⟹ There is no significant difference in the scores before and after the review
• Ha : 𝜇1 ≠ 𝜇2 ⟹ There is a significant difference in the scores before and after the review (claim)
Step 2: Identify the test to use, the tailed of the distribution and the level of
significance
• t-Test for Two Dependent Sample Means
• Since the alternative hypothesis is 𝜇1 ≠ 𝜇2 , therefore the tail of the
distribution is two-tailed test.
• Level of significance: 𝛼 = 0.05
Step 3: Formulate the decision rule
• Reject Ho if P-value < 𝛼 = 0.05
GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 91
DEPARTMENT of
STATISTICS

Step 4: Determine the P-value


• Since the tail of the distribution is
two-tailed, therefore the P-value is
0.0005.
Step 5: Make a decision (whether to
reject or not the Ho)
• Since P-value is less than 𝛼 = 0.05,
reject Ho
Step 6: Interpret the results
• At 5% level of significance, there is enough evidence to support the claim
that there is significant difference in the scores before and after the review.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 92


DEPARTMENT of
STATISTICS

The following table will help you when you summarize the results.
Decision Claim is Ho Claim is Ha
Reject Ho There is enough There is enough
evidence to reject the evidence to support the
claim. claim.
Failed to reject Ho There is no enough There is no enough
evidence to reject the evidence to support the
claim. claim.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 93


DEPARTMENT of
STATISTICS

Example 2:
Advertisements by Fitness Center claim that completing its course will result
in losing weight. A random sample of eight recent participants showed the
following weights before and after completing the course. At 0.01 level of
significance, can we conclude the students lost weight? Assume the data are
normally distributed.

Student 1 2 3 4 5 6 7 8
Before 155 228 141 162 211 164 184 172
After 154 201 147 157 196 150 170 165

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 94


DEPARTMENT of
STATISTICS

Solution: P-value Method


Step 1: State the null (Ho) and alternative (Ha) hypothesis. Identify the claim
• Ho : 𝜇1 = 𝜇2 ⟹ The students doesn’t lost weight
• Ha : 𝜇1 > 𝜇2 ⟹The students lost weight (claim)
Step 2: Identify the test to use, the tailed of the distribution and the level of
significance
• t-Test for Two Dependent Sample Means
• Since the alternative hypothesis is 𝜇1 > 𝜇2 , therefore the tail of the
distribution is one tailed to the right test.
• Level of significance: 𝛼 = 0.01
Step 3: Formulate the decision rule
• Reject Ho if P-value < 𝛼 = 0.01
GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 95
DEPARTMENT of
STATISTICS

Step 4: Determine the P-value


• Since the tail of the distribution
is two-tailed, therefore the P-
value is 0.016.
Step 5: Make a decision (whether to
reject or not the Ho)
• Since P-value is not less than
𝛼 = 0.01 , failed to reject Ho
Step 6: Interpret the results
• At 1% level of significance, there is no enough evidence to support the
claim that the students lost weight.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 96


DEPARTMENT of
STATISTICS

The following table will help you when you summarize the results.
Decision Claim is Ho Claim is Ha
Reject Ho There is enough There is enough
evidence to reject the evidence to support the
claim. claim.
Failed to reject Ho There is no enough There is no enough
evidence to reject the evidence to support the
claim. claim.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 97


DEPARTMENT of
STATISTICS
5. F-test for Two-Sample Variance
• It is used to test whether the variances of two populations are equal.
Step in performing F-test for Two-Sample Variance
1. Click the Data tab on the ribbon, then look for Data Analysis
2. Choose F-Test: Two Sample for Variance, then click OK

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 98


DEPARTMENT of
STATISTICS

3. Input the range that contains the variables 1 and 2.


4. Choose output options, then click ok
Note: The variance of variable 1
must be larger than the variance
of variable 2.

In this case, section B


have larger variance
that section A, so the
data of section B must
be put in the “Variable
1 Range” and section A
in “Variable 2 Range” .

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 99


DEPARTMENT of
STATISTICS

5. the output will be

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 100


DEPARTMENT of
STATISTICS

Solution: F-test
Step 1: Ho: 𝜎12 = 𝜎22 (Variances are equal)
Ha: 𝜎12 ≠ 𝜎22 (Variances are not equal)
Step 2: F-test, 𝛼=0.05
Step 3: Reject Ho F > F Critical one-tail
Reject Ho if P-value < 𝛼 = 0.05 (P-value Method)
Step 4: F = 2.46 and F Critical = 3.17
P-value = 0.098
Step 5: Since F is not greater than F Critical, failed to reject Ho
Since P-value is not less that 𝛼 = 0.05, failed to reject Ho
Step 6: At 5% level of significance, we can conclude that the variances are equal.

GGMFRANCISCO Basic Statistical Data Analysis Using Microsoft Excel | 101


DEPARTMENT of
STATISTICS Correlation Analysis

• Correlation is one of the most common and most useful statistics.


• Linear correlation analysis is a statistical technique to determine the
strength or degree of linear relationship existing between two variables.
• A measure of the degree of linear relationship is called population
coefficient ρ

Assumptions:
• The sample of paired (X,Y) data is a random sample.
• The data pairs fall approximately on a straight line and are measured at the
interval or ratio level.
• The pairs of (X,Y) data have a bivariate normal distribution.

Basic Statistical Data Analysis Using Microsoft Excel | 102


DEPARTMENT of
STATISTICS Scatter plot

Scatter Diagram / Scatter plot - is a type of mathematical diagram


using Cartesian coordinates to display values for two variables for a set
of data

Positive linear Negative linear Curvilinear No


Relationship Relationship relationship relationship

Basic Statistical Data Analysis Using Microsoft Excel | 103


DEPARTMENT of
STATISTICS Example
Suppose data below were collected from a sample of 10 Pizza Parlor restaurants
located near college campuses. The size of student population (in thousands) and the
quarterly sales (in thousands of dollars) for the 10 restaurants are summarized in the
table below.
Restaurant
1 2 3 4 5 6 7 8 9 10
(𝒊)
Student Population
(1000s) 2 6 8 8 12 16 19 20 22 26
(𝒙𝒊 )
Quarterly Sales
($1000s) 58 105 88 118 117 137 157 169 148 202
(𝒚𝒊 )

Basic Statistical Data Analysis Using Microsoft Excel | 104


DEPARTMENT of
STATISTICS Example: Scatter plot in Excel

Step 1: Highlight the cells that


contain the data you want to
use in the chart

Basic Statistical Data Analysis Using Microsoft Excel | 105


DEPARTMENT of
STATISTICS

Step 2: Click the insert tab on the ribbon

Step 3: Select the X, Y (scatter) you want to create

Basic Statistical Data Analysis Using Microsoft Excel | 106


DEPARTMENT of
STATISTICS

Step 4: You'll see many options when you select this button, select Scatter

Basic Statistical Data Analysis Using Microsoft Excel | 107


DEPARTMENT of
STATISTICS

Step 5: The chart will appear. Customize bar chart through ”chart design”,
“format”, “Quick Layout”.

Basic Statistical Data Analysis Using Microsoft Excel | 108


DEPARTMENT of
STATISTICS

The final output will be


250

200
QUARTERLY SALES ($1000S)

150

100

50

0
0 5 10 15 20 25 30
STUDENT POPULATION (1000S)

Basic Statistical Data Analysis Using Microsoft Excel | 109


DEPARTMENT of
STATISTICS

Types of Correlation

1. Positive linear correlation – if the independent variable increases, the


dependent variable also increases; or if the independent variable decreases, the
dependent variable also decreases.
2. Negative linear correlation – if the independent variable increases, the
dependent variable decreases; or if the independent variable decreases, the
independent variable increases.
3. Zero linear correlation – no linear relationship exists between the dependent
and independent variables. If the relationship is not linear, then it can be
exponential or logarithmic. Note that this does not imply that the independent
variable is not related with the independent variable.
Basic Statistical Data Analysis Using Microsoft Excel | 110
DEPARTMENT of
STATISTICS a

Linear Correlation Coefficient r


• Measures the strength of the linear relationship between the paired X and Y
values in a sample.
• It is sometimes referred to as Pearson Product Moment Correlation in
honor of Karl Pearson (1857-1936), who originally develop it.

𝑛 σ 𝑋𝑖 𝑌𝑖 − (σ 𝑋𝑖 )(σ 𝑌𝑖 )
𝑟=
2 2 2 2
σ σ σ σ
𝑛 𝑋𝑖 − ( 𝑋𝑖 ) 𝑛 𝑌𝑖 − ( 𝑌𝑖 )

Basic Statistical Data Analysis Using Microsoft Excel | 111


DEPARTMENT of
STATISTICS

Interpretations of the Linear


𝐶𝑃𝑋𝑌 Correlation Coefficient r
𝑟=
𝑆𝑆𝑋 𝑆𝑆𝑌
Value of r Interpretation
where
σ 𝑋𝑖 σ 𝑌𝑖 0.01 to 0.20 Very weak linear relationship
▪ 𝐶𝑃𝑋𝑌 = σ 𝑋𝑖 𝑌𝑖 − 𝑛
2
0.21 to 0.40 Weak linear relationship
2 σ 𝑋𝑖
▪ 𝑆𝑆𝑥 = σ 𝑋𝑖 - 𝑛 0.41 to 0.70 Moderate linear relationship
2 σ 𝑌𝑖 2
▪ 𝑆𝑆𝑦 = σ 𝑌𝑖 - 0.71 to 0.90 Strong linear relationship
𝑛

0.91 to 0.99 Very strong linear relationship

Basic Statistical Data Analysis Using Microsoft Excel | 112


DEPARTMENT of
STATISTICS

Number of Final grade


Example 1: Student
absences (X) (Y)
The data below shows the
1 6 82
number of absences and the
final grades of eight randomly 2 2 86
selected students from a 3 15 45
statistics class. Determine 4 9 74
whether there is significant 5 11 59
linear relationship between
6 5 90
number of absences and final
grade. 7 8 78
8 4 86

Basic Statistical Data Analysis Using Microsoft Excel | 113


DEPARTMENT of
STATISTICS

Step in performing Correlation:


1. Click the Data tab on the ribbon, then look for Data Analysis
2. Choose Correlation, then click OK

Basic Statistical Data Analysis Using Microsoft Excel | 114


DEPARTMENT of
STATISTICS

3. Input the range that contains the data.


4. Choose output options, then click ok

Basic Statistical Data Analysis Using Microsoft Excel | 115


DEPARTMENT of
STATISTICS

5. the output will be

𝑟 = −0.94
It means that the number of absences
and final grade have negative very
strong linear relationship
Basic Statistical Data Analysis Using Microsoft Excel | 116
DEPARTMENT of
STATISTICS

Properties of the Linear Correlation Coefficient r

1. The value of r is always -1 to 1. That is, -1 ≤ r ≤ 1.


2. A value near 0 indicates there is little linear relationship between variables
3. The value of r does not change if all values of either variable are converted
to a different scale.
4. The value of r is not affected by the choice of x or y. Interchange all x and y
values and the value of r will not change.
5. r measures the strength of a linear relationship. It is not designated to
measure the strength of a relationship that is not linear.
6. The value of r is sensitive to outliers.

Basic Statistical Data Analysis Using Microsoft Excel | 117


DEPARTMENT of
STATISTICS

Formal Hypothesis Test

➢The null and alternative hypothesis are:


Ho: 𝜌 = 0
(There is no significant linear correlation between variables X and Y.)
Ha: 𝜌 ≠ 0
(There is significant linear correlation between variables X and Y.)

➢The decision rule:


Reject Ho if P-value < 𝛼 (level of significance)

Basic Statistical Data Analysis Using Microsoft Excel | 118


DEPARTMENT of
STATISTICS

How to perform a Pearson correlation test in Excel


Step 1: Calculate the Pearson correlation coefficient in Excel
r = PEARSON(array1, array2)
Step 2: Calculate the t-statistic from the coefficient value
t = (r*SQRT(n-2))/(SQRT(1-r^2))
where: r is the Pearson correlation coefficient
n is the number of observation
Note: if your coefficient value is negative, then use the following
formula
t = (ABS(r)*SQRT(n-2))/(SQRT(1-ABS(r)^2))

Basic Statistical Data Analysis Using Microsoft Excel | 119


DEPARTMENT of
STATISTICS

Step 3: Calculate the p-value from the t statistic


P-value = TDIST(t, df, tails)

where:
t is the t-statistic from the coefficient value
df is the degrees of freedom which is equivalent to n-2
tails is the tailed of the distribution (‘1‘ for a one-tailed analysis or a
‘2‘ for a two-tailed analysis

Basic Statistical Data Analysis Using Microsoft Excel | 120


DEPARTMENT of
STATISTICS

Example 1 in Excel:

=PEARSON(B2:B8, C2:C8)

=(ABS(F2)*SQRT(F3-2))
/(SQRT(1-ABS(F2)^2))

=TDIST(F4,F5,F6)

Basic Statistical Data Analysis Using Microsoft Excel | 121


DEPARTMENT of
STATISTICS

Example 1: Formal Hypothesis Test of Correlation Analysis


1.) Ho: 𝜌 = 0 (There is no significant linear correlation between number of absences and final grade)
Ha: 𝜌 ≠ 0 (There is significant linear correlation between number of absences and final grade)
2.) Correlation analysis
3.) Reject Ho if P-value < 𝛼 = 0.05
4.) P-value = 0.0014
5.) Since P-value is less than 𝛼 = 0.05 , reject Ho
6.) At 5% level of significance, we can conclude that there is significant linear
correlation between number of absences and final grade.

Basic Statistical Data Analysis Using Microsoft Excel | 122


DEPARTMENT of
STATISTICS

Y-Height (cm) X-Span (cm)


Example 2:
171 173
The height and arm span of 10 195 193
adult males were measures (in 180 188
cm). 182 185
190 186
175 178
177 182
178 182
192 198
202 202

Basic Statistical Data Analysis Using Microsoft Excel | 123


DEPARTMENT of
STATISTICS

=PEARSON(A2:A11,B2:B11)

=(ABS(E2)*SQRT(E3-2))/(SQRT(1-E2^2))

=TDIST(E4,E5,E6)

Basic Statistical Data Analysis Using Microsoft Excel | 124


DEPARTMENT of
STATISTICS

Example 2: Formal Hypothesis Test of Correlation Analysis


1.) Ho: 𝜌 = 0 (There is no significant linear correlation between number of absences and final grade)
Ha: 𝜌 ≠ 0 (There is significant linear correlation between number of absences and final grade)
2.) Correlation analysis
3.) Reject Ho if P-value < 𝛼 = 0.05
4.) P-value = 0.000089
5.) Since P-value is less than 𝛼 = 0.05 , reject Ho
6.) At 5% level of significance, we can conclude that there is significant linear
correlation between height and arm span.

Basic Statistical Data Analysis Using Microsoft Excel | 125


DEPARTMENT of
STATISTICS

Other Correlations
• Pearson Product Moment Correlation - appropriate when both
variables are measured at an interval and ratio level.
• Spearman Rank Order Correlation (rho) – appropriate when the two
variables are in ordinal level.
• Point-Biserial Correlation – appropriate when one measure is a
continuous interval level and the other is dichotomous (i.e., two-
category).

Basic Statistical Data Analysis Using Microsoft Excel | 126


DEPARTMENT of
STATISTICS Simple Linear Regression Analysis

• Used to predict the value of a dependent variable based on the value of at


least one independent variable
• Used to explain the impact of changes in an independent variable on the
dependent variable.

Components of Regression Analysis:


1. Dependent/ Response variable (Y)
– the variable we wish to explain
2. Independent/ Predictor / Explanatory variable (X’s)
– the variable used to explain the dependent variable

Basic Statistical Data Analysis Using Microsoft Excel | 127


DEPARTMENT of
STATISTICS

Simple Linear Regression Model:


Population
Slope Independent
Population y Coefficient Variable
intercept
Dependent
Variable Random Error term,
𝒚 = 𝜷𝒐 + 𝜷𝟏 𝑿 + 𝜺 or residual

Linear Regression Assumptions


Error values (𝜀) are statistically independent
Error values are normally distributed for any given values of x
The probability distribution of the errors is normal
The probability distribution of the errors has constant variance
The underlying relationship between x variable and the y variable is linear
Basic Statistical Data Analysis Using Microsoft Excel | 128
DEPARTMENT of
STATISTICS Regression

Estimated Regression Model by Least Squares Method


Estimate of the Estimate of the Formulas for 𝒃𝟏 and 𝒃𝟎 :
Estimated (or regression regression
Independent σ 𝑥−𝑥ҧ 𝑦−𝑦ത 𝐶𝑃𝑥𝑦
predicted) intercept slope
variable
𝑏1 = σ 𝑥−𝑥ҧ 2
or 𝑏1 = 𝑆𝑆
y value 𝑥

𝑏0 = 𝑦ത − 𝑏1 𝑥ҧ
𝒚ෝ𝒊 = 𝒃𝒐 + 𝒃𝟏 𝑿
Interpretation of the slope and the intercept:
▪ 𝑏0 is the estimated average value of y when the value of x is zero
▪ 𝑏1 is the estimated change (increase or decrease) in the average
value of y as a result of a one-unit increase in x
Basic Statistical Data Analysis Using Microsoft Excel | 129
DEPARTMENT of
STATISTICS

Evaluating a regression equation’s ability to predict


𝐶𝑃𝑋𝑌
Coefficient of Determination 𝑅2 = 𝑏1 𝑆𝑆
𝑦
• The proportion of the total variation in the dependent variable Y that is
explained, or accounted for, by the variation in the independent variable X.
• The coefficient of determination is also called R-squared and is denoted as
𝑅2 .
• It is the squared of the correlation coefficient.
• If it were possible to make perfect predictions, the coefficient of
determination would be 100%. A coefficient of determination of 100% is
associated with a correlation coefficient of +1.0 or −1.0.

Basic Statistical Data Analysis Using Microsoft Excel | 130


DEPARTMENT of
STATISTICS

Step in performing Regression:


1. Click the Data tab on the ribbon, then look for Data Analysis
2. Choose Regression, then click OK

Basic Statistical Data Analysis Using Microsoft Excel | 131


DEPARTMENT of
STATISTICS

3. Input the
range that
contains the
variables X and Y.

4. Choose output
options, then
click ok

Basic Statistical Data Analysis Using Microsoft Excel | 132


DEPARTMENT of
STATISTICS

5. the output will be

Basic Statistical Data Analysis Using Microsoft Excel | 133


DEPARTMENT of
STATISTICS

Example: The data below shows the number of absences and the final grades of
seven randomly selected students from a statistics class.
Number of Final grade 1. Create regression equation
Student
absences (X) (Y) to predict the final grade of
A 6 82 student.
B 2 86 2. Interpret 𝑏0 and 𝑏1
C 15 43 3. Compute the value of 𝑅2
4. What is the estimated
D 9 74
value of the final grade if
E 12 58
the number of absences is
F 5 90 1, 4, 7, 10, 20
G 8 78
Basic Statistical Data Analysis Using Microsoft Excel | 134
DEPARTMENT of
STATISTICS a
Solution:

Regression Equation:
𝑦ෝ𝑖 = 𝑏𝑜 + 𝑏1 𝑥𝑖
𝑦ො𝑖 = 101.31 − 3.51𝑥𝑖

• 101.31 is the estimated average final grade when the number of absences
is zero.
• -3.51 is the estimated decreased in the average final grade as a result of
one-unit increase in the number of absences.
Basic Statistical Data Analysis Using Microsoft Excel | 135
DEPARTMENT of
STATISTICS

Solution:

Coefficient of Determination:

𝑅2 = 0.8927 or 89.27%

89.27% of the total variation in the final


grade was explained, or accounted for,
by the variation of the number of
absences.

Basic Statistical Data Analysis Using Microsoft Excel | 136


DEPARTMENT of
STATISTICS a

Solution:
What is the estimated value of the final grade if the number of absences is 1, 4,
7, 10, and 20

Final regression equation:


𝐹𝑖𝑛𝑎𝑙 𝑔𝑟𝑎𝑑𝑒 = 101.31 − 3.51 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑏𝑠𝑒𝑛𝑐𝑒𝑠

Case 1: no. of absences = 1 Final grade = 101.31 − 3.51 1 = 97.80


Case 2: no. of absences = 4 Final grade = 101.31 − 3.51 4 = 87.27
Case 3: no. of absences = 7 Final grade = 101.31 − 3.51 7 = 76.74
Case 4: no. of absences = 10 Final grade = 101.31 − 3.51 10 = 66.21
Case 3: no. of absences = 20 Final grade = 101.31 − 3.51 20 = 31.11

Basic Statistical Data Analysis Using Microsoft Excel | 137


DEPARTMENT of
STATISTICS

Y-Height (cm) X-Span (cm)


Example 2: 171 173
The height and arm span of 10 adult 195 193
males were measures (in cm). 180 188
1. Create regression equation to predict 182 185
the height of males.
190 186
2. Interpret 𝑏0 and 𝑏1
175 178
3. Determine and interpret 𝑅2
177 182
4. What is the estimated value of the
height if the span is 150, 200, 250. 178 182
192 198
202 202

Basic Statistical Data Analysis Using Microsoft Excel | 138


DEPARTMENT of
STATISTICS a
Solution:

Regression Equation:
𝑦ෝ𝑖 = 𝑏𝑜 + 𝑏1 𝑥𝑖
𝑦ො𝑖 = −10.997 + 1.05𝑥𝑖

−10.997 is the estimated average height when the size of arm span is zero.
1.05 is the estimated increased in the average height as a result of one-unit
increase in the size of arm span.

Basic Statistical Data Analysis Using Microsoft Excel | 139


DEPARTMENT of
STATISTICS

Solution:

Coefficient of Determination:
𝑅2 = 0.8677 or 86.77%
86.77% of the total variation in the
height was explained, or accounted
for, by the variation of the size of arm
span.

Basic Statistical Data Analysis Using Microsoft Excel | 140


DEPARTMENT of
STATISTICS

Solution:
What is the estimated value of the height if the span is 150, 200, 250.

Final regression equation:


ℎ𝑒𝑖𝑔ℎ𝑡 = −10.997 + 1.05(𝑎𝑟𝑚 𝑠𝑝𝑎𝑛)
Case 1: arm span = 150 ℎ𝑒𝑖𝑔ℎ𝑡 = −10.997 + 1.05 150 = 146.503
Case 2: arm span = 200 ℎ𝑒𝑖𝑔ℎ𝑡 = −10.997 + 1.05 200 = 199.003
Case 3:arm span = 250 ℎ𝑒𝑖𝑔ℎ𝑡 = −10.997 + 1.05 250 = 251.503

Basic Statistical Data Analysis Using Microsoft Excel | 141

You might also like