The Logic of Statistical Tests of Significance
The Logic of Statistical Tests of Significance
We enter through this module the levels of inquiry in research studies incorporating the concept
of estimation and hypothesis testing. The concept of probability and levels of significance are
also touched.
I hope you will use this module as a good basis for understanding the succeeding modules
especially modules of Unit III. Continue your reading habits and also do the exercises and
activities. They will help you develop your statistical skills sharper. So again, good luck and
study well.
OBJECTIVES
At the end of this module study, you will be able to:
History tells us that as far back as ancient Egypt, sporting men and women used four-sided
“astraq ALI” made from animal heel bones as dice to gamble. From this, Roman emperor
Claudius (10 BCE – 54 CE) wrote the first known treatise of gambling. Modern dice grew
popular in the Middle Ages, in time for a renaissance rake, the Chevalier de Mere, to pose a
mathematical puzzle. He posed this question: What is likelier to happen, rolling at least one six
in four throws of a single die or rolling at least one double six in 24 throws of a pair of dice? To
this question, the Chevalier reasoned that the average number of successful rolls was the same
for both gambler. Hence the following:
1
Chance of one six =
6
1 2
Average number in four rolls = 4 =
6 3 ()
1
Chance of double six in one roll =
36
In the event that a coin was tossed, the random experiment consists of recording its
outcome. The elementary outcomes are the heads and tails.
Coin tossed
With the tossed coin, two possible things
Can happen: tail or head of the coin.
H T
Let us imagine a random experiment with n elementary outcomes: O 1, O2, O3, … On. We want to
assign a numerical weight or probability to each outcome, which measures the likelihood of its
occurring, we write the probability of O1 as P(O1). For example, in a fair coin toss, heads and
tails are equally likely, and we assign them both the probability 0.5
Each outcome comes up half the time. In the roll of two dice, there are 36 elementary outcomes,
all equally likely, so the probability is 1/36. For instance,
P(Black 5, White 2) = 1/36
This means, if you rolled the dice a very large number of times, in the long run, this outcome
would occur 1/36 of the time.
If an event is certain to happen, we assign the probability 1. In the long run, that is the proportion
of times it will occur. The total probability of the sample space must be 1. Total probability of
all elementary outcomes is one.
P(O1) > 0
P(O1) + P(O2) + …+ P(On) = 1
1. Classical probability – this is based on gambling ideas. The fundamental assumption is that
the game is fair and all the elementary outcomes have the same probability.
2. Classical frequency – when an experiment can be repeated, then an event’s probability is
the proportion of times the event occurs in the long run.
3. Personal probability – based on intuition and life experience. Most of life’s events are not
repeatable. Personal probability is an individual’s assessment of an outcome’s likelihood.
SAQ 10-1
Estimation exists in our daily life situations. For example, we want to know on the average the
length of time one would travel from Talf Avenue to san Agustin, Novaliches; or how much sales
we will make with our SKA IV manuals next year. There are many situations where we
estimate. Sometimes we can estimate that with soaring prices in food and commodities, we may
need between P 5,000 to P 8,000 expenses per month. The range from P 5,000 to P 8,000 is an
interval estimate. This means that based on the sample mean and the standard deviation, known
as the standard error of the mean, we can conclude with perhaps 95% of confidence that the
population mean is located within the range of estimated values.
There is always the presence of probability of errors in making an estimate involving range of
values. The 95% confidence means that when we make similar estimates of times, on the
average, the expenses will be 95% of the range given.
An estimate of population parameter given by a single member is called a point estimate of that
parameter. In many cases, we use x́ (the sample mean) as a point estimate for μ (the population
mean) and s (the sample standard deviation) as a point estimate for σ (the population standard
deviation).
An estimate is not very valuable unless we have some kind of measure of how “good” it is. The
language of probability can give us an idea of the size of the error of estimate caused by using
the sample mean x as an estimate for the population mean.
Let us remember that x́ is a random variable. Each time we draw a sample of size n from a
population, we can get a different value for _ x́ . According to the Central Limit Theorem, if the
sample size is large, this x́ has a distribution that is approximately normal with mean. μx = μ, the
population mean we are trying to estimate. The standard deviation is σ x = σ /√ n.
This information, together with our work on confidence levels, leads us to the probability of
statement.
σ σ
(
P −z c
√n
< x́ – μ < z c)√n
=C
This equation uses the language of probability to give us an idea of the size of the error of
estimate for the corresponding confidence level c. in words, this equation says that the
probability is c that our point estimate x is within a distance + z c (σ /√ n) of the population mean
μ. We shall show this relationship in Figure 10.1.
μ
σ σ
2c 2c
√n √n
Patterned after the example given by Brase and Brase (1983), let us see the application of this
estimation.
Example: Lydia de Guzman enjoys jogging. She has been jogging over a period of several
years even when she was still working in Saudi Arabia, during which time her physical condition
has remained constantly good. Usually, she jogs two mi/day. During the past year, Lydia has
sometimes recorded her times required to run 2 mi. she has a sample of 90 of these times. For
these 90 times, the mean was x = 15.60 min and the standard deviation was s = 1.80 min. Let μ
be the mean jogging time for the entire distribution of Lydia’s 2-mi running times (taken over the
past year). Find a 0.95 confidence for μ .
Solution: The interval from x́ - E to x́ + E will be 95% confidence interval for μ . In this case,
c = 0.95, so zc = 1.96 (please refer to Fig. 10.1). The sample size n = 90 is large enough that we
may approximate σ as s = 1.80 min. Therefore,
s
E = zc
√n
1.80
E = 1.96
√ 90
E = 0.37
x́ - E < μ <x́ + E
15.60 – 0.37 < μ 15.60 + 0.37
15.23 < μ < 15.97
Conclusion: There is a 95% chance that the population means μ of jogging time for Lydia is
15.23 min and 15.97 min.
In hypothesis testing, the data gathered can either support or negate the hypothesis. There is no
claim to “prove” that the hypothesis is true, because any study can never “prove” anything.
There are two kinds of hypothesis.
1. The null hypothesis is that which represents the status quo to the party performing the
sampling experiment – the hypothesis that will be accepted unless the data provide
convincing evidence that it is false.
2. The research hypothesis or alternative hypothesis is that which will be accepted only
if the data provide convincing evidences of its truth. The research carries this type of
hypothesis because when statistical treatment is applied, the research hypothesis
transforms automatically to null in order to find the truth of the findings of the data.
Example:
For this type hypothesis testing, two groups of samples, to serve as study group and control
group are necessary.
ACTIVITY 10-1
Can you state at least three alternate and its corresponding hypotheses?
Levels of Significance
The level of significance tells about the “rare” chance that you have not found a real difference
of relationship. Such result will be labeled “statistically significant”. It is important that before
the conduct of the research, the investigator decides what will be considered “rare”. Usually, the
common level of significance placed by many investigators is p = .05 or p = .01. this means that
the probability of the “rare” finding is about 95% for p = .05 or 99% for p = .01 level of
confidence.
The p value is the smallest level of significance for which the observed sample statistic tells the
investigator to reject the null hypothesis. How does one go about computing p value? The p
value must be expressed in terms of probability. Let us use the model of a right-tail test of the
mean, the p value is simply the probability that the sample men from any random sample mean x́
. in symbol, this is p value = p (x computed from any random sample > observed x for right-tail
test µ).
p values are the areas in the tail or tails of a probability distribution beyond the observed sample
statistic. Figure 10.2 shows the p values for right-, left-, and two tail tests of the mean.
k
Observed sample x
k
Observed sample x
H0 : µ = k Two-tailed test P value is sum of areas in two tails
H0 : µ > k
k
K plus or minus (x-k) where x is the
observed sample mean
SAQ 10-2
(a) H :μ=30
0
H 1 : μ ≠ 30
α =0.01
P value = 0.0213
Do we accept or reject H 0?
(b) H 0 :μ 1=μ2
H 1 : α 1 < μ2
α =0.05
P value = 0.0316
Do we accept or reject H 0?
(c) H 0 :P=0.15
H 1 : P< 0.015
α =0.05
P value = 0.0171
Do we accept or reject H 0?
(d) H 0 : p 1= p2
H 1 : p1 ≠ p 2
α =0.01
P value = 0.321
Do we accept or reject H 0?
With null hypothesis, there are two types of errors. When data have been analyzed, the
investigator accepts the null hypothesis if there are no significant results. On the other hand, if
significant differences have been found, then the null hypothesis is rejected.
Type I error happens when we reject a true null hypothesis. This means the data indicate a
statistically significant result when in fact there is no difference in the population.
If we reject the null hypothesis when it is in fact true, we have made an error that is called a Type
I error. On the other hand, if we accept the null hypothesis when it is in fact false, we have made
an error that is called a Type II error.
The error of making a Type I error is called alpha (α ¿ and can be decreased by altering the level
of significance. You can set the p at 0.1 instead of p = .05. Then there is only 1 chance in 100
that the “significant” could occur by chance alone. However, in decreasing the power of the test,
there is the risk of increasing Type II.
A Type II error is accepting a false null hypothesis. If there is no significant result, you are likely
to accept the null hypothesis, when in fact there were significant results. One way to avoid Type
II errors is to increase the sample size and be willing to risk 10 chances in 100 that you can be
wrong (p = .10) than there is if you are willing to risk only five chances in 100 (p = .05).
SAQ 10-3
You hypothesize that there is no significant difference between sophomores and juniors in terms
of weight. In each of the following, determine whether or not an error has been made, and if so,
what type of error?
(a) Juniors really weight significantly more than the sophomores, and you accept
the null hypothesis.
(b) Sophomores really weigh more than the juniors and you reject the null
hypothesis.
(c) Juniors and sophomores really do weigh the same, and you accept the null
hypothesis.
(d) Juniors and sophomores really do weigh the same, and you reject the null
hypothesis.
SUMMARY
This module has brought you to the logic of statistical tests of significance. You saw the null and
directional hypotheses, their difference and how to detect Type I and Type II errors passing by
the concepts of estimation and probabilities.
I hope you continue reading the succeeding modules. Keep up your interest.
11
Difference Between Means Test
INTRODUCTION
In this module, you will learn tests that measure differences between means. You will encounter
many research projects that are designed to test the differences between means, between groups,
between population means and sample means. When the differences involve interval or ratio
data, the analysis requires an evaluation of means and distribution of each group. In this module,
we shall study back the z-score test in order to get into the t-test.
There will be exercises for you to practice your understanding of the concept and procedure of
the tests for difference between means. So, I encourage you to take a pleasant seat, read slowly,
understand the concepts and do the exercises to enhance your statistical skills. The difference
between means test is usually called for in studies having two samples, and also in studies that
deal with pre and post tests analysis. The data must be intervals or in ratios so that we can deal
with means of scores.
OBJECTIVES
There are statistical problems that require you and I to decide whether observed differences
between two sample means are attributed to chance. For instance, we may decide whether there
is really a difference in the average electrical consumption of two kinds of washing machines, or
if a group of patients suffering from arthritis showed marked improvement after an average of 12
days taking a particular prescribed medicine, while under similar condition another group of
sample patients averaged 10 days. On the same vein, we may decide on the basis of samples
whether boy teenagers dance more than girl teenagers, or whether retired professors are more
active after retirement than office workers.
Examples will be given to make this concept of detecting the differences between means clearer.
You will notice that in trying to solve for the differences, we always go back to the normal curve
distribution and we pass by the z-score. To test an observed difference between two sample
means, we need to use a theory so that we can attribute the difference to chance. Example: if x́ 1
and x́ 2 are the means of two large independent samples of size n1 and n2, the sampling distribution
of the statistic x́ 1 - x́ 2 can be closely approximated with a normal curve having the mean µ 1 - µ2
and the standard deviation
σ 21 σ 22
√ +
n1 n2
Where µ1, µ2, σ 1, and σ 2 are the means and the standard deviations of the two populations from
which the two samples were taken. Such standard deviation referred to as the standard error of
difference between two means.
If the selection of one sample does not affect the selection of the other, then we call the samples
independent. In this case, the “before and after” comparison cannot use this theory because the
design does not call for this theory. But in most practical situations, σ 1 and σ 2 are unknown, so
that when we limit ourselves to large samples, (n1 or n2 should be more than 30), and then we use
s1 and s2 standard deviations as estimates of σ 1 and σ 2. Now we shall test the null hypothesis: µ 1 -
µ2 = 0 using the z-statistic. As you will recall, the formula for z is as follows:
x́ 1−x́ 2
z= S21 S22
√ +
n1 n2
We shall now proceed with the T-test. The T-test has been the technique commonly used to
compare two groups. For this type of test, the data required are of two samples. The independent
variable can be nominal-level or ordinal-level, which can often be treated as interval-level. The
dependent variable should be interval or ratio level.
Before calculating the t-test statistic, let us keep in mind the following important points:
(1) Data requires interval-level.
(2) Each subject specifically belongs to only one group – this is to assure independence of
groups.
(3) The distribution of the dependent measure is normal; not skewed, otherwise the t-test
may be invalid for this type of abnormally skewed data.
(4) The variance of the two groups that are being compared should have similar variances.
This is known as the homogeneity of variance.
The t-test is also known as the student’s t-test. The inventor, William Gosset of French-English
descent, described a set of distributions of means of randomly drawn samples. He published his
description and findings under the name of student, thus the name, student t-test. In t-test, the
distribution are described by the sample differences between means obtained from drawing pairs
of samples from a population.
There are three different formulas based on the t-distribution that can be used to compare two
groups of samples. These formulas are:
(a) pooled formula – two groups of samples have met the requirement for the test of
homogeneity of variance.
(b) separate formula – the variances are not equal
(c) correlated t-test or t-test for paired comparisons – you compare a group of subjects on
the pre and post test scores
Let us start describing the pooled t-test. To understand this concept and technique, let us
illustrate this by a research example. Let us situate ourselves as doing a nursing research project
among post-stroke patients and we will apply structured physic-psycho biobehavioral
intervention on one group and we take a control of post-stroke patients where we do not apply
our structured physic-psycho nursing biobehavioral intervention. Both groups of samples are
randomly drawn from a population of post-stroke patients. We give the nursing biobehavioral
intervention of physic-psycho package to one group, we call this study group and we randomly
draw another group to serve as our control group without the prescribed nursing intervention. We
shall measure after three months their physiological responses, through scores measuring their
mobility response, functionality response and compliance to exercise response.
Our research question is: “Is the group receiving the nursing physic-psycho intervention different
from the group who did not in terms physiological response scores?” We are interested in
examining the group differences so that we can infer and make projections about the population
of post-stroke patients. Because we are introducing an experiment in our study group, we are
defining a new population – which is the population of post-stroke patients who receive the
nursing interventions.
Table 11-1. Physiological response Scores of the Study and Control Groups
From Table 12.1, the means show clearly that the study group where the physic-psycho
intervention was applied obtained higher scores than the control group. We now want to find out
how different are the groups. We check first the homogeneity of variable before deciding which
type of t-test to use, i.e., pooled or separate formula.
Formula for variance is:
(∑ x)2❑
∑ x2 2−
n
2
S= or ∑ x
n−1 n−1
2
(∑ x)
2−
n
S2 ¿ ∑ x n=9
n−1
n = 10 x2 = 9.78
(88)2
x1 = 15.10 s2 = 996 –
9
2 (151)2
s = 2445 –
10 9−1
s2 = 16.94
10−1
S2 = 18.32
18.32
F9.8 =
16.94
F = 1.08
With the F-table in Appendix E, the tabled values for 9, 8 df to be 3.39 (.05 level) and 5.91 (.01
level). We should double the probability levels for the two-tailed test to .10 and .02. As F value
of 1.08 is not significant at the .10 level (3.39), it will not be significant at the .05 level. For this
type of data, the pooled t-formula is the approximate formula to use.
Our next task is to compare the group means as an estimate of the means of the two different
populations. Study group’s mean is 15.10 and that of the control group is 9.78. Study group has a
higher mean than the control group. To check whether this difference is due to chance, or true
difference because the two groups are different, we test the null hypothesis that there is no
difference. We subject this now through the t-test. The t-trio or t-test can compare the difference
to the distribution of differences between pairs of means in the population.
You can see readily that the formula is similar to that of the z-score formula.
x−x́
z=
s
x−μ
z= (z-formula for population parameters)
σ
In our example on table 12.1, we will use the t-test to analyze the group differences
( x́ 1−x́2 ) −( μ́ 1−μ2)
t=
s (x́ 1−x́ 2)
In the numerator, (x́ 1 - x́ 2) represents the difference between the means of two groups. Translated
to scores, this is (15.10 – 9.78). The term (µ 1 - µ2) is based upon the null hypothesis, which
assumes that the two populations are not different and different is zero (µ1 - µ2) = 0.
In this formula, the denominator represents the “pooled” variance of both groups and is
appropriate because the variances were not different. The denominator is the appropriate
standard error for this t-statistic and the formula for standard error is:
∑ x 21+ ∑ x22
S(x́ 1 - x́ 2)=
Where
√( n1 + n2 )( 1 1
+
n 1 n2 )
∑x 1
2
= sum of squares study group
2
∑ x2 = sum of squares of control group
n1 = the number of scores in study group
n2 = the number of scores in study group
(a) When the two groups have equal n’s, the formula is:
∑ x 21+ ∑ x22
(b)
S(x́ 1 - x́ 2)=
√( n(n−1)
To find the sum of squares for each group, the formula for study group is:
)
( ∑ x 2) 2
∑x12 = ∑x12 –
n1
(c) For the control group, the formula is:
( ∑ x 2) 2
∑x22 = ∑x22 -
n2
(15.10 – 9.78)−(0)
´ )−¿ ¿ ¿ ¿t =
t = ( x ¿ ¿1−x́ 2
(√ 164.90+ 135.56 1 1
10+9−2 )( 10 9 )
+
5.32
t=
(√ 300.46
17 )
( 0.21 )
5.32
t=
√( 17.67 )( 0.21 )
5.32
t=
√3.71
5.32
t=
1.93
t = 2.76
We now compare this value of 2.76 to the distribution of t values of our df. We have two groups,
each group has a mean. So we calculate according to the following formula:
df = (n1+n2) – 2 or df = total n – 2
df = 19 - 2
df = 17
We look in the Appendix F and look for the t-value of 2.76 with df =17. The probability levels of
.01 and .005 for a one tailed test. This means the difference between the groups would occur by
chance not even in 1 time in 100. So, the null hypothesis is rejected; because the groups differed
significantly. The study group scored higher in higher in physiological responses than the control
group.
The separate t- test is a conservative formula for groups whose variances are not the same. The
formula for this is:
( x́ 1−x́2 ) −( μ́ 1−μ2)
t= s21 s 22
√( +
n1 n2 )
Where s12 = variances for study group
S22 = variances for control group
To demonstrate this separate t-test formula, let us use the same example of the pooled t- test
formula
(15.10 – 98)−(0)
t = 18.32 16.94
√( 10
+
5.32
9 )
t=
√( 1.832+1.88 )
5.32
t=
1.93
t = 2.76
The Correlated or Paired T-test
Correlated or paired t-test- the matched or paired samples are expected to have similar scores.
The chance of differences between the two groups will not be as large as when they are drawn
independently. The formula is:
( x́ 1−x́ 2 )−(μ́1−μ2 )
t=t= s 21 s 22 s1 s2
√
Where s12 and s22 = group variables
+ −2r
n1 n 2 ( )( )
√ n1 √ n2
SAQ 11-1
Compute the following exercises with the t-test:
A random sample of five specimens and at the level of significance as α = 0.01. A
decision must be made whether the fat content of Selecta ice cream is less than 10%.
However, it remains to be seen whether the difference between 10% and 9.6 % is really
significant. Set up your hypothesis and compute.
SUMMARY
This module has brought you how to compute the difference between means. The t-test was
presented to you. You must be aware now of the uses of t-test and what data are required to use
this technique.
You do not have to be overwhelmed with the formula. All you have to know is what type of data
are needed, what type of research problems go with t-test and what normal distribution mean to
your data. The computation can easily and quickly be done by software computer packages. Now
that you are a little bit more knowledgeable about difference between means, it is time to
conceptualize research problems so that you can apply what you have learned.
To conclude this module, Let us be reminded of the basic step in testing a hypothesis concerning
means:
(1) Formulate a null hypothesis H0 in such a way that the probability of a type I error can be
calculated.
(2) Formulate an alternative hypothesis H1 so that the rejection of the null hypothesis H 0 is
equivalent to the acceptance of the alternative hypothesis H1.
(3) Specify the level of significance α. The most commonly values of α are 0.01 and about
0.05 but the investigator may choose any level of significance depending on the extent of
committing a Type I error.
(4) Choose the appropriate test statistic. If the test concern means, the z- statistic is used as
long as the sampling distribution approximates the normal distribution. The t-statistic is
used when sampling distribution follows the shape of the student t-distribution.
(5) Determine the critical region which may be either lying entirely on one tail or split into
equal parts with one lying on the right tail and the other on the left tail of the distribution
(6) Compute for the value of the statistical test.
(7) Draw a conclusion. If the computed value of the test statistic is within the region of
rejection, we accept H0 or reserve judgment.