Chi-Square Basics for Learners
Chi-Square Basics for Learners
Conduct the Chi-Square test of goodness of fit to verify whether a fitted distribution to a
set of observations is appropriate or not.
1.1 INTRODUCTION
It is evidently true that most research problems call for the process of determining the existence
of association or interdependence between two or more variables. To this end statistical methods
help in measuring the relationship that exists between variables. One of such methods is the Chi-
Square test of independence or association. This method is applied in cases where we have two
variables. It is used to detect the existence or non-existence of association between two variables.
Nevertheless, it should be noted that the Chi-Square test of association does not measure the
degree of association or relationship. On the other hand, the Chi-Square distribution is also used
in evaluating the goodness of fit of a distribution to a given data. This kind of test is referred to
as the Chi-Square test of Goodness of Fit, and it is of great importance in statistics.
The chi-square ( ) distribution is obtained from the values of the ratio of the sample variance
and population variance multiplied by the degrees of freedom. This occurs when the population
is normally distributed with population variance sigma^2.
The Chi-Square distribution will be used in investigating whether the expected frequencies are
significantly different from the observed frequencies obtained from the sample.
There are three cases in which we apply this kind of test.
i) Test of Goodness of fit:
fit: Testing whether a given model is acceptable or not.
ii) Chi-Square test of independence:
independence: Testing whether two attributes are associated or
not.
iii) Chi-Square test of Homogeneity:
Homogeneity: Testing whether many populations are homogenous
with respect to certain classification.
CHI SQUARE ( χ 2
) TEST:
In most statistical tests, our decisions are based on the assumption that the population is normally
distributed. But when this assumption about the population cannot be made, it is necessary to use
the CHI SQUARE ( χ 2
) test. This test is good for nominal or ordinal scale of measurement
where nominal scale of measurement deals with the data which can only be classified into
categories such as male and female, or freshman, juniors and seniors and so on. There is no
particular order for these groupings and are mutually exclusive so that an item in one category is
not included in another category. The ordinal scale of measurement assigns different ranks to
these categories. One category may be superior in standing an the other may be good or fair and
so on. χ 2
test is used for analyzing qualitative variables such as opinions of persons, religious
affiliations, smoking habits, etc. It deals with judgments about proportions of two or more than
two populations.
Example: -
Find the critical value of χ 2
from the table of χ -distribution if level of significance is
2
The chi-square test of independence is a bivariate statistical technique that is used to detect the
existence of association between two attributes or variables. It is of great interest to know the
existence of association between variables. Some examples of such variables are the following.
Type of school to which families send their children and income of
families.
Religion and family size.
size.
Time studied and exam result.
result.
Demand and Price of a commodity.
commodity.
Age and number of teeth.
teeth.
It is one of the most commonly applied statistical techniques in researches conducted in
different disciplines.
Suppose that we have two attributes (characteristics) say A and B. We want to test the hypothesis
H0: There is no association between attributes A and attribute B.
Versus the alternative
HA: There is association between attribute A and attribute B.
In the test for independence, the claim is that the row and column variables are independent of
each other. This is the null hypothesis. The test statistic used is the same as the chi-square
goodness-of-fit test. The principle behind the test for independence is the same as the principle
behind the goodness-of-fit test. The test for independence is always a right tail test.
test.
If attribute A has r categories (levels) and attribute B has c categories (levels), then the table in
which the two attributes (variables) are cross classified contains r rows and c columns. The table
has rc cells. This table is usually referred to as rxc (r by c) contingency table.
Now suppose a sample of size n is taken and cross classified. Let O ij denote the observed
frequency of the ith category (level) of A and the jth category (level) of B. recall that, our interest
is to test the null hypothesis that there is no association between the two attributes A and B. The
test statistics to be used is:
r c 2
2 ( Oij −e ij )
χ =∑ ∑
i=1 j=1 e ij
e ij=n P {A i∩B j } Where P {A i ∩B j } is the probability of the cell ( i,j)
If the null hypothesis is true, then
O O O O
e ij=n P {A i∩B j }=n P( A i )P( B j )=n i. j. = i . . j
Where n n n
Oi. is the total frequency of the ith row and O.j is the total frequency of the jth column.
The above test statistics has a chi-square distribution with (r-1)(c-1) degrees of freedom. The
rejection criterion is:
2 2
χ > χ α [(r−1)(c−1 )]
The multiplication rule said that if two events were independent, then the probability of both
occurring was the product of the probabilities of each occurring. This is key to working the test
for independence. If you end up rejecting the null hypothesis, then the assumption must have
been wrong and the row and column variable are dependent. Remember, all hypothesis testing is
done under the assumption the null hypothesis is true.
Example:
Example: To test the hypothesis that color of eye and color of hair are associated, data on color
of eye and color of hair for 6,800 individuals were compiled.
Hair
Fair Brown Black Red
Blue 1768 808 190 47
Green 946 1387 746 43
Eye
Brown 115 444 288 18
Test whether there is association between the two attributes at 1% level of significance.
Solution:
The hypothesis we want to test is:
H0: There is no association between color of eye and color of hair.
HA: There is association between color of eye and color of hair.
Hair
Oi. O. j
e ij=
n and presented in the following table.
eij 1 2 3 4
1 1170.29 1091.69 506.34 44.68
2 1298.84 1211.61 561.96 49.58
3 359.87 335.70 155.70 13.74
r c 2 3
2 ( Oij−e ij )
χ =∑ ∑ =∑¿
i=1 j=1 eij ¿
which is 1074.43, is greater than the tabulated value we reject the null hypothesis and conclude
that there is association between the two attributes, eyes color and hair color.
Exercise:
Exercise:
Suppose that 500 university students were randomly selected and classified by year and
smoking habit.
Smoking habit
Year Non-Smokers Casual-Smokers Heavy-Smokers
Freshman 90 42 22
Sophomore 65 37 36
Junior 45 28 30
Senior 25 43 37
Test whether the two attributes, year and smoking habit, are related (associated) or not.
Example 20.9
Suppose we are interested to check whether qualification and salary of employees are dependent
or not. Then we may classify qualification in to three categories (r = 3) as: 12 complete, Diploma
holder, and First Degree or higher.
We may also classify salary in to three categories (s = 3) as: Less than 200 Birr, 200 up to 499
Birr and 500 Birr or more. Then we randomly select employees and classify them into one of the
rxs = 3 x 3 = 9 categories (cells). Suppose a random sample of 80 employees is taken and the
following result is obtained.
Qualification
Salary 12 Diploma 1st degree Row total
complete holder or higher
< 200 Birr 10 2 0 12
200 – 499 Birr 16 20 2 38
500 Birr or more 6 2 22 30
Column Total 32 24 24 80
In tests of independence, the null and alternative hypotheses are of the form:
HO : The two classifications are independent
H1 : The two classifications are dependent.
The null hypothesis can also be written as “ There is no association between the two
classifications”.
The test statistics used to test the hypothesis of independence is called a Chi-Square test.
A chi-square distribution denoted by 2
is a continuous probability distribution. Unlike the
normal distribution the chi-square distribution is asymmetric (not symmetric). It is positively
skewed (right) distribution. It cannot assume a negative value. The 2 values for a given level of
significance () and the number of degrees of freedom (d.f.) can be read from the 2 distribution
table.
Notation: 2 denotes the value of 2 for which the area to its right is with a given degree of
freedom.
This is displayed below
2
For example, to find χ 0 .05 (28), look up at the value of = 0.05and under this value of , look
for the number of degrees of freedom (d.f.), which is equal to 28 in the chi-square distribution
2
table. From the chi-square table, this value is 41.337. Similarly χ 0 .01 (15) = 30.578
Example 20.10
Look at the previous example about qualification and salary.
Test if there is a relationship between qualification and salary at the 5 percent level of
significance.
Solution: -
HO : Qualification and salary are independent
H1 : Qualification and salary are dependent
= 0.05
Computing for the expected frequencies, we have the following table.
Salary Qualification
12 complete Diploma 1st degree or Column
holder higher total
< 200 Birr 10 (4.8) 2 (3.6) 0 (3.6) 12
200 – 499 Birr 16 (15.2) 20 (11.4) 2 (11.4) 38
500 Birr or more 6 (12) 2 (9) 22 (9) 30
Row total 32 24 24 80
Self-Reviews 9.3
An electronic company wants to check if advertisement has a significant effect on the number of
TV sets that are sold within six months of production. A random sample of 600 TV sets reveals
the following results.
The idea behind the chi-square goodness-of-fit test is to see if the sample comes from the
population with the claimed distribution. Another way of looking at that is to ask if the frequency
distribution fits a specific pattern. Here we want to test whether an observed data (frequency
distribution) is sufficiently close to a theoretical (fitted) distribution.
Suppose that the data is classified into k classes (one-way classification). Let us designate the
expected and observed frequencies of the ith classes by ei and Oi, respectively. Expected
frequencies are calculated frequencies, which are calculated based the proposed distribution. On
the other hand, observed frequencies are those frequencies that are obtained by observation.
They are the sample frequencies.
The test statistics for goodness of fit test is:
k 2
2 ( Oi−ei )
χ =∑
i=1 ei
The test statistic has a chi-square distribution when the following assumptions are met
If the above assumptions are satisfied, the test statistics will have a Chi-Square distribution with
(k-1) degrees of freedom if no parameter is estimated in the process.
The idea is that if the observed frequency is really close to the claimed (expected) frequency,
then the square of the deviations will be small. If the sum of these weighted squared deviations is
small, the observed frequencies are close to the expected frequencies and there would be no
reason to reject the claim that it came from that distribution. Only when the sum is large is that
we have a reason to question the distribution. In other words we reject the null hypothesis when
the value of the calculated test statistics is very large. Therefore, the chi-square goodness-of-fit
test is always a right tail test.
test.
2 2
The rejection region is χ > χ α (k −1) .
Example: In an experiment of pea breading the following frequencies of seeds were ontained.
360 round and yellow, 102 wrinkled and yellow, 109 round and green, and 33 wringkled and
green. Theory predicts that the frequencies should be in proportions 9:3:3:1. Apply Chi-Square
goodness of fit test to examine the correspondence of theory and practice.
Solution
HA: Not H0
2
k
(O i−ei )2
χ =∑
The test statistics is i=1 ei .
The expected frequencies are ei=nPi; where n is the sample size and Pi is the probability of the ith
class. But n=316+102+109+33=560
Since the calculated value is less than the tabulated value at 5% ( 2=0.3356<0.052(3)=7.81) we
accept the null hypothesis are conclude that the observed and expected frequencies are close to
one another.
Example 2:
2: Fit a normal distribution to the following frequency and test whether the fit is good.
Class Frequency
1-4 13
5-8 18
9-12 6
13-16 10
17-20 16
Solution:
The hypothesis we want to test is that the observations come from a normal population. The test
2 (O i−ei )2
k
χ =∑
statistic is i=1 ei .
ei’s are the expected frequencies which will be computed by using the fitted normal curve.
k k
∑ f i xi ∑ f ( xi −X )2
X = i=1 and S2= i =1
The unbiased estimator of and are2 n n−1 ; where xi is the
class mark of the ith class.
In computing the expected frequencies we use the class boundaries since the variable is
continuous random variable. Note that if X is assumed to have a normal distribution then the
X−μ X −X
Z= ≈
value σ S will have the standard normal distribution.
X− X 4 .5−10 .37
P{ < }=P {Z<−0 . 96}=0 .5−P {0<Z <0 . 96}=0 . 1685
P1=P{X<4.5}= S 6.1
8. 5−10. 37 X− X 12 .5−10. 37
P3 =P{8 .5< X <12 .5}=P { < < }
6 .1 S 6.1
=P {−0 . 31<Z <0 . 35}=P {0<Z <0 . 31}+P {0<Z <0 . 35}=0 .1217+0 .1368=0 .2585
The above result may be summarized in the following table so as to facilitate the remaining
calculation.
2
( Oi−ei )
Class Boundaries Oi Pi ei=nPi ei
<4.5 13 0.1685 10.6 0.536
4.5-8.5 18 0.2098 13.2 1.731
8.5-12.5 6 0.2585 16.3 6.496
12.5-16.5 10 0.207 13.0 0.709
>16.5 16 0.1562 9.8 3.855
Total 63 1 63 13.33
2
The calculated value is 13.33. The tabulated value is χ 0 .01 (5−1−2)=10. 6 . Since the
calculated value is greater than the tabulated value we reject the null hypothesis that
the observations come from the normal distribution. Accordingly, we conclude that the
fit is not good. In other words, the fitted curve does not describe the given frequency
distribution.