0% found this document useful (0 votes)
374 views16 pages

Chi-Square Basics for Learners

Uploaded by

temedebere
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
374 views16 pages

Chi-Square Basics for Learners

Uploaded by

temedebere
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

CHAPTER - SIX

THE CHI-SQUARE DISTRIBUTION


OBJECTIVES:-
The aim of this unit is to provide the learner with the basic applications of the Chi-Square
distribution in the analyses of frequencies.

At the end of the unit, the reader is expected to:


 Perform the Chi-Square test of association (independence)

 Conduct the Chi-Square test of goodness of fit to verify whether a fitted distribution to a
set of observations is appropriate or not.
1.1 INTRODUCTION

It is evidently true that most research problems call for the process of determining the existence
of association or interdependence between two or more variables. To this end statistical methods
help in measuring the relationship that exists between variables. One of such methods is the Chi-
Square test of independence or association. This method is applied in cases where we have two
variables. It is used to detect the existence or non-existence of association between two variables.
Nevertheless, it should be noted that the Chi-Square test of association does not measure the
degree of association or relationship. On the other hand, the Chi-Square distribution is also used
in evaluating the goodness of fit of a distribution to a given data. This kind of test is referred to
as the Chi-Square test of Goodness of Fit, and it is of great importance in statistics.

1.2 THE CHI-SQUARE DISTRIBUTION

The chi-square ( ) distribution is obtained from the values of the ratio of the sample variance
and population variance multiplied by the degrees of freedom. This occurs when the population
is normally distributed with population variance sigma^2.

Properties of the Chi-Square

 Chi-square is non-negative. Is the ratio of two non-negative values, therefore must be


non-negative itself.
 Chi-square is non-symmetric.
 There are many different chi-square distributions, one for each degree of freedom.
 The degrees of freedom when working with a single population variance is n-1.

The Chi-Square distribution will be used in investigating whether the expected frequencies are
significantly different from the observed frequencies obtained from the sample.
There are three cases in which we apply this kind of test.
i) Test of Goodness of fit:
fit: Testing whether a given model is acceptable or not.
ii) Chi-Square test of independence:
independence: Testing whether two attributes are associated or
not.
iii) Chi-Square test of Homogeneity:
Homogeneity: Testing whether many populations are homogenous
with respect to certain classification.
CHI SQUARE ( χ 2
) TEST:
In most statistical tests, our decisions are based on the assumption that the population is normally
distributed. But when this assumption about the population cannot be made, it is necessary to use
the CHI SQUARE ( χ 2
) test. This test is good for nominal or ordinal scale of measurement
where nominal scale of measurement deals with the data which can only be classified into
categories such as male and female, or freshman, juniors and seniors and so on. There is no
particular order for these groupings and are mutually exclusive so that an item in one category is
not included in another category. The ordinal scale of measurement assigns different ranks to
these categories. One category may be superior in standing an the other may be good or fair and
so on. χ 2
test is used for analyzing qualitative variables such as opinions of persons, religious
affiliations, smoking habits, etc. It deals with judgments about proportions of two or more than
two populations.

Properties of Chi square distribution


1- It involves squared observations and hence it is always positive or greater than or equal to
zero.
2- The distribution is not symmetrical. It is skewed to the right so that its skew ness is
positive. However, as the number of degrees of freedom increases, Chi-square
approaches a symmetrical distribution.
3- Similar to t-distribution, there is a family of chi-square distributions. There is a particular
distribution for each degree of freedom.
The estimation of degree of freedom of χ 2
-distribution is determined by the number of
categories in which various attributes of the sample are place, so that if there are K numbers of
categories, then the number of degrees of freedom (df) would be (k – 1). For categories of two or
more independent samples (where given contingency table), the df would be (k – 1) (r – 1),
where r-is the number of rows and k-number of columns. For example, if a sample of 100
students were categorized as freshman, sophomores, juniors and seniors, then there are four
categories and k is 4.

So that the degree of freedom or df is k-1 = 3


The following illustration shows the family of χ 2
curves with varying degrees of freedom and
it can be seen that as the number of degrees of freedom increases, χ 2
distribution approaches
the normal curve.
The χ 2
test is used to test whether there is a significant difference between the observed
number of responses in each category and the expected number of responses for such category
under the assumptions of null hypothesis. In order words, the objective is to find how well the
distribution of observed frequencies (fo) fit the distribution of expected frequencies (fe). Hence
this test is also called goodness-of-fit test.

Example: -
Find the critical value of χ 2
from the table of χ -distribution if level of significance  is
2

0.05 and degree of freedom is 2.


Answer χ 2
= 5.991

1.3 CHI-SQUARE TEST OF INDEPENDENCE (ASSOCIATION)

The chi-square test of independence is a bivariate statistical technique that is used to detect the
existence of association between two attributes or variables. It is of great interest to know the
existence of association between variables. Some examples of such variables are the following.
 Type of school to which families send their children and income of
families.
 Religion and family size.
size.
 Time studied and exam result.
result.
 Demand and Price of a commodity.
commodity.
 Age and number of teeth.
teeth.
It is one of the most commonly applied statistical techniques in researches conducted in
different disciplines.
Suppose that we have two attributes (characteristics) say A and B. We want to test the hypothesis
H0: There is no association between attributes A and attribute B.
Versus the alternative
HA: There is association between attribute A and attribute B.
In the test for independence, the claim is that the row and column variables are independent of
each other. This is the null hypothesis. The test statistic used is the same as the chi-square
goodness-of-fit test. The principle behind the test for independence is the same as the principle
behind the goodness-of-fit test. The test for independence is always a right tail test.
test.
If attribute A has r categories (levels) and attribute B has c categories (levels), then the table in
which the two attributes (variables) are cross classified contains r rows and c columns. The table
has rc cells. This table is usually referred to as rxc (r by c) contingency table.
Now suppose a sample of size n is taken and cross classified. Let O ij denote the observed
frequency of the ith category (level) of A and the jth category (level) of B. recall that, our interest
is to test the null hypothesis that there is no association between the two attributes A and B. The
test statistics to be used is:
r c 2
2 ( Oij −e ij )
χ =∑ ∑
i=1 j=1 e ij
e ij=n P {A i∩B j } Where P {A i ∩B j } is the probability of the cell ( i,j)
If the null hypothesis is true, then
O O O O
e ij=n P {A i∩B j }=n P( A i )P( B j )=n i. j. = i . . j
Where n n n
Oi. is the total frequency of the ith row and O.j is the total frequency of the jth column.
The above test statistics has a chi-square distribution with (r-1)(c-1) degrees of freedom. The
rejection criterion is:
2 2
χ > χ α [(r−1)(c−1 )]

The multiplication rule said that if two events were independent, then the probability of both
occurring was the product of the probabilities of each occurring. This is key to working the test
for independence. If you end up rejecting the null hypothesis, then the assumption must have
been wrong and the row and column variable are dependent. Remember, all hypothesis testing is
done under the assumption the null hypothesis is true.

Example:
Example: To test the hypothesis that color of eye and color of hair are associated, data on color
of eye and color of hair for 6,800 individuals were compiled.

Hair
Fair Brown Black Red
Blue 1768 808 190 47
Green 946 1387 746 43
Eye
Brown 115 444 288 18
Test whether there is association between the two attributes at 1% level of significance.
Solution:
The hypothesis we want to test is:
H0: There is no association between color of eye and color of hair.
HA: There is association between color of eye and color of hair.

  Hair

  Fair Brown Black Red Oi.


  Blue 1768 808 190 47 2813
Eye Green 946 1387 746 43 3122
  Brown 115 444 288 18 865
  O.j 2829 2639 1224 108 O..= 6800
The values of eij for the different combinations of i and j are calculated using the formula

Oi. O. j
e ij=
n and presented in the following table.

eij 1 2 3 4
1 1170.29 1091.69 506.34 44.68
2 1298.84 1211.61 561.96 49.58
3 359.87 335.70 155.70 13.74
r c 2 3
2 ( Oij−e ij )
χ =∑ ∑ =∑¿
i=1 j=1 eij ¿

( 1768−1170.29 )2 ( 808−1091.69 )2 (18−13.74 )2


= + +...+ =1074.43
1170.29 1091.69 13.74
2 2
At 1% level, the rejection region is χ > χ 0. 01 [(3−1 )(4−1)]=16 .81 Since the calculated value,

which is 1074.43, is greater than the tabulated value we reject the null hypothesis and conclude
that there is association between the two attributes, eyes color and hair color.

Exercise:
Exercise:
Suppose that 500 university students were randomly selected and classified by year and
smoking habit.
Smoking habit
Year Non-Smokers Casual-Smokers Heavy-Smokers
Freshman 90 42 22
Sophomore 65 37 36
Junior 45 28 30
Senior 25 43 37

Test whether the two attributes, year and smoking habit, are related (associated) or not.

TEST OF ASSOCIATION OF ATTRIBUTES (TEST OF INDEPENDENCE)


In real life situations, sometimes our interest may be to determine whether two classifications or
variables are dependent or independent. For instance, we may be interested to know whether
qualification of employees and their salary are dependent or not, or to know whether advertising
expense and sales of a company are dependent or not. In such cases, we apply tests of
independence.
independence.

Suppose we have two classification: classification 1 consisting of r categories and classification


2 consisting of s categories. Take a random sample and classify each item in to one of the rxs
categories, called cells.

Example 20.9
Suppose we are interested to check whether qualification and salary of employees are dependent
or not. Then we may classify qualification in to three categories (r = 3) as: 12 complete, Diploma
holder, and First Degree or higher.

We may also classify salary in to three categories (s = 3) as: Less than 200 Birr, 200 up to 499
Birr and 500 Birr or more. Then we randomly select employees and classify them into one of the
rxs = 3 x 3 = 9 categories (cells). Suppose a random sample of 80 employees is taken and the
following result is obtained.

Qualification
Salary 12 Diploma 1st degree Row total
complete holder or higher
< 200 Birr 10 2 0 12
200 – 499 Birr 16 20 2 38
500 Birr or more 6 2 22 30
Column Total 32 24 24 80

Notation: Ors = observed frequency of rth row and sth column.


Ers = expected frequency of rth row and sth column.
Ers is computed as:

( r th row total ) X ( sth column total )


Ers = Overall total

Consider the above example:

( 1st row total ) X ( 1 st column total)


O11 = 10 and E11 = Overall total
12 X 32
= 4.8
= 80

( 1st row total ) X ( 2nd column total )


O12 = 2 and E12 = Overall total
12 X 24
= 3.6
= 80

( 1st row total ) X ( 3rd column total )


O13 = 0 and E13 = Overall total
12 X 24
= 3.6
= 80
⋮ Continue

( 3rd row total ) X ( 3rd column total )


O33 = 22 and E33 = Overall total
30 X 24
=9
= 80

In tests of independence, the null and alternative hypotheses are of the form:
HO : The two classifications are independent
H1 : The two classifications are dependent.
The null hypothesis can also be written as “ There is no association between the two
classifications”.

The test statistics used to test the hypothesis of independence is called a Chi-Square test.
A chi-square distribution denoted by  2
is a continuous probability distribution. Unlike the
normal distribution the chi-square distribution is asymmetric (not symmetric). It is positively
skewed (right) distribution. It cannot assume a negative value. The 2 values for a given level of
significance () and the number of degrees of freedom (d.f.) can be read from the  2 distribution
table.

Notation:  2 denotes the value of  2 for which the area to its right is  with a given degree of
freedom.
This is displayed below

2
For example, to find χ 0 .05 (28), look up at the value of  = 0.05and under this value of , look
for the number of degrees of freedom (d.f.), which is equal to 28 in the chi-square distribution
2
table. From the chi-square table, this value is 41.337. Similarly χ 0 .01 (15) = 30.578

To test for independence, the critical value is:


2 (r – 1) (s – 1)
i.e. d.f. = (r – 1) (s – 1) where r is the number of rows and s is the number of columns. To accept
or reject the null hypothesis, compare 2cal with this critical value 2 cal (r – 1) (s – 1) (the
tabulated value)
The test criterion is to reject HO if:
2cal > 2 r
r - 1 s - 1

Example 20.10
Look at the previous example about qualification and salary.
Test if there is a relationship between qualification and salary at the 5 percent level of
significance.

Solution: -
HO : Qualification and salary are independent
H1 : Qualification and salary are dependent
 = 0.05
Computing for the expected frequencies, we have the following table.

Salary Qualification
12 complete Diploma 1st degree or Column
holder higher total
< 200 Birr 10 (4.8) 2 (3.6) 0 (3.6) 12
200 – 499 Birr 16 (15.2) 20 (11.4) 2 (11.4) 38
500 Birr or more 6 (12) 2 (9) 22 (9) 30
Row total 32 24 24 80

The values in bracket are the expected frequencies


The test statistic is:
2
( Ors − Ers )
∑ Ers
2cal = Ors – Observed freq.
Ers – Expected freq.
( 10 − 4 .8 )2 ( 2 − 3. 6 )2 ( 0 − 3. 6 )2 ( 22 − 9 )2
+ + +−−−−−+
= 4 .8 3. 6 3.6 9
= 51.44
 = 0.05 and d.f. = (r – 1) (s – 1) = (3 – 1) (3 – 1) = 2 x 2 = 4

The critical value is:


2
r - 1 s - 1 = χ 0 .05
2 r (4) = 9.488 (from the Chi-square table)
As 2cal > 2 (r – 1) (s – 1) tabulated i.e.
51.44 > 9.488, thus HO is rejected and accept H1
i.e. Salary of employees and their qualification are dependent i.e.; they are associated.

Self-Reviews 9.3
An electronic company wants to check if advertisement has a significant effect on the number of
TV sets that are sold within six months of production. A random sample of 600 TV sets reveals
the following results.

Number of TV sets sold Number of TV sets not


Within 6 months sold within 6 months
Before 150 150
advertisement
After 165 135
advertisement
Is the effect of advertisement significant? Uses  = 0.05

1.4 CHI-SQUARE TEST OF GOODNESS OF FIT

The idea behind the chi-square goodness-of-fit test is to see if the sample comes from the
population with the claimed distribution. Another way of looking at that is to ask if the frequency
distribution fits a specific pattern. Here we want to test whether an observed data (frequency
distribution) is sufficiently close to a theoretical (fitted) distribution.
Suppose that the data is classified into k classes (one-way classification). Let us designate the
expected and observed frequencies of the ith classes by ei and Oi, respectively. Expected
frequencies are calculated frequencies, which are calculated based the proposed distribution. On
the other hand, observed frequencies are those frequencies that are obtained by observation.
They are the sample frequencies.
The test statistics for goodness of fit test is:
k 2
2 ( Oi−ei )
χ =∑
i=1 ei

The test statistic has a chi-square distribution when the following assumptions are met

 The data are obtained from a random sample


 The expected frequency of each category must be at least 5. This goes back to the
requirement that the data be normally distributed.

If the above assumptions are satisfied, the test statistics will have a Chi-Square distribution with
(k-1) degrees of freedom if no parameter is estimated in the process.

The idea is that if the observed frequency is really close to the claimed (expected) frequency,
then the square of the deviations will be small. If the sum of these weighted squared deviations is
small, the observed frequencies are close to the expected frequencies and there would be no
reason to reject the claim that it came from that distribution. Only when the sum is large is that
we have a reason to question the distribution. In other words we reject the null hypothesis when
the value of the calculated test statistics is very large. Therefore, the chi-square goodness-of-fit
test is always a right tail test.
test.

2 2
The rejection region is χ > χ α (k −1) .

Example: In an experiment of pea breading the following frequencies of seeds were ontained.
360 round and yellow, 102 wrinkled and yellow, 109 round and green, and 33 wringkled and
green. Theory predicts that the frequencies should be in proportions 9:3:3:1. Apply Chi-Square
goodness of fit test to examine the correspondence of theory and practice.

Solution

The hypothesis we want to test is

H0: The proportion of the frequencies in the four classes is 9:3:3:1

HA: Not H0

2
k
(O i−ei )2
χ =∑
The test statistics is i=1 ei .

In the given problem there are four classes. These are:

i=1 ….round and yellow

i=2…..wrinkled and yellow

i=3…..round and green

i=4…..wrinkled and green

The expected frequencies are ei=nPi; where n is the sample size and Pi is the probability of the ith
class. But n=316+102+109+33=560

Class Observed (Oi) Pi Expected( ei=nPi) (O i−ei )2


ei
1 316 9/16 315 0.0032
2 102 3/16 105 0.0857
3 109 3/16 105 0.1524
4 33 1/16 35 0.1143
Total 560 1 560 0.3356

Thus the calculated test statistics becomes 2=0.3356.

Since the calculated value is less than the tabulated value at 5% ( 2=0.3356<0.052(3)=7.81) we
accept the null hypothesis are conclude that the observed and expected frequencies are close to
one another.

Example 2:
2: Fit a normal distribution to the following frequency and test whether the fit is good.

Class Frequency
1-4 13
5-8 18
9-12 6
13-16 10
17-20 16

Solution:

The hypothesis we want to test is that the observations come from a normal population. The test

2 (O i−ei )2
k
χ =∑
statistic is i=1 ei .

ei’s are the expected frequencies which will be computed by using the fitted normal curve.

k k
∑ f i xi ∑ f ( xi −X )2
X = i=1 and S2= i =1
The unbiased estimator of  and  are2 n n−1 ; where xi is the
class mark of the ith class.

Class Marks (xi) Frequency


2.5 13
6.5 18
10.5 6
14.5 10
18.5 16
Total 63
5 k
∑ f i xi ∑ f ( x i −X )2
i =1 653 .5 i=1
X= = =10 .37 and S2 = =37 .15 ⇔ S=6 .1
n 63 n−1

In computing the expected frequencies we use the class boundaries since the variable is
continuous random variable. Note that if X is assumed to have a normal distribution then the

X−μ X −X
Z= ≈
value σ S will have the standard normal distribution.

Clearly, ei=nPi and

X− X 4 .5−10 .37
P{ < }=P {Z<−0 . 96}=0 .5−P {0<Z <0 . 96}=0 . 1685
P1=P{X<4.5}= S 6.1

4 . 5−10 . 37 X−X 8 .5−10 .37


P2 =P {4 . 5< X<8 .5}=P { < < }
6 .1 S 6 .1
=P {−0 . 96<Z <−0 . 31}=P {0 .31<Z<0 .96}=P{0<Z <0 . 96}−P{0<Z <0 .31}=0 . 2098

8. 5−10. 37 X− X 12 .5−10. 37
P3 =P{8 .5< X <12 .5}=P { < < }
6 .1 S 6.1
=P {−0 . 31<Z <0 . 35}=P {0<Z <0 . 31}+P {0<Z <0 . 35}=0 .1217+0 .1368=0 .2585

12 .5−10. 37 X− X 16 . 5−10 .37


P4 =P {12 . 5<X <16 . 5}=P{ < < }
6.1 S 6.1
=P {0 . 35<Z <1 . 01}=P {0 .35<Z <1. 01}=P {0<Z <1 .01}−P {0<Z<0. 35}=0 . 3438−0 . 1368=2070
4
P5 =P{X >16 . 5}=1−∑ Pi =1−0. 1685−0. 2098−0 .2585−0 .207=0 . 1562
i=1

The above result may be summarized in the following table so as to facilitate the remaining
calculation.
2
( Oi−ei )
Class Boundaries Oi Pi ei=nPi ei
<4.5 13 0.1685 10.6 0.536
4.5-8.5 18 0.2098 13.2 1.731
8.5-12.5 6 0.2585 16.3 6.496
12.5-16.5 10 0.207 13.0 0.709
>16.5 16 0.1562 9.8 3.855
Total 63 1 63 13.33
2
The calculated value is 13.33. The tabulated value is χ 0 .01 (5−1−2)=10. 6 . Since the

calculated value is greater than the tabulated value we reject the null hypothesis that
the observations come from the normal distribution. Accordingly, we conclude that the
fit is not good. In other words, the fitted curve does not describe the given frequency
distribution.

You might also like