Final Biostatistics Lecture Notes
Final Biostatistics Lecture Notes
BIOSTATISTICS
STATISTICS - is the science of collecting, summarizing, presenting and interpreting data, and of using them to
estimate the magnitude of associations and test hypotheses
BIOSTATISTICS is a branch of applied statistics that is concerned with the application of statistical methods to
biological events (medicine, clinical trials, demography, population estimation, modelling, community diagnosis
and surveys). When the different statistical methods are applied in biological, medical and public health data, they
constitute the discipline of biostatistics. In general, the purpose of using biostatistics is to gather data that can be
used to provide honest information about unanswered biomedical questions. Biostatistics is now considered an
essential tool in the planning and delivery of health care systems. The knowledge and ability to use bio-statistical
techniques have also become increasingly important in health sciences. The medical practitioner in the 21 st century
will need a far greater ability to evaluate new information than in the past. A good understanding of biostatistics
can improve clinical thinking, decision making, evaluations and medical research. The role of biostatistics in
medical education is now well recognised.
Constant – Quantities that do not vary e.g. in biostatistics, mean, standard deviation are considered constant for a
population
• Variable – Characteristics which takes different values for different person, place or thing such as height, weight,
blood pressure
• Parameter – It is a constant that describes a population e.g. in a college there are 40% girls. This describes the
population, hence it is a parameter.
• Statistic – Statistic is a constant that describes the sample e.g. out of 200 students of the same college, 45% girls.
This 45% will be a statistic as it describes the sample
• Attribute - A characteristic based on which the population can be described into categories or classes e.g. gender,
caste, religion.
Essential features of statistics
a. Principles and methods for the collection of presentation, analysis and interpretation of numerical data of
different kinds.
1. Observational data, qualitative data.
2. Data that has been obtained by a repetitive operation.
3. Data affected to a marked degree of a multiplicity of causes.
b. The science and art of dealing with variation in such a way as to obtain reliable results.
c. Controlled objective methods whereby group trends are abstracted from observations on many separate
individuals.
d. The science of experimentation which may be regarded as mathematics applied to observational data.
WHY STATISTICS?
Variability in measurement can be handled using statistics. E.g.: the investigator makes observations
according to his judgement of the situation. (Depending upon his skills, knowledge, and experience).
Statistics pervades a way of organizing information on a wider and more formal basis than relying on the
exchange of anecdotes and personal experience.
There is a great deal of intrinsic (inherent) variation in most biological processes.
Public health and medicine are becoming increasingly quantitative. As technology progresses, the physician
encounters more and more quantitative rather than descriptive information. In one sense, statistics is the
language of assembling and handling quantitative material. Even if one’s concern is only with the results of
other people’s manipulation and assemblage of data, it is important to achieve some understanding of this
language to interpret their results properly.
The planning, conduct, and interpretation of much of medical research are becoming increasingly reliant on
statistical technology. Is this new drug or procedure better than the one commonly in use? How much better?
What, if any, are the risks of side effects associated with its use? In testing a new drug how many patients
must be treated, and in what manner, to demonstrate its worth? What is the normal variation in some clinical
measurements? How reliable and valid is the measurement? What is the magnitude and effect of laboratory
and technical error? How does one interpret abnormal values?
Statistics pervades the medical literature. As a consequence of the increasingly quantitative nature of public
health and medicine and its reliance on statistical methodology, the medical literature is replete with reports
in which statistical techniques are used extensively.
Epidemiology and Biostatistics are sister sciences or disciplines.
Epidemiology collects facts relating to a group of population in places, times and situations.
Biostatistics converts all the facts into figures and at the end translates them into facts, interpreting
the significance of their results.
Epidemiology and biostatistics both deal with the facts-figures-facts
QUANTITATIVE METHODOLOGY
USES OF BIOSTATISTICS
1. To test whether the difference between two populations is real or by chance occurrence.
2. To study the correlation between attributes in the same population.
3. To evaluate the efficacy of vaccines/drugs.
4. To measure mortality and morbidity.
5. To evaluate the achievements of public health programs/research
6. To fix priorities in public health programs
7. To help promote health legislation and create administrative standards for oral health.
DATA
What is DATA?
Data are facts or figures, or information, especially numerical facts, collected together for reference or information.
Data is the plural of "datum”. The collective recording of observations either numerical or otherwise is called data.
Many of the steps to conducting a field investigation rely on identifying relevant existing data or collecting new
data that address the key investigation objectives.
Types of data
Qualitative
Quantitative
Qualitative data
Observation or information characterized by measurement on a categorical scale (dichotomous, nominal or ordinal
scale).
Data that describe the quality of the subject studied. E.g. gender, ethnicity, death or survival, nationality etc.
Generally describes in terms of percentages or proportions. Mostly displayed by using contingency tables, pie
charts, and bar charts.
Quantitative data
Data in numerical quantities such as continuous measurements or counts. Observation for which the differences
between numbers have meaning on a numerical scale. They measure the quantity of something.
Types of numerical scales;
Continuous scale (e.g. Age, height)
Discrete scale (e.g. Number of pregnancies)
Described in terms of means and standard deviation.
Frequency tables and histograms are most often used to display this type of information.
How to analyse DATA?
Using STATISTICS!
A small representative ‘sample’ is used to study a big ‘population’
Why?
Expensive to conduct a very large study
Impossible to collect information from everyone in the population
POPULATIONS AND SAMPLES
POPULATION - is the collection or set of all of the values that a variable may have. The population is a
complete collection of data on the group under study. e.g.: If we are interested in the weights of students
enrolled in Vet. Med at the University of Abuja, then our population consists of the weights of all of these
students, and our variable of interest is the weight.
Population Size (N): The number of elements in the population is called the population size and is denoted by N.
Populations can be thought of as existing or conceptual. Existing populations are well–defined sets of data
containing elements that could be identified explicitly while Conceptual populations are non–existing, yet
visualized, or imaginable sets of measurements. Examples:
Existing populations
a. Red blood cells (RBC) counts of children diagnosed with malaria as of October 30, 2016 at the UATH
Gwagwalada.
b. Amount of active drug in all 50 mg berenil sachet manufactured in October 2009.
c. Presence or absence of prior myocardial infarction (MI) in all males horses between 3 and 6 years of age brought
to VTH Gwagwalada.
Conceptual populations
Could be thought of as characteristics of all people with a disease, now or shortly, also as the outcomes of some
treatment that were given to a large group of subjects.
a. Bioavailability of a drug’s oral dose relative to i. v. dose in all healthy subjects under identical conditions.
b. Presence or absence of MI’s in all current and future high blood pressure patients who receive short–acting
calcium channel blockers.
c. Positive or negative result of all pregnant women using a particular type of pregnancy test kit.
Target Population
Target population refers to the ENTIRE group of individuals or objects to which researchers are interested in
generalizing the conclusions. The target population usually has varying characteristics and it is also known as the
theoretical population.
Accessible Population
The accessible population is the population in research to which the researchers can apply their conclusions. This
population is a subset of the target population and is also known as the study population. It is from the accessible
population that researchers draw their samples. The factor which determines the choice of the population is the
problem under investigation. The population should be such that it can provide the most authentic and dependable
data necessary for solving the problem and should be such that the generalizations or conclusions from the study
can validly apply to it. So when a researcher is specifying his research population, he is setting some standard
against which his study will be judged.
SAMPLE
A sample is a part of a population. From the population, we select various elements on which we collect our data.
This part of the population on which we collect data is called the sample. E.g. suppose we are interested in
studying the characteristics of the weights of the students enrolled in Vet. Med. at the Uni. Abuja. If we randomly
select 30 students among the students of in Vet. Med. at the Uni. Abuja and measure their weights, then the
weights of these 30 students from our sample. Sample Size (n): The number of elements in the sample is called the
sample size and is denoted by n. A sample is a collection of sampling units selected from the population.
Sampling unit: Is a member of the population.
SAMPLING
Sampling Methods
Non-probability Sampling: Sample does not have known probability of being selected as convenience or
voluntary response surveys
PROBABILITY SAMPLING
In probability sampling, it is possible to both determine which sampling units belong to which sample and the
probability that each sample will be selected. The following sampling methods are examples of probability
sampling:
Simple Random Sampling
Stratified Sampling
Cluster Sampling
Systematic Sampling
Multistage Sampling (in which some of the methods above are combined in the stage.
NON-PROBABILITY SAMPLING (Purposive selection)
With non-probability sampling methods, we do not know the probability that each population element will be
chosen, and/or we cannot be sure that each population element has a non-zero chance of being chosen.
Non-probability sampling methods offer two potential advantages - convenience and cost. The main disadvantage
is that non-probability sampling methods do not allow you to estimate the extent to which sample statistics are
likely to differ from population parameters. Only probability sampling methods permit that kind of analysis.
Examples of non-probability sampling methods are:
VOLUNTARY SAMPLE
A voluntary sample is made up of people who self-select into the survey. Often, these folks have a strong interest
in the main topic of the survey.
E.g. Suppose, that a news show asks viewers to participate in an online poll. This would be a volunteer sample.
The sample is chosen by the viewers, not by the survey administrator.
CONVENIENCE SAMPLE
A simple random sample (n) is drawn from a population (N) in such a way that every possible sample of size (n)
has an equal opportunity of being chosen.
Simple random sampling refers to any sampling method that has the following properties.
-The population consists of N objects.
-The sample consists of n objects.
-If all possible samples of n objects are equally likely to occur, the sampling method is called simple random
sampling.
There are many ways to obtain a simple random sample. One way would be the lottery method. Each of the N
population members is assigned a unique number. The numbers are placed in a bowl and thoroughly mixed. Then,
a blindfolded researcher selects n numbers. Population members having the selected numbers are included in the
sample.
STRATIFIED SAMPLING
With stratified sampling, the population is divided into groups, based on some characteristic. Then, within each
group, a probability sample (often a simple random sample) is selected. The sample resulting from combining these
samples is called a stratified random sample. In stratified sampling, the groups are called strata. For example,
suppose we conduct a national survey. We might divide the population into groups or strata, based on geography -
north, east, south, and west. Then, within each stratum, we might randomly select survey respondents.
CLUSTER SAMPLING
With cluster sampling, every member of the population is assigned to one, and only one, group. Each group is
called a cluster. A sample of clusters is chosen, using a probability method (often simple random sampling). Only
individuals within sampled clusters are surveyed.
Note the difference between cluster sampling and stratified sampling. With stratified sampling, the sample includes
elements from each stratum. With cluster sampling, in contrast, the sample includes elements only from sampled
clusters.
MULTISTAGE SAMPLING
In multistage sampling, we select a sample by using combinations of different sampling methods. E.g., in Stage 1,
we might use cluster sampling to choose clusters from a population, in Stage 2, we might use simple random
sampling to select a subset of elements from each chosen cluster for the final sample.
SAMPLING ERRORS
Non-Sampling errors
o Coverage errors- due to non-response or noncooperation of the informant.
o Observational errors: interview bias, imperfect experimental technique.
o Processing errors: Statistical Analysis
DATA PRESENTATION
Two main types of data presentation are:
• Tabulation
• Graphic representation - charts and diagrams
TABULATION
Tables are a simple device used for the presentation of statistical data.
PRINCIPLES:
Tables should be as simple as possible. (2-3 small tables).
Data should be presented according to size or importance, chronologically or alphabetically.
Should be self-explanatory.
Each row and column should be labelled concisely and clearly.
A Specific unit of measure for the data should be given.
The title should be clear, concise and to the point.
Total should be shown.
Every table should contain a title as to what is depicted in the table.
In the small table, vertical lines separating the column may not be necessary.
If the data are not original, their source should be given in a footnote.
TYPES OF TABLES
3. The values of the variable are presented on the horizontal or X-axis and frequency on the vertical line Y-axis.
6. The scale of the division of both the axes should be proportional and the divisions should be marked along with the details of the variable and frequencies
presented on the axes.
BAR CHARTS
• Represents qualitative data.
• Bars can be either vertical or horizontal.
• Suitable scale is chosen
• Bars are usually equally spaced
• They are of three types:
• Simple bar chart- represents only one variable.
• Multiple bar chart-each category of a variable there are set of bars.
• Component /proportional bar chart, - individual bar is divided into 2 or more parts.
PIE CHART
HISTOGRAM
• Pictorial presentation of frequency distribution
• No space between the cells on a histogram.
• class interval is given on vertical axis
A histogram is the graph of the frequency distribution of continuous measurement variables. It is constructed based on the following principles:
a) The horizontal axis is a continuous scale running from one end of the distribution to the other. It should be labelled with the name of the variable and the units of
measurement.
b) For each class in the distribution a vertical rectangle is drawn with,
Its base on the horizontal axis extending from one class boundary of the class to the other class boundary, there will never be any gap between the histogram
rectangles.
The bases of all rectangles will be determined by the width of the class intervals.
Histogram
Descriptive statistics
Describe the frequency and distribution to characterize data collected from a group of samples to represent the population.
E.g. - Percentage of patients attending the diabetes clinic
– Gender, age group, education level of the patients
– Patients waiting time for doctors consultation
– Patients fasting glucose and HbA1c level
– Etc.
VARIABLE
A variable is a characteristic that can take on different values for different members of the group under study e.g. a group of university students will be found to differ in
gender, height, attitudes, intelligence and many ways. These characteristics are called variables.
• Categories of variable
– Continuous vs. Discrete
– Independent vs. Dependent
• Continuous variable
– Can take on any values on the measurement scale under study
– Do not fit into a finite number or categories
– Referred to as measurement data
– E.g. weight, height, age, blood pressure etc.
• Discrete variable
– Only designated values or integer values i.e. 1, 2, 3…
– Fit into limited categories
– Referred as count data (dichotomous/ multichotomous)
• E.g. dichotomous
– Male-Female
– Yes-No
• E.g. multichotomous
– Malay-Chinese-Indian
Man Utd-Arsenal-Chelsea-Man City-Liverpool
• Type of scales
– Nominal – classify observation that cannot be numerically arranged (no order)
– Ordinal – assign an order to categories so that one category is higher than another
– Interval / ratio –sequential ranking of values (as ordinal scales)
MEASURES OF STATISTICAL AVERAGES OR CENTRAL TENDENCY
• Central value around which all the other observations are distributed.
• Main objective is to condense the entire mass of data and to facilitate the comparison.
• the most common measures of central tendency that are used in sciences:
– MEAN
– MEDIAN
– MODE
Mode
• Most frequently occurring observation in a data is called mode
• Not often used in medical statistics.
• Is the most frequently occurring value in a set of discrete data
• Can be more than one mode if two or more values are equally common
E.g.
– 1,3,4,7,2,5,9,4,6,7,8,9,3,4,9,6,4,5,
2,1,6,6,7,4,3
– 1,1,2,2,3,3,3,4,4,4,4,4,5,5,6,6,6,6,7,
7, 7, 8,9,9,9
Mode=4
Median
• The sample mean is an estimator available for estimating the population mean.
• Its value depends equally on all of the data which may include outliers.
• Refers to the arithmetic mean
• It is obtained by adding the individual observations divided by the total number of observations.
• Advantages – it is easy to calculate.
• Most useful of all the averages.
• Disadvantages – influenced by abnormal values.
•
•
•
•
• E.g. (N=10) (3, 4, 7, 2, 5, 7, 5, 5, 1, 2)
= 3+3+4+7+2+5+7+5+5+1+2
10
Mean = 4.1
MEASUREMENT OF DISPERSION
• Dispersion is the degree of spread or variation of the variable about a central value.
• Helps to know how widely the observations are spread on either side of the average.
• Used to describe the variability (spread and dispersion) in a given sample
Dispersion measurement;
– Range
– Percentiles
– Variance
– Standard deviation
– Standard error
– Interquartile range
Range
– Defined as the difference between the value of the largest item and the smallest item.
– Gives no information about the values that lie between the extreme values.
– Difference between the highest and the lowest value
Percentiles
– Indicate the % of individuals who have equal to/below a given value.
Variance
– Provides information about how individuals differ within sample.
Standard deviation (SD)
– Gives information about the spread/variability of scores around the mean.
The Standard error (SE)
– Indicates about the certainty of the mean itself.
Interquartile Range (IQR) the distance between 1st and 3rd quartile
Range
• Is the difference between the smallest and largest value in a set of observation
• Range = (the largest value – the smallest value)
– E.g. 3,5,6,7,9,10
– Range = 10 – 3 = 7
• Uses only extreme values and ignores the other values in the data set.
Variance
• The greater the S.D, greater will be the magnitude of dispersion from the mean.
The more widely the values are spread out, the larger
Standard Error (of the Mean)
st rd
• IQR is the distance between the 1 and 3 quartile.
• It is not sensitive to extreme values (outliers).
• Thus, it is usually described together with the median in a skewed distribution of observation.
• Formula: IQR = (Q3 – Q1)
DRAWING INFERENCE FROM RESEARCH DATA
Statistics could be classified into two: descriptive and inferential statistics. While descriptive statistics deals with the methods and techniques of summarising and
describing information (data), inferential statistics goes beyond mere summarising and description of data. Inferential statistics is concerned with gaining knowledge of
a population’s characteristics from information collected from a random sample of the population. In other words, it is concerned with drawing inference or
generalizations about the characteristics of a population based on data collected from a random sample of that population. Therefore, with inferential statistics, we can
draw conclusions that apply beyond the actual subjects studied and extended to other subjects not
that can apply only to the actual elements or subjects studied will be of limited applicability; and this can hardly be the aim of any meaningful research. Rather, any
meaningful research should be interested in conclusion that, although based on a limited number of subjects/elements actually studied, could still apply to other
subjects/elements not studied. Generally, this is what we desire to achieve in any research; and only inferential statistics can help us to realize such a desire. To this
extent, inferential statistics has contributed immensely to the development of research by providing more efficient ways of handling data and dealing with complex
problems. Our understanding of the effects has also widened through the application of inferential statistics.
Hypothesis Testing
Drawing inference about a population based on a random sample from that population involves formulation and testing of hypothesis. A hypothesis could be conceived
generally as an informed guess, a hunch or conjecture about the solution of the problem under investigation. However, in inferential statistics, the term assumes a very
specific meaning. In this context, a hypothesis is a guess, a hunch, or a conjecture about one or more population parameters. In other words, any hypothesis to be
subjected to inferential statistical test must specify the relevant population parameter(s) on which the test is to be
represented with the symbol H0. A null hypothesis is one which posits that no difference or no relationship exists between two variables. It is a hypothesis of no
difference or no effect. For instance, in a study to compare the performance of male and female students in science, a simple null hypothesis could be: ‘There is no
significant difference between the mean performance of male and female students in science. Using symbols, this can be expressed thus:
H :µ =µ or µ - µ = 0
0 B G B G
Where µ = population mean of male students µ = population mean of female students
B G
this null hypothesis is usually tested against what is called Alternative hypothesis given the symbol H . This alternative hypothesis specifies the possible conditions
a
not included in the
null hypothesis. In other words, the alternative hypothesis does not hold. It is the hypothesis which we accept when the null hypothesis is rejected. In the example
above, an alternative hypothesis could be:
i.e H : µ µ or µ µ >0
a B> G B- G
The mean performance of female is significantly higher than the mean performance of male in science.
i.e H : µ µ or µ µ >0
a G> B G- B
Decision Rule
In testing a hypothesis, we usually compare the calculated value of the test statistics with a critical or table value of the test statistics. The critical or table value of a test
statistics, therefore, serves as a criterion value. This serves as the basis of rejecting or not rejecting the null hypothesis. As a rule, the decision to reject or not reject the
null hypothesis depends on whether the calculated value of the test statistics is greater than or less than the critical value.
Decision One
Reject the Null hypothesis if the calculated value of the test statistics is greater than the critical value.
Decision two
Do not reject the Null hypothesis if the calculated value of the test statistic is less than the critical value.
The critical value of the various test statistics are usually read off from the usual statistical table or from the net.
When the researcher decides to reject or not to reject the Null hypothesis, he does so fully aware that his decision cannot be perfectly correct. Decision of this sort are
usually characterised by some degrees of errors and the researcher is usually keen, not only on reducing such errors, but also on knowing their magnitude. Let us
consider the possible situation that can occur when we decide to either reject or not to reject a Null hypothesis. There are four of such situations as follows:
rejected
Type II Error is made when a false Null hypothesis is not rejected.
The table below shows the possible conditions in making a decision about the Null hypothesis, including the two types of errors.
level of significance too small. If we make it too small, we reduce type I error but increase type II error. The determination of the probability associated with type II
error is not easy, and we will not be going into it here.
Fortunately, in the field, the choice of an alpha level is no longer much of a problem as the 0.05 and 0.01 levels of significance has come to be accepted as desirable.
Testing Hypothesis about the Difference between Two Population Means when the sample size is large: The Z-test
The Z-test usually adopted in testing hypothesis about the difference between two populations means when the sample size is large. Generally, a sample is considered to
be large if its size is equal to or greater than 30. Otherwise, the sample is regarded as small.
Assuming X1 and X2, represent the means of two independent sample S 1 and S2; n1 and n2 the corresponding standard deviation and sample sizes, respectively; then the Z-
test statistic (or ratio) is computed using the formula:
Z= x1 - x2
SDx
Where SDx = Standard error of different means.
Z = x1 - x2
√ S21 + S22
n1 n2
To illustrate the application of Z-test in testing for the significance of the difference between two independent means, let us look at the following example. In a study to
determine the effectiveness of a new instructional method (E-Learning method) relative to the conventional methods, 40 students were assigned to each of the two
methods. The score obtained by the two groups on the post-test administered after the treatment were as shown below.
Post-exam scores of students exposed to New E- learning method and the Conventional method
E-Learning Method Conventional Method
23, 18, 22, 15, 14, 16, 13, 11, 20, 19, 8, 15, 10, 5, 5, 11, 14, 13, 10, 17, 12,
28, 17, 20, 17, 12, 24, 14, 10, 17, 14, 9, 16, 0, 4, 23, 15, 6, 15, 30, 0, 10, 7,
8, 21, 19, 23, 26, 18, 23, 22, 14, 20, 16, 9, 13, 8, 15, 11, 12, 10, 3, 19, 5,
14, 17, 19, 14, 24, 24, 29, 20, 12, 16 20, 14, 10, 3, 11, 9
The researcher now wants to determine whether those exposed to the E-learning method did better than those exposed to the conventional method. To do this, he has to
formulate and test an appropriate hypothesis about the difference between the means of the two groups of students. The procedures are as follows
Step 1: The appropriate null hypothesis under the Z-test is:
HO: There is no significant difference between the mean score of the
Student exposed to E-learning method and the mean score of those exposed to the conventional method. i.e. =. µE-µC = 0 or µE= µC
Where µE=population mean score of the group exposed to the E-learning method
And µC=Population mean score of the group exposed to the conventional method.
The alternative hypothesis against which the null hypothesis will be tested is stated as follow:
HA: There is a significant difference between the mean score of the student exposed to the E-learning method and the mean score of those exposed to the conventional
method. i.e. µE ≠ µC or µE-µC ± 0: Where µE and µC retain their previous meanings
Step 2: Level significance
The new hypothesis will be tested at the 0.05 or 5% level of significance
Step 3: Computation of the test statistic.
The test statistics in this case is the Z ratio. Before we can compute it, we must first calculate the means and the standard deviations of the two groups of students.
We will not go into the computation of the means and the standard deviation here since we are already familiar with the procedure involved. The means and the standard
deviations of the two groups have been calculated to be:
E-Learning method Conventional method
x E =18.18 x 2 = 10.83
SE = 4.95 SC = 5.43
nE = 40 NC = 40
Verify that the means and standard deviation of the two groups computed in the table above are correct.
Z-test (ratio) can then be computed as follows:
Z = x1 - x2
√ S21 + S22
n1 n2
By substituting, Z = 18.18 −10.83
√ (4.85)2+ (5.43)2
40 40
= 7.35
√ (4.85)2+ (5.43)2
40 40
= 7.35
1.16
= 6.34
df = n1 + n2 - 2
Where n1 and n2 are the total number of sample or population.
√ S21 + S22
n1 n2
Where:
x , S, n denote the mean, standard deviation and cases or sample sizes for the groups.
Example 1
In a study to investigate the influence of gender on student’s achievement in Biostatistics, the following scores were obtained in a Biostatistics achievement test for 400
level DVM students.
Scores of 400 Level DVM students in Biostatistics achievement test
Male Students Female Students
9, 12. 16, 15, 5, 15, 10, 18, 18, 20, 26, 6, 10, 12, 18,13 17, 11, 9, 19, 5, 15,
10, 8, 19 10
The researcher wants to test if gender is a significant factor in the achievement of 400 level DVM students in Biostatistics.
Since n < 30, the appropriate test statistic is the t-test. Here are the steps involved in carrying out the t-test of difference between means;
Ho: There is no significant difference between the mean Biostats achievement of male and female students in 400 L DVM students.
Ha: There is a significant difference between the mean Biostats achievement of male and female 400 DVM students.
Step two: Choose an alpha level (level of significance). The test will be conducted at 0.05 level of significance,
Step three: Calculate the test statistic namely the t-ratio, using the formula:
t= xm - xf
√ S2m + S2f
nm nf
= 2.92
√ 4.08
= 2.92 = 1.45
2.02
Step four: Having computed the test statistic, we now determine the critical value (table value) of that test statistic. In the case of the t-test, we first calculate the
degree of freedom (df) using the formula.
df = n1 + n2 - 2
= 14 + 12 – 2
= 26 – 2 = 24
df = 24
With the df =24 and 0.05 level of significance, reference is made to the table for the t-distribution for a two-tailed test as suggested by the non-directional alternative
hypothesis. At 0.05 level of significance and 24 df for two-tailed test, critical or table value of t=2.064.
Step five: We now state our decision rule as follows: Reject Ho in favour of Ha if the calculated value of t exceeds the critical (table) value. Otherwise do not
reject Ho.
Step six: Inference/Decision
The calculated t-value is 1.45 while the critical (table) value is 2.064, since the calculated t-value is less than the critical (table) value, we do not reject the null
hypothesis (we accept the null hypothesis).
Step seven: Conclusion
The result of the test suggests that, the significant difference between the mean achievement of the 400 level DVM male and female students in Biostats is not
statistically significant. The probability that the observed difference resulted from sampling errors is high, (i.e. greater than 0.05). We then conclude that there is
no significant difference between the mean achievements of male and female 400L DVM in Biostats. Any difference observed are such that they could have
arisen from sampling errors.
The table containing relevant information from the test is presented below:
Two - tailed t-test of difference between mean of male and female 400 Level DVM in Biostats
Assuming the research had formulated a directional alternative hypothesis, a one- tailed test would have been required. Let’s see our null hypothesis against the following
alternative hypothesis.
Ha: The mean achievement of male 400L DVM students is significantly higher than the mean achievement of their female counterparts in Biostats.
The steps involved in the two tailed test are the same with that of the one tailed test, the only difference is in the critical (table) value of the test statistic. Our level of
significance is 0.05 as previously.
Conclusion
There is no significant difference between the mean achievement of male and female 400 L DVM students in Biostats.
The table containing relevant information from the test is presented below:
One–tailed t-test of difference between mean of male and female 400 Level DVM in Biostats
The researcher is interested in testing whether or not these frequencies fit the expected frequencies under the null hypothesis that the observed subject preference pattern of
the Vet students is due to chance.
The expected frequencies are those that will occur if the null hypothesis is true. In this case if the null hypothesis is true, we would expect 50% of Vet student to show
preference to Biostat and the other 50% to show preference to Microbiology. Accordingly, the expected frequencies will be 60 for Biostat and 60 for microbiology.
Observed and expected frequencies of course preference of 120 Vet students
COURSE PREFERENCE
Biostat Microbiology
Observed Frequency 65 55
Expected Frequency 60 60
We then state the Ho and Ha hypothesis as follows:
Ho = the subject preference of Vet students is due to chance
Or Vet students do not show any preference for either Biostat or Microbiology
Ha = the subject preference pattern of Vet student is not due to chance.
Or Vet students show more preference for either Biostat or Microbiology.
We, then choose the 5% (0.05) level of significance for this test. To compute the X2, we then apply the formula:
X2 = Σ (O –E) 2
E
Where O = Observed frequency
E = Expected or theoretical frequency
And Σ = Sum of
= (65 – 60)2 + (55 – 60)2
60 60
The df in this case = 1. We then refer to the table of sampling distribution of X 2 at 0.05 level of significance and 1 df to obtain the critical X2 as 3.84.
Let us consider another example involving the goodness for fit test. In the admission policy of Federal Universities in Nigeria, the following set of criteria are adopted.
Merit = 40%
Locality = 30%
Educationally less developed states (ELDS) = 20%
University discretion = 10%
In the University of Abuja, in 2020, out of a total of 250 candidates offered admission, 80 were offered admission on the basis of Merit, 70 on the basis of locality, 65 on
the basis of ELDS and 35 on the basis of University discretion. Let us test the null hypothesis which states that the distribution of candidates offered admission into the
University of Abuja does not differ from that stipulated by the admission policy of Universities in Nigeria. The alternative hypothesis would be that the candidates offered
admission into the University of Abuja differ significantly from that stipulated in the admission policy.
We now compute the expected frequencies based on the null hypothesis. If the null hypothesis is true, the expected frequency of the candidates to be admitted under the
various criteria will be:
Merit = 40% of 250 =40/ 100×250/1 =100
Locality =30% of 250 =30/100×250/1 =75
ELDS =20% of 250 =20/100× 250/1 =50
University Discretion =10% of 250 =10/100×250/1 =25.
Both the observed and expected frequencies are presented in a table as follows:
Observed and Expected Frequencies of Admission into a University
Merit Locality ELDS University Discretion
Observed(O) 80 70 65 35
Expected(E) 100 75 50 25
The calculated X2 =12. 83. The associate degree of freedom is 3. At the 0.05 level of significance and 3df, the critical X 2 =7.815
The calculated X2 is greater than the critical X 2 we therefore reject the null hypothesis which implies that there is significant discrepancy between the frequencies of
candidate admitted under the various categories than that specified in the admission policy.
The X2 Test of Independence
Another major and popular application of the X2 test is the testing for the independence of two variables. In this case, two factors or variable, each having two or more
levels/categories, are involved and the researcher wants to test whether or not the two variable are dependent or not. The table in which the observed and expected
frequencies associated with the various levels of two variable are presented is call the contingency table. It is referred to as contingency table because it displays data
associated with 2 variable that are possibly contingent upon (i.e. dependent on) one another. A contingency table is usually named by the number of rows (R) and number
of columns(C) it has, as an R×C contingency table. For instance, if there are 3 rows and 5 columns, the contingency table will be called a 3×5 contingency table.
In a contingency table it is customary to put the expected frequency in the same cell as the corresponding observed frequency. However, the expected frequency is then
differentiated by enclosing it in a bracket. Consider the following example:
EXAMPLE: A researcher studying the attitude to science of 300 students from different socio-economic backgrounds on a 4-point Likert-type scale, obtained the
following information:
Attitude to Science of 300 students from different Socio-economic Backgrounds
The researcher is interested in knowing whether or not science attitude is dependent on the socio-economic status of the students. The appropriate null hypothesis under
the X2 test of independent is formulated as follows:
Ho: The attitude of students towards science is significantly independent of
their Socio-economic status.
Ha: The attitude of students toward science is significantly dependent
on their socio-economic status.
Let us choose the 0.05 level of significance for testing the hypothesis. We now calculate the X 2. In doing this the expected frequencies have to be calculated first. In a
contingency table, the expected frequency of each cell is calculated as follows:
E(RC) = fR x fC
N
Where E (RC) = Expected frequency of the cell
fR = total row frequency
fC = total column frequency
N = total frequency
Let us compute the expected frequencies for this example
Row 1 Cell 1 E = 80 × 110 = 29.33
300
Row 1 Cell 2 E = 80 × 89 = 23.73
300
Row 1 Cell 3 E = 80 × 63 = 16.80
300
Row 1 Cell 4 E = 80 × 38 = 10.13
300
Row 2 Cell 1 E = 100 × 110 = 36.67
300
Row 2 Cell 2 E = 100 × 89 = 29.67
300
Row 2 Cell 3 E = 100 × 63 = 21.00
300
These expected frequencies are now presented along with the corresponding frequencies in a 3 x 4 contingency table as shown below.
A 3 x 4 Contingency table
The figures in bracket are the expected frequencies. Next, we compute the X 2 as follows:
X2 = Σ (O –E) 2
E
X2 = (20 -29.33)2 + (15 – 23.73)2 + (30 -16.80)2 + (15 -10.13)2 + (40 -16.67)2
29.33 23.73 16.80 10.13 36.67
+ (35 -29.67) + (15 -21.00) + (10 -12.67) + (50 - 44.00) + (39 -35.60)2 +
2 2 2 2
= 2.97 + 3.21 + 10.37 + 2.34 + 0.30 + 0.96 + 1.71 + 0.56 + 0.82 + 0.32 + 2.06 + 0.32 = 25.94
i.e. the calculated X2 =25.94
To determine the critical X2 value, we first determine the associated degree of freedom. The dree of freedom in a contingency table is given by:
df = (R – 1) (C – 1)
Where R = number of rows
C = the number of columns
Therefore is this present case, df = (3-1) × (4-1) = 2×3 = 6
We now refer to the table of sampling distribution of X2 for 6 df at 0.05 level of significance. The critical X2 for 6 df and 0.05 level of significance is 12.592.
X2 Cal. = 25.94
X2 Crit. = 12.592
The calculated value exceeds the critical value, hence we reject the null hypothesis. This implies that the attitude of the student to science is dependent on their socio-
economic status. In other words, students from different socio-economic backgrounds hold different attitudes about science.
Computation of X2 from a 2×2 Contingency table
Assuming we have a 2×2 Contingency table whose cell frequencies and marginal totals are represented as follows:
a b k
c d l
m n N
With a, b, c, d as the cell frequencies, k, l, m, n as the marginal frequencies and N as the total frequency.
To compute X2 from this kind of contingency table, we use the formula:
X2 = N (ad – bc) 2
klmn
Example:
In a study, the opinion of male and female students was sought on the introduction of artificial insemination in all food animals in Nigeria and the following data were
obtained.
Yes No Total
Male 50(a) 46(b) 96(k)
Female 70(c) 61(d) 131(l)
Total 120(m) 107(n) 227(N)
X2 = N (ad – bc) 2
klmn
= 227(50×61 – 46+×70)2
96×131×120×107
= 227×28900
161475000
= 6560300
161475000
= 0.41
Identifying a problem highlighted by the gap in the literature and framing a purpose for the study.
1.0 Introduction
1.1 Statement of the research problems
1.2 Research questions/Hypothesis
1.3 Justification of the study
1.4 Aim of the study
1.5 Objectives of the study
2.0 Methodology
2.1 Study location
2.2 Sampling
2.3 Sampling method
2.4 Sample size determination
2.5 Sample collection
2.6 Sample processing
2.7 Sample analysis/laboratory analysis
2.8 Statistical analysis
2.9 Ethical consideration
2.9 Budget for the study
2.10 Time frame for the study
2.11 Expected outcome/impact of the study
3.0 References
Outline of a Typical Format of a Project
Tittle Page
Abstract
- Stress content not intent
- Assume a knowledgeable reader
- Write the abstract last
- Avoid passive voice
- Keep it short, nt more than 500 words
- Make quantitative and not qualitative statements
- Do not use equations or other mathematical notations
- Empathise with the first time reader
Keywords
List of figures and tables
Acknowledgements
Table of contents
Statement of Original authorship
1.0 Introduction
1.1 Background of research
1.2 Research problem
1.3 Research question/Hypothesis and hypothesis
1.4 Justification for the research
2.0 Literature Review
3.0 Methods and materials
4.0 Results and Discussion
5.0 Conclusion and suggestions for further work
6.0 References
Appendices
There may be some slight changes in the structure, but you will need to justify it.
BASIC STRUCTURE OF A PROJECT
QUESTION SECTION OF PROJECT
1, Why am I doing it Introduction: Significance?
2. What is known Review of research?
3. What is unknown Identifying Gaps
4. What do I hope to discover Aims?
5. How am I going to discover it Methodology
6. What have I found Results?
7. What does it mean Discussion?
8. So What? What are the possible Conclusion?
Applications or Recommendations.
What contribution does it make to knowledge?
What next? Recommendation for further research.