Introduction to Biostatistics - lecture 1
Introduction to Biostatistics - lecture 1
Dr Martin C. Simuunza
Department of Disease Control
School of Veterinary Medicine
University of Zambia
1. Introduction
Definition of statistics
Nominal variables
Those where no natural order or ranking can be imposed on
the data.
Examples
Gender => Male / female,
Disease status => Diseased / not diseased,
Ordinal variables
• classifies data into categories that can be ranked or have
natural order.
Examples
– Social status => married /single / divorced/separated/widow,
– Smoking habit => non-smoker/ex-smoker/light smoker/heavy
smoker.
– Performance => Very good /good / bad / very bad
Quantitative variables
• Variables which assume numerical values
• Also known as numerical variables
• Divided into discrete and continuous variables
Discrete variables
• assume a finite or countable number of possible values.
Usually obtained by counting.
• E.g. number of teats in a cow, Number of lactations, number of
diseased animals in a population
Continuous variables
Variables which assume an infinite number of possible values.
Usually obtained by measurement. Are quantified by
comparison with a defined unit.
E.g. Weight of a calf at birth, haemoglobin values, litters of
milk produced etc.
– Response variable is one that is affected by another
called explanatory variable, e.g. an animal’s weight may
be a response to feed intake (which is an explanatory
variable).
Mean
• It uses every score in the distribution – most accurate measure of
central tendency
• The value of the mean might be an abstract one e.g mean = 5.8
children
• Affected by extreme values in the distribution (makes it unsuitable)
• cha
• Changes in same manner and same extent as scale is changed
Property Mean Median Mode
• Sample Range
• It is the simplest measure of variability and is
defined as the difference between the largest and
smallest observations.
• i.e. r = max(xi) – min(xi )
• Example: If the distribution is 4, 8, 9, 7, 5, 3, the largest
score is 9 and the lowest is 3. According to the formula, the
range is 9 – 3 = 6.
• An approximate measure of variability.
• Has the advantage of easy computation.
• Disproportionately affected by an extreme score as
in the case of the mean.
• Does not take into account the form of the
distribution
• Variance and Standard Deviation
• Another important way to describe data is the
dispersion of the mass around the theoretical mean.
• A natural basis for introducing this concept is (X -
µ).
• If we assume that X1, X2,………, Xn are n
measurements of a variable X, then the sum of all
(X - µ) can be a good star.
• How ever, (X1- µ) + …… + (Xn- µ) will end up as
nearly 0 if X is symmetrically distributed
• One way of ignoring this is to use the absolute
values |Xi - µ)|.
• This has been tried and 1/n ∑|(Xi - µ)| is called the
total variation
• However this formula is very difficult to use
mathematically.
• The most important measure of variability or
dispersion is the variance
• The variance is the average of the squared
deviations from the mean of a distribution and for a
population, is given by the formula
Where Is the population mean
n 1
Shortcut Formula for Calculating s2
S2 = ∑x2 -(∑x)2/n
n-1
Where n – 1 are the degrees of freedom
There is a problem with variances since the deviations were
squared.
That means that the units were also squared.
To get the units back the same as the original data values, the
square root must be taken.
The standard error of the mean (SEM) is the standard
deviation of the sample-mean's estimate of a
population mean
SEM is usually estimated by the sample estimate of the
population standard deviation (sample standard
deviation) divided by the square root of the sample size
(assuming statistical independence of the values in the
sample):
SEM =
where
• s is the sample standard deviation, and
• n is the size (number of observations) of the sample.
Empirical Rule
The empirical rule is only valid for bell-shaped (normal)
distributions. The following statements are true.
• Approximately 68% of the data values fall within one standard
deviation of the mean.
• Approximately 95% of the data values fall within two standard
deviations of the mean.
• Approximately 99.7% of the data values fall within three
standard deviations of the mean.
Percentiles, Deciles, Quartiles
Percentiles (100 regions)
The kth percentile is the number which has k% of the
values below it. The data must be ranked.
• Rank the data
• Find k% (k /100) of the sample size, n.
• If this is an integer, add 0.5. If it isn't an integer round up.
• Find the number in this position. If your depth ends in 0.5,
then take the midpoint between the two numbers.
The 80th percentile is the number which has 80%
below it and 20% above it.
Sample Space
• An exhaustive list of all the possible outcomes of an
experiment.
• Each possible result of such a study is represented by
one and only one point in the sample space, which is
usually denoted by S.
Examples
1. Experiment: Rolling a die once:
Sample space S = {1,2,3,4,5,6}
Probability
• A probability provides a quantitative description of the
likely occurrence of a particular event.
• Probability is conventionally expressed on a scale from
0 to 1; a rare event has a probability close to 0, a very
common event has a probability close to 1.
• The probability of an event has been defined as its
long-run relative frequency.
• It can also be thought of as a personal degree of
belief that a particular event will occur (subjective
probability).
Examples
1. The probability of drawing a spade from a pack of
52 well-shuffled playing cards is 13/52 = 1/4 = 0.25
since event E = 'a spade is drawn'; the number of
outcomes corresponding to E = 13 (spades); the
total number of outcomes = 52 (cards).
Example
Suppose that a man and a woman each have a pack of 52
playing cards.
• Each draws a card from his/her pack. Find the
probability that they each draw the ace of clubs.
Addition Rule
• used to determine the probability that event A or
event B occurs or both occur.
The result is often written as follows, using set
notation:
where:
• P(A) = probability that event A occurs
• P(B) = probability that event B occurs
• = probability that event A or event B occurs
• = probability that event A and event B both
occur
• For mutually exclusive events, that is events which
cannot occur together:
=0
The addition rule therefore reduces to
= P(A) + P(B)
For independent events, that is events which have no influence on
each other:
Example
• Suppose we wish to find the probability of drawing either a king or
a spade in a single draw from a pack of 52 playing cards.
• We define the events A = 'draw a king' and B = 'draw a spade'
• Since there are 4 kings in the pack and 13 spades, but 1 card is
both a king and a spade, we have:
Multiplication Rule
• Used to determine the probability that two events,
A and B, both occur.
• The multiplication rule follows from the definition
of conditional probability.
• The result is often written as follows, using set
notation:
where:
• P(A) = probability that event A occurs
• P(B) = probability that event B occurs
• = probability that event A and event B occur
• P(A | B) = the conditional probability that event A
occurs given that event B has occurred already
• P(B | A) = the conditional probability that event B
occurs given that event A has occurred already
x 1 2 3 4 5 6 sum
p(x) 1/6 1/6 1/6 1/6 1/6 1/6 6/6=1
= 0.1288 (approx)
Normal Distribution
Any Normal Distribution
• Bell-shaped
• Symmetric about mean
• Continuous
• Never touches the x-axis
• Total area under curve is 1.00
• Approximately 68% lies within 1 standard deviation of the
mean, 95% within 2 standard deviations, and 99.7% within
3 standard deviations of the mean. This is the Empirical
Rule mentioned earlier.
• Data values represented by x which has mean mu and
standard deviation sigma.
Probability Function given by
Interquartile Range (IQR)
• The interquartile range is the difference between
the third and first quartiles. That's it: Q3 - Q1
Interval estimation
In statistics, interval estimation is the use of sample
data to calculate an interval of possible (or
probable) values of an unknown population
parameter