0% found this document useful (0 votes)
5 views

Introduction to Biostatistics - lecture 1

This document provides an introduction to biostatistics, defining key concepts such as statistics, variables, and types of data. It covers descriptive and inferential statistics, measures of central tendency (mean, median, mode), and measures of dispersion (range, variance, standard deviation). Additionally, it explains the importance of understanding data distribution and includes methods for summarizing and interpreting statistical data.

Uploaded by

marthajanebanda
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Introduction to Biostatistics - lecture 1

This document provides an introduction to biostatistics, defining key concepts such as statistics, variables, and types of data. It covers descriptive and inferential statistics, measures of central tendency (mean, median, mode), and measures of dispersion (range, variance, standard deviation). Additionally, it explains the importance of understanding data distribution and includes methods for summarizing and interpreting statistical data.

Uploaded by

marthajanebanda
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Introduction to Biostatistics

Dr Martin C. Simuunza
Department of Disease Control
School of Veterinary Medicine
University of Zambia
1. Introduction
Definition of statistics

Statistics is a collection of methods for planning experiments,


obtaining data, and then organizing, summarizing, presenting,
analyzing, interpreting, and drawing conclusions.

Divided into two


Descriptive statistics: provides simple summaries about the
sample and about the observations that have been made
Inferential statistics: the process of drawing conclusions
from data subject to random variation, for example, observational
errors or sampling variation. I.e. statistical inference mainly makes
propositions about populations, using data drawn from the
population of interest via some form of random sampling
2. Variables
Variable : Characteristic or attribute that can
assume different values
Random Variable : A variable whose values are
determined by chance.

Variables are classified into quantitative and


qualitative
Qualitative variables
• Also called categorical variables
• assume non-numerical values.
• describe the property or characteristic of an animal, i.e. its
membership of a group or class.
• Divided into two: Nominal and ordinal variables

Nominal variables
Those where no natural order or ranking can be imposed on
the data.
Examples
Gender => Male / female,
Disease status => Diseased / not diseased,
Ordinal variables
• classifies data into categories that can be ranked or have
natural order.
Examples
– Social status => married /single / divorced/separated/widow,
– Smoking habit => non-smoker/ex-smoker/light smoker/heavy
smoker.
– Performance => Very good /good / bad / very bad

Quantitative variables
• Variables which assume numerical values
• Also known as numerical variables
• Divided into discrete and continuous variables
Discrete variables
• assume a finite or countable number of possible values.
Usually obtained by counting.
• E.g. number of teats in a cow, Number of lactations, number of
diseased animals in a population

Continuous variables
Variables which assume an infinite number of possible values.
Usually obtained by measurement. Are quantified by
comparison with a defined unit.
E.g. Weight of a calf at birth, haemoglobin values, litters of
milk produced etc.
– Response variable is one that is affected by another
called explanatory variable, e.g. an animal’s weight may
be a response to feed intake (which is an explanatory
variable).

– Response variables may be called dependent and


explanatory – independent.
Some Definitions
Population
• All subjects possessing a common characteristic that is
being studied.
Sample
• A subgroup or subset of the population.
Parameter
• Characteristic or measure obtained from a population.
Statistic (not to be confused with Statistics)
• Characteristic or measure obtained from a sample.
Descriptive statistics
• Descriptive statistics describe patterns and general trends
in a data set.
• In some sense, descriptive statistics is one of the bridges
between measurement and understanding.
• Following methods can be used to describe (summarise)
data
• Tables and graphs - Common tables
- frequency tables
- histograms/bar charts / pie
charts
- graphs
• Measures of central tendency (mean, mode median)
• Measures of Dispersion (Range, variance, standard
deviation, interquatile reange)
Measures of Central tendency
These describe what is meant by the term “average”.
e.g. How much does a dairy farmer spend on feed on a daily
basis?
How many cigarettes does he smoke in a a day
What is the average salary of a new vet graduate in
Zambia?
On the average, how many 2nd year vet students
proceed to the third year of study?
• From the above we see that
– Measures of central tendency, or "location", attempt to
quantify what we mean when we think of as the
"typical" or "average" score in a data set.

• An extremely important concept encountered


frequently in daily life.
Mode
• Simplest, but also the least widely used, measure of central
tendency is the mode
• The mode in a distribution of data is simply the score that
occurs most frequently.
Example: The milk yield of 10 cows is measured in liters and the
following is obtained
13.4, 13.7, 14.1, 13.6 , 15.2, 14.2, 13.9, 14.9, 16.0, 14.2

The mode is 14.2


• When a frequency distribution indicates two values of the
same highest frequency, both values are modes and such a
distribution is bimodal
• For grouped data, the Mode is contained in the interval
with the highest frequency.
• Mode defined as the mid-point of the class-interval with
the highest frequency
• Easy method for determining a measure of central
tendency
• However, usefulness is greatly reduced if the highest
frequency is not is not much higher than the others.
• As a rule of the thumb, it must constitute at least a third of
the total data in order to have some representativeness.
• In grouped data, mode lacks stability as value is dependent
of the set of class-intervals and might change if the
intervals were reduced or increased
Median.
• Technically, the median of a distribution is the
value that cuts the distribution exactly in half, such
that an equal number of scores are larger than that
value as there are smaller than that value.

• This is an ideal definition, but often distributions


can’t be cut exactly in half in this way, but we still
can define the median in the distribution

• Distributions of qualitative data do not have a


median.

• computed by sorting the data in the data set from


smallest to largest
• The median is the "middle" score in the distribution.
• Suppose we have the following scores in a data set: 5, 7,
6, 1, 8. Sorting the data, we have: 1, 5, 6, 7, 8.
• The "middle score" is 6, so the median is 6. Half of the
(remaining) scores are larger than 6 and half of the
(remaining) scores are smaller than 6.
Rule for computing the median
• First, compute (n+1)/2, where n is the number of data
points. Here, n = 5. If (n+1)/2 is an integer, the median
is the value that is in the (n+1)/2 location in the sorted
distribution
• (n+1)/2 = 6/2 or 3, which is an integer.
• The median is the 3rd score in the sorted distribution,
which is 6
• If (n+1)/2 is not an integer, then there is no
"middle" score.
• In such a case, the median is defined as one half of
the sum of the two data points that hold the two
nearest locations to (n+1)/2.
• For example, suppose the data are 0, 1, 4, 5, 6, 8
• n = 6, and (n+1)/2 = 7/2 = 3.5.
• This is not an integer. So the median is one half of
the sum of the 3rd and 4th scores in the sorted
distribution.
• The 3rd score is 4 and the 4th score is 5. One half
of 4 + 5 is 9/2 or 4.5. So the median is 4.5
• Mean
• Most widely used measure of central tendency.
• The mean is defined technically as the sum of all
the data scores divided by n (the number of scores
in the distribution).
• symbolised as , pronounced "X-bar."
• Example, using the data from above, where the n
= 5 values of X were 5, 7, 6, 1, and 8, the mean is
(5 + 7 + 6 + 1 + 8) / 5 = 5.4.

• Distributions of qualitative data do not have a


mean
• Note that is an estimate of a true
population mean µ. As n increases,
becomes a better estimate of µ.

µ is a parameter and refers to the population while is a


random variable and refers to the sample

as the sample size increases, statistic becomes


a better estimate of µ – parameter.
Some properties of the mean
1. Effects of scale changes on the mean
Addition: if all X scores of a distribution are changed by adding a
(positive or negative) constant number c to each of them, then
x’ = x + c.
Multiplication: if all scores of x of a distribution are changed by
multiplying (or dividing) each of them by a constant c to each of
them, then
x' = cx or x’ = x/c

The mean is susceptible to extreme values.


Comparison of the mean, median and mode
Mode
• It is the score that occurs most often in a group
• It does not include all the scores in the distribution
• It is therefore not affected by extreme value
• It can only be considered as a good measure of central tendancy
when:
a) the occurring frequency is very high compared to all others (at
least a third of the total number of scores)
b) the distribution is unimodal
• It can even be applied to scores using nominal values
• The value of the mode in ungrouped data is always a score of the
data set.
• In grouped data, the value of the mode is approximately
Mode = mean – 3 (mean –median)
Median
• It is middle value of an ordered set of scores
• It does not include all the scores in the distribution
• Therefore not affected by extreme values
• Can be applied to scores on an ordinal scale
• The value of the median might not belong to the set of the scores

Mean
• It uses every score in the distribution – most accurate measure of
central tendency
• The value of the mean might be an abstract one e.g mean = 5.8
children
• Affected by extreme values in the distribution (makes it unsuitable)
• cha
• Changes in same manner and same extent as scale is changed
Property Mean Median Mode

Always Exists No (1) Yes No (2)

Uses all data values Yes No No

Affected by extreme Yes No No


values
• MEASURES OF DISPERSION OR
VARIABILITY

• Sample Range
• It is the simplest measure of variability and is
defined as the difference between the largest and
smallest observations.
• i.e. r = max(xi) – min(xi )
• Example: If the distribution is 4, 8, 9, 7, 5, 3, the largest
score is 9 and the lowest is 3. According to the formula, the
range is 9 – 3 = 6.
• An approximate measure of variability.
• Has the advantage of easy computation.
• Disproportionately affected by an extreme score as
in the case of the mean.
• Does not take into account the form of the
distribution
• Variance and Standard Deviation
• Another important way to describe data is the
dispersion of the mass around the theoretical mean.
• A natural basis for introducing this concept is (X -
µ).
• If we assume that X1, X2,………, Xn are n
measurements of a variable X, then the sum of all
(X - µ) can be a good star.
• How ever, (X1- µ) + …… + (Xn- µ) will end up as
nearly 0 if X is symmetrically distributed
• One way of ignoring this is to use the absolute
values |Xi - µ)|.
• This has been tried and 1/n ∑|(Xi - µ)| is called the
total variation
• However this formula is very difficult to use
mathematically.
• The most important measure of variability or
dispersion is the variance
• The variance is the average of the squared
deviations from the mean of a distribution and for a
population, is given by the formula
Where Is the population mean

•The population standard deviation σ, is the positive square root of


the variance
•The sample variance is the unbiased estimator for the
population variance and is calculated using the following formula:
Sample Variance s =  (x
i  x) 2

n 1
Shortcut Formula for Calculating s2
S2 = ∑x2 -(∑x)2/n
n-1
Where n – 1 are the degrees of freedom
There is a problem with variances since the deviations were
squared.
That means that the units were also squared.
To get the units back the same as the original data values, the
square root must be taken.
The standard error of the mean (SEM) is the standard
deviation of the sample-mean's estimate of a
population mean
SEM is usually estimated by the sample estimate of the
population standard deviation (sample standard
deviation) divided by the square root of the sample size
(assuming statistical independence of the values in the
sample):

SEM =

where
• s is the sample standard deviation, and
• n is the size (number of observations) of the sample.
Empirical Rule
The empirical rule is only valid for bell-shaped (normal)
distributions. The following statements are true.
• Approximately 68% of the data values fall within one standard
deviation of the mean.
• Approximately 95% of the data values fall within two standard
deviations of the mean.
• Approximately 99.7% of the data values fall within three
standard deviations of the mean.
Percentiles, Deciles, Quartiles
Percentiles (100 regions)
The kth percentile is the number which has k% of the
values below it. The data must be ranked.
• Rank the data
• Find k% (k /100) of the sample size, n.
• If this is an integer, add 0.5. If it isn't an integer round up.
• Find the number in this position. If your depth ends in 0.5,
then take the midpoint between the two numbers.
The 80th percentile is the number which has 80%
below it and 20% above it.

Note: The 50th percentile is the median.


If you wish to find the percentile for a number (rather than
locating the kth percentile), then
– Take the number of values below the number
– Add 0.5
– Divide by the total number of values
– Convert it to a percent

Example 1: sample size of 20


• The median will be in position 10.5.
• The lower half is positions 1 - 10 and the upper half is positions
11 - 20.
• The lower quartile is the median of the lower half and would be
in position 5.5.
• The upper quartile is the median of the upper half and would be
in position 5.5 starting with original position 11 as position 1 --
this is the original position 15.5.
• Example 2: sample size of 21
• The median is in position 11.
• The lower half is positions 1 - 11 and the upper half
is positions 11 - 21.
• The first quartile is the median of the lower half
and would be in position 6.
• The third quartile is the median of the upper half
and would be in position 6 when starting at position
11 -- this is original position 16.
Deciles (10 regions)
• The percentiles divide the data into 100 equal
regions.
• The deciles divide the data into 10 equal regions.
• The instructions are the same for finding a
percentile, except instead of dividing by 100 in step
2, divide by 10.
Five Number Summary
• consists of the minimum value, lower quartile, median,
upper hinge, and maximum value.
• Some textbooks use the quartiles instead of the hinges

• Box and Whiskers Plot


• A graphical representation of the five number
summary.
• A box is drawn between the first(25th percentile) and
third (75th percentile) quartile with a line at the
median. Whiskers (a single line, not a box) extend from
the quartile to lines at the minimum and maximum
values.
Probability
Outcome
• An outcome is the result of an experiment or other situation
involving uncertainty.
• The set of all possible outcomes of a probability experiment is
called a sample space.

Sample Space
• An exhaustive list of all the possible outcomes of an
experiment.
• Each possible result of such a study is represented by
one and only one point in the sample space, which is
usually denoted by S.
Examples
1. Experiment: Rolling a die once:
Sample space S = {1,2,3,4,5,6}

2. Experiment: Tossing a coin:


Sample space S = {Heads,Tails}

3. Experiment: Measuring the height (cms) of a girl on


her first day at school:
Sample space S = the set of all possible real numbers
Event
• Defined as any collection of outcomes of an
experiment.
• Formally, any subset of the sample space is an
event.
• Any event which consists of a single outcome in the
sample space is called an elementary or simple
event.
• Events which consist of more than one outcome are
called compound events.
• Set theory is used to represent relationships among
events. In general, if A and B are two events in the
sample space S, then

– (A union B) = 'either A or B occurs or both occur'


– (A intersection B) = 'both A and B occur'
– (A is a subset of B) = 'if A occurs, so does B'
– A' or = 'event A does not occur'
– (the empty set) = an impossible event
– S (the sample space) = an event that is certain to occur
Example
Experiment: rolling a dice once -
• Sample space S = {1,2,3,4,5,6}
• Events A = 'score < 4' = {1,2,3}
• B = 'score is even' = {2,4,6}
• C = 'score is 7' =

• = 'the score is < 4 or even or both' = {1,2,3,4,6}


• = 'the score is < 4 and even' = {2}
• A' or = 'event A does not occur' = {4,5,6}

Relative Frequency
• Relative frequency is another term for proportion;
• calculated by dividing the number of times an event
occurs by the total number of times an experiment
is carried out.
• The probability of an event can be thought of as its
long-run relative frequency when the experiment is
carried out many times.
• If an experiment is repeated n times, and event E
occurs r times, then the relative frequency of the
event E is defined to be
rfn(E) = r/n
Example
• Experiment: Tossing a fair coin 50 times (n = 50)
• Event E = 'heads'
• Result: 30 heads, 20 tails, so r = 30
• Relative frequency: rfn(E) = r/n = 30/50 = 3/5 = 0.6

If an experiment is repeated many times without


changing the experimental conditions, the relative
frequency of any particular event will settle down
to some value.
The probability of the event can be defined as the
limiting value of the relative frequency:
P(E) = rfn(E)

For example, in the above experiment, the relative


frequency of the event 'heads' will settle down to a
value of approximately 0.5 if the experiment is
repeated many more times.

Probability
• A probability provides a quantitative description of the
likely occurrence of a particular event.
• Probability is conventionally expressed on a scale from
0 to 1; a rare event has a probability close to 0, a very
common event has a probability close to 1.
• The probability of an event has been defined as its
long-run relative frequency.
• It can also be thought of as a personal degree of
belief that a particular event will occur (subjective
probability).

• In some experiments, all outcomes are equally


likely.
• For example if you were to choose one winner in a
raffle from a hat, all raffle ticket holders are equally
likely to win, that is, they have the same probability
of their ticket being chosen.
• This is the equally-likely outcomes model and is
defined to be:
• P(E) = number of outcomes corresponding to event E

total number of outcomes

Examples
1. The probability of drawing a spade from a pack of
52 well-shuffled playing cards is 13/52 = 1/4 = 0.25
since event E = 'a spade is drawn'; the number of
outcomes corresponding to E = 13 (spades); the
total number of outcomes = 52 (cards).

2. When tossing a coin, we assume that the results


'heads' or 'tails' each have equal probabilities of 0.5.

Independent Events
• Two events are independent if the occurrence of
one of the events gives us no information about
whether or not the other event will occur; that is,
the events have no influence on each other.

• In probability theory we say that two events, A and


B, are independent if the probability that they both
occur is equal to the product of the probabilities of
the two individual events, i.e.
The idea of independence can be extended to more than
two events. For example, A, B and C are independent if:
a) A and B are independent; A and C are independent
and B and C are independent (pairwise
independence);

b) If two events are independent then they cannot be


mutually exclusive (disjoint) and vice versa.

Example
Suppose that a man and a woman each have a pack of 52
playing cards.
• Each draws a card from his/her pack. Find the
probability that they each draw the ace of clubs.

• We define the events:


– A = probability that man draws ace of clubs = 1/52
– B = probability that woman draws ace of clubs = 1/52
• Clearly events A and B are independent so:
= 1/52 . 1/52 = 0.00037

• That is, there is a very small chance that the man


and the woman will both draw the ace of clubs.
Mutually Exclusive Events
• Two events are mutually exclusive (or disjoint) if it is
impossible for them to occur together.

• Formally, two events A and B are mutually exclusive


if and only if

• If two events are mutually exclusive, they cannot be


independent and vice versa.
Examples
• Experiment: Rolling a die once
• Sample space S = {1,2,3,4,5,6}
• Events A = 'observe an odd number' = {1,3,5}
• B = 'observe an even number' = {2,4,6}
• = the empty set, so A and B are mutually
exclusive.

Addition Rule
• used to determine the probability that event A or
event B occurs or both occur.
The result is often written as follows, using set
notation:

where:
• P(A) = probability that event A occurs
• P(B) = probability that event B occurs
• = probability that event A or event B occurs
• = probability that event A and event B both
occur
• For mutually exclusive events, that is events which
cannot occur together:
=0
The addition rule therefore reduces to
= P(A) + P(B)
For independent events, that is events which have no influence on
each other:

The addition rule therefore reduces to

Example
• Suppose we wish to find the probability of drawing either a king or
a spade in a single draw from a pack of 52 playing cards.
• We define the events A = 'draw a king' and B = 'draw a spade'
• Since there are 4 kings in the pack and 13 spades, but 1 card is
both a king and a spade, we have:

= 4/52 + 13/52 - 1/52 = 16/52


So, the probability of drawing either a king or a spade
is 16/52 (= 4/13).

Multiplication Rule
• Used to determine the probability that two events,
A and B, both occur.
• The multiplication rule follows from the definition
of conditional probability.
• The result is often written as follows, using set
notation:
where:
• P(A) = probability that event A occurs
• P(B) = probability that event B occurs
• = probability that event A and event B occur
• P(A | B) = the conditional probability that event A
occurs given that event B has occurred already
• P(B | A) = the conditional probability that event B
occurs given that event A has occurred already

• For independent events, that is events which have


no influence on one another the rule simplifies to:
Conditional Probability
• In many situations, once more information becomes
available, we are able to revise our estimates for the
probability of further outcomes or events happening.

• For example, suppose you go out for lunch at the same


place and time every Friday and you are served lunch
within 15 minutes with probability 0.9.

• However, given that you notice that the restaurant is


exceptionally busy, the probability of being served
lunch within 15 minutes may reduce to 0.7.

• This is the conditional probability of being served lunch


within 15 minutes given that the restaurant is
exceptionally busy.
The usual notation for "event A occurs given that
event B has occurred" is "A | B" (A given B).

The symbol | is a vertical line and does not imply


division. P(A | B) denotes the probability that event
A will occur given that event B has occurred already.

A rule that can be used to determine a conditional


probability from unconditional probabilities is:
where:
• P(A | B) = the (conditional) probability that event A
will occur given that event B has occured already
• = the (unconditional) probability that event A
and event B both occur
• P(B) = the (unconditional) probability that event B
occurs
Probability Distributions
A listing of all the values a random variable can
assume with their corresponding probabilities make
a probability distribution.
Note: A random variable does not mean that the
values can be anything (a random number).
Random variables have a well defined set of
outcomes and well defined probabilities for the
occurrence of each outcome.
The “random” refers to the fact that the outcomes
happen by chance -- that is, you don't know which
outcome will occur next.
Example: probability distribution that results from the
rolling of a single fair die.

x 1 2 3 4 5 6 sum
p(x) 1/6 1/6 1/6 1/6 1/6 1/6 6/6=1

A prabability distribution is described by the mean


Variance and standard deviation
Binomial Probability Distribution
Binomial Experiment
A binomial experiment is an experiment which
satisfies these four conditions
• A fixed number of trials
• Each trial is independent of the others
• There are only two outcomes
• The probability of each outcome remains constant
from trial to trial.
Examples of binomial experiments
• Tossing a coin 20 times to see how many tails occur.
• Asking 200 people if they watch ABC news.
• Rolling a die to see if a 5 appears.

Examples which aren't binomial experiments


• Rolling a die until a 6 appears (not a fixed number
of trials)
• Asking 20 people how old they are (not two
outcomes)
• Drawing 5 cards from a deck for a poker hand (done
without replacement, so not independent)
Poisson Distribution
Named after the French mathematician Simeon
Poisson, Poisson probabilities are useful when there
are a large number of independent trials with a
small probability of success on a single trial and the
variables occur over a period of time.
It can also be used when a density of items is
distributed over a given area or volume.
Lambda in the formula is the mean number of
occurrences.
If you're approximating a binomial probability using
the Poisson, then lambda is the same as mu or n *
p.
Example:
If there are 500 customers per eight-hour day in a
check-out lane, what is the probability that there
will be exactly 3 in line during any five-minute
period?
The expected value during any one five minute period
would be 500 / 96 = 5.2083333.
The 96 is because there are 96 five-minute periods in
eight hours.
So, you expect about 5.2 customers in 5 minutes and
want to know the probability of getting exactly 3.

= 0.1288 (approx)
Normal Distribution
Any Normal Distribution
• Bell-shaped
• Symmetric about mean
• Continuous
• Never touches the x-axis
• Total area under curve is 1.00
• Approximately 68% lies within 1 standard deviation of the
mean, 95% within 2 standard deviations, and 99.7% within
3 standard deviations of the mean. This is the Empirical
Rule mentioned earlier.
• Data values represented by x which has mean mu and
standard deviation sigma.
Probability Function given by
Interquartile Range (IQR)
• The interquartile range is the difference between
the third and first quartiles. That's it: Q3 - Q1
Interval estimation
In statistics, interval estimation is the use of sample
data to calculate an interval of possible (or
probable) values of an unknown population
parameter

You might also like