Biostatistics Module-1 PDF
Biostatistics Module-1 PDF
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/292177058
CITATIONS READS
0 248
2 authors:
All content following this page was uploaded by Avijit Hazra on 15 April 2016.
Key Words: Boxplot, confidence interval, data, descriptive statistics, measures of central
tendency, measures of dispersion, normal distribution, stem‑and‑leaf plot, variable
behavioral variations, individuals residing as neighbors in In this series, we will look at the applied uses of
the same locality may differ greatly in their perception statistics without delving into mathematical depths.
of stigma associated with a common skin disease like This is not to deny the mathematical underpinnings of
vitiligo. Often the degree of variability is substantial statistics – these can be found in statistics textbooks.
even when observational or interventional conditions are Our goal here is to present the concepts and look at the
held as uniform and constant as possible. The challenge applications from the point of view of the applied user
for the biomedical researcher is to unearth the patterns of biostatistics.
that are being obscured by the variability of responses in
living systems. Further, the researcher is often interested Data and Variables
in small differences or changes. For instance, if we give Data constitute the raw material for statistical work.
you two antibiotics and say that drug A has 10% cure They are records of measurement or observations or
rate in folliculitis with 7 days of treatment while drug simply counts. A variable refers to a particular character
B has 90% cure rate in the same situation, and ask on which a set of data are recorded. Data are thus the
you to choose one for your patient; the choice would values of a variable. Before a study is undertaken it is
be obvious. However, if we were to say that the cure important to consider the nature of the variables that
rates for drugs A and B are 95% and 97% respectively, are to be recorded. This will influence the manner in
then your choice will not be so obvious. Very likely, you which observations are undertaken, the way in which
will be wondering whether the difference of 2% is worth they are summarized and the choice of statistical tests
changing practice if you are accustomed to using drug that will be used.
A or maybe you will look at other factors such as the At the most basic level, it is important to distinguish
toxicity profile, cost or ease of use. Statistics, gives us between two types of data or variables. The first type
the tools, albeit mathematical, to make an appropriate includes those measured on a suitable scale using an
choice by judging the “significance” of such small appropriate measuring device and is called quantitative
observed differences or changes. variable. Since quantitative variables always have values
Furthermore, it is important to remember that statistics expressed as numbers, and the differences between
is the science of generalization. We are generally not values have numerical meaning, they are also referred to
in a position to carry out “census” type of studies that as numerical variables. The second type includes those
cover entire populations. Therefore, we usually study which are defined by some characteristic, or quality, and
subsets or samples of a population and hope that the is referred to as qualitative variable. Because qualitative
conclusions drawn from studying such a subset can be data are best summarized by grouping the observations
generalized to the population as a whole. This process is into categories and counting the numbers in each, they
fraught with errors, and we require statistical techniques are often referred to as categorical variables.
to make the generalizations tenable. A quantitative variable can be continuous or discrete.
Before the advent of computers and statistical software, A continuous variable can, in theory at least, take on any
researchers and others dealing with statistics had to value within a given range, including fractional values.
do most of their analysis by hand, taking recourse to A discrete variable can take on only certain discrete
books of statistical formulas and statistical tables. values within a given range – often these values are
This required one to be proficient in the mathematics integers. Sometimes variables (e.g., age of adults) are
underlying statistics. This is no longer mandatory since treated as discrete variables although strictly speaking
increasingly user‑friendly software takes the drudgery they are continuous. A qualitative variable can be a
out of calculations and obviates the need for looking nominal variable or an ordinal variable. A nominal variable
up statistical tables. Therefore, today, understanding covers categories that cannot be ranked, and no category
the applied aspects of statistics suffices for the majority is more important than another. The data is generated
of researchers and we seldom require to dig into the simply by naming, on the basis of a qualitative attribute,
mathematical depths of statistics, to make sense of the the appropriate category to which the observation
belongs. An ordinal variable has categories that follow a
data that we generate or scrutinize.
logical hierarchy and hence can be ranked. We can assign
The applications of biostatistics broadly covers three numbers (scores) to nominal and ordinal categories;
domains – description of patterns in data sets through although, the differences among those numbers do not
various descriptive measures (descriptive statistics), have numerical meaning. However, category counts do
drawing conclusions regarding populations through have numerical significance. A special case may exist for
various statistical tests applied to sample data (inferential both categorical or numerical variables when the variable
statistics) and application of modeling techniques to in question can take on only one of two numerical values
understand relationship between variables (statistical or belong to only one of two categories; these are known
modeling), sometimes with the goal of prediction. as binary or dichotomous variables [Table 1].
Numerical data can be recorded on an interval scale or a the quality of life. This categorization may be more relevant
ratio scale. On an interval scale, the differences between to the clinician than the actual DLQI score achieved. In
two consecutive numbers carry equal significance in contrast, converting from categorical to numerical will not
any part of the scale, unlike the scoring of an ordinal be feasible without having actual measurements.
variable (“ordinal scale”). For example, when measuring
When exploring the relationship between variables, some
height, the difference between 100 and 102 cm is the can be considered as dependent (dependent variable) on
same as the difference between 176 and 178 cm. Ratio others (independent variables). For instance, when exploring
scale is a special case of recording interval data. With the relationship between height and age, it is obvious that
interval scale data the 0 value can be arbitrary, such height depends on age, at least until a certain age. Thus, age
as the position of 0 on some temperature scales – the is the independent variable, which influences the value of the
Fahrenheit 0 is at a different position to that of the dependent variable height. When exploring the relationship
Celsius scale. With ratio scale, 0 actually indicates between multiple variables, usually in a modeling situation,
the point where nothing is scored on the scale the value of the outcome (response) variable depends on
(“true 0”), such as 0 on the absolute or Kelvin scale of the value of one or more predictor (explanatory) variables.
temperature. Thus, we can say that an interval scale of In this situation, some variables may be identified that
measurement has the properties of identity, magnitude, cannot be accurately measured or controlled and only serve
and equal intervals while the ratio scale has the additional to confuse the results. They are called confounding variables
property of a true 0. Only on a ratio scale, can differences or confounders. Thus, in a study of the protective effect of
be judged in the form of ratios. 0°C is not 0 heat, nor is a sunscreen in preventing skin cancer, the amount of time
26°C twice as hot as 13°C; whereas these value judgments spent in outdoor activity could be a major confounder. The
hold with the Kelvin scale. In practice, this distinction extent of skin pigmentation would be another confounder.
is not tremendously important so far as the handling of There could even be confounders whose existence is
numerical data in statistical tests is concerned. unknown or effects unsuspected, for instance, undeclared
Changing data scales is possible so that numerical data consumption of antioxidants by the subjects which is quite
may become ordinal, and ordinal data may become possible because the study would go on for a long time. Such
nominal (even dichotomous). This may be done when unsuspected confounders have been called lurking variables.
the researcher is not confident about the accuracy of the Numerical or categorical variables may sometimes need
measuring instrument, is unconcerned about the loss of to be ranked, that is arranged in ascending order and
fine detail, or where group numbers are not large enough new values assigned to them serially. Values that tie
to adequately represent a variable of interest. It may are each assigned average of the ranks they encompass.
also make clinical interpretation easier. For example, the Thus, a data series 2, 3, 3, 10, 23, 35, 37, 39, 45 can
Dermatology Life Quality Index (DLQI) is used to assess be ranked as 1, 2.5, 2.5, 4, 5, 6, 7, 8, 9 since the 2, 3s
how much of an adult subject’s skin problem is affecting encompass ranks 2 and 3, giving an average rank value
his or her quality of life. A DLQI score <6 indicates that the of 2.5. Note that when a numerical variable is ranked, it
skin problem is hardly affecting the quality of life, score of gets converted to an ordinal variable. Ranking obviously
6–20 indicates moderate to large effect on quality while does not apply to nominal variables because their values
score >20 indicates that the problem is severely degrading do not follow any order.
• Abbreviations: DLQI = Dermatology life quality index; PASI = Psoriasis Area and Severity Index; VAS =
Visual Analog Scale
• We can interconvert variable types. For example age (which is actually measurement of time since birth) is a
continuous numerical variable, but we usually treat is as discrete by recording it in completed number of
years. After recording hair length we may classify a subject into long, medium or short hair category. If we
are interested in only two categories of hair length, say long or short, then this becomes a binary variable.
define a range in which the true population value is likely The formulae for estimating standard error however
to lie, and this range is the CI while its two terminal values varies for different statistics, and in some instances
are the confidence limits. The width of the CI depends on is quite elaborate. Fortunately, we generally rely on
the standard error and the degree of confidence required. computer software to do the calculations.
Conventionally, the 95% CI (95% CI) is most commonly
used. From the properties of a normal distribution Frequency Distributions
curve (see below) it can be shown that the 95% CI of the It is useful to summarize a set of raw numbers with a
mean would cover a range 1.96 standard errors either side frequency distribution. The summary may be in the form
of the sample mean, and will have a 95% probability of of a table or a graph (plot). Many frequency distributions
including the population mean; while 99% CI will span 2.58 are encountered in medical literature [Figure 1] and it is
standard errors either side of the sample mean and will have important to be familiar with commonly encountered ones.
99% probability of including the population mean. Thus, a
fundamental relation that needs to be remembered is: Majority of distributions that quantitative clinical
95% CI of mean = Sample mean ± 1.96 × SEM. data follow are unimodal, that is the data have a
single peak (mode) with a tail on either side. The
It is evident that the CI would be narrower if SEM is most common of these unimodal distributions is the
smaller. Thus if a sample is larger, SEM would be smaller bell‑shaped symmetrical distribution called the normal
and the CI would be correspondingly narrower and distribution or the Gaussian distribution [Figure 2]. In
thus more “focused” on the true mean. Large samples this distribution, the values of mean, median and mode
therefore increase precision. It is interesting to note that will coincide. However, some distributions are skewed
although increasing sample size improves precision, it is with a substantially longer tail on one side. The type of
a somewhat costly approach to increasing precision, since skew is determined by the direction of the longer tail.
halving of SEM requires a 4‑fold increase in sample size. A positively skewed distribution has a longer tail to the
CIs can be used to estimate most population parameters right. In this case, the mean will be greater than the
from sample statistics (means, proportions, correlation median because the mean is strongly influenced by the
coefficients, regression coefficients, odds ratios, relative extreme values in the right‑hand tail. On the other hand,
risks, etc.). In all cases, the principles and the general a negatively skewed distribution has a longer tail to the
pattern of estimating the CI remains the same, that is: left; in this instance, the mean will be smaller than the
95% CI of a parameter = Sample statistic ± 1.96 × standard median. Thus, the relationship between mean and median
error for that statistic. gives an idea of the distribution of numerical data.
Figure 1: Examples of frequency distribution shapes. Note that the normal distribution is symmetrical but there can be distributions that are symmetrical but not normal
It is possible that datasets may have more than one are determined by the outcomes of a random experiment.
peak (mode). Such data can be difficult to manage and The possible values of a random variable and the
it may be the case that neither the mean nor the median associated probabilities constitute a statistical probability
is a representative measure. However, it is important to distribution. The concept of probability distributions and
remember that bimodal or multimodal distributions are frequency distributions are similar in that each associates
rare and may even be artifactual. A distribution with a number with the possible values of a variable. However,
two peaks may actually be reflecting a combination of for a frequency distribution, the number is a frequency,
two unimodal distributions, for instance, one for each while for a probability distribution, this number is a
gender or different age groups. In such cases, appropriate probability. A frequency distribution describes a set
subdivision, categorization, or even recollection of the of data that has been observed; it is thus empirical.
data may be required to eliminate multiple peaks. A probability distribution describes data that might
be observed under certain specified conditions; hence
Probability Distributions it is theoretical. Probability distributions are part of
A random variable is a numerical quantity whose values descriptive statistics, and they can be used to predict how
random variables are expected to behave under certain
conditions. If the empirical data deviate considerably
from the predictions of a probability distribution model,
the correctness of the model or its assumptions can be
questioned, and we may look for alternative models
to fit the empirical data. Table 2 provides examples of
statistical probability distributions. Note that, they are
broadly classified as continuous or discrete probability
distributions depending on whether the random variable
in question is a continuous or a discrete variable.
Of the many probability distributions that can be
used to model biological events or observations, the
most common is the normal distribution. In such a
distribution, the values of the random variable tend
to cluster around a central value, with a symmetrical
positive and negative dispersion about this point. The
more extreme values become less frequent the further
they lie from the central point [Figure 3]. The term
“normal” relates to the sense of 'standard' against
which other distributions may be compared. It is also
referred to as a Gaussian distribution after the German
mathematician, Karl Friedrich Gauss (1777–1855),
although Gauss was not the first person to describe such
a distribution. The bell curve was named 'normal curve'
Figure 2: Relation between the measures of central tendency for commonly by the great Karl Pearson. Important properties of a
encountered frequency distributions normal distribution are:
a b c
Figure 3: Normal distribution of a variable x with mean and SD . The bottom panel shows Z-score transformation of x to derive the standard normal curve (Z distribution)
with mean 0 and SD 1. SD: Standard deviation (a) Negatively skewed, (b) Normal, (c) Positively skewed
Table 2: Examples of probability distributions used distribution table shows cumulative probability
in biomedical research associated with particular Z-scores and can be used to
estimate probabilities of particular values of a normally
Continuous probability Discrete probability
distributions distributions distributed variable.
Normal distribution (Gaussian) Bernoulli distribution In all biomedical research where samples are used to
Log normal distribution Binomial distribution learn about populations, some random procedure is
Continuous uniform distribution Multinomial distribution essential for subject selection to avoid many kinds of
Student’s t‑test distribution Negative binomial bias. This takes the form of random sampling from a
distribution (Pascal) population or randomized allocation of participants
Chi-square distribution Geometric distribution to interventional groups. Randomness is a property of
F distribution Hypergeometric distribution the procedure rather than of the sample and ensures
Weibull distribution Poisson distribution that every potential subject has a fair and equal
Gompertz distribution Discrete uniform distribution chance of getting selected. The resulting sample is
Note: With a discrete probability distribution, each possible called a random sample. As the number of observations
value of the random variable can be associated with a nonzero increases (say, n > 100), the shape of a random
probability. Thus, a discrete probability distribution can always sampling distribution will approximate a normal
be presented in tabular form. With a continuous probability distribution curve even if the distribution of the
distribution it is possible that the random variable would have
variable in question is not normal. This is explained
0 probability of assuming certain values. Hence a continuous
probability distribution is not usually depicted in a tabular form by the central limit theorem and is one reason why
but as a plot whose shape is determined by the equation for the the normal distribution is so important in biomedical
continuous probability distribution (called a probability density research.
function). The probability that a continuous random variable
Many statistical techniques require the assumption of
assumes a value between two boundaries is equal to the area under
the curve between these two boundaries normality of the dataset. It is not mandatory for the
sample data to be normally distributed, but it should
represent a population that is normally distributed.
• Unimodal bell‑shaped distribution
• Symmetric about the mean Presenting Data
• Flattens symmetrically as the variance is increased Once summary measures of data have been calculated,
• Kurtosis is 0 (“kurtosis” refers to how peaked a they need to be presented in tables and graphs.
distribution is) Appropriate data presentation summarizes the data in a
• The tails may extend toward infinity, but the total compact and meaningful manner without burdening the
area is taken as 1. reader with a surfeit of information, enables conclusions
In a normal distribution curve, the mean, median, and to be drawn simply by looking at the summarized data
mode coincide. The area delimited by one SD either and, of course, helps in further statistical analysis where
side of the mean includes 68% of the total area, two necessary.
SDs 95.4%, and three SDs 99.7%; 95% of the values lie Regarding data presentation in tables, it is helpful to
within 1.96 SDs on either side of the mean. It is for this remember the following:
reason that the interval denoted by mean ± 1.96 × SD is • Tables should be numbered
often taken as the normal range or reference range for • Each table must have a concise and self‑explanatory
many physiological variables. title
If we look at the equation for the normal distribution, it • Tables must be formatted with an appropriate
is evident that there are two parameters that define the number of rows and columns but should not be too
curve, namely µ (the mean) and σ (the SD): large. Larger tables can usually be split into multiple
simpler tables
1
1 − ( x − µ ) 2 /σ 2 • Column headings and row classifiers must be clear
f (x) = e 2
σ 2π and concise
• For tables showing frequency distributions, it must
The standard normal distribution curve is a special be clear whether the frequencies depicted in each
case of the normal distribution for which probabilities class or class interval represent absolute frequency,
have been calculated. It is a symmetrical bell‑shaped relative frequency (i.e., the percentage of the total)
curve with a mean of 0 and a variance (or SD) of 1. or the cumulative frequency
The random variable of a standard normal distribution • For tables depicting percentages, it must be clear
is the Z-score of the corresponding value of the whether the percentages represent percentages with
variable for the normal distribution. A standard normal respect to the row (row percentage) or the column
(column percentage) in which the cell is located Table 3: Examples of data representation strategies
• The mean is to be used for numerical data and
Tabular (in tables) Graphical By charts
symmetric (nonskewed) distributions
Numerical or Numerical data Categorical data
• The median should be used for ordinal data or for categorical data
numerical data if the distribution is skewed e.g., Frequency e.g., Line diagram e.g., Pictogram
• The mode is generally used only for examining distribution tables Bar chart Pie chart
bimodal or multimodal distributions
Contingency tables Histogram Bar chart
• The range may be used for numerical data to
Data summary tables Frequency polygon Map diagram
emphasize extreme values
• The SD is to be used along with the mean Data comparison Distribution curves
tables
• Interquartile range or percentiles should be used
Dot plot
along with the median
Stem‑and‑leaf plot
• SDs and percentiles may also be used when the
Box and whiskers plot
objective is to depict a set of norms (“normative data”)
• The CV may be used if the intent is to compare Scatter plot
variability between datasets measured on different Survival plot
numerical scales Frequency tables depict the frequencies (absolute, relative or
• 95% CIs should be used whenever the intent is to cumulative) for a series of categories or class intervals. Contingency
tables depict data in a matrix of rows and columns - the simplest
draw inferences about populations from samples
is a 2 × 2 table that distributes the total n among 4 cells arranged
• Additional information required to interpret the in two rows and two columns. The counts in the cells are mutually
table (e.g., explanation of column headings, other exclusive. Most contingency tables are one way in the sense that
abbreviations, explanatory remarks) can be appended the rows are not stratified. If the rows are stratified by another
as footnotes. variable then multi‑way contingency tables are generated. Data
summary tables depict summary measures of central tendency,
For presenting data graphically, it is usually dispersion and precision. Group comparison tables present
necessary to obtain the summary measures, counts or comparison of two or more groups in a study. Most plots mentioned
percentages of the data. These can then be utilized in this table have been explained in the text. Various other plots
to draw different types of graphs (or charts or plots are used in biomedical literature for data summary or comparison
or diagrams). The more common types with some of purposes
their variants are summarized in Table 3 and Figure 4.
Although charts are visually appealing, they should not frequencies or the means. The bar widths and separation
replace tabulation of important summary data. Further, between bars should be uniform but are of little
if not constructed or scaled appropriately, charts can significance other than to indicate that the bars denote
be misleading. separate series or categories. Bars depicting subcategories
can be stacked one on top of another (stacked or
A pictogram represents quantity by presenting stylized
segmented or component bar chart). The frequencies can
pictures or icons of the variable being depicted – the
be converted to percentages so that the total numbers
number or size of the icon being proportional to the
in each category add up to 100% giving 100% stacked
frequency. When comparing between groups using a
bar chart where all the bars are of equal height. Two
pictogram, it is preferable that same‑sized icons be used
or more data series or subcategories can be depicted on
across groups (with their numbers varying) – otherwise
the same bar chart by placing corresponding bars side by
the picture may be misleading. Pictograms are more
side – different patterns or colors are used to distinguish
often used in mass media presentations than in serious
the different series or subcategories (compound or
biomedical literature.
multiple or cluster bar chart).
Pie chart depicts frequency distribution of categorical
The histogram is similar to bar chart in appearance
data in a circle (the “pie”), with the sectors of the circle
but is used for summarizing continuous numerical data
proportional in size to the frequencies in the respective
and hence there should not be any gaps between the
categories. A particular category can be emphasized by
bars. The bar widths correspond to the class intervals.
pulling out that sector. All sectors are pulled out in
The alignment of the bars is usually horizontal with
an “exploded” pie chart. Pie charts can be made highly
the class intervals along the horizontal axis and the
attractive, by using color and three‑dimensional design
frequencies along the vertical axis. A histogram is
enhancements, but become cumbersome if there are too
popularly used to depict the frequency distribution
many categories.
in a large data series. Accordingly, the class intervals
Bar chart (also called column chart) depicts categorical or should be so chosen that the bars are narrow enough
numerical data as a series of vertical (or horizontal) bars, to illustrate patterns in the data but not so narrow
with the bar heights (lengths) being proportional to the that they become too large in number. A histogram
must be labeled carefully to depict clearly where the it depicts the frequency distribution of numerical data
boundaries lie. as a curve.
A frequency polygon is a line diagram representation Dot plot [Figure 5] depicts frequency distribution
of the frequency distribution depicted by the of numerical variables like histograms but with the
histogram and is obtained by joining the midpoints of advantage of depicting individual values as well. Instead
the upper boundary of the histogram blocks. As such of bars, it has a series of dots for each value or class
interval – each dot representing one observation. The Stem‑and‑leaf plot or stem plot [Figure 6] is a sort of
alignment can be vertical or horizontal. They are useful mixture of a diagram and a table. It has been devised
in highlighting clusters and gaps in data sets as well as to depict frequency distribution, as well as individual
outliers. Dot plots are conceptually simple but become values for numerical data. The data values are examined
cumbersome for large data sets. Scatter plots (sometimes to determine their last significant digit (the “leaf” item),
erroneously called dot plots) are used for depicting and this is attached to the previous digits (the “stem”
item). The stem items are usually arranged in ascending or
association between two variables with the X and Y
descending order vertically, and a vertical line is usually
coordinates of each dot representing the corresponding
drawn to separate the stem from the leaf. The number of
values of the two variables. A bubble plot is an extension leaf items should total up to the number of observations.
of the scatter plot to depict the relation between three However, it becomes cumbersome with large data sets.
variables – here each dot is expanded into a bubble with
Box‑and‑whiskers plot (or box plot) is a graphical
the diameter of the bubble being proportional to the
representation of numerical data based on the
value of the third variable. This is preferable to depicting
five‑number summary – minimum value, 25th percentile,
the third variable on a Z axis since it is difficult to
median (50th percentile), 75th percentile and maximum
comprehend depth on a two‑dimensional surface. value [Figure 7]. A rectangle is drawn extending from
the lower quartile to the upper quartile, with the
median dividing this “box” but not necessarily equally.
Lines (“whiskers”) are drawn from the ends of the box
to the extreme values. Outliers may be indicated beyond
the extreme values by dots or asterisks – in such
“modified” 0 or “refined” box plots, the whiskers have
lengths not exceeding 1.5 times the box length. The
whole plot may be aligned vertically or horizontally.
Box plots are ideal for summarizing large samples and
are being increasingly used. Multiple box plots, arranged
side by side, allow ready comparison of data sets.
We have looked at the commonly used plots used for
Figure 5: A dot plot depicting age of 156 women in reproductive age group enrolled summarizing data and depicting underlying patterns.
in a study. Note that individual values are conserved while distribution, gaps and Many other plots are used in biostatistics for depicting
clusters are evident
data distributions, time trends in observations,
relationships between two or more variables, exploring
100 | 022466 Key: 120 | 2 means 122 mmHg goodness‑of‑fit to hypothesized data distributions and
drawing inferences by comparing data sets. We will get
110 | 0022444688
introduced to select other plots in subsequent modules
120 | 002222224444444444666666688888
in this series.
130 | 000222224444446666668888
Financial support and sponsorship
140 | 00002222444446666888
Nil.
150 | 0022244688
Conflicts of interest
Figure 6: A stem-and-leaf plot depicting systolic blood pressure recordings (recorded
as even values only) in 100 individuals. Note that the plot gives an idea of the There are no conflicts of interest.
underlying distribution while retaining all the individual values
Futher Reading
1. Samuels ML, Witmer JA, Schaffner AA, editors. Description of
samples and populations. In: Statistics for the Life Sciences.
4th ed. Boston: Pearson Education; 2012. p. 26‑80.
2. Kirk RE, editor. Random variables and probability distributions.
In: Statistics: An Introduction. 5th ed. Belmont: Thomson
Wadsworth; 2008. p. 207‑27.
Outlier Minimum 25th Median 75th Maximum 3. Kirk RE, editor. Normal distribution and sampling
value value percentile percentile distributions. In: Statistics: An Introduction. 5th ed. Belmont:
Thomson Wadsworth; 2008. p. 229‑55.
Interquartile range 4. Dawson B, Trapp RG, editors. Summarizing data &
presenting data in tables and graphs. In: Basic & Clinical
Figure 7: A horizontal box plot depicting the five number summary of numerical Biostatistics. 4th ed. New York: McGraw‑Hill; 2004.
data. Note that this particular dataset is not symmetrical but is skewed to the left p. 23‑60.