Module 005 - Descriptive Statistics

Quantitative Methods
1
Descriptive Statistics
Module 005 – Descriptive Statistics
At the end of this module you are expected to:

1. Explain descriptive statistics;
2. Show the measures of central location; and
3. Illustrate the measures of dispersion and variability
Descriptive statistics are used to describe the basic features of the data
in a study. They provide simple summaries about the sample and the
measures. Together with simple graphics analysis, they form the basis
of virtually every quantitative analysis of data.
Descriptive statistics are typically distinguished from inferential

statistics. With descriptive statistics you are simply describing what is
or what the data shows. With inferential statistics, you are trying to
reach conclusions that extend beyond the immediate data alone. For
instance, we use inferential statistics to try to infer from the sample
data what the population might think. Or, we use inferential statistics
to make judgments of the probability that an observed difference
between groups is a dependable one or one that might have happened
by chance in this study. Thus, we use inferential statistics to make
inferences from our data to more general conditions; we use
descriptive statistics simply to describe what's going on in our data.
Descriptive Statistics are used to present quantitative descriptions in a

manageable form. In a research study we may have lots of measures. Or
we may measure a large number of people on any measure. Descriptive
statistics help us to simplify large amounts of data in a sensible way.
Each descriptive statistic reduces lots of data into a simpler summary.
For instance, consider a simple number used to summarize how well a
batter is performing in baseball, the batting average. This single
number is simply the number of hits divided by the number of times at
bat (reported to three significant digits). A batter who is hitting .333 is
getting a hit one time in every three at bats. One batting .250 is hitting
one time in four. The single number describes a large number of
Course Module
discrete events. Or, consider the scourge of many students, the Grade
Point Average (GPA). This single number describes the general
performance of a student across a potentially wide range of course
experiences.
Every time you try to describe a large set of observations with a single
indicator you run the risk of distorting the original data or losing
important detail. The batting average doesn't tell you whether the
batter is hitting home runs or singles. It doesn't tell whether she's been
in a slump or on a streak. The GPA doesn't tell you whether the student
was in difficult courses or easy ones, or whether they were courses in
their major field or in other disciplines. Even given these limitations,
descriptive statistics provide a powerful summary that may enable
comparisons across people or other units.
Descriptive statistics can be useful for two purposes: 1) to provide

basic information about variables in a dataset and 2) to highlight
potential relationships between variables.
The three most common descriptive statistics can be

displayed graphically or pictorially and are measures of:
 Graphical/Pictorial Methods
 Measures of Central Tendency
 Measures of Dispersion
 Measures of Association
Graphical/Pictorial Methods
There are several graphical and pictorial methods that enhance

researchers' understanding of individual variables and the
relationships between variables. Graphical and pictorial methods
provide a visual representation of the data. Some of these methods
include:
 Histograms
 Scatter plots
 Geographical Information Systems (GIS)
 Sociograms
Histograms
 Visually represent the frequencies with which values of

variables occur
 Each value of a variable is displayed along the bottom of a
histogram, and a bar is drawn for each value
3
 The height of the bar corresponds to the frequency with which

that value occurs
Scatter plots
 Display the relationship between two quantitative or numeric
variables by plotting one variable against the value of another
variable
 For example, one axis of a scatter plot could represent height
and the other could represent weight. Each person in the data
would receive one data point on the scatter plot that
corresponds to his or her height and weight
Geographic Information Systems (GIS)
 A GIS is a computer system capable of capturing, storing,

analyzing, and displaying geographically referenced
information; that is, data identified according to location
 Using a GIS program, a researcher can create a map to represent
data relationships visually
Sociograms
 Display networks of relationships among variables, enabling

researchers to identify the nature of relationships that would
otherwise be too complex to conceptualize
Univariate Analysis
Univariate analysis involves the examination across cases of one

variable at a time. There are three major characteristics of a single
variable that we tend to look at:
 the distribution
 the central tendency
 the dispersion
In most situations, we would describe all three of these characteristics

for each of the variables in our study.
The Distribution
The distribution is a summary of the frequency of individual values or

ranges of values for a variable. The simplest distribution would list
every value of a variable and the number of persons who had each
Course Module
value. For instance, a typical way to describe the distribution of college
students is by year in college, listing the number or percent of students
at each of the four years. Or, we describe gender by listing the number
or percent of males and females. In these cases, the variable has few
enough values that we can list each one and summarize how many
sample cases had the value. But what do we do for a variable like
income or GPA? With these variables there can be a large number of
possible values, with relatively few people having each one. In this case,
we group the raw scores into categories according to ranges of values.
For instance, we might look at GPA according to the letter grade ranges.
Or, we might group income into four or five ranges of income values.
Table 1: Frequency distribution table.
One of the most common ways to describe a single variable is with

a frequency distribution. Depending on the particular variable, all of
the data values may be represented, or you may group the values into
categories first (e.g., with age, price, or temperature variables, it would
usually not be sensible to determine the frequencies for each value.
Rather, the value are grouped into ranges and the frequencies
determined.). Frequency distributions can be depicted in two ways, as
a table or as a graph. Table 1 shows an age frequency distribution with
five categories of age ranges defined. The same frequency distribution
can be depicted in a graph as shown in Figure 1. This type of graph is
often referred to as a histogram or bar chart.
5
Figure 1: Frequency distribution bar chart.

URL: https://www.socialresearchmethods.net/kb/statdesc.htm
Retrieved: September 08, 2018
Distributions may also be displayed using percentages. For example,

you could use percentages to describe the:
 percentage of people in different income levels

 percentage of people in different age ranges
 percentage of people in different ranges of standardized test
scores
Measures of Central Location
Central tendency is defined as “the statistical measure that identifies a

single value as representative of an entire distribution.” It aims to
provide an accurate description of the entire data. It is the single value
that is most typical/representative of the collected data. The term
“number crunching” is used to illustrate this aspect of data description.
The mean, median and mode are the three commonly used measures of
central tendency.
Mean (or the arithmetic average) is the sum of all the scores divided by
the number of scores. Mean may be influenced profoundly by the
extreme variables. For example, the average stay of organ phosphorus
poisoning patients in ICU may be influenced by a single patient who
stays in ICU for around 5 months because of septicemia. The extreme
Course Module
values are called outliers. In short, the mean is the sum of all the values
in a set, divided by the number of values. The mean of a whole
population is usually denoted by µ while the mean of a sample is usually
denoted by 𝑥̅ .
Thus the mean of a set {a1, a2, … , an} is given by
𝑎1 + 𝑎2 + … + 𝑎𝑛
𝜇=
𝑛
where
µ is the population mean, or
𝑥̅ is the sample mean
n is the total number of items in a set
a is each element in a set
Example:
Given the set of values: {1, 2, 4, 7}, we substitute the values to the
given formula.
1+2+4+ 7
𝜇=
4
14
𝜇=
4
𝜇 = 3.5
The mean of the given set {1, 2, 4, 7} is 3.5.
Median is defined as the middle of a distribution in a ranked data (with

half of the variables in the sample above and half below the median
value). If the number of values in a set is even, then the median is the
sum of the two middle values divided by two (2).
Mode is the most frequently occurring variable in a distribution. A set

can have more than one mode.
 Unimodal – A set that has only one mode
 Bimodal – A set with two modes
 Multimodal – A set with three or more modes
7
Example:
Find the mean, median, mode, and range for the following list of
values:
13, 18, 13, 14, 13, 16, 14, 21, 13
The mean is the usual average, so I'll add and then divide:
(13 + 18 + 13 + 14 + 13 + 16 + 14 + 21 + 13) ÷ 9 = 15
Note that the mean, in this case, isn't a value from the original list.
This is a common result. You should not assume that your mean will
be one of your original numbers.
The median is the middle value, so first I'll have to rewrite the list in
numerical order:
13, 13, 13, 13, 14, 14, 16, 18, 21
There are nine numbers in the list, so the middle one will be the
(9 + 1) ÷ 2 = 10 ÷ 2 = 5th number:
1 2 3 4 5 6 7 8 9
13, 13, 13, 13, 14, 14, 16, 18, 21
So the median is 14.
The mode is the number that is repeated more often than any other,
so 13 is the mode.
mean: 15
median: 14
mode: 13
Note: The formula for the place to find the median is "([the number of
data points] + 1) ÷ 2", but you don't have to use this formula. You can
just count in from both ends of the list until you meet in the middle, if
you prefer, especially if your list is short. Either way will work.
Measures of Dispersion
Course Module
Measures of spread describe how similar or varied the set of observed
values are for a particular variable (data item). Measures of spread
include the range, quartiles and the interquartile range, variance and
standard deviation.
Standard deviation
Standard deviation is the measurement of average distance between
each quantity and mean. That is, how data is spread out from mean. A
low standard deviation indicates that the data points tend to be close to
the mean of the data set, while a high standard deviation indicates that
the data points are spread out over a wider range of values.
There are situations when we have to choose between sample or

population Standard Deviation.
When we are asked to find SD of some part of a population, a segment

of population; then we use sample Standard Deviation.
2
∑(𝑋𝑖 − ̅
𝑋 )
𝑆𝐷 = √
𝑛−1
where
x̅ is mean of a sample
n is the total number x
xi is the i th element from the sample
But when we have to deal with a whole population, then we use

population Standard Deviation.
∑(𝑋𝑖 − 𝜇)2
𝑆𝐷 = √
𝑁
where
µ is mean of a population
n is the total number x
xi is the i th element from the sample
Though sample is a part of a population, their SD formulas should have

been same, but it is not.
Variance
9
Variance is a square of average distance between each quantity and

mean. That is it is square of standard deviation.
Variance is a measure of how spread out is the distribution. It gives an
indication of how close an individual observation clusters about the
mean value. The variance of a population is defined by the following
formula:
∑ 2
(X i − X)
σ2 =
N
where:
σ2 is the population variance,
X is the population mean,
Xi is the i th element from the population, and
N is the number of elements in the population.
The variance of a sample is defined by slightly different formula:
2
∑(Xi − X)2
s =
n−1
where:
s2 is the sample variance,
x is the sample mean,
xi is the i th element from the sample and
n is the number of elements in the sample.
Range
Range is one of the simplest techniques of descriptive statistics. It is the

difference between lowest and highest value.
12, 24, 41, 51, 67, 67, 58, 59
Course Module
Range is 99–12 = 87
Percentile
Percentile is a way to represent position of a values in data set. To

calculate percentile, values in data set should always be in ascending
order.
12, 24, 41, 51, 67, 67, 85, 99
The median 59 has 4 values less than itself out of 8. It can also be said
as: In data set, 59 is 50th percentile because 50% of the total terms are
less than 59. In general, if k is nth percentile, it implies that n% of the
total terms are less than k.
Quartiles
The interquartile range is a measure of where the “middle fifty” is in a

data set. Where a range is a measure of where the beginning and end
are in a set, an interquartile range is a measure of where the bulk of the
values lie.That’s why it’s preferred over many other measures of
spread (i.e. the average or median) when reporting things like school
performance or SAT scores.
The interquartile range formula is the first quartile subtracted from the
third quartile:
IQR = Q3 – Q1.
Where:
IQR is the interquartile range
Q3 is the 3rd quartile
Q1 is the 1st
In statistics and probability, quartiles are values that divide your data
into quarters provided data is sorted in an ascending order.
11
Figure 2: Quartiles
URL: https://statsmethods.wordpress.com/2013/05/09/iqr/
Retrieved: September 08, 2018
There are three quartile values. First quartile value is at 25 percentile.

Second quartile is 50 percentile and third quartile is 75 percentile.
Second quartile (Q2) is median of the whole data. First quartile (Q1) is
median of upper half of the data. And Third Quartile (Q3) is median of
lower half of the data.
12, 24, 51, 67, 67, 85, 99, 115
So here, by analogy,
Q2 = 67: is 50 percentile of the whole data and is median.

Q1 = 41: is 25 percentile of the data.
Q3 = 85: is 75 percentile of the date.
Interquartile range (IQR) = Q3 - Q1 = 85 - 41 = 44
Note: If you sort data in descending order, IQR will be -44. The
magnitude will be same, just sign will differ. Negative IQR is fine, if your
data is in descending order. It just we negate smaller values from larger
values, we prefer ascending order (Q3 - Q1).
Steps:
Step 1: Put the numbers in order.

1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27.
Step 2: Find the median.

1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27.
Course Module
Step 3: Place parentheses around the numbers above and below the
median.
Not necessary statistically, but it makes Q1 and Q3 easier to spot.
(1, 2, 5, 6, 7), 9, (12, 15, 18, 19, 27).
Step 4: Find Q1 and Q3

Think of Q1 as a median in the lower half of the data and think of Q3
as a median for the upper half of data.
(1, 2, 5, 6, 7), 9, ( 12, 15, 18, 19, 27). Q1 = 5 and Q3 = 18.
Step 5: Subtract Q1 from Q3 to find the interquartile range.

18 – 5 = 13.
References and Supplementary Materials

Books and Journals
1. Zealure C Holcomb; 2016; Fundamentals of Descriptive Statistics; USA;
Routledge
2. Peter Goos, David Meintrup; 2015; Statistics with JMP: Graphs,
Descriptive Statistics and Probabbility; UK; John Wiley & Sons Ltd.
Online Supplementary Reading Materials
1. Measures of Spread; https://www.coursera.org/lecture/probability-
intro/measures-of-spread-t9Wbk; September 08, 2018
2. Measures of Spread;
http://maths.nayland.school.nz/Year_11/AS1.10_Multivar_data/9_Mea
sures_Spread.htm; September 08, 2018
3. Descriptive Statistics;
https://www.socialresearchmethods.net/kb/statdesc.htm; September
08, 2018
4. Descriptive Statistics;
https://www.researchconnections.org/childcare/datamethods/descrip
tivestats.jsp;September 08, 2018
Online Instructional Videos
13
1. This course introduces you to sampling and exploring data, as well as

basic probability theory and Bayes' rule. You will examine various types
of sampling methods, and discuss how such methods can impact the
scope of inference. A variety of exploratory data analysis techniques
will be covered, including numeric summary statistics and basic data
visualization. You will be guided through installing and using R and
RStudio (free statistical software), and will use this software for lab
exercises and a final project. The concepts and techniques in this course
will serve as building blocks for the inference and modeling courses in
the Specialization; https://www.coursera.org/lecture/probability-
intro/measures-of-spread-t9Wbk; September 08, 2018
Course Module

Module 005 - Descriptive Statistics

Uploaded by

Module 005 - Descriptive Statistics

Uploaded by

Quantitative Methods

Module 005 – Descriptive Statistics

At the end of this module you are expected to:

Descriptive statistics are typically distinguished from inferential

Descriptive Statistics are used to present quantitative descriptions in a

Descriptive statistics can be useful for two purposes: 1) to provide

The three most common descriptive statistics can be

There are several graphical and pictorial methods that enhance

 Visually represent the frequencies with which values of

 The height of the bar corresponds to the frequency with which

Geographic Information Systems (GIS)

 A GIS is a computer system capable of capturing, storing,

 Display networks of relationships among variables, enabling

Univariate analysis involves the examination across cases of one

In most situations, we would describe all three of these characteristics

The distribution is a summary of the frequency of individual values or

Table 1: Frequency distribution table.

One of the most common ways to describe a single variable is with

Figure 1: Frequency distribution bar chart.

Distributions may also be displayed using percentages. For example,

 percentage of people in different income levels

Measures of Central Location

Central tendency is defined as “the statistical measure that identifies a

The mean of the given set {1, 2, 4, 7} is 3.5.

Median is defined as the middle of a distribution in a ranked data (with

Mode is the most frequently occurring variable in a distribution. A set

13, 18, 13, 14, 13, 16, 14, 21, 13

13, 13, 13, 13, 14, 14, 16, 18, 21

So the median is 14.

There are situations when we have to choose between sample or

When we are asked to find SD of some part of a population, a segment

But when we have to deal with a whole population, then we use

Though sample is a part of a population, their SD formulas should have

Variance is a square of average distance between each quantity and

The variance of a sample is defined by slightly different formula:

Range is one of the simplest techniques of descriptive statistics. It is the

12, 24, 41, 51, 67, 67, 58, 59

Percentile is a way to represent position of a values in data set. To

12, 24, 41, 51, 67, 67, 85, 99

The interquartile range is a measure of where the “middle fifty” is in a

There are three quartile values. First quartile value is at 25 percentile.

12, 24, 51, 67, 67, 85, 99, 115

Q2 = 67: is 50 percentile of the whole data and is median.

Interquartile range (IQR) = Q3 - Q1 = 85 - 41 = 44

Step 1: Put the numbers in order.

Step 2: Find the median.

Step 4: Find Q1 and Q3

Step 5: Subtract Q1 from Q3 to find the interquartile range.

References and Supplementary Materials

1. This course introduces you to sampling and exploring data, as well as

You might also like