Module 005 - Descriptive Statistics
Module 005 - Descriptive Statistics
1
Descriptive Statistics
Descriptive Statistics
Descriptive statistics are used to describe the basic features of the data
in a study. They provide simple summaries about the sample and the
measures. Together with simple graphics analysis, they form the basis
of virtually every quantitative analysis of data.
Course Module
discrete events. Or, consider the scourge of many students, the Grade
Point Average (GPA). This single number describes the general
performance of a student across a potentially wide range of course
experiences.
Every time you try to describe a large set of observations with a single
indicator you run the risk of distorting the original data or losing
important detail. The batting average doesn't tell you whether the
batter is hitting home runs or singles. It doesn't tell whether she's been
in a slump or on a streak. The GPA doesn't tell you whether the student
was in difficult courses or easy ones, or whether they were courses in
their major field or in other disciplines. Even given these limitations,
descriptive statistics provide a powerful summary that may enable
comparisons across people or other units.
Graphical/Pictorial Methods
Measures of Central Tendency
Measures of Dispersion
Measures of Association
Graphical/Pictorial Methods
Histograms
Scatter plots
Display the relationship between two quantitative or numeric
variables by plotting one variable against the value of another
variable
For example, one axis of a scatter plot could represent height
and the other could represent weight. Each person in the data
would receive one data point on the scatter plot that
corresponds to his or her height and weight
Sociograms
Univariate Analysis
the distribution
the central tendency
the dispersion
The Distribution
Course Module
value. For instance, a typical way to describe the distribution of college
students is by year in college, listing the number or percent of students
at each of the four years. Or, we describe gender by listing the number
or percent of males and females. In these cases, the variable has few
enough values that we can list each one and summarize how many
sample cases had the value. But what do we do for a variable like
income or GPA? With these variables there can be a large number of
possible values, with relatively few people having each one. In this case,
we group the raw scores into categories according to ranges of values.
For instance, we might look at GPA according to the letter grade ranges.
Or, we might group income into four or five ranges of income values.
Mean (or the arithmetic average) is the sum of all the scores divided by
the number of scores. Mean may be influenced profoundly by the
extreme variables. For example, the average stay of organ phosphorus
poisoning patients in ICU may be influenced by a single patient who
stays in ICU for around 5 months because of septicemia. The extreme
Course Module
values are called outliers. In short, the mean is the sum of all the values
in a set, divided by the number of values. The mean of a whole
population is usually denoted by µ while the mean of a sample is usually
denoted by 𝑥̅ .
Thus the mean of a set {a1, a2, … , an} is given by
𝑎1 + 𝑎2 + … + 𝑎𝑛
𝜇=
𝑛
where
µ is the population mean, or
𝑥̅ is the sample mean
n is the total number of items in a set
a is each element in a set
Example:
Given the set of values: {1, 2, 4, 7}, we substitute the values to the
given formula.
1+2+4+ 7
𝜇=
4
14
𝜇=
4
𝜇 = 3.5
Example:
Find the mean, median, mode, and range for the following list of
values:
The mean is the usual average, so I'll add and then divide:
(13 + 18 + 13 + 14 + 13 + 16 + 14 + 21 + 13) ÷ 9 = 15
Note that the mean, in this case, isn't a value from the original list.
This is a common result. You should not assume that your mean will
be one of your original numbers.
The median is the middle value, so first I'll have to rewrite the list in
numerical order:
There are nine numbers in the list, so the middle one will be the
(9 + 1) ÷ 2 = 10 ÷ 2 = 5th number:
1 2 3 4 5 6 7 8 9
13, 13, 13, 13, 14, 14, 16, 18, 21
The mode is the number that is repeated more often than any other,
so 13 is the mode.
mean: 15
median: 14
mode: 13
Note: The formula for the place to find the median is "([the number of
data points] + 1) ÷ 2", but you don't have to use this formula. You can
just count in from both ends of the list until you meet in the middle, if
you prefer, especially if your list is short. Either way will work.
Measures of Dispersion
Course Module
Measures of spread describe how similar or varied the set of observed
values are for a particular variable (data item). Measures of spread
include the range, quartiles and the interquartile range, variance and
standard deviation.
Standard deviation
Standard deviation is the measurement of average distance between
each quantity and mean. That is, how data is spread out from mean. A
low standard deviation indicates that the data points tend to be close to
the mean of the data set, while a high standard deviation indicates that
the data points are spread out over a wider range of values.
2
∑(𝑋𝑖 − ̅
𝑋 )
𝑆𝐷 = √
𝑛−1
where
x̅ is mean of a sample
n is the total number x
xi is the i th element from the sample
∑(𝑋𝑖 − 𝜇)2
𝑆𝐷 = √
𝑁
where
µ is mean of a population
n is the total number x
xi is the i th element from the sample
∑ 2
(X i − X)
σ2 =
N
where:
σ2 is the population variance,
X is the population mean,
Xi is the i th element from the population, and
N is the number of elements in the population.
2
∑(Xi − X)2
s =
n−1
where:
s2 is the sample variance,
x is the sample mean,
xi is the i th element from the sample and
n is the number of elements in the sample.
Range
Course Module
Range is 99–12 = 87
Percentile
The median 59 has 4 values less than itself out of 8. It can also be said
as: In data set, 59 is 50th percentile because 50% of the total terms are
less than 59. In general, if k is nth percentile, it implies that n% of the
total terms are less than k.
Quartiles
The interquartile range formula is the first quartile subtracted from the
third quartile:
IQR = Q3 – Q1.
Where:
IQR is the interquartile range
Q3 is the 3rd quartile
Q1 is the 1st
In statistics and probability, quartiles are values that divide your data
into quarters provided data is sorted in an ascending order.
Quantitative Methods
11
Descriptive Statistics
Figure 2: Quartiles
URL: https://statsmethods.wordpress.com/2013/05/09/iqr/
Retrieved: September 08, 2018
So here, by analogy,
Note: If you sort data in descending order, IQR will be -44. The
magnitude will be same, just sign will differ. Negative IQR is fine, if your
data is in descending order. It just we negate smaller values from larger
values, we prefer ascending order (Q3 - Q1).
Steps:
Course Module
Step 3: Place parentheses around the numbers above and below the
median.
Not necessary statistically, but it makes Q1 and Q3 easier to spot.
(1, 2, 5, 6, 7), 9, (12, 15, 18, 19, 27).
Course Module