0% found this document useful (0 votes)
215 views13 pages

Module 005 - Descriptive Statistics

The document discusses descriptive statistics, which are used to describe basic features of data through simple summaries. It explains measures of central tendency like mean, median, and mode that describe the central or typical values in a data set. It also discusses measures of dispersion that quantify how spread out the values are, and graphical methods like histograms and scatter plots that visually represent distributions and relationships between variables. The goal of descriptive statistics is to simplify large data sets into interpretable summaries.

Uploaded by

Ilovedocumint
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
215 views13 pages

Module 005 - Descriptive Statistics

The document discusses descriptive statistics, which are used to describe basic features of data through simple summaries. It explains measures of central tendency like mean, median, and mode that describe the central or typical values in a data set. It also discusses measures of dispersion that quantify how spread out the values are, and graphical methods like histograms and scatter plots that visually represent distributions and relationships between variables. The goal of descriptive statistics is to simplify large data sets into interpretable summaries.

Uploaded by

Ilovedocumint
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 13

Quantitative Methods

1
Descriptive Statistics

Module 005 – Descriptive Statistics

At the end of this module you are expected to:


1. Explain descriptive statistics;
2. Show the measures of central location; and
3. Illustrate the measures of dispersion and variability

Descriptive Statistics

Descriptive statistics are used to describe the basic features of the data
in a study. They provide simple summaries about the sample and the
measures. Together with simple graphics analysis, they form the basis
of virtually every quantitative analysis of data.

Descriptive statistics are typically distinguished from inferential


statistics. With descriptive statistics you are simply describing what is
or what the data shows. With inferential statistics, you are trying to
reach conclusions that extend beyond the immediate data alone. For
instance, we use inferential statistics to try to infer from the sample
data what the population might think. Or, we use inferential statistics
to make judgments of the probability that an observed difference
between groups is a dependable one or one that might have happened
by chance in this study. Thus, we use inferential statistics to make
inferences from our data to more general conditions; we use
descriptive statistics simply to describe what's going on in our data.

Descriptive Statistics are used to present quantitative descriptions in a


manageable form. In a research study we may have lots of measures. Or
we may measure a large number of people on any measure. Descriptive
statistics help us to simplify large amounts of data in a sensible way.
Each descriptive statistic reduces lots of data into a simpler summary.
For instance, consider a simple number used to summarize how well a
batter is performing in baseball, the batting average. This single
number is simply the number of hits divided by the number of times at
bat (reported to three significant digits). A batter who is hitting .333 is
getting a hit one time in every three at bats. One batting .250 is hitting
one time in four. The single number describes a large number of

Course Module
discrete events. Or, consider the scourge of many students, the Grade
Point Average (GPA). This single number describes the general
performance of a student across a potentially wide range of course
experiences.

Every time you try to describe a large set of observations with a single
indicator you run the risk of distorting the original data or losing
important detail. The batting average doesn't tell you whether the
batter is hitting home runs or singles. It doesn't tell whether she's been
in a slump or on a streak. The GPA doesn't tell you whether the student
was in difficult courses or easy ones, or whether they were courses in
their major field or in other disciplines. Even given these limitations,
descriptive statistics provide a powerful summary that may enable
comparisons across people or other units.

Descriptive statistics can be useful for two purposes: 1) to provide


basic information about variables in a dataset and 2) to highlight
potential relationships between variables.

The three most common descriptive statistics can be


displayed graphically or pictorially and are measures of:

 Graphical/Pictorial Methods
 Measures of Central Tendency
 Measures of Dispersion
 Measures of Association

Graphical/Pictorial Methods

There are several graphical and pictorial methods that enhance


researchers' understanding of individual variables and the
relationships between variables. Graphical and pictorial methods
provide a visual representation of the data. Some of these methods
include:
 Histograms
 Scatter plots
 Geographical Information Systems (GIS)
 Sociograms

Histograms

 Visually represent the frequencies with which values of


variables occur
 Each value of a variable is displayed along the bottom of a
histogram, and a bar is drawn for each value
Quantitative Methods
3
Descriptive Statistics

 The height of the bar corresponds to the frequency with which


that value occurs

Scatter plots
 Display the relationship between two quantitative or numeric
variables by plotting one variable against the value of another
variable
 For example, one axis of a scatter plot could represent height
and the other could represent weight. Each person in the data
would receive one data point on the scatter plot that
corresponds to his or her height and weight

Geographic Information Systems (GIS)

 A GIS is a computer system capable of capturing, storing,


analyzing, and displaying geographically referenced
information; that is, data identified according to location
 Using a GIS program, a researcher can create a map to represent
data relationships visually

Sociograms

 Display networks of relationships among variables, enabling


researchers to identify the nature of relationships that would
otherwise be too complex to conceptualize

Univariate Analysis

Univariate analysis involves the examination across cases of one


variable at a time. There are three major characteristics of a single
variable that we tend to look at:

 the distribution
 the central tendency
 the dispersion

In most situations, we would describe all three of these characteristics


for each of the variables in our study.

The Distribution

The distribution is a summary of the frequency of individual values or


ranges of values for a variable. The simplest distribution would list
every value of a variable and the number of persons who had each

Course Module
value. For instance, a typical way to describe the distribution of college
students is by year in college, listing the number or percent of students
at each of the four years. Or, we describe gender by listing the number
or percent of males and females. In these cases, the variable has few
enough values that we can list each one and summarize how many
sample cases had the value. But what do we do for a variable like
income or GPA? With these variables there can be a large number of
possible values, with relatively few people having each one. In this case,
we group the raw scores into categories according to ranges of values.
For instance, we might look at GPA according to the letter grade ranges.
Or, we might group income into four or five ranges of income values.

Table 1: Frequency distribution table.

One of the most common ways to describe a single variable is with


a frequency distribution. Depending on the particular variable, all of
the data values may be represented, or you may group the values into
categories first (e.g., with age, price, or temperature variables, it would
usually not be sensible to determine the frequencies for each value.
Rather, the value are grouped into ranges and the frequencies
determined.). Frequency distributions can be depicted in two ways, as
a table or as a graph. Table 1 shows an age frequency distribution with
five categories of age ranges defined. The same frequency distribution
can be depicted in a graph as shown in Figure 1. This type of graph is
often referred to as a histogram or bar chart.
Quantitative Methods
5
Descriptive Statistics

Figure 1: Frequency distribution bar chart.


URL: https://www.socialresearchmethods.net/kb/statdesc.htm
Retrieved: September 08, 2018

Distributions may also be displayed using percentages. For example,


you could use percentages to describe the:

 percentage of people in different income levels


 percentage of people in different age ranges
 percentage of people in different ranges of standardized test
scores

Measures of Central Location

Central tendency is defined as “the statistical measure that identifies a


single value as representative of an entire distribution.” It aims to
provide an accurate description of the entire data. It is the single value
that is most typical/representative of the collected data. The term
“number crunching” is used to illustrate this aspect of data description.
The mean, median and mode are the three commonly used measures of
central tendency.

Mean (or the arithmetic average) is the sum of all the scores divided by
the number of scores. Mean may be influenced profoundly by the
extreme variables. For example, the average stay of organ phosphorus
poisoning patients in ICU may be influenced by a single patient who
stays in ICU for around 5 months because of septicemia. The extreme

Course Module
values are called outliers. In short, the mean is the sum of all the values
in a set, divided by the number of values. The mean of a whole
population is usually denoted by µ while the mean of a sample is usually
denoted by 𝑥̅ .
Thus the mean of a set {a1, a2, … , an} is given by

𝑎1 + 𝑎2 + … + 𝑎𝑛
𝜇=
𝑛

where
µ is the population mean, or
𝑥̅ is the sample mean
n is the total number of items in a set
a is each element in a set
Example:
Given the set of values: {1, 2, 4, 7}, we substitute the values to the
given formula.

1+2+4+ 7
𝜇=
4
14
𝜇=
4
𝜇 = 3.5

The mean of the given set {1, 2, 4, 7} is 3.5.

Median is defined as the middle of a distribution in a ranked data (with


half of the variables in the sample above and half below the median
value). If the number of values in a set is even, then the median is the
sum of the two middle values divided by two (2).

Mode is the most frequently occurring variable in a distribution. A set


can have more than one mode.
 Unimodal – A set that has only one mode
 Bimodal – A set with two modes
 Multimodal – A set with three or more modes
Quantitative Methods
7
Descriptive Statistics

Example:

Find the mean, median, mode, and range for the following list of
values:

13, 18, 13, 14, 13, 16, 14, 21, 13

The mean is the usual average, so I'll add and then divide:

(13 + 18 + 13 + 14 + 13 + 16 + 14 + 21 + 13) ÷ 9 = 15

Note that the mean, in this case, isn't a value from the original list.
This is a common result. You should not assume that your mean will
be one of your original numbers.

The median is the middle value, so first I'll have to rewrite the list in
numerical order:

13, 13, 13, 13, 14, 14, 16, 18, 21

There are nine numbers in the list, so the middle one will be the

(9 + 1) ÷ 2 = 10 ÷ 2 = 5th number:

1 2 3 4 5 6 7 8 9
13, 13, 13, 13, 14, 14, 16, 18, 21

So the median is 14.

The mode is the number that is repeated more often than any other,
so 13 is the mode.

mean: 15
median: 14
mode: 13

Note: The formula for the place to find the median is "([the number of
data points] + 1) ÷ 2", but you don't have to use this formula. You can
just count in from both ends of the list until you meet in the middle, if
you prefer, especially if your list is short. Either way will work.

Measures of Dispersion

Course Module
Measures of spread describe how similar or varied the set of observed
values are for a particular variable (data item). Measures of spread
include the range, quartiles and the interquartile range, variance and
standard deviation.

Standard deviation
Standard deviation is the measurement of average distance between
each quantity and mean. That is, how data is spread out from mean. A
low standard deviation indicates that the data points tend to be close to
the mean of the data set, while a high standard deviation indicates that
the data points are spread out over a wider range of values.

There are situations when we have to choose between sample or


population Standard Deviation.

When we are asked to find SD of some part of a population, a segment


of population; then we use sample Standard Deviation.

2
∑(𝑋𝑖 − ̅
𝑋 )
𝑆𝐷 = √
𝑛−1
where
x̅ is mean of a sample
n is the total number x
xi is the i th element from the sample

But when we have to deal with a whole population, then we use


population Standard Deviation.

∑(𝑋𝑖 − 𝜇)2
𝑆𝐷 = √
𝑁
where
µ is mean of a population
n is the total number x
xi is the i th element from the sample

Though sample is a part of a population, their SD formulas should have


been same, but it is not.
Variance
Quantitative Methods
9
Descriptive Statistics

Variance is a square of average distance between each quantity and


mean. That is it is square of standard deviation.
Variance is a measure of how spread out is the distribution. It gives an
indication of how close an individual observation clusters about the
mean value. The variance of a population is defined by the following
formula:

∑ 2
(X i − X)
σ2 =
N

where:
σ2 is the population variance,
X is the population mean,
Xi is the i th element from the population, and
N is the number of elements in the population.

The variance of a sample is defined by slightly different formula:

2
∑(Xi − X)2
s =
n−1
where:
s2 is the sample variance,
x is the sample mean,
xi is the i th element from the sample and
n is the number of elements in the sample.

Range

Range is one of the simplest techniques of descriptive statistics. It is the


difference between lowest and highest value.

12, 24, 41, 51, 67, 67, 58, 59

Course Module
Range is 99–12 = 87

Percentile

Percentile is a way to represent position of a values in data set. To


calculate percentile, values in data set should always be in ascending
order.

12, 24, 41, 51, 67, 67, 85, 99

The median 59 has 4 values less than itself out of 8. It can also be said
as: In data set, 59 is 50th percentile because 50% of the total terms are
less than 59. In general, if k is nth percentile, it implies that n% of the
total terms are less than k.

Quartiles

The interquartile range is a measure of where the “middle fifty” is in a


data set. Where a range is a measure of where the beginning and end
are in a set, an interquartile range is a measure of where the bulk of the
values lie.That’s why it’s preferred over many other measures of
spread (i.e. the average or median) when reporting things like school
performance or SAT scores.

The interquartile range formula is the first quartile subtracted from the
third quartile:

IQR = Q3 – Q1.

Where:
IQR is the interquartile range
Q3 is the 3rd quartile
Q1 is the 1st

In statistics and probability, quartiles are values that divide your data
into quarters provided data is sorted in an ascending order.
Quantitative Methods
11
Descriptive Statistics

Figure 2: Quartiles
URL: https://statsmethods.wordpress.com/2013/05/09/iqr/
Retrieved: September 08, 2018

There are three quartile values. First quartile value is at 25 percentile.


Second quartile is 50 percentile and third quartile is 75 percentile.
Second quartile (Q2) is median of the whole data. First quartile (Q1) is
median of upper half of the data. And Third Quartile (Q3) is median of
lower half of the data.

12, 24, 51, 67, 67, 85, 99, 115

So here, by analogy,

Q2 = 67: is 50 percentile of the whole data and is median.


Q1 = 41: is 25 percentile of the data.
Q3 = 85: is 75 percentile of the date.

Interquartile range (IQR) = Q3 - Q1 = 85 - 41 = 44

Note: If you sort data in descending order, IQR will be -44. The
magnitude will be same, just sign will differ. Negative IQR is fine, if your
data is in descending order. It just we negate smaller values from larger
values, we prefer ascending order (Q3 - Q1).

Steps:

Step 1: Put the numbers in order.


1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27.

Step 2: Find the median.


1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27.

Course Module
Step 3: Place parentheses around the numbers above and below the
median.
Not necessary statistically, but it makes Q1 and Q3 easier to spot.
(1, 2, 5, 6, 7), 9, (12, 15, 18, 19, 27).

Step 4: Find Q1 and Q3


Think of Q1 as a median in the lower half of the data and think of Q3
as a median for the upper half of data.
(1, 2, 5, 6, 7), 9, ( 12, 15, 18, 19, 27). Q1 = 5 and Q3 = 18.

Step 5: Subtract Q1 from Q3 to find the interquartile range.


18 – 5 = 13.

References and Supplementary Materials


Books and Journals
1. Zealure C Holcomb; 2016; Fundamentals of Descriptive Statistics; USA;
Routledge
2. Peter Goos, David Meintrup; 2015; Statistics with JMP: Graphs,
Descriptive Statistics and Probabbility; UK; John Wiley & Sons Ltd.
Online Supplementary Reading Materials
1. Measures of Spread; https://www.coursera.org/lecture/probability-
intro/measures-of-spread-t9Wbk; September 08, 2018
2. Measures of Spread;
http://maths.nayland.school.nz/Year_11/AS1.10_Multivar_data/9_Mea
sures_Spread.htm; September 08, 2018
3. Descriptive Statistics;
https://www.socialresearchmethods.net/kb/statdesc.htm; September
08, 2018
4. Descriptive Statistics;
https://www.researchconnections.org/childcare/datamethods/descrip
tivestats.jsp;September 08, 2018
Online Instructional Videos
Quantitative Methods
13
Descriptive Statistics

1. This course introduces you to sampling and exploring data, as well as


basic probability theory and Bayes' rule. You will examine various types
of sampling methods, and discuss how such methods can impact the
scope of inference. A variety of exploratory data analysis techniques
will be covered, including numeric summary statistics and basic data
visualization. You will be guided through installing and using R and
RStudio (free statistical software), and will use this software for lab
exercises and a final project. The concepts and techniques in this course
will serve as building blocks for the inference and modeling courses in
the Specialization; https://www.coursera.org/lecture/probability-
intro/measures-of-spread-t9Wbk; September 08, 2018

Course Module

You might also like