Chapter 3
Data Description
Outline
3-1 Measures of Central Tendency
3-2 Measures of Variation
3-3 Measures of Position
3-4 Exploratory Data Analysis
OBJECTIVES
Summarize data, using measures of central tendency,
1 such as the mean, median, mode, and midrange.
2
Describe data, using measures of variation, such as
the range, variance, and standard deviation.
Identify the position of a data value in a data set,
3
using various measures of position, such as
percentiles, deciles, and quartiles.
4 Use the techniques of exploratory data analysis,
including boxplots and five-number summaries, to
discover various aspects of data.
Introduction
Measures of Measures of Measures of
central tendency variation position
• Measures of • Measures of • Measures
average dispersion , tell where
mean the which the data
center of the determine value falls
distribution the spread of within the
or the most the data data set .
typical one . values.
4
Introduction
Measures of Measures of Measures of
central tendency variation position
• mean • range • Percentile
• Median • Variance • Deciles
• Mode • Standard • quartiles
• midrange deviation
5
3.1 Measures of Central Tendency
Measures
Statistics Parameter
Is a characteristic or measure Is a characteristic or measure obtained by
obtained by using the data values using all the data values for a specific
from a sample. population.
Thus, the average of household (HH) income obtained from a sample of household is a
statistic,
and the average of household (HH) income obtained from the entire population of HH is a
parameter
6
Measures of Central Tendency:
The Mean
The mean also known as the arithmetic
average, is found by adding the values of Mean
the data and dividing by the total number
of values.
Sample mean Population Mean
𝑥1 + 𝑥2 + … . . +𝑥𝑛 𝑋1 +𝑋2 + …..+𝑋𝑁
𝑥ҧ = µ=
𝑛 𝑁
𝑛
σ𝑖=1 𝑥𝑖 σ𝑁𝑖=1 𝑋𝑖
= =
𝑛 𝑁
Greek letter μ (mu)
7
Example 3-1-page 112 Avian Flu Cases:
8
Example 3-1: Days Off per Year
The data represent the number of days off per year for a
sample of individuals selected from nine different
countries. Find the mean.
20, 26, 40, 36, 23, 42, 35, 24, 30
𝑥1 + 𝑥2 + … . . +𝑥𝑛
𝑥ҧ =
𝑛
20 + 26 + 40 + 36 + 23 + 42 + 35 + 24 + 30 276
X= = = 30.7
9 9
The mean number of days off is 30.7 days.
*The mean, in most cases, is not an actual data value.
9
Measures of Central Tendency:
mean for grouped data
Mean of grouped data
10
Finding the Mean for Grouped Data
• Step 1 Make a table as shown.
A B C D
Class Frequency f Midpoint Xm f * Xm
• Step 2 Find the midpoints of each class and place them in column C.
• Step 3 Multiply the frequency by the midpoint for each class, and
place the product in column D.
• Step 4 Find the sum of column D.
• Step 5 Divide the sum obtained in column D by the sum of the
frequencies obtained in column B.
• The formula for the mean is X=
f Xm
n
11
Example 3-3: Miles Run
Class Frequency f Midpoint, Xm 𝑓. 𝑋𝑚
5.5 - 10.5 1 (5.5+10.5)/2=8 1X8=8
10.5 - 15.5 2 13 2X13=26
15.5 - 20.5 3 18 54
20.5 - 25.5 5 23 115
25.5 - 30.5 4 28 112
30.5 - 35.5 3 33 99
35.5 - 40.5 2 38 76
n=f = 20 f ·Xm = 490
X=
f X m
=
490
= 24.5 miles
n 20
Measures of Central Tendency: Weighted Mean
• Sometimes, you must find the mean of a data set in
which not all values are equally represented.
• The type of mean that considers an additional factor
is called the weighted mean, and it is used when the
values are not all equally represented.
• Weighted mean of a variable is obtained by
multiplying each value (x) by its corresponding weight
(w) and dividing the sum of the products by the sum
of the weights.
13
Measures of Central Tendency: Weighted Mean
14
Example 3-14: Grade Point Average
A student received the following grades. Find the
corresponding GPA.
Course Credits, w Grade, X
English Composition 3 A (4 points)
Introduction to Psychology 3 C (2 points)
Biology 4 B (3 points)
Physical Education 2 D (1 point)
X= wX
=
3 4 + 3 2 + 4 3 + 2 1 32
= = 2.7
w 3+3+ 4+ 2 12
The grade point average is 2.7.
15
Measures of Central Tendency: Median
• The median is the midpoint of the data array. The symbol
for the median is MD.
Steps in computing the median of a data array
Step 1 Arrange the data in order.
Step 2 Select the middle point
• The median will be one of the data values if there is an odd
number of values. e.g . 1 2 3
• The median will be the average of two data values if there is
an even number of values. e.g. 1 2 3 4
16
Example 3-4: Tablet Sales
17
Example 3-5: Tornadoes in the U.S.
The number of tornadoes that have occurred in
the United States over an 8-year period follows.
Find the median.
684, 764, 656, 702, 856, 1133, 1132, 1303
Find the average of the two middle values.
656, 684, 702, 764, 856, 1132, 1133, 1303
764 + 856 1620
MD = = = 810
2 2
The median number of tornadoes is 810.
18
Measures of Central Tendency: Mode
• The mode is the value that occurs most often
in a data set.
• It is sometimes said to be the most typical
case.
• There may be no mode, one mode (unimodal),
two modes (bimodal), or many modes
(multimodal).
19
Examples: Mode
Unimodal: Find the mode of the signing bonuses of eight NFL players
for a specific year. The bonuses in millions of dollars are
18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10
You may find it easier to sort first.
10, 10, 10, 11.3, 12.4, 14.0, 18.0, 34.5
The mode is 10.
Find the mode for the number of branches that six banks have.
401, 344, 209, 201, 227, 353
Since each value occurs only once, there is no mode.
Note: Do not say that the mode is zero. That would be incorrect,
because in some data, such as temperature, zero can be an actual
value.
20
Example 3-7: Licensed Nuclear Reactors
The data show the number of licensed nuclear
reactors in the United States for a recent 15-year
period. Find the mode.
104 104 104 104 104
107 109 109 109 110
109 111 112 111 109
104 and 109 both occur the most (5 times). The data
set is said to be bimodal.
The modes are 104 and 109.
21
Example 3-9,3-10: Mode of grouped data
22
Measures of Central Tendency: Midrange
• The midrange is a rough estimate of the middle.
• It midrange is the average of the lowest and highest values in
a data set.
Lowest + Highest
MR =
2
23
Properties and Uses of Central Tendency
The Mean
• Define rigorously with a mathematical formula which is highly
amenable to mathematical treatment
• Uses all data values.
• Varies less than the median or mode when samples are taken
from the same population and all three measures are
computed for these samples.
• Used in computing other statistics, such as the variance
• Unique, usually not one of the data values
• Cannot be used with open-ended classes
• Affected by extremely high or low values, called outliers
24
Properties of the Median
➢Gives the midpoint
➢Used when it is necessary to find out whether
the data values fall into the upper half or lower
half of the distribution.
➢Can be used for an open-ended distribution.
➢Affected less than the mean by extremely high
or extremely low values.
25
Properties of the Mode
➢Used when the most typical case is desired
➢Easiest average to compute
➢Can be used with nominal data
➢Not always unique or may not exist
26
Properties of the Midrange
➢Easy to compute.
➢Gives the midpoint.
➢Affected by extremely high or low values in a
data set
27
Distributions Shapes
28
Exercise
3. High Temperatures The reported high temperatures (in degrees Fahrenheit) for
selected world cities on an October day are shown below. Find (i) the mean, (ii) the
median, (iii) the mode, and (iv) the mid-range. Which measure of central tendency
do you think best describes these data?
62 72 66 79 83 61 62 85 72 64 74 71
42 38 91 66 77 90 74 63 64 68 42
Solution:
62+72+66+ ………………..+42 1566
(i) Mean = x = = = 68.1
23 23
(ii) Arrange the observation in ascending order
38 42 42 61 62 62 63 64 64 66 66 68 71 72 72 74 74 77
79 83 85 90 91
Median = 68.
(iii) Modes are: 42 62 64 66 72 74
38+91 129
(iv) Midrange = = =64.5
2 2 29
Exercise
14 . Hourly Compensation for Production Workers The hourly compensation
costs (in U.S. dollars) for production workers in selected countries are
represented below. Find the (a) mean, and (b) modal class.
30
Exercise
Solution:
Class Frequency f Midpoint Xm f*Xm
2.48 - 7.48 7 4.98 34.86
X=
f *X m
=
495.15
= 17.68
7.49 – 12.49 3 9.99 29.97 n 28
12.50 – 17.50 1 15.00 15.00
17.51 – 22.51 7 20.01 140.07 2.48 – 7.48 and 17.51 – 22.51
22.52 – 27.52 5 25.02 125.10
27.53 – 32.53 5 30.03 150.15
Total n = f = 28 f *X m = 495.15
31
Extending the concept
36. If the mean of five values is 64, find the sum of the
values. 320
37. If the mean of five values is 8.2 and four of the values
are 6, 10, 7, and 12, find the fifth value.
38. Find the mean of 10, 20, 30, 40, and 50.
a. Add 10 to each value and find the mean. 40
b. Subtract 10 from each value and find the mean. 20
c. Multiply each value by 10 and find the mean. 300
d. Divide each value by 10 and find the mean. 3
e. Make a general statement about each situation.
3-2 Measures of Variation
In statistics, to describe the data set accurately, in addition to
measures of central tendency we must know how the data values
spread from one another.
33
3-2 Measures of Variation
Consider the example of the test scores of two groups of students.
Group 1 Group 2
45 70 Mean score for both the group
100 75 of students is 75.
80 80
225 225 But their performance is not same.
How Can We Measure Variability?
Range Variance Standard Deviation
Coefficient of Variation Chebyshev’s Theorem
Empirical Rule (Normal)
34
Measures of Variation: Range
• The range is the difference between the highest and
lowest values in a data set. R=highest – lowest
Example 3-15/16: Outdoor Paint
Two experimental brands of outdoor paint are tested to see how
long each will last before fading. Six cans of each brand constitute
a small population. The results (in months) are shown. Find the
mean and range of each group. Brand A Brand B
10 35
60 45
50 30
30 35
40 40
20 25
35
Example 3-15/16 : Outdoor Paint
Brand A Brand B = X 210
= = 35
10 35 Brand A: N 6
60 45 R = 60 − 10 = 50
50 30
30 35
= X
=
210
= 35
40 40 Brand B: N 6
20 25
R = 45 − 25 = 20
The average for both brands is the same, but the range
for Brand A is much greater than the range for Brand B.
Which brand would you buy?
Bluman Chapter 3 36
Measures of Variation: Variance & Standard Deviation
Measures Variance Standard deviation
Definition is the average of the is the square root of the
squares of the distance variance.
each value is from the The standard deviation is a
mean. measure of how spread out
your data are.
σ𝑁
𝑖=1 𝑋 − 𝜇
2 σ𝑁
𝑖=1 𝑋 − 𝜇
2
2 𝜎=
𝜎 =
𝑁 𝑁
Uses or -To determine the spread of the data.
Purposes -To determine the consistency of a variable.
-To determine the number of data values that fall within a
specified interval in a distribution (Chebyshev’s Theorem).
-Used in inferential statistics.
37
Example 3-21: Outdoor Paint
Find the variance and standard deviation for the data set for
Brand A paint. 10, 60, 50, 30, 40, 20
= =
X 210
Months, X X – µ (X – µ)2 = 35
N 6
10 –25 625
( X − )
2
60 25 625 2 =
n
50 15 225 1750
=
30 –5 25 6
40 5 25 = 291.7
20 –15 225
1750
00 1750 =
6
= 17.1
38
Measures of Variation: Variance & Standard Deviation
(Sample Theoretical Model)
• The sample variance is
( X −X)
2
s 2
=
n −1
• The sample standard deviation is
( X − X )
2
s=
n −1
39
Variance & Standard Deviation
(shortcut or Sample Computational formula for 𝑠 2 and s)
• It Is mathematically equivalent to the theoretical formula.
• Saves time when calculating by hand
• Does not use the mean
• Is more accurate when the mean has been rounded.
n X − ( X )
2 2
The sample variance is
s =
2
n ( n − 1)
The sample standard deviation is
s= s 2
40
Example : European Auto Sales
Find the variance and standard deviation for the amount of
European auto sales for a sample of 6 years. The data are in
millions of dollars.
11.2, 11.9, 12.0, 12.8, 13.4, 14.3
n X − ( X )
2 2
X X 2
s = 2
11.2 n ( n − 1)
125.44
11.9 6 ( 958.94 ) − ( 75.6 )
2
141.61
12.0 144.00 s =
2
s 2 = 1.28
6 ( 5)
12.8 163.84 s = 1.13
13.4 179.56
14.3 204.49 Note: that σ 𝑋 2 is not the same as σ 𝑋 2 .
75.6 958.94 -The notation σ 𝑋 2 means to square the values
first, then sum;
- σ 𝑋 2 means to sum the values first, then square
the sum. 41
shortcut or Sample Computational formula for 𝑠 2 and s for grouped
data )
• The steps for finding the variance and standard deviation for grouped data
are summarized in this Procedure Table.
Bluman, Chapter 3 42
Example 3-22: Miles run per week
Find the variance and the standard deviation for the frequency
distribution of the data in Example 2–7. The data represent the
number of miles that 20 runners ran during one week.
Bluman, Chapter 3 43
Solution of example 3-24
Substitute in the formula and solve for 𝑠 2 to get the variance.
s =
2
( )
n f X m2 − ( f X m )
2
=
20(13,310) − (490)
2
= 68.7
n(n − 1) 20(20 − 1)
Take the square root to get the standard deviation
s = 68.7 = 8.3
Bluman, Chapter 3 44
Measures of Variation: Coefficient of Variation
Whenever two samples have the same units of measure: the variance and
standard deviation for each can be compared directly.
But what if we want to compare the standard deviations of two different
variables, such as the number of sales per salesperson over a 3-month
period and the commissions made by these salespeople?
Coefficient of variation is a statistic that allows you to compare
standard deviations when the units are different.
The coefficient of variation is the standard deviation divided by
the mean, expressed as a percentage.
s
CVAR = 100%
X
45
Example 3-23: Sales of Automobiles
The mean of the number of sales of cars over a 3-month
period is 87, and the standard deviation is 5.
The mean of the commissions is $5225, and the standard
deviation is $773. Compare the variations of the two.
5
CVar = 100% = 5.7% Sales
87
773
CVar = 100% = 14.8% Commissions
5225
Commissions are more variable than sales.
46
Range Rule of Thumb
The range can be used to approximate the standard
deviation. The approximation is called the range rule
of thumb.
The Range Rule of Thumb approximates the standard
deviation as
Range
s
4
Note: only when the distribution is unimodal and
approximately symmetric.
47
Example : Range Rule of Thumb
• For example: the data set 5, 8, 8, 9, 10, 12, 13
the standard deviation for the data is 2.7, and the range is :
13 –5= 8. The range rule of thumb applies that s= 8/4≈2.
• The range rule of thumb in this case underestimates the
standard deviation somewhat.
• A note of caution should be mentioned here. The range rule
of thumb is only an approximation and should be used when
the distribution of data values is unimodal and roughly
symmetric.
48
Range Rule of Thumb
The range rule of thumb can be used to estimate the largest and smallest
data values of a data set. The smallest data value will be approximately 2
standard deviations below the mean, and the largest data value will be
approximately 2 standard deviations above the mean of the data set.
Smallest data value = X − 2s
Largest data value = X + 2s
12
Example: X = 10, Range = 12 s =3
4
LOW 10 − 2 ( 3) = 4
HIGH 10 + 2 ( 3) = 16
49
Measures of Variation: Chebyshev’s Theorem
• Note: The variance and standard deviation of a variable
can be used to determine the spread, or dispersion, of a
variable. That is, the larger the variance or standard
deviation, the more the data values are dispersed.
• For example, if two variables measured in the same
units have the same mean, say, 70, and the first variable
has a standard deviation of 1.5 while the second variable
has a standard deviation of 10, then the data for the
second variable will be more spread out than the data for
the first variable.
50
Measures of Variation: Chebyshev’s Theorem
• Chebyshev’s theorem, developed by the Russian mathematician
Chebyshev (1821–1894), specifies the proportions of the spread in
terms of the standard deviation.
• Chebyshev’s Theorem: The proportion of values from any data set
that fall within k standard deviations of the mean will be at least 1 –
1/k2, where k is a number greater than 1 (k is not necessarily an
integer).
51
Chebyshev’s Theorem
Chebyshev’s Theorem: The proportion of values from any data
set that fall within k standard deviations of the mean will be at
least 1 – 1/k2, where k is a number greater than 1 (k is not
necessarily an integer).
# of standard Minimum Proportion Minimum Percentage
deviations, k within k standard within k standard
deviations deviations
2 1 – 1/4 = 3/4 75%
3 1 – 1/9 = 8/9 88.89%
1 – 1/16 =
4 93.75%
15/16
52
Example : Chebyshev’s Theorem
for a variable which has a mean of 70 and a standard deviation of
1.5, at least three-fourths, or 75%, of the data values fall between
67 and 73.
These values are found by adding 2 standard deviations to the
mean and subtracting 2 standard deviations from the mean, as
shown:
53
Example 3-25: Prices of Homes
The mean price of houses in a certain neighborhood is $50,000, and
the standard deviation is $10,000. Find the price range for which at
least 75% of the houses will sell.
Solution:
Chebyshev’s Theorem states that at least 75% of a data set will fall
within 2 standard deviations of the mean.
0r
Smallest data value is = 50,000 – 2(10,000) = 30,000
Largest data value is = 50,000 + 2(10,000) = 70,000
Thus, at least 75% of all homes sold in the area will have a price range
from $30,000 and $70,000.
54
Example 3-26: Travel Allowances
A survey of local companies found that the mean amount of travel
allowance for executives was $0.25 per mile. The standard
deviation was 0.02. Using Chebyshev’s theorem, find the
minimum percentage of the data values that will fall between
$0.20 and $0.30.
Solution:
We have, k = (upper limit – mean)/standard dev
= (0.30 – 0.25)/0.02 = 2.5
Thus,
At least 84% of the data values will fall between $0.20 and $0.30.
Bluman, Chapter 3 55
Measures of Variation: Empirical Rule (Normal)
Chebyshev’s theorem applies to any distribution regardless of its
shape. However, when a distribution is bell-shaped (or what is
called normal), the following statements, which make up the
empirical rule, are true.
56
Measures of Variation: Empirical Rule (Normal)
• Other way: The percentage of values from a data set
that fall within k standard deviations of the mean in a
normal (bell-shaped) distribution is listed below.
# of standard deviations, Proportion within k
k standard deviations
1 68%
2 95%
3 99.7%
Bluman, Chapter 3 57
Measures of Variation: Empirical Rule (Normal)
Suppose that the scores on a national achievement exam have a mean
of 480 and a standard deviation of 90. If these scores are normally
distributed, then approximately 95% of the scores will fall between 300
and 660 (480+2*90=660 and 480- 2*90=300).
Bluman, Chapter 3 58
Linear transformation of the data
Sometimes, it is necessary to transform the data values into other
data values.
Example:
Some time the temperature value is collected using the Celsius
temperature scale, but in some state the data should be
transferred to the Fahrenheit temperature scale.
This change is called linear transformation of the data.
Question now?
How does the transformation of the data values effect the mean
and the standard deviation???
59
Linear transformation of the data
Example:
Suppose you own a store with five employees, their hourly
salaries are: $10, $13, $10, $11, $16
𝑥=$12
ҧ s=2.550
then you decided to give each employee a raise of $1.00 per hour.
So, the new salaries will be : $11, $14, $11, $12, $17
𝑥=$13
ҧ s=2.550
So, we noticed that the value of the mean increases by the
amount added to the data, but the standard deviation dose not
change.
60
Linear transformation of the data
Example:
Suppose that the five employees worked , the numbers of hours
per week shown as : 15, 12, 18, 20, 10
𝑥=15
ҧ s=4.123
You next decide to double the amount of each employee’s hours
for December: 30, 24, 36, 40, 20
𝑥ҧ = 30 s=8.246
So we noticed that the value of the mean and standard
deviation also doubled .
61
Linear transformation of the data
62