L1 Descriptive Stats
L1 Descriptive Stats
HMCS106
Introduction
86 80 25 77 73 76 100 90 69 93
90 83 70 73 73 70 90 83 71 95
40 58 68 69 100 78 87 97 92 74
Scale Frequency
10
20 1
30
40 1
50 1
60 3
70 10
80 5
90 7
100 2
• We then construct the histogram by drawing for
each group, or class, a vertical bar whose
length is the number of observations in that
group.
• In our example, the bar labelled 100 is 2 units
long, the bar labelled 90 is 7 units long, and so
on
• the individual data values are lost, but we know
the number in each class
• This number is called the frequency of the
class, hence the name frequency histogram
Frequency Histogram
Constructing a frequency table
• the weight (to the nearest gram) of 30
packets of salt, all marked 65g
68 52 49 56 68
74 41 59 79 81
42 57 60 88 87
47 68 55 68 65
50 78 61 90 85
65 66 72 63 95
• Step 1: Determine the data range
r = x max − x min
= 95 − 41
= 54
10
8
Number of packets (f)
0
30 40 50 60 70 80 90 100 110
Weight in gram
Cumulative frequency polygon of the weight of 30 packets of salt
30
25
Cumulative frequency (F)
20
15
10
0
30 40 50 60 70 80 90 100
Weight of packet
Scatterplots
140
120
Number of days
100
80
60
40
20
0
-1 0 1 2 3 4 5 6 7
Number of errors
Bar chart of the number of typing errors
160
140
120
Number of days
100
80
60
40
20
0
-1 0 1 2 3 4 5 6 7
Number of errors
Graphical illustrations of continuous data
10
8
Number of packets (f)
0
30 40 50 60 70 80 90 100 110
Weight in gram
Relative Frequency Histograms
49
Measures of Central Location
• Any kind of “average” is meant to be an answer
to the question “Where do the data centre?”
• It is therefore a measure of the central location
of a data set
• 3 main measures of central location
– arithmetic mean,
– mode, and
– median (also called the 2nd quartile or the 50th
percentile).
50
• the nature of the data set, as indicated by
a relative frequency histogram, will
determine what constitutes a good answer
51
The mean
• The first measure of central location is the
usual “average”, also known as the
arithmetic mean or simply the “mean”
• Defn.: a sample mean of a set of sample
data is the number defined by;
Class fi
[40;50) 4
[50;60) 6
[60;70) 10
[70;80) 4
[80;90) 4
[90;100) 2
Total 30
56
,
6
iii. Calculate f i xi and divide by n.
i =1
57
Class fi xi fixi
[40;50) 4 45 180
[50;60) 6 55 330
[60;70) 10 65 650
[70;80) 4 75 300
[80;90) 4 85 340
[90;100) 2 95 190
Total 30 1990
6
x= 1 f i xi = 1 (1990 ) = 63.3333
30 30
i =1
58
• Population mean of a set of N population
data is the number μ defined by the
formula:
59
Median
• Suppose we are interested in the average yearly
income of employees at a large corporation
• We take a random sample of seven employees,
obtaining the sample data (rounded to the nearest
hundred dollars, and expressed in thousands of
dollars)
24.8 ;22.8; 24.6; 192.5; 25.2; 18.5; 23.7
• The mean (rounded to one decimal place) is x̅
=47.4
• Hence “the average income of employees at this
corporation is $47,400” - surely misleading!
60
Median
• The mean is approximately twice what six of
the seven employees in the sample make
• What went wrong?
• the presence of the one executive in the
sample
• The number 192.5 in our data set is called an
outlier
• Many times an outlier is the result of some
sort of error, but not always , as is the case
here 61
• We would get a better measure of the
“centre” of the data if we were to arrange the
data in numerical order,
• 18.5 22.8 23.7 24.6 24.8 25.2 192.5
• then select the middle number in the list, in
this case 24.6
• This is the median characterised by that,
roughly half of the measurements are larger
than it is, and roughly half are smaller
62
Defn.
• The sample median x^~ of a set of sample
data for which there are an odd number of
measurements is the middle measurement
when the data are arranged in numerical
order.
• The sample median x^~ of a set of sample
data for which there are an even number of
measurements is the mean of the two middle
measurements when the data are arranged in
numerical order.
63
The Median
64
Skewness of Relative Frequency Histograms
65
The relationship between the mean and the
median for several common shapes of
distributions
• When the distribution is symmetric, as in panels (a)
and (b) of , the mean and the median are equal
• When the distribution is as shown in panel (c) of , it
is said to be skewed right aka positive skew, or
right tailed
– The mean has been pulled to the right of the median by
the long “right tail” of the distribution, the few relatively
large data values
• When the distribution is as shown in panel (d) of , it
is said to be skewed left aka negative skew, or left
tailed
– The mean has been pulled to the left of the median by
the long “left tail” of the distribution 66
Median
is in the
Middle
67
Definition
1, 3, 7, 10, 13
Median = 7
68
Grouped data
• median can be determined graphically or
by using a formula
• Graphically (on cumulative frequency
polygon: n +1
i. find the value of the 2 th observation
ii. Mark the value on the vertical axis and
project horizontally to the polygon
[Link] it down to the horizontal axis and
read the value of the median
69
Cumulative frequency polygon of the weight of 30 packets of salt
30
25
Cumulative frequency (F)
20
15
10
0
30 40 50 60 Me 70 80 90 100
Weight of packet 70
• Formula
• median interval is the first interval for which the cumulative frequency is
n + 1 or more
2
n
− Fme −1
Me = L + c 2
f me
• L = lower limit of the median interval,
n
• 2 = order number of the median,
• Fme-1 = cumulative frequency of the interval before the
median interval
Class fi Fi
[40;50) 4 4
[50;60) 6 10
[60;70) 10 20
[70;80) 4 24
[80;90) 4 28
[90;100) 2 30
Total 30
72
n 30
• Order number = position of Me = =2 2
= 15
• Median interval = [60;70)
n
• L = 60 ; 2 = 15 ; c = 10 ; Fme-1 = 10 and
fme = 10.
n2 − Fme −1
• Therefore: Me = L + c
f me
15 − 10
= 60 + 10
10
= 65
73
Mode
• Consider the statement: “The average number
of mobile phones owned by Harare residents
is 1.37,”
• the thought of a fraction of a mobile phone can
be baffling
• In such a context the following measure for
central location might make more sense
• Defn: The sample mode of a set of sample
data is the most frequently occurring value
74
On a relative frequency histogram, the highest point of
the histogram corresponds to the mode of the data set
illustrates the mode
75
Definition
Mode
is the most
Popular
76
Definition
• Mode – the number that
appears most frequently in a
set of numbers.
1, 1, 3, 7, 10, 13
Mode = 1
77
Grouped data
• Graphically
Histogram of the weight of 30 packets of salt
10
8
Number of packets (f)
0
30 40 50 60 Mo 70 80 90 100 110
Weight in gram 78
Grouped data
f m − f m−1
• Using a formulaMo = L + c
2 f m − f m−1 − f m+1
• Geometric mean: GM = n x1 x2 x3 xn
n
• Harmonic mean: HM = 1
+ 1 ++ 1
x1 x2 xn
• Both formulae for ungrouped data
• Trimmed mean or trimean or truncated
mean
82
Summary
• The mean, the median, and the mode
each answer the question “Where is the
centre of the data set?”
• The nature of the data set, as indicated by
a relative frequency histogram, determines
which one gives the best answer.
• There are other “special” measures of
central location
83
Measures of
Dispersion/Variability
84
Learning objectives
85
Dispersion: extent by which the observations
are scattered about the central value
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-6 -4 -2 0 2 4 6
86
• most common measures of dispersion are:
– Range
– Mean deviation
– Standard deviation
– Variance
87
• Look at the two data sets above and their
graphical representations in charts called
“dot plots” (below)
88
89
• The two sets of ten measurements each
centre at the same value: they both have
mean, median, and mode 40
• Yet a glance at the charts shows that they are
markedly different
• In Data Set I the measurements vary only
slightly from the centre, while for Data Set II
the measurements vary greatly
90
• Just as we have attached numbers to a
data set to locate its centre, we now
associate to each data set numbers that
measure quantitatively how the data either
scatter away from the centre or cluster
close to it.
• These new quantities are called measures
of variability or dispersion
91
Range
• The range of a data set is the number R
defined by the formula
R= xmax − xmin
where xmax is the largest measurement in the
data set and xmin is the smallest
• For Data Set I the maximum is 43 and the
minimum is 38, so the range is R=43−38=5.
• For Data Set II the maximum is 47 and the
minimum is 33, so the range is R=47−33=14
92
• The range is a measure of variability
because it indicates the size of the interval
over which the data points are distributed
• A smaller range indicates less variability
(less dispersion) among the data, whereas
a larger range indicates the opposite
93
Mean deviation (MD)
• sum of the deviations around the arithmetic
mean equals zero, hence is not useful
• Mean deviation is defined by the following
equation: n
1
MD = xi − x
n i =1
Where is the mean, is the ith value
and n is sample size
94
example
• Consider these scores and calculate the MD:
8.10 ; 7.10 ; 6.65 ; 8.60 ; 6.20 ; 6.55 ; 7.75
x = (50.95) = 7.2786
1
7
1 n
MD = xi − x
n i =1
96
Sample variance
Defn:
• The sample variance of a set of n sample data
is the number defined by the formula:
97
Sample standard deviation
Defn:
• The sample standard deviation of a set of n
sample data is the square root of the sample
variance, hence it is the number s defined by
the formula:
99
soln
Data set 2
100
homework
Data set 1
• Show that the sample variance and standard
deviation of Data Set I are the much smaller
numbers s^2=20/9=2.2… and s=√20/9≈1.49
101
Population parameters
• If the data set comprises the whole population, then
the population standard deviation, denoted σ (the
lower case Greek letter sigma), and its square, the
population variance σ^2, are defined as follows:
102
• Since most data sets are samples, we will
always work with the sample standard
deviation and variance
• In most applications we deal with comparing
the means and standard deviations of two data
sets
• The figures below illustrate how a difference in
one or both of the sample mean and the
sample standard deviation are reflected in the
appearance of the data set as shown by the
curves derived from the relative frequency
histograms built using the data
103
104
105
Grouped data
Range
Example:
• Consider the frequency table of the “salt”
problem and calculate the range
106
Fi
Class xi fi fi Fi n
n
[40;50) 45 4 0.1333 4 0.1333
[50;60) 55 6 0.2000 10 0.3333
[60;70) 65 10 0.3333 20 0.6667
[70;80) 75 4 0.1333 24 0.8000
[80;90) 85 4 0.1333 28 0.9333
[90;100) 95 2 0.0667 30 1.0000
Total 30 1.0000
r = 100 – 40 = 60
107
Mean deviation (MD)
1 k
MD = f i xi − x
n i =1
109
x = 66.3333
Class xi fi xi − x f i xi − x
[40;50) 45 4 21.3333 85.3332
[50;60) 55 6 11.3333 67.9998
[60;70) 65 10 1.3333 13.3333
[70;80) 75 4 8.6667 34.6668
[80;90) 85 4 18.6667 74.6668
[90;100) 95 2 28.6667 57.3334
Total 30 333.3333
MD =
1
(333.3333) = 11.1111
30 110
Standard deviation (s)
k
2
1 k
s= i i n i i
n − 1 i =1
f x 2 1
− f x
i =1
111
Example
• calculate the standard deviation and
variance of the weight of the packets of
salt after being summarised in a frequency
table
112
Class xi fi f i xi f i xi2
[40;50) 45 4 180 8100
[50;60) 55 6 330 18150
[60;70) 65 10 650 42250
[70;80) 75 4 300 22500
[80;90) 85 4 340 28900
[90;100) 95 2 190 18050
Total 30 1990 137950
113
1 2
137950 − (1990 )
1
s=
30 − 1 30
=
1
(137950 − 132003.3333)
29
= 14.3198
114
Properties of measures of
dispersion
• standard deviation
– Most stable and powerful statistic of dispersion
– Often easier to work with the standard deviation
squared i.e. the variance
• mean deviation
– Accurate but mathematically difficult to handle
in further statistical analysis
• Range
– Easy and quick to compute but least stable
115
KEY TAKEAWAY
116
Relative position of data
117
Learning objectives
1. To learn the concept of the relative position
of an element of a data set
2. To learn the meanings and computations
of:
– the percentile rank
– the z-score
– the three quartiles
– the five-number summary and the box plot
associated to it, and how to interpret the box
plot
118
introduction
• When you take an exam, what is often as important
as your actual score on the exam is the way your
score compares to other students’ performance.
• If you made a 70 but the average score was 85, you
did relatively poorly
• If you made a 70 but the average score was only 55
then you did relatively well.
• In general, the significance of one observed value in a
data set strongly depends on how that value
compares to the other observed values in a data set.
• We wish to attach to each observed value a number
that measures its relative position
119
Percentiles and Quartiles
• Defn: Given an observed value x in a data
set, x is the Pth percentile of the data if the
percentage of the data that are less than or
equal to x is P. The number P is the
percentile rank of x.
120
example
• A random sample of some 10 “scores”
(GPA) are given below:
121
solution
• The data written in increasing order are:
1.39; 1.76; 1.90; 2.12; 2.53; 2.71; 3.00; 3.33;
3.71; 4.00
• The only data value that is less than or equal
to 1.39 is 1.39 itself. Since 1 is 1∕10 = .10 or
10% of 10, the value 1.39 is the 10th
percentile.
• Eight data values are less than or equal to
3.33. Since 8 is 8∕10 = .80 or 80% of 10, the
value 3.33 is the 80th percentile
122
• The Pth percentile cuts the data set in two
so that approximately P% of the data lie
below it and (100−P)% of the data lie
above it
• In particular, the three percentiles that cut
the data into fourths are called the
quartiles
123
Data Division by Quartiles
124
definitions
For any data set:
1. The second quartile Q2 of the data set is its median.
2. Define two subsets:
1. the lower set: all observations that are strictly less
than Q2;
2. the upper set: all observations that are strictly
greater than Q2.
126
solution
• The data written in increasing order are:
1.39; 1.76; 1.90; 2.12; 2.53; 2.71; 3.00; 3.33;
3.71; 4.00
• This data set has n = 10 observations. Since
10 is an even number, the median is the
mean of the two middle observations:
x˜=(2.53 + 2.71)/2=2.62. Thus the second
quartile is Q2=2.62
127
• The lower and upper subsets are:
Lower: L={1.39,1.76,1.90,2.12,2.53}
Upper: U={2.71,3.00,3.33,3.71,4.00}
• Each has an odd number of elements, so the
median of each is its middle observation.
• Thus the first quartile is Q1=1.90, the median of
L, and the third quartile is Q3=3.33, the median
of U
• Summary:
Q1 = 1.90
Q2 = 2.62
Q3 = 3.33
128
five-number summary
• In addition to the three quartiles, the two
extreme values, the minimum xmin and the
maximum xmax are also useful in describing
the entire data set
• Together these five numbers are called the
five-number summary of the data set:
{xmin, Q1, Q2, Q3, xmax}
• The five-number summary is used to
construct a box plot
129
The Box Plot
131
Z-scores
• Another way to locate a particular observation
x in a data set is to compute its distance from
the mean in units of standard deviation
Defn: the z-score of an observation x in a data
set is the number z given by the formula
(according to whether the data set is a sample
or the entire population):
132
• The formulas in the definition allow us to
compute the z-score when x is known.
• If the z-score is known then x can be
recovered using the corresponding inverse
formulas
x=(x^−)+sz or x=μ+σz
• The z-score indicates how many standard
deviations an individual observation x is from
the center of the data set, its mean.
• If z is negative then x is below average. If z is
0 then x is equal to the average.
• If z is positive then x is above average
133
x-Scale versus z-Score
134
exercise
1. Find the z scores for all the 10 observations
in the GPA sample data (given earlier), i.e.
138
Heights of Men
68.7 72.3 71.3 72.5 70.6 68.2 70.1 68.4 68.6 70.6
73.7 70.5 71.0 70.9 69.3 69.4 69.7 69.1 71.5 68.6
70.9 70.0 70.4 68.9 69.4 69.4 69.2 70.7 70.5 69.9
69.8 69.8 68.6 69.5 71.6 66.2 72.4 70.7 67.7 69.1
68.8 69.3 68.9 74.8 68.0 71.2 68.3 70.2 71.9 70.4
71.9 72.2 70.0 68.7 67.9 71.1 69.0 70.8 67.3 71.8
70.3 68.8 67.2 73.0 70.4 67.8 70.0 69.5 70.1 72.0
72.2 67.6 67.0 70.3 71.2 65.6 68.1 70.8 71.4 70.2
70.1 67.5 71.3 71.5 71.0 69.1 69.5 71.1 66.8 71.8
69.6 72.7 72.8 69.6 65.9 68.0 69.7 68.7 69.8 69.7
139
The Empirical Rule
• The mean and standard deviation of the data,
rounded to two decimal places are,
x^−=69.92 and s = 1.70
• the number of observations within one
standard deviation of the mean, i.e., that are
between 69.92−1.70=68.22 and
69.92+1.70=71.62 inches is 69
• Those within two standard deviations of the
mean are 95
• All of the measurements are within three
standard deviations of the mean 140
Heights of Men relative frequency histogram
141
The Empirical Rule
If a data set has an approximately bell-shaped relative
frequency histogram, then:
1. approximately 68% of the data lie within one
standard deviation of the mean, that is, in the
interval with endpoints x^−±s for samples and with
endpoints μ±σ for populations;
2. approximately 95% of the data lie within two
standard deviations of the mean, that is, in the
interval with endpoints x^−±2s for samples and with
endpoints μ±2σ for populations; and
3. approximately 99.7% of the data lies within three
standard deviations of the mean, that is, in the
interval with endpoints x^−±3s for samples and with
endpoints μ±3σ for populations 142
The Empirical Rule
143
Key points
1. the data distribution must be approximately bell-shaped
– The Empirical Rule does not apply to data sets with severely
asymmetric distributions
146
Key points
1. The Chebyshev’s Theorem applies to any data
set
2. The theorem gives the minimum proportion of
the data which must lie within a given number of
standard deviations of the mean
– the true proportions found within the indicated
regions could be greater than what the theorem
guarantees 147
Section Summary
1. The Empirical Rule is an approximation that
applies only to data sets with a bell-shaped
relative frequency histogram
– It estimates the proportion of the measurements
that lie within one, two, and three standard
deviations of the mean
2. Chebyshev’s Theorem is a fact that applies to
all possible data sets.
– It describes the minimum proportion of the
measurements that must lie within one, two, or
more standard deviations of the mean 148
exercises
1. A sample data set with a bell-shaped distribution
has mean x^−=6 and standard deviation s = 2. Find
the approximate proportion of observations in the
data set that lie:
a. between 4 and 8;
b. between 2 and 10;
c. between 0 and 12
2. A population data set has mean μ = 2 and standard
deviation σ = 1.1. Find the minimum proportion of
observations in the data set that must lie:
a. between −0.2 and 4.2;
b. between −1.3 and 5.3
149