0% found this document useful (0 votes)
20 views149 pages

L1 Descriptive Stats

The document provides an overview of statistics, focusing on descriptive and inferential statistics, and the distinction between populations and samples. It explains various methods for organizing and displaying data, including graphical representations like histograms and stem-and-leaf diagrams. Additionally, it discusses measures of central location such as mean, mode, and median, emphasizing their importance in understanding data sets.

Uploaded by

talentkundai52
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views149 pages

L1 Descriptive Stats

The document provides an overview of statistics, focusing on descriptive and inferential statistics, and the distinction between populations and samples. It explains various methods for organizing and displaying data, including graphical representations like histograms and stem-and-leaf diagrams. Additionally, it discusses measures of central location such as mean, mode, and median, emphasizing their importance in understanding data sets.

Uploaded by

talentkundai52
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Statistics for Meteorology

HMCS106
Introduction

The Grand Picture of Statistics

• Statistics is a study of data: describing


properties of data (descriptive statistics);
and drawing conclusions about a
population based on information in a
sample (inferential statistics).
The Grand Picture of Statistics

• The distinction between a population


together with its parameters and a sample
together with its statistics is a fundamental
concept in inferential statistics
The Grand Picture of Statistics

• Information in a sample is used to make


inferences about the population from which the
sample was drawn

• Defn: infer - deduce or conclude (something)


from evidence and reasoning rather than from
explicit statements
The Grand Picture of Statistics
Definitions

• Statistics - a collection of methods for


collecting, displaying, analysing, and drawing
conclusions from data.
Definitions

• Descriptive statistics is the branch of statistics that


involves organizing, displaying, and describing data.

• Inferential statistics - the branch of statistics that


involves drawing conclusions about a population
based on information contained in a sample taken
from that population.
Definitions
• Qualitative data are measurements for which there is no
natural numerical scale, but which consist of attributes, labels,
or other non-numerical characteristics

• Quantitative data are numerical measurements that arise from


a natural numerical scale
Descriptive Statistics

Aim is to summarize a collection


of data
Descriptive Statistics

• the starting point for dealing with a


collection of data is to organize, display,
and summarize it effectively
– These are the objectives of descriptive
statistics
• Common techniques employed in
descriptive statistics are:
– Graphical description – graphs

– Tabular description – tables

– Parametric description – in which we estimate


values of certain parameters which we assume
to complete the description of a set of data
Frequency distributions
• Generally, techniques for grouping data in
tabular and graphical frequency
distributions are applied to observations
for which (time) order is not important
• When (time) order of observations cannot
be ignored, the observations are treated
as time series (to be covered later)
Graphical representation
• graphical output can convey information more widely
and quickly than in numerical tables and written
reports
• The most common kinds of graphical illustrations
include:
– Stem and leaf (stemplot) diagram
– The line graph,
– The histogram,
– Area diagrams, etc
Stem and Leaf Diagrams
• Suppose 30 students in a statistics class
took a test and made the following scores:

86 80 25 77 73 76 100 90 69 93
90 83 70 73 73 70 90 83 71 95
40 58 68 69 100 78 87 97 92 74

• How did the class do on the test?


• A quick glance at the set of 30 numbers
does not immediately give a clear answer
• the data set may be reorganized and
rewritten to make relevant information
more visible
• One way to do so is to construct a stem
and leaf diagram
• The numbers in the “tens” place, from 2
through 9, and additionally the number 10,
are the “stems,” and are arranged in
numerical order from top to bottom to the
left of a vertical line
• The number in the units place in each
measurement is a “leaf,” and is placed in a
row to the right of the corresponding stem
• The display is made even more useful for
some purposes by rearranging the leaves in
numerical order
• Either way, with the data reorganized certain
information of interest becomes apparent
immediately
– There are two perfect scores;
– three students made scores under 60;
– most students scored in the 70s, 80s and
90s;
– and the overall average is probably in the
high 70s or low 80s
Frequency Histograms
• The stem and leaf diagram is not practical
for large data sets, so we need a different,
purely graphical way to represent data
• A frequency histogram is such a device
• For the 30 scores on the exam, it is natural
to group the scores on the standard ten-
point scale, and count the number of
scores in each group
Thus there are two 100s, seven scores in the 90s,
five in the 80s, and so on

Scale Frequency
10
20 1
30
40 1
50 1
60 3
70 10
80 5
90 7
100 2
• We then construct the histogram by drawing for
each group, or class, a vertical bar whose
length is the number of observations in that
group.
• In our example, the bar labelled 100 is 2 units
long, the bar labelled 90 is 7 units long, and so
on
• the individual data values are lost, but we know
the number in each class
• This number is called the frequency of the
class, hence the name frequency histogram
Frequency Histogram
Constructing a frequency table
• the weight (to the nearest gram) of 30
packets of salt, all marked 65g
68 52 49 56 68
74 41 59 79 81
42 57 60 88 87
47 68 55 68 65
50 78 61 90 85
65 66 72 63 95
• Step 1: Determine the data range

r = x max − x min
= 95 − 41
= 54

• Step 2: Decide on the number of classes (intervals).


– Sturge’s rule k = 1 + 3.3 log 10 (n )
k = 1 + 3.3 log( 30) = 1 + 3.3(1.4771) = 5.8744
– therefore 6 classes or intervals (always round up)
• 3. Determine the class width - convenient
width, e.g. 5 or 10
Range 54
c= = =9
k 6
– Hence10 is a convenient class width in this
case
• 4. Determine the class limits (class boundaries)
– there will be no doubt of the class in which an
observation should fall
• first class would be [40;50), the second one:
[50;60), etc
– the round bracket,”)” excludes the value, while
the square bracket, “[“, includes the value
concerned
• Q: to which class belongs an observation of 50?
• 5. Tabulate the data values
– A tally table is used to assist with the assignment of
each data value (observation) to one and only one
class interval

Weight (in gram) Tally Frequency


[40;50) |||| 4
[50;60) |||| | 6
[60;70) |||| |||| 10
[70;80) |||| 4
[80;90) |||| 4
[90;100) || 2
Total n = 30
Frequency table
• Notation:
– fi = frequency of the i-th class
– xi = midpoint of the i-th class
– Fi = cumulative frequency of the i-th class
– n = total number of observations (sum of the
frequencies
Class midpoint = xi = 0.5*(lower limit + upper limit)
fi
relative frequency in each class is n

The frequency table


fi Fi
Class xi fi n
Fi n

[40;50) 45 4 0.1333 4 0.1333


[50;60) 55 6 0.2000 10 0.3333
[60;70) 65 10 0.3333 20 0.6667
[70;80) 75 4 0.1333 24 0.8000
[80;90) 85 4 0.1333 28 0.9333
[90;100) 95 2 0.0667 30 1.0000
Total n = 30 1.0000
Comparison with raw data
• the weight (to the nearest gram) of 30
packets of salt, all marked 65g
68 52 49 56 68
74 41 59 79 81
42 57 60 88 87
47 68 55 68 65
50 78 61 90 85
65 66 72 63 95
Graphical illustrations of continuous data
Histogram of the weight of 30 packets of salt

10

8
Number of packets (f)

0
30 40 50 60 70 80 90 100 110
Weight in gram
Cumulative frequency polygon of the weight of 30 packets of salt
30

25
Cumulative frequency (F)

20

15

10

0
30 40 50 60 70 80 90 100
Weight of packet
Scatterplots

A scatterplot (or scatter diagram) is a plot of paired (x,


y) data with a horizontal x-axis and a vertical y-axis
Discrete and Continuous data

• Discrete data: observations can take on only specific


values – often integer (whole number) values
• E.g.
– The number of students in a class
– The outcome when a die is tossed

• Continuous data: observations can take on any possible


value in an interval (class)
• E.g.
– time taken to travel to university daily
– height of a person
Graphical illustrations of discrete data

• A line diagram or a bar chart is used to


illustrate (represent) discrete data
• Example

Errors per day 0 1 2 3 4 5 6


Number of days 143 90 42 12 9 3 1
Line diagram of the number of typing errors
160

140

120
Number of days

100

80

60

40

20

0
-1 0 1 2 3 4 5 6 7
Number of errors
Bar chart of the number of typing errors
160

140

120
Number of days

100

80

60

40

20

0
-1 0 1 2 3 4 5 6 7
Number of errors
Graphical illustrations of continuous data

• A histogram is used to illustrate


(represent) continuous data
• Example
– the weight of 30 packets of salt, all marked
65g
Graphical illustrations of continuous data
Histogram of the weight of 30 packets of salt

10

8
Number of packets (f)

0
30 40 50 60 70 80 90 100 110
Weight in gram
Relative Frequency Histograms

• In our example of the exam scores in a


statistics class, five students scored in the 80s
• Since there are 30 students in the entire
statistics class, the proportion who scored in
the 80s is 5/30
• The number 5/30, which could also be
expressed as 0.1667, or as 16.67%, is the
relative frequency of the group labelled “80s”
• We can thus construct a diagram by drawing
for each group, or class, a vertical bar whose
length is the relative frequency of that group
• The diagram is a relative frequency
histogram for the data
• A key point is that, if each vertical bar has
width 1 unit, then the total area of all the bars
is 1 or 100%
Relative Frequency Histogram
• Although the histograms have the same
appearance, the relative frequency
histogram is more important for us
• The relative frequency histogram is
important because the labelling on the
vertical axis reflects what is important
visually: the relative sizes of the bars
Sample Size and Relative Frequency Histograms
A Very Fine Relative Frequency Histogram
KEY TAKEAWAYS
• Graphical representations of large data sets
provide a quick overview of the nature of the data
• A population or a very large data set may be
represented by a smooth curve. This curve is a
very fine relative frequency histogram in which the
exceedingly narrow vertical bars have been omitted
• When a curve derived from a relative frequency
histogram is used to describe a data set, the
proportion of data with values between two
numbers a and b is the area under the curve
between a and b
Measures of Central Location
• Learning objectives
– To learn the concept of the “centre” of a data
set
– To learn the meaning of some measures of
the centre of a data set

49
Measures of Central Location
• Any kind of “average” is meant to be an answer
to the question “Where do the data centre?”
• It is therefore a measure of the central location
of a data set
• 3 main measures of central location
– arithmetic mean,
– mode, and
– median (also called the 2nd quartile or the 50th
percentile).
50
• the nature of the data set, as indicated by
a relative frequency histogram, will
determine what constitutes a good answer

• Different shapes of the histogram call for


different measures of central location

51
The mean
• The first measure of central location is the
usual “average”, also known as the
arithmetic mean or simply the “mean”
• Defn.: a sample mean of a set of sample
data is the number defined by;

• Sigma is the standard summation notation


52
For ungrouped data
sum of all observatio ns
x=
number of observatio ns
x1 + x2 + x3 +  + xn
=
n
n
1
=  xi
n i =1
53
• Given: 8.10 ; 7.10 ; 6.65 ; 8.60 ; 6.20
; 6.55 ; 7.75
• Calculate the arithmetic mean.
1 7
x =  xi
7 i =1

= (8.10 + 7.10 + 6.65 + 8.60 + 6.20 + 6.55 + 7.75)


1
7
= (50.95)
1
7
= 7.2786 54
Grouped data
• The class midpoint is taken as the
“representative” value of the class
k
n i i
(interval) x 1 f x
• Arithmetic mean = = i =1
 k 
• k = number of classes 

n =  f i


 i =1 
• n = number of observations
• xi = class midpoint
• fi = frequency of the i-th class
55
Consider the frequency table and calculate the
arithmetic mean

Class fi
[40;50) 4
[50;60) 6
[60;70) 10
[70;80) 4
[80;90) 4
[90;100) 2
Total 30

56
,

i. Calculate the midpoint of each class:


1 (40 + 50) = 45 = 1 (50 + 60) = 55
e.g x1 2= x
; 2 2
ii. Calculate fixi :
e.g. f1 x1 = 4  45 = 180 ; f 2 x2 = 6  55 = 330
iii. Write the values obtained in (i) and (ii) in the
frequency table

6
iii. Calculate  f i xi and divide by n.
i =1

57
Class fi xi fixi
[40;50) 4 45 180
[50;60) 6 55 330
[60;70) 10 65 650
[70;80) 4 75 300
[80;90) 4 85 340
[90;100) 2 95 190

Total 30 1990

6
x= 1  f i xi = 1 (1990 ) = 63.3333
30 30
i =1
58
• Population mean of a set of N population
data is the number μ defined by the
formula:

59
Median
• Suppose we are interested in the average yearly
income of employees at a large corporation
• We take a random sample of seven employees,
obtaining the sample data (rounded to the nearest
hundred dollars, and expressed in thousands of
dollars)
24.8 ;22.8; 24.6; 192.5; 25.2; 18.5; 23.7
• The mean (rounded to one decimal place) is x̅
=47.4
• Hence “the average income of employees at this
corporation is $47,400” - surely misleading!
60
Median
• The mean is approximately twice what six of
the seven employees in the sample make
• What went wrong?
• the presence of the one executive in the
sample
• The number 192.5 in our data set is called an
outlier
• Many times an outlier is the result of some
sort of error, but not always , as is the case
here 61
• We would get a better measure of the
“centre” of the data if we were to arrange the
data in numerical order,
• 18.5 22.8 23.7 24.6 24.8 25.2 192.5
• then select the middle number in the list, in
this case 24.6
• This is the median characterised by that,
roughly half of the measurements are larger
than it is, and roughly half are smaller
62
Defn.
• The sample median x^~ of a set of sample
data for which there are an odd number of
measurements is the middle measurement
when the data are arranged in numerical
order.
• The sample median x^~ of a set of sample
data for which there are an even number of
measurements is the mean of the two middle
measurements when the data are arranged in
numerical order.
63
The Median

64
Skewness of Relative Frequency Histograms

65
The relationship between the mean and the
median for several common shapes of
distributions
• When the distribution is symmetric, as in panels (a)
and (b) of , the mean and the median are equal
• When the distribution is as shown in panel (c) of , it
is said to be skewed right aka positive skew, or
right tailed
– The mean has been pulled to the right of the median by
the long “right tail” of the distribution, the few relatively
large data values
• When the distribution is as shown in panel (d) of , it
is said to be skewed left aka negative skew, or left
tailed
– The mean has been pulled to the left of the median by
the long “left tail” of the distribution 66
Median
is in the
Middle
67
Definition

• Median – the middle number


in a set of ordered numbers.

1, 3, 7, 10, 13
Median = 7
68
Grouped data
• median can be determined graphically or
by using a formula
• Graphically (on cumulative frequency
polygon: n +1
i. find the value of the 2 th observation
ii. Mark the value on the vertical axis and
project horizontally to the polygon
[Link] it down to the horizontal axis and
read the value of the median
69
Cumulative frequency polygon of the weight of 30 packets of salt
30

25
Cumulative frequency (F)

20

15

10

0
30 40 50 60 Me 70 80 90 100
Weight of packet 70
• Formula
• median interval is the first interval for which the cumulative frequency is
n + 1 or more
2
 n
− Fme −1 
Me = L + c  2

 f me 
• L = lower limit of the median interval,
n
• 2 = order number of the median,
• Fme-1 = cumulative frequency of the interval before the
median interval

• fme = frequency of the median interval, and


• c = class width.
71
Consider the frequency table of the “salt” problem and
calculate the median

Class fi Fi
[40;50) 4 4
[50;60) 6 10
[60;70) 10 20
[70;80) 4 24
[80;90) 4 28
[90;100) 2 30
Total 30

72
n 30
• Order number = position of Me = =2 2
= 15
• Median interval = [60;70)
n
• L = 60 ; 2 = 15 ; c = 10 ; Fme-1 = 10 and
fme = 10.
 n2 − Fme −1 
• Therefore: Me = L + c  
 f me 
 15 − 10 
= 60 + 10  
 10 
= 65
73
Mode
• Consider the statement: “The average number
of mobile phones owned by Harare residents
is 1.37,”
• the thought of a fraction of a mobile phone can
be baffling
• In such a context the following measure for
central location might make more sense
• Defn: The sample mode of a set of sample
data is the most frequently occurring value
74
On a relative frequency histogram, the highest point of
the histogram corresponds to the mode of the data set
illustrates the mode

75
Definition

Mode
is the most
Popular
76
Definition
• Mode – the number that
appears most frequently in a
set of numbers.
1, 1, 3, 7, 10, 13
Mode = 1
77
Grouped data
• Graphically
Histogram of the weight of 30 packets of salt

10

8
Number of packets (f)

0
30 40 50 60 Mo 70 80 90 100 110
Weight in gram 78
Grouped data
 f m − f m−1 
• Using a formulaMo = L + c 
 2 f m − f m−1 − f m+1 

• L = lower limit of the modal interval,


• c = class width (upper limit – lower limit),
• fm = frequency of the modal interval,
• fm-1 = frequency of the interval preceding the
modal interval,
• fm+1 = frequency of the interval following the
model interval.
79
• Consider the “salt” problem.
• The modal interval (class with the highest
frequency) is the 3rd interval, i.e. [60;70).
• Thus L = 60 ; c = 70 – 60 = 10 ; fm = 10 ; fm-1
= 6 and fm+1 = 4.
 f m − f m −1 
Mo = L + c 
 2 f m − f m −1 − f m +1 
 10 − 6 
= 60 + 10 
 2(10 ) − 6 − 4 
 4 
= 60 + 10 
 10 
= 64 80
Comparing the mean ( x ), mode
(Mo) and median (Me)
• Choice depends on the shape of the
frequency distribution
• 3 possible shapes:
i. Symmetric
– all three measures have identical value
ii. left skewed
– median or mode is more representative than the
mean
[Link] skewed
– median may be the best measure of central
location, as it is least affected by extreme
values
81
Other measures of central location

• Geometric mean: GM = n x1  x2  x3    xn
n
• Harmonic mean: HM = 1
+ 1 ++ 1
x1 x2 xn
• Both formulae for ungrouped data
• Trimmed mean or trimean or truncated
mean

82
Summary
• The mean, the median, and the mode
each answer the question “Where is the
centre of the data set?”
• The nature of the data set, as indicated by
a relative frequency histogram, determines
which one gives the best answer.
• There are other “special” measures of
central location
83
Measures of
Dispersion/Variability

84
Learning objectives

• To learn the concept of the variability of a


data set
• To learn how to compute some measures
of the variability of a data set: the range,
the mean deviation, the variance, and the
standard deviation

85
Dispersion: extent by which the observations
are scattered about the central value
0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
-6 -4 -2 0 2 4 6

Equal means, but different variances

86
• most common measures of dispersion are:
– Range
– Mean deviation
– Standard deviation
– Variance

87
• Look at the two data sets above and their
graphical representations in charts called
“dot plots” (below)

88
89
• The two sets of ten measurements each
centre at the same value: they both have
mean, median, and mode 40
• Yet a glance at the charts shows that they are
markedly different
• In Data Set I the measurements vary only
slightly from the centre, while for Data Set II
the measurements vary greatly
90
• Just as we have attached numbers to a
data set to locate its centre, we now
associate to each data set numbers that
measure quantitatively how the data either
scatter away from the centre or cluster
close to it.
• These new quantities are called measures
of variability or dispersion

91
Range
• The range of a data set is the number R
defined by the formula
R= xmax − xmin
where xmax is the largest measurement in the
data set and xmin is the smallest
• For Data Set I the maximum is 43 and the
minimum is 38, so the range is R=43−38=5.
• For Data Set II the maximum is 47 and the
minimum is 33, so the range is R=47−33=14
92
• The range is a measure of variability
because it indicates the size of the interval
over which the data points are distributed
• A smaller range indicates less variability
(less dispersion) among the data, whereas
a larger range indicates the opposite

93
Mean deviation (MD)
• sum of the deviations around the arithmetic
mean equals zero, hence is not useful
• Mean deviation is defined by the following
equation: n
1
MD =  xi − x
n i =1
Where is the mean, is the ith value
and n is sample size
94
example
• Consider these scores and calculate the MD:
8.10 ; 7.10 ; 6.65 ; 8.60 ; 6.20 ; 6.55 ; 7.75

x = (50.95) = 7.2786
1
7
1 n
MD =  xi − x
n i =1

= ( 8.10 − 7.2786 + 7.10 − 7.2786 +  + 7.75 − 7.2786 )


1
7
= (0.8214 + 0.1786 +  + 0.4714 )
1
7
= 0.7469 95
Variance and Standard Deviation
• These two measures of variability depend
on whether the data set is just a sample
drawn from a much larger population or is
the whole population itself (that is, a
census)

96
Sample variance
Defn:
• The sample variance of a set of n sample data
is the number defined by the formula:

• This is equivalent to:

97
Sample standard deviation
Defn:
• The sample standard deviation of a set of n
sample data is the square root of the sample
variance, hence it is the number s defined by
the formula:

• Although the first formula in each case looks


less complicated than the second, the latter is
easier to use in hand computations, and is
called a shortcut formula 98
example
• Calculate the sample variance and sample
sd for data sets 1 and 2 (given before)

99
soln
Data set 2

100
homework
Data set 1
• Show that the sample variance and standard
deviation of Data Set I are the much smaller
numbers s^2=20/9=2.2… and s=√20/9≈1.49

• Note that the variance, standard deviation


and mean deviation are always non-negative

101
Population parameters
• If the data set comprises the whole population, then
the population standard deviation, denoted σ (the
lower case Greek letter sigma), and its square, the
population variance σ^2, are defined as follows:

Where N is population size and is the population


mean

102
• Since most data sets are samples, we will
always work with the sample standard
deviation and variance
• In most applications we deal with comparing
the means and standard deviations of two data
sets
• The figures below illustrate how a difference in
one or both of the sample mean and the
sample standard deviation are reflected in the
appearance of the data set as shown by the
curves derived from the relative frequency
histograms built using the data
103
104
105
Grouped data
Range

r = upper limit (highest class) − lower limit (lowest class)

Example:
• Consider the frequency table of the “salt”
problem and calculate the range

106
Fi
Class xi fi fi Fi n
n
[40;50) 45 4 0.1333 4 0.1333
[50;60) 55 6 0.2000 10 0.3333
[60;70) 65 10 0.3333 20 0.6667
[70;80) 75 4 0.1333 24 0.8000
[80;90) 85 4 0.1333 28 0.9333
[90;100) 95 2 0.0667 30 1.0000

Total 30 1.0000

r = 100 – 40 = 60
107
Mean deviation (MD)
1 k
MD =  f i xi − x
n i =1

Where n = number of observations,


• k = number of classes
• fi = frequency of class i,
• xi = midpoint of class i,
x = arithmetic mean
108
Example
• Use the frequency table in the previous
example and calculate the mean deviation

109
x = 66.3333
Class xi fi xi − x f i xi − x
[40;50) 45 4 21.3333 85.3332
[50;60) 55 6 11.3333 67.9998
[60;70) 65 10 1.3333 13.3333
[70;80) 75 4 8.6667 34.6668
[80;90) 85 4 18.6667 74.6668
[90;100) 95 2 28.6667 57.3334
Total 30 333.3333

MD =
1
(333.3333) = 11.1111
30 110
Standard deviation (s)

 k 
2
 
1  k
s=  i i n  i i  
n − 1  i =1
f x 2 1
− f x 
 i =1  
 

111
Example
• calculate the standard deviation and
variance of the weight of the packets of
salt after being summarised in a frequency
table

112
Class xi fi f i xi f i xi2
[40;50) 45 4 180 8100
[50;60) 55 6 330 18150
[60;70) 65 10 650 42250
[70;80) 75 4 300 22500
[80;90) 85 4 340 28900
[90;100) 95 2 190 18050
Total 30 1990 137950

113
1  2
137950 − (1990 ) 
1
s=
30 − 1  30 
=
1
(137950 − 132003.3333)
29
= 14.3198

• And the variance is s2 = 205.0575.

114
Properties of measures of
dispersion
• standard deviation
– Most stable and powerful statistic of dispersion
– Often easier to work with the standard deviation
squared i.e. the variance
• mean deviation
– Accurate but mathematically difficult to handle
in further statistical analysis
• Range
– Easy and quick to compute but least stable
115
KEY TAKEAWAY

• The range, the mean deviation, the


standard deviation, and the variance each
give a quantitative answer to the question
“How variable are the data?”

116
Relative position of data

117
Learning objectives
1. To learn the concept of the relative position
of an element of a data set
2. To learn the meanings and computations
of:
– the percentile rank
– the z-score
– the three quartiles
– the five-number summary and the box plot
associated to it, and how to interpret the box
plot
118
introduction
• When you take an exam, what is often as important
as your actual score on the exam is the way your
score compares to other students’ performance.
• If you made a 70 but the average score was 85, you
did relatively poorly
• If you made a 70 but the average score was only 55
then you did relatively well.
• In general, the significance of one observed value in a
data set strongly depends on how that value
compares to the other observed values in a data set.
• We wish to attach to each observed value a number
that measures its relative position
119
Percentiles and Quartiles
• Defn: Given an observed value x in a data
set, x is the Pth percentile of the data if the
percentage of the data that are less than or
equal to x is P. The number P is the
percentile rank of x.

120
example
• A random sample of some 10 “scores”
(GPA) are given below:

• What percentile is the value 1.39 in this


data set?
• What percentile is the value 3.33?

121
solution
• The data written in increasing order are:
1.39; 1.76; 1.90; 2.12; 2.53; 2.71; 3.00; 3.33;
3.71; 4.00
• The only data value that is less than or equal
to 1.39 is 1.39 itself. Since 1 is 1∕10 = .10 or
10% of 10, the value 1.39 is the 10th
percentile.
• Eight data values are less than or equal to
3.33. Since 8 is 8∕10 = .80 or 80% of 10, the
value 3.33 is the 80th percentile
122
• The Pth percentile cuts the data set in two
so that approximately P% of the data lie
below it and (100−P)% of the data lie
above it
• In particular, the three percentiles that cut
the data into fourths are called the
quartiles

123
Data Division by Quartiles

124
definitions
For any data set:
1. The second quartile Q2 of the data set is its median.
2. Define two subsets:
1. the lower set: all observations that are strictly less
than Q2;
2. the upper set: all observations that are strictly
greater than Q2.

3. The first quartile Q1 of the data set is the median of


the lower set.

4. The third quartile Q3 of the data set is the median of


the upper set
125
example
• Find the quartiles of the data set of GPAs
in the earlier example, i.e.:

126
solution
• The data written in increasing order are:
1.39; 1.76; 1.90; 2.12; 2.53; 2.71; 3.00; 3.33;
3.71; 4.00
• This data set has n = 10 observations. Since
10 is an even number, the median is the
mean of the two middle observations:
x˜=(2.53 + 2.71)/2=2.62. Thus the second
quartile is Q2=2.62
127
• The lower and upper subsets are:
Lower: L={1.39,1.76,1.90,2.12,2.53}
Upper: U={2.71,3.00,3.33,3.71,4.00}
• Each has an odd number of elements, so the
median of each is its middle observation.
• Thus the first quartile is Q1=1.90, the median of
L, and the third quartile is Q3=3.33, the median
of U
• Summary:
Q1 = 1.90
Q2 = 2.62
Q3 = 3.33
128
five-number summary
• In addition to the three quartiles, the two
extreme values, the minimum xmin and the
maximum xmax are also useful in describing
the entire data set
• Together these five numbers are called the
five-number summary of the data set:
{xmin, Q1, Q2, Q3, xmax}
• The five-number summary is used to
construct a box plot
129
The Box Plot

• Note that the distance from Q1 to Q3 is the


length of the interval over which the middle half
of the data range. Thus it has the following
special name
Defn: The interquartile range (IQR) is the quantity
IQR=Q3−Q1 130
Schematic plot

131
Z-scores
• Another way to locate a particular observation
x in a data set is to compute its distance from
the mean in units of standard deviation
Defn: the z-score of an observation x in a data
set is the number z given by the formula
(according to whether the data set is a sample
or the entire population):

132
• The formulas in the definition allow us to
compute the z-score when x is known.
• If the z-score is known then x can be
recovered using the corresponding inverse
formulas
x=(x^−)+sz or x=μ+σz
• The z-score indicates how many standard
deviations an individual observation x is from
the center of the data set, its mean.
• If z is negative then x is below average. If z is
0 then x is equal to the average.
• If z is positive then x is above average
133
x-Scale versus z-Score

134
exercise
1. Find the z scores for all the 10 observations
in the GPA sample data (given earlier), i.e.

2. Suppose the mean and standard deviation


of the GPAs of all currently registered
students at a college are μ = 2.70 and σ =
0.50. The z-scores of the GPAs of two
students, Antonio and Alice, are z=−0.62
and z = 1.28, respectively. What are their
GPAs?
135
Section Summary
• The percentile rank and z-score of a
measurement indicate its relative position
with regard to the other measurements in a
data set
• The three quartiles divide a data set into
fourths
• The five-number summary and its associated
box plot summarize the location and
distribution of the data
136
The Empirical Rule and Chebyshev’s
Theorem

1. To learn what the value of the standard


deviation of a data set implies about how
the data scatter away from the mean as
described by the Empirical Rule and
Chebyshev’s Theorems
2. To use the Empirical Rule and Chebyshev’s
Theorem to draw conclusions about a data
set
137
The Empirical Rule

• The following table shows the heights in


inches of 100 randomly selected adult
men

138
Heights of Men
68.7 72.3 71.3 72.5 70.6 68.2 70.1 68.4 68.6 70.6

73.7 70.5 71.0 70.9 69.3 69.4 69.7 69.1 71.5 68.6

70.9 70.0 70.4 68.9 69.4 69.4 69.2 70.7 70.5 69.9

69.8 69.8 68.6 69.5 71.6 66.2 72.4 70.7 67.7 69.1

68.8 69.3 68.9 74.8 68.0 71.2 68.3 70.2 71.9 70.4

71.9 72.2 70.0 68.7 67.9 71.1 69.0 70.8 67.3 71.8

70.3 68.8 67.2 73.0 70.4 67.8 70.0 69.5 70.1 72.0

72.2 67.6 67.0 70.3 71.2 65.6 68.1 70.8 71.4 70.2

70.1 67.5 71.3 71.5 71.0 69.1 69.5 71.1 66.8 71.8

69.6 72.7 72.8 69.6 65.9 68.0 69.7 68.7 69.8 69.7
139
The Empirical Rule
• The mean and standard deviation of the data,
rounded to two decimal places are,
x^−=69.92 and s = 1.70
• the number of observations within one
standard deviation of the mean, i.e., that are
between 69.92−1.70=68.22 and
69.92+1.70=71.62 inches is 69
• Those within two standard deviations of the
mean are 95
• All of the measurements are within three
standard deviations of the mean 140
Heights of Men relative frequency histogram

141
The Empirical Rule
If a data set has an approximately bell-shaped relative
frequency histogram, then:
1. approximately 68% of the data lie within one
standard deviation of the mean, that is, in the
interval with endpoints x^−±s for samples and with
endpoints μ±σ for populations;
2. approximately 95% of the data lie within two
standard deviations of the mean, that is, in the
interval with endpoints x^−±2s for samples and with
endpoints μ±2σ for populations; and
3. approximately 99.7% of the data lies within three
standard deviations of the mean, that is, in the
interval with endpoints x^−±3s for samples and with
endpoints μ±3σ for populations 142
The Empirical Rule

143
Key points
1. the data distribution must be approximately bell-shaped
– The Empirical Rule does not apply to data sets with severely
asymmetric distributions

2. the percentages are only approximately true


– The actual percentage of observations in any of the intervals
specified by the rule could be either greater or less than those given
in the rule
• The Empirical Rule does not apply to all data sets, only to those
that are bell-shaped, and even then is stated in terms of
approximations
144
Chebyshev’s Theorem
For any numerical data set :
1. at least 3/4 of the data lie within two standard
deviations of the mean, that is, in the interval with
endpoints x^−±2s for samples and with endpoints
μ±2σ for populations;
2. at least 8/9 of the data lie within three standard
deviations of the mean, that is, in the interval with
endpoints x^−±3s for samples and with endpoints
μ±3σ for populations;
3. at least 1−1/k^2 of the data lie within k standard
deviations of the mean, that is, in the interval with
endpoints x^−±ks for samples and with endpoints
μ±kσ for populations, where k is any positive
whole number that is greater than 1. 145
Chebyshev’s Theorem

146
Key points
1. The Chebyshev’s Theorem applies to any data
set
2. The theorem gives the minimum proportion of
the data which must lie within a given number of
standard deviations of the mean
– the true proportions found within the indicated
regions could be greater than what the theorem
guarantees 147
Section Summary
1. The Empirical Rule is an approximation that
applies only to data sets with a bell-shaped
relative frequency histogram
– It estimates the proportion of the measurements
that lie within one, two, and three standard
deviations of the mean
2. Chebyshev’s Theorem is a fact that applies to
all possible data sets.
– It describes the minimum proportion of the
measurements that must lie within one, two, or
more standard deviations of the mean 148
exercises
1. A sample data set with a bell-shaped distribution
has mean x^−=6 and standard deviation s = 2. Find
the approximate proportion of observations in the
data set that lie:
a. between 4 and 8;
b. between 2 and 10;
c. between 0 and 12
2. A population data set has mean μ = 2 and standard
deviation σ = 1.1. Find the minimum proportion of
observations in the data set that must lie:
a. between −0.2 and 4.2;
b. between −1.3 and 5.3
149

You might also like