0% found this document useful (0 votes)
37 views65 pages

Chapter 03

aaa

Uploaded by

king78m7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views65 pages

Chapter 03

aaa

Uploaded by

king78m7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 3: Data Description

By
Dr. Abdelfattah Mustafa
Associate Professor of Mathematical Statistics

Mathematics Department, Faculty of Science,


Islamic University of Madinah, KSA

Reference: Allan G. Bluman, Elementary Statistics: A Step by Step Approach, 8 edition,


McGraw-Hill, 2012.

March 8, 2024

1/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 1 / 65
Contents

1 Measures of Central Tendency

2 Measures of Variation

3 Measures of Position

4 Measures of Skewness and Kurtosis

5 Exploratory Data Analysis

2/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 2 / 65
Measures of Central Tendency

3-1 Measures of Central Tendency

A statistic is a characteristic or measure obtained by using the data values


from a sample.

A parameter is a characteristic or measure obtained by using all the data


values from a specific population.

The Mean:
The Mean is the sum of the values, divided by the total number of values.
The sample mean:
n
X1 + X2 + · · · + Xn 1X
X̄ = = Xi , (1)
n n i=1
where n represents the total number of values in the sample.

3/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 3 / 65
Measures of Central Tendency

The population mean:


N
X1 + X2 + · · · + XN 1 X
µ= = Xi (2)
N N i=1
where N represents the total number of values in the population.

Example 1 ( Example 3-1: Days Off per Year)

The data represent the number of days off per year for a sample of individuals
selected from nine different countries. Find the mean.

20 26 40 36 23 42 35 24 30

Solution:
The sample mean is
n
1X 20 + 26 + 40 + 36 + 23 + 42 + 35 + 24 + 30
X̄ = Xi = = 30.7 days
n i=1 9

4/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 4 / 65
Measures of Central Tendency

Example 2 (Example 3-2: Hospital Infections)

The data show the number of patients in a sample of six hospitals who acquired an
infection while hospitalized. Find the mean.

110 76 29 38 105 31

Solution:
The sample mean is
n
1X 110 + 76 + 29 + 38 + 105 + 31
X̄ = Xi = = 64.8
n i=1 6

The mean of the number of hospital infections for the six hospitals is 64.8.

▶ The procedure for finding the mean for grouped data uses the midpoints of the
classes.
k
1 X
X̄ = xm fm (3)
n m=1
5/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 5 / 65
Measures of Central Tendency

where xm is the midpoint for class m, m = 1, 2, · · · , k.

Example 3 (Example 3-3: Miles Run per Week)

Using the following frequency distribution, find the mean. The data represent the
number of miles run during one week for a sample of 20 runners.

Class boundaries Frequency


5.5 – 10.5 1
10.5 – 15.5 2
15.5 – 20.5 3
20.5 – 25.5 5
25.5 – 30.5 4
30.5 – 35.5 3
35.5 – 40.5 2
Total 20

Solution:

6/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 6 / 65
Measures of Central Tendency

The sample mean is

Class boundaries Frequency (fm ) Midpoint (xm ) x m fm


5.5 – 10.5 1 8 8
10.5 – 15.5 2 13 26
15.5 – 20.5 3 18 54
20.5 – 25.5 5 23 115
25.5 – 30.5 4 28 112
30.5 – 35.5 3 33 99
35.5 – 40.5 2 38 76
Total 20 490

k
1 X 490
X̄ = xm fm = = 24.5 miles
n m=1 20

The Median:
The median is the midpoint of the data array. The symbol for the median is MD.

7/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 7 / 65
Measures of Central Tendency

Steps in computing the median of a data array:

1 Arrange the data in order.


2 Select the middle point.
n+1

(a) If n is odd, then the middle point is X 2
and the median is


n+1
MD = X .
2
(b) If n is even, then the middle points are X n n
 
2
and X 2
+ 1 , the median is

n n
 
X 2
+X 2
+1
MD = .
2

Example 4 (Example 3-4: Hotel Rooms)

The number of rooms in the seven hotels in downtown Pittsburgh is 713, 300, 618,
595, 311, 401, and 292. Find the median.

8/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 8 / 65
Measures of Central Tendency

Solution:
1 Arrange the data in order.

292, 300, 311, 401, 595, 618, 713

2 Select the middle value.

292, 300, 311, 401, 595, 618, 713

Since n = 7 is odd, the median is


 
7+1
MD = X = X(4) = 401 rooms
2

Example 5 (Example 3-8: Magazines Purchased)

Six customers purchased these numbers of magazines: 1, 7, 3, 2, 3, 4. Find the


median.

Solution:

9/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 9 / 65
Measures of Central Tendency

1, 2, 3, 3, 4, , 7
Since n = 6 is even, then the median number of magazines purchased is
X(3) + X(4) 3+3
MD = = =3
2 2

The Mode:

The value that occurs most often in a data set is called the mode.

A data set that has only one value that occurs with the greatest frequency is
said to be unimodal.

If a data set has two values that occur with the same greatest frequency, both
values are considered to be the mode and the data set is said to be bimodal.

If a data set has more than two values that occur with the same greatest
frequency, each value is used as the mode, and the data set is said to be
multimodal.

When no data value occurs more than once, the data set is said to have no mode.
10/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 10 / 65
Measures of Central Tendency

Example 6 (Example 3-9: NFL Signing Bonuses)

Find the mode of the signing bonuses of eight NFL players for a specific year. The
bonuses in millions of dollars are: 18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10

Solution:
Since the value 10 occurs three times, then the mode is $10 million.

Example 7 (Example 3-10: Branches of Large Banks)

Find the mode for the number of branches that six banks have.

401, 344, 209, 201, 227, 353

Solution:
Since each value occurs only once, there is no mode.

11/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 11 / 65
Measures of Central Tendency

Example 8 (Example 3-11: Licensed Nuclear Reactors)

The data show the number of licensed nuclear reactors in the United States for a
recent 15-year period. Find the mode.

104, 104, 104, 104, 104, 107, 109, 109, 109, 110, 109, 111, 112, 111, 109.

Solution:
Since the values 104 and 109 both occur 5 times, the modes are 104 and 109. The
data set is said to be bimodal.

The Mode and Median can be found for grouped data as following:

The median can be calculated by using the following formula


 P 
n/2 − ( f )m−1
MD = L + × h, (4)
fm

where, L: the class boundaries for the median class,

12/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 12 / 65
Measures of Central Tendency

and
P
fm : the frequency for the median class, ( f )m−1 : the cumulative frequency
before the median class,
h: the width for the median class.

The mode can be calculated by using the following formula


 
∆1
Mode = L + × h, (5)
∆1 + ∆ 2

where, L: the class boundaries for the modal class,


h: the width for the modal class,

∆1 = f ∗ − f 1 , ∆2 = f ∗ − f 2 ,

f ∗ : the largest frequency,


f1 : the frequency for the class before the model class,
f2 : the frequency for the class after the model class.
13/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 13 / 65
Measures of Central Tendency

Example 9 (Example 3-12: Miles Run per Week)

The following frequency distribution represents the number of miles run during one
week for a sample of 20 runners. Find the median and mode.
Classes 5.5-10.5 10.5-15.5 15.5-20.5 20.5-25.5 25.5-30.5 30.5-35.5 35.5-40.5
Frequency 1 2 3 5 4 3 2

Solution:

The median: the cumulative frequency distribution is


Upper class boundaries Cumulative frequency
Less than 5.5 0
Less than 10.5 1
Less than 15.5 3
Less than 20.5 6
Less than 25.5 11
Less than 30.5 15
Less than 35.5 18
Less than 40.5 20

14/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 14 / 65
Measures of Central Tendency

Since n/2 = 10, from the cumulative frequency, the median class is
P
(20.5 − −25.5), so L = 20.5, ( f )m−1 = 6, fm = 5, h = 5.

Therefore, the median is given by


 P   
n/2 − ( f )m−1 10 − 6
MD = L + × h = 20.5 + × 5 = 24.5
fm 5
The Mode: From the frequency, then the modal class is 20.5–25.5.

L = 20.5, f ∗ = 5, f1 = 3, f2 = 4, h=5
∗ ∗
∆ = f − f1 = 5 − 3 = 2, ∆2 = f − f2 = 5 − 4 = 1

Use the formula


   
∆1 2
Mode = L + × h = 20.5 + × 5 = 23.83
∆1 + ∆ 2 2+1

The Midrange:
The midrange is defined as the sum of the lowest and highest values in the data
set, divided by 2. The symbol MR is used for the midrange.
15/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 15 / 65
Measures of Central Tendency

lower value + highest value


MR = (6)
2

Example 10 (Example 3-15: Water-Line Breaks)

In the last two winter seasons, the city of Brownsville, Minnesota, reported these
numbers of water-line breaks per month: 2, 3, 6, 8, 4, 1. Find the midrange.

Solution:

1+8
MR = = 4.5
2
Hence, the midrange is 4.5.

Example 11 (Example 3-16: NFL Signing Bonuses)

Find the midrange of data for the NFL signing bonuses in Example 3-9. The bonuses
in millions of dollars are: 18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10
16/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 16 / 65
Measures of Central Tendency

Solution:

10 + 34.5 44.5
MD = = = $22.25 million
2 2

The Weighted Mean:


The weighted mean of a variable X can be calculated by using the following
formula
X1 w1 + X2 w2 + · · · + Xn wn
X̄ = (7)
w1 + w2 + · · · + wn
where w1 , w2 , · · · , wn are the weights and X1 , X2 , · · · , Xn are the values.

Example 12 (Example 3-17: Grade Point Average)

A student received an A in English Composition I (3 credits), a C in Introduction to


Psychology (3 credits), a B in Biology I (4 credits), and a D in Physical Education
(2 credits).
17/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 17 / 65
Measures of Central Tendency

Assuming A = 4 grade points, B = 3 grade points, C = 2 grade points, D = 1 grade


point, and F = 0 grade points, find the student’s grade point average.

Solution:
Course Credits(w) Grade(X)
English Composition I 3 A (4 points)
Introduction to Psychology 3 C (2 points)
Biology I 4 B(3 points)
Physical Education 2 D(1 points)
Pn
i=1 Xi wi 3(4) + 3(2) + 4(3) + 2(1) 32
X̄ = P n = = = 2.7
i=1 w i 3 + 3 + 4 + 2 12
The grade point average is 2.7.

18/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 18 / 65
Measures of Central Tendency

Properties and Uses of Central Tendency:

The Mean:

1 The mean is found by using all the values of the data.

2 The mean is used in computing other statistics, such as the variance.

3 The mean for the data set is unique and not necessarily one of the data values.

4 The mean cannot be computed for the data in a frequency distribution that has
an open-ended class.

5 The mean is affected by extremely high or low values, called outliers, and may
not be the appropriate average to use in these situations.

The Median:

1 The median is used to find the center or middle value of a data set.

19/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 19 / 65
Measures of Central Tendency

2 The median is used when it is necessary to find out whether the data values fall
into the upper half or lower half of the distribution.

3 The median is used for an open-ended distribution.

4 The median is affected less than the mean by extremely high or extremely low
values.

The Mode:

1 The mode is used when the most typical case is desired.

2 The mode is the easiest average to compute.

3 The mode can be used when the data are nominal or categorical, such as
religious preference, gender, or political affiliation.

4 The mode is not always unique. A data set can have more than one mode, or
the mode may not exist for a data set.

20/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 20 / 65
Measures of Central Tendency

The Midrange:
1 The midrange is easy to compute.
2 The midrange gives the midpoint.
3 The midrange is affected by extremely high or low values in a data set.

Distribution Shapes:

Frequency distributions can assume many shapes. The three most important
shapes are positively skewed, symmetric, and negatively skewed. The following
figures show histograms of each.

21/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 21 / 65
Measures of Central Tendency

1 In a positively skewed or right-skewed distribution, the majority of the data


values fall to the left of the mean and cluster at the lower end of the
distribution; the “tail” is to the right, see Figure a. Also,

mode < median < mean

2 In a symmetric distribution, the data values are evenly distributed on both sides
of the mean, see Figure b. In addition, when the distribution is unimodal, then

mean = median = mode


22/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 22 / 65
Measures of Variation

3 When the majority of the data values fall to the right of the mean and cluster at
the upper end of the distribution, with the tail to the left, the distribution is
said to be negatively skewed or left-skewed, see Figure c. Also,

mean < median < mode

3-2 Measures of Variation

In statistics, to describe the data set accurately, statisticians must know more
than the measures of central tendency.

Example 13 (Example 3-18: Comparison of Outdoor Paint)

A testing lab wishes to test two experimental brands of outdoor paint to see how
long each will last before fading. The testing lab makes 6 gallons of each paint to
test. Since different chemical agents are added to each group and only six cans are
involved, these two groups constitute two small populations.
23/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 23 / 65
Measures of Variation

The results (in months) are shown. Find the mean of each group.

Brand A 10 60 50 30 40 20
Brand B 35 45 30 35 40 25

Solution:
The mean for brand A is

N1
1 X 210
µA = X1i = = 35 months
N1 i=1 6
The mean for brand B is
N2
1 X 210
µB = X2i = = 35 months
N2 i=1 6

Since µA = µB , you might conclude that both brands of paint last equally well.

However, when the data sets are examined graphically, a somewhat different
conclusion might be drawn. See the following figure.
24/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 24 / 65
Measures of Variation

Figure 1: Variation of paint in months.

As Figure 1, even though the means are the same for both brands, the spread, or
variation, is quite different.

Figure 1 shows that brand B performs more consistently; it is less variable.

25/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 25 / 65
Measures of Variation

For the spread or variability of a data set, three measures are commonly used:
range, variance, and standard deviation.

Range:
The range is the highest value minus the lowest value. The symbol R is used for
the range.
R = highest value − lowest value (8)

Example 14 ( Example 3-19: Comparison of Outdoor Paint)

Find the ranges for the paints in Example 13.

Solution:
For brand A, the range is:

RA = 60 − 10 = 50 months

26/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 26 / 65
Measures of Variation

For brand B, the range is

RB = 45 − 25 = 20 months

Therefore,
RA > RB

Example 15 (Example 3-20: Employee Salaries)

The salaries for the staff of the XYZ Manufacturing Co. are shown here. Find the
range.
Staff Salary
Owner $100,000
Manager 40,000
Sales representative 30,000
Workers 25,000
15,000
18,000

27/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 27 / 65
Measures of Variation

Solution:
The range for salary is

R = 100, 000 − 15, 000 = $85, 000

Variance and Standard Deviation:


The variance is the average of the squares of the distance each value is from the
mean.

The population variance is


N
1 X
σ2 = (Xi − µ)2 (9)
N i=1

The population standard deviation is the square root of the variance.


v
N

u
u1 X
σ= σ =t 2 (Xi − µ)2 (10)
N i=1

28/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 28 / 65
Measures of Variation

Example 16 ( Example 3-21,22:Comparison of Outdoor Pain)

Find the variance and standard deviation for brands A and B paint data in Example
13. The months were
Brand A 10 60 50 30 40 20
Brand B 35 45 30 35 40 25

Solution:
Step 1: Find the mean; from Example 13,
N
1 X 35 + 45 + 30 + 35 + 40 + 25 210
µ= Xi = = = 35
N i=1 6 6

Step 2: Construct the following table, use the formula


N
1 X
σ2 = (Xi − µ)2
N i=1

29/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 29 / 65
Measures of Variation

Brand A X1i − µ (X1i − µ1 )2 Brand B X2i − µ2 (X2i − µ2 )2


10 -25 625 35 0 0
60 25 625 45 10 100
50 15 225 30 -5 25
30 -5 25 35 0 0
40 5 25 40 5 25
20 -15 225 25 -10 100
Total 1750 250
Step 3: compute the variance for A and B:
N1
2 1 X 1750
σA = (X1i − µ1 )2 = = 291.67
N1 i=1 6
N2
2 1 X 250
σB = (X2i − µ2 )2 = = 41.7
N2 i=1 6

Step 4: The standard deviation


q √ q √
2 2
σA = σA = 291.67 = 17.07 σB = σB = 41.7 = 6.5

30/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 30 / 65
Measures of Variation

Therefore, the variation in brand A is greater than the variation in brand B.

The sample variance, denoted by s2 , is given by the formula


n
1 X
s2 = (Xi − X̄)2 . (11)
n − 1 i=1

To find the standard deviation of a sample, you must take the square root of the
sample variance. v
u n
u 1 X
s=t (Xi − X̄)2 . (12)
n − 1 i=1

The shortcut formulas for computing the variance and standard deviation for
data obtained from samples are as follows.
" n n
!2 #
1 X X
s2 = n Xi2 − Xi (13)
n(n − 1) i=1 i=1

31/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 31 / 65
Measures of Variation

Example 17 (Example 3-23: European Auto Sales)

Find the sample variance and standard deviation for the amount of European auto
sales for a sample of 6 years shown. The data are in millions of dollars: 11.2, 11.9,
12.0, 12.8, 13.4, 14.3

Solution:
We can summarize our calculation in the following table:
X X2
11.2 125.44
11.9 141.61
12.0 144.00
12.8 163.84
13.4 179.56
P 14.3 P 204.49
X=75.6 X 2 =958.94
" n n
!2 #
2 1 X X 6(958.94) − (75.6)2
∴s = n Xi2 − Xi = = 1.276
n(n − 1) i=1 i=1
6(6 − 1)

32/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 32 / 65
Measures of Variation

The standard deviation can be obtained as

√ √
s= s2 = 1.276 = 1.13

Variance and Standard Deviation for Grouped Data:

k
1 X 2
s2 = xm − X̄ fm ; or (14)
n − 1 m=1
 !2 
k k
1 X X
s2 = n x2m fm − xm f m , (15)
n(n − 1) m=1 m=1

where, xm : the midpoint of the class, fm : the frequency, k: number of classes


and n = km=1 fm the sample size,
P

33/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 33 / 65
Measures of Variation

Example 18 (Example 3-24: Miles Run per Week)

The following data represent the number of miles that 20 runners ran during one
week. Find the variance and the standard deviation.

Solution:
We can summarize the calculations in the following table
Class Frequency Midpoints
boundaries (fm ) (xm ) x m fm x2m fm
5.5 – 10.5 1 8 8 64
10.5 – 15.5 2 13 26 338
15.5 – 20.5 3 18 54 972
20.5 – 25.5 5 23 115 2645
25.5 – 30.5 4 28 112 3136
30.5 – 35.5 3 33 99 3267
35.5 – 40.5 2 38 76 2888
Total 20 490 13310
 !2 
k k 2
1  = 20(13310) − (490) = 68.7
X X
s2 = n x2m fm − xm fm
n(n − 1) m=1 m=1
20(20 − 1)

34/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 34 / 65
Measures of Variation

The standard deviation is √ √


s= s2 = 68.7 = 8.3

Coefficient of Variation:

Whenever two samples have the same units of measure, the variance and
standard deviation for each can be compared directly.
If we want to compare the variation of two variables we can’t use the variance or
the standard deviation when:
1 The variable might have different means.
2 The variables might have different units.

We need a measure of the relative variation that will not depend on either the
units or on how large values are. This is measure is the coefficient of variation
(CV) which is defined by:

standard deviation
CVar = × 100 (16)
mean

35/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 35 / 65
Measures of Variation

Example 19 (Example 3-25: Sales of Automobiles)

The mean of the number of sales of cars over a 3-month period is 87, and the
standard deviation is 5. The mean of the commissions is $5225, and the standard
deviation is $773. Compare the variations of the two.

Solution:
The coefficients of variation are

Sales:
s1 5
CVar1 = × 100 = × 100 = 5.7%
X̄1 87
Commissions:
s2 773
CVar2 = × 100 = × 100 = 14.8%
X̄2 5225
Since the coefficient of variation is larger for commissions, the commissions are more
variable than the sales.

36/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 36 / 65
Measures of Variation

Example 20 (Example 3–26: Pages in Women’s Fitness Magazines)

The mean for the number of pages of a sample of women’s fitness magazines is 132,
with a variance of 23; the mean for the number of advertisements of a sample of
women’s fitness magazines is 182, with a variance of 62. Compare the variations.

Solution:
The coefficients of variation are

s1 23
CVar1 = × 100 = × 100 = 3.6%
x̄1 132

s2 62
CVar2 = × 100 = × 100 = 4.33%
x̄2 182

Since the coefficient of variation is larger for the number of advertisements, then the
number of advertisements are more variable than the number of pages of women’s
fitness.

37/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 37 / 65
Measures of Position

3-3 Measures of Position

In addition to measures of central tendency and measures of variation, there are


measures of position or location. These measures include standard scores,
percentiles, deciles, and quartiles.

Standard Scores (Z):

Suppose that a student scored 90 on a Math test and 45 on an English exam.

Direct comparison of raw scores is impossible, since the exams might not be
equivalent in terms of number of questions, value of each question, and so on.
However, a comparison of a relative standard similar to both can be made.

value − mean
Z score = (17)
standard deviation

38/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 38 / 65
Measures of Position

Example 21 (Example 3-29: Test Scores)

A student scored 65 on a calculus test that had a mean of 50 and a standard


deviation of 10; she scored 30 on a history test with a mean of 25 and a standard
deviation of 5. Compare her relative positions on the two tests.

Solution:
For calculus the z score is

X − X̄ 65 − 50
z1 = = = 1.5
s1 10

For history the z score is

Y − Ȳ 30 − 25
z2 = = = 1.0
s2 5

Since the Z score for calculus is larger, her relative position in the calculus class is
higher than her relative position in the history class.
39/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 39 / 65
Measures of Position

Example 22 (Example 3-30: Test Scores)

Find the z score for each test, and state which is higher.

Test A XA = 38 X̄A = 40 sA = 5
Test B XB = 94 X̄B = 100 sB = 10

Solution:
For test A:
XA − X̄A 38 − 40
zA = = = −0.4
sA 5
For test B:
XB − X̄B 94 − 100
zB = = = −0.6
sB 10
The score for test A is relatively higher than the score for test B.

40/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 40 / 65
Measures of Position

Percentiles:

Percentiles are position measures used in educational and health-related fields to


indicate the position of an individual in a group.

Percentiles divide the data set into 100 equal groups.

Percentiles are symbolized by P1 , P2 , P3 , · · · , P99 , and divide the distribution


into 100 groups.

Percentiles are not the same as percentages.

Percentile graphs can be constructed by using the same as the cumulative


relative frequency graphs.

41/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 41 / 65
Measures of Position

Example 23 (Example 3-31: Systolic Blood Pressure)

The frequency distribution for the systolic blood pressure readings (in millimeters of
mercury, mm Hg) of 200 randomly selected college students is shown here. Construct
a percentile graph.
Class boundaries Frequency
89.5 – 104.5 24
104.5 – 119.5 62
119.5 – 134.5 72
134.5 – 149.5 26
149.5 – 164.5 12
164.4 – 179.5 4

Solution:
Find cumulative percentages. To do this use the formula.

cumulative frequency
Cumulative % = × 100
n

42/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 42 / 65
Measures of Position

Class boundaries Cumulative frequency Cumulative percent


Less than 89.5 0 0
Less than 104.5 24 12
Less than 119.5 86 43
Less than 134.5 158 79
Less than 149.5 184 92
Less than 164.5 196 98
Less than 179.5 200 100

Graph the data, using class boundaries for the x axis and the percentages for the
y axis.

43/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 43 / 65
Measures of Position

From the percentile graph, the 40th percentile corresponds to a value of


approximately 118.

From the figure that a blood pressure of 130 corresponds to approximately the
70th percentile.

Several mathematical methods exist for computing percentiles for data.

These methods can be used to find the approximate percentile rank of a data
value or to find a data value corresponding to a given percentile.
44/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 44 / 65
Measures of Position

Finding a data value corresponding to a given percentile

Arrange the data in order from lowest to highest.

Substitute into the formula


n×p
c=
100
where n = total number of values p = percentile

If c is not a whole number, round up to the next whole number.


Starting at the lowest value, count over to the number that corresponds to the
rounded-up value.

If c is a whole number, use the value halfway between the cth and (c + 1)st
values when counting up from the lowest value.

45/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 45 / 65
Measures of Position

Example 24 (Example 3–34: Test Scores)

A teacher gives a 20-point test to 10 students. The scores are shown here: 18, 15, 12,
6, 8, 2, 3, 5, 20, 10. Find the value corresponding to the 25th percentile.

Solution:
Arrange the data in order from smallest to largest.

2, 3, 5, 6, 8, 10, 12, 15, 18, 20

Compute
n×p 10(25)
c= = = 2.5
100 100
Since c is not a whole number, round it up to the next whole number; in this
case, c = 3

Hence, the value 5 corresponds to the 25th percentile.

46/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 46 / 65
Measures of Position

Example 25 (Example 3–35)

Using the data set in Example 24, find the value that corresponds to the 60th
percentile

Solution:
Arrange the data in order from smallest to largest.

2, 3, 5, 6, 8, 10, 12, 15, 18, 20

Compute
n×p 10(60)
c= = =6
100 100
Since c is a whole number, use the 6th and 7th values

10 + 12
= 11
2

Hence, the value 11 corresponds to the 60th percentile.

47/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 47 / 65
Measures of Position

Quartiles and Deciles:

Quartiles divide the distribution into four groups.

Quartiles can be computed by using the formula given for computing percentiles,
such that Q1 = P25 , Q2 = P50 and Q3 = P75 .

Finding Data Values Corresponding to Q1 , Q2 and Q3 :


1 Arrange the data in order from lowest to highest.

2 Find the median of the data values. (This is Q2 )

3 Find the median of the data values that fall below Q2 . (This is Q1 ).

4 Find the median of the data values that fall above Q2 . (This is Q3 ).

48/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 48 / 65
Measures of Position

Example 26 (Example 3-36.)

Find Q1 , Q2 , and Q3 for the data set

15, 13, 6, 5, 12, 50, 22, 18.

Solution:
1 Arrange the data in order.

5, 6, 12, 13, 15, 18, 22, 50

2 Find the median (Q2 ) (n=8 even)

X(4) + X(5) 13 + 15
MD = = = 14
2 2
3 Find the median of the data values less than 14 (n1 =4)

X(2) + X(3) 6 + 12
Q1 = = =9
2 2
49/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 49 / 65
Measures of Position

4 Find the median of the data values greater than 14 (n2 = 4)

X(2) + X(3) 18 + 22
Q3 = = = 20
2 2

In addition to dividing the data set into four groups, quartiles can be used as a
rough measurement of variability.

The interquartile range (IQR) is defined as the difference between Q1 and Q3


and is the range of the middle 50% of the data.

The interquartile range is used to identify outliers, and it is also used as a


measure of variability in exploratory data analysis, as shown in Section 3–4.

Deciles divide the distribution into 10 groups, as shown. They are denoted by
D1 , D2 · · · , D9 .

50/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 50 / 65
Measures of Position

Deciles can be found by using the formulas given for percentiles.

Taken altogether then, these are the relationships among percentiles, deciles,
and quartiles.
Deciles are denoted by D1 , D2 , D3 , · · · , D9 , and they correspond to P10 , P20 , P30 ,
· · · , P90 .
Quartiles are denoted by Q1 , Q2 , Q3 and they correspond to P25 , P50 , P75 .
The median is the same as P50 or Q2 or D5 .

Outliers:

An outlier is an extremely high or an extremely low data value when compared


with the rest of the data values.

51/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 51 / 65
Measures of Position

Procedure for Identifying Outliers


1 Arrange the data in order and find Q1 and Q3 .

2 Find the interquartile range: IQR = Q3 − Q1 .

3 Multiply the IQR by 1.5.

4 Subtract the value obtained in 3 from Q1 and add the value to Q3 .

5 Check the data set for any data value that is smaller than Q1 − 1.5(IQR) or
larger than Q3 + 1.5(IQR).

Example 27 (Example 3-37.)

Check the following data set for outliers.


5, 6, 12, 13, 15, 18, 22, 50

Solution:

52/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 52 / 65
Measures of Skewness and Kurtosis

1 From Example 26; we have Q1 = 9 and Q3 = 20.

2 The interquartile range: IQR = Q3 − Q1 = 20 − 9 = 11

3 1.5(IQR) = 1.5(11) = 16.5

4 Calculate the interval:

Q1 − 1.5(IQR) = 9 − 16.5 = −7.5

Q3 + 1.5(IQR) = 20 + 16.5 = 36.5

5 The value 50 is outside this interval [−7.5, 36.5]; hence, it can be considered an
outlier, for these data.

Measures of Skewness and Kurtosis:

Skewness is a measure of symmetry, or more precisely, the lack of symmetry.

53/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 53 / 65
Measures of Skewness and Kurtosis

A distribution, or data set, is symmetric if it looks the same to the left and right
of the center point.

54/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 54 / 65
Measures of Skewness and Kurtosis

Pearson’s first coefficient of skewness

Mean − Mode
PC = (18)
Standard deviation

If the data have low mode or various modes, Pearson’s second coefficient may be
superior,
3(Mean − Median)
PC = (19)
Standard deviation
1 If −0.5 < P C < 0.5, the data are nearly symmetrical.

2 If −1 < P C < −0.5 or 0.5 < P C < 1, the data are slightly skewed.

3 If P C < −1 or P C > 1, the data are extremely skewed.

4 If P C = ±1, perfect linear relationship

5 IF P C = 0 indicating no linear relationship.

55/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 55 / 65
Measures of Skewness and Kurtosis

Example 28

Calculate the coefficient of skewness for the data set: 5, 6, 12, 13, 15, 18, 22, 50.(
Example 26)

Solution:
Since these data does not have mode, so we will use the Pearson’s second coefficient
x = 141, x2 = 3907,
P P
of skewness,
n
1X 141
X̄ = Xi = = 17.625
n i=1 8
s P P s
n X 2 − ( X)2 8(3907) − (141)2
s = = = 14.252
n(n − 1) 8(8 − 1)

Since n=8 is even then

X(4) + X(5) 13 + 15
Med. = = = 14
2 2
56/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 56 / 65
Measures of Skewness and Kurtosis

then
3(X̄ − Med.) 3(17.625 − 14)
PC = = = 0.763
s 14.252
therefore, the PC between 0.5 & 1 is positive skewed, the data are slightly skewed.

Kurtosis is a measure of whether the data are heavy-tailed or light-tailed


relative to a normal distribution.

The kurtosis coefficient is calculated by the following formula

57/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 57 / 65
Measures of Skewness and Kurtosis


m4
ku = , (20)
s4

where m4 is the fourth moment about the origin.
n
′ 1X
m4 = (Xi − X̄)4 .
n i=1

Types of kurtosis:

Leptokurtic or heavy-tailed distribution (ku > 3), is having very long and skinny
tails, which means there are more chances of outliers.

Mesokurtic (ku ∼
= 3), is the same as the normal distribution, which means
kurtosis is near to 3.

Platykurtic or short-tailed distribution (ku < 3), having a lower tail and
stretched around center tails means most of the data points are present in high
proximity with mean.

58/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 58 / 65
Measures of Skewness and Kurtosis

Example 29

Calculate the coefficient of kurtosis for the data set: 5, 6, 12, 13, 15, 18, 22, 50, (
Example 26).

Solution

1
Pn 141
From the data, we have X̄ = n i=1 Xi = 8
= 17.625, and

i X (X − X̄)2 (X − X̄)4
1 5 159.3901 25405.3713
2 6 135.1406 18262.9885
3 12 31.6406 1001.1292
4 13 21.3906 457.5588
5 15 6.8906 47.4807
6 18 0.1406 0.0198
7 22 19.1406 366.3635
8 50 1048.1406 1098598.7700
Total 141 1421.875 1144139.6820

59/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 59 / 65
Exploratory Data Analysis

Then, v
u n r
u 1 X 1421.875
s=t (Xi − X̄)2 = = 14.252
n − 1 i=1 8−1

and
n
′ 1X 1144139.682
m4 = (Xi − X̄)4 = = 143017.460
n i=1 8
then,

m4 143017.460
ku = = = 3.466
s4 (14.252)4
Therefore, the distribution of data is Leptokurtic or heavy-tailed, because Ku > 3.

3-4 Exploratory Data Analysis

Exploratory data analysis is a methodology in statistics you can use to


investigate your raw data for patterns, trends, and anomalies.

60/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 60 / 65
Exploratory Data Analysis

In exploratory data analysis (EDA), data can be organized using a stem and leaf
plot.

The measure of central tendency used in EDA is the median.

The measure of variation used in EDA is the interquartile range Q3 − Q1 .

In EDA the data are represented graphically using a boxplot (sometimes called a
box-and-whisker plot).

A boxplot can be used to graphically represent the data set. These plots involve
five specific values:

The Five-Number Summary and Boxplots

1 The lowest value of the data set (i.e., minimum)

2 Q1

3 The median
61/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 61 / 65
Exploratory Data Analysis

4 Q3

5 The highest value of the data set (i.e., maximum)


These values are called a five-number summary of the data set.

A boxplot is a graph of a data set obtained by drawing a horizontal line from


the minimum data value to Q1 , drawing a horizontal line from Q3 to the
maximum data value, and drawing a box whose vertical sides pass through Q1
and Q3 with a vertical line inside the box passing through the median or Q2 .

Example 30 (Example 3-38: Number of Meteorites Found)

The number of meteorites found in 10 states of the United States is 89, 47, 164, 296,
30, 215, 138, 78, 48, 39. Construct a boxplot for the data.

Solution:

62/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 62 / 65
Exploratory Data Analysis

1 Arrange the data in order:

30, 39, 47, 48, 78, 89, 138, 164, 215, 296

2 Find the median. Since n = 10 is even

X(5) + X(6) 78 + 89
MD = = = 83.5
2 2
3 Find the median for the first half of data: 30, 39, 47, 48, 78, then Q1 = 47.

4 Find the median for the second half of data: 89, 138, 164, 215, 296, then
Q3 = 164.

5 Draw a scale for the data on the x axis.

6 Locate the lowest value, Q1 , median, Q3 , and the highest value on the scale.

7 Draw a box around Q1 and Q3 , draw a vertical line through the median, and
connect the upper value and the lower value to the box.

63/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 63 / 65
Exploratory Data Analysis

Figure 2: The boxplot for Number of Meteorites Found.

Some information can be obtained from a boxplot as follows:

1 If the median is near the center of the box, the distribution is approximately
symmetric.

2 If the median falls to the left of the center of the box, the distribution is
positively skewed.

3 If the median falls to the right of the center, the distribution is negatively
skewed.

4 If the lines are about the same length, the distribution is approximately
symmetric.
64/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 64 / 65
Exploratory Data Analysis

5 If the right line is larger than the left line, the distribution is positively skewed.

6 If the left line is larger than the right line, the distribution is negatively skewed.

From Figure 2, the distribution of the data for Example 30 is positively skewed.

Exercises 3
Page 118: 1, 3, 18.

Page 120: 26, 28, 30

Page 137: 7, 19, 27

Page 139: 30, 31

Page 153: 9, 12, 13

Page 154: 20, 23

Page 166: 1, 8, 11

65/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 65 / 65

You might also like