Chapter 03
Chapter 03
By
Dr. Abdelfattah Mustafa
Associate Professor of Mathematical Statistics
March 8, 2024
1/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 1 / 65
Contents
2 Measures of Variation
3 Measures of Position
2/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 2 / 65
Measures of Central Tendency
The Mean:
The Mean is the sum of the values, divided by the total number of values.
The sample mean:
n
X1 + X2 + · · · + Xn 1X
X̄ = = Xi , (1)
n n i=1
where n represents the total number of values in the sample.
3/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 3 / 65
Measures of Central Tendency
The data represent the number of days off per year for a sample of individuals
selected from nine different countries. Find the mean.
20 26 40 36 23 42 35 24 30
Solution:
The sample mean is
n
1X 20 + 26 + 40 + 36 + 23 + 42 + 35 + 24 + 30
X̄ = Xi = = 30.7 days
n i=1 9
4/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 4 / 65
Measures of Central Tendency
The data show the number of patients in a sample of six hospitals who acquired an
infection while hospitalized. Find the mean.
110 76 29 38 105 31
Solution:
The sample mean is
n
1X 110 + 76 + 29 + 38 + 105 + 31
X̄ = Xi = = 64.8
n i=1 6
The mean of the number of hospital infections for the six hospitals is 64.8.
▶ The procedure for finding the mean for grouped data uses the midpoints of the
classes.
k
1 X
X̄ = xm fm (3)
n m=1
5/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 5 / 65
Measures of Central Tendency
Using the following frequency distribution, find the mean. The data represent the
number of miles run during one week for a sample of 20 runners.
Solution:
6/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 6 / 65
Measures of Central Tendency
k
1 X 490
X̄ = xm fm = = 24.5 miles
n m=1 20
The Median:
The median is the midpoint of the data array. The symbol for the median is MD.
7/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 7 / 65
Measures of Central Tendency
n n
X 2
+X 2
+1
MD = .
2
The number of rooms in the seven hotels in downtown Pittsburgh is 713, 300, 618,
595, 311, 401, and 292. Find the median.
8/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 8 / 65
Measures of Central Tendency
Solution:
1 Arrange the data in order.
Solution:
9/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 9 / 65
Measures of Central Tendency
1, 2, 3, 3, 4, , 7
Since n = 6 is even, then the median number of magazines purchased is
X(3) + X(4) 3+3
MD = = =3
2 2
The Mode:
The value that occurs most often in a data set is called the mode.
A data set that has only one value that occurs with the greatest frequency is
said to be unimodal.
If a data set has two values that occur with the same greatest frequency, both
values are considered to be the mode and the data set is said to be bimodal.
If a data set has more than two values that occur with the same greatest
frequency, each value is used as the mode, and the data set is said to be
multimodal.
When no data value occurs more than once, the data set is said to have no mode.
10/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 10 / 65
Measures of Central Tendency
Find the mode of the signing bonuses of eight NFL players for a specific year. The
bonuses in millions of dollars are: 18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10
Solution:
Since the value 10 occurs three times, then the mode is $10 million.
Find the mode for the number of branches that six banks have.
Solution:
Since each value occurs only once, there is no mode.
11/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 11 / 65
Measures of Central Tendency
The data show the number of licensed nuclear reactors in the United States for a
recent 15-year period. Find the mode.
104, 104, 104, 104, 104, 107, 109, 109, 109, 110, 109, 111, 112, 111, 109.
Solution:
Since the values 104 and 109 both occur 5 times, the modes are 104 and 109. The
data set is said to be bimodal.
The Mode and Median can be found for grouped data as following:
12/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 12 / 65
Measures of Central Tendency
and
P
fm : the frequency for the median class, ( f )m−1 : the cumulative frequency
before the median class,
h: the width for the median class.
∆1 = f ∗ − f 1 , ∆2 = f ∗ − f 2 ,
The following frequency distribution represents the number of miles run during one
week for a sample of 20 runners. Find the median and mode.
Classes 5.5-10.5 10.5-15.5 15.5-20.5 20.5-25.5 25.5-30.5 30.5-35.5 35.5-40.5
Frequency 1 2 3 5 4 3 2
Solution:
14/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 14 / 65
Measures of Central Tendency
Since n/2 = 10, from the cumulative frequency, the median class is
P
(20.5 − −25.5), so L = 20.5, ( f )m−1 = 6, fm = 5, h = 5.
L = 20.5, f ∗ = 5, f1 = 3, f2 = 4, h=5
∗ ∗
∆ = f − f1 = 5 − 3 = 2, ∆2 = f − f2 = 5 − 4 = 1
The Midrange:
The midrange is defined as the sum of the lowest and highest values in the data
set, divided by 2. The symbol MR is used for the midrange.
15/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 15 / 65
Measures of Central Tendency
In the last two winter seasons, the city of Brownsville, Minnesota, reported these
numbers of water-line breaks per month: 2, 3, 6, 8, 4, 1. Find the midrange.
Solution:
1+8
MR = = 4.5
2
Hence, the midrange is 4.5.
Find the midrange of data for the NFL signing bonuses in Example 3-9. The bonuses
in millions of dollars are: 18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10
16/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 16 / 65
Measures of Central Tendency
Solution:
10 + 34.5 44.5
MD = = = $22.25 million
2 2
Solution:
Course Credits(w) Grade(X)
English Composition I 3 A (4 points)
Introduction to Psychology 3 C (2 points)
Biology I 4 B(3 points)
Physical Education 2 D(1 points)
Pn
i=1 Xi wi 3(4) + 3(2) + 4(3) + 2(1) 32
X̄ = P n = = = 2.7
i=1 w i 3 + 3 + 4 + 2 12
The grade point average is 2.7.
18/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 18 / 65
Measures of Central Tendency
The Mean:
3 The mean for the data set is unique and not necessarily one of the data values.
4 The mean cannot be computed for the data in a frequency distribution that has
an open-ended class.
5 The mean is affected by extremely high or low values, called outliers, and may
not be the appropriate average to use in these situations.
The Median:
1 The median is used to find the center or middle value of a data set.
19/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 19 / 65
Measures of Central Tendency
2 The median is used when it is necessary to find out whether the data values fall
into the upper half or lower half of the distribution.
4 The median is affected less than the mean by extremely high or extremely low
values.
The Mode:
3 The mode can be used when the data are nominal or categorical, such as
religious preference, gender, or political affiliation.
4 The mode is not always unique. A data set can have more than one mode, or
the mode may not exist for a data set.
20/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 20 / 65
Measures of Central Tendency
The Midrange:
1 The midrange is easy to compute.
2 The midrange gives the midpoint.
3 The midrange is affected by extremely high or low values in a data set.
Distribution Shapes:
Frequency distributions can assume many shapes. The three most important
shapes are positively skewed, symmetric, and negatively skewed. The following
figures show histograms of each.
21/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 21 / 65
Measures of Central Tendency
2 In a symmetric distribution, the data values are evenly distributed on both sides
of the mean, see Figure b. In addition, when the distribution is unimodal, then
3 When the majority of the data values fall to the right of the mean and cluster at
the upper end of the distribution, with the tail to the left, the distribution is
said to be negatively skewed or left-skewed, see Figure c. Also,
In statistics, to describe the data set accurately, statisticians must know more
than the measures of central tendency.
A testing lab wishes to test two experimental brands of outdoor paint to see how
long each will last before fading. The testing lab makes 6 gallons of each paint to
test. Since different chemical agents are added to each group and only six cans are
involved, these two groups constitute two small populations.
23/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 23 / 65
Measures of Variation
The results (in months) are shown. Find the mean of each group.
Brand A 10 60 50 30 40 20
Brand B 35 45 30 35 40 25
Solution:
The mean for brand A is
N1
1 X 210
µA = X1i = = 35 months
N1 i=1 6
The mean for brand B is
N2
1 X 210
µB = X2i = = 35 months
N2 i=1 6
Since µA = µB , you might conclude that both brands of paint last equally well.
However, when the data sets are examined graphically, a somewhat different
conclusion might be drawn. See the following figure.
24/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 24 / 65
Measures of Variation
As Figure 1, even though the means are the same for both brands, the spread, or
variation, is quite different.
25/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 25 / 65
Measures of Variation
For the spread or variability of a data set, three measures are commonly used:
range, variance, and standard deviation.
Range:
The range is the highest value minus the lowest value. The symbol R is used for
the range.
R = highest value − lowest value (8)
Solution:
For brand A, the range is:
RA = 60 − 10 = 50 months
26/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 26 / 65
Measures of Variation
RB = 45 − 25 = 20 months
Therefore,
RA > RB
The salaries for the staff of the XYZ Manufacturing Co. are shown here. Find the
range.
Staff Salary
Owner $100,000
Manager 40,000
Sales representative 30,000
Workers 25,000
15,000
18,000
27/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 27 / 65
Measures of Variation
Solution:
The range for salary is
28/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 28 / 65
Measures of Variation
Find the variance and standard deviation for brands A and B paint data in Example
13. The months were
Brand A 10 60 50 30 40 20
Brand B 35 45 30 35 40 25
Solution:
Step 1: Find the mean; from Example 13,
N
1 X 35 + 45 + 30 + 35 + 40 + 25 210
µ= Xi = = = 35
N i=1 6 6
29/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 29 / 65
Measures of Variation
30/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 30 / 65
Measures of Variation
To find the standard deviation of a sample, you must take the square root of the
sample variance. v
u n
u 1 X
s=t (Xi − X̄)2 . (12)
n − 1 i=1
The shortcut formulas for computing the variance and standard deviation for
data obtained from samples are as follows.
" n n
!2 #
1 X X
s2 = n Xi2 − Xi (13)
n(n − 1) i=1 i=1
31/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 31 / 65
Measures of Variation
Find the sample variance and standard deviation for the amount of European auto
sales for a sample of 6 years shown. The data are in millions of dollars: 11.2, 11.9,
12.0, 12.8, 13.4, 14.3
Solution:
We can summarize our calculation in the following table:
X X2
11.2 125.44
11.9 141.61
12.0 144.00
12.8 163.84
13.4 179.56
P 14.3 P 204.49
X=75.6 X 2 =958.94
" n n
!2 #
2 1 X X 6(958.94) − (75.6)2
∴s = n Xi2 − Xi = = 1.276
n(n − 1) i=1 i=1
6(6 − 1)
32/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 32 / 65
Measures of Variation
√ √
s= s2 = 1.276 = 1.13
k
1 X 2
s2 = xm − X̄ fm ; or (14)
n − 1 m=1
!2
k k
1 X X
s2 = n x2m fm − xm f m , (15)
n(n − 1) m=1 m=1
33/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 33 / 65
Measures of Variation
The following data represent the number of miles that 20 runners ran during one
week. Find the variance and the standard deviation.
Solution:
We can summarize the calculations in the following table
Class Frequency Midpoints
boundaries (fm ) (xm ) x m fm x2m fm
5.5 – 10.5 1 8 8 64
10.5 – 15.5 2 13 26 338
15.5 – 20.5 3 18 54 972
20.5 – 25.5 5 23 115 2645
25.5 – 30.5 4 28 112 3136
30.5 – 35.5 3 33 99 3267
35.5 – 40.5 2 38 76 2888
Total 20 490 13310
!2
k k 2
1 = 20(13310) − (490) = 68.7
X X
s2 = n x2m fm − xm fm
n(n − 1) m=1 m=1
20(20 − 1)
34/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 34 / 65
Measures of Variation
Coefficient of Variation:
Whenever two samples have the same units of measure, the variance and
standard deviation for each can be compared directly.
If we want to compare the variation of two variables we can’t use the variance or
the standard deviation when:
1 The variable might have different means.
2 The variables might have different units.
We need a measure of the relative variation that will not depend on either the
units or on how large values are. This is measure is the coefficient of variation
(CV) which is defined by:
standard deviation
CVar = × 100 (16)
mean
35/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 35 / 65
Measures of Variation
The mean of the number of sales of cars over a 3-month period is 87, and the
standard deviation is 5. The mean of the commissions is $5225, and the standard
deviation is $773. Compare the variations of the two.
Solution:
The coefficients of variation are
Sales:
s1 5
CVar1 = × 100 = × 100 = 5.7%
X̄1 87
Commissions:
s2 773
CVar2 = × 100 = × 100 = 14.8%
X̄2 5225
Since the coefficient of variation is larger for commissions, the commissions are more
variable than the sales.
36/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 36 / 65
Measures of Variation
The mean for the number of pages of a sample of women’s fitness magazines is 132,
with a variance of 23; the mean for the number of advertisements of a sample of
women’s fitness magazines is 182, with a variance of 62. Compare the variations.
Solution:
The coefficients of variation are
√
s1 23
CVar1 = × 100 = × 100 = 3.6%
x̄1 132
√
s2 62
CVar2 = × 100 = × 100 = 4.33%
x̄2 182
Since the coefficient of variation is larger for the number of advertisements, then the
number of advertisements are more variable than the number of pages of women’s
fitness.
37/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 37 / 65
Measures of Position
Direct comparison of raw scores is impossible, since the exams might not be
equivalent in terms of number of questions, value of each question, and so on.
However, a comparison of a relative standard similar to both can be made.
value − mean
Z score = (17)
standard deviation
38/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 38 / 65
Measures of Position
Solution:
For calculus the z score is
X − X̄ 65 − 50
z1 = = = 1.5
s1 10
Y − Ȳ 30 − 25
z2 = = = 1.0
s2 5
Since the Z score for calculus is larger, her relative position in the calculus class is
higher than her relative position in the history class.
39/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 39 / 65
Measures of Position
Find the z score for each test, and state which is higher.
Test A XA = 38 X̄A = 40 sA = 5
Test B XB = 94 X̄B = 100 sB = 10
Solution:
For test A:
XA − X̄A 38 − 40
zA = = = −0.4
sA 5
For test B:
XB − X̄B 94 − 100
zB = = = −0.6
sB 10
The score for test A is relatively higher than the score for test B.
40/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 40 / 65
Measures of Position
Percentiles:
41/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 41 / 65
Measures of Position
The frequency distribution for the systolic blood pressure readings (in millimeters of
mercury, mm Hg) of 200 randomly selected college students is shown here. Construct
a percentile graph.
Class boundaries Frequency
89.5 – 104.5 24
104.5 – 119.5 62
119.5 – 134.5 72
134.5 – 149.5 26
149.5 – 164.5 12
164.4 – 179.5 4
Solution:
Find cumulative percentages. To do this use the formula.
cumulative frequency
Cumulative % = × 100
n
42/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 42 / 65
Measures of Position
Graph the data, using class boundaries for the x axis and the percentages for the
y axis.
43/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 43 / 65
Measures of Position
From the figure that a blood pressure of 130 corresponds to approximately the
70th percentile.
These methods can be used to find the approximate percentile rank of a data
value or to find a data value corresponding to a given percentile.
44/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 44 / 65
Measures of Position
If c is a whole number, use the value halfway between the cth and (c + 1)st
values when counting up from the lowest value.
45/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 45 / 65
Measures of Position
A teacher gives a 20-point test to 10 students. The scores are shown here: 18, 15, 12,
6, 8, 2, 3, 5, 20, 10. Find the value corresponding to the 25th percentile.
Solution:
Arrange the data in order from smallest to largest.
Compute
n×p 10(25)
c= = = 2.5
100 100
Since c is not a whole number, round it up to the next whole number; in this
case, c = 3
46/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 46 / 65
Measures of Position
Using the data set in Example 24, find the value that corresponds to the 60th
percentile
Solution:
Arrange the data in order from smallest to largest.
Compute
n×p 10(60)
c= = =6
100 100
Since c is a whole number, use the 6th and 7th values
10 + 12
= 11
2
47/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 47 / 65
Measures of Position
Quartiles can be computed by using the formula given for computing percentiles,
such that Q1 = P25 , Q2 = P50 and Q3 = P75 .
3 Find the median of the data values that fall below Q2 . (This is Q1 ).
4 Find the median of the data values that fall above Q2 . (This is Q3 ).
48/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 48 / 65
Measures of Position
Solution:
1 Arrange the data in order.
X(4) + X(5) 13 + 15
MD = = = 14
2 2
3 Find the median of the data values less than 14 (n1 =4)
X(2) + X(3) 6 + 12
Q1 = = =9
2 2
49/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 49 / 65
Measures of Position
X(2) + X(3) 18 + 22
Q3 = = = 20
2 2
In addition to dividing the data set into four groups, quartiles can be used as a
rough measurement of variability.
Deciles divide the distribution into 10 groups, as shown. They are denoted by
D1 , D2 · · · , D9 .
50/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 50 / 65
Measures of Position
Taken altogether then, these are the relationships among percentiles, deciles,
and quartiles.
Deciles are denoted by D1 , D2 , D3 , · · · , D9 , and they correspond to P10 , P20 , P30 ,
· · · , P90 .
Quartiles are denoted by Q1 , Q2 , Q3 and they correspond to P25 , P50 , P75 .
The median is the same as P50 or Q2 or D5 .
Outliers:
51/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 51 / 65
Measures of Position
5 Check the data set for any data value that is smaller than Q1 − 1.5(IQR) or
larger than Q3 + 1.5(IQR).
Solution:
52/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 52 / 65
Measures of Skewness and Kurtosis
5 The value 50 is outside this interval [−7.5, 36.5]; hence, it can be considered an
outlier, for these data.
53/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 53 / 65
Measures of Skewness and Kurtosis
A distribution, or data set, is symmetric if it looks the same to the left and right
of the center point.
54/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 54 / 65
Measures of Skewness and Kurtosis
Mean − Mode
PC = (18)
Standard deviation
If the data have low mode or various modes, Pearson’s second coefficient may be
superior,
3(Mean − Median)
PC = (19)
Standard deviation
1 If −0.5 < P C < 0.5, the data are nearly symmetrical.
2 If −1 < P C < −0.5 or 0.5 < P C < 1, the data are slightly skewed.
55/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 55 / 65
Measures of Skewness and Kurtosis
Example 28
Calculate the coefficient of skewness for the data set: 5, 6, 12, 13, 15, 18, 22, 50.(
Example 26)
Solution:
Since these data does not have mode, so we will use the Pearson’s second coefficient
x = 141, x2 = 3907,
P P
of skewness,
n
1X 141
X̄ = Xi = = 17.625
n i=1 8
s P P s
n X 2 − ( X)2 8(3907) − (141)2
s = = = 14.252
n(n − 1) 8(8 − 1)
X(4) + X(5) 13 + 15
Med. = = = 14
2 2
56/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 56 / 65
Measures of Skewness and Kurtosis
then
3(X̄ − Med.) 3(17.625 − 14)
PC = = = 0.763
s 14.252
therefore, the PC between 0.5 & 1 is positive skewed, the data are slightly skewed.
57/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 57 / 65
Measures of Skewness and Kurtosis
′
m4
ku = , (20)
s4
′
where m4 is the fourth moment about the origin.
n
′ 1X
m4 = (Xi − X̄)4 .
n i=1
Types of kurtosis:
Leptokurtic or heavy-tailed distribution (ku > 3), is having very long and skinny
tails, which means there are more chances of outliers.
Mesokurtic (ku ∼
= 3), is the same as the normal distribution, which means
kurtosis is near to 3.
Platykurtic or short-tailed distribution (ku < 3), having a lower tail and
stretched around center tails means most of the data points are present in high
proximity with mean.
58/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 58 / 65
Measures of Skewness and Kurtosis
Example 29
Calculate the coefficient of kurtosis for the data set: 5, 6, 12, 13, 15, 18, 22, 50, (
Example 26).
Solution
1
Pn 141
From the data, we have X̄ = n i=1 Xi = 8
= 17.625, and
i X (X − X̄)2 (X − X̄)4
1 5 159.3901 25405.3713
2 6 135.1406 18262.9885
3 12 31.6406 1001.1292
4 13 21.3906 457.5588
5 15 6.8906 47.4807
6 18 0.1406 0.0198
7 22 19.1406 366.3635
8 50 1048.1406 1098598.7700
Total 141 1421.875 1144139.6820
59/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 59 / 65
Exploratory Data Analysis
Then, v
u n r
u 1 X 1421.875
s=t (Xi − X̄)2 = = 14.252
n − 1 i=1 8−1
and
n
′ 1X 1144139.682
m4 = (Xi − X̄)4 = = 143017.460
n i=1 8
then,
′
m4 143017.460
ku = = = 3.466
s4 (14.252)4
Therefore, the distribution of data is Leptokurtic or heavy-tailed, because Ku > 3.
60/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 60 / 65
Exploratory Data Analysis
In exploratory data analysis (EDA), data can be organized using a stem and leaf
plot.
In EDA the data are represented graphically using a boxplot (sometimes called a
box-and-whisker plot).
A boxplot can be used to graphically represent the data set. These plots involve
five specific values:
2 Q1
3 The median
61/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 61 / 65
Exploratory Data Analysis
4 Q3
The number of meteorites found in 10 states of the United States is 89, 47, 164, 296,
30, 215, 138, 78, 48, 39. Construct a boxplot for the data.
Solution:
62/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 62 / 65
Exploratory Data Analysis
30, 39, 47, 48, 78, 89, 138, 164, 215, 296
X(5) + X(6) 78 + 89
MD = = = 83.5
2 2
3 Find the median for the first half of data: 30, 39, 47, 48, 78, then Q1 = 47.
4 Find the median for the second half of data: 89, 138, 164, 215, 296, then
Q3 = 164.
6 Locate the lowest value, Q1 , median, Q3 , and the highest value on the scale.
7 Draw a box around Q1 and Q3 , draw a vertical line through the median, and
connect the upper value and the lower value to the box.
63/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 63 / 65
Exploratory Data Analysis
1 If the median is near the center of the box, the distribution is approximately
symmetric.
2 If the median falls to the left of the center of the box, the distribution is
positively skewed.
3 If the median falls to the right of the center, the distribution is negatively
skewed.
4 If the lines are about the same length, the distribution is approximately
symmetric.
64/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 64 / 65
Exploratory Data Analysis
5 If the right line is larger than the left line, the distribution is positively skewed.
6 If the left line is larger than the right line, the distribution is negatively skewed.
From Figure 2, the distribution of the data for Example 30 is positively skewed.
Exercises 3
Page 118: 1, 3, 18.
Page 166: 1, 8, 11
65/65
Dr. Abdelfattah Mustafa STAT 3111: General Statistics March 8, 2024 65 / 65