Describing Data
Measures of Variability
Variability - refers to how spread out or how much data points differ from each other and from
the center of a distribution. It indicates the degree to which individual data points are different or
similar to each other within a dataset or distribution.
Figure 3-6 explanation:
● Distribution A (left graph):
○ Scores spread widely from 0 to 100.
○ This means students’ test scores are very spread out (high variability).
○ Even though the average (mean) score is 50, scores are all over the place.
● Distribution B (right graph):
○ Scores are tightly clustered between 40 and 60.
○ This means students’ scores are close together (low variability).
○ Again, the average (mean) is still 50, but most students scored near that
average.
Main Point:
Both distributions have the same mean (50), but their variability is very different.
● Distribution A shows a wide range of scores.
● Distribution B shows a narrow range of scores.
Range - the range of a distribution is equal to the difference between the highest and lowest
scores.
Example:
Suppose the test scores in a class are: 45, 50, 60, 70, 80
● Highest score = 80
● Lowest score = 45
● Range = 80 − 45 = 35
So, the range of the test scores is 35.
So, as a descriptive statistic of variation, the range provides a quick but gross description of the
spread of scores. When its value on extreme scores in a distribution, the resulting description of
variation may be understated or overstated. Better measures of variation include the
interquartile range and the semi-interquartile range.
Interquartile Range - The IQR shows the spread of the middle 50% of the data, so it’s less
affected by extreme values (outliers).
Quartile - refers to a specific point
● Quartiles help you see who’s in the lower group, middle group, and higher group.
● Quartile → splits data into 4 equal parts.
Quarter - refers to an interval
● “Quarter” just means one-fourth of the whole.
● Quarter → splits anything (time, money, objects, etc.) into 4 equal parts.
Sem-interquartile Range - half of that spread, showing the typical deviation from the median.
Example:
Suppose we have this data set of test scores: 10, 20, 30, 40, 50, 60, 70, 80, 90
1. Median = 50
2. Q2 (50th percentile) = 45
3. Q1 (25th percentile) = 30
4. Q3 (75th percentile) = 70
● IQR = Q3 - Q1 = 70 - 30 = 40
● SIQR = IQR ÷ 2 = 40 ÷ 2 = 20
In a perfectly symmetrical distribution, Q1 and Q3 will be exactly the same distance from the
median. If these distances are unequal then there is a lack of symmetry. This lack of symmetry
is referred to as skewness.
The Mean Absolute Deviation (MAD) - another tool that could be used to describe the amount
of variability in a distribution is the mean absolute deviation.
The formula is:
The bar on each side of X — X¯ indicates that it is the absolute value of the deviation score
(ignoring the positive or negative sign and treating all deviation scores as positive). All the
deviation scores are then summed and divided by the total number of scores (n) to arrive at the
average deviation.
● It is the “average” of the “positive distances” of each point from the mean.
● It tells us, on average, how far the numbers are from the mean.
● It is the “average distance” between each data value and the mean.
We make all the differences positive in Mean Absolute Deviation because:
● Some numbers will be above the mean (positive differences) and some will be below
the mean (negative differences).
● If we add them all up, the positives and negatives would cancel each other out, and it
might look like there's no variation when there actually is.
● By taking the absolute value (making the difference positive), we measure the actual
distance from the mean, no matter if it's above or below.
● That's why we make all the differences positive — so we can see the true distance from
the average.
Example data set: 2, 4, 6, 8, 10
1. Find the mean: xˉ = 2 + 4 + 6 + 8 + 10 / 5 = 6
2. Find deviations from the mean:
● |2 – 6| = 4
● |4 – 6| = 2
● |6 – 6| = 0
● |8 – 6| = 2
● |10 – 6| = 4
3. Find the average of these deviations:
MAD = 4 + 2 + 0 + 2 + 4 / 5 = 12 / 5 = 2.4
Standard Deviation - refers to how spread out the numbers are from the average/mean. It
shows how consistent or how varied the data is.
● If the standard deviation is small, the numbers are close to the mean → the data is
consistent (less varied).
● Consistent results show the values are close to each other and close to the average.
● This means most people/things performed similarly.
● If the standard deviation is large, the numbers are spread out from the mean → the
data is varied (less consistent)
● Varied results show the values are spread out and far apart from the average.
● This means that there's a big difference in performance/measurement.
● The symbol for standard deviation has variously been represented as s, S, SD and the
lowercase Greek letter sigma (σ).
● One custom (the one we adhere to) has it that s refers to the sample standard deviation
and σ refers to the population standard deviation.
● Population standard deviation is used when you have data for the entire population
(all members).
● Sample standard deviation is used when you only have a subset/sample (a part of the
population.
● The “-1” is called Bessel's correction, and it helps make the estimate more accurate
since a sample usually underestimates the true spread.
Why is n-1 called degrees of freedom?
● When we calculate the sample SD, we first need the sample mean.
● Once the sample mean is known, one piece of data is no longer free to vary— it's
already determined by the others.
● Formula of sample standard deviation is:
Example Data (Sample): 5, 7, 3, 7, 9
Here, n = 5
Step 1: Find the Sample Mean
xˉ = 5 + 7 + 3 + 7 + 9 / 5 = 31 / 5 = 6.2
Step 2: Find Deviations from the Mean and Square Them
(5 − 6.2)² = (−1.2)² = 1.44
(7 − 6.2)² = (0.8)² = 0.64
(3 − 6.2)² = (−3.2)² = 10.24
(7 − 6.2)² = (0.8)² = 0.64
(9 − 6.2)²= (2.8)² = 7.84
Sum of squared deviations = 1.44 + 0.64 + 10.24 + 0.64 + 7.84 = 20.8
Step 3: Divide by n − 1 (degrees of freedom)
s² = 20.8 / 5 − 1 = 20.8 / 4 = 5.2
Step 4: Take the Square Root
s = √5.2 ≈ 2.28
● Formula for the population standard deviation:
Example Data (Population): 2, 4, 6, 8,1 0
Population size: N = 5
Step 1: Find the Mean (µ)
μ = 2 + 4 + 6 + 8 + 10 / 5 = 30 / 5 = 6
Step 2: Find the Squared Deviations
(2 − 6)² = (−4)² = 16
(4 − 6)² =(−2)² = 4
(6 − 6)² = (0)² = 0
(8 − 6)² = (2)² = 4
(10 − 6)² = (4)² = 16
Sum of squared deviations = 16 + 4 + 0 + 4 + 16 = 40
Step 3: Find the Variance
σ² = 40 / N = 40 / 5 = 8
Step 4: Find the Standard Deviation
σ = √8 ≈ 2.83
Variance - it is equal to the arithmetic mean of the squares of the differences between the
scores in a distribution and their mean. The formula used to calculate the variance (s2) using
deviation scores is:
s² = ∑(X − X¯ )² / n
Example Data Set: 4, 8, 6, 10, 12
Step 1: Find the Mean
xˉ = 4 + 8 + 6 + 10 + 12 / 5 = 40 / 5 = 8
Step 2: Find the Deviations from the Mean
● (4 – 8) = –4
● (8 – 8) = 0
● (6 – 8) = –2
● (10 – 8) = 2
● (12 – 8) = 4
Step 3: Square the Deviations
● (–4)² = 16
● 0² = 0
● (–2)² = 4
● 2² = 4
● 4² = 16
Step 4: Find the Average of Squared Deviations
Variance = 16 + 0 + 4 + 4 + 16 / 5 = 40 / 5 = 8
Skewness - refers to the nature and extent to which symmetry is absent. It is an indication of
how the measurements in a distribution are distributed.
● Positive skew - when relatively few of the scores fall at the high end of the distribution.
● It shows that the tail goes to the right (higher values; 10, 15, 20).
● Most data are on the low end, but a few very high numbers pull the curve to the right.
● For example: the test scores of the students in an exam. Most students score low, but a
few students get very high scores.
● Its results may indicate that the test was too difficult.
● Negative skew - when relatively few of the scores fall at the low end of the distribution.
● It shows that the tail goes to the left (lower values; -5, 0, 2).
● Most data are on the high end, but a few very low numbers pull the curve to the left.
● For example: the test scores of the students in an exam. Most students score very high,
but a few fail badly.
● Its results may indicate that the test was too easy.
● Zero skew - the distribution is perfectly symmetrical around the mean.
● It shows the balanced bell curve and no leaning to the right or left.
● The data is spread evenly around the average.
● Its results show that the left and right sides of the distribution are mirror images.
● For example: IQ tests. By design, IQ tests are made to follow a normal (balanced)
distribution.
Why are the peaks around 5-7?
● Both the positive skew and negative skew graphs have their peak (the mode, where
most values occur) around 5-7.
● This happens because skewness doesn't move the peak much.
● Skewness only stretches one side of the distribution (the tail), not the whole graph.
● So, the majority of the data (the bulk, the peak) stays in the middle. What changes is
how far the extreme values go on one side.
● Both graphs peak at 5-7 because that's where the bulk of the data lies.
Kurtosis - refers to the steepness of a distribution in its center.
● Platykurtic - relatively flat (low kurtosis)
● It shows a flat peak + light tails.
● Values are spread out more evenly. Has a few outliers because tails are short.
● For example: exam scores. Where students’ scores spread out evenly from low to high
— no strong clustering and no big outliers.
● Leptokurtic - relatively peaked (high kurtosis)
● It shows a tall, skinny peak + heavy tails.
● But heavy tails → more extreme outliers on both sides.
● For example: stock market returns → most days have small changes, but sometimes
there are huge gains/losses.
● Mesokurtic - somewhere in the middle or between (normal curve)
● It's a “just right” curve. The distribution looks like the usual bell shape.
● Neither too peaked nor too flat.
● Tails are moderate → average number of outliers.
● For example: heights of adults → most people are near the average, a few are
taller/shorter.
● Outliers are data points (values) that are much higher or much lower that the most of
the other values in a dataset.
● They are the “odd ones out” — unusual values that don't fit the general pattern.
● For example: student exam scores. Most students score between 70-90 on a test.
However, one student scores 20 (very low) and another scores 100 (perfect).
● In short, outliers are extreme values that stand far away from the rest of the data.