Statistics Chapter
Statistics Chapter
Statistics
Statistics is used in many different areas of life to show relationships between certain things.
You get many different types of statistics of which you learn about one major segment. This is
where you are given a “bunch” of data that tells you about a certain group of people or things –
for example the different heights in your class.
For example – Your teacher measured 20 people in your class and came out with the following
data:
145 cm, 176 cm, 136 cm, 178cm, 159cm, 166cm, 176cm, 159cm, 143cm, 189cm, 155cm, 176cm,
162cm, 155cm, 156cm, 178cm, 122cm, 157cm, 165cm, 148cm
To enter data simply type the first number, e.g. 145 into the calculator and then press to
enter the data into your calculator, then enter the rest of the data into your calculator.
Remember to make sure that your memory for the previous data is cleared before you
enter your new data. To do this press:
Remember that your calculator remembers all your data until you purposefully clear the
memory.
122; 136; 143; 145; 148; 155; 155; 156; 157; 159; 159; 162; 165; 166; 176; 176; 176; 178; 178; 189.
The Mode:
The mode is the value in your ungrouped data that appears the MOST.
The Median:
The median is the value that is in the MIDDLE of your data.
The Mean:
The mean is the average of the data because it is NASTY to work out
To find the mode look at your data and look for the value that appears the most times: in our
example this would be: 176.
In this case we can see that the mode does not accurately represent the data that we have.
This means that our median lies between two values: 159 and 159. It is easy to see that our
median is therefore 159.
To find the mean there are two methods – the long and the short way:
Grouped data is data that has already been organised in some form: for example:
Your class writes a maths test and gets the following marks:
56; 78; 34; 89; 67; 45; 65; 67; 69; 21;
49; 35; 67; 72; 78; 83; 75; 48; 63; 58;
Your teacher groups the data in order to give herself an idea of what the spread of the data looks
like.
To find the median: you need to add a cumulative frequency to the table:
Mark Freq. Cum. Freq. Mark Freq. Cum. Freq.
0–9 0 0 50 – 59 5 12
10 – 19 0 0 60 – 69 9 21
20 – 29 1 1 70 – 79 5 26
30 – 39 2 3 80 – 89 3 29
40 – 49 4 7 90 – 99 1 30
So then look for where the cumulative frequency would equal 15 and that is the median
group, in this case 60 – 69.
The range gives the “width” of your data, for example you have a range of shoes – or all different
types of shoes.
or biggest – smallest
Maximum = 88
Minimum = 6
Quartiles:
Just like the Median finds the middle position of the data or half of the data the quartiles find the
quarters of the data. Before you can find the quartiles you need to make sure that your data is
arranged in ascending order (i.e. from smallest to biggest).
So looking at the previous example, find the first and third quartile:
Therefore Q1 = 21
Therefore Q3 = 46
Interquartile range:
IQR = Q3 – Q1
= 25
Percentiles:
Percentiles divide the data into 100 equal parts – in other words you need to give the actual data
point for the percentage mentioned.
Percentile position =
Percentile position =
It is important to remember that your sample size affects your mean and the
prediction of your mean using the median and mode.
The bigger your sample size the closer together your median and mean will be
and the more able you will be able to predict your mean from your median.
The smaller your sample size the less you will be able to predict your mean from
the median.
In the same way: If your sample size is bigger your standard deviation will be
more accurate and will give a better indication of the spread of data, whereas
the smaller your sample size the less accurate your standard deviation is.
Standard Deviation
There are two ways to find the standard deviation – the long way and the short way.
∑ ̅
The formula for the standard deviation is √
To find the variance simply square your standard deviation answer. Standard deviation tells us
how spread out the data is – the bigger the value of the standard deviation the more spread out
the data is and vice versa.
If the data is not skewed and from a relatively large sample then we can say the following things
about the data:
Approximately 99,5% of the data lies within 3 standard deviations of the mean. While
approximately 95% of the data lies within 2 standard deviations of the mean. Finally,
approximately 66% (or two thirds) of the data lies within one standard deviation of the mean.
∑ ̅
√
What this formula means is: the sum of the x-data point minus the average (mean) and then
squared, after which it is divided by the number of observations.
̅ 545455
Now put your information in a table:
̅ ̅ answer
6 6 – 38.54 1 059.206612
17 17 – 38.54 464.2066116
21 21 – 38.54 307.8429752
23 23 – 38.54 241.661157
24 24 – 38.54 211.5702479
36 36 – 38.54 6.479338843
36 36 – 38.54 6.479338843
40 40 – 38.54 2.115702479
46 46 – 38.54 55.57024793
87 87 – 38.54 2 347.842975
88 88 – 38.54 2 445.752066
Total or ∑ ̅ 7 093.157025
∑ ̅
√
√
or
This means that the average distance between two points in our set of data is 25.39.
Press
To input your data from the example press the first value (in this case ) and then your
calculator should say DATA SET = 1.
Now do this with all your other data from the example.
Activity 1
1.78; 2.97
h) How many cultures (bacterial experiments) were in the top 15% for growth and
how big would the smallest of this group have been?
i) How would the mean be affected if we added the following 5 results to the data?
Show your working out.
j) Would the standard deviation also be affected by these changes (from question (i)
and by how much would it change by?
C. Box and Whisker Plots
A Box and whisker plot is a diagram of what the data looks like. There are 5 important things
(often called the 5-number summary) that you need to remember in order to draw the plot.
5-number summary:
Minimum Maximum
1 2 3 4 5 6 7 8 9 10 11 12
Example: A coffee shop counts the number of cappuccinos that they sell on any one day for two
weeks. These are their results:
34; 44; 99; 39; 10; 56; 71; 71; 41; 93; 89; 11;
77; 68
10; 11; 34; 39; 41; 44; 56; 68; 71; 71; 77; 89; 93; 99
Minimum → 10 Maximum → 99
Activity 2
1. Your teacher measures the heights of your classmates in meters and gets the following
results in your class of 18:
b) Draw box and whisker plots for both sets of data and use them to answer the
questions that follow.
e) What percentage of girls watched between 14 and 28 movies and what percentage
of boys watched between 14 and 28 movies? Therefore were there more girls that
watched between 14 and 28 movies, than boys or vice versa?
3. Below are two box and whisker diagrams for the geography marks of two different classes.
Study them carefully before answering the questions that follow.
28 37 64 77 91
Class 1
14 43 56 74 83
Class 2
0 10 20 30 40 50 60 70 80 90 100
b) In your opinion, which class did better in the test? Give a reason for your answer.
c) What is the interquartile range for each class? Do you think that the interquartile
range gives a better indication of the spread of the data than the range? Give a
reason for your answer.
D. Histograms, Frequency Polygons and Ogives
A histogram represents the distribution of data for grouped data (or the frequency of data against
the type of observations) – like a bar graph but with NO gaps.
We can draw a histogram with the marks on the -axis and the frequency on the -axis.
Frequency
Marks
A frequency polygon is a line graph of the frequency on the - axis and the groups on the -axis.
The frequency is plotted at the mid-point of each group.
Frequency
Marks
An ogive is a cumulative frequency straight line graph. The cumulative frequency is plotted at the
end of each corresponding group either in frequency (most commonly) or percentage format.
From the table our ogive should look like this:
Cumulative Frequency
Marks
You can use your ogive to find your median and your first and third quartiles as well.
Find 15.5 on your cumulative frequency axis draw a line across until it meets the graph and then
draw a line down to the -axis to find the corresponding median.
To find the first and third quartile follow the same procedure:
First Quartile:
Third Quartile:
Cumulative Frequency
Marks Q1 M Q3
Thus the first quartile is approximately 53.
1. The department of road-works needs to decide whether a certain road needs another lane.
They decide to count the number of cars that travel on a certain part of the road everyday
at a certain time. They do this for 1 month. This is the data that they found:
a) Draw up a table containing the above information and group the data in groups of
500 along with their frequency and cumulative frequency.
i) The median
g) Give a possible reason for the day that there were only 300 cars.
2. The tuck-shop at school decides look at the number of fizzas children at school eat
everyday. Below is the table of their findings:
i) The median
3. A car company interviewed 700 university students to determine the age at which they got
their licences. Below are their findings:
4. A certain school decides to find out the average age that a student from their school gets
their first full-time job after leaving school. This is a summary of what they found:
c) From the ogive determine the median, first quartile, and third quartile.
E. The Normal Distribution and Skewness of Data
A normal distribution is a frequency polygon that is symmetrical and where the mean is equal to
the median which is equal to the mode.
Examples of normal distributions would be the graphs of the height of a population or the
intelligence of a population and so on.
In a normal distribution approximately 67% of the data lies within the first standard deviation on
either side of the mean. Approximately 95% of the data lies within two standard deviations on
either side of the mean, and approximately 99% of the data lies within three standard deviations
on either side of the mean.
Skewness
The skewness of your data tells you how your data looks and which side of the graph your data
leans towards.
In a normal distribution, your data is unimodal – i.e. it has only one mode or one peak. Sometimes
you will get a bimodal distribution which will have two modes or two peaks.
If your frequency polygon or histogram is symmetrical it means that the peak is approximately in
the middle of the graph and the two ends look approximately the same. On the following page is
an example:
If the mode or tallest column is towards the left then the data tails off to the right and is said to be
skewed to the right or positively skewed.
If the mode or tallest column is towards the right then the data tails off to the left and is said to be
skewed to the left or negatively skewed.
c) d)
e) f)
g) h)
2. Look at the following box-and-whisker plots and determine their skewness.
a)
b)
c)
d)
e)
F. Scatter Plots, Lines of Best Fit and The Regression Line
Scatter plots are graphs with two variables, and , for example the number of hours spent
studying vs the mark achieved for that test:
Hours 2 5 3 3 4 1 2 4 6
Mark 62 75 73 69 86 22 47 72 89
Now we use the number of hours as our -axis because the number of hours spent studying
affects the mark achieved.
Sometimes you cannot tell which variable affects the
other variable – in which case you can choose which
variable goes on which axis.
And we use the marks achieved as the -axis because the marks are affected by the number of
hours spent studying.
Marks
From the scatter plot you can see that there is a relationship between the number of hours spent
studying and the marks achieved. If you tried to draw a line through the dots or points plotted it
would look like a straight line.
The line of best fit is the line that works the best with all the plotted data. For example we can see
that the line that fits best with the above data is a straight line shown on the next page:
Line 1
Line 2
Line 3
Line 4
Marks
Remember that data can also have a line that is exponential in shape, logarithmic,
parabolic or even hyperbolic in shape. Be aware of this for when your teacher asks you
what kind of shape or line best fits the given data. Also be aware that you need to be able
to interpret the shape.
Relationships of data
Look at the scatter plot above – the gradient is positive. This means that there is a positive
relationship between the variables.
If the gradient is negative there is a negative relationship between the variables.
Strength
When the line of best fit lies perfectly on all the plotted points it means that the relationship is
ideal or perfect – the one variable directly affects the other variable in proportion.
When the points are spread a little bit from the line, or not all points lie on the line the
relationship is less strong.
When the points are spread very far away from the line it means that the relationship is very
weak.
If you cannot draw a line of best fit it means that there is no relationship between the variable.
No Relationship
Activity 5
1. A driving company does a survey on the number of years spent driving vs the number of
accidents occurring in the last year for that driver.
Number of
years 2 4 6 8 10 12 14 16 18 20
driving
Number of
17 15 12 7 10 4 5 3 1 1
accidents
2. Die-gogo uses a particular pesticide. They test the pesticides on a bug to determine the
resistance that a bug would build up to the pesticide over time.
Number of
1 2 3 4 5 6 7 8 9
Sprays
Number of
Bugs still 70 42 26 2 8 14 46 53 56
alive
3. A scientist wants to try to determine whether there is a relationship between the heat of
the day and how many murders are committed that day.
Temperature 25 27 23 32 35 19 28 29 26
Number of
58 7 21 51 24 14 36 22 13
Murders
a) Draw a scatter plot of the above information.
b) Is there a relationship between the temperature and the number of murders per
day?
4. State whether the following have a negative or positive relationship, and whether it is a
strong or weak relationship or no relationship at all.
a) b)
c) d)
e) f)
Regression Lines
It is very difficult to calculate the regression line because it measures the distance from the line of
each point so that those distances cancel each other out, so we can use a calculator for this
section.
Press: 1 and then press 1 for Line data.
To enter data put the -variable in first and then press and the put in the -variable and
then press .
From our example you would press: 2 62 then the next set: 5 75 and so
on until all the data is entered.
2. A chef wants to determine the type of relationship there is between the number of
customers at his restaurant and the number of eggs he uses on that day. Below is a table of
his information.
Number
of 94 29 49 64 69 44 98
Customers
Number
384 66 126 276 192 132 390
of Eggs
3. Zookeepers around the world tried to determine the number of bananas a troupe of
monkeys would eat against the number of monkeys in the troupe.
Below is the data they gathered:
Number
of 6 7 12 3 5 1 7 8 2
Monkeys
Number
of 75 53 175 22 38 21 84 96 18
Bananas
The first question only allows a yes or no answer while the second question allows a person to
express their own opinion.
The first question can be used to present skewed information, whereas the second question will
give a more accurate idea of what hot drinks are popular.
If you were a travel company and you wanted to show the dramatic increase in fuel prices which
graph would you use?
If you were the government and wanted to show that diesel prices had not increased very much
which graph would you use?
The only difference between these two graphs is the scale they are drawn with. The first has a
scale of 0 to 10 and goes up in two’s while the second graph has a scale that goes from 0 to 30 and
goes up in five’s. Scale makes a big difference.
You need to look out for a couple of things:
- Who does the research / statistics benefit?
- What is being compared? And why are they comparing it?
- Look at the size of the sample → statistics for a small sample is much easier to manipulate
simply by changing one or two values.
- Look at the scale of the graph – think about why they would use that scale.
Activity 7
1. Look at the graphs below with regards to the number of accidents per month in a certain
area which represent the same data:
July
Sept
Nov
Dec
Jan
Feb
Apr
Aug
Mar
May
Oct
June
JHB CPT
Graph A Graph B
Graph A
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Year Graph B
a) Which graph in your opinion more effectively conveys the plight of the rhinos?
b) Looking at both graphs, what general trend is occurring with regards to the number
of deaths of the rhino per year?
c) Which graph would you use if you were trying to raise funds to raise awareness
about the rhino’s plight?
Answers to Activities:
Activity 1:
∑
1. a) ̅
̅
̅
c) 1 800
Interquartile Range = Q3 – Q1
= 1 780 – 1 130
= 650
f) Note: You won’t be expected to use the long method for more than 15 data values.
∑
2. a) ̅
̅
̅
b) To find the median, we first need to arrange the data in ascending order:
1,00 1,32 1,78 2,02 2,63 2,71
2,97 3,21 3,70 3,78 3,79 3,79
3,84 3,92 4,06 4,11 4,33 4,57
5,11 5,32
c) mode = 3,79
Interquartile Range = Q3 – Q1
= 4,085 – 2,67
= 1,415
f)
h) 15% of 20 = 3
the third biggest growth of bacteria = 4,57
i) New mean
j) new
yes the standard deviation would be affected – it would increase by 0,545.
Activity 2
1. a) Before you can give a 5-number summary, you need to put the data in ascending
order:
1,16 1,19 1,26 1,32 1,39 1,39
1,41 1,49 1,51 1,55 1,56 1,67
1,70 1,75 1,77 1,78 1,83 1,96
c) 50%
2. a) Girls: 5 9 12 13 14 14 21 28 29 30
Minimum: 5 Maximum: 32
Quartile 1: 10,5 Quartile 3: 28,5
Median: 14
Boys: 1 14 14 15 18 23 24 28 29 30
Minimum: 1 Maximum: 30
Quartile 1: 14 Quartile 3: 28,5
Median: 20,5
c) Girls: The data is very spread out on the right side of the box and whisker plot,
while the left side of the graph is very “clumped” together.
Boys: The data is more spread out on the left than on the right and 25% of the
values fall between 28,5 and 30.
d) More girls watched less than 14 movies (50% of the girls, 50% of the boys watched
less than 20,5 movies).
3. a) Class 1 Range = 91 – 28 = 63
Class 2 Range = 83 – 14 = 69
Therefore Class 2 has the greater range of marks.
b) the first class did better as the lowest mark was 28% and 50% of the data falls
between 37% and 77%, finally 25% of the marks fall between 77% and 91% which is
better than the marks for class one whose lowest mark is 14% and 50% of the
marks fall between 43% and 73%, the top 25% of the class only ranged from 74% to
83% which is much lower than the first class.
c) IQR class 1 = 77 – 37 = 40
IQR class 2 = 74 – 43 = 31
Yes – the interquartile range gives a much better indication of how close the values
are to each other.
Activity 3
1. a)
Number of Cars Frequency Cumulative Frequency
1 1
3 4
4 8
3 11
2 13
8 21
4 25
2 27
1 28
2 30
Total 30
b)
c) 2 500 – 3 000
d)
e) Ogive:
f) i) Median → 2 500 – 3 000
ii) Quartile 1 → 1 000 – 1 500
iii) Quartile 3 → 3 000 – 3 500
2. a)
b)
c)
3. a)
b)
c)
4. a)
Activity 4
1. a)
b) A straight line.
c)
d) A strong negative relationship.
2. a)
c) The optimum number of sprays would be when the least number of bugs remain
alive – thus the optimum number of sprays would be 4.
3. a)
b) No there does not appear to be a relationship between the temperature and the
number of murders committed on the day.
4. a) no relationship
b) strong, negative relationship
c) relatively strong, positive relationship
d) relatively strong, negative relationship
e) weak positive relationship
f) relatively strong positive relationship.
Activity 6
1. Regression line:
2. a)
b) a positive relationship
c)
3. a)
b) a positive relationship
c)
4. a)
b) a negative relationship
c)
d)
Activity 7
b) Graph A
c) Graph A – the gap between the two lines looks bigger – thus it would be more
effective in giving the impression that JHB is less safe than CT.
d) Graph B – the gap between the two lines looks smaller so it would like
approximately the same number of accidents occurred and that there was
therefore not much difference in the safety.
e) The difference is in the size of the graphs and the scale of the graphs. Because
graph A’s scale is more spread out than Graph B’s scale the lines are more spread
out – it also makes the data easier to read.
2. a) Graph A – the pictures between the graph are very graphic and thus effective in
highlighting the plight of the rhinos.
b) the number of deaths of rhino have been increasing drastically over the last year.
c) Graph A.