Introduction To Statistical Programming - PPT Week 2 - Descriptive Statistics
Introduction To Statistical Programming - PPT Week 2 - Descriptive Statistics
• We can describe categorical variables using frequency distribution tables and graphs such as bar
charts, pie charts, and Pareto diagrams.
GRAPHS TO DESCRIBE CATEGORICAL VARIABLES
• A frequency distribution is a table used to organize data. The left column (called classes or
groups) includes all possible responses on a variable being studied. The right column is a
list of the frequencies, or number of observations, for each class.
• A relative frequency distribution is obtained by dividing each frequency by the number of
observations and multiplying the resulting proportion by 100%.
GRAPHS TO DESCRIBE CATEGORICAL VARIABLES
• The classes that we use to construct frequency distribution tables of a categorical variable
are simply the possible responses to the categorical variable.
• If our intent is to draw attention to the frequency of each category, then we will most likely
draw a bar chart.
GRAPHS TO DESCRIBE CATEGORICAL VARIABLES
A cross table, sometimes called a crosstab or a contingency table, lists the number of
observations for every combination of values for two categorical or ordinal variables. The
combination of all possible intervals for the two variables defines the cells in a table. A cross
table with r rows and c columns is referred to as an r * c cross table.
GRAPHS TO DESCRIBE CATEGORICAL VARIABLES
Figure 1.2 displays this information in a component or stacked bar chart. Figure 1.3 is a cluster, or side-
by-side, bar chart of the same data.
GRAPHS TO DESCRIBE CATEGORICAL VARIABLES
• If we want to draw attention to the proportion of frequencies in each category, then we will probably
use a pie chart to depict the division of a whole into its constituent parts.
• The circle (or “pie”) represents the total, and the segments (or “pieces of the pie”) cut from its center
depict shares of that total.
• The pie chart is constructed so that the area of each segment is proportional to the corresponding
frequency.
GRAPHS TO DESCRIBE CATEGORICAL VARIABLES
GRAPHS TO DESCRIBE CATEGORICAL VARIABLES
A Pareto diagram is a bar chart that displays the frequency of defect causes. The bar at the
left indicates the most frequent cause and the bars to the right indicate causes with
decreasing frequencies. A Pareto diagram is used to separate the “vital few” from the
“trivial many.”
GRAPHS TO DESCRIBE TIME-SERIES DATA
Frequency Distributions
• Similar to a frequency distribution for categorical data, a frequency distribution for numerical data
is a table that summarizes data by listing the classes in the left column and the number of
observations in each class in the right column.
• Determining the classes of a frequency distribution for numerical data requires answers to certain
questions: How many classes should be used? How wide should each class be? There are some
general rules
GRAPHS TO DESCRIBE NUMERICAL VARIABLES
Histogram
A histogram is a graph that consists of vertical bars constructed on a horizontal line that is marked off
with intervals for the variable being displayed. The intervals correspond to the classes in a frequency
distribution table. The height of each bar is proportional to the number of observations in that inter- val.
The number of observations can be displayed above the bars.
Ogive
An ogive, sometimes called a cumulative line graph, is a line that connects points that are the cumulative
percent of observations below the upper limit of each interval in a cumulative frequency distribution.
GRAPHS TO DESCRIBE NUMERICAL VARIABLES
GRAPHS TO DESCRIBE NUMERICAL VARIABLES
Stem-and-Leaf Display
• A stem-and-leaf display is an EDA graph that is an alternative to the histogram. Data are grouped
according to their leading digits (called stems), and the final digits (called leaves) are listed
separately for each member of a class. The leaves are displayed individually in ascending order
after each of the stems.
• Describe the following random sample of 10 final exam grades for an introductory accounting class
with a stem-and-leaf display.
88 51 63 85 79 65 79 70 73 77
GRAPHS TO DESCRIBE NUMERICAL VARIABLES
Scatter Plot
We can prepare a scatter plot by locating one point for each pair of two variables that represent an
observation in the data set. The scatter plot provides a picture of the data, including the following:
1. The range of each variable
2. The pattern of values over the range
3. A suggestion as to a possible relationship between the two variables 4. An indication of outliers
(extreme points)
GRAPHS TO DESCRIBE NUMERICAL VARIABLES
Scatter Plot
DESCRIBING DATA: NUMERICAL
MEASURES OF CENTRAL TENDENCY AND LOCATION
Arithmetic Mean
The arithmetic mean (or simply mean) of a set of data is the sum of the data values divided by the
number of observations. If the data set is the entire population of data, then the population mean, m, is a
parameter given by
MEASURES OF CENTRAL TENDENCY AND LOCATION
Median
The median is the middle observation of a set of observations that are arranged in increasing (or
decreasing) order. If the sample size, n, is an odd number, the median is the middle observation. If the
sample size, n, is an even number, the median is the average of the two middle observations. The median
will be the number located in the
Mode
The mode, if one exists, is the most frequently occurring value. A distribution with one mode is called
unimodal; with two modes, it is called bimodal; and with more than two modes, the distribution is
said to be multimodal. The mode is most commonly used with categorical data.
MEASURES OF CENTRAL TENDENCY AND LOCATION
The demand for bottled water increases during the hurricane season in Florida. The number of 1-gallon bottles of
water sold for a random sample of n = 12 hours in one store during hurricane season is:
60 84 65 67 75 72
80 85 63 82 70 75
Describe the central tendency of the data.
MEASURES OF CENTRAL TENDENCY AND LOCATION
The average or mean hourly number of 1-gallon bottles of water demanded is found as follows:
MEASURES OF CENTRAL TENDENCY AND LOCATION
The five-number summary refers to the five descriptive measures: minimum, first quartile, median,
third quartile, and maximum.
• In this section we present descriptive numbers that measure the variability or spread of the observations from the
mean. In particular, we include the range, interquartile range, variance, standard deviation, and coefficient of
variation.
• While two data sets could have the same mean, the individual observations in one set could vary more
from the mean than do the observations in the second set. Consider the following two sets of sample
data
• Although the mean is 10 for both samples, clearly the data in sample A are farther from 10 than are
the data in sample B. We need descriptive numbers to measure this spread.
RANGE AND INTERQUARTILE RANGE
Box-and-Whisker Plot
A box-and-whisker plot is a graph that describes the shape of a distribution in terms of the five-number summary:
the minimum value, first quartile (25th percentile), the median, the third quartile (75th percentile), and the
maximum value. The inner box shows the numbers that span the range from the first to the third quartile. A line is
drawn through the box at the median. There are two “whiskers.” One whisker is the line from the 25th percentile to
the minimum value; the other whisker is the line from the 75th percentile to the maximum value.
RANGE AND INTERQUARTILE RANGE
Calculate the standard deviation of daily sales for Gilotti Pizzeria, Location 1. the daily sales for
Location 1 are:
6 8 10 12 14 9 11 7 13 11
To calculate sample variance and standard deviation follow these three steps:
• Step 1: Calculate the sample mean, xbar. It is equal to 10.1.
• Step 2: Find the difference between each of the daily sales and the mean of 10.1.
• Step 3: Square each difference
VARIANCE AND STANDARD DEVIATION
Correlation Coefficient
The correlation coefficient is computed by dividing the covariance by the product of the
standard deviations of the two variables.
MEASURES OF RELATIONSHIPS BETWEEN VARIABLES
Correlation Coefficient
OUTPUT SPSS
NOMINAL ASSOCIATION WITH SPSS
Click Continue, then OK. Your output will look like this:
Look under the Value column to the dependent variable “General Happiness”. The lambda value is .000, suggesting
that there is no association between the variable “SEX” and the dependent variable “Happy”.
EXERCISES
1.
EXERCISES
2.
REFERENCES
December 2023