0% found this document useful (0 votes)
27 views

Introduction To Statistical Programming - PPT Week 2 - Descriptive Statistics

Introduction to Statistical Programming - PPT Week 2 - Descriptive statistics

Uploaded by

therezia.ryu
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Introduction To Statistical Programming - PPT Week 2 - Descriptive Statistics

Introduction to Statistical Programming - PPT Week 2 - Descriptive statistics

Uploaded by

therezia.ryu
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

DESCRIPTIVE STATISTICS

UNIVERSITAS BINA NUSANTARA

SUBJECT MATTER EXPERT


Rinda Nariswari, S.Si., M.Si.
LEARNING OUTCOME

 LO1: Describe the basic concepts of descriptive and inferential statistics.


 LO2: Calculate the statistical measurement which related to descriptive and inferential
statistics.
 LO3: Use SPSS and SmartPLS to do statistical analysis.
• LO4: Interpret the result of statistical analysis output
SUBTOPICS

1. Describing Data: Graphical


2. Describing Data: Numerical
3. Association Between Two Variables
4. Calculating Univariate Parameters with SPSS
5. Nominal Associations with SPSS
DESCRIBING DATA: GRAPHICAL
GRAPHS TO DESCRIBE CATEGORICAL VARIABLES

• We can describe categorical variables using frequency distribution tables and graphs such as bar
charts, pie charts, and Pareto diagrams.
GRAPHS TO DESCRIBE CATEGORICAL VARIABLES

• A frequency distribution is a table used to organize data. The left column (called classes or
groups) includes all possible responses on a variable being studied. The right column is a
list of the frequencies, or number of observations, for each class.
• A relative frequency distribution is obtained by dividing each frequency by the number of
observations and multiplying the resulting proportion by 100%.
GRAPHS TO DESCRIBE CATEGORICAL VARIABLES

• The classes that we use to construct frequency distribution tables of a categorical variable
are simply the possible responses to the categorical variable.

• If our intent is to draw attention to the frequency of each category, then we will most likely
draw a bar chart.
GRAPHS TO DESCRIBE CATEGORICAL VARIABLES

A cross table, sometimes called a crosstab or a contingency table, lists the number of
observations for every combination of values for two categorical or ordinal variables. The
combination of all possible intervals for the two variables defines the cells in a table. A cross
table with r rows and c columns is referred to as an r * c cross table.
GRAPHS TO DESCRIBE CATEGORICAL VARIABLES

Figure 1.2 displays this information in a component or stacked bar chart. Figure 1.3 is a cluster, or side-
by-side, bar chart of the same data.
GRAPHS TO DESCRIBE CATEGORICAL VARIABLES

• If we want to draw attention to the proportion of frequencies in each category, then we will probably
use a pie chart to depict the division of a whole into its constituent parts.
• The circle (or “pie”) represents the total, and the segments (or “pieces of the pie”) cut from its center
depict shares of that total.
• The pie chart is constructed so that the area of each segment is proportional to the corresponding
frequency.
GRAPHS TO DESCRIBE CATEGORICAL VARIABLES
GRAPHS TO DESCRIBE CATEGORICAL VARIABLES

A Pareto diagram is a bar chart that displays the frequency of defect causes. The bar at the
left indicates the most frequent cause and the bars to the right indicate causes with
decreasing frequencies. A Pareto diagram is used to separate the “vital few” from the
“trivial many.”
GRAPHS TO DESCRIBE TIME-SERIES DATA

• A time series is a set of measurements, ordered over time, on a particular quantity of


interest.
• In a time-series the sequence of the observations is important. A line chart, also called a
time-series plot, is a series of data plotted at various time intervals.
• Examples of time-series data include:
 annual university enrolment
 annual interest rates
 the gross domestic product over a period of years
 Daily closing prices for shares of common stock
 Daily exchange rates between various world currencies
 Government receipts and expenditures over a period of years
 monthly product sales
 quarterly corporate earnings
 social network weekly traffic
GRAPHS TO DESCRIBE TIME-SERIES DATA

The time-series plot in Figure 1.7


shows the annual GDP data growing
rather steadily over a long period of
time from 1929 through 2009. This
pattern clearly shows a strong upward
trend component that is stronger in
some periods than in others.
GRAPHS TO DESCRIBE TIME-SERIES DATA

From the data file RELEVANT Magazine we


obtain the number of weekly new visitors for a
recent 9-week period from both Facebook and
Twitter. This information is given in Table 1.5.
The time series plot in Figure 1.12 shows the
trend over this same time period.
GRAPHS TO DESCRIBE NUMERICAL VARIABLES

Frequency Distributions
• Similar to a frequency distribution for categorical data, a frequency distribution for numerical data
is a table that summarizes data by listing the classes in the left column and the number of
observations in each class in the right column.
• Determining the classes of a frequency distribution for numerical data requires answers to certain
questions: How many classes should be used? How wide should each class be? There are some
general rules
GRAPHS TO DESCRIBE NUMERICAL VARIABLES

Example 1.9 Employee Completion Times

The supervisor of a very large plant obtained


the time (in seconds) for a random sample of n
= 110 employees to complete a particular task.
The goal is to complete this task in less than 4.5
minutes. Table 1.6 contains these times (in
seconds). The data are stored in the data file
Completion Times. What do the data indicate?
GRAPHS TO DESCRIBE NUMERICAL VARIABLES

Example 1.9 Employee Completion Times


GRAPHS TO DESCRIBE NUMERICAL VARIABLES

Histogram
A histogram is a graph that consists of vertical bars constructed on a horizontal line that is marked off
with intervals for the variable being displayed. The intervals correspond to the classes in a frequency
distribution table. The height of each bar is proportional to the number of observations in that inter- val.
The number of observations can be displayed above the bars.

Ogive
An ogive, sometimes called a cumulative line graph, is a line that connects points that are the cumulative
percent of observations below the upper limit of each interval in a cumulative frequency distribution.
GRAPHS TO DESCRIBE NUMERICAL VARIABLES
GRAPHS TO DESCRIBE NUMERICAL VARIABLES

Stem-and-Leaf Display
• A stem-and-leaf display is an EDA graph that is an alternative to the histogram. Data are grouped
according to their leading digits (called stems), and the final digits (called leaves) are listed
separately for each member of a class. The leaves are displayed individually in ascending order
after each of the stems.

• Describe the following random sample of 10 final exam grades for an introductory accounting class
with a stem-and-leaf display.
88 51 63 85 79 65 79 70 73 77
GRAPHS TO DESCRIBE NUMERICAL VARIABLES

Scatter Plot
We can prepare a scatter plot by locating one point for each pair of two variables that represent an
observation in the data set. The scatter plot provides a picture of the data, including the following:
1. The range of each variable
2. The pattern of values over the range
3. A suggestion as to a possible relationship between the two variables 4. An indication of outliers
(extreme points)
GRAPHS TO DESCRIBE NUMERICAL VARIABLES

Scatter Plot
DESCRIBING DATA: NUMERICAL
MEASURES OF CENTRAL TENDENCY AND LOCATION

Arithmetic Mean
The arithmetic mean (or simply mean) of a set of data is the sum of the data values divided by the
number of observations. If the data set is the entire population of data, then the population mean, m, is a
parameter given by
MEASURES OF CENTRAL TENDENCY AND LOCATION

Median
The median is the middle observation of a set of observations that are arranged in increasing (or
decreasing) order. If the sample size, n, is an odd number, the median is the middle observation. If the
sample size, n, is an even number, the median is the average of the two middle observations. The median
will be the number located in the

Mode
The mode, if one exists, is the most frequently occurring value. A distribution with one mode is called
unimodal; with two modes, it is called bimodal; and with more than two modes, the distribution is
said to be multimodal. The mode is most commonly used with categorical data.
MEASURES OF CENTRAL TENDENCY AND LOCATION

Example 2.1 Demand for Bottled Water

The demand for bottled water increases during the hurricane season in Florida. The number of 1-gallon bottles of
water sold for a random sample of n = 12 hours in one store during hurricane season is:
60 84 65 67 75 72
80 85 63 82 70 75
Describe the central tendency of the data.
MEASURES OF CENTRAL TENDENCY AND LOCATION

Example 2.1 Demand for Bottled Water

The average or mean hourly number of 1-gallon bottles of water demanded is found as follows:
MEASURES OF CENTRAL TENDENCY AND LOCATION

Percentiles and Quartiles


• To find percentiles and quartiles, data must first be arranged in order from the smallest to the largest
values.
• The Pth percentile is a value such that approximately P% of the observations are at or below that
number. Percentiles separate large ordered data sets into 100ths. The 50th percentile is the median.
• The Pth percentile is found as follows:

Pth percentile = value located in the (P/100)(n + 1) th ordered position


MEASURES OF CENTRAL TENDENCY AND LOCATION

Percentiles and Quartiles


• Quartiles are descriptive measures that separate large data sets into four quarters. The first quartile,
Q1, (or 25th percentile) separates approximately the smallest 25% of the data from the remainder of the
data. The second quartile, Q2, (or 50th percentile) is the median.
• The third quartile, Q3, (or 75th percentile), separates approximately the smallest 75% of the data from
the remaining largest 25% of the data.
MEASURES OF CENTRAL TENDENCY AND LOCATION

The five-number summary refers to the five descriptive measures: minimum, first quartile, median,
third quartile, and maximum.

minimum < Q1 < median < Q3 < maximum


MEASURES OF CENTRAL TENDENCY AND LOCATION

Example 2.5 Demand for Bottled Water


MEASURES OF VARIABILITY

• In this section we present descriptive numbers that measure the variability or spread of the observations from the
mean. In particular, we include the range, interquartile range, variance, standard deviation, and coefficient of
variation.
• While two data sets could have the same mean, the individual observations in one set could vary more
from the mean than do the observations in the second set. Consider the following two sets of sample
data

• Although the mean is 10 for both samples, clearly the data in sample A are farther from 10 than are
the data in sample B. We need descriptive numbers to measure this spread.
RANGE AND INTERQUARTILE RANGE

Box-and-Whisker Plot
A box-and-whisker plot is a graph that describes the shape of a distribution in terms of the five-number summary:
the minimum value, first quartile (25th percentile), the median, the third quartile (75th percentile), and the
maximum value. The inner box shows the numbers that span the range from the first to the third quartile. A line is
drawn through the box at the median. There are two “whiskers.” One whisker is the line from the 25th percentile to
the minimum value; the other whisker is the line from the 75th percentile to the maximum value.
RANGE AND INTERQUARTILE RANGE

Example 2.8 Gilotti’s Pizzeria


RANGE AND INTERQUARTILE RANGE

Example 2.8 Gilotti’s Pizzeria

• the distribution of sales for Location 3 is


skewed left, which indicates the presence of
days with sales less than most of the other
days ($200 and $300) or per- haps a data-
entry error.
• the distribution of sales in Location 4 is
skewed right indicating the presence of
sales higher than most of the other days
($2,200 and $2,000) or the possibility that
sales were incorrectly recorded.
VARIANCE AND STANDARD DEVIATION
VARIANCE AND STANDARD DEVIATION

Example 2.9 Gilotti’s Pizzeria Sales

Calculate the standard deviation of daily sales for Gilotti Pizzeria, Location 1. the daily sales for
Location 1 are:
6 8 10 12 14 9 11 7 13 11

To calculate sample variance and standard deviation follow these three steps:
• Step 1: Calculate the sample mean, xbar. It is equal to 10.1.
• Step 2: Find the difference between each of the daily sales and the mean of 10.1.
• Step 3: Square each difference
VARIANCE AND STANDARD DEVIATION

Example 2.9 Gilotti’s Pizzeria Sales


COEFFICIENT OF VARIATION
ASSOCIATION BETWEEN TWO
VARIABLES
MEASURES OF RELATIONSHIPS BETWEEN VARIABLES

• Covariance (Cov) is a measure of the linear relationship between two variables. A


positive value indicates a direct or increasing linear relationship, and a negative value
indicates a decreasing linear relationship.
MEASURES OF RELATIONSHIPS BETWEEN VARIABLES

Correlation Coefficient
The correlation coefficient is computed by dividing the covariance by the product of the
standard deviations of the two variables.
MEASURES OF RELATIONSHIPS BETWEEN VARIABLES

Correlation Coefficient

Figure 2.4 Scatter Plots and Correlation


CALCULATING UNIVARIATE
PARAMETERS WITH SPSS
CALCULATING UNIVARIATE PARAMETERS WITH SPSS

• This section uses the sample dataset spread.sav.


• There are two ways to calculate univariate parameters with SPSS :
1. Most descriptive parameters can be calculated by clicking the menu items Analyze  Descriptive
Statistics  Frequencies. In the menu that opens, first select the variables that are to be calculated for
the univariate statistics. If there’s a cardinal variable among them, deactivate the option Display
frequency tables.
2. Another way to calculate univariate statistics can be obtained by selecting Analyze  Descriptive
Statistics  Descriptives. . .. Once again, select the desired variables and indicate the univariate
parameters in the submenu Options .
CALCULATING UNIVARIATE PARAMETERS WITH SPSS

OUTPUT SPSS
NOMINAL ASSOCIATION WITH SPSS

• Go to Analyze  Descriptive Statistics  Crosstabs.


• Enter your dependent variable in the “row “and the independent variable in the “column” box.
Using the GSS 2008 (1500 cases) database, we can test for the association of the independent variable
“SEX” and the dependent variable “Happy”.
NOMINAL ASSOCIATION WITH SPSS

Click Statistics. Since we are looking at a nominal and


an ordinal variable, we will use lambda
NOMINAL ASSOCIATION WITH SPSS

Click Continue, then OK. Your output will look like this:

Look under the Value column to the dependent variable “General Happiness”. The lambda value is .000, suggesting
that there is no association between the variable “SEX” and the dependent variable “Happy”.
EXERCISES

1.
EXERCISES

2.
REFERENCES

Paul Newbold William L. Carlson Betty M. Thorne. (2023).


Statistics for Business and Economics. 10. Pearson
Education, Chapter 1,2

Thomas Cleff. (2019). Applied Statistics and Multivariate Data


Analysis for Business and Economics: A Modern Approach Using
SPSS, Stata, and Excel. , Chapter 3,4
THANK YOU
DESCRIPTIVE STATISTICS
UNIVERSITAS BINA NUSANTARA

December 2023

You might also like