Notes (Chapter 1 - 3)
Notes (Chapter 1 - 3)
CHAPTER 1 – DATA AND STATISTICS The data have the properties of nominal data and the
What is statistics? order or rank of the data is meaningful
can refer to numerical facts (i.e. averages, medians, A nonnumeric label or numeric code may be used
percentages, and maximums) that help us understand a Example: The nonnumeric rating labels from AAA to F
variety of business and economic situations used for Fitch rating. These can be rank ordered from
can also refer to the art and science collecting, best credit rating AAA to poorest credit rating F.
analyzing, presenting, and interpreting data Numerical code can also be used - Class rank of a
student in school.
Applications in Business and Economics INTERVAL
ACCOUNTING The data have the properties of ordinal data, and the
Public accounting firms use statistical sampling interval between observations is expressed in terms
procedures when conducting audits for their clients. of a fixed unit of measure.
ECONOMICS Interval data is always numeric
Economists use statistical information in making Example: Melissa has a SAT score of 1985, while Kevin
forecasts about the future of the economy or some has a SAT score of 1880. Melissa scored 105 points
aspect of it. more than Kevin.
FINANCE RATIO
Financial advisors use price-earnings ratios and The data have all the properties of interval data and
dividend yields to guide their investment advice. the ratio of two values is meaningful.
MARKETING Variables such as distance, height, weight, and time
Electronic point-of-sale scanners at retail checkout use the ratio scale.
counters are used to collect data for a variety of This scale must contain a zero value that indicates
marketing research applications. that nothing exists for the variable at the zero point.
PRODUCTION Example: Melissa’s college record shows 36 credit
A variety of statistical quality control charts are used to hours earned, while Kevin’s record shows 72 credit
monitor the output of a production process. hours earned. Kevin has twice as many credit hours
INFORMATION SYSTEMS earned as Melissa. 1:2
A variety of statistical information helps administrators
Categorial and Quantitative Data
assess the performance of computer networks.
Data can be further classified as being categorical or
quantitative.
Data and Data Sets
Categorical Data
Data are the facts and figures collected, analyzed, and
o Labels or names used to identify an attribute of each
summarized for presentation and interpretation.
element
Data Sets refers to all the data collected in a particular
o Often referred to as qualitative data
study
o Use either the nominal or ordinal scale of
Elements, Variables, and Observations measurement
Elements are the entities on which data are collected o Can be either numeric or nonnumeric
Variables is a characteristic of interest for the elements o Appropriate statistical analysis is rather limited
Observation the set of measurements obtained for a Quantitative Data
particular element o Indicates how many or how much
o a data set with n elements contains n observations Discrete if measuring how many
o the total no. of data values in a complete data set is Continuous if measuring how much
the number of elements multiplied by the number of o are always numeric
variables o ordinary arithmetic operations are meaningful for
quantitative data.
AGENCIES
STATISTICAL STUDIES – OBSERVATIONAL
In observational (nonexperimental) studies no attempt
is made to control or influence the variables of interest
Example: survey (studies of smokers and nonsmokers
are observational studies because researchers do not
determine or control who will smoke and who will not
smoke)
STATISTICAL STUDIES – EXPERIMENTAL Numerical Descriptive Statistics
In experimental studies the variable of interest is first most common NDS mean (or average) demonstrates
identified. Then one or more variables are identified a measure of the central tendency, or central location, of
and controlled so that data can be obtained about how the data for a variable.
they influence the variable of interest. Example: Hudson’s mean cost of parts, based on the 50
Example: The largest experimental study ever tune-ups studied, is $79 (found by summing up the 50 cost
conducted is believed to be the 1954 Public Health values and then dividing by 50).
Service experiment for the Salk polio vaccine. Nearly
Statistical Inference
two million U.S. children (grades 1- 3) were selected.
Population The set of all elements of interest in a
Data Acquisition Considerations particular study
TIME REQUIREMENT Sample A subset of the population
searching for information can be time consuming Statistical Inference The process of using data obtained
information may no longer be useful by the time it is from a sample to make estimates and test hypotheses
available about the characteristics of a population.
COST OF ACQUISITION Census Collecting data for the entire population
Organizations often charge for information even Sample Survey collecting data for a sample
when it is not their primary business activity.
DATA ERRORS
Using any data that happen to be available or were
acquired with little care can lead to misleading
information.
Statistical Analysis Using Microsoft Excel the most effective data mining systems use automated
procedures to discover relationships in the data and
predict future outcomes, … prompted by only general,
even vague, queries by the user.
the major applications of data mining have been made by
companies with a strong consumer focus such as retail,
financial, and communication firms.
as another example, data mining is used to identify
customers who should receive special discount offers
based on their past purchasing volumes.
Requirements:
Statistical methodology (i.e. multiple regression, logistic
regression, and correlation) are heavily used.
Computer science technologies are also needed in
relation to the involving artificial intelligence and
machine learning.
Significant investment in time and money
Model Reliability:
Statistical model for a particular sample may not be
applicable to other data
Data set can be partitioned into: training set (model
development) & test set (validating the model)
over fitting the model can cause danger misleading
associations & conclusions appear to exist
careful interpretation of results and extensive testing is
important
Ethical Guidelines of Statistical Practice
unethical behavior can take a variety of forms including:
improper sampling
Inappropriate analysis of the data
Development of misleading graphs
Use of inappropriate summary statistics
Biased interpretation of the statistical results
Be fair, thorough, objective, and neutral as you collect,
analyze, and present data.
“Ethical Guidelines for Statistical Practice” developed by
American Statistical Association
It contains 67 guidelines organized into 8 topic area:
Professionalism
Responsibilities to Funders, Clients, Employers
Responsibilities in Publications and Testimony
Responsibilities to Research Subjects
Responsibilities to Research Team Colleagues
Analytics Responsibilities to Other
Scientific process of transforming data into insight for Statisticians/Practitioners
making better decisions. Responsibilities Regarding Allegations of
Types: Misconduct
DESCRIPTIVE A. Analytical techniques that describe Responsibilities of Employers Including
what happened in the past Organizations, Individuals, Attorneys, or Other
PREDICTIVE A. Analytical techniques that use models Clients
constructed from past data to predict future. It help
assess the impact of one variable on another
PRESCRIPTIVE A. Analytical techniques that yield a
best course of action to take
Data Warehousing
is capturing, storing, and maintaining the data and it is a
significant undertaking.
Organizations obtain large amounts of data on a daily basis
by means of magnetic card readers, bar code scanners,
point of sale terminals, and touch screen monitors.
Example(s): Wal-Mart captures data on 20-30 million
transactions per day; Visa processes 6,800 payment
transactions per second.
Data Mining
is used to identify related products that customers who
have already purchased a specific product are also likely to
purchase (and then pop-ups are used to draw attention to
those related products).
CHAPTER 2A: DESCRIPTIVE
STATISTICS (TABULAR AND
GRAPHICAL DISPLAYS
Categorical Data
FREQUENCY DISTRIBUTION
- is a tabular summary of data showing the number
(frequency) of observations in each of several non-
overlapping categories or classes.
- Objective: to provide insights about the data that cannot
be quickly obtained by looking only at the original data.
BAR CHART
- is a graphical display for depicting qualitative data
- are used to identify the most important causes of
problems.
- horizontal axis we specify the labels that are used for
each class
- vertical axis frequency, relative frequency, or percent
frequency scale
- bar or fixed width drawn above each class label, we
extend the height appropriately
- bars are separated to emphasize the fact that each
class is separate.
- Pareto Diagram When the bars are arranged in
descending order of height from left to right (with the
most frequently occurring cause appearing first)
founder “Vilfredo Pareto”, an Italian economist.
PIE CHART
- is a commonly used graphical display for presenting
relative frequency and percent frequency distributions
for categorical data.
CUMULATIVE DISTRIBUTIONS
- Cumulative Frequency D. shows the number of
items with values less than or equal to the upper limit
of each class. (Last entry = total no. of observations)
- Insights obtained from Percent Frequency - Cumulative Relative FD. shows the proportion of
Distribution: items with values less than or equal to the upper limit
40% of the audits required from 15 to 19 days of each class. (Last entry = 1.00)
Another 25% of the audits required 20 to 25 days - Cumulative Percent FD shows the percentage of
Only 5% of the audits required more than 30 days items with values less than or equal to the upper limit
DOT PLOT of each class. (Last entry = 100)
- one of the simplest graphical summaries of data - Example: Sanderson and Clifford
- horizontal axis range of data values
- then each data value is represented by a dot placed
above the axis
- Example: Sanderson and Clifford
STEM-AND-LEAF DISPLAY
- shows both the rank order and shape of the
distribution of the data.
HISTOGRAM - It is similar to a histogram on its side, but it has the
- Common graphical display of quantitative data advantage of showing the actual data values.
- Horizontal axis variable of interest - the first digits of each data item are arranged to the
- A rectangle is drawn above each class interval with its left of a vertical line.
height corresponding to the interval’s frequency, - to the right of the vertical line we record the last digit
relative frequency, or percent frequency. for each item in rank order.
- has no natural separation between rectangles of - each line (row) in the display is referred to as a stem.
adjacent classes. - Each digit on a stem is a leaf.
- Example: Sanderson and Clifford - Example: The number of questions answered
correctly on an aptitude test by 50 students analysed
with the help of a Stem – and – leaf display here. The
relevant data is given in the following table.
- Simpson’s Paradox the reversal of conclusions based Data Visualization: Best Practices in Creating Effective
on aggregate and unaggregated data. Graphical Displays
- Scatter diagrams and trendlines are useful in - Data Visualization describes the use of graphical
exploring the relationship between two variables. displays to summarize and present information about a
Scatter Diagram is a graphical presentation of data set.
the relationship between two quantitative - The goal is to communicate as effectively and clearly as
variables. possible the key information about the data.
One variable is shown on the horizontal axis
Choosing the Type of Graphical Display
and the other variable is shown on the vertical
Displays used to show the distribution of data:
axis.
Bar Chart to show the frequency distribution or
The general pattern of the plotted points
relative frequency distribution for categorical data
suggests the overall relationship between the
variables. Pie Chart to show the relative frequency or percent
Trendline provides an approximation of the frequency for categorical data
relationship Dot Plot to show the distribution for quantitative
data over the entire range of the data
Histogram to show the frequency distribution for
quantitative data over a set of class intervals
Stem-and-Leaf Display to show both the rank order
and shape of the distribution for quantitative data
Display used to make comparisons:
Side-by-Side Chart to compare two variables
Stacked Bar Chart to compare the relative frequency
or Percent frequency of two categorical variables
Display used to show relationships:
Scatter Diagram to show the relationship between
two quantitative variables
Trendline to approximate the relationship of data in
a scatter diagram
Data Dashboard
Data dashboard widely used data visualization tool
It organizes and presents key performance indicators
(KPIs) used to monitor an organization or process.
It provides timely, summary information that is easy to
read, understand, and interpret.
Side-by-side bar chart
Some additional guidelines include . . .
- is a graphical display for depicting multiple bar charts on
o Minimize the need for screen scrolling
the same display.
o Avoid unnecessary use of color or 3D
- Each cluster of bars represents one value of the first
o Use borders between charts to improve readability
variable
- Each bar within a cluster represents one value of the
second variable.
CHAPTER 3A: DESCRIPTIVE STATISTICS
(NUMERICAL MEASURES)
Numerical Measures
Sample statistics if the measure is computed for data
from a sample.
Population parameters If the measures are computed
for data from a population.
Point estimator a sample statistic of the
corresponding population parameter.
Measures of Location o Example: the 5% trimmed mean is obtained by
MEAN [Excel Function AVERAGE(data cell range)] removing the smallest 5% and the largest 5% of
- most important measure of location the data values and then computing the mean of
- provides a measure of central point the remaining values.
- the mean of a data set is the average of all the data MODE [Excel Function MODE.SNGL(data cell range)]
values - is the value that occurs with greatest frequency.
- The sample mean x́ is the point estimator of the - greatest frequency can occur at two or more different
population mean µ. values
- Bimodal If the data have exactly two modes
- Multimodal If the data have more than two modes
- Example: Monthly Starting Salary
The only monthly starting
salary that occurs more than
Sample Mean once is $3,880. Mode = 3,880
Population Mean
WEIGHTED MEAN
- In some instances the mean is computed by giving
MEDIAN [Excel
each observation a weight that reflects its relative
Function
importance. The choice of weights depends on the
MEDIAN(data
application (e.g. no. of credit hrs. earned for each
cell range)]
grade, GPA)
- is the value in the middle when the data items are
arranged in ascending order (least to greatest).
- is the measure of location most often reported for
annual income and property value data.
- Whenever a data set has extreme values, median is
the preferred measure of central location. A few
extremely large incomes or property values can - Example: Purchase of Raw Material
inflate the mean. Consider the following sample of five
purchases of a raw material over a period of three
months:
- Trimmed Mean
o another measure sometimes used when extreme
values are present
o it is obtained by deleting a percentage of the
smallest and largest values from a data set and
then computing the mean of the remaining
values.
GEOMETRIC MEAN [Excel F. GEOMEAN(data cell
range)]
- is calculated by finding the nth root of the product of
n values.
- It is often used in analyzing growth rates in financial
data (where using the arithmetic mean will provide
misleading results).
- It should be applied anytime you want to determine
the mean rate of change over several successive
periods (be it years, quarters, weeks, . . .).
- Other common applications include: changes in
populations of species, crop yields, pollution levels,
and birth and death rates.
PERCENTILES [Excel F. PERCENTILE.EXC(data range,
p/100)
- provides information about how the data are spread
over the interval from the smallest value to the largest
value. (e.g. Admission test scores for colleges and
universities are frequently reported in terms of
Measures of Variability
percentiles.)
- It is often desirable to consider measures of variability
- pth percentile is a value such that at least p percent
(dispersion), as well as measures of location.
of the items take on this value or less and at least (100
- E.g. in choosing supplier A or supplier B we might consider
- p) percent of the items take on this value or more.
not only the average delivery time for each, but also the
variability in delivery time for each.
RANGE
- is the difference between the largest and smallest
data values.
- Example: Monthly Starting Salary (80th percentile)
COEFFICIENT OF VARIATION
- indicates how large the standard deviation is in
relation to the mean.
CHEBYSHEV’S THEOREM
o Moderately Skewed Left - At least (1 - 1/z2) of the items in any data set will be
- skewness is negative within z standard deviations of the mean, where z is
- mean will usually be less than the median any value greater than 1.
- Chebyshev’s theorem requires z > 1, but z need not be
an integer.
- At least 75% of the data values must be within z = 2
standard deviations of the mean.
- At least 89% of the data values must be within z = 3
standard deviations of the mean.
- At least 94% of the data values must be within z = 4
o Moderately Skewed Right standard deviations of the mean.
- skewness is positive - Example: Marks of Students
- mean is usually be more than the median Suppose the marks of 100 students in a
course had a mean of 70 and a standard deviation of
5. We want to know the number of students having
test scores between 60 and 80.
EMPIRICAL RULE
- When the data are believed to approximate a bell-
Z-SCORES shaped distribution.
- is often called the standardized value - can be used to determine the percentage of data
- It denotes the number of standard deviations a data values that must be within a specified number of
value xi is from the mean. standard deviations of the mean.
- rule is based on the normal distribution (chap.6)
- For data having a bell-shaped distribution:
o Approximately 68% of the data values will be
within +/- 1 standard deviation of its mean.
- Excel’s STANDARDIZE function can be used to
o Approximately 95% of the data values will be
compute the z-score.
within +/- 2 standard deviations of its mean.
- observation’s z-score is a measure of the relative
o Almost all of the data values will be within +/- 3
location of the observation in a data set.
standard deviations of its mean.
- A data value less than the sample mean will have a z-
score less than zero.
- greater than sample mean will have a z-score is
greater than zero
- equal the sample mean will have a z-score of zero
- Example: Class size data
DETECTING OUTLIERS
- Outlier is an unusually small or unusually large
value in a data set.
- A data value with a z-score less than -3 or greater than
+3 might be considered an outlier.
- It might be:
o an incorrectly recorded data value
o a data value that was incorrectly included in the
data set
o a correctly recorded unusual data value that
belongs in the data set
- Example: Class size data
o CORRELATION COEFFICIENT
- Correlation is a measure of linear association
and not necessarily causation.
- Just because two variables are highly correlated, it
Box Plot does not mean that one variable is the cause of
- is a graphical summary of data that is based on a five- the other.
number summary.
- A key to the development of a box plot is the computation
of the median and the quartiles Q1 and Q3.
- Box plots provide another way to identify outliers
- Example: Monthly Starting Salary
o A box is drawn with its ends located at the first and
third quartiles.
o A vertical line is drawn in the box at the location of - The coefficient can take on values between -1 and
the median (second quartile). +1
o Limits are located (not drawn) using the interquartile o Strong negative linear relationship values
near -1
o Strong positive linear relationship values
near +1
- The closer the correlation is to zero, the weaker
the relationship.
- Example: Stereo and Sound Equipment Store
The store’s manager wants to determine
the relationship between the number of weekend
television commercials shown and the sales at the
store during the following week
range (IQR).
o Data outside these limits are considered outliers
o The locations of each outlier is shown with the
symbol ●
Data Dashboards: Adding Numerical Measures to Improve
Effectiveness
- Data dashboards are not limited to graphical displays.
- The addition of numerical measures, such as the mean
and standard deviation of KPIs, to a data dashboard is
often critical.
- Dashboards are often interactive
- Drilling Down refers to functionality in interactive
dashboards that allows the user to access information
and analyses at increasingly detailed level.