Engineering Data Analysis
Engineering Data Analysis
WHAT IS STATISTICS
The science of collecting, organizing, presenting, analyzing, and interpreting data to assist in
making more effective decisions. As the definition suggests, the first step in investigating a problem is to
collect relevant data. They must be organized in some way and perhaps presented in a chart, such as Chart
1-1
Types of Statistics
DESCRIPTIVE STATISTICS - methods of organizing, summarizing, and presenting data in an
informative way.
For instance, the United States government reports the population of the United States was 179,323,000
in 1960: 203,302.000 in 1970: 226.542.000 in 1980:248,709,000 in 1990, 265,000,000 in 2000: and
308.400 000 in 2010. This information is descriptive statistics, it is descriptive statistics if we calculate
the percentage growth from one decade to the next
This kind of data can be organized into FREQUENCY DISTRIBUTION.
INFERENTIAL STATISTICS - also called statistical inference and inductive statistics. The methods
used to determine something about a population, based on a sample.
Population - A collection of all possible individuals, objects, or
measurements of interest.
Sample - to infer something about a population, we usually take a sample. A portion, or part of the
population. As noted, taking a sample to learn something about a population is done extensively in
business, agriculture, politics, and government
Example.
Gamous and Associates, a public accounting firm, is conducting an audit of Pronto Printing Company. To
begin, the accounting firm selects a random sample of 100 invoices and checks each invoice for accuracy.
There’s at least one error on five of the invoices hence the accounting firm estimates that 5 percent of the
population of invoices contain at least one error.
Self-Review 1-1
Chicago based Markets Facts asked a sample of 1,960 customers to try a newly developed frozen
fish dinner by Morton called Fish Delight of the 1,960 sample. 1.176 said they would purchase the dinner
if it is marketed. (a)What would market report to Morton Foods regarding acceptance of
Fish Delight in the population? (b) Is this an example of descriptive statistics or inferential statistics?
Explain.
1176/1960 · 100 = 60% Inferential
Types of Variables
1. Qualitative variable or an attribute - when the characteristic or variable being studied is non
numeric.
Example Gender, religious affiliation, type of automobile owned, city of birth, eye color, etc. We are
usually interested in how many or what proportion fall in each category. Example what percent of the
population has blue eyes. how many Catholics and protestants are there in the United States? What
percent of the total number of cars sold last month were Buicks? Qualitative data are often summarized in
charts and bar graphs.
2. Quantitative variable - when the variable studied can be reported numerically. Quantitative
variable is either discrete or continuous. Discrete variables can assume only certain values. They
are a result from counting.
Example: A 100 hospital bed, a 3-to-4-bedroom house. It cannot be a 3.56-bedroom house.
Levels of measurement
Data can be classified according to levels of measurement. The level of measurement of the data often
dictates the calculations that can be done to summarize and present the data.
1. Nominal Level Data
2. Ordinal Level Data
3. Interval Level Data
4. Ratio Level Data
Get the lowest point. Then you find the nearest number that is a multiple of your interval.
Basics of Statistics
FREQUENCY DISTRIBUTION
- is a table that displays the frequency of various outcomes in a sample. Each entry in the table
contains
the frequency or count of the occurrences of values within a particular group or interval, and in
this way, the table summarizes the distribution of values in the sample.
STEPS IN ORGANIZING A FREQUENCY DISTRIBUTION TABLE
1. Determine the range. Range = Highest score - Lowest score
2. Divide this by 15 to estimate the approximate size of the interval. A widely accepted practice
is to have a value between 10 to 20 intervals.
3. List the intervals, beginning at the bottom. Let the lowest interval begin with a number that is
a multiple of the interval size.
4. Tally the frequencies
5. Summarize these under a column labeled f.
6. Total this column and record the number of cases at the bottom. It might be mentioned here
that if this number, obtained by adding the frequencies, is the same as the known number of
cases, it does not follow that no mistake has been made. To check the work, the scores should be
re-tallied.
SAMPLE FREQUENCY DISTRIBUTION
39-41 // 2
36-38 // 2
33-35 0
30-32 0
27-29 / 1
N= 40
TYPES OF FREQUENCY POLYGON
Frequency Polygon (Line Graph) - graphical device for understanding the shapes of distributions. They
serve the same purpose as histograms, but are especially helpful for comparing sets of data
Bar Graphs - used to display data in a similar way to line graphs. However, rather than using a point on a
plane to define a value, a bar graph uses a horizontal or vertical rectangular bar that levels off at the
appropriate level.
-4 -2 0 2 4 6 8 10 12
Histogram
Pie Diagrams - is a circular statistical graphic which is divided into slices to illustrate numerical
proportion. In a pie chart, the arc length of each slice (and consequently its central angle and area), is
proportional to the quantity it represents.
SKEWNESS - is a measure of the symmetry of the probability distribution of a real – valued random
variable about its mean. The skewness value can be positive or negative or undefined.
Measure of Skewness
Σ [ ( x−μ )3 ]
Skew (X) =
δ3
Interpretation of the Measure of Skewness
Normal (or Gaussian) distribution - The area under the normal curve is equal to 1.0. Normal
distributions are denser in the center and less dense in the tails. Normal distributions are defined by two
parameters, the mean (μ) and the standard deviation (σ).
2. If Sk is (+), then the frequency distribution is positively skewed. Positive Skew: The right tail is longer;
the mass of the distribution is concentrated on the left of the figure. The distribution is said to be right-
skewed, right-tailed, or skewed to the right, despite the fact that the curve itself appears to be skewed
or leaning to the left; right instead refers to the right tail being drawn out and, often, the mean being
skewed to the right of a typical center of the data. A right skewed distribution usually appears as a left-
leaning curve.
3. If Sk is (-), then the frequency distribution is negatively skewed. Negative Skew: The left tail is longer;
the mass of the distribution is concentrated on the right of the figure. The distribution is said to be left-
skewed, left-tailed, or skewed to the left, despite the fact that the curve itself appears to be skewed or
leaning to the right; left instead refers to the left tail being drawn out and, often, the mean being
skewed to the left of a typical center of the data. A left-skewed distribution usually appears as a right-
leaning curve.
Q Q3−Q1
Ku = Q=
P 90−P 10 2
EXCESS KURTOSIS is defined as kurtosis minus 3
Mesokurtic- Distributions with zero excess kurtosis are called mesokurtic, or mesokurtotic.
Leptokurtic- A distribution with positive excess kurtosis is called leptokurtic, or leptokurtotic. In terms of
shape, a leptokurtic distribution has fatter tails.
Platykurtic- A distribution with negative excess kurtosis is called platykurtic, or platykurtotic. In terms of
shape, a platykurtic distribution has thinner tails.
Mesokurtic Leptokurtic
Kurtosis
Platykurtic
Bimodal Distribution has two peaks. Bimodal distribution showing two normal distribution curves
combined, to show peaks.
Bell Curve is another name for a normal distribution curve (sometimes just shortened to “normal
curve”) or Gaussian distribution.
AVERAGES
ARITHMETIC MEAN
- or simply the mean or average when the context is clear, is the sum of a collection of numbers divided
by the number of numbers in the collection.
Mean X́ =
∑ of Scores =
∑X
number of cases N
MEDIAN
- the value separating the higher half of a data sample, a population, or a probability distribution, from
the lower half.
For a data set, it may be thought of as the "middle" value. For example, in the data set {1, 3, 3, 6, 7, 8, 9},
the median is 6, the fourth largest, and also the fourth smallest, number in the sample. For a continuous
probability distribution, the median is the value such that a number is equally likely to fall above or
below it.
MODE - of a set of data values is the value that appears most often. It is the value x at which its
probability mass function takes its maximum value. In other words, it is the value that is most likely to
be sampled. A mode of a continuous probability distribution is often considered to be any value x at
which its probability density function has a locally maximum value, so any peak is a mode.
VARIABILITY RANGE - of a set of data is the difference between the largest and smallest values.
STANDARD DEVIATION - (SD, also represented by the Greek letter sigma s or the Latin letter S) is a
measure that is used to quantify the amount of variation or dispersion of a set of data values. A low
standard deviation indicates that the data points tend to be close to the mean (also called the expected
value) of the set, while a high standard deviation indicates that the data points are spread out over a
wider range of values.
x2
Sample Standard Deviation S ¿ √ ∑ where: x = X − X́ and x 2 = ( X − X́ )2
N
1
also S=( ) √ N ∑ X 2 −¿ ¿
N
x2
Population Standard Deviation = δ = √ ∑
n
VARIANCE - is the expectation of the squared deviation of a random variable from its mean. Informally,
it measures how far a set of (random) numbers are spread out from their average value.
Population Variance ¿
Sample Variance ¿
PARAMETRIC TEST VS NON - PARAMETRIC TEST
Parametric Non-parametric
Assumed distribution Normal Any
Assumed variance Homogeneous Any
Typical data Ratio or Interval Ordinal or Nominal
Data set relationships Independent Any
Usual Central Measure Mean Median
benefits Can draw more conclusions Simplicity; Less affected by
outliers
Test
Choosing Choosing parametric test Choosing a non- parametric
test
Correlation Test Pearson Spearman
Independent measures, 2 Independent measures, t - test Mann-Whitney test
groups
Independent measures, >2 One – way independent Kruskal-Wallis test
groups measures ANOVA
Repeated measures, 2 Matched – pair test Wilcoxon test
conditions
Repeated measures, >2 One-way, repeated measures
conditions ANOVA
Friedman's test
t-TEST FOR INDEPENDENT SAMPLES
It is used if:
Formula:
1. a- Alpha
2. z
Under two-tail test and df=18, the tabular t-value or critical regions are -2.101 and +2.101
9.8 12.0
13.2 7.4
11.2 9.8
9.5 11.5
13.0 13.0
12.1 12.5
9.8 9.8
12.3 10.5
7.9 13.5
10.2 12.0
9.7
t = -0.396
6. State the decision rule.