Engineering Data Analysis

Chapter 1:
WHAT IS STATISTICS
The science of collecting, organizing, presenting, analyzing, and interpreting data to assist in
making more effective decisions. As the definition suggests, the first step in investigating a problem is to
collect relevant data. They must be organized in some way and perhaps presented in a chart, such as Chart
1-1
Types of Statistics
DESCRIPTIVE STATISTICS - methods of organizing, summarizing, and presenting data in an
informative way.
For instance, the United States government reports the population of the United States was 179,323,000
in 1960: 203,302.000 in 1970: 226.542.000 in 1980:248,709,000 in 1990, 265,000,000 in 2000: and
308.400 000 in 2010. This information is descriptive statistics, it is descriptive statistics if we calculate
the percentage growth from one decade to the next
This kind of data can be organized into FREQUENCY DISTRIBUTION.
INFERENTIAL STATISTICS - also called statistical inference and inductive statistics. The methods
used to determine something about a population, based on a sample.
Population - A collection of all possible individuals, objects, or
measurements of interest.
Sample - to infer something about a population, we usually take a sample. A portion, or part of the
population. As noted, taking a sample to learn something about a population is done extensively in
business, agriculture, politics, and government
Example.
Gamous and Associates, a public accounting firm, is conducting an audit of Pronto Printing Company. To
begin, the accounting firm selects a random sample of 100 invoices and checks each invoice for accuracy.
There’s at least one error on five of the invoices hence the accounting firm estimates that 5 percent of the
population of invoices contain at least one error.
Self-Review 1-1
Chicago based Markets Facts asked a sample of 1,960 customers to try a newly developed frozen
fish dinner by Morton called Fish Delight of the 1,960 sample. 1.176 said they would purchase the dinner
if it is marketed. (a)What would market report to Morton Foods regarding acceptance of
Fish Delight in the population? (b) Is this an example of descriptive statistics or inferential statistics?
Explain.
1176/1960 · 100 = 60% Inferential
Types of Variables
1. Qualitative variable or an attribute - when the characteristic or variable being studied is non
numeric.
Example Gender, religious affiliation, type of automobile owned, city of birth, eye color, etc. We are
usually interested in how many or what proportion fall in each category. Example what percent of the
population has blue eyes. how many Catholics and protestants are there in the United States? What
percent of the total number of cars sold last month were Buicks? Qualitative data are often summarized in
charts and bar graphs.
2. Quantitative variable - when the variable studied can be reported numerically. Quantitative
variable is either discrete or continuous. Discrete variables can assume only certain values. They
are a result from counting.
Example: A 100 hospital bed, a 3-to-4-bedroom house. It cannot be a 3.56-bedroom house.
Continuous variable - can assume any value within a specific range.

Example: Air pressure of a tire, weight of a shipment of grain, temperature of a specific location or body.
Typically, continuous variable is a result from measuring something.
Levels of measurement
Data can be classified according to levels of measurement. The level of measurement of the data often
dictates the calculations that can be done to summarize and present the data.
1. Nominal Level Data
2. Ordinal Level Data
3. Interval Level Data
4. Ratio Level Data
Sn- Starting Number
How to get sn?
Get the lowest point. Then you find the nearest number that is a multiple of your interval.
Basics of Statistics
FREQUENCY DISTRIBUTION
- is a table that displays the frequency of various outcomes in a sample. Each entry in the table
contains
the frequency or count of the occurrences of values within a particular group or interval, and in
this way, the table summarizes the distribution of values in the sample.
STEPS IN ORGANIZING A FREQUENCY DISTRIBUTION TABLE
1. Determine the range. Range = Highest score - Lowest score
2. Divide this by 15 to estimate the approximate size of the interval. A widely accepted practice
is to have a value between 10 to 20 intervals.
3. List the intervals, beginning at the bottom. Let the lowest interval begin with a number that is
a multiple of the interval size.
4. Tally the frequencies
5. Summarize these under a column labeled f.
6. Total this column and record the number of cases at the bottom. It might be mentioned here
that if this number, obtained by adding the frequencies, is the same as the known number of
cases, it does not follow that no mistake has been made. To check the work, the scores should be
re-tallied.
SAMPLE FREQUENCY DISTRIBUTION
39-41 // 2
36-38 // 2
33-35 0
30-32 0
27-29 / 1
N= 40
TYPES OF FREQUENCY POLYGON
Frequency Polygon (Line Graph) - graphical device for understanding the shapes of distributions. They
serve the same purpose as histograms, but are especially helpful for comparing sets of data
Histogram - is an accurate graphical representation of the distribution of numerical data. It is a kind of

bar graph.
Bar Graphs - used to display data in a similar way to line graphs. However, rather than using a point on a
plane to define a value, a bar graph uses a horizontal or vertical rectangular bar that levels off at the
appropriate level.
Histogram with “auto” bins
-4 -2 0 2 4 6 8 10 12
Histogram
Bar Graphs CISURALIAN
Pie Diagrams - is a circular statistical graphic which is divided into slices to illustrate numerical
proportion. In a pie chart, the arc length of each slice (and consequently its central angle and area), is
proportional to the quantity it represents.
Terminologies on Frequency Distribution
SKEWNESS - is a measure of the symmetry of the probability distribution of a real – valued random
variable about its mean. The skewness value can be positive or negative or undefined.
Measure of Skewness
1. Karl Pearson coefficient of Skewness: Sk = 3(mean - median)

standard deviation (δ )
2. The skewness of a random variable X is denoted Skew (X).
Σ [ ( x−μ )3 ]
Skew (X) =
δ3
Interpretation of the Measure of Skewness
1. If sk = 0, then the frequency distribution is normal and symmetrical.
Normal (or Gaussian) distribution - The area under the normal curve is equal to 1.0. Normal
distributions are denser in the center and less dense in the tails. Normal distributions are defined by two
parameters, the mean (μ) and the standard deviation (σ).
2. If Sk is (+), then the frequency distribution is positively skewed. Positive Skew: The right tail is longer;
the mass of the distribution is concentrated on the left of the figure. The distribution is said to be right-
skewed, right-tailed, or skewed to the right, despite the fact that the curve itself appears to be skewed
or leaning to the left; right instead refers to the right tail being drawn out and, often, the mean being
skewed to the right of a typical center of the data. A right skewed distribution usually appears as a left-
leaning curve.
3. If Sk is (-), then the frequency distribution is negatively skewed. Negative Skew: The left tail is longer;
the mass of the distribution is concentrated on the right of the figure. The distribution is said to be left-
skewed, left-tailed, or skewed to the left, despite the fact that the curve itself appears to be skewed or
leaning to the right; left instead refers to the left tail being drawn out and, often, the mean being
skewed to the left of a typical center of the data. A left-skewed distribution usually appears as a right-
leaning curve.
KURTOSIS- is a measure of the "tailedness" of the probability distribution of a real-valued random

variable. - is a descriptor of the shape of a probability distribution and, just as for skewness, there are
different ways of quantifying it for a theoretical distribution and corresponding ways of estimating it
from a sample from a population. Depending on the particular measure of kurtosis that is used, there
are various interpretations of kurtosis, and of how particular measures should be interpreted.
Q Q3−Q1
Ku = Q=
P 90−P 10 2
EXCESS KURTOSIS is defined as kurtosis minus 3
Mesokurtic- Distributions with zero excess kurtosis are called mesokurtic, or mesokurtotic.
Leptokurtic- A distribution with positive excess kurtosis is called leptokurtic, or leptokurtotic. In terms of
shape, a leptokurtic distribution has fatter tails.
Platykurtic- A distribution with negative excess kurtosis is called platykurtic, or platykurtotic. In terms of
shape, a platykurtic distribution has thinner tails.
Mesokurtic Leptokurtic
Kurtosis
Platykurtic
Bimodal Distribution has two peaks. Bimodal distribution showing two normal distribution curves
combined, to show peaks.
Bell Curve is another name for a normal distribution curve (sometimes just shortened to “normal
curve”) or Gaussian distribution.
Bimodal Distribution Bell Curve
AVERAGES
ARITHMETIC MEAN
- or simply the mean or average when the context is clear, is the sum of a collection of numbers divided
by the number of numbers in the collection.
Mean X́ =
∑ of Scores =
∑X
number of cases N
MEDIAN
- the value separating the higher half of a data sample, a population, or a probability distribution, from
the lower half.
For a data set, it may be thought of as the "middle" value. For example, in the data set {1, 3, 3, 6, 7, 8, 9},
the median is 6, the fourth largest, and also the fourth smallest, number in the sample. For a continuous
probability distribution, the median is the value such that a number is equally likely to fall above or
below it.
Median for Grouped Data:
MODE - of a set of data values is the value that appears most often. It is the value x at which its
probability mass function takes its maximum value. In other words, it is the value that is most likely to
be sampled. A mode of a continuous probability distribution is often considered to be any value x at
which its probability density function has a locally maximum value, so any peak is a mode.
VARIABILITY RANGE - of a set of data is the difference between the largest and smallest values.
range = (XH- XL) + 1
STANDARD DEVIATION - (SD, also represented by the Greek letter sigma s or the Latin letter S) is a
measure that is used to quantify the amount of variation or dispersion of a set of data values. A low
standard deviation indicates that the data points tend to be close to the mean (also called the expected
value) of the set, while a high standard deviation indicates that the data points are spread out over a
wider range of values.
x2
Sample Standard Deviation S ¿ √ ∑ where: x = X − X́ and x 2 = ( X − X́ )2
N
1
also S=( ) √ N ∑ X 2 −¿ ¿
N
x2
Population Standard Deviation = δ = √ ∑
n
VARIANCE - is the expectation of the squared deviation of a random variable from its mean. Informally,
it measures how far a set of (random) numbers are spread out from their average value.
Population Variance ¿
Sample Variance ¿
PARAMETRIC TEST VS NON - PARAMETRIC TEST
Parametric Non-parametric
Assumed distribution Normal Any
Assumed variance Homogeneous Any
Typical data Ratio or Interval Ordinal or Nominal
Data set relationships Independent Any
Usual Central Measure Mean Median
benefits Can draw more conclusions Simplicity; Less affected by
outliers
Test
Choosing Choosing parametric test Choosing a non- parametric
test
Correlation Test Pearson Spearman
Independent measures, 2 Independent measures, t - test Mann-Whitney test
groups
Independent measures, >2 One – way independent Kruskal-Wallis test
groups measures ANOVA
Repeated measures, 2 Matched – pair test Wilcoxon test
conditions
Repeated measures, >2 One-way, repeated measures
conditions ANOVA
Friedman's test
t-TEST FOR INDEPENDENT SAMPLES
It is used if:
1. The two groups are independent

2. The data is normally distributed
3. The data is expressed in interval or ratio data
4. The number of samples is less than 30
Formula:
The t distribution has two tables:
1. a- Alpha
2. z
Under two-tail test and df=18, the tabular t-value or critical regions are -2.101 and +2.101
Compute for the test statistic using the formula.
-2.101 +2.101 +2.88

APPENDIX F
Upper Critical Values of Student’s T Distribution
Probability of exceeding the critical value
df 0.10 0.05 0.025 0.01 0.005 0.001

1. 3.078 6.314 12.706 31.821 63.657 318.313
2. 1.886 2.920 4.303 6.965 9.925 22.327
3. 1.638 2.353 3.182 4.541 5.841 10.215
4. 1.533 2.132 2.776 3.747 4.604 7.173
5. 1.476 2.015 2.571 3.365 4.032 5.893
6. 1.440 1.943 2.447 3.143 3.707 5.208
7. 1.415 1.895 2.365 2.998 3.499 4.782
8. 1.397 1.860 2.06 2.896 3.355 4.499
9. 1.383 1.833 2.262 2.821 3.250 4.296
10. 1.372 1.812 2.228 2.764 3.169 4.143
11. 1.363 1.796 2.201 2.718 3.106 4.024
12. 1.356 1.782 2.179 2.681 3.055 3.929
13. 1.350 1.771 2.160 2.650 3.012 3.852
14. 1.345 1.761 2.145 2.624 2.977 3.787
15. 1.341 1.753 2.131 2.602 2.947 3.733
16. 1.337 1.746 2.120 2.583 2.921 3.686
17. 1.333 1.740 2.110 2.567 2.898 3.646
18. 1.330 1.734 2.101 2.552 2.878 3.610
19. 1.328 1.729 2.093 2.539 2.861 3.679
20. 1.325 1.725 2.086 2.528 2.845 3.552
21. 1.323 1.721 2.080 2.518 2.831 3.527
22. 1.321 1.717 2.074 2.508 2.819 3.505
23. 1.319 1.714 2.069 2.500 2.807 3.485
24. 1.318 1.711 2.064 2.492 2.797 3.467
25. 1.316 1.708 2.060 2.485 2.787 3.450
26. 1.315 1.706 2.056 2.479 2.779 3.435
27. 1.314 1.703 2.052 2.473 2.771 3.421
28. 1.313 1.701 2.048 2.467 2.763 3.408
29. 1.311 1.699 2.045 2.462 2.756 3.396
30. 1.310 1.697 2.042 2.457 2.750 3.385
31. 1.309 1.696 2.040 2.453 2.744 3.375
32. 1.309 1.694 2.037 2.449 2.738 3.365
33. 1.308 1.692 2.035 2.445 2.733 3.356
34. 1.307 1.691 2.032 2.441 2.728 3.348
35. 1.306 1.690 2.030 2.438 2.724 3.340
36. 1.306 1.688 2.028 2.434 2.719 3.333
37. 1.305 1.687 2.026 2.431 2.715 3.326
Example 2:
Two groups of experimental rats were injected with tranquilizer at 1.0 mg and 1.5 mg
dose respectively. The time, in seconds, that took them to fall asleep is given. Test the null
hypothesis that there is no difference in the time the rats fell asleep under the two dosages at 0.01
level of significance.
1.0 mg dose 1.5 mg dose
9.8 12.0
13.2 7.4
11.2 9.8
9.5 11.5
13.0 13.0
12.1 12.5
9.8 9.8
12.3 10.5
7.9 13.5
10.2
9.7
Solution:
1. State the problem: Is there a significant difference in the time the rats fell asleep
under the 1.0 mg dose and 1.5 mg dose?
2. Formulate the null and alternative hypothesis:
4. Set the level of significance and solve for other parameters:
a = 0.01
9.8 12.0
13.2 7.4
11.2 9.8
9.5 11.5
13.0 13.0
12.1 12.5
9.8 9.8
12.3 10.5
7.9 13.5
10.2 12.0
9.7
t = -0.396
6. State the decision rule.
Conclusion: There is sufficient evidence to show that there is no significant difference

in the time they fell asleep under the two dosages.

Engineering Data Analysis

Uploaded by

Engineering Data Analysis

Uploaded by

Chapter 1:

Continuous variable - can assume any value within a specific range.

Sn- Starting Number

How to get sn?

Histogram - is an accurate graphical representation of the distribution of numerical data. It is a kind of

Histogram with “auto” bins

Bar Graphs CISURALIAN

Terminologies on Frequency Distribution

1. Karl Pearson coefficient of Skewness: Sk = 3(mean - median)

1. If sk = 0, then the frequency distribution is normal and symmetrical.

KURTOSIS- is a measure of the "tailedness" of the probability distribution of a real-valued random

Bimodal Distribution Bell Curve

Median for Grouped Data:

range = (XH- XL) + 1

1. The two groups are independent

The t distribution has two tables:

Compute for the test statistic using the formula.

-2.101 +2.101 +2.88

Upper Critical Values of Student’s T Distribution

Probability of exceeding the critical value

df 0.10 0.05 0.025 0.01 0.005 0.001

Conclusion: There is sufficient evidence to show that there is no significant difference

You might also like