Process Data Analysis

Process Data Analysis
Dr M. A. A. Shoukat Choudhury
Department of Chemical Engineering
BUET, Dhaka - 1000
DATA MINING
• Data explosion problem
– Automated data collection tools and mature database

technology lead to tremendous amounts of data stored in
databases, data warehouses and other information
repositories
• We are drowning in data, but starving for knowledge!
• Solution: Data mining and knowledge extraction

• Data mining (knowledge discovery in databases):
– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns from
data in large databases
Elementary Concepts
• Variables: Variables are things that we measure, control, or
manipulate in research. They differ in many respects, most
notably in the role they are given in our research and in the type
of measures that can be applied to them.
• Observational vs. experimental research. Most empirical

research belongs clearly to one of those two general categories.
In observational research we do not (or at least try not to)
influence any variables but only measure them and look for
relations (correlations) between some set of variables. In
experimental research, we manipulate some variables and then
measure the effects of this manipulation on other variables.
• Dependent vs. independent variables. Independent variables

are those that are manipulated whereas dependent variables
are only measured or registered.
Random Variable
• In most cases, cannot know what the “true” value is unless there
is an independent determination (i.e. different measurement
technique).
• Only can consider estimates of the error.
• Discrepancy is the difference between two or more

observations. This gives rise to uncertainty.
Population vs. sample
• Population - an entire collection of measurements
– (e.g. reaction times, IQ scores, height or even
height of male BUET students)
• Sample – smaller subset of observations taken from
population
– sample should be drawn randomly to make
inferences about population. Random assignment
to groups improves validity
In General:
(Parent Parameter) = lim (Sample Parameter)
N ->∞
When the number of observations, N, goes to infinity.

Population vs. sample
• In general:
– population parameters =Greek letters
– sample statistics=English letters
Population Sample
mean μ (mu) X
variance σ2 (sigma) s2
Statistics
• Two essential components of data are:

(i) central tendency of the data &
(ii) spread of the data (e.g. standard deviation)
• Although mean (central tendency) and standard

deviation (spread) are most commonly used, other
measures can also be useful
Measures of central tendency
• Mode
– the most frequent observation: 1, 2, 2, 3, 4 ,5
• Median
– the middle number of a dataset arranged in numerical
order: 0, 1, 2, 5, 1000
(average of middle two numbers when even number of scores
exist)
– relatively uninfluenced by outliers
• Mean =
Measures of dispersion
• Several ways to measure spread of data:
– Range (max-min), IQR or Inter-Quartile Range (middle 50%),
Mean Absolute Deviation
– Variance – average of the squared deviations
– Variance for population of 3 scores (-10,0,10) is 66.66

(200/3)
– Standard deviation is simply the square root of the
variance
Calculating sample variance
• Population variance (σ2) is the true variance of the
population calculated by
-this equation is used when we have all
values in a population (unusual)
• However, the variance of a sample (S2) tends to be

larger than the population from which it was drawn. So,
we use this equation:
standard deviation: positive square root of the variance
small std dev: observations are clustered tightly around the mean
large std dev: observations are scattered widely around the mean
Data Distribution
Histogram is a useful graphic representation of information content of
sample or parent population
many statistical tests assume

values are normally distributed
not always the case!
examine data prior to processing
Normal/Gaussian Distribution
• Many real-life variables
(height, weight, IQ etc etc)
are distributed like this
• Mathematical equation
mimics this normal
(or Gaussian) distribution
Normal Distribution
• The mathematical normal distribution is useful
as its known mathematical properties give us
useful info about our real-life variable (assuming
our real-life variable is normally distributed)
• For example, 2 standard deviations above the

mean represent the extreme 2.5% of scores
• Consequently, a person with an IQ score of 130

(M=100, SD=15), would be in the top 2.5%
(assuming IQ is normally distributed)
For gaussian or normal error distributions:
Total area underneath curve is 1.00 (100%)
68.27% of observations lie within ± 1 std dev of mean

95% of observations lie within ± 2 std dev of mean
99.9% of observations lie within ± 3 std dev of mean
Variance, standard deviation, probable error, mean, and

weighted root mean square error are commonly used
statistical terms in geodesy.
compare (rather than attach significance to numerical value)

Steps of Data Analysis
• Extract Data from Database
• Visualize the data
• Select an appropriate segment
• Preprocessing of Data : Removal and
replacement of outliers
• Transform data, if needed
• Analyse the data to find patterns or to
extract information
Data Visualisation
• Plot the Data and have a look
• Basic rule is to select plot which represents what
you want to say in the clearest and simplest way
• Avoid ‘chart junk’ (e.g. plotting in 3D where 2D

would be clearer)
• Popular options include time trends, bar charts,

histograms, pie charts etc
Transforming data
• One reason we might ‘transform’ data is to convert from
one scale to another
– e.g. feet into inches, centigrade into fahrenheit,
raw IQ scores into standard IQ scores
• Scale conversion can usually be achieved by simple

linear transformation (multiplying/dividing by a constant
and adding/subtracting a constant)
Xnew = b*Xold + c
– So to convert centigrade data into fahrenheit we would apply the

following:
Transforming data
• Z-transform (standardisation) is one common type of
linear transform, which produces a new variable with
M=0 & SD=1
• Z -scores= X
• Standardisation is useful when comparing the same

dimension measured on different scales
• After standardisation these scales could also be added
together (adding two quantities on different scales is
obviously problematic)
Process Time trends
Correlation Analysis
• Correlation is a measure of the relation between
two or more variables. The measurement scales
used should be at least interval scales, but other
correlation coefficients are available to handle
other types of data. Correlation coefficients can
range from -1.00 to +1.00. The value of -1.00
represents a perfect negative correlation while a
value of +1.00 represents a perfect positive
correlation. A value of 0.00 represents a lack of
correlation.
A Nice Statistic Website
http://www.statsoft.com/textbook/stathome.html
Data Analysis
• Outliers – Removal and replacement
Flow Network Problem

Process Data Analysis

Uploaded by

Process Data Analysis

Uploaded by

Process Data Analysis

– Automated data collection tools and mature database

• We are drowning in data, but starving for knowledge!

• Solution: Data mining and knowledge extraction

• Observational vs. experimental research. Most empirical

• Dependent vs. independent variables. Independent variables

• Only can consider estimates of the error.

• Discrepancy is the difference between two or more

When the number of observations, N, goes to infinity.

• Two essential components of data are:

• Although mean (central tendency) and standard

– Variance – average of the squared deviations

– Variance for population of 3 scores (-10,0,10) is 66.66

• However, the variance of a sample (S2) tends to be

standard deviation: positive square root of the variance

many statistical tests assume

• For example, 2 standard deviations above the

• Consequently, a person with an IQ score of 130

Total area underneath curve is 1.00 (100%)

68.27% of observations lie within ± 1 std dev of mean

Variance, standard deviation, probable error, mean, and

compare (rather than attach significance to numerical value)

• Avoid ‘chart junk’ (e.g. plotting in 3D where 2D

• Popular options include time trends, bar charts,

• Scale conversion can usually be achieved by simple

– So to convert centigrade data into fahrenheit we would apply the

• Standardisation is useful when comparing the same

You might also like