0% found this document useful (0 votes)
43 views24 pages

Process Data Analysis

This document discusses process data analysis and some key statistical concepts: 1. It describes the problem of data explosion and how data mining aims to extract useful knowledge and patterns from large databases through techniques like classification, clustering, and association rule mining. 2. It introduces some basic statistical concepts like variables, populations, samples, measures of central tendency (mean, median, mode), measures of dispersion (range, variance, standard deviation), and the normal distribution. 3. It outlines some common steps in data analysis including extracting and visualizing data, preprocessing like removing outliers, transforming data if needed, and analyzing the data to find patterns or extract information.

Uploaded by

Ridwan Mahfuz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
43 views24 pages

Process Data Analysis

This document discusses process data analysis and some key statistical concepts: 1. It describes the problem of data explosion and how data mining aims to extract useful knowledge and patterns from large databases through techniques like classification, clustering, and association rule mining. 2. It introduces some basic statistical concepts like variables, populations, samples, measures of central tendency (mean, median, mode), measures of dispersion (range, variance, standard deviation), and the normal distribution. 3. It outlines some common steps in data analysis including extracting and visualizing data, preprocessing like removing outliers, transforming data if needed, and analyzing the data to find patterns or extract information.

Uploaded by

Ridwan Mahfuz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 24

Process Data Analysis

Dr M. A. A. Shoukat Choudhury
Department of Chemical Engineering
BUET, Dhaka - 1000
DATA MINING
• Data explosion problem

– Automated data collection tools and mature database


technology lead to tremendous amounts of data stored in
databases, data warehouses and other information
repositories

• We are drowning in data, but starving for knowledge!

• Solution: Data mining and knowledge extraction


• Data mining (knowledge discovery in databases):
– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns from
data in large databases
Elementary Concepts
• Variables: Variables are things that we measure, control, or
manipulate in research. They differ in many respects, most
notably in the role they are given in our research and in the type
of measures that can be applied to them.

• Observational vs. experimental research. Most empirical


research belongs clearly to one of those two general categories.
In observational research we do not (or at least try not to)
influence any variables but only measure them and look for
relations (correlations) between some set of variables. In
experimental research, we manipulate some variables and then
measure the effects of this manipulation on other variables.

• Dependent vs. independent variables. Independent variables


are those that are manipulated whereas dependent variables
are only measured or registered.
Random Variable

• In most cases, cannot know what the “true” value is unless there
is an independent determination (i.e. different measurement
technique).

• Only can consider estimates of the error.

• Discrepancy is the difference between two or more


observations. This gives rise to uncertainty.
Population vs. sample
• Population - an entire collection of measurements
– (e.g. reaction times, IQ scores, height or even
height of male BUET students)
• Sample – smaller subset of observations taken from
population
– sample should be drawn randomly to make
inferences about population. Random assignment
to groups improves validity
In General:
(Parent Parameter) = lim (Sample Parameter)
N ->∞

When the number of observations, N, goes to infinity.


Population vs. sample
• In general:
– population parameters =Greek letters
– sample statistics=English letters

Population Sample
mean μ (mu) X
variance σ2 (sigma) s2
Statistics

• Two essential components of data are:


(i) central tendency of the data &
(ii) spread of the data (e.g. standard deviation)

• Although mean (central tendency) and standard


deviation (spread) are most commonly used, other
measures can also be useful
Measures of central tendency
• Mode
– the most frequent observation: 1, 2, 2, 3, 4 ,5

• Median
– the middle number of a dataset arranged in numerical
order: 0, 1, 2, 5, 1000
(average of middle two numbers when even number of scores
exist)
– relatively uninfluenced by outliers

• Mean =
Measures of dispersion
• Several ways to measure spread of data:
– Range (max-min), IQR or Inter-Quartile Range (middle 50%),
Mean Absolute Deviation

– Variance – average of the squared deviations

– Variance for population of 3 scores (-10,0,10) is 66.66


(200/3)
– Standard deviation is simply the square root of the
variance
Calculating sample variance
• Population variance (σ2) is the true variance of the
population calculated by
-this equation is used when we have all
values in a population (unusual)

• However, the variance of a sample (S2) tends to be


larger than the population from which it was drawn. So,
we use this equation:

standard deviation: positive square root of the variance

small std dev: observations are clustered tightly around the mean
large std dev: observations are scattered widely around the mean
Data Distribution
Histogram is a useful graphic representation of information content of
sample or parent population

many statistical tests assume


values are normally distributed
not always the case!
examine data prior to processing
Normal/Gaussian Distribution
• Many real-life variables
(height, weight, IQ etc etc)
are distributed like this

• Mathematical equation
mimics this normal
(or Gaussian) distribution
Normal Distribution
• The mathematical normal distribution is useful
as its known mathematical properties give us
useful info about our real-life variable (assuming
our real-life variable is normally distributed)

• For example, 2 standard deviations above the


mean represent the extreme 2.5% of scores

• Consequently, a person with an IQ score of 130


(M=100, SD=15), would be in the top 2.5%
(assuming IQ is normally distributed)
For gaussian or normal error distributions:

Total area underneath curve is 1.00 (100%)

68.27% of observations lie within ± 1 std dev of mean


95% of observations lie within ± 2 std dev of mean
99.9% of observations lie within ± 3 std dev of mean

Variance, standard deviation, probable error, mean, and


weighted root mean square error are commonly used
statistical terms in geodesy.

compare (rather than attach significance to numerical value)


Steps of Data Analysis
• Extract Data from Database
• Visualize the data
• Select an appropriate segment
• Preprocessing of Data : Removal and
replacement of outliers
• Transform data, if needed
• Analyse the data to find patterns or to
extract information
Data Visualisation
• Plot the Data and have a look
• Basic rule is to select plot which represents what
you want to say in the clearest and simplest way

• Avoid ‘chart junk’ (e.g. plotting in 3D where 2D


would be clearer)

• Popular options include time trends, bar charts,


histograms, pie charts etc
Transforming data
• One reason we might ‘transform’ data is to convert from
one scale to another
– e.g. feet into inches, centigrade into fahrenheit,
raw IQ scores into standard IQ scores

• Scale conversion can usually be achieved by simple


linear transformation (multiplying/dividing by a constant
and adding/subtracting a constant)
Xnew = b*Xold + c

– So to convert centigrade data into fahrenheit we would apply the


following:
Transforming data
• Z-transform (standardisation) is one common type of
linear transform, which produces a new variable with
M=0 & SD=1

• Z -scores= X

• Standardisation is useful when comparing the same


dimension measured on different scales
• After standardisation these scales could also be added
together (adding two quantities on different scales is
obviously problematic)
Process Time trends
Correlation Analysis
• Correlation is a measure of the relation between
two or more variables. The measurement scales
used should be at least interval scales, but other
correlation coefficients are available to handle
other types of data. Correlation coefficients can
range from -1.00 to +1.00. The value of -1.00
represents a perfect negative correlation while a
value of +1.00 represents a perfect positive
correlation. A value of 0.00 represents a lack of
correlation.
A Nice Statistic Website
http://www.statsoft.com/textbook/stathome.html
Data Analysis
• Outliers – Removal and replacement
Flow Network Problem

You might also like