Data Preprocessing
Data Preprocessing
Data Cleaning
Data Integration
Data Reduction
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, transmission error
incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
e.g., Occupation=“ ” (missing data)
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Χ2 (chi-square) test
(Observed Expected ) 2
2
Expected
The larger the Χ2 value, the more likely the variables are related
The cells that contribute the most to the Χ2 value are those whose
actual count is very different from the expected count
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
Both are causally linked to the third variable: population
Correlation Analysis
where n is the number of tuples, and are the respective means of A and B, σA
and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the
AB cross-product.
If rA,B > 0, A and B are positively correlated (A’s values increase as B’s).
The higher, the stronger correlation.
rA,B = 0: independent; rAB < 0: negatively correlated
Data Reduction
Data reduction: Obtain a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical results
Why data reduction? — A database/data warehouse may store terabytes of data. Complex
data analysis may take a very long time to run on the complete data set.
Data reduction strategies
Dimensionality reduction, e.g., remove unimportant attributes
Wavelet transforms
Principal Components Analysis (PCA)
Feature subset selection, feature creation
Data compression
Dimensionality reduction
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier analysis, becomes less
meaningful
The possible combinations of subspaces will grow exponentially
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality reduction techniques
Wavelet transforms
Principal Component Analysis
Supervised and nonlinear techniques (e.g., feature selection)
Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is
mapped to
Z-score normalization (μ: mean, σ: standard deviation):
v A
v'
A
Starts with a very small Sample and then starts increasing the sample size
until a sample of sufficient size is obtained.
Dimensionality Reduction
Need ?
Many algorithms work with low dimensionality .
Allows Data visualization better
Amount of time for processing and memory required is reduced
Dimensionality reduction reduce dimensionality by creating new attributes
which are combinations of existing attributes .
The reduction of dimensionality by selecting new attributes that are subset of
old attributes is known as feature subset selection.
Curse of Dimensionality
Feature Subset Selection
Creating new set of features from the original set of features is known as
feature extraction.
Discretization and Binarization
Mapping continuous valued attributes to categorical attributes is called
Discretization.
Mapping continuous valued attributes to one or more binary attributes is
called binarization.
Example
Discretization of Continuous values
attributes
Unsupervised Discretization
Equal Width intervals
Equal Depth Intervals
Supervised Discretization