Concepts and Techniques: Data Mining
Concepts and Techniques: Data Mining
— Chapter 3 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 3: Data Preprocessing
3
Chapter 3: Data Preprocessing
8
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may be due to
faulty data collection instruments
technology limitation
incomplete data
inconsistent data
9
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
Clustering
detect and remove outliers
10
Chapter 3: Data Preprocessing
13
Correlation Analysis (Nominal Data)
Χ2 (chi-square) test
(Observed Expected ) 2
2
Expected
The larger the Χ2 value, the more likely the variables are
related
The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
Both are causally linked to the third variable: population
14
Chi-Square Calculation: An Example
Correlation coefficient:
Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
Thus, A and B rise together since Cov(A, B) > 0.
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
18
LAB: Visual correlation
Open Weka -> explorer
Choose iris dataset
Click visualize tab
See the correlation between petal_length and
petal_width
Click the figure (graph) to open more detailed
graph.
19
Chapter 3: Data Preprocessing
Wavelet transforms
Data compression
21
Data Reduction 1: Dimensionality Reduction
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality reduction techniques
Wavelet transforms
Principal Component Analysis
Supervised and nonlinear techniques (e.g., feature selection)
22
What Is Wavelet Transform?
Decomposes a signal into
different frequency subbands
Applicable to n-
dimensional signals
Data are transformed to
preserve relative distance
between objects at different
levels of resolution
Allow natural clusters to
become more distinguishable
Used for image compression
23
Wavelet Transformation
Haar2 Daubechie4
Discrete wavelet transform (DWT) for linear signal
processing, multi-resolution analysis
Compressed approximation: store only a small fraction of
the strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better lossy
compression, localized in space
Method:
Length, L, must be an integer power of 2 (padding with 0’s, when
necessary)
Each transform has 2 functions: smoothing, difference
Applies to pairs of data, resulting in two set of data of length L/2
Applies two functions recursively, until reaches the desired length
24
Wavelet Decomposition
Wavelets: A math tool for space-efficient hierarchical
decomposition of functions
S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ =
[23/4, -11/4, 1/2, 0, 0, -1, -1, 0]
Compression: many small detail coefficients can be
replaced by 0’s, and only the significant coefficients are
retained
25
Principal Component Analysis (PCA)
Find a projection that captures the largest amount of variation in data
The original data are projected onto a much smaller space, resulting in
dimensionality reduction. We find the eigenvectors of the covariance
matrix, and these eigenvectors define the new space
x2
28
Data Reduction 2: Numerosity Reduction
Reduce data volume by choosing alternative, smaller
forms of data representation
Parametric methods (e.g., regression)
Assume the data fits some model, estimate model
29
y
Regression Analysis
Y1
Regression analysis: A collective name for
techniques for the modeling and analysis Y1’
y=x+1
of numerical data consisting of values of a
dependent variable (also called
response variable or measurement) and X1 x
of one or more independent variables (aka.
explanatory variables or predictors) Used for prediction
The parameters are estimated so as to (including forecasting of
give a "best fit" of the data time-series data), inference,
hypothesis testing, and
Most commonly the best fit is evaluated by
modeling of causal
using the least squares method, but relationships
other criteria have also been used
30
LAB: Linear Regression
Open Weka -> explorer
Choose cpu dataset (numerical)
32
Histogram
Equal-width: In an equal-width
histogram, the width of each
bucket range is uniform (e.g.,
the width of $10 for the buckets
in down Figure).
33
Sampling
item
Sampling without replacement
Once an object is selected, it is removed from the
population
Sampling with replacement
A selected object is not removed from the population
Stratified sampling:
Partition the data set, and draw samples from each
35
Sampling: With or without Replacement
W O R
SRS le random
i m p h ou t
( s e wi t
l
samp ment)
pl a c e
re
SRSW
R
Raw Data
36
Sampling: Cluster or Stratified Sampling
37
LAB: stratified samples
Open weather.numerical dataset
Open supervised.instances.StratifiedRemoveFolds filter
Configure the filter as shown in fig.
38
Data Compression
ss y
lo
Original Data
Approximated
39
Chapter 3: Data Preprocessing
73,600 54,000
1.225
Ex. Let μ = 54,000, σ = 16,000. Then 16,000
Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
43
LAB: Normalize
Open Weka -> explorer
Choose weather.numerical dataset
Click filter and select
(filters.unsupervised.attributes.Normalize)
Click on the filter to change the normalization
scale if needed
Click apply
Check the new data
44
4. Label Discretization
Three types of attributes
Nominal—values from an unordered set, e.g., color, profession
Ordinal—values from an ordered set, e.g., military or academic
rank
Numeric—real numbers, e.g., integer or real numbers
Discretization: Divide the range of a continuous attribute into intervals
Interval labels can then be used to replace actual data values
Reduce data size by discretization
Split (top-down) vs. merge (bottom-up)
Prepare for further analysis, e.g., classification
Example: Age Discretization; Young 18-29; Career 30-40; Mid-
Life 41-55; Empty-Nester 56-69; Senior 70+
45
LAB: Label Discretize
Open Weka -> explorer
Choose weather.numerical dataset
Click filter and select
(filters.unsupervised.attributes.Descritize)
attributeindices=2 , 2 represents the temperature,
number of bins= 3, precision = 2
Click on the filter to change the descritization
scale if needed
Click apply
Check the new data
46
Data Discretization Methods
Typical methods: All the methods can be applied
recursively
Binning
Top-down split, unsupervised
Histogram analysis
Top-down split, unsupervised
Clustering analysis (unsupervised, top-down split or
bottom-up merge)
Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)
47
Data Discretization: Binning
Remove redundancies
Detect inconsistencies
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
52