0% found this document useful (0 votes)
32 views50 pages

D06B-Data Preprocessing 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views50 pages

D06B-Data Preprocessing 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Persiapan Data

1. Data Cleaning
2. Data Reduction
3. Data Transformation and Data Discretization
Data Integration
CRISP-DM

2
Latihan
• Impor data MissingDataValue-Noisy.csv
• Gunakan Regular Expression (operator Replace)
untuk mengganti semua noisy data pada atribut
nominal menjadi “N”

3
Latihan
1. Impor data MissingDataValue-Noisy-Multiple.csv
2. Gunakan operator Replace Missing Value untuk mengisi data kosong
3. Gunakan Regular Expression (operator Replace) untuk mengganti semua noisy
data pada atribut nominal menjadi “N”
4. Gunakan operator Map untuk mengganti semua isian Face, FB dan Fesbuk
menjadi Facebook

4
5
1 2 3 4

1. Impor data MissingDataValue-Noisy-Multiple.csv


2. operator Replace Missing Value untuk mengisi data
kosong
3. operator Replace untuk mengganti semua noisy data
pada atribut nominal menjadi “N”
4. operator Map untuk mengganti semua isian Face, FB
dan Fesbuk menjadi Facebook
6
2. Data Reduction

7
Data Reduction Methods
• Data Reduction
• Obtain a reduced representation of the data set that is much smaller in
volume but yet produces the same analytical results
• Why Data Reduction?
• A database/data warehouse may store terabytes of data
• Complex data analysis take a very long time to run on the complete dataset

• Data Reduction Methods


1. Dimensionality Reduction
1. Feature Extraction
2. Feature Selection
1. Filter Approach
2. Wrapper Approach
3. Embedded Approach

2. Numerosity Reduction (Data Reduction)


• Regression and Log-Linear Models
• Histograms, clustering, sampling
8
1. Dimensionality Reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to
clustering, outlier analysis, becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization

• Dimensionality Reduction Methods:


1. Feature Extraction: Wavelet transforms, Principal Component
Analysis (PCA)
2. Feature Selection: Filter, Wrapper, Embedded
9
Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n
orthogonal vectors (principal components) that can be
best used to represent data
1. Normalize input data: Each attribute falls within the same range
2. Compute k orthonormal (unit) vectors, i.e., principal components
3. Each input data (vector) is a linear combination of the k principal
component vectors
4. The principal components are sorted in order of decreasing
“significance” or strength
5. Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance

• Works for numeric data only


10
Latihan

• Lakukan eksperimen mengikuti buku Markus Hofmann


(Rapid Miner - Data Mining Use Case) Chapter 4 (k-Nearest
Neighbor Classification II) pp. 45-51
• Dataset: glass.data
• Analisis metode preprocessing apa saja yang digunakan dan
mengapa perlu dilakukan pada dataset tersebut!
• Bandingkan akurasi dari k-NN dan PCA+k-NN

11
12
13
Data Awal Sebelum PCA

14
Data Setelah PCA

15
Latihan

• Susun ulang proses yang mengkomparasi model yang


dihasilkan oleh k-NN dan PCA + k-NN
• Gunakan 10 Fold X Validation

16
17
Metode Cross-Validation

• Metode cross-validation digunakan untuk menghindari overlapping


pada data testing
• Tahapan cross-validation:
1. Bagi data menjadi k subset yg berukuran sama
2. Gunakan setiap subset untuk data testing dan sisanya untuk data training
• Disebut juga dengan k-fold cross-validation
• Seringkali subset dibuat stratified (bertingkat) sebelum cross-
validation dilakukan, karena stratifikasi akan mengurangi variansi dari
estimasi

18
10 Fold Cross-Validation
Eksperimen Dataset Akurasi
1 93%
2 91%
3 90%
4 93%
5 93%
6 91%
7 94%
8 93%
9 91%
10 90%
Akurasi Rata-Rata 92%
Orange: k-subset (data testing)
19
10 Fold Cross-Validation
• Metode evaluasi standard: stratified 10-fold
cross-validation
• Mengapa 10? Hasil dari berbagai percobaan
yang ekstensif dan pembuktian teoritis,
menunjukkan bahwa 10-fold cross-validation
adalah pilihan terbaik untuk mendapatkan
hasil validasi yang akurat
• 10-fold cross-validation akan mengulang
pengujian sebanyak 10 kali dan hasil
pengukuran adalah nilai rata-rata dari 10 kali
pengujian

20
Latihan
• Review operator apa
saja yang bisa digunakan
untuk feature extraction
• Ganti PCA dengan
metode feature
extraction yang lain
• Lakukan komparasi dan
tentukan mana metode
feature extraction
terbaik untuk data
Glass.data, gunakan 10-
fold cross validation
21
Feature/Attribute Selection
• Another way to reduce dimensionality of data
• Redundant attributes
• Duplicate much or all of the information contained
in one or more other attributes
• E.g., purchase price of a product and the amount of
sales tax paid
• Irrelevant attributes
• Contain no information that is useful for the data
mining task at hand
• E.g., students' ID is often irrelevant to the task of
predicting students' GPA
22
Feature Selection Approach

A number of proposed approaches for feature selection can broadly be


categorized into the following three classifications: wrapper, filter, and
embedded (Liu & Tu, 2004)
1. In the filter approach, statistical analysis of the feature set is required,
without utilizing any learning model (Dash & Liu, 1997)
2. In the wrapper approach, a predetermined learning model is assumed,
wherein features are selected that justify the learning performance of the
particular learning model (Guyon & Elisseeff, 2003)
3. The embedded approach attempts to utilize the complementary
strengths of the wrapper and filter approaches (Huang, Cai, & Xu, 2007)

23
Wrapper Approach vs Filter Approach

Wrapper Approach Filter Approach

24
Feature Selection Approach
1. Filter Approach:
• information gain
• chi square
• log likehood rasio
• etc

2. Wrapper Approach:
• forward selection
• backward elimination
• randomized hill climbing
• etc

3. Embedded Approach:
• decision tree
• weighted naïve bayes
• etc
25
2. Numerosity Reduction
Reduce data volume by choosing alternative, smaller forms of
data representation

1. Parametric methods (e.g., regression)


• Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)
• Ex.: Log-linear models—obtain value at a point in m-D
space as the product on appropriate marginal
subspaces

2. Non-parametric methods
• Do not assume models
• Major families: histograms, clustering, sampling, …

26
Numerosity Reduction

27
Parametric Data Reduction: Regression and
Log-Linear Models
• Linear regression
• Data modeled to fit a straight line
• Often uses the least-square method to fit the
line
• Multiple regression
• Allows a response variable Y to be modeled as a
linear function of multidimensional feature
vector
• Log-linear model
• Approximates discrete multidimensional
probability distributions

28
Regression Analysis
• Regression analysis: A collective name for
techniques for the modeling and analysis of
numerical data consisting of values of a
dependent variable (also called response
variable or measurement) and of one or more
independent variables (aka. explanatory
Y1
variables or predictors)
• The parameters are estimated so as to give a
"best fit" of the data Y1’
y=x+1
• Most commonly the best fit is evaluated by
using the least squares method, but other
criteria have also been used
X1 x
• Used for prediction (including forecasting of
time-series data), inference, hypothesis
testing, and modeling of causal relationships

29
Regress Analysis and Log-Linear Models
• Linear regression: Y = w X + b
• Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
• Using the least squares criterion to the known values of Y1, Y2, …, X1,
X2, ….
• Multiple regression: Y = b0 + b1 X1 + b2 X2
• Many nonlinear functions can be transformed into the above
• Log-linear models:
• Approximate discrete multidimensional probability distributions
• Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset of
dimensional combinations
• Useful for dimensionality reduction and data smoothing

30
Histogram Analysis
• Divide data into buckets and
40
store average (sum) for each
35
bucket
30
• Partitioning rules: 25
• Equal-width: equal bucket 20
range 15
• Equal-frequency (or equal-10
depth) 5
0
10000 30000 50000 70000 90000

31
Clustering
• Partition data set into clusters based on
similarity, and store cluster representation (e.g.,
centroid and diameter) only
• Can be very effective if data is clustered but not
if data is “smeared”
• Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
• There are many choices of clustering definitions
and clustering algorithms

32
Sampling
• Sampling: obtaining a small sample s to represent
the whole data set N
• Allow a mining algorithm to run in complexity that
is potentially sub-linear to the size of the data
• Key principle: Choose a representative subset of
the data
• Simple random sampling may have very poor performance in the
presence of skew
• Develop adaptive sampling methods, e.g., stratified sampling

• Note: Sampling may not reduce database I/Os


(page at a time)

33
Types of Sampling

• Simple random sampling


• There is an equal probability of selecting any particular item
• Sampling without replacement
• Once an object is selected, it is removed from the population
• Sampling with replacement
• A selected object is not removed from the population
• Stratified sampling
• Partition the data set, and draw samples from each partition (proportionally,
i.e., approximately the same percentage of the data)
• Used in conjunction with skewed data

34
Sampling: With or without Replacement

Raw Data

35
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

36
Stratified Sampling
• Stratification is the process of dividing members of the
population into homogeneous subgroups before sampling
• Suppose that in a company there are the following staff:
• Male, full-time: 90
• Male, part-time: 18
• Female, full-time: 9
• Female, part-time: 63
• Total: 180
• We are asked to take a sample of 40 staff, stratified
according to the above categories
• An easy way to calculate the percentage is to multiply each
group size by the sample size and divide by the total
population:
• Male, full-time = 90 × (40 ÷ 180) = 20
• Male, part-time = 18 × (40 ÷ 180) = 4
• Female, full-time = 9 × (40 ÷ 180) = 2
• Female, part-time = 63 × (40 ÷ 180) = 14
37
3. Data Transformation and Data Discretization

38
Data Transformation
• A function that maps the entire set of values of a given
attribute to a new set of replacement values
• Each old value can be identified with one of the new values
• Methods:
• Smoothing: Remove noise from data
• Attribute/feature construction
• New attributes constructed from the given ones
• Aggregation: Summarization, data cube construction
• Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Discretization: Concept hierarchy climbing
39
Normalization
• Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA

• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].


Then $73,000 is mapped to 73,600 − 12,000 (1.0 − 0) + 0 = 0.716
98,000 − 12,000

• Z-score normalization (μ: mean, σ: standard deviation):


v − A
v' =
 A

73,600 − 54,000
• Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
• Normalization by decimal scaling
v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10
40
Discretization

• Three types of attributes


• Nominal —values from an unordered set, e.g., color, profession
• Ordinal —values from an ordered set, e.g., military or academic rank
• Numeric —real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
• Interval labels can then be used to replace actual data values
• Reduce data size by discretization
• Supervised vs. unsupervised
• Split (top-down) vs. merge (bottom-up)
• Discretization can be performed recursively on an attribute
• Prepare for further analysis, e.g., classification

41
Data Discretization Methods
Typical methods: All the methods can be
applied recursively
• Binning: Top-down split, unsupervised
• Histogram analysis: Top-down split, unsupervised
• Clustering analysis: Unsupervised, top-down split
or bottom-up merge
• Decision-tree analysis: Supervised, top-down
split
• Correlation (e.g., 2) analysis: Unsupervised,
bottom-up merge
42
Simple Discretization: Binning

• Equal-width (distance) partitioning


• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well
• Equal-depth (frequency) partitioning
• Divides the range into N intervals, each containing approximately same
number of samples
• Good data scaling
• Managing categorical attributes can be tricky

43
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34

• Partition into equal-frequency (equi-depth) bins:


• Bin 1: 4, 8, 9, 15
• Bin 2: 21, 21, 24, 25
• Bin 3: 26, 28, 29, 34
• Smoothing by bin means:
• Bin 1: 9, 9, 9, 9
• Bin 2: 23, 23, 23, 23
• Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries:
• Bin 1: 4, 4, 4, 15
• Bin 2: 21, 21, 25, 25
• Bin 3: 26, 26, 26, 34

44
Discretization Without Using Class Labels
(Binning vs. Clustering)

Data Equal interval width (binning)

Equal frequency (binning) K-means clustering leads to better results

45
Discretization by Classification & Correlation
Analysis
• Classification (e.g., decision tree analysis)
• Supervised: Given class labels, e.g., cancerous vs. benign
• Using entropy to determine split point (discretization point)
• Top-down, recursive split
• Correlation analysis (e.g., Chi-merge: χ2-based discretization)
• Supervised: use class information
• Bottom-up merge: find the best neighboring intervals (those having similar
distributions of classes, i.e., low χ2 values) to merge
• Merge performed recursively, until a predefined stopping condition

46
4. Data Integration

47
Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store
• Schema Integration: e.g., A.cust-id  B.cust-#
• Integrate metadata from different sources
• Entity Identification Problem:
• Identify real world entities from multiple data sources,
e.g., Bill Clinton = William Clinton
• Detecting and Resolving Data Value Conflicts
• For the same real world entity, attribute values from
different sources are different
• Possible reasons: different representations, different
scales, e.g., metric vs. British units

48
Handling Redundancy in Data Integration
• Redundant data occur often when integration of
multiple databases
• Object identification: The same attribute or object may
have different names in different databases
• Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
• Redundant attributes may be able to be detected
by correlation analysis and covariance analysis
• Careful integration of the data from multiple
sources may help reduce/avoid redundancies and
inconsistencies and improve mining speed and
quality

49
Rangkuman
1. Data quality: accuracy, completeness,
consistency, timeliness, believability,
interpretability
2. Data cleaning: e.g. missing/noisy values, outliers
3. Data reduction
• Dimensionality reduction
• Numerosity reduction
4. Data transformation and data discretization
• Normalization
5. Data integration from multiple sources:
• Entity identification problem
• Remove redundancies
• Detect inconsistencies

50

You might also like