0% found this document useful (0 votes)

32 views50 pages

D06B-Data Preprocessing 2

Uploaded by

Abdul Barir Hakim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views50 pages

D06B-Data Preprocessing 2

Uploaded by

Abdul Barir Hakim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Persiapan Data

1. Data Cleaning
2. Data Reduction
3. Data Transformation and Data Discretization
Data Integration
CRISP-DM

2
Latihan
• Impor data MissingDataValue-Noisy.csv
• Gunakan Regular Expression (operator Replace)
untuk mengganti semua noisy data pada atribut
nominal menjadi “N”

3
Latihan
1. Impor data MissingDataValue-Noisy-Multiple.csv
2. Gunakan operator Replace Missing Value untuk mengisi data kosong
3. Gunakan Regular Expression (operator Replace) untuk mengganti semua noisy
data pada atribut nominal menjadi “N”
4. Gunakan operator Map untuk mengganti semua isian Face, FB dan Fesbuk
menjadi Facebook

4
5
1 2 3 4

1. Impor data MissingDataValue-Noisy-Multiple.csv

2. operator Replace Missing Value untuk mengisi data
kosong
3. operator Replace untuk mengganti semua noisy data
pada atribut nominal menjadi “N”
4. operator Map untuk mengganti semua isian Face, FB
dan Fesbuk menjadi Facebook
6
2. Data Reduction

7
Data Reduction Methods
• Data Reduction
• Obtain a reduced representation of the data set that is much smaller in
volume but yet produces the same analytical results
• Why Data Reduction?
• A database/data warehouse may store terabytes of data
• Complex data analysis take a very long time to run on the complete dataset

• Data Reduction Methods

1. Dimensionality Reduction
1. Feature Extraction
2. Feature Selection
1. Filter Approach
2. Wrapper Approach
3. Embedded Approach

2. Numerosity Reduction (Data Reduction)

• Regression and Log-Linear Models
• Histograms, clustering, sampling
8
1. Dimensionality Reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to
clustering, outlier analysis, becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization

• Dimensionality Reduction Methods:

1. Feature Extraction: Wavelet transforms, Principal Component
Analysis (PCA)
2. Feature Selection: Filter, Wrapper, Embedded
9
Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n
orthogonal vectors (principal components) that can be
best used to represent data
1. Normalize input data: Each attribute falls within the same range
2. Compute k orthonormal (unit) vectors, i.e., principal components
3. Each input data (vector) is a linear combination of the k principal
component vectors
4. The principal components are sorted in order of decreasing
“significance” or strength
5. Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance

• Works for numeric data only

10
Latihan

• Lakukan eksperimen mengikuti buku Markus Hofmann

(Rapid Miner - Data Mining Use Case) Chapter 4 (k-Nearest
Neighbor Classification II) pp. 45-51
• Dataset: glass.data
• Analisis metode preprocessing apa saja yang digunakan dan
mengapa perlu dilakukan pada dataset tersebut!
• Bandingkan akurasi dari k-NN dan PCA+k-NN

11
12
13
Data Awal Sebelum PCA

14
Data Setelah PCA

15
Latihan

• Susun ulang proses yang mengkomparasi model yang

dihasilkan oleh k-NN dan PCA + k-NN
• Gunakan 10 Fold X Validation

16
17
Metode Cross-Validation

• Metode cross-validation digunakan untuk menghindari overlapping

pada data testing
• Tahapan cross-validation:
1. Bagi data menjadi k subset yg berukuran sama
2. Gunakan setiap subset untuk data testing dan sisanya untuk data training
• Disebut juga dengan k-fold cross-validation
• Seringkali subset dibuat stratified (bertingkat) sebelum cross-
validation dilakukan, karena stratifikasi akan mengurangi variansi dari
estimasi

18
10 Fold Cross-Validation
Eksperimen Dataset Akurasi
1 93%
2 91%
3 90%
4 93%
5 93%
6 91%
7 94%
8 93%
9 91%
10 90%
Akurasi Rata-Rata 92%
Orange: k-subset (data testing)
19
10 Fold Cross-Validation
• Metode evaluasi standard: stratified 10-fold
cross-validation
• Mengapa 10? Hasil dari berbagai percobaan
yang ekstensif dan pembuktian teoritis,
menunjukkan bahwa 10-fold cross-validation
adalah pilihan terbaik untuk mendapatkan
hasil validasi yang akurat
• 10-fold cross-validation akan mengulang
pengujian sebanyak 10 kali dan hasil
pengukuran adalah nilai rata-rata dari 10 kali
pengujian

20
Latihan
• Review operator apa
saja yang bisa digunakan
untuk feature extraction
• Ganti PCA dengan
metode feature
extraction yang lain
• Lakukan komparasi dan
tentukan mana metode
feature extraction
terbaik untuk data
Glass.data, gunakan 10-
fold cross validation
21
Feature/Attribute Selection
• Another way to reduce dimensionality of data
• Redundant attributes
• Duplicate much or all of the information contained
in one or more other attributes
• E.g., purchase price of a product and the amount of
sales tax paid
• Irrelevant attributes
• Contain no information that is useful for the data
mining task at hand
• E.g., students' ID is often irrelevant to the task of
predicting students' GPA
22
Feature Selection Approach

A number of proposed approaches for feature selection can broadly be

categorized into the following three classifications: wrapper, filter, and
embedded (Liu & Tu, 2004)
1. In the filter approach, statistical analysis of the feature set is required,
without utilizing any learning model (Dash & Liu, 1997)
2. In the wrapper approach, a predetermined learning model is assumed,
wherein features are selected that justify the learning performance of the
particular learning model (Guyon & Elisseeff, 2003)
3. The embedded approach attempts to utilize the complementary
strengths of the wrapper and filter approaches (Huang, Cai, & Xu, 2007)

23
Wrapper Approach vs Filter Approach

Wrapper Approach Filter Approach

24
Feature Selection Approach
1. Filter Approach:
• information gain
• chi square
• log likehood rasio
• etc

2. Wrapper Approach:
• forward selection
• backward elimination
• randomized hill climbing
• etc

3. Embedded Approach:
• decision tree
• weighted naïve bayes
• etc
25
2. Numerosity Reduction
Reduce data volume by choosing alternative, smaller forms of
data representation

1. Parametric methods (e.g., regression)

• Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)
• Ex.: Log-linear models—obtain value at a point in m-D
space as the product on appropriate marginal
subspaces

2. Non-parametric methods
• Do not assume models
• Major families: histograms, clustering, sampling, …

26
Numerosity Reduction

27
Parametric Data Reduction: Regression and
Log-Linear Models
• Linear regression
• Data modeled to fit a straight line
• Often uses the least-square method to fit the
line
• Multiple regression
• Allows a response variable Y to be modeled as a
linear function of multidimensional feature
vector
• Log-linear model
• Approximates discrete multidimensional
probability distributions

28
Regression Analysis
• Regression analysis: A collective name for
techniques for the modeling and analysis of
numerical data consisting of values of a
dependent variable (also called response
variable or measurement) and of one or more
independent variables (aka. explanatory
Y1
variables or predictors)
• The parameters are estimated so as to give a
"best fit" of the data Y1’
y=x+1
• Most commonly the best fit is evaluated by
using the least squares method, but other
criteria have also been used
X1 x
• Used for prediction (including forecasting of
time-series data), inference, hypothesis
testing, and modeling of causal relationships

29
Regress Analysis and Log-Linear Models
• Linear regression: Y = w X + b
• Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
• Using the least squares criterion to the known values of Y1, Y2, …, X1,
X2, ….
• Multiple regression: Y = b0 + b1 X1 + b2 X2
• Many nonlinear functions can be transformed into the above
• Log-linear models:
• Approximate discrete multidimensional probability distributions
• Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset of
dimensional combinations
• Useful for dimensionality reduction and data smoothing

30
Histogram Analysis
• Divide data into buckets and
40
store average (sum) for each
35
bucket
30
• Partitioning rules: 25
• Equal-width: equal bucket 20
range 15
• Equal-frequency (or equal-10
depth) 5
0
10000 30000 50000 70000 90000

31
Clustering
• Partition data set into clusters based on
similarity, and store cluster representation (e.g.,
centroid and diameter) only
• Can be very effective if data is clustered but not
if data is “smeared”
• Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
• There are many choices of clustering definitions
and clustering algorithms

32
Sampling
• Sampling: obtaining a small sample s to represent
the whole data set N
• Allow a mining algorithm to run in complexity that
is potentially sub-linear to the size of the data
• Key principle: Choose a representative subset of
the data
• Simple random sampling may have very poor performance in the
presence of skew
• Develop adaptive sampling methods, e.g., stratified sampling

• Note: Sampling may not reduce database I/Os

(page at a time)

33
Types of Sampling

• Simple random sampling

• There is an equal probability of selecting any particular item
• Sampling without replacement
• Once an object is selected, it is removed from the population
• Sampling with replacement
• A selected object is not removed from the population
• Stratified sampling
• Partition the data set, and draw samples from each partition (proportionally,
i.e., approximately the same percentage of the data)
• Used in conjunction with skewed data

34
Sampling: With or without Replacement

Raw Data

35
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

36
Stratified Sampling
• Stratification is the process of dividing members of the
population into homogeneous subgroups before sampling
• Suppose that in a company there are the following staff:
• Male, full-time: 90
• Male, part-time: 18
• Female, full-time: 9
• Female, part-time: 63
• Total: 180
• We are asked to take a sample of 40 staff, stratified
according to the above categories
• An easy way to calculate the percentage is to multiply each
group size by the sample size and divide by the total
population:
• Male, full-time = 90 × (40 ÷ 180) = 20
• Male, part-time = 18 × (40 ÷ 180) = 4
• Female, full-time = 9 × (40 ÷ 180) = 2
• Female, part-time = 63 × (40 ÷ 180) = 14
37
3. Data Transformation and Data Discretization

38
Data Transformation
• A function that maps the entire set of values of a given
attribute to a new set of replacement values
• Each old value can be identified with one of the new values
• Methods:
• Smoothing: Remove noise from data
• Attribute/feature construction
• New attributes constructed from the given ones
• Aggregation: Summarization, data cube construction
• Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Discretization: Concept hierarchy climbing
39
Normalization
• Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA

• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].

Then $73,000 is mapped to 73,600 − 12,000 (1.0 − 0) + 0 = 0.716
98,000 − 12,000

• Z-score normalization (μ: mean, σ: standard deviation):

v − A
v' =
 A

73,600 − 54,000
• Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
• Normalization by decimal scaling
v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10
40
Discretization

• Three types of attributes

• Nominal —values from an unordered set, e.g., color, profession
• Ordinal —values from an ordered set, e.g., military or academic rank
• Numeric —real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
• Interval labels can then be used to replace actual data values
• Reduce data size by discretization
• Supervised vs. unsupervised
• Split (top-down) vs. merge (bottom-up)
• Discretization can be performed recursively on an attribute
• Prepare for further analysis, e.g., classification

41
Data Discretization Methods
Typical methods: All the methods can be
applied recursively
• Binning: Top-down split, unsupervised
• Histogram analysis: Top-down split, unsupervised
• Clustering analysis: Unsupervised, top-down split
or bottom-up merge
• Decision-tree analysis: Supervised, top-down
split
• Correlation (e.g., 2) analysis: Unsupervised,
bottom-up merge
42
Simple Discretization: Binning

• Equal-width (distance) partitioning

• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well
• Equal-depth (frequency) partitioning
• Divides the range into N intervals, each containing approximately same
number of samples
• Good data scaling
• Managing categorical attributes can be tricky

43
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34

• Partition into equal-frequency (equi-depth) bins:

• Bin 1: 4, 8, 9, 15
• Bin 2: 21, 21, 24, 25
• Bin 3: 26, 28, 29, 34
• Smoothing by bin means:
• Bin 1: 9, 9, 9, 9
• Bin 2: 23, 23, 23, 23
• Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries:
• Bin 1: 4, 4, 4, 15
• Bin 2: 21, 21, 25, 25
• Bin 3: 26, 26, 26, 34

44
Discretization Without Using Class Labels
(Binning vs. Clustering)

Data Equal interval width (binning)

Equal frequency (binning) K-means clustering leads to better results

45
Discretization by Classification & Correlation
Analysis
• Classification (e.g., decision tree analysis)
• Supervised: Given class labels, e.g., cancerous vs. benign
• Using entropy to determine split point (discretization point)
• Top-down, recursive split
• Correlation analysis (e.g., Chi-merge: χ2-based discretization)
• Supervised: use class information
• Bottom-up merge: find the best neighboring intervals (those having similar
distributions of classes, i.e., low χ2 values) to merge
• Merge performed recursively, until a predefined stopping condition

46
4. Data Integration

47
Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store
• Schema Integration: e.g., A.cust-id  B.cust-#
• Integrate metadata from different sources
• Entity Identification Problem:
• Identify real world entities from multiple data sources,
e.g., Bill Clinton = William Clinton
• Detecting and Resolving Data Value Conflicts
• For the same real world entity, attribute values from
different sources are different
• Possible reasons: different representations, different
scales, e.g., metric vs. British units

48
Handling Redundancy in Data Integration
• Redundant data occur often when integration of
multiple databases
• Object identification: The same attribute or object may
have different names in different databases
• Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
• Redundant attributes may be able to be detected
by correlation analysis and covariance analysis
• Careful integration of the data from multiple
sources may help reduce/avoid redundancies and
inconsistencies and improve mining speed and
quality

49
Rangkuman
1. Data quality: accuracy, completeness,
consistency, timeliness, believability,
interpretability
2. Data cleaning: e.g. missing/noisy values, outliers
3. Data reduction
• Dimensionality reduction
• Numerosity reduction
4. Data transformation and data discretization
• Normalization
5. Data integration from multiple sources:
• Entity identification problem
• Remove redundancies
• Detect inconsistencies

Data Preprocessing-2
No ratings yet
Data Preprocessing-2
30 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Lec07 - Data-Preprocessing-18052023-082951pm 2
No ratings yet
Lec07 - Data-Preprocessing-18052023-082951pm 2
32 pages
03preprocessing3 Part3 4
No ratings yet
03preprocessing3 Part3 4
49 pages
10-2 Data Analysis and Pre-Processing Part 4 PDF
No ratings yet
10-2 Data Analysis and Pre-Processing Part 4 PDF
23 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Week 2
No ratings yet
Week 2
96 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Business Analytics
No ratings yet
Business Analytics
14 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Dwina DM 03 Persiapan 2018
No ratings yet
Dwina DM 03 Persiapan 2018
82 pages
3-Data Fundamentals For BI - Part2
No ratings yet
3-Data Fundamentals For BI - Part2
44 pages
DM 2 Part 2
No ratings yet
DM 2 Part 2
35 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
82 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
32 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
52 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
Lecture6a DataPreprocessing
No ratings yet
Lecture6a DataPreprocessing
52 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Module 2 Data Preprocessing
No ratings yet
Module 2 Data Preprocessing
31 pages
Data Mining - Lecture 3
No ratings yet
Data Mining - Lecture 3
33 pages
Data Mining Chapter3 0
No ratings yet
Data Mining Chapter3 0
32 pages
Data Reduction and Integration Techniques
No ratings yet
Data Reduction and Integration Techniques
21 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
ML Lecture 6 7 Preprocess
No ratings yet
ML Lecture 6 7 Preprocess
43 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
45 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Data Mining - Data Reduction
No ratings yet
Data Mining - Data Reduction
6 pages
CH 2
No ratings yet
CH 2
36 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Chapter 3
No ratings yet
Chapter 3
43 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Feature Engineering Basics for ML
No ratings yet
Feature Engineering Basics for ML
33 pages
Dimensionality Reduction Techniques Explained
No ratings yet
Dimensionality Reduction Techniques Explained
16 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Big Data Lecture # 04
No ratings yet
Big Data Lecture # 04
22 pages
Data Preprocessing
No ratings yet
Data Preprocessing
15 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Feature Selection and Extraction
No ratings yet
Feature Selection and Extraction
26 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
4 - Basics in Statistics and Linear Algebra
No ratings yet
4 - Basics in Statistics and Linear Algebra
7 pages
Unit 2 Part 4
No ratings yet
Unit 2 Part 4
47 pages
Data Preparation for Data Science
No ratings yet
Data Preparation for Data Science
57 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Data Mining
No ratings yet
Data Mining
21 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
UNIT-III Data Warehouse and Minig Notes MDU
No ratings yet
UNIT-III Data Warehouse and Minig Notes MDU
42 pages
r20 DWDM Unit 2 PART 2
No ratings yet
r20 DWDM Unit 2 PART 2
15 pages
Data Pre-processing Techniques Explained
No ratings yet
Data Pre-processing Techniques Explained
101 pages
Dimensionality Reduction Guide
No ratings yet
Dimensionality Reduction Guide
15 pages
Essential Python Libraries for Data Science
No ratings yet
Essential Python Libraries for Data Science
12 pages
Data Science Career Boost
No ratings yet
Data Science Career Boost
46 pages
Nuclear Procurement Insights
No ratings yet
Nuclear Procurement Insights
25 pages
PDF Nonlinear Computational Geometry 2010th Edition Ioannis Z Emiris Download
100% (13)
PDF Nonlinear Computational Geometry 2010th Edition Ioannis Z Emiris Download
84 pages
2021 BakerMedard Etal EnvSciPol
No ratings yet
2021 BakerMedard Etal EnvSciPol
7 pages
Cabette Et Al 2017 - Effects of Changes in The Riparian Forest On The Butterfly Community in Cerrado Areas
No ratings yet
Cabette Et Al 2017 - Effects of Changes in The Riparian Forest On The Butterfly Community in Cerrado Areas
8 pages
High-Res Image Pixel Reduction via PCA
No ratings yet
High-Res Image Pixel Reduction via PCA
12 pages
Introduction To The Theory of Neural Computation
No ratings yet
Introduction To The Theory of Neural Computation
18 pages
B.Tech CSE Study Evaluation Scheme
0% (2)
B.Tech CSE Study Evaluation Scheme
32 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
33 pages
Customer Segmentation via Data Science
No ratings yet
Customer Segmentation via Data Science
21 pages
Edureka Machine Learning Ebook
No ratings yet
Edureka Machine Learning Ebook
23 pages
TF-IDF Matrix & Clustering Analysis
No ratings yet
TF-IDF Matrix & Clustering Analysis
9 pages
Dimensionality Reduction For Visualizing Single-Cell Data Using UMAP
No ratings yet
Dimensionality Reduction For Visualizing Single-Cell Data Using UMAP
10 pages
Unit 3 DWM Notes
No ratings yet
Unit 3 DWM Notes
17 pages
PA 765 - Factor Analysis
100% (1)
PA 765 - Factor Analysis
18 pages
Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
No ratings yet
Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
56 pages
Data Warehouse & Architecture Guide
No ratings yet
Data Warehouse & Architecture Guide
30 pages
Chirag Subramanian S Resume 1739478153
No ratings yet
Chirag Subramanian S Resume 1739478153
3 pages
PCA, LDA Report
No ratings yet
PCA, LDA Report
10 pages
AI From Basics To Advanced Levels
No ratings yet
AI From Basics To Advanced Levels
3 pages
Effects of Heavy Metal Contamination On River Water Quality Due To The Release of
No ratings yet
Effects of Heavy Metal Contamination On River Water Quality Due To The Release of
57 pages
Data Science II Review
No ratings yet
Data Science II Review
8 pages
The Unscrambler
0% (1)
The Unscrambler
12 pages
Ujian Kompetensi Analisis Multivariat
No ratings yet
Ujian Kompetensi Analisis Multivariat
6 pages
Brand Loyalty 2 BMW
No ratings yet
Brand Loyalty 2 BMW
13 pages
Face Detection and Face Recognition
No ratings yet
Face Detection and Face Recognition
7 pages
PCA on Interest Rate Term Structures
No ratings yet
PCA on Interest Rate Term Structures
28 pages
Evaluation of Melon Cucumis Melo L
No ratings yet
Evaluation of Melon Cucumis Melo L
10 pages
Predictive Analytics With KNIME: Analytics For Citizen Data Scientists 1st Edition Acito PDF Available
No ratings yet
Predictive Analytics With KNIME: Analytics For Citizen Data Scientists 1st Edition Acito PDF Available
162 pages

D06B-Data Preprocessing 2

Uploaded by

D06B-Data Preprocessing 2

Uploaded by

Persiapan Data

1. Impor data MissingDataValue-Noisy-Multiple.csv

• Data Reduction Methods

2. Numerosity Reduction (Data Reduction)

• Dimensionality Reduction Methods:

• Works for numeric data only

• Lakukan eksperimen mengikuti buku Markus Hofmann

• Susun ulang proses yang mengkomparasi model yang

• Metode cross-validation digunakan untuk menghindari overlapping

A number of proposed approaches for feature selection can broadly be

Wrapper Approach Filter Approach

1. Parametric methods (e.g., regression)

• Note: Sampling may not reduce database I/Os

• Simple random sampling

Raw Data Cluster/Stratified Sample

• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].

• Z-score normalization (μ: mean, σ: standard deviation):

• Three types of attributes

• Equal-width (distance) partitioning

• Partition into equal-frequency (equi-depth) bins:

Data Equal interval width (binning)

Equal frequency (binning) K-means clustering leads to better results

You might also like