Data Preprocessing

This document discusses various techniques for data preprocessing including data cleaning, integration, reduction, and transformation. Data cleaning involves handling missing values and noisy data through techniques like mean imputation, binning, and regression. Data integration combines data from multiple sources by resolving inconsistencies and schema conflicts. Data reduction reduces data size through dimensionality reduction, numerosity reduction, and compression. Data transformation techniques normalize and discretize data into meaningful ranges.

Uploaded by

nikhithalazarus4

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Data Preprocessing

Uploaded by

nikhithalazarus4

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Data preprocessing

 Data Cleaning

 Data Integration

 Data Reduction

 Data Transformation and Data Discretization

Data Pre-procesing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation
Data Cleaning

 Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, transmission error
 incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
 e.g., Occupation=“ ” (missing data)

 noisy: containing noise, errors, or outliers

 e.g., Salary=“−10” (an error)

 inconsistent: containing discrepancies in codes or names, e.g.,

 Age=“42”, Birthday=“03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records

 Intentional (e.g., disguised missing data)

 Jan. 1 as everyone’s birthday?
How to handle missing data

 Ignore the tuple: usually done when class label is missing

(when doing classification)—not effective when the % of
missing values per attribute varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class: smarter
 the most probable value: inference-based such as Bayesian formula or decision tree
How to handle noisy data

 Binning
 first sort data and partition into (equal-frequency) bins
 then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.

 Regression
 smooth by fitting the data into regression functions

 Clustering
 detect and remove outliers

 Combined computer and human inspection

 detect suspicious values and check by human (e.g., deal with possible outliers)
Data integration
 Data integration:
 Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different sources are
different
 Possible reasons: different representations, different scales, e.g., metric
vs. British units
Handling Redundancy in Data Integration

 Redundant data occur often when integration of multiple databases

 Object identification: The same attribute or object may have different names in different
databases
 Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue

 Redundant attributes may be able to be detected by correlation

analysis and covariance analysis
 Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality
Correlation Analysis

 Χ2 (chi-square) test
(Observed  Expected ) 2
2  
Expected
 The larger the Χ2 value, the more likely the variables are related
 The cells that contribute the most to the Χ2 value are those whose
actual count is very different from the expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population
Correlation Analysis

 Correlation coefficient (also called Pearson’s product moment coefficient)

i 1 (ai  A)(bi  B) 
n n
(ai bi )  n AB
rA, B   i 1
(n  1) A B (n  1) A B

where n is the number of tuples, and are the respective means of A and B, σA
and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the
AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values increase as B’s).
The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated
Data Reduction
 Data reduction: Obtain a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical results
 Why data reduction? — A database/data warehouse may store terabytes of data. Complex
data analysis may take a very long time to run on the complete data set.
 Data reduction strategies
 Dimensionality reduction, e.g., remove unimportant attributes
 Wavelet transforms
 Principal Components Analysis (PCA)
 Feature subset selection, feature creation

 Numerosity reduction (some simply call it: Data Reduction)

 Regression and Log-Linear Models
 Histograms, clustering, sampling
 Data cube aggregation

 Data compression
Dimensionality reduction

 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier analysis, becomes less
meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)
 Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is
mapped to
 Z-score normalization (μ: mean, σ: standard deviation):
v  A
v' 
 A

 Ex. Let μ = 54,000, σ = 16,000. Then

 NormalizationWhere
by decimal scaling
j is the smallest integer such that Max(|ν’|) < 1
v
v'  j
10
Aggregation

 Combining of two o more record into single object

Sampling Techniques
 Simple random sampling
 There is an equal probability of selecting any particular item
 Sampling without replacement
 Oncean object is selected, it is removed from the
population
 Sampling with replacement
A selected object is not removed from the population
 Stratified sampling:
 Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the
data)
 Used in conjunction with skewed data
Progressive Sampling

 Starts with a very small Sample and then starts increasing the sample size
until a sample of sufficient size is obtained.
Dimensionality Reduction

 Need ?
 Many algorithms work with low dimensionality .
 Allows Data visualization better
 Amount of time for processing and memory required is reduced
 Dimensionality reduction reduce dimensionality by creating new attributes
which are combinations of existing attributes .
 The reduction of dimensionality by selecting new attributes that are subset of
old attributes is known as feature subset selection.
 Curse of Dimensionality
Feature Subset Selection

 There are three standard methods for feature subset selection

 Embedded Subset Selection
 Filter approach
 Wrapper Approach
Feature Subset Selection
Feature Extraction

 Creating new set of features from the original set of features is known as
feature extraction.
Discretization and Binarization
 Mapping continuous valued attributes to categorical attributes is called
Discretization.
 Mapping continuous valued attributes to one or more binary attributes is
called binarization.
 Example
Discretization of Continuous values
attributes
 Unsupervised Discretization
 Equal Width intervals
 Equal Depth Intervals
 Supervised Discretization

Informatica IICS Interview Questions
100% (1)
Informatica IICS Interview Questions
33 pages
Understanding ETL
No ratings yet
Understanding ETL
20 pages
Software Engineering For Modern Web Applications
No ratings yet
Software Engineering For Modern Web Applications
403 pages
Fintech Case Analysis
No ratings yet
Fintech Case Analysis
2 pages
Education Management Information System (EMIS) : Integrated Data and Information Systems and Their Implications in Educational Management
No ratings yet
Education Management Information System (EMIS) : Integrated Data and Information Systems and Their Implications in Educational Management
26 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
CH 3
No ratings yet
CH 3
68 pages
Data Preprocessing
No ratings yet
Data Preprocessing
28 pages
03Preprocessing
No ratings yet
03Preprocessing
38 pages
Lec4 Data Preprocessing
No ratings yet
Lec4 Data Preprocessing
43 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Unit 3
No ratings yet
Unit 3
164 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
03Preprocessing
No ratings yet
03Preprocessing
65 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
03 Pre Processing
No ratings yet
03 Pre Processing
63 pages
Data Mining P5
No ratings yet
Data Mining P5
32 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
Data Pre Processing
No ratings yet
Data Pre Processing
63 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
_03Preprocessing
No ratings yet
_03Preprocessing
60 pages
IT326 - Ch3
No ratings yet
IT326 - Ch3
33 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
Chapter 3
No ratings yet
Chapter 3
56 pages
Module 5 03preprocessing
No ratings yet
Module 5 03preprocessing
63 pages
03 Pre Processing
No ratings yet
03 Pre Processing
89 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
40 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
Module 2
No ratings yet
Module 2
62 pages
DMDW_
No ratings yet
DMDW_
14 pages
Data Mining: Dosen: Dr. Vitri Tundjungsari
No ratings yet
Data Mining: Dosen: Dr. Vitri Tundjungsari
64 pages
03Preprocessing_20160222
No ratings yet
03Preprocessing_20160222
65 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
66 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
DM_merged
No ratings yet
DM_merged
169 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
' 3 IT326 - Ch2 - Pre-Processing
No ratings yet
' 3 IT326 - Ch2 - Pre-Processing
48 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
55 pages
DWDM Unit 1 Chap2 PDF
No ratings yet
DWDM Unit 1 Chap2 PDF
21 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Data Science unit I(LN and QB)
No ratings yet
Data Science unit I(LN and QB)
44 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
56 pages
Lec7
No ratings yet
Lec7
45 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
80 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
9 pages
03 Preprocessing
No ratings yet
03 Preprocessing
64 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
CH 03-01 Data Preprocessing
No ratings yet
CH 03-01 Data Preprocessing
27 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Buyers Guide To Data Integration Software CloverETL June 2018
No ratings yet
Buyers Guide To Data Integration Software CloverETL June 2018
36 pages
Chapter 2 Data Warehousing
No ratings yet
Chapter 2 Data Warehousing
47 pages
Data Quality at A Glance PDF
No ratings yet
Data Quality at A Glance PDF
10 pages
Semantic Information Retrieval From Distributed Heterogeneous Data Sources
No ratings yet
Semantic Information Retrieval From Distributed Heterogeneous Data Sources
6 pages
Astera Data Integration Bootcamp 23
No ratings yet
Astera Data Integration Bootcamp 23
4 pages
Informatica Power Exchange Architecture PDF
50% (2)
Informatica Power Exchange Architecture PDF
24 pages
MDM Userguide 1 2767
No ratings yet
MDM Userguide 1 2767
56 pages
TalendOpenStudio BigData UG 5.5.2 en
No ratings yet
TalendOpenStudio BigData UG 5.5.2 en
248 pages
26 Months Exp ETL-ABinitio Vidushi Gupta UPD
No ratings yet
26 Months Exp ETL-ABinitio Vidushi Gupta UPD
4 pages
Data Warehousing and Management
100% (1)
Data Warehousing and Management
7 pages
Data Assignment
No ratings yet
Data Assignment
24 pages
Employee Performance Appraisal For Salary Hike - Project
No ratings yet
Employee Performance Appraisal For Salary Hike - Project
93 pages
Talend Open Studio v4.1.x - User Guide (2011)
No ratings yet
Talend Open Studio v4.1.x - User Guide (2011)
332 pages
Powercenter Real Time Data Sheet 6812
No ratings yet
Powercenter Real Time Data Sheet 6812
8 pages
Untitled Document
No ratings yet
Untitled Document
5 pages
TalendOpenStudio DI GettingStarted 6.5.1 EN
No ratings yet
TalendOpenStudio DI GettingStarted 6.5.1 EN
33 pages
Bca Vi Sem Bi - Unit III
No ratings yet
Bca Vi Sem Bi - Unit III
110 pages
Mails 4 Sree 09@
No ratings yet
Mails 4 Sree 09@
1 page
1.what Is Talend Software? What Is A Project in Talend? Why Is Talend Called A Code Generator?
No ratings yet
1.what Is Talend Software? What Is A Project in Talend? Why Is Talend Called A Code Generator?
3 pages
Azure Analytics: Synapse
100% (4)
Azure Analytics: Synapse
251 pages
Discussion Questions BA
No ratings yet
Discussion Questions BA
11 pages
Mrunali Mishra (MM5039401)
No ratings yet
Mrunali Mishra (MM5039401)
6 pages
A Deep Dive Into Pe Data: Presented By: Daniya Boges (Data Scientist)
No ratings yet
A Deep Dive Into Pe Data: Presented By: Daniya Boges (Data Scientist)
20 pages
DBMS - On Data Integration and Data Mining For Developing Business Intelligence
No ratings yet
DBMS - On Data Integration and Data Mining For Developing Business Intelligence
6 pages
Data-Analytic Im 2021-2022
No ratings yet
Data-Analytic Im 2021-2022
67 pages