0% found this document useful (0 votes)
12 views

Data Preprocessing

This document discusses various techniques for data preprocessing including data cleaning, integration, reduction, and transformation. Data cleaning involves handling missing values and noisy data through techniques like mean imputation, binning, and regression. Data integration combines data from multiple sources by resolving inconsistencies and schema conflicts. Data reduction reduces data size through dimensionality reduction, numerosity reduction, and compression. Data transformation techniques normalize and discretize data into meaningful ranges.

Uploaded by

nikhithalazarus4
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Data Preprocessing

This document discusses various techniques for data preprocessing including data cleaning, integration, reduction, and transformation. Data cleaning involves handling missing values and noisy data through techniques like mean imputation, binning, and regression. Data integration combines data from multiple sources by resolving inconsistencies and schema conflicts. Data reduction reduces data size through dimensionality reduction, numerosity reduction, and compression. Data transformation techniques normalize and discretize data into meaningful ranges.

Uploaded by

nikhithalazarus4
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Data preprocessing

 Data Cleaning

 Data Integration

 Data Reduction

 Data Transformation and Data Discretization


Data Pre-procesing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation
Data Cleaning

 Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, transmission error
 incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
 e.g., Occupation=“ ” (missing data)

 noisy: containing noise, errors, or outliers


 e.g., Salary=“−10” (an error)

 inconsistent: containing discrepancies in codes or names, e.g.,


 Age=“42”, Birthday=“03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records

 Intentional (e.g., disguised missing data)


 Jan. 1 as everyone’s birthday?
How to handle missing data

 Ignore the tuple: usually done when class label is missing


(when doing classification)—not effective when the % of
missing values per attribute varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class: smarter
 the most probable value: inference-based such as Bayesian formula or decision tree
How to handle noisy data

 Binning
 first sort data and partition into (equal-frequency) bins
 then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.

 Regression
 smooth by fitting the data into regression functions

 Clustering
 detect and remove outliers

 Combined computer and human inspection


 detect suspicious values and check by human (e.g., deal with possible outliers)
Data integration
 Data integration:
 Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different sources are
different
 Possible reasons: different representations, different scales, e.g., metric
vs. British units
Handling Redundancy in Data Integration

 Redundant data occur often when integration of multiple databases


 Object identification: The same attribute or object may have different names in different
databases
 Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue

 Redundant attributes may be able to be detected by correlation


analysis and covariance analysis
 Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality
Correlation Analysis

 Χ2 (chi-square) test
(Observed  Expected ) 2
2  
Expected
 The larger the Χ2 value, the more likely the variables are related
 The cells that contribute the most to the Χ2 value are those whose
actual count is very different from the expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population
Correlation Analysis

 Correlation coefficient (also called Pearson’s product moment coefficient)


i 1 (ai  A)(bi  B) 
n n
(ai bi )  n AB
rA, B   i 1
(n  1) A B (n  1) A B

where n is the number of tuples, and are the respective means of A and B, σA
and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the
AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values increase as B’s).
The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated
Data Reduction
 Data reduction: Obtain a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical results
 Why data reduction? — A database/data warehouse may store terabytes of data. Complex
data analysis may take a very long time to run on the complete data set.
 Data reduction strategies
 Dimensionality reduction, e.g., remove unimportant attributes
 Wavelet transforms
 Principal Components Analysis (PCA)
 Feature subset selection, feature creation

 Numerosity reduction (some simply call it: Data Reduction)


 Regression and Log-Linear Models
 Histograms, clustering, sampling
 Data cube aggregation

 Data compression
Dimensionality reduction

 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier analysis, becomes less
meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)
 Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is
mapped to
 Z-score normalization (μ: mean, σ: standard deviation):
v  A
v' 
 A

 Ex. Let μ = 54,000, σ = 16,000. Then


 NormalizationWhere
by decimal scaling
j is the smallest integer such that Max(|ν’|) < 1
v
v'  j
10
Aggregation

 Combining of two o more record into single object


Sampling Techniques
 Simple random sampling
 There is an equal probability of selecting any particular item
 Sampling without replacement
 Oncean object is selected, it is removed from the
population
 Sampling with replacement
A selected object is not removed from the population
 Stratified sampling:
 Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the
data)
 Used in conjunction with skewed data
Progressive Sampling

 Starts with a very small Sample and then starts increasing the sample size
until a sample of sufficient size is obtained.
Dimensionality Reduction

 Need ?
 Many algorithms work with low dimensionality .
 Allows Data visualization better
 Amount of time for processing and memory required is reduced
 Dimensionality reduction reduce dimensionality by creating new attributes
which are combinations of existing attributes .
 The reduction of dimensionality by selecting new attributes that are subset of
old attributes is known as feature subset selection.
 Curse of Dimensionality
Feature Subset Selection

 There are three standard methods for feature subset selection


 Embedded Subset Selection
 Filter approach
 Wrapper Approach
Feature Subset Selection
Feature Extraction

 Creating new set of features from the original set of features is known as
feature extraction.
Discretization and Binarization
 Mapping continuous valued attributes to categorical attributes is called
Discretization.
 Mapping continuous valued attributes to one or more binary attributes is
called binarization.
 Example
Discretization of Continuous values
attributes
 Unsupervised Discretization
 Equal Width intervals
 Equal Depth Intervals
 Supervised Discretization

You might also like