CH1-data Preprocessing
CH1-data Preprocessing
Data Mining
• Two different users may have two different assessments of the quality of the data
Major Tasks in Data Preprocessing
Data Cleaning
Data Cleaning tasks
• In those cases, averaging and imputation causes information loss – In other words: “missing” can be
information!
Missing data caveats (ctd)
Missing data caveats (ctd)
Noisy data
• Noise: Random error in a measured variable.
technology limitations
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Detect and remove outliers
Data Integration
Data integration
• Data integration:
• Combines data from multiple sources into a coherent store
• Entity identification problem:
• Identify real-world entities from multiple data sources, e.g., Bill Clinton = William
Clinton
• Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical
results
•
Data reduction strategies
• Data compression
Sampling
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
Feature selection
Basic Heuristics
• Compute pairwise correlations between attributes and remove highly correlated attributes:
• Naive Bayes requires independent attributes. Will benefit from removing correlated attributes
PCA: Principal Component Analysis
• feature selection methods select a subset of attributes: no new attributes are created
•
• also known for Pearson's correlation coefficient
PCA (ctd)
• Idea: transform the coordinate system so that each new coordinate (principal component) is as expressive as possible
• expressivity: variance of the variable
• the 1 , 2 , 3 ... PC should account for as much variance as possible
st nd rd
Source: https://knowledge.dataiku.com/latest/ml-analytics/statistics/concept-principal-component-analysis-pca.html
Data Transformation
Data transformation
• Transformation: A function that maps the entire set of values of a given attribute to a new
set of replacement values s.t. each old value can be identified with one of the new values
• Methods:
• Conversion
• Discretization
• Smoothing: remove noise from data (binning, clustering, regression)
• Normalization: scaled to fall within a small, specified range
• Attribute/feature construction New attributes constructed from the given ones
:
Conversion
• Binary to numeric:
eg: student=yes,no
Convert to 0, 1
• Nominal to numeric
Normalization
• Variables tend to have ranges that vary greatly from each other
• The measurement unit used can affect the data analysis
• For some data mining algorithms, differences in ranges will lead to a tendency for the
variable with a greater range to have undue influence on the results
• Data miners should normalize their numeric variables in order to standardize the
scale of effect each variable has on the results
• Decimal scaling ensure that every normalized values lies between -1 and 1
• d = number of digits in the data value with the largest absolute value
Normalization - remarks
• Normalization can change the original data quite a bit, especially when using
the z-score normalization or decimal scaling
• The normalization parameters now become model parameters and the same
value should be used when the model is used on new data (e.g. testing data)
Transformations to achieve normality
• Some data mining algorithms and statistical methods require that the
variables be normally distributed
•
Transformations to achieve normality
• Common transformations to achieve normality:
ln(x)
1
sqrt(x)
1/x, …
• A lot a methods have been developed but still an active area of research