0% found this document useful (0 votes)
9 views49 pages

CH1-data Preprocessing

The document discusses various tasks involved in data preprocessing including data cleaning, data integration, data reduction, and data transformation. Data cleaning involves tasks like handling missing data, identifying outliers, and correcting inconsistent data. Data integration combines data from multiple sources. Data reduction obtains a reduced representation of data through techniques like sampling and dimensionality reduction. Data transformation maps data values to new values through methods such as conversion, discretization, normalization, and attribute construction.

Uploaded by

selsabilrouahi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
9 views49 pages

CH1-data Preprocessing

The document discusses various tasks involved in data preprocessing including data cleaning, data integration, data reduction, and data transformation. Data cleaning involves tasks like handling missing data, identifying outliers, and correcting inconsistent data. Data integration combines data from multiple sources. Data reduction obtains a reduced representation of data through techniques like sampling and dimensionality reduction. Data transformation maps data values to new values through methods such as conversion, discretization, normalization, and attribute construction.

Uploaded by

selsabilrouahi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 49

Chapter 1: Data Preprocessing

Data Mining

Heger Arfaoui - ENIT - 2023


References

Chapter 2: Data Preprocessing


Outline

• Data Preprocessing: An Overview


• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation
Why Data preprocessing?
Real-world data

• Data in the real world is dirty:

incomplete: lacking attribute values, lacking certain attributes of interest, or


containing only aggregate data

noisy: containing errors or outliers

inconsistent: containing discrepancies in codes or names

• GIGO! Garbage In Garbage Out


A multi-dimensional measure of data quality

• A well-accepted multi-dimensional view:


Accuracy: correct or wrong, accurate or not

Completeness: not recorded, unavailable, …

Consistency: some modified but some not, dangling, …

Timeliness: timely update?

Believability: how trustable is the data?

Interpretability: how easily the data can be understood?

• Two different users may have two different assessments of the quality of the data
Major Tasks in Data Preprocessing
Data Cleaning
Data Cleaning tasks

• Fill in missing values


• Identify outliers and smooth out noisy data
• Correct inconsistent data
Missing Data
• Data is not always available

• Missing data may be due to:


• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• intentional

• Missing data may need to be inferred


How to handle missing data?
• Ignore records with missing values in training data especially large
dataset
few missing records
• Replace missing value with...
• default or special value(e.g.,0,“missing”)
• average/median value for numerics
• most frequent value for nominals

• Try to predict missing values:


• handle missing values as learning problem
• target: attribute which has missing values
• training data: instances where the attribute is present
• test data: instances where the attribute is missing
Missing data: caveats
Note: values may be missing for various reasons
...and, more importantly: at random vs. not at random

• Examples for not random:


– Non-mandatory questions in questionnaires
• e.g., “how often do you drink alcohol?”
– Values only valid for certain data sub-populations
• e.g.,“are you currently pregnant?”
– Sensors failing under certain conditions
• e.g.,at high temperatures

• In those cases, averaging and imputation causes information loss – In other words: “missing” can be
information!
Missing data caveats (ctd)
Missing data caveats (ctd)
Noisy data
• Noise: Random error in a measured variable.

• Incorrect attribute values may be due to:


faulty data collection instruments

data entry problems

data transmission problems

technology limitations

inconsistency in naming convention


How to handle noisy data?
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection

• detect suspicious values and check by human (e.g., deal with possible outliers)
Binning method for data smooting

* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Detect and remove outliers
Data Integration
Data integration

• Data integration:
• Combines data from multiple sources into a coherent store
• Entity identification problem:
• Identify real-world entities from multiple data sources, e.g., Bill Clinton = William
Clinton

• Detecting and resolving data value conflicts


• For the same real-world entity, attribute values from different sources are different
• Possible reasons: different representations, different scales, e.g., metric vs. British units

Handling redundant data in data integration

• Redundant data occur often when integrating multiple DBs


• The same attribute may have different names in different databases
• One attribute may be a “derived” attribute in another table, e.g., annual revenue
• Redundant data may be able to be detected by correlational analysis
S( A - A)( B - B)
rA, B =
(n - 1)s As B
• Careful integration can help reduce/avoid redundancies and inconsistencies and improve
mining speed and quality

Data Reduction
Data reduction

• A database/data warehouse may store terabytes of data. Complex data


analysis may take a very long time to run on the complete data set.

• Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical
results

Data reduction strategies

• Numerosity reduction (some simply call it: Data Reduction)


• Histograms, clustering
• Sampling

• Dimensionality reduction: e.g., remove unimportant


attributes

• Principal Components Analysis (PCA)


• Feature subset selection, feature creation

• Data compression
Sampling

• Sampling: obtaining a small sample s to represent the whole data set N


• Allow a mining algorithm to run in complexity that is potentially sub-linear to
the size of the data

• Key principle: Choose a representative subset of the data


• Simple random sampling may have very poor performance in the
presence of skew

• Develop adaptive sampling methods, e.g., stratified sampling


Types of sampling

• Simple random sampling


• There is an equal probability of selecting any particular item
• Sampling without replacement
• Once an object is selected, it is removed from the population
• Sampling with replacement
• A selected object is not removed from the population
• Stratified sampling:
• Partition the data set, and draw samples from each partition (proportionally, i.e.,
approximately the same percentage of the data)
• Used in conjunction with skewed data
Dimensionality reduction

• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
Feature selection
Basic Heuristics

• Remove nominal attributes...


• which have more than p% identical values
• example: millionaire=false
• which have more than p% different values
• example: names, IDs

• Remove numerical attributes


• which have little variation, i.e., standard deviation <s

• Compute pairwise correlations between attributes and remove highly correlated attributes:
• Naive Bayes requires independent attributes. Will benefit from removing correlated attributes
PCA: Principal Component Analysis

• feature selection methods select a subset of attributes: no new attributes are created

• PCA creates a (smaller set of) new attributes


• artificial linear combinations of existing attributes
• as expressive as possible

• Dates back to the pre-computer age


• invented by Karl Pearson (1857-1936)


• also known for Pearson's correlation coefficient
PCA (ctd)
• Idea: transform the coordinate system so that each new coordinate (principal component) is as expressive as possible
• expressivity: variance of the variable
• the 1 , 2 , 3 ... PC should account for as much variance as possible
st nd rd

• further PCs can be neglected


Source: https://knowledge.dataiku.com/latest/ml-analytics/statistics/concept-principal-component-analysis-pca.html
Data Transformation
Data transformation

• Transformation: A function that maps the entire set of values of a given attribute to a new
set of replacement values s.t. each old value can be identified with one of the new values

• Methods:
• Conversion
• Discretization
• Smoothing: remove noise from data (binning, clustering, regression)
• Normalization: scaled to fall within a small, specified range
• Attribute/feature construction New attributes constructed from the given ones
:
Conversion

• Binary to numeric:
eg: student=yes,no
Convert to 0, 1

• Order to numeric: Ordered attributes (e.g. grade) can be converted to


numbers preserving order
– A → 4.0
– A-→3.7
– B+→ 3.3
– B → 3.0

• Nominal to numeric
Normalization

• Variables tend to have ranges that vary greatly from each other
• The measurement unit used can affect the data analysis
• For some data mining algorithms, differences in ranges will lead to a tendency for the
variable with a greater range to have undue influence on the results

• Data miners should normalize their numeric variables in order to standardize the
scale of effect each variable has on the results

• Algorithms that make use of distance measures (such as k-Nearest Neighbors)


benefit from normalization

• Normalization and scaling are used interchangeably


Min-max normalization

Performs a linear transformation on the original data

Min-max normalization preserves the relationships among the


original data values

Values range between 0 and 1

Min-max normalization will encounter an « out-of-bounds » error if


a future input case for normalization falls outside of the original
data range of X
Z-score normalization

Also called zero-mean normalization

Z-score standardization works by taking the difference


between the field value and the field mean value, and scaling
this difference by the standard deviation of the field values:

The z-score normalization is useful when the actual minimum


and maximum of an attribute X are unknown, or when there
are outliers that dominate the min-max normalization
Decimal scaling

• Decimal scaling ensure that every normalized values lies between -1 and 1

• d = number of digits in the data value with the largest absolute value
Normalization - remarks

• Normalization can change the original data quite a bit, especially when using
the z-score normalization or decimal scaling

• It is necessary to save the normalization parameters (e.g., the mean and


standard deviation if using z-score normalization) so that future data can be
normalized in a uniform manner

• The normalization parameters now become model parameters and the same
value should be used when the model is used on new data (e.g. testing data)
Transformations to achieve normality

• Some data mining algorithms and statistical methods require that the
variables be normally distributed

• z-score transformation does not achieve normality


Transformations to achieve normality

• The skewness of a distribution is measured by:

• Most real-world data is right-skewed, especially most financial data.


Transformations to achieve normality
• Common transformations to achieve normality:
ln(x)

1
sqrt(x)
1/x, …

• log transformation is suitable for strongly right-skewed data, sqrt transformation is


suitable for slightly right-skewed data
Recap
Recap
• Raw data has many problems:
• missing values
• errors
• high dimensionality

• Good preprocessing is essential for good data mining


• one of the first steps in the pipeline
• often the most time-consuming step of the pipeline
• requires lots of experimentation and fine-tuning

• Data preparation includes:


• Data cleaning, data integration, data reduction, feature selection, normalization,…

• A lot a methods have been developed but still an active area of research

You might also like