0% found this document useful (0 votes)

9 views49 pages

CH1-data Preprocessing

The document discusses various tasks involved in data preprocessing including data cleaning, data integration, data reduction, and data transformation. Data cleaning involves tasks like handling missing data, identifying outliers, and correcting inconsistent data. Data integration combines data from multiple sources. Data reduction obtains a reduced representation of data through techniques like sampling and dimensionality reduction. Data transformation maps data values to new values through methods such as conversion, discretization, normalization, and attribute construction.

Uploaded by

selsabilrouahi

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

0% found this document useful (0 votes)

9 views49 pages

CH1-data Preprocessing

Uploaded by

selsabilrouahi

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

You are on page 1/ 49

Chapter 1: Data Preprocessing

Data Mining

Heger Arfaoui - ENIT - 2023

References

Chapter 2: Data Preprocessing

Outline

• Data Preprocessing: An Overview

• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation
Why Data preprocessing?
Real-world data

• Data in the real world is dirty:

incomplete: lacking attribute values, lacking certain attributes of interest, or

containing only aggregate data

noisy: containing errors or outliers

inconsistent: containing discrepancies in codes or names

• GIGO! Garbage In Garbage Out

A multi-dimensional measure of data quality

• A well-accepted multi-dimensional view:

Accuracy: correct or wrong, accurate or not

Completeness: not recorded, unavailable, …

Consistency: some modified but some not, dangling, …

Timeliness: timely update?

Believability: how trustable is the data?

Interpretability: how easily the data can be understood?

• Two diﬀerent users may have two diﬀerent assessments of the quality of the data
Major Tasks in Data Preprocessing
Data Cleaning
Data Cleaning tasks

• Fill in missing values

• Identify outliers and smooth out noisy data
• Correct inconsistent data
Missing Data
• Data is not always available

• Missing data may be due to:

• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• intentional

• Missing data may need to be inferred

How to handle missing data?
• Ignore records with missing values in training data especially large
dataset
few missing records
• Replace missing value with...
• default or special value(e.g.,0,“missing”)
• average/median value for numerics
• most frequent value for nominals

• Try to predict missing values:

• handle missing values as learning problem
• target: attribute which has missing values
• training data: instances where the attribute is present
• test data: instances where the attribute is missing
Missing data: caveats
Note: values may be missing for various reasons
...and, more importantly: at random vs. not at random

• Examples for not random:

– Non-mandatory questions in questionnaires
• e.g., “how often do you drink alcohol?”
– Values only valid for certain data sub-populations
• e.g.,“are you currently pregnant?”
– Sensors failing under certain conditions
• e.g.,at high temperatures

• In those cases, averaging and imputation causes information loss – In other words: “missing” can be
information!
Missing data caveats (ctd)
Missing data caveats (ctd)
Noisy data
• Noise: Random error in a measured variable.

• Incorrect attribute values may be due to:

faulty data collection instruments

data entry problems

data transmission problems

technology limitations

inconsistency in naming convention

How to handle noisy data?
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
•
• detect suspicious values and check by human (e.g., deal with possible outliers)
Binning method for data smooting

* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Detect and remove outliers
Data Integration
Data integration

• Data integration:
• Combines data from multiple sources into a coherent store
• Entity identification problem:
• Identify real-world entities from multiple data sources, e.g., Bill Clinton = William
Clinton

• Detecting and resolving data value conflicts

• For the same real-world entity, attribute values from different sources are different
• Possible reasons: different representations, different scales, e.g., metric vs. British units
•
Handling redundant data in data integration

• Redundant data occur often when integrating multiple DBs

• The same attribute may have diﬀerent names in diﬀerent databases
• One attribute may be a “derived” attribute in another table, e.g., annual revenue
• Redundant data may be able to be detected by correlational analysis
S( A - A)( B - B)
rA, B =
(n - 1)s As B
• Careful integration can help reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
•
Data Reduction
Data reduction

• A database/data warehouse may store terabytes of data. Complex data

analysis may take a very long time to run on the complete data set.

• Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical
results
•
Data reduction strategies

• Numerosity reduction (some simply call it: Data Reduction)

• Histograms, clustering
• Sampling

• Dimensionality reduction: e.g., remove unimportant

attributes

• Principal Components Analysis (PCA)

• Feature subset selection, feature creation

• Data compression
Sampling

• Sampling: obtaining a small sample s to represent the whole data set N

• Allow a mining algorithm to run in complexity that is potentially sub-linear to
the size of the data

• Key principle: Choose a representative subset of the data

• Simple random sampling may have very poor performance in the
presence of skew

• Develop adaptive sampling methods, e.g., stratified sampling

Types of sampling

• Simple random sampling

• There is an equal probability of selecting any particular item
• Sampling without replacement
• Once an object is selected, it is removed from the population
• Sampling with replacement
• A selected object is not removed from the population
• Stratified sampling:
• Partition the data set, and draw samples from each partition (proportionally, i.e.,
approximately the same percentage of the data)
• Used in conjunction with skewed data
Dimensionality reduction

• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
Feature selection
Basic Heuristics

• Remove nominal attributes...

• which have more than p% identical values
• example: millionaire=false
• which have more than p% diﬀerent values
• example: names, IDs

• Remove numerical attributes

• which have little variation, i.e., standard deviation <s

• Compute pairwise correlations between attributes and remove highly correlated attributes:
• Naive Bayes requires independent attributes. Will benefit from removing correlated attributes
PCA: Principal Component Analysis

• feature selection methods select a subset of attributes: no new attributes are created

• PCA creates a (smaller set of) new attributes

• artificial linear combinations of existing attributes
• as expressive as possible

• Dates back to the pre-computer age

• invented by Karl Pearson (1857-1936)

•
• also known for Pearson's correlation coeﬃcient
PCA (ctd)
• Idea: transform the coordinate system so that each new coordinate (principal component) is as expressive as possible
• expressivity: variance of the variable
• the 1 , 2 , 3 ... PC should account for as much variance as possible
st nd rd

• further PCs can be neglected

•

Source: https://knowledge.dataiku.com/latest/ml-analytics/statistics/concept-principal-component-analysis-pca.html
Data Transformation
Data transformation

• Transformation: A function that maps the entire set of values of a given attribute to a new
set of replacement values s.t. each old value can be identified with one of the new values

• Methods:
• Conversion
• Discretization
• Smoothing: remove noise from data (binning, clustering, regression)
• Normalization: scaled to fall within a small, specified range
• Attribute/feature construction New attributes constructed from the given ones
:
Conversion

• Binary to numeric:
eg: student=yes,no
Convert to 0, 1

• Order to numeric: Ordered attributes (e.g. grade) can be converted to

numbers preserving order
– A → 4.0
– A-→3.7
– B+→ 3.3
– B → 3.0

• Nominal to numeric
Normalization

• Variables tend to have ranges that vary greatly from each other
• The measurement unit used can aﬀect the data analysis
• For some data mining algorithms, diﬀerences in ranges will lead to a tendency for the
variable with a greater range to have undue influence on the results

• Data miners should normalize their numeric variables in order to standardize the
scale of eﬀect each variable has on the results

• Algorithms that make use of distance measures (such as k-Nearest Neighbors)

benefit from normalization

• Normalization and scaling are used interchangeably

Min-max normalization

Performs a linear transformation on the original data

Min-max normalization preserves the relationships among the

original data values

Values range between 0 and 1

Min-max normalization will encounter an « out-of-bounds » error if

a future input case for normalization falls outside of the original
data range of X
Z-score normalization

Also called zero-mean normalization

Z-score standardization works by taking the diﬀerence

between the field value and the field mean value, and scaling
this diﬀerence by the standard deviation of the field values:

The z-score normalization is useful when the actual minimum

and maximum of an attribute X are unknown, or when there
are outliers that dominate the min-max normalization
Decimal scaling

• Decimal scaling ensure that every normalized values lies between -1 and 1

• d = number of digits in the data value with the largest absolute value
Normalization - remarks

• Normalization can change the original data quite a bit, especially when using
the z-score normalization or decimal scaling

• It is necessary to save the normalization parameters (e.g., the mean and

standard deviation if using z-score normalization) so that future data can be
normalized in a uniform manner

• The normalization parameters now become model parameters and the same
value should be used when the model is used on new data (e.g. testing data)
Transformations to achieve normality

• Some data mining algorithms and statistical methods require that the
variables be normally distributed

• z-score transformation does not achieve normality

Transformations to achieve normality

• The skewness of a distribution is measured by:

• Most real-world data is right-skewed, especially most financial data.

•
Transformations to achieve normality
• Common transformations to achieve normality:
ln(x)

1
sqrt(x)
1/x, …

• log transformation is suitable for strongly right-skewed data, sqrt transformation is

suitable for slightly right-skewed data
Recap
Recap
• Raw data has many problems:
• missing values
• errors
• high dimensionality

• Good preprocessing is essential for good data mining

• one of the first steps in the pipeline
• often the most time-consuming step of the pipeline
• requires lots of experimentation and fine-tuning

• Data preparation includes:

• Data cleaning, data integration, data reduction, feature selection, normalization,…

• A lot a methods have been developed but still an active area of research

Data Mining and Predictive Modelling: Lecture 4: Data Pre-Processing
No ratings yet
Data Mining and Predictive Modelling: Lecture 4: Data Pre-Processing
19 pages
Data Preprocessingedfgh
No ratings yet
Data Preprocessingedfgh
21 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Lecture123
No ratings yet
Lecture123
20 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
03 Preprocessing
No ratings yet
03 Preprocessing
42 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Romi DM 03 Persiapan Mar2016
No ratings yet
Romi DM 03 Persiapan Mar2016
82 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
3 Persiapan Data Mining
No ratings yet
3 Persiapan Data Mining
83 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
3 Ravi
No ratings yet
3 Ravi
82 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Dwina DM 03 Persiapan 2018
No ratings yet
Dwina DM 03 Persiapan 2018
82 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Chapter3 DataPreprocessing
No ratings yet
Chapter3 DataPreprocessing
50 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Data Preprocessing
100% (1)
Data Preprocessing
33 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
03preprocessing DMDW
No ratings yet
03preprocessing DMDW
81 pages
Data Mining
No ratings yet
Data Mining
40 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
Down 2
No ratings yet
Down 2
61 pages
Data preprocessing (1)
No ratings yet
Data preprocessing (1)
77 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
D06A Data Preprocessing
No ratings yet
D06A Data Preprocessing
25 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Slide 2 - Data Preprocessing
100% (1)
Slide 2 - Data Preprocessing
39 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
DATA MINING Notes
No ratings yet
DATA MINING Notes
37 pages
Data Mining Pertemuan 6
No ratings yet
Data Mining Pertemuan 6
28 pages
FDS Chapter 3
No ratings yet
FDS Chapter 3
103 pages
GK NU CS 503 - Data Preprocessing
No ratings yet
GK NU CS 503 - Data Preprocessing
62 pages
data_mining_unit_3[1]
No ratings yet
data_mining_unit_3[1]
64 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
02.data Preprocessing PDF
100% (1)
02.data Preprocessing PDF
31 pages
Unsia_Data Mining Pertemuan 9
No ratings yet
Unsia_Data Mining Pertemuan 9
39 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Week 2
No ratings yet
Week 2
96 pages
Unit I- Data Science
No ratings yet
Unit I- Data Science
161 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
PPT1
No ratings yet
PPT1
93 pages
Data Mining Methods: Data Pre-Processing: Prof. Dr. Christina Andersson
No ratings yet
Data Mining Methods: Data Pre-Processing: Prof. Dr. Christina Andersson
33 pages
S1-21 - DSECLZC415 Data Pre-Processing: BITS Pilani
No ratings yet
S1-21 - DSECLZC415 Data Pre-Processing: BITS Pilani
54 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Data Mining Basic Techniques
No ratings yet
Data Mining Basic Techniques
14 pages
Data Reduction Techniques
No ratings yet
Data Reduction Techniques
41 pages
Data Reduction
No ratings yet
Data Reduction
22 pages
Normalization
No ratings yet
Normalization
35 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
25 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages