data preprocessing

Uploaded by

Sneha

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

0% found this document useful (0 votes)

6 views11 pages

data preprocessing

Uploaded by

Sneha

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

You are on page 1/ 11

MACHINE LEARNING

UNIT 2
DATA PRE-PROCESSING
Data pre-processing is a process of preparing the raw data and making it suitable
for a machine learning model. It is the first and crucial step while creating a
machine learning model.
When creating a machine learning project, it is not always a case that we come
across the clean and formatted data. And while doing any operation with data,
it is mandatory to clean it and put in a formatted way. So for this, we use data
pre-processing task.
TASKS IN DATA PRE-PROCESSING
➢ Data cleaning
o Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
➢ Data integration
o Integration of multiple databases, data cubes, or files
➢ Data reduction
o Dimensionality reduction
o Numerosity reduction
o Data compression
➢ Data transformation and data discretization
o Normalization
Concept hierarchy generation
NEED OF DATA PRE-PROCESSING
Accuracy: correct or wrong, accurate or not
Completeness: not recorded, unavailable, …
Consistency: some modified but some not, dangling, …
Timeliness: timely update?
Believability: how trustable the data are correct?

DATA CLEANING
The data cleaning process detects and removes the errors and inconsistencies
present in the data and improves its quality. Data quality problems occur due to
misspellings during data entry, missing values or any other invalid data.
Basically, “dirty” data is transformed into clean data. “Dirty” data does not
produce the accurate and good results. Garbage data gives garbage out. So, it
becomes very important to handle this data.
Need of data cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, transmission error
❖ incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
▪ e.g., Occupation=“ ” (missing data)
❖ noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
❖ inconsistent: containing discrepancies in codes or names, e.g.,
▪ Age=“42”, Birthday=“03/07/2010”
▪ Was rating “1, 2, 3”, now rating “A, B, C”
▪ discrepancy between duplicate records
❖ Intentional (e.g., disguised missing data)
▪ Jan. 1 as everyone’s birthday?
❖ Data is not always available
▪ E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
❖ Missing data may be due to
▪ equipment malfunction
▪ inconsistent with other recorded data and thus deleted
▪ data not entered due to misunderstanding
▪ certain data may not be considered important at the time of entry
▪ not register history or changes of the data
What to do to clean data?
1. Handle Missing Values
2. Handle Noise and Outliers
3. Remove Unwanted data
Handle Missing Values
Missing values cannot be looked over in a data set. They must be handled. Also,
a lot of models do not accept missing values. There are several techniques to
handle missing data, choosing the right one is of utmost importance. The choice
of technique to deal with missing data depends on the problem domain and the
goal of data mining process. The different ways to handle missing data are:
1. Ignore the data row: This method is suggested for records where
maximum amount of data is missing, rendering the record meaningless.
This method is usually avoided where only less attribute values are
missing. If all the rows with missing values are ignored i.e., removed, it
will result in poor performance.
2. Fill the missing values manually: This is a very time-consuming method
and hence infeasible for almost all scenarios.
3. Use a global constant to fill in for missing values: A global constant like
“NA” or 0 can be used to fill all the missing data. This method is used when
missing values are difficult to be predicted.
4. Use attribute mean or median: Mean or median of the attribute is used
to fill the missing value.
5. Use forward fill or backward fill method: In this, either the previous value
or the next value is used to fill the missing value. A mean of the previous
and succession values may also be used.
6. Use a data-mining algorithm to predict the most probable value

Handle Noise and Outliers

Noise in data may be introduced due to fault in data collection, error during data
entering or due to data transmission errors, etc. Unknown encoding (Example:
Marital Status — Q), out of range values (Example : Age — -10), Inconsistent
Data (Example : DoB — 4th Oct 1999, Age — 50), inconsistent formats (Example
: DoJ — 13th Jan 2000, DoL — 10/10/2016), etc. are different types of noise and
outliers.
Noise can be handled using binning. In this technique, sorted data is placed into
bins or buckets. Bins can be created by equal-width (distance) or equal-depth
(frequency) partitioning. On these bins, smoothing can be applied. Smoothing
can be by bin mean, bin median or bin boundaries.
Outliers can be smoothed by using binning and then smoothing it. They can be
detected using visual analysis or boxplots. Clustering can be used identify groups
of outlier data.The detected outliers may be smoothed or removed.
Remove Unwanted Data
Unwanted data is duplicate or irrelevant data. Scraping data from different
sources and then integrating may lead to some duplicate data if not done
efficiently. This redundant data should be removed as it is of no use and will only
increase the amount of data and the time to train the model. Also, due to
redundant records, the model may not provide accurate results as the duplicate
data interferes with analysis process, giving more importance to the repeated
values.

DATA INTEGRATION
In this step, a coherent data source is prepared. This is done by collecting and
integrating data from multiple sources like databases, legacy systems, flat files,
data cubes etc.
Issues in Data Integration
1. Schema Integration: Metadata (i.e. the schema) from different sources
may not be compatible. This leads to entity identification
problem. Example : Consider two data sources R and S. Customer id in R
is represented as cust_id and in S is represented is c_id. They mean the
same thing, represent the same thing but have different names which
leads to integration problems. Detecting and resolving them is very
important to have a coherent data source.
2. Data value conflicts: The values or metrics or representations of the same
data maybe different in for the same real world entity in different data
sources. This leads to different representations of the same data,
different scales etc. Example : Weight in data source R is represented in
kilograms and in source S is represented in grams. To resolve this, data
representations should be made consistent and conversions should be
performed accordingly.
3. Redundant data: Duplicate attributes or tuples may occur as a result of
integrating data from various sources. This may also lead to
inconsistencies. These redundancies or inconsistencies may be reduced
by careful integration of data from multiple sources. This will help in
improving the mining speed and quality. Also, co-relational analysis can
be performed to detect redundant data.
During data integration in data mining, various data stores are used. This
can lead to the problem of redundancy in data. An attribute (column or
feature of data set) is called redundant if it can be derived from any other
attribute or set of attributes. Inconsistencies in attribute or dimension
naming can also lead to the redundancies in data set.
Data redundancy refers to the duplication of data in a computer system.
This duplication can occur at various levels, such as at the hardware or
software level, and can be intentional or unintentional. The main purpose
of data redundancy is to provide a backup copy of data in case the primary
copy is lost or becomes corrupted. This can help to ensure the availability
and integrity of the data in the event of a failure or other problem.
Advantages of data redundancy include:
1. Increased data availability and reliability, as there are multiple copies of
the data that can be used in case the primary copy is lost or becomes
unavailable.
2. Improved data integrity, as multiple copies of the data can be compared
to detect and correct errors.
3. Increased fault tolerance, as the system can continue to function even if
one copy of the data is lost or corrupted.
Disadvantages of data redundancy include:
1. Increased storage requirements, as multiple copies of the data must be
maintained.
2. Increased complexity of the system, as managing multiple copies of the
data can be difficult and time-consuming.
3. Increased risk of data inconsistencies, as multiple copies of the data may
become out of sync if updates are not properly propagated to all copies
4. Reduced performance, as the system may have to perform additional
work to maintain and access multiple copies of the data.
Technique to handle data redundancy
• χ2Test (Used for nominal Data or categorical or qualitative data)
• Correlation coefficient and covariance (Used for numeric Data or
quantitative data)

DATA REDUCTION
If the data is very large, data reduction is performed. Sometimes, it is also
performed to find the most suitable subset of attributes from a large number of
attributes. This is known as dimensionality reduction. Data reduction also
involves reducing the number of attribute values and/or the number of tuples.
Various data reduction techniques are:
1. Data cube aggregation: In this technique the data is reduced by applying
OLAP operations like slice, dice or rollup. It uses the smallest level
necessary to solve the problem.
2. Dimensionality reduction: The data attributes or dimensions are
reduced. Not all attributes are required for data mining. The most suitable
subset of attributes are selected by using techniques like forward
selection, backward elimination, decision tree induction or a combination
of forward selection and backward elimination.
3. Data compression: In this technique. large volumes of data is compressed
i.e. the number of bits used to store data is reduced. This can be done by
using lossy or lossless compression. In loss compression, the quality of
data is compromised for more compression. In lossless compression, the
quality of data is not compromised for higher compression level.
4. Numerosity reduction : This technique reduces the volume of data by
choosing smaller forms for data representation. Numerosity reduction
can be done using histograms, clustering or sampling of data. Numerosity
reduction is necessary as processing the entire data set is expensive and
time consuming.

DATA TRANSFORMATION
The change made in the format or the structure of the data is called data
transformation. This step can be simple or complex based on the requirements.
There are some methods for data transformation.
• Smoothing: With the help of algorithms, we can remove noise from the
dataset, which helps in knowing the important features of the dataset. By
smoothing, we can find even a simple change that helps in prediction.
• Aggregation: In this method, the data is stored and presented in the form
of a summary. The data set, which is from multiple sources, is integrated
into with data analysis description. This is an important step since the
accuracy of the data depends on the quantity and quality of the data.
When the quality and the quantity of the data are good, the results are
more relevant.
• Discretization: The continuous data here is split into intervals.
Discretization reduces the data size. For example, rather than specifying
the class time, we can set an interval like (3 pm-5 pm, or 6 pm-8 pm).
• Normalization: It is the method of scaling the data so that it can be
represented in a smaller range. Example ranging from -1.0 to 1.0.
•

FEATURE SCALING
Feature scaling is a method used to normalize the range of independent
variables or features of data. In data processing, it is also known as data
normalization and is generally performed during the data preprocessing step.
Just to give you an example — if you have multiple independent variables like
age, salary, and height; With their range as (18–100 Years), (25,000–75,000
Euros), and (1–2 Meters) respectively, feature scaling would help them all to be
in the same range, for example- centered around 0 or in the range (0,1)
depending on the scaling technique.
Normalization
Also known as min-max scaling or min-max normalization, it is the simplest
method and consists of rescaling the range of features to scale the range in [0,
1]. The general formula for normalization is given as:

Here, max(x) and min(x) are the maximum and the minimum values of the
feature respectively.
We can also do a normalization over different intervals, e.g. choosing to have
the variable laying in any [a, b] interval, a and b being real numbers. To rescale
a range between an arbitrary set of values [a, b], the formula becomes:

Standardization
Feature standardization makes the values of each feature in the data have zero
mean and unit variance. The general method of calculation is to determine the
distribution mean and standard deviation for each feature and calculate the new
data point by the following formula:

Here, σ is the standard deviation of the feature vector, and x̄ is the average of
the feature vector.
Normalization Standardization

This technique uses minimum and max This technique uses mean and standard deviation
values for scaling of model. for scaling of model.

It is helpful when features are of It is helpful when the mean of a variable is set to
different scales. 0 and the standard deviation is set to 1.

Scales values ranges between [0, 1] or [- Scale values are not restricted to a specific range.
1, 1].

It got affected by outliers. It is comparatively less affected by outliers.

Scikit-Learn provides a transformer Scikit-Learn provides a transformer called

called MinMaxScaler for Normalization. StandardScaler for Normalization.

It is also called Scaling normalization. It is known as Z-score normalization.

It is useful when feature distribution is It is useful when feature distribution is norma

unknown.

SPLITTING DATASET INTO TRAINING AND TESTING SETS

Why Split the Dataset?

To detect a machine learning model behavior, we need to use observations that
aren’t used in the training process. Otherwise, the evaluation of the model
would be biased.
Using the training observations for model evaluation is like giving a class of
students a set of questions, then asking some of the questions in the final exam.
We can’t know whether the pupils really learned the subject or just memorized
some specific data.
The simplest method is to divide the whole dataset into two sets. Then use one
for training and the other for model evaluation. This is called the holdout
method.
Let’s have a look at the meanings of data subsets first:
We train the machine learning models using these observations. In other
words, we feed these observations into the model to update its parameters
during the learning phase.
We test the machine learning model after the training phase is complete, using
the observations from the test set. This way, we measure how the model reacts
to new observations. Train and test sets should follow the same distribution.
Using the holdout method reduces the risk of possible issues such as data leaking
or overfitting. Hence, we ensure that the trained model generalizes well on new
data.

How to Split the Dataset?

If we search the Internet for the best train-test ratio, the first answer to pop will
be 80:20. This means we use 80% of the observations for training and the rest
for testing. Older sources and some textbooks would tell us to use a 70:30 or
even a 50:50 split. On the contrary, sources on deep learning or big data would
suggest a 99:1 split.
Like all machine learning problems, this question doesn’t have one simple
answer. With less training data, the parameter estimates have a high variance.
On the other hand, less testing data leads to high variance in performance
measures.
The dataset size implies a split ratio. Obviously, using the same ratio on datasets
with different sizes results in varying train and test set sizes:
If the dataset is relatively small (n < 10,000), 70:30 would be a suitable choice.
However, for smaller datasets (n < 1,000), each observation is extremely
valuable, and we can’t spare any for validation. In this case, k-fold cross-
validation is a better choice than the holdout method. This method is
computationally expensive. Yet it offers an effective model evaluation over a
small dataset by training multiple models.
On the contrary, a very large dataset (like millions of observations) gives us
flexibility. We can choose a high split ratio like 99:1 or even higher:

With datasets containing considerably high observations, 80:20 is a good

starting point. Overall, we need to make sure that the test set represents most
of the variance in the dataset. We can ensure this by trying different amounts of
test data.
For example, we start with an 80:20 split. Then, we train a model using 80% of
the dataset. By using portions of the test data (100%, 70%, 50%, etc.), we can
see if a smaller test set can cover the variance. Having a smaller portion
performing similar to the whole test set proves it’s covered. Otherwise, we can
consider using a larger test set.

THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Police Management System Crime File (Body)
79% (14)
Police Management System Crime File (Body)
53 pages
Baxter Colleague-Service Manual - Compressed
No ratings yet
Baxter Colleague-Service Manual - Compressed
422 pages
Outfitters Case Study
0% (1)
Outfitters Case Study
2 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Chapter-3 data processing
No ratings yet
Chapter-3 data processing
54 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
data_mining_unit_3[1]
No ratings yet
data_mining_unit_3[1]
64 pages
Chapter 3& 4 (3)
No ratings yet
Chapter 3& 4 (3)
60 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Preprocessing
No ratings yet
Data Preprocessing
54 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Chapter 3 Data Preparation
100% (1)
Chapter 3 Data Preparation
34 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Chapter 2
No ratings yet
Chapter 2
40 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
02 Data_preprocessing -4,5,6
No ratings yet
02 Data_preprocessing -4,5,6
54 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
CS-DM MODULE-2
No ratings yet
CS-DM MODULE-2
30 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Data Preprocessing
No ratings yet
Data Preprocessing
13 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Data and DW Lab Manual Updated
No ratings yet
Data and DW Lab Manual Updated
44 pages
FDS Chapter 3
No ratings yet
FDS Chapter 3
103 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Module 2
No ratings yet
Module 2
8 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Correlation
No ratings yet
Correlation
14 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Why Data Preprocessing
No ratings yet
Why Data Preprocessing
7 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Data Cleaning Wrangling
No ratings yet
Data Cleaning Wrangling
42 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
Data Preprocessing 013333
No ratings yet
Data Preprocessing 013333
8 pages
Unit - II
No ratings yet
Unit - II
56 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
UNIT _ Introduction_DataScience_new (1)
No ratings yet
UNIT _ Introduction_DataScience_new (1)
55 pages
Introduction to data science 1-2-2025
No ratings yet
Introduction to data science 1-2-2025
14 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Bafpred Module 2 Week 5 6
No ratings yet
Bafpred Module 2 Week 5 6
35 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
DATA WAREHOUSING UNIT 1[1]
No ratings yet
DATA WAREHOUSING UNIT 1[1]
26 pages
Data Warehouse and Data Mining - Unit 3
No ratings yet
Data Warehouse and Data Mining - Unit 3
14 pages
Module 2_data preprocessing
No ratings yet
Module 2_data preprocessing
16 pages
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Bca 3 Sem File
No ratings yet
Bca 3 Sem File
62 pages
Laravel: Code Bright: Books
No ratings yet
Laravel: Code Bright: Books
10 pages
Y2 GPS Tracker Communication Protocol - v1.8
No ratings yet
Y2 GPS Tracker Communication Protocol - v1.8
39 pages
Ec8791 Embedded and Real Time Systems IV Year Vii Sem Part A B Questions With Answer
No ratings yet
Ec8791 Embedded and Real Time Systems IV Year Vii Sem Part A B Questions With Answer
98 pages
Python Programming Examples
No ratings yet
Python Programming Examples
23 pages
Mohanlal Sukhadia University, Udaipur: First Year Science Computer Science
No ratings yet
Mohanlal Sukhadia University, Udaipur: First Year Science Computer Science
20 pages
01 Introduction To STAAD Pro
No ratings yet
01 Introduction To STAAD Pro
5 pages
Exchange Online Done Easily
No ratings yet
Exchange Online Done Easily
68 pages
Aiopen S 23 00476
No ratings yet
Aiopen S 23 00476
10 pages
TOS Grade6 1ST2022
No ratings yet
TOS Grade6 1ST2022
11 pages
Photoshop Lesson 03
No ratings yet
Photoshop Lesson 03
38 pages
DLL 2
No ratings yet
DLL 2
4 pages
Blackboard Transaction System Service Client - V1.2
No ratings yet
Blackboard Transaction System Service Client - V1.2
21 pages
ATV900_PROFINET_VW3A3647_Manual_EN_BQT46620_02
No ratings yet
ATV900_PROFINET_VW3A3647_Manual_EN_BQT46620_02
118 pages
TMS Error Troubleshooting
100% (1)
TMS Error Troubleshooting
4 pages
Hardware Guide HPE Alletra 6010 - 6030 - 6050 - 6070 - 6090 - 2140
No ratings yet
Hardware Guide HPE Alletra 6010 - 6030 - 6050 - 6070 - 6090 - 2140
136 pages
Why Are Information Systems Vulnerable To Destruction, Error, and Abuse ?
No ratings yet
Why Are Information Systems Vulnerable To Destruction, Error, and Abuse ?
4 pages
Requirements Management Plan Template
No ratings yet
Requirements Management Plan Template
3 pages
Cybersecurity Bootcamp - V1 - 0324
No ratings yet
Cybersecurity Bootcamp - V1 - 0324
20 pages
The Benefits of NetFlow in Network Traffic Analysis White Paper
No ratings yet
The Benefits of NetFlow in Network Traffic Analysis White Paper
8 pages
Diagnostics Apps Check 270511
No ratings yet
Diagnostics Apps Check 270511
484 pages
Exercise Number: 4: Register Number: 16BCE0747 Name: Yash Mishra Date-8 February, 2017
No ratings yet
Exercise Number: 4: Register Number: 16BCE0747 Name: Yash Mishra Date-8 February, 2017
12 pages
Seminar Diary Samruddhi
No ratings yet
Seminar Diary Samruddhi
13 pages
Openshift Container Platform 4.15 Installing en Us
No ratings yet
Openshift Container Platform 4.15 Installing en Us
3,702 pages
2024 Grindberg Rachael Resume
No ratings yet
2024 Grindberg Rachael Resume
2 pages
Dell EMC Networking - QRG - Data Center
No ratings yet
Dell EMC Networking - QRG - Data Center
2 pages
Data Security Consideration
No ratings yet
Data Security Consideration
14 pages