data preprocessing
data preprocessing
UNIT 2
DATA PRE-PROCESSING
Data pre-processing is a process of preparing the raw data and making it suitable
for a machine learning model. It is the first and crucial step while creating a
machine learning model.
When creating a machine learning project, it is not always a case that we come
across the clean and formatted data. And while doing any operation with data,
it is mandatory to clean it and put in a formatted way. So for this, we use data
pre-processing task.
TASKS IN DATA PRE-PROCESSING
➢ Data cleaning
o Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
➢ Data integration
o Integration of multiple databases, data cubes, or files
➢ Data reduction
o Dimensionality reduction
o Numerosity reduction
o Data compression
➢ Data transformation and data discretization
o Normalization
Concept hierarchy generation
NEED OF DATA PRE-PROCESSING
Accuracy: correct or wrong, accurate or not
Completeness: not recorded, unavailable, …
Consistency: some modified but some not, dangling, …
Timeliness: timely update?
Believability: how trustable the data are correct?
DATA CLEANING
The data cleaning process detects and removes the errors and inconsistencies
present in the data and improves its quality. Data quality problems occur due to
misspellings during data entry, missing values or any other invalid data.
Basically, “dirty” data is transformed into clean data. “Dirty” data does not
produce the accurate and good results. Garbage data gives garbage out. So, it
becomes very important to handle this data.
Need of data cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, transmission error
❖ incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
▪ e.g., Occupation=“ ” (missing data)
❖ noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
❖ inconsistent: containing discrepancies in codes or names, e.g.,
▪ Age=“42”, Birthday=“03/07/2010”
▪ Was rating “1, 2, 3”, now rating “A, B, C”
▪ discrepancy between duplicate records
❖ Intentional (e.g., disguised missing data)
▪ Jan. 1 as everyone’s birthday?
❖ Data is not always available
▪ E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
❖ Missing data may be due to
▪ equipment malfunction
▪ inconsistent with other recorded data and thus deleted
▪ data not entered due to misunderstanding
▪ certain data may not be considered important at the time of entry
▪ not register history or changes of the data
What to do to clean data?
1. Handle Missing Values
2. Handle Noise and Outliers
3. Remove Unwanted data
Handle Missing Values
Missing values cannot be looked over in a data set. They must be handled. Also,
a lot of models do not accept missing values. There are several techniques to
handle missing data, choosing the right one is of utmost importance. The choice
of technique to deal with missing data depends on the problem domain and the
goal of data mining process. The different ways to handle missing data are:
1. Ignore the data row: This method is suggested for records where
maximum amount of data is missing, rendering the record meaningless.
This method is usually avoided where only less attribute values are
missing. If all the rows with missing values are ignored i.e., removed, it
will result in poor performance.
2. Fill the missing values manually: This is a very time-consuming method
and hence infeasible for almost all scenarios.
3. Use a global constant to fill in for missing values: A global constant like
“NA” or 0 can be used to fill all the missing data. This method is used when
missing values are difficult to be predicted.
4. Use attribute mean or median: Mean or median of the attribute is used
to fill the missing value.
5. Use forward fill or backward fill method: In this, either the previous value
or the next value is used to fill the missing value. A mean of the previous
and succession values may also be used.
6. Use a data-mining algorithm to predict the most probable value
DATA INTEGRATION
In this step, a coherent data source is prepared. This is done by collecting and
integrating data from multiple sources like databases, legacy systems, flat files,
data cubes etc.
Issues in Data Integration
1. Schema Integration: Metadata (i.e. the schema) from different sources
may not be compatible. This leads to entity identification
problem. Example : Consider two data sources R and S. Customer id in R
is represented as cust_id and in S is represented is c_id. They mean the
same thing, represent the same thing but have different names which
leads to integration problems. Detecting and resolving them is very
important to have a coherent data source.
2. Data value conflicts: The values or metrics or representations of the same
data maybe different in for the same real world entity in different data
sources. This leads to different representations of the same data,
different scales etc. Example : Weight in data source R is represented in
kilograms and in source S is represented in grams. To resolve this, data
representations should be made consistent and conversions should be
performed accordingly.
3. Redundant data: Duplicate attributes or tuples may occur as a result of
integrating data from various sources. This may also lead to
inconsistencies. These redundancies or inconsistencies may be reduced
by careful integration of data from multiple sources. This will help in
improving the mining speed and quality. Also, co-relational analysis can
be performed to detect redundant data.
During data integration in data mining, various data stores are used. This
can lead to the problem of redundancy in data. An attribute (column or
feature of data set) is called redundant if it can be derived from any other
attribute or set of attributes. Inconsistencies in attribute or dimension
naming can also lead to the redundancies in data set.
Data redundancy refers to the duplication of data in a computer system.
This duplication can occur at various levels, such as at the hardware or
software level, and can be intentional or unintentional. The main purpose
of data redundancy is to provide a backup copy of data in case the primary
copy is lost or becomes corrupted. This can help to ensure the availability
and integrity of the data in the event of a failure or other problem.
Advantages of data redundancy include:
1. Increased data availability and reliability, as there are multiple copies of
the data that can be used in case the primary copy is lost or becomes
unavailable.
2. Improved data integrity, as multiple copies of the data can be compared
to detect and correct errors.
3. Increased fault tolerance, as the system can continue to function even if
one copy of the data is lost or corrupted.
Disadvantages of data redundancy include:
1. Increased storage requirements, as multiple copies of the data must be
maintained.
2. Increased complexity of the system, as managing multiple copies of the
data can be difficult and time-consuming.
3. Increased risk of data inconsistencies, as multiple copies of the data may
become out of sync if updates are not properly propagated to all copies
4. Reduced performance, as the system may have to perform additional
work to maintain and access multiple copies of the data.
Technique to handle data redundancy
• χ2Test (Used for nominal Data or categorical or qualitative data)
• Correlation coefficient and covariance (Used for numeric Data or
quantitative data)
DATA REDUCTION
If the data is very large, data reduction is performed. Sometimes, it is also
performed to find the most suitable subset of attributes from a large number of
attributes. This is known as dimensionality reduction. Data reduction also
involves reducing the number of attribute values and/or the number of tuples.
Various data reduction techniques are:
1. Data cube aggregation: In this technique the data is reduced by applying
OLAP operations like slice, dice or rollup. It uses the smallest level
necessary to solve the problem.
2. Dimensionality reduction: The data attributes or dimensions are
reduced. Not all attributes are required for data mining. The most suitable
subset of attributes are selected by using techniques like forward
selection, backward elimination, decision tree induction or a combination
of forward selection and backward elimination.
3. Data compression: In this technique. large volumes of data is compressed
i.e. the number of bits used to store data is reduced. This can be done by
using lossy or lossless compression. In loss compression, the quality of
data is compromised for more compression. In lossless compression, the
quality of data is not compromised for higher compression level.
4. Numerosity reduction : This technique reduces the volume of data by
choosing smaller forms for data representation. Numerosity reduction
can be done using histograms, clustering or sampling of data. Numerosity
reduction is necessary as processing the entire data set is expensive and
time consuming.
DATA TRANSFORMATION
The change made in the format or the structure of the data is called data
transformation. This step can be simple or complex based on the requirements.
There are some methods for data transformation.
• Smoothing: With the help of algorithms, we can remove noise from the
dataset, which helps in knowing the important features of the dataset. By
smoothing, we can find even a simple change that helps in prediction.
• Aggregation: In this method, the data is stored and presented in the form
of a summary. The data set, which is from multiple sources, is integrated
into with data analysis description. This is an important step since the
accuracy of the data depends on the quantity and quality of the data.
When the quality and the quantity of the data are good, the results are
more relevant.
• Discretization: The continuous data here is split into intervals.
Discretization reduces the data size. For example, rather than specifying
the class time, we can set an interval like (3 pm-5 pm, or 6 pm-8 pm).
• Normalization: It is the method of scaling the data so that it can be
represented in a smaller range. Example ranging from -1.0 to 1.0.
•
FEATURE SCALING
Feature scaling is a method used to normalize the range of independent
variables or features of data. In data processing, it is also known as data
normalization and is generally performed during the data preprocessing step.
Just to give you an example — if you have multiple independent variables like
age, salary, and height; With their range as (18–100 Years), (25,000–75,000
Euros), and (1–2 Meters) respectively, feature scaling would help them all to be
in the same range, for example- centered around 0 or in the range (0,1)
depending on the scaling technique.
Normalization
Also known as min-max scaling or min-max normalization, it is the simplest
method and consists of rescaling the range of features to scale the range in [0,
1]. The general formula for normalization is given as:
Here, max(x) and min(x) are the maximum and the minimum values of the
feature respectively.
We can also do a normalization over different intervals, e.g. choosing to have
the variable laying in any [a, b] interval, a and b being real numbers. To rescale
a range between an arbitrary set of values [a, b], the formula becomes:
Standardization
Feature standardization makes the values of each feature in the data have zero
mean and unit variance. The general method of calculation is to determine the
distribution mean and standard deviation for each feature and calculate the new
data point by the following formula:
Here, σ is the standard deviation of the feature vector, and x̄ is the average of
the feature vector.
Normalization Standardization
This technique uses minimum and max This technique uses mean and standard deviation
values for scaling of model. for scaling of model.
It is helpful when features are of It is helpful when the mean of a variable is set to
different scales. 0 and the standard deviation is set to 1.
Scales values ranges between [0, 1] or [- Scale values are not restricted to a specific range.
1, 1].