0% found this document useful (0 votes)
12 views

unit 2 Preprocessing in Data Mining

data mining unit 2

Uploaded by

Akansha S
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

unit 2 Preprocessing in Data Mining

data mining unit 2

Uploaded by

Akansha S
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

U2 DATA MINING

Preprocessing in Data Mining:


Data preprocessing is a data mining technique which is used to transform the raw data in a useful and efficient
format.

Steps Involved in Data Preprocessing:


1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling
of missing data, noisy data etc.

 (a). Missing Data:


This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are missing
within a tuple.

2. Fill the Missing values:


There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean
or the most probable value.

 (b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty data
collection, data entry errors etc. It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into segments of equal
size and then various methods are performed to complete the task. Each segmented is handled separately.
One can replace all data in a segment by its mean or boundary values can be used to complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may be linear
(having one independent variable) or multiple (having multiple independent variables).

3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the
clusters.
U2 DATA MINING

2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This involves
following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining process.

3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.

4. Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in hierarchy. For Example-The attribute “city” can
be converted to “country”.

3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with huge volume of
data, analysis became harder in such cases. In order to get rid of this, we uses data reduction technique. It aims to
increase the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.

2. Attribute Subset Selection:


The highly relevant attributes should be used, rest all can be discarded. For performing attribute selection, one
can use level of significance and p- value of the attribute.the attribute having p-value greater than significance
level can be discarded.

3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression Models.

4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If after reconstruction from
compressed data, original data can be retrieved, such reduction are called lossless reduction else it is called
lossy reduction. The two effective methods of dimensionality reduction are:Wavelet transforms and PCA
(Principal Component Analysis).

Data Integration
Data Integration is a data preprocessing technique that combines data from multiple heterogeneous data sources
into a coherent data store and provides a unified view of the data. These sources may include multiple data cubes,
databases, or flat files.
The data integration approaches are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stands for the heterogeneous source of schema,
M stands for mapping between the queries of source and global schema.
U2 DATA MINING

There are mainly 2 major approaches for data integration – one is the “tight coupling approach” and another is the
“loose coupling approach”.
Tight Coupling:
 Here, a data warehouse is treated as an information retrieval component.
 In this coupling, data is combined from different sources into a single physical loc ation through the process of
ETL – Extraction, Transformation, and Loading.
Loose Coupling:
 Here, an interface is provided that takes the query from the user, transforms it in a way the source database can
understand, and then sends the query directly to the source databases to obtain the result.
 And the data only remains in the actual source databases.
Issues in Data Integration:
There are three issues to consider during data integration: Schema Integration, Redundancy Detection, and
resolution of data value conflicts. These are explained in brief below.
1. Schema Integration:
 Integrate metadata from different sources.
 The real-world entities from multiple sources are referred to as the entity identification problem.
2. Redundancy:
 An attribute may be redundant if it can be derived or obtained from another attribute or set of attributes.
 Inconsistencies in attributes can also cause redundancies in the resulting data set.
 Some redundancies can be detected by correlation analysis.
3. Detection and resolution of data value conflicts:
 This is the third critical issue in data integration.
 Attribute values from different sources may differ for the same real-world entity.
 An attribute in one system may be recorded at a lower level of abstraction than th e “same” attribute in another.

The data are transformed in ways that are ideal for mining the data. The data transformation involves steps that
are:
1.Smoothing:
It is a process that is used to remove noise from the dataset using some algorithms It allows for
highlighting important features present in the dataset. It helps in predicting the patterns. When collecting data, it
can be manipulated to eliminate or reduce any variance or any other noise form.
The concept behind data smoothing is that it will be able to identify simple changes to help predict different trends
and patterns. This serves as a help to analysts or traders who need to look at a lot of data which can often be
difficult to digest for finding patterns that they wouldn’t see otherwise.
2.Aggregation:
Data collection or aggregation is the method of storing and presenting data in a summary format. The data
may be obtained from multiple data sources to integrate these data sources into a data analysis descripti on. This is a
crucial step since the accuracy of data analysis insights is highly dependent on the quantity and quality of the data
used. Gathering accurate data of high quality and a large enough quantity is necessary to produce relevant results.
The collection of data is useful for everything from decisions concerning financing or business strategy of the
product, pricing, operations, and marketing strategies.
For example, Sales, data may be aggregated to compute monthly& annual total amounts.
U2 DATA MINING

3.Discretization:
It is a process of transforming continuous data into set of small intervals. Most Data Mining activities in the
real world require continuous attributes. Yet many of the existing data mining frameworks are unable to handle
these attributes.
Also, even if a data mining task can manage a continuous attribute, it can significantly improve its efficiency by
replacing a constant quality attribute with its discrete values.
For example, (1-10, 11-20) (age:- young, middle age, senior).
4.AttributeConstruction:
Where new attributes are created & applied to assist the mining process from the given set of attributes. This
simplifies the original data & makes the mining more efficient.
5.Generalization:
It converts low-level data attributes to high-level data attributes using concept hierarchy. For Example Age
initially in Numerical form (22, 25) is converted into categorical value (young, old).
For example, Categorical attributes, such as house addresses, may be generalized to higher-level definitions, such as
town or country.
6. Normalization: Data normalization involves converting all data variable into a given range.
Techniques that are used for normalization are:
 Min-Max Normalization:
 This transforms the original data linearly.
 Suppose that: min_A is the minima and max_A is the maxima of an attribute, P

 Where v is the value you want to plot in the new range.


 v’ is the new value you get after normalizing the old value.
Solved example:
Suppose the minimum and maximum value for an attribute profit(P) are Rs. 10, 000 and Rs. 100, 000. We want
to plot the profit in the range [0, 1]. Using min-max normalization the value of Rs. 20, 000 for attribute profit can
be plotted to:

And hence, we get the value of v’ as 0.11


 Z-Score Normalization:
 In z-score normalization (or zero-mean normalization) the values of an attribute (A), are normalized
based on the mean of A and its standard deviation
 A value, v, of attribute A is normalized to v’ by computing

For example:
Let mean of an attribute P = 60, 000, Standard Deviation = 10, 000, for the attribute P. Using z -score
normalization, a value of 85000 for P can be transformed to:

And hence we get the value of v’ to be 2.5


 Decimal Scaling:
 It normalizes the values of an attribute by changing the position of their decimal points
 The number of points by which the decimal point is moved can be determined by the absolute
maximum value of attribute A.
 A value, v, of attribute A is normalized to v’ by computing

 where j is the smallest integer such that Max(|v’|) < 1.


For example:
 Suppose: Values of an attribute P varies from -99 to 99.
 The maximum absolute value of P is 99.
U2 DATA MINING

 For normalizing the values we divide the numbers by 100 (i.e., j = 2) or (number of integers in the
largest number) so that values come out to be as 0.98, 0.97 and so on.

The method of data reduction may achieve a condensed description of the original data which is much smaller in
quantity but keeps the quality of the original data.
Methods of data reduction:
These are explained as following below.
1. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. For example, imagine that information you gathered for
your analysis for the years 2012 to 2014, that data includes the revenue of your company every three months. They
involve you in the annual sales, rather than the quarterly average, So we can summarize the data in such a way that
the resulting data summarizes the total sales per year instead of per quarter. It summarizes the data.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute required for our analysis.
It reduces data size as it eliminates outdated or redundant features.
 Step-wise Forward Selection –
The selection begins with an empty set of attributes later on we decide best of the original attributes on the set
based on their relevance to other attributes. We know it as a p-value in statistics.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }

Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}


 Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original data and at each point, it eliminates the
worst remaining attribute in the set.
Suppose there are the following attributes in the data set in which few attributes are red undant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }

Step-1: {X1, X2, X3, X4, X5}


Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}


 Combination of forwarding and Backward Selection –
It allows us to remove the worst and select best attributes, saving time and making the process faster.
3. Data Compression:
The data compression technique reduces the size of the files using different encoding mechanisms (Huffman
Encoding & run-length Encoding). We can divide it into two types based on their compression techniques.
 Lossless Compression –
Encoding techniques (Run Length Encoding) allows a simple and minimal data size reduction. Lossless data
compression uses algorithms to restore the precise original data from the compressed data.
 Lossy compression –
Methods such as Discrete Wavelet transform technique, PCA (principal component analysis) are examples of this
compression. For e.g., JPEG image format is a lossy compression, but we can find the meaning equivalent to the
original the image. In lossy-data compression, the decompressed data may differ to the original data but are
useful enough to retrieve information from them.
4. NumerosityReduction:
In this reduction technique the actual data is replaced with mathematical models or smaller representation of the
data instead of actual data, it is important to only store the model parameter. Or non -parametric method such as
clustering, histogram, sampling. For More Information on Numerosity Reduction Visit the link below:
5. Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the continuous nature into data with intervals.
U2 DATA MINING

We replace many constant values of the attributes by labels of small intervals. This means that mining results are
shown in a concise, and easily understandable way.
 Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points) to divide the whole set of
attributes and repeat of this method up to the end, then the process is known as top -down discretization also
known as splitting.
 Bottom-up discretization –
If you first consider all the constant values as split-points, some are discarded through a combination of the
neighbourhood values in the interval, that process is called bottom-up discretization.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for age) to high-level
concepts (categorical variables such as middle age or Senior).
For numeric data following techniques can be followed:
 Binning –
Binning is the process of changing numerical variables into categorical counterparts. The number of categorical
counterparts depends on the number of bins specified by the user.
 Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute X, into disjoint ranges
called brackets. There are several partitioning rules:
1. Equal Frequency partitioning: Partitioning the values based on their number of occurrences in the data set.
2. Equal Width Partitioning: Partitioning the values in a fixed gap based on the number of bins i.e. a set of
values ranging from 0-20.
3. Clustering: Grouping the similar data together.

You might also like