unit 2 Preprocessing in Data Mining
unit 2 Preprocessing in Data Mining
2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may be linear
(having one independent variable) or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the
clusters.
U2 DATA MINING
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This involves
following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with huge volume of
data, analysis became harder in such cases. In order to get rid of this, we uses data reduction technique. It aims to
increase the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If after reconstruction from
compressed data, original data can be retrieved, such reduction are called lossless reduction else it is called
lossy reduction. The two effective methods of dimensionality reduction are:Wavelet transforms and PCA
(Principal Component Analysis).
Data Integration
Data Integration is a data preprocessing technique that combines data from multiple heterogeneous data sources
into a coherent data store and provides a unified view of the data. These sources may include multiple data cubes,
databases, or flat files.
The data integration approaches are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stands for the heterogeneous source of schema,
M stands for mapping between the queries of source and global schema.
U2 DATA MINING
There are mainly 2 major approaches for data integration – one is the “tight coupling approach” and another is the
“loose coupling approach”.
Tight Coupling:
Here, a data warehouse is treated as an information retrieval component.
In this coupling, data is combined from different sources into a single physical loc ation through the process of
ETL – Extraction, Transformation, and Loading.
Loose Coupling:
Here, an interface is provided that takes the query from the user, transforms it in a way the source database can
understand, and then sends the query directly to the source databases to obtain the result.
And the data only remains in the actual source databases.
Issues in Data Integration:
There are three issues to consider during data integration: Schema Integration, Redundancy Detection, and
resolution of data value conflicts. These are explained in brief below.
1. Schema Integration:
Integrate metadata from different sources.
The real-world entities from multiple sources are referred to as the entity identification problem.
2. Redundancy:
An attribute may be redundant if it can be derived or obtained from another attribute or set of attributes.
Inconsistencies in attributes can also cause redundancies in the resulting data set.
Some redundancies can be detected by correlation analysis.
3. Detection and resolution of data value conflicts:
This is the third critical issue in data integration.
Attribute values from different sources may differ for the same real-world entity.
An attribute in one system may be recorded at a lower level of abstraction than th e “same” attribute in another.
The data are transformed in ways that are ideal for mining the data. The data transformation involves steps that
are:
1.Smoothing:
It is a process that is used to remove noise from the dataset using some algorithms It allows for
highlighting important features present in the dataset. It helps in predicting the patterns. When collecting data, it
can be manipulated to eliminate or reduce any variance or any other noise form.
The concept behind data smoothing is that it will be able to identify simple changes to help predict different trends
and patterns. This serves as a help to analysts or traders who need to look at a lot of data which can often be
difficult to digest for finding patterns that they wouldn’t see otherwise.
2.Aggregation:
Data collection or aggregation is the method of storing and presenting data in a summary format. The data
may be obtained from multiple data sources to integrate these data sources into a data analysis descripti on. This is a
crucial step since the accuracy of data analysis insights is highly dependent on the quantity and quality of the data
used. Gathering accurate data of high quality and a large enough quantity is necessary to produce relevant results.
The collection of data is useful for everything from decisions concerning financing or business strategy of the
product, pricing, operations, and marketing strategies.
For example, Sales, data may be aggregated to compute monthly& annual total amounts.
U2 DATA MINING
3.Discretization:
It is a process of transforming continuous data into set of small intervals. Most Data Mining activities in the
real world require continuous attributes. Yet many of the existing data mining frameworks are unable to handle
these attributes.
Also, even if a data mining task can manage a continuous attribute, it can significantly improve its efficiency by
replacing a constant quality attribute with its discrete values.
For example, (1-10, 11-20) (age:- young, middle age, senior).
4.AttributeConstruction:
Where new attributes are created & applied to assist the mining process from the given set of attributes. This
simplifies the original data & makes the mining more efficient.
5.Generalization:
It converts low-level data attributes to high-level data attributes using concept hierarchy. For Example Age
initially in Numerical form (22, 25) is converted into categorical value (young, old).
For example, Categorical attributes, such as house addresses, may be generalized to higher-level definitions, such as
town or country.
6. Normalization: Data normalization involves converting all data variable into a given range.
Techniques that are used for normalization are:
Min-Max Normalization:
This transforms the original data linearly.
Suppose that: min_A is the minima and max_A is the maxima of an attribute, P
For example:
Let mean of an attribute P = 60, 000, Standard Deviation = 10, 000, for the attribute P. Using z -score
normalization, a value of 85000 for P can be transformed to:
For normalizing the values we divide the numbers by 100 (i.e., j = 2) or (number of integers in the
largest number) so that values come out to be as 0.98, 0.97 and so on.
The method of data reduction may achieve a condensed description of the original data which is much smaller in
quantity but keeps the quality of the original data.
Methods of data reduction:
These are explained as following below.
1. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. For example, imagine that information you gathered for
your analysis for the years 2012 to 2014, that data includes the revenue of your company every three months. They
involve you in the annual sales, rather than the quarterly average, So we can summarize the data in such a way that
the resulting data summarizes the total sales per year instead of per quarter. It summarizes the data.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute required for our analysis.
It reduces data size as it eliminates outdated or redundant features.
Step-wise Forward Selection –
The selection begins with an empty set of attributes later on we decide best of the original attributes on the set
based on their relevance to other attributes. We know it as a p-value in statistics.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
We replace many constant values of the attributes by labels of small intervals. This means that mining results are
shown in a concise, and easily understandable way.
Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points) to divide the whole set of
attributes and repeat of this method up to the end, then the process is known as top -down discretization also
known as splitting.
Bottom-up discretization –
If you first consider all the constant values as split-points, some are discarded through a combination of the
neighbourhood values in the interval, that process is called bottom-up discretization.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for age) to high-level
concepts (categorical variables such as middle age or Senior).
For numeric data following techniques can be followed:
Binning –
Binning is the process of changing numerical variables into categorical counterparts. The number of categorical
counterparts depends on the number of bins specified by the user.
Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute X, into disjoint ranges
called brackets. There are several partitioning rules:
1. Equal Frequency partitioning: Partitioning the values based on their number of occurrences in the data set.
2. Equal Width Partitioning: Partitioning the values in a fixed gap based on the number of bins i.e. a set of
values ranging from 0-20.
3. Clustering: Grouping the similar data together.