0% found this document useful (0 votes)

12 views

unit 2 Preprocessing in Data Mining

data mining unit 2

Uploaded by

Akansha S

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

unit 2 Preprocessing in Data Mining

data mining unit 2

Uploaded by

Akansha S

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

U2 DATA MINING

Preprocessing in Data Mining:

Data preprocessing is a data mining technique which is used to transform the raw data in a useful and efficient
format.

Steps Involved in Data Preprocessing:

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling
of missing data, noisy data etc.

 (a). Missing Data:

This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are missing
within a tuple.

2. Fill the Missing values:

There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean
or the most probable value.

 (b). Noisy Data:

Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty data
collection, data entry errors etc. It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into segments of equal
size and then various methods are performed to complete the task. Each segmented is handled separately.
One can replace all data in a segment by its mean or boundary values can be used to complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may be linear
(having one independent variable) or multiple (having multiple independent variables).

3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the
clusters.
U2 DATA MINING

2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This involves
following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining process.

3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.

4. Concept Hierarchy Generation:

Here attributes are converted from lower level to higher level in hierarchy. For Example-The attribute “city” can
be converted to “country”.

3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with huge volume of
data, analysis became harder in such cases. In order to get rid of this, we uses data reduction technique. It aims to
increase the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.

2. Attribute Subset Selection:

The highly relevant attributes should be used, rest all can be discarded. For performing attribute selection, one
can use level of significance and p- value of the attribute.the attribute having p-value greater than significance
level can be discarded.

3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression Models.

4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If after reconstruction from
compressed data, original data can be retrieved, such reduction are called lossless reduction else it is called
lossy reduction. The two effective methods of dimensionality reduction are:Wavelet transforms and PCA
(Principal Component Analysis).

Data Integration
Data Integration is a data preprocessing technique that combines data from multiple heterogeneous data sources
into a coherent data store and provides a unified view of the data. These sources may include multiple data cubes,
databases, or flat files.
The data integration approaches are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stands for the heterogeneous source of schema,
M stands for mapping between the queries of source and global schema.
U2 DATA MINING

There are mainly 2 major approaches for data integration – one is the “tight coupling approach” and another is the
“loose coupling approach”.
Tight Coupling:
 Here, a data warehouse is treated as an information retrieval component.
 In this coupling, data is combined from different sources into a single physical loc ation through the process of
ETL – Extraction, Transformation, and Loading.
Loose Coupling:
 Here, an interface is provided that takes the query from the user, transforms it in a way the source database can
understand, and then sends the query directly to the source databases to obtain the result.
 And the data only remains in the actual source databases.
Issues in Data Integration:
There are three issues to consider during data integration: Schema Integration, Redundancy Detection, and
resolution of data value conflicts. These are explained in brief below.
1. Schema Integration:
 Integrate metadata from different sources.
 The real-world entities from multiple sources are referred to as the entity identification problem.
2. Redundancy:
 An attribute may be redundant if it can be derived or obtained from another attribute or set of attributes.
 Inconsistencies in attributes can also cause redundancies in the resulting data set.
 Some redundancies can be detected by correlation analysis.
3. Detection and resolution of data value conflicts:
 This is the third critical issue in data integration.
 Attribute values from different sources may differ for the same real-world entity.
 An attribute in one system may be recorded at a lower level of abstraction than th e “same” attribute in another.

The data are transformed in ways that are ideal for mining the data. The data transformation involves steps that
are:
1.Smoothing:
It is a process that is used to remove noise from the dataset using some algorithms It allows for
highlighting important features present in the dataset. It helps in predicting the patterns. When collecting data, it
can be manipulated to eliminate or reduce any variance or any other noise form.
The concept behind data smoothing is that it will be able to identify simple changes to help predict different trends
and patterns. This serves as a help to analysts or traders who need to look at a lot of data which can often be
difficult to digest for finding patterns that they wouldn’t see otherwise.
2.Aggregation:
Data collection or aggregation is the method of storing and presenting data in a summary format. The data
may be obtained from multiple data sources to integrate these data sources into a data analysis descripti on. This is a
crucial step since the accuracy of data analysis insights is highly dependent on the quantity and quality of the data
used. Gathering accurate data of high quality and a large enough quantity is necessary to produce relevant results.
The collection of data is useful for everything from decisions concerning financing or business strategy of the
product, pricing, operations, and marketing strategies.
For example, Sales, data may be aggregated to compute monthly& annual total amounts.
U2 DATA MINING

3.Discretization:
It is a process of transforming continuous data into set of small intervals. Most Data Mining activities in the
real world require continuous attributes. Yet many of the existing data mining frameworks are unable to handle
these attributes.
Also, even if a data mining task can manage a continuous attribute, it can significantly improve its efficiency by
replacing a constant quality attribute with its discrete values.
For example, (1-10, 11-20) (age:- young, middle age, senior).
4.AttributeConstruction:
Where new attributes are created & applied to assist the mining process from the given set of attributes. This
simplifies the original data & makes the mining more efficient.
5.Generalization:
It converts low-level data attributes to high-level data attributes using concept hierarchy. For Example Age
initially in Numerical form (22, 25) is converted into categorical value (young, old).
For example, Categorical attributes, such as house addresses, may be generalized to higher-level definitions, such as
town or country.
6. Normalization: Data normalization involves converting all data variable into a given range.
Techniques that are used for normalization are:
 Min-Max Normalization:
 This transforms the original data linearly.
 Suppose that: min_A is the minima and max_A is the maxima of an attribute, P

 Where v is the value you want to plot in the new range.

 v’ is the new value you get after normalizing the old value.
Solved example:
Suppose the minimum and maximum value for an attribute profit(P) are Rs. 10, 000 and Rs. 100, 000. We want
to plot the profit in the range [0, 1]. Using min-max normalization the value of Rs. 20, 000 for attribute profit can
be plotted to:

And hence, we get the value of v’ as 0.11

 Z-Score Normalization:
 In z-score normalization (or zero-mean normalization) the values of an attribute (A), are normalized
based on the mean of A and its standard deviation
 A value, v, of attribute A is normalized to v’ by computing

For example:
Let mean of an attribute P = 60, 000, Standard Deviation = 10, 000, for the attribute P. Using z -score
normalization, a value of 85000 for P can be transformed to:

And hence we get the value of v’ to be 2.5

 Decimal Scaling:
 It normalizes the values of an attribute by changing the position of their decimal points
 The number of points by which the decimal point is moved can be determined by the absolute
maximum value of attribute A.
 A value, v, of attribute A is normalized to v’ by computing

 where j is the smallest integer such that Max(|v’|) < 1.

For example:
 Suppose: Values of an attribute P varies from -99 to 99.
 The maximum absolute value of P is 99.
U2 DATA MINING

 For normalizing the values we divide the numbers by 100 (i.e., j = 2) or (number of integers in the
largest number) so that values come out to be as 0.98, 0.97 and so on.

The method of data reduction may achieve a condensed description of the original data which is much smaller in
quantity but keeps the quality of the original data.
Methods of data reduction:
These are explained as following below.
1. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. For example, imagine that information you gathered for
your analysis for the years 2012 to 2014, that data includes the revenue of your company every three months. They
involve you in the annual sales, rather than the quarterly average, So we can summarize the data in such a way that
the resulting data summarizes the total sales per year instead of per quarter. It summarizes the data.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute required for our analysis.
It reduces data size as it eliminates outdated or redundant features.
 Step-wise Forward Selection –
The selection begins with an empty set of attributes later on we decide best of the original attributes on the set
based on their relevance to other attributes. We know it as a p-value in statistics.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }

Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}

 Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original data and at each point, it eliminates the
worst remaining attribute in the set.
Suppose there are the following attributes in the data set in which few attributes are red undant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }

Step-1: {X1, X2, X3, X4, X5}

Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}

 Combination of forwarding and Backward Selection –
It allows us to remove the worst and select best attributes, saving time and making the process faster.
3. Data Compression:
The data compression technique reduces the size of the files using different encoding mechanisms (Huffman
Encoding & run-length Encoding). We can divide it into two types based on their compression techniques.
 Lossless Compression –
Encoding techniques (Run Length Encoding) allows a simple and minimal data size reduction. Lossless data
compression uses algorithms to restore the precise original data from the compressed data.
 Lossy compression –
Methods such as Discrete Wavelet transform technique, PCA (principal component analysis) are examples of this
compression. For e.g., JPEG image format is a lossy compression, but we can find the meaning equivalent to the
original the image. In lossy-data compression, the decompressed data may differ to the original data but are
useful enough to retrieve information from them.
4. NumerosityReduction:
In this reduction technique the actual data is replaced with mathematical models or smaller representation of the
data instead of actual data, it is important to only store the model parameter. Or non -parametric method such as
clustering, histogram, sampling. For More Information on Numerosity Reduction Visit the link below:
5. Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the continuous nature into data with intervals.
U2 DATA MINING

We replace many constant values of the attributes by labels of small intervals. This means that mining results are
shown in a concise, and easily understandable way.
 Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points) to divide the whole set of
attributes and repeat of this method up to the end, then the process is known as top -down discretization also
known as splitting.
 Bottom-up discretization –
If you first consider all the constant values as split-points, some are discarded through a combination of the
neighbourhood values in the interval, that process is called bottom-up discretization.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for age) to high-level
concepts (categorical variables such as middle age or Senior).
For numeric data following techniques can be followed:
 Binning –
Binning is the process of changing numerical variables into categorical counterparts. The number of categorical
counterparts depends on the number of bins specified by the user.
 Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute X, into disjoint ranges
called brackets. There are several partitioning rules:
1. Equal Frequency partitioning: Partitioning the values based on their number of occurrences in the data set.
2. Equal Width Partitioning: Partitioning the values in a fixed gap based on the number of bins i.e. a set of
values ranging from 0-20.
3. Clustering: Grouping the similar data together.

Comptia Data+ Da0-001
No ratings yet
Comptia Data+ Da0-001
10 pages
Rigid Catenary System
0% (1)
Rigid Catenary System
20 pages
Sadigh Et - Al - 1997
No ratings yet
Sadigh Et - Al - 1997
10 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Data Integration and Data Reduction
No ratings yet
Data Integration and Data Reduction
27 pages
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
No ratings yet
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
3 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
3.data Pre-Processing Concepts
No ratings yet
3.data Pre-Processing Concepts
8 pages
DATA MINING Notes
No ratings yet
DATA MINING Notes
37 pages
UNIT-III Data Warehouse and Minig Notes MDU
No ratings yet
UNIT-III Data Warehouse and Minig Notes MDU
42 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
22 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
DATA WAREHOUSING UNIT 1[1]
No ratings yet
DATA WAREHOUSING UNIT 1[1]
26 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
Data Warehouse and Data Mining- Definition and Concepts
No ratings yet
Data Warehouse and Data Mining- Definition and Concepts
20 pages
Unit-3 Finalized
No ratings yet
Unit-3 Finalized
9 pages
Unit-3 Data Preprocessing
100% (1)
Unit-3 Data Preprocessing
7 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
Chapter-3 data processing
No ratings yet
Chapter-3 data processing
54 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
Module 2_data preprocessing
No ratings yet
Module 2_data preprocessing
16 pages
Data Preprocessing 013333
No ratings yet
Data Preprocessing 013333
8 pages
Data Preprocessing
No ratings yet
Data Preprocessing
28 pages
Unit-2 new
No ratings yet
Unit-2 new
61 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Unit 2 DWDM
No ratings yet
Unit 2 DWDM
14 pages
Down 2
No ratings yet
Down 2
61 pages
IV-cse DM Viva Questions
No ratings yet
IV-cse DM Viva Questions
10 pages
BUSINESS INTELLIGENCE NOTES Unit 4
No ratings yet
BUSINESS INTELLIGENCE NOTES Unit 4
10 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
9 pages
data preprocessing
No ratings yet
data preprocessing
11 pages
Data Mining
No ratings yet
Data Mining
5 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Preprocessing: Why Preprocess The Data?
No ratings yet
Data Preprocessing: Why Preprocess The Data?
51 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Chapter 3 - For Class
No ratings yet
Chapter 3 - For Class
52 pages
Unit 2 Data Mining
No ratings yet
Unit 2 Data Mining
69 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Unit 3
No ratings yet
Unit 3
18 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
CS-DM MODULE-2
No ratings yet
CS-DM MODULE-2
30 pages
Module 2
No ratings yet
Module 2
8 pages
Normalization
No ratings yet
Normalization
35 pages
Bana Reviewer
No ratings yet
Bana Reviewer
4 pages
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet
DS Practical
No ratings yet
DS Practical
9 pages
Document (4)
No ratings yet
Document (4)
21 pages
Document (5)
No ratings yet
Document (5)
37 pages
Document (1)
No ratings yet
Document (1)
32 pages
2nd semester question
No ratings yet
2nd semester question
8 pages
Document (2)
No ratings yet
Document (2)
29 pages
? Simple UI Design Elements (Community)
No ratings yet
? Simple UI Design Elements (Community)
1 page
statistics 2
No ratings yet
statistics 2
36 pages
JS Part-2
No ratings yet
JS Part-2
24 pages
SE unit -3
No ratings yet
SE unit -3
25 pages
SE unit 1
No ratings yet
SE unit 1
29 pages
DM UNIT 1
No ratings yet
DM UNIT 1
9 pages
MAD Unit 4
No ratings yet
MAD Unit 4
141 pages
L13 Relational Model DDL
No ratings yet
L13 Relational Model DDL
79 pages
Heat Engines and Heat Pumps-1 GGGGG
No ratings yet
Heat Engines and Heat Pumps-1 GGGGG
21 pages
Safety Wire
No ratings yet
Safety Wire
4 pages
Goodman GKS9 Spec Sheet 6.09
No ratings yet
Goodman GKS9 Spec Sheet 6.09
8 pages
ISAA BM-1 (6. Timber)
No ratings yet
ISAA BM-1 (6. Timber)
20 pages
E415e-1 Bolting Tools en-GB
No ratings yet
E415e-1 Bolting Tools en-GB
152 pages
Elechome Questions
No ratings yet
Elechome Questions
22 pages
Commercial Gas Boilers Efficient - Aae Series: For Hydronic Heating Systems
No ratings yet
Commercial Gas Boilers Efficient - Aae Series: For Hydronic Heating Systems
4 pages
Time:3 Hrs Class: IX M.M: 80: Salsabeel Central School TERM EXAMINATION 2018-2019 Mathematics
No ratings yet
Time:3 Hrs Class: IX M.M: 80: Salsabeel Central School TERM EXAMINATION 2018-2019 Mathematics
3 pages
Download Basic Engineering Physics WBUT 2013 3rd Edition Sujay Kumar Bhattacharya ebook All Chapters PDF
100% (10)
Download Basic Engineering Physics WBUT 2013 3rd Edition Sujay Kumar Bhattacharya ebook All Chapters PDF
82 pages
Tetk Brochure
No ratings yet
Tetk Brochure
2 pages
CH 11 Organization
100% (1)
CH 11 Organization
1 page
L - 2 - High-Dimensional Space
No ratings yet
L - 2 - High-Dimensional Space
20 pages
DLL - Mathematics 5 - Q2 - W7
No ratings yet
DLL - Mathematics 5 - Q2 - W7
9 pages
DevResume Sketch Template PDF Preview
100% (1)
DevResume Sketch Template PDF Preview
1 page
Date Sheet
No ratings yet
Date Sheet
1 page
Application of Integrals: Date: ..............
No ratings yet
Application of Integrals: Date: ..............
8 pages
Mathematical System
No ratings yet
Mathematical System
87 pages
Image Compression: Mohamed N. Ahmed, PH.D
No ratings yet
Image Compression: Mohamed N. Ahmed, PH.D
67 pages
CS351 Chapter 2 Homework Questions
No ratings yet
CS351 Chapter 2 Homework Questions
3 pages
Error E555
No ratings yet
Error E555
5 pages
Microprocessor_Lab_Manual Final for Print
No ratings yet
Microprocessor_Lab_Manual Final for Print
19 pages
Web Dynpro ABAP - OTR Text Translation Tool
No ratings yet
Web Dynpro ABAP - OTR Text Translation Tool
10 pages
1U Switching Power Supplies: Installation
No ratings yet
1U Switching Power Supplies: Installation
2 pages
EGYR D 22 02557 R2 Reviewer
No ratings yet
EGYR D 22 02557 R2 Reviewer
100 pages
Operating System and Memory: Presentation On
No ratings yet
Operating System and Memory: Presentation On
31 pages
Manufacturing Process of Dilmah Tea Company
100% (1)
Manufacturing Process of Dilmah Tea Company
9 pages
DR Sohail Rehman Article
No ratings yet
DR Sohail Rehman Article
5 pages