0% found this document useful (0 votes)
6 views13 pages

DM Module 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views13 pages

DM Module 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Module 1: DATA MINING AND DATA PREPROCESSING

Introduction: Data Mining: Motivated Data Mining, Kinds of Data to mined, kinds of patterns mined,
Technologies Used, Kinds of applications targeted, Major issues in Data Mining.
Data Pre-processing: Need for Pre-processing the Data, Major Tasks in Data Pre-processing, Data
Cleaning, Data Integration, Data Reduction, Data Transformation and Data Discretization
Introduction:
What motivated data mining? Why is it important? The major reason that data mining has attracted a great
deal of attention in information industry in recent years is due to the wide availability of huge amounts of
data and the imminent need for turning such data into useful information and knowledge. The information
and knowledge gained can be used for applications ranging from business management, production
control, and market analysis, to engineering design and science exploration.

What is data mining? Data mining refers to extracting or mining “knowledge” from large amounts of data.
There are many other terms related to data mining, such as knowledge mining, knowledge extraction,
data/pattern analysis, data archaeology, and data dredging. Many people treat data mining as a synonym
for another popularly used term, Knowledge Discovery in Databases or KDD

The evolution of database technology:


 A knowledge base that contains the domain knowledge used to guide the search or to evaluate the
interestingness of resulting patterns. For example, the knowledge base may contain metadata that
describes data from multiple heterogeneous sources.
 A data mining engine, which consists of a set of functional modules for tasks such as
classification, association, classification, cluster analysis, and evolution and deviation analysis.
 A pattern evaluation module that works in tandem with the data mining modules by employing
interestingness measures to help focus the search toward interestingness patterns
 A graphical user interface that allows the user an interactive approach to the data mining system.

Kinds of Data to be mined: Data mining defines extracting or mining knowledge from huge amounts of
data. Data mining is generally used in places where a huge amount of data is saved and processed. For
example, the banking system uses data mining to save huge amounts of data which is processed
constantly.
In Data mining, hidden patterns of data a considering according to the multiple categories into a piece of
useful data. This data is assembled in an area including data warehouses for analyzing, and data mining
algorithms are performed. This data facilitates in creating of effective decisions which cut value and
increase revenue.
There are various types of data mining applications that are used for data are as follows −
Relational Databases − A database system is also called a database management system. It includes a set
of interrelated data, called a database, and a set of software programs to handle and access the data.
A relational database is a set of tables, each of which is authorized a unique name. Each table includes a
set of attributes (columns or fields) and generally stores a huge set of tuples (records or rows). Each tuple
in a relational table defines an object identified by a unique key and represented by a set of attribute
values. A semantic data model including an entity-relationship (ER) data model is generally constructed
for relational databases. An ER data model defines the database as a set of entities and their relationships.
Data warehouse Databases: A data warehouse is a repository of information collected from multiple
sources, stored under a unified schema, and usually residing at a single site. Data warehouses are
constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic
data refreshing. A data warehouse is usually modeled by a multidimensional data structure, called a data
cube.
Transactional Databases − A transactional database includes a file where each record defines a
transaction. A transaction generally contains a unique transaction identity number (trans ID) and a list of
the items creating up the transaction (such as items purchased in a store).
The transactional database can have additional tables related to it, which includes other data regarding the
sale, including the date of the transaction, the customer ID number, the ID number of the salesperson and
of the branch at which the sale appeared, etc.
 Object-Relational Databases − Object-relational databases are assembled based on an object-
relational data model. This model continues the relational model by supporting a rich data type for
managing complex objects and object orientation.
 Temporal Database − A temporal database generally stores relational data that contains time-
related attributes. These attributes can include multiple timestamps, each having several semantics.
 Sequence Database − A sequence database stores sequences of ordered events, with or without a
factual idea of time. For example, customer shopping sequences, Web click streams, and biological
sequences.
 Time-Series Database − A time-series database stores sequences of values or events accessed
over repeated measurements of time (e.g., hourly, daily, weekly). An example includes data
collected from the stock exchange, stock control, and the measurement of natural phenomenlike
temperatureure and wind).
Kinds of patterns Mined (or) Data Mining Functionalities:
Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks. Data
mining tasks are classified into two categories:
1. Descriptive
2. Predictive
Descriptive mining tasks characterize the general properties of the data in the database. Predictive mining
tasks make predictions based on the current database.
Kind of patterns to be found in data mining tasks
 Concept / Class Description:
 Characterization and Discrimination
 Mining frequent patterns, Associations, and Correlations
 Classification and Prediction
 Cluster Analysis
 Outlier Analysis
Concept/Class Description: Descriptions of individual classes (set of data) or concepts (holistic data) in
summarized, small and detailed terms called class or concept description. These descriptions can be
divided into 1. Characterization
2. Discrimination.
Data characterization: It is a summarization of the general characteristics of a target class of data. The
data corresponding to the user-specified class are collected by a database query. The output of data
characterization can be presented in various forms like pie charts, bar charts curves, multi-dimensional
cubes, multi-dimensional tables, etc.
Data Discriminations: Comparison of two target class data objects from one or a set of contrasting
classes. The target and contrasting classes can be specified by the user, and corresponding data objects are
retrieved through database queries. For example, a comparison of products whose sales increased by 10%
in the last year with those whose sales decreased by 30% during the same period. This is called data
discrimination
Mining frequent patterns, Associations, and Correlations:
A). Frequent patterns: A frequent item set typically refers to items often appearing in transactional data.
For example, milk and bread are frequently purchased by many customers. A Electronics industry
occurring the products which are frequently purchased by the customers. Generally, home needs are
frequently used by more customers.
B). Association Analysis: Association analysis is the discovery of association rules showing attributes
with value conditions that occur frequently to give a data set. The Association rule of the form X=>Y.
For example, ‘Electronics’ relational database data mining system may find association rules like Here,
who buys ‘computer’, they buy “software”.
Buys(X, “Computer”)=>buys(X, “Software”) [support=1%, confidence=50%]
A Confidence, or certainty, of 50% means that if a customer buys a computer, there is a 50%chance that
she will buy software as well. A 1% support means that 1% of all transactions under analysis show that
computers and software are purchased together.

Classification and Regressive prediction:


A).Classification is the process of finding a set of models that describes and distinguishes data classes or
concepts. The model is derived based on the analysis of a set of training data (i.e., data objects for which
the class labels are known). The model is used to predict the class label of objects for which the class label
is unknown.
The derived model may be represented in various forms such as classification (IF-THEN) rules,
decision trees, mathematical formulae, or neural networks. A decision tree is a flowchart-like tree
structure, where each node denotes a test on an attribute value, each branch represents an outcome of the
test, and the tree leaves represent classes or class distributions. Decision trees can easily be converted to
classification rules.
A neural network, when used for classification, is typically a collection of neuron-like processing units
with weighted connections between the units.
Regression: Regression analysis is a statistical methodology that is most often used for numeric
prediction, although other methods exist as well. Regression also encompasses the identification of
distribution trends based on the available data.
Classification and Regression may need to be preceded by relevance analysis, which attempts to
identify attributes that are significantly relevant to the classification and regression process.
Cluster Analysis: Clustering is a method of grouping data into different groups so that each group shares
similar trends and patterns. Clustering can be used to generate class labels for a group of data. The objects
are clustered or grouped based on the principle of maximizing interclass similarity. Clustering can also
facilitate taxonomy formation which is the organization of observations into a hierarchy of classes that
group similar events together.
Outlier Analysis: In this analysis, a database may contain data objects that do not have what someone
wants. Most data mining methods discard outliers as noise or exceptions. Finding such types of
applications are fraud detection is referred to as Outlier mining.
For example, Outlier analysis may uncover usage of credit cards by detecting purchases of large
amounts of products when compared with regular purchases of large products customers
Which Technologies Are Used in data mining?
Data mining has incorporated many techniques from other domains such as statistics, machine learning,
pattern recognition, database and data warehouse systems, information retrieval, visualization, algorithms,
high performance computing, and many application domains. The interdisciplinary nature of data mining
research and development contributes significantly to the success of data mining and its extensive
applications.

Statistics studies the collection, analysis, interpretation or explanation, and presentation of data. Data
mining has an inherent connection with statistics. A statistical model is a set of mathematical functions
that describe the behavior of the objects in a target class in terms of random variables and their associated
probability distributions.
Machine learning investigates how computers can learn (or improve their performance) based on data. A
main research area is for computer programs to automatically learn to recognize complex patterns and
make intelligent decisions based on data. Machine learning is a fast-growing discipline, so it illustrates
classic problems in machine learning that are highly related to data mining.
Database systems research focuses on the creation, maintenance, and use of databases for organizations
and end-users. Particularly, database systems researchers have established highly recognized principles in
data models, query languages, query processing and optimization methods, data storage, and indexing and
accessing methods.

Information retrieval (IR) is the science of searching for documents or information in documents.
Documents can be text or multimedia, and may reside on the Web.

kinds of Applications Targeted :


Data Mining is primarily used by organizations with intense consumer demands with wide and diverse
applications.
Many customized data mining tools have been developed for domain-specific applications, including
finance, the retail industry, telecommunications, bioinformatics, and other science, engineering, and
government data analysis.
Some application domains
1. Financial data analysis
2. Retail industry
3. Telecommunication industry
4. Biological data analysis
5. Data Mining in other Scientific Applications.

1. Data Mining for Financial Data Analysis: Financial data collected in banks and financial institutions
are often relatively complete, reliable, and of high quality. Loan payment prediction and customer credit
policy analysis.
Examples: LIC, Private Fund Organizations.
2. Data Mining for Retail Industry: Retail industry is huge amounts of data on sales, customer shopping
history, etc. Applications of retail data mining areidentifying customer buying behaviors, Discovering
customer shopping patterns and trends, and improving the quality of customer service.
Examples: Supermarkets, Sales, and Shopping malls. Multidimensional analysis of sales, customers,
products, time, and region.
3. Data Mining for Telecommunication Industry: A rapidly expanding and highly competitive industry
and a great demand for data mining. Multidimensional analysis of telecommunication data is intrinsically
multidimensional i.e. calling time, duration, location of the caller, location of the called, type of call, etc.
Examples: Fraudulent pattern analysis and the identification of unusual patterns.
4. Data Mining for Biomedical Data Analysis: DNA sequences are 4 basic building blocks
(nucleotides): adenine (A), cytosine (C), guanine (G), and thymine (T). Gene is a sequence of hundreds of
individual nucleotides arranged in a particular order. Humans have around 30,000 genes.
Examples: Semantic integration of heterogeneous, distributed genomic and proteomic databases.
Discovery of structural patterns and analysis of genetic networks and protein pathways.
5. Data Mining in Other Scientific Applications
 Satellite Imagery, GIS, Rocket Launching.
 Graph-based mining.
 Visualization tools and domain-specific knowledge.

Major Issues In Data Mining: Data mining is not an easy task, as the algorithms used can get very
complex and data is not always available in one place. They are three major issues in data mining are
 Mining Methodology and User Interaction
 Performance Issues
 Diverse Data Types Issues
Mining Methodology and User Interaction Issues
It refers to the following kinds of issues −Mining different kinds of knowledge in databases − Different
users may be interested in different kinds of knowledge. Therefore it is necessary for data mining to cover
a broad range of knowledge discovery tasks.
Interactive mining of knowledge at multiple levels of abstraction − The data mining process needs to
be interactive because it allows users to focus the search for patterns, providing and refining data mining
requests based on the returned results.
 Incorporation of background knowledge − To guide the discovery process and to express the
discovered patterns, background knowledge can be used. Background knowledge may be used to
express the discovered patterns not only in concise terms but at multiple levels of abstraction.
 Presentation and visualization of data mining results − Once the patterns are discovered it
needs to be expressed in high-level languages, and visual representations. These representations
should be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods are required to handle the noise
and incomplete objects while mining the data regularities. If the data cleaning methods are not
there then the accuracy of the discovered patterns will be poor.
 Pattern evaluation − The patterns discovered should be interesting because either they represent
common knowledge.
Performance Issues:
 Efficiency and scalability of data mining algorithms − In order to effectively extract
information from the huge amount of data in databases, data mining algorithms must be efficient
and scalable.
 Parallel, distributed, and incremental mining algorithms − The factors such as the huge size of
databases, wide distribution of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms.
Diverse Data Types Issues:
 Handling of relational and complex types of data − The database may contain complex data
objects, multimedia data objects, spatial data, temporal data, etc. It is not possible for one system
to mine all these kinds of data.
 Mining information from heterogeneous databases and global information systems − The
data is available at different data sources on LAN or WAN. These data sources may be structured,
semi-structured, or unstructured. Therefore mining the knowledge from them adds challenges to
data mining.

Preprocessing in Data Mining: Today’s real-world databases are highly susceptible to noisy, missing,
and inconsistent data due to their typically huge size and their likely origin from multiple, heterogenous
sources. Low-quality data will lead to low-quality mining results. “How can the data be preprocessed in
order to help improve the quality of the data and, consequently, of the mining results? How can the data
be preprocessed so as to improve the efficiency and ease of the mining process?” Data preprocessing is a
data mining technique which is used to transform the raw data in a useful and efficient format.

Major Tasks in Data Preprocessing


Data cleaning : If users believe the data are dirty, they are unlikely to trust the results of any data mining
that has been applied. work to “clean” the data by filling in missing values, smoothing noisy data,
identifying or removing outliers, and resolving inconsistencies.
Data integration :It is like to include data from multiple sources in your analysis. This would involve
integrating multiple databases, data cubes, or files.
Data reduction: It is obtains a reduced representation of the data set that is much smaller in volume, yet
produces the same (or almost the same) analytical results. Data reduction strategies include
dimensionality reduction(removing irrelevant attributes) and numerosity reduction(replaced by alternative
values).
Data Transformation: The data to be analyzed have been normalized, that is, scaled to a smaller range
such as [0.0, 1.0].
Data Discretization: This concept hierarchy generation are powerful tools for data mining in that they
allow data mining at multiple abstraction levels.

Data Cleaning: The data can have many irrelevant and missing parts. To handle this part, data cleaning
is done. It involves handling of missing data, noisy data etc.
Missing Values: To filling in the missing values for attributes, the following methods are used.
1. Ignore the tuple: This is usually done when the class label is missing (assuming the mining task
involves classification). This method is not very effective, unless the tuple contains several attributes with
missing values. By ignoring the tuple, we do not make use of the remaining attributes’ values in the tuple.
2. Fill in the missing value manually: In this approach is time consuming and may not be feasible given
a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute values by the same
constant such as a label like “Unknown” or 1. If missing values are replaced by “Unknown” it is simple,
but it is not full proof.
4. Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the missing
value: Measures of central tendency, which indicate the “middle” value of a data distribution. For normal
(symmetric) data distributions
5. Use the attribute mean or median for all samples belonging to the same class as the given tuple : In
classifying customers according to credit risk, we may replace the missing value with the mean income
value for customers in the same credit risk category as that of the given tuple. If the data distribution for a
given class is skewed, the median value is a better choice.
6. Use the most probable value to fill in the missing value: This may be determined with regression,
inference-based tools using a Bayesian formalism, or decision tree.

Noisy Data: Noise is a random error or variance in a measured variable. We saw how some basic
statistical description techniques (e.g., boxplots and scatter plots), and methods of data visualization can
be used to identify outliers, which may represent noise. To “smooth” out the data and to remove the noise
the following data smoothing techniques.
1. Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that is, the
values around it. The sorted values are distributed into a number of “buckets,” or bins. Because binning
methods consult the neighborhood of values, they perform local smoothing.
2. Regression: Data smoothing can also be done by regression, a technique that conforms data values to a
function. Linear regression involves finding the “best” line to fit two attributes (or variables) so that one
attribute can be used to predict the other.
3. Outlier analysis: Outliers may be detected by clustering, for example, where similar values are
organized into groups, or “clusters.”

Data Integration in Data Mining :Data integration in data mining refers to the process of combining
data from multiple sources into a single, unified view. This can involve cleaning and transforming the
data, as well as resolving any inconsistencies or conflicts that may exist between the different sources.
The goal of data integration is to make it easier to access and analyze data that is spread across multiple
systems or platforms, in order to gain a more complete and accurate understanding of the data.
1. Entity Identification Problem: It is likely that your data analysis task will involve data integration,
which combines data from multiple sources into a coherent data store, as in data warehousing. These
sources may include multiple databases, data cubes, or flat files
2. Redundancy and Correlation Analysis: Redundancy is another important issue in data integration. An
attribute (such as annual revenue, for instance) may be redundant if it can be “derived” from another
attribute or set of attributes. Inconsistencies in attribute or dimension naming can also cause redundancies
in the resulting data set.
3. Tuple Duplication: In addition to detecting redundancies between attributes, duplication should also be
detected at the tuple level (e.g., where there are two or more identical tuples for a given unique data entry
case). The use of denormalized tables (often done to improve performance by avoiding joins) is another
source of data redundancy.
4. Data Value Conflict Detection and Resolution: Data integration also involves the detection and
resolution of data value conflicts. For example, for the same real-world entity, attribute values from
different sources may differ. This may be due to differences in representation, scaling, or encoding.

Data Reduction: Data reduction techniques can be applied to obtain a reduced representation of the
data set that is much smaller in volume, and closely maintains the integrity of the original data. So, mining
on the reduced data set should be more efficient to produce the same (or almost the same) analytical
results.
1. Numerosity reduction techniques replace the original data volume by alternative, smaller forms of
data representation. These techniques may be parametric or nonparametric.
2. Data compression: It transformations are applied so as to obtain a reduced or “compressed”
representation of the original data. If the original data can be reconstructed from the compressed data
without any information loss, the data reduction is called lossless.
Dimensionality reduction is the process of reducing the number of random variables or attributes under
consideration. In this methods it include wavelet transforms and principal components analysis, which
transform or project the original data onto a smaller space.
1. Wavelet Transforms: The discrete wavelet transform (DWT) is a linear signal processing technique
that, applied to a data vector X, transforms it to a numerically different vector, X0, of wavelet coefficients.
The two vectors are of the same length. When applying this technique to data reduction, we consider each
tuple as an n-dimensional data vector, that is, X D .x1,x2, : : : ,xn/, depicting n measurements made on the
tuple from n database attributes.
2. Principal Components Analysis: In this subsection we provide an intuitive introduction to principal
components analysis as a method of dimesionality reduction.
3. Attribute Subset Selection: Data sets for analysis may contain hundreds of attributes, many of which
may be irrelevant to the mining task or redundant.

Data Transformation and Data Discretization: The data are transformed or consolidated so that the
resulting mining process may be more efficient, and the patterns found may be easier to understand. Data
discretization, a form of data transformation. Strategies for data transformation include the following:
1. Smoothing, which works to remove noise from the data. Techniques include binning, regression, and
clustering.
2. Attribute construction (or feature construction), where new attributes are constructed and added
from the given set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data.
4. Normalization, where the attribute data are scaled so as to fall within a smaller range, such as 1.0 to
1.0, or 0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by interval
labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior).

*************************************************************************************
(Additional Information)
Data Transformation
In data transformation, the data are transformed or consolidated into forms appropriate for mining. Data
transformation can involve the following:
 Smoothing, which works to remove noise from the data. Such techniques include binning,
regression, and clustering.
 Aggregation, where summary or aggregation operations are applied to the data. For example, the
daily sales data may be aggregated so as to compute monthly and annual total amounts. This step
is typically used in constructing a data cube for analysis of the data at multiple granularities.
 Generalization of the data, where low-level or “primitive” (raw) data are replaced by higher-level
concepts through the use of concept hierarchies. For example, categorical attributes, like street, can
be generalized to higher-level concepts, like city or country. Similarly, values for numerical
attributes, like age, may be mapped to higher-level concepts, like youth, middle-aged, and senior.
 Normalization, where the attribute data are scaled so as to fall within a small specified range, such
as −1.0 to 1.0, or 0.0 to 1.0.
 Attribute construction (or feature construction), where new attributes are constructed and added
from the given set of attributes to help the mining process.
Data Smoothing
 Smoothing data entails eliminating noise from the data collection under consideration. We've seen
how techniques like binning, regression, and clustering are used to eliminate noise from results.
 Binning: This approach divides the sorted data into several bins and smooths the data values in
each bin based on the values in the surrounding neighborhood.
 Regression: This approach determines the relationship between two dependent attributes so that we
can use one attribute to predict the other.
 Clustering: In this process, related data values are grouped together to create a cluster. Outliers are
values that are located outside of a cluster.
Aggregation of Data
 By performing an aggregation process on a large collection of data, data aggregation reduces the
volume of the data set.
Generalization
The nominal data or nominal attribute is a set of values with a finite number of unique values and no
ordering between them. Job-category, age-category, geographic regions, items-category, and other
nominal attributes are examples. By adding a group of attributes, the nominal attributes form the definition
hierarchy. Concept hierarchy can be created by combining words like street, city, state, and nation.
The data is divided into several levels using a concept hierarchy. At the schema level, the definition
hierarchy can be created by adding partial or complete ordering between the attributes. Alternatively, a
concept hierarchy can be created by specifically grouping data on a portion of the intermediate level.
Normalization
The process of data normalization entails translating all data variables into a specific set. Data
normalization is the process of reducing the number of data values to a smaller range, such as [-1, 1] or
[0.0, 1.0]. The following are some examples of normalization techniques:
1. Min-Max Normalization: Minimum-Maximum Normalization is a linear transformation of the
original results. Assume that the minima and maxima of an attribute are min A and max A,
respectively.

Where v is the value in the new range that you want to plot. After normalizing the old value, you
get v', which is the new value.
For example, the minimum and maximum values for the attribute ‘income' are $1200 and $9800,
respectively, and the range in which we would map a value of $73,600 is [0.0, 1.0]. The value of
$73,600 will be normalized using the min-max method:

2. Z-score Normalization: Using the mean and standard deviation, this approach normalizes the
value for attribute A. The formula is as follows:

Here Ᾱ and σAare the mean and standard deviation for attribute A are [Link] instance,
attribute A now has a mean and standard deviation of $54,000 and $16,000, respectively. We must
also use z-score normalization to normalize the value of $73,600.

3. Decimal Scaling: By shifting the decimal point in the value, this method normalizes the value of
attribute A. The maximum absolute value of A determines the movement of a decimal point. The
decimal scaling formula is as follows:

Here j is the smallest integer such that max(|v’i|)


For example, the observed values for attribute A range from -986 to 917, with 986 as the
maximum absolute value. To use decimal scaling to normalize each value of attribute A, we must
divide each value of attribute A by 1000, i.e. j=3. As a result, the value -986 is normalized to -
0.986, and the value 917 is normalized to 0.917.
Attribute Construction
In the attribute construction process, new attributes are created by consulting an existing collection of
attributes, resulting in a new data set that is easier to mine. Consider the following scenario: we have a
data set containing measurements of various plots, such as the height and width of each plot. So, using the
attributes ‘height' and ‘weight,' we can create a new attribute called ‘area.' This often aids in the
comprehension of the relationships between the attributes in a data set.
Data Discretization
Data discretization encourages data transformation by replacing integer data values with interval marks.
The interval labels (0-10, 11-20...) or the interval labels (0-10, 11-20...) may be used to substitute the
values for the attribute "age" (kid, youth, adult, senior).Data discretization can be classified into two types
supervised discretization where the class information is used and the other is unsupervised discretization
which is based on which direction does the process proceed i.e. ‘top-down splitting strategy’ or ‘bottom-
up merging strategy’. For Discretization different attributes needs consideration which is as follows:
Nominal Attribute: Nominal Attributes only provide enough information to distinguish one object from
another. For example, the person's sex and the student's roll number.
Ordinal Attribute: The value of the ordinal attribute is sufficient to order the objects. Rankings, grades,
and height are only a few examples.
Numeric Attributes:It is quantitative, in the sense that quantity can be calculated and expressed in integer
or real numbers
Ratio Scaled attribute: A proportion Ratio is a scaled parameter that is important for both inequalities
and ratios. For example, age, height, and weight.
Types of Data Discretization
Top-down Discretization: Top-down discretization or splitting is when a process begins by finding one
or a few points (called breakpoints or cut points) to split the entire attribute set, and then repeats the
process recursively on the resulting intervals.

Bottom-up Discretization: Bottom-up discretization or merging is described as a process that begins by


considering all continuous values as potential split-points and then removes some by merging
neighborhood values to form intervals. Discretization on an attribute can be done quickly to create a
definition hierarchy, which is a hierarchical partitioning of the attribute values.

Supervised Discretization: Supervised discretization is when you take the class into account when
making discretization [Link] discretization must be determined solely by the training set and not
the test set.
Unsupervised Discretization: Unsupervised discretization algorithms are the simplest algorithms to make
use of, because the only parameter you would specify is the number of intervals to use; or else, how many
values should be included in each interval.

Data Reduction
Terabytes of data may be stored in a database or data center. As a result, data processing and mining on
such massive datasets can take a long time. Data reduction methods may be used to create a smaller-
volume representation of the data set that still contains important information. The data reduction process
reduces the size of data and makes it suitable and feasible for analysis. In the reduction process, the
integrity of the data must be preserved, and data volume is reduced. Many strategies can be used for data
reduction.
1. Data Cube Aggregation: In the construction of a data cube, aggregation operations are applied to the
data. This method is used to condense data into a more manageable format. As an example, consider the
data you collected for your study from 2012 to 2014, which includes your company's revenue every three
months. Rather than the quarterly average, they include you in the annual revenue.
2. Dimension Reduction:We use the attribute needed for our analysis whenever we come across data that
is weakly significant. It shrinks data by removing obsolete or redundant functions. The following are the
methods used for Dimension reduction:
 Step-wise Forward Selection
 Step-wise Backward Selection
 Combination of forwarding and Backward Selection
Step-wise Forward Selection: The selection begins with an empty set of attributes later on we decide
the best of the original attributes on the set based on their relevance to other attributes.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }
 Step-1: {X1}
 Step-2: {X1, X2}
 Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}
Step-wise Backward Selection: This selection starts with a set of complete attributes in the original
data and at each point, it eliminates the worst remaining attribute in the set.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }
 Step-1: {X1, X2, X3, X4, X5}
 Step-2: {X1, X2, X3, X5}
 Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}
Combination of forwarding and backward selection: It helps us to eliminate the worst attributes
and pick the better ones, saving time and speeding up the process.
Data Compression
Using various encoding methods, the data compression technique reduces the size of files (Huffman
Encoding & run-length Encoding). Based on the compression techniques used, we can split it into two
forms.
 Lossless Compression
 Lossy Compression
Lossless Compression – Encoding techniques (Run Length Encoding) allow for a quick and painless
data reduction. Algorithms are used in lossless data compression to recover the exact original data
from the compressed data.
Lossy Compression – Examples of this compression include the Discrete Wavelet Transform
Technique and PCA (principal component analysis). JPEG image format, for example, is a lossy
compression, but we can find a sense equivalent to the original image. The decompressed data in lossy
data compression vary from the original data, but they are still useful for retrieving information
[Link] Reduction
Numerosity Reduction is a data reduction strategy that uses a smaller type of data representation to replace
the original data. There are two approaches for reducing numerosity: parametric and nonparametric.
Parametric Methods: Parametric methods use a model to represent data. The model is used to
estimate data, requiring only data parameters to be processed rather than real data. These models are
built using regression and log-linear methods.
 Regression
Easy linear regression and multiple linear regression are two types of regression. While there is only
one independent attribute, the regression model is referred to as simple linear regression, and when
there are multiple independent attributes, it is referred to as multiple linear regression.
 Log-Linear Model
Based on a smaller subset of dimensional combinations, a log-linear model may be used to estimate
the likelihood of each data point in a multidimensional space for a collection of discretized attributes.
This enables the development of a higher-dimensional data space from lower-dimensional attributes.
Non-Parametric Methods: Histograms, clustering, sampling, and data cube aggregation are examples
of non-parametric methods for storing reduced representations of data.
 Histograms: A histogram is a frequency representation of data. It is a common method of data
reduction that employs binning to approximate data distribution.
 Clustering: Clustering is the division of data into classes or clusters. This method divides all
of the data into distinct clusters. The cluster representation of the data is used to replace the
actual data in data reduction. It also aids in the detection of data outliers.
 Sampling: Sampling is a data reduction technique that allows a large data set to be represented
by a much smaller random data set.
Aggregation of Data Cubes: Aggregation of data cubes entails transferring data from a complex level to
a smaller number of dimensions. The resulting data set is smaller in size while retaining all of the details
required for the analysis mission.

You might also like