Data Mining and Data Analysis UNIT-1 Notes For Print

D.K.Govt.
CollegeforWomen(A),Nellore
DATA MINING AND DATA ANALYSIS
UNIT-1
Unit-I
Data mining - KDD Vs Data Mining, Stages of the Data Mining Process-
Task Primitives, Data Mining Techniques – Data Mining Knowledge
Representation. Major Issues in Data Mining – Measurement and Data –
Data Preprocessing – Data Cleaning - Data transformation- Feature
Selection - Dimensionality reduction
1. What Is Data Mining?
Data mining refers to extracting or mining knowledge from large

amounts of data. The term is actually a misnomer. Thus, data mining
should have been more appropriately named as knowledge mining
which emphasis on mining from large amounts of data.
It is the computational process of discovering patterns in large data sets

involving methods at the intersection of artificial intelligence, machine
learning, statistics, and database systems.
The overall goal of the data mining process is to extract information
from a data set and transform it into an understandable structure for
further use.
The key properties of datamining are
Automatic discovery of patterns

Prediction of likelyoutcomes
Creation of actionable information
Focus on large datasets and databases
Tasks of Data Mining
Data mining involves six common classes of tasks:
Anomaly detection (Outlier/change/deviation detection) – The
identification of unusual data records, that might be interesting or
data errors that require further investigation.
Association rule learning (Dependency modelling) – Searches
for relationships between variables. For example a supermarket
might gather data on customer purchasing habits. This is
sometimes referred to as market basket analysis.
Clustering – is the task of discovering groups and structures in
the data that are in some way or another "similar", without using
known structures in the data.
Classification – is the task of generalizing known structure to
apply to new data. For example, an e-mail program might attempt
to classify an e-mail as "legitimate" or as "spam".
Regression – attempts to find a function which models the data
with the least error.
Summarization – providing a more compact representation of
the data set, including visualization and report generation.
Architecture of Data Mining
A typical data mining system may have the following major components
2. Data Mining Process:
Data Mining is a process of discovering various models, summaries,

and derived values from a given collection of data.
The general experimental procedure adapted to data-mining problems
involves the following steps:
1. State the problem and formulate the hypothesis
Most data-based modeling studies are performed in a particular
application domain. Hence, domain-specific knowledge and
experience are usually necessary in order to come up with a
meaningful problem statement.
2.Collect the data
This step is concerned with how the data are generated and
collected. In general, there are two distinct possibilities. The first
is when the data-generation process is under the control of an
expert (modeler): this approach is known as a designed
experiment. The second possibility is when the expert cannot
influence the data- generation process: this is known as the
observational approach.
2. Preprocessing the data
In the observational setting, data are usually "collected" from the

existing databses, data warehouses, and data marts. Data
preprocessing usually includes at least two common tasks:
1. Outlier detection (and removal) – Outliers are unusual data

values that are not consistent with most observations. There are
two strategies for dealing with outliers:
a. Detect and eventually remove outliers as a part of the
preprocessing phase, or
b. Develop robust modeling methods that are insensitive to outliers.
2. Scaling,encoding, and selecting features – Data preprocessing

includes several steps such as variable scaling and different types
of encoding. For example, one feature with the range [0, 1] and the
other with the range [−100, 1000] will not have the same weights
in the applied technique
4. Estimate the model
The selection and implementation of the appropriate data-mining
technique is the main task in this phase. This process is not
straightforward; usually, in practice, the implementation is based
on several models, and selecting the best one is an additional task.
5. Interpret the model and draw conclusions

In most cases, data-mining models should help in decision making.
Hence, such models need to be interpretable in order to be useful
because humans are not likely to base their decisions on complex
"black-box" models.
The Data mining Process

3. Major Issues In Data Mining:
Mining different kinds of knowledge in databases. - The need

of different users is not the same. And Different user may be in
interested in different kind of knowledge. Therefore it is necessary
for data mining to cover broad range of knowledge discovery task.
Interactive mining of knowledge at multiple levels of
abstraction. - The data mining process needs to be interactive
because it allows users to focus the search for patterns, providing
and refining data mining requests based on returned results.
Incorporation of background knowledge. - To guide discovery
process and to express the discovered patterns, the background
knowledge can be used. Background knowledge may be used to
express the discovered patterns not only in concise terms but at
multiple level of abstraction.
Data mining query languages and ad hoc data mining. - Data
Mining Query language that allows the user to describe ad hoc
mining tasks, should be integrated with a data warehouse query
language and optimized for efficient and flexible data mining.
Presentation and visualization of data mining results. - Once
the patterns are discovered it needs to be expressed in high level
languages, visual representations. This representations should be
easily understandable by the users.
Handling noisy or incomplete data. - The data cleaning methods
are required that can handle the noise, incomplete objects while
mining the data regularities. If data cleaning methods are not there
then the accuracy of the discovered patterns will be poor.
Pattern evaluation. - It refers to interestingness of the problem.

The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.
Efficiency and scalability of data mining algorithms. - In

order to effectively extract the information from huge amount of
data in databases, data mining algorithm must be efficient and
scalable.
Parallel, distributed, and incremental mining algorithms. -
The factors such as huge size of databases, wide distribution of
data,and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms.
4. Knowledge Discovery in Databases(KDD)
Some people treat data mining same as Knowledge discovery

while some people view data mining essential step in process
of knowledge discovery. Here is the list of steps involved in
knowledge discovery process:
Data Cleaning - In this step the noise and inconsistent data

is removed.
Data Integration - In this step multiple data sources are
combined.
Data Selection - In this step relevant to the analysis task are
retrieved from the database.
Data Transformation - In this step data are transformed
or consolidated into forms appropriate for mining by
performing summary or aggregation operations.
Data Mining - In this step intelligent methods are
applied in order to extract data patterns.
Pattern Evaluation - In this step, data patterns are
evaluated.
Knowledge Presentation - In this step,knowledge is
represented.
The following diagram shows the process of knowledge discovery

process:
5. Architecture of KDD
Advantages of KDD
1. Improves decision-making: KDD provides valuable insights and
knowledge that can help organizations make better decisions.
2. Increased efficiency: KDD automates repetitive and time-
consuming tasks and makes the data ready for analysis, which saves
time and money.
3. Better customer service: KDD helps organizations gain a better
understanding of their customers’ needs and preferences, which can
help them provide better customer service.
4. Fraud detection: KDD can be used to detect fraudulent activities
by identifying patterns and anomalies in the data that may indicate
fraud.
5. Predictive modeling: KDD can be used to build predictive models
that can forecast future trends and patterns.
Disadvantages of KDD
1. Privacy concerns: KDD can raise privacy concerns as it involves
collecting and analyzing large amounts of data, which can include
sensitive information about individuals.
2. Complexity: KDD can be a complex process that requires
specialized skills and knowledge to implement and interpret the
results.
3. Unintended consequences: KDD can lead to unintended
consequences, such as bias or discrimination, if the data or models
are not properly understood or used.
4. Data Quality: KDD process heavily depends on the quality of data,
if data is not accurate or consistent, the results can be misleading
5. High cost: KDD can be an expensive process, requiring significant
investments in hardware, software, and personnel.
6. Overfitting: KDD process can lead to overfitting, which is a
common problem in machine learning where a model learns the
detail and noise in the training data to the extent that it negatively
impacts the performance of the model on new unseen data.
Difference between KDD and Data Mining
Parameter KDD Data Mining
KDD refers to a process of
identifying valid, novel, Data Mining refers to a process
potentially useful, and of extracting useful and valuable
Definition
ultimately understandable information or patterns from
patterns and relationships in large data sets.
data.
To find useful knowledge from To extract useful information
Objective
data. from data.
Data cleaning, data integration,
data selection, data Association rules, classification,
transformation, data mining, clustering, regression, decision
Techniques Used
pattern evaluation, and trees, neural networks, and
knowledge representation and dimensionality reduction.
visualization.
Structured information, such as Patterns, associations, or insights
rules and models, that can be that can be used to improve
Output
used to make decisions or decision-making or
predictions. understanding.
Focus is on the discovery of Data mining focus is on the
Focus useful knowledge, rather than discovery of patterns or
simply finding patterns in data. relationships in data.
Domain expertise is important Domain expertise is less critical
in KDD, as it helps in defining in data mining, as the algorithms
Role of domain
the goals of the process, are designed to identify patterns
expertise
choosing appropriate data, and without relying on prior
interpreting the results. knowledge.
6. Data Mining Techniques
1. Classification:
This technique is used to obtain important and relevant information
about data and metadata. This data mining technique helps to classify
data in different classes.
Data mining techniques can be classified by different criteria, as

follows:
i. Classification of Data mining frameworks as per the type of

data sources mined:
This classification is as per the type of data handled. For
example, multimedia, spatial data, text data, time-series data,
World Wide Web, and so on..
2. Clustering:
Clustering is a division of information into groups of connected
objects. Describing the data by a few clusters mainly loses certain
confine details, but accomplishes improvement. It models data by its
clusters. Data modeling puts clustering from a historical point of view
rooted in statistics, mathematics, and numerical analysis.
From a practical point of view, clustering plays an extraordinary job

in data mining applications. For example, scientific data exploration,
text mining, information retrieval, spatial database applications,
CRM, Web analysis, computational biology, medical diagnostics, and
much more.
3. Regression:
Regression analysis is the data mining process is used to identify and
analyze the relationship between variables because of the presence of
the other factor. It is used to define the probability of the specific
variable.
4. Association Rules:
This data mining technique helps to discover a link between two or
more items. It finds a hidden pattern in the data set.
Association rules are if-then statements that support to show the

probability of interactions between data items within large data sets in
different types of databases. Association rule mining has several
applications and is commonly used to help sales correlations in data
or medical data sets.
5. Outer detection:
This type of data mining technique relates to the observation of data
items in the data set, which do not match an expected pattern or
expected behavior. This technique may be used in various domains
like intrusion, detection, fraud detection, etc. It is also known as
Outlier Analysis or Outilier mining.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized
for evaluating sequential data to discover sequential patterns
7. Prediction:
Prediction used a combination of other data mining techniques such
as trends, clustering, classification, etc. It analyzes past events or
instances in the right sequence to predict a future event.
7. Data Mining Task Primitives

A data mining task can be specified in the form of a data mining
query, which is input to the data mining system.
The data mining primitives specify the following,
1. Set of task-relevant data to be mined.

2. Kind of knowledge to be mined.
3. Background knowledge to be used in the discovery process.
4. Interestingness measures and thresholds for pattern evaluation.
5. Representation for visualizing the discovered patterns.
List of Data Mining Task Primitives
A data mining query is defined in terms of the following primitives,
such as:
1. The set of task-relevant data to be mined
This specifies the portions of the database or the set of data in which
the user is interested. This includes the database attributes or data
warehouse dimensions of interest (the relevant attributes or
dimensions).
2. The kind of knowledge to be mined
This specifies the data mining functions to be performed, such as

characterization, discrimination, association or correlation analysis,
classification, prediction, clustering, outlier analysis, or evolution
analysis.
3. The background knowledge to be used in the discovery process
This knowledge about the domain to be mined is useful for guiding

the knowledge discovery process and evaluating the patterns found.
Concept hierarchies are a popular form of background knowledge,
which allows data to be mined at multiple levels of abstraction.
o Rolling Up - Generalization of data:

o Drilling Down - Specialization of data:
4. The interestingness measures and thresholds for pattern

evaluation
Different kinds of knowledge may have different interesting

measures. They may be used to guide the mining process or, after
discovery, to evaluate the discovered patterns. For example,
interesting measures for association rules include support and
confidence. Rules whose support and confidence values are below
user-specified thresholds are considered uninteresting.
5. The expected representation for visualizing the discovered
patterns
This refers to the form in which discovered patterns are to be

displayed, which may include rules, tables, cross tabs, charts, graphs,
decision trees, cubes, or other visual representations.
8. Data preprocessing
Data preprocessing is an important step in the data mining process. It

refers to the cleaning, transforming, and integrating of data in order
to make it ready for analysis. The goal of data preprocessing is to
improve the quality of the data and to make it more suitable for the
specific data mining task.
Some common steps in data preprocessing include:
Data Cleaning: This involves identifying and correcting errors or
inconsistencies in the data, such as missing values, outliers, and
duplicates. Various techniques can be used for data cleaning, such as
imputation, removal, and transformation.
Data Integration: This involves combining data from multiple
sources to create a unified dataset. Data integration can be
challenging as it requires handling data with different formats,
structures, and semantics. Techniques such as record linkage and data
fusion can be used for data integration.
Data Transformation: This involves converting the data into a
suitable format for analysis. Common techniques used in data
transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common
range, while standardization is used to transform the data to have zero
mean and unit variance. Discretization is used to convert continuous
data into discrete categories.
Data Reduction: This involves reducing the size of the dataset while
preserving the important information. Data reduction can be achieved
through techniques such as feature selection and feature extraction.
Feature selection involves selecting a subset of relevant features from
the dataset, while feature extraction involves transforming the data
into a lower-dimensional space while preserving the important
information.
Data Discretization: This involves dividing continuous data into
discrete categories or intervals. Discretization is often used in data
mining and machine learning algorithms that require categorical data.
Discretization can be achieved through techniques such as equal
width binning, equal frequency binning, and clustering.
Data Normalization: This involves scaling the data to a common
range, such as between 0 and 1 or -1 and 1. Normalization is often
used to handle data with different units and scales. Common
normalization techniques include min-max normalization, z-score
normalization, and decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data
and the accuracy of the analysis results. The specific steps involved
in data preprocessing may vary depending on the nature of the data
and the analysis goals.
By performing these steps, the data mining process becomes more
efficient and the results become more accurate.
Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to
transform the raw data in a useful and efficient format.
9.Steps Involved in Data Preprocessing:
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this
part, data cleaning is done. It involves handling of missing data,
noisy data etc.
 (a). Missing Data:

This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite
large and multiple values are missing within a tuple.
2. Fill the Missing values:

There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable
value.
3. (b). Noisy Data:

Noisy data is a meaningless data that can’t be interpreted by
machines.It can be generated due to faulty data collection, data
entry errors etc. It can be handled in following ways :
Binning Method:
This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various
methods are performed to complete the task. Each segmented is
handled separately. One can replace all data in a segment by its
mean or boundary values can be used to complete the task.
Regression:
Here data can be made smooth by fitting it to a regression
function.The regression used may be linear (having one
independent variable) or multiple (having multiple independent
variables).
Clustering:
This approach groups the similar data in a cluster. The outliers
may be undetected or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms
suitable for mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0
to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by
interval levels or conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in
hierarchy. For Example-The attribute “city” can be converted to
“country”.
3. Data Reduction:
Data reduction is a crucial step in the data mining process that
involves reducing the size of the dataset while preserving the
important information. This is done to improve the efficiency of data
analysis and to avoid overfitting of the model. Some common steps
involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant
features from the dataset. Feature selection is often performed to
remove irrelevant or redundant features from the dataset. It can be
done using various techniques such as correlation analysis, mutual
information, and principal component analysis (PCA).
Feature Extraction: This involves transforming the data into a
lower-dimensional space while preserving the important information.
Feature extraction is often used when the original features are high-
dimensional and complex. It can be done using techniques such as
PCA, linear discriminant analysis (LDA), and non-negative matrix
factorization (NMF).
Sampling: This involves selecting a subset of data points from the
dataset. Sampling is often used to reduce the size of the dataset while
preserving the important information. It can be done using techniques
such as random sampling, stratified sampling, and systematic
sampling.
Clustering: This involves grouping similar data points together into
clusters. Clustering is often used to reduce the size of the dataset by
replacing similar data points with a representative centroid. It can be
done using techniques such as k-means, hierarchical clustering, and
density-based clustering.
Compression: This involves compressing the dataset while
preserving the important information. Compression is often used to
reduce the size of the dataset for storage and transmission purposes.
It can be done using techniques such as wavelet compression, JPEG
compression, and gzip compression.
Benefits of Data Cleaning

When you have clean data, you can make decisions using the highest-
quality information and eventually boost productivity. The following
are some important advantages of data cleaning in data mining,
including:
o Removal of inaccuracies when several data sources are

involved.
o Clients are happier and employees are less annoyed when there
are fewer mistakes.
o The capacity to map out the many functions and the planned
uses of your data.
o Monitoring mistakes and improving reporting make it easier to
resolve inaccurate or damaged data for future applications by
allowing users to identify where issues are coming from.
o Making decisions more quickly and with greater efficiency will
be possible with the use of data cleansing tools.
10. DATA TRANSFORMATION INTRODUCTION:

Data transformation in data mining refers to the process of converting
raw data into a format that is suitable for analysis and modeling. The
goal of data transformation is to prepare the data for data mining so
that it can be used to extract useful insights and knowledge. Data
transformation typically involves several steps, including:
1. Data cleaning: Removing or correcting errors, inconsistencies, and
missing values in the data.
2. Data integration: Combining data from multiple sources, such as
databases and spreadsheets, into a single format.
3. Data normalization: Scaling the data to a common range of values,
such as between 0 and 1, to facilitate comparison and analysis.
4. Data reduction: Reducing the dimensionality of the data by
selecting a subset of relevant features or attributes.
5. Data discretization: Converting continuous data into discrete
categories or bins.
6. Data aggregation: Combining data at different levels of
granularity, such as by summing or averaging, to create new
features or attributes.
7. Data transformation is an important step in the data mining process
as it helps to ensure that the data is in a format that is suitable for
analysis and modeling, and that it is free of errors and
inconsistencies. Data transformation can also help to improve the
performance of data mining algorithms, by reducing the
dimensionality of the data, and by scaling the data to a common
range of values.
The data are transformed in ways that are ideal for mining the data.
The data transformation involves steps that are:
1. Smoothing: It is a process that is used to remove noise from the
dataset using some algorithms It allows for highlighting important
features present in the dataset.
2. Aggregation: Data collection or aggregation is the method of
storing and presenting data in a summary format. The data may be
obtained from multiple data sources to integrate these data sources
into a data analysis description.
3. Discretization: It is a process of transforming continuous data into
set of small intervals. Most Data Mining activities in the real world
require continuous attributes. For example, (1-10, 11-20) (age:-
young, middle age, senior).
4. Attribute Construction: Where new attributes are created &
applied to assist the mining process from the given set of attributes.
This simplifies the original data & makes the mining more efficient.
5. Generalization: It converts low-level data attributes to high-level
data attributes using concept hierarchy. For Example Age initially in
Numerical form (22, 25) is converted into categorical value (young,
old). For example, Categorical attributes, such as house addresses,
may be generalized to higher-level definitions, such as town or
country.
6. Normalization: Data normalization involves converting all data
variables into a given range. Techniques that are used for
normalization are:
ADVANTAGES OR DISADVANTAGES:
Advantages of Data Transformation in Data Mining:
1. Improves Data Quality: Data transformation helps to improve the
quality of data by removing errors, inconsistencies, and missing
values.
2. Facilitates Data Integration: Data transformation enables the
integration of data from multiple sources, which can improve the
accuracy and completeness of the data.
Disadvantages of Data Transformation in Data Mining:
1. Time-consuming: Data transformation can be a time-consuming
process, especially when dealing with large datasets.
2. Complexity: Data transformation can be a complex process,
requiring specialized skills and knowledge to implement and
interpret the results.
11. Feature Selection Techniques in Machine Learning
Feature selection is a way of selecting the subset of the most relevant

features from the original features set by removing the redundant,
irrelevant, or noisy features.
o What is Feature Selection?

o Need for Feature Selection
o Feature Selection Methods/Techniques
o Feature Selection statistics
What is Feature Selection?

A feature is an attribute that has an impact on a problem or is useful
for the problem, and choosing the important features for the model is
known as feature selection.
Need for Feature Selection

Before implementing any technique, it is really important to
understand, need for the technique and so for the Feature Selection.
As we know, in machine learning, it is necessary to provide a pre-
processed and good input dataset in order to get better outcomes.
Below are some benefits of using feature selection in machine

learning:
o It helps in avoiding the curse of dimensionality.

o It helps in the simplification of the model so that it can be easily
interpreted by the researchers.
o It reduces the training time.
o It reduces overfitting hence enhance the generalization.
Feature Selection Techniques

There are mainly two types of Feature Selection techniques, which
are:
o Supervised Feature Selection technique

Supervised Feature selection techniques consider the target
variable and can be used for the labelled dataset.
o Unsupervised Feature Selection technique
Unsupervised Feature selection techniques ignore the target
variable and can be used for the unlabelled dataset.
.
12. What is Dimensionality Reduction?
The number of input features, variables, or columns present in a given

dataset is known as dimensionality, and the process to reduce these
features is called dimensionality reduction.
A dataset contains a huge number of input features in various cases,

which makes the predictive modeling task more complicated. Because
it is very difficult to visualize or make predictions for the training
dataset with a high number of features, for such cases, dimensionality
reduction techniques are required to use.
Dimensionality reduction technique can be defined as, "It is a way of

converting the higher dimensions dataset into lesser dimensions
dataset ensuring that it provides similar information." These
techniques are widely used in machine learning for obtaining a better
fit predictive model while solving the classification and regression
problems.
It is commonly used in the fields that deal with high-dimensional

data, such as speech recognition, signal processing, bioinformatics,
etc. It can also be used for data visualization, noise reduction,
cluster analysis, etc.
Benefits of applying Dimensionality Reduction
Some benefits of applying dimensionality reduction technique to the
given dataset are given below:
o By reducing the dimensions of the features, the space required

to store the dataset also gets reduced.
o Less Computation training time is required for reduced
dimensions of features.
o Reduced dimensions of features of the dataset help in
visualizing the data quickly.
o It removes the redundant features (if present) by taking care of
multicollinearity.
Disadvantages of dimensionality Reduction

There are also some disadvantages of applying the dimensionality
reduction, which are given below:
o Some data may be lost due to dimensionality reduction.

o In the PCA dimensionality reduction technique, sometimes the
principal components required to consider are unknown.
Approaches of Dimension Reduction

There are two ways to apply the dimension reduction technique,
which are given below:
Feature Selection
Feature selection is the process of selecting the subset of the relevant
features and leaving out the irrelevant features present in a dataset to
build a model of high accuracy. In other words, it is a way of
selecting the optimal features from the input dataset.
Common techniques of Dimensionality Reduction
a. Principal Component Analysis
b. Backward Elimination
c. Forward Selection
d. Score comparison
e. Missing Value Ratio
f. Low Variance Filter
g. High Correlation Filter
h. Random Forest
i. Factor Analysis
j. Auto-Encoder
Principal Component Analysis (PCA)

Principal Component Analysis is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated
features with the help of orthogonal transformation. These new
transformed features are called the Principal Components. It is one
of the popular tools that is used for exploratory data analysis and
predictive modeling.
PCA works by considering the variance of each attribute because the

high attribute shows the good split between the classes, and hence it
reduces the dimensionality. Some real-world applications of PCA
are image processing, movie recommendation system, optimizing the
power allocation in various communication channels.

Data Mining and Data Analysis UNIT-1 Notes For Print

Uploaded by

Data Mining and Data Analysis UNIT-1 Notes For Print

Uploaded by

D.K.Govt.

1. What Is Data Mining?

Data mining refers to extracting or mining knowledge from large

It is the computational process of discovering patterns in large data sets

The key properties of datamining are

Automatic discovery of patterns

Data Mining is a process of discovering various models, summaries,

In the observational setting, data are usually "collected" from the

1. Outlier detection (and removal) – Outliers are unusual data

2. Scaling,encoding, and selecting features – Data preprocessing

5. Interpret the model and draw conclusions

The Data mining Process

Mining different kinds of knowledge in databases. - The need

Pattern evaluation. - It refers to interestingness of the problem.

Efficiency and scalability of data mining algorithms. - In

4. Knowledge Discovery in Databases(KDD)

Some people treat data mining same as Knowledge discovery

Data Cleaning - In this step the noise and inconsistent data

The following diagram shows the process of knowledge discovery

Data mining techniques can be classified by different criteria, as

i. Classification of Data mining frameworks as per the type of

From a practical point of view, clustering plays an extraordinary job

Association rules are if-then statements that support to show the

7. Data Mining Task Primitives

The data mining primitives specify the following,

1. Set of task-relevant data to be mined.

1. The set of task-relevant data to be mined

2. The kind of knowledge to be mined

This specifies the data mining functions to be performed, such as

3. The background knowledge to be used in the discovery process

This knowledge about the domain to be mined is useful for guiding

o Rolling Up - Generalization of data:

4. The interestingness measures and thresholds for pattern

Different kinds of knowledge may have different interesting

This refers to the form in which discovered patterns are to be

Data preprocessing is an important step in the data mining process. It

9.Steps Involved in Data Preprocessing:

 (a). Missing Data:

2. Fill the Missing values:

3. (b). Noisy Data:

Benefits of Data Cleaning

o Removal of inaccuracies when several data sources are

10. DATA TRANSFORMATION INTRODUCTION:

Feature selection is a way of selecting the subset of the most relevant

o What is Feature Selection?

What is Feature Selection?

Need for Feature Selection

Below are some benefits of using feature selection in machine

o It helps in avoiding the curse of dimensionality.

Feature Selection Techniques

o Supervised Feature Selection technique

12. What is Dimensionality Reduction?

The number of input features, variables, or columns present in a given

A dataset contains a huge number of input features in various cases,

Dimensionality reduction technique can be defined as, "It is a way of

It is commonly used in the fields that deal with high-dimensional

o By reducing the dimensions of the features, the space required

Disadvantages of dimensionality Reduction

o Some data may be lost due to dimensionality reduction.

Approaches of Dimension Reduction

Principal Component Analysis (PCA)

PCA works by considering the variance of each attribute because the

You might also like