Data Mining and Data Analysis UNIT-1 Notes For Print
Data Mining and Data Analysis UNIT-1 Notes For Print
CollegeforWomen(A),Nellore
DATA MINING AND DATA ANALYSIS
UNIT-1
Unit-I
Data mining - KDD Vs Data Mining, Stages of the Data Mining Process-
Task Primitives, Data Mining Techniques – Data Mining Knowledge
Representation. Major Issues in Data Mining – Measurement and Data –
Data Preprocessing – Data Cleaning - Data transformation- Feature
Selection - Dimensionality reduction
A typical data mining system may have the following major components
2. Data Mining Process:
1. Classification:
This technique is used to obtain important and relevant information
about data and metadata. This data mining technique helps to classify
data in different classes.
2. Clustering:
Clustering is a division of information into groups of connected
objects. Describing the data by a few clusters mainly loses certain
confine details, but accomplishes improvement. It models data by its
clusters. Data modeling puts clustering from a historical point of view
rooted in statistics, mathematics, and numerical analysis.
3. Regression:
Regression analysis is the data mining process is used to identify and
analyze the relationship between variables because of the presence of
the other factor. It is used to define the probability of the specific
variable.
4. Association Rules:
This data mining technique helps to discover a link between two or
more items. It finds a hidden pattern in the data set.
5. Outer detection:
This type of data mining technique relates to the observation of data
items in the data set, which do not match an expected pattern or
expected behavior. This technique may be used in various domains
like intrusion, detection, fraud detection, etc. It is also known as
Outlier Analysis or Outilier mining.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized
for evaluating sequential data to discover sequential patterns
7. Prediction:
Prediction used a combination of other data mining techniques such
as trends, clustering, classification, etc. It analyzes past events or
instances in the right sequence to predict a future event.
This specifies the portions of the database or the set of data in which
the user is interested. This includes the database attributes or data
warehouse dimensions of interest (the relevant attributes or
dimensions).
8. Data preprocessing
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this
part, data cleaning is done. It involves handling of missing data,
noisy data etc.
Binning Method:
This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various
methods are performed to complete the task. Each segmented is
handled separately. One can replace all data in a segment by its
mean or boundary values can be used to complete the task.
Regression:
Here data can be made smooth by fitting it to a regression
function.The regression used may be linear (having one
independent variable) or multiple (having multiple independent
variables).
Clustering:
This approach groups the similar data in a cluster. The outliers
may be undetected or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms
suitable for mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0
to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by
interval levels or conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in
hierarchy. For Example-The attribute “city” can be converted to
“country”.
3. Data Reduction:
Data reduction is a crucial step in the data mining process that
involves reducing the size of the dataset while preserving the
important information. This is done to improve the efficiency of data
analysis and to avoid overfitting of the model. Some common steps
involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant
features from the dataset. Feature selection is often performed to
remove irrelevant or redundant features from the dataset. It can be
done using various techniques such as correlation analysis, mutual
information, and principal component analysis (PCA).
Feature Extraction: This involves transforming the data into a
lower-dimensional space while preserving the important information.
Feature extraction is often used when the original features are high-
dimensional and complex. It can be done using techniques such as
PCA, linear discriminant analysis (LDA), and non-negative matrix
factorization (NMF).
Sampling: This involves selecting a subset of data points from the
dataset. Sampling is often used to reduce the size of the dataset while
preserving the important information. It can be done using techniques
such as random sampling, stratified sampling, and systematic
sampling.
Clustering: This involves grouping similar data points together into
clusters. Clustering is often used to reduce the size of the dataset by
replacing similar data points with a representative centroid. It can be
done using techniques such as k-means, hierarchical clustering, and
density-based clustering.
Compression: This involves compressing the dataset while
preserving the important information. Compression is often used to
reduce the size of the dataset for storage and transmission purposes.
It can be done using techniques such as wavelet compression, JPEG
compression, and gzip compression.
Feature Selection
Feature selection is the process of selecting the subset of the relevant
features and leaving out the irrelevant features present in a dataset to
build a model of high accuracy. In other words, it is a way of
selecting the optimal features from the input dataset.
Common techniques of Dimensionality Reduction
a. Principal Component Analysis
b. Backward Elimination
c. Forward Selection
d. Score comparison
e. Missing Value Ratio
f. Low Variance Filter
g. High Correlation Filter
h. Random Forest
i. Factor Analysis
j. Auto-Encoder