0% found this document useful (0 votes)
90 views

Notes Module 2

Data mining is the process of automatically discovering useful patterns from large amounts of data. It involves techniques from machine learning, statistics, and database systems. The goals of data mining are prediction, describing patterns, clustering, association analysis, and detecting anomalies. It provides capabilities for predicting outcomes and discovering unknown patterns not evident in traditional data analysis. Key challenges include handling large, high-dimensional, heterogeneous and complex data from multiple sources.

Uploaded by

Tejaswini Girish
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views

Notes Module 2

Data mining is the process of automatically discovering useful patterns from large amounts of data. It involves techniques from machine learning, statistics, and database systems. The goals of data mining are prediction, describing patterns, clustering, association analysis, and detecting anomalies. It provides capabilities for predicting outcomes and discovering unknown patterns not evident in traditional data analysis. Key challenges include handling large, high-dimensional, heterogeneous and complex data from multiple sources.

Uploaded by

Tejaswini Girish
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Data Mining & Data Warehousing 15CS651

MODULE 2: DATA MINING

WHAT IS DATA MINING?


• Data Mining is the process of automatically discovering useful information in large
data- repositories.
• DM techniques
→ can be used to search large DB to find useful patterns that might otherwise remain
unknown
→ provide capabilities to predict the outcome of future observations

Why do we need Data Mining?


• Conventional database systems provide users with query & reporting tools.
• To some extent the query & reporting tools can assist in answering questions like,
where did the largest number of students come from last year?
• But these tools cannot provide any intelligence about why it happened.

Taking an Example of University Database System


• The OLTP system will quickly be able to answer the query like “how many students
are enrolled in university”
• The OLAP system using data warehouse will be able to show the trends in students’
enrollments
(ex: how many students are preferring BCA),
• Data mining will be able to answer where the university should market.

DATA MINING AND KNOWLEDGE DISCOVERY


• Data Mining is an integral part of KDD (Knowledge Discovery in Databases).
KDD is the overall process of converting raw data into useful information (Figure: 1.1).

• The input-data is stored in various formats such as flat files, spread sheet or relational tables.
• Purpose of preprocessing: to transform the raw input-data into an appropriate format
for subsequent analysis.

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 1


Data Mining & Data Warehousing 15CS651

• The steps involved in data-preprocessing include


→ combine data from multiple sources
→ clean data to remove noise & duplicate observations, and
→ select records & features that are relevant to the DM task at hand
• Data-preprocessing is perhaps the most time-consuming step in the overall knowledge
discovery process.
• “Closing the loop" refers to the process of integrating DM results into decision support
systems.
• Such integration requires a postprocessing step. This step ensures that only valid and
useful results are incorporated into the decision support system.
• An example of postprocessing is visualization.
Visualization can be used to explore data and DM results from a variety of
viewpoints.
• Statistical measures can also be applied during postprocessing to eliminate bogus DM results.

MOTIVATING CHALLENGES
Scalability
• Nowadays, data-sets with sizes of terabytes or even petabytes are becoming common.
• DM algorithms must be scalable in order to handle these massive data sets.
• Scalability may also require the implementation of novel data structures to access
individual records in an efficient manner.
• Scalability can also be improved by developing parallel & distributed algorithms.
High Dimensionality
• Traditional data-analysis technique can only deal with low dimensional data.
• Nowadays, data-sets with hundreds or thousands of attributes are becoming common.
• Data-sets with temporal or spatial components also tend to have high dimensionality.
• The computational complexity increases rapidly as the dimensionality increases.
Heterogeneous and Complex Data
• Traditional analysis methods can deal with homogeneous type of attributes.
• Recent years have also seen the emergence of more complex data-objects.
• DM techniques for complex objects should take into consideration relationships in the data,
such as
→ temporal & spatial autocorrelation
→ parent-child relationships between the elements in semi-structured text & XML
documents
Data Ownership & Distribution
• Sometimes, the data is geographically distributed among resources belonging to multiple
entities.
• Key challenges include:
1) How to reduce amount of communication needed to perform the distributed
computation
2) How to effectively consolidate the DM results obtained from multiple sources &
3) How to address data-security issues
Non Traditional Analysis
• The traditional statistical approach is based on a hypothesized and test paradigm.
In other words, a hypothesis is proposed, an experiment is designed to gather the data,
and then the data is analyzed with respect to hypothesis.
• Current data analysis tasks often require the generation and evaluation of thousands
of hypotheses, and consequently, the development of some DM techniques has been
motivated by the desire to automate the process of hypothesis generation and evaluation.
P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 2
Data Mining & Data Warehousing 15CS651

THE ORIGIN OF DATA MINING


• Data mining draws upon ideas from
→ Sampling, estimation, and hypothesis test from statistics
→ Search algorithms, modeling techniques machine learning, learning theories
from AI pattern recognition, statistics database systems
• Traditional techniques may be unsuitable due to
→Enormity of data →High dimensionality of data →Heterogeneous nature
of data
• Data mining also had been quickly to adopt ideas from other areas including
→Optimization →Evolutionary computing →Signal processing
→Information theory
• Database systems are needed to provide supports for efficient storage, indexing, query
processing.
• The parallel computing and distribute technology are two major data addressing issues in
data mining to increase the performance.

DATA MINING TASKS


• DM tasks are generally divided into 2 major categories.
Predictive Tasks
• The objective is to predict the value of a particular attribute based on the
values of other attributes.
• The attribute to be predicted is commonly known as the target or
dependent variable, while the attributes used for making the
predication are known as the explanatory or independent variables.
Descriptive Tasks
• The objective is to derive patterns (correlations, trends, clusters,
trajectories and anomalies) that summarize the relationships in data.
• Descriptive DM tasks are often exploratory in nature and frequently require
postprocessing techniques to validate and explain the results.

Four of the Core Data Mining Tasks


1) Predictive Modeling
• This refers to the task of building a model for the target variable as a
function of the explanatory variable.
• The goal is to learn a model that minimizes the error between the
predicted and true values of the target variable.

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 3


Data Mining & Data Warehousing 15CS651

• There are 2 types of predictive modeling tasks:


i) Classification: used for discrete target variables
Ex: Web user will make purchase at an online bookstore is a
classification task, because the target variable is binary valued.
ii) Regression: used for continuous target variables.
Ex: forecasting the future price of a stock is regression task because
price is a continuous values attribute
2) Cluster Analysis
• This seeks to find groups of closely related observations so that
observations that belong to the same cluster are more similar to each other
than observations that belong to other clusters.
• Clustering has been used
→ to group sets of related customers
→ to find areas of the ocean that have a significant impact on the Earth's
climate

3) Association Analysis
• This is used to discover patterns that describe strongly associated features in the data.
• The goal is to extract the most interesting patterns in an efficient manner.
• Useful applications include
→ finding groups of genes that have related functionality or
→ identifying web pages that are accessed together
• Ex: market based analysis
We may discover the rule that {diapers} -> {Milk}, which suggests that
customers who buy diapers also tend to buy milk.
4) Anomaly Detection
• This is the task of identifying observations whose characteristics are
significantly different from the rest of the data. Such observations are
known as anomalies or outliers.
• The goal is
→ to discover the real anomalies and
→ to avoid falsely labeling normal objects as anomalous.

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 4


Data Mining & Data Warehousing 15CS651

• Applications include the detection of fraud, network intrusions, and unusual patterns
of disease.
Example 1.4 (Credit Card Fraud Detection).
• A credit card company records the transactions made by every credit
card holder, along with personal information such as credit limit, age,
annual income, and address.
• Since the number of fraudulent cases is relatively small compared to
the number of legitimate transactions, anomaly detection techniques can
be applied to build a profile of legitimate transactions for the users.
• When a new transaction arrives, it is compared against the profile of the user.
• If the characteristics of the transaction are very different from the
previously created profile, then the transaction is flagged as potentially
fraudulent

WHAT IS A DATA OBJECT?


• A data-set refers to a collection of data-objects and their attributes.
• Other names for a data-object are record, transaction, vector, event, entity, sample or
observation.
• Data-objects are described by a number of attributes such as
→ mass of a physical object or
→ time at which an event occurred.
• Other names for an attribute are dimension, variable, field, feature or characteristics.
WHAT IS AN ATTRIBUTE?
• An attribute is a characteristic of an object that may vary, either
→ from one object to another or
→ from one time to another.
• For example, eye color varies from person to person.
Eye color is a symbolic attribute with a small no. of possible values {brown, black, blue,
green}.

Example 2.2 (Student Information).


• Often, a data-set is a file, in which the objects are records(or rows) in the file
and each field (or column) corresponds to an attribute.
• For example, Table 2.1 shows a data-set that consists of student information.
• Each row corresponds to a student and each column is an attribute that
describes some aspect of a student, such as grade point average(GPA) or
identification number(ID).

PROPERTIES OF ATTRIBUTE VALUES


• The type of an attribute depends on which of the following properties it possesses:
1) Distinctness: = ≠
2) Order: < >

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 5


Data Mining & Data Warehousing 15CS651

3) Addition: + -
4) Multiplication: * /
• Nominal attribute: Uses only
distinctness. Examples: ID
numbers, eye color, pin
codes
• Ordinal attribute: Uses
distinctness & order.
Examples: Grades in
{SC, FC, FCD}
Shirt sizes in {S, M, L, XL}
• Interval attribute: Uses distinctness, order & addition
Examples: calendar dates, temperatures in Celsius or Fahrenheit.
• Ratio attribute: Uses all 4 properties
Examples: temperature in Kelvin, length, time, counts

DIFFERENT TYPES OF ATTRIBUTES

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 6


Data Mining & Data Warehousing 15CS651

DESCRIBING ATTRIBUTES BY THE NUMBER OF VALUES


1) Discrete
• Has only a finite or countably infinite set of values.
• Examples: pin codes, ID Numbers, or the set of words in a collection of documents.
• Often represented as integer variables.
• Binary attributes are a special case of discrete attributes and assume only 2 values.
E.g. true/false, yes/no, male/female or 0/1
2) Continuous
• Has real numbers as attribute values.
• Examples: temperature, height, or weight.
• Often represented as floating-point variables.

ASYMMETRIC ATTRIBUTES
• Binary attributes where only non-zero values are important are called asymmetric
binary attributes.
• Consider a data-set where each object is a student and each attribute records whether
or not a student took a particular course at a university.
• For a specific student, an attribute has a value of 1 if the student took the course
associated with that attribute and a value of 0 otherwise.
• Because students take only a small fraction of all available courses, most of the values
in such a data-set would be 0.
• Therefore, it is more meaningful and more efficient to focus on the non-zero values.
• This type of attribute is particularly important for association analysis.

TYPES OF DATA SETS


1) Record data
→ Transaction (or Market based data)
→ Data matrix
→ Document data or Sparse data matrix
2) Graph data

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 7


Data Mining & Data Warehousing 15CS651

→ Data with relationship among objects (World Wide Web)


→ Data with objects that are Graphs (Molecular Structures)
3) Ordered data
→ Sequential data (Temporal data)
→ Sequence data
→ Time series data
→ Spatial data

GENERAL CHARACTERISTICS OF DATA SETS


• Following 3 characteristics apply to many data-sets:
1) Dimensionality
• Dimensionality of a data-set is no. of attributes that the objects in the
data-set possess.
• Data with a small number of dimensions tends to be qualitatively
different than moderate or high-dimensional data.
• The difficulties associated with analyzing high-dimensional data are
sometimes referred to as the curse of dimensionality.
• Because of this, an important motivation in preprocessing data is
dimensionality reduction.
2) Sparsity
• For some data-sets with asymmetric feature, most attribute of an object
have values of 0.
• In practical terms, sparsity is an advantage because usually only the
non-zero values need to be stored & manipulated.
• This results in significant savings with respect to computation-time and
storage.
• Some DM algorithms work well only for sparse data.
3) Resolution
• This is frequently possible to obtain data at different levels of resolution, and
often the properties of the data are different at different resolutions.
• Ex: the surface of the earth seems very uneven at a resolution of few meters,
but is relatively smooth at a resolution of tens of kilometers.
• The patterns in the data also depend on the level of resolution.
• If the resolution is too fine, a pattern may not be visible or may be buried in
noise. If the resolution is too coarse, the pattern may disappear.

RECORD DATA
• Data-set is a collection of records.
Each record consists of a fixed set of attributes.
• Every record has the same set of attributes.
• There is no explicit relationship among records or attributes.
• The data is usually stored either Flat files or relational databases

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 8


Data Mining & Data Warehousing 15CS651

TYPES OF RECORD DATA


1) Transaction (Market Basket Data)
• Each transaction consists of a set of items.
• Consider a grocery store.
The set of products purchased by a customer represents
a transaction while the individual products
represent items.
• This type of data is called market basket data because
the items in each transaction are the products in a person's "market
basket."
• Data can also be viewed as a set of records whose fields are asymmetric
attributes.
2) Data Matrix
• An m*n matrix, where there are m rows, one for each object, &
n columns, one for
each attribute. This
matrix is called a
data-matrix.
• Since data-matrix consists of numeric attributes, standard matrix
operation can be applied to manipulate the data.
3) Sparse Data Matrix

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 9


Data Mining & Data Warehousing 15CS651

• This is a special case of a data-matrix.


• The attributes are of the same type and are asymmetric i.e. only
non-zero values are important.

Document Data
• A document can be represented as a ‘vector’,
where each term is a attribute of the vector and value of each attribute is the no. of times
corresponding term occurs in the document.

GRAPH BASED DATA


• Sometimes, a graph can be a convenient and powerful representation for data.
• We consider 2 specific cases:

1) Data with Relationships among Objects


• The relationships among objects frequently convey important information.
• In particular, the data-objects are mapped to nodes of the graph,
while relationships among objects are captured by link properties such as direction &
weight.
• For ex, in web, the links to & from each page provide a great deal of
information about the relevance of a web-page to a query, and thus, must also
be taken into consideration.

2) Data with Objects that are Graphs


• If the objects contain sub-objects that have relationships, then such
objects are frequently represented as graphs.
• For ex, the structure of chemical compounds can be
represented by a graph, where nodes are atoms and
links between nodes are chemical bonds.

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 10


Data Mining & Data Warehousing 15CS651

ORDERED DATA
Sequential Data (Temporal Data)
• This can be thought of as an extension of
record-data, where each
record has a time associated
with it.
• A time can also be associated with each attribute.
• For example, each record could be the purchase history of a customer,
with a listing of items purchased at different times.
• Using this information, it is possible to find patterns such as "people who
buy DVD
players tend to buy DVDs in the period immediately following the purchase."
Sequence Data
• This consists of a data-set that is a sequence of individual entities, such
as a sequence of words or letters.
• This is quite similar to sequential data, except that there are no time stamps;
instead, there are positions in an ordered sequence.
• For example, the genetic information of plants and animals can be
represented in the form of sequences of nucleotides that are known as
genes.
Time Series Data
• This is a special type of sequential data in which a series of measurements
are taken over time.
• For example, a financial data-set might contain objects that are time
series of the daily prices of various stocks.
• An important aspect of temporal-data is temporal-autocorrelation i.e.
if two measurements are close in time, then the values of those
measurements are often very similar.
Spatial Data
• Some objects have spatial attributes, such as positions or areas.
• An example is weather-data (temperature, pressure) that is collected for
a variety of geographical location.
• An important aspect of spatial-data is spatial-autocorrelation i.e.
objects that are physically close tend to be similar in other ways as well.

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 11


Data Mining & Data Warehousing 15CS651

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 12


Data Mining & Data Warehousing 15CS651

Data Quality
Preventing Data Quality Problems is not possible. Hence Data Mining Focuses on,

1. Detection and Correction of Data Quality problems, this step is called Data Cleaning.
2. Use of Algorithms that can tolerate poor Data Quality.

Measurement and Data Collection Issues:


Example of Data Quality Problems,
1. Measurement and Data Collection Errors
2. Noise and Artifacts
3. Precision , Bias, Accuracy
4. Outliers
5. Missing Values
6. Inconsistent Values
7. Duplicate Data
Measurement and Data Collection Errors

Measurement Error: It refers to any problem resulting from the measurement process. A common
problem is that the value recorded differs from the true value to some extent.

Data Collection Error: It refers to errors such as omitting data objects or attribute values or in
appropriately including a data object.

Noise and Artifacts


 Noise: It is the random component of a measurement error. It may involve the distortion
of a value or the addition of spurious objects.

(a) Time series (b) Time series with noise


 Elimination of noise is difficult. Therefore Data Mining focuses on devising robust
algorithms that produce acceptable results even when noise is present.
 Data Errors may be the result of a more deterministic phenomenon, such as a streak in the
same place on a set of photographs. Such deterministic distortions of data are often
referred to as artifacts.

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 13


Data Mining & Data Warehousing 15CS651

Precision, Bias, Accuracy

Precision: The closeness of repeated measurements (of the same quantity) to one another.
Bias: A systematic variation of measurements from the quantity being measured.
Accuracy: Closeness of measurements to the true value of the quantity being measured.
Outliers:

Outliers are either,

1. Data objects that have characteristics those are different from most of the other data
objects in the data set. Or
2. Values of an attribute that are unusual with respect to the typical values for that attribute.

Missing Values

Reasons for missing values:


- Information was not collected
Eg. Some people decline to give their age or weight
- Some attributes are not applicable for all objects
Eg. Annual income is not applicable to children.

Several Strategies to handle missing values are:

1. Eliminate Data objects or attributes


- Eliminate objects with missing values
- A related strategy is to eliminate attributes that have missing values
2. Estimate missing values:
- Sometimes, missing data can be reliably estimated
- Eg. In case of Time series wave having smooth changes, we can estimate its value at a
specific time.
- If the attribute is continuous, then the average attribute value of nearest neighbors is used.
- If the attribute is categorical, then the most commonly occurring attribute value can be
taken.
3. Ignore the missing value during Analysis
- Many Data Mining approaches can be modified.

Inconsistent Values

 Data can contain inconsistent values.


Eg. In address field, specified zip code area is not contained in that city.
 Regardless of cause of inconsistent values, it is important to detect and correct (if
possible) such problems.
 Some type of inconsistencies are easy to detect
Eg. Person’s height should not be negative
In other cases, its necessary to consult an external source of information.

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 14


Data Mining & Data Warehousing 15CS651

Duplicate Data

 A dataset may include data objects that are duplicates, or almost duplicates of one
another
 To detect and eliminate such duplicates, two main issues must be addressed.
1. If there are two objects that actually represent a single object, then the values of
corresponding attributes may differ and these inconsistent values must be resolved.
2. Care needs to be taken to avoid accidently combining data objects that are similar, but not
duplicates such as two distinct people with identical names.
 The term reduplication is used to refer to the process of dealing with these issues.

Issues related to Applications:

1. Timeliness:
Some data starts to age as soon as it has been collected
Eg. Snapshot of some ongoing process represents reality for only a limited time.
2. Relevance:
The available data must contain the information necessary for the application.
3. Knowledge about the data:
 Ideally, the data sets are accompanied by documentation that describes different aspects
of data. The quality of this documentation can either aid or hinder the subsequent
analysis.
 If the documentation is poor, then our analysis of the data may be faulty.

DATA PREPROCESSING
• Data preprocessing is a broad area and consists of a number of different strategies and
techniques that are interrelated in complex way.
• Different data processing techniques are:
1. Aggregation 2. Sampling 3. Dimensionality reduction
4. Feature subset selection 5. Feature creation 6. Discretization and
binarization 7. Variable transformation

AGGREGATION
• This refers to combining 2 or more attributes (or objects) into a single attribute (or object).
For example, merging daily sales figures to obtain monthly sales figures

Consider the dataset consisting of transactions(data objects) recording the daily sales of
products in various store locations for different days over the course of the year.

Transaction ID item Store location Date Price


101 watch chicago 9/1/19 $29.99
102 shoes ottowo 9/1/19 $31.44
103 shirt Newyork 9/1/19 $30

 One way to aggregate transactions for this data set is to replace all the transactions of
a single with a single store wide transaction.

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 15


Data Mining & Data Warehousing 15CS651

 Issue: How an aggregate transaction is created?


1. Quantitative attributes (like “price”) are aggregating by taking a sum or average.
2. Qualitative attributes (like “Item”) can either be omitted or summarized as the
set of all items that were sold at that location.

• Motivations for aggregation:


1) Data reduction: The smaller data-sets require
→ less memory → less processing time.
Because of aggregation, more expensive algorithm can be used.
2) Aggregation can act as a change of scale by providing a high-level view of
the data instead of a low-level view. E.g. Cities aggregated into districts,
states, countries, etc
3) The behavior of groups of objects is often more stable than that of individual objects.
• Disadvantage: The potential loss of interesting details.

SAMPLING
• This is a method used for selecting a subset of the data-objects to be analyzed.
• This is often used for both
→ preliminary investigation of the data → final data analysis
• Q: Why sampling?
Ans: Obtaining & processing the entire set of “data of interest” is too expensive or time
consuming.
• Sampling can reduce data-size to the point where better & more expensive algorithm can be
used.
• Key principle for effective sampling: Using a sample will work almost as well as
using entire data-set, if the sample is representative.
Sampling Methods
1) Simple Random Sampling
• There is an equal probability of selecting any particular object.
• There are 2 variations on random sampling:
i) Sampling without Replacement
• As each object is selected, it is removed from the population.
ii) Sampling with Replacement
• Objects are not removed from the population as they are selected for the
sample.
• The same object can be picked up more than once.
• When the population consists of different types(or number) of objects, simple random
sampling can fail to adequately represent those types of objects that are less frequent.
2) Stratified Sampling
• This starts with pre-specified groups of objects.
• In the simplest version, equal numbers of objects are drawn from each group even
though the groups are of different sizes.
• In another variation, the number of objects drawn from each group is
proportional to the size of that group.

3) Progressive Sampling
• If proper sample-size is difficult to determine then progressive sampling can be used.
• This method starts with a small sample, and then increases the sample-size until a
sample of sufficient size has been obtained.
• This method requires a way to evaluate the sample to judge if it is large enough.

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 16


Data Mining & Data Warehousing 15CS651

DIMENSIONALITY REDUCTION
• Key benefit: many DM algorithms work better if the dimensionality is lower.
Purpose
• May help to eliminate irrelevant features or reduce noise.
• Can lead to a more understandable model (which can be easily visualized).
• Reduce amount of time and memory required by DM algorithms.
• Avoid curse of dimensionality.
The Curse of Dimensionality
• Data-analysis becomes significantly harder as the dimensionality of the data increases.
• For classification, this can mean that there are not enough data-objects to allow
the creation of a model that reliably assigns a class to all possible objects.
• For clustering, the definitions of density and the distance between points (which are
critical for clustering) become less meaningful.
• As a result, we get
→ reduced classification accuracy &
→ poor quality clusters.
Techniques
– Principal Components Analysis (PCA)- To find a projection that captures the
largest amount of variation
– Singular Value Decomposition (SVD) – It is a linear algebra technique for the
continuous attributes that finds a new attribute.
– Others: supervised and non-linear techniques

FEATURE SUBSET SELECTION


• Another way to reduce the dimensionality is to use only a subset of the features.
• This might seem that such approach would lose information, this is not the case if
redundant and irrelevant features are present.
1) Redundant features duplicate much or all of the information contained in one
or more other attributes.
For example: purchase price of a product and the amount of sales tax paid.
2) Irrelevant features contain almost no useful information for the DM task at hand.
For example: students' ID numbers are irrelevant to the task of predicting
students' grade point averages.
Techniques for Feature Selection
1) Embedded approaches: Feature selection occurs naturally as part of
DM algorithm. Specifically, during the operation of the DM algorithm,
the algorithm itself decides which attributes to use and which to ignore.
2) Filter approaches: Features are selected before the DM algorithm is run.
3) Wrapper approaches: Use DM algorithm as a black box to find best subset of
attributes.
An Architecture for Feature Subset Selection
• The feature selection process is viewed as consisting of 4 parts:
1) A measure of evaluating a subset,

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 17


Data Mining & Data Warehousing 15CS651

2) A search strategy that controls the generation of a new subset of features,


3) A stopping criterion and
4) A validation procedure.

FEATURE CREATION

Create new attributes that can capture the important information in a data set much more
efficiently than the original attributes. Furthermore, the number of new attributes can be smaller
than the number of original attributes .

Three general methodologies:


– Feature extraction (Creation of new set of features from the original raw data is
known as feature extraction. It is highly domain specific)
 Example: extracting edges from images
– Feature construction
– Sometimes the features in the original datasets have the necessary information,
but it is not in a form suitable for data mining algorithm.
– In this situation, one or more new features constructed out of the original features
can be more useful than the original features.
 Example: dividing mass by volume to get density
– Mapping data to new space
 Example: Fourier and wavelet analysis
DISCRETIZATION AND BINARIZATION
• Some DM algorithms (especially classification algorithms) require that the data be in the
form of categorical attributes.
• Algorithms that find association patterns require that the data be in the form of binary
attributes.
• Transforming continuous attributes into a categorical attribute is called discretization.
And transforming continuous & discrete attributes into binary attributes is called as
binarization.

Binarization
• A simple technique to binarize a categorical attribute is the following: If
there are m categorical values, then uniquely assign each original value to an
integer in interval [0,m-1].
• Next, convert each of these m integers to a binary number.

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 18


Data Mining & Data Warehousing 15CS651

• Since n=[log2(m)] binary digits are required to represent these integers, represent these
binary numbers using ‘n’ binary attributes.

A transformation can cause two complications creates unintended relationships among


the transformed attributes.

Eg. Attributes x2 and x3 are correlated because information about the “good’ value is
encoded using both attributes.

It leads to symmetric binary attribute, But association analysis requires asymmetric


binary attributes, where only presence of the attribute (value=1) is important. The
Solution is introducing one binary attribute for each categorical value.

Discretization of continuous attributes

• Discretization is typically applied to attributes that are used in classification or


association analysis.
• In general, the best discretization depends on
→ the algorithm being used, as well as
→ the other attributes being considered
• Transformation of a continuous attribute to a categorical attribute involves two
subtasks:
→ deciding how many categories to have:
After the value of continuous attributes are sorted, they are then divided into “n”
intervals by specifying n-1 split points.
→ determining how to map the values of the continuous attribute to these
categories
All the values of one interval are mapped to the same categorical value.

Conclusion:
Problem of discretization is one of deciding how many split points to choose and where to place
them. The result can be represented either as a set of intervals { (x0,x1), (x1,x2)… (xn-1,xn)} ,

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 19


Data Mining & Data Warehousing 15CS651

where x0 and xn may be + ∞ or - ∞ respectively, or equivalently as a series of inequalities


x0<x<=x1.. xn-1 <x < xn.

Descretization may be either


 Unsupervised (here Class information is not used)
 Supervised (Class information is used)

Unsupervised Descretization:
 Equal width approach – divides the range of the attribute into a user specified
number of intervals each having the same width.
Disadvantage: badly affected by outliers
 Equal Frequency(depth) – tries to put same number of objects into each interval
 Clustering methods

Orginal data Equal width Dicretization

Equal Frequency K-means

Supervised Descretization:
The splits in a way that maximizes the purity of the intervals.
Entropy based approach.
Definition of Entropy:
The entropy of the ith interval ,
k
Ei = Σ Pij log2 Pij
i=1

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 20


Data Mining & Data Warehousing 15CS651

Where,
K= number of different class labels
Pij = mij / mi is the probability of class j in the ith interval
mi - no. of values in ith interval of a partition
mij – no. of values of class j in interval i

Total Entropy, of the partition is the weighted average of the individual interval entropies.
n
E = Σ wiei
I=1
Wi = mi/m
Where, mi = is the fraction of vales in the ith interval
M = number of values
N = number of intervals
 Entropy of an interval is a measure of purity of an interval.
 If an interval contains only values of one class (is perfectly pure), then the entropy is
overall entropy ∅ and it contributes nothing to the overall entropy.
 If the classes of values in an interval occur equally often (the interval is as impure as
possible), then the entropy is maximum.
 A simple approach for partitioning a continuous attribute starts by bisecting the
initial values so that the resulting two intervals gives minimum entropy.
 The splitting process is the repeated with another interval, typically choosing the
interval with the worst (highest) entropy, until a user specified number of intervals is
reached, or a stopping criterion is satisfied.

VARIABLE TRANSFORMATION

It is a function that maps the entire set of values of a given attribute to a new set of replacement
values such that each old value can be identified with one of the new values
• This refers to a transformation that is applied to all the values of a variable.
• Ex: converting a floating point value to an absolute value.
• Two types are:
1) Simple Functions
• A simple mathematical function is applied to each value individually.
x
• If x is a variable, then examples of transformations include e , 1/x, log(x),
sin(x).
2) Normalization (or Standardization)
• The goal is to make an entire set of values have a particular property.
• A traditional example is that of "standardizing a variable" in statistics.
• If x is the mean of the attribute values and s x is their standard
deviation, then the transformation x'=(x- x )/sx creates a new variable
that has a mean of 0 and a standard deviation of 1.

MEASURE OF SIMILARITY AND DISSIMILARITY

• Similarity & dissimilarity are important because


they are used by a no. of DM techniques such as clustering, classification & anomaly

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 21


Data Mining & Data Warehousing 15CS651

detection.
• Proximity is used to refer to either similarity or dissimilarity.
• The similarity between 2 objects is a numerical measure of degree to which the 2 objects are
alike.
• Consequently, similarities are higher for pairs of objects that are more alike.
• Similarities are usually non-negative and are often between 0(no similarity) and
1(complete similarity).
• The dissimilarity between 2 objects is a numerical measure of the degree to which the
2 objects are different.
• Dissimilarities are lower for more similar pairs of objects.
• The term distance is used as a synonym for dissimilarity.

• Dissimilarities sometimes fall in the interval [0,1] but is also common for them to range
from 0 to infinity.

DISSIMILARITIES BETWEEN DATA OBJECTS


Distances
• The Euclidean distance, d, between 2 points, x and y, in one-,two-,three- or
higher-dimensional space, is given by the following familiar formula:

where n=number of dimensions


xk and yk are respectively the kth attributes of x and y.

• The Euclidean distance measure given in equation 2.1 is generalized by the Minkowski
distance metric given by

where r=parameter.

• The following are the three most common examples of minkowski distance:
r=1. City block( Manhattan L1 norm) distance.
A common example is the Hamming distance, which is the number of bits

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 22


Data Mining & Data Warehousing 15CS651

that are different between two objects that have only binary attributes ie
between two binary vectors.
r=2. Euclidean distance (L2 norm)
r=∞. Supremum(L∞ or Lmax norm) distance. This is the maximum difference
between any attribute of the objects. Distance is defined by

• If d(x,y) is the distance between two points, x and y, then the following properties hold
1) Positivity
d(x,x)>=0 for all x and y d(x,y)=0 only if x=y
2) Symmetry
d(x,y)=d(y,x) for all x and y.
3) Triangle inequality
d(x,z)<=d(x,y)+d(y,z) for all points x,y and z.
• Measures that satisfy all three properties are known as metrics.

SIMILARITIES BETWEEN DATA OBJECTS


• For similarities, the triangle inequality typically does not hold, but symmetry positivity
typically do.
• If s(x,y) is the similarity between points x and y, then the typical properties of similarities are
the following
1) s(x,y)=1 only if x=y
2) s(x,y)=s(y,x) for all x and y. (Symmetry)
• For ex, cosine and Jaccard similarity.

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 23


Data Mining & Data Warehousing 15CS651

EXAMPLES OF PROXIMITY MEASURES


• Similarity measures between objects that contain only binary attributes are called
similarity coefficients.
• Typically, they have values between 0 and 1.
• A value of 1 indicates that the two objects are completely similar,
while a value of 0 indicates that the objects are not at all similar.
• Let x and y be 2 objects that consist of n binary attributes.
• Comparison of 2 objects, ie, 2 binary vectors, leads to the following four quantities
(frequencies):
f00=the number of attributes where x is 0 and y is 0.
f01=the number of attributes where x is 0 and y is 1.
f10=the number of attributes where x is 1 and y is 0.
f11=the number of attributes where x is 1 and y is 1.

SIMPLE MATCHING COEFFICIENT


• One commonly used similarity coefficient is the SMC, which is defined as

• This measure counts both presences and absences equally.

JACCARD COEFFICIENT
• Jaccard coefficient is frequently used to handle objects consisting of asymmetric binary
attributes.
• The jaccard coefficient is given by the following equation:

COSINE SIMILARITY
• Documents are often represented as vectors, where each attribute represents the frequency

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 24


Data Mining & Data Warehousing 15CS651

with
which a particular term (or word) occurs in the document.
• This is more complicated, since certain common words are ignored and various
processing techniques are used to account for
→ different forms of the same word
→ differing document lengths and
→ different word frequencies.
• The cosine similarity is one of the most common measure of document similarity.
• If x and y are two document vectors, then

• As indicates by figure 2.16, cosine similarity really is a measure of the angle between x and y.
• Thus, if the cosine similarity is 1,the angle between x and y is 0',and x and y are the same
except for magnitude(length).
• If cosine similarity is 0, then the angle between x and y is 90' and they do not share any terms.

EXTENDED JACCARD COEFFICIENT (TANIMOTO COEFFICIENT)


• This can be used for document data.
• this coefficient is defined by following equation

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 25


Data Mining & Data Warehousing 15CS651

Correlation
Correlation between two data objects that have binary or continuous variables is a Measures of
the linear relationship between the attributes of the objects.
More Precisely, Pearson’s correlation coefficient between two data objects x and y is defined
by,
Corr(x,y) = Covarience (x,y) / std-deviation (x) * std-deviation(y)
= Sxy / SxSy
Where, n
Sxy = 1 / n-1 Σ (xk - ) (yk - - ȳ )
k=1
n
Sx = sq. rt of (1 / n-1 Σ (xk - x ) 2
k=1
n
Sy = sq. rt of (1 / n-1 Σ (yk - ȳ ) 2
k=1

Mean of x,

= ( Σ xi ) / n

Ex. X= (-3,6,0,3,-6)
Y= (1,-2,0,-1,2)

Ans:

Sxy = -15/2
Sx = Sq,rt of (45/2)
Sy = Sq.rt of (5/2)
Corr(x,y) = -1
ISSUES IN PROXIMITY CALCULATION
1) How to handle the case in which attributes have different scales and/or are correlated.
2) How to calculate proximity between objects that are composed of different types of
attributes e.g. quantitative and qualitative.
3) How to handle proximity calculation when attributes have different weights.

COMBINING SIMILARITIES FOR HETEROGENEOUS ATTRIBUTES


• A general approach is needed when the attributes are of different types.
• One straightforward approach is to compute the similarity between each attribute
separately and then combine these similarities using a method that results in a
similarity between 0 and 1.
• Typically, the overall similarity is defined as the average of all the individual attribute
similarities.

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 26


Data Mining & Data Warehousing 15CS651

DATA MINING APPLICATIONS


Prediction & Description
• Data mining may be used to answer questions like
→ "would this customer buy a product" or
→ "is this customer likely to leave?”
• DM techniques may also be used for sales forecasting and analysis.
Relationship Marketing
• Customers have a lifetime value, not just the value of a single sale.
• Data mining can help
→ in analyzing customer profiles and improving direct marketing plans
→ in identifying critical issues that determine client loyalty and
→ in improving customer retention
Customer Profiling
• This is the process of using the relevant and available information
→ to describe the characteristics of a group of customers
→ to identify their discriminators from ordinary consumers and
→ to identify drivers for their purchasing decisions
• This can help an enterprise identify its most valuable customers
so that the enterprise may differentiate their needs and values.
Outliers Identification & Detecting Fraud
• For this, examples include:
→ identifying unusual expense claims by staff
→ identifying anomalies in expenditure between similar units of an
enterprise
→ identifying fraud involving credit cards
Customer Segmentation
• This is a way to assess & view individuals in market based on their status &
needs.
• Data mining may be used
→ to understand & predict customer behavior and profitability
→ to develop new products & services and
→ to effectively market new offerings
Web site Design & Promotion
• Web mining may be used to discover how users navigate a web site and the
results can help in improving the site design.
• Web mining may also be used in cross-selling by suggesting to a web
customer, items that he may be interested in.

Question Bank
1. What is data mining? Explain Data Mining and Knowledge Discovery? (10)
2. What are different challenges that motivated the development of DM? (10)
3. Explain Origins of data mining (5)
4. Discuss the tasks of data mining with suitable examples. (10)
5. Explain Anomaly Detection .Give an Example? (5)
6. Explain Descriptive tasks in detail? (10)
7. Explain Predictive tasks in detail by example? (10)
8. Explain Data set. Give an Example? (5)
9. Explain 4 types of attributes by giving appropriate example? (10)

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 27


Data Mining & Data Warehousing 15CS651

10. With example, explain


a. Continious ii) Discrete iii) Asymmetric Attributes. (10)
11. Explain general characteristics of data sets. (6)
12. Explain record data & its types. (10)
13. Explain graph based data. (6)
14. Explain ordered data & its types. (10)
15. Explain shortly any five data pre-processing approaches. (10)
16. What is sampling? Explain simple random sampling vs.
stratified sampling vs. progressive sampling. (10)
17. Write a short note on the following: (15)
a. Dimensionality reduction ii)Variable transformation iii)Feature selection
18. Distinguish between
a. SMC & Jaccard coefficient ii) Discretization & binarization (6)
19. Explain various distances used for measuring dissimilarity between objects. (6)
20. Consider the following 2 binary vectors
X=(1,0,0,0,0,0,0,0,0,0)
Y=(0,0,0,0,0,0,1,0,0,1)
Find i)hamming distance ii)SMC iii)Jaccard coefficient (4)
21. List out issues in proximity calculation. (4)
22. List any 5 applications of data mining. (8)

P.Priyanga, Asso. Prof, Dept. of CSE, KSIT Page 28

You might also like