Data Mining Written Notes 1

_______________________________________________________
DEPARTMENT OF INFORMATION TECHNOLOGY
CLASS NOTES
Unit-I
Introducing to Data Mining
1.1. Introducing to Data Mining:
Data mining refers to extracting or “mining” knowledge from large amounts of data.
 Data Mining is defined as the procedure of extracting information from huge sets of data.
Data mining as a synonym for another popularly used term, Knowledge Discovery from
Data, or KDD.
1.1.1. What motivated data mining? Why is it important?

The major reason that data mining has attracted a great deal of attention in information
industry in recent years is due to the wide availability of huge amounts of data and the
imminent need for turning such data into useful information and knowledge. The information
and knowledge gained can be used for applications ranging from business management,
production control, and market analysis, to engineering design and science exploration.

1.1.2 What is Data Mining?
Data Mining is defined as extracting information from huge sets of data. In other words, we
can say that data mining is the procedure of mining knowledge from data. The information or
knowledge extracted so can be used for any of the following applications −
 Market Analysis
 Fraud Detection
 Customer Retention
 Production Control
 Science Exploration
Apart from these, data mining can also be used in the areas of production control, customer
retention, science exploration, sports, astrology, and Internet Web Surf-Aid.
Department of Information Technology CS701PC- Data Mining

Market Analysis and Management
Listed below are the various fields of market where data mining is used −
 Customer Profiling − Data mining helps determine what kind of people buy what
kind of products.
 Identifying Customer Requirements − Data mining helps in identifying the best
products for different customers. It uses prediction to find the factors that may attract new
customers.
 Cross Market Analysis − Data mining performs Association/correlations between
product sales.
 Target Marketing − Data mining helps to find clusters of model customers who
share the same characteristics such as interests, spending habits, income, etc.
 Determining Customer purchasing pattern − Data mining helps in determining
customer purchasing pattern.
 Providing Summary Information − Data mining provides us various

multidimensional summary reports.
Corporate Analysis and Risk Management

Data mining is used in the following fields of the Corporate Sector −
 Finance Planning and Asset Evaluation − It involves cash flow analysis and
prediction, contingent claim analysis to evaluate assets.
 Resource Planning − It involves summarizing and comparing the resources and

spending.
 Competition − It involves monitoring competitors and market directions.
Fraud Detection
Data mining is also used in the fields of credit card services and telecommunication to detect
frauds. In fraud telephone calls, it helps to find the destination of the call, duration of the call,
time of the day or week, etc. It also analyzes the patterns that deviate from expected norms.
1.1.3. Data Mining On What Kind of Data?
 Relational Databases
 Data Warehouses
 Transactional Databases
 Advanced Data and Information Systems and Advanced Applications
 Object-Relational Databases
 Temporal Databases, Sequence Databases, and Time-Series Databases
 Spatial Databases and Spatiotemporal Databases

 Text Databases and Multimedia Databases
 Heterogeneous Databases and Legacy Databases
 Data Streams
 The World Wide Web
Fig: Architecture of a data mining system
1.2. KDD:
What is Knowledge Discovery?

Some people don’t differentiate data mining from knowledge discovery while others view
data mining as an essential step in the process of knowledge discovery. Here is the list of
steps involved in the knowledge discovery process −

 Data Cleaning − In this step, the noise and inconsistent data is removed.
 Data Integration − In this step, multiple data sources are combined.
 Data Selection − In this step, data relevant to the analysis task are retrieved from the
database.
 Data Transformation − In this step, data is transformed or consolidated into forms

appropriate for mining by performing summary or aggregation operations.
 Data Mining − In this step, intelligent methods are applied in order to extract data
patterns.
 Pattern Evaluation − In this step, data patterns are evaluated.
 Knowledge Presentation − In this step, knowledge is represented.
The following diagram shows the process of knowledge discovery −

Fig: KDD PROCESS
1.3. Challenges:
Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data
sources. These factors also create some issues. Here in this tutorial, we will discuss the major
issues regarding −
 Mining Methodology and User Interaction

 Performance Issues
 Diverse Data Types Issues
The following diagram describes the major issues.

Mining Methodology and User Interaction Issues
It refers to the following kinds of issues −
 Mining different kinds of knowledge in databases − Different users may be
interested in different kinds of knowledge. Therefore it is necessary for data mining to cover
a broad range of knowledge discovery task.
 Interactive mining of knowledge at multiple levels of abstraction − The data
mining process needs to be interactive because it allows users to focus the search for
patterns, providing and refining data mining requests based on the returned results.
 Incorporation of background knowledge − To guide discovery process and to
express the discovered patterns, the background knowledge can be used. Background
knowledge may be used to express the discovered patterns not only in concise terms but at
multiple levels of abstraction.
 Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be integrated with a
data warehouse query language and optimized for efficient and flexible data mining.
 Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual representations. These
representations should be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the data
cleaning methods are not there then the accuracy of the discovered patterns will be poor.
 Pattern evaluation − The patterns discovered should be interesting because either
they represent common knowledge or lack novelty.

Performance Issues
There can be performance-related issues such as follows −
 Efficiency and scalability of data mining algorithms− In order to effectively
extract the information from huge amount of data in databases, data mining algorithm must
be efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The factors such as
huge size of databases, wide distribution of data, and complexity of data mining methods
motivate the development of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions which is further processed in a parallel fashion.
Then the results from the partitions is merged. The incremental algorithms, update databases
without mining the data again from scratch.
Diverse Data Types Issues
 Handling of relational and complex types of data − The database may contain
complex data objects, multimedia data objects, spatial data, temporal data etc. It is not
possible for one system to mine all these kind of data.
 Mining information from heterogeneous databases and global information
systems − The data is available at different data sources on LAN or WAN. These data source
may be structured, semi structured or unstructured. Therefore mining the knowledge from
them adds challenges to data mining.
1.4. Data Mining Tasks:
Data mining functionalities are used to specify the kind of patterns to be found in data mining
tasks. In general, data mining tasks can be classified into two categories:
 Descriptive
 Classification and Prediction
Descriptive mining tasks characterize the general properties of the data in the
database.
Predictive mining tasks perform inference on the current data in order to make predictions.
Describe data mining functionalities, and the kinds of patterns they can discover (or) Define
each of the following data mining functionalities: characterization, discrimination,
association and correlation analysis, classification, prediction, clustering, and evolution
analysis. Give examples of each data mining functionality, using a real-life database that you
are familiar with.
1 .4.1 Concept/class description: characterization and discrimination
Data can be associated with classes or concepts. It describes a given set of data in a concise
and summarative manner, presenting interesting general properties of the data. These
descriptions can be derived via data characterization, by summarizing the data of the class
under study (often called the target class) data discrimination, by comparison of the target
class with one or a set of comparative classes both data characterization and discrimination.
Data characterization
It is a summarization of the general characteristics or features of a target class of data.
Example: A data mining system should be able to produce a description summarizing the
characteristics of a student who has obtained more than 75% in every semester; the result
could be a general profile of the student.
Data Discrimination is a comparison of the general features of target class data objects with
the general features of objects from one or a set of contrasting classes.
Example
The general features of students with high GPA’s may be compared with the general features
of students with low GPA’s. The resulting description could be a general comparative profile
of the students such as 75% of the students with high GPA’s are fourth-year computing
science students while 65% of the students with low GPA’s are not.
The output of data characterization can be presented in various forms. Examples include pie
charts, bar charts, curves, multidimensional data cubes, and multidimensional tables,
including crosstabs. The resulting descriptions can also be presented as generalized relations,
or in rule form called characteristic rules.
Discrimination descriptions expressed in rule form are referred to as discriminant rules.
1.4.2 Mining Frequent Patterns, Association and Correlations

It is the discovery of association rules showing attribute-value conditions that occur
frequently together in a given set of data. For example, a data mining system may find
association rules like
major(X, “computing science””) ⇒ owns(X, “personal computer”)

[support = 12%, confidence = 98%]
where X is a variable representing a student. The rule indicates that of the students under
study, 12% (support) major in computing science and own a personal computer. There is a
98% probability (confidence, or certainty) that a student in this group owns a personal
computer.
1.4.3 Classification and prediction

Classification:
Classification: It predicts categorical class labels
It classifies data (constructs a model) based on the training set and the values (class labels) in
a classifying attribute and uses it in classifying new data
Typical Applications
credit approval o target marketing o medical diagnosis o
treatment effectiveness analysis

Classification is the process of finding a model that describes the data classes or concepts.
The purpose is to be able to use this model to predict the class of objects whose class label is
unknown. This derived model is based on the analysis of sets of training data. A classification
model can be represented in various forms, such as
 Classification (IF-THEN) Rules

 Decision Trees
 Mathematical Formulae
 Neural Networks
DECISION TREE:
Fig:decision tree based on if-then rules
Prediction:
Prediction − It is used to predict missing or unavailable numerical data values rather than
class labels. Regression Analysis is generally used for prediction. Prediction can also be used
for identification of distribution trends based on available data.
1.4.4 Clustering analysis
Clustering analyzes data objects without consulting a known class label. The objects are
clustered or grouped based on the principle of maximizing the intraclass similarity and
minimizing the interclass similarity. Each cluster that is formed can be viewed as a class of
objects.
Clustering can also facilitate taxonomy formation, that is, the organization of observations
into a hierarchy of classes that group similar events together as shown below

1.4.5 Outlier analysis: A database may contain data objects that do not comply with general
model of data. These data objects are outliers. In other words, the data objects which do not
fall within the cluster will be called as outlier data objects. Noisy data or exceptional data are
also called as outlier data. The analysis of outlier data is referred to as outlier mining.
Example
Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of
extremely large amounts for a given account number in comparison to regular charges
incurred by the same account. Outlier values may also be detected with respect to the
location and type of purchase, or the purchase frequency.
1.5. Data Preprocessing:

 Data be preprocessed in order to help improve the quality of the data and,
consequently, the efficiency and ease of the mining process
 Data preprocessing is a data mining technique that involves transforming raw data
into an understandable format. Real-world data is often incomplete, inconsistent, and/or
lacking in certain behaviors or trends, and is likely to contain many errors. Data
preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw
data for further processing.
 Data goes through a series of steps during preprocessing:
 Data Cleaning: Data is cleansed through processes such as filling in missing values,
smoothing the noisy data, or resolving the inconsistencies in the data.
 Data Integration: Data with different representations are put together and conflicts
within the data are resolved.
 Data Transformation: Data is normalized, aggregated and generalized.

 Data Reduction: This step aims to present a reduced representation of the data in a
data warehouse.
 Data Discretization: Involves the reduction of a number of values of a continuous
attribute by dividing the range of attribute intervals.
Forms of data preprocessing:
1.5.1. Data Cleaning
Data cleaning routines attempt to
 fill in missing values at the following methods :

 Ignore the tuple
 Fill in the missing value manually
 Use a global constant to fill in the missing value
 Use the attribute mean to fill in the missing value
 Use the attribute mean for all samples belonging to the same class as the given tuple.
Data Cleaning as a Process
Data cleaning as a process is discrepancy detection. Commercial tools that can aid in the step of
discrepancy detection

 Data scrubbing tools
 Data auditing tools
 Data migration tools
 ETL (extraction/transformation/loading) tools
Noisy data:
Noise is a random error or variance in a measured variable. Data smoothing tech is used for
removing such noisy data.
Several Data smoothing techniques:

1 Binning methods: Binning methods smooth a sorted data value by consulting the
neighborhood", or values around it. The sorted values are distributed into a number of
'buckets', or bins. Because binning methods consult the neighborhood of values, they perform
local smoothing.
In this technique,
The data for first sorted
Then the sorted list partitioned into equi-depth of bins.
Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries,
etc.
Smoothing by bin means: Each value in the bin is replaced by the mean value of the bin.
Smoothing by bin medians: Each value in the bin is replaced by the bin median.
Smoothing by boundaries: The min and max values of a bin are identified as the bin
boundaries. Each bin value is replaced by the closest boundary value.
Example: Binning Methods for Data Smoothing

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Partition into (equi-depth) bins(equi depth of 3 since each bin contains three values)

- Bin 1: 4, 8, 9, 15
Bin 2:21, 21, 24, 25
Bin 3:26, 28, 29, 34
Smoothing by bin means:

Bin 1:9, 9, 9, 9
Bin 2:23, 23, 23, 23
Bin3:29,29, 29, 29
Smoothing by bin boundaries:
Bin 1:4, 4, 4, 15
Bin 2:21, 21, 25, 25
Bin 3:26, 26, 26, 34

In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. For
example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in
this bin is replaced by the value 9. Similarly, smoothing by bin medians can be employed, in
which each bin value is replaced by the bin median. In smoothing by bin boundaries, the
minimum and maximum values in a given bin are identified as the bin boundaries. Each bin
value is then replaced by the closest boundary value.
Data Integration and Transformation
Data integration, which combines data from multiple sources into a coherent data store, as in data
warehousing
Number of issues to consider during data integration
 Schema integration and object matching

 Redundancy can be detected by correlation analysis
The correlation between two attributes, A and B, by computing the correlation coefficient
Where N is the number of tuples,
ai and bi are the respective values of A and B in tuple i,
A͞ and B͞ are the respective mean values of A and B,
σ A and σB are the respective standard deviations of A and B
-1≤rA,B≤+1
If rA,B is greater than 0, then A and B are positively correlated

If rA,B is equal to 0, then A and B are independent
If rA,B is less than 0, then A and B are negatively correlated
A correlation relationship between two attributes, A and B -square)

test
Where oij is the observed frequency and eij is the expected frequency of (Ai , Bj)
where N is the number of data tuples.
1.5.2.Data Transformation
The data are transformed or consolidated into forms appropriate for mining. Data transformation can
involve
 Smoothing
 Aggregation
 Generalization
 Normalization
 Min-max normalization
 z-score normalization
 Normalization by decimal scaling
 Attribute construction
Smoothing: which works to remove noise from the data

Aggregation: where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute weekly and annual total
scores.
Generalization of the data: where low-level or “primitive” (raw) data are replaced by
higher-level concepts through the use of concept hierarchies. For example, categorical
attributes, like street, can be generalized to higher-level concepts, like city or country.
Normalization: where the attribute data are scaled so as to fall within a small specified
range, such as −1.0 to 1.0, or 0.0 to 1.0.
Attribute construction (feature construction): this is where new attributes are constructed
and added from the given set of attributes to help the mining process.
Normalization
In which data are scaled to fall within a small, specified range, useful for classification
algorithms involving neural networks, distance measurements such as nearest neighbor
classification and clustering. There are 3 methods for data normalization. They are:
 Min-max normalization
 z-score normalization

 Normalization by decimal scaling
Data Reduction
Data reduction techniques can be applied to obtain a reduced representation of the data set that is
much smaller in volume, yet closely maintains the integrity of the original data
Data reduction include the following
 Data cube aggregation

 Attribute subset selection
 Dimensionality reduction
 Numerosity reduction
 Discretization and concept hierarchy generation
1.6. Measures of Similarity and Dissimilarity:

Similarity and Dissimilarity:
Distance or similarity measures are essential to solve many pattern recognition problems such
as classification and clustering. Various distance/similarity measures are available in
literature to compare two data distributions. As the names suggest, a similarity measures how
close two distributions are. For multivariate data complex summary methods are developed to
answer this question.
Similarity Measure
 Numerical measure of how alike two data objects are.
 Often falls between 0 (no similarity) and 1 (complete similarity).
Dissimilarity Measure
 Numerical measure of how different two data objects are.
 Range from 0 (objects are alike) to ∞ (objects are different).
Similarity Between Objects:
Euclidean distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp
Minkowski distance:
d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1 j1 i2 j2 ip jp
Manhattan distance:
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j2 ip jp
Unit-II
Association Rules
2.1. Problems Definition:

Association rule mining is a popular and well researched method for discovering interesting
relations between variables in large databases.
It is intended to identify strong rules discovered in databases using different measures of

interestingness.
Association Rule Problem:
• Given:
― a set I of all the items;
― a database D of transactions;
― minimum support s;
― minimum confidence c;
• Find:
― all association rules X  Y with a minimum support s and confidence c.
Problem Definition:
The problem of association rule mining is defined as:
Let be a set of binary attributes called items.
Let be a set of transactions called the database.
Each transaction in has a unique transaction ID and contains a subset of the items in
. A rule is defined as an implication of the form
where and .
The sets of items (for short itemsets) and are called antecedent (left-hand-side or LHS)
and consequent (right-hand-side or RHS) of the rule respectively.
Association Rule Definitions:
 A set of items is referred as an itemset. A itemset that contains k items is a k-itemset.

 The support s of an itemset X is the percentage of transactions in the transaction
database D that contain X.
 The support of the rule X  Y in the transaction database D is the support of the items
set X  Y in D.
 The confidence of the rule X  Y in the transaction database D is the ratio of the
number of transactions in D that contain X  Y to the number of transactions that contain X
in D.
Example:

To illustrate the concepts, we use a small example from the supermarket domain. The set of
items is and a small database containing the items (1
codes presence and 0 absence of an item in a transaction) is shown in the table.
An example rule for the supermarket could be meaning that

if butter and bread are bought, customers also buy milk.
Example database with 4 items and 5 transactions
Transaction ID milk bread butter beer
1 1 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 1 1 0
5 0 1 0 0
2.1.1 Important concepts of Association Rule Mining:
The support of an item set is defined as the proportion of transactions in th

data set which Contain the itemset. In the example database, the itemset
has a support of since it occurs in 20% of all

transactions (1 out of 5 transactions).
The confidenceof a rule is defined
For example, the rule has a confidence of
in the database, which means that for 100% of the transactions containing
butter and bread the rule is correct (100% of the times a customer buys butter and bread, milk
is bought as well). Confidence can be interpreted as an estimate of the probability ,
the probability of finding the RHS of the rule in transactions under the condition that these
transactions also contain the LHS.

The liftof a rule is defined as
or the ratio of the observed support to that expected if X and Y were independent. The
rule has a lift of .
The conviction of a rule is defined as
The rule has a conviction of ,
and can be interpreted as the ratio of the expected frequency that X occurs without Y (that is
to say, the frequency that the rule makes an incorrect prediction) if X and Y were
independent divided by the observed frequency of incorrect predictions.
2.2. Market basket analysis:
This processanalyzes customer buying habits by finding associations between the different
items thatcustomers place in their shopping baskets. The discovery of such associationscan
help retailers develop marketing strategies by gaining insight into which itemsare frequently
purchased together by customers.

2.3. The APRIORI Principle:
 Finding Frequent Itemsets Using Candidate Generation:The
Apriori Algorithm
Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining
frequent itemsets for Boolean association rules.
The name of the algorithm is based on the fact that the algorithm uses prior knowledge of
frequent itemset properties.
Apriori employs an iterative approach known as a level-wise search, where k-itemsets are
used to explore (k+1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count
for each item, and collecting those items that satisfy minimum support. The resulting set is
denoted L1.Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find
L3, and so on, until no more frequent k-itemsets can be found.
The finding of each Lkrequires one full scan of the database.

Example:
TID List of item IDs
T100 I1, I2, I5
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3
There are nine transactions in this database, that is, |D| = 9.
Steps:
1. In the first iteration of the algorithm, each item is a member of the set of candidate1-
itemsets, C1. The algorithm simply scans all of the transactions in order to countthe number
of occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. The set of
frequent 1-itemsets, L1, can thenbe determined. It consists of the candidate 1-itemsets
satisfying minimum support.In our example, all of the candidates in C1 satisfy minimum
support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 on L1
togenerate a candidate set of 2-itemsets, C2.No candidates are removed fromC2 during the
prune step because each subset of thecandidates is also frequent.
4.Next, the transactions inDare scanned and the support count of each candidate itemsetInC2
is
accumulated.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate2-
itemsets in C2 having minimum support.
6. The generation of the set of candidate 3-itemsets,C3, Fromthejoin step, we first getC3
=L2x L2 = ({I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},{I2, I3, I5}, {I2, I4, I5}. Based
on the Apriori property that all subsets of a frequentitemsetmust also be frequent, we can
determine that the four latter candidates cannotpossibly be frequent.
7.The transactions in D are scanned in order to determine L3, consisting of those candidate
3-itemsets in C3 having minimum support.
8.The algorithm uses L3x L3 to generate a candidate set of 4-itemsets, C4.

2.4. Partition Algorithms:
 First scan:
• Subdivide the transactions of database D into n non overlapping
partitions
• If the minimum support in D is min_sup, then the minimum support for
a partition is min_sup * number of transactions in that partition
• Local frequent items are determined
• A local frequent item my not by a frequent item in D
 Second scan:
• Frequent items are determined from the local frequent items
 First scan:
• Subdivide the transactions of database D into n non overlapping
partitions
• If the minimum support in D is min_sup, then the minimum support for
a partition is
min_sup * number of transactions in D /
number of transactions in that partition
• Local frequent items are determined
• A local frequent item my not by a frequent item in D
 Second scan:
• Frequent items are determined from the local frequent items
2.5. FP- Growth Algorithms:
Principles of Frequent Pattern Growth
Pattern growth property
– Let be a frequent itemset in DB, B be 's conditional pattern base, and
be an itemset in B. Then is a frequent itemset in DB iff is frequent in B.
“abcdef ” is a frequent pattern, if and only if
– “abcde ” is a frequent pattern, and
– “f ” is frequent in the set of transactions containing “abcde ”
Why Is Frequent Pattern Growth Fast?
 FP-growth is an order of magnitude faster than Apriori, and is also faster than tree-
projectio
 No candidate generation, no candidate test
 Use compact data structure
 Eliminate repeated database scan
2.6. Association Rule Generation:

Generating Association Rules from Frequent Itemsets:
Once the frequent itemsets from transactions in a database D have been found, it is
straightforward to generate strong association rules from them.

Example:






 Mining multilevel association rules from transactional databases
Multilevel association rule: Multilevel association rules can be defined as applying
association rules over different levels of data abstraction.
Figure: multilevel association rules

2.7. Maximal Frequent Item Set:

It is a frequent itemset for which none of its immediate supersets are frequent.
2.8. Closed Frequent Item Sets:

It is a frequent itemset that is both closed and its support is greater than or equal to minsup.
An itemset is closed in a data set if there exists no superset that has the same support count as
this original itemset.

Unit-III
Classification

3.1. Problem Definition: There are two forms of data analysis that can be used for
extracting models describing important classes or to predict future data trends. These two
forms are as follows −
 Classification
 Prediction
Classification models predict categorical class labels; and prediction models predict
continuous valued functions. For example, we can build a classification model to categorize
bank loan applications as either safe or risky, or a prediction model to predict the
expenditures in dollars of potential customers on computer equipment given their income
and occupation.
What is classification?
Following are the examples of cases where the data analysis task is Classification −
 A bank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
 A marketing manager at a company needs to analyze a customer with a given profile,
who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the categorical
labels. These labels are risky or safe for loan application data and yes or no for marketing
data.
What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will spend
during a sale at his company. In this example we are bothered to predict a numeric value.
Therefore the data analysis task is an example of numeric prediction. In this case, a model or
a predictor will be constructed that predicts a continuous-valued-function or ordered value.
How Does Classification Works?
With the help of the bank loan application that we have discussed above, let us understand
the working of classification. The Data Classification process includes two steps −
 Building the Classifier or Model

 Using Classifier for Classification
Building the Classifier or Model
 This step is the learning step or the learning phase.
 In this step the classification algorithms build the classifier.
 The classifier is built from the training set made up of database tuples and their
associated class labels.
 Each tuple that constitutes the training set is referred to as a category or class. These
tuples can also be referred to as sample, object or data points.

Using Classifier for Classification
In this step, the classifier is used for classification. Here the test data is used to estimate the
accuracy of classification rules. The classification rules can be applied to the new data tuples
if the accuracy is considered acceptable.
Classification and Prediction Issues

The major issue is preparing the data for Classification and Prediction. Preparing the data
involves the following activities −

 Data Cleaning − Data cleaning involves removing the noise and treatment of
missing values. The noise is removed by applying smoothing techniques and the problem of
missing values is solved by replacing a missing value with most commonly occurring value
for that attribute.
 Relevance Analysis − Database may also have the irrelevant attributes. Correlation
analysis is used to know whether any two given attributes are related.
 Data Transformation and reduction − The data can be transformed by any of the
following methods.
o Normalization − The data is transformed using normalization. Normalization
involves scaling all values for given attribute in order to make them fall within a small
specified range. Normalization is used when in the learning step, the neural networks or the
methods involving measurements are used.
o Generalization − The data can also be transformed by generalizing it to the
higher concept. For this purpose we can use the concept hierarchies.
3.2. General Approaches to solving a classification problem:
A classification technique (or classifier) is a systematic approach to building
classificationmodels from an input data set.Examples include decision treeclassifiers, rule-
based classifiers,neural networks, support vector machines andnaive Bayes classifiers.Each
technique employs alearning algorithmto identify a model that best fits the relationship
between the attribute set andclass label of the input data.
3.3. Decision Tree – Decision tree Construction:

A decision tree is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a test, and
each leaf node holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a
customer at a company is likely to buy a computer or not. Each internal node represents a
test on an attribute. Each leaf node represents a class.
The benefits of having a decision tree are as follows −
 It does not require any domain knowledge.
 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple and fast.
Decision Tree Induction Algorithm

A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm
known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the successor of
ID3. ID3 and C4.5 adopt a greedy approach. In this algorithm, there is no backtracking; the
trees are constructed in a top-down recursive divide-and-conquer manner.

Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to noise or
outliers. The pruned trees are smaller and less complex.
Tree Pruning Approaches
There are two approaches to prune a tree −
 Pre-pruning − The tree is pruned by halting its construction early.
 Post-pruning - This approach removes a sub-tree from a fully grown tree.
3.4. Naive Bayes Classifier:
Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical
classifiers. Bayesian classifiers can predict class membership probabilities such as the
probability that a given tuple belongs to a particular class.
Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities −
 Posterior Probability [P(H/X)]
 Prior Probability [P(H)]
where X is data tuple and H is some hypothesis.
According to Bayes' Theorem,
P(H/X)= P(X/H)P(H) / P(X)
The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:

1.Let D be a training set of tuples and their associated class labels. As usual, each
tuple is represented by an n-dimensional attribute vector, X = (x1, x2,
…,xn), depicting n measurements made on the tuple from n attributes,
respectively, A1, A2, …, An.
2. Suppose that there are m classes, C1, C2, …, Cm. Given a tuple, X, the
classifier willpredict that X belongs to the class having the highest posterior
probability, conditioned on X.
That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if
and only if
Thus we maximize P(CijX). The classCifor which P(CijX) is maximized is

called the maximum posteriori hypothesis. By Bayes’ theorem
3.As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If
the class
prior probabilities are not known, then it is commonly assumed that the classes are
equally
likely, that is, P(C1) = P(C2) = …= P(Cm), and we would therefore maximize
P(X|Ci). Otherwise, we maximize P(X|Ci)P(Ci).
4.Given data sets with many attributes, it would be extremely computationally
expensiveto compute P(X|Ci). In order to reduce computation in evaluating P(X|Ci),
the naive assumption of class conditional independence is made. This presumes
that the values of the attributes areconditionally independent of one another,
given the class label of the tuple. Thus,
We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), : : : , P(xn|Ci)

fromthe trainingtuples. For eachattribute, we look at whether the attribute
is categorical or continuous-valued. Forinstance, to compute P(X|Ci), we consider
the following:

If Akis categorical, then P(xk|Ci) is the number of tuples of class Ciin D havingthe
value
xkfor Ak, divided by |Ci,D| the number of tuples of
class Ciin D.
If Akis continuous-valued, then we need to do a bit more work, but the calculationis
pretty straightforward.
A continuous-valued attribute is typically assumed tohave a Gaussian
distribution with a mean μ and standard deviation , defined by
5.In order to predict the class label of X, P(XjCi)P(Ci) is evaluated for each
class Ci. The classifier predicts that the class label of tuple X is the class
Ciif and only if
3.5. Bayesian Belief Network:

Bayesian Belief Networks specify joint conditional probability distributions. They are also
known as Belief Networks, Bayesian Networks, or Probabilistic Networks.
 A Belief Network allows class conditional independencies to be defined between
subsets of variables.
 It provides a graphical model of causal relationship on which learning can be
performed.
 We can use a trained Bayesian Network for classification.
There are two components that define a Bayesian Belief Network −
 Directed acyclic graph

 A set of conditional probability tables
Directed Acyclic Graph
 Each node in a directed acyclic graph represents a random variable.

 These variable may be discrete or continuous valued.
 These variables may correspond to the actual attribute given in the data.
Directed Acyclic Graph Representation
The following diagram shows a directed acyclic graph for six Boolean variables.

The arc in the diagram allows representation of causal knowledge. For example, lung cancer
is influenced by a person's family history of lung cancer, as well as whether or not the person
is a smoker. It is worth noting that the variable PositiveXray is independent of whether the
patient has a family history of lung cancer or that the patient is a smoker, given that we know
the patient has lung cancer.
Conditional Probability Table
The conditional probability table for the values of the variable LungCancer (LC) showing
each possible combination of the values of its parent nodes, FamilyHistory (FH), and Smoker
(S) is as follows −
3.6. K – N earnest neighbour classification – Algorithm:

K-Nearest Neighbors Algorithm (aka kNN) can be used for both classification (data with
discrete variables) and regression (data with continuous labels). The algorithm functions by
calculating the distance (Sci-Kit Learn uses the formula for Euclidean distance but other
formulas are available) between instances to create local "neighborhoods".
k-Nearest-Neighbor Classifier:
Nearest-neighbor classifiers are based on learning by analogy, that is, by
comparing a given test tuplewith training tuples that are similar to it.
The training tuples are describedby n attributes. Each tuple represents a point
in an n- dimensional space. In this way,all of the training tuples are stored in
an n-dimensional pattern space. When given anunknown tuple, a k-nearest-
neighbor classifier searches the pattern space for the k trainingtuples that are closest
to the unknown tuple. These k training tuples are the k nearest neighbors of the
unknown tuple.
Closeness is defined in terms of a distance metric, such as Euclidean distance.
The Euclidean distance between two points or tuples, say, X1 = (x11, x12, … , x1n) andX2 =
(x21, x22, … ,x2n), is

In other words, for each numeric attribute, we take the difference between the
corresponding values of that attribute in tuple X1and in tuple X2, square this
difference,and accumulate it. The square root is taken of the total accumulated
distance count.
Min-Max normalization can be used to transforma value v of a numeric attribute A to
v0 in therange [0, 1] by computing
whereminAand maxAare the minimum and maximum values of attribute A
For k-nearest-neighbor classification, the unknown tuple is assigned the

mostcommon class among its k nearest neighbors.
When k = 1, the unknown tuple is assigned the class of the training tuple that is
closest to it in pattern space.
Nearestneighborclassifiers can also be used for prediction, that is, to return a real-
valued prediction for a given unknown tuple.
In this case, the classifier returns the averagevalue of the real-valued labels
associated with the k nearest neighbors of the unknowntuple.
UNIT-IV
Clustering
4.1. Problem Definition:

Data Mining Written Notes 1

Uploaded by

Data Mining Written Notes 1

Uploaded by

_______________________________________________________

DEPARTMENT OF INFORMATION TECHNOLOGY

1.1.1. What motivated data mining? Why is it important?

Department of Information Technology CS701PC- Data Mining

 Providing Summary Information − Data mining provides us various

Corporate Analysis and Risk Management

 Resource Planning − It involves summarizing and comparing the resources and

 Competition − It involves monitoring competitors and market directions.

1.1.3. Data Mining On What Kind of Data?

Department of Information Technology CS701PC- Data Mining

Fig: Architecture of a data mining system

What is Knowledge Discovery?

Department of Information Technology CS701PC- Data Mining

 Data Integration − In this step, multiple data sources are combined.

 Data Transformation − In this step, data is transformed or consolidated into forms

 Pattern Evaluation − In this step, data patterns are evaluated.

 Knowledge Presentation − In this step, knowledge is represented.

The following diagram shows the process of knowledge discovery −

Department of Information Technology CS701PC- Data Mining

 Mining Methodology and User Interaction

Department of Information Technology CS701PC- Data Mining

Department of Information Technology CS701PC- Data Mining

1 .4.1 Concept/class description: characterization and discrimination

Discrimination descriptions expressed in rule form are referred to as discriminant rules.

1.4.2 Mining Frequent Patterns, Association and Correlations

major(X, “computing science””) ⇒ owns(X, “personal computer”)

1.4.3 Classification and prediction

Department of Information Technology CS701PC- Data Mining

 Classification (IF-THEN) Rules

Fig:decision tree based on if-then rules

Department of Information Technology CS701PC- Data Mining

1.5. Data Preprocessing:

Department of Information Technology CS701PC- Data Mining

Forms of data preprocessing:

1.5.1. Data Cleaning

Data cleaning routines attempt to

 fill in missing values at the following methods :

Data Cleaning as a Process

Department of Information Technology CS701PC- Data Mining

Several Data smoothing techniques:

Then the sorted list partitioned into equi-depth of bins.

Example: Binning Methods for Data Smoothing

Bin 2:21, 21, 24, 25

Bin 3:26, 28, 29, 34

Smoothing by bin means:

Department of Information Technology CS701PC- Data Mining

Bin 2:23, 23, 23, 23

Smoothing by bin boundaries:

Bin 2:21, 21, 25, 25

Bin 3:26, 26, 26, 34

Data Integration and Transformation

Number of issues to consider during data integration

 Schema integration and object matching

Where N is the number of tuples,

ai and bi are the respective values of A and B in tuple i,

A͞ and B͞ are the respective mean values of A and B,

σ A and σB are the respective standard deviations of A and B

If rA,B is greater than 0, then A and B are positively correlated

Department of Information Technology CS701PC- Data Mining

If rA,B is less than 0, then A and B are negatively correlated

A correlation relationship between two attributes, A and B -square)

where N is the number of data tuples.

Smoothing: which works to remove noise from the data

Department of Information Technology CS701PC- Data Mining

Data reduction include the following

 Data cube aggregation

1.6. Measures of Similarity and Dissimilarity:

2.1. Problems Definition:

It is intended to identify strong rules discovered in databases using different measures of

Association Rule Problem:

The problem of association rule mining is defined as:

Let be a set of binary attributes called items.

Let be a set of transactions called the database.

Association Rule Definitions:

 A set of items is referred as an itemset. A itemset that contains k items is a k-itemset.

Department of Information Technology CS701PC- Data Mining