Data Mining Written Notes 1
Data Mining Written Notes 1
CLASS NOTES
Unit-I
Introducing to Data Mining
1.1. Introducing to Data Mining:
Data mining refers to extracting or “mining” knowledge from large amounts of data.
Data Mining is defined as the procedure of extracting information from huge sets of data.
Data mining as a synonym for another popularly used term, Knowledge Discovery from
Data, or KDD.
1.1.2 What is Data Mining?
Data Mining is defined as extracting information from huge sets of data. In other words, we
can say that data mining is the procedure of mining knowledge from data. The information or
knowledge extracted so can be used for any of the following applications −
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
Apart from these, data mining can also be used in the areas of production control, customer
retention, science exploration, sports, astrology, and Internet Web Surf-Aid.
Finance Planning and Asset Evaluation − It involves cash flow analysis and
prediction, contingent claim analysis to evaluate assets.
Fraud Detection
Data mining is also used in the fields of credit card services and telecommunication to detect
frauds. In fraud telephone calls, it helps to find the destination of the call, duration of the call,
time of the day or week, etc. It also analyzes the patterns that deviate from expected norms.
Relational Databases
Data Warehouses
Transactional Databases
Advanced Data and Information Systems and Advanced Applications
Object-Relational Databases
Temporal Databases, Sequence Databases, and Time-Series Databases
Spatial Databases and Spatiotemporal Databases
1.2. KDD:
Data Selection − In this step, data relevant to the analysis task are retrieved from the
database.
Data Mining − In this step, intelligent methods are applied in order to extract data
patterns.
1.3. Challenges:
Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data
sources. These factors also create some issues. Here in this tutorial, we will discuss the major
issues regarding −
Descriptive
Classification and Prediction
Descriptive mining tasks characterize the general properties of the data in the
database.
Predictive mining tasks perform inference on the current data in order to make predictions.
Describe data mining functionalities, and the kinds of patterns they can discover (or) Define
each of the following data mining functionalities: characterization, discrimination,
association and correlation analysis, classification, prediction, clustering, and evolution
analysis. Give examples of each data mining functionality, using a real-life database that you
are familiar with.
Data can be associated with classes or concepts. It describes a given set of data in a concise
and summarative manner, presenting interesting general properties of the data. These
descriptions can be derived via data characterization, by summarizing the data of the class
under study (often called the target class) data discrimination, by comparison of the target
class with one or a set of comparative classes both data characterization and discrimination.
Data characterization
Department of Information Technology CS701PC- Data Mining
It is a summarization of the general characteristics or features of a target class of data.
Example: A data mining system should be able to produce a description summarizing the
characteristics of a student who has obtained more than 75% in every semester; the result
could be a general profile of the student.
Data Discrimination is a comparison of the general features of target class data objects with
the general features of objects from one or a set of contrasting classes.
Example
The general features of students with high GPA’s may be compared with the general features
of students with low GPA’s. The resulting description could be a general comparative profile
of the students such as 75% of the students with high GPA’s are fourth-year computing
science students while 65% of the students with low GPA’s are not.
The output of data characterization can be presented in various forms. Examples include pie
charts, bar charts, curves, multidimensional data cubes, and multidimensional tables,
including crosstabs. The resulting descriptions can also be presented as generalized relations,
or in rule form called characteristic rules.
Typical Applications
credit approval o target marketing o medical diagnosis o
treatment effectiveness analysis
DECISION TREE:
Prediction:
Prediction − It is used to predict missing or unavailable numerical data values rather than
class labels. Regression Analysis is generally used for prediction. Prediction can also be used
for identification of distribution trends based on available data.
1.4.4 Clustering analysis
Clustering analyzes data objects without consulting a known class label. The objects are
clustered or grouped based on the principle of maximizing the intraclass similarity and
minimizing the interclass similarity. Each cluster that is formed can be viewed as a class of
objects.
Clustering can also facilitate taxonomy formation, that is, the organization of observations
into a hierarchy of classes that group similar events together as shown below
Example
Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of
extremely large amounts for a given account number in comparison to regular charges
incurred by the same account. Outlier values may also be detected with respect to the
location and type of purchase, or the purchase frequency.
Data Cleaning: Data is cleansed through processes such as filling in missing values,
smoothing the noisy data, or resolving the inconsistencies in the data.
Data Integration: Data with different representations are put together and conflicts
within the data are resolved.
Data Transformation: Data is normalized, aggregated and generalized.
Data cleaning as a process is discrepancy detection. Commercial tools that can aid in the step of
discrepancy detection
Noisy data:
Noise is a random error or variance in a measured variable. Data smoothing tech is used for
removing such noisy data.
In this technique,
The data for first sorted
Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries,
etc.
Smoothing by bin means: Each value in the bin is replaced by the mean value of the bin.
Smoothing by bin medians: Each value in the bin is replaced by the bin median.
Smoothing by boundaries: The min and max values of a bin are identified as the bin
boundaries. Each bin value is replaced by the closest boundary value.
Partition into (equi-depth) bins(equi depth of 3 since each bin contains three values)
- Bin 1: 4, 8, 9, 15
Bin3:29,29, 29, 29
Bin 1:4, 4, 4, 15
Data integration, which combines data from multiple sources into a coherent data store, as in data
warehousing
-1≤rA,B≤+1
Where oij is the observed frequency and eij is the expected frequency of (Ai , Bj)
1.5.2.Data Transformation
The data are transformed or consolidated into forms appropriate for mining. Data transformation can
involve
Smoothing
Aggregation
Generalization
Normalization
Min-max normalization
z-score normalization
Normalization by decimal scaling
Attribute construction
Min-max normalization
z-score normalization
Data reduction techniques can be applied to obtain a reduced representation of the data set that is
much smaller in volume, yet closely maintains the integrity of the original data
Similarity Measure
Numerical measure of how alike two data objects are.
Often falls between 0 (no similarity) and 1 (complete similarity).
Dissimilarity Measure
Numerical measure of how different two data objects are.
Range from 0 (objects are alike) to ∞ (objects are different).
Similarity Between Objects:
Euclidean distance:
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j2 ip jp
Minkowski distance:
d (i, j) q (| x x |q | x x |q ... | x x |q )
i1 j1 i2 j2 ip jp
Manhattan distance:
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j2 ip jp
Department of Information Technology CS701PC- Data Mining
Unit-II
Association Rules
• Given:
― a set I of all the items;
― a database D of transactions;
― minimum support s;
― minimum confidence c;
• Find:
― all association rules X Y with a minimum support s and confidence c.
Problem Definition:
Each transaction in has a unique transaction ID and contains a subset of the items in
. A rule is defined as an implication of the form
where and .
The sets of items (for short itemsets) and are called antecedent (left-hand-side or LHS)
and consequent (right-hand-side or RHS) of the rule respectively.
1 1 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 1 1 0
5 0 1 0 0
in the database, which means that for 100% of the transactions containing
butter and bread the rule is correct (100% of the times a customer buys butter and bread, milk
is bought as well). Confidence can be interpreted as an estimate of the probability ,
the probability of finding the RHS of the rule in transactions under the condition that these
transactions also contain the LHS.
or the ratio of the observed support to that expected if X and Y were independent. The
and can be interpreted as the ratio of the expected frequency that X occurs without Y (that is
to say, the frequency that the rule makes an incorrect prediction) if X and Y were
independent divided by the observed frequency of incorrect predictions.
This processanalyzes customer buying habits by finding associations between the different
items thatcustomers place in their shopping baskets. The discovery of such associationscan
help retailers develop marketing strategies by gaining insight into which itemsare frequently
purchased together by customers.
Apriori Algorithm
Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining
frequent itemsets for Boolean association rules.
The name of the algorithm is based on the fact that the algorithm uses prior knowledge of
frequent itemset properties.
Apriori employs an iterative approach known as a level-wise search, where k-itemsets are
used to explore (k+1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count
for each item, and collecting those items that satisfy minimum support. The resulting set is
denoted L1.Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find
L3, and so on, until no more frequent k-itemsets can be found.
The finding of each Lkrequires one full scan of the database.
T200 I2, I4
T300 I2, I3
T500 I1, I3
T600 I2, I3
T700 I1, I3
Steps:
1. In the first iteration of the algorithm, each item is a member of the set of candidate1-
itemsets, C1. The algorithm simply scans all of the transactions in order to countthe number
of occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. The set of
frequent 1-itemsets, L1, can thenbe determined. It consists of the candidate 1-itemsets
satisfying minimum support.In our example, all of the candidates in C1 satisfy minimum
support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 on L1
togenerate a candidate set of 2-itemsets, C2.No candidates are removed fromC2 during the
prune step because each subset of thecandidates is also frequent.
4.Next, the transactions inDare scanned and the support count of each candidate itemsetInC2
is
Department of Information Technology CS701PC- Data Mining
accumulated.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate2-
itemsets in C2 having minimum support.
6. The generation of the set of candidate 3-itemsets,C3, Fromthejoin step, we first getC3
=L2x L2 = ({I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},{I2, I3, I5}, {I2, I4, I5}. Based
on the Apriori property that all subsets of a frequentitemsetmust also be frequent, we can
determine that the four latter candidates cannotpossibly be frequent.
7.The transactions in D are scanned in order to determine L3, consisting of those candidate
FP-growth is an order of magnitude faster than Apriori, and is also faster than tree-
projectio
Once the frequent itemsets from transactions in a database D have been found, it is
Mining multilevel association rules from transactional databases
Multilevel association rule: Multilevel association rules can be defined as applying
association rules over different levels of data abstraction.
Classification
Prediction
Classification models predict categorical class labels; and prediction models predict
continuous valued functions. For example, we can build a classification model to categorize
bank loan applications as either safe or risky, or a prediction model to predict the
expenditures in dollars of potential customers on computer equipment given their income
and occupation.
What is classification?
Following are the examples of cases where the data analysis task is Classification −
A bank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
A marketing manager at a company needs to analyze a customer with a given profile,
who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the categorical
labels. These labels are risky or safe for loan application data and yes or no for marketing
data.
What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will spend
during a sale at his company. In this example we are bothered to predict a numeric value.
Therefore the data analysis task is an example of numeric prediction. In this case, a model or
a predictor will be constructed that predicts a continuous-valued-function or ordered value.
How Does Classification Works?
With the help of the bank loan application that we have discussed above, let us understand
the working of classification. The Data Classification process includes two steps −
The following decision tree is for the concept buy_computer that indicates whether a
customer at a company is likely to buy a computer or not. Each internal node represents a
test on an attribute. Each leaf node represents a class.
It is easy to comprehend.
The learning and classification steps of a decision tree are simple and fast.
3.As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If
the class
prior probabilities are not known, then it is commonly assumed that the classes are
equally
likely, that is, P(C1) = P(C2) = …= P(Cm), and we would therefore maximize
P(X|Ci). Otherwise, we maximize P(X|Ci)P(Ci).
4.Given data sets with many attributes, it would be extremely computationally
expensiveto compute P(X|Ci). In order to reduce computation in evaluating P(X|Ci),
the naive assumption of class conditional independence is made. This presumes
that the values of the attributes areconditionally independent of one another,
given the class label of the tuple. Thus,
If Akis continuous-valued, then we need to do a bit more work, but the calculationis
pretty straightforward.
A continuous-valued attribute is typically assumed tohave a Gaussian
distribution with a mean μ and standard deviation , defined by
5.In order to predict the class label of X, P(XjCi)P(Ci) is evaluated for each
class Ci. The classifier predicts that the class label of tuple X is the class
Ciif and only if
UNIT-IV
Clustering
4.1. Problem Definition:
Department of Information Technology CS701PC- Data Mining