Data Mining
Data Mining
Introduction
Data Mining deals with what kind of patterns can be mined. On the basis of kind of data to be
mined there are two kind of functions involved in Data Mining, that are listed below:
Descriptive
Class/Concept Description
Mining of Associations
Mining of Correlations
Mining of Clusters
Class/Concept Description
Class/Concepts refers the data to be associated with classes or concepts. For example, in a
company classes of items for sale include computer and printers, and concepts of customers
include big spenders and budget spenders.Such descriptions of a class or a concept are called
class/concept descriptions. These descriptions can be derived by following two ways:
Data Characterization - This refers to summarizing data of class under study. This class
under study is called as Target Class.
Data Discrimination - It refers to mapping or classification of a class with some
predefined group or class.
Mining of Frequent Patterns
Frequent patterns are those patterns that occur frequently in transactional data. Here is the list of
kind of frequent patterns:
Frequent Item Set - It refers to set of items that frequently appear together for example
milk and bread.
Decision Trees
Mathematical Formulae
Neural Networks
Here is the list of functions involved in this:
Classification - It predicts the class of objects whose class label is unknown.Its objective
is to find a derived model that describes and distinguishes data classes or concepts. The Derived
Model is based on analysis set of training data i.e the data object whose class label is well
known.
Prediction - It is used to predict missing or unavailable numerical data values rather than
class labels. Regression Analysis is generally used for prediction.Prediction can also be used for
identification of distribution trends based on available data.
Outlier Analysis - The Outliers may be defined as the data objects that do not comply
with general behaviour or model of the data available.
Evolution Analysis - Evolution Analysis refers to description and model regularities or
trends for objects whose behaviour changes over time.
Data Mining Task Primitives
We can specify the data mining task in form of data mining query.
The data mining query is defined in terms of data mining task primitives.
Note: Using these primitives allow us to communicate in interactive manner with the data
mining system. Here is the list of Data Mining Task Primitives:
Database Attributes
Characterization
Discrimination
Classification
Prediction
Clustering
Outlier Analysis
Evolution Analysis
BACKGROUND KNOWLEDGE TO BE USED IN DISCOVERY PROCESS
The background knowledge allow data to be mined at multiple level of abstraction. For example
the Concept hierarchies are one of the background knowledge that allow data to be mined at
multiple level of abstraction.
INTERESTINGNESS MEASURES AND THRESHOLDS FOR PATTERN EVALUATION
This is used to evaluate the patterns that are discovers by the process of knowledge discovery.
There are different interestingness measures for different kind of knowledge.
REPRESENTATION FOR VISUALIZING THE DISCOVERED PATTERNS
This refers to the form in which discovered patterns are to be displayed. These representations
may include the following:
Rules
Tables
Charts
Graphs
Decision Trees
Cubes
Performance Issues
Mining different kinds of knowledge in databases. - The need of different users is not
the same. And Different user may be in interested in different kind of knowledge. Therefore it is
necessary for data mining to cover broad range of knowledge discovery task.
Data Reduction - The basic idea of this theory is to reduce the data representation which
trades accuracy for speed in response to the need to obtain quick approximate answers to queries
on very large data bases.Some of the data reduction techniques are as follows:
Wavelets
Regression
Log-linear models
Histograms
Clustering
Sampling
Data Compression - The basic idea of this theory is to compress the given data by
encoding in terms of the following:
Bits
Association Rules
Decision Trees
Clusters
Pattern Discovery - The basic idea of this theory is to discover patterns occurring in the
database. Following are the areas that contributes to this theory:
Machine Learning
Neural Network
Association Mining
Clustering
Probability Theory - This theory is based on statistical theory. The basic idea behind this
theory is to discover joint probability distributions of random variables.
Probability Theory - According to this theory data mining is finding the patterns that are
interesting only to the extent that they can be used in the decision making process of some
enterprise.
Microeconomic View - As per the perception of this theory, the database schema consist
of data and patterns that are stored in the database. Therefore according to this theory data
mining is the task of performing induction on databases.
Inductive databases - Apart from the database oriented techniques, there are statistical
techniques also available for data analysis. These techniques can be applied to scientific data and
data from economic & social sciences as well.
Statistical Data Mining
Some of the Statistical Data Mining Techniques are as follows:
Regression - The regression methods are used to predict the value of response variable
from one or more predictor variables where the variables are numeric.Following are the several
forms of Regression:
Linear
Multiple
Weighted
Polynomial
Nonparametric
Robust
Logistic Regression
Poisson Regression
The model's generalization allow a categorical response variable to be related to set of predictor
variables in manner similar to the modelling of numeric response variable using linear
regression.
variable.
Autoregression Methods
Data Visualization
Data Mining
Visual Data Mining is closely related to the following:
Computer Graphics
Multimedia Systems
Pattern Recognition
Data Visualization - The data in the databases or the data warehouses can be viewed in
several visual forms that are listed below:
Boxplots
3-D Cubes
Curves
Surfaces
Data Mining result Visualization - Data Mining Result Visualization is the presentation
of the results of data mining in visual forms. These visual forms could be scatter plots and
boxplots etc.
Data Mining Process Visualization - Data Mining Process Visualization presents the
several processes of data mining. This allows the users to see how the data are extracted. This
also allow the users to see from which database or data warehouse data are cleaned, integrated,
preprocessed, and mined.
Audio Data Mining
To indicate the patterns of data or the features of data mining results, Audio Data Mining makes
use of audio signals. By transforming patterns into sound and musing instead of watching
pictures, we can listen to pitches,tunes in order to identify anything interesting.
Data Mining and Collaborative Filtering
Today the consumer faced with large variety of goods and services while shopping. During live
customer transactions, the Recommender System helps the consumer by making product
recommendation. The Collaborative Filtering Approach is generally used for recommending
products to customers. These recommendations are based on the opinions of other customers.
Introduction
The Data Mining Query Language was proposed by Han, Fu, Wang, et al for the DBMiner data
mining system. The Data Mining Query Language is actually based on Structured Query
Language (SQL). Data Mining Query Languages can be designed to support ad hoc and
interactive data mining. This DMQL provides commands for specifying primitives. The DMQL
can work with databases data warehouses as well. Data Mining Query Language can be used to
define data mining tasks. Particularly we examine how to define data warehouse and data marts
in Data Mining Query Language.
Task-Relevant Data Specification Syntax
Here is the syntax of DMQL for specifying the task relevant data:
use database database_name,
or
use data warehouse data_warehouse_name
in relevance to att_or_dim_list
from relation(s)/cube(s) [where condition]
order by order_list
group by grouping_list
Specifying Kind of Knowledge Syntax
Here we will discuss the syntax for Characterization, Discrimination, Association, Classification
and Prediction.
CHARACTERIZATION
The syntax for characterization is:
mine characteristics [as pattern_name]
analyze {measure(s) }
The analyze clause, specifies aggregate measures, such as count, sum, or count%.
For example:
Description describing customer purchasing habits.
mine characteristics as customerPurchasing
analyze count%
DISCRIMINATION
The syntax for Discrimination is:
mine comparison [as {pattern_name]}
For {target_class } where {t arget_condition }
{versus {contrast_class_i }
where {contrast_condition_i}}
analyze {measure(s) }
For Example, A user may define bigSpenders as customers who purchase items that costs $100
or more on average, and budgetSpenders as customers who purchase items at less than $100 on
average. The mining of discriminant descriptions for customers from each of these categories can
be specified in DMQL as:
mine comparison as purchaseGroups
for bigSpenders where avg(I.price) $100
versus budgetSpenders where avg(I.price)< $100
analyze count
ASSOCIATION
The syntax for Association is:
mine associations [ as {pattern_name} ]
{matching {metapattern} }
For Example:
mine associations as buyingHabits
As a market manager of a Company , you would like to characterize the buying habits of
customers who purchase items priced at no less than $100, w.r.t customer's age, type of item
purchased, & place in which item was made. You would like to know the percentage of
customers having that characteristic. In particular, you are only interested in purchases made in
Canada, & paid for with an American Express("AmEx") credit card. You would like to view the
resulting descriptions in the form of a table.
use database AllElectronics_db
use hierarchy location_hierarchy for B.address
mine characteristics as customerPurchasing
analyze count%
in relevance to C.age,I.type,I.place_made
from customer C, item I, purchase P, items_sold S, branch B
where I.item_ID = S.item_ID and P.cust_ID = C.cust_ID and
P.method_paid = "AmEx" and B.address = "Canada" and I.price 100
with noise threshold = 5%
display as table
Data Mining Languages Standardization
Standardizing the Data Mining Languages will serve the following purposes:
Answer:
(a) What is the mean of the data? What is the median?
The (arithmetic) mean of the data is: x =
= 809/27 = 30 (Equation 2.1). The median (middlevalue of the ordered set, as the number of values in the
set is odd) of the data is: 25.
(b) What is the mode of the data? Comment on the datas modality (i.e., bimodal, trimodal, etc.).
This data set has two values that occur with the same highest frequency and is, therefore, bimodal.
The modes (values occurring with the greatest frequency) of the data are 25 and 35.
2. using the data for age given in Exercise 2.4, answer the following.
(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your
steps. Comment on the effect of this technique for the given data.
(b) How might you determine outliers in the data?.
Answer:
(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your
steps.Comment on the effect of this technique for the given data.
The following steps are required to smooth the above data using smoothing by bin means with a bin depth
of 3.
Step 1: Sort the data. (This step is not required here as the data are already sorted.)
Step 2: Partition the data into equal-frequency bins of size 3.
Bin 1: 13, 15, 16 Bin 2: 16, 19, 20 Bin 3: 20, 21, 22
Bin 4: 22, 25, 25 Bin 5: 25, 25, 30 Bin 6: 33, 33, 35
Bin 7: 35, 35, 35 Bin 8: 36, 40, 45 Bin 9: 46, 52, 70
Step 3: Calculate the arithmetic mean of each bin.
Step 4: Replace each of the values in each bin by the arithmetic mean calculated for the bin.
Bin 1: 142/3, 142/3, 142/3 Bin 2: 181/3, 181/3, 181/3 Bin 3: 21, 21, 21
Bin 4: 24, 24, 24 Bin 5: 262/3, 262/3, 262/3 Bin 6: 332/3, 332/3, 332/3
Bin 7: 35, 35, 35 Bin 8: 401/3, 401/3, 401/3 Bin 9: 56, 56, 56
3. Design a data warehouse for a regional weather bureau. The weather bureau has about 1,000 probes,
which are scattered throughout various land and ocean locations in the region to collect basic weather
Data,including air pressure, temperature, and precipitation at each hour. All data are sent to the central
station, which has collected such data for over 10 years. Your design should facilitate efficient querying
and on-line analytical processing, and derive general weather patterns in multidimensional space.
Answer:
Since the weather bureau has about 1,000 probes scattered throughout various land and ocean locations,
We need to construct a spatial data warehouse so that a user can view weather patterns on a map by
month, by region, and by different combinations of temperature and precipitation, and can dynamically
drill down or roll up along any dimension to explore desired patterns.
The star schema of this weather spatial data warehouse can be constructed as shown in Figure 3.4.
4.
TID items bought
T100 {M, O, N, K, E, Y}
T200 {D, O, N, K, E, Y }
T300 {M, A, K, E}
T400 {M, U, C, K, Y}
T500 {C, O, O, K, I ,E}
(a) Find all frequent itemsets using Apriori and FP-growth, respectively. Compare the efficiency of
the two mining processes.
(b) List all of the strong association rules (with support s and confidence c) matching the following
metarule, where X is a variable representing customers, and itemi denotes variables representing
items (e.g., A, B, etc.):
x transaction, buys(X, item1) buys(X, item2) buys(X, item3) [s, c]
(a) Find all frequent itemsets using Apriori and FP-growth, respectively. Compare the efficiency of
the two mining processes.
Apriori:
C1 =
m3
o3
n2
k5
e4
y3
d1
a1
u1
c2
i1
L1 =
m3
o3
k5
e4
y3
C2 =
mo 1
mk 3
me 2
my 2
ok 3
oe 3
oy 2
ke 4
ky 3
ey 2
L2 =
mk 3
ok 3
oe 3
ke 4
ky 3
C3= oke 3
key 2
L3= oke 3
Efficiency comparison: Apriori has to do multiple scans of the database while FP-growth builds the
FP-Tree with a single scan. Candidate generation in Apriori is expensive (owing to the self-join), while
FP-growth does not generate any candidates.
(b) List all of the strong association rules (with support s and confidence c) matching the following
metarule, where X is a variable representing customers, and itemi denotes variables representing
items (e.g., A, B, etc.):
x transaction, buys(X, item1) buys(X, item2) buys(X, item3) [s, c]
k,o e [0.6,1]
e,o k [0.6,1]