Ch-4: DATA MINING
PRIMITIVES
• Data Mining:
Data Miningrefers to extracting on mining
knowledge from large amount of data.
• Data Mining Primitives:
A data mining task can be specified in the form of a data
mining query which is input to the data mining system
• A mining query is defined in terms of the following
Task-Relevant Data
The Kind Of Knowledge to be Mined
Background Knowledge : Concept Hierarchies
Interestingness Measures
Presentation and Visualization of Discovered Pattern
TASK-RELEVANT DATA
• The set of task relevant data can be collected a relational query(SQL
and DMQL) involving operation like selection , projection , join
and aggregation.
• The data collection process results in a new data relation called the
initial data relation.
• The initial relation may or may not correspond to a physical relation
in the database.
• Virtual relation are called views in the field of databases, the set of
task-relevant data for data mining is called a minable view.
• The task-relevant data can be specified by providing the following
information:
The names of the database or data warehouse to be used
The names of the tables or data cubes containing the
relevant
data
Condition for selection the relevant data
The relevant attributes or dimensions
The data retrieved be grouped by certain attributes ,
such as
“grouped by data”
• The set of task relevant data can be specified by condition based
data filtering ,slicing or dicing of the data cube
• For eg : A concept hierarchy on item that specifies that “home
entertainment ” is at a higher concept level , composed of the lower
concept level {“TV”,”CD player ”, ” VCR”} can be used in the
collection of the task-relevant data.
THE KIND OF KNOWLEDGE TO BE MINED
• The kinds of knowledge include concept description
(characterization , discrimination ), association , classification ,
prediction , clustering , and evolution analysis.
• These templates or metapatterns can be used to guide the discovery
process.
• For eg :
age(X ,”30…39”) ^ income (X,”40K…49K”) =>buys (X,”VCR”)
[2.2%,60%]
BACKGROUND KNOWLEDGE : CONCEPT
HIERARCHIES
• Background knowledge is information about the domain to be
mined that can be useful in the discovery process.
• Background knowledge known as concept hierarchies. concept
hierarchies allows the discovery of knowledge at multiple levels of
abstraction.
• concept hierarchies defines a sequence of mappings from a set of
low-level concept to higher-level .
Concept hierarchy
• concept hierarchies is represented as a set of nodes organized in a
tree , where each node , in itself , represents a concept.
• There are four types of concept hierarchies :
Schema hierarchies
Set grouping hierarchies
Operation-derived hierarchies
Rule –based hierarchies.
• Schema hierarchies : is a total or partial order among attributes in the
database schema.
street < city < state < country
• Set grouping hierarchies : organizes a values for a given attribute or
dimension into groups of constants or range values.
{young , middle-age) C all (age)
{20…39} C young
{40…59} C middle-aged
• Operation-derived hierarchies : include the decoding of
information-
encoded string , information extraction from complex data objects.
login-name < department < university < country forming a email
address.
• Rule –based hierarchies : set of rules and is evaluated dynamically based
on the current database data and the rule definition.
low_profit_margin(X) <= price( X,P1) ^ cost (X,P2) ^ (( P1-P2)
<
$50)
INTERESTINGNESS MEASURES
• The number of uninteresting patterns returned by the process. This can
be achieved by specifying interestingness measure that estimate the
simplicity,
certainty ,
utility and
novelty.
• Each measure is associated with a threshold that can be controlled by the
user.
• SIMPLICITY:
Simplicity can be viewed as functions of the pattern
structure defined in terms of the pattern size in bits or the number of
attributes or operators appearing in the pattern. for eg: rule length.
• CERTAINTY:
Each discovery pattern should have a measure of certainty
associated with it that assesses the validity or trustworthiness of the
pattern. A certainty measure for associated rules of the form
“A=>B”, where A and B are set of items, is confidence.
confidence(A=>B)=
#_tuples_containing_both_A_and_B
#_tuples_containing_A
• UTILITY:
It can be estimated by a utility function such as support. The
support of an associated pattern refers to the percentage of task-relevant
data tuples for which the pattern is true .for associated rules of the form
“A=>B” where A and B are set of items,
support(A=>B) = #_tuples_containing_both_A_and_B
total_#_of_tuples
• NOVELTY:
It contribute new information or increased performed to the given
pattern set. Novelty is removed redundant patterns. For eg: a data
exception may be considered novel in it differs from that based on
statistical model or user beliefs.
location(X,”CANADA”) => buys( X,”SONY_TV”) [8%, 70%]
PRESENTATION AND VISUALIZATION OF
DISCOVERED PATTERNS
• Data mining system should be able to display the discovery patterns
in multiple patterns such as rules, tables, crosstabs, pie charts,
decision tree, cubes, or other visual representations .
• Data mining system should employ concept hierarchies to
implement drill-down and roll-up operation. So that users may
discovery patterns at multiple levels of abstraction.
• In addition pivoting, slicing and dicing operation ,the user in
viewing generalized data and knowledge from different perspective.
Various form of presenting and visualizing the
discovered pattern