CS423
DATA WAREHOUSING AND
DATA MINING
Chapter 6a
Frequent Patterns Analysis
Dr. Hammad Afzal
[email protected]
Department of Computer Software Engineering
National University of Sciences and Technology
(NUST)
MINING FREQUENT PATTERNS, ASSOCIATION
AND CORRELATIONS: BASIC CONCEPTS AND
METHODS
Basic Concepts
Frequent Itemset Mining Methods
Which Patterns Are Interesting?—Pattern Evaluation
Methods
Summary
2
WHAT IS FREQUENT PATTERN
ANALYSIS?
Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
Motivation: Finding inherent regularities in data
What products were often purchased together?— Milk and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale campaign
3
analysis, Web log (click stream) analysis, and DNA sequence analysis.
WHAT IS FREQUENT PATTERN
ANALYSIS?
Frequent pattern: a pattern (a set of items) that occurs frequently
in a data set.
Milk and Bread
Frequent pattern: a pattern (subsequences) that occurs frequently
in a data set
Buying first PC and then digital Camera
Aik web page pe click kiya tou us k baad kahaan click kiya
Frequent pattern: a pattern (substructures) that occurs frequently
in a data set .
Sub-Graphs
4
BASIC CONCEPTS: FREQUENT
PATTERNS
Tid Items bought itemset: A set of one or more
10 Beer, Nuts, Diaper items
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs
k-itemset X = {x1, …, xk}
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk
E.g. 2-itemset X = {x1, x2}
Customer Customer
buys both buys diaper
Customer 6
buys beer
BASIC CONCEPTS: FREQUENT
PATTERNS
(absolute) support, or, support
count of X:
Frequency or occurrence of an
Tid Items bought itemset X
10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper (relative) support, s,
30 Beer, Diaper, Eggs is the fraction of transactions
40 Nuts, Eggs, Milk
that contains X (i.e., the
probability that a transaction
50 Nuts, Coffee, Diaper, Eggs, Milk contains X)
An itemset X is frequent if X’s
support is no less than a
minsup threshold
7
BASIC CONCEPTS: ASSOCIATION RULES
Find all the rules X Y with
minimum support and
Ti
confidence
Items bought
d
10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper support, s, probability that a
30 Beer, Diaper, Eggs
transaction contains X Y
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk
confidence, c, conditional
probability that a transaction
having X also contains Y
8
BASIC CONCEPTS: ASSOCIATION RULES
Support (X-> Y) = P (X Y)
Confidence(X-> Y) = P (X | Y)
= Support (X Y) / Support (X)
9
BASIC CONCEPTS: ASSOCIATION RULES
Ti Items bought Let minsup = 50%, minconf = 50%
d
10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
Freq. Pat.:
30 Beer, Diaper, Eggs
Beer:3, Nuts:3, Diaper:4, Eggs:3,
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk {Beer, Diaper}:3
Association rules: (many more!)
Beer Diaper (60%, 100%)
Diaper Beer (60%, 75%)
10
COMPUTATIONAL COMPLEXITY OF FREQUENT ITEMSET
MINING
How many itemsets are potentially to be generated in the worst case?
The number of frequent itemsets to be generated is senstive to the
minsup threshold
When minsup is low, there exist potentially an exponential number of
frequent itemsets
The worst case: MN where M: # distinct items, and N: max length of
transactions.
A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains
(1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-
patterns! 11
CLOSED PATTERNS AND MAX-PATTERNS
Solution: Mine closed patterns and max-patterns instead
An itemset X is closed if X is frequent and there exists
no super-pattern Y כX, with the same support as X.
An itemset X is a max-pattern if X is frequent and
there exists no frequent super-pattern Y כX
Closed pattern is a lossless compression of freq.
patterns
12
Reducing the # of patterns and rules
CLOSED PATTERNS AND MAX-PATTERNS
Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
Min_sup = 1.
13
CLOSED PATTERNS AND MAX-PATTERNS
Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
Min_sup = 1.
What is the set of closed itemset?
<a1, …, a100>: 1
< a1, …, a50>: 2
What is the set of max-pattern?
<a1, …, a100>: 1
14