Association Rule Mining
Algorithms for frequent itemset mining
Apriori
ELCAT
FP-Growth
Acknowledgement
Lecture slides taken/modified from:
Course Material of CIS527, 2004, Temple University
Jiawei Han (http://www-sal.cs.uiuc.edu/~hanj/DM_Book.html)
Vipin Kumar (http://www-users.cs.umn.edu/~kumar/csci5980/index.html)
Motivation: Association Rule Mining
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction
Market-Basket transactions
TID
Items
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper} {Beer},
{Milk, Bread} {Eggs,Coke},
{Beer, Bread} {Milk},
Implication means co-occurrence,
not causality!
Applications: Association Rule Mining
* Annual Maintenance Contract (AMC)
What the store should do to boost Maintenance
Agreement sales
Home Electronics *
What other products should the store stocks up?
Some options:
Attached mailing in direct marketing
Marketing and Sales Promotion
Supermarket shelf management
Definition: Frequent Itemset
Itemset
A collection of one or more items
Example: {Milk, Bread, Diaper}
k-itemset
An itemset that contains k items
Support count ()
Frequency of occurrence of an itemset
E.g. ({Milk, Bread,Diaper}) = 2
Support
Fraction of transactions that contain an
itemset
E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
An itemset whose support is greater
than or equal to a minsup threshold
TID
Items
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Definition: Association Rule
Association Rule
An implication expression of the form
X Y, where X and Y are itemsets
Example:
{Milk, Diaper} {Beer}
TID
Items
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Rule Evaluation Metrics
Support (s)
Example:
{Milk, Diaper} Beer
Fraction of transactions that contain
both X and Y
Confidence (c)
Measures how often items in Y
appear in transactions that
contain X
s
c
(Milk , Diaper, Beer)
|T|
2
0.4
5
(Milk, Diaper, Beer) 2
0.67
(Milk , Diaper )
3
Association Rule Mining Task
Given a set of transactions T, the goal of
association rule mining is to find all rules having
support minsup threshold
confidence minconf threshold
Brute-force approach:
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf
thresholds
Computationally prohibitive!
Computational Complexity
Given d unique items:
Total number of itemsets = 2d
Total number of possible association rules:
d d k
R
k j
3 2 1
d 1
d k
k 1
j 1
d 1
If d=6, R = 602 rules
Mining Association Rules: Decoupling
TID
Items
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Rules:
{Milk,Diaper} {Beer} (s=0.4, c=0.67)
{Milk,Beer} {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} {Milk} (s=0.4, c=0.67)
{Beer} {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} {Milk,Beer} (s=0.4, c=0.5)
{Milk} {Diaper,Beer} (s=0.4, c=0.5)
Observations:
All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
Rules originating from the same itemset have identical support but
can have different confidence
Thus, we may decouple the support and confidence requirements
Rule Generation
Given a frequent itemset L, find all non-empty
subsets f L such that f L f satisfies the
minimum confidence requirement
If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D,
A BCD,
AB CD,
BD AC,
ABD C,
B ACD,
AC BD,
CD AB,
ACD B,
C ABD,
AD BC,
BCD A,
D ABC
BC AD,
If |L| = k, then there are ... candidate
association rules (ignoring L and L)
2k 2
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
Generate all itemsets whose support minsup
2. Rule Generation
Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
Frequent itemset generation is still
computationally expensive
Frequent Itemset Generation
Brute-force approach:
Each itemset is a candidate frequent itemset
Count the support of each candidate by scanning the
database
Transactions
TID
1
2
3
4
5
Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
List of
Candidates
Match each transaction against every candidate
Complexity ~ O(NMw) => Expensive since M = 2d !!!
Frequent Itemset Generation Strategies
Reduce the number of candidates (M)
Complete search: M=2d
Use pruning techniques to reduce M
Reduce the number of transactions (N)
Reduce size of N as the size of itemset increases
Use a subsample of N transactions
Reduce the number of comparisons (NM)
Use efficient data structures to store the candidates or
transactions
No need to match every candidate against every
transaction
Reducing Number of Candidates: Apriori
Apriori principle:
If an itemset is frequent, then all of its subsets must also
be frequent
If an itemset is infrequent, then all its supersets must
also be infrequent.
Apriori principle holds due to the following property
of the support measure:
X , Y : ( X Y ) s( X ) s(Y )
Support of an itemset never exceeds the support of its
subsets
This is known as the anti-monotone property of support
Illustrating Apriori Principle
null
Found to be
Infrequent
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
Pruned
supersets
ABCE
ABDE
ABCDE
ACDE
BCDE
Illustrating Apriori Principle
Item
Bread
Coke
Milk
Beer
Diaper
Eggs
Count
4
2
4
3
4
1
Items (1-itemsets)
Minimum Support = 3
If every subset is considered,
6C + 6C + 6C = 41
1
2
3
With support-based pruning,
6 + 6 + 1 = 13
Itemset
{Bread,Milk}
{Bread,Beer}
{Bread,Diaper}
{Milk,Beer}
{Milk,Diaper}
{Beer,Diaper}
Count
3
2
3
2
3
3
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
Itemset
{Bread,Milk,Diaper}
Count
3
Apriori Algorithm
Method:
Let k=1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are identified
Generate length (k+1) candidate itemsets from length k
frequent itemsets
Prune candidate itemsets containing subsets of length k that
are infrequent
Count the support of each candidate by scanning the DB
Eliminate candidates that are infrequent, leaving only those
that are frequent
Extensions to Apriori
Transaction reduction
A transaction which does not contain
frequent k-itemsets should be removed
from the database for further scans
Partitioning
First scan:
Subdivide the transactions of database D into n non
overlapping partitions
If the minimum support in D is min_sup, then the minimum
support for a partition is min_sup * number of transactions in
that partition
Local frequent items are determined
A local frequent item may not by a frequent item in D
Second scan:
Frequent items are determined from the local frequent items
Partitioning
First scan:
Subdivide the transactions of database D into n non overlapping
partitions
If the minimum support in D is min_sup, then the minimum support
for a partition is
min_sup * number of transactions in D /
number of transactions in that partition
Local frequent items are determined
A local frequent item may not by a frequent item in D
Second scan:
Frequent items are determined from the local frequent items
Sampling
Pick a random sample S of D
Search for local frequent items in S
Use a lower support threshold
Determine frequent items from the local frequent
items
Frequent items of D may be missed
For completeness a second scan is done
Bottlenecks of Apriori
Candidate generation can result in huge
candidate sets:
104 frequent 1-itemset will generate 107 candidate 2itemsets
To discover a frequent pattern of size 100, e.g., {a1,
a2, , a100}, one needs to generate 2100 ~ 1030
candidates.
Multiple scans of database:
Needs (n +1 ) scans, n is the length of the longest
pattern
ECLAT: Another Method for Frequent Itemset
Generation
ECLAT: for each item, store a list of transaction
ids (tids); vertical data layout
Horizontal
Data Layout
TID
1
2
3
4
5
6
7
8
9
10
Items
A,B,E
B,C,D
C,E
A,C,D
A,B,C,D
A,E
A,B
A,B,C
A,C,D
B
Vertical Data Layout
A
1
4
5
6
7
8
9
B
1
2
5
7
8
10
TID-list
C
2
3
4
8
9
D
2
4
5
9
E
1
3
6
ECLAT: Another Method for Frequent Itemset
Generation
Determine support of any k-itemset by intersecting tidlists of two of its (k-1) subsets.
A
1
4
5
6
7
8
9
B
1
2
5
7
8
10
AB
1
5
7
8
3 traversal approaches:
top-down, bottom-up and hybrid
Advantage: very fast support counting
Disadvantage: intermediate tid-lists may become too
large for memory
FP-growth: Another Method for Frequent
Itemset Generation
Use a compressed representation of the
database using an FP-tree
Once an FP-tree has been constructed, it uses a
recursive divide-and-conquer approach to mine
the frequent itemsets
FP-Tree Construction
TID
1
2
3
4
5
6
7
8
9
10
Items
{A,B}
{B,C,D}
{A,C,D,E}
{A,D,E}
{A,B,C}
{A,B,C,D}
{B,C}
{A,B,C}
{A,B,D}
{B,C,E}
null
After reading TID=1:
A:1
B:1
After reading TID=2:
null
A:1
B:1
B:1
C:1
D:1
FP-Tree Construction
TID
1
2
3
4
5
6
7
8
9
10
Items
{A,B}
{B,C,D}
{A,C,D,E}
{A,D,E}
{A,B,C}
{A,B,C,D}
{B,C}
{A,B,C}
{A,B,D}
{B,C,E}
Header table
Item
Pointer
A
B
C
D
E
Transaction
Database
null
B:3
A:7
B:5
C:1
C:3
D:1
C:3
D:1
D:1
D:1
D:1
E:1
E:1
E:1
Pointers are used to assist
frequent itemset generation
FP-growth
Build conditional pattern
base for E:
P = {(A:1,C:1,D:1),
(A:1,D:1),
(B:1,C:1)}
null
B:3
A:7
B:5
C:1
D:1
C:3
C:3
D:1
D:1
D:1
E:1
D:1
E:1
E:1
Recursively apply FPgrowth on P
FP-growth
Conditional tree for E:
Conditional Pattern base
for E:
P = {(A:1,C:1,D:1,E:1),
(A:1,D:1,E:1),
(B:1,C:1,E:1)}
null
B:1
A:2
C:1
D:1
E:1
D:1
E:1
C:1
Count for E is 3: {E} is
frequent itemset
Recursively apply FPgrowth on P
E:1
FP-growth
Conditional tree for D
within conditional tree
for E:
null
A:2
C:1
D:1
D:1
Conditional pattern base
for D within conditional
base for E:
P = {(A:1,C:1,D:1),
(A:1,D:1)}
Count for D is 2: {D,E} is
frequent itemset
Recursively apply FPgrowth on P
FP-growth
Conditional tree for C
within D within E:
null
Conditional pattern base
for C within D within E:
P = {(A:1,C:1)}
Count for C is 1: {C,D,E}
is NOT frequent itemset
A:1
C:1
FP-growth
Conditional tree for A
within D within E:
null
Count for A is 2: {A,D,E}
is frequent itemset
Next step:
A:2
Construct conditional tree
C within conditional tree
E
Continue until exploring
conditional tree for A
(which has only node A)
Benefits of the FP-tree Structure
Performance study shows
FP-growth is an order of
magnitude faster than
Apriori, and is also faster
than tree-projection
No candidate generation,
no candidate test
Use compact data structure
Eliminate repeated
database scan
Basic operation is counting
and FP-tree building
D1 FP-grow th runtime
90
D1 Apriori runtime
80
70
Run time(sec.)
Reasoning
100
60
50
40
30
20
10
0
0
0.5
1
1.5
2
Support threshold(%)
2.5
Complexity of Association Mining
Choice of minimum support threshold
lowering support threshold results in more frequent itemsets
this may increase number of candidates and max length of
frequent itemsets
Dimensionality (number of items) of the data set
more space is needed to store support count of each item
if number of frequent items also increases, both computation and
I/O costs may also increase
Size of database
since Apriori makes multiple passes, run time of algorithm may
increase with number of transactions
Average transaction width
transaction width increases with denser data sets
This may increase max length of frequent itemsets and traversals
of hash tree (number of subsets in a transaction increases with its
width)
Mining Frequent Patterns With
FP-trees
Idea: Frequent pattern growth
Recursively grow frequent patterns by pattern and
database partition
Method
For each frequent item, construct its conditional
pattern-base, and then its conditional FP-tree
Repeat the process on each newly created conditional
FP-tree
Until the resulting FP-tree is empty, or it contains only
one pathsingle path will generate all the
combinations of its sub-paths, each of which is a
frequent pattern
Compact Representation of Frequent
Itemsets
Some itemsets are redundant because they have
identical support as their supersets
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
4
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
5
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
6
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
7
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
8
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
9
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
10
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
11
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
12
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
13
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
14
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
15
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
10
Number of frequent itemsets 3
k
10
k 1
Need a compact representation
Maximal Frequent Itemset
An itemset is maximal frequent if none of its immediate supersets
is frequent
null
Maximal
Itemsets
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
Infrequent
Itemsets
ABCE
ABDE
ABCD
E
ACDE
BCDE
Border
Closed Itemset
Problem with maximal frequent itemsets:
Support of their subsets is not known additional DB scans are
needed
An itemset is closed if none of its immediate supersets
has the same support as the itemset
TID
1
2
3
4
5
Items
{A,B}
{B,C,D}
{A,B,C,D}
{A,B,D}
{A,B,C,D}
Itemset
{A}
{B}
{C}
{D}
{A,B}
{A,C}
{A,D}
{B,C}
{B,D}
{C,D}
Support
4
5
3
4
4
2
3
3
4
3
Itemset
{A,B,C}
{A,B,D}
{A,C,D}
{B,C,D}
{A,B,C,D}
Support
2
3
2
2
2
Maximal vs Closed Frequent Itemsets
Minimum support = 2
124
123
12
124
AB
12
ABC
TID
Items
ABC
ABCD
BCE
ACDE
DE
24
AC
AD
ABD
ABE
1234
AE
345
D
2
BC
BD
4
ACD
245
123
24
Closed but
not maximal
null
4
ACE
BE
ADE
BCD
24
CD
BCE
ABCE
ABDE
ABCDE
34
CE
BDE
45
DE
CDE
# Closed = 9
4
ABCD
Closed and
maximal
ACDE
BCDE
# Maximal = 4
Maximal vs Closed Itemsets
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
Presentation of Association Rules (Table Form)
Extra
Rule Generation
How to efficiently generate rules from frequent
itemsets?
In general, confidence does not have an antimonotone property
c(ABC D) can be larger or smaller than c(AB D)
But confidence of rules generated from the same
itemset has an anti-monotone property
e.g., L = {A,B,C,D}:
c(ABC D) c(AB CD) c(A BCD)
Confidence is anti-monotone w.r.t. number of items on the
RHS of the rule
Visualization of Association Rule Using Plane Graph
Visualization of Association Rule Using Rule Graph