0% found this document useful (0 votes)
174 views45 pages

Association Rule Mining: - Algorithms For Frequent Itemset Mining - Apriori - Elcat - FP-Growth

The document discusses association rule mining algorithms. It begins by listing common frequent itemset mining algorithms like Apriori and FP-Growth. It then provides motivation for association rule mining by explaining how rules can predict item occurrences. Examples of market basket transactions and association rules are given. The rest of the document describes key concepts like frequent itemsets, association rules, support and confidence metrics. It explains the computational complexity and discusses algorithms like Apriori, FP-Growth and ECLAT that aim to efficiently generate frequent itemsets in two steps - first finding frequent items and then generating rules from those items.

Uploaded by

akshayhazari8281
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
174 views45 pages

Association Rule Mining: - Algorithms For Frequent Itemset Mining - Apriori - Elcat - FP-Growth

The document discusses association rule mining algorithms. It begins by listing common frequent itemset mining algorithms like Apriori and FP-Growth. It then provides motivation for association rule mining by explaining how rules can predict item occurrences. Examples of market basket transactions and association rules are given. The rest of the document describes key concepts like frequent itemsets, association rules, support and confidence metrics. It explains the computational complexity and discusses algorithms like Apriori, FP-Growth and ECLAT that aim to efficiently generate frequent itemsets in two steps - first finding frequent items and then generating rules from those items.

Uploaded by

akshayhazari8281
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Association Rule Mining

Algorithms for frequent itemset mining


Apriori
ELCAT
FP-Growth
Acknowledgement
Lecture slides taken/modified from:
Course Material of CIS527, 2004, Temple University
Jiawei Han (http://www-sal.cs.uiuc.edu/~hanj/DM_Book.html)
Vipin Kumar (http://www-users.cs.umn.edu/~kumar/csci5980/index.html)

Motivation: Association Rule Mining


Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction
Market-Basket transactions
TID

Items

Bread, Milk

2
3
4
5

Bread, Diaper, Beer, Eggs


Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke

Example of Association Rules


{Diaper} {Beer},
{Milk, Bread} {Eggs,Coke},
{Beer, Bread} {Milk},

Implication means co-occurrence,


not causality!

Applications: Association Rule Mining


* Annual Maintenance Contract (AMC)
What the store should do to boost Maintenance
Agreement sales

Home Electronics *
What other products should the store stocks up?

Some options:
Attached mailing in direct marketing
Marketing and Sales Promotion
Supermarket shelf management

Definition: Frequent Itemset


Itemset
A collection of one or more items
Example: {Milk, Bread, Diaper}

k-itemset
An itemset that contains k items

Support count ()
Frequency of occurrence of an itemset
E.g. ({Milk, Bread,Diaper}) = 2

Support
Fraction of transactions that contain an
itemset
E.g. s({Milk, Bread, Diaper}) = 2/5

Frequent Itemset
An itemset whose support is greater
than or equal to a minsup threshold

TID

Items

Bread, Milk

2
3
4
5

Bread, Diaper, Beer, Eggs


Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke

Definition: Association Rule


Association Rule
An implication expression of the form
X Y, where X and Y are itemsets
Example:
{Milk, Diaper} {Beer}

TID

Items

Bread, Milk

2
3
4
5

Bread, Diaper, Beer, Eggs


Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke

Rule Evaluation Metrics


Support (s)

Example:

{Milk, Diaper} Beer

Fraction of transactions that contain


both X and Y

Confidence (c)
Measures how often items in Y
appear in transactions that
contain X

s
c

(Milk , Diaper, Beer)


|T|

2
0.4
5

(Milk, Diaper, Beer) 2


0.67
(Milk , Diaper )
3

Association Rule Mining Task


Given a set of transactions T, the goal of
association rule mining is to find all rules having
support minsup threshold
confidence minconf threshold

Brute-force approach:
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf
thresholds
Computationally prohibitive!

Computational Complexity
Given d unique items:
Total number of itemsets = 2d
Total number of possible association rules:

d d k
R

k j
3 2 1
d 1

d k

k 1

j 1

d 1

If d=6, R = 602 rules

Mining Association Rules: Decoupling


TID

Items

Bread, Milk

2
3
4
5

Bread, Diaper, Beer, Eggs


Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke

Example of Rules:
{Milk,Diaper} {Beer} (s=0.4, c=0.67)
{Milk,Beer} {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} {Milk} (s=0.4, c=0.67)
{Beer} {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} {Milk,Beer} (s=0.4, c=0.5)
{Milk} {Diaper,Beer} (s=0.4, c=0.5)

Observations:
All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
Rules originating from the same itemset have identical support but
can have different confidence
Thus, we may decouple the support and confidence requirements

Rule Generation
Given a frequent itemset L, find all non-empty
subsets f L such that f L f satisfies the
minimum confidence requirement
If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D,
A BCD,
AB CD,
BD AC,

ABD C,
B ACD,
AC BD,
CD AB,

ACD B,
C ABD,
AD BC,

BCD A,
D ABC
BC AD,

If |L| = k, then there are ... candidate


association rules (ignoring L and L)
2k 2

Mining Association Rules

Two-step approach:
1. Frequent Itemset Generation
Generate all itemsets whose support minsup

2. Rule Generation
Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset

Frequent itemset generation is still


computationally expensive

Frequent Itemset Generation


Brute-force approach:
Each itemset is a candidate frequent itemset
Count the support of each candidate by scanning the
database
Transactions

TID
1
2
3
4
5

Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke

List of
Candidates

Match each transaction against every candidate


Complexity ~ O(NMw) => Expensive since M = 2d !!!

Frequent Itemset Generation Strategies


Reduce the number of candidates (M)
Complete search: M=2d
Use pruning techniques to reduce M

Reduce the number of transactions (N)


Reduce size of N as the size of itemset increases
Use a subsample of N transactions

Reduce the number of comparisons (NM)


Use efficient data structures to store the candidates or
transactions
No need to match every candidate against every
transaction

Reducing Number of Candidates: Apriori


Apriori principle:
If an itemset is frequent, then all of its subsets must also
be frequent
If an itemset is infrequent, then all its supersets must
also be infrequent.

Apriori principle holds due to the following property


of the support measure:

X , Y : ( X Y ) s( X ) s(Y )

Support of an itemset never exceeds the support of its


subsets
This is known as the anti-monotone property of support

Illustrating Apriori Principle


null

Found to be
Infrequent

AB

AC

AD

AE

BC

BD

BE

CD

CE

DE

ABC

ABD

ABE

ACD

ACE

ADE

BCD

BCE

BDE

CDE

ABCD

Pruned
supersets

ABCE

ABDE

ABCDE

ACDE

BCDE

Illustrating Apriori Principle


Item
Bread
Coke
Milk
Beer
Diaper
Eggs

Count
4
2
4
3
4
1

Items (1-itemsets)

Minimum Support = 3
If every subset is considered,
6C + 6C + 6C = 41
1
2
3
With support-based pruning,
6 + 6 + 1 = 13

Itemset
{Bread,Milk}
{Bread,Beer}
{Bread,Diaper}
{Milk,Beer}
{Milk,Diaper}
{Beer,Diaper}

Count
3
2
3
2
3
3

Pairs (2-itemsets)

(No need to generate


candidates involving Coke
or Eggs)

Triplets (3-itemsets)
Itemset
{Bread,Milk,Diaper}

Count
3

Apriori Algorithm
Method:
Let k=1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are identified
Generate length (k+1) candidate itemsets from length k
frequent itemsets
Prune candidate itemsets containing subsets of length k that
are infrequent
Count the support of each candidate by scanning the DB
Eliminate candidates that are infrequent, leaving only those
that are frequent

Extensions to Apriori

Transaction reduction
A transaction which does not contain
frequent k-itemsets should be removed
from the database for further scans

Partitioning
First scan:
Subdivide the transactions of database D into n non
overlapping partitions
If the minimum support in D is min_sup, then the minimum
support for a partition is min_sup * number of transactions in
that partition
Local frequent items are determined
A local frequent item may not by a frequent item in D

Second scan:
Frequent items are determined from the local frequent items

Partitioning
First scan:
Subdivide the transactions of database D into n non overlapping
partitions
If the minimum support in D is min_sup, then the minimum support
for a partition is
min_sup * number of transactions in D /
number of transactions in that partition

Local frequent items are determined


A local frequent item may not by a frequent item in D

Second scan:
Frequent items are determined from the local frequent items

Sampling
Pick a random sample S of D
Search for local frequent items in S
Use a lower support threshold
Determine frequent items from the local frequent
items
Frequent items of D may be missed

For completeness a second scan is done

Bottlenecks of Apriori
Candidate generation can result in huge
candidate sets:
104 frequent 1-itemset will generate 107 candidate 2itemsets
To discover a frequent pattern of size 100, e.g., {a1,
a2, , a100}, one needs to generate 2100 ~ 1030
candidates.

Multiple scans of database:


Needs (n +1 ) scans, n is the length of the longest
pattern

ECLAT: Another Method for Frequent Itemset


Generation
ECLAT: for each item, store a list of transaction
ids (tids); vertical data layout
Horizontal
Data Layout
TID
1
2
3
4
5
6
7
8
9
10

Items
A,B,E
B,C,D
C,E
A,C,D
A,B,C,D
A,E
A,B
A,B,C
A,C,D
B

Vertical Data Layout


A
1
4
5
6
7
8
9

B
1
2
5
7
8
10

TID-list

C
2
3
4
8
9

D
2
4
5
9

E
1
3
6

ECLAT: Another Method for Frequent Itemset


Generation
Determine support of any k-itemset by intersecting tidlists of two of its (k-1) subsets.
A
1
4
5
6
7
8
9

B
1
2
5
7
8
10

AB
1
5
7
8

3 traversal approaches:
top-down, bottom-up and hybrid

Advantage: very fast support counting


Disadvantage: intermediate tid-lists may become too
large for memory

FP-growth: Another Method for Frequent


Itemset Generation
Use a compressed representation of the
database using an FP-tree
Once an FP-tree has been constructed, it uses a
recursive divide-and-conquer approach to mine
the frequent itemsets

FP-Tree Construction
TID
1
2
3
4
5
6
7
8
9
10

Items
{A,B}
{B,C,D}
{A,C,D,E}
{A,D,E}
{A,B,C}
{A,B,C,D}
{B,C}
{A,B,C}
{A,B,D}
{B,C,E}

null

After reading TID=1:

A:1
B:1
After reading TID=2:

null

A:1
B:1

B:1
C:1
D:1

FP-Tree Construction
TID
1
2
3
4
5
6
7
8
9
10

Items
{A,B}
{B,C,D}
{A,C,D,E}
{A,D,E}
{A,B,C}
{A,B,C,D}
{B,C}
{A,B,C}
{A,B,D}
{B,C,E}

Header table
Item
Pointer
A
B
C
D
E

Transaction
Database

null

B:3

A:7
B:5

C:1

C:3
D:1

C:3
D:1
D:1

D:1

D:1

E:1

E:1

E:1
Pointers are used to assist
frequent itemset generation

FP-growth
Build conditional pattern
base for E:
P = {(A:1,C:1,D:1),
(A:1,D:1),
(B:1,C:1)}

null
B:3

A:7
B:5

C:1

D:1

C:3

C:3
D:1

D:1

D:1

E:1
D:1

E:1

E:1

Recursively apply FPgrowth on P

FP-growth
Conditional tree for E:

Conditional Pattern base


for E:
P = {(A:1,C:1,D:1,E:1),
(A:1,D:1,E:1),
(B:1,C:1,E:1)}

null
B:1

A:2

C:1
D:1

E:1

D:1
E:1

C:1

Count for E is 3: {E} is


frequent itemset

Recursively apply FPgrowth on P


E:1

FP-growth
Conditional tree for D
within conditional tree
for E:
null

A:2

C:1
D:1

D:1

Conditional pattern base


for D within conditional
base for E:
P = {(A:1,C:1,D:1),
(A:1,D:1)}
Count for D is 2: {D,E} is
frequent itemset

Recursively apply FPgrowth on P

FP-growth
Conditional tree for C
within D within E:
null

Conditional pattern base


for C within D within E:
P = {(A:1,C:1)}
Count for C is 1: {C,D,E}
is NOT frequent itemset

A:1

C:1

FP-growth
Conditional tree for A
within D within E:
null

Count for A is 2: {A,D,E}


is frequent itemset
Next step:

A:2

Construct conditional tree


C within conditional tree
E
Continue until exploring
conditional tree for A
(which has only node A)

Benefits of the FP-tree Structure


Performance study shows
FP-growth is an order of
magnitude faster than
Apriori, and is also faster
than tree-projection
No candidate generation,
no candidate test
Use compact data structure
Eliminate repeated
database scan
Basic operation is counting
and FP-tree building

D1 FP-grow th runtime

90

D1 Apriori runtime

80
70

Run time(sec.)

Reasoning

100

60
50
40
30
20
10
0
0

0.5

1
1.5
2
Support threshold(%)

2.5

Complexity of Association Mining


Choice of minimum support threshold
lowering support threshold results in more frequent itemsets
this may increase number of candidates and max length of
frequent itemsets

Dimensionality (number of items) of the data set


more space is needed to store support count of each item
if number of frequent items also increases, both computation and
I/O costs may also increase

Size of database
since Apriori makes multiple passes, run time of algorithm may
increase with number of transactions

Average transaction width


transaction width increases with denser data sets
This may increase max length of frequent itemsets and traversals
of hash tree (number of subsets in a transaction increases with its
width)

Mining Frequent Patterns With


FP-trees
Idea: Frequent pattern growth
Recursively grow frequent patterns by pattern and
database partition

Method
For each frequent item, construct its conditional
pattern-base, and then its conditional FP-tree
Repeat the process on each newly created conditional
FP-tree
Until the resulting FP-tree is empty, or it contains only
one pathsingle path will generate all the
combinations of its sub-paths, each of which is a
frequent pattern

Compact Representation of Frequent


Itemsets
Some itemsets are redundant because they have
identical support as their supersets
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
4
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
5
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
6
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
7
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
8
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
9
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
10
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
11
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
12
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
13
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
14
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
15
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1

10
Number of frequent itemsets 3
k
10

k 1

Need a compact representation

Maximal Frequent Itemset


An itemset is maximal frequent if none of its immediate supersets
is frequent
null

Maximal
Itemsets

AB

AC

AD

AE

BC

BD

BE

CD

CE

DE

ABC

ABD

ABE

ACD

ACE

ADE

BCD

BCE

BDE

CDE

ABCD

Infrequent
Itemsets

ABCE

ABDE

ABCD
E

ACDE

BCDE

Border

Closed Itemset
Problem with maximal frequent itemsets:
Support of their subsets is not known additional DB scans are
needed

An itemset is closed if none of its immediate supersets


has the same support as the itemset
TID
1
2
3
4
5

Items
{A,B}
{B,C,D}
{A,B,C,D}
{A,B,D}
{A,B,C,D}

Itemset
{A}
{B}
{C}
{D}
{A,B}
{A,C}
{A,D}
{B,C}
{B,D}
{C,D}

Support
4
5
3
4
4
2
3
3
4
3

Itemset
{A,B,C}
{A,B,D}
{A,C,D}
{B,C,D}
{A,B,C,D}

Support
2
3
2
2
2

Maximal vs Closed Frequent Itemsets


Minimum support = 2
124

123

12

124

AB

12
ABC

TID

Items

ABC

ABCD

BCE

ACDE

DE

24

AC

AD

ABD

ABE

1234

AE

345
D

2
BC

BD

4
ACD

245

123

24

Closed but
not maximal

null

4
ACE

BE

ADE

BCD

24

CD

BCE

ABCE

ABDE

ABCDE

34

CE

BDE

45

DE

CDE

# Closed = 9

4
ABCD

Closed and
maximal

ACDE

BCDE

# Maximal = 4

Maximal vs Closed Itemsets

Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets

Presentation of Association Rules (Table Form)

Extra

Rule Generation
How to efficiently generate rules from frequent
itemsets?
In general, confidence does not have an antimonotone property
c(ABC D) can be larger or smaller than c(AB D)

But confidence of rules generated from the same


itemset has an anti-monotone property
e.g., L = {A,B,C,D}:

c(ABC D) c(AB CD) c(A BCD)


Confidence is anti-monotone w.r.t. number of items on the
RHS of the rule

Visualization of Association Rule Using Plane Graph

Visualization of Association Rule Using Rule Graph

You might also like