ASSOCIATION RULE MINING
Frequent patterns are patterns (such as itemsets, subsequences, or substructures) that
appear in a data set frequently.
For example, a set of items, such as milk and bread, that appear frequently together in a
transaction data set is a frequent itemset.
A subsequence, such as buying first a PC, then a digital camera, and then a memory card,
if it occurs frequently in a shopping history database, is a (frequent) sequential pattern.
A substructure can refer to different structural forms, such as subgraphs, subtrees, or
sublattices, which may be combined with itemsets or subsequences. If a substructure
occurs frequently, it is called a (frequent) structured pattern.
Finding such frequent patterns plays an essential role in mining associations, correlations,
and many other interesting relationships among data.
ASSOCIATION RULE MINING
How can we find frequent itemsets from large amounts of data, where the data are either
transactional or relational?
How can we mine association rules in multilevel and multidimensional space?
Which association rules are the most interesting?
How can we help or guide the mining procedure to discover interesting associations or
correlations?
How can we take advantage of user preferences or constraints to speed up the mining
process?
Market Basket Analysis:
Market Basket Analysis: A Motivating Example
A typical example of frequent itemset mining is market basket analysis. This process
analyzes customer buying habits by finding associations between the different items that
customers place in their “shopping baskets”.
The discovery of such associations can help retailers develop marketing strategies by
gaining insight into which items are frequently purchased together by customers.
For instance, if customers are buying milk, how likely are they to also buy bread (and
what kind of bread) on the same trip to the supermarket?
Such information can lead to increased sales by helping retailers do selective marketing
and plan their shelf space.
Market Basket Analysis:
Market Basket Analysis:
If we think of the universe as the set of items available at the store, then each item has a
boolean variable representing the presence or absence of that item.
Each basket can then be represented by a Boolean vector of values assigned to these
variables.
The Boolean vectors can be analyzed for buying patterns that reflect items that are
frequently associated or purchased together.
These patterns can be represented in the form of association rules. For example, the
information that customers who purchase computers also tend to buy antivirus software at
the same time is represented in Association Rule
computer->antivirus software [support = 2%; confidence = 60%]
Support and Confidence measures
Support and Confidence measures
Rule support and confidence are two measures of rule interestingness. They respectively
reflect the usefulness and certainty of discovered rules.
A support of 2% for Association Rule means that 2% of all the transactions under analysis
show that computer and antivirus software are purchased together.
A confidence of 60% means that 60% of the customers who purchased a computer also
bought the software.
Typically, association rules are considered interesting if they satisfy both a minimum
support threshold and a minimum confidence threshold. Such thresholds can be set by
users or domain experts.
Frequent Itemsets, Closed Itemsets, and Association Rules
Frequent Itemsets:
Let I ={I1, I2, , .. , Im} be a set of items. Let D, the task-relevant data, be a set of database
transactions where each transaction T is a set of items such that T ⊆ I . Each transaction is
associated with an identifier, called TID. Let A be a set of items. A transaction T is said to
contain A if and only if A ⊆ T. An association rule is an implication of the form A ⇒B, where
A⊆I , B⊆I , and A∩B=∅.
The rule A⇒B holds in the transaction set D with support s, where s is the percentage of
transactions in D that contain AUB (i.e., the union of sets A and B, or say, both A and B).
This is taken to be the probability, P(AUB).
support(A⇒B) =P(A ∪B)
The rule A⇒B has confidence c in the transaction set D, where c is the percentage of
transactions in D containing A that also contain B. This is taken to be the conditional
probability, P(B/A). That is,
confidence(A⇒B) =P(B|A)
Frequent Itemsets, Closed Itemsets, and Association Rules
A set of items is referred to as an itemset.
An itemset that contains k items is a k-itemset. The set {computer, antivirus software} is a
2-itemset. The occurrence frequency of an itemset is the number of transactions that
contain the itemset. This is also known, simply, as the frequency, support count, or count
of the itemset.
confidence(A⇒B) = P(B|A) = support(A ∪B)/ support(A) = support count(A ∪B)/ support count(A)
Frequent Itemsets, Closed Itemsets, and Association Rules
Closed Itemset:
An itemset X is closed in a data set S if there exists no proper super-itemset Y such that Y
has the same support count as X in S. An itemset X is a closed frequent itemset in set S if
X is both closed and frequent in S. An itemset X is a maximal frequent itemset (or max-
itemset) in set S if X is frequent, and there exists no super-itemset Y such that X ⊆ Y and
Y is frequent in S.
Frequent Itemsets, Closed Itemsets, and Association Rules
Association Rules
In general, association rule mining can be viewed as a two-step process:
1. Find all frequent itemsets:
By definition, each of these itemsets will occur at least as frequently as a predetermined
minimum support count, min sup.
2. Generate strong association rules from the frequent itemsets:
By definition, these rules must satisfy minimum support and minimum confidence.
FREQUENT ITEM SET GENERATION - THE APRIORI PRINCIPLE
FREQUENT ITEM SET GENERATION - THE APRIORI PRINCIPLE
Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining
frequent itemsets for Boolean association rules. The name of the algorithm is based on the
fact that the algorithm uses prior knowledge of frequent itemset properties.
Apriori employs an iterative approach known as a level-wise search, where k-itemsets are
used to explore (k+1)-itemsets. First, the set of frequent 1-itemsets is found by scanning
the database to accumulate the count for each item, and collecting those items that satisfy
minimum support. The resulting set is denoted L1.
Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and
so on, until no more frequent k-itemsets can be found. The finding of each Lk requires
one full scan of the database.
FREQUENT ITEM SET GENERATION - THE APRIORI PRINCIPLE
To improve the efficiency of the level-wise generation of frequent itemsets, an important
property called the Apriori property
Apriori property: All nonempty subsets of a frequent itemset must also be frequent.
The Apriori property is based on the following observation. By definition, if an itemset I
does not satisfy the minimum support threshold, min sup, then I is not frequent; that is,
P(I) < min sup. If an item A is added to the itemset I, then the resulting itemset (i.e.,
I U A) cannot occur more frequently than I. Therefore, I U A is not frequent either; that
is, P(I U A) < min sup.
This property belongs to a special category of properties called antimonotone in the sense
that if a set cannot pass a test, all of its supersets will fail the same test as well. It is called
antimonotone because the property is monotonic in the context of failing a test.
FREQUENT ITEM SET GENERATION - THE APRIORI PRINCIPLE
FREQUENT ITEM SET GENERATION - THE APRIORI PRINCIPLE
Apriori. Let’s look at a concrete example, based on the AllElectronics transaction database,D, of Table. There
are nine transactions in this database, that is, |D| = 9.
FREQUENT ITEM SET GENERATION - THE APRIORI PRINCIPLE
FREQUENT ITEM SET GENERATION - THE APRIORI PRINCIPLE
FREQUENT ITEM SET GENERATION - THE APRIORI PRINCIPLE
Generating Association Rules from Frequent Itemsets
Once the frequent itemsets from transactions in a database D have been found,it is
straightforward to generate strong association rules from them (where strong association
rules satisfy both minimum support and minimum confidence).
confidence(A⇒B) = P(B|A) = support count(A ∪B)/ support count(A)
The conditional probability is expressed in terms of itemset support count, where support
count(AUB) is the number of transactions containing the itemsets AUB, and support
count(A) is the number of transactions containing the itemset A. Based on this equation,
association rules can be generated as follows:
FREQUENT ITEM SET GENERATION - THE APRIORI PRINCIPLE
FREQUENT ITEM SET GENERATION - THE APRIORI PRINCIPLE
FREQUENT ITEM SET GENERATION – FP Growth Algorithm
FREQUENT ITEM SET GENERATION – FP Growth Algorithm
FREQUENT ITEM SET GENERATION – FP Growth Algorithm
Mining Frequent Itemsets without Candidate Generation
FREQUENT ITEM SET GENERATION – FP Growth Algorithm
The first scan of the database is the same as Apriori, which derives the set of frequent
items (1-itemsets) and their support counts (frequencies). Let the minimum support count
be 2. The set of frequent items is sorted in the order of descending support count. This
resulting set or list is denoted L. Thus, we have L ={{I2: 7}, {I1: 6}, {I3: 6},
{I4: 2}, {I5: 2}}.
An FP-tree is then constructed as follows. First, create the root of the tree, labeled with
“null.” Scan database D a second time. The items in each transaction are processed in L
order (i.e., sorted according to descending support count), and a branch is created for each
transaction. For example, the scan of the first transaction, “T100: I1, I2, I5,” which
contains three items (I2, I1, I5 in L order), leads to the construction of the first branch of
the tree with three nodes, <I2: 1>, <I1:1>, and <I5: 1>, where I2 is linked as a child of the
root, I1 is linked to I2, and I5 is linked to I1.
FREQUENT ITEM SET GENERATION – FP Growth Algorithm
The second transaction, T200, contains the items I2 and I4 in L order, which would result
in a branch where I2 is linked to the root and I4 is linked to I2. However, this branch
would share a common prefix, I2, with the existing path for T100.
Therefore, we instead increment the count of the I2 node by 1, and create a new node,
<I4: 1>,which is linked as a child of <I2: 2>. In general, when considering the branch to
be added for a transaction, the count of each node along a common prefix is incremented
by 1, and nodes for the items following the prefix are created and linked accordingly. To
facilitate tree traversal, an item header table is built so that each item points to its
occurrences in the tree via a chain of node-links. The tree obtained after scanning all of
the transactions is shown in Figure with the associated node-links. In this way, the
problem of mining frequent patterns in databases is transformed to that of mining the FP-
tree.
FREQUENT ITEM SET GENERATION – FP Growth Algorithm
FREQUENT ITEM SET GENERATION – FP Growth Algorithm
The FP-tree is mined as follows. Start from each frequent length-1 pattern (as an
initial suffix pattern), construct its conditional pattern base (a “sub
database,”which consists of the set of prefix paths in the FP-tree co-occurring with
the suffix pattern), then construct its (conditional) FP-tree, and perform mining
recursively on such a tree. The pattern growth is achieved by the concatenation of
the suffix pattern with the frequent patterns generated from a conditional FP-tree.
FREQUENT ITEM SET GENERATION – FP Growth Algorithm
Mining of the FP-tree is summarized in Table and detailed as follows. We first consider
I5, which is the last item in L, rather than the first. The reason for starting at the end of the
list will become apparent as we explain the FP-tree mining process. I5 occurs in two
branches of the FP-tree of Figure. (The occurrences of I5 can easily be found by
following its chain of node-links.) The paths formed by these branches are <I2, I1,I5: 1>
and <I2, I1, I3, I5: 1>. Therefore, considering I5 as a suffix, its corresponding two prefix
paths are <I2, I1: 1> and <I2, I1, I3: 1>, which form its conditional pattern base. Its
conditional FP-tree contains only a single path,<I2: 2, I1: 2>; I3 is not included because
its support count of 1 is less than the minimum support count. The single path generates
all the combinations of frequent patterns: {I2, I5: 2}, {I1, I5: 2}, {I2, I1, I5: 2}. For I4, its
two prefix paths form the conditional pattern base, {{I2 I1: 1}, {I2: 1}},which generates a
single-node conditional FP-tree, <I2: 2>, and derives one frequent
FREQUENT ITEM SET GENERATION – FP Growth Algorithm
pattern, {I2, I1: 2}. Notice that although I5 follows I4 in the first branch, there is no need
to include I5 in the analysis here because any frequent pattern involving I5 is analyzed in
the examination of I5.
Similar to the above analysis, I3’s conditional pattern base is {{I2, I1: 2}, {I2: 2},{I1:
2}}. Its conditional FP-tree has two branches, <I2: 4, I1: 2> and <I1: 2>, as shown in
Figure which generates the set of patterns, {{I2, I3: 4}, {I1, I3: 4}, {I2, I1, I3: 2}}.
Finally, I1’s conditional pattern base is {{I2: 4}}, whose FP-tree contains only one node,
<I2: 4>, which generates one frequent pattern, {I2, I1: 4}. This mining process is
summarized in Figure.
FREQUENT ITEM SET GENERATION – FP Growth Algorithm
Compact frequent itemsets-Maximal frequent itemset :
Compact frequent itemsets-Maximal frequent itemset :
The number of frequent itemsets generated by the Apriori algorithm can often be very
large, so it is beneficial to identify a small representative set from which every frequent
itemset can be derived. One such approach is using maximal frequent itemsets.
A maximal frequent itemset is a frequent itemset for which none of its immediate
supersets are frequent. To illustrate this concept, consider the example given below:
Compact frequent itemsets-Maximal frequent itemset :
Compact frequent itemsets-Maximal frequent itemset :
The support counts are shown on the top left of each node. Assume support count
threshold = 50%, that is, each item must occur in 2 or more transactions. Based on that
threshold, the frequent itemsets are: a, b, c, d, ab, ac and ad (shaded nodes).
Out of these 7 frequent itemsets, 3 are identified as maximal frequent (having red
outline):
ab: Immediate supersets abc and abd are infrequent.
ac: Immediate supersets abc and acd are infrequent.
ad: Immediate supersets abd and acd are infrequent.
The remaining 4 frequent nodes (a, b, c and d) cannot be maximal frequent because they
all have at least 1 immediate superset that is frequent.
Compact frequent itemsets-Closed frequent itemset :
Closed frequent itemset:
An itemset is closed if none of its immediate superset has the same support as that of the
itemset.