UNIT - II
ASSOCIATION RULES
By
P.Laxmi
Frequent Item set
• Itemset:- set of items.
Example- {computer, printer, MS office software} is 3- item
set. { milk, bread} is 2-item set. similarly set of K items is
called k-item set.
• Frequent patterns are patterns that appear frequently in a data
set. Patterns may be itemsets, subsequences or substructures.
Example: A set of items, such as Milk & Butter that appear
together in a transaction data set. ( Also called Frequent Item
set).
• Frequent itemset mining leads to the discovery of associations
and correlations among items in large transactional (or)
relational data sets.
• This helps in many business decision- making processes like
Catalog design, and customer shopping behaviour analysis, etc.
Market Basket Analysis
• This is the example of frequent item set mining. This process
analyzes customer buying habits by finding associations
between different items that customer places in their shopping
baskets.
Data is collected using bar-code scanners in supermarkets. Such
market basket databases consist of a large number of transaction
records. Each record lists all items bought by a customer on a
single purchase
• Retailers can use the result by placing the items that are
frequently purchased together in proximity to further
encourage the combined sale of such items. In our
example(in the figure), Milk and bread is frequent, so it
can be kept in proximity. They could use this data for
adjusting store layouts (placing items optimally with
respect to each other), for cross-selling, for promotions,
and to identify customer segments based on buying
patterns.
• Association rules provide information of this type in the
form of if-then statements. These rules are computed
from the data and, unlike the if-then rules of logic,
association rules are probabilistic in nature.
In association analysis, the antecedent (if) and
the consequent (then), are sets of items (called itemsets)
that are disjoint (do not have any items in common).
Association Rule Mining (ARM)
Association rule mining finds interesting associations and
relationships among large sets of data items
Problem Definition:
The problem of association rule mining is defined as:
Let I={i1,i2,i3....in} be a set of n binary attributes called items
Let D={t1,t2,....tm} be a set of transactions called the database.
Each transaction in D has a unique ID and contains a subset of
the items in I.
A rule is defined as an implication of the form X=>Y where
X,Y⊆ I and X∩Y=Ǿ
Important Concepts of ARM
• Support: This is the percentage of transaction in D that
contain A⋃B. Here A⋃B means every item in A and every item
in B. Support is also written as P(A⋃B). It is also called
Relative support.
Therefore, Support (A ⟹ B) = P(A⋃B).
• Confidence: This is the percentage of transactions in D
containing A that also contain B. It is also written as P(B|A).
Confidence(A ⟹ B) = P(B|A).
= support(A⋃B) / support(A)
= support count (A⋃B) / support count (A)
• Support count or Frequency: Number of transactions that
contain the item set. It is also called Absolute support.
• The lift of a rule is defines as
lift(A ⟹ B) = supp(A⋃B) / supp(A) * supp(B)
• The conviction of a rule is defined as
conv(A ⟹ B) = 1-supp(B) / 1-conf(A ⟹ B)
• Any association rules that satisfy both a minimum
support threshold(min_sup) and minimum confidence
threshold (min_conf) are called strong association.
[Note: frequent item set are those item sets that satisfies the
min_sup
Thresholds like min_sup and min_conf can be set by
users or domain experts.]
Example
I = { Beer, Bread, Jelly, Milk, PeanutButter}
Support (s) of {A} = No. of transactions in which A appears /
Total no. of transactions
Support of {Bread}= 4/5 = 0.8= 80%
support of {Bread, PeanutButter} = 3/5 = 0.6= 60%
Confidence (A ⟹ B) = support(A⋃B) / support(A)
Confidence of {Bread ⟹ PeanutButter} = 0.6/0.8=0.75=75%
Support count (σ)Frequency of occurrence of an itemset
Frequent Pattern Mining: A Road Map
• Based on completeness of patterns to be mined:
• We can mine complete set of frequent itemsets
• Closed frequent itemsets and maximal frequent itemsets
given minimum support threshold
• Constrained frequent itemsets (i.e., those that satisfy a set
of user-defined constraints)
• Approximate frequent itemsets (which satisfies minimum
support)
• Near-match frequent itemsets (those that tally the support
count of the near or almost matching item-sets )
• Based on levels of abstraction involved in the rule set
buys(X, “computer”) buys(X, “Printer”)
buys(X, “laptop_computer”) buys(X, “Printer”)
Computer is higher-level abstraction of laptop_computer
• Based on number of data dimensions involved in the rule
• single dimensional association rule
e.g.: buys(X, “milk”) buys(X, “bread”)
• Multidimensional association rule: 2 dimensions or
predicates [Inter-dimension assoc. rules (no repeated
predicates)]
e.g.: age(X,”19-25”) occupation(X,“student”) buys(X,“coke”)
• Hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)
• Based on types of values handled in the rule
Boolean association rule
Quantitative association rule (if rule describes associations
between quantitative items) (e.g. of multidimensional)
• Based on kinds of rules to be mined
• Association rules
• Correlation rules (for statistical correlations)
• Based on kinds of patterns to be mined
• Sequential pattern mining
• e.g.: sequential datasets
• Structured pattern mining
• e.g.: Graphs, trees
Association Rule Mining Task
• Given a set of transactions T, the goal of association
rule mining is to find all rules having
• support ≥ minsup threshold
• confidence ≥ minconf threshold
• Brute-force approach: List all possible association rules
• Compute the support and confidence for each rule
• Prune rules that fail the minsup and minconf
thresholds
• ⇒Computationally prohibitive!
Mining Association Rules
• Association rule mining can be viewed as a two-step process:
1. Find all frequent item sets
a. Apriori algorithm
b. FP-growth algorithm
2. Generate strong association rules from the frequent item sets.
• By definition, these rules must satisfy minimum support
and minimum confidence.
Frequent Itemset Generation
• Brute-force approach:
Each itemset in the lattice is a candidate frequent itemset
Count the support of each candidate by scanning the
database
• Scanning database each time is very tedious
• Strategies to be followed to generate frequent item sets
could be:
• Reduce the number of candidates
• Reduce the number of transactions
• Reduce the number of comparisons
Reducing Number of Candidates
• Apriori property:
• All nonempty subsets of a frequent itemset must
also be frequent
• Apriori principle holds due to the following property of
the support measure:
• Support of an itemset never exceeds the support of
its subsets
• This is known as the anti-monotone property of
support
The Apriori Algorithm
Example
Consider the following dataset and for this we have to find frequent
itemsets and also have to generate association rules for them
Let min. support count= 2
Transactional Dataset D
Example
Step 4: Generate candidate set C4 using L3 (join step). And scan
D for count of each candidate.
Therefore C4 = Φ
Because the subset {I1, I3, I5} of itemset{I1, I2, I3,
I5} is not frequent so there is no itemset in C4.
Hence algorithm is terminated .
Note: If min. support count is given in ‘%’, then,
min.support count (as number)= number _of_transactions *
(min_support_count % / 100)
E.g.: There are 10 transactions in database and minimum support count is
70%. Now to get the number as Min.Sup.Count = 10*70/100=7
Generating Association Rules from
Frequent Item sets
• Once the frequent itemsets from transactions in a database
‘d’ have been found, generate strong association rules
from them (where strong association rules satisfy both
minimum support and minimum confidence).
• Association rules can be generated as follows:
Confidence(A ⟹ B)=P(B | A)=support count (A⋃B)/support count (A)
Example
• Let’s consider the frequent itemset, l= {I1, I2, I5}.
The nonempty subsets of l are: {I1}, {I2}, {I5}, {I1,I2},
{I1,I5}, {I2,I5}
l-s for I1= {I1, I2, I5}- {I1}= {I2,I5}
• Resulting association rules are as follows:
I1 ⟹I2ꓥI5, confidence=2/6=33%
I2 ⟹I1ꓥI5 confidence=2/7=29%
I5 ⟹I1ꓥI2 confidence=2/2=100%
I1ꓥI2 ⟹ I5 confidence=2/4=50%
I1ꓥI5 ⟹ I2 confidence=2/2=100%
I2ꓥI5 ⟹ I1 confidence=2/2=100%
• If min. confidence is 50% then only third, fifth and sixth rules are
strong(output).
Advantages & Disadvantages of Apriori
algorithm
• Advantages of Apriori algorithm
1. Easy to implement
2. Use large itemset property
• Disadvantages of Apriori algorithm
1. Requires many database scans
2. Very slow
3. Needs more search space and computational cost is too
high
Improving the efficiency of Apriori
a) Hash Based Techniques
b) Transaction Reduction
c) Partitioning
d) Dynamic Itemset Counting
e) Sampling
Hash Based Technique
h(x, y)=((order of x)*10+(order of y))mod 7
Summation of each column and each row should be greater than or equal
to min. support count.
Use join property to combine {I2,I3} and {I2,I4} which gives{I2,I3,I4}
{I1,I2} and {I3,I4} are not considered because they cannot be combined with any itemse
Dynamic Itemset Counting
• A dynamic itemset counting technique was proposed in
which the database is partitioned into blocks marked by
start points. In this variation, new candidate itemsets can be
added at any start point, unlike in Apriori, which
determines new candidate itemsets only immediately before
each complete database scan.
• The technique is dynamic in that it estimates the support of
all of the itemsets that have been counted so far, adding
new candidate itemsets if all of their subsets are estimated
to be frequent.
• The resulting algorithm requires fewer database scans than
Apriori.
Mining Frequent Itemsets without
Candidate Generation
• Can we design a method that mines the complete set of
frequent itemsets without candidate generation?
• FP-growth (Frequent pattern growth) is the solution.
• It adopts a divide-and-conquer strategy as follows:
• First, compress the database representing frequent
items into a frequent-pattern tree, or FP-tree, which
retains the itemset association information.
• Then divide the compressed database into a set of
conditional databases (a special kind of projected
database), each associated with one frequent item or
“pattern fragment”, and mines each such database
separately.
Steps followed to mine the frequent pattern
using frequent pattern growth algorithm
1) The first step is to scan the database to find the occurrences of
the itemsets in the database. This step is the same as the first
step of Apriori. The count of 1-itemsets in the database is
called support count or frequency of 1-itemset.
2) The second step is to construct the FP tree. For this, create the
root of the tree. The root is represented by null.
3) The next step is to scan the database again and examine the
transactions. Examine the first transaction and find out the
itemset in it. The itemset with the max count is taken at the
top, the next itemset with lower count and so on. It means that
the branch of the tree is constructed with transaction itemsets
in descending order of count.
4) The next transaction in the database is examined. The
itemsets are ordered in descending order of count. If any
itemset of this transaction is already present in another
branch (for example in the 1st transaction), then this
transaction branch would share a common prefix to the
root.
This means that the common itemset is linked to the new
node of another itemset in this transaction.
5) Also, the count of the itemset is incremented as it occurs in
the transactions. Both the common node and new node
count is increased by 1 as they are created and linked
according to transactions.
6) The next step is to mine the created FP Tree. For this, the
lowest node is examined first along with the links of the
lowest nodes. The lowest node represents the frequency
pattern length 1. From this, traverse the path in the FP Tree.
This path or paths are called a conditional pattern base.
Conditional pattern base is a sub-database consisting of
prefix paths in the FP tree occurring with the lowest node
(suffix).
7) Construct a Conditional FP Tree, which is formed by a
count of itemsets in the path. The itemsets meeting the
threshold support are considered in the Conditional FP
Tree.
8) Frequent Patterns are generated from the Conditional FP
Tree.
Example
• Mining of FP-tree is summarized below:
• We first consider I5, which is the last item in L, rather than the
first.
• I5 occurs in two branches of the FP-tree. The occurrences of I5
can easily be found by following its chain of node-links.
• The paths formed by these branches are (I2, I1,I5: 1) and (I2,
I1, I3, I5: 1). Therefore, considering I5 as a suffix, its
corresponding two prefix paths are (I2, I1: 1) and (I2, I1, I3: 1),
which form its conditional pattern base. Its conditional FP-tree
contains only a single path, (I2: 2, I1: 2); I3 is not included
because its support count of 1 is less than the minimum support
count.
The single path generates all the combinations of frequent
patterns: {I2, I5: 2}, {I1, I5: 2}, {I2, I1, I5: 2}.
Advantages:
1. This algorithm needs to scan the database only twice when
compared to Apriori which scans the transactions for each
iteration.
2. The database is stored in a compact version in memory.
3. It is efficient and scalable for mining both long and short
frequent patterns.
Disadvantages:
1. When the database is large, the algorithm may not fit in the
shared memory.
2. It may be expensive.
Differences Between FP-growth and
Apriori
Mining Frequent Itemsets Using Vertical
Data Format
• Both Apriori and FP-growth methods mine frequent patterns
from a set of transactions in TID-itemset format (that
is{TID:itemset}).
• This data format is known as horizontal data format.
• Data can also be represented in item-TID_set format (that is,
{item:TID_set}), where item is an item name and TID_set is the
set of transaction identifiers containing the item.
• This format is known as vertical data format.
• ECLAT (Equivalence CLASS Transformation) algorithm is
used to efficiently mine frequent itemsets using vertical data
format.
ECLAT Example
• Mining is performed on the data set by
intersecting the TID_sets of every pair of
frequent single itemsets.
• Advantages:
1. Depth-first search reduces memory requirements.
2. Usually (considerably) faster than Apriori.
3. No need to scan the database to find the support of (k+1)
itemsets, for k>=1.
• Disadvantage:
1. The TID-sets can be quite long, hence expensive to
manipulate.
Frequent, Closed, Maximal Itemset
• The lattice diagram above shows the maximal, closed and
frequent itemsets. The itemsets that are circled with blue are the
frequent itemsets. The itemsets that are circled with the thick blue
are the closed frequent itemsets. The itemsets that are circled with
the thick blue and have the yellow fill are the maximal frequent
itemsets. In order to determine which of the frequent itemsets are
closed, all you have to do is check to see if they have the same
support as their supersets, if they do they are not closed.
•
For example ad is a frequent itemset but has the same support
as abd so it is NOT a closed frequent itemset; c on the other hand
is a closed frequent itemset because all of its supersets, ac, bc, and
cd have supports that are less than 3.
As you can see there are a total of 9 frequent itemsets, 4 of them
are closed frequent itemsets and out of these 4, 2 of them are
maximal frequent itemsets. This brings us to the relationship
between the three representations of frequent itemsets.
Example
Mining Closed Frequent Itemsets
• Item Merging - If every transaction containing a frequent
item- set X also contains an item-set Y but not any proper
superset of Y , then X ∪ Y forms a frequent closed item-
set and there is no need to search for any item-set
containing X but no Y .
• Sub-itemset pruning: If a frequent iemset X is a proper
subset of an already found frequent closed itemset Y and
support_count(X)=suppport_count(Y), then X and all of
X’s descendants in the set enumeration tree cannot be
frequent
• Item Skipping –
Depth-first mining of closed itemsets
prefix itemset X associated with a header table and a
projected database.
• If a local frequent item p has the same support in several header
tables at different levels, one can safely prune p from the header
tables at higher levels.
a2 has same support in global header and a1’s projection –
can be pruned.
• Closure Checking
Check if superset / subset of already found closed frequent
itemsets with same support
• Superset Checking
Handled in Item Merging
• Subset Checking
Pattern tree – maintain set of closed itemsets mined so far
(Similar to FP tree)
If the current itemset(Sc) is subsumed by another already
found closed itemset (Sa) then
Both have same support
Length of Sc is smaller than Sa
All items in Sc are contained in Sa
Mining Various Kinds of Association Rules
• Mining Multilevel association Rules
Using Uniform Minimum Support
Using Reduced Minimum Support at Lower levels
Using Item or Group Based Support
Mining Multidimensional Association Rules
• Based on number of data dimensions involved in the rule
• single dimensional association rule
e.g.: buys(X, “milk”) buys(X, “bread”)
• Multidimensional association rule: 2 dimensions or
predicates [Inter-dimension assoc. rules (no repeated
predicates)]
e.g.: age(X,”19-25”) occupation(X,“student”)
buys(X,“coke”)
• Hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”) buys(X, “popcorn”) buys(X,
“coke”)
Quantitative Attributes Discretized Using
Predefined Concept Hierarchy
Mining Quantitative Association Rules
ARCS (Association Rules Clustering System)
From Association Mining to Correlation
Analysis
• Correlation is the relationship that exists between two or more
variables.
• When a change in one variable causes a change in other
variable then the two variables are said to be correlated.
From Association Analysis To Correlation Analysis
Constraint-Based Association Mining
Rule constraints
• Rule constraints can be classified into five categories:
antimonotonic, monotonic, succinct, convertible, and
inconvertible.
• A constraint Ca is anti-monotone iff. for any pattern S not
satisfying Ca, none of the super-patterns of S can satisfy Ca
• A constraint Cm is monotone iff. for any pattern S satisfying Cm,
every super-pattern of S also satisfies it.
• A subset of item Is is a succinct set, if it can be expressed as
p(I) for some selection predicate p, where is a selection
operator.
• A constraint C is convertible anti-monotone iff a pattern S
satisfying the constraint implies that each suffix of S w.r.t. R also
satisfies C.
Metarule-Guided Mining of Association
Rules
Constraint Pushing