Mining Infr equent
Patter ns
JOHAN BJARNLE ( JOHBJ551)
PETER ZHU (PETZH912)
L I N K Ö P I N G U N I V E R S I T Y, 2 0 0 9
T N M 0 3 3 D ATA M I N I N G
Contents
1 Introduction ..................................................................................................................................... 2
2 Techniques ...................................................................................................................................... 3
2.1 Negative Patterns ..................................................................................................................... 3
2.2 Negative Correlated Patterns .................................................................................................... 3
2.3 Mining Negative Patterns ......................................................................................................... 5
2.4 Mining Negative Correlated Patterns ........................................................................................ 5
2.4.1 Concept Hierarchy............................................................................................................ 5
2.4.2 Indirect Association .......................................................................................................... 7
3 Applications .................................................................................................................................... 8
4 Conclusion ...................................................................................................................................... 8
References............................................................................................................................................... 8
1 Introduction
Often when considering data mining, the focus is on frequent patterns. Although the majority of the most
interesting patterns will lie within the frequent ones, there are important patterns that will be ignored with
this approach. These are called infrequent patterns.
Take for example the sale of VHS:s and DVD:s. There will be low occurrences of people buying both of
them. In terms of data mining, the itemset {VHS, DVD} will be infrequent and therefore ignored.
However, people that buys DVD:s does not tend to buy VHS:s and vice versa. These items will be
competing and an interesting pattern is found.
Data mining comes with a lot of concepts and terms. To be able to follow and understand the concepts
and definitions in this report, one should be familiar with two basic terms, support and confidence.
Support is defined as how frequent an itemset is among all the records:
()
SUPPORT: =
||
where (
) is the number of occurrences of itemset
and || is the total number of records.
Confidence is defined as the probability function (
|) which measures how often the itemset X
appears in records that contains itemset Y.
()
CONFIDENCE:
=
()
where (
) and () are defined as the number of occurrences of the itemsets X respective Y.
2 (8)
Infrequent patterns are described as itemsets or rules that don’t pass the minsup criteria. The minsup
threshold is manually set and separates frequent- and infrequent patterns. If the minsup is set too high then
itemsets involving interesting rare items could be missed. For example, in a shopping mall, itemsets
containing expensive colognes might be ignored, causing eventual interesting patterns to be missed.
DEFINITION 1: An infrequent pattern is an itemset or a rule whose support is less than the minsup
threshold.
Usually most of the infrequent patterns are not interesting, and they also account for a large number of
itemsets. Good algorithms to filter out interesting patterns which at the same time are computationally
efficient are required. The definitions in this report were extracted from Introduction to Data Mining [1].
2 Techniques
2.1 Negative Patterns
Given an itemset = { , , … , }, denotes the absence of item in the itemset. For example,
corresponds to the absence of . With the negative item defined, a negative itemset can be described
as an itemset with both positive and negative items, passing the minsup threshold.
DEFINITION 2: A negative itemset X contains positive items A and negative items ,
= ∪ ,
where || ≥ 1 and $(
) ≥ %$&'.
From these negative itemsets, negative association rules can be defined.
DEFINITION 3: A negative association rules has to have following three properties; the rules is
extracted from a negative itemset, the support of the rule is greater than or equal to minsup, and the
confidence of the rule is greater than or equal to minconf.
Such rules will be denoted as ()**++ ⟶ -+., which means that people who buys coffee tends to not buy
tea. Negative itemsets and negative association rules will be referred to as negative patterns in this report.
2.2 Negative Correlated Patterns
The basic idea of a negative correlated pattern is that the actual support should be less than the expected
support, as stated in the definitions below:
DEFINITION 4: An itemset X is negatively correlated if
$(
) < 0 $123 4 = $(2 ) × $(2 ) × … × $(2 ),
36
where $(23 ) is the support of an item 23 .
3 (8)
The right hand side, the expected support, represents an estimate of the probability that all the
items in X are statistically independent. The smaller the support is compared to the expected
support, the more negatively correlated the pattern is.
DEFINITION 5: An association rule
⟶ is negatively correlated if
$(
∪ ) < $(
)$(),
where X and Y are disjoint itemsets, for example
∪ = ∅. This definition, however, only
provides a partial condition for negative correlation between items in X and Y. The full
condition is stated below:
$(
∪ ) < 0 $ (29 ) 0 $1:3 4,
9 3
where 29 ∈
and :3 ∈ .
The negatively correlated itemsets and association rules will be referred to as negatively
correlated patterns.
These patterns are closely related. Infrequent patterns and negatively correlated patterns refer to
itemsets that only contain positive items. This is shown in the figure below.
Figure 1. Comparisons between infrequent patterns, negative patterns,
and negative correlated patterns.
4 (8)
2.3
2. 3 Mining Negative Patterns
This technique utilizes a binary table, where every itemset is translated into binary values. If one
considers the earlier described negative patterns, the records can be binarized by augmenting it with
negative items. The tables below demonstrate how this works.
RecordID Items RecordID A
A B
B C C
1
2
{A,B}
{B,C} ⟶ 1
2
1
0
0
1
1
1
0
0
0
1
1
0
3 {A} 3 1 0 0 1 0 1
Table 1. Original records table. Table 2. The records translated to binary values.
This can then be combined with an algorithm such as Apriori [2], to derive all the negative itemsets. This
technique is only usable when the itemsets covers few items, because of the following reasons:
1. The number of items is doubled, since every item also has a corresponding negative value. When
exploring an itemset, a larger set than the usual 2= has to be evaluated, where the d stands for
number of items in the original data set.
2. Pruning based on support is no longer effective, since either 2 or 2̅ will have a value greater than
or equal 50%.
3. The width of each record will increase when adding the corresponding negative part for each
item. If the total number of items is d, then a record originally is much smaller than d. Consider a
food market, then each record is almost guaranteed to not include all items. However, using this
technique, the width of each record will be d.
Many of the existing algorithms can’t handle these kinds of datasets due to the large data size, and will
eventually break down.
To decrease the computational cost; the support computation for a negative itemset can be done by
calculating the support for the positive parts only.
EQUATION 1: ( ∪ ) = () + ∑GD6C ∑E⊂,|E|6D{(−C)D × ( ∪ E)}
2.4
2. 4 Mining Negative
Negative Correlated Patterns
The second technique is based on the expected support. An infrequent pattern is considered to be
interesting if the actual support is less than the expected support. Two approaches to calculate the
expected support will be discussed. The first is based upon a concept hierarchy and the second uses a
neighborhood-based approach.
2.4
2. 4.1 Concept Hierarchy
The usual support calculation (see support) is not a sufficient measure to filter away uninteresting
infrequent patterns. Consider an itemset e.g. {TV, coke}, whose items are frequent. The itemset is
considered to be infrequent due to low support and maybe negatively correlated, but it is uninteresting
5 (8)
due to the fact that the items belong to different categories. For an expert this may seem to be obvious,
however for a computer to ignore these kinds of infrequent patterns, a good measure has to be used.
Figure 2. Concept hierarchy
This approach is based on the assumption that products within same product family have similar types of
interaction with other items. For example, if {Cookies, Pork} (see figure 2) is a frequent itemset then their
children will have similar relations. The expected support for {Oatmeal, Bacon} can then be calculated
with the following equation:
O(PQRSTQU) O(XQYWZ)
EXAMPLE 1: K1$(L.-+., .()%)4 = $(M))+$, )N) × ×
O(VWW9TO) O([W\)
If the actual support for {Oatmeal, Bacon} is less than the expected support calculated above then we
have found an interesting infrequent pattern.
The algorithm for this case where we only consider direct children of the frequent itemset (in example 1
{Cookies, Pork}) can be generalized into the following:
O(`)×O(a)×…×O(R)
EQUATION 2: K1$(', ], … , -)4 = $('̂ ∪ ]_ ∪ … ∪ -̂) ×
O(`_)×O(a_)×…×O(Rb )
where { '̂ , ]_, … , -̂} is a frequent itemset and '̂ denotes the parent of '.
In the second case we can consider one of the items in the frequent itemset and one child from the other
item, for example{Cookies, Bacon}. The general equation for this:
O(\)×…×O(R)
EQUATION 3: K1$(', ], N, … , -)4 = $(' ∪ ] ∪ N̂ ∪ … ∪ -̂) ×
O(\̂ )×…×O(Rb)
6 (8)
The third case considers one of the items in the frequent itemset and one sibling from the other item, for
example {Cookies, Chicken}. The general equation for this:
O(\)×…×O(R)
EQUATION 4: K1$(', ], N, … , -)4 = $(' ∪ ] ∪ N′ ∪ … ∪ -′) ×
O(\d)×…×O(Rd)
where -′ denotes the sibling of -.
In our example we only take into account that the frequent itemset contains two items to make it
simpler to understand, however this is not the case when mining data. These three cases are the
only interesting when mining for negative itemsets, for example we do not consider the itemset
where we only have the siblings.
2.4
2. 4.2 Indirect Association
In the previous section we described how to calculate the expected support using concept hierarchy. This
technique will instead use indirect association to get the expected support. Indirect association means
looking through a set of items in common, the mediator set.
Figure 3. Indirect association between a and b
through a mediator set Y.
DEFINITION 6: A pair of items a, b is indirectly associated through a mediator set Y if the following
conditions hold:
1. $({., e}) < -O (Itempair support condition).
2. ∃ ≠ 0 such that:
I. $({.} ∪ ) ≥ -i and $({e} ∪ ) ≥ -i (Mediator support condition).
II. j({.}, ) ≥ -= , j({e}, ) ≥ -= , where j(
, k) is an objective measure
of the association between X and Z (Mediator dependence condition).
7 (8)
3 Applications
Applications
Infrequent patterns can be used in many applications.
• In text mining, indirect associations can used to find synonyms, antonym or words that are used
in different contexts. For example, the word data might be indirectly associated with the word
gold, using the mediator mining.
• In the market basket domain, indirect associations can be used to find competing items, such as
desktop computers and laptops, which states that people whom buys desktop computers won’t
buy laptops.
• Infrequent patterns can be used to detect errors. For example, if {Fire = Yes} is frequent, but
{Fire = Yes, Alarm = On} is infrequent, then the alarm system probably is faulting.
When evaluating Weka [3], we could not find any signs of an infrequent pattern classifier.
4 Conclusion
The techniques discussed in this report show that we can find interesting infrequent patterns that wouldn’t
be discovered with the usual algorithms. One may believe that these techniques are computationally
heavy, but that’s not the case. Savasere et. al [4] ran the negative correlation algorithm technique on a
SPARCStation 5 which was released almost fifteen years ago [5]. The time it took to complete the
algorithm over 50,000 records with an average of ten items per itemset was within the interval 300 to
1000 seconds. On a modern computer the same operations would probably be done within seconds.
Noteworthy is that negative and negative correlated patterns do not have to be infrequent, see figure 1.
It is a shame that Weka does not contain any of the discussed algorithms for finding infrequent patterns.
However Apriori is implemented and since the binary tables uses Apriori to find negative patterns, it
should be possible to extend the current Apriori in Weka.
References
[1] Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining, Addison-Wesley.
Pages 457-472, 2006.
[2] Aida Vitoria, lecture 7, powerpoint slides, Linköping University 2009.
[3] Weka, http://www.cs.waikato.ac.nz/ml/weka/, 2009
[4] Ashok Savasere, Edward Omiecinski, Shamkant Navathe, Mining for Strong Negative Associations in
a Large Database of Customer Transactions. In Proc. Of the 14th Intl. Conf. on Data Engineering, pages
494-502, 1998.
[5] Wikipedia, http://en.wikipedia.org/wiki/SPARCstation_5, 2009.
8 (8)