0% found this document useful (0 votes)

39 views8 pages

Mining Infrequent Patter NS: Johan Bjarnle (Johbj551) Peter Zhu (Petzh912)

This document discusses techniques for mining infrequent and negatively correlated patterns from transactional data. It defines infrequent patterns as itemsets whose support is below a minimum support threshold. Negatively correlated patterns have an actual support that is lower than the expected support if the items were independent. The document presents methods for mining these patterns, including representing transaction data in binary format by augmenting it with negative items. Mining infrequent and negatively correlated patterns can find interesting relationships between rare or competing items that would otherwise be ignored.

Uploaded by

Saipavanesh Guggilapu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views8 pages

Mining Infrequent Patter NS: Johan Bjarnle (Johbj551) Peter Zhu (Petzh912)

Uploaded by

Saipavanesh Guggilapu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Mining Infr equent

Patter ns

JOHAN BJARNLE ( JOHBJ551)

PETER ZHU (PETZH912)

L I N K Ö P I N G U N I V E R S I T Y, 2 0 0 9

T N M 0 3 3 D ATA M I N I N G
Contents
1 Introduction ..................................................................................................................................... 2
2 Techniques ...................................................................................................................................... 3
2.1 Negative Patterns ..................................................................................................................... 3
2.2 Negative Correlated Patterns .................................................................................................... 3
2.3 Mining Negative Patterns ......................................................................................................... 5
2.4 Mining Negative Correlated Patterns ........................................................................................ 5
2.4.1 Concept Hierarchy............................................................................................................ 5
2.4.2 Indirect Association .......................................................................................................... 7
3 Applications .................................................................................................................................... 8
4 Conclusion ...................................................................................................................................... 8
References............................................................................................................................................... 8

1 Introduction
Often when considering data mining, the focus is on frequent patterns. Although the majority of the most
interesting patterns will lie within the frequent ones, there are important patterns that will be ignored with
this approach. These are called infrequent patterns.

Take for example the sale of VHS:s and DVD:s. There will be low occurrences of people buying both of
them. In terms of data mining, the itemset {VHS, DVD} will be infrequent and therefore ignored.
However, people that buys DVD:s does not tend to buy VHS:s and vice versa. These items will be
competing and an interesting pattern is found.

Data mining comes with a lot of concepts and terms. To be able to follow and understand the concepts
and definitions in this report, one should be familiar with two basic terms, support and confidence.
Support is defined as how frequent an itemset is among all the records:

()
SUPPORT: =
||

where (
) is the number of occurrences of itemset
and || is the total number of records.

Confidence is defined as the probability function (

|) which measures how often the itemset X
appears in records that contains itemset Y.

()
CONFIDENCE: =
()

where (
) and () are defined as the number of occurrences of the itemsets X respective Y.

2 (8)
Infrequent patterns are described as itemsets or rules that don’t pass the minsup criteria. The minsup
threshold is manually set and separates frequent- and infrequent patterns. If the minsup is set too high then
itemsets involving interesting rare items could be missed. For example, in a shopping mall, itemsets
containing expensive colognes might be ignored, causing eventual interesting patterns to be missed.

DEFINITION 1: An infrequent pattern is an itemset or a rule whose support is less than the minsup
threshold.

Usually most of the infrequent patterns are not interesting, and they also account for a large number of
itemsets. Good algorithms to filter out interesting patterns which at the same time are computationally
efficient are required. The definitions in this report were extracted from Introduction to Data Mining [1].

2 Techniques
2.1 Negative Patterns
Given an itemset = { , , … , }, denotes the absence of item in the itemset. For example,

corresponds to the absence of . With the negative item defined, a negative itemset can be described
as an itemset with both positive and negative items, passing the minsup threshold.

DEFINITION 2: A negative itemset X contains positive items A and negative items ,
= ∪ ,
where || ≥ 1 and $(
) ≥ %$&'.

From these negative itemsets, negative association rules can be defined.

DEFINITION 3: A negative association rules has to have following three properties; the rules is
extracted from a negative itemset, the support of the rule is greater than or equal to minsup, and the
confidence of the rule is greater than or equal to minconf.

Such rules will be denoted as ()**++ ⟶ -+., which means that people who buys coffee tends to not buy
tea. Negative itemsets and negative association rules will be referred to as negative patterns in this report.

2.2 Negative Correlated Patterns

The basic idea of a negative correlated pattern is that the actual support should be less than the expected
support, as stated in the definitions below:

DEFINITION 4: An itemset X is negatively correlated if

$(
) < 0 $123 4 = $(2 ) × $(2 ) × … × $(2 ),
36

where $(23 ) is the support of an item 23 .

3 (8)
The right hand side, the expected support, represents an estimate of the probability that all the
items in X are statistically independent. The smaller the support is compared to the expected
support, the more negatively correlated the pattern is.

DEFINITION 5: An association rule

⟶ is negatively correlated if

$(
∪ ) < $(
)$(),

where X and Y are disjoint itemsets, for example

∪ = ∅. This definition, however, only
provides a partial condition for negative correlation between items in X and Y. The full
condition is stated below:

$(
∪ ) < 0 $ (29 ) 0 $1:3 4,
9 3

where 29 ∈
and :3 ∈ .

The negatively correlated itemsets and association rules will be referred to as negatively
correlated patterns.

These patterns are closely related. Infrequent patterns and negatively correlated patterns refer to
itemsets that only contain positive items. This is shown in the figure below.

Figure 1. Comparisons between infrequent patterns, negative patterns,

and negative correlated patterns.

4 (8)
2.3
2. 3 Mining Negative Patterns
This technique utilizes a binary table, where every itemset is translated into binary values. If one
considers the earlier described negative patterns, the records can be binarized by augmenting it with
negative items. The tables below demonstrate how this works.

RecordID Items RecordID A

A B
B C C
1
2
{A,B}
{B,C} ⟶ 1
2
1
0
0
1
1
1
0
0
0
1
1
0
3 {A} 3 1 0 0 1 0 1

Table 1. Original records table. Table 2. The records translated to binary values.

This can then be combined with an algorithm such as Apriori [2], to derive all the negative itemsets. This
technique is only usable when the itemsets covers few items, because of the following reasons:

1. The number of items is doubled, since every item also has a corresponding negative value. When
exploring an itemset, a larger set than the usual 2= has to be evaluated, where the d stands for
number of items in the original data set.
2. Pruning based on support is no longer effective, since either 2 or 2̅ will have a value greater than
or equal 50%.
3. The width of each record will increase when adding the corresponding negative part for each
item. If the total number of items is d, then a record originally is much smaller than d. Consider a
food market, then each record is almost guaranteed to not include all items. However, using this
technique, the width of each record will be d.

Many of the existing algorithms can’t handle these kinds of datasets due to the large data size, and will
eventually break down.

To decrease the computational cost; the support computation for a negative itemset can be done by
calculating the support for the positive parts only.

EQUATION 1: ( ∪ ) = () + ∑GD6C ∑E⊂,|E|6D{(−C)D × ( ∪ E)}

2.4
2. 4 Mining Negative
Negative Correlated Patterns
The second technique is based on the expected support. An infrequent pattern is considered to be
interesting if the actual support is less than the expected support. Two approaches to calculate the
expected support will be discussed. The first is based upon a concept hierarchy and the second uses a
neighborhood-based approach.

2.4
2. 4.1 Concept Hierarchy
The usual support calculation (see support) is not a sufficient measure to filter away uninteresting
infrequent patterns. Consider an itemset e.g. {TV, coke}, whose items are frequent. The itemset is
considered to be infrequent due to low support and maybe negatively correlated, but it is uninteresting

5 (8)
due to the fact that the items belong to different categories. For an expert this may seem to be obvious,
however for a computer to ignore these kinds of infrequent patterns, a good measure has to be used.

Figure 2. Concept hierarchy

This approach is based on the assumption that products within same product family have similar types of
interaction with other items. For example, if {Cookies, Pork} (see figure 2) is a frequent itemset then their
children will have similar relations. The expected support for {Oatmeal, Bacon} can then be calculated
with the following equation:

O(PQRSTQU) O(XQYWZ)
EXAMPLE 1: K1$(L.-+., .()%)4 = $(M))+$, )N) × ×
O(VWW9TO) O([W\)

If the actual support for {Oatmeal, Bacon} is less than the expected support calculated above then we
have found an interesting infrequent pattern.

The algorithm for this case where we only consider direct children of the frequent itemset (in example 1
{Cookies, Pork}) can be generalized into the following:

O(`)×O(a)×…×O(R)
EQUATION 2: K1$(', ], … , -)4 = $('̂ ∪ ]_ ∪ … ∪ -̂) ×
O(`_)×O(a_)×…×O(Rb )

where { '̂ , ]_, … , -̂} is a frequent itemset and '̂ denotes the parent of '.

In the second case we can consider one of the items in the frequent itemset and one child from the other
item, for example{Cookies, Bacon}. The general equation for this:

O(\)×…×O(R)
EQUATION 3: K1$(', ], N, … , -)4 = $(' ∪ ] ∪ N̂ ∪ … ∪ -̂) ×
O(\̂ )×…×O(Rb)

6 (8)
The third case considers one of the items in the frequent itemset and one sibling from the other item, for
example {Cookies, Chicken}. The general equation for this:

O(\)×…×O(R)
EQUATION 4: K1$(', ], N, … , -)4 = $(' ∪ ] ∪ N′ ∪ … ∪ -′) ×
O(\d)×…×O(Rd)

where -′ denotes the sibling of -.

In our example we only take into account that the frequent itemset contains two items to make it
simpler to understand, however this is not the case when mining data. These three cases are the
only interesting when mining for negative itemsets, for example we do not consider the itemset
where we only have the siblings.

2.4
2. 4.2 Indirect Association
In the previous section we described how to calculate the expected support using concept hierarchy. This
technique will instead use indirect association to get the expected support. Indirect association means
looking through a set of items in common, the mediator set.

Figure 3. Indirect association between a and b

through a mediator set Y.

DEFINITION 6: A pair of items a, b is indirectly associated through a mediator set Y if the following
conditions hold:

1. $({., e}) < -O (Itempair support condition).

2. ∃ ≠ 0 such that:
I. $({.} ∪ ) ≥ -i and $({e} ∪ ) ≥ -i (Mediator support condition).
II. j({.}, ) ≥ -= , j({e}, ) ≥ -= , where j(
, k) is an objective measure
of the association between X and Z (Mediator dependence condition).

7 (8)
3 Applications
Applications
Infrequent patterns can be used in many applications.

• In text mining, indirect associations can used to find synonyms, antonym or words that are used
in different contexts. For example, the word data might be indirectly associated with the word
gold, using the mediator mining.
• In the market basket domain, indirect associations can be used to find competing items, such as
desktop computers and laptops, which states that people whom buys desktop computers won’t
buy laptops.
• Infrequent patterns can be used to detect errors. For example, if {Fire = Yes} is frequent, but
{Fire = Yes, Alarm = On} is infrequent, then the alarm system probably is faulting.

When evaluating Weka [3], we could not find any signs of an infrequent pattern classifier.

4 Conclusion
The techniques discussed in this report show that we can find interesting infrequent patterns that wouldn’t
be discovered with the usual algorithms. One may believe that these techniques are computationally
heavy, but that’s not the case. Savasere et. al [4] ran the negative correlation algorithm technique on a
SPARCStation 5 which was released almost fifteen years ago [5]. The time it took to complete the
algorithm over 50,000 records with an average of ten items per itemset was within the interval 300 to
1000 seconds. On a modern computer the same operations would probably be done within seconds.

Noteworthy is that negative and negative correlated patterns do not have to be infrequent, see figure 1.

It is a shame that Weka does not contain any of the discussed algorithms for finding infrequent patterns.
However Apriori is implemented and since the binary tables uses Apriori to find negative patterns, it
should be possible to extend the current Apriori in Weka.

References
[1] Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining, Addison-Wesley.
Pages 457-472, 2006.

[2] Aida Vitoria, lecture 7, powerpoint slides, Linköping University 2009.

[3] Weka, http://www.cs.waikato.ac.nz/ml/weka/, 2009

[4] Ashok Savasere, Edward Omiecinski, Shamkant Navathe, Mining for Strong Negative Associations in
a Large Database of Customer Transactions. In Proc. Of the 14th Intl. Conf. on Data Engineering, pages
494-502, 1998.

[5] Wikipedia, http://en.wikipedia.org/wiki/SPARCstation_5, 2009.

8 (8)

Mining Infrequent Patter NS: Johan Bjarnle (Johbj551) Peter Zhu (Petzh912)
No ratings yet
Mining Infrequent Patter NS: Johan Bjarnle (Johbj551) Peter Zhu (Petzh912)
8 pages
10 - Chapter 5 PDF
No ratings yet
10 - Chapter 5 PDF
14 pages
06 Apriori
No ratings yet
06 Apriori
36 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
74 pages
Week 4
No ratings yet
Week 4
59 pages
DMDW U3
No ratings yet
DMDW U3
16 pages
Advanced
No ratings yet
Advanced
80 pages
06apriori Edited v3
No ratings yet
06apriori Edited v3
29 pages
Data Mining: Basic Algorithms and Concepts
No ratings yet
Data Mining: Basic Algorithms and Concepts
20 pages
Data Mining: Frequent Pattern Analysis
No ratings yet
Data Mining: Frequent Pattern Analysis
15 pages
06 FPBasic
No ratings yet
06 FPBasic
74 pages
FDS Unit02
No ratings yet
FDS Unit02
16 pages
FPA - Advance
No ratings yet
FPA - Advance
80 pages
Miningpattern Association
No ratings yet
Miningpattern Association
53 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
81 pages
Literature Survey On Various Frequent Pattern Mining Algorithm
No ratings yet
Literature Survey On Various Frequent Pattern Mining Algorithm
7 pages
04 FPbasic
No ratings yet
04 FPbasic
78 pages
07 FPAdvanced
No ratings yet
07 FPAdvanced
81 pages
Apriori Based Novel Frequent Itemset Mining Mechanism: Issn No
No ratings yet
Apriori Based Novel Frequent Itemset Mining Mechanism: Issn No
8 pages
Mining Frequent Itemset-Association Analysis
No ratings yet
Mining Frequent Itemset-Association Analysis
59 pages
Mining Frequent Patterns, Association and Correlations - Basic Concepts and Methods
No ratings yet
Mining Frequent Patterns, Association and Correlations - Basic Concepts and Methods
55 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
81 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
93 pages
DWDM Unit III Notes
No ratings yet
DWDM Unit III Notes
23 pages
DWDM Unit 2 and 3
No ratings yet
DWDM Unit 2 and 3
31 pages
Data Mining - : Dr. Mahmoud Mounir Mahmoud - Mounir@cis - Asu.edu - Eg
No ratings yet
Data Mining - : Dr. Mahmoud Mounir Mahmoud - Mounir@cis - Asu.edu - Eg
26 pages
Efficient Mining of Both Positive and Negative Association Rules
No ratings yet
Efficient Mining of Both Positive and Negative Association Rules
25 pages
Association Rules
No ratings yet
Association Rules
48 pages
FP Tree Basics
No ratings yet
FP Tree Basics
67 pages
06 FPBasic
No ratings yet
06 FPBasic
69 pages
Frequent Patterns and Associations in Data Mining
No ratings yet
Frequent Patterns and Associations in Data Mining
66 pages
P8 FPBasic
No ratings yet
P8 FPBasic
53 pages
Unit 3 1
No ratings yet
Unit 3 1
34 pages
Inbound 5799672056943946753
No ratings yet
Inbound 5799672056943946753
47 pages
Data Mining Unit 2 1
No ratings yet
Data Mining Unit 2 1
15 pages
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
No ratings yet
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
20 pages
DWDM - Unit - IV
No ratings yet
DWDM - Unit - IV
67 pages
Module 2
No ratings yet
Module 2
14 pages
Frequent Pattern Mining Techniques
No ratings yet
Frequent Pattern Mining Techniques
59 pages
Efficient Pattern and Association Mining Analysis
No ratings yet
Efficient Pattern and Association Mining Analysis
14 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
99 pages
Mining Frequent Patterns and Correlations
No ratings yet
Mining Frequent Patterns and Correlations
100 pages
Module 3
No ratings yet
Module 3
98 pages
Frequent Pattern Mining Techniques
No ratings yet
Frequent Pattern Mining Techniques
37 pages
Data Mining Techniques Overview
No ratings yet
Data Mining Techniques Overview
3 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
12 pages
2007 Jiawei Han FP Mining
No ratings yet
2007 Jiawei Han FP Mining
32 pages
Unit 3
No ratings yet
Unit 3
15 pages
Chapter06 (Frequent Patterns)
No ratings yet
Chapter06 (Frequent Patterns)
47 pages
Introduction To Data Mining - Lecture03
No ratings yet
Introduction To Data Mining - Lecture03
23 pages
CSE 385 - Data Mining and Business Intelligence - Lecture 02
No ratings yet
CSE 385 - Data Mining and Business Intelligence - Lecture 02
67 pages
Data Mining Unit-Ii Notes
No ratings yet
Data Mining Unit-Ii Notes
24 pages
Ch5 DataMIning
No ratings yet
Ch5 DataMIning
99 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
94 pages
Advanced Pattern Mining Guide
No ratings yet
Advanced Pattern Mining Guide
62 pages
Association Analysis: Unit-V
No ratings yet
Association Analysis: Unit-V
12 pages
Lecture Notes Session-2
No ratings yet
Lecture Notes Session-2
4 pages
5 DM Association
No ratings yet
5 DM Association
27 pages
Developer Dependency List
No ratings yet
Developer Dependency List
7 pages
AI Game Strategies and Search Methods
No ratings yet
AI Game Strategies and Search Methods
20 pages
1.1 Discrete Probability Spaces
No ratings yet
1.1 Discrete Probability Spaces
22 pages
Intermediate Code Generation in Compilers
No ratings yet
Intermediate Code Generation in Compilers
32 pages
Study Plan - GMAT Focus Edition - Self Paced Course - XLSX - Live Classes
No ratings yet
Study Plan - GMAT Focus Edition - Self Paced Course - XLSX - Live Classes
3 pages
Class X Unit 3 QP 12.10.2024
No ratings yet
Class X Unit 3 QP 12.10.2024
1 page
A Detailed Lesson Plan Science 3
No ratings yet
A Detailed Lesson Plan Science 3
9 pages
Is 9523 (2000) - Ductile Iron Fittings For Pressure Pipes For Water, Gas and Sewage
100% (2)
Is 9523 (2000) - Ductile Iron Fittings For Pressure Pipes For Water, Gas and Sewage
38 pages
X - CH 2-Acids Bases and Salts
No ratings yet
X - CH 2-Acids Bases and Salts
9 pages
Universal Electric Fuel Pump Installation
No ratings yet
Universal Electric Fuel Pump Installation
5 pages
Revit Mep Training Guide Techfuhrer
No ratings yet
Revit Mep Training Guide Techfuhrer
27 pages
Chapter 9 The School Head in School Based Management
No ratings yet
Chapter 9 The School Head in School Based Management
22 pages
Ankit Agrawal XI I1 2001337 Phy Project
No ratings yet
Ankit Agrawal XI I1 2001337 Phy Project
15 pages
Personality Traits and Relationships
No ratings yet
Personality Traits and Relationships
140 pages
FIN601 Strategic Financial Analysis Design - FALL 2024
100% (2)
FIN601 Strategic Financial Analysis Design - FALL 2024
6 pages
Evaluasi PKM Dengan HMN by Putra
No ratings yet
Evaluasi PKM Dengan HMN by Putra
12 pages
Stick Saw Machine Overview and Operation
No ratings yet
Stick Saw Machine Overview and Operation
6 pages
Personal Statement
No ratings yet
Personal Statement
1 page
Performance Task 3: Learning Support
No ratings yet
Performance Task 3: Learning Support
1 page
Philippines' Population Growth
No ratings yet
Philippines' Population Growth
18 pages
Genmath PT Scoringsheet-1
No ratings yet
Genmath PT Scoringsheet-1
1 page
M.Tech Power & Energy Systems Syllabus
No ratings yet
M.Tech Power & Energy Systems Syllabus
42 pages
Prolog - Backtracking
No ratings yet
Prolog - Backtracking
8 pages
Modeling and Analysis of Maximum Power P
No ratings yet
Modeling and Analysis of Maximum Power P
6 pages
Speech Event and Speech Act
No ratings yet
Speech Event and Speech Act
12 pages
Pancasila Presentation Guide
No ratings yet
Pancasila Presentation Guide
28 pages
Volvo Service
100% (2)
Volvo Service
316 pages
Abhay Chand Gautam @CV
No ratings yet
Abhay Chand Gautam @CV
2 pages
Descriptive Text About Sport 10 Grade Student
No ratings yet
Descriptive Text About Sport 10 Grade Student
2 pages
Advice To The Serious Seeker: Meditations On The Teaching of Frithjof Schuon
No ratings yet
Advice To The Serious Seeker: Meditations On The Teaching of Frithjof Schuon
11 pages
QSZ13-G7: EU Stage IIIA / U.S. EPA Tier 3
100% (1)
QSZ13-G7: EU Stage IIIA / U.S. EPA Tier 3
3 pages
Discussion On Urea Product Quality
No ratings yet
Discussion On Urea Product Quality
24 pages
10 Essential Tape Reading Tips
100% (1)
10 Essential Tape Reading Tips
4 pages
Nuclear Physics: Forces, Decay, and Reactions
No ratings yet
Nuclear Physics: Forces, Decay, and Reactions
4 pages