Data Mining With Apriori Algorithm
Data Mining With Apriori Algorithm
when you talk of data mining, the discussion would not be complete without the mentioning of the
term, ‘Apriori Algorithm.’
This algorithm, introduced by R Agrawal and R Srikant in 1994 has great significance in data mining.
We shall see the importance of the apriori algorithm in data mining in this article.
Why does he do so? He realises that people who buy potatoes also buy onions. Therefore, by
bunching them together, he makes it easy for the customers. At the same time, he also increases his
sales performance. It also allows him to offer discounts.
Similarly, you go to a supermarket, and you will find bread, butter, and jam bundled together. It is
evident that the idea is to make it comfortable for the customer to buy these three food items in the
same place.
The Walmart beer diaper parable is another example of this phenomenon. People who buy diapers
tend to buy beer as well. The logic is that raising kids is a stressful job. People take beer to relieve
stress. Walmart saw a spurt in the sale of both diapers and beer.
These three examples listed above are perfect examples of Association Rules in Data Mining. It helps
us understand the concept of apriori algorithms.
Apriori algorithm, a classic algorithm, is useful in mining frequent itemsets and relevant association
rules. Usually, you operate this algorithm on a database containing a large number of transactions.
One such example is the items customers buy at a supermarket.
It helps the customers buy their items with ease, and enhances the sales performance of the
departmental store.
This algorithm has utility in the field of healthcare as it can help in detecting adverse drug reactions
(ADR) by producing association rules to indicate the combination of medications and patient
characteristics that could lead to ADRs.
It has got this odd name because it uses ‘prior’ knowledge of frequent itemset properties. The credit
for introducing this algorithm goes to Rakesh Agrawal and Ramakrishnan Srikant in 1994. We shall
now explore the apriori algorithm implementation in detail.
Three significant components comprise the apriori algorithm. They are as follows.
Support
Confidence
Lift
As mentioned earlier, you need a big database. Let us suppose you have 2000 customer transactions in
a supermarket. You have to find the Support, Confidence, and Lift for two items, say bread and jam. It
is because people frequently bundle these two items together.
Out of the 2000 transactions, 200 contain jam whereas 300 contain bread. These 300 transactions
include a 100 that includes bread as well as jam. Using this data, we shall find out the support,
confidence, and lift.
Support
Support is the default popularity of any item. You calculate the Support as a quotient of the division
of the number of transactions containing that item by the total number of transactions. Hence, in our
example,
= 200/2000 = 10%
Confidence
In our example, Confidence is the likelihood that customer bought both bread and jam. Dividing the
number of transactions that include both bread and jam by the total number of transactions will give
the Confidence figure.
Confidence = (Transactions involving both bread and jam) / (Total Transactions involving jam)
= 100/200 = 50%
It implies that 50% of customers who bought jam bought bread as well.
Lift
According to our example, Lift is the increase in the ratio of the sale of bread when you sell jam. The
mathematical formula of Lift is as follows.
= 50 / 10 = 5
It says that the likelihood of a customer buying both jam and bread together is 5 times more than the
chance of purchasing jam alone. If the Lift value is less than 1, it entails that the customers are
unlikely to buy both the items together. Greater the value, the better is the combination.
Apriori Algorithm Example
Consider a supermarket scenario where the itemset is I = {Onion, Burger, Potato, Milk, Beer}. The
database consists of six transactions where 1 represents the presence of the item and 0 the absence.
Step 1
Create a frequency table of all the items that occur in all the transactions. Now, prune the frequency
table to include only those items having a threshold support level over 50%. We arrive at this
frequency table.
Step 2
Make pairs of items such as OP, OB, OM, PB, PM, BM. This frequency table is what you arrive at.
Simple Apriori Algorithm Example
Step 3
Apply the same threshold support of 50% and consider the items that exceed 50% (in this case 3 and
above).
Step 4
Look for a set of three items that the customers buy together. Thus we get this combination.
Step 5
Determine the frequency of these two itemsets. You get this frequency table.
If you apply the threshold assumption, you can deduce that the set of three items frequently purchased
by the customers is OPB.
We have taken a simple example to explain the apriori algorithm in data mining. In reality, you have
hundreds and thousands of such combinations.
At times, you need a large number of candidate rules. It can become computationally
expensive.
It is also an expensive method to calculate support because the calculation has to go through
the entire database.
Use the following methods to improve the efficiency of the apriori algorithm.
There are other methods as well such as partitioning, sampling, and dynamic itemset counting.
This video will help you understand the importance of the apriori algorithm in python.
We have seen an example of the apriori algorithm concerning frequent itemset generation. There are
many uses of apriori algorithm in data mining. One such use is finding association rules efficiently.
Find all rules having the Support value more than the threshold Support
And Confidence values greater than the threshold confidence
Use Brute Force – List out all the rules and determine the support and confidence levels for
each rule. The next step is to eliminate the values below the threshold support and confidence
levels. It is a tedious exercise.
The Two-Step Approach – This method is a better one as compared to the Brute Force
method.
Step 1
We have seen earlier in this article how to prepare the frequency table and find itemsets having
support greater than the threshold support.
Step 2
Use binary partition of the frequent itemsets to create rules. You have to look for the ones having the
highest confidence levels. You also refer to it as the candidate rules.
In our example, we found out that the OPB combination was the frequent itemset. Apply Step 2 and
find out all the rules using OPB.
You find that there are six combinations. Therefore, if you have k elements, there will be 2 k – 2
candidate association rules.
Final Words
In this age of eCommerce retail shops and globalisation, it becomes imperative for businesses to use
machine learning and artificial intelligence to stay ahead of the competition. Hence, data analysis has
a great scope in today’s environment. The use of apriori algorithms has great value in data analysis.
We have seen how you can deduce various kinds of data and enhance the sales performance of the
supermarket. That was one example of the utility of the apriori algorithm.
This concept has been used in other critical industries like the healthcare industries, and so on. It
enables the industry to bundle drugs that cause the least ADRs depending on the patient’s
characteristics. Read Digital Vidya Blogs to know more about machine learning aspects.
First I would like to thank Josh Mitchell for bringing back my charger that I left at his place at one AM. If it was
not for him I wouldn’t be able to write this and get lot of stuff done tonight that I did.
Without further ado, let’s start talking about Apriori algorithm. It is a classic algorithm used in data mining for
learning association rules. It is nowhere as complex as it sounds, on the contrary it is very simple; let me give
you an example to explain it. Suppose you have records of large number of transactions at a shopping center
as follows:
T2 Item1, item2
T3 Item2, item5
Learning association rules basically means finding the items that are purchased together more frequently than
others.
For example in the above table you can see Item1 and item2 are bought together frequently.
Also if you are familiar with Amazon, they use association mining to recommend you the items based on the
current item you are browsing/buying.
Another application is the Google auto-complete, where after you type in a word it searches frequently
associated words that user type after that particular word.
So as I said Apriori is the classic and probably the most basic algorithm to do it. Now if you search online you
can easily find the pseudo-code and mathematical equations and stuff. I would like to make it more intuitive
and easy, if I can.
I would like if a 10th or a 12th grader can understand this without any problem. So I will try and not use any
terminologies or jargons.
Now, we follow a simple golden rule: we say an item/itemset is frequently bought if it is bought at least 60% of
times. So for here it should be bought at least 3 times.
For simplicity
M = Mango
O = Onion
And so on……
Original table:
T2 {D, O, N, K, E, Y }
T3 {M, A, K, E}
T4 {M, U, C, K, Y }
T5 {C,O, O, K, I, E}
Step 1: Count the number of transactions in which each item occurs, Note ‘O=Onion’ is bought 4 times in total,
but, it occurs in just 3 transactions.
C1
Item No of
transactions
M 3
O 3
N 2
K 5
E 4
Y 3
D 1
A 1
U 1
C 2
I 1
Step 2: Now remember we said the item is said frequently bought if it is bought at least 3 times. So in this step
we remove all the items that are bought less than 3 times from the above table and we are left with
L1
Item Number of
transactions
M 3
O 3
K 5
E 4
Y 3
This is the single items that are bought frequently. Now let’s say we want to find a pair of items that are
bought frequently. We continue from the above table (Table in step 2)
Step 3: We start making pairs from the first item, like MO,MK,ME,MY and then we start with the second item
like OK,OE,OY. We did not do OM because we already did MO when we were making pairs with M and buying
a Mango and Onion together is same as buying Onion and Mango together. After making all the pairs we get,
C2
Item pairs
MO
MK
ME
MY
OK
OE
OY
KE
KY
EY
Step 4: Now we count how many times each pair is bought together. For example M and O is just bought
together in {M,O,N,K,E,Y}
L2
MO 1
MK 3
ME 2
MY 2
OK 3
OE 3
OY 2
KE 4
KY 3
EY 2
Step 5: Golden rule to the rescue. Remove all the item pairs with number of transactions less than three and
we are left with
L2
MK 3
OK 3
OE 3
KE 4
KY 3
These are the pairs of items frequently bought together.
Now let’s say we want to find a set of three items that are brought together.
We use the above table (table in step 5) and make a set of 3 items.
Step 6: To make the set of three items we need one more rule (it’s termed as self-join),
It simply means, from the Item pairs in the above table, we find two pairs with the same first Alphabet, so we
get
Then we find how many times O,K,E are bought together in the original table and same for K,E,Y and we get
the following table
OKE 3
KEY 2
While we are on this, suppose you have sets of 3 items say ABC, ABD, ACD, ACE, BCD and you want to generate
item sets of 4 items you look for two sets having the same first two alphabets.
And so on … In general you have to look for sets having just the last alphabet/item different.
Step 7: So we again apply the golden rule, that is, the item set must be bought together at least 3 times which
leaves us with just OKE, Since KEY are bought together just two times.
Thus the set of three items that are bought together most frequently are O,K,E