0% found this document useful (0 votes)
109 views

Assignment 1: Data Mining MGSC5126 - 10

The document provides responses to 5 questions on data mining algorithms and concepts. For Question 1, the student defines Euclidean, Manhattan, and Minkowski distances and provides examples of calculating each. For Question 2, the student calculates distances and similarity measures for various data points and ranks them. They also provide an example of data normalization. For Question 3, the student summarizes the ChiMerge algorithm and provides an example of using it with the Iris dataset in R. For Question 4, the student demonstrates the Apriori and FP-Growth algorithms on a transactional dataset, generating frequent itemsets and association rules. For Question 5, the student summarizes the Apriori algorithm,

Uploaded by

Nguyen Le
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views

Assignment 1: Data Mining MGSC5126 - 10

The document provides responses to 5 questions on data mining algorithms and concepts. For Question 1, the student defines Euclidean, Manhattan, and Minkowski distances and provides examples of calculating each. For Question 2, the student calculates distances and similarity measures for various data points and ranks them. They also provide an example of data normalization. For Question 3, the student summarizes the ChiMerge algorithm and provides an example of using it with the Iris dataset in R. For Question 4, the student demonstrates the Apriori and FP-Growth algorithms on a transactional dataset, generating frequent itemsets and association rules. For Question 5, the student summarizes the Apriori algorithm,

Uploaded by

Nguyen Le
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Assignment 1

Data Mining
MGSC5126 - 10

First name: Nguyen


Last name: Le
Student ID: 20194224

Important Note: This is the assignment prepared by a non-tech background at her best effort from reading slides,
watching Youtube tutorials and examining research papers, given that she was not trained about programming
language in her course. Therefore, kindly consider this and mark her accordingly.

Thank you and best regards!

NL
Question 1
2 2 2
Euclidean distance d(i, j) = √( x
−x j 1 ) + ( xi 2− x j 2 ) +..+ ( xip−x jp )
i1
Manhattan distance d(i, j) = |x i1−x j1|+|x i2−x j2|+...+|x i p −x j p|
h h h
Minkowski distance d(i, j) = √|x i1 −x j 1| +¿ xi 2−x j 2∨¿h +...+|x ip −x jp| ¿

Cosine similarity cos(i, j) = (i*j) /||i || ||j||

a. The Euclidean distance is ( 22−10 )2 + ( 1−0 )2+ ( 42−36 )2+ (10−8 )2 = 6.7082

b. The Manhattan distance is |22−20|+|1−0|+|42−36|+|10−8|=11
3 3 3 3 3
c. The Minkowski distance using h = 3 is √|22−20| +|1−0| +|42−36| +|10−8| =6.1534
Question 2
a.
Euclidean distance Manhattan distance Cosine similarity
x1 √¿ ¿ |0.1+0.1| = 0.2 1.4 ×1.5+1.6 ×1.7
2 2 = 0.99999
¿ √ 0.1 +0. 1 =¿ 0.14 √1. 4 2 +1.6 2+ √1. 52 +1.7 2
x2 √ 0. 62 +0. 32=¿0.67 |0.6+ 0.3| = 0.9 1.4 × 2+1.6 × 1.9
=¿ 0.99575
√1. 4 2 +1.6 2+ √2. 02 +9. 02
x3 √ 0. 22+ 0.22=¿0.28 |0.2+0.2| = 0.4 1.4 ×1.6+1.6 × 1.8
=¿0.99997
√1. 4 2 +1.6 2+ √1. 62 +1. 82
x4 √ 0. 22+ 0.12=¿0.22 |0.2+0.1| = 0.3 1.4 ×1.2+1.6 ×1.5
= 0.99903
√1. 4 2 +1.6 2+ √1. 22 +1.5 2
x5 √ 0. 12+0. 6 2=¿0.60 |0.1+0.6| = 0.7 1.4 ×1.5+1.6 ×1.0
= 0.96536
√1. 4 2 +1.6 2+ √1. 52 +1. 02
Based on similarity, the data is ranked as below:
Euclidean distance: x1, x4, x3, x5, x2
Manhattan distance: x1, x4, x3, x5, x2
Cosine similarity: x1, x3, x4, x2, x5

b. The normalized query from x = (1.4, 1.6) is calculated as below:


1.4 1.6
( ,
√ 1. 4 +1. 6 √1. 4 2+ 1.6 2
2 2 )
=¿(0.659, 0.753)

Similarly, the new normalized dataset is given as follows:


A1 A2
x1 0.662 0.750
x2 0.725 0.689
x3 0.664 0.747
x4 0.625 0.781
x5 0.832 0.555
The new Euclidean distance is:
Euclidean distance
x1 √ ¿ ¿ = 0.004
x2 √ ¿ ¿0.092
x3 √ ¿ ¿0.008
x4 √ ¿ ¿0.044
x5 √ ¿ ¿0.263

After transformation, the ranking of data set will be: x1, x3, x4, x2, x5

Question 3
According to Data Mining Instruction Manual “The ChiMerge algorithm takes the Iris dataset and design the
max_interval as the input and gives the split-points and final intervals as output” // [1]. As the stopping
condition, I used max_interval = 4.

According to raiyan1102006 Github, ChiMerge “is a robust algorithm that works in the following manner:

1. Sort the data based on the attribute’s values in an ascending order.


2. Define each distinct value in the attribute as an interval on its own.
3. Construct a frequency table where the various class frequencies for each distinct attribute value is
computed.
4. Calculate the Chi square values for each of the adjacent rows (intervals) in the frequency table.
5. Merge adjacent rows with the smallest Chi square value. This leads to a new frequency table.
6. Repeat steps 4 & 5 until stopping condition is met” [2]

We can refer to package “arules” in R to discretize the variables in Iris dataset. Please refer to coding file for
more information.

Question 4
a.
Given that min support = 60% and min confidence = 80%. Here I use I use absolute support, there are total 5
transactions with min support = 60% so it translates to itemset with 1 or 2 support count(s) will be eliminated

Using Apriori Algorithm


First scan to generate 1-itemset
Item Count

A 1

C 2

D 1

E 4

I 1

K 5

M 3

N 2
Item Count

0 3

U 1

Y 3

Pruning with support counts smaller than 3


Item Count

E 4

K 5

M 3

O 3

Y 3

Second scan to generate 2-itemset


Item Count

E, K 4

E, M 2

E, O 3

E, Y 2

K, M 3

K, O 3

K, Y 3

M, O 1

M, Y 2

O, Y 2

Pruning with support counts < 3


Item Count

E, K 4
Item Count

E, O 3

K, M 3

K, O 3

K, Y 3

Third scan to generate 3-itemset


Item Count

E, K, O 3

K, M, O 1

K, M, Y 2
Pruning item with support count smaller than 3
Item Count

E, K, O 3

Frequent itemset using Apriori include:


L1 = (E, K, M, O, Y)
L2 = (EK, EO, KM, KO, KY)
L3 = (EKO)

Using FP-Growth Algorithm


First scan: find frequent 1-itemsets with supports are equal or greater than 3 and order them based on support
count, which generates the frequent item list: {K:5, E: 4, M:3, O: 3, Y: 3}

TID items bought Ordered frequent items

T100 {M, O, N, K, E, K, E, M, O, Y
Y}

T200 {D, O, N, K, E, K, E, O, Y
Y}

T300 {M, A, K, E} K, E, M

T400 {M, U, C, K, Y} K, M, Y

T500 {C, O, O, K, I, E} K, E, O

Next generate the FB-tree


Frequent itemset using FP Growth

Item ending with Conditional pattern base Conditional tree satisfies min FPs generated
support = 3
Y {E, K, M, O: 1}, {E, K, O: 1}, ⟨K: 3⟩ {K, Y: 3}
{K, M: 1}
O {E, K, M: 1}, {E, K: 2} ⟨E: 3, K: 3⟩ {E, K, O: 3}, {K, O: 3},
{E, O: 3}
M {E, K: 2}, {K: 1} ⟨K: 3⟩ {K, M: 3}
E {K: 4} ⟨K: 4⟩ {E, K: 4}

Final frequent itemset using FP-Growth is as follows:


{ {E: 4}, {K: 4}, {M: 3}, {O: 3}, {Y: 3}, {K,Y: 3}, {E,K,O: 3}, {K,O: 3}, {E,O: 3}, {K,M:3}, {E, K: 4} }

FP-growth is more efficient especially when it comes to large dataset because it is able to mine in the
conditional pattern bases, thus reduce the sizes of the data sets to be mined.

B. Now generate ASSOCIATION RULE using minimum confidence = 80%


1. [E, K] → O = support (E, K, O)/ support (E, K) = 3/4 = 75% (discard)
2. [K, O] → E = 3/3 = 100%
3. [E, O] → K = 3/3 = 100%
4. E → [K, O] = 3/4 = 75% (discard)
5. K → [E, O] = 3/5 = 60% (discard)
6. O → [E, K] = 3/3 = 100%
Therefore, rule #2, 3, 6 are strong association rules and can be written as below:
∀X ∈ transaction, buys (X, E) ∧ buys (X, O) ⇒ buys (X, K) [60%, 100%]
∀X ∈ transaction, buys (X, K) ∧ buys (X, O) ⇒ buys (X, E) [60%, 100%]
∀X ∈ transaction, buys (X, O) ⇒ buys (X, E) ∧ buys (X, K) [60%, 100%]

Question 5

Apriori Frequent Itemset Algorithm

Accoding to M.S. Mythili, the Apriori property follows two step processes:
 Join step: - Ck is generated by combining lk-1 with itself
 Prune Step: - Any (k – 1) item set that’s not frequent cannot be a set of a frequent k item set [4].
According to Data Mining Slides, Chapter 6 prepared by Jamileh, “SQL Implementation of candidate
generation. Suppose the items in Lk-1 are listed in an order

Step 1: self-joining Lk-1


from Lk-1 p, Lk-1 q
where p.item1=q.item1, ..., p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

Step 2: pruning
forall itemsets c in Ck do

forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck [3]

We can refer to package “arules” in R to implement the Apriori algorithm in Adult dataset. Please refer to
coding file for more information.

Compare Apriori Frequent Itemset Algorithm vs FP-Growth Algorithm

No Parameter Apriori FP-Growth Sources


1 Storage structure Array based Tree based [4]

2 Search type Breadth First Search Depth First Search [4]

This means it uses current items at (Divide and conquer)


level k to generate items of level k+1
(many database scans)

3 Technique Join and prune Constructs conditional [4]


frequency pattern tree which
satisfy minimum Support
4 Number of K+1 2 [4]
Database scans

5 Memory Large memory (candidate Less memory (No candidate [4]


utilization generation) generation).

6 Database Sparse/dense datasets Large and medium data sets [4]

7 Run time More time Less time [4], [6]

8 Accuracy Less accurate More accurate [5]

Besides, based on research conducted by Jeff Heaton, titled “Comparing Dataset Characteristics that Favor
the Apriori, Eclat or FP-Growth Frequent Itemset Mining Algorithms” [7], while Apriori is a common
method for frequent itemset study, it has difficulty in scalability and consumes memory much faster than Eclat
and FP-Growth. B.Santhosh Kumar’s research titled “Implementation of Web Usage Mining Using
APRIORI and FP Growth Algorithms” showed that Apriori algorithm use candidate generation so it is costly
to handle a large number of candidate sets, and it requires multiple scans of the database and check the
candidate set by pattern matching while efficient FP-tree based method like FP-Growth only conducts 2 phases
to output frequent patterns [10].

EFFECTS OF DATASET DENSITY [7]

Fig. 4. Frequent Itemset Density’s Effect on Runtime (seconds)

The Eclat and FP-Growth algorithms experience similar growth when frequent itemset density increases;
similar trend is recognized for Apriori until the density exceeds 70%; Eclat is somewhat ahead of FP-Growth
when density is low [7].

EFFECTS OF BASKET SIZE [7]

Fig. 5. Maximum Basket Size’s Effect on Runtime (seconds)

Above algorithms show almost the same performance for basket sizes until it reaches 60; however, once above
60, Apriori seems to grow much faster than Eclat and FP-Growth and at its best between 60-70 maximum
transaction size; this is possibly because of the memory increase consumed by Apriori [7].

Based on one of Borgelt research, titled “Frequent item set mining”, while his implementation of Apriori
algorithm performed much better for high support thresholds, it performed much worse for small thresholds,
mainly due to differences in the implementation of some of the used data structures and procedures [8].
Another research from Zheng et al. in their article about “Real-world performance of association rule
algorithms” showed that different machine architectures sometimes differs in behavior under the same
algorithms. The research revealed that choice of algorithm only matters at support levels that generate more
rules than needed in practice. For example, a support level that generates a small enough number of rules
which are human-understandable, Apriori’s processing time is within under a minute, so performance
improvements are not interesting; support levels that generate about 1,000,000 rules and not human-
understandable, Apriori algorithm finishes processing in less than 10 minutes; however, if the support level
continues to rise (which is often the case for prediction purposes), the number of frequent itemsets and
association rules grows exponentially which makes memory space the main concern for most algorithms -
either run out of memory or reasonable disk space [9].

Based on a thesis titled “Efficient Frequent Pattern Mining” from Bart Goethals, it was concluded that
different kinds of data sets that the algorithms were performed also differed in the algorithm’s performance. As
long as the database fits in main memory, while the Hybrid algorithm, as a combination of an optimized
version of Apriori and Eclat, is the most efficient algorithm, the Eclat algorithm is better for dense databases; if
the database does not fit into memory, we choose algorithm depending on the density of the database: Hybrid
should be chosen for sparse dataset while Eclat should be considered for dense dataset [11]

References

[1] Han, J., Kamber, M., & Pei, J. (2012, Jan 2). Data Mining: Concepts and Techniques 3rd Solution Manual.

Elsevier Textbook. https://textbooks.elsevier.com/manualsprotectedtextbooks/9780123814791/Instructor

%27s_manual.pdf

[2] Raiyan. (2018, Nov 4). ChiMerge. raiyan1102006.

https://github.com/raiyan1102006/ChiMerge/blob/master/ChiMerge.py

[3] Jamileh Y., J. (2020, Oct 4). DATA Transformation. Data Mining Concept.
https://cbulms2.cbu.ca/moodle/pluginfile.php/70165/mod_resource/content/3/ML_Week4.pdf

[4] S.Mythili, M., & R. Mohamed Shanavas, A. (2013). Performance Evaluation of Apriori and FP-Growth
Algorithms. International Journal of Computer Applications. https://doi.org/10.5120/13779-1650

[5] Muliono R, Muhathir, Khairina N, Harahap MK (2019). Analysis of Frequent Itemsets Mining Algorithm
Againts Models of Different Datasets. In: Journal of Physics: Conference Series. ; 2019.
doi:10.1088/1742-6596/1361/1/012036

[6] Amneh S. (2018). Hybrid user action prediction system for automated home using association rules and
ontology. The Institution of Engineering and Technology. doi: 10.1049/iet-wss.2018.5032

[7] Helton J. (2016). Comparing dataset characteristics that favor the Apriori, Eclat or FP-Growth frequent
itemset mining algorithms. Institute of Electrical and Electronics Engineers. Doi:
10.1109/secon.2016.7506659

[8] C. Borgelt, “Frequent item set mining,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery, vol. 2, no. 6, pp. 437–456, 2012.

[9] Z. Zheng, R. Kohavi, and L. Mason. Real world performance of association rule algorithms. In F. Provost
and R. Srikant, editors, Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 401–406. ACM Press, 2001

[10] Santhosh Kumar, B., & Rukmani, K. V. (2010). Implementation of Web Usage Mining Using APRIORI
and FP Growth Algorithms. Int. J. of Advanced Networking and Applications.

[11] Goethals B., (2012, Dec). Efficient Frequent Pattern Mining. School of Information Technology
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.4.4137&rep=rep1&type=pdf

You might also like