Assignment 1: Data Mining MGSC5126 - 10
Assignment 1: Data Mining MGSC5126 - 10
Data Mining
MGSC5126 - 10
Important Note: This is the assignment prepared by a non-tech background at her best effort from reading slides,
watching Youtube tutorials and examining research papers, given that she was not trained about programming
language in her course. Therefore, kindly consider this and mark her accordingly.
NL
Question 1
2 2 2
Euclidean distance d(i, j) = √( x
−x j 1 ) + ( xi 2− x j 2 ) +..+ ( xip−x jp )
i1
Manhattan distance d(i, j) = |x i1−x j1|+|x i2−x j2|+...+|x i p −x j p|
h h h
Minkowski distance d(i, j) = √|x i1 −x j 1| +¿ xi 2−x j 2∨¿h +...+|x ip −x jp| ¿
a. The Euclidean distance is ( 22−10 )2 + ( 1−0 )2+ ( 42−36 )2+ (10−8 )2 = 6.7082
√
b. The Manhattan distance is |22−20|+|1−0|+|42−36|+|10−8|=11
3 3 3 3 3
c. The Minkowski distance using h = 3 is √|22−20| +|1−0| +|42−36| +|10−8| =6.1534
Question 2
a.
Euclidean distance Manhattan distance Cosine similarity
x1 √¿ ¿ |0.1+0.1| = 0.2 1.4 ×1.5+1.6 ×1.7
2 2 = 0.99999
¿ √ 0.1 +0. 1 =¿ 0.14 √1. 4 2 +1.6 2+ √1. 52 +1.7 2
x2 √ 0. 62 +0. 32=¿0.67 |0.6+ 0.3| = 0.9 1.4 × 2+1.6 × 1.9
=¿ 0.99575
√1. 4 2 +1.6 2+ √2. 02 +9. 02
x3 √ 0. 22+ 0.22=¿0.28 |0.2+0.2| = 0.4 1.4 ×1.6+1.6 × 1.8
=¿0.99997
√1. 4 2 +1.6 2+ √1. 62 +1. 82
x4 √ 0. 22+ 0.12=¿0.22 |0.2+0.1| = 0.3 1.4 ×1.2+1.6 ×1.5
= 0.99903
√1. 4 2 +1.6 2+ √1. 22 +1.5 2
x5 √ 0. 12+0. 6 2=¿0.60 |0.1+0.6| = 0.7 1.4 ×1.5+1.6 ×1.0
= 0.96536
√1. 4 2 +1.6 2+ √1. 52 +1. 02
Based on similarity, the data is ranked as below:
Euclidean distance: x1, x4, x3, x5, x2
Manhattan distance: x1, x4, x3, x5, x2
Cosine similarity: x1, x3, x4, x2, x5
After transformation, the ranking of data set will be: x1, x3, x4, x2, x5
Question 3
According to Data Mining Instruction Manual “The ChiMerge algorithm takes the Iris dataset and design the
max_interval as the input and gives the split-points and final intervals as output” // [1]. As the stopping
condition, I used max_interval = 4.
According to raiyan1102006 Github, ChiMerge “is a robust algorithm that works in the following manner:
We can refer to package “arules” in R to discretize the variables in Iris dataset. Please refer to coding file for
more information.
Question 4
a.
Given that min support = 60% and min confidence = 80%. Here I use I use absolute support, there are total 5
transactions with min support = 60% so it translates to itemset with 1 or 2 support count(s) will be eliminated
A 1
C 2
D 1
E 4
I 1
K 5
M 3
N 2
Item Count
0 3
U 1
Y 3
E 4
K 5
M 3
O 3
Y 3
E, K 4
E, M 2
E, O 3
E, Y 2
K, M 3
K, O 3
K, Y 3
M, O 1
M, Y 2
O, Y 2
E, K 4
Item Count
E, O 3
K, M 3
K, O 3
K, Y 3
E, K, O 3
K, M, O 1
K, M, Y 2
Pruning item with support count smaller than 3
Item Count
E, K, O 3
T100 {M, O, N, K, E, K, E, M, O, Y
Y}
T200 {D, O, N, K, E, K, E, O, Y
Y}
T300 {M, A, K, E} K, E, M
T400 {M, U, C, K, Y} K, M, Y
T500 {C, O, O, K, I, E} K, E, O
Item ending with Conditional pattern base Conditional tree satisfies min FPs generated
support = 3
Y {E, K, M, O: 1}, {E, K, O: 1}, ⟨K: 3⟩ {K, Y: 3}
{K, M: 1}
O {E, K, M: 1}, {E, K: 2} ⟨E: 3, K: 3⟩ {E, K, O: 3}, {K, O: 3},
{E, O: 3}
M {E, K: 2}, {K: 1} ⟨K: 3⟩ {K, M: 3}
E {K: 4} ⟨K: 4⟩ {E, K: 4}
FP-growth is more efficient especially when it comes to large dataset because it is able to mine in the
conditional pattern bases, thus reduce the sizes of the data sets to be mined.
Question 5
Accoding to M.S. Mythili, the Apriori property follows two step processes:
Join step: - Ck is generated by combining lk-1 with itself
Prune Step: - Any (k – 1) item set that’s not frequent cannot be a set of a frequent k item set [4].
According to Data Mining Slides, Chapter 6 prepared by Jamileh, “SQL Implementation of candidate
generation. Suppose the items in Lk-1 are listed in an order
Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck [3]
We can refer to package “arules” in R to implement the Apriori algorithm in Adult dataset. Please refer to
coding file for more information.
Besides, based on research conducted by Jeff Heaton, titled “Comparing Dataset Characteristics that Favor
the Apriori, Eclat or FP-Growth Frequent Itemset Mining Algorithms” [7], while Apriori is a common
method for frequent itemset study, it has difficulty in scalability and consumes memory much faster than Eclat
and FP-Growth. B.Santhosh Kumar’s research titled “Implementation of Web Usage Mining Using
APRIORI and FP Growth Algorithms” showed that Apriori algorithm use candidate generation so it is costly
to handle a large number of candidate sets, and it requires multiple scans of the database and check the
candidate set by pattern matching while efficient FP-tree based method like FP-Growth only conducts 2 phases
to output frequent patterns [10].
The Eclat and FP-Growth algorithms experience similar growth when frequent itemset density increases;
similar trend is recognized for Apriori until the density exceeds 70%; Eclat is somewhat ahead of FP-Growth
when density is low [7].
Above algorithms show almost the same performance for basket sizes until it reaches 60; however, once above
60, Apriori seems to grow much faster than Eclat and FP-Growth and at its best between 60-70 maximum
transaction size; this is possibly because of the memory increase consumed by Apriori [7].
Based on one of Borgelt research, titled “Frequent item set mining”, while his implementation of Apriori
algorithm performed much better for high support thresholds, it performed much worse for small thresholds,
mainly due to differences in the implementation of some of the used data structures and procedures [8].
Another research from Zheng et al. in their article about “Real-world performance of association rule
algorithms” showed that different machine architectures sometimes differs in behavior under the same
algorithms. The research revealed that choice of algorithm only matters at support levels that generate more
rules than needed in practice. For example, a support level that generates a small enough number of rules
which are human-understandable, Apriori’s processing time is within under a minute, so performance
improvements are not interesting; support levels that generate about 1,000,000 rules and not human-
understandable, Apriori algorithm finishes processing in less than 10 minutes; however, if the support level
continues to rise (which is often the case for prediction purposes), the number of frequent itemsets and
association rules grows exponentially which makes memory space the main concern for most algorithms -
either run out of memory or reasonable disk space [9].
Based on a thesis titled “Efficient Frequent Pattern Mining” from Bart Goethals, it was concluded that
different kinds of data sets that the algorithms were performed also differed in the algorithm’s performance. As
long as the database fits in main memory, while the Hybrid algorithm, as a combination of an optimized
version of Apriori and Eclat, is the most efficient algorithm, the Eclat algorithm is better for dense databases; if
the database does not fit into memory, we choose algorithm depending on the density of the database: Hybrid
should be chosen for sparse dataset while Eclat should be considered for dense dataset [11]
References
[1] Han, J., Kamber, M., & Pei, J. (2012, Jan 2). Data Mining: Concepts and Techniques 3rd Solution Manual.
%27s_manual.pdf
https://github.com/raiyan1102006/ChiMerge/blob/master/ChiMerge.py
[3] Jamileh Y., J. (2020, Oct 4). DATA Transformation. Data Mining Concept.
https://cbulms2.cbu.ca/moodle/pluginfile.php/70165/mod_resource/content/3/ML_Week4.pdf
[4] S.Mythili, M., & R. Mohamed Shanavas, A. (2013). Performance Evaluation of Apriori and FP-Growth
Algorithms. International Journal of Computer Applications. https://doi.org/10.5120/13779-1650
[5] Muliono R, Muhathir, Khairina N, Harahap MK (2019). Analysis of Frequent Itemsets Mining Algorithm
Againts Models of Different Datasets. In: Journal of Physics: Conference Series. ; 2019.
doi:10.1088/1742-6596/1361/1/012036
[6] Amneh S. (2018). Hybrid user action prediction system for automated home using association rules and
ontology. The Institution of Engineering and Technology. doi: 10.1049/iet-wss.2018.5032
[7] Helton J. (2016). Comparing dataset characteristics that favor the Apriori, Eclat or FP-Growth frequent
itemset mining algorithms. Institute of Electrical and Electronics Engineers. Doi:
10.1109/secon.2016.7506659
[8] C. Borgelt, “Frequent item set mining,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery, vol. 2, no. 6, pp. 437–456, 2012.
[9] Z. Zheng, R. Kohavi, and L. Mason. Real world performance of association rule algorithms. In F. Provost
and R. Srikant, editors, Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 401–406. ACM Press, 2001
[10] Santhosh Kumar, B., & Rukmani, K. V. (2010). Implementation of Web Usage Mining Using APRIORI
and FP Growth Algorithms. Int. J. of Advanced Networking and Applications.
[11] Goethals B., (2012, Dec). Efficient Frequent Pattern Mining. School of Information Technology
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.4.4137&rep=rep1&type=pdf