Parallel MS-Kmeans Clustering Algorithm Based On M
Parallel MS-Kmeans Clustering Algorithm Based On M
MapReduce
Guodong Li ( lgd@ncepu.edu.cn )
North China Electric Power University
Chunhong Wang
North China Electric Power University
Kai Li
Xinjiang Information and Telecommunication Company
Research Article
DOI: https://doi.org/10.21203/rs.3.rs-1857679/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License
Springer Nature 2021 LATEX template
Abstract
Aimed at the problems of initial points selection, outliers influence
and cluster instability of traditional K-means algorithm in big data
clustering, an MS-Kmeans algorithm based on MapReduce framework
was proposed. The algorithm selected multiple non-outlier points as
the candidate center points, and mean shifted the candidate center
points, used the maximum-minimum principle to select K initial cen-
ter points from the candidate center points, and then executed the
K-means algorithm to find the final center points. In order to improve
the running speed of the algorithm for clustering large data sets, the
algorithm used MapReduce framework to implement parallel computing
on Hadoop platform. The experimental results showed that the par-
allel MS-Kmeans algorithm is feasible in big data clustering, and the
algorithm performs well in terms of performance, speed and stability.
Keywords: Clustering,MapReduce,Hadoop,MS-Kmeans
1 Introduction
Clustering is a continuous loop iterative process [1] that divides data set into
clusters, dividing objects with high similarity into the same clusters and objects
with low similarity into different clusters [2, 3]. Clustering can be applied to
1
Springer Nature 2021 LATEX template
2 Article Title
data analysis in many related fields of computers, such as text analysis, pattern
recognition, and artificial intelligence, etc. The K-means algorithm is one of the
classical clustering algorithms, which is a distance-based clustering algorithm
[4, 5]. Its basic idea is to divide the data set into K subsets, and each subset
represents a cluster. K-means algorithm is highly dependent on the selection of
initial center points, so it is susceptible to the influence of outliers, which leads
to clustering into local optimal solutions. For this reason, some papers have
proposed improved k-means algorithms [6, 7] to solve the initial center points
selection problem of traditional k-means algorithm, and these algorithms have
good results on small data.
The K-means algorithm requires a large amount of computational resources
when dealing with large-scale data, and the convergence rate becomes slower
[8]. Therefore clustering using parallel K-means algorithm is a popular topic
in data mining. To improve the storage and computational capacity of massive
data, the Apache Foundation developed the Hadoop distributed system infras-
tructure, which is highly reliable, scalable, fault-tolerant, and low-cost [9, 10].
HDFS and MapReduce are two core modules of Hadoop, HDFS is a distributed
file management system running on multiple machines [11, 12]. MapReduce
is a programming framework for distributed computing programs, which can
change algorithms from serial to parallel computing, reduce the time complex-
ity of algorithms [13–15], and decrease the computing time, which is centered
on three functions, Map function, Combine function, and Reduce function
[16, 17]. However, K-means is susceptible to the influence of initial values and
outlier points when dealing with clustering, which leads to unstable and easy
convergence to optimal local solution for each result and affects the clustering
performance.
In this paper, we propose a K-means algorithm to change the initial center
points selection, named MS-Kmeans, which first randomly selects M non-
outlier points from the data set as candidate center points in order to reduce
the influence of outliers, and then performs the Mean Shift algorithm on the M
candidate center points to make the candidate center points move to the data-
dense area, in order to avoid the clustering results falling into In order to avoid
the optimal local solution, the K farthest points from the M candidate center
points are selected as the initial center points of K-means clustering using the
maximum-minimum principle, and then the data are clustered by K-means. In
addition, a parallel MS-Kmeans algorithm is developed using the MapReduce
framework in order to reduce the running time, making the algorithm more
suitable for handling large data sets.
2 Related work
Traditional clustering algorithms were more efficient and obtain satisfactory
clustering results when dealing with small data sets, but the performance was
poor when dealing with large data sets due to memory, data volume, and
computational power. To improve the performance of clustering in large data
Springer Nature 2021 LATEX template
Article Title 3
4 Article Title
were assigned to the nearest cluster. To improve efficiency and reduce com-
plexity, the algorithm was parallelized on the Hadoop platform. Literature [23]
used a hash algorithm to map data with large similarity to the same address
space and data with small similarity to different address spaces, and then K
initial center points were selected from K data-rich address spaces, and the
selected K initial center points should try to satisfy the principle of large dis-
tance of different clusters. The hash function mapping can effectively avoid
outlier points as the initial clustering centers, and it was conducive to mining
the clustering relationship of the data, so as to select the better initial center
points. A KMEANS-BRMS algorithm based on interval means and sampling
was proposed in the literature [24], in order to reduce the influence of outliers
on clustering results, the mean range method was used to obtain the initial
clustering center point. In order to avoid data skew caused by uneven dis-
tribution of data on each node during parallelization, the BSA (based pond
sampling and first adaptive algorithm) strategy based on pond sampling and
first adaptation algorithm was proposed.
the mean range method (MRM) was proposed to obtain the initial clus-
tering centers, which reduced the interference of outliers. The BSA (based
pond sampling and first adaptive algorithm) strategy was proposed to solve
the data skewing problem caused by uneven data distribution among nodes in
the parallelization process, thus improving the overall clustering efficiency.
In this paper, based on the MapReduce framework, a parallel MS-Kmeans
clustering algorithm is proposed, which selects non-outlier points as candidate
center points and uses the Mean Shift algorithm to mean-shift the candidate
center points to move the candidate center points to data-dense regions [25],
and uses the minimum-maximum principle to select K points from the can-
didate center points as the initial center points of the K means algorithm
and perform K-means clustering on the data to find the final cluster center
points and assign all data to the nearest clusters. This algorithm effectively
reduces the clustering running time and improves the clustering performance
and stability.
The main contributions of this paper are as follows: (1) In this paper, we
propose a parallel MS-Kmeans algorithm that speeds up the time of clustering
large data and improves the performance and stability of clustering. (2)By
finding other data points in the high-dimensional region, we can effectively
determine whether the current point is an outlier, thus reducing the impact
of outliers on the clustering performance. (3) By moving the candidate center
points to the dense region that can represent the data distribution through the
Mean Shift algorithm, the speed of finding center points is accelerated and the
stability of the algorithm is increased. (4) The minimum-maximum principle is
introduced to select the initial center points from the candidate center points
to avoid K-means clustering into local optimal solutions.
Springer Nature 2021 LATEX template
Article Title 5
′
𝑥𝑚
′ )
𝑀(𝑥 𝑚
′
𝑥𝑚 ′
𝑥𝑚 ′
𝑥𝑚
Fig. 1 Mean Shift algorithm process of each candidate center vector. x′m in figure (a) is
a candidate center vector, then, all shift vectors in the specified area are drawn in figure
(b), figure (c) shows the distance and direction of one shift of x′m , where M (x′m ) is shift
mean vector, and x′m is the position after the first shift of x′m , figure (d) shows the result
of one shift of a candidate center vector, figure (e) shows the result of multiple shifts of a
candidate center vector,finally, figure (f) shows the result of a candidate center vector mean
shift iteration
3 MS-Kmeans algorithm
Suppose there are N data vectors, and each data vector is P -dimensional.X =
{xn = (xn1 , xn2 , ..., xnp , ..., xnP ), n = 1, 2, ..., N ; p = 1, 2, ..., P } where
xn represents the n-th P -dimensional data vector. X ′ = {x′m =
(x′m1 , x′m2 , ..., x′mp , ..., x′mP ), m = 1, 2, ..., M ; p = 1, 2, ..., P } is a set of M can-
didate center vectors randomly selected from X. R = {R1 , R2 , ..., Rk , ..., RK }
represents the K clusters finally divided, Rk is the k-th cluster. C = {ck =
(ck1 , ck2 , ..., ckp , ..., ckP ), k = 1, 2, ..., K, p = 1, 2, ..., P } represents the set of
center vectors of k clusters, ck is the k-th cluster center, where K < M < N .
The following are the steps of the algorithm proposed in this paper:
Select M non-outlier points from data set X as the candidate center vector
set X ′ . For each candidate center vector, mean shift is performed using the
Mean Shift algorithm until the distance of shift equals zero or the number of
the shifts is reached, as shown in Figure 1.
The Mean Shift algorithm for each candidate center vector at time t is
described in steps (1)-(4):
(1)Calculate all the shift vectors xj of the candidate center vector (x′m )t
in the high-dimensional space with a radius of D. where, (x′m )t is the m-th
candidate center vector in the t state.
(2) Calculate the average value of all shift vectors (xj ) in a high-dimensional
space with a radius of D by using formula (1), and get an mean shift vector
Springer Nature 2021 LATEX template
6 Article Title
M (x′m )t
X
M (x′m )t = (1/G) × (xj − (x′m )t ) (1)
xj ∈SD
Where, (x′m )t+1 is the m-th candidate center vector in the t + 1 state.
(4) Repeat steps (2) and (3) until the translation distance is equal to zero
or the number of the shifts is reached.
After the Mean Shift algorithm is executed, M candidate center vectors
located in high-density regions can be obtained. K vectors farthest from each
other are selected from the candidate center vectors as the initial center vector
set C by the maximum-minimum principle. Specific steps such as (5) -(7):
(5) Randomly select a vector x′m from data set X ′ and add it to set C.
(6) Calculate the distance from the vector in X ′ to all the vectors in set C,
if the vector x′m meets the maximum value of (min (dist(x′1 , c1 ), dist(x′1 , c2 )
, · · · , dist(x′1 , ck )) , · · · , min (dist(x′m , c1 ) , dist(x′m , c2 ) , · · · , dist(x′m , ck ))),
then x′m is added to C, where dist(x′m , ck ) represents the Euclidean distance
between x′m and ck .
(7) Repeat step (6) until K initial center vectors are selected as the center
vectors of K clusters.
All data are clustered by K-means based on the selected K initial center
vectors. Specific steps such as (8) -(11):
(8) For each vector xn in X, its Euclidean distance to K center vectors are
calculated using formula (3), and xn is divided into the nearest cluster.
r
XP 2
dis(xn , ck ) = (xnp − ckp ) (3)
p=1
(9) Recalculate each center vector using the formula (4) and compare
whether the distance between the two groups of center vectors is equal to zero.
X
ck = (1/N ) × xn (4)
xn ∈Rk
(10) Repeat steps (8) and (9) until the K center vectors do not change,
indicating that the cluster has reached convergence, the final k center vectors
are obtained.
Springer Nature 2021 LATEX template
Article Title 7
(11) All data vectors are assigned to the nearest cluster. Output clustering
results R.
8 Article Title
Data
Reduce
NO
YES
K-means Reduce
NO
Center vectors unchanged?
YES
clustering results
4.2 K-means
The main work of the K-means stage is to select K initial center vectors from
the candidate center vectors generated after the end of Mean Shift, and then
do K-means clustering for the data. This stage also includes three components:
Map, Combine, Reduce.
Springer Nature 2021 LATEX template
Article Title 9
Map phase: The map function is used to calculate the center vector closest
to the data object vector. Input the data set in the form of <key5, value5>
key-value pairs into the map function. In this function, the center vector closest
to each data vector needs to be calculated, and then identifier is added and
output, the output format is also <key6, value6>, where value6 is the data
vector, and key6 is the identifier of the center vector closest to value6. Each
iteration of the map function places the data vector in the cluster of the nearest
center vector, and may change the center vectors, so the center vectors need
to be recalculated.
Combine stage: Similar to the combine function in the Mean Shift phase.
The combine function combines the key-value pairs of the same key to reduce
the calculation of the reduce function, , The output of this function is a <key7,
value7 > key-value pair, where key7 is the center vector identifier and value7
is a combination of all vectors of the same key7.
Reduce stage:The reduce function processes data vectors belonging to the
same clustering and generates a new center vector. The input of the reduce is
the output of combine. At this stage, the new center vectors are calculated and
output to the HDFS file for reading by the map function in the next iteration,
the output is <key8, value8>, Where value8 is the new center vector and
Key8 is the identification of the new center vector, after each iteration, the
distance between the new center vector and the previous center vector need to
be calculated, if the distance between the two is equal to zero, the iteration is
ended, the clustering result converges.
10 Article Title
(SSE), are selected to evaluate the performance of the algorithm, and the pro-
posed algorithm is compared with K- means and K-means++ for comparison.
In order to avoid the effect of randomness, the three clustering algorithms were
conducted 20 times for each of the three data sets, and the average values of
the clustering effect indicators were taken as shown in Table 2. Compared with
the K-means algorithm and K-means++ algorithm, the MS-Kmeans algorithm
proposed in this paper shows better performance on all data sets in three
different metrics, and the clustering performance is improved significantly on
the KDD-Cup-1999 data set in particular. It can be found from the table
that the larger the amount of data, the more obvious the improvement in the
performance of the algorithm proposed in this paper.
Article Title 11
Iris
3 Exp-nonscale
KDD-Cup-1999
2.5
Speedup ratio
1.5
0.5
1 2 3 4
Nodes number
Fig. 3 Speedup ratio.The graph depicts the speedup ratio of three datasets using MS-
Kmeans algorithm on different number of nodes
12 Article Title
6 Conclusion
In order to solve the problems that the traditional K-means algorithm is sus-
ceptible to outliers, time-consuming and poor stability in the context of large
data, this paper proposes an MS-Kmeans algorithm, which determines out-
liers in high-dimensional space by using the Mean Shift algorithm to move
candidate center points to data-dense regions, selects initial center points
using the maximum-minimum principle in MapReduce framework to achieve
parallelization, and experiments were conducted on Iris, Exp-nonscale, and
KDD-Cup-1999 public datasets to compare the performance of K-means,
K-means++, and MS-Kmeans algorithms and verify the advantages of the
algorithm proposed in this paper. The experimental results show that the clus-
tering speed, performance, and stability of the MS-Kmeans algorithm on large
data sets are effectively improved.
Declarations
• Funding The authors did not receive support from any organization for the
submitted work. No funding was received to assist with the preparation
of this manuscript. No funding was received for conducting this study. No
funds, grants, or other support was received.
• Conflict of interest/Competing interests The authors have no relevant finan-
cial or non-financial interests to disclose. The authors have no competing
interests to declare that are relevant to the content of this article. All authors
certify that they have no affiliations with or involvement in any organization
or entity with any financial interest or non-financial interest in the sub-
ject matter or materials discussed in this manuscript. The authors have no
financial or proprietary interests in any material discussed in this article.
• Ethics approval The manuscript will not be submitted to multiple magazines
for consideration at the same time. The submitted works are original and
have not been published elsewhere in any form or language (in part or in
whole), The results are presented clearly and truly, without fabrication,
tampering or improper data operation.
• Consent to participate
The author did not consent to participate in the declaration
• Consent for publication
Springer Nature 2021 LATEX template
Article Title 13
Submission of work requires that the piece to be reviewed has not been
previously published. Upon acceptance, the Author assigns to the Journal
of Grid Computing the right to publish and distribute the manuscript.
• Availability of data and materials
The datasets analysed during the current study are available in the UCI
and Kaggle. These datasets were derived from the following public domain
resources:
https://archive.ics.uci.edu/ml/datasets/Iris
https://www.kaggle.com/datasets/zechengzhang/cluster-exp-
nonscale?select=kappa omega test.txt
http://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data
• Authors’ contributions
Li Guodong: Data Curation, Writing, Review, Editing, Supervision, Project
administration.
Wang Chunhong: Methodology, Software, Validation, Investigation, Writing-
Original draft preparation.
Li Kai: Conceptualization, Visualization.
• Acknowledgements
On the completion of my thesis, I should like to express my deepest grati-
tude to all those whose kindness and advice have made this work possible.
I am greatly indebted to my advisor Li Guodong who gave me valuable
instructions and has improved me in language. His effective advice, shrewd
comments have kept the thesis in the right direction.
I would like to thank my partner Li Kai for his friendship and constructive
suggestions, he constantly encouraged me when I felt frustrated with this
dissertation.
References
[1] Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M.,
Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al.: Apache hadoop
yarn: Yet another resource negotiator. In: Proceedings of the 4th Annual
Symposium on Cloud Computing, pp. 1–16 (2013)
[2] Saeed, Z., Abbasi, R.A., Maqbool, O., Sadaf, A., Razzak, I., Daud, A.,
Aljohani, N.R., Xu, G.: What’s happening around the world? a survey
and framework on event detection techniques on twitter. Journal of Grid
Computing 17(2), 279–312 (2019)
[3] da Rosa Righi, R., Lehmann, M., Gomes, M.M., Nobre, J.C., da Costa,
C.A., Rigo, S.J., Lena, M., Mohr, R.F., de Oliveira, L.R.B.: A survey on
global management view: toward combining system monitoring, resource
management, and load prediction. Journal of Grid Computing 17(3), 473–
502 (2019)
14 Article Title
[5] Rodriguez, A., Laio, A.: Clustering by fast search and find of density
peaks. science 344(6191), 1492–1496 (2014)
[6] Zhang, G., Zhang, C., Zhang, H.: Improved k-means algorithm based on
density canopy. Knowledge-based systems 145, 289–297 (2018)
[7] Ling, Y., Zhang, X.: An improved k-means algorithm based on multiple
clustering and density. In: 2021 13th International Conference on Machine
Learning and Computing, pp. 86–92 (2021)
[9] Feng, D., Zhu, L., Zhang, L.: Review of hadoop performance optimiza-
tion. In: 2016 2nd IEEE International Conference on Computer and
Communications (ICCC), pp. 65–68 (2016). IEEE
[10] Yonggui, W., Chao, W., Wei, D.: Random sampling k-means algorithm
based on mapreduce. Computer Engineering and Applications 52(8), 74–
79 (2016)
[11] Khan, M.A., Memon, Z.A., Khan, S.: Highly available hadoop namenode
architecture. In: 2012 International Conference on Advanced Computer
Science Applications and Technologies (ACSAT), pp. 167–172 (2012).
IEEE
[12] Singh, K., Kaur, R.: Hadoop: addressing challenges of big data. In: 2014
IEEE International Advance Computing Conference (IACC), pp. 686–689
(2014). IEEE
[13] Sardar, T.H., Ansari, Z.: Distributed big data clustering using mapreduce-
based fuzzy c-medoids. Journal of The Institution of Engineers (India):
Series B 103(1), 73–82 (2022)
[15] Sardar, T.H., Ansari, Z.: Mapreduce-based fuzzy c-means algorithm for
distributed document clustering. Journal of The Institution of Engineers
(India): Series B 103(1), 131–142 (2022)
[16] Xiong, K., He, Y.: Power-effiicent resource allocation in mapreduce clus-
ters. In: 2013 IFIP/IEEE International Symposium on Integrated Network
Springer Nature 2021 LATEX template
Article Title 15
[17] Hanif, M., Lee, C.: Jargon of hadoop mapreduce scheduling techniques: a
scientific categorization. The Knowledge Engineering Review 34 (2019)
[18] Lai, X., Gong, X., Han, L.: Genetic algorithm based k-medoids clustering
within mapreduce framework. Computer Science 44(03), 23–26 (2017)
[19] Zhang, W., Jiang, L.: Parallel computation algorithm for big data clus-
tering based on mapreduce. Application Research of Computers 37(1)
(2018)
[20] Li, H., Liu, R., Wang, J., Wu, Q.: An enhanced and efficient clustering
algorithm for large data using mapreduce. IAENG International Journal
of Computer Science 46(1), 61–67 (2019)
[22] Lu, W.: Improved k-means clustering algorithm for big data mining under
hadoop parallel framework. Journal of Grid Computing 18(2), 239–250
(2020)
[24] Huang, X., Cheng, S.: Optimization of k-means algorithm base on mapre-
duce. In: Journal of Physics: Conference Series, vol. 1881, p. 032069
(2021). IOP Publishing
[25] Chen, Y., Hu, P., Wang, W.: Improved k-means algorithm and its imple-
mentation based on mean shift. In: 2018 11th International Congress on
Image and Signal Processing, Biomedical Engineering and Informatics
(CISP-BMEI), pp. 1–5 (2018). IEEE
Springer Nature 2021 LATEX template
16 Article Title
K-means
Iris K-means++
Iris K-means
0.5525
0.666
0.552 0.665
DBI
0.664
0.5515
0.663
0.551
0.662
0.5505 0.661
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
K-means K-means
Iris K-means++
Exp-nonscale K-means++
180
MS-Kmeans MS-Kmeans
160 0.095
140
Average Silhouette
0.09
120
SSE
100 0.085
80
0.08
60
40 0.075
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
K-means K-means
Exp-nonscale K-means++
Exp-nonscale K-means++
2.8 1.50E+10
MS-Kmeans MS-Kmeans
2.75 1.40E+10
1.30E+10
2.7
1.20E+10
DBI
SSE
2.65
1.10E+10
2.6
1.00E+10
2.55 9.00E+09
2.5 8.00E+09
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
K-means
KDD-Cup-1999 K-means++
2
MS-Kmeans
1.8
1.6
1.4
DBI
1.2
1
0.8
0.6
0.4
0.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of runs
(g)
Fig. 4 Stability analysis.This paper compares the stability of K-means, K-means++ and
MS-Kmeans algorithms on three data sets. Where, figure (a), figure (b) and figure (c) show
the stability of Iris on Average Silhouette, DBI, and SSE. Figure (d), figure (e) and figure (f)
show the stability of Exp-nonscale on Average Silhouette, DBI, and SSE. Figure (g) show
the stability of Kdd-Cup-1999 on DBI. The X-axis represents the number of experiments
and the Y -axis represents the value of each index.
Springer Nature 2021 LATEX template
Article Title 17
k-means k-means
400
Iris k-means++
Exp- nonscale k-means++
1100.00 KDD–Cup- 1999 k-means
k-means++
MS-kmeans 650 MS-kmeans 1000.00
MS-kmeans
600 900.00
350 800.00
550
700.00
Runtime s
Runtime s
✁
Runtime s
✁ ✁
500
300 600.00
450
500.00
400
400.00
250 350
300.00
300 200.00
200 250 100.00
1 2 3 4 1 2 3 4 1 2 3 4
Nodes number Nodes number Nodes number
Fig. 5 Run time. Figure (a) shows the running time of Iris on three algorithms on different
nodes, figure (b) shows the running time of Exp-nonscale on three algorithms on different
nodes, figure (c) shows the running time of iris on three algorithms on different nodes.