0% found this document useful (0 votes)
21 views9 pages

K-Means and MAP REDUCE Algorithm

This document summarizes a retracted research article that applied a K-means clustering algorithm to energy data analysis. The publisher retracted the article after an investigation uncovered evidence of systematic manipulation in the publication process, including discrepancies in the scope, research described, and data availability as well as inappropriate citations and irrelevant content. The retraction notice aims to alert readers that the article's content is unreliable but does not investigate author involvement in the manipulation.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
21 views9 pages

K-Means and MAP REDUCE Algorithm

This document summarizes a retracted research article that applied a K-means clustering algorithm to energy data analysis. The publisher retracted the article after an investigation uncovered evidence of systematic manipulation in the publication process, including discrepancies in the scope, research described, and data availability as well as inappropriate citations and irrelevant content. The retraction notice aims to alert readers that the article's content is unreliable but does not investigate author involvement in the manipulation.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 9

Hindawi

Wireless Communications and Mobile Computing


Volume 2023, Article ID 9768373, 1 page
https://doi.org/10.1155/2023/9768373

Retraction
Retracted: Application of K-Means Clustering Algorithm in
Energy Data Analysis

Wireless Communications and Mobile Computing


Received 19 September 2023; Accepted 19 September 2023; Published 20 September 2023

Copyright © 2023 Wireless Communications and Mobile Computing. This is an open access article distributed under the Creative
Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the
original work is properly cited.

This article has been retracted by Hindawi following an investi- References


gation undertaken by the publisher [1]. This investigation has
[1] Y. Zhou, “Application of K-Means Clustering Algorithm in
uncovered evidence of one or more of the following indicators of
Energy Data Analysis,” Wireless Communications and Mobile
systematic manipulation of the publication process: Computing, vol. 2022, Article ID 5914893, 8 pages, 2022.

(1) Discrepancies in scope


(2) Discrepancies in the description of the research reported
(3) Discrepancies between the availability of data and the
research described
(4) Inappropriate citations
(5) Incoherent, meaningless and/or irrelevant content
included in the article
(6) Peer-review manipulation

The presence of these indicators undermines our confidence


in the integrity of the article’s content and we cannot, therefore,
vouch for its reliability. Please note that this notice is intended
solely to alert readers that the content of this article is unreliable.
We have not investigated whether authors were aware of or
involved in the systematic manipulation of the publication
process.
Wiley and Hindawi regrets that the usual quality checks did
not identify these issues before publication and have since put
additional measures in place to safeguard research integrity.
We wish to credit our own Research Integrity and Research
Publishing teams and anonymous and named external research-
ers and research integrity experts for contributing to this
investigation.
The corresponding author, as the representative of all
authors, has been given the opportunity to register their agree-
ment or disagreement to this retraction. We have kept a record of
any response received.
Hindawi
Wireless Communications and Mobile Computing
Volume 2022, Article ID 5914893, 8 pages
https://doi.org/10.1155/2022/5914893

Research Article
Application of K-Means Clustering Algorithm in Energy
Data Analysis

E D
Ying Zhou
Lanzhou Resources & Environment Voc-Tech University, Lanzhou, Gansu 730022, China

Correspondence should be addressed to Ying Zhou; 11233234@stu.wxic.edu.cn

C T
Received 31 March 2022; Revised 1 May 2022; Accepted 16 May 2022; Published 31 May 2022

Academic Editor: Aruna K K

A
Copyright © 2022 Ying Zhou. This is an open access article distributed under the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

In order to solve the problem of how to explore potential information in massive data and make effective use of it, this paper
mainly studies news text clustering and proposes a news clustering algorithm based on improved K-Means. Then, the

R
MapReduce programming model is used to parallelize the TIM-K-Means algorithm, so that it can run on the Hadoop
platform. The accuracy and error are used as measurement indicators, and the collected datasets are used for experiments to
verify the correctness and effectiveness of the TI value and TIM-K-Means algorithm. In addition, the Alibaba cloud server is
used to build the Hadoop cluster, and the feasibility of parallelization transformation of TIM-K-Means algorithm is verified by
accelerated comparison. The results show that the parallelized TIM-K-Means has a good acceleration ratio, can save about 30%

T
of the time under the same conditions, and can meet the actual needs of processing massive data in the context of big data. In
multidocument automatic summarization, news clustering algorithm can gather the news with the same topic and provide
cleaner and accurate data for visual automatic summarization, which is of great significance in the fields of public opinion
supervision, hot topic discovery, emergency real-time tracking, and so on.

1. Introduction

R E
With the rapid development of network technology and
cloud computing in recent years, the network has become
an indispensable part of production and life, and the dissem-
ination of information has become more and more rapid
with the rise of the network [1]. People have more conve-
nient and complete ways to obtain information [2]. The
popularization of computer education has greatly reduced
the technical threshold of software development. With the
national attention to the information industry and the sup-
limit to the amount of information people obtain, which is
far lower than the speed of information generation and dis-
semination, and this gap is expanding with the acceleration
of network technology, resulting in the accumulation of
massive information. How to mine valuable information
from massive information and apply it to related fields has
become a key research field [3].
The rapid development of data mining technology solves
the problem of how to obtain potentially valuable informa-
tion. Data mining uses relevant algorithms to analyze the
data and get valuable rules or information hidden behind
port of youth entrepreneurship, various media have sprung the information, so as to better find the potential value, opti-
up rapidly. Online social media is loved by people of all ages mize the production process, and provide useful information
and has a large number of users. With the huge number of for scientific research. At present, data mining technology is
users and high user activity, these network media not only quite active in the fields of social network, recommendation
speed up the dissemination of information but also produce system, text analysis, and so on. Clustering is an important
a large amount of data. Traditional paper media have been unsupervised learning method in data mining. The data is
impacted unprecedentedly, so they have invested a lot of divided into several categories through the similarity
resources to establish their own news portal or app to deal between data. It is widely used in the fields of biological
with the crisis. In these media, most of the information is information, medical health, artificial intelligence, and so
transmitted in the form of text. However, there is a certain on [4]. As a classical clustering algorithm, the K-Means
2 Wireless Communications and Mobile Computing

algorithm has the advantages of fast, simple, and easy to 2. Literature Review
implement, but it also has some disadvantages, such as using
random method to select the initial center point, resulting in Text mining is the product of the combination of data min-

D
local optimal solution and mistakenly selecting outliers as ing and natural language processing. Through the analysis of
the center, resulting in reduced clustering accuracy and long text, we can get potentially valuable information. Text clus-
running time. This paper improves the clustering effect by tering is one of the important branches of text mining, which
optimizing a step in the calculation process of the K-Means mainly includes two processes: feature extraction and clus-
algorithm. In addition, researchers also integrate the K- tering calculation [10]. In terms of feature extraction, the

E
Means algorithm with other models and algorithms and research mainly focuses on how to accurately express the
apply it to various fields such as finance, medicine, and text with words. The common main methods are frequent
image processing [5]. itemset mining after word segmentation, inverse indexing,
In the era of big data, the data information star is grow- and latent Dirichlet allocation (LDA) model or the combina-

T
ing exponentially, and the data to be processed each time tion of different methods. In clustering, the existing frequent
can reach the level of GB, TB, or even Pb. Therefore, only itemset mining algorithm is used to form the feature vector
relying on a single machine for data processing requires of text documents, which not only reduces the vector dimen-
high-performance machines and takes a lot of time. If the sion but also retains the commonality between documents
operation process is unexpectedly interrupted due to for similarity calculation. In addition, researchers use the

C
machine problems, it needs to be rerun. The traditional par- obtained frequent phrase sequence to represent the text
allel framework needs a lot of equipment. Although it can and use the association rule miner to find the binomial set
solve the problem of massive data processing to a certain that meets the minimum support of the Apriori algorithm,
extent, it has poor fault tolerance and scalability, and the which avoids the disadvantage that the traditional vector
cost is high. The emergence of Hadoop solves the time- space model ignores the word sequence and improves the

A
consuming problem of massive data processing and uses accuracy and accuracy of text clustering analysis. The text
its own fault tolerance to ensure the smooth operation of network can also be constructed based on the frequent item-
the program to a certain extent. Hadoop uses the distributed sets according to the similarity between texts, and the text
file system (HDFS) to store files, distributes data to multiple network can be divided by using the community division
servers through MapReduce computing model for distrib- algorithm, so as to achieve the purpose of clustering [11].

R
uted computing, and schedules resources through yarn. After using frequent itemsets to extract feature vectors, the
The MapReduce computing model encapsulates the func- two similarity indexes are combined to produce a new sim-
tions of data segmentation, task allocation, and fault- ilarity index. At the same time, fuzzy logic is used for cluster-
tolerant processing. Users only need to write task programs ing rules. Finally, the datasets are classified by support vector

T
as required [6, 7]. Since its launch, Hadoop has been contin- machine to verify the accuracy of the proposed algorithm.
uously improved and developed into a complete ecosystem Using the labeled data to construct the strong category dis-
with multiple components. At present, Hadoop has become crimination word set, the cosine similarity and the similarity
the mainstream distributed platform, and major Internet based on the strong category discrimination word items are
companies are used as the basic platform for offline and fused to form a new similarity calculation method, and a

E
streaming data processing. In the field of scientific research, semisupervised short text clustering algorithm based on
the MapReduce programming model has become the first improved similarity and class center vector is formed. Then
choice for researchers to parallelize algorithms [8]. In the the harmony search algorithm is used for feature selection to
era of mobile Internet, people get news information anytime obtain useful information or new subsets with features, so as

R
and anywhere through mobile phones or computers. News to reduce the impact of information loss and sparsity on text
has become one of the most important text information in clustering, so as to enhance the clustering effect. Four bench-
daily contact. News information plays an extremely impor- mark text datasets are used for experiments. The enhance-
tant role in spreading social positive energy, carrying for- ment of unsupervised feature selection technology based
ward traditional culture, setting a social example and on harmony search in the K-Means clustering algorithm is
guiding public opinion. News clustering is still a kind of text proved by measuring the F value and accuracy [12, 13].
clustering, which gathers texts with similar or even the same The K-Means algorithm is a process of repeatedly moving
topics through cluster analysis. In information retrieval, the the center point of the class based on similarity, in which
direct use of keyword matching will lead to unsatisfactory the selection of the center point and the definition of similar-
retrieval results due to ambiguity and other factors. If we ity are particularly important. For the optimization of the
first cluster the text set and then search according to the key- center, it includes the maximum distance product algorithm,
words generated after clustering, we can retrieve the text cat- minimum variance optimization method, and maximum
egories that better match the user’s goals. In multidocument minimum similarity. In addition, it also combines LDA
automatic summarization, the news clustering algorithm can and other models to solve the problems of data space and
gather news with the same topic and provide cleaner and semantic barriers [14, 15].
accurate data for visual automatic summarization, which is On the basis of weighted K-Means, the Minkows metric
of great significance in the fields of public opinion supervi- can be used to measure the distance, and the feature weight
sion, hot topic discovery, emergency real-time tracking, can be used as the feature scaling factor in the traditional K-
and so on [9]. Means criterion. At the same time, the anomaly clustering
Wireless Communications and Mobile Computing 3

center is used to initialize the centroid and feature weight of tering algorithm in social networks. The K-Means algorithm
weighted K-Means. Through experiments on the dataset of is not only applied in the field of text mining but also applied
UCI machine learning library and the dataset of generated in other aspects. The K-Means algorithm is integrated into

D
Gaussian clusters, it is proved that the Minkows metric plays the minimum spanning tree algorithm, and a fast minimum
an important role in the K-Means algorithm. The shortcom- spanning tree algorithm based on the N-point complete
ings of the K-Means initial point selection affecting the clus- graph is proposed, which reduces the theoretical time com-
tering effect are studied. The criteria are dynamically plexity from OðN 2 Þ to OðN 1:5 Þ and overcomes the deficiency
weighted according to the covariance integration of the data- that the traditional minimum spanning tree algorithm can-

E
set to avoid large differences in the cluster. The simulation not be applied to large datasets due to time complexity. K-
shows that this method has a certain effect. In addition, Means clustering can also be used to develop image com-
genetic algorithm is used to optimize the selection of initial pression methods on low-power embedded devices, that is,
cluster center point in the K-Means algorithm, so as to using the similarity of pixel colors to group pixels and com-

T
improve the clustering accuracy. There is also the method press the original image, so as to reduce the power of wire-
of using the FP growth algorithm to find out the frequent less imaging sensor networks [19, 20].
itemsets and using the frequent itemsets to generate the ini-
tial clustering centroid and clustering K value. The improved 3. Research Methods
K-Means algorithm not only improves the accuracy but also

C
speeds up the convergence speed of clustering. There is a 3.1. K-Means Algorithm. In the K-Means algorithm, for the
new point-to-point distance-s distance, and combined with dataset, where n represents the number of data and x repre-
labor heuristic, the s-K-Means algorithm is proposed. Com- sents the dataset in the dataset, the similarity calculation
pared with the traditional K-Means algorithm using Euclid- adopts the Euclidean distance, and the calculation formula
ean distance, the clustering effect of this algorithm is is as follows:

A
significantly enhanced, especially in the case of irregular cat- sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
egory distribution. According to the theory that “the farthest n
sample points are most unlikely to be divided into the same d ðx, yÞ = 〠 ð x i − y i Þ, ð1Þ
cluster,” the maximum distance method is proposed to select i=1

the initial center.

R
For the application of K-Means algorithm in text cluster- where n is the number of attributes in each data. x and y rep-
ing, researchers have also proposed a variety of improve- resent the i-th attribute of data x and y, respectively. The
ment and optimization methods. Firstly, the particle swarm algorithm randomly selects the initial cluster center, and in
optimization algorithm is optimized, combined with the the iteration, the average value of all vectors in the last clus-
ter is used as the new cluster center. The calculation formula

T
strong global search ability of particle swarm optimization
algorithm and the strong local search ability of the K-Means of cluster center vector is as follows:
algorithm to improve the effect of text clustering [16]. The
transformation formula of cosine similarity and the Euclid- 1
ui = 〠 x: ð2Þ
ean distance under standard vector is proposed. Based on h x∈C

E
i
this, a cosine clustering with close relationship and similar
meaning with the Euclidean distance is defined. The selec- In the formula, C represents the i-th class family after
tion method of initial center of the K-Means clustering is clustering, and h represents the total number of data in this
improved, so that the convergence speed is accelerated and class cluster. When judging the conditions for the end of

R
the clustering accuracy is improved. Then, in the text pre- clustering, the criterion function adopts the square differ-
processing stage of text clustering, an alternative thesaurus ence function, and the calculation formula is as follows:
is constructed according to the feature space of the docu-
ment set, the text theme is obtained with the thesaurus, k
and the document times are replaced according to the theme E = 〠 〠 jx − ui j2 : ð3Þ
and the corresponding domain dictionary. In the clustering i=1 x∈ci
stage, they proposed an improved K-Means algorithm based
on K-value optimization [17]. Then use the cooccurrence The specific flow of the K-Means algorithm is shown in
word principle to calculate the text similarity, and divide it Figure 1.
into K + n class families according to the reading value.
Then use the K-Means algorithm to cluster these class clus- 3.2. Parallel Foundation. Hadoop is an open source distrib-
ters, which solves the problem that K-Means is sensitive to uted computing platform, which was separated from the
the K value. Fair operation and clone operation are intro- project into a separate software in 2006. After a long period
duced to optimize the bee colony algorithm, and the K- of development, Hadoop has formed an ecosystem covering
Means algorithm is combined to improve the clustering various services such as computing model, data storage,
quality [18]. workflow, and communication coordination between clus-
In terms of application, the rapid development of social ters [21]. Its core components are shown in Figure 2.
networks provides rich data for text clustering. Many In the Hadoop ecosystem, the Hadoop distributed file
researchers began to pay attention to the application of clus- system is the basic component, which distributes a large
4 Wireless Communications and Mobile Computing

Begin Workflow High-level Enterprise-level


abstractions data
integration

D
Enter dataset D, number of Pig
cluster classes K value, Oozie
criterion function
convergence condition

E
Coordinate
Hive
Sqoop Flume
The central vector was
randomly selected
Zookeeper

T
The similarity was calculated Programming model NoSQL database
and classified into clusters
Map reduce HBSE

C
The sum of squared values of
error was calculated Distributed and reliable storage
HDFS

A
Figure 2: Core components of Hadoop ecosystem.
Determine
whether the
criterion
the function meets NameNode at regular intervals. In the HDFS system, the
the conditions? NameNode stores metadata, including information such as

R
directories, data block locations, and data size, and persists
these information to the local disk. At the same time, the
NameNode is responsible for managing the cluster and will
Output cluster class judge the node and the data in the node according to the

T
heartbeat signal sent by each node. The main function of
the DataNode is to store data. The DataNode periodically
verifies the stored data blocks and periodically sends a heart-
Finish
beat signal to the NameNode. The heartbeat signal has two
main functions. On the one hand, it indicates the storage

E
Figure 1: Flow chart of the K-Means algorithm. information of the data block to the NameNode, and on
the other hand, it indicates that the node is still working
amount of data to the computer cluster. The data is written and not down. SecondaryNameNode receives fsimage and
once but can be read many times for analysis. MapReduce is editlog for merging, then sends it to NameNode, and also
saves the merged file locally to prevent data loss caused by

R
a programming model for distributed parallel data process-
ing. It is the main execution framework of Hadoop. It NameNode crash. Client provides a file system interface
divides the whole job into two stages: map and reduce. for users to use, and Client accesses files in HDFS through
HBase is a column-oriented NoSQL database, which can NameNode and DataNode.
provide fast reading and writing of a large amount of data.
Zookeeper and Oozie are mainly used for distributed coordi- 3.4. MapReduce Model. The MapReduce model is com-
nation. Pig and Hive are abstract layers, which can analyze posed of multiple parts. When using it, users only need
data by HQL statement and Latin statement, respectively. to write the program into map and reduce functions
In addition, Hadoop also provides frameworks Sqoop and according to the format given by the model and then
Flume for enterprise-level data integration. Among them, use the driver to configure the required components
Sqoop is often used to transfer data between different types (including input and output format, combiner, and parti-
of databases, such as MySQL and HBase, while flume is used tion). Most components can be customized according to
to efficiently collect, aggregate, and move a large amount of user requirements. For example, Inputformat and Output-
data from a single machine to HDFS [22]. format define the input and input format, Recordreader
defines the data reader, and Inputsplit controls slice size.
3.3. File System HDFS. HDFS mainly includes four parts, At the same time, adjusting the parameter settings of these
NameNode, DataNode, Client, and SecondaryNameNode, components can optimize the execution of MapReduce
and adopts the Master-Slaves mode. The SecondaryName- job, so as to improve the utilization of computing
Node will back up the operation logs and image files in the resources and reduce the consumption of task time.
Wireless Communications and Mobile Computing 5

3.5. TIM-K-Means Algorithm. Text mining is a tool and


method that takes documents as data to find potential valu- Begin
able information targets of documents. It uses the relevant

D
knowledge of natural language processing to map the docu-
ment into data and processes the data corresponding to the
document through the relevant algorithms in data mining Text participle processing
or machine learning, so as to find the hidden law or knowl-
edge. It can be seen that text mining is an extension of data

E
mining, and data mining is the basis and essence of text min-
ing. Text clustering is an important branch of text mining. Word frequency statistics
Clustering analysis of news by the K-Means algorithm is
essentially to mine the news content by using the clustering

T
algorithm, gather similar news information together, and
find the valuable information hidden behind the news con- Stop word filtering
tent. When extracting features, the TF-IDF value is often
used to represent the weight of a word. The main idea of
the TF-IDF value is that if a word appears frequently in an

C
article and rarely in its text, it is considered that the word
Word TF-IDF values were
can represent the article to a certain extent and can be calculated and feature items
regarded as an important feature to distinguish it from other were selected
texts [23].
Term frequency (TF) refers to the frequency of words in

A
the file. The higher the TF value, the more the word appears
in the text, which means that the word is more important in
the text. The TF value is calculated by the following: Finish

ni, j Figure 3: Flow chart of text feature vector extraction.


TFi, j = : ð4Þ

R
∑k nk, j
tering. At the same time, in addition to improving and
In the formula, ni, j represents the number of occurrences optimizing the algorithm itself, this paper also continues to
of document word t i in document d j , and the denominator try to integrate the K-Means algorithm with other algo-

T
rithms, such as the ant colony algorithm, frequent itemset
is the sum of the number of occurrences of all words in doc-
mining algorithm, and genetic algorithm, and combine the
ument dj.
advantages of the two algorithms to work together on text
Inverse document frequency (IDF) measures the univer-
clustering. This paper puts forward the concept of the TI
sality of a word. The larger IDF value of a word means that
value for the structural characteristics of news information

E
the word is widely used in the text. It cannot distinguish the
and unifies the news title, introduction, and text, so as to
text from other texts by virtue of the word and cannot be
make the feature words more representative. Then the news
used as a feature to distinguish the text. The calculation for-
clustering algorithm which combines the TI value with the
mula is shown in the following:
improved maximum distance algorithm is called the TIM-

R
K-Means algorithm. The TIM-K-Means algorithm improves
jD j
IDFi = log : ð5Þ the K-Means algorithm in terms of text feature vector com-
1+∑D ⊇ t i position and initial center point selection and does not
change its structure and process. Therefore, its paralleliza-
In the formula, jDj represents the total number of all tion process is similar to that of K-Means. The map function
documents in the corpus, and ∑D ⊇ t i represents the num- of the parallel TIM-K-Means algorithm puts the distance
ber of documents containing the word t i . center generated by the optimized maximum distance algo-
The TF-IDF calculation formula of this word is shown in rithm into the central file to calculate the distance between
the following: the data sample points and all central points. Then add the
data sample point to the cluster class represented by the
TF − IDF = TFi, j × IDFi : ð6Þ cluster center point with the smallest distance, and pass it
to the reduce function in the mode of the < key, value >
The flow chart of feature extraction is shown in Figure 3. key value pair. Key is the flag of the cluster center point,
Aiming at the disadvantages of the traditional K-Means and value represents the sample point [24].
clustering algorithm, such as the clustering effect, K-value
sensitivity, randomness of initial clustering center selection, 4. Result Analysis
and possible local optimal solution, researchers propose a
new method to calculate the similarity of the initial classifi- 4.1. Experimental Environment. In this paper, the TIM-K-
cation points or new methods to improve the effect of clus- Means algorithm is parallelized. In order to make the
6 Wireless Communications and Mobile Computing

Table 1: K-Means clustering accuracy under different coefficients. 3.0

Lead weight n 2.5


Title weight m 0 0.5 1.0 1.5 2.0

D
Acceleration ratio
0 55.46% 55.73% 55.24% 55.34% 54.82% 2.0
0.5 55.36% 54.14% 54.92% 57.64% 55.47%
1.5
1.0 55.48% 54.73% 55.78% 55.41% 54.86%

E
1.5 55.74% 54.85% 55.74% 55.34% 54.68% 1.0
2.0 55.69% 55.26% 55.49% 56.49% 54.86%
0.5

Table 2: Algorithm running time. 0.0

T
0 1 2 3 4 5
Number of news
5 W (s) 10 W (s) 25 W (s) 50 W (s)
Algorithm Number of nodes (s)
K-Means 33.56 63.64 179.68 558.23 50w 10w
TIM-K-Means 32.46 48.35 168.98 540.16 25w 5w

C
Parallel TIM-K-Means 40.35 47.12 112.46 303.74
Figure 4: Parallel TIM-K-Means speedup.

experimental results more convincing, experiments are car- 4.0


ried out in a single machine and distributed environments,
3.5

A
respectively.
Acceleration ratio 3.0
(1) Single machine experiment environment setting:
2.5
Intel® Core™ i5-6500, 8 G memory, 930 G memory,
Win7 64-bit operating system. Java JDK1.8.0_ 161, 2.0

R
python 2.7.13 1.5
(2) Distributed environment settings: the Hadoop plat- 1.0
form is built by five Alibaba cloud servers
0.5

T
4.2. Verification and Analysis of TI Value. In the feature 0.0
extraction of news text, this paper puts forward the concept 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
of TI value based on the TF-IDF value, that is, the words in
Number of nodes (s)
news title and introduction are integrated into the calcula-
tion of feature value, so that the obtained feature vector is 50w 10w

E
more representative and accurate. For the convenience of 25w 5w
later description, this paper calls the K-Means algorithm
Figure 5: Parallel K-Means speedup ratio.
that only combines the news headline factors in feature
extraction as the T_ K-Means algorithm, the algorithm that

R
only combines the news lead factors as I K-Means, and the prove the correctness of the TI value. And in the TI value,
algorithm that combines the TI value as the TI K-Means the weight of title and introduction should be 0.5 and 1.5,
algorithm. respectively, that is, m = 0:5 and n = 1:5. This weight is in
In order to verify the TI value, this paper uses a dataset line with the objective fact that the news lead contains more
of 2000 news, including military, NBA, and science and information than the title.
technology. After word segmentation of the news title, intro-
duction, and text, the distribution of words in these three 4.3. Algorithm Parallelization Verification Analysis. In this
structures is irregular. In order to determine the correspond- paper, K-Means, TIM-K-Means, and parallel TIM-K-Means
ing weight of the title and the introduction, this paper uses algorithms are experimented with datasets with 50000,
the progressive method for the experiment, with a step size 100000, 250000, and 500000 news, and the time taken to
of 0.5. The accuracy of each method is the average value after complete clustering is recorded [25]. Its running time is
10 times of operation, and the accuracy is expressed in per- shown in Table 2.
centage. m and n, respectively, represent the weights of news In this paper, we use different amounts of news data to
headlines and leads when calculating TI values. The experi- experiment with parallel K-Means and TIM-K-Means algo-
mental results are shown in Table 1. rithms in a cluster environment with 1, 2, 3, and 4 data
News headlines and leads have a certain impact on news nodes, respectively. The speedup ratios of the two algorithms
clustering. When giving appropriate weights to news head- are recorded, respectively, as shown in Figures 4 and 5.
lines and leads, the feature vector constructed by the TI The parallel K-Means algorithm and parallel TIM-K-
value is conducive to improve the clustering accuracy and Means have similar speedup. It shows that the TIM-K-
Wireless Communications and Mobile Computing 7

Means algorithm can still run stably after parallelization How to accurately find the K value? The determination
transformation without destroying the original characteris- of the K value in news clustering requires certain prior
tics of the algorithm. And with the increasing number of knowledge, and in the absence of any prior knowledge, it

D
datasets and data nodes, the acceleration ratio growth trend can only be determined manually by the operator’s work
of TIM-K-Means is more obvious than that of K-Means. experience. How to automatically discover the K value more
Therefore, from the aspect of acceleration ratio, the paralle- accurately before news clustering needs further research.
lization transformation of TIM-K-Means algorithm is feasi-
ble, which can accelerate the implementation of the

E
algorithm to a certain extent and solve the problem of Data Availability
time-consuming clustering of massive news information.
The data used to support the findings of this study are avail-
able from the corresponding author upon request.
5. Conclusion

T
Based on the in-depth understanding of clustering analysis
Conflicts of Interest
and K-Means algorithm, this paper studies and improves
news clustering. Firstly, this paper introduces the research The author has no conflict of interest to declare.
background and significance of news clustering analysis

C
and introduces the research status of these two aspects. Sec-
ondly, the basic knowledge of clustering analysis and algo- References
rithm parallelization technology is introduced. Thirdly,
according to the organizational structure of news text, the [1] Y. Fan, Y. Liu, H. Qi, F. Liu, and X. Ji, “Anti-interference tech-
concept of the TI value is defined and optimized. Combining nology of surface acoustic wave sensor based on K-Means clus-

A
the two, the TIM-K-Means algorithm is proposed, and the tering algorithm,” IEEE Sensors Journal, vol. 21, no. 7,
TIM-K-Means algorithm is parallelized by using the pp. 8998–9007, 2021.
MapReduce programming model, so that it can adapt to [2] D. Zheng, X. Sun, S. K. Damarla, A. Shah, J. Amalraj, and
B. Huang, “Valve stiction detection and quantification using
the massive data environment. Finally, this paper verifies
a K-Means clustering based moving window approach,”
the above concepts and algorithms in stand-alone and dis-

R
Industrial & Engineering Chemistry Research, vol. 60, no. 6,
tributed environments, respectively. The main research work pp. 2563–2577, 2021.
of this paper is as follows: [3] B. S. Aski, A. T. Haghighat, and M. Mohsenzadeh, “Evaluating
(1) Combined with news headlines and leads, the con- single web service trust employing a three-level neuro-fuzzy
cept of the TI value is defined, and the weight of headlines system considering K-Means clustering,” Journal of Intelligent

T
and leads in TI value is determined. When extracting the and Fuzzy Systems, vol. 40, no. 1, pp. 1–15, 2021.
features of news text information, this paper gives different [4] Z. Chen, “Using big data fuzzy K-Means clustering and infor-
weights to the words in the news title and news lead and mation fusion algorithm in English teaching ability evalua-
adds them to the TF-IDF value of the text feature word to tion,” Complexity, vol. 2021, no. 5, Article ID 5554444, 9
obtain the TI value. Compared with the original feature pages, 2021.

E
extraction, the TI value fully considers the organizational [5] D. Lou, M. Yang, D. Shi, G. Wang, and Y. Chen, “K-Means and
structure of news, making feature words more representa- c4.5 decision tree based prediction of long-term precipitation
tive. Through Tencent News data, it is proved that when variability in the Poyang lake basin, China,” Atmosphere,
the weight values of news title and lead are 0.5 and 1.5, vol. 12, no. 7, p. 834, 2021.
[6] F. Tao, R. Suresh, J. Votion, and Y. Cao, “Graph based multi-

R
respectively, the TI value is the most representative, which
improves the accuracy of clustering to a certain extent layer K-Means++ (g-mlkm) for sensory pattern analysis in
(2) The TIM-K-Means algorithm is parallelized. Accord- constrained spaces,” Sensors, vol. 21, no. 6, p. 2069, 2021.
ing to the calculation process of the TIM-K-Means algo- [7] P. Arjun and K. G. Manoj, “Improved hybrid bag-boost
rithm, this paper deduces the error calculation formula and ensemble with K-Means-smote–enn technique for handling
noisy class imbalanced data,” The Computer Journal, vol. 65,
obtains the calculation method of clustering error in distrib-
p. 1, 2021.
uted environment. The parallel transformation of TIM-K-
[8] Z. Zhu and N. Liu, “Early warning of financial risk based on K-
Means is carried out by using the MapReduce programming
Means clustering algorithm,” Complexity, vol. 2021, no. 24,
model. Experiments show that the parallelized TIM-K- Article ID 5571683, 12 pages, 2021.
Means has a good speedup ratio and can meet the actual
[9] C. Y. Peng, U. Raihany, S. W. Kuo, and Y. Z. Chen, “Sound
needs of processing massive data in the context of big data detection monitoring tool in cnc milling sounds by K-Means
The TIM-K-Means news clustering algorithm proposed clustering algorithm,” Sensors, vol. 21, no. 13, p. 4288, 2021.
in this paper fully combines the organizational structure [10] M. Zhao, H. Gao, Q. Han, J. Ge, W. Wang, and J. Qu, “Devel-
information of news and improves the selection method of opment of a driving cycle for Fuzhou using K-Means and
the initial clustering center, which improves the clustering ampso,” Journal of Advanced Transportation, vol. 2021,
accuracy and stability to a certain extent and reduces the no. 2, Article ID 5430137, 15 pages, 2021.
clustering error, but there are still deficiencies in some [11] V. Utomo and J.-S. Leu, “Automatic news-roundup generation
aspects, and further investigation and research are needed. using clustering, extraction, and presentation,” Multimedia
The main research directions in the future are as follows: Systems, vol. 26, no. 2, pp. 201–221, 2020.
8 Wireless Communications and Mobile Computing

[12] B. Liang, N. Li, Z. He, Z. Wang, and T. Lu, “News video sum-
marization combining surf and color histogram features,”
Entropy, vol. 23, no. 8, p. 982, 2021.
[13] L. Wang, S. Li, W. Wang, W. Yang, and H. Wang, “A bank

D
liquidity multilayer network based on media emotion,” The
European Physical Journal B, vol. 94, no. 2, pp. 1–23, 2021.
[14] D. Fuentealba, M. Lopez, and H. Ponce, “Effects on time and
quality of short text clustering during real-time presentations,”

E
IEEE Latin America Transactions, vol. 19, no. 8, pp. 1391–
1399, 2021.
[15] Z. Gou, Y. Li, and Z. Huo, “A method for constructing super-
vised time topic model based on variational auto encoder,” Sci-

T
entific Programming, vol. 2021, no. 12, Article ID 6623689, 11
pages, 2021.
[16] H. Li and D. Han, “A novel time-aware hybrid recommenda-
tion scheme combining user feedback and collaborative filter-
ing,” Mobile Information Systems, vol. 15, no. 4, 16 pages, 2021.

C
[17] C. Hu, Z. Pan, and T. Zhong, “Leaf and wood separation of
poplar seedlings combining locally convex connected patches
and K-Means++ clustering from terrestrial laser scanning
data,” Journal of Applied Remote Sensing, vol. 14, no. 1, p. 1,
2020.
[18] I. H. Hannah, A. T. Azar, and G. Jothi, “Leukemia image seg-

A
mentation using a hybrid histogram-based soft covering rough
K-Means clustering algorithm,” Electronics, vol. 9, no. 1,
p. 188, 2020.
[19] F. Deng, W. Gu, W. Zeng, Z. Zhang, and F. Wang, “Hazardous
chemical accident prevention based on K-Means clustering

R
analysis of incident information,” IEEE Access, vol. 8,
pp. 180171–180183, 2020.
[20] J. Wu, L. Shi, W. P. Lin, S. B. Tsai, and G. Xu, “An empirical
study on customer segmentation by purchase behaviors using

T
a rfm model and K-Means algorithm,” Mathematical Problems
in Engineering, vol. 2020, no. 6, Article ID 8884227, 7 pages,
2020.
[21] Z. Chen and W. Liu, “An efficient parameter adaptive support
vector regression using K-Means clustering and chaotic slime

E
mould algorithm,” Access, vol. 8, pp. 156851–156862, 2020.
[22] M. Bradha, N. Balakrishnan, A. Suvitha et al., “Experimental,
computational analysis of Butein and Lanceoletin for natural
dye-sensitized solar cells and stabilizing efficiency by IoT,”

R
Environment, Development and Sustainability, vol. 24, no. 6,
pp. 8807–8822, 2021.
[23] A. Sharma and R. Kumar, “A framework for pre-computated
multi- constrained quickest QoS path algorithm,” Journal of
Telecommunication, Electronic and Computer Engineering
(JTEC), vol. 9, 2017.
[24] R. Huang, S. Zhang, W. Zhang, and X. Yang, “Progress of zinc
oxide-based nanocomposites in the textile industry,” IET Col-
laborative Intelligent Manufacturing, vol. 3, no. 3, pp. 281–289,
2021.
[25] L. Xin, M. Chengyu, and Y. Chongyang, “Power station flue
gas desulfurization system based on automatic online moni-
toring platform,” Journal of Digital Information Management,
vol. 13, no. 6, pp. 480–488, 2015.

You might also like