Text Similarity Measures in News Articles by Vector Space Model Using NLP
Text Similarity Measures in News Articles by Vector Space Model Using NLP
B
https://doi.org/10.1007/s40031-020-00501-5
ORIGINAL CONTRIBUTION
Abstract The present global size of online news websites Keywords Bilingual news article similarity
is more than 200 million. According to MarketingProfs, Cosine similarity Jaccard similarity Euclidean distance
more than 2 million articles are published every day on the
web, but Online News websites have also circulated edi-
torial content over the internet that specifies which articles Introduction
to display on their website’s home pages and what articles
to highlight, e.g., broad text size for main news articles. A huge increase in the number of online newspaper pub-
Many of the articles posted on a news website are very lishing is only because of the digital technology innova-
similar to many other news websites. The selective tions. When in the modern world so much information
reporting of top news headlines and also the similarity appears at a tremendous speed, readers need to find out if
among news across various news associations is well- they are reading true news or false news. False news and
identified but not very well calculated. This paper identifies information can endanger and confuse not only a person’s
the top news items on the news sites and measures the life, but also an entire society, so it is very important to find
similarity between two same news items in two languages out the source of information and compare it with other
(Hindi and English) referring to the same event. To news. So this study has an interest in extracting online
accomplish this, a highlighted headline and link extractor news platforms, specifically to measure the similarity of
has been created to extract top news for both Hindi and news articles across various sites. This article provides
English from Google’s news feed. First, translate the Hindi details about what news is being considered, how it is being
news article into English by using Google translator and presented and, and highlighted on a website [1]. News
then compare it with English news articles. Second, we articles which are published on the website usually appear
used the cosine similarity, Jaccard similarity, Euclidean in similar or rectified form on several different websites.
distance measure to calculate news similarity score. The Similar and almost identical news is confusing for users.
frequency of nouns and the next word of nouns from the Similarity slows down the process of discovering new
news articles are also extracted. Our methodology clearly information about a topic, and potentially leads to missing
shows that we can efficiently identify top news articles and information, if the user mistakenly recognizes two news as
measure the similarity between news reports. similar when in fact one contains new data. It is much more
difficult to locate similar news items in websites. This is
because of the large amount of miscellaneous content or
material on these articles. Although the main news article
& Ritika Singh text can be similar on two different web pages, the extra-
ritikasingh2397@outlook.com
neous material on the pages may not be the same. There-
Satwinder Singh fore, traditional approaches to equivalent news
satwinder.singh@cup.edu.in
determination would fail [2]. First, this paper developed a
1
Department of Computer Science and Technology, Central method for scraping top news headline text from web
University of Punjab-Bathinda, Bathinda, India pages, i.e. Google news feed websites which are present in
123
J. Inst. Eng. India Ser. B
two different languages (Hindi and English), referring to do this, they first created a headline and link extractor that
the same event then use the extracted text to classify news parses selected news websites and then searched ten US-
pairs with the same content, avoiding any irrelevant based news site home pages for 3 months. They use a
information on the articles. By measuring a similarity score parser to extract k = 1, 3, 10 for each news site, the max-
for news pairs based on a method called Cosine similarity imum number of articles. Second, the author uses the cal-
and Jaccard similarity and Euclidean Similarity, this culation of cosine similarity to quantify the similarity of
research can distinguish similar news articles, as well as news. They also provide techniques during this work to
different ones. The purpose of this paper is also to discover assist in analyzing archived news web pages by introducing
bilingual news articles in a comparable corpus [3]. In tools for parsing select HTML news sites for Hero and
particular, the study is dealing with the representation of headline stories using CSS selectors. Author’s studies over
news and the measurement of the similarity among new 3 months have shown that the overall similarity decreased
articles. This experiment uses the similarly named entities as the number of articles increased. Studies from the author
which they include as representative features of the news. indicate that they would set up synchronous stories for a
To assess the similarity between articles of the same news, given day besides relevant national events. This approach
this research proposing a new method focused on a can be used to further examine the occasional elections that
knowledge base framework that aims to provide human are being held.
information on the value of the category of named entities Katarzyna Baraniak and Marcin Sydow work on tools
within the news [4]. In a comparable corpus with news in that would support the detection and analysis of the
Hindi and English, we compared our approach to a tradi- information bias [7]. The author uses methods to auto-
tional one which obtains better results. Similarity and also matically identify the articles reporting on the same sub-
distance measures calculate the similarity of two docu- ject, event, or entity to use them more in comparative
ments or sentences into a single numerical value and brings analysis or to construct a test or training collection. Within
out the degree of semantic similarity [5] or distance from the paper, the author explains representations of the doc-
one another. Several similarity measures have been used by ument text and the method of similarity measures for text
the researchers, but not much work has been done on the clustering. Which include tests such as cosine similarity,
similarity of newspapers. This study aims to compare the Euclidean distance, Jaccard coefficient, Pearson coefficient
semantic similarity between two articles of the same news, of correlation, and Averaged Kullback–Leibler Diver-
present in two different languages (Hindi and English), to gence. The author also applies a machine learning
optimize human understanding. The basic concept for approach to recognize a similar article and develop a
measuring news similarities is to identify Feature articles machine learning model that detects similar articles auto-
vectors, and thereafter measure the difference between matically. Identifying fragments of text concerning similar
those features. Low distance between those features events and identifying bias in them is expected. The author
implies a high level of similarity value, while a large dis- is also working to expand the research study to other lan-
tance in between those features implies a low level of guages (e.g., Polish, English).
similarity value [6]. Euclidean distance, Cosine distance, Maake Benard Magara et al., suggest a system to use
Jaccard coefficient metrics are some of the distance metrics 220 artificial intelligent research paper written by 8 artifi-
used in document similarity computation. This study cial intelligence experts [8]. This work uses Recursive
explores two separate methods of generating features from Partitioning, Random Forest, and improved machine
the texts: (1) the Tf-idf vectors, (2) bag of words also learning algorithms by having an average accuracy and
implements two methods for calculating textual similarity timing efficiency of 80.73 and 2.354628. Seconds, this
between news articles: (1) cosine similarity and Jaccard algorithm typically performed quite well compared to the
similarity with Tf-idf vectors and (2) Euclidean distance Boosted and even the Random Forest algorithms. More
using a bag of words. sophisticated models can be used in future studies much
like the Latent Semantic Analysis (LSA), since documents
can be identified as belonging to the same class even if they
Literature Rereview have no similar words and phrases. Vikas Thada and Dr.
Vivek Jaglan authors used the cosine similarity, dice
In the literature, similarity measures have been used for coefficient, Jaccard similarity algorithms [9]. The work is
various purposes. In this section, some proposals are completed on the first 10 pages of the Google search result
reviewed. and will be expanded to 30–35 pages for a reliable effi-
Atkins et al. [1] describe a technique to assess the top ciency estimate in future study. The cosine similarity
news headline story from a selected set of US-based news eventually concluded was the best fitness compared with
websites, and then calculate correlations across them. To others for this dataset. In summary, while the initial
123
J. Inst. Eng. India Ser. B
The major steps of the methodology are given below. Article Scraping
Figure 1 presents the framework of this work. The
textual news data are first pre-processed before it is rep- ‘Newspaper’ is a Python module used to extract newspaper
resented into a more structural format. The two represen- articles and to parse them. Newspapers are using special-
tation methods of generating features from the text that are ized Web scrapping algorithms to extract all the valuable
investigated in this study are tf-idf, and Bag of Word. Once text from a website. This works extremely well on websites
represented into these three representation methods, each of the online newspapers. This experiment has extracted
represented method is compared with three similarity links from both Hindi and English news, so now also
measures as shown in Fig. 1 i.e. Cosine, Euclidean and extract their text using the Newspaper module.
123
J. Inst. Eng. India Ser. B
123
J. Inst. Eng. India Ser. B
divided by the union size of two sets. Jaccard similitude is AC 2 ¼ AB2 þ BC 2 ð6Þ
determined using the formula [16] below. pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
T T AC ¼ AB2 þ BC 2 ð7Þ
A B A B
J ðA; BÞ ¼ S ¼ T ð3Þ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
A B ‘j Aj þ jBj jA Bj
AC ¼ ðx2 x1Þ2 þðy2 y1Þ2 ð8Þ
where \ represents intersection and [ represents the union. sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
X m
In this paper, A and B are bags of words that contain news jX ! Y j ¼ ð X i Yi Þ 2 ð9Þ
articles. i¼1
123
J. Inst. Eng. India Ser. B
Table 1 Comparison of the pros and cons of different measures and their application area
SI. Similarity Pros Cons Application area
measures
1 Cosine Both continuous and categorical variables may be Doesn’t work effectively Text mining, document similarity
similarity used with nominal data [17]
2 Jaccard Both continuous and categorical variables may be Doesn’t work effectively Document classification
coefficient used with nominal data
3 Euclidean Easy to Compute and work well with a dataset with Does not work with image Application involving interval data, DNA
distance compact or isolated clusters [17] data efficiently analysis, K-mean algorithm
news articles * number of features (distinct words)] [16]. Bag of Words Euclidean Distance
That tf-idf weight from the matrix was now used as a
feature for every text, and similarity among news articles is The pre-processed documents have been described as a
calculated using cosine similarity and Jaccard similarity. vector with the frequency of each word and compare how
Sklearn’s built-in cosine and Jaccard similarity module was similar they are by comparing their bag of vector words.
used to measure the similarity. This experiment uses the bag-of-words model because the
computer processes vectors much faster than a vast file of
123
J. Inst. Eng. India Ser. B
Table 4 Sample pair of completely dissimilar news performed on pairs of news headline obtained from Google
News [14]. The chosen news articles are listed in Tables 2,
3 and 4. The news articles were given to a human expert to
judge the similarity and dissimilarity. As a result, the
human expert has determined 6 pairs (pair 1–6) are com-
pletely similar news and 5 pairs (pair 7–11) are different
news about the same topic and the other 5 pairs (pair
12–16) are completely dissimilar news. The expert judg-
ment is used as a benchmark to evaluate the automatic
similarity calculation on these news articles. The cosine
similarity, Jaccard coefficient, and Euclidean distance are
applied. The result of all three measures is shown in
Tables 5, 6 and 7.
To provide a better understanding of the three compared
measures, the results are shown on a bar graph as depicted
in Fig. 4.
Figure 5 shows the similarity measures bar graph for
different news stories about the same topic.
Figure 6 shows the similarity measures bar graph for
completely dissimilar news.
The performance measures used in the experiment are
accuracy, precision, recall and F-measures. These measures
are calculated by determining the number of news articles
correctly identified as similar or dissimilar compared to the
decisions by human experts [21]. In other words, using the
human decisions as a benchmark the number of true pos-
itive (TP) which is equivalent to actual similar news cor-
rectly identified as similar, true negative (TN) which is
equivalent to actual dissimilar news correctly identified as
dissimilar, false positive (FP) which is equivalent to actual
similar news incorrectly identified as dissimilar, and false-
text for a lot of data [20]. So this paper load all news negative (FN) which is equivalent to actual dissimilar news
articles in a list called corpus then calculate the feature incorrectly identified as similar are determined. Then, the
vectors from the documents and finally compute the accuracy is calculated as (TP ? TN)/all data, precision is
Euclidean distance and then to check how similar they are. TP/(TP ? FP), recall is TP/(TP ? FN) and the F-measures
Greater the distance, less similar they are. This paper uses a as the harmonic mean of precision and recall, which is
module or library called sklearn which is a machine equal to 2TP/(2TP ? FP ? FN) [21]. The results are pre-
learning library. sented in the next section.
123
J. Inst. Eng. India Ser. B
123
J. Inst. Eng. India Ser. B
123
J. Inst. Eng. India Ser. B
123