Text Similarity Measures in News Articles by Vector Space Model Using NLP

J. Inst. Eng. India Ser.
B
https://doi.org/10.1007/s40031-020-00501-5
ORIGINAL CONTRIBUTION
Text Similarity Measures in News Articles by Vector Space Model

Using NLP
Ritika Singh1 • Satwinder Singh1
Received: 4 June 2020 / Accepted: 7 October 2020

The Institution of Engineers (India) 2020
Abstract The present global size of online news websites Keywords Bilingual news article similarity
is more than 200 million. According to MarketingProfs, Cosine similarity Jaccard similarity Euclidean distance
more than 2 million articles are published every day on the
web, but Online News websites have also circulated edi-
torial content over the internet that specifies which articles Introduction
to display on their website’s home pages and what articles
to highlight, e.g., broad text size for main news articles. A huge increase in the number of online newspaper pub-
Many of the articles posted on a news website are very lishing is only because of the digital technology innova-
similar to many other news websites. The selective tions. When in the modern world so much information
reporting of top news headlines and also the similarity appears at a tremendous speed, readers need to find out if
among news across various news associations is well- they are reading true news or false news. False news and
identified but not very well calculated. This paper identifies information can endanger and confuse not only a person’s
the top news items on the news sites and measures the life, but also an entire society, so it is very important to find
similarity between two same news items in two languages out the source of information and compare it with other
(Hindi and English) referring to the same event. To news. So this study has an interest in extracting online
accomplish this, a highlighted headline and link extractor news platforms, specifically to measure the similarity of
has been created to extract top news for both Hindi and news articles across various sites. This article provides
English from Google’s news feed. First, translate the Hindi details about what news is being considered, how it is being
news article into English by using Google translator and presented and, and highlighted on a website [1]. News
then compare it with English news articles. Second, we articles which are published on the website usually appear
used the cosine similarity, Jaccard similarity, Euclidean in similar or rectified form on several different websites.
distance measure to calculate news similarity score. The Similar and almost identical news is confusing for users.
frequency of nouns and the next word of nouns from the Similarity slows down the process of discovering new
news articles are also extracted. Our methodology clearly information about a topic, and potentially leads to missing
shows that we can efficiently identify top news articles and information, if the user mistakenly recognizes two news as
measure the similarity between news reports. similar when in fact one contains new data. It is much more
difficult to locate similar news items in websites. This is
because of the large amount of miscellaneous content or
material on these articles. Although the main news article
& Ritika Singh text can be similar on two different web pages, the extra-
ritikasingh2397@outlook.com
neous material on the pages may not be the same. There-
Satwinder Singh fore, traditional approaches to equivalent news
satwinder.singh@cup.edu.in
determination would fail [2]. First, this paper developed a
1
Department of Computer Science and Technology, Central method for scraping top news headline text from web
University of Punjab-Bathinda, Bathinda, India pages, i.e. Google news feed websites which are present in
123
J. Inst. Eng. India Ser. B
two different languages (Hindi and English), referring to do this, they first created a headline and link extractor that
the same event then use the extracted text to classify news parses selected news websites and then searched ten US-
pairs with the same content, avoiding any irrelevant based news site home pages for 3 months. They use a
information on the articles. By measuring a similarity score parser to extract k = 1, 3, 10 for each news site, the max-
for news pairs based on a method called Cosine similarity imum number of articles. Second, the author uses the cal-
and Jaccard similarity and Euclidean Similarity, this culation of cosine similarity to quantify the similarity of
research can distinguish similar news articles, as well as news. They also provide techniques during this work to
different ones. The purpose of this paper is also to discover assist in analyzing archived news web pages by introducing
bilingual news articles in a comparable corpus [3]. In tools for parsing select HTML news sites for Hero and
particular, the study is dealing with the representation of headline stories using CSS selectors. Author’s studies over
news and the measurement of the similarity among new 3 months have shown that the overall similarity decreased
articles. This experiment uses the similarly named entities as the number of articles increased. Studies from the author
which they include as representative features of the news. indicate that they would set up synchronous stories for a
To assess the similarity between articles of the same news, given day besides relevant national events. This approach
this research proposing a new method focused on a can be used to further examine the occasional elections that
knowledge base framework that aims to provide human are being held.
information on the value of the category of named entities Katarzyna Baraniak and Marcin Sydow work on tools
within the news [4]. In a comparable corpus with news in that would support the detection and analysis of the
Hindi and English, we compared our approach to a tradi- information bias [7]. The author uses methods to auto-
tional one which obtains better results. Similarity and also matically identify the articles reporting on the same sub-
distance measures calculate the similarity of two docu- ject, event, or entity to use them more in comparative
ments or sentences into a single numerical value and brings analysis or to construct a test or training collection. Within
out the degree of semantic similarity [5] or distance from the paper, the author explains representations of the doc-
one another. Several similarity measures have been used by ument text and the method of similarity measures for text
the researchers, but not much work has been done on the clustering. Which include tests such as cosine similarity,
similarity of newspapers. This study aims to compare the Euclidean distance, Jaccard coefficient, Pearson coefficient
semantic similarity between two articles of the same news, of correlation, and Averaged Kullback–Leibler Diver-
present in two different languages (Hindi and English), to gence. The author also applies a machine learning
optimize human understanding. The basic concept for approach to recognize a similar article and develop a
measuring news similarities is to identify Feature articles machine learning model that detects similar articles auto-
vectors, and thereafter measure the difference between matically. Identifying fragments of text concerning similar
those features. Low distance between those features events and identifying bias in them is expected. The author
implies a high level of similarity value, while a large dis- is also working to expand the research study to other lan-
tance in between those features implies a low level of guages (e.g., Polish, English).
similarity value [6]. Euclidean distance, Cosine distance, Maake Benard Magara et al., suggest a system to use
Jaccard coefficient metrics are some of the distance metrics 220 artificial intelligent research paper written by 8 artifi-
used in document similarity computation. This study cial intelligence experts [8]. This work uses Recursive
explores two separate methods of generating features from Partitioning, Random Forest, and improved machine
the texts: (1) the Tf-idf vectors, (2) bag of words also learning algorithms by having an average accuracy and
implements two methods for calculating textual similarity timing efficiency of 80.73 and 2.354628. Seconds, this
between news articles: (1) cosine similarity and Jaccard algorithm typically performed quite well compared to the
similarity with Tf-idf vectors and (2) Euclidean distance Boosted and even the Random Forest algorithms. More
using a bag of words. sophisticated models can be used in future studies much
like the Latent Semantic Analysis (LSA), since documents
can be identified as belonging to the same class even if they
Literature Rereview have no similar words and phrases. Vikas Thada and Dr.
Vivek Jaglan authors used the cosine similarity, dice
In the literature, similarity measures have been used for coefficient, Jaccard similarity algorithms [9]. The work is
various purposes. In this section, some proposals are completed on the first 10 pages of the Google search result
reviewed. and will be expanded to 30–35 pages for a reliable effi-
Atkins et al. [1] describe a technique to assess the top ciency estimate in future study. The cosine similarity
news headline story from a selected set of US-based news eventually concluded was the best fitness compared with
websites, and then calculate correlations across them. To others for this dataset. In summary, while the initial
123
findings are promising, there is still a long way to go to

achieve the greatest crawling efficiency possible. A sys-
tematic method proposed by Nasab et al. [10] the following
points determine the similarities. (1) Article texts are
divided into three sections as headings, abstracts and
keywords. (2) Abstract, keywords, based on the link to the
title of article weighing. (3) The weighted mean is esti-
mated based on the description, abstract, and keyword and
use Pearson’s correlation method to find the similarity
between person and machine scores. They have 87%
accuracy in this proposed technique. Use a specialized Fig. 1 A framework for comparative analysis
WordNet it can also concentrate on article similarities. The
proposed framework can be used for other texts that require
a WordNet of that language, such as texts in Persian and
other languages. M. Snover et al., explore a new way of
using monolingual target data to enhance the efficiency of a
statistical or predictive machine translation for news stories Jaccard similarity measures. The final step in the frame-
[11]. This method employs comparable text various texts in work is to compare and analyze the produced results. We
the target language which explore the same or equivalent further explain each of the steps in detail.
stories as mentioned in the source language document. A The dataset used in this paper is known as ‘Google
large monolingual data set for each source document to be News’, and is publicly available [13]. Google News:
translated in the target language, which is searched for Google is offering a special experience to Google News
documents that may be similar to the source documents. which combines all its news items into one. It provides a
The experimental results of this paper generated through constant, personalized flow of newspapers from thousands
the difference of the language and translation models show of publishers and magazines grouped around. Google News
vital improvements over the baseline framework. is a combination of global events, local news and news
Qian et al. [12] using a comparable corpus, a bilingual stories that you’ve been reading. Then you can turn to
dependency mapping model for bilingual lexicon building Headlines to show top news from all over the world.
from English to Chinese. This model considers both Additional sections here allow you to delve into various
dependent words and their relationships when measuring topics such as sports, business and technology. And its
the similarity between bilingual words and thus offers a greatest value is that this service delivered the news in 35
more precise and less noisy representation. Author’s also languages so using Google news this experiment extracts
illustrated that bilingual dependency mappings can be the news articles in both Hindi and English languages.
created and optimized automatically without human input,
contributing to a medium-sized set of dependency map- Headline and Link Extractor
pings and that their impacts on Bilingual Lexicon Con-
struction (BLC) can be fully exploited through weight A basic python library for searching and downloading live
learning using a simple but effective perceptron algorithm, news articles from Google News feeds is GoogleNews or
making their approach quickly adaptable to several other gnewsclient [14]. Using this, one can pick up the top
language pairs. headlines running on Google’s news websites or check for
top headlines on a particular subject (or keyword). So this
experiment can use this, to extract links from both Hindi
Methodology and English news that related to the same event.
The major steps of the methodology are given below. Article Scraping
Figure 1 presents the framework of this work. The
textual news data are first pre-processed before it is rep- ‘Newspaper’ is a Python module used to extract newspaper
resented into a more structural format. The two represen- articles and to parse them. Newspapers are using special-
tation methods of generating features from the text that are ized Web scrapping algorithms to extract all the valuable
investigated in this study are tf-idf, and Bag of Word. Once text from a website. This works extremely well on websites
represented into these three representation methods, each of the online newspapers. This experiment has extracted
represented method is compared with three similarity links from both Hindi and English news, so now also
measures as shown in Fig. 1 i.e. Cosine, Euclidean and extract their text using the Newspaper module.
123
Translator idf vectors have now been used as feature vectors to

measure the similarity between articles in news-results.
Through using this package, Google offers a language
translation package for Python; words are taken from the Similarities Measures
Hindi news articles and translated into different languages
(English language). Either Hindi corpus can be translated Similarity function is a real-valued function that calculates
into English or English corpus can be translated into Hindi. the similarity between two items. The calculation of sim-
Here we have translated Hindi corpus into English. The ilarity is achieved by mapping distances to similarities
translation is performed at a level of the sentences. This within the vector space. This experiment provides two tests
translation also generates a map of words in various lan- of similarity: cosine similarity, similarity with Jaccard, and
guages, from English. This research used bilingual dic- Euclidean distance.
tionaries ranging from Hindi to English. (1) Cosine Similarity It is a cosine angle in an n-di-
mensional space, between two n-dimensional vectors. This
Pre-processing and Data Cleaning is the dot product of the two vectors, divided by-product of
the two vectors’ lengths (or magnitudes) [16]. The simi-
Pre-processing steps such as the elimination of stop-words, larity of the cosine is measured by using the following
lemmatization, and parsing letters, punctuation marks, and formula:
numbers have been completed. The words were lemma- Pn
A:B i¼1 Ai Bi
tized by WordNetLemmatizer and NLTK library took the similarityðA; BÞ ¼ ¼ pPffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
n
ffi pP
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
n
ffi
jj Ajj jjBjj i¼1 Ai
2
i¼1 Bi
2
English stop-words [15].
ð2Þ
Vector Space Model As shown in Fig. 2, suppose there is two point’s p1 and
p2, as the distance within these points increases the
A mathematical model is also called the term vector model, similarity between these points decreases and vice versa.
which describes text documents as identifier variables, such
as terms or tokens. Of course, the term depends on the 1 Cosine Similarity ¼ Cosine Distance
comparisons, but usually, only words, keywords or sen- The result of the angle will show the result. If the angle
tences are compared. is 0 between the document vectors then the cosine function
is 1 and both documents are the same. If the angel is any
Feature Vectors other value then the cosine function will be less than 1.
Does the angle reach - 1 then the documents are
In the Artificial Intelligence feature vector is an n-dimen- completely different? Thus this way by calculating the
sional vector of computational features that describe some cosine angle between the vectors of P1 and P2 decides if
entity. That is a really important method of calculating the vectors are pointing in the same direction or not.
semantic similarity among texts. Methods were used during (2) Jaccard Similarity Jaccard similarity calculates
this experiment to measure the function vectors is TF-IDF similarities among sets. It’s defined as the intersection size
(Term Frequency-Inverse Document Frequency) is a sim-
ple algorithm for transforming a text into a meaningful
representation of numbers. Tf-idf weight is a measure of
fact which evaluates the importance of a specific word in a
text. In mathematics,
X
N
tfi df weight ¼ tfi;d log ð1Þ
i2d
df i
where in document d, tfi,d is the number of occurrences of

the ith term, dfi is the number of documents which contain
ith term; N is the total number of documents. The sklearn-
vectorized function was used to construct a tf-idf function.
This whole model was constructed by using the documents,
and a group of such tf-idf vectors was generated consisting
of the tf idf weight of and term in the documents. Such tf- Fig. 2 Cosine similarity
123
divided by the union size of two sets. Jaccard similitude is AC 2 ¼ AB2 þ BC 2 ð6Þ
determined using the formula [16] below. pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
T T AC ¼ AB2 þ BC 2 ð7Þ
A B A B
J ðA; BÞ ¼ S ¼ T ð3Þ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
A B ‘j Aj þ jBj jA Bj
AC ¼ ðx2 x1Þ2 þðy2 y1Þ2 ð8Þ
where \ represents intersection and [ represents the union. sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
X m
In this paper, A and B are bags of words that contain news jX ! Y j ¼ ð X i Yi Þ 2 ð9Þ
articles. i¼1
• Jaccard(A,A) = 1 Table 1 shows a comparative analysis of the methods

T
• Jaccard(A,B) = 0 if A B = 0 based on their relative pros and cons. The table also
• A and B don’t have to be the same size describes the application areas where the selected
• Always assign a number between 0 and 1. techniques can be used.
Jaccard distance which instead of similarity measures
dissimilarity between can be found by subtracting Jaccard Similarity Score
similarity coefficient from 1:
Similarity score means that two data sets are how similar to
JDðA; BÞ ¼ 1 JDðA; BÞ ð4Þ one another. The data collection will include two separate
jA [ Bj jA [ Bj texts as in this case. The similarity between the two texts is
or JDðA; BÞ ¼ ð5Þ evaluated according to the scoring system. Euclidean dis-
jA [ B j
tance does not find the similarity between the texts, but
(3) Euclidean Distance Another similarity measure in finds the metric, the distance between both texts [18]; there
the vector space model is Euclidean distance or L2 are different ways to calculate similarity:
distance, or Euclidean norm. This similarity measure
1
differentiates similarity measurements from the other SimilarityðA; BÞ ¼ ð10Þ
vector space model by not judging from the angle like 1 þ DistanceðA; BÞ
the rest but rather the direct distance between the vector Noun Phrase Extraction
inputs.
As shown in Fig. 3, if there are two points like (X1, Y1) Noun Phrase Extraction is a technique of text analysis,
and (X2, Y2) and let us consider any dimension point so if consisting of the automated extraction of nouns in a text. It
one wants to find out the distance between (X1, Y1) and helps to summarize the contents of a text and identify the
(X2, Y2) then basically use this particular parameter like key topics being discussed. This paper concludes that the
Euclidean distance to check that if this particular points are extraction of the frequency of noun phrases and the fre-
nearer to each other than it will consider that this two-point quency of the next word of the noun from news articles can
are similar with each other. Euclidean distance is calcu- considerably improve similarity measures. TextBlob is a
lated based on the Pythagoras theorem. Let D represent the Python module that is used to extract a noun [19].
measure of distances between (X1, Y1) and (X2, Y2). Hence
the distance from A to C can be expressed as:
Proposed Method
This paper introduces two methods for calculating the

similarity between two articles of the same news, which are
present in two different languages (Hindi and English),
based on methods for calculating the feature vector and
similarity measures.
Cosine Similarity and Jaccard Similarity with TF-

IDF Vectors
The pre-processed news articles were turned into vectors of

tf-idf by using a vectorized model of tf-idf. The vectors
obtained were a sparse-matrix containing tf-idf weights for
news article word having the dimensions of [number of
Fig. 3 Euclidean distance
123
Table 1 Comparison of the pros and cons of different measures and their application area
SI. Similarity Pros Cons Application area
measures
1 Cosine Both continuous and categorical variables may be Doesn’t work effectively Text mining, document similarity
similarity used with nominal data [17]
2 Jaccard Both continuous and categorical variables may be Doesn’t work effectively Document classification
coefficient used with nominal data
3 Euclidean Easy to Compute and work well with a dataset with Does not work with image Application involving interval data, DNA
distance compact or isolated clusters [17] data efficiently analysis, K-mean algorithm
news articles * number of features (distinct words)] [16]. Bag of Words Euclidean Distance
That tf-idf weight from the matrix was now used as a
feature for every text, and similarity among news articles is The pre-processed documents have been described as a
calculated using cosine similarity and Jaccard similarity. vector with the frequency of each word and compare how
Sklearn’s built-in cosine and Jaccard similarity module was similar they are by comparing their bag of vector words.
used to measure the similarity. This experiment uses the bag-of-words model because the
computer processes vectors much faster than a vast file of
Table 2 Sample pair of completely similar news

Table 3 Sample pair of different news stories about the same topic
123
Table 4 Sample pair of completely dissimilar news performed on pairs of news headline obtained from Google
News [14]. The chosen news articles are listed in Tables 2,
3 and 4. The news articles were given to a human expert to
judge the similarity and dissimilarity. As a result, the
human expert has determined 6 pairs (pair 1–6) are com-
pletely similar news and 5 pairs (pair 7–11) are different
news about the same topic and the other 5 pairs (pair
12–16) are completely dissimilar news. The expert judg-
ment is used as a benchmark to evaluate the automatic
similarity calculation on these news articles. The cosine
similarity, Jaccard coefficient, and Euclidean distance are
applied. The result of all three measures is shown in
Tables 5, 6 and 7.
To provide a better understanding of the three compared
measures, the results are shown on a bar graph as depicted
in Fig. 4.
Figure 5 shows the similarity measures bar graph for
different news stories about the same topic.
Figure 6 shows the similarity measures bar graph for
completely dissimilar news.
The performance measures used in the experiment are
accuracy, precision, recall and F-measures. These measures
are calculated by determining the number of news articles
correctly identified as similar or dissimilar compared to the
decisions by human experts [21]. In other words, using the
human decisions as a benchmark the number of true pos-
itive (TP) which is equivalent to actual similar news cor-
rectly identified as similar, true negative (TN) which is
equivalent to actual dissimilar news correctly identified as
dissimilar, false positive (FP) which is equivalent to actual
similar news incorrectly identified as dissimilar, and false-
text for a lot of data [20]. So this paper load all news negative (FN) which is equivalent to actual dissimilar news
articles in a list called corpus then calculate the feature incorrectly identified as similar are determined. Then, the
vectors from the documents and finally compute the accuracy is calculated as (TP ? TN)/all data, precision is
Euclidean distance and then to check how similar they are. TP/(TP ? FP), recall is TP/(TP ? FN) and the F-measures
Greater the distance, less similar they are. This paper uses a as the harmonic mean of precision and recall, which is
module or library called sklearn which is a machine equal to 2TP/(2TP ? FP ? FN) [21]. The results are pre-
learning library. sented in the next section.
Results and Discussion

Result and Analysis
Figure 7 presents the graph of similarity measurements of
Proposed algorithms are implemented using Python the sample pair of news articles using Euclidean, Jaccard
3.7.3(64-bit). For the experiment, around 1000 news stories and cosine similarity measures for each representation
were randomly picked from the dataset. The algorithm runs schemes i.e. tf-idf, and a bag of word representation. As
on that dataset, and it measures and compares the various can be learned from Fig. 7, Cosine performs similar to
similarity score. Every news article’s similarity has been benchmark for news with the same meaning (pair 1–6) and
calculated against itself and every other article. different news about the same topic (pair 7–11) and how-
ever for completely dissimilar news (pair 12–16) Jaccard’s
Comparative Analysis and Euclidean score are similar to the human benchmark.
To prove our point further, we calculated the correlation
To analyze the performance of the representation method scores for each similarity measures against the human
on different similarity measures, the experiment was benchmark as shown in Table 8.
123
Table 5 Similarity measures of completely same news Cosine Jaccard Euclidean

S.no. Cosine similarity Jaccard similarity Euclidean similarity 1
0.8
1 0.8931 0.38075 0.04679
0.6
2 0.856 0.2134 0.02764
3 0.8476 0.3589 0.04861 0.4
4 0.7434 0.3289 0.04610 0.2
5 0.8034 0.2086 0.08609 0
6 0.7899 0.2756 0.02440 News 1 News 2 News 3 News 4 News 5
Fig. 5 Comparison of similarity coefficients for different news

articles about the same topic
Table 6 Similarity measures for different news stories about the

same topic Cosine Jaccard Euclidean
S.no. Cosine similarity Jaccard similarity Euclidean similarity 1

0.8
7 0.7063 0.1168 0.02311
0.6
8 0.5301 0.047 0.01377
0.4
9 0.5459 0.1516 0.04511
0.2
10 0.6316 0.1196 0.03771
0
11 0.7182 0.132 0.02990
News 1 News 2 News 3 News 4 News 4
Fig. 6 Comparison of similarity coefficients for completely dissim-

ilar news
Table 7 Similarity measures of completely dissimilar news
S.no. Cosine similarity Jaccard similarity Euclidean similarity
Human Cosine Jaccard Euclidean
12 0.3447 0.066 0.0434 1
similarity Measures
13 0.4032 0.0705 0.00804 0.8

14 0.4843 0.0996 0.0334
0.6
15 0.5490 0.1003 0.0298
16 0.3466 0.08503 0.02949 0.4
0.2
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Cosine Jaccard Euclidean News Hadlines
1
Fig. 7 Similarity score graph
0.8
0.6
0.4 (Tables 9, 10, 11) to find out their accuracy, precision,
0.2
recall and F- measures as explained in the previous section.
Table 12 gives a clear picture of the performance of
0
each similarity measure. Analyzing the results we see that
News 1 News 2 News 3 News 4 News 5 News 6
the Precision value of Jaccard measures is 1.0 or 100% but
Fig. 4 Comparison of similarity coefficients for articles of same less than 50% in Euclidean Distance. However, Euclidean
news gives a high value of Recall as compared to Precision.
Cosine measure gives a good accuracy level and F1 score,
From the correlation score in Table 8, it can be per- but the difference between Recall value and Precision is
ceived that the Cosine and Jaccard similarity is more cor- high. But, among these three methods cosine similarity
related to the benchmark scores. We further analyze the using tf-idf showed greater accuracy, recall and F-measure
produced result by calculating the Confusion Matrix [3] scores of 81.25%, 100% and 76.92%, respectively.
123
Table 8 Correlation of the similarity scores to the benchmark Conclusion

Method Correlation
This ongoing research conducted a comparison of three
Cosine and benchmark 0.919847 different methods to estimate the semantic similarity
Jaccard and benchmark 0.816131 among two news articles on (nearly) the same topic/event
Euclidean and benchmark 0.422671 to measure the similarity between them in two different
languages (Hindi and English). The experiment was tested
using the GoogleNews data sets. The three methodologies
are the similarity of Cosine with tf-idf vectors, similarity of
Jaccard with tf-idf vectors, Bag of words Euclidean dis-
Table 9 Confusion matrix for cosine similarity tance. All three of these methods showed promising results,
16 News Predicted: No Predicted: Yes but among these three methods, cosine similarity using tf-
idf showed greater accuracy, recall and F-measure scores
Total: 8 8 of 81.25%, 100% and 76.92%, respectively. The accuracy
Actual: No TN = 8 FP = 3 of the other two methods may be improved with the
Actual: Yes FN = 0 TP = 5 Doc2Vec model [6], which takes text corpus as input and
Threshold value: 0.788 generates document vectors as output. This experiment is
Total refined news: 8 also looking to expand the work to other languages.
Table 10 Confusion matrix for Jaccard similarity References

16 News Predicted: No Predicted: Yes
1. G. Atkins, M. Weigle, and M. Nelson, Measuring news similarity
Total: 12 4 across ten U.S. news sites, arXiv preprint arXiv, pp. 1–11, 2018
2. J. Gibson, B. Wellner, and S. Lubar, Identification of duplicate
Actual: No TN = 8 FP = 0 news stories in web pages, in The MITRE Corporation 202
Actual: Yes FN = 4 TP = 4 Burlington Rd. Bedford MA 01730 USA, 202 Burlington Rd.
Threshold value: 0.245 Bedford MA 01730 USA, 2008
3. M. Singh, D.A. Kumar, D.V. Goyal, Review of techniques for
Total refined news: 4
extraction of bilingual lexicon from comparable corpora. Int.
J. Eng. Technol. 7(2), 16–20 (2018)
4. S. Montalvo, R. Martı́nez, A. Casilla, Bilingual News Clustering
Using Named Entities and Fuzzy Similarity (Springer, Heidel-
Table 11 Confusion matrix for Euclidean similarity berg, 2007), pp. 108–114
News Predicted: No Predicted: Yes 5. S. Mohd Saad and s. S. Kamarudin, Comparative analysis of
similarity measures for sentence level semantic measurement of
Total: 8 8 text. in IEEE International Conference on Control System,
Computing and Engineering, pp. 90–94, 2013
Actual: No TN = 8 FP = 7
6. P. Sitikhu, A Comparison of Semantic Similarity Methods for
Actual: Yes FN = 0 TP = 1 Maximum Human Interpretability, 2019. arXiv:1910.09129v2
Threshold value: 0.0529 [cs.IR]
Total refined News: 8 7. K. Baraniak and M. Sydow, News articles similarity for auto-
matic media bias detection in Polish news portals, in Proceedings
of the Federated Conference on Computer Science and Infor-
mation Systems, 2018
8. M. B. Magara and T. Zuva, A comparative analysis of text
similarity measures and algorithms in research paper recom-
mender systems, in Conference on Information Communications
Table 12 Accuracy level of each similarity measures
Technology and Society (ICTAS), 2018
Similarity measures Performance measures 9. V. Thada, D.V. Jaglan, Comparison of Jaccard, dice, cosine
similarity coefficient to find best fitness value for web retrieved
Accuracy Precision Recall F-Measure documents using genetic algorithm. Int. J. Innov. Eng. Technol.
(IJIET) 2(4), 202–204 (2013)
Cosine 0.8125 62.5 1.0 0.7692 10. M.I. Nasab, A new approach for finding semantic similar scien-
Jaccard 0.750 1.0 0.50 0.666 tific articles. J. Adv. Comput. Sci. Technol. (JACST) 4, 563-59
Euclidean 0.5625 0.125 1.0 0.222 (2015)
11. M. Snover, B. Dorr, and R. Schwartz, Language and translation
The Highest value is shown in bold model adaptation using comparable corpora, in Proceedings of
123
the 2008 Conference on Empirical Methods in Natural Language https://dataaspirant.com/2015/04/11/five-most-popular-

Processing, Honolulu, 2008 similarity-measures-implementation-in-python/
12. Q. Long Hua and W. Hong Ling, Bilingual lexicon construction 17. M. Goswami, A. Babu, B. Purkayastha, A comparative analysis
from comparable corpora via dependency mapping, in Proceed- of similarity measures to. Int. J. Manag. Technol. Eng. 8(XI),
ings of COLING, 2012 786–797 (2018)
13. Y. Y Dani Deahl@danideahl, Google News is getting an overhaul 18. A. Ali, Textual similarity, ISSN 2011-19, 2011
and customized news feeds, THE VERGE, 8 May 2018. [Online]. 19. TextBlob: simplified text processing, [Online]. Available:
Available: https://textblob.readthedocs.io/en/dev/
https://www.theverge.com/2018/5/8/17329074/google-news- 20. Bag of words Euclidean distance, [Online]. Available:
update-new-features-newsstand-io-2018 https://pythonprogramminglanguage.com/bag-of-words-
14. H. Hu, ‘‘GoogleNews.PyPI,’’ PyPI.org, Mar 13, 2020. [Online]. Euclidean-distance/
Available: https://pypi.org/project/GoogleNews/ 21. S.S. Kamaruddi, Graph-based representation for sentence simi-
15. J. Brownlee, How to clean text for machine learning with python, larity measure: a comparative analysis. Int. J. Eng. Technol.
Machine Learning Mastery, October 18, 2017. [Online]. Avail- 7(2.4), 32–35 (2018)
able:
https://machinelearningmastery.com/clean-text-machine- Publisher’s Note Springer Nature remains neutral with regard to
learning-python/ jurisdictional claims in published maps and institutional affiliations.
16. S. Polamuri, Five most popular similarity measures implemen-
123

Text Similarity Measures in News Articles by Vector Space Model Using NLP

Uploaded by

Copyright:

Available Formats

Text Similarity Measures in News Articles by Vector Space Model Using NLP

Uploaded by

Copyright:

Available Formats

J. Inst. Eng. India Ser.

Text Similarity Measures in News Articles by Vector Space Model

Received: 4 June 2020 / Accepted: 7 October 2020

findings are promising, there is still a long way to go to

Translator idf vectors have now been used as feature vectors to

where in document d, tfi,d is the number of occurrences of

• Jaccard(A,A) = 1 Table 1 shows a comparative analysis of the methods

This paper introduces two methods for calculating the

Cosine Similarity and Jaccard Similarity with TF-

The pre-processed news articles were turned into vectors of

Table 2 Sample pair of completely similar news

Results and Discussion

Table 5 Similarity measures of completely same news Cosine Jaccard Euclidean

Fig. 5 Comparison of similarity coefficients for different news

Table 6 Similarity measures for different news stories about the

S.no. Cosine similarity Jaccard similarity Euclidean similarity 1

Fig. 6 Comparison of similarity coefficients for completely dissim-

13 0.4032 0.0705 0.00804 0.8

Table 8 Correlation of the similarity scores to the benchmark Conclusion

Table 10 Confusion matrix for Jaccard similarity References

the 2008 Conference on Empirical Methods in Natural Language https://dataaspirant.com/2015/04/11/five-most-popular-

You might also like