0% found this document useful (0 votes)
26 views6 pages

Fake News Spotting Using Interrelated Feature Selection Model Using Logistic Reg

This study presents a fake news detection model utilizing various classification strategies, particularly focusing on logistic regression, which achieved an accuracy of 97%. The model employs Natural Language Processing techniques and a Term Frequency-Inverse Document Frequency matrix for feature extraction. The research highlights the growing challenge of identifying fake news on social media and proposes a solution to mitigate its impact on society.

Uploaded by

c24vishalpandit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views6 pages

Fake News Spotting Using Interrelated Feature Selection Model Using Logistic Reg

This study presents a fake news detection model utilizing various classification strategies, particularly focusing on logistic regression, which achieved an accuracy of 97%. The model employs Natural Language Processing techniques and a Term Frequency-Inverse Document Frequency matrix for feature extraction. The research highlights the growing challenge of identifying fake news on social media and proposes a solution to mitigate its impact on society.

Uploaded by

c24vishalpandit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

12th IEEE International Conference on Communication Systems and Network Technologies

Fake News Spotting using Interrelated Feature


2023 IEEE 12th International Conference on Communication Systems and Network Technologies (CSNT) | 978-1-6654-6261-7/23/$31.00 ©2023 IEEE | DOI: 10.1109/CSNT57126.2023.10134586

Selection Model using Logistic Regression


V. Lakshman, [Link], P. Tritisha, SK. Firduas Fathima, Nafisa Begum
Department of Information Technology
Vignan’s Nirula Institute of Technology and Science for Women
Peda Palakaluru, Guntur, Andhra Pradesh
lakshmanv58@[Link], vajjamounika5@[Link], trithisha999@[Link], Skff810@[Link],
Nafisab8@[Link]

Abstract— This study aids us in evaluating the validity of fake news on them rather than traditional media like newspapers
news by employing many classification strategies. Fake news and television: it's easier to share, discuss, and debate news
actually has a major effect on our social lives. Specifically, in stories with friends and other readers on these sites; and it's
the fields of politics and teaching. By developing a often more timely and less expensive to consume
victimization-based fake news detection model with a separate information on these sites than on traditional media like
approach of categorization, a solution to the problems caused
by fake news is proposed. When resources, such as fake news
newspapers and television. [6] For instance, 62 percent of
detection, are taken into account, things get tricky. "Fake American adults rely on social media for news in 2016, up
news," or news reports that are erroneous and come from from 49 percent who said they would check it out if they
questionable sources, can be easily identified with the use of could only choose one source in 2012 [7]. It was also found
applications developed using NLP (Natural Language that social media has surpassed television as the primary
Processing) techniques, which this research aims to do. Using a source of news. [8] Technology advancements have made it
(Term Frequency Inverse Document Frequency) tfi-df matrix, simpler to fabricate and disseminate fake information, but it
which tallies words based on how often they appear in other is now much more difficult to determine whether or not any
articles in your dataset, is merely a starting point for given piece of information is real. Products and businesses
constructing a model. One form of scarce resource is datasets.
Logistic regression, Thomas Bayes's naïve machine, and the
can be affected by the spread of false information.[9] One's
Passive Aggression Classifier were all employed in the political career is not immune to the damaging effects of
classification processes of this model. Our model's extraction of fake news. [10]According to Jency Jacob, managing director
features for victimisation employs approaches like Term of the Mumbai-based fact-checking website BOOM, which
Frequency-Inverse Document (TF-IDF) frequency and logistic collaborates with Facebook to examine stories and tag
regression with an accuracy of 97%. specific posts spreading misinformation on the platform,
"2019 has been a unique year where truth checkers
Keywords- Fake News; Natural Language Processing; Term consistently kept shifting from one match to the other,"
Frequency; Machine Learning; Feature Extraction making it the busiest year for them so far.[11] In this study,
I. INTRODUCTION the Nave Bayes classifier, logistic regression, the Passive
Aggressive Classifier, and the Support Vector Machine
From the several years ago, social media has become (SVM) are compared for supervised classification.[12] In
ubiquitous.[1] The dissemination of false information is this work, a dataset is employed that combines real and fake
rampant on social media.[12] False information poses a data with successful outcomes.
threat to business, government, education, democracy, and
the economy.[3] Although fake news has always been an II. LITERATURE SURVEY
issue, the proliferation of social media has made it more Conventional methods of identifying false information only
likely that people will believe it and disseminate it allow for two possible values (Real or Fake), whereas in
further.[4] It is becoming increasingly difficult to tell true practice it is often impossible to determine with absolute
from misleading information, which leads to confusion and certainty whether a piece of data is genuine or not and must
issues. Recognizing fake news by hand is a challenging task instead be evaluated on a scale of confidence. The author
that can only be accomplished by someone with an saw this as a crucial consideration for organising data in
exceptional knowledge of the news industry. People are social media.[13] This hybrid strategy employs both
increasingly likely to seek out and consume information consisting the words consists of bags and n-gram
from social media rather than conventional news approaches to communicate data.[14] For the purpose of
organisations as more and more of their time is spent identifying harmful websites, suggested based static item
communicating online through these platforms. [5] The detection method. It was previously believed that IP
characteristics of social media sites like Facebook, Twitter, addresses were externally focused. Rather of focusing on the
and Instagram explain why people are shifting to reading

978-1-6654-6261-7/23/$31.00 ©2023 IEEE 924


DOI: 10.1109/csnt.2023.160
Authorized licensed use limited to: Zhejiang University. Downloaded on August 04,2023 at [Link] UTC from IEEE Xplore. Restrictions apply.
inner workings, the developer has highlighted the exterior impact on the study's findings. Additionally, will show that
features, such as IP addresses.[15] VSM, a vector-based including emotion into a detection approach does not add
framework for programmatical vectorization, is selected as any useful data.
the vector model for the URL. [16] These trendsetters were The authors present a text-processing based fully computing
schooled in the technique of using either the article's title or device learning approach for automatic identification of
contents. Utilizing the PAC classifier, they are able to Fake News with 87% accuracy, and they release a new
achieve a best-case accuracy of 94.63%, and the accuracy of public dataset of real news stories. It appears that the
the logistic regression technique is 97%. inclusion of the text has taken a back seat to the emotions
The primary goal is to identify the effective classification that emerge from reading the piece. In [10], the authors
process for identifying faux data and determining its included LIAR, a new dataset for automatic identification of
accuracy. [17]The author researched several classification fake data. Classification, argument mining, topic modelling,
processes and implemented an SVM, Passive Aggressive rumour detection, and political NLP studies are only some
Classifier, and Naive Bayes into our model. Many fictional of the other applications for this corpus. This standard has
works feature fictitious new-detection plotlines. In order to been adopted by the vast majority of local researches. It is
identify fake news, the authors of [18] propose a taxonomy common knowledge, however, that this pinnacle only
of several truth evaluation methods that may be broken includes political data, whereas others include a wide range
down into two main classes: linguistic cue methods that use of records. The common disadvantage of those methods is
computerised learning and community analysis methods. that the encoding of particular pieces of information may
By contrasting two alternative function extraction not be correct.
approaches and six different classification strategies, the III. PROPOSED MODEL
authors of [19] advocate for a fake data detection model
based on n-gram analysis and machine learning methods. The conversion of textual facts into a form more suitable
The results of the studies demonstrate that the so-called for statistical modelling requires pre-processing. Pre
elements extraction method is crucial in achieving the processing is the process of removing the unnecessary
desired high performance levels (TF-IDF). They utilised a symbols and null values in the dataset. The empty values in
classifier with 92% accuracy, the Linear Support Vector the dataset is removed by applying the expectation
Machine (LSVM). Since this model employs LSVM, it can maximization model to clean the dataset for further
only handle the special case of two linearly separated processing. Natural language processing algorithms are
classes. among the many common approaches used to transform
textual material; employed this strategy (NLTK). The author
The authors of [5] present a simple approach to fake news employed Stop Words Removal, Punctuation Removal, and
identification with a naive Bayesian classifier. This tactic is Stemming as pre-processing stages for our data sets
put to the test using data collected from Facebook status containing news headlines and articles. By eliminating the
updates. They brag that they can achieve a 74% success unnecessary data now present in the facts, the author may
rate. This model has a high accuracy but not the best rate, as make the data set smaller. Sentences are put together using
many other works have achieved a higher rate with the use various comma and semicolon including stop phrases. Their
of alternative classifiers. usage as a criterion in textual content classification is
In [9], the authors explain how social media users might meaningless. Processing and filtering out stop words is a
verify the authenticity of posts. They also detail the roles of crucial first step in natural language processing.
journalists and researchers, as well as the standards to which Specifically, the author have used the NLTK library to get
legitimate organisations should adhere. This work is no rid of adverbs and gerunds. Punctuation marks like commas
longer acknowledged as true with anything, but it does are just stylistic and should be removed from the text
assist people see the reality behind the news on social because of their lack of relevance. The process of
media. According to the authors of [9], there are a wide "stemming" removes all prefixes and suffixes from a word.
variety of methods and indices that can be used to measure The process of "stemming" reduces a word to its original,
the effectiveness of various modalities (text, image, social unaltered form, the root word. For instance, the words
information). They also learn the value of mixing and "knowing", "known" ,and "knows" will be shortened to just
merging various methods to verify information sharing. "know." The author looked at the two separate methods for
The authors of [8] conduct a comprehensive and up-to-date determining the number of different aspects, Term
analysis of the effectiveness of extraordinary strategies Frequency (TF) and Term Frequency-Inverse Document
across three separate data sets. The authors of this study Frequency (TF-IDF).Words and phrases that appear
paid close attention to the data's textual content and the frequently in a manuscript are prioritised by TF. It gives an
impression it conveyed, but they paid less attention to other account of a phrase's frequency usage. As a result, a string
factors—such as the data's source, author, or even the of words is used to symbolise a report. The IDF (Inverse
publication date of the e-book—that can have a significant Document Frequency) algorithm is used to find rare

925
Authorized licensed use limited to: Zhejiang University. Downloaded on August 04,2023 at [Link] UTC from IEEE Xplore. Restrictions apply.
expressions, or to put it another way, to explain how selecting a bag of words set and comparing the reviews with
unusual a phrase is. One method for determining which the bag of words and identifies the true positives and false
phrases are most significant in a document is called the positives. The accuracy and recall metrics are calculated to
Term Frequency-Inverse Document Frequency (TF-IDF). It identify the model performance levels. The diagram
is true that the frequency with which a phrase appears in a presented is a Sigmoid Function, which was also refer to as
document raises its relevance, but the frequency with which a Logit. This quality transforms the probabilities into binary
the phrase appears in the corpus helps to mitigate this effect. numbers, which can be similarly utilised in making
High TF-IDF value words are crucial to the text and started forecasts.
off by performing some preliminary processing on the
dataset by removing any superfluous or irrelevant words or
characters. After that, author will move on to Term
Frequency-Inverse Document Frequency for the next phase
of the process: function extraction. Author looked into the
Support Vector Machine (SVM), Naive Bayes, Passive
Aggressive Classifier, and logistic regression techniques.
Python's Natural Language Toolkit was used to implement
these classifiers (NLTK). The dataset was divided into a
"preparation" and "testing" portion. the 20% of the data for
testing purposes and 80% for getting ready purposes. Here
Logistic Regression model is used for training our data.

News Pre-
articles processing [Link] Regression
dataset steps of
If the probability price is reduced to below 0.5, then it
belongs to Class 0 on this graph, and if the price is larger
than 0.5, then it belongs to Class 1. The proposed model
strictly considers and checks the correlation of features in the
dataset and the fake news are detected using the correlation
Divide factor. This process helps in achieving high accuracy levels
Feature dataset into in fake news detection.
abstracting train and IV. RESULTS
test The accuracy of 97% obtained by logistic regression is very
high. The accuracy score on the training data is 98%, while
on the testing data it is 97%. The following code snippet
depicts the accuracy scores for both the training data and the
testing data.
Parametric
Classifier
validation
training
[Link] for detecting faux news.

A. Logistic Regression:

Let's review the idea of Logistic Regression before


going into the coding. Specifically, logistic regression is a
method of statistical evaluation used to make predictions
about a binary outcome, such as yes or no (binary
classification), based on the initial observations of an [Link] score
information set. It is a type of supervised statistical analysis
used to determine the likelihood of a known variable. The
parametric validation performs detecting the fake news by

926
Authorized licensed use limited to: Zhejiang University. Downloaded on August 04,2023 at [Link] UTC from IEEE Xplore. Restrictions apply.
Making a predictive system:
For predicting the system here author used a
classifier model called logistic regression model which is a
binary classification model. If prediction of the model is 0
the news is real otherwise the news is considered as fake
news.

[Link] System
[Link] extraction accuracy level
Feature extraction time level:
Model is trained using logistic regression. Here Feature selection time level:
[Link] dataset is used consists of 10800 news articles and Two features are selected in our training model.
extracted 5 features from the dataset called id, title, author, Feature selection for training the model is id and title. The
text and label. The feature extraction time level will be low time required is less when compared to existing model in
when compared to the existing model and the feature feature selection. The feature selection time level graph is
extraction time level graph is represented as follows: represented as follows:

[Link] extraction time level

Feature extraction accuracy level:


Feature extraction accuracy level is high when [Link] selection time level
compared to existing model. Feature extraction accuracy
Feature selection accuracy level:
level is represented as follows:
Feature selection accuracy level is high when
compared to existing model. Feature selection accuracy
level is represented as follows:

927
Authorized licensed use limited to: Zhejiang University. Downloaded on August 04,2023 at [Link] UTC from IEEE Xplore. Restrictions apply.
uses binary representations for prediction. If the prediction
system is 0 the news is REAL. If the prediction system is 1
the news is FAKE. In future hybrid classifiers can be
implemented for reducing the complexity levels and
enhancing the accuracy rate. Feature dimensionality
reduction can also be applied on the model for reducing the
time complexity levels.

REFERENCES:

[1]. Economic and Social Research Council. Using Social Media.


Available at: [Link]
media/using-social-media.
[2]. Gil, P. Available at: [Link]
twitter-2483331. 2019, April22. ASCI-2020 IOP Conf. Series:
Materials Science and Engineering 1099 (2021) 012040 IOP
Publishing doi:10.1088/1757-899X/1099/1/012040 12
[3]. E. C. Tandon Jr et al. “Defining fake news a typology of scholarly
[Link] selection accuracy level definitions”. Digital Journalism, 1–17. 2017.
[4]. J. Radiant et al. “An Overview of Public Concerns During the
Recovery Period after a Major Earthquake: Nepal Twitter
Error Rate: Analysis.” HICSS '16 Proceedings of the 2016 49th Hawaii
Error Rate is low for logistic regression model International Conference on System Sciences (HICSS) (pp. 136-
when compared to KNN classifier which is the existing 145). Washington, DC, USA : IEEE. 2016.
[5]. Alkhodair S A, Ding S H.H, Fung B C M, Liu J 2020 “Detecting
[Link] because of less error rate, logistic regression breaking news rumors of emerging topics in social media” Inf.
model is chosen for training and testing the data. The error Process. Manage. 2020, 57, 102018.
rate is calculated using the formula [6]. Jeonghee Yi et al. “Sentiment analyzer: Extracting sentiments
about a given topic using natural language
[Link],[Link]
nternationalConference(pp.427-434). [Link]
200).2003
[7]. Ranjan et al. “Part of speech tagging and local word grouping
techniques for natural language parsing in Hindi”. In Proceedings
of the 1st International Conference on Natural Language
Processing (ICON 2003). Semanticscholar2003.
[8]. Mondial et al. Automatic Tagging of Arabic Text: From Raw
Text to Base Phrase Chunks. Proceedings of HLT-NAACL 2004:
Short Papers (pp. 149–152).Boston, Massachusetts, USA:
Association for Computational Linguistics. 2004
[9]. Rouse,M.[Link]
hine-learning-ML May2018
[10]. Lakshman Narayana, V., Lakshmi Patibandla, R.S.M., Pavani, V.,
Radhika, P. (2023). Optimized Nature-Inspired Computing
Algorithms for Lung Disorder Detection. In: Raza, K. (eds)
Nature-Inspired Intelligent Computing Techniques in
Bioinformatics. Studies in Computational Intelligence, vol 1066.
Springer, Singapore. [Link]
7_6.
[11]. V. L. Narayana, S. Sirisha, G. Divya, N. L. S. Pooja and S. A.
Nouf, "Mall Customer Segmentation Using Machine Learning,"
2022 International Conference on Electronics and Renewable
Systems (ICEARS), 2022, pp.1280-1288, doi:
10.1109/ICEARS53579.2022.9752447.
[12]. R. S. M. Lakshmi Patibandla, B. Tarakeswara Rao, V. Lakshman
[Link] Rate Narayana,11 - Prediction of COVID-19 using machine learning
techniques,Editor(s): Deepak Gupta, Utku Kose, Ashish Khanna,
Valentina Emilia Balas,Deep Learning for Medical Applications
V. CONCLUSION with Unique Data,Academic Press,2022,Pages 219-231,ISBN
9780128241455,[Link]
Here logistic regression is used to detect fake news which 5.00007-1
has accuracy of 98% of training data and 97% of testing [13]. Patibandla, R.S.M.L., Vejendla, L.N.(2022),Significance of
data which is very high and effective to detect fake news. Blockchain Technologies in Industry, EAI/Springer Innovations
in Communication and Computingthis link is disabled, 2022, pp.
Here large data sets of size (20800,5) are used which are of 19–31.
very large size which have 20800 news articles and 5 [14]. V. Lakshman Narayana,(2021), “Secured resource allocation for
features. So logistic regression uses a predictive system and authorized users using time specific blockchain methodology”,

928
Authorized licensed use limited to: Zhejiang University. Downloaded on August 04,2023 at [Link] UTC from IEEE Xplore. Restrictions apply.
International Journal of Safety and Security Engineering, Vol. 11, [17]. V. Pavani, S. Sri. K, S. Krishna. P and V. L. Narayana, "Multi-
No. 2, 2021, pp. 201–205. Level Authentication Scheme for Improving Privacy and Security
[15]. V. Pavani, M. N. Swetha, Y. Prasanthi, K. Kavya and M. of Data in Decentralized Cloud Server," 2021 2nd International
Pavithra, "Drowsy Driver Monitoring Using Machine Learning Conference on Smart Electronics and Communication (ICOSEC),
and Visible Actions," 2022 International Conference on 2021, pp. 391-394, doi: 10.1109/ICOSEC51865.2021.9591698.
Electronics and Renewable Systems (ICEARS), 2022, pp. 1269- [18]. Shaik, Sharmila, P. Sudhakar, and Shaik Khaja Mohiddin. "A
1279, doi: 10.1109/ICEARS53579.2022.9751890. Novel Framework for Image Inpainting." International Journal of
[16]. V. Pavani, N. M. Pujitha, P. V. Vaishnavi, K. Neha and D. S. Computer Trends and Technology (IJCTT) 14: 141-147.
Sahithi, "Feature Extraction based Online Job Portal," 2022 [19]. Sharmila, Shaik, and Ch Aparna. "VMSSS: A proposed model for
International Conference on Electronics and Renewable Systems cloud forensic in cloud computing using VM snapshot server."
(ICEARS), 2022, pp. 1676-1683, doi: Soft Computing for Problem Solving. Springer, Singapore, 2019.
10.1109/ICEARS53579.2022.9752295. 483-493.

929
Authorized licensed use limited to: Zhejiang University. Downloaded on August 04,2023 at [Link] UTC from IEEE Xplore. Restrictions apply.

You might also like