A project report on
FAKE NEWS DETECTION
Review 3 TARP project report
Submitted by
Aditya (19BIT0139)
Prajesh Raj Singh (19BIT0422)
of
B.Tech. degree
in
Information Technology and Engineering
School of Information technology and Engineering
April, 2023
CHAPTER 1
INTRODUCTION
1.1 GENERAL
Fake news is a type of yellow press that purposefully spreads disinformation or
hoaxes through traditional print news media as well as more modern internet social
media. Fake news has been around for a long time, dating back to the 1835
publication of the "Great moon hoax". In recent years, the ability of a user to write
anything on online news platforms such as social media and news websites
newspapers has led to the spread of false information, and as a result of the booming
development of online social networks, fake news for various commercial and
political purposes has appeared in large numbers and spread throughout the online
world. Online social network members can quickly become infected by online fake
news by using deceptive phrases, which has already had a huge impact on the offline
culture.
When compared to regular suspicious material (eg: spams), fake news has
considerable significant differences in various aspects: (1) impact on society: spams
are mainly found in personal emails or on specific review websites, and they only
affect a limited number of people locally, whereas fake news on online social
networks and websites can have a huge influence due to the vast user numbers
worldwide; (2) audiences’ initiative: instead of passively receiving spam emails, users
of online social networks can actively seek out, receive, and distribute news items
without regard for its accuracy.; and (3) identification difficulty: due to the lack of
other comparable news pieces, spotting fake news with incorrect information is
extremely difficult. It necessitates both lengthy evidence-gathering and meticulous
fact-checking. These qualities of fake news present new hurdles in terms of
identification. Firstly, collecting false news data and manually labelling fake news are
both tough tasks. Private data includes news that appears on online news feeds. So
yet, only a few large-scale public datasets for false news identification exist in this
context.
Hence, the contributions of this paper are summarized as follows:
We have collected multiple high-quality datasets and take in-depth analysis on the
text from multiple perspectives.
We have used datasets from English as well as Hindi languages in order to extract
effective features in identifying the fake news.
A unified model is proposed to analyze news for multiple languages.
The model proposed in this paper is an effective way to recognize fake news from lots
of online information.
1.2. SYSTEM REQUIREMENTS
• Hardware
RAM: 4GB
Disk Space: 2GB
• Software
Sklearn, Python, Pandas, Numpy, CountVectorizer, TF-IDF vectorizer
CHAPTER 2
OVERVIEW OF PROJECT
2.1 EXISTING WORK
In this paper [1], the authors have explored main textual features essential for developing
a fake news detector, which includes features like: Language features, lexical features,
semantic features, subjectivity. In addition to this, the authors have introduced new set of
features like: bias, credibility & domain location. From the dataset, the authors have
discarded stories labelled as “non-factual”, and merged those labelled as “mixture of true
and false” & “mostly false” into a single class, henceforth referred as “fake news”,
remaining were classified as the legitimate portion. In order to extract these mentioned
features from the new articles, the authors first parsed all the news URLs and then
extracted the domain information. In case, if the URL happened to be unavailable, the
official URL of news outlet was associated with news article. The authors have evaluated
the discriminative power of the previous features using several state-of-the-art & classic
classifiers, including Naïve Bayes (NB), k- Nearest Neighbors (KNN), Support Vector
Machine (SVM), Random Forests (RF) and XGBoost (XGB). For the mean AUC and F1,
the authors computed 95% confidence intervals by performing a fivefold split between
test and training set, which is repeated ten times with different shuffled version of the
original dataset. For each classifier, from a set of previously labelled data the authors
learned a model, and used it to classify new (unseen) news articles into “fake” or “not
fake”. The obtained best results were by XGB and RF classifiers, statistically tied with
0.86 and 0.85 for AUC, respectively. The issues addressed by the authors in this research
paper were that most of the existing works which happens to be concurrent work, which
propose new features for training classifiers, based on ideas that have not been tested in
combination. Thus, it becomes difficult to gauge the potential of the supervised models
trained from the recent studies. The author has attempted to resolves these addressed
issues by briefly surveying the existing studies on this topic, by identifying the main
features proposed for this task and implemented these features & tested the effectiveness
of a variety of supervised learning.
In this paper [2], the authors have tried to develop computational resources and models
for fake news detection, where they have presented the construction of two novel datasets
covering various domains. Using these datasets, the authors conducted several analyses
by describing the collection, annotation, and validation process in detail, in order to
identify the linguistic properties that were predominantly present in fake content, and then
they built fake new detectors relying on these linguistic features. The domains like
technology, politics and education appeared to be having greater linguistic features which
are present in a fake content, than in domains like sports, business and entertainment. On
comparing the accuracy of the proposed model with the human performance, it was found
that humans were better in detecting the fake contents in the celebrity domain while the
proposed model outperformed humans in detecting fake news in more serious and diverse
fake news. Issues like: serious fabrications, hoaxes, satires and celebrities’ gossip are
been addressed, in the various domains like: politics, education, celebrity, sports business
and entertainment.
In this paper [3], the authors have proposed an automatic fake news identification
technique for which they performed a thorough investigation of fake news data where lots
of useful explicit features and hidden patterns are identified from the text words as well as
the images used in the fake news. A model named as TI-CNN (Text and image
information based Convolutional Neural Network) is proposed which is trained with both
the text and image information simultaneously by projecting the explicit and latent and
latent features into a unified feature space. The proposed model provides consistent
results with F1-measure of 0.92-0.93 and is greatly expandable, which can easily absorb
other features of the news, leaving the text and image information with the corresponding
explicit and latent features. The issues like: handling the sparse & high order features very
well, predicting/detecting whether the news is fake or not from the image input itself and
also the failure of traditional convolution neural networks (CNN) in capturing the explicit
features. In the proposed system, the authors have overcome these issues by utilizing TI-
CNN to combine the latent and explicit features of image & text information in a unified
space, and then used those learned features to identify fake news.
In this paper [4], the authors address the challenges introduced by the diverse connections
among news articles, creators & subjects and characteristics of fake news using the novel
automatic fake news integrity model, ‘FakeDetector’ which is based on an extracted set of
latent & explicit features from the textual information. In order to learn the
representations of news-articles, creators and subjects simultaneously, it builds a
framework which consists of representation feature learning and credibility label
inference, which will together compose of a deep diffusive network called FakeDetector.
As the overall performance of the FakeDetector, its accuracy score is higher than what is
obtained by the state-of-the-art fake news detection models like Hybrid: LIWC, TRIFN
and CNN, as well as the network structure-based models like DeepWalk, Propagation,
LINE and the textual content-based methods SVM & RNN. In this paper the authors have
addressed issues like fake news detection problem as a credibility inference problem. To
resolve these challenges, the authors have introduced this framework where the fake news
detection problem is formulated as a credibility source interference problem which aims
at learning a model to infer the credibility labels of news articles, creators and subjects
simultaneously.
In this paper [5], in order to detect fake news, the authors have proposed an optimized
Convolutional Neural Network model (OPCNN-FAKE) where the authors have compared
its performance with Long Short-Term Memory (LSTM), Recurrent Neural Network
(RNN) and the six regular ML techniques like: Logistic Regression (LR), K Nearest
Neighbour (KNN), Random Forest (RF), Decision Trees (DT), Naïve Bayes (NB) and
Support Vector Machine (SVM) using four fake news datasets. This model consists of
steps: fake news data collection, text processing, dataset splitting, features extraction,
models training/ optimization and model evaluation. The results show that, compared to
other models OPCNN-FAKE model has achieved the best performance and has higher
testing results & cross- validation results over the other models, indicating that OPCNN-
FAKE for fake news detection is significantly better than the other models. In this paper,
the authors have addressed issues like: optimizing the existing fake news detection
models and increasing accuracy of their performance. These issues are resolved by using
the novel approach es based on Machine Learning (ML) and Deep Learning (DL) where
Grid search and hyperopt optimization techniques are used to optimize the ML and DL
parameters respectively.
In this paper [6], authors in order to combat the situation, created Fake news page. A
detection system that receives human input and determine whether it is true or false.
Several NLP and Machine Learning Methods was employed to do this. The model was
trained using an appropriate dataset, and its performance was evaluated using a variety of
performance indicators. To categorise news headlines or articles, the best model, i.e., the
model with the highest accuracy, was utilised. As seen above for static search, best model
was Logistic Regression, which had an accuracy of 65%. As a result, authors applied grid
search parameter optimisation to improve the performance of logistic regression, yielding
a 75% accuracy. As a result, if a user feeds a specific news story or its title into our
algorithm, it has a 75% probability of being categorised to its real nature. The consumer
may examine the news story or keywords online, as well as the website's validity. The
dynamic system's accuracy is 93%, and it improves with each repetition.
In this paper [7], authors researched on three datasets: a huge imbalanced dataset
(FakeNews) and two smaller ones (PHEME and LIAR). First, a comparison of various
machine learning models such as Random Forest, Decision Tree, SVC, and Logistic
Regression were tested with Logistic Regression emerging as the top model in terms of
efficiency and efficacy measurements. As compared to other models, Logistic Regression
has several advantages, including interpretability, quick execution time, and minimal
parameters to tweak.
Deep learning models (such as Convolutional Neural Networks and BERT) were
compared one after the other. Google B.E.R.T had the greatest overall results because it
conducts word-level embedding based on context, despite being difficult to train. Lastly, a
multimodal technique for performing false picture classification was examined further by
integrating content and multimedia analysis. Using multimedia data yields the highest
outcomes in terms of accuracy, precision, recall, and F1.
In this paper [8], they suggested model takes use of a deep learning framework that
employs neural networks and the long short-term memory architecture. This neural
network model was introduced in order to enhance the identification of false news. Before
training the model, the idea of stopwords was used for data pre-processing, which
improves the model's accuracy. A tokenization approach for feature extraction or
vectorization has been suggested, which assigns tokens to word embeddings. For word
embeddings, the GloVe approach was employed to represent each word in vector form.
The information is subsequently routed through various layers of the architecture by the
primary LSTM neural network. The long short-term memory model is utilised to build the
model, which is primarily based on the notion of data categorization for sequential
prediction. Lastly, the model is trained to differentiate between real and fake news. The
suggested model's assessment metric is accuracy. The proposed model was 99.88%
accurate. Their future goal is to improve and expand on existing work in order to establish
an automated system for e-commerce websites, where identification of bogus news has
become as vital.
In this paper [9], authors introduced a geometric deep learning strategy for detecting
bogus news on the Twitter social network. They suggested technique that naturally
enables for the integration of diverse data such as user profile and activity, social network
structure, news propagation trends, and content. The main advantage of employing deep
learning over handcrafted features is its capacity to automatically learn task-specific
features from data. In this case, the choice of geometric deep learning is justified by the
graph-structured form of the data. In numerous demanding circumstances including large-
scale actual data, current model exhibits extremely high accuracy and resilient behavior,
indicating the huge promise of geometric deep learning approaches for false news
identification. Experiments showed that social network structure and propagation are
crucial factors in detecting bogus news with high accuracy (92.7% ROC AUC). Second,
authors found that bogus news may be successfully recognised early on, even after only a
few hours of spread. Lastly, they examined the ageing of model using training and testing
data that are separated by time. Findings suggested that propagation-based techniques to
false news identification have potential as an alternative or alternative technique to
content-based strategies.
In this paper [10], authors observed over-fitting as a practical consequence of training a
model without testing its generalisation capabilities. The most evident effect of over-
fitting is poor model performance on a fresh, previously unknown dataset. Moreover,
over-fitted models show high-complexity and assess far more information than is likely
required to make a judgement. Lastly, over-fitted models cannot be transferred to a
comparable job on a different dataset and must be re-trained from start, decreasing
reusability.Another practical aspect of deep neural networks observed was that they
specialise in specific activities, limiting their capacity to perform effectively in other
tasks. The limitation of neural networks in that they were tackle one problem at a time
becomes an advantage of the proposed hybrid technique, which combines a CNN network
that learns spatial, hence conceptual, characteristics of text with an LSTM that captures
the sequential flow of text. The fundamental hypothesis is that a hybrid Convolutional
Neural Networks (CNN) Recurrent Neural Networks (RNN) model might improve on
state-of-the-art baselines for false news identification. Tests on two distinct types of real-
world fake news datasets (100% accuracy on ISOT dataset, 45000 articles; 60% accuracy
on FA-KES dataset, 804 articles) show much better outcomes than non-hybrid baseline
approaches.
CHAPTER 3
PROPOSED METHODOLOGY AND ARCHITECTURE
3.1 Proposed Methodology
CHAPTER 4
IMPLEMENTATION
4.1 Data Acquisition:
The data in this project were acquired from two types of datasets, english and hindi. There
were three different types of english datasets while one hindi news dataset. The english
datasets included:
FA-KES Dataset: This dataset mainly consisted of the news covering the conflict in Syria.
The dataset consists of a set of articles/news that have been tagged as either 0 (fake) or 1
(real). The ground truth information gathered from the Syrian Violations Documentation
Center is used to determine the credibility of articles (VDC). The information extraction
was crowdsourced (e.g., date, place, number of casualties) for each article using the
crowdsourcing site Figure Eight (formally CrowdFlower). Then those articles were
compared to the VDC database to determine whether they are genuine or not.
Fig 1: FA-KES Dataset
The columns ‘article title’ and ‘article content’ are aggregated into one column ‘text’,
while the rest of the columns are dropped leaving ‘label column.
Fig 2: FA-KES Dataset after data reduction and aggregation
ISOT Dataset: There are two sorts of articles in the dataset: fake news and legitimate
news. The true articles were retrieved by crawling articles from Reuters.com (News
website) for this dataset. The fake news pieces came from a variety of places. The fake
news items were gathered from untrustworthy websites identified by PolitiFact (a fact-
checking organization based in the United States) and Wikipedia. The dataset contains a
variety of articles on a variety of themes, although the majority of them are about politics
and world events.
The file ‘True.csv’ consisted of more than 12,600 articles from reuter.com. While the
second file ‘Fake.csv’ consisted of 12,600 articles obtained from various fake news outlet
resources. Both of these dataset files consisted of the following information’s like: article
tile, text, type and the date of publication.
The table below gives a perfect break-down of the numbers and the categories of
articles of the ‘True’ and ‘Fake’ types of news.
Fig 3: ISOT Dataset break-down
Fig 4: ISOT Fake news dataset
Fig 5: ISOT True news dataset
Both of these datasets are assigned the class labels accordingly on the basis of the type of
the dataset. ‘1’ is assigned to the entries fake news dataset while ‘0’ is assigned to the
entries of the true news dataset. Also, the ‘title’ and the ‘text’ columns are aggregated
while rest of the columns are dropped.
Fig 6: After labelling and performing data reduction on the fake ISOT dataset
Fig 7: After labelling and performing data reduction on the true ISOT dataset
Kaggle Dataset: The english news dataset was acquired from Kaggle which consisted of 5
features and about 20800 news entries. The feature attributes in this dataset were: id, title,
author, text and label. Here:
id: unique id for a news article
title: the title of a news article
text: the text of the article; could be incomplete
label: a label that marks the article as potentially unreliable and reliable.
1: unreliabe
0: reliable
Fig 8: Dataset from Kaggle
Here the columns ‘title’ and ‘text’ are merged into one colum ‘text’ while the rest of the
other columns are dropped leaving the ‘label’ column.
Fig 9: After labelling and performing data reduction Kaggle dataset
All these english datasets were merged together and a new ‘language’ column was also
added to the dataframe.
Fig 10: Final english dataset
Hindi Dataset: Meanwhile, The hindi news dataset was obtained from a Github
repository. It consisted of two files: one for the fake news and the other for the true
news.Both, the fake hindi news dataset & true new dataset consisted of 4 features: id,
short_description, full_title and long_description. Here:
id: unique id for a news article
short_description: a breif summary of the given news article
full_title: the title of the news article
long_description: the text of the article.
The fake news dataset consists of 1,241 entries and the true news dataset consists of 439
data entries.
Fig 11: Hindi fake news dataset
Fig 12: Hindi true news dataset
Labelling and Merging of Hindi news dataset:
In order to obtain one whole hindi news dataset, the entries in the hindi fake news dataset
were labelled as '1' while the ture news dataset entries were labelled as '0'. After the
labelling of the datasets, both of the datasets were merged and schuffled. Like this I was
able to acquire a hindi news dataset which consisted of both fake news and true news
entries and the number of entries were 1680. A new column ‘language’ was added to the
dataframe to which the value ‘hindi’ was assigned.
Fig 13: Final hindi dataset
Now the final Hindi and final English datasets are merged together and shuffled to form
the final dataset for pre-processing.
Fig 14: Final overall dataset
Data Cleaning and Pre-processing:
In the case of data cleaning and pre-processing, there were some parts which were
common to the data entries of both the languages while there were some parts which were
only applicable to a specific language.Various process like data cleaning which involved
removal of the puncuation marks or the removal of stopwords were common to english as
well the hindi news data. While the processes like: lemmitization and stemming were
language specific where lemitization was used in the case of english news set, while
stemming was used in the case of hindi news set.
English News
In the case of english news set, text pre-processing starts with removal of puncuation
marks from the entered sentence followed by tokenization of the string and the removal of
the stopwords. After this, lemmatization takes place which gives the final pre- processed
string as the result. This pre-processed string replaces the corresponding original news
text from the entered dataset. Hence, the pre-processing of the english text covers:
Data Cleaning: In this step of text pre-processing, the puncuation marks were removed
from the entered news text. The method of regex was used in order to perfomr this step.
This step is essential with respect to text pre-preoceesing because these puncuation marks
which are used to divide the text into sentences, paragraphs and phrases affects the results
of any text processing approach, especially what depends on the occurrence frequencies
of words and phrases, since the punctuation marks are used frequently in text.
Tokenization: After the removal of the puncuation marks, the target sentence is tokenized,
i.e, the raw text is broken down into small chunch of words or sentences which are
collectively termed as tokens. These tokens assists in understanding the context or
developing the model and in interpreting the meaning of the text by analyzing the
sequence of the words. This step of tokenization of the sentence, helps in the removal of
the stopwords from a sentence.
Stopwords removal: The stopwords removal is an important step in text pre- processing
because it involves the removal of the low-level information from our text in order to give
more focus to the important information. THe stopwords are those words which are most
common in english language, in terms of usage and does not add much
information to the text. Example of few of the stopwords in english are: “the”, “a”, “an”,
“so”, “what” etc. Hence, we can say that the removal of the stopwords from the tokenized
sentences does not show any negative consequences on the model we train for our task.
Instead, due to this the dataset size gets reduced and thus reduces the training time due to
the fewer number of tokens involved in the training.
Lemmatization: Lemmatization is the final and the most important step in the english text
text pre-processing because it helps in returning the base or dictionary form of a word. It
usually refers to doing things properly with the use of a vocabulary and structural analysis
of words, normally aiming to remove inflectional endings only. Like this, the target text
gets more simplified.
After the completion of the above steps, the final pre-processed text which gets formed
here replaces its corresponding original text in the dataset.
Fig 15: Pre-processing of English data
Hindi News
In the case of hindi news set, like english news text all the text pre-processing steps are
same, the only difference here is that instead of lemmatization, the concept of stemming
is used. It starts with the removal of puncuation marks, followed by tokenization and
stopwords removal. After this, stemming takes place which gives the final pre-processed
string as the result. This pre-processed string replaces the original text from the dataset.
Stemming: In this step of hindi text pre-oricessing, the reduction of a word takes place by
the removal of its suffixes and prefixes in order to obtain the root form of that word, even
if the resulting word is the valid word or not. Stemming is different from lemmatization
because stemming technique only looks at the form of the word whereas lemmatization
technique looks at the meaning of the word. It means after applying lemmatization, we
will always get a valid word.
After stemming, the final pre-processed text which gets formed here replaces its
corresponding original text in the dataset.
Fig 16: Pre-processing of Hindi data
Feature Selection:
In various machine learning processes, it is desirable to reduce the number of input
variables to both reduce the computational cost of modelling and, in some cases, to
improve the performance of the model. Hence, in order to achieve this feature selection is
an important process of reducing the number of input variables when developing a
predictive model.
Here in the case of text processing, feature selection selects a subset of the terms
occurring in the given pre-processed dataset. Since a machine can't understand text,
instead it can only understand numbers. Hence, in feature selection, the pre-processed text
is converted into meaningful numbers such that these numbers can be feed into machine
learning algorithms. In this project, I have used two techniques to perform feature
selection: CountVectorizer and TF-IDF vectorizer.
CountVectorizer: It is an NLP tool which is used to transform a given text into a vector on
the basis of the frequency (count) of each word that occurs in the entire text. It has been
helpful for data with multiple texts, where each word in each text into vectors for using in
further text into further text analysis. Here, it creates a matrix in which each unique word
is represented by a column of the matrix, and each text sample from the document is a
row in the matrix. The value of each cell is nothing but the count of the word in that
particular text sample.
TF-IDF vectorizer: Term frequency-inverse document frequency is also a text vectorizer
that transforms the text into a usable vector. It combines 2 concepts, Term Frequency
(TF) and Document Frequency (DF). The term frequency is the number of occurrences of
a specific term in a document. Term frequency indicates how important a specific term in
a document. It represents every text from the data as a matrix whose rows are the number
of documents and columns are the number of distinct terms throughout all documents.
Document frequency is the number of documents containing a specific term. Document
frequency indicates how common the term is. Inverse document frequency (IDF) is the
weight of a term, it aims to reduce the weight of a term if the term’s occurrences are
scattered throughout all the documents.
Fig 17: TF-IDF formula
Fig 18: Feature extraction process
Fig 19: Feature extracted matrix
Train-Test split:
After performing feature selection on the pre-processed dataset, now the available data is
needed to be split into two parts: the training par and the testing part. As per the usual
practice of a machine learning project, I have split the data in the ratio of 80% and 20%.
Where, 80% of the data is for the model training part while, rest of the 20% of the data is
for the model testing part.
Fig 20: The splitting of the data in the 80:20 ratio
Fig 21: Training Data
Fig 22: Testing Data
Modelling:
Multi-Nomial Naïve-Bayes
Fig 23: Training and Testing using Multinomial Naïve-Bayes
Fig 24: The results of Multinomial Naïve Bayes
In Natural Language Processing, the Multinomial Naive Bayes method is a common
Bayesian learning approach (NLP). Using the Bayes principle, the programme estimates
the label of each entered Feature. It assesses the likelihood of each tag for a given sample
and returns the tag with the highest probability. The Naive Bayes classifier is made up of
several algorithms, all of which have one thing in common: each feature being classified
is unrelated to any other feature. The presence or absence of one trait has no bearing on
the inclusion or exclusion of another.
Logistic Regression
Fig 24: Training and Testing using Logistic Regression
Fig 25: The results of Logistic Regression
Logical regression is the most used supervised machine learning algorithm for
classification in natural language processing, and it has a tight relationship with neural
networks. Here, it starts from the random initialization of the weight matrix on the feature
extracted dataset. Then weight updating takes place which makes the model ready for
deployment. The dot product is calculated from which it is determined that is the result is
more that a specific value then that particular data belongs to class ‘0’ (real news) or to
class ‘1’ (fake news).
Decision Tree
Fig 26: Training and Testing using Decision Tree
Fig 27: The results of Decision Tree
Decision tree is a versatile tool that can be used in a variety of situations. In NLP,
classification problems can be solved with decision trees. Here, the algorithm uses a tree-
like flowchart to classify the texts on the basis of the extracted features that result from a
sequence of feature-based splits and classify as ‘0’ and ‘1’ for ‘fake news’ and ‘true
news’ respectively.
Random Forest
Fig 28: Training and Testing using Random Forest
Fig 29: The results of Random Forest
A random forest is an ensemble classifier that makes predictions using a variety of
decision trees. It works by fitting a variety of decision tree classifiers to different
subsamples of the dataset. In addition, each tree in the forest was constructed using a
random subset of the best attributes. Finally, the act of enabling these trees gives us the
best subset of features among all the random subsets of features. Random forest is
currently one of best performing algorithms for many classification problems.
CHAPTER 5
RESULTS AND CONCLUSION
Figure 30: Performance of all machine learning models
Most tasks in the twenty-first century are completed online. Newspapers, which were
once favoured as tangible copies, are gradually being replaced by online-only
programmes such as Facebook, Twitter, and news articles. Forwards from WhatsApp are
also a significant source. The growing problem of fake news only complicates matters by
attempting to sway people's opinions and attitudes against the usage of digital technology.
When a person is tricked by real news, one of two things can happen: people begin to
believe that their assumptions about a certain topic are correct. In order to combat the
problem, a Fake News Detection system was created, which collects user input and
classifies it as real or false. Various NLP and Machine Learning techniques must be
employed to accomplish this. The model is trained using a variety of acceptable datasets
in both English and Hindi, and its performance is assessed using a variety of performance
indicators. To classify news headlines or articles, the best model, or the model with the
highest accuracy, is utilised. Our best model for static search, as seen above, was
Decision Tree, which had a 96.2% percent accuracy. As a result, pipelining was done
with this approach, which was then included into the website. I want to create my own
dataset in the future, which will be maintained up to date with the latest news. Using a
Web Crawler and an online database, all live news and current data will be stored in a
database.
REFERENCES
1. Reis, J. C. S., Correia, A., Murai, F., Veloso, A., & Benevenuto, F. (2019).
Supervised Learning for Fake News Detection. IEEE Intelligent Systems, 34(2), 76–
81. doi:10.1109/MIS.2019.2899143 (BASE PAPER)
2. Pérez-Rosas, V., Kleinberg, B., Lefevre, A., & Mihalcea, R. (2017). Automatic
Detection of Fake News. arXiv [cs.CL]. Opgehaal van
http://arxiv.org/abs/1708.07104
3. Yang, Y., Zheng, L., Zhang, J., Cui, Q., Li, Z., & Yu, P. S. (2018). TI-CNN:
Convolutional Neural Networks for Fake News Detection. arXiv [cs.CL]. Opgehaal
van http://arxiv.org/abs/1806.00749
4. Zhang, J., Dong, B., & Yu, P. S. (2019). FAKEDETECTOR: Effective Fake News
Detection with Deep Diffusive Neural Network. arXiv [cs.SI]. Opgehaal van
http://arxiv.org/abs/1805.08751
5. Saleh, H., Alharbi, A., & Alsamhi, S. H. (2021). OPCNN-FAKE: Optimized
Convolutional Neural Network for Fake News Detection. IEEE Access, 9, 129471–
129489. doi:10.1109/ACCESS.2021.3112806
6. Uma Sharma, Sidarth Saran, Shankar M. Patil, 2021, Fake News Detection using
Machine Learning Algorithms, INTERNATIONAL JOURNAL OF ENGINEERING
RESEARCH & TECHNOLOGY (IJERT) NTASU – 2020 (Volume 09 – Issue 03),
7. Galli, A., Masciari, E., Moscato, V., & Sperlí, G. (2022). A comprehensive
Benchmark for fake news detection. Journal of Intelligent Information Systems,
59(1), 237-261.
8. Chauhan, T., & Palivela, H. (2021). Optimization and improvement of fake news
detection using deep learning approaches for societal benefit. International Journal of
Information Management Data Insights, 1(2), 100051.
9. Monti, F., Frasca, F., Eynard, D., Mannion, D., & Bronstein, M. M. (2019). Fake
news detection on social media using geometric deep learning. arXiv preprint
arXiv:1902.06673.
10. Nasir, J. A., Khan, O. S., & Varlamis, I. (2021). Fake news detection: A hybrid
CNN-RNN based deep learning approach. International Journal of Information
Management Data Insights, 1(1), 100007.