0% found this document useful (0 votes)
21 views9 pages

Fake News

This paper discusses a novel approach to fake news detection using machine learning, specifically through stance detection by analyzing the relevance of news article claims. The proposed model combines neural, statistical, and external features to classify news as agree, disagree, discuss, or unrelated, outperforming existing techniques on the fake news challenge dataset. The project also includes a visualization tool to help users understand the classification process and the importance of specific language patterns in identifying fake news.

Uploaded by

ven0m071973
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views9 pages

Fake News

This paper discusses a novel approach to fake news detection using machine learning, specifically through stance detection by analyzing the relevance of news article claims. The proposed model combines neural, statistical, and external features to classify news as agree, disagree, discuss, or unrelated, outperforming existing techniques on the fake news challenge dataset. The project also includes a visualization tool to help users understand the classification process and the importance of specific language patterns in identifying fake news.

Uploaded by

ven0m071973
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Fake News Detection Using Machine Learning

Abstract
Identifying the veracity of a news article is an interesting problem while
automating this process can be a challenging task. Detection of a news article as
fake is still an open question as it is contingent on many factors which the current
state-of-the-art models fail to incorporate. In this paper, we explore a subtask to
fake news identification, and that is stance detection. Given a news article, the
task is to determine the relevance of the body and its claim. We present a novel
idea that combines the neural, statistical and external features to provide an
efficient solution to this problem. We compute the neural embedding from the
deep recurrent model, statistical features from the weighted n-gram bag-of-
words model and hand crafted external features with the help of feature
engineering heuristics. Finally, using deep neural layer all the features are
combined, thereby classifying the headline-body news pair as agree, disagree,
discuss, or unrelated. We compare our proposed technique with the current
state-of-the-art models on the fake news challenge dataset. Through extensive
experiments, we find that the proposed model outperforms all the state-of-the-
art techniques including the submissions to the fake news challenge. Fake news is
described as a story that is made up with an intention to misdirect or to delude
the reader. We have presented a response for the task of fake news discovery by
using Deep Learning structures. Due to numerous number of cases of fake news
the result has been an extension in the in the spread of fake news. Because of the
wide effects of the huge onsets of fake news, individuals are clashing if not by
large poor locators of fake news. With these, moves have been made to make an
automatic system for fake news identification. The most preferred of such
activities incorporate "blacklists" of sources and makers that are not dependable.
While these instruments are utilized to make an increasingly dynamic complete
start to finish arrangement, we need to speak to progressively troublesome cases
where progressively solid sources and creators release counterfeit news. As, the
goal of this undertaking was to make an apparatus for recognizing the language
plans that depict fake and certified news utilizing AI, AI and regular language
preparing strategies. The results of this project demonstrate the limit with regards
to machine learning and AI to be significant. We have constructed a model that
gets many no of natural signs of genuine and fake news & also an application that
guides in the representation of the classification choice

CHAPTER 1:-Introduction
This project could be practically used by any media company to
automatically predict whether the circulating news is fake or not. The
process could be done automatically without having humans manually
review thousands of news related articles. The Internet has become
compulsory in our life. It is now very easy to access the Internet than it was
before. There is no doubt that many young people prefer the internet to
get their news rather than the newspaper, radio, etc.[1]. The Internet
provides many opportunities for us, we can search for anything on the
internet to clear our doubt and for research purpose also. Simply saying, we
can’t even imagine our life without the internet. As more people are
connecting to the Internet, they get most of their information content
through it. In a country like India where Internet access has become cheap
recently, a lot of people are accessing news through their digital devices.
But when it comes to news publishing it creates so many issues. If it is
about the news, the internet plays a very important role because through
the internet, the news widespread very fast. There are so many
consequences of fake news, it can cause harm to innocent people. Fake
news may be made intentionally or accidentally to give harms to an
individual or a group for any purposes, such as for political issues, for
religious purposes and so on. There have been many incidences of people
getting hurt or getting killed because of rumors on the Internet. The
creation of fake news generally increases during the time of the election in
a country. The BBC news broadcaster has done research on Indian general
election during 2014. The researchers [2] viewed about 16000 and 3000
accounts and pages from Twitter and Facebook respectively to learnhow
fake news getspolarized in India. This research indicates a “strong and
coherent” proliferation of right-wing ideology, while networks which
promote left-wing ideology based fake news were found to be less
organized and effective. Another research [3]by the BBC resulted that
nearly 72% of Indian citizens are not able to differentiate between real facts
from made-up ones. Altogether, these concludes that we need to expose
people in India to the digital literacy to overcome the consequences of fake
news in the country. In this paper, we first review the work done up to
now, then we present the POLITIFACT dataset [4] and explain the feature
extraction done on the dataset. After this, we present ensemble of multiple
models and reclassify the problem as a binary classification rather than a
multiclassification problem as presented by the POLITIFACT dataset. Then
the evaluation and results are presented at the end of this paper.

CHAPTER 2:-LIAR Dataset

Most of the datasets available contain short statements as the language


used for political information broadcasting on TV interviews, Facebook
posts and tweets on Twitter are mostly of short length statements,that’s
why the detection of fake news is more challenging. In this work, we use a
publicly available benchmark dataset (LIAR dataset) collected from the
website POLITIFACT.COM that has detailed report and URL to eachsource
document sample to make the development of techniques for fake news
detection automatic. This labeled dataset consists of thirteen different
features in 12.8k samples of data that is manually labeled into various
categories about politics which is analysed by the editor of
POLITIFACT.COM and categories according to its truthfulness. The dataset
contains speaker of each statement as ‘speaker’, speaker’s job as
‘speakerjob’, state information of the speaker as ‘stateinfo’, party affiliation
as ‘partyaffiliation’, context of the statement as ‘context’, and subjects of
the news ‘subjects’. The statements are labelled in six categories –true,
mostly-true,barely true,pants-on-fire, false, and half-true. The number of
data samples for all categories falls under the range from 2,063 to 2,638
except for the category pants-on-fire which has 1050 data samples in the
whole dataset. The dataset contains the statements from the year 2007 to
2016. The dataset contains speakers as a mix of republicans and democrats,
as well as enough amount of data samples from social media.
2.1 Spam Detection
The problem of detecting not-genuine sources of information through
content based analysis is considered solvable at least in the domain of
spam detection [7], spam detection utilizes statistical machine learning
techniques to classify text (i.e. tweets [8] or emails) as spam or legitimate.
These techniques involve pre-processing of the text, feature extraction (i.e.
bag of words), and feature selection based on which features lead to the
best performance on a test dataset. The purpose of this project is not to
decide for the reader whether or not the document is fake, but rather to
alert them that they need to use extra scrutiny for some documents. Fake
news detection, unlike spam detection, has many nuances that arent as
easily detected by text analysis. For example, a human actually needs to
apply their knowledge of a particular subject in order to decide whether or
not the news is true. The “fakeness” of an article could be switched on or
off simply by replacing one persons name with another persons name.
Therefore, the best we can do from a content-based standpoint is to decide
if it is something that requires scrutiny. The idea would be for a reader to
do leg work of researching other articles on the topic to decide whether or
not the article is actually fake, but a “flagging” would alert them to do so in
appropriate circumstances.

CHAPTER 3:-PROPOSED METHODOLOGY

The fundamental thought to make a model to foresee the trustworthiness


of continuous news affairs going on. The proposed is comprises of following
steps:
∙ Collection Of Data
∙ Pre-Processing Of Data
∙ Classification
∙ Result Analysis

The key expressions of news affairs have been taken in a form that needs
to be verified. The filtered data is then stored in a database known as
MangoDB. Data Pre Processing unit is very reliable for setting up data for
the additional processing that is required. Classification is basically
dependent on :-
∙ No. Of Tweets
∙ No. Of hashtags
∙ No. Of adherents
∙ Confirmed User
∙ Sentiment Score
∙ No. Of Retweets
∙ Methods Of NLP

3.1 Topic Dependency


As we suspected from the makeup of the dataset which can be which
demonstrates a general overview of the makeup of both of the datasets,
there is a significant difference in the subjects being written about in fake
news and real news, even in the same time range with the same current
events going up. More specifically, you can see that the concentration of
articles that involve “Hillary”, “Wikileaks”, and “republican” is higher in
Fake News than it is in real news. This is not to say that these words did not
appear in real news, but they were not some of the “most frequent” words
there. Additionally, words like ”football” and “love” 24 appear very
frequently in the real news dataset, but these are topics that you can
imagine would not be written about, or rarely be written about, in fake
news. The “hot topics” of fake news present another issue in this task. We
do not want a model that simply chooses a classification based on the
probability that a fake or real news article would be written on that topic
just like we would never tell a person that every article written about
Hillary is fake news or every article written about love is real news. The way
we accounted for these differences in the dataset was by separating our
training set and tests sets on the presence/absence of certain words. We
tried this for a number of topics that were present in both fake news and
real news but had different proportions in the two categories. The words
we chose were “election”, “war”, and “email.” To create a model that was
not biased about the presence of one of these words, we extracted all body
texts which did not contain that word. We used this set as the training set.
Then, we used the remaining body texts that did contain the target word as
the test set. The accuracy of the model on the test set represents transfer
learning in the sense that the model was trained on a number of articles
about topics other than the target word and had to use what it learned to
classify texts about the target word. The accuracies were still quite high, as
demonstrated in section 5. This shows that the model was learning patterns
of language other than those specific words. This could mean that it
learned similar words because of the word embeddings or it could mean
that it learned completely different words to “pay attention” to, or both.

3.2Cleaning
Pre-processing data is a normal first step before training and evaluating the
data using a neural network. Machine learning algorithms are only as good
as the data you are feeding them. It is crucial that data is formatted
properly and meaningful features are included in order to have sufficient
consistency that will result in the best possible results. The goal of these is
to take away some of the unimportant distinguishing features between
different images. Features like the darkness or brightness are not beneficial
in the task of labeling the image. Similarly, there are portions of text that
are not beneficial in the task of labeling the text as real or fake. The task of
pre-processing data is often an iterative task rather than a linear one. This
was the case in this project where we used a new and not yet standardized
dataset. As we found certain unmeaningful features that the neural net was
learning, we learned what more we needed to pre-process from the data.

3.3Experimental Results
The accuracy of the model we believe is the most representative of how
machine learning can handle fake news / real news classification task based
simply on language patterns is 95.8 %. This accuracy can be represented by
the following confusion matrix that shows the counts of each category of
predictions. The rest of the accuracies and confusion matrices can be found
in Table 5.1 in the Appendix. Table 5.1: Confusion matrix from our “best”
model

Actual Fake(Predicted Fake) 2965(Predicted Real) 98 Actual Real(Predicted


Fake) 134(Predicted Real) 2307 To better understand which types of Fake
news were being properly classified and which more were difficult to
classify, we used [20] to gather different “types” of Fake News. According
to [20], fake news is separate form other categories such as clickbait,
junkscience, rumor, hate, satire, etc. However, our dataset included
sources that are listed as types other than straightforward “fake news.” The
majority of the 244 sources were listed in /citeopensources mapping of
sources to their corresponding categories.

Chapter 4. Application
Perhaps the most generalizable contribution of this project was a the
creation of a visualization tool for classification of fake and real news. While
it is interesting to see summary statistics about the model - like the
prediction accuracy, parameters, and even what each of the individual
neurons is describing, it is maybe more interesting to see how it makes the
decision for a given body text. This can be done by tracing back the most
important trigrams. However, this doesnt tell us if the removal of a certain
trigram or word from the body text would change the classification label.
We created an application that can perform online, the tracking of the most
important trigrams. It highlights the “most real,” “least real”, “most fake”,
and “least fake” trigrams as defined in the “Tracking Important Trigrams”
section. Through the use of this application, a user can test out a body text
and see the probability that it is real, the probability that it is fake, and
which trigrams were most prominently used to make that decision.
Likewise, a user can see the resulting probability increase or decrease in a
classification, or more extremely, the change in classification, when they
edit a body text. This gives a better demonstration on how holistic of a view
the model has on the article. For example, if it requires removing many
trigrams to change the classification, the model has a holistic view.

CHAPTER4 :-REQUIREMENTS
Python
numpy
pandas
itertools
matplotlib
sklearn

CHAPTER 5 :-CONCLUSION
The main contribution of this project is support for the idea that
machinelearning could be useful in a novel way for the task of classifying
fake news. Ourfindings show that after much pre-processing of relatively
small dataset, a simpleCNN is able to pick up on a diverse set of potentially
subtle language patterns thata human may (or may not) be able to detect.
Many of these language patterns are intuitively useful in a humans manner
of classifying fake news. Some such intuitive patterns that our model has
found to indicate fake news include generalizations, colloquialisms and
exaggerations. Likewise, our model looks for indefinite or inconclusive
words, referential words, and evidence words as patterns that characterize
real news. Even if a human could detect these patterns, they are not able to
store asmuch information as a CNN model, and therefore, may not
understand the complex relationships between the detection of these
patterns and the decision for classification. Furthermore, the model seems
to be relatively unphased by the exclusion of certain “giveaway” topic
words in the training set, as it is able to pick up on trigrams that are less
specific to a given topic, if need be. As such, this seems to be a really good
start on a tool that would be useful to augment humans ability to detect
Fake News. Other contributions of this project is include the creation of a
dataset for the task and the creation of an application that aids in the
visualization and understanding of the neural nets classification of a given
body text. This application could be a tool for humans trying to classify fake
news, to get indications of which words might cue them into the correct
classification. It could also be useful in researchers trying to develop
improved models through the use of improved and enlarged datasets,
different parameters, etc. The application also provides a way to see
manually how changes in the body text affect the classification.

You might also like