0% found this document useful (0 votes)
38 views16 pages

2023PCS2016 Report

Uploaded by

Surya S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views16 pages

2023PCS2016 Report

Uploaded by

Surya S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Fake News detection using Machine Learning

THESIS REPORT

SUBMITTED BY

SURYA S

(2023PCS2016)

Under the Supervision of

Dr. Vandana Bhatia

(Assistant Professor, Department of Computer Science)

In the partial fulfillment for the award of degree

Of

MASTER OF TECHNOLOGY

IN

COMPUTER SCIENCE AND ENGINEERING


Table of Contents

1. ABSTRACT

2. INTRODUCTION

3. MOTIVATION

4. LITERATURE SURVEY

5. PROBLEM STATEMENT

6. OBJECTIVES

7. METHODOLOGY

8. CONCLUSION
Chapter 1

ABSTRACT

Fake news has emerged as a significant threat in the digital era, impacting public opinion, social
stability, and even political outcomes. With the proliferation of social media and online news
platforms, the rapid dissemination of false information has become a critical issue. Machine
learning offers a promising solution for automating the detection of fake news by leveraging
algorithms that can analyze and classify large volumes of data efficiently. This paper explores
various machine learning techniques, such as Support Vector Machines (SVM), Naïve Bayes,
Random Forest, and deep learning models, to detect fake news. These models are trained on
features extracted from textual content, social context, and metadata, allowing them to
differentiate between true and false information. The study highlights the challenges faced by
machine learning approaches, including the need for large labeled datasets, feature engineering,
and handling bias in data. Overall, machine learning-based systems show great potential in
enhancing the accuracy and speed of fake news detection, contributing to efforts aimed at
curbing the spread of misinformation online.
CHAPTER 2

INTRODUCTION

Fake news, defined as intentionally misleading information, has gained significant attention due
to its societal impact, particularly in the context of elections and public safety. With the rapid
spread of fake news on social media, developing robust detection systems is critical. Developing
automated systems to detect fake news is critical due to the large volume of user-generated content
and the complex nature of natural language. While traditional machine learning models have been
used for this task, they face computational limitations. The study proposes a genetic algorithm-
based approach that applies machine learning classifiers to detect fake news more effectively and
efficiently. This method reduces computational complexity while maintaining high detection
accuracy with a particular emphasis on fine-tuning and downstream neural network structures to
improve performance.
CHAPTER 3

MOTIVATION

Given the limitations of current fake news detection methods, including their high computational
cost and limited scalability, the motivation for this study stems from the need for more efficient
solutions. Genetic algorithms are known for their ability to find near-optimal solutions with lower
computational overhead. The adaptability and optimization capabilities of genetic algorithms
make them ideal candidates for enhancing machine learning classifiers in fake news detection,
driving the development of this novel approach.
CHAPTER 4

LITERATURE SURVEY

Past studies have explored various techniques for fake news detection, from content analysis using
machine learning to social network analysis. Several approaches have been employed for fake
news detection, ranging from traditional machine learning models like SVM and Logistic
Regression to advanced models like CNN and LSTM.

In this study, we investigate and compare the effectiveness of different pre-trained language
models (PLMs), specifically BERT and CT-BERT, by incorporating them with a range of
downstream neural network architectures. We explore both simple and sophisticated models,
testing their performance in two training strategies: a feature-based approach, where the PLM's
weights are frozen, and a fine-tuning approach, where all parameters are jointly trained for a
supervised task. Our primary objective is to predict the veracity of information (whether it is fake
or not) by adding a classification layer with a sigmoid function to each variant. The effectiveness
of these models is evaluated by comparing their predictive accuracy on the binary classification
task. [1]

In the proposed algorithm, genes are represented as binary strings, and the initial population is
randomly generated from input data. The LIAR dataset uses "Statement" as the feature with
labels "True" or "False", while the FJP dataset uses "description" as the feature and "fraudulent"
(0 for true, 1 for false) as the label. After preprocessing the data by removing stop words, a
population of 200 chromosomes is formed, each with 5000 unique features. The fitness of each
individual is determined using machine learning classifiers, including SVM, Naïve Bayes, Logistic
Regression, and Random Forest, which act as fitness functions. Higher fitness scores indicate
better solutions. Offspring are generated from selected high-quality parents through single-point
crossover and mutation (3% rate). The algorithm runs for 50 generations, after which the best
solution for detecting fake news is obtained. With 33% of the data reserved for testing, accuracy
is evaluated, and a confusion matrix is generated. Precision, recall, F1 score, AUC, and ROC curves
are calculated to assess model performance across datasets.[2]

A Fake news detection model using a Capsule network was introduced and was applied to the
ISOT and LIAR datasets, achieving improvements of 7.8% on ISOT, 3.1% on the LIAR validation
set, and 1% on the LIAR test set compared to state-of-the-art models. Palani employed the BERT
model to extract textual information and capture word semantic links, outperforming the
SpotFake+ model on the Politifact and Gossipcop datasets. Sun proposed a model combining bi-
GRU and capsule networks to preserve the spatial structure of the text and improve rumor
detection accuracy by 6.1% under static routing and 6.7% under dynamic routing on the Twitter
dataset. Braşoveanu and Andonie developed a hybrid model incorporating machine learning,
semantics, and NLP, showing that semantic features significantly enhance fake news
classification accuracy.[3]

MWPBert (Max Worth Parallel BERT) is a model designed for fake news detection that processes
both news headlines and relevant text spans using parallel BERT networks. The headline is
tokenized into 128 tokens and fed into one BERT network, while the MaxWorth algorithm selects
the most relevant portion of the news body (up to 512 tokens) for a second BERT network. The
outputs of both networks, capturing the semantics of the headline and news body, are merged
and passed through a dropout layer, followed by a linear neural network to predict whether the
news is true or false. The model focuses on efficiently utilizing BERT's encoder layers for learning
text structure and meaning, avoiding unnecessary complexity in the post-BERT network to reduce
computational cost and maintain the effectiveness of the BERT attention mechanism. [Error!
Reference source not found.]

A fake news detection model was developed using five machine learning algorithms to classify
tweets as fake or real. Logistic Regression (LR) utilized a stepwise method to reduce model
complexity, achieving optimal performance with an AIC value of 540.1684. The Classification and
Regression Tree (CART) model was optimized through repeated deviance adjustments and tree
pruning, resulting in the best model with six terminal nodes (deviance = 7.75). For the Neural
Network (NNET), the optimal configuration involved three hidden layers (size = 1, decay = 0.1).
The Support Vector Machine (SVM) used a nonlinear Radial Basis Function (RBF) with a Gaussian
Kernel to enhance model performance. The evaluation of these algorithms aimed to determine
the highest-performing fake news detection model. [Error! Reference source not found.]

The FDHN model is designed with a tripartite input structure that includes three distinct
modalities: a news text modality for the article's written content, a textual context modality for
accompanying contextual information, and a numerical context modality for relevant numerical
data. The model is organized into three main components—News Text, Textual Context, and
Numerical Context—each detailed in their respective sections. Additionally, the Fuzzy Layer, a
key component of the FDHN model, is discussed separately. The second-to-last layer produces
four feature representations from the news text, textual context, numerical context, and fuzzy layer
output, which are concatenated before being processed by the final Output Layer. This final layer
determines the most plausible classification for the news article. [Error! Reference source not
found.]

Data mining heavily relies on preprocessing to transform inconsistent and incomplete raw data
into a machine-readable format, as demonstrated through various text preprocessing activities on
the FNC-1 dataset. Feature extraction is essential, converting raw data into numerical features
while maintaining the original dataset's information, which is more effective than using raw data
for training. One employed method is HashingTF, which uses the MurmurHash3 algorithm to map
phrases to their word frequencies, avoiding the need for a term-to-index map and reducing
processing time, though it may encounter hash collisions. Inverse Document Frequency (IDF)
complements term frequency by downplaying the significance of frequently occurring words, thus
enhancing feature vector scaling. The study utilizes several classification models, including
Random Forest (RF), which constructs multiple decision trees to make predictions, Logistic
Regression (LR) for probability prediction in classification tasks, and Decision Trees (DT) for
decision-making. These methodologies collectively aim to improve the detection of fake news
through efficient data processing and classification techniques. [Error! Reference source not
found.]

Initial experiments conducted with the IFND dataset tested five algorithms for fake news detection.
While KNN underperformed, both Multinomial Naive Bayes and Logistic Regression yielded
promising results, with Logistic Regression achieving 99% accuracy on the IFND dataset. Given
the dataset's limited size, cross-dataset scenarios were explored by merging the IFND and Indian
Fake News datasets for comprehensive analysis. The models, built from the combined training
sets, were tested separately on both datasets, revealing that the Logistic Regression model
significantly outperformed others, with F1 scores of 93% for IFND and 99% for the Indian dataset.
This study introduced a robust pipeline for fake news detection, incorporating automatic language
detection and translation for non-English headlines, along with web scraping for data extraction,
preprocessing through tokenization and stop word removal, and feature extraction using TF-IDF.
The machine learning algorithms evaluated included Gradient Descent, Multinomial Naive Bayes,
KNN, AdaBoost, and Logistic Regression. [Error! Reference source not found.]
Chapter 5

PROBLEM STATEMENT

The main challenge is to detect fake news accurately and efficiently in social networks. Traditional
machine learning classifiers, though effective, face scalability issues and require high
computational resources. The goal is to leverage genetic algorithms to optimize these classifiers,
thereby improving their performance in detecting fake news on large datasets.

The challenge lies in dealing with the poor quality of user-generated content, high-dimensional
textual data, and the dynamic nature of fake news, which requires more advanced models capable
of understanding deep semantic structures.
Chapter 6

OBJECTIVES

Research objectives include:

• To design and implement a robust machine learning-based system for the automatic
detection of fake news articles.
• To assess the effectiveness of different datasets in training and testing the detection models,
including the impact of combining datasets in cross-dataset scenarios.
• To evaluate the performance of the detection models using various metrics, including
accuracy, F1 score, precision, and recall, to determine the most effective approach for fake
news detection.
• To apply effective data preprocessing methods, such as tokenization, stop word removal,
and feature extraction using TF-IDF, to improve the quality of input data for the models.
• To incorporate automatic language detection and translation tools to handle non-English
news articles, ensuring uniformity and accessibility in the detection process.
Chapter 7

METHODOLOGY

Data Collection
and Preprocessing

Feature Extraction

Model Training

Model Evaluation

Cross-Dataset
Testing

Prediction and
Evaluation

Fig 7.1 Process flow diagram

1. Data Collection
o Gather datasets containing news articles (both real and fake).
o Sources include news websites, social media, and existing datasets.
2. Data Preprocessing
o Text Cleaning: Remove irrelevant characters, HTML tags, and special symbols.
o Tokenization: Split text into individual words or tokens.
o Stop-word Removal: Eliminate common words that do not contribute to meaning
(e.g., "and," "the").
o Lemmatization/Stemming: Reduce words to their base or root form.
3. Feature Extraction
o TF-IDF: Calculate Term Frequency-Inverse Document Frequency to represent text
data numerically.
o HashingTF: Optionally apply hashing techniques to create feature vectors.
4. Language Detection and Translation (if applicable)
o Detect language of the headlines and translate non-English text to a uniform language
(usually English).
5. Model Training
o Split the dataset into training and testing sets (e.g., 75% training, 25% testing).
o Train selected machine learning algorithms (e.g., Logistic Regression, SVM,
Decision Trees) on the training data.
6. Model Evaluation
o Test the trained models on the testing dataset.
o Evaluate performance using metrics like accuracy, F1 score, precision, and recall.
7. Cross-Dataset Testing
o Test the models on combined datasets to assess robustness in different scenarios.
8. Prediction
o Use the trained model to classify new, incoming news articles as real or fake.
9. Results Reporting
o Generate a report summarizing model performance, including confusion matrices and
evaluation metrics.
Chapter-8

CONCLUSION AND FUTURE SCOPE

The results indicate that machine learning classifiers, particularly SVM and Random Forest, can
achieve high accuracy when trained on well-curated datasets. However, machine learning
approaches are not without challenges, including the need for extensive feature engineering,
handling imbalanced datasets, and ensuring that models generalize well to new, unseen data.
Additionally, machine learning algorithms rely heavily on the availability of high-quality labeled
datasets, which can be a limitation when tackling rapidly evolving misinformation topics.

The genetic algorithm-based approach significantly improves the efficiency and performance of
fake news detection models. SVM and Random Forest achieved the highest accuracy across
different datasets, with SVM performing best overall. The results demonstrate the potential of
genetic algorithms to optimize machine learning classifiers, making them more effective for large-
scale fake news detection.

Future Work

Future work will focus on refining the genetic algorithm by increasing the number of features and
population size to further enhance detection accuracy. Additionally, the approach will be tested on
larger datasets to validate its scalability and generalization across different types of fake news. The
study will also explore the integration of deep learning techniques with the genetic algorithm to
achieve even higher performance.
REFERENCES

[1] “Towards COVID-19 fake news detection using transformer-based models”, Jawaher
Alghamdi, Yuqing Lin, Suhuai Luo, 13 May 2023
https://www.sciencedirect.com/science/article/pii/S0950705123003921

[2] “A novel approach to fake news detection in social networks using genetic algorithm
applying machine learning classifiers”, Deepjyoti Choudhury, Tapodhir Acharjee2, 23
April 2022, https://link.springer.com/article/10.1007/s11042-022-12788-1

[3] ”Multilingual deep learning framework for fake news detection using capsule neural
network”, Rami Mohawesh, Sumbal Maqsood,Qutaibah Althebyan , 9 May 2023 ,
https://link.springer.com/article/10.1007/s10844-023-00788-y

[4] “Fake news detection using dual BERT deep neural networks”, Mahmood
Farokhian,Vahid Rafe,Hadi Veisi, 16 October 2023 ,
https://link.springer.com/article/10.1007/s11042-023-17115-w

[5] ”Constructing a User-Centered Fake News Detection Model by Using Classification


Algorithms in Machine Learning Techniques”, MINJUNG PARK AND SANGMI CHAI, 12 July
2023 , https://ieeexplore.ieee.org/document/10179862

[6] “An Enhanced Fake News Detection System With Fuzzy Deep Learning”, CHENG XU
AND M-TAHAR KECHADI , 15 June 2024
,https://ieeexplore.ieee.org/document/10568915

[7] “Big Data ML-Based Fake News Detection Using Distributed Learning“,ALAA
ALTHENEYAN AND ASEEL ALHADLAQ , 20 March 2023 ,
https://ieeexplore.ieee.org/document/10078408.

[8] “Generalized Multilingual AI-Powered System for Detecting Fake News in India: A
Comparative Analysis of Machine Learning Algorithms”, M Sai Mahesh,Thanusri V,Deepak
K,, September 22,2024, https://typeset.io/papers/generalized-multilingual-ai-powered-
system-for-detecting-38i9u2rtu6

You might also like