0% found this document useful (0 votes)
25 views40 pages

Draft

The project report titled 'Fake News Detection Using Machine Learning' presents a system designed to classify news articles as real or fake using various machine learning algorithms, achieving an accuracy of around 99%. The study emphasizes the importance of machine learning in combating the spread of misinformation, particularly in the digital age where fake news can have significant societal impacts. The report outlines the project's objectives, methodologies, and the technologies employed, including Logistic Regression, Support Vector Machines, and Random Forest algorithms.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views40 pages

Draft

The project report titled 'Fake News Detection Using Machine Learning' presents a system designed to classify news articles as real or fake using various machine learning algorithms, achieving an accuracy of around 99%. The study emphasizes the importance of machine learning in combating the spread of misinformation, particularly in the digital age where fake news can have significant societal impacts. The report outlines the project's objectives, methodologies, and the technologies employed, including Logistic Regression, Support Vector Machines, and Random Forest algorithms.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1

FAKE NEWS DETECTION USING MACHINE LEARNING


PROJECT REPORT
Submitted by

KALPANADEVI G

21CSR074

KARTHICK P
21CSR078

INDRA B
21CSL259
in partial fulfillment of the requirements

for the award of the degree


of

BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


KONGU ENGINEERING COLLEGE
(Autonomous)

PERUNDURAI, ERODE 638 060


NOVEMBER 2023
ii

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

KONGU ENGINEERING COLLEGE


(Autonomous)

PERUNDURAI, ERODE 638060

NOVEMBER 2023

BONAFIDE CERTIFICATE

This is to certify that the Project report entitled FAKE NEWS DETECTION USING MACHINE

LEARNING is the bonafide record of project work done by KALPANADEVI G(Register No

:21CSR074), KARTHICK P (Register No:21CSR078) and INDRA B (Register No :21CSL259)

in partial fulfillment of the requirements for the award of the Degreeof Bachelor of Engineering in

Computer Science and Engineering of Anna University, Chennai during the year 2022 - 2023.

SUPERVISOR HEAD OF THE DEPARTMENT


(Signature with seal)

Date :

Submitted for the end semester viva voce examination held on _

INTERNAL EXAMINER EXTERNAL EXAMINER


iii

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


KONGU ENGINEERING COLLEGE
(Autonomous)

PERUNDURAI ERODE - 638060


NOVEMBER 2023

DECLARATION

We affirm that the Project Report titled FAKE NEWS DETECTION USING MACHINE

LEARNING being submitted in partial fulfillment of the requirements for the award of Bachelor

of Engineering is the original work carried out by us. It has not formed part of any other project

report or dissertation on the basis of which a degree or award was conferred on an earlier occasion

on this or any other candidate.

Date :
KALPANADEVI G
(Reg. No.:21CSR074)

KARTHICK P
([Link] :21CSR078)

INDRA B
([Link].:21CSRL259)

I certify that the declaration made by the above candidates is true to the best of my knowledge.

Date : Name and Signature of the Supervisor with seal


i

ABSTRACT

Fake news is false or misleading information presented as news. Fake news often has the aim

of damaging a person's reputation and making money through advertising revenue. The constant

circulation of fake news directly or indirectly produces a huge negative impact on the vast majority

of society. However, they have difficulty in sensing highly ambiguous fake news which can be

detected only after identifying the meaning and the latest related information. Fake news detection

involves analysing various data types, such as textual or media content,and social context. Machine

learning plays a crucial role in fake news detection and in analysing large amounts of

misinformation data. ML is a data-driven approach that learns from labelled data to make

predictions. Machine learning algorithms discover patterns and relationships in input data, enabling

the algorithm to make predictions or classifications. Supervised learning involves training a model

on a labelled dataset containing examples of real and fake news articles. We've used Logistic

Regression(LR), Support Vector Machine(SVM) Regression, Random Forest(RF) Regression, and

Naïve Bayes(NB) algorithm techniques to classify fake news. This model helps us to detect the

accuracy of fake news using different classification techniques. The results obtained show that the

fake news with textual content can indeed be classified . Finally, we had received an accuracy range

of 99% percentage using these Logistic regression (LR) classifiers.


v

ACKNOWLEDGEMENT

We express our sincere thanks and gratitude to our beloved Correspondent

[Link], [Link], M.B.A, LLB our beloved Correspondent and all other philanthropic

trust members of Kongu Vellalar Institute of Technology Trust who have always encouraged us in

the academic and co- curricular activities.

We are extremely thankful with no words of formal nature to the dynamic Principal Dr.

V. BALUSAMY, [Link]., Ph.D., for providing the necessary facilities to complete ourwork.

We would like to express our sincere gratitude to [Link] M.E., Ph.D.,

Professor and Head of the Department for providing necessary facilities.

We extend our thanks to [Link] [Link], Assistant Professor , Department of

Computer Science Engineering, Project Coordinator for her encouragement and valuable advice that

made us to carry out the project work successfully.

We extend our gratitude to our Supervisor [Link] ME, Assistant

Professor(SRG), Department of Computer Science Engineering, for her valuable ideas and

suggestions, which have been very helpful in the project. We are grateful to all the faculty members

of the Computer Science and Engineering Department, for their support.


i

TABLE OF CONTENTS

CHAPTER TITLE PAGE


No. No.

ABSTRACT iv

LIST OF FIGURES ix

LIST OF ABBREVIATION x

INTRODUCTION 1
1 1.1 MOTIVATION OF THE PROJECT 1

1.2 OBJECTIVE OF THE PROPOSED WORK 3

2 LITERATURE REVIEW 4

3 SYSTEM REQUIREMENT 5

3.1 HARDWARE REQUIREMENT 5

3.2 SOFTWARE REQUIREMENT 5

3.3 SOFTWARE DESCRIPTION 5

3.3.1 Python 8 5

3.3.2 OpenCV 7

3.3.3 Matplotlib 9

3.3.4 Tensorflow 9

4 PROPOSED SYSTEM 10

4.1 MACHINE LEARNING 10

4.1.1 Machine Learning Techniques 10

4.1.2 Neural Networks 11


7

4.2 KEY TECHNOLOGIES 12

4.2.1 Natural Language Processing 12

4.2.2 Machine Learning Algorithms 12

4.2.3 Feature Extraction Techniques 13

4.2.4 Data Augmentation 13

4.3 MODULE DESCRIPTION 13

4.3.1 Dataset Description 14

4.3.2 Datasets Collection 14

4.3.3 Working of Model 15

4.3.4 Working of Logistic Regression 18

4.4 FLOW DIAGRAM OF WORKING MODEL 20

5 RESULTS AND DISCUSSION 21

5.1 PERFORMANCE EVALUATION 21

5.2 VALIDATION AND RESULTS 21

5.3 MODEL COMPARISON 22

6 CONCLUSION AND FUTURE WORK 24

APPENDIX 1 CODING 25

APPENDIX 2 SCREENSHOT 31

REFERENCES 32
8

LIST OF FIGURES

FIGURE FIGURE NAME PAGE


No. No.
4.1 Images of Dataset 15

4.2 Fake News Detection Architecture 17

4.3 Logistic Regression Architecture 19

4.4 Flow Diagram for Working Model 20

4.5 Comparision between Confusion Matrix 22

4.6 Confusion Matrix for Best Model 23


9

LIST OF ABBREVIATIONS

ML : Machine Learning

NLP : Natural Language Processing

SVM : Support Vector Machine

RF : Random Forest

NB : Naïve Bayes

TF-IDF : Term Frequency-Inverse Document Frequency

API : Application Programming Interface

GPL : General Public License

CSV : Comma Separated Values

GUI : Graphical User Interface

CT : Computed Tomography

MAP : Mean Average Precision


10

CHAPTER 1

INTRODUCTION

Fake news refers to false or misleading information presented as news with the intent to

deceive or manipulate readers. It often serves to harm reputations, spread misinformation, or

generate revenue through advertising clicks. The rise of fake news in the digital age, amplified by

social media and other platforms, has had widespread negative consequences, leading to

misinformation on important topics and influencing public opinion and behavior. Detecting fake

news is a challenging task as it requires distinguishing between factual and deceptive content, which

often involves subtle differences in text, meaning, and the context in which the news is shared. Fake

news detection involves analyzing large datasets containing a variety of data types such as text,

images, and even social context, to determine the credibility of news [Link] learning (ML)

has emerged as a powerful tool for detecting fake news. Through supervised learning, ML models

are trained on labeled datasets consisting of real and fake news articles. These models use algorithms

to identify patterns in the data that help distinguish between genuine and false news. Several

classification techniques are employed to achieve this, including Logistic Regression (LR), Support

Vector Machines (SVM), Random Forest (RF), and Naïve Bayes (NB). These algorithms allow the

system to analyze the news content and predict whether a news article is real or [Link] this project,

we aim to classify fake news using different machine learning models. By leveraging these

techniques, we achieved a high accuracy of around 99%, showcasing the effectiveness of machine

learning in detecting misleading content.


11
MOTIVATION OF THE PROJECT

The increasing prevalence of fake news, particularly in the digital age, has caused significant

societal, political, and economic harm. Misinformation spreads rapidly on social media platforms, often

going viral before fact-checkers or legitimate sources can intervene. This not only misleads individuals

but also erodes public trust in reliable news sources. The motivation for this project stems from the

urgent need to address the growing threat of fake news and its impact on [Link] of the main drivers

behind this project is the difficulty that people face in distinguishing fake news from real news,

especially when fake news is crafted to look authentic. Traditional methods of verifying information

manually are often time-consuming and insufficient given the sheer volume of content generated online.

Automated detection systems powered by machine learning offer a scalable solution that can analyze

large datasets and classify news articles quickly and [Link] project also aims to highlight the

role machine learning plays in detecting fake news. By applying machine learning models to this

problem, we can explore how data-driven algorithms can be used to identify patterns, relationships, and

key indicators that separate real news from fake news. This can empower users and platforms with tools

to automatically flag or filter out misleading [Link], the success of this project has a broader

societal impact. Detecting and reducing the spread of fake news can contribute to a more informed

public, reduce the negative influence of misinformation on elections, public health, and other critical

areas, and restore trust in credible journalism. By achieving high accuracy with multiple machine

learning algorithms, this project demonstrates the potential of technology.


12

OBJECTIVE OF THE PROPOSED WORK

The main objectives of the proposed work are:

• To design and implement a system capable of automatically identifying and classifying news

articles as fake or real, thus reducing the manual effort required to fact-check information.

• To apply various machine learning algorithms, such as Logistic Regression, Support Vector

Machines (SVM), Random Forest, and Naive Bayes, to build robust classifiers for detecting

fake news based on patterns learned from labeled datasets.

• To optimize the detection process by using advanced feature extraction techniques like Term

Frequency-Inverse Document Frequency (TF-IDF) to transform textual data into meaningful

representations that machine learning algorithms can effectively process.

• To evaluate the performance of different machine learning models and compare their accuracy,

precision, recall, and F1 scores in classifying fake and real news. This helps identify the most

effective model for real-world applications.

• To provide a tool that contributes to the larger goal of controlling the spread of misinformation

and its adverse effects on society by enabling quick and accurate detection of fake news.

• To ensure that the model can handle large datasets efficiently, making it scalable and

applicable to real-time detection scenarios, such as social media platforms and news

aggregators.

By achieving these objectives, this project aims to develop a high-accuracy fake news detection system

that can help combat the spread of misinformation and contribute to a more informed and trust media

environment.
13

CHAPTER 2

LITERATURE REVIEW

The detection of fake news has become an essential area of study due to the increasing spread of

misinformation on social media. Early research primarily focused on analyzing the textual content of

news articles, using techniques like Bag of Words (BoW) and Term Frequency-Inverse Document

Frequency (TF-IDF) to represent text in a machine-readable format. Castillo et al. (2011) explored

linguistic features to classify rumors, marking one of the first attempts to automate this process. Wang

(2017) introduced the "LIAR" dataset, a widely used benchmark that emphasized the role of text analysis

in fake news detection. Machine learning algorithms such as Logistic Regression (LR), Support Vector

Machines (SVM), and Random Forest (RF) have been employed for classification tasks, leveraging

textual features to distinguish between real and fake news. More recently, deep learning models like

Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks have shown

significant improvements in capturing contextual and semantic nuances in text. Hybrid approaches that

combine text-based features with user behavior and social context data, as suggested by Shu et al. (2017),

have further improved detection accuracy. Despite these advancements, challenges remain, such as

generalizing models across domains and handling misinformation patterns. Future research aims to

address these issues by improving real-time detection capabilities.


14

CHAPTER 3

SYSTEM REQUIREMENTS

HARDWARE REQUIREMENTS

Processor : intel i7
Processor Speed : 2.21 GHz
Hard Disk : 475 GB
RAM : 8.00 GB

SOFTWARE REQUIREMENTS
Language :Python 3.x

Operating System :Windows 11

Software :PyCharm, Pandas, TensorFlow, Seaborn, NumPy, Scikit-learn, Matplotlib,

Time, OpenCV

SOFTWARE DESCRIPTION

PYTHON

Python is an interactive, object-oriented, interpreted and high- level programming language.

The source code of python is available under the GPL. It provides constructs that enable clear

programming on both small and large scales. Python has features like dynamic systems and

automatic memory management. It supports eight multiple programming paradigms, including

imperative, object-oriented, functional and procedural, and has a large and comprehensive standard

library.
15

Python is open source software and has a community-based development model. Python allows

programmers to build their own types using classes, which are most often used for object- oriented

programming. Python is managed by the Python Software Foundation.

The features of python include the below mentioned:

Easy-to-maintain: source code is easy to maintain.

Easy-to-learn: Python has simple keywords, simple structure, and a clearly defined

syntax. This allows the student to learn language quickly.

A broad standard library: source code is fairly easy to maintain.

Easy-to-read: Python code is more clearly defined and visible to the eyes.

Interactive Mode: Python has a support for an interactive mode which allows interactive

testing and debugging of snippets of code.

Portable: Python can run on a wide variety of hardware platforms and has the same

interface on all platforms.

Databases: Python provides interfaces to all major commercial databases.

GUI Programming: Python supports GUI applications that can be created and ported to

many system calls, libraries and windows systems, such as Windows MFC, Macintosh,

and the X Window system of Unix.

Scalable: Python provides a better structure and support for large programs than shell

scripting. The support of NumPy makes the task easier. NumPy is a highly optimized library

for numerical operations.


16

NUMPY:

NumPy is a library for the Python programming language that provides tools for working with

large, multi-dimensional arrays and matrices. It also provides a variety of mathematical functions

for working with these arrays, including linear algebra, Fourier analysis, and random number

generation.

MATPLOTLIB:

Matplotlib is a plotting library for the Python programming language. It provides avariety

of plotting functions and tools for creating visualizations of data, including line plots, scatter plots,

bar plots, and histograms.

TIME:

Time is a module in the Python standard library that provides tools for working with time-

related functions. It can be used for measuring the performance of code, calculating time intervals,

and other time-related tasks.

OPENCV:

OpenCV supports various programming languages such as C++, Python, and Java, and is

available on various platforms such as Windows, Linux, OS X, Android, and iOS. Interfaces for

accelerated GPU operations based on CUDA and OpenCL are also actively developed. OpenCV

Python is a Python API for OpenCV that combines the best features of the OpenCV C++ API
17

and the Python [Link] Python uses Numpy, a highly optimized library for numerical

operations with MATLAB-style syntax. All OpenCV array structures are converted to and from

numpy arrays. This also makes it easier to integrate with other libraries that use Numpy, such as

SciPy and Matplotlib.

OpenCV's application areas include:

Image Segmentation

Feature Extraction

Object Detection

Pattern Recognition

Visualization

TENSORFLOW:

TensorFlow is an open-source machine learning framework developed by Google. It

provides tools and libraries for building and training machine learning models, including deep neural

networks. TensorFlow is known for its ease of use, scalability, and flexibility.

PYTORCH:

PyTorch is an open-source machine learning framework developed by Facebook. It is based

on the Torch library and provides a dynamic computational graph, making it easier to debug and

optimize machine learning models. PyTorch is known for its simplicity and flexibility.
18

KERAS:

Keras is a high-level neural networks API, written in Python and capable of running on top

of TensorFlow, Theano, or CNTK. It provides a user-friendly interface for building and training

deep learning models.

EFFICIENT NET:

EfficientNet is a series of convolutional neural networks (CNNs) that were designed to provide

an optimal balance between model size (number of parameters) and model performance. These

models are known for their efficiency, meaning they achieve high accuracy while requiring fewer

computational resources compared to other architectures like VGG, ResNet, or Inception.


CHAPTER 4

PROPOSED SYSTEM

The proposed system for fake news detection employs machine learning techniques to analyze

and classify news articles as either real or fake. It will utilize classifiers such as Logistic Regression,

Support Vector Machines (SVM), Random Forest, and Naive Bayes, trained on a large and diverse

dataset of labeled articles to ensure high accuracy in distinguishing factual reporting from

misinformation. The system will incorporate data collection and preprocessing, feature extraction

using methods like TF-IDF, and comprehensive model training and evaluation to optimize

performance. A user-friendly interface will allow users to input articles for real-time analysis,

providing feedback on their authenticity along with confidence scores. Additionally, the system will

aim for seamless integration with existing news platforms to automatically flag potentially

misleading content. Ultimately, this project seeks to combat misinformation and enhance public

discernment of credible news sources, contributing to a more informed society.

MACHINE LEARNING:

Machine learning (ML) is a subset of artificial intelligence (AI) that focuses on developing

algorithms and statistical models that enable computers to perform tasks without explicit

programming. It involves training a model on data to recognize patterns, make predictions, or

classify information. Here’s an overview of the main techniques used in machine learning:

MACHINE LEARNING TECHNIQUES:

Some of the various machine learning techniques are:

1. Supervised Learning

2. Unsupervised Learning

3. Reinforcement Learning
NEURAL NETWORK:

Neural networks provide the ability to perform tasks such as classification and clustering.

They are a set of algorithms that copy the behavior of the human brain in recognizing relationships

between data. Neurons are information carriers. They use electrical impulses and chemical signals

to communicate information between themselves and other parts of the brain. A neural network here

is a set of neurons organized in layers.

Each neuron is a mathematical operation that takes an input, performs multiplication by

weights, and passes the sum to other neurons via an activation function. Also called a node or unit.

A neural network consists of three layers:

1. Input Layer

2. Hidden Layer

3. Output Layer

The input layer is responsible for getting input into the system for further processing in subsequent

layers. These inputs can be read as vectors or from CSV files. Only one input layercan exist in

the network.

The hidden layer sits between the model's input and output layers, where the function applies

weights to the input values and passes them through the activation function. It is passed to

subsequent functions. They are very common in neural networks, but their usage and architecture

often vary from case to case. In some cases, weighted inputs are randomly assigned. Otherwise,

it will be tweaked and adjusted by the Back Propagation process call.


The output layer of a neural network is the final layer of neurons responsible for generating the

output. The output layer takes input from the previous layer, performs computations, and outputs.

As with the input layer, there can only be one output layer with any number of neurons.

KEY TECHNOLOGIES:

NATURAL LANGUAGE PROCESSING:

NLP is a critical technology for understanding and processing human language. In the context

of fake news detection, NLP techniques are used to analyze text data, extract features, and interpret

the semantics of [Link]: Splitting text into individual words or phrases for

[Link] Removal: Eliminating common words that do not contribute to the meaning (e.g.,

"and," "the").Stemming and Lemmatization: Reducing words to their base or root form to unify

different variations of the same word.

MACHINE LEARNING ALGORITHMS:

Machine learning techniques are fundamental for building classifiers that can differentiate

between real and fake news. Common algorithms used include:

• Logistic Regression: A simple and effective classification algorithm for binary outcomes.

• Support Vector Machines (SVM): Effective for high-dimensional data and text classification.

• Random Forest: An ensemble method that combines multiple decision trees to improve

accuracy.

• Naive Bayes: Based on Bayes' theorem, this algorithm is particularly useful for text

classification.
FEATURE EXTRACTION TECHNIQUES :

Extracting meaningful features from text data is essential for improving model

performanceTF-IDF (Term Frequency-Inverse Document Frequency): A statistical measure used to

evaluate the importance of a word in a document relative to a corpus. Word Embeddings:

Techniques like Word2Vec or GloVe transform words into dense vector representations that capture

semantic meaning. BERT (Bidirectional Encoder Representations from Transformers): A pre-

trained transformer model that captures contextual information from text and can be fine-tuned for

specific tasks.

DATA AUGMENTATION:

Data Augmentation is a technique used to increase the size of the training dataset by applying

random transformations to the data. This can improve the robustness and generalization of the

model, as it learns to recognize objects under different conditions. Common data augmentation

techniques include random scaling, cropping, and flipping, which can help the model learn to detect

vehicles at different scales and orientations.

MODULE DESCRIPTION:

1) Introduction to Fake News Detection

Epidemiology and impact

Types of Fake News

Significance of early detection


2) Neural networks and their components

Convolutional neural networks (CNNs) for Text analysis

Recurrent neural networks (RNNs) for sequential data

Data Processing and Preparation

DATASET DESCRIPTION:

The dataset used for the fake news detection project consists of two primary categories: real

news and fake news, encompassing a total of 44,898 articles. The real news category contains 21,417

articles sourced from reputable news organizations, covering subjects such as world news and

politics, which ensures a comprehensive representation of legitimate reporting. In contrast, the fake

news category comprises 23,481 articles including various topics like government news, middle-

east news, and left-leaning news, reflecting the diverse nature of misinformation present in online

platforms. Each article in the dataset is structured with key attributes: the title, the full text of the

article, the subject category, and a binary label indicating whether the article is real (0) or fake (1).

The articles have been meticulously curated and annotated to maintain high quality and accuracy,

with human annotators verifying each piece to ensure reliability. This rich dataset is instrumental in

training machine learning models to detect fake news by enabling them to learn from the linguistic

patterns and contextual features within the text, ultimately contributing to more effective

identification of misinformation in the digital age.

DATASET COLLECTION:

The dataset for the fake news detection project was collected from a variety of reputable sources

to ensure diversity and relevance in the training data. For the real news category, articles were sourced

from established news organizations, online publications, and trusted journalism websites, which

provide accurate and credible information on various topics such as politics, world events, and local

news. In contrast, the fake news category was compiled from known misinformation websites, social
media platforms, and content specifically labeled as unreliable or misleading by fact-checking

organizations. This collection process involved scraping publicly available articles, as well as

leveraging datasets from previous research efforts dedicated to fake news detection. Each article was

carefully vetted for authenticity and labeled accordingly, creating a balanced dataset that includes both

real and fake news examples. This approach not only enhances the dataset's quality but also ensures

that the model can learn to identify nuanced differences in writing styles, factual accuracy, and

contextual cues that differentiate genuine news from fabricated stories.

Figure 4.1 Images of Dataset

WORKING OF MODEL:

The working of fake news detection involves several key processes that leverage machine
learning and natural language processing techniques to identify and classify news articles as either
real or fake. Here’s an overview of how the system operates:

1. Data Collection:

The process begins with the collection of a dataset that includes both real and fake news
articles. This dataset typically consists of various features such as titles, article texts, subjects, and
labels indicating the authenticity of each article.

2. Data Preprocessing:

The collected data undergoes preprocessing to clean and prepare it for analysis. This step
involves removing unnecessary elements such as HTML tags, special characters, and stop words,
as well as converting the text to a uniform format (e.g., lowercasing). Additionally, the text is often
tokenized, where sentences are broken down into individual words or phrases for further analysis.
3. Feature Extraction:

After preprocessing, relevant features are extracted from the text data. Techniques such as
Term Frequency-Inverse Document Frequency (TF-IDF) or word embeddings (like Word2Vec or
GloVe) are commonly used to convert textual data into numerical representations that machine
learning models can understand. This process captures the importance and context of words within
the articles.

4. Model Training:

The processed data is then split into training and testing sets. Various machine learning
algorithms, such as Logistic Regression, Support Vector Machines (SVM), Random Forests, and
Neural Networks, are trained on the training set. During this phase, the model learns to identify
patterns and relationships between the features and their corresponding labels (real or fake).

5. Model Evaluation:

After training, the model's performance is evaluated using the testing set. Metrics such as
accuracy, precision, recall, and F1-score are calculated to assess how well the model can correctly
classify unseen data. This evaluation helps to identify any weaknesses or biases in the model.

6. Prediction:

Once the model is deemed effective, it can be used to predict the authenticity of new, unseen
articles. When a new article is input into the system, it undergoes the same preprocessing and feature
extraction steps before being classified by the trained model.

7. Output and Reporting:

The system provides an output indicating whether the article is classified as real or fake, often
with a confidence score. This information can be further enhanced with additional insights, such as
key features that contributed to the classification decision.

8. Continuous Improvement:

As new articles are published and more data becomes available, the model can be updated
and retrained to adapt to evolving patterns of misinformation, ensuring its accuracy and
effectiveness over time.

By combining these processes, the fake news detection system can efficiently analyze large
volumes of content and help users identify misleading or false information in the digital landscape.
Figure 4.2 Fake News Detection Architecture

WORKING OF LOGISTIC REGRESSION:

Logistic regression operates as a statistical method used for binary classification, determining
the probability that a given input belongs to one of two categories. Initially, the model takes input
features and computes a linear combination of these features, applying weights to each input. The
key component of logistic regression is the logistic (or sigmoid) function, which transforms this
linear output into a probability value ranging between 0 and 1. This transformation allows the model
to express the likelihood that an instance belongs to the positive class, such as identifying a news
article as fake.

During the training phase, the model adjusts the weights assigned to each feature in order to
minimize the error between the predicted probabilities and the actual class labels in the training data.
This is typically achieved through an optimization process that uses algorithms like gradient descent,
iteratively updating the weights to reduce the loss function, which measures prediction accuracy.
Once trained, the logistic regression model can classify new instances by calculating the predicted
probability and applying a threshold (commonly set at 0.5) to assign a class label. For instance, if
the probability is above the threshold, the model predicts the instance as fake news; otherwise, it
classifies it as real news.

The performance of logistic regression is evaluated using metrics such as accuracy, precision,
recall, and F1-score, which help assess how well the model performs in distinguishing between the
two classes. Logistic regression is favored for its simplicity and interpretability, allowing users to
understand the impact of different features on the classification outcome. However, it may struggle
with complex relationships in the data, where more sophisticated models might be needed. Overall,
logistic regression remains a foundational technique in machine learning, particularly for tasks like
fake news detection, due to its effectiveness and ease of use.

Figure 4.3 Logistic Regression Architecture


FLOW DIAGRAM OF WORKING MODEL

The workflow model for Fake news detection is depicted below:

Figure 4.4 Flow Diagram for Working Model


CHAPTER 5

RESULTS AND DISCUSSION

PERFORMANCE EVALUATION

Performance evaluation is a crucial aspect of any machine learning model, including those

used for fake news detection, as it determines how well the model performs in classifying new,

unseen data. Several metrics are commonly employed to assess the effectiveness of the model,

providing insights into its accuracy and reliability. The most straightforward metric is accuracy,

which measures the proportion of correctly classified instances out of the total number of

instances. However, accuracy alone may be misleading, especially in cases of imbalanced

datasets where one class significantly outnumbers the [Link] gain a more nuanced

understanding of the model's performance, additional metrics such as precision, recall, and F1-

score are utilized. Precision calculates the ratio of true positive predictions to the total predicted

positives, indicating the model's accuracy when it predicts an article as fake. Recall, on the other

hand, measures the ratio of true positives to the actual positives in the dataset, providing insights

into the model's ability to identify all relevant instances of fake news. The F1-score, which is the

harmonic mean of precision and recall, offers a single metric that balances both, making it

particularly useful when the classes are [Link], confusion matrices are

employed to visualize the model's performance, showcasing the counts of true positives, true

negatives, false positives, and false negatives. This visualization helps identify specific areas

where the model excels or struggles, guiding further improvements. ROC (Receiver Operating

Characteristic) curves and AUC (Area Under the Curve) are also valuable tools for evaluating the

model's discriminatory power across various threshold settings, illustrating the trade-offs between

sensitivity and specificity.


VALIDATION AND RESULTS

Validation and results are critical components in the development and assessment of

fake news detection models. Validation refers to the process of evaluating the model's

performance on a separate dataset that was not used during the training phase. This ensures

that the model can generalize well to new, unseen data and helps prevent overfitting, where

a model performs exceptionally well on training data but fails to perform similarly on new

data. Common validation techniques include k-fold cross-validation, where the dataset is

divided into k subsets, and the model is trained and tested k times, each time using a different

subset as the test set while the others serve as the training set. This method provides a robust

estimate of the model's performance by averaging the results over multiple [Link]

validation is completed, the results are analyzed to determine how effectively the model

classifies news articles as real or fake. The evaluation metrics—such as accuracy, precision,

recall, and F1-score—offer a detailed understanding of the model's strengths and

weaknesses. For instance, a high accuracy rate might indicate good overall performance, but

low precision or recall could suggest that the model struggles with false positives or

negatives. Additionally, visual tools such as confusion matrices provide insights into

specific classification errors, allowing for targeted improvements in the [Link],

results from validation can guide further refinement of the model. Based on performance

insights, adjustments can be made to the feature set, hyperparameters, or even the choice of

algorithm to enhance accuracy and reliability. Ultimately, thorough validation and analysis

of results are essential for ensuring that the fake news detection model is not only effective

but also trustworthy, contributing to more informed and responsible dissemination of

information in today's media landscape.


MODEL COMPARISON:
Figure 4.5 Comparision between Confusion Matrix

CONFUSION MATRIX:

Figure 4.6 Confusion Matrix best model


CHAPTER 6

CONCLUSION AND FUTURE WORK

In conclusion, the development of a robust fake news detection model leveraging machine

learning techniques represents a significant step toward addressing the pervasive issue of misinformation

in today’s digital landscape. Through the application of various algorithms, such as Logistic Regression,

Support Vector Machines, and Random Forests, the model demonstrated a high degree of accuracy and

reliability in classifying news articles as real or fake. The validation process revealed the model's ability

to generalize well to unseen data, ensuring that it can effectively assist in the identification of misleading

information. Furthermore, the use of diverse evaluation metrics provided valuable insights into the

model's performance, highlighting areas for improvement and confirming its practical applicability in

real-world [Link] ahead, future work will focus on enhancing the model's capabilities

through several key avenues. One area of improvement involves expanding the dataset to include a

broader range of news sources and topics, which will help the model better understand the nuances of

language and context in different types of articles. Additionally, exploring advanced deep learning

techniques, such as neural networks, could yield even more accurate predictions by capturing complex

patterns in the data. Incorporating user feedback mechanisms may also refine the model further, enabling

continuous learning and adaptation to evolving trends in misinformation. Lastly, addressing ethical

considerations around bias and fairness in model predictions will be paramount, ensuring that the tool is

equitable and effective across diverse populations and news genres. Through these efforts, the goal is to

develop a more sophisticated and impactful fake news detection system that can play a vital role in

fostering a more informed society.


APPENDIX 1

CODING:

import pandas as pd

import numpy as np

from [Link] import drive

[Link]('/content/drive')

# Try using the 'python' engine and handling bad

lines

true_data = pd.read_csv('/content/[Link]',

engine='python', on_bad_lines='skip')

fake_data = pd.read_csv('/content/[Link]',

engine='python', on_bad_lines='skip')

print(true_data[0:10])

fake_data[0:4]

print(true_data['text'][0])

true_data.columns

print(fake_data['text'][0])

fake_data.columns

len(true_data)

len(fake_data)

fake_data['label'] = 1

true_data['label']=0

print(true_data.head())

fake_data.head()

all_data=[Link]([fake_data, true_data])

random_permutation =

[Link](len(all_data))

all_data= all_data.iloc[random_permutation]
print(all_data.columns)

all_data.head()

filterd_data=all_data.loc[:, ['title', 'text', "subject",

'label']]

filterd_data.head()

filterd_data.isnull().sum()

filterd_data['training_feature']=filterd_data['title']+'

'+filterd_data['text']+' '+filterd_data['subject']

filterd_data.head()

X= filterd_data['training_feature'].values

y = filterd_data['label']

l_X=filterd_data['training_feature'].values[0:1000]

l_Y= filterd_data['label'].values[0:1000]

print(l_X.shape)

print(l_Y.shape)

type (l_X)

print(X[0:1])

from sklearn.feature_extraction.text import

TfidfVectorizer

from sklearn.model_selection import train_test_split

from sklearn.linear_model import

LogisticRegression

from [Link] import accuracy_score

vectorizer= TfidfVectorizer()

X=vectorizer.fit_transform(X)

l_vectorizer= TfidfVectorizer()

l_X=l_vectorizer.fit_transform(l_X)

print(type(X))

print([Link])

print(type(l_X))
print(l_X.shape)

X_train, X_test, Y_train, Y_test =

train_test_split(X, y, test_size = 0.2, stratify=y,

random_state=42)

l_X_train, l_X_test, l_Y_train, l_Y_test =

train_test_split(l_X, l_Y, test_size = 0.2,

random_state=42)

X_train.shape

model=LogisticRegression()

[Link](X_train,Y_train)

# Model accuracy on the test set

test_y_hat=[Link](X_test)

print(accuracy_score(test_y_hat, Y_test))

# Model Accuracy on training set

train_y_hat = [Link](X_train)

print(accuracy_score(train_y_hat,Y_train))

from sklearn import svm

#Create a svm Classifier

model = [Link](kernel='linear') # Linear Kernel

clf_poly=clf = [Link](kernel='poly') #

Polynomial kernel

[Link](l_X_train, l_Y_train)

y_pred = [Link](l_X_test)

from [Link] import precision_score,

recall_score, f1_score, accuracy_score

print(accuracy_score(y_pred, l_Y_test))

print (precision_score(l_Y_test, y_pred,))

print (recall_score(l_Y_test, y_pred,))

print(f1_score(l_Y_test, y_pred,))
from [Link] import

RandomForestClassifier

model = RandomForestClassifier()

[Link](l_X_train, l_Y_train)

y_pred = [Link](l_X_test)

from [Link] import precision_score,

recall_score, f1_score,

accuracy_score,confusion_matrix

print(accuracy_score(y_pred, l_Y_test))

print (precision_score(l_Y_test, y_pred,))

print (recall_score(l_Y_test, y_pred,))

print(f1_score(l_Y_test, y_pred,))

from sklearn.naive_bayes import GaussianNB

model = GaussianNB()

[Link](l_X_train.toarray(), l_Y_train)

y_pred = [Link](l_X_test.toarray())

print(accuracy_score(y_pred, l_Y_test))

print (precision_score(l_Y_test, y_pred,))

print (recall_score(l_Y_test, y_pred,))

print(f1_score(l_Y_test, y_pred,))

import [Link] as plt

import seaborn as sns

from [Link] import confusion_matrix

# Confusion matrix for Logistic Regression

cm_lr = confusion_matrix(Y_test, test_y_hat)

[Link](figsize=(6, 4))

[Link](cm_lr, annot=True, fmt="d",


cmap="Blues", xticklabels=['Fake', 'True'],

yticklabels=['Fake', 'True'])

[Link]('Predicted Label')

[Link]('True Label')

[Link]('Confusion Matrix for Logistic Regression')

[Link]()

# Confusion matrix for SVM (Linear Kernel)

cm_svm = confusion_matrix(l_Y_test, y_pred)

[Link](figsize=(6, 4))

[Link](cm_svm, annot=True, fmt="d",

cmap="Blues", xticklabels=['Fake', 'True'],

yticklabels=['Fake', 'True'])

[Link]('Predicted Label')

[Link]('True Label')

[Link]('Confusion Matrix for SVM (Linear

Kernel)')

[Link]()

# Confusion matrix for Random Forest

cm_rf = confusion_matrix(l_Y_test, y_pred)

[Link](figsize=(6, 4))

[Link](cm_rf, annot=True, fmt="d",

cmap="Blues", xticklabels=['Fake', 'True'],

yticklabels=['Fake', 'True'])

[Link]('Predicted Label')

[Link]('True Label')

[Link]('Confusion Matrix for Random Forest')

[Link]()
APPENDIX 2

SCREENSHOT
REFERENCES

[1] Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu, Fake News Detection
Social Media: A Data Mining Perspective‖ arXiv:1708.01967v3 [[Link]], 3 Sep 2017

[2] Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu, Fake News Detection on
Social MediaA Data Mining Perspective‖ arXiv:1708.01967v3 [[Link]], 3 Sep 2017

[3] Z Khanam, B N Alwasel, H Sirafi and M Rashid. Fake News Detection Using Machine
Learning Approaches. Journal of Physics: Conference Series, 1099(1), 012040.

[4] Shalini Pandey, Sankeerthi Prabhakaran, N V Subba Reddy and Dinesh Acharya. Fake News
Detection from Online media using Machine learning Classifiers. Journal of Physics:
Conference Series, 2161(1),012027.

[5] Ahmed H, Traore I, Saad S. (2017) “Detection of Online Fake News Using N-Gram Analysis
and Machine Learning Techniques. In: Traore I., Woungang I., Awad A. (eds) Intelligent,
Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017.
Lecture Notes in Computer Science, vol 10618. Springer, Cham (pp. 127- 138).

[6] JA Nasir, OS Khan, I Varlamis Fake news detection: A hybrid CNN-RNN based deep learning
approach International Journal of Information Management Data Insights, 1 (1) (2021),
Article 100007

[7] MD Ibrishimova, KF Li A machine learning approach to fake news detection using knowledge
verification and natural language processing Advances in Intelligent Networking and
Collaborative Systems (2019), pp. 223-234

[8] JY Khan, MT Khondaker, S Afroz, G Uddin, A Iqbal A benchmark study of machine learning
models for online fake news detection Machine Learning with Applications, 4 (2021), Article
100032

[9] S Kumar, R Asthana, S Upadhyay, N. Upreti, M Akbar Fake news detection using deep
learning models: A novel approach Transactions on Emerging Telecommunications
Technologies, 31 (2) (2020), p. e3767

[10] Ahmed H, Traore I, Saad S. (2017) “Detection of Online Fake News Using N-Gram Analysis
and Machine Learning Techniques. In: Traore I., Woungang I., Awad A. (eds) Intelligent,
Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017.
Lecture Notes in Computer Science, vol 10618. Springer, Cham (pp. 127- 138).

You might also like