ijhighschoolresearch.
org
RESEARCH ARTICLE
A Comparative Analysis of Fake News Classifier Models
Nithin Sivakumar
Flower Mound High School, 3113 Sheryl Drive, Flower Mound, Texas, 75022, USA; [Link]@[Link]
Mentor: Jim Shepich
ABSTRACT: In this study, we trained three different machine learning algorithms on publicly accessible corpora labeled
real and fake news articles to compare the models’ performance in detecting which articles were real and fake. We compared the
performance of a naïve Bayes classifier, a support vector classifier, and a random forest classifier using both cross-validation and
external validation. We hypothesized that the random forest model would perform better because of its theoretically less biased
approach. The naïve Bayes and random forest models performed significantly better under validation (sensitivity and accuracy
around the 0.894-0.921 marking) than the support vector classifier (sensitivity and accuracy around the 0.78 marking). Our
hypothesis that the random forest model would yield the optimal outcomes was not supported through comparison with the other
two models, as all three models performed about the same and experienced high levels of overfitting.
KEYWORDS: Robotics and Intelligent Machines; Machine Learning; Data Science; Fake News; Classification.
� Introduction model. SVM sorts the input data into two groups and then
The proliferation of social media has dramatically expanded draws hyperplanes to separate them based on patterns. Values
the reach and influence of news outlets and information sourc- will then be classified according to their respective position on
es. A 2018 study by the Pew Research Center found that 62% a certain side of the hyperplane.⁵ Finally, a random forest clas-
of US adults get news on social media, up from 49% in 2012.¹ sifier constructs an ensemble of decision trees (usually limited
However, amidst the multitude of voices, some sources resort to a very shallow depth to be weak classifiers). It lets all the
to spreading false or misleading information to advance their trees vote on which class to assign.⁶ This methodology stems
own agendas. A 2020 study by Stanford University found that from decision trees, which are branching models that are split
false information was found in 76% of tweets related to the based upon rules to divide the input data for classification.
Covid-19 pandemic.² This phenomenon, known as fake news, This investigation aims to apply these three machine
undermines people's ability to make informed decisions and learning models to classify fake news and then evaluate their
respond to the world around them.³ The problem is further performance to determine which model works best in each
exacerbated by social media platforms, mainly those popular scenario. In terms of evaluation, we ran two separate testing
among younger generations, which can make children espe- methods: 5-fold cross-validation and external validation, in
cially vulnerable to being deceived by fake news. A 2019 study which we used different datasets to train and test the model
by the University of Oxford found that 75% of 11–18-year- to see how an algorithm performs on different inputs. Because
olds in the UK had encountered fake news in the past year and the random forest classifier has a well-structured approach to
that 26% of them believed the news they had seen to be true.⁴ dividing up its input data with minimal bias, we hypothesized
Therefore, this project aims to design and test proactive mea- that the random forest classifier would provide the best results
sures to detect and halt the spread of fake news to safeguard in this scenario.
the public. Section 2 will detail the datasets used in this study and how
To approach the task of detecting fake news, we will utilize they were processed for testing. Section 3 will discuss the mod-
supervised machine learning, which uses a set of training data els and the experiments conducted with them on the datasets.
to form predictions for new data. Machine learning (ML) will Section 4 will present the experiment's results and discuss their
help us detect and discriminate between fake and real news. subsequent implications. Finally, in Section 5, we will conclude
More specifically, this study will include three different ML the study's findings and propose additional studies that could
models: Gaussian naïve Bayes, support vector machine (SVM), further explore the topic.
and random forest. Gaussian naïve Bayes assumes conditional � Methods
independence, which treats all its features as if they are statisti- All code used to conduct classification experiments and an-
cally independent variables, between its features, and then uses alyze their results were developed in Python (version 3.11.1).
a Gaussian distribution over each feature of the training set All data sets used for testing are from Kaggle, an open source
to calculate the terms that are used in Bayes rule (specifically for datasets accompanied by notebooks and code.
the likelihood and the normalizer; the priors are based on the Data:
proportion of each class in the training set). SVM takes a linear This study used three corpora of predominantly political ar-
approach to regression by using vectors to create a classification ticles scraped from various web sources. First, dataset a was
© 2024 Terra Science and Education 114 DOI: 10.36838/v6i1.19
[Link]
sourced from Kaggle and published in 2018, consisting of arti- on c, test it on a, b, and c. For each test run, we print the accu-
cles with labels indicating whether they were REAL or FAKE. racy score in a confusion matrix.
Second, dataset b was also obtained from Kaggle, published in � Results and Discussion
2020, and comprised two sub-datasets, one containing articles In this section, we present the results of our experiment.
labeled as fake and another labeled as true. These sub-datasets The source code used to perform analyses is included in the
were then consolidated to create a cohesive dataset b. Lastly, supporting documents.
dataset c also found on Kaggle and published in 2020, mir- Random Forest Classification:
rored dataset a in that it featured a list of articles with labels of Figure 1A: Shows the results of cross-validation testing on the random
1 or 0, with 1 indicating true and 0 indicating false. forest model.
Text Processing/Pre-Processing/Vectorization: Figure 1B: shows the results of external validation testing on the random
To ready the datasets for the vectorizer, we first concatenat- forest model.
ed each document’s title/header and body to compile all the
text input necessary for each article entry. We then cleaned this
column in each of the three datasets to strip away any unnec-
essary string characters. First, we removed any stop words in
the string, then converted the letters to lowercase and removed
any HTML tags, usernames, URLs, and numbers in a normal-
ization process.
After cleaning the text, we processed the values into usable
input data for vectorization. In this experiment, we used count
vectorization to create a vector space matrix, which converts From the metrics in Figure 1A, we see that the random
documents into vectors where each component represents a forest classification model achieved high accuracy, precision,
specific word, and the value represents the count/proportion of sensitivity, and specificity in the 0.8-0.9 range. This indi-
that word in the document. After vectorization, our corpus can cates that the model could identify most of the positive and
be transformed and fit onto each ML model. negative instances in the data and make accurate predictions
Classification Algorithms: for a large proportion of the data. Furthermore, the F1 score
To classify the algorithms, we employed sci-kit learn to fit indicates that the model had a good balance between pre-
and create models for each algorithm. Each of our models has cision and recall, which is important for ensuring that the
its specific set of parameters that made what we thought to model is not overly biased towards one class. However, the
be the perfect testing environments. For example, the random accuracy metrics found based on Figure 1B show that the
forest classifier has one parameter – n_estimators – which we model undergoes high levels of overfitting. When the mod-
set to 200. This parameter represents the number of trees that el was trained and tested on the same dataset, it achieved a
the model will create. Given the quantity of the input data, 200 perfect accuracy of 1. However, when the model was trained
tree estimators gave us an efficient foundation for testing pur- on one dataset and tested on a different, unseen dataset, its
poses. The naïve Bayes classifier, however, does not take in any performance dropped drastically. The accuracy metric barely
specific parameters and simply uses the default settings. Final- surpassed 50%. In a random forest classifier, overfitting might
ly, our SVM classifier creates a pipeline combining a Standard occur for several reasons. For one, the dataset might be too
Scaler and a support vector Classification (SVC) model, which complex or large for the model, which causes the algorithm to
takes in a single parameter – gamma – which is set to its au- learn the noise in the data rather than the underlying pattern.
tomatic setting. This parameter defines how much influence a Overfitting can also occur when the model needs to be
single training input has. properly regularized or when there are too many trees in the
To test the three different models on the three different data forest. Finally, the model might also overfit when it has too
sets, we created two validation trials for each model: cross-fold many features or is too deep. This causes the model to mem-
validation and external validation. This makes a total of six orize the training data rather than generalizing it to new,
trial procedures. unseen data.
Evaluating Algorithms: Naïve Bayes Classification:
We used 5-fold cross-validation for internal validation and Figure 2A: Shows the results of cross-validation testing on the naïve Bayes
reported a confusion matrix averaged over five folds that dis- model.
play the number of samples that fit into each category – true Figure 2B: Shows the results of external validation testing on the naïve
Bayes model.
positive, false positive, true negative, and false negative. For
external validation, we bring in the other two datasets, b, and c,
to see how the models perform on entirely different data sets.
Then, we create a nested loop that takes each dataset, trains
the model on it, and evaluates the model on all three datasets.
In other words, we train the model on a, test it on a, b, and c;
train the model on b, test it on a, b, and c; and train the model
115 DOI: 10.36838/v6i1.19
[Link]
The naïve Bayes classification experiment results indicate Model Comparison:
that the model performed well overall, with a high level of This study evaluated the performance of three models –
accuracy, sensitivity, and precision, as seen in Figure 2A. This naïve Bayes, random forest, and SVM - for their effectiveness
suggests that the model could identify and classify the positive in classifying fake news articles.
instances in the data effectively. However, the slightly lower The naïve Bayes and random forest models performed well
specificity score indicates that the model may have difficulty in differentiating between real and fake news when given data-
accurately identifying negative instances. The naïve Bayes clas- sets that are similar in structure to the news in question. These
sifier is a probabilistic classifier that makes assumptions about models produced very high metrics in all categories of cross-
the independence of the features; this means that it assumes fold validation and had solid overfitting results. Additionally,
that the presence or absence of a feature does not depend on the random forest and naïve Bayes models are both relatively
the presence or absence of any other feature. This assumption simple models that can be trained on large datasets with rel-
is not always true, and the classifier may perform poorly when atively low computational resources. This suggests that both
it is not met. This could be the reason for the lower specifici- models seem to be suitable for fake news classification.
ty score. While this does not significantly impact the model's However, all three models displayed overfitting, where they
overall performance, it may be an area to focus on to improve produced better results when fit and tested on the same model.
its performance further. In this case, the standard deviations However, when trained and tested on different datasets, the
for the accuracy, sensitivity, and precision metrics are relatively models displayed accuracy markings of around 0.5, which are
small, which suggests that the model's performance is con- that of a null model, or simply no model. These coin flip-like
sistent across different experiment runs. Regarding external scores suggest that these models had essentially no effect on
validation, the naïve Bayes model performed similarly to the helping predict fake versus real news. Overfitting is a massive
random forest one as it also experienced overfitting, as seen in issue in fake news classification because it creates models that
Figure 2B. This could also occur if the model is not regular- are not robust when translated to larger scales for predictive
ized, where the algorithm considers the noise of the dataset, measures. This is problematic in the context of fake news clas-
similar to what was observed in the random forest classifier. sification because the goal is to accurately identify and classify
Support Vector Classification: fake news in real-world scenarios, not just on the training data.
Figure 3A: Shows the results of cross-validation testing on the support For example, suppose a model overfits the training data. In
vector model.
that case, it may memorize specific patterns and features in
Figure 3B: Shows the results of external validation testing on the support
vector model.
the training data that do not generalize to new data, resulting
in poor performance on new, unseen data. Additionally, over-
fitting can also lead to a model that is too complex and may
not be able to efficiently classify new instances, which can be a
concern in real-time classification scenarios.
While overfitting may be a significant problem when scaling
fake news detection to real-world and real-time applications, it
shows a trend that could be useful in finding a solution. One
way to solve overfitting issues is by using regularization tech-
niques, which help constrain the model and prevent it from
becoming too complex. Additionally, using ensemble methods
combining multiple models can be an excellent way to reduce
The performance of the SVM model in our classification ex- overfitting. To improve performance in the future, a more so-
periment revealed some discrepancies in its ability to identify phisticated model could be developed that integrates various
and classify instances of fake news accurately. While the model algorithms, allowing it to handle diverse inputs consistently.
demonstrated a relatively high level of accuracy and specifici-
� Conclusion
ty, it struggled to achieve strong scores in sensitivity, negative Our investigation into fake news classification yielded valu-
predictive value, and F1 score, as depicted in Figure 3A. This able insights into the most effective models for distinguishing
suggests that the model may have difficulty identifying spe- between real and fake news articles. Initially, we hypothesized
cific types of fake news or may be prone to misclassifying that the random forest model would outperform the other
certain instances. Additionally, the low F1 score indicates that models tested. However, our experiments revealed that all three
the model may be biased toward precision rather than recall. models experience extreme overfitting and do not produce ro-
These results may mean that the SVM model could bene- bust results.
fit from further optimization or modification to improve its
The advancement of machine learning algorithms requires a
performance. Finally, as seen in Figure 3B, the SVM model concerted effort to tackle the pervasive problem of overfitting.
experienced overfitting in its external validation tests. Once Therefore, future studies must prioritize finding solutions to
again, this could occur due to over-complexification and bias. this challenge by exploring various techniques, such as incor-
However, it's important to note that dataset b was not includ- porating dropouts into neural networks, conducting rigorous
ed in the testing for SVM due to its large size, which made it feature selection, and exploring innovative feature extraction
impractical for use in this specific model. methods. Through these endeavors, researchers aim to create
DOI: 10.36838/v6i1.19 116
[Link]
more resilient models for handling complex datasets and de-
livering more precise predictions. Thus, addressing overfitting
is critical to elevating the performance and trustworthiness of
machine learning algorithms in the years to come.
� Acknowledgment
I want to express my sincere gratitude to my mentor, Jim
Shepich, for his unwavering support and guidance throughout
this research project. Jim's extensive knowledge and expertise
in the field have been invaluable to me, and I am deeply grate-
ful for his willingness to share his insights and experience with
me.
Jim's constructive feedback and insightful suggestions have
helped me to refine my research questions, design my experi-
ments, and analyze my results. In addition, his encouragement
and motivation have kept me focused and determined to
achieve my research goals.
I also want to thank Jim for his patience and understand-
ing when I faced challenges and setbacks in my research. His
unwavering support and encouragement helped me to stay on
track and overcome obstacles.
Finally, I appreciate Jim's personal and professional mentor-
ship during my research journey. His mentorship has not only
helped me to grow as a researcher but also as an individual. I
feel privileged to have had Jim as my mentor, and I will always
be grateful for his guidance and support.
� References
1. Matsa, K. E. News use across social media platforms 2018. https://
[Link]/journalism/2018/09/10/news-use-across-s
ocial-media-platforms-2018/ (accessed Feb 3, 2023).
2. Gallagher, D.; et al. Tweeting False Information During the Covid
-19 Pandemic: Content and Network Characteristics of Tweets fr
om False Sources. 2020.
3. Olan, F.; Jayawickrama, U.; Arakpogun, E. O.; Suklan, J.; Liu, S. F
ake News on Social Media: The Impact on Society. Information
Systems Frontiers 2022.
4. Livingstone, S.; et al. Children's Exposure to and Evaluation of O
nline Misinformation. 2019.
5. Vapnik, V. N. The Nature of Statistical Learning Theory 1995.
6. Breiman, L. Random Forests. Machine Learning 2001, 45 (1), 5-3
2.
� Author
Nithin Sivakumar is a junior at Flower Mound High
School in Dallas, Texas, who has always been interested in
adding to the world of software. He hopes to expand his
interest in artificial intelligence and machine learning by ma-
joring in computer science in college.
117 DOI: 10.36838/v6i1.19
Short Responses:
1. Title: A Comparative Analysis of Fake News Classifier Models
Author: Nithin Sivakumar
Publication Date: 2024
2. Why did the random forest model perform similar to the other models (Naive Bayes)
which caused their hypothesis to fail? And why not fix the overfitting issue from the start
using regularization techniques as suggested?
3. “In this study, we trained three different machine learning algorithms on publicly
accessible corpora labeled real and fake news articles to compare the models’
performance in detecting which articles were real and fake. We compared the
performance of a naïve Bayes classifier, a support vector classifier, and a random forest
classifier using both cross-validation and external validation. ” (Sivakumar, 114)
4. Paraphrased: They took three machine learning models - Naive Bayes, Support Vector
Classifier, and Random Forest Classifier and trained them on datasets containing
labeled data of real and fake news. They then compared the performance of the models.
5. I chose this quote because it presents the main object of this research paper and tells
me what it will do.