Generating Fake News Detection Model Using A Two-Stage Evolutionary Approach 7th Aug 2023 Published

Received 22 June 2023, accepted 6 August 2023, date of publication 7 August 2023, date of current version 15 August 2023.
Digital Object Identifier 10.1109/ACCESS.2023.3303321
Generating Fake News Detection Model Using

A Two-Stage Evolutionary Approach
JEFFERY T. H. KONG 1 , (Member, IEEE), W. K. WONG 1,
FILBERT H. JUWONO 2 , (Senior Member, IEEE),

AND CATUR APRIONO 3 , (Member, IEEE)
1 Department of Electrical and Computer Engineering, Curtin University Malaysia, Miri 98009, Malaysia
2 Department of Electrical and Electronic Engineering, Xi’an Jiaotong–Liverpool University, Suzhou 215000, China
3 Department of Electrical Engineering, Universitas Indonesia, Depok 16424, Indonesia
Corresponding author: Catur Apriono (catur@eng.ui.ac.id)
This work was supported in part by the Fundamental Research Grant Scheme (FRGS) under Grant FRGS/1/2020/ICT06/CURTIN/02/1;
and in part by Universitas Indonesia’s International Indexed Publication (PUTI) Q1 Grant, in 2022, under Grant
NKB-501/UN2.RST/HKP.05.00/2022.
ABSTRACT While fake news is morally reprehensible, irresponsible parties intentionally use it to achieve
their goals by disseminating it to vulnerable and targeted groups. Machine learning techniques have been
researched extensively to detect fake news. On the other hand, evolutionary-based algorithms are now
gaining popularity in the research community. In this study, a two-stage evolutionary approach is proposed to
generate and optimize a mathematical equation for fake news detection. In the first stage, tree-based Genetic
Programming (GP) algorithm is used to generate mathematical expressions to detect correlations between the
language-independent (Lang-IND) features, extracted from Fake.my-COVID19 dataset, the newly curated
fake news dataset in a mixed Malay - English language. The uniqueness of the proposed approach is that
the mathematical expressions are formed by basic arithmetic operators or to include complex arithmetic
operators such as addition, multiplication, subtraction, division, square, abs, log1p, sign, square root, and
exponential together with Lang-IND features as the variables. Prior to second stage of the evolutionary
approach, a sensitivity analysis is applied to shorten the best equation while maintaining the F1-score
performance. In the second stage, an Adaptive Differential Evolution (ADE), is used to fine-tune the
mathematical model. The experimental results conclude that the proposed two-stage evolutionary approach
can be applied in fake news detection and the model can learn to predict using the Lang-IND features. Results
from the first stage shows that the equation from GP scores F1-score of 83.23% on Fake.my-COVID19
dataset using complex arithmetic operators and at tree depth of 8. After the fine-tuning stage, the model
performance increases the F1-score to 84.44%. The performance of the proposed two-stage evolutionary
approach outperforms the baseline performance of six commonly-used machine learning algorithms, with
Random Forest having the highest F1-score of 84.07%. The mathematical model is also tested separately on
two other unseen datasets of different domain topic or language and achieves acceptable F1-scores.
INDEX TERMS Fake news detection, evolutionary approach, genetic programming, differential evolution.
I. INTRODUCTION A study mentions that fake news is a phenomenon that has a

As digital technology advances, people tend to spread unre- direct impact on how anxiety, panic, despair, fear, exhaustion,
liable news or fake news to their online contacts without ver- psychological distress, and emotional overload develop in
ifying. Fake news can be detrimental in some circumstances. people of all ages [1]. In COVID-19 case, it may cause
distrusts in Governments, researchers, and health profession-
The associate editor coordinating the review of this manuscript and als, which indirectly impact the public decision, such as the
approving it for publication was Yu-Da Lin . mandatory vaccination uptake [2]. In addition to political and
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 11, 2023 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 85067
J. T. H. Kong et al.: Generating Fake News Detection Model Using A Two-Stage Evolutionary Approach
health implications, fake news can have impact on businesses, apply fake news detection on low resource language such as
thus appropriate marketing strategies should be formulated in Bangla [8] [9], Arabic [10], Indonesian [11], [12], [13], and
advance [3]. Slovak [14].
Governments from all over the world have been battling the In feature-oriented approach, static word embeddings,
COVID-19 Infodemic in addition to the COVID-19 pandemic such as Word2Vec [14], [15], [16], GloVe [17], [18], [19],
itself. However, multilingual nations like Malaysia may find Bag-of-words (BoW) [12], Term Frequency (TF) [18], and
this to be more difficult. Malaysian people speak 137 lan- Term Frequency - Inverse Document Frequency (TF-IDF)
guages [4] which make fake news detection more difficult. [12], [17], [18], [19], have been commonly used as fea-
To overcome this issue, the Malaysian Communications and tures for detecting fake news on a specific language using
Multimedia Commission (MCMC) launched sebenarnya.my machine learning classifiers. There are also fake news
portal to allow users to verify unconfirmed news before detection models that use dynamic word embeddings, such
spreading it on social media, instant messaging, blogs, web- as Bidirectional Encoder Representations from Transform-
sites, and other online platforms. Through this portal, it is ers (BERT) [17], [20], [21], XLNet, Efficiently Learn-
expected that Malaysia would be able to stop the propagation ing an Encoder that Classifies Token Replacement Accu-
of fake news successfully. rately (ELECTRA), and Robustly Optimized BERT Pretrain-
To date, the majority of research publications on fake ing Approach (RoBERTa) [21], [22]. Static features with
news detection focus on English language and relatively Lang-IND characteristics are being used in the research
few focus on the low resource languages. The absence of as multiple languages are being investigated for fake news
datasets and the Natural Language Processing (NLP) tools detection. Note that the Lang-IND features focus on capturing
designed to interpret them is the cause of the low resource high-level structures rather than taking into account specific
languages. With the explosion of fake news created during terms from a language.
the COVID-19 pandemic, the urgency increases in exploring To detect fake news on three languages (English, Por-
these low resource languages in fake news detection. More- tuguese, and Spanish), the authors in [23] proposed Lang-
over, detecting fake news on social media presents unique IND features like stylometric, complexity, and psychological
challenges. Firstly, fake news is intentionally written to mis- types. They found that Random Forest (RF) model achieved
lead readers. It is non-trivial to detect fake news simply based the highest accuracy of 85.3% with stylometric features, Part-
on its content. Secondly, social media data are large-scale, of-Speech (POS)-tag diversity, the ratio of named entities to
multi-modal, mostly user-generated, sometimes anonymous text size, the ratio of quotation marks to text size, and the Out-
and noisy. Thirdly, in the countries whose population speaks of-Vocabulary (OOV) words frequency. Similarly, the authors
more than one language, like Malaysia, a mixed language in [24] used Lang-IND features to determine fake news
is commonly used in their text postings. An example of the spreader that wrote in English and Spanish with an accu-
Malay - English mixed language is as follows: ‘‘Lol kalau nak racy of 78% and 87%, respectively. In particular, they used
tamatkan endemic phase, why hiv test kit still expensive than various stylistic and psychological features such as emojis,
covid test?’’. Literally, the sentence in English can be directly hashtags, upper phrases, user mentions, neutral, and negative
translated as ‘‘Laughing out loud (Lol), if we want to end the polarity. The authors in [20] detected fake news in two Indic-
endemic phase, why does HIV test kit is more expensive than Languages, i.e., Hindi and Bengali. The study found that
COVID-19 test?’’. the feature representations extracted from Hindi and Bengali
language were highly transferable across Indic languages.
A. RELATED WORK In specific, they extracted textual features like number of
As COVID-19 pandemic hits the world in such alarming upper characters in a tweet, number of question marks, num-
speed, massive misinformation on the virus spreads through ber of exclamation marks, retweet count, and favourite count
online platforms. Thus, researchers are rushing to perform to combine with user features, fact verification score, and
in-depth research in all directions as a result of the increasing bias score. M-BERT embeddings was then used to create
importance of fake news detection research. Outlined by [5] concatenated features as input to a classifier model. The
in the future research directions for fake news detection, there authors in [25] conducted fake news detection in multiple
are four categories to be explored into: data-oriented, feature- languages such as Latin, Germanic, and Slavic using weakly
oriented, model-oriented and application-oriented. Due to the language dependent features such as proportion of upper-
availability of datasets and NLP packages intended for high case, exclamation marks, question marks, number of unique
resource languages like English, fake news detection models words, sentences, characters, spelling errors, sentiment, pro-
are frequently trained using the English language. In order portion of adjectives, adverbs, and nouns. However, we noted
to expand the data-oriented approaches, [6] mentions that that features like spelling errors, sentiment and recognizing
it is desirable to apply machine learning model to datasets adjectives, adverbs, and nouns are highly dependent on the
other than English language. The authors in [7] encouraged NLP libraries that were trained for the particular languages.
others to develop NLP methods for other languages to detect Therefore, these features are not suitable to be categorized as
fake news. Recently, some researchers have proposed to Lang-IND features.
85068 VOLUME 11, 2023

Model-oriented research focus on building fake news strategies to find the best optimum solution for fake news
detection model incorporating machine learning and deep detection. In a recent metaheuristic study, [30] proposed a
learning approaches. One pioneer work by [26], adapted hybrid multi-thread metaheuristic approach that runs three
twenty-three supervised artificial intelligence algorithms in different swarm-based metaheuristic algorithms, GWO, Par-
their fake news detection study. In a multilingual fake news ticle Swarm Optimization (PSO) and Dragonfly Optimization
detection study, [23] compared four machine learning algo- algorithms in parallel for fake news detection.
rithms such as k-Nearest Neighbors (k-NN), Support Vector Genetic Programming (GP) is a type of evolutionary
Machine (SVM), RF, and Extreme Gradient Boosting (XGB) algorithm and a subset of machine learning as described
using the Lang-IND features. Reference [15] proposed a in [31] research. According to [32], GP has been researched
one-class SVM as a classifier model, using the linear kernel and applied extensively on various type of engineering prob-
function to group the training samples into one class and those lems in particular water resources engineering problems. In a
that do not fit into the class are grouped into a new class more recent study, [33] proposed GP approach to solve binary
(fake news). Meanwhile, [27] uses K-Nearest Neighbor as classification problem and encouraged better feature engi-
a classification model to label the problem instances into a neering technique to improve the accuracy.
different class and the model achieved maximum accuracy Specifically to Evolutionary Approach (EA), genetic
with value of K taken between 15 to 20. Reference [25] algorithm is commonly used to reduce the number of fea-
performed fake news detection study in multiple platforms tures required in the machine learning classifier model as
(Twitter or Sina Weibo) and languages (Germanic, Latin, and proposed in [34], and [35] studies. Besides genetic algorithm,
Slavic) using four machine learning algorithms such as KNN, other evolutionary approaches, such as Particle Swarm Opti-
RF, Gaussian Naïve Bayes (Multinomial for bag-of-words) mization (PSO) and Salp Swarm Algorithm (SSA), are pro-
and SVM. posed as feature reduction techniques for fake news detection
A subset of machine learning techniques called deep learn- by [36]. Note that evolutionary approach is still new in
ing uses multi-layered neural networks to discover complex fake news detection field. In a more recent study, [37] pro-
connections between the inputs and outputs. Deep learning posed machine learning classifiers, like SVM, Naïve Bayes,
classifiers have become popular in the field of fake news Logistic Regression and RF, as fitness function in genetic
detection in recent years. The authors in [14] trained two algorithm. Their model employed TF-IDF as the features
neural networks architectures, one dimensional Convolu- input and confusion matrix to calculate the evaluation metrics
tional Neural Network (CNN) and Long Short-Term Memory such as precision, recall, and F1-score.
(LSTM) and compared the two deep learning models for fake In this work [38], the tree-based GP algorithm is pro-
news detection. Reference [19] proposes a modified version posed to discover the correlations between the Lang-IND
of deep neural network to detect fake news, specifically the features to form mathematical expressions that separate the
Modified-LSTM and the Modified Gated Recurrent Units fake news and real news. In general, tree-based GP initiates
(GRU) by increasing the hidden layers from one layer to a trees population of equations by randomly select operators
three layers. In a similar research, [17] and [21] proposed (eg. +, −, ×, ÷) and features to build the mathematical equa-
a fake news detection model using deep learning technique tions. Each tree or mathematical equation will be computed
named Bidirectional LSTM (Bi-LSTM), a sequence of two against the fitness function to generate a fitness score. The
LSTM with one taking the input in a forward direction and fitness score will be used to select the best trees for repro-
the other in a backward direction. Reference [21]’s Bi-LSTM duction, mutations and crossovers to make the next better
model took the pre-trained transformer embeddings such as generation of tree populations. As fake news detection is a
the BERT, XLNet, RoBERTa, while [17]’s Bi-LSTM model classification problem, the fitness function used is a maxi-
took Glove embeddings as input features. mization function to seek the highest fitness score.
Evolutionary approach has ushered in a new phase in the Lastly, application-oriented research approach focus on
growth of machine learning and deep learning, which have two main directions: fake news diffusion and fake news
continued to advance. Evolutionary approaches are a type of intervention. Fake news diffusion refers to the patterns and
metaheuristic method. Metaheuristics are high-level, general- channels of fake news propagation on social media platforms.
purpose search and optimization algorithms that are designed Early research has demonstrated that while spreading via
to find good, near-optimal solutions to difficult problems in online social networks, reliable and false information follow
a reasonable amount of time. They are called ‘‘metaheuris- different patterns. In order to reduce the impacts caused by
tics’’ because they are designed to work on a wide range of fake news, proactive intervention can minimize the spread
problems and do not rely on any specific problem structure. area and reactive intervention can address the issues after
Reference [28] has adopted two metaheuristic algorithms, the news goes viral. Proactive intervention methods include
the Grey Wolf Optimization (GWO) and Salp Swarm Opti- removing user accounts and labelling the news as fake.
mization (SSO) in their fake news detection study. Further According to [39] study, reactive intervention methods focus
to the above, [29] improved the SSO by introducing non- on initiating news mitigation campaigns targeting a specific
linear decreasing coefficient and oscillating inertia weight set of users when the infected set of users is known or
VOLUME 11, 2023 85069

TABLE 1. Fake news papers in related works.
targeting the entire network when the infected set of users thereby affecting its performance on tasks that require under-
is unknown. In another study, [40] introduced framework standing of these languages.
of marked temporal point processes to leverages on public Many research papers that have investigated the fake news
crowd efforts to detect and reduce the spread of fake news. detection on multiple languages suffer from a similar limi-
The idea is to flag the probable fake news story and send tation. Instead of training the models on a single language
to independent organizations like Poynter for fact-checking. dataset and evaluating their model performance on other
When the flagged news are confirmed as fake news, the language dataset like in [20], most papers have opted to train
system will lower the appearances in the users feeds thus their models on either a dataset specific to each language or a
reducing the number of people exposed to the misinforma- combined dataset of multiple languages. Consequently, these
tion. All of the works mentioned in the four approaches to models tend to achieve high accuracy as they are trained on
fake news detection are listed in Table 1. data that is specific to the language or languages used in the
dataset, which may not accurately represent the complexity
of detecting fake news in multilingual settings.
B. MOTIVATION AND CONTRIBUTIONS In this paper, we propose to use a two-stage evolution-
Existing methods for fake news detection in multilingual ary approach for COVID-19 fake news detection in the
settings have several limitations, such as a lack of training mixed Malay-English dataset. To eliminate the requirement
data in multiple languages, difficulties in detecting fake news for language-specific pre-processing and feature engineer-
in languages with different linguistic structures or styles of ing, we employed language-independent (Lang-IND) fea-
communication with pre-trained model of other languages, tures during the model’s training. In particular, an optimal
and the need for language-specific pre-processing and feature mathematical model will be generated using tree-based GP
engineering. Current state-of-the-art method uses Multilin- algorithm. Furthermore, the model is fine-tuned using Adap-
gual BERT (M-BERT) model that is pre-trained on large tive Parameter Differential Evolution (ADE) method. Dif-
monolingual Wikipedia corpora from 104 languages. How- ferent from the existing work, the proposed model will be
ever, the effectiveness of the M-BERT model or its vari- evaluated on datasets with various languages and domains
ants may be limited for low-resource languages like Malay, after being trained on the mixed Malay-English language
85070 VOLUME 11, 2023

dataset. To summarize, the contributions of this paper are A. FAKE.MY-COVID19 DATASET

given as follows: As of to-date, there is no fake news dataset publicly avail-
1) A COVID-19 fake news dataset (Fake.my-COVID19) of able in Malay language, due to time constraints and efforts
the mixed Malay - English language is created to con- required [43]. On the other hand, the authors in [44] have
tribute to the low resource Malay language. The dataset encouraged to work on fake news detection in Malay news.
is made publicly available on GitHub1 . We have built a data collection program to collect
2) A tree-based GP is used to generate a mathematical COVID-19 related news posted in Malaysia via Twitter’s
model which can show the correlation between the Standard search API. The collection timeline was set from 1st
language-independent (Lang-IND) features used. The September 2021 until 31st March 2022. The period was at the
model is further optimized using ADE by introducing time when Malaysia started to administer third doses to health
randomized weights to the model. frontliners and elderly, as well as started the vaccination for
The rest of this paper is organized as follows. Section II adolescents aged 12 to 17 years old. A total of 251,216 tweets
outlines the data collection method, annotation guideline to were collected over 231 days. From the data collected, 68%
build fake news dataset and discuss the datasets acquisition of tweets are in Malay, 28% in English, and 4% in other lan-
in the research. Section III describes the proposed two-stage guages (Chinese, Tamil, etc). In the data collection program,
evolutionary approaches based on Lang-IND features in there are two important search criteria, namely keywords and
details. Section IV discusses the experimental results from locations.
GP algorithm and the fine-tuning results. Finally, Section V
summarizes the work and provide an outlook for possible 1) SEARCH CRITERIA: KEYWORD
future research and investigation into this topic. In order to retrieve tweets that contain COVID-19 related
tweets, we set the search parameter (q) in the ‘api.search’
II. DATASETS method with a list of keywords. The search keywords
In this paper, three fake news datasets from Twitter are were mainly related to vaccination and virus variants. The
used. The first dataset (D1) is our own curated COVID-19 keywords were translated into Malay from English lan-
fake news dataset in the mixed Malay-English language, guage or vice versa using Google Neural Machine Transla-
i.e., Fake.my-COVID19. From now on, the terms ‘‘D1’’ tion (GNMT) [45]. Table 2 shows the search keywords used
dataset and ‘‘Fake.my-COVID19’’ dataset are interchange- in the data collection. A few more keywords were added at a
able. The second dataset (D2) is curated by Patwa et al. in [41] later date, for example ‘‘Omicron’’, a new variant of concern
and publicly available2 . The third dataset (D3) is published by that was recognized by the WHO on November 26, 2021.
P. Faustini and T. Ferreira Covoes which can also be publicly Tracing date was recorded for each keywords to track its
accessed3 . The idea is to take the equation generated from starting date used in the search.
GP which is run on D1 dataset and test the equation F1-score
TABLE 2. Search keywords used in data collection.
on D2 and D3 datasets to observe for its generalization. Note
that different from D1 and D2 datasets, D3 dataset contains
political fake news in Portuguese language. Both D2 and D3
datasets are published for their tweet IDs only, due to the
Twitter’s privacy policy stating that the content of tweets
cannot be shared. As such, the tweet IDs are hydrated to
extract the tweet texts. D2 dataset is selected as it is in English
language and contain COVID-19 news, similar to D1 dataset.
In contrast, D3 dataset is chosen for its language in Por-
tuguese and for its non-related domain to COVID-19 news.
According to a previous study [42], Portuguese and English
are both genetic relatives as they belong to the Indo-European
language family. The study examined cross-language similar-
ities by comparing the lexical distance of 500 high-frequency
words in both languages using computational analysis. The
resulting percentage of lexical similarity, or similarity index,
was calculated to be 20.4%. The study concluded that despite
2) SEARCH CRITERIA: LOCATION
the presence of loan words from Latin in English, the two
In order to ensure the collected tweets were the ones posted
languages are distantly related.
within Malaysia, geographic coordinates (geocode) were
1 https://github.com/z3fei/Malaysia-COVID-19-Tweet-
added into the search parameter. The latitude and longitude
ID/tree/main/Fake.my-COVID19 of each state capital city and other cities with large population
2 https://competitions.codalab.org/competitions/26655 are plotted on Malaysia map with a radius range of 3 km to
3 https://github.com/phfaustini/BRACIS2019_FAKENEWS 70 km as per listed in Table 3. Variations of radius circles
VOLUME 11, 2023 85071

TABLE 3. Geocodes used in data collection. According to [41], the tweets should be marked as real
news if it contains useful information on COVID-19 such
as numbers, dates, vaccine progress, government policies,
hotspots, etc. All tweets posted by government agencies han-
dlers, medical institutes and official news media channels are
also considered as real news. Truthful reporting in digital
journalism is adhered to in well-established print or radio or
outlets or established reputable citizen journalist blogs with
reputation. Such sources can be considered genuine unless
proven otherwise, retracted or corrected as agreed by [48].
Finding verifiable factual claim among the retrieved tweets
is not an easy task. One would have to go through the tweets
one by one. In [47], the authors removed tweets less than
five words. Thus, in our annotation task, a minimum 20 char-
acters filter was applied to exclude any tweets less than
20 characters. The filter also excluded tweets that contain
only emojis, emoticons or greetings, which may not contain
claims that lead to a fake news. In addition, a language
filter was added to only consider Malay and English tweets,
which represent 96% of the tweets collected. The number
of tweets posted in Chinese and Tamil language were only
about 2%. The tweets in other languages were insufficient
method were inspired from the theory of circles used to
and inadequate size to form fake news dataset for a machine
collect the tweets in Great Britain [46]. As discussed in the
learning detection task.
paper, highly overlapping circles method and larger circles
It is not impossible that tweets with large numbers of
would cause a lot of duplicate tweets, which is known as
retweets or ‘‘likes’’ are fake news. However, the majority
‘‘circle coverage problem’’. As we aimed to reduce both the
of tweets with high retweet or like counts are of actual
overlaps between circles and their numbers as much as pos-
news, celebrity posts, health ministry updates, and media
sible and to solve the problem of retrieving duplicate tweets,
news. During our data collection, we found certain keywords
the following control measure was applied before saving the
often used in fake news tweets such as beksin and baksin.
tweets. The data collection program would refer to a main
The actual words for beksin or baksin are vaksin (Malay)
file containing all the tweet IDs that had been retrieved. If the
or vaccine (English). The anti-vaxxers who posted these
new tweet ID did not appear inside the main file list, then
tweets misspelled the word on purpose to avoid authorities
the program would proceed to save the tweet content. In the
screening on the actual word - vaccine. There were other
case that the retrieved tweet ID was found inside the main file
words used instead of COVID, misspelt on purpose, such
list, the program would stop processing the particular tweet
as kovid, kobid, convid, konvid. Other than that, some words
ID and proceed to the next tweet ID.
recorded in fake news tweets were Adverse Events Following
Immunization (AEFI), ivermectin, haram (illegal), flu, fake,
3) ANNOTATION TASK scam, deltacron, bunuh (kill), cipta (create), racun (poison),
Annotation task involves finding claims made by the users in etc. The same user would not just post one false claim and
the tweet and verifying the claim on reputable fact checking stop spreading the false claims in many weeks to come. The
websites. This task is based on the annotation guidelines users that had been identified as a fake news spreader or
stated in [47] to decide the binary labels (fake or real). The anti-vaxxers were put into a list. All tweets posted by them
guidelines considered the context of the tweets and exclude would be read and further checked on their claims made
tweets that are sarcastic and humorous. Likewise, the tweets against trusted sources.
that expressed general opinions regarding the vaccine, official
news, and appointment details of vaccination centers are not
considered as fake news. Additionally, this approach helps B. DATA DISTRIBUTION
ensuring the manual annotation of data was accurate and Of the 251,216 tweets collected, a total of 3,068 tweets
the quality of the dataset was of high standard. Labelled were labelled in accordance to the guidelines outlined in
tweets are in binary labels, fake as ’1’ and real as ’0’. the annotation task and named as Fake.my-COVID19 dataset.
Tweets that were labelled fake contains either unverifiable In total, 1,422 tweets were labelled as fake news and 1,646
claims, misleading, has the intention to deceive or contains were labelled as real news. A dimensionality reduction
conspiracy theories that has no scientific backing. The labels technique called the t-SNE (t-distributed stochastic neighbour
were annotated by one person and verified by three other embedding) plot [49] is used to visualise Fake.my-COVID19
professionals to remove the bias. dataset. A position for every data point is assigned on the
85072 VOLUME 11, 2023

TABLE 5. List of 25 lang-IND features.
FIGURE 1. TSNE plot.
two-dimensional plot, as shown in Fig. 1, to visualise the

high-dimensional data. In particular, the plot models similar
objects by neighbouring points and dissimilar objects by dis-
tant points. Note that D2 dataset contains 5,600 fake news and
5,100 real news, while D3 dataset contains 4,392 fake news
and 4,580 real news. All the three datasets labels are balanced.
The binary class distribution for each datasets are presented
in Table 4.
TABLE 4. Datasets distribution.
main steps: 1) feature engineering and 2) feature construction.

For the feature engineering process, we developed the pro-
gram to extract 25 Lang-IND features listed in TABLE 5 from
the tweet text alone. The features are focus on capturing high-
level structures. This way, the same set of features can be used
C. DATA PREPROCESSING in multilanguage domains [24]. Most Lang-IND features are
Data preprocessing is a process to prepare the tweets for learnt from the previous works but we added the proportion
feature engineering. Data preprocessing contains two steps, of each feature per sentence. The step includes tokeniza-
i.e., data cleaning and data formatting. To assure the tweets tion, removed @mentions, URLs, and emojis before counting
that are retrieved and stored are displayed exactly as they number of words and sentences. In feature construction step,
should, data cleaning is performed. During the data collection nine constant values [0.1, 0.2, 0.3 . . . 0.9] were added as the
process, HTML codes were found in the retrieved tweet predicted variations in weights to the 25 features. We chose
texts instead of the real characters, such as ampersand (&) relatively small variations in weights so that the algorithm
as ’&’, the greater than (>) as ’>’, the less than can converge to near optima faster. No normalization was
(<) symbol as ’<’. Similarly, UTF-8 encoding such as ’âe’ performed on the dataset, as the other benchmarking datasets
will also need to be corrected. These HTML codes and UTF-8 were made of tweets with the same maximum characters
encoding will distort the calculation on number of characters length.
if left untreated. Therefore, we replaced the HTML codes
with the correct symbols respectively, to reflect the exact III. METHODOLOGY
characters. Next, in the data formatting process, both the The overview of our proposed two-stage evolutionary
tweet text and label columns were copied into new file for approach is shown in Fig. 2, with each component of the
feature preparation steps. framework explained in the following. We use F1-score as
evaluation metric in all the experiments. The F1-score can be
D. FEATURE PREPARATION defined as the harmonic mean of the Precision and Recall as
Feature preparation involves preparing the inputs in their expressed by Equation (1), where TP, TN, FP, and FN denote
numerical form and in an acceptable manner to be input to GP True Positive, True Negative, False Positive, and False Nega-
algorithm. The feature preparation pipeline consists of two tive samples, respectively. In addition, Precision is calculated
VOLUME 11, 2023 85073

FIGURE 2. Two-Stage Evolutionary Approach Model.
by using Equation (2) and Recall is calculated by using the number of parameters in the model will also increase and
Equation (3). add on to the model complexity [51]. In this paper, we set
2 × Precision × Recall a maximum tree depth of 8 to restrict the program from
F1 − score = (1) bloating. These mathematical expressions are then evaluated
Precision + Recall
TP based on how well they fit the data and be given a fitness
Precision = (2) score. The fitness function is the mathematical function that
TP + FP
TP calculates the fitness score for each mathematical expression.
Recall = (3) As fake news detection is a classification problem, the fitness
TP + FN
function used is a maximization function to seek the highest
A. FIRST STAGE - GP fitness score. The tree with the highest fitness score will be
The tree-based GP [38] is a technique that uses evolutionary selected for reproduction, point mutation, branch mutation,
algorithm to search for the best solution to a given problem. and crossover process to make the next and better generation
In the context of fake news detection, GP is used to auto- of the tree population. Fig. 3 shows an example of a crossover
matically discover the correlations between the Lang-IND operation, swapping a tree point on a tree (Parent A) with
features to form mathematical expressions that best fits the a tree branch on another tree (Parent B). The offsprings
data set. The mathematical expressions are represented as a created from crossover operator process can be very simi-
tree structure, where the internal nodes of the tree represent lar to the parents thereby causing the new generation with
the arithmetic operators and the leaf nodes of the tree, also low diversity. The mutation operators solve this problem
known as terminal nodes, represent the variables or constants. by changing the value of some features in the offspring
The tree-based GP algorithm begins by creating an initial at random. During the training process, the GP algorithm
population of trees or expressions with randomly chosen iteratively modifies the structure and parameters of the trees
arithmetic operators, constants, and Lang-IND features to based on the fitness of each tree to fit the training data. The
construct the mathematical expressions. The basic arithmetic evolution process will stop after a predetermined number of
operators are (+, −, ×, ÷ ). Meanwhile, the complex generations.
√ arith- In GP, the fitness function is a measure of the quality of
metic operators are (·)2 , abs (|·|), log1p(·), sign(·), ·, exp(·),
and the basic arithmetic operators4 . Each tree can have differ- a tree or mathematical expression. The fitness function is a
ent depth level in each generation. The higher the tree depth crucial component of the GP algorithm as it drives the search
level, the greater number of features can be included, as there process towards the best solution. The fitness score of a can-
will be more branches and nodes to hold more arithmetic didate solution is computed by evaluating the solution against
operators and Lang-IND features. the validation data set. The output of the mathematical expres-
One of the traditional ways of measuring the complexity sion is denoted by result (R) in numerical value and evaluated
of a tree model is counting the total number of nodes in all next by a piecewise function. For example, R = a + b − c,
subtree models [50]. Hence, the deeper the tree depth level, where a, b, and c are the Lang-IND features in its numerical
values. If the result value of the mathematical expression
4 Note that log1p(x) = log(1 + x). is greater than or equal to 0.5, then the predicted label is
85074 VOLUME 11, 2023

FIGURE 3. Genetic Operation - Crossover.
fake news. Originally, the threshold is set at 0. The threshold B. SENSITIVITY ANALYSIS
is increased to 0.5 value and the model is trained towards the One-at-a-time sensitivity analysis, also known as univariate
threshold value. Mathematically, the identifier of fake news sensitivity analysis, is a method of sensitivity analysis that
is given by involves studying the sensitivity of the mathematical model’s
( output to each input parameter individually, while holding
fake = true if R ≥ 0.5 all other input parameters constant. This method is simple
fake = g(result) = (4)
fake = false if R < 0.5 to implement and can be useful for identifying the most
important input parameters of the model. However, it can
The result will then be compared against the actual label. be limited in its ability to capture interactions between input
The fitness score is incremented by 1 with each correct pre- parameters, as it only considers one input parameter at a time.
diction against the actual label. The goal of the GP algorithm Prior to the next evolutionary step, it is intended to further
is to maximize the fitness value of the candidate solutions. remove the terms that do not add to the model’s F1-score.
The output of GP algorithm is a tree-like structure repre- Each mathematical term or constant is altered or removed
senting the best and final candidate solution. However, the during the sensitivity analysis to determine whether the
characteristics of the training data set play a crucial role in F1-score still persists. After each term augmentation or
determining the specific form of the generated GP tree. If the removal, the new equation is evaluated on the unseen data -
data set is large and diverse, it is likely that the GP will find test set. The augmentation procedure changes one variable
a more general and robust solution [52]. On the other hand, in a listed term to 1. This is done to assess the variable’s
if the data set is small or not diverse enough, the GP may find sensitivity to the performance of the equation. The individual
a solution that overfits the training data and performs poorly term may include a basic or complex arithmetic operator
on new, unseen data. and a Lang-IND feature throughout the elimination process.
VOLUME 11, 2023 85075

The process of elimination begins with longer terms such as (i.e., automated) tuning approach was proposed in [54]. The
Term 1, Term 2, Term 3, and Term 4 in Equation (15), which adaptive F and CR are respectively given by
may include a few smaller terms. As a result, it is possible to 
π t̂ π

α̂ + (1 − α̂) × sin
first eliminate the larger term that precedes a basic arithmetic maxiter
− 2 , 2 ,
if t̂ ≤ maxiter
F= π π
operator from the mathematical equation. If the F1- score α̂ − (1 − α̂) × cos
2 − maxiter , otherwise,
t̂
drops after the term is eliminated, the term is crucial to the
solution and is to remain in the equation. The following large (8)

terms and individual terms in the mathematical expression β̂ + (1 − β̂) × sin π π
t̂
− 2 , if t̂ ≤ 2 ,
maxiter
which can be referred to as Terms in Table 8 are subjected CR = maxiter
β̂ − (1 − β̂) × cos π − π t̂
to the next elimination process. 2 maxiter , otherwise,
(9)
C. SECOND STAGE - ADE
ADE is an optimization algorithm that is based on the dif- where α̂ and β̂ are constants, t̂ is the generation of iteration,
ferential evolution (DE) algorithm. DE is a population-based and maxiter is the maximum number of iterations.
optimization algorithm that is inspired by the concept of The ADE algorithm generates random weights for each
natural evolution and is used to find the global minimum or term in the equation. The values range from 0 to 1. The
maximum of a function. Such methods are usually referred to weights will be plotted into the optimized equation and the
as metaheuristics since they can search very broad regions of best weights are recorded based on the result with the highest
potential solutions and make few, if any, assumptions about F1-score.
the problem being optimized. According to [53], the perfor-
mance of metaheuristic algorithms is sensitive and includes D. IMPLEMENTATION AND SIMULATION SETUP
an adaptive mechanism that adjusts the control parameters of The GP experiments on Fake.my-COVID19 dataset was per-
the algorithm based on the characteristics of the optimization formed by using Google Colab, a free online platform to run
problem. The fine-tuning technique will alleviate the user Jupyter notebooks in the cloud. Google Colab was installed
from the laborious and time-consuming task of manually with Python 3.8.16 at the time. Following are the parameters
setting the parameters. used to generate the GP simulations, the initial population is
The DE algorithm can be summarized as follows: set to 100 and the stopping criteria is set to 500 generations.
−
→ G ,··· , The final output from GP is the tree with the highest fitness
1) For each i-th solution vector X G i = [xi,1
G , · · · , x G ]T , select three N -dimensional auxiliary
xi,j score. The GP algorithms are run at a maximum tree depth
i,N
−
→G − → − →G level of 4, 6, and 8. Each level is run for 10 times. In particular,
vectors { X r1 , X G r2 , X r3 }, where i = 1, · · · , P, we run GP using the basic arithmetic operators then we repeat
r1 , r2 , r3 ∈ {1, · · · , P}, i ̸ = r1 ̸ = r2 ̸ = r3 , P is the the whole process using the complex arithmetic operators.
population size, and G = 1, · · · , Gmax is the maximum In total, there are 60 runs. From the 60 results, we compare the
generation. trees’ F1-score and select the tree with the highest F1-score
−
→
2) Form the mutated vector V G i = [vGi,1 , · · · , for sensitivity analysis. The procedure of sensitivity analysis
vi,j , · · · , vi,N ] using
G G T
has been described above and will be summarized here. In the
−
→G − → →G −
− →G sensitivity analysis, the tree holding the mathematical expres-
Vi = XGr1 + F( X r2 − X r3 ), (5) sion is split into a few large terms to be analyzed. One term
where F is the differential weight. is removed at a time to see whether the F1-score is affected.
3) Generate a trial vector using the mutated vector and the After removing the term, we tested the equation (without the
principal parent using removed term) on the training set (80%). If the removed term
( from the equation does not reduce the F1-score, it means
G vG , if randi,j [0, 1] ≤ CR, that the term is not important and can be removed from the
ui,j = i,j
G , otherwise,
(6)
xi,j mathematical expression. After checking all the individual
terms, there may be a few terms that can be removed, thus
where CR is the crossover rate and randi,j [0, 1] is a ran-
shorten the original mathematical expression.
dom number drawn from standard uniform distribution
To further optimize the mathematical expression, we insert
for each j-th component of the i-th vector.
weights into the expression by adding a weight before each
4) Calculate the next generation using
(− variables and constants. In the ADE algorithm, the weights
→G −
→ −
→G
−
→G+1 U i , if F( X G i ) ≥ F( U i ),
are optimized using a maximization fitness function similar to
Xi = − →G −
→ −
→G (7) the GP algorithm. The weights-added equations are evaluated
X i , if F( X G i ) < F( U i ),
against the training data set. The parameters used in ADE
where F(×) is the fitness function to be optimized. algorithm are set with an initial population of 50, maximum
It can be seen that the basic differential evolution number of iterations of 1000, crossover rate, CR at 0.9, dif-
algorithm has three main parameters like population size, ferential weight, F at 0.8, constant α at 0.8, and constant β
crossover rate, CR, and differential weight, F. An adaptive at 0.75. The final weights are then added to the shortened
85076 VOLUME 11, 2023

expression from the sensitivity analysis to form the optimized class frequencies of the training data points that reached a
mathematical expression. certain leaf as a measure of their probability. It is similar
Furthermore, tree-based GP is trained and validated with to minimizing the log loss (also known as cross-entropy
64% and 16% of the data, respectively, leaving 20% data to be and multinomial deviation) between the true labels and the
tested with the final equation. The tree selection in the genera- probabilistic predictions of the tree model.
tion for each genetic operators follows the default percentage
given in the guide [38]: reproduction 10%, point mutation 3) RANDOM FOREST (RF)
10%, branch mutation 20%, and crossover 60%. Mutation RF is an ensemble machine learning algorithm that combines
introduces random variations, while crossover swaps differ- multiple decision trees to make a prediction. In [57], it is
ent features among the trees. These operations help exploring explained that each tree in the random forest is trained on
different regions of the search space and increase the chances a different subset of the training data. The class with the
of escaping local optima. highest vote frequency is selected. RF introduces randomness
in selecting features for each tree, which helps to reduce
E. CONVENTIONAL MACHINE LEARNING ALGORITHMS overfitting. RF is also known for its ability to handle high-
In addition to our proposed two-stage evolutionary approach, dimensional data, deal with missing values [58], and provide
a baseline performance on the Fake.my-COVID19 dataset measures of feature importance. In our simulations, the RF
is created using six commonly-used machine learning algo- model imposes a constraint where the maximum tree depth
rithms. The baseline performance will serve as a reference is set to 8, and the tree population is limited to 100, which is
point so that the performance of more advanced models can consistent with the settings used in the GP algorithm.
be compared. The performance of each algorithm is evalu-
ated by measuring accuracy, precision, recall, and F1-score. 4) MULTINOMIAL NAÏVE BAYES (MNB)
K -fold cross-validation is used to evaluate the performance Naïve Bayes classifiers are simple probabilistic classifiers
of machine learning models and it helps to determine which based on the application of Bayes’ theorem. Naïve Bayes
model that performs the best on average across the folds. classifier for multinomial models is appropriate for classifi-
K -fold cross-validation provides a more reliable estimate cation with discrete features (e.g., the occurrence of a word in
of the model’s performance compared to a single train-test text classification). The probabilistic categorizer MNB makes
split [55]. In the experiments, the D1 dataset is divided into the assumption that a document is made up of words. Accord-
K = 5 folds, with each model is trained and evaluated K ing to [11], the document class was determined by MNB’s
times and each fold serving as the test set once. A brief weighted term on term Frequency-Inverted Document Fre-
explanation of each machine learning model is presented quency (tf-idf) feature extraction approach, which took into
below. account both the frequency of a word’s occurrence within a
single document and throughout the corpus of documents that
1) SUPPORT VECTOR MACHINE (SVM) is provided.
SVM is a supervised learning model that was proposed by
Vladimir Vapnik and colleagues at the AT&T Bell Laborato- 5) K -NEAREST NEIGHBORS (K-NN)
ries. By utilizing the kernel approach and converting inputs k-NN is a non-parametric supervised learning technique that
into high-dimensional feature spaces, SVM may efficiently relies on distance with its nearest neighbors for classification.
execute non-linear classification. According to [19], each In other words, the object is assigned to the class that are
data object is plotted in n-dimensional space during this pro- most common among its nearest neighbors. The term k in
cedure, with the value and coordinate determining the item’s k-NN refers to the number of nearest neighbors included in
value. In the classification task, SVM model aims to identify the voting process to determine the class. It uses historical
a hyperplane that effectively separates data objects belonging instances that are the most comparable to the new data for
to different classes by maximizing the margins between them. making predictions. According to [59], similar data points
In our simulations, kernel parameter value is set to linear. are near to each other and have shorter distance. When a
prediction is expected, the k examples that are the closest to
2) DECISION TREE (DT) the input data will be chosen using a distance computation
DT is a supervised learning strategy that can learn simple metric. In our simulations, Minkowski distance is used as
decision rules drawn from data attributes to create a model the distance metric, where power parameter, p = 1 is equal
that predicts the value of a target variable. The DT contains to Manhattan distance and results in the standard Euclidean
nodes and branches, with the nodes representing the tests distance when power parameter, p = 2. Furthermore, k is
performed on each feature, the branches are the results of set to 5.
the operation, and the leaf nodes are the class labels [19].
The DT model in the experiment employ entropy as crite- 6) LOGISTIC REGRESSION (LR)
rion, which computes the Shannon entropy of the possible LR is a supervised classification algorithm, also known as
classes as described in detail in [56]. Moreover, it uses the logit regression or maximum-entropy classification (MaxEnt)
VOLUME 11, 2023 85077

or the log-linear classifier. The algorithm uses the logis- compared to the equations using basic arithmetic operators.
tic function to transform log-odds to probability. According Additionally, the F1-score improves substantially when the
to [60], it is a powerful binary classification algorithm. The maximum tree depth value is changed from 4 to 6. However,
authors describe that when classifying text, the LR model there is very little improvement of less than 0.2% on average
first accepts a vector of variables, evaluates the coefficients when the tree depth values is changed from 6 to 8. Thus,
for each input variable, and then predicts the text class as a the equation has optimum F1-score at maximum tree depth
word vector. In [61], LR can be used to forecast probability of 8. GP can be computationally intensive, especially on
values between 0 and 1 for binary classification problems. training larger datasets, with more features and number of
Labels can be categorised as 1 or 0 depending on whether rows, inclusion of complex arithmetic operators, and deeper
they are greater than or less than 0.5. The maximum number tree depth level. For our experiments, it took 4 hours to
of iterations for the solver to converge in our simulations is set execute a GP experiment of tree depth level of 8 with complex
to 1000. The lbfgs solver used in the model is an optimization arithmetic operators as compared to 1 hour of tree depth level
algorithm that approximates the Broyden-Fletcher-Goldfarb- of 4 with basic arithmetic operators. Thus, the mathematical
Shanno (BFGS) algorithm as described in [62]. model with complex arithmetic operators requires a large
amount of computational power, as it will need to solve a set
IV. SIMULATION RESULTS AND DISCUSSION of nonlinear equations.
A. CROSS-VALIDATION RESULT FOR THE MACHINE Keep in mind the best equation is chosen for its highest
LEARNING (ML) MODELS F1-score and fitness score. In the event that similar F1-score
Each machine learning model’s performance was evaluated is achieved with same fitness score, then F1-score for fake
using the extracted 25 Lang-IND features from the D1 label is considered. We note that the F1-score for fake label
dataset. In order to provide a brief overview, Table 6 presents is more important than the F1-score for real label in the fake
the mean performance in terms of Accuracy, Precision, news detection task.
Recall, and F1-score for the six commonly-used machine The best equations using basic arithmetic operators at
learning algorithms. Given the presented cross-validation maximum tree depth of 4, 6, and 8 are respectively given
results, RF scores the highest mean F1-score of 84.07%, fol- by Equations (10), (11), and (12), as shown at the bottom of
lowed by LR with mean F1-score of 82.28%, SVM with mean the next page. Meanwhile, the best equations using complex
F1-score of 81.36%, DT with mean F1-score of 80.63%, arithmetic operators at maximum tree depth of 4, 6, 8 are
k-NN with mean F1-score of 77.21% and lastly, MNB with given by Equations (13), as shown at the bottom of the next
mean F1-score of 70.06%. The benchmarked results will be page, Equations (14), and (15), as shown at the bottom of
compared with the proposed two-stage evolutionary approach the 14th page, respectively. Table 7 shows that among all the
results. best equations from each tree depth level, Equation (15) of
complex arithmetic operators and tree depth value 8 has the
TABLE 6. Performance of the ML models. highest F1-score and fitness score of 84.39% and 417/491
respectively. The F1-score in Table 7 only reflects the 16%
of the dataset because 491 is the entire validation data, which
makes up 16% of the dataset. The F1-score of 83.23% is
attained when the best Equation (15) is evaluated on the 80%
of the data which includes both the train and validation data.
One of the main limitations of GP is that like other optimiza-
tion algorithms, GP can get stuck in local optima, which are
suboptimal solutions that are not the global best [63]. This
can make it difficult for GP to find the global best solu-
B. RAW EQUATION (FIRST STAGE) tion to a problem. In other words, GP may generate similar
After all the simulations have been executed, we evaluated the mathematical expression when stuck in local optima. Thus,
equations that were generated using the basic and complex it is recommended to run each experiment 10 times. This
arithmetic operators. Based on the generated raw equations approach allows exploration from different starting points and
structure, the inclusion of complex arithmetic operators in the helps escape local optima. We also noticed that GP can be
GP algorithm will add to the model complexity by mixing the sensitive to the initial population and can produce different
complex mathematical functions in the equation. This shows results depending on how it is initialized. The initial popula-
that the use of complex arithmetic operators allows the model tion is set to 100 to form a larger population to allow for more
to create more complex relationships between the Lang-IND diverse solutions and greater exploration, leading to increase
features and constant values. Table 7 shows the metrics for the chances of finding better solutions. This however can make it
mean of the 10 equations and the best equation for each depth difficult to reproduce similar results.
level using basic and complex arithmetic operators. It can be In this study, we observed that GP has a number of
seen that the equations using complex arithmetic operators advantages, including automation, ability to learn from exam-
achieve higher F1-score at tree depth levels 6 and 8 as ples and does not require domain expertise. Firstly, GP uses
85078 VOLUME 11, 2023

TABLE 7. Performance of the equations generated by genetic programming.
evolutionary search to solve the problem, containing high- 13 unique Lang-IND features. The 13 unique Lang-IND
dimensional data that are time-consuming for humans to features are avg_word_length, avg_words_sent, cnt_
solve manually. Secondly, GP can learn from the newly uniquewords, cnt_words, emojis_ratio, exclam_ratio, hash-
curated fake news dataset and generate mathematical expres- tags_ratio, question, quote, tweet_length, uppercase, urls,
sion that can achieve high performance. Thirdly, GP does and urls_ratio. These features collectively appear 23 times
not require domain-specific knowledge or expert input to throughout the equation, with certain features recurring more
generate programs, which makes it a useful tool for solving than twice. In SA, the terms, which consist of Lang-IND
problems in areas where there is limited domain expertise. features and arithmetic operators in the equation will either
be removed or augmented. These modified terms, along
C. SENSITIVITY ANALYSIS (SA) with their corresponding changes, are presented in Table 8.
Given that Equation (15) attains the highest F1-score, SA is In Table 8, Step 0 shows the F1-score before any term removal
applied on this equation. Upon examining Equation (15) or augmentation on 80% of the data. A total of 36 steps
before SA, it becomes evident that the equation encompasses are performed. It is worth-noted that eliminating terms
R = −alluppercase + 0.3 × avg_words_sent × exclam + avg_words_sent × question_ratio + cnt_sentences + exclam

+ 0.7 × quote × uppercase_ratio−uppercase × urls_ratio + 0.7 (10)
alluppercase 1.6667 × exclam

R= +avg_word_length2 × emojis_ratio × quote_ratio × tweet_length+
(avg_word_length × uppercase) uppercase_ratio
0.4286 × question_ratio 0.4 ∗ quote_ratio 0.7 0.2857
+ + + +
(uppercase_ratio × urls_ratio) urls_ratio tweet_length (cnt_sentences × uppercase_ratio2 )
cnt_sentences 0.3333 2.0 × question
+ + +
(avg_word_length × uppercase_ratio) (avg_word_length × tweet_length2 ) (avg_word_length × puncs_ratio)
(11)
TTR cnt_sentences 3.5714 × exclam_ratio puncs_ratio × quote_ratio × uppercase

R= + + + +question_ratio+0.1
uppercase uppercase uppercase_ratio tweet_length
0.1786×hashtags 55.5556 × emojis_ratio × quote × uppercase
× quote − urls_ratio + 0.21+ +
(avg_words_sent × uppercase) (avg_words_sent 2 × hashtags×tweet_length × urls)
2.0 × question × quote
+ (12)
(TTR × hashtags × tweet_length)
−uppercase_ratio
R = exclam_ratio − abs(hashtags_ratio) × uppercase_ratio2 + TTR2 × (uppercase_ratio2 ) abs(urls_ratio)
question × abs(question) × abs(question)−quote × sign(puncs_ratio) × sign(quote)

+ (13)
(puncs_ratio × urls_ratio)
VOLUME 11, 2023 85079

TABLE 8. Elimination/Augmentation of terms. 0.2% impact on the F1-score. Finally, the terms stated in
step 4, 6, 15, and 36 are selected for removal from the
equation, as the F1-score remains the same post elimina-
tion during the analysis. This involves the removal of two
distinct Lang-IND features, namely, urls and cnt_words,
which collectively reduces the appearance of 5 non-distinct
features in the Equation (15). To further condense the
Equation (15), the logarithm of the values below are aggre-
gated with the outcome being 0.0674, 0.0606, and 0.0321,
respectively.
1) log1p(0.4) × log1p(0.7) × log1p(0.8) × log1p(0.9)
2) log1p(0.5) × log1p(0.7)3
3) log1p(0.1) × log1p(0.4)
In this analysis, there are four eliminations and three aggre-
gation of the terms that lead to a shortened length of the
complex Equation (16), as shown at the bottom of the next
page. The equation from the SA process is tested on the
20% test data of the D1 dataset and the F1-score obtained
is 83.85%.
D. FINAL EQUATION (STAGE 2)

In the second stage, we fine-tune the equation by adding a
weight to each term as shown in Equation (17), shown at the
bottom of the next page. where w1 , w2 , w3 , · · · , w26 are the
weights. The weights are generated by ADE algorithm. The
equation incorporating the weights is given by (18), as shown
at the bottom of the next page. The constants in Equation (18)
are aggregated, leading to the final equation, which is labelled
as Equation (19), shown at the bottom of the next page. Using
the final equation, we obtain a mean F1-score of 84.19% and
the best F1-score of 84.44% as shown in Table 10. We notice
a slight improvement is obtained by adding the weights,
in steps 21, 22, 30, and 35 can greatly deteriorate the improving from 83.85% to 84.44%. Table 10 also shows the
F1-score by at least 10%. Meanwhile, eliminating terms standard deviation (Stdev) of the mean for several metrics
in steps 2, 7, 9, 10, 11, 31, 32, and 34 have less than as the representation of model’s parameter uncertainty [64].
√ √
p √ avg_word_length × avg_word_lengthpuncs_ratio × exclam × puncs_ratio × sign(tweet_length)
R = − avg_word_length × urls_ratio + 2
log1p(avg_word_length)
√
uppercase_ratio question_ratio×uppercase_ratio2 3.1623×quoteuppercase_ratio
− avg_word_length 2 + + −0.22 −urls_ratio2
log1p(0.2) log1p(avg_word_length)
sign(question) sign(exclam_ratio)
+ 1.2910 × sign(emojis) + sign(exclam_ratio) + sign(hashtags) + + √ √ (14)
log1p(0.4) ( tweet_length × urls)
√
avg_word_length × abs(question) × log1p(0.5)−exclam_ratio × log1p(emojis_ratio) × sign(avg_words_sent)−avg_word_length
R=
(abs(urls_ratio) × log1p(0.4) × log1p(0.7) × log1p(0.8) × log1p(0.9) × log1p(cnt_uniquewords) × log1p(hashtags_ratio) × uppercase2 × urls2 )
| {z }
Term 1
urls_ratio × log1p(0.5)−exclam_ratio
+
(log1p(0.5) × log1p(0.7)3 × log1p(avg_word_length)7 × log1p(tweet_length) × log1p(uppercase) × log1p(urls_ratio)3 × 0.72 )
| {z }
Term 2
log1p(quote)
+
(log1p(0.7) × log1p(avg_word_length)4 × 0.72 )
| {z }
Term 3
log1p(0.7)
+ (15)
(log1p(0.1) × log1p(0.4) × log1p(avg_word_length)3 × log1p(avg_words_sent) × avg_words_sent 2 × sign(cnt_words))
| {z }
Term 4
85080 VOLUME 11, 2023

It can be observed that the amount of model parameter is applied extensively to several terms across the equation. For
uncertainty is relatively low, with all metrics falling example, it is used in the first numerator for terms such as
below 0.5%. 0.8585 × emojis_ratio and in the first denominator for terms
The final equation, denoted as Equation (19), encapsulates like 0.3001×cnt_uniquewords and 0.6099×hashtags_ratio.
a comprehensive mathematical model that unveils the under- Both the log1p(·) function and square root operation have
lying relationships and dependencies between Lang-IND fea- dual effects on the feature values in the equation. They can
tures and their impact on the outcome, denoted as R. The operate in two distinct mechanisms, where they compress
numerator and denominator of the equation consists of sev- larger feature values, resulting in a reduction of their mag-
eral terms that involve different features and coefficients. The nitude, as well as expands smaller feature values, leading to
ratio between the numerator and the denominator determines an increase in their magnitude. The square root operation
the value of R. The coefficients determine the magnitude is applied only once and to the term involving 0.5948 ×
and direction of the impact that each feature has on the avg_word_length in the numerator. In total, there are eight
equation. Coefficients allow us to assign more or less weight instances of exponentiation in the final equation. Exponen-
to specific features, thereby emphasizing their significance tiation affects the feature values in the equation by raising
or reducing their impact on the overall result. An example them to a certain power, which can modify their magnitude or
of lowering the importance of the feature using coefficient, emphasize their significance such as (0.9771 × uppercase)2 .
0.0091 × log1p(avg_word_length)7 in Equation (19). The The absolute value operation is applied to the term involving
equation includes mathematical operations such as square 0.3376 × question in the numerator and 0.7334 × urls_ratio
roots, absolute values, exponentiation, and logarithmic trans- in the denominator. This operation ensures that the resulting
formations. These operations modify their feature values. The values are positive, regardless of the original signs of the
logarithmic transformation, specifically the log1p(·) function, expressions.
√
avg_word_length × abs(question) × log1p(emojis_ratio)
R=
(abs(urls_ratio) × 0.0674 × log1p(cnt_uniquewords) × log1p(hashtags_ratio) × uppercase2 )
urls_ratio × log1p(0.5)−exclam_ratio
+
(0.0606 × log1p(avg_word_length)7 × log1p(tweet_length) × log1p(uppercase) × log1p(urls_ratio)3 × 0.72 )
log1p(quote) log1p(0.7)
+ +
(log1p(0.7) × log1p(avg_word_length)4 × 0.72 ) (0.0321 × log1p(avg_word_length)3 × log1p(avg_words_sent) × avg_words_sent 2 )
(16)
√
w1 × avg_word_length × abs(w2 × question) × log1p(w3 × emojis_ratio)
R=
(abs(w4 × urls_ratio) × (w5 × 0.0674) × log1p(w6 × cnt_uniquewords) × log1p(w7 × hashtags_ratio) × (w8 × uppercase)2 )
w9 × urls_ratio × log1p(w10 × 0.5)(w11 ×−exclam_ratio)
+
(w12 × 0.0606 × log1p(w13 × avg_word_length)7 × log1p(w14 × tweet_length) × log1p(w15 × uppercase) × log1p(w16 × urls_ratio)3 × (w17 × 0.7)2 )
log1p(w18 × quote)
+
(log1p(w19 × 0.7) × log1p(w20 × avg_word_length)4 × (w21 × 0.7)2 )
log1p(w22 × 0.7)
+ (17)
(w23 × 0.0321 × log1p(w24 × avg_word_length)3 × log1p(w25 × avg_words_sent) × (w26 × avg_words_sent)2 )
√
0.5948 × avg_word_length × abs(0.3376 × question) × log1p(0.8585 × emojis_ratio)
R=
(abs(0.7334 × urls_ratio) × (0.0509) × log1p(0.3001 × cnt_uniquewords) × log1p(0.6099 × hashtags_ratio) × (0.9771 × uppercase)2 )
0.1095 × urls_ratio × log1p(0.6753 × 0.5)(−exclam_ratio)
+
(0.0279 × log1p(avg_word_length)7 × log1p(0.1590 × tweet_length) × log1p(0.8836 × uppercase) × log1p(0.8817 × urls_ratio)3 × (0.8148 × 0.7)2 )
log1p(quote)
+
(log1p(0.7) × log1p(avg_word_length)4 × 0.72 )
log1p(0.6380 × 0.7)
+ (18)
(0.0321 × log1p(0.5935 × avg_word_length)3 × log1p(0.3146 × avg_words_sent) × (0.7254 × avg_words_sent)2 )
√
0.5948 × avg_word_length × abs(0.3376 × question) × log1p(0.8585 × emojis_ratio)
R=
(abs(0.7334 × urls_ratio) × (0.0509) × log1p(0.3001 × cnt_uniquewords) × log1p(0.6099 × hashtags_ratio) × (0.9771 × uppercase)2 )
0.1095 × urls_ratio × log1p(0.3377)(−exclam_ratio)
+
(0.0091 × log1p(avg_word_length)7 × log1p(0.1590 × tweet_length) × log1p(0.8836 × uppercase) × log1p(0.8817 × urls_ratio)3 )
log1p(quote)
+
(0.26 × log1p(avg_word_length)4 )
0.3692
+ (19)
(0.0321 × log1p(0.5935 × avg_word_length)3 × log1p(0.3146 × avg_words_sent) × (0.7254 × avg_words_sent)2 )
VOLUME 11, 2023 85081

In particular, the final Equation (19), unveils a significant TABLE 9. Evaluation metrics on D2 and D3 datasets.
characteristic of fake news. Many existing machine learn-
ing models prioritize accuracy enhancement at the cost of
model transparency. However, our proposed mathematical
model provides deeper insights into how a certain tweet can
be classified as fake news. The final Equation (19) reveals TABLE 10. Stage 2 results.
that out of the 25 Lang-IND features presented in the paper,
only 11 features are utilized. Most notably, the model reveals
that there are repeated appearances of certain features such
as the average number of characters per word and URLs
per sentence, which appeared four times and three times,
respectively. In addition, both the average number of words
per sentence in the tweet and number of uppercase in tweet
features appeared twice in the final equation. The repeated
appearance of certain features indicates their importance or
influence in the equation and suggests that they play a sig- mathematical expression on a Portuguese fake news dataset
nificant role in the model’s predictions of fake news. This (D3) to demonstrate the generalizability of our proposed
is at least real in the case of COVID-19 news. We note that approach to different language. The language can affect the
the usage of fewer words by fake news spreaders serves as results due to differences in syntax, vocabulary, and grammar
a notable indicator, as they attempt to justify their purpose between Malay, English and Portuguese languages. By test-
with brevity. Additionally, these spreaders attach fewer URLs ing the performance of our final equation on a Portuguese
to their tweets and do not provide credible sources to shore fake news dataset, this evaluation can provide insights into the
up confidence. Such tendencies strongly suggest the pres- potential for our approach to be adapted to other languages
ence of fake news. Another clear indication is the excessive and may pave the way for its application in multilingual
use of exclamation marks in tweet sentences. Fake news settings.
spreaders tend to use more exclamation marks, whereas real
messages exhibit a lesser tendency to do so. These obser- V. CONCLUSION AND FUTURE WORK
vations offer valuable cues to distinguish between fake and In this study, we have addressed the fake news detec-
real news. tion research from three different approaches, data-oriented
approach, feature-oriented approach, and model-oriented
E. PERFORMANCE ON UNSEEN D2 AND D3 DATASETS approach. In data-oriented approach, we have created
Drawing inspiration from a study [20] that introduced a Fake.my-COVID19, the first fake news dataset in mixed
zero-shot testing approach, their M-BERT model was pre- Malay and English language, enabling other researchers to
trained on both English and Bengali and then tested on use it for further analysis. In feature-oriented approach,
Hindi, a third language. The study’s experimental results we have designed 25 Lang-IND features as the input into
demonstrated that the proposed zero-shot approach achieved machine learning model. As for the model-oriented approach,
comparable accuracy. Since the Indic languages (Bengali and we have proposed a two-stage evolutionary approach to detect
Hindi), belong to the Indo-Aryan family of languages, they fake news. Based on the experimental results, tree-based
share similar syntactic constructs, which appears to facilitate GP can learn from Fake.my-COVID19 dataset and generate
cross-lingual transfer learning and contribute to the high best equation using complex operators with the F1-score
accuracy achieved. of 83.23%. It can be seen from the GP results that the equa-
To evaluate the performance of our proposed approach, tions using complex operators achieve higher F1-score as
we measured the final equation on D2, a COVID-19 fake compared to the equations using basic operators. Moreover,
news dataset in English, and obtained F1-score of 65.98%. increasing the tree depth will generate longer equation and
We also evaluated the same final equation on D3, a political add more complex combinations of terms into the equation
fake news dataset in Portuguese, and achieved an F1-score to achieve optimum F1-score. However, we see that adding
of 64.22%. These performance evaluation results are rela- the tree depth beyond the level value of 6 can be computa-
tively acceptable given that our model was not trained on tionally intensive and did not improve the model performance
the two datasets. The evaluation metrics results for D2 and much. It can also be concluded that the GP does not require
D3 datasets are listed in Table 9. Comparatively, the TP and domain-specific knowledge or expert input to generate the
TN values are higher than the FP and FN values. The results mathematical equation model, which makes it a useful tool
shows that the derived mathematical equation model can be for solving problems in areas where there is limited domain
domain-independent and language-independent. expertise. In the sensitivity analysis, the length of the best
While our equation was trained on D1 dataset, a mixed equation is further reduced by 4 terms, while maintaining the
Malay-English language, we chose to evaluate the final F1-score. In the second stage, ADE algorithm is applied to
85082 VOLUME 11, 2023

fine-tune the shortened equation. A weight is applied in front [3] G. D. Domenico, J. Sit, A. Ishizaka, and D. Nunan, ‘‘Fake news, social
of every term in the equation to further boost the F1-score to media and marketing: A systematic review,’’ J. Bus. Res., vol. 124,
pp. 329–341, Jan. 2021.
84.44%. The high performance achieved by the mathematical [4] M. K. David, C. Dealwis, and K. C. Hei, ‘‘Language policy and lan-
equation in fake news detection shows that there are strong guage use in multilingual Malaysia,’’ in In Pursuit of Societal Harmony:
correlations between the Lang-IND features created in this Reviewing the Experiences and Approaches in Officially Monolingual
and Officially Multilingual Countries. South Africa: SUN MeDia Bloem-
thesis. Moreover, the two-stage evolutionary approach perfor- fontein, 2017, p. 83.
mance is better than the baseline performance created using [5] K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu, ‘‘Fake news detection
the six commonly-used machine learning algorithms. Finally, on social media: A data mining perspective,’’ ACM SIGKDD Explor.
the optimized final equation is evaluated on two unseen Newslett., vol. 19, no. 1, pp. 22–36, 2017.
[6] T. Jiang, J. P. Li, A. U. Haq, and A. Saboor, ‘‘Fake news detection
datasets, resulting in F1-score of 65.98% and 64.22%, respec- using deep recurrent neural networks,’’ in Proc. 17th Int. Comput. Conf.
tively. Thus, the mathematical equation model for fake news Wavelet Act. Media Technol. Inf. Process. (ICCWAMTIP), Dec. 2020,
detection is proven possible and can be domain-independent pp. 205–208.
[7] S. Verma, A. Paul, S. S. Kariyannavar, and R. Katarya, ‘‘Understanding the
and language-independent. applications of natural language processing on COVID-19 data,’’ in Proc.
As this work presents a fake news detection model using 4th Int. Conf. Electron., Commun. Aerosp. Technol. (ICECA), Nov. 2020,
two-stage evolutionary approach with Lang-IND features on pp. 1157–1162.
a new mixed language fake news dataset. Extending the [8] M. G. Hussain, M. R. Hasan, M. Rahman, J. Protim, and S. Al Hasan,
‘‘Detection of Bangla fake news using MNB and SVM classifier,’’ 2020,
work on this topic can be curating more fake news dataset arXiv:2005.14627.
(mixed Malay and English language) in other domain topics. [9] S. B. S. Mugdha, S. M. Ferdous, and A. Fahmin, ‘‘Evaluating machine
This will encourage more researchers with limited time- learning algorithms for Bengali fake news detection,’’ in Proc. 23rd Int.
Conf. Comput. Inf. Technol. (ICCIT), Dec. 2020, pp. 1–6.
frame to focus on building fake news detection model in [10] G. Jardaneh, H. Abdelhaq, M. Buzz, and D. Johnson, ‘‘Classifying Arabic
particular on low resource language like Malay language. tweets based on credibility using content and user features,’’ in Proc.
The work can also be extended from the Feature-Oriented IEEE Jordan Int. Joint Conf. Electr. Eng. Inf. Technol. (JEEIT), Apr. 2019,
pp. 596–601.
approach by expanding the Lang-IND features or combining
[11] H. S. Al-Ash, M. F. Putri, P. Mursanto, and A. Bustamam, ‘‘Ensemble
the Lang-IND features with other state of the art word embed- learning approach on Indonesian fake news classification,’’ in Proc. 3rd
dings such as BERT, GloVe, XLNet embedding. From the Int. Conf. Informat. Comput. Sci. (ICICoS), Oct. 2019, pp. 1–6.
Model-Oriented approach, the current work can be enhanced [12] A. Rusli, J. C. Young, and N. M. S. Iswari, ‘‘Identifying fake news
in Indonesian via supervised binary text classification,’’ in Proc. IEEE
with different fitness function in the tree-based GP. The idea Int. Conf. Ind., Artif. Intell., Commun. Technol. (IAICT), Jul. 2020,
is to find a better fitness function that can produce better pp. 86–90.
generation of trees. There have been many machine learn- [13] A. Prasetyo, B. D. Septianto, G. F. Shidik, and A. Z. Fanani, ‘‘Evaluation
of feature extraction TF-IDF in Indonesian hoax news classification,’’ in
ing and deep learning algorithms explored in the field of Proc. Int. Seminar Appl. Technol. Inf. Commun. (iSemantic), Sep. 2019,
fake news detection. Unlike the proposed approach in this pp. 1–6.
research, most of the current evolutionary approach in fake [14] K. Ivancová, M. Sarnovskỳ, and V. Maslej-Krcšñáková, ‘‘Fake news
news detection models are used on feature-selection and not detection in Slovak language using deep learning techniques,’’ in Proc.
IEEE 19th World Symp. Appl. Mach. Intell. Informat. (SAMI), Jan. 2021,
in classifying the text. We opined that other metaheuristic pp. 255–260.
type algorithm like Swarm intelligence algorithms such as [15] N. R. de Oliveira, D. S. V. Medeiros, and D. M. F. Mattos, ‘‘A sensitive
Ant Colony Optimization (ACO), Particle Swarm Optimiza- stylistic approach to identify fake news on social networking,’’ IEEE Signal
Process. Lett., vol. 27, pp. 1250–1254, 2020.
tion (PSO), Grey Wolf Optimization (GWO) and Salp Swarm
[16] P. K. Verma, P. Agrawal, I. Amorim, and R. Prodan, ‘‘WELFake: Word
Optimization (SSO) can be explored and chosen to fine-tune embedding over linguistic features for fake news detection,’’ IEEE Trans.
and optimize the equation in future research work. In this Computat. Social Syst., vol. 8, no. 4, pp. 881–893, Aug. 2021.
work, the final equation has also been tested on other fake [17] K. Hayawi, S. Shahriar, M. A. Serhani, I. Taleb, and S. S. Mathew,
‘‘ANTi-Vax: A novel Twitter dataset for COVID-19 vaccine misinforma-
news dataset and achieved a reasonable performance consid- tion detection,’’ Public Health, vol. 203, pp. 23–30, Feb. 2022.
ering that the model has not been trained on the other dataset. [18] T. Jiang, J. P. Li, A. U. Haq, A. Saboor, and A. Ali, ‘‘A novel stacking
Hence, the aim is to build a generalized final equation that approach for accurate detection of fake news,’’ IEEE Access, vol. 9,
pp. 22626–22639, 2021.
could be used on other fake news datasets to demonstrate
[19] D. S. Abdelminaam, F. H. Ismail, M. Taha, A. Taha, E. H. Houssein,
the ultimate goal of domain-independence and language- and A. Nabil, ‘‘CoAID-DEEP: An optimized intelligent framework for
independence of a fake news detection model. automated detecting COVID-19 misleading information on Twitter,’’ IEEE
Access, vol. 9, pp. 27840–27867, 2021.
[20] D. Kar, M. Bhardwaj, S. Samanta, and A. P. Azad, ‘‘No rumours
REFERENCES please! A multi-indic-lingual approach for COVID fake-tweet detec-
[1] Y. M. Rocha, G. A. de Moura, G. A. Desidério, C. H. de Oliveira, tion,’’ in Proc. Grace Hopper Celebration India (GHCI), Feb. 2021,
F. D. Lourenço, and L. D. de Figueiredo Nicolete, ‘‘The impact of fake pp. 1–5.
news on social media and its influence on health during the COVID- [21] A. Hande, K. Puranik, R. Priyadharshini, S. Thavareesan, and
19 pandemic: A systematic review,’’ J. Public Health, vol. 31, pp. 1–10, B. R. Chakravarthi, ‘‘Evaluating pretrained transformer-based models
Oct. 2021. for COVID-19 fake news detection,’’ in Proc. 5th Int. Conf. Comput.
[2] E. Bozzola, G. Spina, R. Russo, M. Bozzola, G. Corsello, and A. Villani, Methodol. Commun. (ICCMC), Apr. 2021, pp. 766–772.
‘‘Mandatory vaccinations in European countries, undocumented informa- [22] S. D. Das, A. Basak, and S. Dutta, ‘‘A heuristic-driven ensemble framework
tion, false news and the impact on vaccination uptake: The position of for COVID-19 fake news detection,’’ in Proc. Int. Workshop Combating
the Italian pediatric society,’’ Italian J. Pediatrics, vol. 44, no. 1, pp. 1–4, Online Hostile Posts Regional Lang. During Emergency Situation, 2021,
Dec. 2018. pp. 164–176.
VOLUME 11, 2023 85083

[23] H. Q. Abonizio, J. I. de Morais, G. M. Tavares, and S. B. Junior, [45] Y. Wu et al., ‘‘Google’s neural machine translation system: Bridging the
‘‘Language-independent fake news detection: English, Portuguese, and gap between human and machine translation,’’ 2016, arXiv:1609.08144.
Spanish mutual features,’’ Future Internet, vol. 12, no. 5, p. 87, May 2020. [46] S. Schlosser, D. Toninelli, and M. Cameletti, ‘‘Comparing methods to
[24] I. Vogel and M. Meghana, ‘‘Detecting fake news spreaders on Twitter from collect and geolocate tweets in great Britain,’’ J. Open Innov., Technol.,
a multilingual perspective,’’ in Proc. IEEE 7th Int. Conf. Data Sci. Adv. Market, Complex., vol. 7, no. 1, p. 44, Mar. 2021.
Anal. (DSAA), Oct. 2020, pp. 599–606. [47] F. Alam, S. Shaar, F. Dalvi, H. Sajjad, A. Nikolov, H. Mubarak,
[25] P. H. A. Faustini and T. F. Covões, ‘‘Fake news detection in multi- G. Da San Martino, A. Abdelali, N. Durrani, K. Darwish, A. Al-Homaid,
ple platforms and languages,’’ Expert Syst. Appl., vol. 158, Nov. 2020, W. Zaghouani, T. Caselli, G. Danoe, F. Stolk, B. Bruntink, and P. Nakov,
Art. no. 113503. ‘‘Fighting the COVID-19 infodemic: Modeling the perspective of journal-
[26] F. A. Ozbay and B. Alatas, ‘‘Fake news detection within online social ists, fact-checkers, social media platforms, policy makers, and the society,’’
media using supervised artificial intelligence algorithms,’’ Phys. A, Stat. 2020, arXiv:2005.00033.
Mech. Appl., vol. 540, Feb. 2020, Art. no. 123174. [48] V. L. Rubin, Y. Chen, and N. K. Conroy, ‘‘Deception detection for news:
[27] A. Kesarwani, S. S. Chauhan, and A. R. Nair, ‘‘Fake news detection on Three types of fakes,’’ Proc. Assoc. Inf. Sci. Technol., vol. 52, no. 1, pp. 1–4,
social media using K-nearest neighbor classifier,’’ in Proc. Int. Conf. Adv. Jan. 2015.
Comput. Commun. Eng. (ICACCE), Jun. 2020, pp. 1–4. [49] L. Van der Maaten and G. Hinton, ‘‘Visualizing data using t-SNE,’’
[28] F. A. Ozbay and B. Alatas, ‘‘A novel approach for detection of fake news J. Mach. Learn. Res., vol. 9, no. 11, pp. 2579–2605, 2008.
on social media using metaheuristic optimization algorithms,’’ Elektronika [50] N. Le, H. N. Xuan, A. Brabazon, and T. P. Thi, ‘‘Complexity measures
Elektrotechnika, vol. 25, no. 4, pp. 62–67, Aug. 2019. in genetic programming learning: A brief review,’’ in Proc. IEEE Congr.
[29] F. A. Ozbay and B. Alatas, ‘‘Adaptive salp swarm optimization algorithms Evol. Comput. (CEC), Jul. 2016, pp. 2409–2416.
with inertia weights for novel fake news detection model in online social [51] C. Qi, X. Tang, X. Dong, Q. Chen, A. Fourie, and E. Liu, ‘‘Towards
media,’’ Multimedia Tools Appl., vol. 80, nos. 26–27, pp. 34333–34357, intelligent mining for backfill: A genetic programming-based method for
Nov. 2021. strength forecasting of cemented paste backfill,’’ Minerals Eng., vol. 133,
[30] G. Yildirim, ‘‘A novel hybrid multi-thread metaheuristic approach for pp. 69–79, Mar. 2019.
fake news detection in social media,’’ Appl. Intell., vol. 53, pp. 1–21, [52] A. Cano and B. Krawczyk, ‘‘Evolving rule-based classifiers with genetic
Sep. 2022. programming on GPUs for drifting data streams,’’ Pattern Recognit.,
[31] U. Bhowan, M. Johnston, and M. Zhang, ‘‘Developing new fitness func- vol. 87, pp. 248–268, Mar. 2019.
tions in genetic programming for classification with unbalanced data,’’ [53] R. D. Al-Dabbagh, F. Neri, N. Idris, and M. S. Baba, ‘‘Algorithmic design
IEEE Trans. Syst., Man, Cybern., B. Cybern., vol. 42, no. 2, pp. 406–421, issues in adaptive differential evolution schemes: Review and taxonomy,’’
2011. Swarm Evol. Comput., vol. 43, pp. 284–311, Dec. 2018.
[32] A. D. Mehr, V. Nourani, E. Kahya, B. Hrnjica, A. M. A. Sattar, [54] Z. Huang and Y. Chen, ‘‘An improved differential evolution algorithm
and Z. M. Yaseen, ‘‘Genetic programming in water resources engi- based on adaptive parameter,’’ J. Control Sci. Eng., vol. 2013, pp. 1–5,
neering: A state-of-the-art review,’’ J. Hydrol., vol. 566, pp. 643–667, Sep. 2013.
Nov. 2018. [55] T. Gunasegaran and Y.-N. Cheah, ‘‘Evolutionary cross validation,’’ in Proc.
[33] L. W. Santoso, B. Singh, S. S. Rajest, R. Regin, and K. H. Kadhim, 8th Int. Conf. Inf. Technol. (ICIT), May 2017, pp. 89–95.
‘‘A genetic programming approach to binary classification problem,’’ EAI [56] C. E. Shannon, ‘‘A mathematical theory of communication,’’ ACM SIG-
Endorsed Trans. Energy Web, vol. 8, no. 31, p. e11, 2021. MOBILE Mobile Comput. Commun. Rev., vol. 5, no. 1, pp. 3–55, 2001.
[34] M. Smith, A. Richardson, B. Brown, G. Dozier, M. King, and J. Morris, [57] L. Breiman, ‘‘Random forests,’’ Mach. Learn., vol. 45, no. 1, pp. 5–32,
‘‘A study of the impact of evolutionary-based feature selection for fake 2001.
news detection,’’ in Proc. IEEE Symp. Ser. Comput. Intell. (SSCI), [58] F. Tang and H. Ishwaran, ‘‘Random forest missing data algorithms,’’
Dec. 2020, pp. 1859–1865. Stat. Anal. Data Mining, ASA Data Sci. J., vol. 10, no. 6, pp. 363–377,
Dec. 2017.
[35] V. Sabeeh, M. Zohdy, and R. Al Bashaireh, ‘‘Enhancing the fake
[59] A. Agarwal and A. Dixit, ‘‘Fake news detection: An ensemble learning
news detection by applying effective feature selection based on seman-
approach,’’ in Proc. 4th Int. Conf. Intell. Comput. Control Syst. (ICICCS),
tic sources,’’ in Proc. Int. Conf. Comput. Sci. Comput. Intell. (CSCI),
May 2020, pp. 1178–1183.
Dec. 2019, pp. 1365–1370.
[60] K. Shah, H. Patel, D. Sanghvi, and M. Shah, ‘‘A comparative analysis of
[36] B. Al-Ahmad, A. M. Al-Zoubi, R. A. Khurma, and I. Aljarah, ‘‘An
logistic regression, random forest and KNN models for the text classifica-
evolutionary fake news detection method for COVID-19 pandemic infor-
tion,’’ Augmented Hum. Res., vol. 5, no. 1, pp. 1–16, Dec. 2020.
mation,’’ Symmetry, vol. 13, no. 6, p. 1091, Jun. 2021.
[61] E. Tacchini, G. Ballarin, M. L. D. Vedova, S. Moret, and L. de Alfaro,
[37] D. Choudhury and T. Acharjee, ‘‘A novel approach to fake news detec-
‘‘Some like it hoax: Automated fake news detection in social networks,’’
tion in social networks using genetic algorithm applying machine learn-
2017, arXiv:1704.07506.
ing classifiers,’’ Multimedia Tools Appl., vol. 82, no. 6, pp. 9029–9045,
[62] R. Fletcher, Practical Methods of Optimization. Hoboken, NJ, USA: Wiley,
Mar. 2023.
2013.
[38] K. Staats, E. Pantridge, M. Cavaglia, I. Milovanov, and A. Aniyan, ‘‘Ten-
[63] R. Guha, M. Ghosh, S. Kapri, S. Shaw, S. Mutsuddi, V. Bhateja, and
sorFlow enabled genetic programming,’’ 2017, arXiv:1708.03157.
R. Sarkar, ‘‘Deluge based genetic algorithm for feature selection,’’ Evol.
[39] S. Galal, N. Nagy, and M. E. El-Sharkawi, ‘‘CNMF: A community-based Intell., vol. 14, no. 2, pp. 357–367, Jun. 2021.
fake news mitigation framework,’’ Information, vol. 12, no. 9, p. 376, [64] K. Parasuraman, A. Elshorbagy, and B. C. Si, ‘‘Estimating saturated
Sep. 2021. hydraulic conductivity using genetic programming,’’ Soil Sci. Soc. Amer.
[40] J. Kim, B. Tabibian, A. Oh, B. Schölkopf, and M. Gomez-Rodriguez, J., vol. 71, no. 6, pp. 1676–1684, Nov. 2007.
‘‘Leveraging the crowd to detect and reduce the spread of fake news and
misinformation,’’ in Proc. 11th ACM Int. Conf. Web Search Data Mining,
Feb. 2018, pp. 324–332.
[41] P. Patwa, S. Sharma, S. Pykl, V. Guptha, G. Kumari, M. S. Akhtar, A. Ekbal,
A. Das, and T. Chakraborty, ‘‘Fighting an infodemic: COVID-19 fake news
dataset,’’ in Proc. Int. Workshop Combating Online Hostile Posts Regional
Lang. During Emergency Situation, 2021, pp. 21–29. JEFFERY T. H. KONG (Member, IEEE) received
[42] M. I. M. García and A. M. B. de Souza, ‘‘Lexical similarity level the B.Sc. degree in computer science from the Uni-
between English and Portuguese,’’ Estudios de Lingüística Inglesa Apli- versity of Hertfordshire, in 2003. He is currently
cada (ELIA), vol. 14, pp. 145–163, 2014, doi: 10.12795/elia.2014.i14.06. pursuing the M.Phil. degree with Curtin Univer-
[43] S. H. Kong, L. M. Tan, K. H. Gan, and N. H. Samsudin, ‘‘Fake news sity Malaysia. He was an IT Manager with Sam-
detection using deep learning,’’ in Proc. IEEE 10th Symp. Comput. Appl. ling Group of Companies. His current research
Ind. Electron. (ISCAIE), Apr. 2020, pp. 102–107. interests include explainable artificial intelligence
[44] S. A. Alameri and M. Mohd, ‘‘Comparison of fake news detection using (XAI), SQL database management, managing
machine learning and deep learning techniques,’’ in Proc. 3rd Int. Cyber ERP system data, and network infrastructure.
Resilience Conf. (CRC), Jan. 2021, pp. 1–6.
85084 VOLUME 11, 2023

W. K. WONG received the M.Eng. and Ph.D. CATUR APRIONO (Member, IEEE) received the
degrees from Universiti Malaysia Sabah, in B.Eng. and M.Eng. degrees in telecommunica-
2012 and 2016, respectively. Prior to joining tion engineering from the Department of Electri-
academia, he was with the telecommunication and cal Engineering, Universitas Indonesia, Indonesia,
building services industry. He is currently an Asso- in 2009 and 2011, respectively, and the Ph.D.
ciate Professor with the Department of Electri- degree in nano vision technology from Shizuoka
cal and Computer Engineering, Curtin University University, Japan, in 2015. Since 2018, he has
Malaysia. His current research interests include been an Assistant Professor of telecommunication
embedded system development, machine learning engineering with Universitas Indonesia, where he
applications, and image processing. is currently a Lecturer with the Department of
Electrical Engineering, Faculty of Engineering. His current research interests
include antenna and microwave engineering, terahertz wave technology, and
optical communications. He has been a member of the IEEE Antenna and
Propagation Society (AP-S) and the IEEE Microwave Theory and Technique
FILBERT H. JUWONO (Senior Member, IEEE) Society (MTT-S). Has had involved in the IEEE Joint Chapter MTT/AP
received the B.Eng. degree in electrical engineer- Indonesia Section as a Secretary and a Treasurer, in 2017, 2018, and 2019,
ing and the M.Eng. degree in telecommunica- and also active in various chapter activities, such as the Frist Indonesia–Japan
tion engineering from the University of Indonesia, Workshop on Antennas and Wireless Technology (IJAWT) as a Secretary
Depok, Indonesia, in 2007 and 2009, respec- and the 2019 IEEE International Conference on Antenna Measurements
tively, and the Ph.D. degree in electrical and elec- Applications (CAMA), Bali, in October 2019, as a Treasurer.
tronic engineering from The University of Western
Australia, Perth, WA, Australia, in 2017. He is
currently with Xi’an Jiaotong–Liverpool Univer-
sity. His current research interests include signal
processing for communications, wireless communications, power-line com-
munications, machine learning applications, and biomedical engineering.
He was a recipient of the prestigious Australian Awards Scholarship, in 2012.
Currently, he serves as an Associate Editor for IEEE ACCESS, a Review
Editor for Frontiers in Signal Processing, and the Editor-in-Chief for a newly
established journal Green Intelligent Systems and Applications.
VOLUME 11, 2023 85085

Generating Fake News Detection Model Using A Two-Stage Evolutionary Approach 7th Aug 2023 Published

Uploaded by

Generating Fake News Detection Model Using A Two-Stage Evolutionary Approach 7th Aug 2023 Published

Uploaded by

Received 22 June 2023, accepted 6 August 2023, date of publication 7 August 2023, date of current version 15 August 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3303321

Generating Fake News Detection Model Using

FILBERT H. JUWONO 2 , (Senior Member, IEEE),

I. INTRODUCTION A study mentions that fake news is a phenomenon that has a

85068 VOLUME 11, 2023

VOLUME 11, 2023 85069

TABLE 1. Fake news papers in related works.

85070 VOLUME 11, 2023

dataset. To summarize, the contributions of this paper are A. FAKE.MY-COVID19 DATASET

VOLUME 11, 2023 85071

85072 VOLUME 11, 2023

TABLE 5. List of 25 lang-IND features.

FIGURE 1. TSNE plot.

two-dimensional plot, as shown in Fig. 1, to visualise the

TABLE 4. Datasets distribution.

main steps: 1) feature engineering and 2) feature construction.

VOLUME 11, 2023 85073

FIGURE 2. Two-Stage Evolutionary Approach Model.

85074 VOLUME 11, 2023

FIGURE 3. Genetic Operation - Crossover.

VOLUME 11, 2023 85075

85076 VOLUME 11, 2023

VOLUME 11, 2023 85077

85078 VOLUME 11, 2023

TABLE 7. Performance of the equations generated by genetic programming.

R = −alluppercase + 0.3 × avg_words_sent × exclam + avg_words_sent × question_ratio + cnt_sentences + exclam

alluppercase 1.6667 × exclam

TTR cnt_sentences 3.5714 × exclam_ratio puncs_ratio × quote_ratio × uppercase

question × abs(question) × abs(question)−quote × sign(puncs_ratio) × sign(quote)

VOLUME 11, 2023 85079

D. FINAL EQUATION (STAGE 2)

85080 VOLUME 11, 2023

VOLUME 11, 2023 85081

85082 VOLUME 11, 2023

VOLUME 11, 2023 85083

85084 VOLUME 11, 2023

VOLUME 11, 2023 85085

You might also like