Adversarial Autoencoder Data Synthesis For Enhancing Machine Learning-Based Phishing Detection Algorit

This article has been accepted for publication in IEEE Transactions on Services Computing.
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TSC.2023.3234806
Adversarial Autoencoder Data Synthesis for Enhancing Machine

Learning-based Phishing Detection Algorithms
Hossein Shirazi Shashika R. Muramudalige Indrakshi Ray
Management Information Systems Department of Electrical & Computer Engineering Department of Computer Science
San Diego State University Colorado State University Colorado State University
San Diego, CA, USA Fort Collins, CO, USA Fort Collins, CO, USA
hshirazi@sdsu.edu Shashika.Muramudalige@colostate.edu iray@colostate.edu
Anura P. Jayasumana Haonan Wang

Department of Electrical & Computer Engineering Department of Statistics
Colorado State University Colorado State University
Fort Collins, CO, USA Fort Collins, CO, USA
Anura.Jayasumana@colostate.edu Haonan.Wang@colostate.edu
Supervised machine learning is often used to detect phishing websites. However, the scarcity of phishing data for training purposes
limits the classifier’s performance. Further, machine learning algorithms are prone to adversarial attacks: small perturbations on
attack data can bypass the classifier. These problems make machine learning less effective for phishing detection. We propose two
Generative Adversarial Network (GAN) based approaches that synthesize phishing and legitimate samples to mimic real-world
websites. Information about real-world datasets is obtained from ten publicly available phishing datasets which are used by the
AAE (Adversarial Autoencoder) and WGAN (Wasserstein GAN) for generating synthetic data. Using both real and synthesized data,
we demonstrate how to implement classifiers with higher performance and more resistance to adversarial attacks. We propose a
set of hypotheses and validate them through experiments to demonstrate: (i) indistinguishability of synthesized samples from actual
ones, (ii) susceptibility of classifiers to adversarial attacks, (iii) mitigating adversarial attacks by training on larger datasets that
include correctly labeled synthesized samples, and (iv) better performance of classifiers trained on large datasets. Our AAE and
WGAN have been trained on a wide range of datasets, making us optimistic about its widespread applicability.
Index Terms—Phishing Detection, Adversarial Attacks, Adversarial Auto-encoder
I. I NTRODUCTION machine learning for phishing detection have demonstrated the

following challenges.
Phishing attacks, even with sophisticated detection algo- Data Gathering. Complexities involved in gathering attack
rithms, still dominate the cyber-crime landscape. FBI’s Internet data and reluctance of parties owning datasets to share them
Crime Complaint Center (IC3) reports phishing (including its due to concerns such as privacy, confidentiality, and liabil-
various forms such as vishing, smishing, and pharming) to be ity [11] are barriers that have prevented high volume phishing
the most prevalent crime type by number in 2019, with an datasets from becoming available. There exists repositories
estimated 12.5 billion USD in financial losses worldwide be- that collect links of phishing websites; examples include
tween 2013-2018 [1, 2]. Adversaries learn from their previous PhishTank.com and OpenPhish.com. However, such
attempts to (i) improve attacks and lure more victims and (ii) websites only provide a list of links. Creating a labeled
bypass existing detecting algorithms to obtain sensitive users’ phishing dataset involves accessing the links, visiting the
information [3, 4] for nefarious purposes. malicious websites, extracting the features, and performing
Social engineering attacks in general, and phishing attacks classification. These extra steps are complex tasks requiring
specifically, are problematic not because of the vulnerability expertise.
in systems but due to the misjudgment of humans in dis- Data Volume. While the volume of the training dataset is
tinguishing legitimate entities from fake ones. Consequently, critical for obtaining a high accuracy of the detection model,
several counter-measures have been studied in the literature, obtaining such a dataset is not an easy task. The low number
differing with respect to the methodologies and the types of of existing phishing datasets [10] does not allow the learning
attacks they protect against. Machine learning algorithms have classifier to converge, and we get inconsistent values for
shown promising results [5, 6, 7, 8]. Machine learning requires accuracy.
large volumes of labeled data for training the classifiers [9], Adversarial Attack. Malicious users often attempt to poison
which are then deployed for detecting phishing websites. the training set or bypass the detection algorithms. In the con-
Issues related to unavailability of data in the phishing context text of phishing detection, the adversary creates new phishing
are well-known [10]. Machine learning algorithms are also websites, e.g., by manipulating feature set to bypass the model
prone to attacks, such as techniques devised by attackers for and evade being caught by the classifiers.
bypassing the classifiers. Our preliminary works on applying In this paper, we focus on two goals: (i) improving the F1
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: COLORADO STATE UNIVERSITY. Downloaded on April 26,2023 at 00:05:00 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Services Computing. This is the author's version which has not been fully edited and
score of classification algorithms by augmenting the dataset, B. Key Contributions

and (ii) making the detection algorithm robust against ad- Key contributions of this paper are given below.
versarial sampling attacks. With respect to the first goal, we
• We develop two generative models of AAE and WGAN
propose a deep-learning approach to synthesize new samples
that can synthesize phishing and legitimate websites that
that preserve existing data properties. This obviates the need
mimic original ones to augment the training dataset. AAE
for actual data collection. These samples are added to the
outperforms WGAN in different ways.
training datasets. This is needed when data is unavailable,
• We exemplify the widespread applicability of our ap-
or the collection process is laborious or infeasible. We use
proach on ten publicly available phishing datasets and
synthesized samples to demonstrate adversarial attacks on the
seven different classification algorithms.
classifier model for the second goal. We can make the clas-
• We define a measure to evaluate the closeness of original
sifiers significantly more attack-resistant by injecting labeled
measured data to synthetic data generated by proposed
synthesized samples into the training set and re-training the
algorithms. We show that synthesized data are less likely
models.
to be distinguished from original data.
• We quantify the improvement of the F1 score of models
by using synthesized data. We demonstrate that adding
A. Our Approach labeled synthesized samples to the training increases the
We advocate the use of two synthesized data genera- performance of classification algorithms.
tion algorithms, namely, Adversarial Autoencoder (AAE) and • We discuss how to design robust classifiers using syn-
Wasserstein Generative Adversarial Network (WGAN), to thetic data that can withstand adversarial attacks.
mimic websites that match the ones generated by actual The rest of the paper is organized as follows. In Section II,
attackers. We generate both phishing and legitimate samples, we describe related work on machine learning based phishing
which are used to augment real-world datasets. detection and provide some background on generative net-
The use of synthesized samples solves multiple problems. works. In Section III, we model the attacker and describe how
First, it addresses the scarcity of phishing data. Second, to produce synthesized samples using two different generative
constructing a dataset from phishing sites requires visiting the networks. In Section IV, we create the experimental configu-
website and processing the data to make it amenable for anal- ration and discuss our results. In Section V, we conclude the
ysis. This is labor-intensive and requires visiting malicious, paper and mention some future work.
often illegal, websites. Samples gathered from such sites are
often very limited in size and, therefore, insufficient for ma- II. R ELATED WORK
chine learning algorithms. The synthesized data from multiple
Machine learning algorithms are well-suited for detecting
sites can have novel malicious combinations not present in
whether a given website is phishing. Earlier machine learning
the individual sites. Third, some synthesized samples can be
approaches [5, 6, 7, 8, 25, 26] used features from diverse
correctly labeled and inserted into the training dataset to make
perspectives using public datasets or their own ones. The
the algorithms more robust against adversarial attacks.
models were trained on phishing and legitimate datasets to
We formulate several hypotheses to demonstrate different predict whether unknown instances are genuine or phishing.
aspects of our work. Our first task is to evaluate if a learning
algorithm can distinguish synthesized samples from original
ones. We next evaluate if the synthesized phishing samples A. Phishing Detection By Machine Learning
can circumvent the trained model. This demonstrates the Content-based features. Zhou et al. [27] extracted 154 fea-
propensity of learning models to exploratory attacks [12] by tures based on the content of a webpage using four time-based,
which an attacker perturbs some features to test if the sample two search-based, and 11 heuristic features to create a labeled
will bypass the classifier. Subsequently, we check if injecting dataset. They created a balanced dataset with 8180 instances.
a small portion of synthesized samples, labeled correctly, into Zhou et al. concluded that Random Tree was the best classifier,
the training set results in learning algorithms more resistant to achieving a precision of 99.4% and 0.1% false positive rate.
exploratory attacks. Finally, we evaluate if using a smaller Niakanlahiji et al. [5] introduced PhishMon, that uses fea-
number of samples negatively affects the learning scores tures derived from HTTP responses, SSL certificates, HTML
and whether adding synthesized samples (both legitimate and documents, and JavaScript files. It does not rely on third-party
phishing) improves the F1 score of models, even for original services to extract features, it is language agnostic, and detects
datasets not generated by our AAE and WGAN models. phishing instances in real-time. The authors reported accuracy
We conducted experiments on ten phishing datasets [13, 14, of 95% on their datasets.
15, 16, 17, 18, 19, 20, 21, 22, 23, 24] and seven classification Subasi et al. [28] used one of the existing datasets that we
algorithms namely Decision Tree (DT), Gradient Boosting also have used in our experiments to compare performance of
(GB), k-Nearest Neighbors (KNN), Random Forest (RF), Adaboost algorithm and multi boosting on phishing detection.
Support Vector Machine with two kernels: Linear (SVM(l)) They demonstrated that Adaboost outperformed Multiboost-
and Gaussian (SVM(r)) kernel, and a Deep Neural Networks ing, achieving accuracy scores as high as 97.61%.
(DNN). Results demonstrate that our proposed deep-learning Visual similarity. Mao et al. [7] studied visual similarity of
generative approaches proved all of the defined hypotheses. phishing and legitimate websites by automatically comparing
the respective Cascading Style Sheets (CSS). The authors B. Data Generation Approaches
proposed a learning-based aggregation analysis mechanism for
Dataset quality and quantity also effect the performance of
distinguish phishing websites from legitimate ones.
machine learning algorithms. Shirazi et al. [13] observed that
Shirazi et al. [29] defined fingerprint of a legitimate website.
datasets used by researchers are often biased with respect to
The fingerprint uniquely represents a legitimate website by
the features based on the URL or content. Moreover, some
considering both its visual and textual characteristics. Machine
of the features become obsolete with time or as new attacks
learning techniques were used to detect the similarity of
emerge. Sometimes the authors extract features for the first
suspicious websites with fingerprints of genuine websites.
page of legitimate websites, but not the other pages. Machine
The approach fingerprinted 14 legitimate websites and tested
learning algorithms must be trained on enough data samples,
against 1446 unique phishing samples. Authors reported an
but there is not a simple way to estimate the needed dataset
accuracy of at least 98% for all legitimate websites. The goal
size. The right size often depends on the complexity of the
of this work was detecting whether a given website is being
problem and that of the learning algorithm and falls under the
targeted by phishers. The goal of our current paper, like most
sample size determination problem.
other works on phishing detection, tries to check whether a
Figueroa et al. described a sample size prediction algorithm
given website is phishing or genuine.
that conducted weighted fitting of learning curves in an active
URL-based detection. Phishing instance detection by analyz- learning algorithm [33]. Active learning systems attempt to
ing the URL of phishing websites have also been proposed. minimize the number of required labeled data and maximize
Hong et al. [25] consider only the URL of the website and the accuracy of the model by asking queries in the form of
collect a handful of lexical features that have been proposed unlabeled instances to be marked by another agent such as the
by other researchers and combine them with features obtained domain expert [34].
from the blacklisted domains. The results show F-1 scores of
Small datasets often create inaccurate learning models, so
84%.
the right size dataset is critically important. Data gathering and
Sahinguz et al. [6] used a set of natural language processing labeling are challenging and often times expensive operations.
based features of URLs of websites. They ran seven different Since getting enough attack data may be infeasible, many data
classifiers for detecting phishing websites and achieved a augmentation techniques have been proposed and used in the
97.98% accuracy rate. This study is language independent literature [35, 36, 37, 38]. However, our approach focuses
and can detect phishing websites in real-time without needing exclusively on phishing samples, and has been tested on a
third-party services. large set of datasets for evaluation.
Al-Ahmadi et al. [30] trained a Long Short-Term Memory Shirazi et al. [35] used an adversarial algorithm to generate
(LSTM) as Generative Adversarial Network (GAN) to syn- new synthesized samples for increasing the dataset size. This
thesize new URL. They also trained a Convolutional Neural work further showed how these synthesized phishing samples
Network (CNN) as a discriminator to decide whether the can evade the classifier. The authors used a heuristic algorithm
URLs are phishing or not. These two components are working for feature manipulation in order to generate samples. Our
together to improve the overall performance. Authors reported current paper extends the AAE network so that it is capable
an accuracy of 97.5%, which is significantly high. of generating more sophisticated samples with a well-studied
Kamran et al. [31] proposed a conditional GAN for algorithm which ensures that the sample matches real-world
synthesizing adversarial examples and also detecting phishing data. In addition, we also demonstrate the use of WGAN for
URLs. Authors used a game-theory perspective to understand generating synthetic samples in this work. Our previous work
the rationale for the decision-making processes of the attacker [35] also does not provide any solution that protects against
and the defender. exploratory attacks. Our current work demonstrates how to
Haynes et al. [32] concluded that using deep learning train the model to make it resilient to exploratory attacks.
neural networks on URL-based features alone failed to achieve Other domains including social analytics [39, 40, 41],
high accuracy in detecting phishing websites. In a separate privacy [42], health informatics [43], video traffic classifi-
experiment, the authors used language transformers that rep- cation [44] also face the issue of limited data availability
resent context-dependent text sequences for detecting phishing and data incompleteness. In many cases, data collection and
websites. These transformers were able to learn directly from maintenance require effort and poses challenges due to data
the text in URLs and were able to distinguish between legit- privacy, confidentiality, and liability issues. Behavioral and so-
imate and malicious websites without feature definition and cial network data are inherently sparse and incomplete because
extraction. Transformer-based approaches outperformed other sometimes the behavioral indicators are not shown or recorded
approaches and achieved accuracy of more than 95% in all [45]. Muramudalige et al. [46] showed an adversarial data
cases. generation technique with novel feature mapping techniques
Hybrid approaches. Jain et al. [8] extracted 19 different to synthesize sparse, incomplete, and small datasets while
features from the URL and the source code of websites mapping into complex objects. The proposed method was
to distinguish phishing websites from legitimate ones. The validated via three real-world datasets, which were small and
features are extracted from the client-side and do not rely on incomplete.
third-party services. They achieved a 99.39% true positive rate Adversarial deep-learning approaches for data generation
and the overall accuracy was 99.09%. have been used in several domains due to their accuracy and
efficacy. Goodfellow et al. [47] presented the use of GAN like the algorithms that have been used, dataset instances, or
for data generation without requiring comprehensive problem- learning parameters.
specific theoretical basis or empirical verification [48]. The Shirazi et al. [35] demonstrated the vulnerabilities of learn-
first such GAN architecture [47] is only capable of capturing ing models against adversarial sampling attacks using a feature
the precise distribution of continuous and complete data but manipulation approach. However, in this current work, we
cannot be used for learning the distribution of discrete vari- focus on evaluating adversarial samples and their effect on
ables [49]. phishing detection, similar to that of [35], and also address
Since there is a critical need to capture data distribution the problem of inadequate volume of phishing dataset.
with discrete features for diverse application domains such as Attacker’s Influence. Ling et al. [55] discuss two types of
phishing, medical, crime data, etc., Makhzani et al. [50] pro- attacks: (a) Causative Attacks and (b) Exploratory Attacks.
posed the AAE, which is a probabilistic autoencoder that uses In Causative Attacks, the attacker mislabels a portion or the
the GAN framework as a variational inference algorithm for entire training data to affect the algorithm. In other words,
both discrete and continuous latent variables. Choi et al. [49] the attacker poisons the training data. In Exploratory Attacks,
focused on learning the distribution of discrete features, such the attacker crafts samples so as to evade the classifier without
as diagnosis or medication codes, using a combination of direct influence. In this study, we assume the adversary carries
an autoencoder and the adversarial framework. Wasserstein out exploratory attacks and targets the integrity of the system;
GAN (WGAN) [51] is another well-known technique used in he cannot inject adversarial samples into the training set to
various domains for both continuous and discrete distributions carry out causative attacks.
such as image generations [52] and Internet traffic genera-
tion [53]. WGAN has a unique loss function (Earth-Mover
(EM) distance or Wasserstein-1) to calculate the difference B. Adversarial Autoencoder (AAE) for Synthesized Data
between actual and generated data distributions. Kattadige et Generation
al. [44] applied the same feature mapping techniques proposed We utilize the AAE for synthesizing both phishing and
in [46] with WGAN to synthesize video traffic data for more legitimate samples. Since AAE can generate both continuous
accurate video types classification. In this work, we use AAE and discrete data distributions, it is very suitable for generating
and WGAN to generate more realistic phishing and legitimate discrete feature sets of datasets described in Subsection IV-A.
website samples. The high-level architecture of the AAE is depicted in Figure 1.
Chen et al. [54] presented an attack-agnostic defense mech- The autoencoder derives a compressed knowledge representa-
anism for detecting poisoning attacks, which means it is not tion of the original input, which reconstructs the same data
designed to detect specific types of attack. In this work, distribution.
authors proposed two novel designs. First, a synthetic data
generation that uses conditional GAN (cGAN). In the next
Z
step, a WGAN is set up to learn the distribution present in q(z) = q(z|x)pd (x)dx (1)
x
the predictions related to the synthesized data. By defining a
detection boundary, attack samples can be distinguished from An aggregated posterior distribution of q(z) on the latent code
original samples. is defined with the encoding function q(z|x) and the data
distribution pd (x) as shown in Eq. 1 where x denotes real
III. O UR A PPROACH phishing dataset. In this work, we synthesize phishing and
legitimate samples separately, where we train two different
In this section, we first present our threat model. We then
AAEs for each dataset.
discuss the synthetic data generation techniques using AAE
The operating principle of AAE is that the autoencoder
and WGAN which we use to produce phishing and legitimate
seeks to minimize the reconstruction error while the adversar-
samples. We then discuss our experimental methodology.
ial network attempts to minimize the adversarial cost. Recon-
Finally, we briefly explain the ML classifiers that we used
struction phase and regularization phase are two simultaneous
in our experiments.
phases that arise during training. In the reconstruction phase,
the autoencoder’s data reconstruction error, often referred to
A. Threat Model as the loss, is minimized. The regularization phase relates
We define the threat model by stating our assumptions in to the adversarial component of the network. It minimizes
terms of goal, knowledge, and influence of an attacker [35]. the adversarial cost to fool the discriminator by maximally
Attacker’s Goal. We assume an attacker will aim to attack the regularizing an aggregated posterior distribution q(z) to the
integrity of the system by making the system label a phishing prior p(z) distribution.
instance as legitimate. The simultaneous training process leads the discriminative
Attacker’s Knowledge. We assume an attacker only knows adversarial network into believing that the samples from
about the features of the phishing instances but not the learning hidden code q(z) come from the prior distribution p(z) [50]. A
model parameters. This is a realistic assumption as an attacker normal distribution is exploited as the arbitrary previous p(z)
may have access to the definition of existing datasets but in this work. After the training process, the adversarial network
not the specific implementation of a classifier. The attacker synthesizes samples similar to the actual samples through the
does not have any information about other system parameters prior distribution p(z).
Fig. 1: The high-level architecture of synthetic data generation approach. The adversarial autoencoder (AAE) generates both
phishing and legitimate samples. The top row represents the standard autoencoder that reconstructs the data from the latent
code z. The next row shows the discriminator network that predicts whether the samples emerge from the hidden code of the
autoencoder q(z) or the user-defined prior distribution p(z) [50]. pd (x) denotes the data distribution. q(z|x) and p(x|z) denote
the encoding and decoding distributions respectively. After the data generation, a machine learning classifier (fc ) described in
Subsection III-E is applied for different classification tasks.
through a parametric function gθ : Z → X that is capable of

synthesizing samples from a certain distribution Pθ [51]. To
achieve the objective in WGAN, two deep-neural networks,
i.e., generator (G) and discriminator compete each other in
the training phase. In WGAN, the discriminator is called the
critic (C). C determines the real and fake samples from pz (z)
and G confuses C by convincing that synthesized samples
reach from the real distribution (Pr ). Eventually, the G will
be capable of generating data samples that are similar to real
Fig. 2: The high-level architecture of Wasserstein Generative samples by mapping its distribution (Pg ) to Pθ . The contest
Adversarial Network (WGAN) between G and C is a two-player minimax game with value
function V (C, G) [47]:
The synthesized dataset has the characteristics of phishing min max V (C, G) = Ex∼pr (x) [logC(x)]+
G C (2)
samples generated by real-world attackers and the character- Ez∼pz (z) [log(1 − C(G(z)))]
istics of legitimate samples collected from real websites. We Further, WGAN has a distinct loss function to compute
feed them into a classification algorithm that can distinguish the difference between actual (Pr ) and generated (Pg ) data
phishing samples from legitimate ones. This classifier is distributions compared to the ordinary GAN. WGAN uses
unaware of whether the samples are synthesized or actual. the Earth-Mover (EM) distance or Wasserstein-1 as the loss
The instances are labeled as legitimate or phishing, and the function [51] while GAN calculates the loss via the standard
classifier will predict them subsequently. cross-entropy [47]. Earth-Mover (EM) distance can be defined
as follows.
C. Wasserstein Generative Adversarial Network (WGAN) W (Pr , Pg ) = − yk]
Qinf E(x,y)∼γ [kx (3)
γ∈ (Pr ,Pg )
Similar to AAE, we use the WGAN for synthesizing both
phishing and legitimate samples. Since WGAN is fluent in where Pr , Pg is the set of all joint distributions γ(x, y),
generating both continuous and discrete data distributions, it whose marginal distributions are Pr and Pg . In other words,
is very suitable for generating discrete feature sets of datasets the similarity between actual and generated data is calculated
described in Subsection IV-A. The high-level architecture of by finding the infimum of the expected values of distances
the WGAN is depicted in Figure 2. between data points from the distributions of actual and
Let Pr and Pg be the actual and generated distributions generated data [53].
respectively. Typically in GAN architecture, instead of evaluat- Further, training of WGANs does not require maintaining a
ing the density of the distribution (Pr ), we can define a random careful balance of the G and the C, and also does not require a
variable Z with a noise (known) distribution pz (z) and send it cautious design of the network architecture. The Wasserstein
loss function is also capable of providing a continuous and synthesized samples. Label of samples indicate if a sample is
usable gradient compared to many other loss functions [51]. original or synthesized, and we do not care if the samples are
Therefore, WGAN is useful for these types of contexts where phishing or legitimate. We then train a classifier to distinguish
only small and discrete training data is available. original samples from synthesized ones. ∆2 is the F1 score
In both AAE and WGAN, we train separate generative of this experiment. We denote the best F1 as ∆2M ax . We
models for phishing and legitimate samples in each dataset, denote the average F1 over all classifiers as ∆2Avg . Average F1
which enables capturing different underlying trends of phish- on different classifiers is important because a classifier may
ing and legitimate samples with different sets of distinct perform well on few datasets, but do poorly on others.
features. The feature values are varied in many value ranges. Hypothesis-3. Synthesized samples, generated by the AAE
Thus, the values are normalized between -1 and 1 before and WGAN, will evade the classifier and be mislabeled
feeding to the AAE or WGAN and are denormalized after data more than original samples. Mislabeling may happen for both
generation. The synthesized phishing and legitimate samples phishing and legitimate samples. In other words, synthesized
are subsequently integrated into a single dataset. Now we have phishing samples will be incorrectly labeled as legitimate more
two datasets: than original phishing samples. Note that, if this is indeed the
1) Original Dataset that has both phishing and legiti- case, then attackers can create such synthesized samples that
mate samples and was obtained from publicly available will easily evade the classifiers.
datasets. The original dataset is used to generate synthe- We first train a classifier with only the original samples in
sized samples. the dataset. This guarantees that the algorithms do not have
2) Synthesized Dataset that consists of new synthesized information about synthesized samples. We then test classifiers
phishing and legitimate samples generated by the AAE with two sets of samples: once with original dataset and then
or WGAN. with synthesized samples generated by each of synthesizer
After we synthesize data, we apply both original and algorithms. The difference of these two sets of results is
synthesized datasets in our experiments in Section IV. defined as ∆3 score.
∆3 specifies the difference in F1 score of a model when it is
tested on original samples and synthesized samples. The lower
D. Experimental Methodology values for ∆3 shows that the F1 of model against synthesized
We define five hypotheses based on the goals introduced in samples is lower than F1 of model against original samples.
Section I. We define five scores, corresponding to the five hy- In other words, it indicates that the pair of classifier and
potheses. Each score empirically evaluates the corresponding dataset are more vulnerable against synthesized samples. It
hypothesis. should be said that this vulnerability is not because of the
Hypothesis-1. The classification algorithm has acceptable classifier we have used in this experiment but it is related
performance on the dataset without considering any to the pair of classifier and dataset together. In other words,
synthesized sample, i.e., the performance is acceptable on the different classifier or an extended dataset may mitigate this
dataset that contains only original samples. This hypothesis vulnerability.
serves to demonstrate that we are using the most appropriate Hypothesis-4. Re-training classifiers on datasets that have
classification algorithm for the datasets. Hypothesis-1 states been injected with synthesized samples will improve the F1
that the classification algorithms reproduce the accuracy close score of the models with regards to synthesized samples. This
to that reported by original authors of the respective datasets. hypothesis considers the mitigation vulnerability. Injecting
In other words, we need to evaluate that our classification synthesized samples in training set will improve the F1 score
algorithms outperform, or are at least as good as, the perfor- to the level when there was no synthesized samples in the
mance reported by authors of those datasets. For cases where training or testing datasets.
the authors do not report any results, the accuracy needs to be ∆4 calculates the difference between the F1 of a classifier
in an acceptable range. We train and test our classifiers without when it is tested with synthesized samples: once it is trained
considering synthesized samples and compare results with the only with original samples, and once it is trained with both
authors’ results to prove this hypothesis. ∆1 that evaluates original and synthesized samples. Higher values on ∆4 are
Hypothesis-1 is defined as follows. desired.
∆1 is the difference between the accuracy reported by Hypothesis-5. Augmenting dataset with synthesized samples
original authors of the dataset and the accuracy we got in improves F1 score with respect to the original samples. A
our experiments. Positive values or close to zero are desired useful application of the proposed approach is to increase the
as it proves that the accuracy of our model is better or close size of the training set without the need to gather real data.
to what the original authors reported. ∆5 is defined for this purpose.
Hypothesis-2. Synthesized samples are indistinguishable from ∆5 defines the improvement of the F1 of the original
actual data with regards to the machine learning classification samples, i.e., we calculate the difference between the F1 of two
algorithm. Hypothesis-2 demonstrates that a machine learning classifiers when it is tested with only original samples: once
algorithm will not be able to distinguish synthesized samples it is trained on only original samples and, once it is trained
from actual ones. For this purpose, a classification algorithm on both original and synthesized samples. This score helps
is designed to distinguish synthesized samples from original to understand if adding synthesized samples can improve the
ones. We construct a dataset including original samples and accuracy of the classifier with regards to the original samples.
Table I summarises hypotheses and scores we defined in services. This dataset includes 7 features with 1K legitimate
this section. samples and 1.2K phishing samples; total of 2.2K samples.
The reported accuracy varies between 97 to 98 on the valida-
E. Machine Learning Classifier tion set and is unknown for live phishing URLs.
We consider six statistical classification algorithms available DS2: Rami et al. [14] dataset was shared through the UCI
in the Scikit-learn tool [56] to train and evaluate our model. machine learning repository [57]. Authors detected charac-
These are Decision Tree (DT), Gradient Boosting (GB), k- teristics that help discern phishing websites from legitimate
Nearest Neighbors (KNN), Random Forest (RF), Gaussian ones, including long URL, IP address in URL, adding prefixes
Naive Bayes (GNB), and Support Vector Machine with two and suffixes to the domain, and request URL. The authors
kernels: Linear (SVM(l)) and Gaussian (SVM(r)) kernel. defined 30 features that can be categorized as follows: URL-
These algorithms are widely used in the literature for phishing based, abnormal-based, HTML-based, JavaScript-based, and
detection and achieved promising results. In addition to these domain name-based features. abnormal-based are features
statistical algorithms, we have also implemented a deep neural extracted based on abnormality on URLs, e.g. DNS records
network model to compare the results. This can demonstrate did not find in WHOIS database. The authors also analyzed
vulnerability of Deep Neural Network (DNN) algorithms to the most significant features of the detection algorithm. The
evaluate against poisoning attack in the context of phishing authors reported 92.2 for accuracy on this dataset. We used
detection. For each experiment, we optimize specific param- all 30 features in our experiments, regardless of their relative
eters of each classifier to obtain the best results and prevent importance.
overfitting. We fine-tuned each algorithm with wide ranges of DS3: Abdelhamid et al. [15] dataset is listed in the UCI ma-
hyper-parameters to get the best performing model. chine learning repository [57]. The essential features include
For DT, we varied the maximum depth from 2 to 20. For HTML content-based features and some that require third-
GB algorithm, we checked learning rates from 0.05 to 1 party service inquiries, such as DNS servers that perform
with 20 estimators and a maximum depth of 10. We checked domain-name age lookup. The best accuracy reported for this
the RF algorithm with different numbers of estimators, vary- dataset is 97.
ing between 10 to 200 estimators. For KNN algorithm, we DS4: Tan et al. [16] published their dataset on Mendeley
varied the number of neighbors parameters from 3 to 25. dataset library. It includes 48 URL-based and HTML-based
For SVM with linear kernel, we checked C parameters with features. The authors integrated a feature selection phase with
the following values: {0.001, 0.01, 0.1, 1, 10}. For SVM with a training phase and chose the ten best features with a random
Gaussian kernel, in addition to C parameter, we checked forest classifier. We used all 48 features in our experiments.
Gamma parameters with {0.01, 0.1, 1} values. The best accuracy reported is 96.
For DNN, we used a network with one input layer, two DS5: Hannousse et al. [17] dataset includes more than 11,000
hidden layers, and one output layer. The first hidden layer phishing and legitimate URLs with 87 extracted features from
includes 32 hidden nodes, and the second hidden layer has three different categories: 56 extracted from the structure and
16 hidden nodes and the ReLu activation function. For the syntax of URLs, 24 extracted from the page contents, and
output layer, we used the sigmoid activation function. We used seven are extracted by querying external services. The dataset
dropout regularization with 0.2 rate and binary cross-entropy consists of 50% of phishing and 50% of legitimate instances.
loss function. We trained each network with 500 epochs and The best accuracy reported by the authors is 96.6.
an early stopping method with a patience of 200. DS6: Vrbancic et al. [18, 19] dataset has 111 URL-based
features without considering the contents of webpages or
IV. EXPERIMENTS AND E VALUATION using third-party services. This dataset includes more than
146,000 instances of phishing and legitimate websites. The
We conduct a set of experiments to empirically prove the best accuracy reported for this dataset 94.39.
hypotheses that are defined in Section III. We start with DS7: Moruf et al. [20, 21] dataset consists of 7,200 phishing
introducing the datasets used, followed by experiments we and 5,800 legitimate websites with 35 features. The authors
conducted for each hypothesis and their results. added image identity, page style, layout identity, and text
identity features. The authors reported an accuracy of 98.3.
A. Summary of Phishing Datasets DS8: Mahmodi et al. [22, 23] dataset of 8,000 legitimate and
We use ten phishing datasets publicly available on the phishing instances had 75 URL and content-based features.
Internet. We list the total number of instances for each dataset, The authors reported an accuracy of 97.0.
and also the number of phishing and legitimate instances. DS9: Muhammad et al. [24] dataset of 15,000 instances of
In addition, we explain the number and types of features in phishing and legitimate websites had 79 URL-based features.
each dataset. Types of features are essential as they explain The best reported accuracy for this dataset is 96.5.
the characteristics of datasets on which our algorithms can DS10: Marchal et al. [58, 59] released a URL-based phishing
be executed. In addition, we mention the highest accuracy dataset with more than 96 thousands instances. Authors argued
reported by the original authors of the dataset, if it was that phishing URLs usually have few relationships between the
available, for comparison purposes. URL part that must be registered (low-level domain) and the
DS1: Shirazi et al. [13] phishing dataset focuses on a subset remaining part of the URL (upper-level domain, path, query).
of domain-name-based features without requiring third-party They defined the concept of intra-URL relatedness and, to
TABLE I: Summary of Hypotheses. For each hypothesis, we defined two experiments (Exp.1 and Exp.2) and calculated
difference. Org means only original data was used for training or testing and Syn indicates use of synthesized data. Syn.
∪ Org. indicates use of both original and synthesized data. Target specifies result of classification labels for experiment.
Phi vs. Leg indicates classifier distinguished phishing samples from legitimate ones and Syn vs. Org indicates distinguishing
synthesized samples from original ones.
Hypothesis Training Testing

Target
# Description Exp1 Exp2 Exp1 Exp2
Hyp-1 Reproducing datasets authors accuracy Org Org Org Org Leg vs. Phi
Hyp-2 Distinguishing synthesized and original samples Syn. ∪ Org. – Syn. ∪ Org. – Syn vs. Org
Hyp-3 Mislabeling of synthesized samples Org Org Org Syn Leg vs. Phi
Hyp-4 Recovery performance for synthesized samples Org Syn. ∪ Org. Syn Syn Leg vs. Phi
Hyp-5 Increased performance for original samples Syn ∪ Org Org Org Org Leg vs. Phi
TABLE II: Summary of datasets. This table summarises TABLE III: Evaluation for Hyp-1. This table reports the
datasets we used in our experiments, including Name we accuracy of all classifiers and datasets when trained to detect
used in this paper, released Year, author’s reported Accuracy, phishing samples from legitimate samples without considering
dataset instance sizes (number of Legitimate, Phishing, and synthesized samples. The best performance among different
Total instances), number of Features and types of features classifiers is in bold. When tested on different datasets, the av-
(URL-based, page-content-based (Pg.), and inquiring 3rd. erage performance for each classifier is also reported. It shows
party services). reported performance by original authors of each dataset. Also,
it shows ∆1Acc metric or the difference between best reported
Dataset Size (K) Features performance and reported performance by original authors.
N Y T L P T F URL Pg. 3rd.
DS1 2018 98.0 1 1.2 2.2 7 X X DT GB KNN RF SVC(l) SVM(r) DNN Auth. ∆1Acc
DS2 2012 92.2 6.2 4.9 11.1 30 X X DS1 95.9 96.8 96.2 97.3 95 95.2 96.2 98 -0.7
DS3 2014 97 0.6 0.7 1.3 9 X X DS2 96.7 97 94.8 97.3 93.2 96.3 96.7 92.2 5.1
DS4 2018 96 5.0 5.0 10.0 48 X X DS3 91.6 93.6 93.2 92.8 90 94.0 90.8 97 -3
DS5 2020 96.6 5.7 5.7 11.4 87 X X X DS4 97.6 98.2 87.3 98.6 94.6 91.8 97.4 96 2.6
DS6 2020 - 58.0 30.7 88.7 111 X DS5 94.2 95.8 93 96.5 95.1 95.3 95.5 96.6 -0.1
DS7 2020 98.3 5.9 7.2 13.1 35 X DS6 95.6 95.2 88.1 97.1 93 94.7 96 94.39 2.71
DS8 2020 97.0 2.2 1.8 4.0 76 X X DS7 99.1 99.1 99 99.2 93.9 94.2 99.1 98.3 0.9
DS8 99.1 99.9 97.9 99.6 98.8 80.6 99 97 2.9
DS9 2020 - 7.8 7.6 15.8 79 X
DS9 97.2 98.3 97 98.5 96.3 96.9 97.9 96.5 2
DS9 2014 - 48.0 47.9 95.9 12 X X DS10 93.6 92.9 89.4 95.5 83.1 87.5 89.6 96.28 -1.33
Avg. 96.06 96.68 93.59 97.24 93.3 92.65 95.82 96.44 1.11
evaluate it, extracted 12 features from words that compose

a URL based on query data from Google and Yahoo search
are reporting positive ∆1Acc values. For the four datasets of
engines. For this dataset, the best-reported accuracy is 96.28.
DS1, DS3, DS5, and DS10 that we report negative values for
Table II summarizes the number of instances, features, and
∆1Acc , the results are not statistically significant. In average,
the portion of legitimate vs. phishing instances in each dataset.
the accuracy we are reporting is 1.11% better than the accuracy
We also specify whether these datasets have URL-based, page-
reported by original authors of this dataset. In addition, the
content-based, and third-party-based features.
best average accuracy we got among classifiers belonging to
RF classifiers is higher than the average accuracy reported by
B. Hyp-1: Reproducing results cited by original authors the original authors. These results prove Hypothesis-1.
We evaluate Hypothesis-1 as it questions whether our pro-
posed method can reproduce the accuracy close to the accuracy
C. Hyp-2: Distinguishing synthesized samples
reported by the original authors of datasets without considering
any synthesized samples. We use 80% of data for training We evaluate Hypothesis 2 to see if synthesized samples are
purposes and 20% for testing in five-fold cross-validation. We distinguishable from original samples. Both AAE and WGAN
run all experiments ten times and report the mean accuracy. algorithms generated 10K phishing and 10K legitimate sam-
Table III summarises the accuracy scores we achieved for ples for each dataset. We train our set of classifiers on synthe-
all ten datasets and seven classification algorithms. It also sized samples (positive labels) and original samples (negative
expresses reported accuracy by authors. For calculating ∆1Acc , labels). We test each classifier to evaluate the performance
we selected the maximum accuracy we got among seven to predict label of each given sample. We use 80% of data
classifiers (declared in bold font) and then subtracted the for training purposes and 20% for testing in five-fold cross-
reported accuracy by authors. validation. Although some datasets are imbalanced, we report
Positive values for ∆1Acc indicate our model outperformed the best performing F1 scores in Table IV, separately for both
the accuracy of original authors. For 6 out of 10 datasets, we AAE and WGAN algorithms.
DS DT GB KNN RF SVC(l) SVC(r) DNN Max

A W A W A W A W A W A W A W A W
DS1 56 54.5 54.3 51.1 46.2 39.2 51 48.8 0 0 47 32.8 38.3 29.2 56 54.5
DS2 82.9 84 88.1 88.4 72.5 77.3 90.7 91.5 9.8 9.7 89.9 88.3 82.8 84.8 90.7 91.5
DS3 12.7 20.8 14.2 27 14.4 24.8 15.3 26.6 0 0 13.9 20.5 11 27.2 15.3 27.2
DS4 94.2 92.3 96.2 96 89 92.1 97.5 97.4 54.2 11 93.7 93.8 96.7 97.6 97.5 97.6
DS5 99.9 99.5 99.8 100 74.5 85.2 99.9 100 74.4 20.6 92.9 96.4 96.7 99 99.9 100
DS6 99.3 100 99.1 100 90.3 96.6 99.8 100 87.7 98.4 96.7 99.9 99.6 100 99.8 100
DS7 90.2 95.3 93.1 96.7 85.7 88.5 97.2 98.2 17.1 0 78.7 72.4 91.6 95.3 97.2 98.2
DS8 99.1 98.9 99.2 99.5 73.8 86.3 99.6 99.9 28.4 40.1 75.3 41.4 89.9 97.4 99.6 99.9
DS9 99.9 99.8 100 100 93.3 96 100 100 88.7 90 98.7 92.8 99.5 99.8 100 100
DS10 97.9 99.7 98.9 99.8 91.8 94.1 98.9 99.9 70.4 73.1 93.3 95.3 97.1 99.7 98.9 99.9
Avg. 83.21 84.48 84.29 85.85 73.15 78.01 84.99 86.23 43.07 34.29 78.01 73.36 80.32 83 84.99 86.23
TABLE IV: F1 score for Hyp-2. Table reports results of F1 scores for different classifiers for two synthesizer algorithms of
AAE and WGAN for all datasets. It also reports Maximum F1 score for each dataset and Average for each classifiers.
We fine-tune all classifiers over datasets in this experiment. two testing sets changes, either increasing or decreasing on
The lower ∆2M ax F 1 scores demonstrate the lack of ability average. We have used 2000 synthesized phishing samples and
to distinguish synthesized samples from legitimate ones and 2000 legitimate samples for testing purposes.
support our Hypothesis-2. Table IV summarises ∆2M ax F 1 Figure 3 depicts ∆3 scores for different classifiers, datasets,
scores as we defined in Section III. For each triple of classifier, and algorithms. Different classifiers faced a decrease in ∆3 at
dataset, and synthesizer algorithm (AAE or WGAN), we report least 5%, and for 3 of those, around 10%. SVM with Gaussian
the F1 score. We also report the best F1 score for each pair kernel is the worst with 10% and 20% for AAE and WGAN
of datasets and algorithms and the average F1 score for each algorithms, respectively. This is a clear sign that all tested
pair of classifier and algorithm. As Table IV shows, the best algorithms in this experiment are vulnerable to synthesized
F1 scores for DS1 and DS3, declared by ∆2M ax F 1 , are very samples. In addition, AAE was able to synthesize samples that
low. For other datasets, the ∆2M ax F 1 are reasonably high, are evading classifiers more than WGAN, a sign that indicates
with the lowest values for DS2 are 90.7 for AAE, and 91.5 AAE is more successful.
for WGAN. While ∆2 values are significant, the average of In addition, Figure 3 shows the ∆3 for different datasets.
different classifies over our datasets is very low. The highest On average, the ∆3 score has been decreased by around 5%
average score belongs to the RF classifier for both AAE and for all datasets, some datasets up to 15%; a clear sign that
WGAN synthesizers, 84.99 and 86.23. In other words, on synthesized samples can evade the classifier. Among different
average, the RF classifier is able to detect synthesized samples, datasets, DS4, DS5, DS9, and DS10 have more decrease in
no matter what algorithm was used, better than any other learning scores.
classification algorithm we tested. These results show that These results demonstrate that our synthesized phishing
while classifiers may successfully discriminate synthesized and legitimate samples are able to evade trained classifiers,
samples from original ones in some datasets, the average a clear sign of vulnerability for models for both our clas-
results are low and prove our hypothesis that synthesized sification algorithms and datasets. Our experiments used ten
samples are difficult to distinguish from original ones on public phishing datasets, seven conventional machine learning
average. classifiers, and two synthesizing algorithms. Our experiments
were carried out on a wide range of datasets and classifica-
D. Hyp-3: Performance of synthesized samples tion algorithms which demonstrates the problem of evading
To prove Hypothesis-3, we checked the F1 score of clas- classifiers.
sifiers against synthesized samples, with two generators of
AAE and WGAN, with models trained exclusively on original
E. Hyp-4: Mitigating against adversarial samples
samples. This will demonstrate whether synthesized samples
can bypass the models and go undetected and samples of what In order to prove Hypothesis-4, we injected synthesized
algorithms are more likely to bypass. We calculated ∆3 , as the samples to measure if the F1 score increased. We define ∆4 ,
difference F1 score of classifiers trained exclusively on original which is the difference F1 score between when models were
samples and tested once on synthesized samples and once only trained only with original samples and when they were trained
on original samples. For reporting results, we averaged the with both original and synthesized samples. For each dataset,
difference between the F1 scores of models trained on original we injected 80% synthesized samples into the training set
samples when models were tested once against original sam- and reserved 20% for testing. Figure 4 explains the results
ples and once against synthesized samples on different datasets of experiments for ∆4 based on a classifier, datasets, and
for both algorithms. That number shows how F1 between these synthesizing algorithms.
10
Fig. 3: Percentage decrease of accuracy and F1 score for models trained exclusively on original samples but tested on original
samples over those tested only on original samples.
(a) Classifiers performance (b) Datasets Performance
AAE WGAN AAE WGAN

0 0
-5 -5
-10 -10
-15 -15
-20 -20
-25 -25
DT GB KNN RF SVM(l) SVM(r) DNN DS1 DS2 DS3 DS4 DS5 DS6 DS7 DS8 DS9 DS10
Fig. 4: This figure depicts the percentage of increase in F1 score when the model is trained with both original and synthesized
samples or it is trained exclusively with original samples without considering synthesized samples. In both cases, models are
tested with synthesized samples.
(a) Classifiers performances (b) Datasets performances
AAE WGAN AAE WGAN

20 25
20
15
15
10
10
5
5
0 0
DT GB KNN RF SVM(l) SVM(r) DNN DS1 DS2 DS3 DS4 DS5 DS6 DS7 DS8 DS9 DS10
Figure 4 depicts improvements over F1 score that happened F. Hyp-5: Improving F1 score for original dataset.
after injecting synthesized samples into the training set. Note
In previous experiments, we showed injecting synthesized
that ∆4 is significantly high when we consider the average
samples into the training set mitigates the vulnerability against
on datasets and classifiers. ∆4 scores, for all cases, have
synthesized samples. However, this is not the only benefit
improved. Among different classifiers, KNN and SVM with
of using our approach. We want to analyze if augmenting
Gaussian kernel.
the training dataset with synthesized samples improves the
F1 score of the models over those trained only on original
samples. We denote ∆5 as this difference. Table V summarizes
The average of ∆4 is 7.4% and 6.1% respectively, while the results for two algorithms of AAE and WGAN.
decrease in F1 score in the previous experiment was 7.1% and Table V demonstrates F1 score has been slightly improved
5.6%. This shows that the F1 score has improved by the same for different datasets and algorithms. As original scores for
amount degraded in the previous experiment. These results these models are significantly high, even a small improve-
prove that Hypothesis-4, as the proposed approach, was able ment is considerable. Besides, the improvement for specific
to recover the model to the level that we had when we were algorithms is significant. For example, the F1 score has been
using only original samples. increased for the DS1 dataset in all cases. DS1 is one of the
11
DS DT GB KNN RF SVC(l) SVC(r) DNN Max

A W A W A W A W A W A W A W A W
DS1 1 1.6 0.4 1.6 1.4 1.6 0.2 1 0.8 0.4 1 0.7 1.6 0.8 0.85 1.09
DS2 -1 -0.4 -1.2 -1.3 1 0.8 0.3 0 0.1 0.1 0.3 -0.4 -0.8 -1.3 0.4 -0.16
DS3 0.9 -0.9 0.4 0.6 -0.7 -0.9 -0.7 -3.7 0.9 4.1 -1.1 -1.1 2.7 -1.5 0.28 -0.2
DS4 -1.5 -0.9 -1.2 -0.3 1 1.9 -0.3 -0.7 -0.8 -1.1 -0.3 2.3 -0.5 0.1 -0.48 -0.36
DS5 0 0.2 -0.5 -0.8 -1.5 1.1 0.3 -0.4 -0.4 -1.6 3.2 -0.5 -0.3 -1.2 0.3 -0.44
DS6 0.3 -0.1 -1.3 0.2 -0.1 0.5 0.1 0.2 -0.3 -0.3 10 0 0.2 0.1 2.42 1.24
DS7 0 0.2 -0.2 -0.6 0 0.1 -0.1 0 -0.2 -0.8 -0.3 0.1 -0.1 -0.7 -0.19 -0.11
DS8 0.1 -1.1 -0.4 -0.7 -0.1 -1 0.2 -0.3 1 -0.4 1.5 0 1.3 -0.1 0.65 -0.49
DS9 0.2 -0.5 -0.2 -0.5 -0.6 -1 -0.4 0.2 -1.5 -0.7 0 -2.6 -0.7 0.1 -0.53 -0.7
DS10 -0.5 -0.3 -0.1 -0.5 -0.9 0 0.7 -0.2 -5.3 0.5 -2.1 -1.2 -2.7 -0.1 -1.53 -0.15
Avg. -0.05 -0.22 -0.43 -0.23 -0.05 0.31 0.03 -0.39 -0.57 0.02 1.22 -0.27 0.07 -0.38
TABLE V: F1 score for Hyp-5. Table reports results of F1 scores for different classifiers for two synthesizer algorithms of
AAE and WGAN for all datasets. It also reports Maximum F1 score for each dataset and Average for each classifiers.
datasets with the lowest number of instances. This shows that Megatrends, Statnett, Cyber Risk Research, ARL, NIST, and
our approach can be useful to enhance the size of the dataset The NewPush. This work was also supported by the U.S.
and have higher performance. On average, results of SVM Department of Justice, Office of Justice Programs/National
with the linear kernel (SVM(l)) have been increased by more Institute of Justice under Award 2017-ZA-CX-0002. Opinions
than 1.2% with AAE. or points of view expressed in this article are those of the
The improvement in F1 score that ∆5 scores have demon- authors and do not necessarily reflect the official position of
strated proves Hypothesis-5. In other words, extending the policies of the funding agencies.
dataset with synthesized samples helped to improve the F1
score of the system for both synthesized and original samples. R EFERENCES
[1] Grant Ho, Asaf Cidon, Lior Gavish, Marco
V. C ONCLUSION AND F UTURE W ORK
Schweighauser, Vern Paxson, Stefan Savage, Geoffrey M
Supervised machine learning is a promising approach for Voelker, and David Wagner. Detecting and characterizing
phishing detection. Adequate amount of data about phishing lateral phishing at scale. In USENIX Security
websites are often infeasible to obtain for reasons of privacy, Symposium, 2019.
confidentiality, and liability. In order to address this problem, [2] Federal Bureau of Investigation (FBI). Busi-
we develop AAE and WGAN based technique for generating ness e-mail compromise 12 billion dollar scam.
data that mimic phishing and genuine websites. We ensure https://www.ic3.gov/media/2018/180712.aspx, (accessed
that the features of the synthesized phishing samples can be July 2, 2020).
realistically produced by an attacker. We use 10 publicly avail- [3] Shivender Singh, Anil K Sarje, and Manoj Misra. Client-
able datasets (with different feature sets) for our experiments. side counter phishing application using adaptive neuro-
Our experiments ensure that injecting synthesized data in the fuzzy inference system. In International Conference
training set improved the F1 score of the learning algorithms. on Computational Intelligence and Communication
Moreover, including some correctly labeled synthesized data Networks, 2012.
in the training set produced algorithms that were significantly [4] Rachna Dhamija, J Doug Tygar, and Marti Hearst.
more robust to exploratory attacks. Our future work involves Why phishing works. In Proceedings of the SIGCHI
the use the AAE and WGAN for other security related domains conference on Human Factors in computing systems,
where it is hard to obtain attack data, e.g. generating attack pages 581–590, 2006.
data for Internet of Things or Cyber Physical Systems. In this [5] Amirreza Niakanlahiji, Bei-Tseng Chu, and Ehab Al-
study, we evaluated only statistical learning models. In future, Shaer. Phishmon: A machine learning framework for de-
we plan to explore the vulnerability of neural networks against tecting phishing webpages. In Intelligence and Security
adversarial attacks. We also plan to explore different attack Informatics, 2018.
types and make learning models more robust against wider [6] Ozgur Koray Sahingoz, Ebubekir Buber, Onder Demir,
range of attacks. and Banu Diri. Machine learning based phishing detec-
tion from urls. Expert Systems with Applications, 2019.
ACKNOWLEDGMENT [7] Jian Mao, Jingdong Bian, Wenqian Tian, Shishi Zhu,
We thank Sid Sutton, who helped us with implementa- Tao Wei, Aili Li, and Zhenkai Liang. Phishing page
tion. This work was supported in part by NSF with award detection via learning classifiers from page layout fea-
numbers DMS 1923142, DMS 2123761, CNS 1932413, IIS ture. EURASIP Journal on Wireless Communications
2027750, CNS 1715458, and CNS 1822118 and by American and Networking, 2019.
12
[8] Ankit Kumar Jain and Brij B Gupta. Towards detection of anomaly detection in social networking, 2020.
phishing websites on client-side using machine learning https://data.mendeley.com/datasets/zw7knrxpy5/1
based approach. Telecommunication Systems, 2018. (Accessed 2020-09-12) .
[9] Jens Kirchner, Andreas Heberle, and Welf Löwe. Clas- [23] Hadi Sadoghi Yazdi, Abbas Ghaemi Bafghi, et al. A drift
sification vs. regression-machine learning approaches for aware adaptive method based on minimum uncertainty
service recommendation based on measured consumer for anomaly detection in social networking. Expert
experiences. In IEEE World Congress on Services, 2015. Systems with Applications, 162:113881, 2020.
[10] Z. Dou, I. Khalil, A. Khreishah, A. Al-Fuqaha, and [24] MUHAMMAD HANIF. Malware webpages data,
M. Guizani. Systematization of knowledge (sok): A sys- 2020. https://data.mendeley.com/datasets/zsj5pgrsg9/1
tematic review of software-based web phishing detection. (Accessed 2020-09-12) .
IEEE Communications Surveys Tutorials, 2017. [25] Jiwon Hong, Taeri Kim, Jing Liu, Noseong Park, and
[11] Sebastian Abt and Harald Baier. Are we missing labels? Sang-Wook Kim. Phishing url detection with lexical fea-
a study of the availability of ground-truth in network tures and blacklisted domains. In Adaptive Autonomous
security research. In 2014 Third International Workshop Secure Cyber Systems, pages 253–267. Springer, 2020.
on Building Analysis Datasets and Gathering Experience [26] Vaibhav Patil, Pritesh Thakkar, Chirag Shah, Tushar
Returns for Security, pages 40–55, 2014. Bhat, and SP Godse. Detection and prevention of
[12] Ling Huang, Anthony D Joseph, Blaine Nelson, Ben- phishing websites using machine learning approach. In
jamin IP Rubinstein, and J Doug Tygar. Adversarial International Conference on Computing Communication
machine learning. In ACM workshop on Security and Control and Automation, 2018.
artificial intelligence, 2011. [27] Xin Zhou and Rakesh Verma. Phishing sites detec-
[13] Hossein Shirazi, Bruhadeshwar Bezawada, and Indrakshi tion from a web developer’s perspective using machine
Ray. ”kn0w thy doma1n name”: Unbiased phishing learning. In 53rd International Conference on System
detection using domain name based features. In Access Sciences, 2020.
Control Models and Technologies, 2018. [28] Abdulhamit Subasi and Emir Kremic. Comparison of ad-
[14] Rami M Mohammad, Fadi Thabtah, and Lee McCluskey. aboost with multiboosting for phishing website detection.
An assessment of features related to phishing websites Procedia Computer Science, 168:272–278, 2020.
using an automated technique. In Internet Technology [29] Hossein Shirazi, Landon Zweigle, and Indrakshi Ray.
And Secured Transactions, 2012. A machine-learning based unbiased phishing detection
[15] Neda Abdelhamid, Aladdin Ayesh, and Fadi Thabtah. approach. In Proceedings of the 17th International Joint
Phishing detection based associative classification data Conference on e-Business and Telecommunications,
mining. Expert Systems with Applications, 2014. pages 423–430. SciTePress, 2020.
[16] Choon Lin Tan. Phishing dataset for [30] Saad Al-Ahmadi, Afrah Alotaibi, and Omar Alsaleh.
machine learning: Feature evaluation, 2018. Pdgan: Phishing detection with generative adversarial
https://data.mendeley.com/datasets/h3cgnj8hft/1 networks. IEEE Access, 10:42459–42468, 2022.
(Accessed 2019-05-12). [31] Sharif Amit Kamran, Shamik Sengupta, and Alireza
[17] Salima Hannousse, Abdelhakim; Yahiouche. Tavakkoli. Semi-supervised conditional gan for simulta-
Web page phishing detection, 2020. neous generation and detection of phishing urls: A game
https://data.mendeley.com/datasets/c2gw7fy2j4/2 theoretic perspective. arXiv preprint arXiv:2108.01852,
(Accessed 2020-09-12) . 2021.
[18] Grega Vrbančič. Phishing dataset for [32] Katherine Haynes, Hossein Shirazi, and Indrakshi
machine learning: Feature evaluation, 2020. Ray. A machine-learning based unbiased phish-
http://dx.doi.org/10.17632/72ptz43s9v.1 (Accessed ing detection approach. In Proceedings of the
2020-10-12). 18th International Conference on Mobile Systems and
[19] Grega Vrbančič, Iztok Fister, and Vili Podgorelec. Pervasive Computing, 2021.
Datasets for phishing websites detection. Data in Brief, [33] Rosa L Figueroa, Qing Zeng-Treitler, Sasikiran Kandula,
33:106438, 2020. and Long H Ngo. Predicting sample size required for
[20] Moruf Adebowale. Phishing dataset for classification performance. BMC Medical Informatics
machine learning: Feature evaluation, 2019. and Decision Making, 2012.
https://data.mendeley.com/datasets/gt7xdbs3kt/2 [34] Burr Settles. Active learning literature survey. Technical
(Accessed 2020-09-12) . report, University of Wisconsin-Madison Department of
[21] M.A. Adebowale, K.T. Lwin, E. Sánchez, and M.A. Hos- Computer Sciences, 2009.
sain. Intelligent web-phishing detection and protection [35] Hossein Shirazi, Bruhadeshwar Bezawada, Indrakshi
scheme using integrated features of images, frames and Ray, and Charles Anderson. Adversarial sampling attacks
text. Expert Systems with Applications, 115:300 – 313, against phishing detection. In IFIP Annual Conference
2019. on Data and Applications Security and Privacy, 2019.
[22] Hadi Sadoghi Yazdi, emad mahmodi, and [36] Briland Hitaj, Paolo Gasti, Giuseppe Ateniese, and Fer-
Abbas Ghaemi Bafghi. Data for: An online nando Perez-Cruz. Passgan: A deep learning approach
minimal uncertainty drift-aware method for for password guessing. In Applied Cryptography and
13
Network Security, 2019. 2016.

[37] Weiwei Hu and Ying Tan. Generating adversarial mal- [51] Martin Arjovsky, Soumith Chintala, and Léon Bot-
ware examples for black-box attacks based on gan, 2017. tou. Wasserstein generative adversarial networks. In
[38] Vahid Mirjalili, Sebastian Raschka, Anoop Namboodiri, International conference on machine learning, pages
and Arun Ross. Semi-adversarial networks: Convolu- 214–223, 2017.
tional autoencoders for imparting privacy to face images. [52] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vin-
In Conference on Biometrics, 2018. cent Dumoulin, and Aaron C Courville. Improved train-
[39] Benjamin WK Hung, Anura P Jayasumana, and Vidar- ing of wasserstein gans. Advances in neural information
shana W Bandara. INSiGHT: A system to detect violent processing systems, 30:5767–5777, 2017.
extremist radicalization trajectories in dynamic graphs. [53] Sina Fathi-Kazerooni and Roberto Rojas-Cessa. Gan
Data & Knowledge Engineering, 2018. tunnel: Network traffic steganography by using gans
[40] Benjamin WK Hung, Anura P Jayasumana, and Vidar- to counter internet traffic classifiers. IEEE Access,
shana W Bandara. Finding emergent patterns of behav- 8:125345–125359, 2020.
iors in dynamic heterogeneous social networks. IEEE [54] Jian Chen, Xuxin Zhang, Rui Zhang, Chen Wang,
Transactions on Computational Social Systems, 2019. and Ling Liu. De-pois: An attack-agnostic defense
[41] S. R. Muramudalige, B. W. K. Hung, A. P. Jayasumana, against data poisoning attacks. IEEE Transactions
and I. Ray. Investigative graph search using graph on Information Forensics and Security, 16:3412–3425,
databases. In 2019 First International Conference on 2021.
Graph Computing (GC), pages 60–67, 2019. [55] Ling Huang, Anthony D. Joseph, Blaine Nelson, Ben-
[42] Brett K. Beaulieu-Jones, Zhiwei Steven Wu, Chris jamin I.P. Rubinstein, and J. D. Tygar. Adversarial
Williams, Ran Lee, Sanjeev P. Bhavnani, James Brian machine learning. In ACM Workshop on Security and
Byrd, and Casey S. Greene. Privacy-preserving genera- Artificial Intelligence, 2011.
tive deep neural networks support clinical data sharing. [56] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram-
bioRxiv, 2018. fort, Vincent Michel, Bertrand Thirion, Olivier Grisel,
[43] Choong Ho Lee and Hyung-Jin Yoon. Medical big data: Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent
promise and challenges. Kidney research and clinical Dubourg, et al. Scikit-learn: Machine learning in python.
practice, 36(1):3, 2017. Journal of Machine Learning Research, 2011.
[44] C. M. Kattadige, S. R. Muramudalige, K. N. Choi, [57] Dua Dheeru and Efi Karra Taniskidou. UCI machine
G. Jourjon, H. Wang, A. P. Jayasumana, and K. Thi- learning repository. http://archive.ics.uci.edu/ml (Ac-
lakarathna. Videotrain: A generative adversarial frame- cessed 2019-05-12).
work for synthetic video traffic generation. In IEEE [58] Samuel Marchal. Phishstorm - phishing /
International Symposium on a World of Wireless, Mobile legitimate url dataset. aalto university, 2014.
and Multimedia Networks, 2021. https://data.mendeley.com/datasets/c2gw7fy2j4/2
[45] Jytte Klausen, Rosanne Libretti, Benjamin W. K. Hung, (Accessed 2020-09-12) .
and Anura P. Jayasumana. Radicalization trajectories: An [59] Samuel Marchal, Jérôme François, Radu State, and
evidence-based computational approach to dynamic risk Thomas Engel. Phishstorm: Detecting phishing with
assessment of homegrown jihadists. Studies in Conflict streaming analytics. IEEE Transactions on Network and
& Terrorism, 2018. Service Management, 11(4):458–471, 2014.
[46] Shashika R. Muramudalige, Anura P. Jayasumana, and
Haonan Wang. A comparative study of complex data
object generation with likelihood and deep generative
approaches. 2021. In Progress.
[47] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,
Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
Courville, and Yoshua Bengio. Generative adver-
sarial networks. International Conference on Neural
Information Processing Systems, 2014.
[48] Junchi Yan. Recent advance in temporal point process:
from machine learning perspective. SJTU Technical
Report, 2019.
[49] Edward Choi, Siddharth Biswal, Bradley Malin, Jon
Duke, Walter F Stewart, and Jimeng Sun. Generating
multi-label discrete patient records using generative ad-
versarial networks. arXiv preprint arXiv:1703.06490,
2017.
[50] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly,
and Ian Goodfellow. Adversarial autoencoders. In
International Conference on Learning Representations,

Adversarial Autoencoder Data Synthesis For Enhancing Machine Learning-Based Phishing Detection Algorit

Uploaded by

Adversarial Autoencoder Data Synthesis For Enhancing Machine Learning-Based Phishing Detection Algorit

Uploaded by

This article has been accepted for publication in IEEE Transactions on Services Computing.

Adversarial Autoencoder Data Synthesis for Enhancing Machine

Anura P. Jayasumana Haonan Wang

Index Terms—Phishing Detection, Adversarial Attacks, Adversarial Auto-encoder

I. I NTRODUCTION machine learning for phishing detection have demonstrated the

score of classification algorithms by augmenting the dataset, B. Key Contributions

through a parametric function gθ : Z → X that is capable of

Hypothesis Training Testing

evaluate it, extracted 12 features from words that compose

DS DT GB KNN RF SVC(l) SVC(r) DNN Max

AAE WGAN AAE WGAN

AAE WGAN AAE WGAN

DS DT GB KNN RF SVC(l) SVC(r) DNN Max

Network Security, 2019. 2016.

You might also like