Project Sample
Project Sample
BY
(17/52HA066)
JANUARY, 2022
CERTIFICATION
This is to certify that this project work was carried out by MAKINDE,
________________________ _______________________
Dr. J. B. Awotunde Date
(Project Supervisor)
________________________ ________________________
Dr. Oluwakemi C. Abikoye Date
(Head of Department)
________________________ ________________________
External Examiner Date
DEDICATION
This project is dedicated to God, my parent and siblings who have contributed to my
academic growth in all ramifications.
CHAPTER ONE
INTRODUCTION
The internet has evolved into a natural phenomenon that cannot separate from people's
daily lives and a great deal of data must be protected against many sorts of criminal
intention of data security is to create and implement protective computer models that
attempts jeopardizing the three primary security goals of integrity, availability and
ensuring that the information transferred is only available to those who should have
concerned with ensuring that the data itself is not intercepted or altered. It ensures that
the receiver receives exactly what the sender intended. Availability ensures that a
system resource or network is usable and accessible when required by a system that is
systems that are capable of detecting intrusions (IDS). The main principle behind
Intrusion detection systems are classified based on their environments (Kajal & Nandal,
2020) and their detection mechanisms (Amiri et al., 2011). Based on their environment,
IDSs are further classified as Host-based Intrusion Detection Systems (HIDS) and
Detection Systems (SIDS) and Anomaly-based Intrusion Detection systems (AIDS) are
two types of Intrusion Detection Systems based on mechanism detection. The ability to
to focus more on anomaly-based IDS, which can detect new patterns (Akyol,
Hacibeyoglu & Karlik 2016). Various intrusion detection systems (IDS) which have
been created are still vulnerable to attacks (Ayo, Folorunso, Abayomi-Alli, Adekunle
& Awotunde 2020). Despite the use of many machine learning algorithms to increase
struggle to achieve good performance (Zhou, Cheng, Jiang, & Dai, 2020).
IDS. In Machine Learning, Feature Selection is grouped into three parts; Filter
Detection Systems (Akyol et al., 2016). In most anomaly detection systems, data
preparation was motivated by a desire for greater accuracy and a low false alarm rate
(Ayo et al., 2020). The dataset is preprocessed to remove unnecessary features and
noise, leaving a smaller set of features for building a high-performance model, while
the classifier uses the vital features to predict attack types with the base classifier.
Data mining algorithms, fuzzy logic, and neural networks, are amongst different
algorithms have been proposed that use both traditional and hybridized Machine
Classifier( Zhou et al., 2020; Gautam & Doegar, 2018); Neural Networks (Chiba,
Abghour, Moussaid, El Omri, & Rida, 2018); Support Vector Machine (Ganapathy et
al., 2013; Jha, Ragha, & Ph, 2013; Wang, Gu, & Wang, 2017).
This study however proposes an IDS that uses the Correlation Based - Forward
with the Random Forest Classifier to detect intrusions. The IDS will be built in phases
starting the first phase with the necessary data preprocessing techniques then followed
by selecting the first set of relevant features using the filter method which is the
Correlation-based Feature Selection that will filter out irrelevant features and move to
the next phase thereby giving the output features as an input to the wrapper method
which is the Sequential Forward Feature Selection and the output finally serves as an
input to the succeeding stage where the Random Forest classifier is used. The final
Intrusions are malicious attacks that spread swiftly over networks, and detecting them
is crucial owing to the potential damage they can inflict. Security in networks is
currently a worldwide major issue in computer security and defense. Breaches, threats,
data, lowering an organization's efficiency and productivity quality (Chiba et al., 2018).
As a result, Intrusion Detection Systems (IDS) are regarded as critical tools for
continuously monitoring malicious activities and detecting threats that could jeopardize
the network's integrity, privacy, or availability (Kajal and Nandal, 2020). According to
Waskle et al., (2020), to detect intruders, the IDS must be accurate and efficient. For
detection of intrusion, different machine learning algorithms were used, some of which
are K-Nearest Neighbors (KNN), Support Vector Machine (SVM) and Naïve Bayes
(NB). Despite the use of many Machine Learning algorithms, existing intrusion
detection systems continue to struggle to achieve good performance (Zhou et al., 2020).
Existing systems that neglected the use of Feature Selection have been observed to
attain a low accuracy and high training time (Teng, Wu, Zhu, Teng, & Zhang, 2018;
Taghavinejad, Taghavinejad, Shahmiri, Zavvar & Zavvar, 2020). Also, SVM has been
found not to be a good choice for large network traffic as it performance degrades as
data size increases (Ahmad, Basheri, Iqbal, & Rahim, 2018; Wang et al., 2017; Teng et
al., 2018; Aburomman, Bin, & Reaz, 2016; Kuang, Xu, Zhang, 2014). The employment
of Naive Bayes and KNN have also been found out to achieve low accuracy (Halimaa
detection rates, and false alarm rates (Ayo et al., 2020; Waskle et al., 2020). The
identified flaws provide a rationale for recommending an IDS to minimize alert rates
equivalent to false and improve accuracy of the IDS system using Correlation Based-
This aim of this project is to develop an Intrusion Detection System using multi-level
feature selection and Random Forest Classifier to detect intrusions. The particular
The need to detect intrusions accurately grows daily as attackers keep getting
Detection System capable enough to detect attacks and notify the system of the attacks
accurately.
The proposed method can also be of numerous importance when applied to the
following fields:
This research is focused on using Correlation Based - Sequential Forward Search CFS-
SFS multi-level feature selection and Random Forest classifier to develop an Intrusion
Detection System that detects intrusions, on a given dataset to yield high accuracy and
low rate of false alarm. This study is limited to using data gotten through second means
and machine learning feature selection techniques with random forest as the base
classifier.
ii. Intrusion Detection System: it supervises the activities of a network and detects if
iii. Network: is a set of computers linked together with the aim of sharing data and
files.
iv. Data Mining: is the process of finding valuable information hidden in a large
division of computer science that focuses on the use of data as input to algorithms
with the aim of predicting an outcome, searching for patterns or grouping entities
into clusters.
vi. Features: with respect to machine learning, features are individual independent
variables which are used as input to make predictions for the dependent variable,
vii. Feature Selection: process of filtering and deleting irrelevant features to get
relevant features.
xi. Base Classifier: It is used to indicate the classifier that has been selected for the
xii. Multi-level Feature Selection: It is the term used to describe the application of two
The project structure follows a 5-chapter standard and the rest of the project includes
CHAPTER TWO: This chapter evaluates the review of literature with respect to related
results.
of this project.
CHAPTER TWO
LITERATURE REVIEW
2.1 Introduction
Intrusion Detection system. It also contains a number of related works that were carried
damage, corrupt, steal or modify the valuable information stored within. This
infiltration into any system has the potential to harm the system. The term "intrusion"
has come to mean a serious trial to breach the system’s security and with the help of an
Intrusion Detection System, this intrusion into any system can be detected. One of the
most lethal diseases on the planet is cancer and looking at cancer through the lens of
intrusion detection system is one of the most needed tool for ensuring network security.
(Teng et al., 2018). Identifying malicious acts or behaviors occurring in networks needs
the use of IDS. An Intrusion Detection System (IDS) is a system that supervises traffic
for unusual activity and sends alerts when it detects it. It is critical for providing security
and reducing harm to the information system, as well as the network and computer
or not occurred and alerts or notifies the network administrator or system of malicious
behaviors or detected attacks. (Velusamy, Ghosh, Debnath, Metia, & Dutta, 2014).
While the primary functions of intrusion detection systems (IDSs) are to detect and
report anomalies, certain IDSs can also take immediate action when anomalous traffic
or malicious activity is detected, such as blocking the traffic received from the
The drawbacks of Intrusion Detection Systems are False Positives (predicting the
when one exists ) and Data overload (inability to effectively and efficiently analyze
Intrusion detection system are categorized broadly into two major groups based on:
a. Environment or nature of data collection (Ghosh et al., 2014; Kajal & Nandal,
2020).
identifying malicious actions on the host computer. It can detect attacks and
vulnerabilities to the system that the Network based Intrusion Detection System
cannot detect. The HIDS only protects the hosts assigned to it. The ability of HIDS
that NIDS failed to detect, as well as detect malicious traffic that emanates from
the host itself, is an advantage of HIDS over NIDS(Ashraf, Ahmad, & Ashraf, 2018;
Agah, 2017).
ii. Network Intrusion Detection System (NIDS): monitors the entire network from
where it is installed and has a more powerful detection mechanism for identifying
strategically placed throughout the network to monitor traffic. (Chirag et. al, 2013;
Agah, 2017).
et al., 2016).
system routines and assumes them as anomalous if they deviate from normal
patterns. This system uses machine learning to construct the model. Such (IDSs)
function more effectively at detecting new kinds of attacks; regrettably, one major
drawback is that they cannot avoid a high false positive rate (Zhou et al., 2020).
Rate for signature-based detection is extremely low (FAR). Although this type of
detection.
correlation, trends and patterns hidden in massive quantities of data is. It converts
used in data mining. Data mining approach can address IDS issues or drawbacks by
utilizing one of the following techniques: Data summarization with statistics and
visualization are used to solve the problem of data overload. Clustering, which is the
task associated with grouping similar observations into clusters or categories, can also
be used, as can classification, which can be used to predict the category of an attack
based on historical data. Variety of algorithms has been made available from disciplines
Data mining tasks are broadly classified into two parts (Namrata, Bijendra & Rajkumar ,
2015):
ii. Anomaly Detection: It recognizes observations that are distinct from the
rest of the data in some way. The goal of this task is to find anomalies
b. Descriptive tasks: These are tasks that involve determining patterns, clusters, or
relationships that sum up the underlying data relationships. Descriptive tasks are
unsupervised learning. The following sub tasks fall under the descriptive tasks:
ii. Clustering tasks: aims to discover clusters of observations that are closely
a. Classification
points from the other group placed on the second side. There could be
more than one hyperplane, but SVM aims to use the hyperplane that best
The distance is known as the margin, and the points within the margin are
splitting hyper plane, but not all data can be linearly separated in the
address this issue. To accomplish this, the Kernel employs a feature space
2012).
ii. K-Nearest Neighbors: is one of the most fundamental supervised Machine
point is assigned to its nearest neighbor based on the majority vote of its
the greatest purity in its equation as its root node or decision node. The
the decision tree's end are the tree's terminal nodes, and when one is
pure or uniform buckets with more of one class. One of the mathematical
parameters used to calculate bucket splits is the Gini Index. Variables are
b. Clustering
pair of points by calculating the distance separating a point and all other
calculates their centroid, and repeats the process until all points are
points and then groups objects based on the shortest distance. Number of
data points to centroids after calculating their distances and based on the
High dimensionality is one of the challenges machine learning algorithms are faced
with. Having datasets with numerous fields or columns these days are common.
Datasets used in Intrusion Detection System are generally huge and using machine
learning algorithms to train the system of datasets with numerous features can lead to
high time consumption, it can also increase learning complexity and use of unrelated
features can negatively impact the system. Feature selection or feature extraction
therefore helps in removing unwanted features or noise from the data thereby improving
classification rate. The major difference between feature selection and extraction is
feature selection selects features without changing the initial data or generating new
features or columns to be used while extraction generates new features or columns from
existing features which sometimes might misinterpret the data. Feature selection has
been confirmed by researchers to be an important and crucial step in Machine Learning.
It selects subsets of the original features based on criteria given and these features
selected are used by the classifier to build a model. Machine learning feature selection
a. Filter Method: applies statistic measures to features and based on the threshold
is it generates subsets without putting into consideration the base classifier that is
it is independent of the classifier and it has the tendency to select large subsets of
classifier and generality is also an advantage since it evaluates the properties of the
data. Information Gain, Correlation - based, Chi squared test are some of the filter
features are tested. It evaluates feature subsets based on their accuracy. It has the
wrapper and the classifier. Asides the wrapper method being computationally
c. Embedded Method: these methods combine the benefits of both the wrapper and
computational cost. Carefully extraction of the features that impart the most to the
training for that iteration is done for every iteration of the training process
which handled by iterative methods. Embedded technique generates a large
number of subsets from the dataset. It chooses features for the model at random
and attempts to perform all permutations and combinations. Whichever subset has
the highest accuracy will be chosen as a subset of features to be given to the dataset
for training.
Various researchers have proposed various methods for the detection of an intrusion
classifies malevolent attacks within the network using a convolutional neural network
in addition with a recurrent neural network (CRNN). Within the HCRNNIDS, the
convolutional neural network (CNN) captured native attributes, while the recurrent
neural network (RNN) captured features that were temporal to enhance the accuracy
and classification of the ID system. Studies were conducted. using the CIC-DS 2018
ID data set. Results achieved after the analysis of the hybrid (HCRNNIDS) intrusion
detection system proved the system to be effective, achieving a detection rate of 97.7%
Abishek (2020) using correlation-based feature selection to prioritize the features based
solely on the strongest correlation between the class outcome and the features
edge techniques, the model outperformed them in terms of sensitivity, accuracy and
specificity. Nevertheless, there were a few drawbacks to using ANN in the model. The
IDS training was time-consuming and necessitated a large amount of time and training
data. Better results were achieved when more layers were added however the model
artificial neural network, like the number of layers hidden and neurons per layer,
Salih and Abdulazeez, (2021) reviewed 17 existing approaches from 2018-2020 that
have been proposed by various researchers. Deciding on the algorithm to use has not
been easy task and testing the performance of different classifiers seemed like the best
method to the researchers which led to the review of existing systems. The result of the
review states that feature selection has good effects on the performance of the model
because it decreases both training and testing time by filtering out irrelevant features.
Also, hybrid classifiers could provide optimal solution to the detection of attacks.
Finally Random Forest got the best accuracy while Practical Swarm Optimization (PSO)
machine learning algorithms to reason network traffic. J48, Random Tree, BayesNet,
PART, Logistic, Random Forest, REPTree, IBK, and JRip were found to be acceptable
for the classification approach. Additional focus was placed on Bagging, Boosting, and
Blending (Stacking) machine learning methods and comparisons were done on their
accuracies. The WEKA tool was used to compare these algorithms, and also the results
were shown by performance criteria. Cross validation of 10 folds was utilized to
simulate the categorization models using NSL-KDD dataset. BayesNet and Random
Forest were found to be effective by the researchers. Boosting achieved a much better
performance among Bagging, Boosting and Blending. The researchers more expressed
that the proposed algorithms may be used to make effective network intrusion detection
Researchers here considered the SMO (SVM), Naive Bayes, Multilayer Perceptron
(NN), Logistic Regression, J48 (DT) and IBK (KNN) algorithms. The models were
built for different phases. At phase one, the selected classifiers were used with their
default parameters and no preprocessing was done on the dataset used. The classifiers
were trained on the set of data assigned for training provided by NSL-KDD dataset
training set and test set depicts over-fitting for all classifiers. Preprocessing of data was
performed at the second phase. InfoGainAttributeEval algorithm was used for the
for each classifier. Though the accuracy of the second phase was better than that of the
first phase, overfitting was also detected. The third phase worked on diminishing the
dataset class imbalance by over sampling the minority classes and under sampling the
majority classes. The mitigating of imbalance classes helped reduced the limitations in
the detection of U2R and R2L attacks which was not the case in the first and second
phase.
Kajal and Nandal, (2020) implemented Genetic Algorithm and Discrete Wavelet
Transform (DWT) with Artificial Bee Colony (ABC) for feature selection. The aim of
detecting attacks which was limited to Denial of Service (DDOS) attacks using KDD
dataset. The system made use of hybridized Artificial Neural Network and Support
Vector Machine (ANN-SVM) as the base classifier. The system was compared to three
different existing systems that addressed intrusion detection using the KDD dataset
against DDOS attacks and the proposed system was seen to have surpassed the other
Waskle et al., (2020) proposed in their study, that they applied principal component
analysis (PCA) and the random forest classifier to develop a model for an efficient
Intrusion Detection System. In specific, the research offered a method for developing
an efficient IDS the principal component analysis assisted in extracting features from
the data by decreasing the data dimensions, while the random forest assisted in
classification of attacks into normal and intrusions using the Knowledge Discovery
Dataset (KDD). Their proposed model began with the extraction of features using PCA
then classification of attacks done by random forest and finally evaluation of the model.
The study of Kasongo and Sun, (2020) adopted Feed-forward Deep Neural Network
Extraction Unit (WFEU). The WFEU method used the Extra Trees algorithm to extract
optimal features. The effectiveness and efficiency of the model were tested using the
UNSW and AWID dataset. Comparison of the system was done with respect to other
existing machine learning algorithms including Random Forest (RF), Naive Bayes
(NB), K-Nearest Neighbors (KNN), Decision Tree (DT) and Support Vector Machine
(SVM). The WFEU was used to extract features and was then used as input for the
FFDNN classifier and other algorithms for both binary and multi-class classification
the WFEU-FFDNN approach achieved a greater accuracy than the other approaches
and in comparison to existing works, the findings demonstrate that the suggested system
(CFS-BA) which is a heuristic algorithm for the purpose of reducing the dimensionality
of the dataset due to the numerous redundant features in data sets. Furthermore the
and the C4.5. Attributes algorithms that were ensemble owing to the fact that a classifier
a classifier may not be efficient in detecting all variants of attacks. The goal of the study
is to create an unbiased model that improves the reliability and stability of the Intrusion
Detection System while requiring very little time and computational complexity. The
NSL-KDD dataset, AWID dataset, and CIC-IDS dataset were used for evaluation and
cross validation was performed on them. Relevant features gotten from the
implementation of the CFS-BA algorithm shows that the approach worked well as it
reduced the dimensionality drastically and achieved the aim it was implemented for as
the NSL-KDD dataset reduced from 41 features to 10features, AWID dataset reduced
from 155 features to 8 features and CIC-IDS reduced from 78 features to 13 features.
and SVM were contrasted in terms of prediction accuracy and misclassification rate in
three different scenarios using 19,000 records of the NSL-KDD dataset. In the first
scenario, the models predicted the dataset which did not go through any pre-processing
phase. In the second scenario, CfsSubsetEval was used for feature selection and the
models predicted using the features gotten. At scenario three, the researchers
normalized the dataset and fed the normalized data to the models for prediction. In all
misclassification rates. The accuracy of SVM was higher and the misclassification rate
Bhati and Rai, (2019) used support vector machine (SVM) variants, such as Fine
Gaussian (98.7% ), Medium Gaussian (98.5%), Quadratic (96.1%) and Linear (98.6%)
implementation of the IDS involved four major steps, namely, collection of data
collection, preprocessing, division into train set and test set and finally the evaluation
of the model based on the accuracy, Receiver Operating Characteristic (ROC) and
Confusion Matrix metrics. Result states that the Fine Gaussian SVM variant provided
the highest accuracy and also the smallest error rate for the intrusion detection.
Bindra and Sook (2019) compared Logistic Regression, Gaussian Naive Bayes, Linear
Random Forest ML algorithms for identifying DDoS attacks in networks. The CIC-
IDS-2017 dataset was utilized to train and evaluate the algorithms compared in this
work. Random Forest was the most accurate, with a 96.2% accuracy rate. The study
credits the effective outcome to the cross-validation applied on the algorithm. The
experiments also highlight the need of lowering the dataset's dimensionality; the
SelectPercentile technique was used to features from 85 to 12, as well as used median
Chu, Lin, and Chang (2019) investigated the result of Machine Learning (ML)
techniques on NSL-KDD dataset utilizing SVM, Nave Bayes, Decision Tree, and
Artificial Neural Network (ANN) with MLP for identifying Remote to Local, DoS,
User to Root attacks and Probe. SVM had a precision of 97.72 percent, ANN had a
precision of 97.82 percent, Nave Bayes had a precision of 90 percent, and the J48 had
a precision of 59.3 percent. According to the researchers, the high accuracy of SVM
was reached by making adjustments to gamma and c parameters, but the best accuracy
of the Artificial Neural Network was attained with four layers. Despite the fact that
there was no important difference in the findings of the two algorithms that performed
better, the researchers noted that speed of classification increased significantly after
using the principal component analysis to minimize the feature space of the data applied
Kim, Shin, and Choi (2019) used the CSE-CIC-IDS 2018 dataset to construct an
Intrusion Detection System model using Convolutional Neural Network (CNN) and
Preprocessing was done on the data, and features were chosen. A CNN needs the dataset
linked layer make up a model. Maxpooling was deployed behind each convolutional
layer in order to implement the model. Although the max pooling layer is not required
for a CNN model, it was necessary since the transformed pictures only comprise
quantitative data and no invisible signatures, thus there is very little risk of losing
important features from the max pooling. In addition, for each convolutional layer, the
activation function 'relu' was employed. Efforts to minimize overfitting, lead dropout
being performed after each phase of the max pooling. Finally, below the last max-
pooling layer, a fully connected layer is placed. When applied to the CIC-2018 dataset,
the CNN model appeared to be more accurate than the RNN model in label
categorization. Furthermore, preparing the dataset with a ratio of benign and attack-
The work of Patgiri, Varshney, Akutota, and Kunde, (2019) investigated the application
Support Vector Machine and Random Forest. Efforts to decrease the computational
relevant features for both classifiers. The NSL-KDD dataset was used for the evaluation
of the models developed. The extensive experimentation carried out was observed by
achieved by Random Forest was stronger than that of SVM before the Recursive feature
Elimination method was applied. On the contrary, SVM performance was better than
Random Forest after using Recursive feature Elimination Cross Validation (RFECV)
Taher, Jisan and Rahman, (2019) in their work, evaluated the performance of Artificial
Neural Network and SVM using the NSL-KDD dataset. They aimed to discover the
classifier with best success rate and accuracy. The proposed models were developed
with the features selected by a wrapper method which reduced from 35 features to 17
features. The work also aimed to discover the best number of hidden layers and learning
rate and results states that 3 hidden layers and a 0.1 learning rate were best. The
evaluation was done comparing the two classifiers before feature selection was applied
and after feature selection was employed. In both scenarios, the detection accuracy was
better when feature selection was employed however in both cases, the Artificial Neural
feature selection. The IDS proposed was examined using the KDDCup ID dataset, and
the outcomes testify that the weighted KNN improved efficiency while sacrificing a
small amount of accuracy. Although only KNN was studied to enhance the performance
of network intrusion detection, the researchers decided that Naïve Bayes and SVM can
also be used for this purpose and indicated that they plan to use SVM and Nave Bayes
based Intrusion Detection System that was previously in place (Aburomman and Reaz,
Ensemble Feature Selection (glm, gbm, treebag, ridge and lasso) and Principal
Component Analysis (PCA) were used to choose significant characteristics. For the
label classification, the AdaBoost classifier was applied. The proposed technique began
with the cleaning and scaling of the data, which totaled 225,745 records. The dataset
was then splitted into training (70 percent of the data, or 158,022 records) and testing
data (the remaining 30 percent of the data) (30 percent of the data resulting in 67,723
records). SMOTE was used to oversample the imbalanced classes, and the Ensemble
technique (comprising gbm, glm, lasso, ridge, and treebag) was used to choose features,
with 25 being chosen (16 features). The model was created with AdaBoost and five
rounds of cross-validation were performed. Finally, the model's efficiency was assessed
using the accuracy, precision, and recall criteria. According to the results, the suggested
Haripriya and Jabba, (2018) reviewed various machine learning algorithms that have
indeed been suggested by existing works in order to determine the best techniques for
implementing an Intrusion Detection System. The review involved existing work that
used single classifiers, hybrid classifiers and ensemble classifiers. The researchers
found that each algorithm has its own importance and contributions when contrasted to
other methods and as a result a particular technique could not be selected for the
can be difficult to train algorithms when certain amount of traffic data is not available.
Hebattalah, Farouk, Abdel-Hamid, (2018) aimed to identify the fewest set of features
that results in maximum accuracy. Researchers here proposed an IDS based on filter,
wrapper method and the J48 and Naive Bayes classifiers as the different base classifiers.
The system selected the best model after performing series of experiments that were
evaluated using the UNSW-NB15 dataset. The models were partitioned into two layers.
The first layer was for selection of features and it was divided into five strategies. The
first strategy used all features. The second strategy used wrapper method to select the
features. The filter method which used different evaluators that summed up to six was
employed in the third strategy. The fourth strategy merged the single evaluators. The
fifth strategy used a combination of the best subset and evaluator. The output from the
first layer which consists different features was then given as input to the second layer
which applied both J48 and Naive Bayes on each strategy separately. The results of the
experiments states that the GR ranking method used with the J48 classifier achieved
In the study of Ashraf, Ahmad. Ashraf, (2018) the detection rate and accuracy of IDSs
were computed using J48, Random Forest and Naïve Bayes classifiers. The NSL KDD
dataset was utilized in the experiments. On the 20% NSL KDD dataset, the
classification performance of J48, Random Forest and Naïve Bayes were analyzed in
the research. Based on the findings, a conclusion was attained that Random Forest
outperformed Naïve Bayes in terms of detection rate and accuracy. Since all three
classifiers achieved up to 90% precision and recall, a hybrid model consisting of all
models which summed up to 48 models. Two IDSs which proved to be best based on
false positive rate, detection rate, F-score and AUC were chosen during the evaluation
process. initial stage aimed towards determining the important parameters to be used
in the classifier's construction, and the parameters they determined to be the most
Network most especially the nodes to be used in the layers , momentum term, learning
rate, transfer function. Generation of the combinations of different parameters was done
in the third stage. Implementation of IDS was done in stage four. Finally, in the fifth
stage, there was comparison based on the efficiency of the models that were developed
and the study selected two. In future work, an optimized algorithm which search for
optimal arguments that will influence the performance of the model will be applied.
Hajisalem and Babaie (2018) suggested a new artificial bee colony (ABC) and artificial
fish swarm (AFS) hybrid classification technique. The structure of the implemented
approach was as follows: Division of the training datasets, feature selection, rule
generation, and hybrid classification. The framework was based on the ABC-AFS that
was proposed. In order to split the data assigned for training and eliminate the redundant
(CFS) algorithms were used. In addition to using the CART technique, attempts to
distinguish between normal and abnormality records involved the use of If-Then rules
that were constructed based on the selected attributes. The rules generated were also
used to train the presented hybrid technique. Outcome of the UNSW-NB15 and NSL-
KDD datasets states that the method proposed achieved a detection rate of 99 percent
and a false positive rate of 0.01 percent in terms of performance metrics. Additionally,
a comparison of time and computational cost revealed that the overhead of the model
multi-layer perceptron with a hidden layer to discern DDoS attacks. The proposed
system was examined based on UNSW-NB15 and NSL-KDD data set. The arguments
supplied to their models were adjusted and their outcomes were stated. The results were
In the work of Al-yaseen, Othman, Zakree, and Nazri, (2016) they introduced a system
framework based on a multi-level hybrid extreme learning and support vector machine .
The researchers stated that SVM consumes a lot of training time and due to that, the
modified K-means was employed to minimize the dataset size thereby getting smaller
samples of the dataset which is 10% of the KDD data that results in small quality data.
Four SVM classifiers and one ELM classifier were employed. The results showed that
their outperformed the state-of-the-art methods and large fluctuations were not seen in
the detection performance. Furthermore the results indicates enhanced accuracy and
short training time due to the reduction of the dataset. However the use of more than
Lin (2015) suggested a KNN model based on Cluster Center and Nearest Neighbors
newly acquired dataset was utilized evaluate the base classifier. The results reveal that
the CANN classifier outperforms the KNN and SVM classifiers in the dataset's first
dimension. The CANN had a better rate of correct identification and a lower percentage
of false alarms. When tested on two datasets, KNN and SVM were shown to be
terms of the study's constraint, CANN was unable to recognize the R2L and U2R
attacks since the one-dimensional representation was unable to appropriately depict
Maharaj and Khanna, (2014) worked on Algorithm Reptree and Voting Feature Interval
(VFI). According to the researchers, most of the existing systems only detects if there
is an attack or not which does not give detailed information about the kind of attack and
this gave them a reason to direct the aim of their study to conclude on a classifier that
of more than two classes. Evaluation was done using the KDDCUP dataset based on
the Receiving Operating Characteristic (ROC) curve which gave details about the Area
Under Curve (AUC), False Positive Rate (FPR) and True Positive Rate (TPR). After
the evaluation, the study therefore concluded that the REPTree learning algorithm is
Golmah, (2014) implemented a hybrid IDS that used Data Mining algorithms. The
researcher combined the SVM and C4.5 which yielded an improved accuracy when
In Ghosh et al., (2014) study, the purpose of the study was to improve performance of
classifier for detecting intrusion by applying multilevel feature selection and hybridized
K-Nearest Neighbors and Neural Networks (KNN-NN). Prior to the classification, there
were four steps. In the first step, preprocessing was done on the NSL-KDD data set
which was used for the experimental analysis. In the next step, the Rough Set Theory
(RST) wrapper method was selected for the first feature selection, prior to using the
RST, Normalization and Discretization were performed. And then the Information Gain
(IG) filter method was done for the second level feature selection process in step three.
After all these were done, then the classification phase came in and it was done in two
phases by applying KNN first and using the output as an input for the Neural Networks.
Charnsripinyo, (2012) presented an IDS using a fuzzy rule algorithm and genetic
algorithm was developed using the KDD99 dataset as well as dataset provided by the
researchers. Assessment of the IDS in terms of false alarm rate, detection speed and
detection rate proved that network attacks could be detected in real-time within a very
short time
The study of Patel et al., (2012) reviewed the K-Nearest Neighbors (KNN), Artificial
Neural Network (ANN), Support Vector Machine (SVM), Decision Tree (DT) and
Naive Bayes (NB) data mining techniques. The researchers further highlighted the
advantages and disadvantages of the algorithms. In this study, the performance of all
the algorithms implemented were low though their dataset statistics were not stated but
the results gotten from the evaluation of the algorithms made the researchers conclude
that combining more than one algorithm may be used to nullify the disadvantages of
one another since different algorithms have different insights regarding the situation
METHODOLOGY/
S/N AUTHOR’S NAME YEAR TITLE RESULTS
CONTRIBUTIONS
HCRNNIDS: Hybrid
Performance achieved was
convolutional
Hybrid IDS (Convolutional higher in comparison to
Reccurent Neural
1 Khan 2021 Neural Networks and existing IDSs and a
Network-Based
Recurrent Neural Networks) detection rate of 97.7% was
Network Intrusion
achieved.
Detection System.
2 Thaseen, Banu, 2021 An integrated Correlation feature selection Proposed model performed
Lavanya, Ghalib & intrusion detection integrated with Artificial better with state-of-art
Abishek system using Neural Networks. methods. However, training
correlation-based of IDS was slow, required
attribute selection and lot of training time and was
artificial neural not memory and
network computationally efficient.
3 Salih & Abdulazeez, 2021 Evaluation of Review of existing Feature Selection has good
Classification classifiers . effects on performance
Algorithms for Hybrid classifiers could
Intrusion Detection provide optimal solution.
System: A Review. Random Forest achieved
best accuracy while Particle
Swarm Optimization
achieved best result for
feature selection.
5. Kajal & Nandal 2020 A Hybrid Approach GA and DWT with ABC for The Proposed system was
for Cybersecurity: feature selection seen to have outperformed
Improved Intrusion Hybridized ANN-SVM for the other existing systems in
Detection System classification terms of accuracy.
Using ANN-SVM.
6. Waskle, Parashar & 2020 Intrusion Detection PCA for feature extraction Comparison was done based
Singh. System Using PCA Random Forest as base on performance time,
with Random Forest classifier. accuracy and error rate with
Approach. other classifiers (SVM, NB,
DT). The PCA with
Random Forest approach
performed better in all
metrics used.
7. Kasongo & Sun. 2020 A deep learning Wrapper based feature The proposed system
method with wrapper extraction unit (WFEU) achieved a great accuracy
based feature used with Extra Trees for compared to other existing
extraction for feature selection and Feed methods for both binary
wireless intrusion Forward Deep Neural amd multi-class
detection system Network (FFDNN) as base classification.
classifier
8. Zhou Cheng, Diang 2020 Building an Efficient Correlation-based Feature An unbiased model having
& Dai. Intrusion Detection Selection - Bat Algorithm low computational
System Based on (CFS-BA )for feature complexity was achieved.
Feature Selection and selction
Ensemble Classifier. Voting technique using
c4.5, RF and Forest by
penalizing attributes for the
model building.
9. Anish and 2019 Machine Learning CfsSubEval to select In all different scenarios,
Sundarakantham. based Intrusion features SVM proved to have an
Detection System Naive Bayes and SVM as upper hand.
base classifiers.
10 Bhati & rai 2019 Analysis of Support Variants of SVM Quadratic (96.1%), Linear
Vector Machine-
(Quadratic, Linear, Fine (98.6%) , Fine Gaussian
based Intrusion
Detection Gaussian and Medium (98.7% ), and Medium
Techniques. Arab.
Gaussian) Gaussian (98.5%).
Fine Gaussian had the best
accuracy and the least error
11 Bindra & Sood 2019 Detecting DDoS Comparison involving Random Forest appeared to
attacks using machine
Logistic Regression, KNN, be more accurate (96.2%)
learning techniques
and contemporary Gaussian Naïve Bayes,
intrusion detection
Random forest, Linear SVM
dataset
and Linear Method
Discriminant.
12 Chu Lin & Chang 2019 Detection and Investigation on SVM attained a high
classification of
performance of SVM, Naïve performance owing to the
advanced persistent
threats and attacks Bayes, Decision Tree, adjustments of the c and
using the support
Artificial Neural Networks gamma parameters.
vector machine
(with Multi-layer However, ANN performed
Perceptron). better with four layers.
13 Kim, Shin & Choi 2019 An Intrusion
Detection Model
based on a
Convolutional Neural
Network
14. Taher, Jisan, 2019 Network Intrusion Wrapper method for feature Feature selection improved
Rahman. Detection using selection the performance of the
Supervised Machine ANN and SVM classifiers though ANN
Learning Technique performed better before
with Feature feature selection and after
Selection feature selection.
Results also states that for
the ANN classifier, the best
count of layers to be hidden
was 3 and rate of learning to
be 0.1.
15 Xu, Przystupa, 2019 A Combination Weighted KNN Accuracy was increased at
Fang, Marciniak, Strategy of Feature the expense of a small
Kochan, & Beshley Selection Based on an amount.
Integrated
Optimization
Algorithm and
Weighted K-Nearest
Neighbor to Improve
the Performance of
Network Intrusion
Detection
16 Yulianto, Sukarno & 2019 Improving AdaBoost- SMOTE for Imbalanced Proposed model
Suwastika based Intrusion dataset. outperformed selected
Detection System PCA for feature extraction. existing IDS (Aburomman
(IDS) Performance on Ensemble for feature & Reaz, 2016)
CIC IDS 2017 selection (gbm, glm, lasso,
Dataset. ridge and tree bag)
17. Haripriya and Jabba. 2018 Role of Machine Review of existing works The researchers could not
Learning in Intrusion including single, hybrid and make a decision as the study
Detection System: ensemble classifiers. showed that each algorithm
Review. has its own importance and
contributions
18. Hebatallah, Farouk 2018 Filter and wrapper based The results of the
& Abdel-Hamid. method for feature experiments states that the
selection.J48and Naive GR ranking method used
Bayes classifier. with the J48 classifier
achieved the best accuracy.
19. Patgiri , Varshney, 2018 An Investigation on Recursive feature The application of RFECV
Akukota & Kunde. Intrusion Detection elimination for feature helped SVM perform better
System Using selection. than Random Forest.
Machine Learning. Random Forest and SVM
for base classifier.
20. Ashraf, Ahmad & 2018 A Comparative Study Comparison of Naïve Random outperformed. All
Ashraf of Data Mining Bayes, J48 and Random three classifiers achieved
Algorithms for High Forest Classifiers accuracy over 90% and a
Detection Rate in hybrid model may be
Intrusion Detection proposed consisting of all
System three.
21. Chiba, Abghour, 2018 A Novel Architecture Back propagation Neural Results states that the aim of
Moussaid, Combined with Network (BPNN) the proposed system having
omri & Rida Optimal Parameters rates of detection and
for Backpropagation accuracy very high and low
Neural Networks rate of false positives was
Applied to Anomaly attained/
Network Intrusion
Detection.
22. Hajisalem & babaie 2018 A hybrid intrusion Artificial Bee Colony and The hybrid model achieved
detection system Fish Swarm hybrid IDS. an accuracy of 99% and a
based on ABC-AFS Fuzzy C-Means Clustering false positive rate of 0.01%
algorithm for misuse (FCM) and Correlation
and anomaly Feature selection for
detection removing irrelevant features
23. Idhammad et al. 2017 DoS detection Artificial Neural Networks The obtained results were
method based on satisfactory when compared
artificial neural to the state-of-the-art DOS
networks detection methods.
24. Al-yaseen, Othman, 2016 Multi-Level Hybrid K-means for dataset Proposed system
Zakree & Nazri. Support Vector reduction. outperformed the state-of art
Machine and Extreme Hybridized SVM(4) and methods but the use of more
LearningMachine extreme learning. than one classifier resulted
Based on Modified K- in longer testing time.
means for Intrusion
Detection System
25. Lin 2015 An intrusion KNN, SVM & Cluster CANN classified labels
detection system center and Nearest correctly and had low false
based on combining Neighbors. alarm rate. KNN and SVM
cluster centers and were computationally
nearest neighbors. intensive while CANN was
computationally light but
could not recognize R2L
and U2R attacks.
26. Maharaj & Khanna. 2014 A comparative RepTree and Voting The REPTree learning
Analysis of Different Feature Interval algorithm performed better
Classification and is more efficient for
Techniques for Intrusion Detection
Intrusion Detection Systems.
System.
27. (Velusamy et al., 2014 An Efficient Hybrid Rough Set Theory and Information Gain applied
n.d.) Multilevel Intrusion Information Gain for feature gave the best result
Detection System in selection. compared to the Rough Set
Cloud Environment. Hybridized KNN and NN Theory. The hybridized
for base classifier. classifier performed better
than the classifiers used
separately.
28. Golmah 2014 An Efficient Hybrid Hybridized SVM and C4.5 Improved accuracy in
Intrusion Detection algorithm. comparison to existing
system based on c5.0 approaches
and SVM.
29. Jongsuebsuk, 2013 Real-time Intrusion Fuzzy rule and Genetic The detection rate was
Wattanapongsakorn, Detection with Fuzzy algorithm. approximately over 97.5%.
& Charnsripinyo Genetic Algorithm
30. (Patel, Thakkar & 2012 A Survey and SVM, KNN, ANN, NB , DT Different algorithms have
Ganatra. Comparative different advantages and
Analysis of Data disadvantages and as to the
Mining Techniques contribution they make to an
for Network Intrusion Intrusion Detection System,
Detection Systems different algorithms can be
combined so as to nullify the
disadvantage of one
another.
CHAPTER THREE
RESEARCH METHODOLOGY
3.1 Introduction
This chapter talks about the method chosen to achieve the project's objectives,
programming language used, data set used, and the program tools used. The proposed
method is in various stages. The techniques and methods required to achieve the
The system proposed is grouped into three major phases: data collection, pre-processing
and feature selection, model development and detection phase and Evaluation phase.
For the selection of important features, the proposed system will use Correlation-based
Feature Selection and Sequential Forward Selection. The Random Forest Classifier will
be tested using the NSL-KDD .The dataset will be partitioned into two portions:
training and testing. The training set will contain 80% of the data, allowing the model
to learn and predict outcomes correctly, while the testing set will contain 20% of the
data. The proposed method will be implemented in stages. The first stage would be for
feature selection in the second stage which checks for multi-co linearity in data. The
relevant features would be re-selected in the third stage. The model would then be
developed using the base classifier in the fourth stage. In the final stage, the model
for multi-co linearity in data, was chosen as a first-level filter for selecting features for
intrusion prediction. However, because filter methods are not affected by the classifier,
Sequential Forward Selection was chosen as a second-level wrapper method for the
final feature selection that takes the classifier into account. Finally, the Random Forest
3.3 Methodology
The NSL-KDD data will be used to examine the proposed system. The NSL-KDD
The dataset was labeled KDD'99 or KDD CUP 99 after its use in "The third
quickly became the most used dataset for the evaluation of intrusion detection systems.
The dataset was created by the Defense Advanced Research Projects Agency (DARPA)
in 1998 and contains roughly 4,900,000 records and 41 attributes. The KDD dataset has
percent and 75 percent of the train and test sets being duplicated, respectively, leading
to the creation of NSL-KDD data set; a revised version of KDD dataset. Attacks in the
a. Denial of Service (DOS): An attack in which the attacker overloads the host system,
preventing authorized users from gaining access to data or services. The attacker
floods the network with unnecessary packet requests, and the attacker depletes
computing resources, making legitimate requests impossible to process. Neptune,
Smurf, tear drop, pod and mail bomb are among the DOs attacks.
b. User to Root Attack (U2R): intruder begins the breach by gaining entry to the
system using a regular user account. Following that, the attacker uses privilege
escalation (when an attacker exploits a system bug or design flaw to gain access to
send packets to a machine connected to the network but has no identity on that
system commits this type of intrusion. The perpetrator exploits some loop holes to
attain remote access to the machine as a user. Guess_passwd, imap, spy are
make it interpretable and easily understood by the machine and used to make the model
more efficient.
i. Data Cleaning: This step is concerned with filling in missing values or deleting
rows with missing values, obtaining a grid of errors, detecting and dealing with
that can be nominal or ordinal. This step will deal with encoding those variables so
iii. Data Transformation or Normalization: Because the units and magnitude of data
𝑋−𝑋𝑚𝑖𝑛
Xnorm = 𝑋𝑚𝑎𝑥−𝑋𝑚𝑖𝑛
iv. Feature Selection: It entails using the methods for selecting features that are stated
The CFS is a filter-based feature selection method that selects features based on a
measure that evaluates subsets of features depending on how correlated their features
are with the classification. The purpose of correlation-based feature selection is to avoid
with the class or dependent feature but not with each other. The CFS is based on the
whereas a correlation less than 0.5 is considered weak. The correlation of the features
that initiates with a set with no element and incrementally adds features chosen by
iteration from the residual features that have not been added to the set of relevant
features.
As a result, the set should be capable of producing the lowest error in comparison
Next, combinations of features are formed using one of the residual features
and the best feature is selected. Also the combination that emerges best is
chosen.
A group of three attributes are created using one of the rest of the data and the
SFS begins with clear model and then gradually fit the model with an individual feature
one at a time, selecting the feature which achieve high accuracy. It iteratively fits the
model with two features by experimenting with combinations of the previously selected
feature and all other leftover features. It picks the feature that attains the highest
1. accuracy = 0
2. subset = null
3. while ~isempty(features) do
4. state = 0
5. for i =1 to length(features) do
13. state = 0
14. else
15. continue
set of feature subsets. As a result, it is less prone to overfitting. Random Forest has the
is called a Forest because the model being developed consists of a large number of
decision trees, which could range from hundreds to thousands, hence the name Forest,
which is similar to Forest in the real world. Using a technique known as Bagging,
Random Forest creates its own training and testing data implicitly (Bootstrap
Aggregating). Bagging allows for the random selection of records with replacement,
with the training data taking two-thirds of the data and the testing data taking one-third
(called Out of Bag Sample). Random Forest, in addition to bagging, performs random
variable selection. The decision tree's goal is to minimize the impurity in a bucket, and
when choosing a variable for use in its equation, it chooses the variable that results in
the purest bucket. Random Forest, on the other hand, takes a random sample of
variables using the formula√p, where p is the number of variables. It then searches for
the variable that gives the purest bucket and applies it to the node in the tree. The
Bagging method and variable selection gave rise to the name Random, and multiple
S1 Tree1 Result1
S2 Tree2 Result2
Voting for
S3 Tree3 Result3 final class
based on
mode
. . .
. . .
. . .
Sn Treen Resultn
Figure 1 depicts the deployment of the random forest classifier in the proposed system's
Each tree produces a classification outcome, and the classifier's outcome is determined
by majority voting. The sample is assigned to the class that receives the most votes.
The bagging method and random variable selection are used in the Algorithm so that
different trees can learn different things from the data. If all trees are fed the same set
of records and variables (features), their predictions will be the same because there will
be no significant difference.
Generating trees
1. for i in number of trees
7. end for
8. create a node ‘d’ for the feature with minimum Gini index.
9. end for
17. Assign the class attack with the highest frequency to the instance.
and precision. These metrics are based on the True Positive Rate, False Positive Rate,
by the model. The outcome of the TPR, shows the correctly predicted positive class.
TPR represents when the outcome is Yes and the actual value is Yes.
True Negative Rate (TNR): The TNR depicts an outcome where the model’s
prediction for the negative class is correctly predicted. TNR represents when the
False Positive Rate (FPR): The model’s prediction on the positive class is incorrect.
It measures the number of instances incorrectly classified in the positive class. FPR
False Negative Rate (FNR): It shows the count of instances wrongly classified in
the negative class. The outcome depicts incorrectly classified class denoting a
negative class. FNR represents when the outcome is Yes and the actual value is
No.
Confusion Matrix displays the statistics of the right and wrong predictions made by
the model in relation to the actual results. A confusion matrix for multiple classes.
Predicted
A B C D E
A TPA EAB EAC EAD EAE
B EBA TPB EBC EBD EBE
C ECA ECB TPC ECD ECE
D EDA EDB EDC TPD EDE
E EEA EEB EEC EED TPE
i. Accuracy: it is used to measure the TPR and TNR ratio to the total number of
𝑇𝑃 + 𝑇𝑁
𝑇𝑜𝑡𝑎𝑙 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠
ii. Recall (Sensitivity): it is a measure of the TPR over the total classified true
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
iii. Precision: it is a measure of the TPR over the total true and false positives.
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
iv. Specificity: it is a measure of the TNR over the true negatives and the false
positives.
𝑇𝑁
𝑇𝑁 + 𝐹𝑃
v. F1 score: is a measure of the balance between the recall and precision denoted
as
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
2∗
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
3.5 Framework of the Proposed System
Data Collection.
Data Processing
Prediction
Model Evaluation
REFERENCES