0% found this document useful (0 votes)
35 views55 pages

Project Sample

This document describes a student's project on developing an intrusion detection system using multi-level feature selection and the random forest algorithm. The student proposes using correlation-based forward sequential feature selection (CFS-SFS) to select relevant features in multiple levels, combined with a random forest classifier. The system will first preprocess data, then use CFS to filter irrelevant features before inputting the output to sequential forward feature selection. The final output will be input to the random forest classifier. The final stage will evaluate the model's performance based on accuracy, sensitivity, specificity, F1 score and precision. The goal of the project is to build an efficient IDS that can accurately detect intrusions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views55 pages

Project Sample

This document describes a student's project on developing an intrusion detection system using multi-level feature selection and the random forest algorithm. The student proposes using correlation-based forward sequential feature selection (CFS-SFS) to select relevant features in multiple levels, combined with a random forest classifier. The system will first preprocess data, then use CFS to filter irrelevant features before inputting the output to sequential forward feature selection. The final output will be input to the random forest classifier. The final stage will evaluate the model's performance based on accuracy, sensitivity, specificity, F1 score and precision. The goal of the project is to build an efficient IDS that can accurately detect intrusions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

INTERNET OF THINGS BASED INTRUSION DETECTION

SYSTEM USING MULTI-LEVEL FEATURE SELECTION AND


RANDOM FOREST ALGORITHM

BY

MAKINDE, OLUWAFUNMILAYO REBECCA

(17/52HA066)

A PROJECT SUBMITTED TO THE DEPARRTMENT OF


COMPUTER SCIENCE, FACULTY OF COMMUNICATION AND
INFORMATION SCIENCES, UNIVERSITY OF ILORIN, ILORIN.

IN PARTTIAL FULFILMENT OF THE REQUIREMENTS FOR


THE AWARD OF BACHELOR OF SCIENCE (B.Sc) DEGREE IN
COMPUTER SCIENCE.

JANUARY, 2022
CERTIFICATION
This is to certify that this project work was carried out by MAKINDE,

OLUWAFUNMILAYO REBECCA with matriculation number 17/52HA066 in the

department of Computer Science, University of Ilorin, Ilorin, Nigeria.

________________________ _______________________
Dr. J. B. Awotunde Date
(Project Supervisor)

________________________ ________________________
Dr. Oluwakemi C. Abikoye Date
(Head of Department)

________________________ ________________________
External Examiner Date
DEDICATION
This project is dedicated to God, my parent and siblings who have contributed to my
academic growth in all ramifications.
CHAPTER ONE

INTRODUCTION

1.1 Background to the Study

The internet has evolved into a natural phenomenon that cannot separate from people's

daily lives and a great deal of data must be protected against many sorts of criminal

activities (Waskle, Parashar, & Singh, 2020).Valuable information is always appealing

to attackers and therefore vulnerable to concentrated network attacks. The primary

intention of data security is to create and implement protective computer models that

are resistant to security breaches, use, disclosure, interruption, modification, or damage

(Ganapathy et al., 2013). Furthermore, information security minimizes behaviours that

attempts jeopardizing the three primary security goals of integrity, availability and

confidentiality (Patel, Thakkar, & Ganatra, 2012). Confidentiality is concerned with

ensuring that the information transferred is only available to those who should have

access to it. To prevent unauthorized access, the data is encrypted. Integrity is

concerned with ensuring that the data itself is not intercepted or altered. It ensures that

the receiver receives exactly what the sender intended. Availability ensures that a

system resource or network is usable and accessible when required by a system that is

authorized. When a breach occurs, the security of a system is compromised. These

intrusions, which have a variety of consequences, have prompted researchers to develop

systems that are capable of detecting intrusions (IDS). The main principle behind

developing Intrusion Detection Systems is to differentiate a malicious attack from a

benign or non-malicious one.

Intrusion detection systems are classified based on their environments (Kajal & Nandal,

2020) and their detection mechanisms (Amiri et al., 2011). Based on their environment,
IDSs are further classified as Host-based Intrusion Detection Systems (HIDS) and

Network-based Intrusion Detection Systems (NIDS). Signature-based Intrusion

Detection Systems (SIDS) and Anomaly-based Intrusion Detection systems (AIDS) are

two types of Intrusion Detection Systems based on mechanism detection. The ability to

detect new patterns is a disadvantage of signature-based IDS, which leads researchers

to focus more on anomaly-based IDS, which can detect new patterns (Akyol,

Hacibeyoglu & Karlik 2016). Various intrusion detection systems (IDS) which have

been created are still vulnerable to attacks (Ayo, Folorunso, Abayomi-Alli, Adekunle

& Awotunde 2020). Despite the use of many machine learning algorithms to increase

efficacy of intrusion detection systems, existing intrusion detection systems continue to

struggle to achieve good performance (Zhou, Cheng, Jiang, & Dai, 2020).

Researchers claim that to reduce computational complexity, feature selection should be

employed during the pre-processing phase in machine learning algorithms to delete

unimportant features or attributes while maintaining or improving performance of the

IDS. In Machine Learning, Feature Selection is grouped into three parts; Filter

techniques, Wrapper methods, and Embedded methods. The Feature Selection

approach is effective for designing and implementing improved accuracy in Intrusion

Detection Systems (Akyol et al., 2016). In most anomaly detection systems, data

preparation was motivated by a desire for greater accuracy and a low false alarm rate

(Ayo et al., 2020). The dataset is preprocessed to remove unnecessary features and

noise, leaving a smaller set of features for building a high-performance model, while

the classifier uses the vital features to predict attack types with the base classifier.
Data mining algorithms, fuzzy logic, and neural networks, are amongst different

methods utilized in the implementation of Intrusion Detection Systems. Data mining

algorithms have been proposed that use both traditional and hybridized Machine

Learning algorithms, such as Random Forest (Waskle et al., 2020); Ensemble

Classifier( Zhou et al., 2020; Gautam & Doegar, 2018); Neural Networks (Chiba,

Abghour, Moussaid, El Omri, & Rida, 2018); Support Vector Machine (Ganapathy et

al., 2013; Jha, Ragha, & Ph, 2013; Wang, Gu, & Wang, 2017).

This study however proposes an IDS that uses the Correlation Based - Forward

Sequential Forward Selection (CFS-SFS) for multi-level selection of features combined

with the Random Forest Classifier to detect intrusions. The IDS will be built in phases

starting the first phase with the necessary data preprocessing techniques then followed

by selecting the first set of relevant features using the filter method which is the

Correlation-based Feature Selection that will filter out irrelevant features and move to

the next phase thereby giving the output features as an input to the wrapper method

which is the Sequential Forward Feature Selection and the output finally serves as an

input to the succeeding stage where the Random Forest classifier is used. The final

stage then deals with the evaluation of the model.


1.2 Statement of the Problem

Intrusions are malicious attacks that spread swiftly over networks, and detecting them

is crucial owing to the potential damage they can inflict. Security in networks is

currently a worldwide major issue in computer security and defense. Breaches, threats,

or intrusions in network infrastructures typically result in tremendous sensitive leaks of

data, lowering an organization's efficiency and productivity quality (Chiba et al., 2018).

As a result, Intrusion Detection Systems (IDS) are regarded as critical tools for

continuously monitoring malicious activities and detecting threats that could jeopardize

the network's integrity, privacy, or availability (Kajal and Nandal, 2020). According to

Waskle et al., (2020), to detect intruders, the IDS must be accurate and efficient. For

detection of intrusion, different machine learning algorithms were used, some of which

are K-Nearest Neighbors (KNN), Support Vector Machine (SVM) and Naïve Bayes

(NB). Despite the use of many Machine Learning algorithms, existing intrusion

detection systems continue to struggle to achieve good performance (Zhou et al., 2020).

Existing systems that neglected the use of Feature Selection have been observed to

attain a low accuracy and high training time (Teng, Wu, Zhu, Teng, & Zhang, 2018;

Taghavinejad, Taghavinejad, Shahmiri, Zavvar & Zavvar, 2020). Also, SVM has been

found not to be a good choice for large network traffic as it performance degrades as

data size increases (Ahmad, Basheri, Iqbal, & Rahim, 2018; Wang et al., 2017; Teng et

al., 2018; Aburomman, Bin, & Reaz, 2016; Kuang, Xu, Zhang, 2014). The employment

of Naive Bayes and KNN have also been found out to achieve low accuracy (Halimaa

& Sundarakantham, 2019; Patel et al., 2012).


According to findings, using Feature Selection will result in improved accuracy,

detection rates, and false alarm rates (Ayo et al., 2020; Waskle et al., 2020). The

identified flaws provide a rationale for recommending an IDS to minimize alert rates

equivalent to false and improve accuracy of the IDS system using Correlation Based-

Sequential Forward Feature Selection and Random Forest classification algorithm.

1.3 Aim and Objectives

This aim of this project is to develop an Intrusion Detection System using multi-level

feature selection and Random Forest Classifier to detect intrusions. The particular

objectives for accomplishing this aim are to:

1. Design a framework for the Intrusion Detection System.

2. Implement the proposed system using the Correlation based-Sequential

Forward Feature Selection and Random Forest Classifier.

3. Evaluate the performance of the proposed intrusion detection system based on


accuracy, sensitivity specificity, F1 score and precision.

1.4 Significance of the Study

The need to detect intrusions accurately grows daily as attackers keep getting

sophisticated. The significance of this study is focused on developing an Intrusion

Detection System capable enough to detect attacks and notify the system of the attacks

accurately.

The proposed method can also be of numerous importance when applied to the

following fields:

● Cancer prediction for the health care system.


● Student's performance prediction system.

● Credit card fraud detection in the banking sector.

1.5 SCOPE AND LIMITATIONS

This research is focused on using Correlation Based - Sequential Forward Search CFS-

SFS multi-level feature selection and Random Forest classifier to develop an Intrusion

Detection System that detects intrusions, on a given dataset to yield high accuracy and

low rate of false alarm. This study is limited to using data gotten through second means

and machine learning feature selection techniques with random forest as the base

classifier.

1.6 DEFINITION OF TERMS USED

i. Intrusion: the act of illegally entering an unauthorized place.

ii. Intrusion Detection System: it supervises the activities of a network and detects if

there is a malicious behavior.

iii. Network: is a set of computers linked together with the aim of sharing data and

files.

iv. Data Mining: is the process of finding valuable information hidden in a large

dataset. It uncovers the hidden.

v. Machine Learning: it is a section under artificial intelligence (AI) and AI also a

division of computer science that focuses on the use of data as input to algorithms

with the aim of predicting an outcome, searching for patterns or grouping entities

into clusters.
vi. Features: with respect to machine learning, features are individual independent

variables which are used as input to make predictions for the dependent variable,

form clusters, find patterns or perform analysis.

vii. Feature Selection: process of filtering and deleting irrelevant features to get

relevant features.

viii. Random Forest Classifier: is a machine learning classification algorithm made up

of several decision trees.

ix. Normalization: Refers to the process of converting real-valued numeric attributes

between the range of 0 and 1. Data normalization is used to reduce sensitivity of

training model to feature scale.

x. Classification: The process of predicting the category of data based on recognized

classes of data that have been 'classified' in supervised Learning (ML).

xi. Base Classifier: It is used to indicate the classifier that has been selected for the

implementation of the model.

xii. Multi-level Feature Selection: It is the term used to describe the application of two

feature selections sequentially.

1.7 ORGANIZATION OF THE REPORT

The project structure follows a 5-chapter standard and the rest of the project includes

CHAPTER TWO: This chapter evaluates the review of literature with respect to related

works and concepts.

CHAPTER THREE: This chapter discusses the research methodologies chosen to

develop (data used, algorithms, and programming tools for implementation)


CHAPTER FOUR: This chapter talks about the implementation and the evaluation of

results.

CHAPTER FIVE: This chapter consists of summary, conclusion and recommendation

of this project.
CHAPTER TWO

LITERATURE REVIEW

2.1 Introduction

This chapter introduces important concepts involved in detection of intrusion using an

Intrusion Detection system. It also contains a number of related works that were carried

out on intrusion detection system using various algorithms.

2.2 Review of Concepts Relating to the Work

2.2.1 Intrusion Detection System

Intrusion, according to Waskle et al., (2020), is a phrase that refers to an attacker

entering a system without permission to forward malicious packets so it can cause

damage, corrupt, steal or modify the valuable information stored within. This

infiltration into any system has the potential to harm the system. The term "intrusion"

has come to mean a serious trial to breach the system’s security and with the help of an

Intrusion Detection System, this intrusion into any system can be detected. One of the

most lethal diseases on the planet is cancer and looking at cancer through the lens of

cybersecurity is analogous to intrusion.

Intrusion detection is an important method for managing network security, and an

intrusion detection system is one of the most needed tool for ensuring network security.

(Teng et al., 2018). Identifying malicious acts or behaviors occurring in networks needs

the use of IDS. An Intrusion Detection System (IDS) is a system that supervises traffic
for unusual activity and sends alerts when it detects it. It is critical for providing security

and reducing harm to the information system, as well as the network and computer

security systems. It supervises different events in a network to ascertain if an intrusion

or not occurred and alerts or notifies the network administrator or system of malicious

behaviors or detected attacks. (Velusamy, Ghosh, Debnath, Metia, & Dutta, 2014).

While the primary functions of intrusion detection systems (IDSs) are to detect and

report anomalies, certain IDSs can also take immediate action when anomalous traffic

or malicious activity is detected, such as blocking the traffic received from the

suspected internet protocol address.

The drawbacks of Intrusion Detection Systems are False Positives (predicting the

presence of an attack incorrectly), False negatives (indicating the absence of an attack

when one exists ) and Data overload (inability to effectively and efficiently analyze

data) (Patel et al., 2012).

2.2. 2 Classification of Intrusion Detection System

Intrusion detection system are categorized broadly into two major groups based on:

a. Environment or nature of data collection (Ghosh et al., 2014; Kajal & Nandal,

2020).

i. Host-based Intrusion Detection System (HIDS): it is in charge of detecting and

identifying malicious actions on the host computer. It can detect attacks and

vulnerabilities to the system that the Network based Intrusion Detection System

cannot detect. The HIDS only protects the hosts assigned to it. The ability of HIDS

to detect suspicious attacks generated inside the organization or malicious activity

that NIDS failed to detect, as well as detect malicious traffic that emanates from
the host itself, is an advantage of HIDS over NIDS(Ashraf, Ahmad, & Ashraf, 2018;

Agah, 2017).

ii. Network Intrusion Detection System (NIDS): monitors the entire network from

where it is installed and has a more powerful detection mechanism for identifying

suspicious intruders. It is capable of recognizing network intrusion by analyzing

behavior to previously observed behavior. NIDS is responsible for detecting and

identifying unauthorized intrusions before they reach critical systems. It is

strategically placed throughout the network to monitor traffic. (Chirag et. al, 2013;

Agah, 2017).

b. Detection mechanisms or approach of detecting attacks (Zhou et al., 2020; Akyol

et al., 2016).

i. Anomaly-based Intrusion Detection System (AIDS): sometimes referred to as

Behavior-based Intrusion Detection System (BIDS) is a powerful method for

detecting unknown, malicious actions and new attacks through recognizing

deviations from a normal profile’s behavior. It examines normal network and

system routines and assumes them as anomalous if they deviate from normal

patterns. This system uses machine learning to construct the model. Such (IDSs)

function more effectively at detecting new kinds of attacks; regrettably, one major

drawback is that they cannot avoid a high false positive rate (Zhou et al., 2020).

ii. Signature-based Intrusion Detection System (SIDS): also referred to as

Knowledge-based Intrusion Detection System (KIDS) or Misuse-based Intrusion

Detection System (MIDS) analyze known attacks to extract the discrimination

characteristics and patterns referred to as signatures as well as signature patterns.


To detect intrusions, these signatures are contrasted to the traffic. The False Alarm

Rate for signature-based detection is extremely low (FAR). Although this type of

IDS is effective at detecting known attack patterns, it cannot detect unfamiliar or

variants of known attacks. As a result, researchers are focusing on anomaly

detection.

2.2.3 Data Mining

Data mining involves discovering information or searching for relationships,

correlation, trends and patterns hidden in massive quantities of data is. It converts

unstructured data into actionable information. Machine Learning includes algorithms

used in data mining. Data mining approach can address IDS issues or drawbacks by

utilizing one of the following techniques: Data summarization with statistics and

visualization are used to solve the problem of data overload. Clustering, which is the

task associated with grouping similar observations into clusters or categories, can also

be used, as can classification, which can be used to predict the category of an attack

based on historical data. Variety of algorithms has been made available from disciplines

such as Statistics, Machine Learning, and Pattern Recognition.

2.2.4 Data Mining Tasks

Data mining tasks are broadly classified into two parts (Namrata, Bijendra & Rajkumar ,

2015):

a. Predictive tasks: These tasks involve predicting the outcome of an attribute or

feature depending on values of other variables. The feature to be predicted is

referred to as the dependent variable, response, or target variable, whereas the

attributes on which the target variable is dependent are termed independent

variables or predictors. In machine learning, predictive tasks are known to be


tasks performed under supervised learning. The following sub tasks fall under

the category of predictive tasks:

i. Predictive Modelling: deals with generating a function of the independent

variables for developing a model for the target variable Predictive

modelling are classified into two types.

 Classification: it is used strictly for discrete classification with binary

or multi-class predictions. Predicting whether an action is normal or

malignant is one instance.

 Regression: It is used when the target variable is continuous. For

instance, predicting an employee's salary.

ii. Anomaly Detection: It recognizes observations that are distinct from the

rest of the data in some way. The goal of this task is to find anomalies

while avoiding labeling normal observations as anomalies.

b. Descriptive tasks: These are tasks that involve determining patterns, clusters, or

relationships that sum up the underlying data relationships. Descriptive tasks are

recurrently explanatory in nature, necessitating post-processing techniques for

explaining the data. In machine learning, descriptive tasks are referred to as

unsupervised learning. The following sub tasks fall under the descriptive tasks:

i. Association tasks: describes patterns that characterize strongly associated

features in the data.

ii. Clustering tasks: aims to discover clusters of observations that are closely

related so that observations can be grouped with similar characteristics or


attributes are grouped together. Recommendation engines are a great example

of a task that requires clustering.

2.2.5 Data Mining Techniques used for Intrusion Detection Systems

a. Classification

i. Support Vector Machine: each object to be classified is represented in an

n-dimensional space by Support Vector Machines, and the coordinates of

these points are commonly referred to as features. SVM starts by drawing

a hyperplane with points from a group on a part of the hyperplane while

points from the other group placed on the second side. There could be

more than one hyperplane, but SVM aims to use the hyperplane that best

classifies the categories by reducing the distance between points in each.

The distance is known as the margin, and the points within the margin are

known as the supporting vectors. SVM creates a classifier using a linear

splitting hyper plane, but not all data can be linearly separated in the

initial input space. SVM employs a feature usually described as kernel to

address this issue. To accomplish this, the Kernel employs a feature space

map.to transform linear algorithms into nonlinear ones. There are

numerous kernel functions, such as two layer sigmoid neural nets,

polynomial, and radial basis. The fact that binary-class classification is

what SVM can handle, happens to be a major disadvantage of SVM,

whereas intrusion detection requires multi-class detection (Patel et al.,

2012).
ii. K-Nearest Neighbors: is one of the most fundamental supervised Machine

Learning algorithms. It computes the distance between data points and a

point is assigned to its nearest neighbor based on the majority vote of its

neighbors. If k is set to 1, the data point is predicted to be in the class of

its nearest neighbor.

iii. Decision Tree: the DT is classified as a supervised classification algorithm

in machine learning. A DT is a kind of tree with the variable that provides

the greatest purity in its equation as its root node or decision node. The

result of the equation in the parent node determines the choice of a

particular branch leading to an intermediate or child node. The nodes at

the decision tree's end are the tree's terminal nodes, and when one is

approached, a decision is made. The goal of a decision tree is to create

pure or uniform buckets with more of one class. One of the mathematical

parameters used to calculate bucket splits is the Gini Index. Variables are

chosen based on the level of purity they provide.

b. Clustering

i. Hierarchical clustering : It is a clustering technique that finds the nearest

pair of points by calculating the distance separating a point and all other

points using Euclidean or Manhattan distance, merges them into one,

calculates their centroid, and repeats the process until all points are

merged and a dendogram (hierarchical tree) is generated. The number of

clusters is calculated by drawing a horizontal line that intersects some

vertical lines in the generated dendogram. The number of clusters chosen

is proportional to the number of intersected vertical lines. The horizontal


line must be able to move freely up and down without colliding with any

other horizontal lines. Due to the distances between points to be found,

hierarchical clustering can be a significant disadvantage on large datasets.

ii. K-means clustering: is an unsupervised exploratory data analysis Machine

Learning technique that integrates a non-hierarchical method of clustering.

It computes the centroid by calculating the Euclidean distance of the

points and then groups objects based on the shortest distance. Number of

clusters is determined first then the centroids are calculated. It initializes

data points to centroids after calculating their distances and based on the

minimum distance, it find similarities and groups them.

2.2.6 Feature Selection

High dimensionality is one of the challenges machine learning algorithms are faced

with. Having datasets with numerous fields or columns these days are common.

Datasets used in Intrusion Detection System are generally huge and using machine

learning algorithms to train the system of datasets with numerous features can lead to

high time consumption, it can also increase learning complexity and use of unrelated

features can negatively impact the system. Feature selection or feature extraction

therefore helps in removing unwanted features or noise from the data thereby improving

classification rate. The major difference between feature selection and extraction is

feature selection selects features without changing the initial data or generating new

features or columns to be used while extraction generates new features or columns from

existing features which sometimes might misinterpret the data. Feature selection has
been confirmed by researchers to be an important and crucial step in Machine Learning.

It selects subsets of the original features based on criteria given and these features

selected are used by the classifier to build a model. Machine learning feature selection

techniques are broadly divided into three categories:

a. Filter Method: applies statistic measures to features and based on the threshold

given, a feature is either selected or discarded. A disadvantage of the Filter method

is it generates subsets without putting into consideration the base classifier that is

it is independent of the classifier and it has the tendency to select large subsets of

features. Fast execution is an advantage since it requires no interaction with

classifier and generality is also an advantage since it evaluates the properties of the

data. Information Gain, Correlation - based, Chi squared test are some of the filter

methods applied to dataset for feature selection.

b. Wrapper Method: takes into consideration the classifier. It is dependent on the

classifier and it is usually computationally intensive as different combinations of

features are tested. It evaluates feature subsets based on their accuracy. It has the

advantage of achieving a high accuracy since there is an interaction between the

wrapper and the classifier. Asides the wrapper method being computationally

intensive or slow to execution, another disadvantage is the lack of generality. The

subset of features will only be specific to the classifier under consideration.

c. Embedded Method: these methods combine the benefits of both the wrapper and

filter methods by incorporating feature associations while significantly reducing

computational cost. Carefully extraction of the features that impart the most to the

training for that iteration is done for every iteration of the training process
which handled by iterative methods. Embedded technique generates a large

number of subsets from the dataset. It chooses features for the model at random

and attempts to perform all permutations and combinations. Whichever subset has

the highest accuracy will be chosen as a subset of features to be given to the dataset

for training.

2.3 REVIEW OF RELATED WORKS.

Summary of Related Works on Intrusion Detection System

Various researchers have proposed various methods for the detection of an intrusion

based on data mining techniques.

Khan, (2021) implemented a hybrid-based deep learning ID framework that

classifies malevolent attacks within the network using a convolutional neural network

in addition with a recurrent neural network (CRNN). Within the HCRNNIDS, the

convolutional neural network (CNN) captured native attributes, while the recurrent

neural network (RNN) captured features that were temporal to enhance the accuracy

and classification of the ID system. Studies were conducted. using the CIC-DS 2018

ID data set. Results achieved after the analysis of the hybrid (HCRNNIDS) intrusion

detection system proved the system to be effective, achieving a detection rate of 97.7%

and surpass ID methodologies compared by the researchers.

An integrated ID system was developed by Thaseen, Banu, Lavanya, Ghalib and

Abishek (2020) using correlation-based feature selection to prioritize the features based

solely on the strongest correlation between the class outcome and the features

integrated with Artificial Neural Network (ANN). Experimental analysis was


performed on the NSL-KDD and UNSW-NB ID datasets. In contrast to some cutting-

edge techniques, the model outperformed them in terms of sensitivity, accuracy and

specificity. Nevertheless, there were a few drawbacks to using ANN in the model. The

IDS training was time-consuming and necessitated a large amount of time and training

data. Better results were achieved when more layers were added however the model

still struggled with computational and memory efficiency. The features of an

artificial neural network, like the number of layers hidden and neurons per layer,

improved system performance while also increasing time complexity.

Salih and Abdulazeez, (2021) reviewed 17 existing approaches from 2018-2020 that

have been proposed by various researchers. Deciding on the algorithm to use has not

been easy task and testing the performance of different classifiers seemed like the best

method to the researchers which led to the review of existing systems. The result of the

review states that feature selection has good effects on the performance of the model

because it decreases both training and testing time by filtering out irrelevant features.

Also, hybrid classifiers could provide optimal solution to the detection of attacks.

Finally Random Forest got the best accuracy while Practical Swarm Optimization (PSO)

achieved the best result for selection of features.

Choudhury and Bhowal, (2020) investigated a spread of classification techniques and

machine learning algorithms to reason network traffic. J48, Random Tree, BayesNet,

PART, Logistic, Random Forest, REPTree, IBK, and JRip were found to be acceptable

for the classification approach. Additional focus was placed on Bagging, Boosting, and

Blending (Stacking) machine learning methods and comparisons were done on their

accuracies. The WEKA tool was used to compare these algorithms, and also the results
were shown by performance criteria. Cross validation of 10 folds was utilized to

simulate the categorization models using NSL-KDD dataset. BayesNet and Random

Forest were found to be effective by the researchers. Boosting achieved a much better

performance among Bagging, Boosting and Blending. The researchers more expressed

that the proposed algorithms may be used to make effective network intrusion detection

devices to be used for security.

Mahfouz, Venugopal and Shiva, (2020) worked on analyzing classifiers considering

different dimensions namely feature selection, hyper-parameter selection and class

imbalances. Their analysis was limited to supervised machine learning methods.

Researchers here considered the SMO (SVM), Naive Bayes, Multilayer Perceptron

(NN), Logistic Regression, J48 (DT) and IBK (KNN) algorithms. The models were

built for different phases. At phase one, the selected classifiers were used with their

default parameters and no preprocessing was done on the dataset used. The classifiers

were trained on the set of data assigned for training provided by NSL-KDD dataset

using 10 folds in the application of Stratified Cross-Validation. The accuracy of the

training set and test set depicts over-fitting for all classifiers. Preprocessing of data was

performed at the second phase. InfoGainAttributeEval algorithm was used for the

selection of features and CVParameterSelection for the optimization of hyperparameter

for each classifier. Though the accuracy of the second phase was better than that of the

first phase, overfitting was also detected. The third phase worked on diminishing the

dataset class imbalance by over sampling the minority classes and under sampling the

majority classes. The mitigating of imbalance classes helped reduced the limitations in

the detection of U2R and R2L attacks which was not the case in the first and second

phase.
Kajal and Nandal, (2020) implemented Genetic Algorithm and Discrete Wavelet

Transform (DWT) with Artificial Bee Colony (ABC) for feature selection. The aim of

their research was to propose an intrusion detection system capable of accurately

detecting attacks which was limited to Denial of Service (DDOS) attacks using KDD

dataset. The system made use of hybridized Artificial Neural Network and Support

Vector Machine (ANN-SVM) as the base classifier. The system was compared to three

different existing systems that addressed intrusion detection using the KDD dataset

against DDOS attacks and the proposed system was seen to have surpassed the other

existing systems in terms of effectiveness.

Waskle et al., (2020) proposed in their study, that they applied principal component

analysis (PCA) and the random forest classifier to develop a model for an efficient

Intrusion Detection System. In specific, the research offered a method for developing

an efficient IDS the principal component analysis assisted in extracting features from

the data by decreasing the data dimensions, while the random forest assisted in

classification of attacks into normal and intrusions using the Knowledge Discovery

Dataset (KDD). Their proposed model began with the extraction of features using PCA

then classification of attacks done by random forest and finally evaluation of the model.

The study of Kasongo and Sun, (2020) adopted Feed-forward Deep Neural Network

(FFDNN) wireless Intrusion Detection System by applying a Wrapper based Feature

Extraction Unit (WFEU). The WFEU method used the Extra Trees algorithm to extract

optimal features. The effectiveness and efficiency of the model were tested using the

UNSW and AWID dataset. Comparison of the system was done with respect to other
existing machine learning algorithms including Random Forest (RF), Naive Bayes

(NB), K-Nearest Neighbors (KNN), Decision Tree (DT) and Support Vector Machine

(SVM). The WFEU was used to extract features and was then used as input for the

FFDNN classifier and other algorithms for both binary and multi-class classification

the WFEU-FFDNN approach achieved a greater accuracy than the other approaches

and in comparison to existing works, the findings demonstrate that the suggested system

performed exceptionally well.

Zhou et al., (2020) proposed Correlation-based Feature Selection - Bat Algorithm

(CFS-BA) which is a heuristic algorithm for the purpose of reducing the dimensionality

of the dataset due to the numerous redundant features in data sets. Furthermore the

researchers introduced a voting technique that combined the probability distributions

of the AO (average of probabilities) rule of the Random Forest, Forest by Penalizing

and the C4.5. Attributes algorithms that were ensemble owing to the fact that a classifier

a classifier may not be efficient in detecting all variants of attacks. The goal of the study

is to create an unbiased model that improves the reliability and stability of the Intrusion

Detection System while requiring very little time and computational complexity. The

NSL-KDD dataset, AWID dataset, and CIC-IDS dataset were used for evaluation and

cross validation was performed on them. Relevant features gotten from the

implementation of the CFS-BA algorithm shows that the approach worked well as it

reduced the dimensionality drastically and achieved the aim it was implemented for as

the NSL-KDD dataset reduced from 41 features to 10features, AWID dataset reduced

from 155 features to 8 features and CIC-IDS reduced from 78 features to 13 features.

The evaluations, comparisons and results show that CFS-BA-Ensemble method

provides a powerful merit for Intrusion Detection Systems.


In the study of Anish & Sundarakantham, (2019) two prediction models Naive Bayes

and SVM were contrasted in terms of prediction accuracy and misclassification rate in

three different scenarios using 19,000 records of the NSL-KDD dataset. In the first

scenario, the models predicted the dataset which did not go through any pre-processing

phase. In the second scenario, CfsSubsetEval was used for feature selection and the

models predicted using the features gotten. At scenario three, the researchers

normalized the dataset and fed the normalized data to the models for prediction. In all

scenarios there were significant differences between their accuracy and

misclassification rates. The accuracy of SVM was higher and the misclassification rate

was lower than that of Naive Bayes in all scenarios.

Bhati and Rai, (2019) used support vector machine (SVM) variants, such as Fine

Gaussian (98.7% ), Medium Gaussian (98.5%), Quadratic (96.1%) and Linear (98.6%)

to evaluate the success of SVM techniques on the NSL-KDD dataset. The

implementation of the IDS involved four major steps, namely, collection of data

collection, preprocessing, division into train set and test set and finally the evaluation

of the model based on the accuracy, Receiver Operating Characteristic (ROC) and

Confusion Matrix metrics. Result states that the Fine Gaussian SVM variant provided

the highest accuracy and also the smallest error rate for the intrusion detection.

Bindra and Sook (2019) compared Logistic Regression, Gaussian Naive Bayes, Linear

methods Discriminant Analysis, K-Nearest Neighbor (KNN), Linear SVM and

Random Forest ML algorithms for identifying DDoS attacks in networks. The CIC-

IDS-2017 dataset was utilized to train and evaluate the algorithms compared in this
work. Random Forest was the most accurate, with a 96.2% accuracy rate. The study

credits the effective outcome to the cross-validation applied on the algorithm. The

experiments also highlight the need of lowering the dataset's dimensionality; the

SelectPercentile technique was used to features from 85 to 12, as well as used median

of an attribute to replace nan (not a number) values.

Chu, Lin, and Chang (2019) investigated the result of Machine Learning (ML)

techniques on NSL-KDD dataset utilizing SVM, Nave Bayes, Decision Tree, and

Artificial Neural Network (ANN) with MLP for identifying Remote to Local, DoS,

User to Root attacks and Probe. SVM had a precision of 97.72 percent, ANN had a

precision of 97.82 percent, Nave Bayes had a precision of 90 percent, and the J48 had

a precision of 59.3 percent. According to the researchers, the high accuracy of SVM

was reached by making adjustments to gamma and c parameters, but the best accuracy

of the Artificial Neural Network was attained with four layers. Despite the fact that

there was no important difference in the findings of the two algorithms that performed

better, the researchers noted that speed of classification increased significantly after

using the principal component analysis to minimize the feature space of the data applied

on NSL-KDD data set.

Kim, Shin, and Choi (2019) used the CSE-CIC-IDS 2018 dataset to construct an

Intrusion Detection System model using Convolutional Neural Network (CNN) and

evaluated by comparing it to a model generated with Recurrent Neural Network (RNN).

Preprocessing was done on the data, and features were chosen. A CNN needs the dataset

to be transformed into images. Convolutional layers, max-pooling layers, and a fully

linked layer make up a model. Maxpooling was deployed behind each convolutional
layer in order to implement the model. Although the max pooling layer is not required

for a CNN model, it was necessary since the transformed pictures only comprise

quantitative data and no invisible signatures, thus there is very little risk of losing

important features from the max pooling. In addition, for each convolutional layer, the

activation function 'relu' was employed. Efforts to minimize overfitting, lead dropout

being performed after each phase of the max pooling. Finally, below the last max-

pooling layer, a fully connected layer is placed. When applied to the CIC-2018 dataset,

the CNN model appeared to be more accurate than the RNN model in label

categorization. Furthermore, preparing the dataset with a ratio of benign and attack-

labeled data was proposed as a technique to improve the model's performance.

The work of Patgiri, Varshney, Akutota, and Kunde, (2019) investigated the application

of Machine Learning for intrusion detection by applying two different algorithms;

Support Vector Machine and Random Forest. Efforts to decrease the computational

time, lead to employment of Recursive feature Elimination (RFE) in order to filter

relevant features for both classifiers. The NSL-KDD dataset was used for the evaluation

of the models developed. The extensive experimentation carried out was observed by

the researchers to be time consuming and performance degrading. The performance

achieved by Random Forest was stronger than that of SVM before the Recursive feature

Elimination method was applied. On the contrary, SVM performance was better than

Random Forest after using Recursive feature Elimination Cross Validation (RFECV)

with the classifiers.

Taher, Jisan and Rahman, (2019) in their work, evaluated the performance of Artificial

Neural Network and SVM using the NSL-KDD dataset. They aimed to discover the
classifier with best success rate and accuracy. The proposed models were developed

with the features selected by a wrapper method which reduced from 35 features to 17

features. The work also aimed to discover the best number of hidden layers and learning

rate and results states that 3 hidden layers and a 0.1 learning rate were best. The

evaluation was done comparing the two classifiers before feature selection was applied

and after feature selection was employed. In both scenarios, the detection accuracy was

better when feature selection was employed however in both cases, the Artificial Neural

Network performed better.

In attempt to improve network intrusion detection performance, Xu, Przystupa, Fang,

Marciniak, Kochan, and Beshley (2019) proposed an anomaly-based Intrusion

Detection System an optimization algorithm and weighted KNN alongside eliminating

feature selection. The IDS proposed was examined using the KDDCup ID dataset, and

the outcomes testify that the weighted KNN improved efficiency while sacrificing a

small amount of accuracy. Although only KNN was studied to enhance the performance

of network intrusion detection, the researchers decided that Naïve Bayes and SVM can

also be used for this purpose and indicated that they plan to use SVM and Nave Bayes

parameter optimization in the future.

Yulianto, Sukarano, and Suwastika (2019) increased the performance of an AdaBoost-

based Intrusion Detection System that was previously in place (Aburomman and Reaz,

2016), to address the imbalance of training data (CIC-IDS 2017 Monday-Working-

Hours-DDoS-Attack dataset). Synthetic Minority Oversampling Technique (SMOTE),

Ensemble Feature Selection (glm, gbm, treebag, ridge and lasso) and Principal

Component Analysis (PCA) were used to choose significant characteristics. For the
label classification, the AdaBoost classifier was applied. The proposed technique began

with the cleaning and scaling of the data, which totaled 225,745 records. The dataset

was then splitted into training (70 percent of the data, or 158,022 records) and testing

data (the remaining 30 percent of the data) (30 percent of the data resulting in 67,723

records). SMOTE was used to oversample the imbalanced classes, and the Ensemble

technique (comprising gbm, glm, lasso, ridge, and treebag) was used to choose features,

with 25 being chosen (16 features). The model was created with AdaBoost and five

rounds of cross-validation were performed. Finally, the model's efficiency was assessed

using the accuracy, precision, and recall criteria. According to the results, the suggested

strategy outperformed the performance of Aburomman and Reaz, (2016).

Haripriya and Jabba, (2018) reviewed various machine learning algorithms that have

indeed been suggested by existing works in order to determine the best techniques for

implementing an Intrusion Detection System. The review involved existing work that

used single classifiers, hybrid classifiers and ensemble classifiers. The researchers

found that each algorithm has its own importance and contributions when contrasted to

other methods and as a result a particular technique could not be selected for the

implementation of Intrusion Detection System. The researchers also observed that it

can be difficult to train algorithms when certain amount of traffic data is not available.

Hebattalah, Farouk, Abdel-Hamid, (2018) aimed to identify the fewest set of features

that results in maximum accuracy. Researchers here proposed an IDS based on filter,

wrapper method and the J48 and Naive Bayes classifiers as the different base classifiers.

The system selected the best model after performing series of experiments that were

evaluated using the UNSW-NB15 dataset. The models were partitioned into two layers.
The first layer was for selection of features and it was divided into five strategies. The

first strategy used all features. The second strategy used wrapper method to select the

features. The filter method which used different evaluators that summed up to six was

employed in the third strategy. The fourth strategy merged the single evaluators. The

fifth strategy used a combination of the best subset and evaluator. The output from the

first layer which consists different features was then given as input to the second layer

which applied both J48 and Naive Bayes on each strategy separately. The results of the

experiments states that the GR ranking method used with the J48 classifier achieved

the best accuracy.

In the study of Ashraf, Ahmad. Ashraf, (2018) the detection rate and accuracy of IDSs

were computed using J48, Random Forest and Naïve Bayes classifiers. The NSL KDD

dataset was utilized in the experiments. On the 20% NSL KDD dataset, the

classification performance of J48, Random Forest and Naïve Bayes were analyzed in

the research. Based on the findings, a conclusion was attained that Random Forest

outperformed Naïve Bayes in terms of detection rate and accuracy. Since all three

classifiers achieved up to 90% precision and recall, a hybrid model consisting of all

three may be proposed in future.

Chiba et al., (2018) suggested an efficient strategy to developing an effective NIDS

using Back Propagation Neural Network (BPNN) on KDD CUP’ 99 dataset.

Combinations of parameters used for implementation were used in the development of

models which summed up to 48 models. Two IDSs which proved to be best based on

false positive rate, detection rate, F-score and AUC were chosen during the evaluation

process. initial stage aimed towards determining the important parameters to be used
in the classifier's construction, and the parameters they determined to be the most

important were the count of attributes, normalization, architecture of the Neural

Network most especially the nodes to be used in the layers , momentum term, learning

rate, transfer function. Generation of the combinations of different parameters was done

in the third stage. Implementation of IDS was done in stage four. Finally, in the fifth

stage, there was comparison based on the efficiency of the models that were developed

and the study selected two. In future work, an optimized algorithm which search for

optimal arguments that will influence the performance of the model will be applied.

Hajisalem and Babaie (2018) suggested a new artificial bee colony (ABC) and artificial

fish swarm (AFS) hybrid classification technique. The structure of the implemented

approach was as follows: Division of the training datasets, feature selection, rule

generation, and hybrid classification. The framework was based on the ABC-AFS that

was proposed. In order to split the data assigned for training and eliminate the redundant

features, Fuzzy C-Means Clustering (FCM) and Correlation-based Feature Selection

(CFS) algorithms were used. In addition to using the CART technique, attempts to

distinguish between normal and abnormality records involved the use of If-Then rules

that were constructed based on the selected attributes. The rules generated were also

used to train the presented hybrid technique. Outcome of the UNSW-NB15 and NSL-

KDD datasets states that the method proposed achieved a detection rate of 99 percent

and a false positive rate of 0.01 percent in terms of performance metrics. Additionally,

a comparison of time and computational cost revealed that the overhead of the model

is the same as that of competing alternatives.


Idahammad et al., (2017) proposed a feed-forward artificial neural network based on

multi-layer perceptron with a hidden layer to discern DDoS attacks. The proposed

system was examined based on UNSW-NB15 and NSL-KDD data set. The arguments

supplied to their models were adjusted and their outcomes were stated. The results were

satisfactory in comparison to methods that are cutting-edge.

In the work of Al-yaseen, Othman, Zakree, and Nazri, (2016) they introduced a system

framework based on a multi-level hybrid extreme learning and support vector machine .

The researchers stated that SVM consumes a lot of training time and due to that, the

modified K-means was employed to minimize the dataset size thereby getting smaller

samples of the dataset which is 10% of the KDD data that results in small quality data.

Four SVM classifiers and one ELM classifier were employed. The results showed that

their outperformed the state-of-the-art methods and large fluctuations were not seen in

the detection performance. Furthermore the results indicates enhanced accuracy and

short training time due to the reduction of the dataset. However the use of more than

one classifier resulted in a longer testing time.

Lin (2015) suggested a KNN model based on Cluster Center and Nearest Neighbors

(CANN) to reduce the dataset's feature characterization to a single dimension. The

newly acquired dataset was utilized evaluate the base classifier. The results reveal that

the CANN classifier outperforms the KNN and SVM classifiers in the dataset's first

dimension. The CANN had a better rate of correct identification and a lower percentage

of false alarms. When tested on two datasets, KNN and SVM were shown to be

computationally demanding, while CANN was found to be computationally light. In

terms of the study's constraint, CANN was unable to recognize the R2L and U2R
attacks since the one-dimensional representation was unable to appropriately depict

such attacks. In addition, CANN necessitated additional calculations in order to remove

the separation-based characteristics.

Maharaj and Khanna, (2014) worked on Algorithm Reptree and Voting Feature Interval

(VFI). According to the researchers, most of the existing systems only detects if there

is an attack or not which does not give detailed information about the kind of attack and

this gave them a reason to direct the aim of their study to conclude on a classifier that

does multiclass classification as some classification algorithms prohibit the prediction

of more than two classes. Evaluation was done using the KDDCUP dataset based on

the Receiving Operating Characteristic (ROC) curve which gave details about the Area

Under Curve (AUC), False Positive Rate (FPR) and True Positive Rate (TPR). After

the evaluation, the study therefore concluded that the REPTree learning algorithm is

better and efficient for Intrusion Detection Systems.

Golmah, (2014) implemented a hybrid IDS that used Data Mining algorithms. The

researcher combined the SVM and C4.5 which yielded an improved accuracy when

compared to that of systems modeled using SVM based on results.

In Ghosh et al., (2014) study, the purpose of the study was to improve performance of

classifier for detecting intrusion by applying multilevel feature selection and hybridized

K-Nearest Neighbors and Neural Networks (KNN-NN). Prior to the classification, there

were four steps. In the first step, preprocessing was done on the NSL-KDD data set

which was used for the experimental analysis. In the next step, the Rough Set Theory

(RST) wrapper method was selected for the first feature selection, prior to using the
RST, Normalization and Discretization were performed. And then the Information Gain

(IG) filter method was done for the second level feature selection process in step three.

After all these were done, then the classification phase came in and it was done in two

phases by applying KNN first and using the output as an input for the Neural Networks.

In order to classify network attack data, Jongsuebsuk, Wattanapongsakorn, and

Charnsripinyo, (2012) presented an IDS using a fuzzy rule algorithm and genetic

algorithm was developed using the KDD99 dataset as well as dataset provided by the

researchers. Assessment of the IDS in terms of false alarm rate, detection speed and

detection rate proved that network attacks could be detected in real-time within a very

short time

The study of Patel et al., (2012) reviewed the K-Nearest Neighbors (KNN), Artificial

Neural Network (ANN), Support Vector Machine (SVM), Decision Tree (DT) and

Naive Bayes (NB) data mining techniques. The researchers further highlighted the

advantages and disadvantages of the algorithms. In this study, the performance of all

the algorithms implemented were low though their dataset statistics were not stated but

the results gotten from the evaluation of the algorithms made the researchers conclude

that combining more than one algorithm may be used to nullify the disadvantages of

one another since different algorithms have different insights regarding the situation

and also increase performance of the system.


Table 1: Review of Related Works

METHODOLOGY/
S/N AUTHOR’S NAME YEAR TITLE RESULTS
CONTRIBUTIONS

HCRNNIDS: Hybrid
Performance achieved was
convolutional
Hybrid IDS (Convolutional higher in comparison to
Reccurent Neural
1 Khan 2021 Neural Networks and existing IDSs and a
Network-Based
Recurrent Neural Networks) detection rate of 97.7% was
Network Intrusion
achieved.
Detection System.
2 Thaseen, Banu, 2021 An integrated Correlation feature selection Proposed model performed
Lavanya, Ghalib & intrusion detection integrated with Artificial better with state-of-art
Abishek system using Neural Networks. methods. However, training
correlation-based of IDS was slow, required
attribute selection and lot of training time and was
artificial neural not memory and
network computationally efficient.
3 Salih & Abdulazeez, 2021 Evaluation of Review of existing Feature Selection has good
Classification classifiers . effects on performance
Algorithms for Hybrid classifiers could
Intrusion Detection provide optimal solution.
System: A Review. Random Forest achieved
best accuracy while Particle
Swarm Optimization
achieved best result for
feature selection.

4. Mahfouz, 2020 Comparative Analysis of classifiers (NB, Overfitting detected from


Venugopal and Analysis of ML LR, NN, SVM, KNN, DT) accuracy of classifiers.
Shiva. Classifiers for
Network Intrusion
Detection.

5. Kajal & Nandal 2020 A Hybrid Approach GA and DWT with ABC for The Proposed system was
for Cybersecurity: feature selection seen to have outperformed
Improved Intrusion Hybridized ANN-SVM for the other existing systems in
Detection System classification terms of accuracy.
Using ANN-SVM.
6. Waskle, Parashar & 2020 Intrusion Detection PCA for feature extraction Comparison was done based
Singh. System Using PCA Random Forest as base on performance time,
with Random Forest classifier. accuracy and error rate with
Approach. other classifiers (SVM, NB,
DT). The PCA with
Random Forest approach
performed better in all
metrics used.
7. Kasongo & Sun. 2020 A deep learning Wrapper based feature The proposed system
method with wrapper extraction unit (WFEU) achieved a great accuracy
based feature used with Extra Trees for compared to other existing
extraction for feature selection and Feed methods for both binary
wireless intrusion Forward Deep Neural amd multi-class
detection system Network (FFDNN) as base classification.
classifier

8. Zhou Cheng, Diang 2020 Building an Efficient Correlation-based Feature An unbiased model having
& Dai. Intrusion Detection Selection - Bat Algorithm low computational
System Based on (CFS-BA )for feature complexity was achieved.
Feature Selection and selction
Ensemble Classifier. Voting technique using
c4.5, RF and Forest by
penalizing attributes for the
model building.

9. Anish and 2019 Machine Learning CfsSubEval to select In all different scenarios,
Sundarakantham. based Intrusion features SVM proved to have an
Detection System Naive Bayes and SVM as upper hand.
base classifiers.
10 Bhati & rai 2019 Analysis of Support Variants of SVM Quadratic (96.1%), Linear
Vector Machine-
(Quadratic, Linear, Fine (98.6%) , Fine Gaussian
based Intrusion
Detection Gaussian and Medium (98.7% ), and Medium
Techniques. Arab.
Gaussian) Gaussian (98.5%).
Fine Gaussian had the best
accuracy and the least error
11 Bindra & Sood 2019 Detecting DDoS Comparison involving Random Forest appeared to
attacks using machine
Logistic Regression, KNN, be more accurate (96.2%)
learning techniques
and contemporary Gaussian Naïve Bayes,
intrusion detection
Random forest, Linear SVM
dataset
and Linear Method
Discriminant.
12 Chu Lin & Chang 2019 Detection and Investigation on SVM attained a high
classification of
performance of SVM, Naïve performance owing to the
advanced persistent
threats and attacks Bayes, Decision Tree, adjustments of the c and
using the support
Artificial Neural Networks gamma parameters.
vector machine
(with Multi-layer However, ANN performed
Perceptron). better with four layers.
13 Kim, Shin & Choi 2019 An Intrusion
Detection Model
based on a
Convolutional Neural
Network

14. Taher, Jisan, 2019 Network Intrusion Wrapper method for feature Feature selection improved
Rahman. Detection using selection the performance of the
Supervised Machine ANN and SVM classifiers though ANN
Learning Technique performed better before
with Feature feature selection and after
Selection feature selection.
Results also states that for
the ANN classifier, the best
count of layers to be hidden
was 3 and rate of learning to
be 0.1.
15 Xu, Przystupa, 2019 A Combination Weighted KNN Accuracy was increased at
Fang, Marciniak, Strategy of Feature the expense of a small
Kochan, & Beshley Selection Based on an amount.
Integrated
Optimization
Algorithm and
Weighted K-Nearest
Neighbor to Improve
the Performance of
Network Intrusion
Detection
16 Yulianto, Sukarno & 2019 Improving AdaBoost- SMOTE for Imbalanced Proposed model
Suwastika based Intrusion dataset. outperformed selected
Detection System PCA for feature extraction. existing IDS (Aburomman
(IDS) Performance on Ensemble for feature & Reaz, 2016)
CIC IDS 2017 selection (gbm, glm, lasso,
Dataset. ridge and tree bag)

17. Haripriya and Jabba. 2018 Role of Machine Review of existing works The researchers could not
Learning in Intrusion including single, hybrid and make a decision as the study
Detection System: ensemble classifiers. showed that each algorithm
Review. has its own importance and
contributions
18. Hebatallah, Farouk 2018 Filter and wrapper based The results of the
& Abdel-Hamid. method for feature experiments states that the
selection.J48and Naive GR ranking method used
Bayes classifier. with the J48 classifier
achieved the best accuracy.
19. Patgiri , Varshney, 2018 An Investigation on Recursive feature The application of RFECV
Akukota & Kunde. Intrusion Detection elimination for feature helped SVM perform better
System Using selection. than Random Forest.
Machine Learning. Random Forest and SVM
for base classifier.
20. Ashraf, Ahmad & 2018 A Comparative Study Comparison of Naïve Random outperformed. All
Ashraf of Data Mining Bayes, J48 and Random three classifiers achieved
Algorithms for High Forest Classifiers accuracy over 90% and a
Detection Rate in hybrid model may be
Intrusion Detection proposed consisting of all
System three.
21. Chiba, Abghour, 2018 A Novel Architecture Back propagation Neural Results states that the aim of
Moussaid, Combined with Network (BPNN) the proposed system having
omri & Rida Optimal Parameters rates of detection and
for Backpropagation accuracy very high and low
Neural Networks rate of false positives was
Applied to Anomaly attained/
Network Intrusion
Detection.

22. Hajisalem & babaie 2018 A hybrid intrusion Artificial Bee Colony and The hybrid model achieved
detection system Fish Swarm hybrid IDS. an accuracy of 99% and a
based on ABC-AFS Fuzzy C-Means Clustering false positive rate of 0.01%
algorithm for misuse (FCM) and Correlation
and anomaly Feature selection for
detection removing irrelevant features
23. Idhammad et al. 2017 DoS detection Artificial Neural Networks The obtained results were
method based on satisfactory when compared
artificial neural to the state-of-the-art DOS
networks detection methods.
24. Al-yaseen, Othman, 2016 Multi-Level Hybrid K-means for dataset Proposed system
Zakree & Nazri. Support Vector reduction. outperformed the state-of art
Machine and Extreme Hybridized SVM(4) and methods but the use of more
LearningMachine extreme learning. than one classifier resulted
Based on Modified K- in longer testing time.
means for Intrusion
Detection System
25. Lin 2015 An intrusion KNN, SVM & Cluster CANN classified labels
detection system center and Nearest correctly and had low false
based on combining Neighbors. alarm rate. KNN and SVM
cluster centers and were computationally
nearest neighbors. intensive while CANN was
computationally light but
could not recognize R2L
and U2R attacks.
26. Maharaj & Khanna. 2014 A comparative RepTree and Voting The REPTree learning
Analysis of Different Feature Interval algorithm performed better
Classification and is more efficient for
Techniques for Intrusion Detection
Intrusion Detection Systems.
System.
27. (Velusamy et al., 2014 An Efficient Hybrid Rough Set Theory and Information Gain applied
n.d.) Multilevel Intrusion Information Gain for feature gave the best result
Detection System in selection. compared to the Rough Set
Cloud Environment. Hybridized KNN and NN Theory. The hybridized
for base classifier. classifier performed better
than the classifiers used
separately.
28. Golmah 2014 An Efficient Hybrid Hybridized SVM and C4.5 Improved accuracy in
Intrusion Detection algorithm. comparison to existing
system based on c5.0 approaches
and SVM.
29. Jongsuebsuk, 2013 Real-time Intrusion Fuzzy rule and Genetic The detection rate was
Wattanapongsakorn, Detection with Fuzzy algorithm. approximately over 97.5%.
& Charnsripinyo Genetic Algorithm
30. (Patel, Thakkar & 2012 A Survey and SVM, KNN, ANN, NB , DT Different algorithms have
Ganatra. Comparative different advantages and
Analysis of Data disadvantages and as to the
Mining Techniques contribution they make to an
for Network Intrusion Intrusion Detection System,
Detection Systems different algorithms can be
combined so as to nullify the
disadvantage of one
another.
CHAPTER THREE
RESEARCH METHODOLOGY

3.1 Introduction
This chapter talks about the method chosen to achieve the project's objectives,

programming language used, data set used, and the program tools used. The proposed

method is in various stages. The techniques and methods required to achieve the

specified goals are described below.

3.2 Proposed System

The system proposed is grouped into three major phases: data collection, pre-processing

and feature selection, model development and detection phase and Evaluation phase.

For the selection of important features, the proposed system will use Correlation-based

Feature Selection and Sequential Forward Selection. The Random Forest Classifier will

be tested using the NSL-KDD .The dataset will be partitioned into two portions:

training and testing. The training set will contain 80% of the data, allowing the model

to learn and predict outcomes correctly, while the testing set will contain 20% of the

data. The proposed method will be implemented in stages. The first stage would be for

preprocessing data. The relevant features will be selected using correlation-based

feature selection in the second stage which checks for multi-co linearity in data. The

relevant features would be re-selected in the third stage. The model would then be

developed using the base classifier in the fourth stage. In the final stage, the model

would be evaluated. The Correlation-based Feature Selection approach, which checks

for multi-co linearity in data, was chosen as a first-level filter for selecting features for

intrusion prediction. However, because filter methods are not affected by the classifier,
Sequential Forward Selection was chosen as a second-level wrapper method for the

final feature selection that takes the classifier into account. Finally, the Random Forest

Classifier would be employed to distinguish between traffics. Implementation will

adopt Python as the programming language to be used.

3.3 Methodology

3.3.1 Dataset Acquisition

The NSL-KDD data will be used to examine the proposed system. The NSL-KDD

dataset was extracted from the UCI Machine Learning repository.

3.3.2 Kdd’99 and Nsl-Kdd Dataset

The dataset was labeled KDD'99 or KDD CUP 99 after its use in "The third

International Knowledge Discovery and Data Mining Tools Competition," and it

quickly became the most used dataset for the evaluation of intrusion detection systems.

The dataset was created by the Defense Advanced Research Projects Agency (DARPA)

in 1998 and contains roughly 4,900,000 records and 41 attributes. The KDD dataset has

been confirmed by researchers to contain a large number of irrelevant records, with 78

percent and 75 percent of the train and test sets being duplicated, respectively, leading

to the creation of NSL-KDD data set; a revised version of KDD dataset. Attacks in the

extracted data can be grouped into four types.

a. Denial of Service (DOS): An attack in which the attacker overloads the host system,

preventing authorized users from gaining access to data or services. The attacker

floods the network with unnecessary packet requests, and the attacker depletes
computing resources, making legitimate requests impossible to process. Neptune,

Smurf, tear drop, pod and mail bomb are among the DOs attacks.

b. User to Root Attack (U2R): intruder begins the breach by gaining entry to the

system using a regular user account. Following that, the attacker uses privilege

escalation (when an attacker exploits a system bug or design flaw to gain access to

resources or data that should normally be unavailable to them) to exploit the

system's vulnerability in order to gain access to the system. User-to-root attacks

include buffer overflow, loadmodule and perl.

c. Remote to Local Attack (R2L): is committed by an intruder with authorization to

send packets to a machine connected to the network but has no identity on that

system commits this type of intrusion. The perpetrator exploits some loop holes to

attain remote access to the machine as a user. Guess_passwd, imap, spy are

examples of Remote to Local attack.

d. Probing Attack: is efforts made to accumulate information of a system by testing

it in order to discover vulnerabilities in order to compromise it later. Some of

Probing attacks are ipsweep, portsweep, nmap and satan.

3.3.3 Dataset Preprocessing

Data preprocessing is the method of cleaning, encoding, or changing data in order to

make it interpretable and easily understood by the machine and used to make the model

more efficient.

The pre-proceesing will consist of the following steps:

i. Data Cleaning: This step is concerned with filling in missing values or deleting

rows with missing values, obtaining a grid of errors, detecting and dealing with

outliers, and resolving data inconsistencies.


ii. Feature Encoding: A Machine Learning model cannot handle categorical variables

that can be nominal or ordinal. This step will deal with encoding those variables so

that the machine can understand their formats.

iii. Data Transformation or Normalization: Because the units and magnitude of data

vary, feature scaling is an important part of feature engineering. Normalization

reduces data between 0 and 1. Normalization is also known as MinMaxScaler.

𝑋−𝑋𝑚𝑖𝑛
Xnorm = 𝑋𝑚𝑎𝑥−𝑋𝑚𝑖𝑛

iv. Feature Selection: It entails using the methods for selecting features that are stated

to filter out the effective attributes.

3.3.4 Correlation-based Feature Selection (CFS)

The CFS is a filter-based feature selection method that selects features based on a

measure that evaluates subsets of features depending on how correlated their features

are with the classification. The purpose of correlation-based feature selection is to avoid

multi-collinearity. A feature is relevant if it contains features that are highly correlated

with the class or dependent feature but not with each other. The CFS is based on the

Pearson Correlation Coefficient. A correlation between 0.5 and 1 is considered strong,

whereas a correlation less than 0.5 is considered weak. The correlation of the features

can be calculated using the following equation:

∑(xi − x̅)(yi − y̅)


r =
√∑(xi − x̅)2 ∑(yi − y̅)2
3.3.5 Sequential Forward Selection (SFS)

Sequential forward selection (SFS) algorithm is a bottom to up search procedure

that initiates with a set with no element and incrementally adds features chosen by

an evaluation function. The feature to be included in the set is chosen at each

iteration from the residual features that have not been added to the set of relevant

features.

As a result, the set should be capable of producing the lowest error in comparison

to the inclusion of another feature

 First a single feature is chosen (using some criterion function).

 Next, combinations of features are formed using one of the residual features

and the best feature is selected. Also the combination that emerges best is

chosen.

 A group of three attributes are created using one of the rest of the data and the

best two features, and the best triplet are chosen.

 The procedure is iterated until a predetermined number of features are chosen.

SFS begins with clear model and then gradually fit the model with an individual feature

one at a time, selecting the feature which achieve high accuracy. It iteratively fits the

model with two features by experimenting with combinations of the previously selected

feature and all other leftover features. It picks the feature that attains the highest

accuracy once more. The process is repeated until goal is achieved.


3.3.5.1 Algorithm for Sequential forward selection

Input: Features (features) (Iv1, IV2, IV3…,IVk, DVc)

Output: Subset of features (subset)

1. accuracy = 0

2. subset = null

3. while ~isempty(features) do

4. state = 0

5. for i =1 to length(features) do

6. new_subset = add(copy(subset), features[i])

7. new_accuracy = evaluate(classifier, data)

8. if new_accuracy > accuracy then

9. index = i , accuracy = new_accuracy, state= 1

10. if state then

11. subset = add (subset, features[index])

12. features = delete(features,features[index])

13. state = 0

14. else

15. continue

16. return subset


3.3.6 Random Forest Classifier
Random Forest is an ensemble classifier and one of the most powerful Machine

Learning methods. It works by producing a different number of n decision trees from a

set of feature subsets. As a result, it is less prone to overfitting. Random Forest has the

advantage of achieving higher accuracy while reducing the likelihood of overfitting. It

is called a Forest because the model being developed consists of a large number of

decision trees, which could range from hundreds to thousands, hence the name Forest,

which is similar to Forest in the real world. Using a technique known as Bagging,

Random Forest creates its own training and testing data implicitly (Bootstrap

Aggregating). Bagging allows for the random selection of records with replacement,

with the training data taking two-thirds of the data and the testing data taking one-third

(called Out of Bag Sample). Random Forest, in addition to bagging, performs random

variable selection. The decision tree's goal is to minimize the impurity in a bucket, and

when choosing a variable for use in its equation, it chooses the variable that results in

the purest bucket. Random Forest, on the other hand, takes a random sample of

variables using the formula√p, where p is the number of variables. It then searches for

the variable that gives the purest bucket and applies it to the node in the tree. The

Bagging method and variable selection gave rise to the name Random, and multiple

trees to the name Forest, resulting in Random forest.


Input
Trees Results
data

S1 Tree1 Result1

S2 Tree2 Result2

Voting for
S3 Tree3 Result3 final class
based on
mode
. . .
. . .
. . .

Sn Treen Resultn

Figure 1: Random Forest Architecture

Figure 1 depicts the deployment of the random forest classifier in the proposed system's

data classification. The random forest classification model is supplied a pre-processed

sample of n samples. RF generates n distinct trees by combining a variety of features.

Each tree produces a classification outcome, and the classifier's outcome is determined

by majority voting. The sample is assigned to the class that receives the most votes.

The bagging method and random variable selection are used in the Algorithm so that

different trees can learn different things from the data. If all trees are fed the same set

of records and variables (features), their predictions will be the same because there will

be no significant difference.

3.3.6.1 Algorithm for Random Forest classifier.

input: Training set (Iv1, IV2, IV3…,IVk, DVc)

output: Class of attack

Generating trees
1. for i in number of trees

2. select sample sets using bagging method with replacement

3. for j in number of nodes.

4. randomly pick ‘k ‘ features from ‘M’ features, ∀ k < M

5. for each random feature in k set

6. calculate gini index

7. end for

8. create a node ‘d’ for the feature with minimum Gini index.

9. end for

10. end for

Deciding the class attack

11. for each tree

12. for each class_label

13. compute class attack

14. end for

15. end for

16. Compute mode of class_labels

17. Assign the class attack with the highest frequency to the instance.

3.4 Model Evaluation

Evaluation of the suggested model is dependent on the accuracy, sensitivity (recall),

and precision. These metrics are based on the True Positive Rate, False Positive Rate,

True Negative Rate, and the False Negative Rate.


 True Positive Rate (TPR): shows the count of instances that are correctly identified

by the model. The outcome of the TPR, shows the correctly predicted positive class.

TPR represents when the outcome is Yes and the actual value is Yes.

 True Negative Rate (TNR): The TNR depicts an outcome where the model’s

prediction for the negative class is correctly predicted. TNR represents when the

outcome is No and the actual value is No.

 False Positive Rate (FPR): The model’s prediction on the positive class is incorrect.

It measures the number of instances incorrectly classified in the positive class. FPR

represents when the outcome is No and the actual value is Yes.

 False Negative Rate (FNR): It shows the count of instances wrongly classified in

the negative class. The outcome depicts incorrectly classified class denoting a

negative class. FNR represents when the outcome is Yes and the actual value is

No.

3.4.1 Confusion Matrix

Confusion Matrix displays the statistics of the right and wrong predictions made by

the model in relation to the actual results. A confusion matrix for multiple classes.

Predicted
A B C D E
A TPA EAB EAC EAD EAE
B EBA TPB EBC EBD EBE
C ECA ECB TPC ECD ECE
D EDA EDB EDC TPD EDE
E EEA EEB EEC EED TPE
i. Accuracy: it is used to measure the TPR and TNR ratio to the total number of

instances and can be calculated using the formula below.

𝑇𝑃 + 𝑇𝑁
𝑇𝑜𝑡𝑎𝑙 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠

ii. Recall (Sensitivity): it is a measure of the TPR over the total classified true

positives and false negatives.

𝑇𝑃
𝑇𝑃 + 𝐹𝑁

iii. Precision: it is a measure of the TPR over the total true and false positives.

𝑇𝑃
𝑇𝑃 + 𝐹𝑃

iv. Specificity: it is a measure of the TNR over the true negatives and the false

positives.

𝑇𝑁
𝑇𝑁 + 𝐹𝑃

v. F1 score: is a measure of the balance between the recall and precision denoted

as

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
2∗
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
3.5 Framework of the Proposed System

Data Collection.

Data Processing

Data Cleaning and Scaling.

Filter method feature selection using


correlation-based feature selection.

Wrapper method feature selection using


sequential forward selection.

Splitting the dataset


Figure 2: Theoretical Framework of the System

Training Set Testing Set

Model building and training

Prediction

Model Evaluation
REFERENCES

A, A. H., & Sundarakantham, K. (2019). MACHINE LEARNING BASED


INTRUSION. 2019 3rd International Conference on Trends in Electronics and
Informatics (ICOEI), (Icoei), 916–920.
Aburomman, A. A., & Ibne Reaz, M. Bin. (2016). A novel SVM-kNN-PSO ensemble
method for intrusion detection system. Applied Soft Computing Journal, 38,
360–372. https://doi.org/10.1016/j.asoc.2015.10.011
Agah, S. A. (2017). Investigating Identification Techniques of Attacks in Intrusion
Detection Systems Using Data Mining Algorithms. 17(5), 174–181.
Ahmad, I., Basheri, M., Iqbal, M. J., & Rahim, A. (2018). Performance Comparison
of Support Vector Machine , Random Forest , and Extreme Learning Machine
for Intrusion Detection. IEEE Access, 6, 33789–33795.
https://doi.org/10.1109/ACCESS.2018.2841987
Al-yaseen, W. L., Othman, Z. A., Zakree, M., & Nazri, A. (2016). PT. Expert Systems
With Applications. https://doi.org/10.1016/j.eswa.2016.09.041
Ashraf, N., Ahmad, W., & Ashraf, R. (2018). A Comparative Study of Data Mining
Algorithms for High Detection Rate in Intrusion Detection System. 2(1), 49–57.
https://doi.org/10.33166/AETiC.2018.01.005
Chiba, Z., Abghour, N., Moussaid, K., El Omri, A., & Rida, M. (2018). A novel
architecture combined with optimal parameters for back propagation neural
networks applied to anomaly network intrusion detection. Computers and
Security, 75, 36–58. https://doi.org/10.1016/j.cose.2018.01.023
Ganapathy, S., Kulothungan, K., Muthurajkumar, S., Vijayalakshmi, M., Yogesh, L.,
& Kannan, A. (2013). Intelligent feature selection and classification techniques
for intrusion detection in networks: A survey. Eurasip Journal on Wireless
Communications and Networking, 2013(1), 1–16. https://doi.org/10.1186/1687-
1499-2013-271
Kajal, A., & Nandal, S. K. (2020). A hybrid approach for cyber security: Improved
intrusion detection system using ann-svm. Indian Journal of Computer Science
and Engineering, 11(4), 412–425.
https://doi.org/10.21817/indjcse/2020/v11i4/201104300
Kasongo, S. M., & Sun, Y. (2020). A deep learning method with wrapper based
feature extraction for wireless intrusion detection system. Computers and
Security, 92. https://doi.org/10.1016/j.cose.2020.101752
Mahfouz, A. M., Venugopal, D., & Shiva, S. G. (n.d.). Comparative Analysis of ML
Classifiers for Network Intrusion Detection.
Member, S. (2016). Design of Multilevel Hybrid Classifier with Variant Feature Sets
for. (7), 1810–1821.
Patel, R., Thakkar, A., & Ganatra, A. (2012). A Survey and Comparative Analysis of
Data Mining Techniques for Network Intrusion Detection Systems. (1), 265–271.
Patgiri, R., Varshney, U., Akutota, T., & Kunde, R. (2019). An Investigation on
Intrusion Detection System Using Machine Learning. Proceedings of the 2018
IEEE Symposium Series on Computational Intelligence, SSCI 2018, 1684–1691.
Institute of Electrical and Electronics Engineers Inc.
https://doi.org/10.1109/SSCI.2018.8628676
Salih, A. A., & Abdulazeez, A. M. (2021). Evaluation of Classification Algorithms
for Intrusion Detection System: A Review. Journal of Soft Computing and Data
Mining, 02(01). https://doi.org/10.30880/jscdm.2021.02.01.004
Taher, K. A. (2019). Network Intrusion Detection using Supervised Machine
Learning Technique with Feature Selection. 2019 International Conference on
Robotics,Electrical and Signal Processing Techniques (ICREST), 643–646.
Teng, S., Wu, N., Zhu, H., Teng, L., & Zhang, W. (2018). SVM-DT-Based Adaptive
and Collaborative. IEEE/CAA Journal of Automatica Sinica, 5(1), 108–118.
https://doi.org/10.1109/JAS.2017.7510730
Velusamy, K., Ghosh, P., Debnath, C., Metia, D., & Dutta, R. 2014). An Efficient
Hybrid Multilevel Intrusion Detection System in Cloud Environment (Vol. 16).
Ver. VII. Retrieved from Ver. VII website: www.iosrjournals.org
Wang, H., Gu, J., & Wang, S. (2017). An effective intrusion detection framework
based on SVM with feature augmentation. Knowledge-Based Systems,
(September). https://doi.org/10.1016/j.knosys.2017.09.014
Waskle, S., Parashar, L., & Singh, U. (2020). Intrusion Detection System Using PCA
with Random Forest Approach. Proceedings of the International Conference on
Electronics and Sustainable Communication Systems, ICESC 2020, (Icesc), 803–
808. https://doi.org/10.1109/ICESC48915.2020.9155656
Zhou, Y., Cheng, G., Jiang, S., & Dai, M. (2019). Building an Efficient Intrusion
Detection System Based on Feature Selection and Ensemble Classifier.
https://doi.org/10.1016/j.comnet.2020.107247

You might also like