0% found this document useful (0 votes)
9 views7 pages

A Supervised Machine Learning Algorithm For Detecting Malware

The paper discusses the development of a malware detection system using supervised machine learning algorithms, specifically K-Nearest Neighbor, Decision Tree, and Random Forest, achieving high accuracy rates of 96.53%, 97.79%, and 99.90% respectively. It highlights the limitations of traditional signature-based methods and emphasizes the need for efficient detection techniques due to the increasing complexity and volume of malware. The research aims to provide a robust framework for detecting malware, contributing to the field of cybersecurity.

Uploaded by

vinaymedida31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views7 pages

A Supervised Machine Learning Algorithm For Detecting Malware

The paper discusses the development of a malware detection system using supervised machine learning algorithms, specifically K-Nearest Neighbor, Decision Tree, and Random Forest, achieving high accuracy rates of 96.53%, 97.79%, and 99.90% respectively. It highlights the limitations of traditional signature-based methods and emphasizes the need for efficient detection techniques due to the increasing complexity and volume of malware. The research aims to provide a robust framework for detecting malware, contributing to the field of cybersecurity.

Uploaded by

vinaymedida31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/362939216

Malware Detection using Machine Learning

Conference Paper · December 2021


DOI: 10.20533/ICITST.2021.0008

CITATION READS

1 677

3 authors, including:

Olaniyi Ayeni Otasowie Owolafe


Federal University of Technology Federal University of Technology
43 PUBLICATIONS 55 CITATIONS 39 PUBLICATIONS 69 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Olaniyi Ayeni on 15 September 2022.

The user has requested enhancement of the downloaded file.


Journal of Internet Technology and Secured Transactions (JITST), Volume 10, Issue 1, 2022

A Supervised Machine Learning Algorithm for Detecting Malware

Olaniyi Abiodun Ayeni


Department of Cyber Security, School of Computing
Federal University of Technology, Nigeria

Abstract
The proliferation of malware is a threat to our techniques and the ability to buy malware in the
computing system and its security. That is why the black-market results in an opportunity to be an
need for malware detection using machine learning attacker for anyone, not depending on the skill level.
arises. This work was motivated by the limitation of Current studies show that more attacks are being
[1], [2] in ‘Malware Detection Module using issued by script-kiddies or automated [4]. Therefore,
Machine Learning Algorithms. The objective of this malware protection of computer is an essential
research is to develop a security system for the cybersecurity tasks for single users and businesses,
detection of malware using supervised machine since an attack can lead to compromised data and
learning algorithms and also to carried out sufficient losses. Also, massive losses and frequent
performance evaluation. Feature selection (Filter attacks influences the need for accurate and timely
method) was used to reduce 100,000 columns and 35 detection methods. However, current static and
rows of features to 20 features, then three classifier dynamic methods do not provide efficient detection,
algorithms were employed which are K-Nearest especially when dealing with zero-day attacks.
Neighbor, Decision Tree and Random Forest. The Hence, machine learning-based techniques can be
classifiers are trained and tested using the used. Therefore, this paper discusses the main points
dataset(malware.csv) gotten from Malware and concerns of machine learning-based malware
Detection Kaggle. The results of the algorithms (K- detection, as well as looks for the best feature
Nearest Neighbor, Decision Tree and Random representation and classification methods.
Forest) are respectively 96.53%,97.79% and
99.90%. The results were also compared with other 2. Research Motivation
researchers[3] work that used the same three
classifiers, the results of Maqsood 2020 for Random The existing literature on the topic of malware
Forest, Decision tree and K nearest neighbor are detection convinces that there is a need for efficient
respectively 96.39%, 100%(overfit) and 99.4%, malware detection system, especially since the use of
while the results of Sarang et al 2013 for Random internet are becoming increasingly important
Forest, Decision tree and K nearest neighbor are nowadays.
respectively 99.57%, 99.23%, and 99.06%. It The existing frameworks Malware Detection
indicates that Random Forest is most effective out of Module using Machine Learning Algorithms to
the three classifiers algorithm for malware detection Assist in Centralized Security in Enterprise
using machine learning, moreover, the study Networks’ [2] focuses on just the detection and
performed can be useful as a base for further classification neglecting home users because it’s
research in the field of malware analysis with processor heavy, also in Detection of malicious code
machine learning methods. by applying machine learning classifiers on static
features Shabtai et al. [5] models were not trained
Keywords: Malware, Supervised Learning, Decision properly which resulted to running inefficient
Tree, K-Nearest Neighbour, Random Forest, Feature algorithms. This provided the motivation to create a
Method, Computer Security malware detection system using machine learning
that is well trained and has a high accuracy and low
1. Introduction positive rate using machine learning that can protect
one’s system by flagging incoming malicious files
The malware was first created in 1949 by John and preventing them from affecting one’s computer.
von Neumann. Ever since then, more were created.
Antivirus company is continually searching for the 2.1. Problem Statement
most effective ways in identifying malware and one
of the most famous methods used is the signature- With the development of technology, the number
based detection. Furthermore, the skill level that is of malwares is increasing daily. Malware is now
required for malware development is on the decrease designed with mutation characteristic, which causes
because of the high numbers of attacking tools on the an enormous growth in the number of variations.
internet nowadays. High availability of anti-detection Also, with the help of automated malware generated

Copyright © 2022, Infonomics Society | DOI: 10.20533/jitst.2046.3723.2022.0094 764


Journal of Internet Technology and Secured Transactions (JITST), Volume 10, Issue 1, 2022

tools, novice malware authors can now quickly However, the survey does not provide either a review
generate a new variation. With these growths, of the most recent deep learning approaches or a
traditional signature-based malware detection is taxonomy of the types of features used in data
proven to be ineffective against the vast variation of mining techniques for malware detection and
malware. However, machine learning methods for classification.
malware detection has proven effective against new The research is motivated by a serious threat
malware. today called malicious executables. It is designed to
damage computer system and some of them spread
Objective: The objectives of this research work are over network without the knowledge of the owner
to: using the system.

• Design a security framework for malware Objective(s): To present a survey of malware


detection using Supervised Machine learning detection approaches.
algorithm.
3.1. Methodology
• Implement the design in (a).
• Evaluate the performance of the system. Provision of summary of the current challenges
related to malware detection approaches in data
mining, presenting a systematic and categorized
2.2. Malware Types
overview of the current approaches to machine
learning mechanisms in the data mining topics,
i. Backdoor: It is a malware type that negates
exploring a structure of the important methods that
standard authentication procedures to access a
are significant in malware detection approach,
system. As a result, remote access is granted to
discussing the important factors of classification
resources within an application, such as databases
malware approaches in the data mining to improve
and file servers, giving perpetrators the ability to
their problems in the futures.
issue system commands remotely and update
malware.
3.2. Contribution to Knowledge
ii. Rootkit: Its functionality enables the attacker to
access the data with higher permissions than is It enlightens more on how to approach malware
allowed. For example, it can be used to give detection using machine learning. In his paper
unauthorized user administrative access. Rootkits “Malware detection using statistical analysis of byte-
always hide their existence and quite often are level file content” [9] used several machine learning
unnoticeable on the system, making the detection techniques to detect malware files. The authors
and, therefore, removal incredibly hard. [6]. claimed that their techniques can properly classify
any malware regardless of its obfuscation using
iii. Keylogger: The idea behind this malware class is multi-class classification technique to detect seven
to log all the keys pressed by the user and store all classes including benign. The novelty of the authors’
data, including passwords, bank card numbers, and approach is in the ability to detect obfuscated and
other sensitive information [7]. packed malware. The difficulty in detecting
obfuscated malware lies in the obscureness of the
iv. Ransomware: This type of malware aims to structure of the malware file itself. The writer of the
encrypt all the data on the machine and ask a victim malware, intentionally, re-write the code of the file
to transfer some money to get the decryption key. that makes it difficult to be caught by antimalware
Usually, a machine infected by ransomware is software. The total size of the content set collected
“frozen” as the user cannot open any file, and the for the experiment is 12,111 files (1,800 benign files
desktop picture is used to provide information on the and 10,311 malicious). However, only 50 files per
attacker’s demands [8]. class were used as training set. The features for the
experiment generated are statistical-based features
3. Related Works derived from byte sequence n-grams of the
executables.
In state-of-the-art survey of malware detection A static malware detection system using data
approaches using data mining techniques [1] present mining methods proposed extraction method based
a survey of malware detection approaches divided on PE headers, DLLs, and API functions and
into two categories: methods based on Naive Bayes, J48 Decision Trees,
and Support Vector Machines. The highest overall
• Signature-based methods accuracy was achieved with the J48 algorithm (99%
with PE header feature type and hybrid PE header
• Behavior-based methods. and API function feature type, 99.1% with API

Copyright © 2022, Infonomics Society | DOI: 10.20533/jitst.2046.3723.2022.0094 765


Journal of Internet Technology and Secured Transactions (JITST), Volume 10, Issue 1, 2022

function feature type). [10]. software:

Motivation of the research: The research is • Signature-based Methods.


motivated by a serious threat called malicious
executables. It is designed to damage computer • Heuristic-based Methods and behavior-based
system and some of them spread over network methods.
without the knowledge of the owner using the
system. In addition, they [12] investigate some features
for malware detection and discuss concealment
Objective(s): The objective of the research is to techniques used by malware to evade detection.
create a static malware detection system with high Nonetheless, the aforementioned research does not
accuracy rate. consider either dynamic or hybrid approaches.

Methodology: A static malware detection system Motivation of the research: The research was
using data mining techniques such as information motivated by malware getting more and more
gain, principal component analysis, and three challenging, given their relentless growth in
classifiers: SVM, J48, and naïve bayes. For complexity and volume.
overcoming the lack of usual anti-virus products,
methods of static analysis to extract valuable features Objective(s): It aims at providing an overview on the
of windows PE file was used. way machine learning has been used so far in the
context of malware analysis in windows
Contribution to knowledge: A static malware environments.
detection system which has a detection rate of 99.6%
was created. Methodology: For the analysis of portable
executables, surveyed papers were systematized
Limitations: It is not suitable for home users because according to their objectives, what information about
it is processor heavy malware they specifically use, and what machine
learning techniques they employ to process the input
In Zero-day malware detection based on and produce the output.
supervised learning algorithms of API call
Signatures, the API functions were used for feature Contribution to knowledge: It provided an overview
representation again. The best result was achieved on how machine learning algorithms can be
with the Support Vector Machines algorithm with employed in malware analysis.
normalized poly kernel. A precision of 97.6% was Limitations: It highlighted that if the models were
achieved, with a false-positive rate of 0.025. [11]. not properly trained it will result in running
Motivation of the research: The research is inefficient algorithms and making limited
motivated by antivirus detectors being unable to predictions.
detect new malwares.
4. Machine Learning Method
Objective(s): To develop a machine learning
framework using eight different classifiers to detect The machine learning method process consists of
unknown malware and to achieve high accuracy rate the following 5 stages:

Methodology: Large data sets were used to train i. Data intake. At first, the dataset is loaded from the
classifiers, and analyses the performance results of file and is saved in memory.
the various data mining algorithms adopted for the
study using a fully automated tool developed in the ii. Data transformation. At this point, the data that
research to conduct the various experimental was loaded at step 1 is transformed, cleared, and
investigations and evaluation normalized to be suitable for the algorithm. Data is
converted so that it lies in the same range, has the
Contribution to knowledge: The machine learning same format. At this point, feature extraction and
framework developed achieved a promising result of selection, which are discussed further, are performed
98.5% accuracy rate. as well. In addition to that, the data is separated into
sets – 'training set' and 'test set.' Data from the
Limitations: API call sequence can be extracted training set is used to build the model, which is later
from most device, not all which makes the algorithm evaluated using the test set.
limited to some devices.
The paper Survey on the usage of machine iii. Model Training. At this stage, a model is built
learning techniques for malware analysis [12] using the selected algorithm.
identify three main methods for detecting malicious

Copyright © 2022, Infonomics Society | DOI: 10.20533/jitst.2046.3723.2022.0094 766


Journal of Internet Technology and Secured Transactions (JITST), Volume 10, Issue 1, 2022

iv. Model Testing. The model that was built or


trained during step 3 is tested using the test data set,
and the produced result is used for building a new
model that would consider previous models, i.e.,
"learn" from them.

v. Model Deployment. At this stage, the best model


is selected (either after the defined number of
iteration or as soon as the needed result is achieved.

5. Application of Machine Learning Figure 3. Implementation of Decision Tree


Methods Algorithm
The purpose of this section is to analyse the data
and using it to train the prediction model. Figure 1
shows the sample of selected features after the
feature extraction method have been applied.

Figure 4. Implementation of Random


Forest Algorithm
Figure 1. sample of the selected features

After the features were extracted and selected, the


algorithms were applied to the data obtained. The
machine learning methods applied, are K-Nearest
Neighbors, Decision tree, and Random Forest. The
results of the model used are shown below.

• The accuracy for the test and the training is


shown in Figure 2, there is accuracy of 0.9667 on
the training data and 0.9653 on the testing data.
Figure 5. the confusion matrix for K-nearest
Neighbor

Figure 2. Implementation of K-Nearest Neighbor


Algorithm

• Figure 3 shows the accuracy of the test and


training, there is accuracy of 0.9793 on the
training data and 0.9779 on the testing data. Figure 6. The confusion matrix for
Decision Tree
• The accuracy of the test and the training is shown
in Figure 4, there is accuracy of 0.9999 on the Confusion Matrix: The confusion Matrix for K-
training data and 0.9999 on the testing data. Nearest Neighbor, Decision Tree and Random Forest

Copyright © 2022, Infonomics Society | DOI: 10.20533/jitst.2046.3723.2022.0094 767


Journal of Internet Technology and Secured Transactions (JITST), Volume 10, Issue 1, 2022

is shown in Figure 5, Figure 6, Figure 7 respectively. 8. Discussion


The Figure 1 shows the original features in the
dataset, but after feature selection method was
applied, the features were pruned to 20. Feature
selection was used in removing redundant and
irrelevant features to improve the accuracy of the
prediction. The Python programming language was
used for performing the feature selection and
applying the machine learning methods. The process
Figure 7. The confusion matrix for was used to reduce the number of features to 20 from
Random Forest a total of 35 features. The method used in this
research is the Filter method.
7. Result of the Dataset Analysis Data preprocessing technique (Data encoding and
checking for missing data) was used in preparing
The overall accuracy of the algorithm is (cleaning and organizing) the raw data to make it
calculated below: suitable for building and training the Machine
Learning models. The reason for data preprocessing
is because, it is the first step marking the initiation of
the process. Typically, real-world data is incomplete,
inconsistent, inaccurate (contains errors or outliers),
and often lacks specific attribute values/trends. It
helps to clean, format, and organize the raw data,
thereby making it ready-to-go for Machine Learning
models.
Figure 2 is about the result of KNN algorithm
Table 1. Performance Evaluation Compared with implementation which shows that the accuracy of
other Researchers Results training and testing data. KNN is a non-parametric
algorithm, meaning that it does not make any
assumptions about the data structure. In real-world
problems, data rarely obeys the general theoretical
assumptions, making non-parametric algorithms a
good solution for such problems. KNN model
representation is as simple as the dataset – there is no
learning required, the entire training set is stored.
Figures 3 and 4 are the results of implementation
of Decision Tree and Random Forest algorithm on
the dataset. decision trees are data structures that
have a structure of the tree. The training dataset is
used for the creation of the tree, which is
subsequently used for making predictions on the test
data. In this algorithm, the goal is to achieve the
most accurate result with the least number of
decisions that must be made. The training accuracy
of DT is 0.979 while the testing accuracy was 0.977.
Random Forests are the collections of decision trees,
producing a better prediction accuracy. That is why it
is called a ’forest’ – it is basically a set of decision
Figure 8. Algorithms performance analysis and trees. The basic idea is to grow multiple decision
Graphical representation of the nslkdd.csv results trees based on the independent subsets of the dataset.
At each node, n variables out of the feature set are
Table 1 shows the performance of our research selected randomly, and the best split on these
compared with other researchers which used the variables is found. The training accuracy for RF was
same machine learning algorithms (RF, DT and 0.99 and the testing accuracy was also 0.997.
KNN). This work performed better than the other Figures 5, 6 and 7 shows the confusion matrix of the
two [13], [14]. three-machine learning algorithm deployed (KNN,
The Figure 8 shows that Random Forest DT, RF). The Figure 8 is the graphical representation
performed better than the two other machine learning of the comparison performance of the three
algorithms.

Copyright © 2022, Infonomics Society | DOI: 10.20533/jitst.2046.3723.2022.0094 768


Journal of Internet Technology and Secured Transactions (JITST), Volume 10, Issue 1, 2022

algorithm which shows that Random Forest is the [10] Baldangombo, U., Nyamjav J., and Shi-Jinn, H.
best out of the three. (2013). A static malware detection system using data
mining methods. International Journal of Artificial
Intelligence and Application, 4(4), 113-126. DOI:
9. Conclusion 10.5121/IJAIA.2013.4411.

Malware detection is a major issue if our [11] Alazab, M., Sitalakshmi, V., Paul, W., and Moutaz, A.
computing system and its infrastructure must be kept (2011). Zero-day malware detection based on supervised
secure in this modern age. Dataset from Kaggle.com learning algorithms of API call signatures. In proceedings
was used for this research, the dataset which of the ninth Australasian data mining conference, 121,
comprises of 100,000 rows and 35 columns of 171-182. Australian Computer Society. DOI: 10.5555/248
features was reduced to 20 by feature selection 3628.2483648.
method to improve the accuracy of the algorithms. [12] Bazrafshan, Z., Hashemi, H., Fard, S.M.H. and
The results showed that Random Forest machine Hamze, A. (2013).A survey on heuristic malware detection
learning techniques are the best classifier to classify techniques Conference: Information and Knowledge
our data with 99.9% of accuracy. Technology (IKT).

10. References [13] Jehad Ali, Rehanullah Khan, Nasir Ahmad, and Imran
Maqsood. (2012). Random forests and decision trees.
International Journal of Computer Science Issues (IJCSI)
[1] Souri, A., and Hosseini, R. (2018). A state‑of‑the‑art
9, 5(2012), 272.
survey of malware detection approaches using data mining
techniques.
[14] Sarang Na, S. and Kwon, T. (2013). A Rolling
[2] Singhal, P., and Nataasha, R. (2015). Malware Image based Virtual Keyboard Resilient to Spyware
detection module using machine learning algorithms to on Smartphones. Journal of the Korea Institute of
assist in centralized security in enterprise networks. Information Security and Cryptology. 23(6):1219-1223.
International Journal of Network Security and it’s
Applications, 4(1).61-67. https://arxiv.org/abs/1205.30 62
(Access Date: 23 December 2021).

[3] Jehad Ali, Rehanullah Khan, Nasir Ahmad, and Imran


Maqsood. (2012). Random forests and decision trees.
International Journal of Computer Science Issues (IJCSI)
9, 5.

[4] Aliyev, V. (2010). Using honeypots to study skill level


of attackers based on the exploited vulnerabilities in the
network(Corpus ID 107947677) [Master’s thesis,
Chalmers University of Technology]. https://www.seman
ticscholar.org/paper/usinghoneypots-to-study-skill-level-of
-attackersAliyev/62cea777b89e3cc069744f5201a46d64bca
fbe0. (Access Date: 21 December 2021).

[5] Shabtai, A., Kanonov, U., Elovici, Y., Glezer, C. and


Weiss, Y. (2009). Andromaly: a behavioral malware
detection framework for android devices.

[6] Chuvakin, A. (2003). An overview of unix rootkits.


iDEFENCE inc.

[7] Lopez, W., Humberto, G., Enio, P., Erick, B., and Juan,
S. (2013). Keyloggers. Florida International University.

[8] Savage, K., Peter, C., and Hon, L. (2015). The


evolution of ransomware. Symantec corporation.
http://www.symantec.com/content/en/us/enterprise/media/s
ecurity_response/whitepapers/the-evolution-ofransomware
.pdf (Access Date: 21 December 2021).

[9] Tabish, M., Zubair Shafiq, M., and Farooq. M. (2009).


Malware detection using statistical analysis of byte-level
file content. In Proceedings of the ACM SIGKDD
Workshop on Cyber Security and Intelligence Informatics,
pages 23–31. ACM.

Copyright © 2022, Infonomics Society | DOI: 10.20533/jitst.2046.3723.2022.0094 769

View publication stats

You might also like