Project Report Final
Project Report Final
SUBHASHREE BASU
Project Report
Submitted in the partial fulfillment of the requirement for the degree of B.Tech in
Information Technology
Affiliated to
June,2021
This is to certify that the work in preparing the project entitled Detection of different
virus based on neural network algorithms has been carried out by SOUMYA BANERJEE,
SANJOY SEBAIT, DEEP SHIKHA under my guidance during the session 2020-21 and
accepted in partial fulfillment of the requirement for the degree of Bachelor Of Technology.
Acknowledgement
We are submitting the project on malware detection by neutral network algorithms under the
guidance of Subhashree Basu who supported us at every stage of the report and guided us
with everything with right and excellent knowledge. We would also like to thank our project
review class teachers, they give us the instruction what to add in our project or what
improvements need to be done.
Vision
To evolve itself into an industry oriented research based recognized hub of creative solutions in
various fields of engineering by establishing progressive teaching- learning process with an
ultimate objective of meeting technological challenges faced by the nation and the society.
Mission
Vision
Mission
• To incubate students, grow into industry ready professionals, proficient research scholars
and enterprising entrepreneurs.
• To create a learner- centric environment that motivates the students in adopting emerging
technologies of the rapidly changing information society.
• To promote social, environmental and technological responsiveness among the members
of the faculty and students.
PEO1: Exhibit the skills and knowledge required to design, develop and implement IT solutions
for real life problems.
PROGRAM OUTCOMES(POs)
2. Problem analysis: Identify, formulate, review research literature, and analyze complex engineering
problems reaching substantiated conclusions using first principles of mathematics, natural sciences, and
engineering sciences.
3. Design/development of solutions: Design solutions for complex engineering problems and design
system components or processes that meet the specified needs with appropriate consideration for the
public health and safety, and the cultural, societal, and environmental considerations.
4. Conduct investigations of complex problems: Use research-based knowledge and research methods
including design of experiments, analysis and interpretation of data, and synthesis of the information
to provide valid conclusions.
5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern engineering
and IT tools including prediction and modeling to complex engineering activities with an understanding
of the limitations.
6. The engineer and society: Apply reasoning informed by the contextual knowledge to assess societal,
health, safety, legal and cultural issues and the consequent responsibilities relevant to the professional
engineering practice.
7. Environment and sustainability: Understand the impact of the professional engineering solutions in
societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable
development.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms
of the engineering practice.
9. Individual and team work: Function effectively as an individual, and as a member or leader in diverse
teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with the engineering
community and with society at large, such as, being able to comprehend and write effective reports and
design documentation, make effective presentations, and give and receive clear instructions.
11. Project management and finance: Demonstrate knowledge and understanding of the engineering and
management principles and apply these to one’s own work, as a member and leader in a team, to
manage projects and in multidisciplinary environments.
12. Life-long learning: Recognize the need for, and have the preparation and ability to engage independent
and life-long learning in the broadest context of technological change.
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
3 2 1 2 3 2 1 1 3 3 2 3
Justification :
In this project, we can apply our engineering knowledge, system components, used modern IT
tools, including prediction and modelling. By this project, we understand the professional
engineering solutions in social and environmental contexts. Demonstrate knowledge and
understanding of the engineering community and with society at large, such as being able to
comprehend and design documentation make effective presentations and give clear
instructions. Recognize the need for and have the preparation and ability to engage independent
and life-long learning in the broadcast context of technological change.
PSO3 ( Software Engineering ) : Understand and analyze a big complex problem and
decompose it into relatively smaller and independent modules either algorithmically or in an
object oriented way choosing correct life cycle model and using effective test cases.
Project Mapping with Program Specific Outcomes
2 2 3
Justification :
Understand and analyze a big complex problem and decompose it into smaller and independent
modules either algorithmically or in an object-oriented way choosing correct life cycle model
and using effective test cases.
Topic Page No
Introduction 1
Chapter 1 2-11
Chapter 2 12-19
Chapter 3 20-22
Chapter 4 23-25
Sample Codes
Chapter 5 26-29
Chapter 6 30
6.2 Conclusion
Annexure
References /Bibliography 31
1
Department of Information Technology
St. Thomas’ College of Engineering and Technology
Introduction
Technology and their latest upgrades and updates are increasing day by day and have a very great impact
on our day-to-day life. Each and every one is surrounded by technology which is making their life much
easier. But this is where cybercriminals have many opportunities to sabotage the new technology by
spreading malware in various forms such as through emails, links, documents, etc. This happens
because development teams of any IT industry mainly focus on the interface and design of the software
instead of securing their network or any method that will prevent from entering of malicious files. But
some of the companies are investing or spending billions and billions of moneys to secure their software
to the security organization. Also, researchers and development of cybersecurity are working and doing
excellent work on establishing new security approaches to prevent cyber-attacks.
To explain the term malware, it is nothing but a piece of software that is developed by a cyber attacker
in intent to damage a technology device by stealing information, credentials, etc. Malware is of different
types such as Viruses, Worms, Spywares, Botnet, Ransomware, Trojans, etc. Malware is created by
attackers and they only develop it to make money or by any means to gain profit, they also gain money
and profit by selling it to the dark web whose bids are the highest. Whereas these are the disadvantage
of creating malware but the advantage of the malware is that the companies also buy malware to test
the security of the software that they developed. Malware on the other hand creates great damage to the
reputation of the company and also lots of loss to the company by revealing sensitive information or
stealing it. According to McAfee, it is said that the new type of malware is detected every four seconds.
It is said that even the antivirus software is not able to detect the types of malwares for e.g. according
to research it seen that 69,477,489 unique malicious viruses have gone undetected. Each time the
malware doesn't need to be created in the same way or method, it can change its form being stealth and
can go undetected.
To mitigate these kinds of problems a better approach for this would be the use of machine learning
techniques and their different algorithms because of which the detection rate will much improve to
detect malware. The software that will be created will run machine instructions that are created by
python and give us accurate results. In this we will be downloading a dataset that will be trained to
detect malware and the use of the K-means clustering algorithm will be very helpful to accurately cluster
the malware accordingly.
2
Department of Information Technology
St. Thomas’ College of Engineering and Technology
Chapter 1
1.1 Problem Statement
Detection of different virus based on neural network algorithms.
1.3 Objective
Our proposed approach automatically and simultaneously learns both discriminative features and
classifier for virus particle detection by machine learning, in contrast to existing methods that are
based on handcrafted features that yield many false positives and require several postprocessing
steps. For this project, we have a malware dataset that is derived using the open-source software
tools. The pe header analysis tools were used to extract the required features from the malware file
which is then used for the further steps of the proposed system. Ten malware categories used in this
research are: virus, trojan, adware, backdoor, ransomware etc. Since our method is based on
supervised learning that requires both the input dataset and their corresponding alert level, it is
basically used for detection of already-known viruses. However, the method is highly flexible, and the
convolutional networks can adapt themselves to any virus particles by learning automatically from an
annotated dataset.
Python is the only programming language required for this project till now.
3
Department of Information Technology
St. Thomas’ College of Engineering and Technology
Library Purpose
In this paper proposed by Santosh Joshi, Himanshu Upadhyay, Leonel Lagos, Naga Suryamitra
Akkipeddi , and Valerie they have used machine learning classifiers to detect the behavior of the
malware. They have used the Random Forest algorithm which is the traditional way to detect and
analyze the malware that causes cyber threats. The proposal tells that a given sample is classified as a
benign or malware with high accuracy and low computational overhead. The parameters to perform
this used by the author are Linux environment, the data is collected is extracted by using LibVMI which
is specially made for Virtual Machine Introspection. The authors have used virtualization method
because it provides better approaches to create a virtual machine and then creates or run the malware
to analyze the behavior of the malware, this is done because if the malware is executed in virtualize
environment, then it can’t affect the host machine. Xen hypervisor-based virtualization platform is
used to host the system; the data is captured by virtual memory introspection using the library LibVMI
and then transfer everything into the database. The dataset that was used to build the model
4
Department of Information Technology
St. Thomas’ College of Engineering and Technology
consisted of a mixture of malicious malware and benign data which were extracted from the list data
of the Linux VM.
For training the data they split the data into two parts such as 70% of the data was used for the training
purpose and 30% was used for the testing process, along with this they used the confusion matrix for
the evaluation and plotted the accuracy. The author has stored the data in a database server in the
virtualization platform and then executed the Random Forest algorithm by using the machine learning
server. The model that has been created will be stored in the database for prediction. To achieve the
accuracy for the model the author does minor changes into the random forest algorithm such as
changing the number of trees, number of variables that are tried at each split. According to the testing
and training of data, the author can achieve accuracy of 90.9%. In last the authors conclude that
machine learning techniques are very promising towards detecting the malware in this technological
world, also various models are proposed but without testing and achieving accuracy no one cant
decides which one to employ. [1]
In contrast to the above-proposed method, Ihab Shhadat, Bara Bataineh, Amena Hayajneh, and Ziad
A. AL-Sharif have proposed the detection of unknown malware using machine learning to advance the
detection and classification. The author has stated that the purpose of this research to detect the
unknown malware using machine learning techniques, the author has applied different types of the
algorithm on the dataset to check which gives better accuracy in binary and multi-classifiers. The
author has mentioned that they have achieved by using the Random Forest classifier for the feature
selection and cross-validating the data. The data that was collected had 1156 files in total out of which
984 were malicious files and 172 benign files, and the formats in which data were collected were of
different forms such as .exe, .pdf and .docx. The author has considered various types of main malware
in their study such as Drider, Locky, Zeus, TeslaCrypt, etc. They have estimated the performance based
on Accuracy, precision, Recall, and the F1-score which is calculated with the help of the confusion
matrix. In this research paper, feature extraction has been conducted using the heuristic strategy
which is a dynamic analysis in which malicious files are executed in a virtual environment and then
information is gathered based on the malware characteristics. But in this, the author has acquired only
the needed information which will give a better accuracy rate and better performance and rejecting
the rest of the information which are not needed because they can give problems in the accuracy rate
and performance. Random Forest algorithm is used because they often decide the important features
accurately, it is a treebased technique that ranks the impurity of the node. Cross-validation is done
because evaluation is very important in machine learning, K-folds-cross-validation divides the initial
sets into k subsets into a similar size then performs the training. According to the research paper the
important features were specified with the help of Random forest and the cross validation was done
by using K-fold-cross-validation The author have applied different types of model which are provided
by the sklearn. [2]
The highest accuracy was achieved by Decision tree(98%), Random Forest(97.8%), Hard Voting(97%),
KNN(96.1%), SVM(96.1%), LR(95%). The lowest accuracy was achieved by Bernoulli NB which is 91%.
The author concludes that the performance that is studied of machine learning techniques in which
the decision tree has achieved 98.2% accuracy which out-performs the rest of the algorithm. In this
paper, the authors Deepak Gupta and Rinkle Rani have proposed a method which is improving
malware detection using big data and ensemble learning. The authors have proposed two methods to
improve the detection of malware, the first one is the weighted voting strategy of ensemble learning
5
Department of Information Technology
St. Thomas’ College of Engineering and Technology
and the second one is choosing an optimal set of base classifiers to stack. The two methods that are
mentioned above are based on ensemble learning and big data which improves malware detection at
a very large scale. It is mentioned that the malware and the benign files are collected from the public
repositories and clean windows machine, to perform static and dynamic analysis author has set up an
automated environment to generate reports which are used as a raw dataset. These data are stored
in a distributed storage system and processed using Apache spark to extract the features. The dataset
for the proposed method consists of a 100,200 malware file and 98,150 benign files[3]. The malware
files were collected from sources like VXHeven, VirusShare, etc. and the benign files were collected
from the clean Windows XP, 7, 10, and installation directory. The author has also said that the dataset
is nearly balanced because the difference between the malicious file and benign files are less so they
are almost equal because of which better accuracy and detection can be achieved. The use of an
automated environment has been used to perform the static and dynamic analysis, talking of the
setup the host machine is Ubuntu as OS and Windows as a guest machine which is created in Oracle
VM VirtualBox. The binary files are executed in the virtual environment and generate a report of full
analysis in JSON format. The feature extraction for detection has been deciding based on the reports
of dynamic and static analysis of which the report was generated and the feature extraction is File
metadata, Filesize, Packer detection, Sections Information, Dynamically linked libraries, Dropped files,
Windows API calls, Mutex operations, File activities, Registry Activities, Network activities and Process
activities[3].
These are the features based on which further detection will take place. The algorithms on which the
dataset has been applied in this paper are Naïve Bayes, K-nearest neighbor, Decision table, SVM, and
Random Forest. After applying the dataset in every algorithm the author says that Random Forest has
given the highest accuracy which is 98.1% followed by the Decision table algorithm which is 94.9%.
The author concludes that the proposed system that is the weighted voting technique gives the
highest accuracy of 99.5%, hence that proposed system enhances the detection of malware. In
contrast to the above paper, In this paper, the author’s Priyank Singhal and Natasha Raul have
proposed a method which is a Malware detection module using machine learning algorithms to assist
centralized security in enterprise networks. In this paper they have proposed a new and much complex
antivirus engine that can scan harmful files, this process is done by extracting the API calls made by
different types of normal and malicious executables. Machine learning will be used to enhance classify
in a better way and also rank the files based on their security risk. This system is very heavy because
of the processor but it is very effective and can be used in enterprise networks to detect threats. In
this research paper, the solution of detecting the virus works on a firewall level of the enterprise
network. For detection the authors have extracted numerous infected and normal PE header
executables by using IAT, then they store the extracted data into a data mine. The information gain is
derived for each function.
For the further implementation, Random Forest is used, it works as a classifier to detect the malware,
Random forest’s output prediction will be the most frequent class output of the individual trees. The
author has said that to check whether the proposed model can provide results or not they have
extracted over 5000 executable files which consist of both normal and infected, by using the
information gain algorithm they chose only the 80% of functions that are more likely to be harmful. In
the paper the results show that the accuracy achieved using different algorithms which are Decision
tree has 90%, Naïve Bayes has 95%, Random Forest has 97% and the proposed method has got 98%.
The author has concluded that the model that has been proposed can detect malware based on
6
Department of Information Technology
St. Thomas’ College of Engineering and Technology
advanced data mining and machine learning techniques. Whereas this model cannot be used for home
users or others but it can be used only on an enterprise level. In this research paper, the author’s J.
Zico Kolter and Marcus A. Maloof have proposed a method which is to detect and classify malicious
executables in the wild. The author has said that 1,971 benign files and 1,651 malicious executables
and each of them was encoded as a training example using n-gram as features. After selecting the
features, the authors have said that they applied various algorithms for evaluations such as Naïve
Bayes, Decision trees, Support vector machines, and boosting [4].
As stated in the paper boosted decision trees have outperformed other algorithms with a ROC curve
of 0.996. The authors have observed that their methodology can scale to a larger number of
executables and also how their method classifies the executables based on their payload. As
mentioned above the data consisted of 1,971 benign files and 1,651 malicious files which were
collected in the window PE format, and also the benign files were collected from the folders of
windows XP and Windows 2000 machines. The hex dump was used to convert the executables into
hexadecimal codes in ASCII format so that the machine can understand.[5]
Then the respected n-grams were created by combining each four-byte sequence into a single term,
for example, if a sequence is ff 00 ab 3e 12 b3 then the corresponding n-gram sequence will be
ff00ab3e, 00ab3e12 and ab3e12b3. The experimental results say that the rate of false-positive rate is
0.05 and the accuracy of a boosted decision tree is 98% of malicious files out which 6 went missing
from the 291 malicious files, but the author says that for some 6 maybe a major issue if one is ready
to accept a false positive rate of 0.1 then it can detect with perfect accuracy. The author concludes
that after all the evaluation boosted J48 outperforms other algorithms and is the best detector
because of the ROC curve of 0.996. In this paper, the author’s Yanfang Ye, Digging Wang, and Dongyi
Ye have proposed a method which is called an Intelligent Malware detection system. The author has
said that by analyzing Windows API execution processes which are called PE header, they have
implemented an intelligent malware detection system using objective -oriented association mining-
based classification. An experimental study with a large section of PE files that were obtained from
the anti-virus laboratory of King-soft corporation is used to compare various malware detection. The
author has said the statement that the IMDS system outperforms popular antivirus software such as
Norton, McAfee and virus scan, etc and also other algorithms that are normally used. The results show
that the IMDS has got the highest accuracy which 93.07% and outperforms other algorithms such as
Naïve Bayes, SVM, and J4.8. The author has concluded that they have implemented the model
successfully with a large collection of 12212 benign samples and 17366 malicious files. In relation to
the other papers, in this paper, the author’s C.P. Patidar and Harshita Khandelwal have proposed a
method to detect a Zero-day attack using machine learning techniques. The author has divided this
into four phases namely the Malware datasets, Analyzing MDS, correlation algorithm, Detection
methods of malware [6].
Firstly the sample of malware is collected which are called datasets, then the use of correlation
algorithm is used to correlate the relationship between the malware and will be able to predict the
future. The final phase is to apply the detection model which will detect the malware and gives the
appropriate output.
We came across detail description of the algorithm and how we can apply our algorithm in our dataset,
and also what are the feature over there we can able to understand it very well. We can understand
time frame by time frame the behavior of the malware [7].
7
Department of Information Technology
St. Thomas’ College of Engineering and Technology
We can able to implement our project and able to do step by step procedure of the program. And we
can know more about support vector machine [8].
We go through the random forest algorithm and its different kind of implementation. We can able to
understand detection tree, actual value for data point or final prediction results very well [9].
We can able to understand no of neighbor, how to count the no of the data point in each category
and then we calculate the distance then assign the new data points to that category for which the no
of neighbor is maximum [10]
With so many sophisticated malware samples, plenty of researches have been concentrated on
proposing miscellaneous malware detection methods to mitigate the rapid growth of malware. Malware
detection can be divided into two main methods: static malware detection and dynamic malware
detection [11]
Static malware detection also refers to signature-based malware detection which examines the content
of malicious binary without actually executing malware samples. Signature-based malware detection is
able to obtain full execution path. However, it can be easily evaded by obfuscation techniques. In
addition, signature-based malware detection requires prior knowledge of malware samples. Static
malware detection also refers to signature-based malware detection which examines the content of
malicious binary without actually executing malware samples. Signature-based malware detection is
able to obtain full execution path. However, it can be easily evaded by obfuscation techniques. In
addition, signature-based malware detection requires prior knowledge of malware samples.[12]
Dynamic malware detection analyzes the sample behaviors during execution and generally called
behavior-based malware detection. Behavior-based malware detection methods include virtual machine
and function call monitoring, information flow tracking, and dynamic binary instrumentation. Windows
Application Programming Interface (API) call graph-based method has been considered as a good
prospect in behavior-based malware detection for a long time[13]
Concretely, the detection process by can be divided into two stages. The first stage is to preprocess
malware sample data, it takes a binary form of a Windows executable file, generates a grayscale and
extracts opcode sequence and metadata feature with decomplication tool.
So, this stage generates the appropriate data format as the input of the follow-up this three algorithm
networks. The second stage applies the core process of KNN and SVM networks, respectively, learning
from the grayscale image and the opcode sequence. To optimize the detection performance, we use
stacking ensemble to integrate two networks’ output and metadata features and get final prediction
8
Department of Information Technology
St. Thomas’ College of Engineering and Technology
result. In neural network, many algorithms are used. We are used 3 algorithms for this project. They
are-
1) Random Forest
2) KNN classification
3) SVM
Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used
for both classification or regression challenges. However, it is mostly used in classification problems.
In the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is number of
features you have) with the value of each feature being the value of a particular coordinate. Then, we
perform classification by finding the hyper-plane that differentiates the two classes very well.
The training of traditional SVM method requires the solution of quadratic programming, and consumes
high memory and has low speed for large data training. Incremental learning is one of the meaningful
methods to continuously update the data for learning, which keeps the previous learning results, re
learning only for the additional data, so as to form a continuous learning process. This paper will study
the support vector machine based on incremental learning method and its application in the malware
detection. The experiments carried out in the Internet Security Laboratory at Kingsoft Corporation
suggested that, for large number of virus samples, our method can rapidly and effectively update the
sample features, which avoids duplication of learning history samples and ensures the malware
prediction ability for the detection model.
9
Department of Information Technology
St. Thomas’ College of Engineering and Technology
Random forest
The first part is the collection of datasets. large collection of malware apps as well as benign app have
been collected to form the dataset for training the model using machine learning algorithm. Both the
malicious app and benign app dataset are merged to form a large dataset.
Training of Dataset The generated dataset is used for training the model with given parameters using
random forest algorithm. Data Prepossessing is an important part of this process where the data must
be organized. Also, the relation between these datasets must be found to extract features.
Extracting of Feature The feature extraction is important for prediction and selected meticulously in
order to perform faster computation and low memory consumption. With the collected malicious and
benign apps, the features are taken from source codes of decompiled files. The installation package of
Android apps is basically the .apk file, that can be decompiled using Apktool. It tends to recover main
files organizing source codes in a particular way in folder.
Deployment of Random Forest Algorithm Now after obtaining the extracted features, the random forest
algorithm is used for classification. During the training process, a set of labels is set to determine the
type of each app. Thus, 1 denotes malware and 0 denotes benign apps. The construction of random
forests consists of a collection of decision trees where the number of decision trees is set manually.
Here every decision tree is developed in a topdown approach initially starting from the root.
K Nearest Neighbor
It is a popular class of supervised machine learning techniques. We can detect malware using knn and
feature analysis to identify changes within malware families over time. K-Nearest Neighbor is one of
the simplest Machine Learning algorithms based on Supervised Learning technique.
K-NN algorithm assumes the similarity between the new case/data and available cases and put the new
case into the category that is most similar to the available categories. This algorithm stores all the
10
Department of Information Technology
St. Thomas’ College of Engineering and Technology
available data and classifies a new data point based on the similarity. This means when new data appears
then it can be easily classified into a well suite category by using K- NN algorithm.
This algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems. Suppose there are two categories, i.e., Category A and Category B, and we
have a new data point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN
11
Department of Information Technology
St. Thomas’ College of Engineering and Technology
Chapter 2
Figure 4: The-schematic-diagram-of-SVM-based-malware-detection
In this section, we will formally describe our design of SVM. Before description, we would express the
notation in the first place.
12
Department of Information Technology
St. Thomas’ College of Engineering and Technology
Notation
There are mainly three kinds of set, i.e. training set Tr, testing set Te and recommended set R. Each set
is composed of two parts, i.e. feature property set and class property set. Then notations in detail are
listed in table1.
Symbol Meaning
Tr Training set
Te Testing set
L Labeled Sample
U Unlabeled Sample
R Recommended Sample
X Feature Property
Y Class Property
vd Vector Dimension
13
Department of Information Technology
St. Thomas’ College of Engineering and Technology
In order to get two mutually independent sub-feature set of malware, we firstly preprocess labeled
dataset L with ICA and get dataset Tr, and then split Tr into two sets Tr1 and Tr2. With the help of ICA,
the independence of Tr1 and Tr2 is guaranteed. We also handle unlabeled dataset U with ICA and get
testing set Te, and obtain Te1 and Te2 correspondingly;
We train two classifiers, i.e. classifier C1 and classifier C2 by training with Tr1 and Tr2. Although the
training set is the same one, the two classifiers are absolutely unrelated as the two sub-feature sets Tr1
and Tr2 are totally independent. Then we test the testing set Te by using classifier C1 and classifier C2
and obtain two different results Ye1 and Ye2;
The next step concerns how to form new training set by recommended samples. Recommended samples
are selected depending on the distance between sample and hyper-plane. Firstly, we sort the results Ye1
and Ye2 based on the distance, and then select the top k to form the recommended dataset R1 and R2
respectively. Finally, we combine R1 with L2 and combine R1 with L2 to form new training set;
we train two classifiers with new training set and get classifier C1’ and classifier C2’. Then we make a
second test towards testing dataset Te with the two new classifiers and obtain the ultimate results Ye1’
and Ye2’, with which we Compute precision, recall rate, F-measure and accuracy rate at last .
14
Department of Information Technology
St. Thomas’ College of Engineering and Technology
X2 Statistic
The X2 statistic was initially proposed in (Pearson, 1900) and is used to compare
an observed frequency distribution to an expected frequency distribution
(Plackett, 1983). In mathematical terms, the X 2 statistic is a normalized sum of
square deviation between the observed and expected frequency distributions.
This statistic is calculated as
x2
Figure 6: X2 statistics
15
Department of Information Technology
St. Thomas’ College of Engineering and Technology
Step1 − First, start with the selection of random samples from a given dataset.
Step2 − Next, this algorithm will construct a decision tree for every sample. Then it will get the
prediction result from every decision tree.
Step3 − In this step, voting will be performed for every predicted result.
Step4 − At last, select the most voted prediction result as the final prediction result.
16
Department of Information Technology
St. Thomas’ College of Engineering and Technology
overcomes the problem of overfitting by averaging or combining the results of different decision trees
work well for a large range of data items than a single decision tree does.
Construction of Random forests are much harder and time-consuming than decision trees.
Random Forest algorithm is used in Banking, Medicine, Stock Market and E-commerce.
17
Department of Information Technology
St. Thomas’ College of Engineering and Technology
Working process
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
Firstly, we will choose the number of neighbors, so we will choose the k=5.
18
Department of Information Technology
St. Thomas’ College of Engineering and Technology
Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already studied in geometry. It
can be calculated as: Euclidean Distance between A1 and B2= √((x2-x1)2+(y2-y1)2).
SVM
• The main reason to use an SVM instead is because the problem might not be linearly
separable. In that case, we will have to use an SVM with a non linear kernel (e.g. RBF).
• Another related reason to use SVMs is if you are in a highly dimensional space. For
example, SVMs have been reported to work better for text classification.
But it requires a lot of time for training. So, it is not recommended when we have a large number of
training examples.
kNN
• It is robust to noisy training data and is effective in case of large number of training
examples.
But for this algorithm, we have to determine the value of parameter K (number of nearest
neighbors) and the type of distance to be used. The computation time is also very much as we need
to compute distance of each query instance to all training samples.
Random Forest
• Random Forest is nothing more than a bunch of Decision Trees combined. They can
handle categorical features very well.
• This algorithm can handle high dimensional spaces as well as large number of training
examples.
Random Forests can almost work out of the box and that is one reason why they are very popular.
19
Department of Information Technology
St. Thomas’ College of Engineering and Technology
Chapter 3
Design and Methodology
Collection of Dataset
The first part is the collection of datasets. According to research conducted there are many open-
source datasets available on different websites such as Kaggle, VirusTotal, VirusShare etc, But when
compared to each other the datasets that are available on Kaggle are much better and the data are
balanced.
The Dataset was downloaded from the Kaggle website, file is in .csv format so we could load the file
in excel and do the further operations. After loading the dataset had columns such as Name of the
Malware, AlertLevel, AvSigVersion, Type etc. The dataset almost consisted of everything on the basis
of which the detection will be performed. We needed only some columns which we collected there
are pre-processing of data we needed to do and labeling that had to be done before we proceed with
the further steps of detection
●Training of Dataset
The generated dataset is used for training the model with given parameters using Support Vector
Machine, random forest algorithm and KNN algorithm. Data Prepossessing is an important part of this
process where the data must be organized. Also, the relation between these datasets must be found
to extract features. Data cleaning is also a major part, to find out data with missing features and values.
●Extracting of Feature
The feature extraction is important for prediction and selected meticulously in order to perform faster
computation and low memory consumption. With the collected malicious dataset, the features are
taken from source codes of decompiled files. It tends to recover main files organizing source codes in
a particular way in folder. Then the methods implemented in the source of malicious datasets, the
numbers of permission-related APIs are extracted as the features to train the classifier.
●Implementation
In the implementation step, the dataset that we created was implemented in each machine learning
algorithm/model. To achieve the best results and accuracy we performed hyperparameter tuning
which is changing of small parameters in the model to achieve better accuracy and performance.
Further moving on to the implementation or applying the model we split the dataset into two parts
that are 70% of the data is allotted to training and 30% of data is allotted to the test data, this is done
because we can test the performance of the data instead of applying the whole data to training. We
did the split by applying “train_test_split” which is imported from the “sklearn model” package in
python. Explanation of each model which is the KNN model, SVM model, Random Forest model, and
Random forest with decision tree model is explained in detail below
20
Department of Information Technology
St. Thomas’ College of Engineering and Technology
Support Vector Machines are perhaps one of the most popular and talked about machine learning
algorithms.
How to understand the various names given to support vector machines. When the model is really
stored on memory, SVM uses this representation. How to produce predictions for new data using a
trained SVM model representation.
Learn how to train an SVM model using training data. How to get your data ready for the SVM
algorithm. Where can you find more information on SVM?SVM differs from previous classification
methods in that it selects the decision boundary that optimizes the distance between all classes'
nearest data points. An SVM not only discovers a decision boundary, but it also identifies the best
decision boundary. The most optimal choice boundary is the one with the greatest margin between
all of the classes' nearest points.
A decision tree is an understandable concept that is used to generate a random forest. A decision tree
can be thought of as a sequence of yes/no questions regarding our data that eventually leads to a
projected class (or continuous value in the case of regression). This model is explainable because it
generates classifications in the same way that we do: we ask questions about the data we have. Our
data consists of only two features (predictor variables), x1 and x2, and six data points (samples)
separated into two labels. Although the task is straightforward, it is not linearly separable, which
means we can't classify the points by drawing a single straight line through the data. However, we can
draw a sequence of straight lines to divide the data points into boxes, which we'll refer to as
nodes. During training we give the model both the features and the labels so it can learn to classify
points based on the features.
21
Department of Information Technology
St. Thomas’ College of Engineering and Technology
The KNN method is one of the most well-known machine learning algorithms, and it's a must-have in
any machine learning toolkit. Python is the most popular programming language for machine learning,
so what better way to learn about KNN than with NumPy and scikit-learn, two of Python's most well-
known packages. The first determining property of machine learning algorithms is the split
between supervised and unsupervised models. The difference between supervised and unsupervised
models is the problem statement.
In supervised models, you have two types of variables at the same time:
1. A target variable, which is also called the dependent variable or the y variable.
2. Independent variables, which are also known as x variables or explanatory variables.
The target variable is the variable that you want to predict. It depends on the independent variables
and it isn’t something that you know ahead of time. The independent variables are variables that you
do know ahead of time. You can plug them into an equation to predict the target variable. In this way,
it’s relatively similar to the y = ax + b case.
22
Department of Information Technology
St. Thomas’ College of Engineering and Technology
Chapter 4
Sample Code:
Support Vector Machine (SVM) model
The use of the Support Vector Machine is that it helps to find the hyperplane in an N dimensional
space that classifies the data points. So our main objective in this is to find the maximum margin which
means the maximum distance between two data points. Then we apply the dataset in the model which
executed as given in the below figure.
From SVM model we get the accuracy around 51%, Which is not satisfactory. So we here we go for
another approach.
23
Department of Information Technology
St. Thomas’ College of Engineering and Technology
algorithm then prepares a number of trees, resembling a forest. But each tree is different, as for each
split in the tree, the variables are chosen randomly. The remaining data set, apart from the training
set is used for predicting the tree in forest which makes best classification of data points and the tree
having most predictive power is shown as output. At each node of the decision tree, it splits the
training set into two subsets with different labels by minimizing the uncertainty of the class labels.
In random forest has a group of trees in which each tree will classify and give results based on a few
attributes. Trees are chosen which has the maximum number of nodes. The model was implemented
on the dataset by giving little parameters which are the maximum depth, the estimators, and the
criteria. The below figures gives the output.
Random forest with a Decision tree: Random Forest with decision tree just gives more precision
and accuracy but according to the evaluation. It didn’t give due to the imbalance in data. The dataset
was applied to the mode with some parameters considering the tree size, estimators, and much more.
24
Department of Information Technology
St. Thomas’ College of Engineering and Technology
Here we can see that the accuracy we get is becoming good compare to Support Vector
Machine and Random Forest algorithm.
25
Department of Information Technology
St. Thomas’ College of Engineering and Technology
Chapter 5
Testing, Result, Discussion On Results
Evaluation:
The below figure shows below accuracy rate and the confusion matrix.
SVM MODEL
The below figure shows the code and Accuracy rate of the Random Forest model
The below graph is the Error rate of the KNN model which is decreasing which represents a
positive result.
26
Department of Information Technology
St. Thomas’ College of Engineering and Technology
Now we will see the accuracy become better than the previous one we got.
27
Department of Information Technology
St. Thomas’ College of Engineering and Technology
The evaluation in machine learning is done by calculating metrics by taking the help of a confusion
matrix which is in the sklearn metrics package. The graphs and ROC curves are plotted with the help
of the confusion matrix for the KNN model and other models that are compared.
When we execute it is considered an important step to calculate the matrix with the help of the fusion
matrix. This is in the form of a table or a graph that represents the Accuracy, Precision, F1-score, and
Recall of the model that we implemented. The below figure shows how a confusion matrix looks like.
The above mentioned in boxes which are True positives, False positives, False-negative and True
negative are explained in detail below.
1. True Positive (TP): True positive means the model or algorithm correctly predicts what has
been told to predict. E.g., In our implementation, we have told to detect the malware so the
amount of application detected will be shown.
2. False Positive (FP): The meaning of False positive is that it incorrectly predicts the positive
that is been told to the algorithm.
3. False Negative (FN): In this, the outcome that is generated represents that the model
predicts or detects the negatives incorrectly.
4. True Negative (TN): In this, the model correctly predicts or detects the negatives.
The parameters that are considered in the confusion matrix such as Accuracy, precision, recall, and
F1- score are explained below and also the equation that is used to calculate those.
1. Precision: A metric that quantifies the number of correct positive detection made. This
calculates the accuracy of the minority class. It is calculated by correctly detected positives divided by
the total number of positives detected.
28
Department of Information Technology
St. Thomas’ College of Engineering and Technology
2. Accuracy: Accuracy means the detected values by the model which are correctly and as told
to the model to do. It is calculated by adding TP and TN divided by all the positives and negatives.
TP + TN / TP + TN + FP + FN …………………. (2)
3. F1-Score: F1-Score is a metric in which precision and recall are taken into account. The
equation to calculate is as follows:
4. Recall: Recall is the sensitivity or the True positives rate of which the equation is explained
below.
TP / TP + FN………… (4)
Discussion on Result
The sort of machine learning that’s found in a lot of antimalware software tries to learn which files
are malicious and which are benign based on databases of malicious task. The AI involved tries to
make decisions about whether or not analyzed code is harmful based on a series of traits. Some
traits may rank higher than other traits. So code that’s determined to be benign might have some
traits that the software considers to be a possible indication of malware. Malware is evolving rapidly,
so the algorithms must evolve rapidly as well. It’s a constant, ongoing process.
After implementation we achieved 51% accuracy through SVM and 87% through Random forest and
98% through KNN algorithm. From the evaluation and the results that are achieved from the models
considered we can say that the KNN algorithm has outperformed other algorithms with a greater
percentage of accuracy. Our model can detect a large number of malwares and classify them
accordingly. KNN model has given us a good accuracy which is 98%.
Use perfmon log data from all machines in an experiment, apart from data from the machine on which
we want to determine whether it has a virus or not. Average the counter values from each of these
machines for a specific instance in time (after the gradient had first been applied). Perform this process
for all instances. We are attempting to create a representation of normality by averaging the output
from many machines, in an attempt to minimise the effect of the virus from those machines which
have been infected. We then use these averaged instances as the training set for the novelty classifier.
The instances from the machine we are attempting to determine whether it has a virus or not are used
as the test set.
29
Department of Information Technology
St. Thomas’ College of Engineering and Technology
Chapter 6
6.1 Scope for future improvement
We were not able to implement this into a file or detect the files but yes for sure we have detected
the malware. Although we achieved good results with our proposed models, our priority was to
implement different machine learning models to compare, rather than spending too much time
tweaking to achieve marginal gains. Machine learning has a lot of room for experimenting and tuning.
It is very possible that even better results could be achieved by tweaking the proposed solutions.
Windows is not the only operating system that is targeted by malware. A possible future improvement
could be creating a solution that works with multiple executable formats. Microsoft’s PE format is
relatively similar to the ELF format used by Linux executables, and a uniform machine learning model
should therefore not be impossible to achieve. Another thing to keep in mind is that as malware
detection get cleverer, so does the malware. It is a perpetual battle where the malware keeps
changing. As such, a solution that works good today, might not work good in the future.
In the future, we will work on obtaining the opcode and analyzing the behavior of malware types and
try to detect the malware in real-time.
6.2 Conclusion :
It is necessary in order to detect zero-day attack malware which leads an opening of a platform to
provide various malware to perform an attack or to overload the system. It plays vital role in the field
of security many attacks have been generated. Rapid growth of malware directs most researchers to
implement new approaches to defeat the attacks and developed countermeasures; on this ground we
used machine learning.
Both static and dynamic analysis yield very high accuracies in the reviewed literature, although there
are some suggestions that static approaches can struggle on newer types of malware. Certain
difficulties for dynamic methods was also highlighted by the literature. Various forms of neural
networks seem to perform very well across the board. Classifiers like SVM, KNN and decision tree-
based models also get good results.
Based on various articles, converting a malware binary into a gray scale image appears to be a very
solid approach to malware classification using static analysis with some clear advantages. It is easy to
process, appears to be resilient against obfuscation and it can take advantage of image analysis
techniques. While several different articles obtain very good results when classifying malware into
various families, we would have liked to see how such a technique performs for malware detection.
30
Department of Information Technology
St. Thomas’ College of Engineering and Technology
References:
[1]. Santosh Joshi, Himanshu Upadhyay, Leonel Lagos, Naga Suryamitra Akkipeddi, Valerie Guerra,
“Machine learning Approach for Malware Detection Using Random Forest Classifier on Process List
Data Structure”.
[2]. I. Shhadat, B. Bataineh, A. Hayajneh, and Z. A. Al-Sharif, “The Use of Machine Learning Techniques
to Advance the Detection and Classification of Unknown Malware,” Procedia Comput. Sci., vol. 170,
pp. 917–922, Jan. 2020, doi: 10.1016/j.procs.2020.03.110.
[3]. D. Gupta and R. Rani, “Improving malware detection using big data and ensemble learning,”
Comput. Electr. Eng., vol. 86, p. 106729, Sep. 2020, doi: 10.1016/j.compeleceng.2020.106729.
[4] P. Singhal, “Malware Detection Module using Machine Learning Algorithms to Assist in Centralized
Security in Enterprise Networks,” Int. J. Netw. Secur. Its Appl., vol. 4, no. 1, pp. 61–67, Jan. 2012, doi:
10.5121/ijnsa.2012.4106.
[5] J. Z. Kolter and M. A. Maloof, “Learning to detect malicious executables in the wild,” in Proceedings
of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’04,
Seattle, WA, USA, 2004, p. 470, doi: 10.1145/1014052.1014105.
[7]. https://www.researchgate.net/publication/336611638_Detecting_Malware_Evolution_Using_
Support_Vector_Machines.
[8]. https://www.sciencedirect.com/science/article/pii/S1877050917305987.
[9]. https://www.sciencedirect.com/science/article/pii/S1084804519303868.
[10]. http://asnugroho.net/papers/act2010-firdausi.pdf.
[11] S. Cesare, Y. Xiang, and W. Zhou, “Control flow-based malware variant detection,” IEEE
Transactions on Dependable and Secure Computing, vol. 11, no. 4, pp. 304–317, 2014.
[https://scholar.google.com/] [https://ieeexplore.ieee.org/]
[12] H. S. Galal, Y. B. Mahdy, and M. A. Atiea, “Behavior-based features model for malware
detection,” Journal in Computer Virology and Hacking Techniques, vol. 12, no. 2, pp. 59–67, 2016.
[https://link.springer.com/]
[13] O. Ronneberger, P. Fischer, and T. Brox, “U-net: convolutional networks for biomedical image
segmentation,” in Proceedings of the International Conference on Medical Image Computing and
Computer-Assisted Intervention (MICCAI '15), vol. 9351 of Lecture Notes in Computer Science, pp.
234–241, Springer, Cham, Switzerland, November 2015.
[https://scholar.google.com/scholar_lookup]
31
Department of Information Technology