0% found this document useful (0 votes)

170 views

Project Report Final

The document summarizes a project report submitted by three students - Soumya Banerjee, Sanjoy Sebait, and Deep Shikha - for their B.Tech degree in Information Technology from St. Thomas' College of Engineering and Technology. The project involved developing algorithms for detecting different viruses using neural networks, under the guidance of their mentor Subhashree Basu. The report documents their work on the project over the 2020-21 academic year in partial fulfillment of the requirements for their degree.

Uploaded by

Subhashree Basu

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

170 views

Project Report Final

Uploaded by

Subhashree Basu

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

St.

Thomas’ College of Engineering and Technology

St. Thomas’ College of Engineering and Technology

Detection of different virus based on neural network algorithms

Prepared by

SOUMYA BANERJEE (12200217013)

SANJOY SEBAIT (12200217018)

DEEP SHIKHA (1220021704)

Under the guidance of

SUBHASHREE BASU

Mentor’s Designation, Department

Project Report
Submitted in the partial fulfillment of the requirement for the degree of B.Tech in

Information Technology

Department of Information Technology

Affiliated to

Maulana Abul Kalam Azad University of Technology , West Bengal

June,2021

Department of Information Technology

St. Thomas’ College of Engineering and Technology

This is to certify that the work in preparing the project entitled Detection of different

virus based on neural network algorithms has been carried out by SOUMYA BANERJEE,

SANJOY SEBAIT, DEEP SHIKHA under my guidance during the session 2020-21 and

accepted in partial fulfillment of the requirement for the degree of Bachelor Of Technology.

Arindam Chakravorty Subhashree Basu

Head, Department of Information Technology Department of Information Technology

Department of Information Technology

St. Thomas’ College of Engineering and Technology

Acknowledgement

We are submitting the project on malware detection by neutral network algorithms under the
guidance of Subhashree Basu who supported us at every stage of the report and guided us
with everything with right and excellent knowledge. We would also like to thank our project
review class teachers, they give us the instruction what to add in our project or what
improvements need to be done.

Signature with date

Soumya Banerjee 28.06.2021

Sanjoy Sebait 28.06.2021

Deep Shikha 28.06.2021

Department of Information Technology

St. Thomas’ College of Engineering and Technology

Department of Information Technology

Vision & Mission (St. Thomas’ College of Engineering &

Technology)

Vision

To evolve itself into an industry oriented research based recognized hub of creative solutions in
various fields of engineering by establishing progressive teaching- learning process with an
ultimate objective of meeting technological challenges faced by the nation and the society.

Mission

• To create opportunities for students and faculty members in acquiring professional

knowledge and developing social attitudes with ethical and moral values.
• To enhance the quality of engineering education through accessible, comprehensive,
industry and research oriented teaching-learning process.
• To satisfy the ever-changing needs of the nation for evolution and absorption of sustainable
and environment friendly technologies.

Department of Information Technology

St. Thomas’ College of Engineering and Technology

Vision & Mission (Department of Information Technology)

Vision

To promote the advancement of learning in Information Technology through research-oriented

dissemination of knowledge which will lead to innovative applications of information in Industry
and Society.

Mission

• To incubate students, grow into industry ready professionals, proficient research scholars
and enterprising entrepreneurs.
• To create a learner- centric environment that motivates the students in adopting emerging
technologies of the rapidly changing information society.
• To promote social, environmental and technological responsiveness among the members
of the faculty and students.

Program Educational Objectives (PEO)

Graduates of Information Technology Program shall

PEO1: Exhibit the skills and knowledge required to design, develop and implement IT solutions
for real life problems.

PEO2: Excel in professional career, higher education and research.

PEO3: Demonstrate professionalism, entrepreneurship, ethical behavior, communication skills

and collaborative team work to adapt the emerging trends by engaging in lifelong learning.

Department of Information Technology

St. Thomas’ College of Engineering and Technology

PROGRAM OUTCOMES(POs)

Engineering Graduates will be able to:

1. Engineering knowledge: Apply the knowledge of mathematics, science, engineering fundamentals,

and an engineering specialization to the solution of complex engineering problems.

2. Problem analysis: Identify, formulate, review research literature, and analyze complex engineering
problems reaching substantiated conclusions using first principles of mathematics, natural sciences, and
engineering sciences.

3. Design/development of solutions: Design solutions for complex engineering problems and design
system components or processes that meet the specified needs with appropriate consideration for the
public health and safety, and the cultural, societal, and environmental considerations.

4. Conduct investigations of complex problems: Use research-based knowledge and research methods
including design of experiments, analysis and interpretation of data, and synthesis of the information
to provide valid conclusions.

5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern engineering
and IT tools including prediction and modeling to complex engineering activities with an understanding
of the limitations.

6. The engineer and society: Apply reasoning informed by the contextual knowledge to assess societal,
health, safety, legal and cultural issues and the consequent responsibilities relevant to the professional
engineering practice.

7. Environment and sustainability: Understand the impact of the professional engineering solutions in
societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable
development.

8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms
of the engineering practice.

9. Individual and team work: Function effectively as an individual, and as a member or leader in diverse
teams, and in multidisciplinary settings.

10. Communication: Communicate effectively on complex engineering activities with the engineering
community and with society at large, such as, being able to comprehend and write effective reports and
design documentation, make effective presentations, and give and receive clear instructions.

11. Project management and finance: Demonstrate knowledge and understanding of the engineering and
management principles and apply these to one’s own work, as a member and leader in a team, to
manage projects and in multidisciplinary environments.

12. Life-long learning: Recognize the need for, and have the preparation and ability to engage independent
and life-long learning in the broadest context of technological change.

Department of Information Technology

St. Thomas’ College of Engineering and Technology

Project Mapping with Program Outcomes

PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
3 2 1 2 3 2 1 1 3 3 2 3

Enter correlation levels 1, 2 or 3 as defined below:

1: Slight (Low) 2: Moderate (Medium) 3: Substantial (High)

Justification :

In this project, we can apply our engineering knowledge, system components, used modern IT
tools, including prediction and modelling. By this project, we understand the professional
engineering solutions in social and environmental contexts. Demonstrate knowledge and
understanding of the engineering community and with society at large, such as being able to
comprehend and design documentation make effective presentations and give clear
instructions. Recognize the need for and have the preparation and ability to engage independent
and life-long learning in the broadcast context of technological change.

Department of Information Technology

St. Thomas’ College of Engineering and Technology

Program Specific Outcomes (PSOs)

PSO1 ( Programming ) : Apply the programming knowledge to build an efficient and

effective solution of the problem with an error free , well documented and reusable code,
user friendly interface and well organized database.

PSO2 ( Multimedia Authoring ) : Create a multimedia product using proper metaphor,

designing effective navigation following human computer interface rules with
proper interactivity, which will be useful for educational , social and business purpose.

PSO3 ( Software Engineering ) : Understand and analyze a big complex problem and
decompose it into relatively smaller and independent modules either algorithmically or in an
object oriented way choosing correct life cycle model and using effective test cases.
Project Mapping with Program Specific Outcomes

PSO1 PSO2 PSO3

2 2 3

Enter correlation levels 1, 2 or 3 as defined below:

1: Slight (Low) 2: Moderate (Medium) 3: Substantial (High)

Justification :

Understand and analyze a big complex problem and decompose it into smaller and independent
modules either algorithmically or in an object-oriented way choosing correct life cycle model
and using effective test cases.

Department of Information Technology

St. Thomas’ College of Engineering and Technology

Index

Topic Page No

Introduction 1

Chapter 1 2-11

1.1 Problem Statement

1.2 Problem Definition
1.3 Objective
1.4 Tools and Platform
1.5 Brief Discussion on Problem

Chapter 2 12-19

Concepts and Problem analysis

Chapter 3 20-22

Design and Methodology

Chapter 4 23-25

Sample Codes

Chapter 5 26-29

Results, Discussion on Results

Chapter 6 30

6.1 Scope for future improvement

6.2 Conclusion

Annexure

References /Bibliography 31

1
Department of Information Technology
St. Thomas’ College of Engineering and Technology

Introduction
Technology and their latest upgrades and updates are increasing day by day and have a very great impact
on our day-to-day life. Each and every one is surrounded by technology which is making their life much
easier. But this is where cybercriminals have many opportunities to sabotage the new technology by
spreading malware in various forms such as through emails, links, documents, etc. This happens
because development teams of any IT industry mainly focus on the interface and design of the software
instead of securing their network or any method that will prevent from entering of malicious files. But
some of the companies are investing or spending billions and billions of moneys to secure their software
to the security organization. Also, researchers and development of cybersecurity are working and doing
excellent work on establishing new security approaches to prevent cyber-attacks.

To explain the term malware, it is nothing but a piece of software that is developed by a cyber attacker
in intent to damage a technology device by stealing information, credentials, etc. Malware is of different
types such as Viruses, Worms, Spywares, Botnet, Ransomware, Trojans, etc. Malware is created by
attackers and they only develop it to make money or by any means to gain profit, they also gain money
and profit by selling it to the dark web whose bids are the highest. Whereas these are the disadvantage
of creating malware but the advantage of the malware is that the companies also buy malware to test
the security of the software that they developed. Malware on the other hand creates great damage to the
reputation of the company and also lots of loss to the company by revealing sensitive information or
stealing it. According to McAfee, it is said that the new type of malware is detected every four seconds.
It is said that even the antivirus software is not able to detect the types of malwares for e.g. according
to research it seen that 69,477,489 unique malicious viruses have gone undetected. Each time the
malware doesn't need to be created in the same way or method, it can change its form being stealth and
can go undetected.

To mitigate these kinds of problems a better approach for this would be the use of machine learning
techniques and their different algorithms because of which the detection rate will much improve to
detect malware. The software that will be created will run machine instructions that are created by
python and give us accurate results. In this we will be downloading a dataset that will be trained to
detect malware and the use of the K-means clustering algorithm will be very helpful to accurately cluster
the malware accordingly.

2
Department of Information Technology
St. Thomas’ College of Engineering and Technology

Chapter 1
1.1 Problem Statement
Detection of different virus based on neural network algorithms.

1.2 Problem Definition

Nowadays, various kinds of software provide wealth resources for users but also bring a certain
potential danger; thus, malware detection is always a highly concerned issue in computer security
field. According to the recent study, the number of malicious samples is rapidly increasing. Malware
detection has always been a concern area of research in recent years. Several methods and techniques
have been proposed to counter the growing amount and sophistication of malware. Static Analysis for
Malware Detection. Static analysis often uses lexical analysis, parsing, control flow, and data flow
analysis techniques to mine the program. One common static malware detection method for the
previous industry communities is signature-based method. For an unknown executable file, they can
determine whether it is a known malware by searching whether there is a matching signature in the
malicious code database. This detection method generated a unique signature identifier for a malware
based on some specific manually designed features.

1.3 Objective
Our proposed approach automatically and simultaneously learns both discriminative features and
classifier for virus particle detection by machine learning, in contrast to existing methods that are
based on handcrafted features that yield many false positives and require several postprocessing
steps. For this project, we have a malware dataset that is derived using the open-source software
tools. The pe header analysis tools were used to extract the required features from the malware file
which is then used for the further steps of the proposed system. Ten malware categories used in this
research are: virus, trojan, adware, backdoor, ransomware etc. Since our method is based on
supervised learning that requires both the input dataset and their corresponding alert level, it is
basically used for detection of already-known viruses. However, the method is highly flexible, and the
convolutional networks can adapt themselves to any virus particles by learning automatically from an
annotated dataset.

1.4 Tools and Platform

In this project, we have used Python-Anaconda3, MS Excel.

Python is the only programming language required for this project till now.

3
Department of Information Technology
St. Thomas’ College of Engineering and Technology

Library Purpose

Provides fast, flexible, and expressive data

pandas
structures for structured data

matplotlib Creates graphical visualisations

Provides arrays and other mathematical

Numpy
computations

Plot graphs, it will be used to visualize

Seaborn
random distributions.

Apply fit and transform which does the trick to

assign numerical value to categorical value and
Label Encoder the same is stored in new column called
“StateN”

Table 1 (essential Python libraries)

Literature Survey / Background Study

In this section we will be discussing the implementation of malware detection which are proposed by
different authors, we will be seeing how they have used different types of detection algorithms and
how they have tested their model, also we will be comparing them further.

In this paper proposed by Santosh Joshi, Himanshu Upadhyay, Leonel Lagos, Naga Suryamitra
Akkipeddi , and Valerie they have used machine learning classifiers to detect the behavior of the
malware. They have used the Random Forest algorithm which is the traditional way to detect and
analyze the malware that causes cyber threats. The proposal tells that a given sample is classified as a
benign or malware with high accuracy and low computational overhead. The parameters to perform
this used by the author are Linux environment, the data is collected is extracted by using LibVMI which
is specially made for Virtual Machine Introspection. The authors have used virtualization method
because it provides better approaches to create a virtual machine and then creates or run the malware
to analyze the behavior of the malware, this is done because if the malware is executed in virtualize
environment, then it can’t affect the host machine. Xen hypervisor-based virtualization platform is
used to host the system; the data is captured by virtual memory introspection using the library LibVMI
and then transfer everything into the database. The dataset that was used to build the model

4
Department of Information Technology
St. Thomas’ College of Engineering and Technology

consisted of a mixture of malicious malware and benign data which were extracted from the list data
of the Linux VM.

For training the data they split the data into two parts such as 70% of the data was used for the training
purpose and 30% was used for the testing process, along with this they used the confusion matrix for
the evaluation and plotted the accuracy. The author has stored the data in a database server in the
virtualization platform and then executed the Random Forest algorithm by using the machine learning
server. The model that has been created will be stored in the database for prediction. To achieve the
accuracy for the model the author does minor changes into the random forest algorithm such as
changing the number of trees, number of variables that are tried at each split. According to the testing
and training of data, the author can achieve accuracy of 90.9%. In last the authors conclude that
machine learning techniques are very promising towards detecting the malware in this technological
world, also various models are proposed but without testing and achieving accuracy no one cant
decides which one to employ. [1]

In contrast to the above-proposed method, Ihab Shhadat, Bara Bataineh, Amena Hayajneh, and Ziad
A. AL-Sharif have proposed the detection of unknown malware using machine learning to advance the
detection and classification. The author has stated that the purpose of this research to detect the
unknown malware using machine learning techniques, the author has applied different types of the
algorithm on the dataset to check which gives better accuracy in binary and multi-classifiers. The
author has mentioned that they have achieved by using the Random Forest classifier for the feature
selection and cross-validating the data. The data that was collected had 1156 files in total out of which
984 were malicious files and 172 benign files, and the formats in which data were collected were of
different forms such as .exe, .pdf and .docx. The author has considered various types of main malware
in their study such as Drider, Locky, Zeus, TeslaCrypt, etc. They have estimated the performance based
on Accuracy, precision, Recall, and the F1-score which is calculated with the help of the confusion
matrix. In this research paper, feature extraction has been conducted using the heuristic strategy
which is a dynamic analysis in which malicious files are executed in a virtual environment and then
information is gathered based on the malware characteristics. But in this, the author has acquired only
the needed information which will give a better accuracy rate and better performance and rejecting
the rest of the information which are not needed because they can give problems in the accuracy rate
and performance. Random Forest algorithm is used because they often decide the important features
accurately, it is a treebased technique that ranks the impurity of the node. Cross-validation is done
because evaluation is very important in machine learning, K-folds-cross-validation divides the initial
sets into k subsets into a similar size then performs the training. According to the research paper the
important features were specified with the help of Random forest and the cross validation was done
by using K-fold-cross-validation The author have applied different types of model which are provided
by the sklearn. [2]

The highest accuracy was achieved by Decision tree(98%), Random Forest(97.8%), Hard Voting(97%),
KNN(96.1%), SVM(96.1%), LR(95%). The lowest accuracy was achieved by Bernoulli NB which is 91%.
The author concludes that the performance that is studied of machine learning techniques in which
the decision tree has achieved 98.2% accuracy which out-performs the rest of the algorithm. In this
paper, the authors Deepak Gupta and Rinkle Rani have proposed a method which is improving
malware detection using big data and ensemble learning. The authors have proposed two methods to
improve the detection of malware, the first one is the weighted voting strategy of ensemble learning

5
Department of Information Technology
St. Thomas’ College of Engineering and Technology

and the second one is choosing an optimal set of base classifiers to stack. The two methods that are
mentioned above are based on ensemble learning and big data which improves malware detection at
a very large scale. It is mentioned that the malware and the benign files are collected from the public
repositories and clean windows machine, to perform static and dynamic analysis author has set up an
automated environment to generate reports which are used as a raw dataset. These data are stored
in a distributed storage system and processed using Apache spark to extract the features. The dataset
for the proposed method consists of a 100,200 malware file and 98,150 benign files[3]. The malware
files were collected from sources like VXHeven, VirusShare, etc. and the benign files were collected
from the clean Windows XP, 7, 10, and installation directory. The author has also said that the dataset
is nearly balanced because the difference between the malicious file and benign files are less so they
are almost equal because of which better accuracy and detection can be achieved. The use of an
automated environment has been used to perform the static and dynamic analysis, talking of the
setup the host machine is Ubuntu as OS and Windows as a guest machine which is created in Oracle
VM VirtualBox. The binary files are executed in the virtual environment and generate a report of full
analysis in JSON format. The feature extraction for detection has been deciding based on the reports
of dynamic and static analysis of which the report was generated and the feature extraction is File
metadata, Filesize, Packer detection, Sections Information, Dynamically linked libraries, Dropped files,
Windows API calls, Mutex operations, File activities, Registry Activities, Network activities and Process
activities[3].

These are the features based on which further detection will take place. The algorithms on which the
dataset has been applied in this paper are Naïve Bayes, K-nearest neighbor, Decision table, SVM, and
Random Forest. After applying the dataset in every algorithm the author says that Random Forest has
given the highest accuracy which is 98.1% followed by the Decision table algorithm which is 94.9%.
The author concludes that the proposed system that is the weighted voting technique gives the
highest accuracy of 99.5%, hence that proposed system enhances the detection of malware. In
contrast to the above paper, In this paper, the author’s Priyank Singhal and Natasha Raul have
proposed a method which is a Malware detection module using machine learning algorithms to assist
centralized security in enterprise networks. In this paper they have proposed a new and much complex
antivirus engine that can scan harmful files, this process is done by extracting the API calls made by
different types of normal and malicious executables. Machine learning will be used to enhance classify
in a better way and also rank the files based on their security risk. This system is very heavy because
of the processor but it is very effective and can be used in enterprise networks to detect threats. In
this research paper, the solution of detecting the virus works on a firewall level of the enterprise
network. For detection the authors have extracted numerous infected and normal PE header
executables by using IAT, then they store the extracted data into a data mine. The information gain is
derived for each function.

For the further implementation, Random Forest is used, it works as a classifier to detect the malware,
Random forest’s output prediction will be the most frequent class output of the individual trees. The
author has said that to check whether the proposed model can provide results or not they have
extracted over 5000 executable files which consist of both normal and infected, by using the
information gain algorithm they chose only the 80% of functions that are more likely to be harmful. In
the paper the results show that the accuracy achieved using different algorithms which are Decision
tree has 90%, Naïve Bayes has 95%, Random Forest has 97% and the proposed method has got 98%.
The author has concluded that the model that has been proposed can detect malware based on

6
Department of Information Technology
St. Thomas’ College of Engineering and Technology

advanced data mining and machine learning techniques. Whereas this model cannot be used for home
users or others but it can be used only on an enterprise level. In this research paper, the author’s J.
Zico Kolter and Marcus A. Maloof have proposed a method which is to detect and classify malicious
executables in the wild. The author has said that 1,971 benign files and 1,651 malicious executables
and each of them was encoded as a training example using n-gram as features. After selecting the
features, the authors have said that they applied various algorithms for evaluations such as Naïve
Bayes, Decision trees, Support vector machines, and boosting [4].

As stated in the paper boosted decision trees have outperformed other algorithms with a ROC curve
of 0.996. The authors have observed that their methodology can scale to a larger number of
executables and also how their method classifies the executables based on their payload. As
mentioned above the data consisted of 1,971 benign files and 1,651 malicious files which were
collected in the window PE format, and also the benign files were collected from the folders of
windows XP and Windows 2000 machines. The hex dump was used to convert the executables into
hexadecimal codes in ASCII format so that the machine can understand.[5]

Then the respected n-grams were created by combining each four-byte sequence into a single term,
for example, if a sequence is ff 00 ab 3e 12 b3 then the corresponding n-gram sequence will be
ff00ab3e, 00ab3e12 and ab3e12b3. The experimental results say that the rate of false-positive rate is
0.05 and the accuracy of a boosted decision tree is 98% of malicious files out which 6 went missing
from the 291 malicious files, but the author says that for some 6 maybe a major issue if one is ready
to accept a false positive rate of 0.1 then it can detect with perfect accuracy. The author concludes
that after all the evaluation boosted J48 outperforms other algorithms and is the best detector
because of the ROC curve of 0.996. In this paper, the author’s Yanfang Ye, Digging Wang, and Dongyi
Ye have proposed a method which is called an Intelligent Malware detection system. The author has
said that by analyzing Windows API execution processes which are called PE header, they have
implemented an intelligent malware detection system using objective -oriented association mining-
based classification. An experimental study with a large section of PE files that were obtained from
the anti-virus laboratory of King-soft corporation is used to compare various malware detection. The
author has said the statement that the IMDS system outperforms popular antivirus software such as
Norton, McAfee and virus scan, etc and also other algorithms that are normally used. The results show
that the IMDS has got the highest accuracy which 93.07% and outperforms other algorithms such as
Naïve Bayes, SVM, and J4.8. The author has concluded that they have implemented the model
successfully with a large collection of 12212 benign samples and 17366 malicious files. In relation to
the other papers, in this paper, the author’s C.P. Patidar and Harshita Khandelwal have proposed a
method to detect a Zero-day attack using machine learning techniques. The author has divided this
into four phases namely the Malware datasets, Analyzing MDS, correlation algorithm, Detection
methods of malware [6].

Firstly the sample of malware is collected which are called datasets, then the use of correlation
algorithm is used to correlate the relationship between the malware and will be able to predict the
future. The final phase is to apply the detection model which will detect the malware and gives the
appropriate output.

We came across detail description of the algorithm and how we can apply our algorithm in our dataset,
and also what are the feature over there we can able to understand it very well. We can understand
time frame by time frame the behavior of the malware [7].

7
Department of Information Technology
St. Thomas’ College of Engineering and Technology

We can able to implement our project and able to do step by step procedure of the program. And we
can know more about support vector machine [8].

We go through the random forest algorithm and its different kind of implementation. We can able to
understand detection tree, actual value for data point or final prediction results very well [9].

We can able to understand no of neighbor, how to count the no of the data point in each category
and then we calculate the distance then assign the new data points to that category for which the no
of neighbor is maximum [10]

With so many sophisticated malware samples, plenty of researches have been concentrated on
proposing miscellaneous malware detection methods to mitigate the rapid growth of malware. Malware
detection can be divided into two main methods: static malware detection and dynamic malware
detection [11]

Static malware detection also refers to signature-based malware detection which examines the content
of malicious binary without actually executing malware samples. Signature-based malware detection is
able to obtain full execution path. However, it can be easily evaded by obfuscation techniques. In
addition, signature-based malware detection requires prior knowledge of malware samples. Static
malware detection also refers to signature-based malware detection which examines the content of
malicious binary without actually executing malware samples. Signature-based malware detection is
able to obtain full execution path. However, it can be easily evaded by obfuscation techniques. In
addition, signature-based malware detection requires prior knowledge of malware samples.[12]

Dynamic malware detection analyzes the sample behaviors during execution and generally called
behavior-based malware detection. Behavior-based malware detection methods include virtual machine
and function call monitoring, information flow tracking, and dynamic binary instrumentation. Windows
Application Programming Interface (API) call graph-based method has been considered as a good
prospect in behavior-based malware detection for a long time[13]

1.5 Brief Discussion on problem

In this section, we introduce the proposed method for malware detection and come up with a malware
detection method, which uses KNN SVM RANDOM FOREST algorithm networks. For malware
detection, neural network actually performs a binary classification task, receiving the raw file data as
input, and outputs a discrimination probability indicating how likely it is a malware.

Concretely, the detection process by can be divided into two stages. The first stage is to preprocess
malware sample data, it takes a binary form of a Windows executable file, generates a grayscale and
extracts opcode sequence and metadata feature with decomplication tool.

So, this stage generates the appropriate data format as the input of the follow-up this three algorithm
networks. The second stage applies the core process of KNN and SVM networks, respectively, learning
from the grayscale image and the opcode sequence. To optimize the detection performance, we use
stacking ensemble to integrate two networks’ output and metadata features and get final prediction

8
Department of Information Technology
St. Thomas’ College of Engineering and Technology

result. In neural network, many algorithms are used. We are used 3 algorithms for this project. They
are-

1) Random Forest

2) KNN classification

3) SVM

Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used
for both classification or regression challenges. However, it is mostly used in classification problems.
In the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is number of
features you have) with the value of each feature being the value of a particular coordinate. Then, we
perform classification by finding the hyper-plane that differentiates the two classes very well.

Figure 1: Dividing the data set according to plane

The training of traditional SVM method requires the solution of quadratic programming, and consumes
high memory and has low speed for large data training. Incremental learning is one of the meaningful
methods to continuously update the data for learning, which keeps the previous learning results, re
learning only for the additional data, so as to form a continuous learning process. This paper will study
the support vector machine based on incremental learning method and its application in the malware
detection. The experiments carried out in the Internet Security Laboratory at Kingsoft Corporation
suggested that, for large number of virus samples, our method can rapidly and effectively update the
sample features, which avoids duplication of learning history samples and ensures the malware
prediction ability for the detection model.

9
Department of Information Technology
St. Thomas’ College of Engineering and Technology

Random forest
The first part is the collection of datasets. large collection of malware apps as well as benign app have
been collected to form the dataset for training the model using machine learning algorithm. Both the
malicious app and benign app dataset are merged to form a large dataset.

Figure 2: Dataset divided into n no of samples.

Training of Dataset The generated dataset is used for training the model with given parameters using
random forest algorithm. Data Prepossessing is an important part of this process where the data must
be organized. Also, the relation between these datasets must be found to extract features.

Extracting of Feature The feature extraction is important for prediction and selected meticulously in
order to perform faster computation and low memory consumption. With the collected malicious and
benign apps, the features are taken from source codes of decompiled files. The installation package of
Android apps is basically the .apk file, that can be decompiled using Apktool. It tends to recover main
files organizing source codes in a particular way in folder.

Deployment of Random Forest Algorithm Now after obtaining the extracted features, the random forest
algorithm is used for classification. During the training process, a set of labels is set to determine the
type of each app. Thus, 1 denotes malware and 0 denotes benign apps. The construction of random
forests consists of a collection of decision trees where the number of decision trees is set manually.
Here every decision tree is developed in a topdown approach initially starting from the root.

K Nearest Neighbor
It is a popular class of supervised machine learning techniques. We can detect malware using knn and
feature analysis to identify changes within malware families over time. K-Nearest Neighbor is one of
the simplest Machine Learning algorithms based on Supervised Learning technique.

K-NN algorithm assumes the similarity between the new case/data and available cases and put the new
case into the category that is most similar to the available categories. This algorithm stores all the

10
Department of Information Technology
St. Thomas’ College of Engineering and Technology

available data and classifies a new data point based on the similarity. This means when new data appears
then it can be easily classified into a well suite category by using K- NN algorithm.

This algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems. Suppose there are two categories, i.e., Category A and Category B, and we
have a new data point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN

Figure 3: Allocation data points into a certain category

KNN is used for some specific reasons. Those are-

• KNN is a supervised learning algorithm, based on feature similarity.

• Unlike most algorithms, KNN is a non-parametric model which means it does not make
any assumptions about the data set. This makes the algorithm simpler and effective since
it can handle realistic data.
• KNN is considered to be a lazy algorithm, i.e., it suggests that it memorizes the training
data set rather than learning a discriminative function from the training data.
• KNN is often used for solving both classification and regression problems.

11
Department of Information Technology
St. Thomas’ College of Engineering and Technology

Chapter 2

Concepts and problem analysis

Support Vector Machine
In this project, we design model with SVM (support vector machine) to detect malicious code. SVM is
an excellent machine learning method based on supervised learning. It includes two cases-linear
separable problem and nonlinear separable problem. In the case of linear separable problem, the train
set is define as 𝛺𝛺 = {(𝑥𝑥𝑖𝑖, 𝑦𝑦𝑖𝑖)|𝑖𝑖 = 1,2 … 𝑁𝑁} ⊂ 𝑅𝑅𝑚𝑚 × {−1,1}, ,where 𝑥𝑥𝑖𝑖 ∈ 𝑅𝑅𝑚𝑚, 𝑦𝑦𝑖𝑖 ∈
{−1,1}. Supposing the set is linear separable, then we’ll obtain a hyperplane 𝑤𝑤𝑇𝑇𝑥𝑥 + 𝑏𝑏 = 0(𝑥𝑥 ∈
𝑅𝑅𝑚𝑚), and the formula can be expressed as 𝑦𝑦𝑖𝑖(𝑤𝑤𝑇𝑇𝑥𝑥 + b) ≥ 1,i=1,2…N. In the case of nonlinear
separable problem, the problem is more complicated and we can’t fulfill the classification just by hyper-
plane, therefore we do it by hyper-surface instead. The main idea of hyper-surface is to express the
training samples in a higher feature space H , where the training samples will be linear separable.

Figure 4: The-schematic-diagram-of-SVM-based-malware-detection

In this section, we will formally describe our design of SVM. Before description, we would express the
notation in the first place.

12
Department of Information Technology
St. Thomas’ College of Engineering and Technology

Notation
There are mainly three kinds of set, i.e. training set Tr, testing set Te and recommended set R. Each set
is composed of two parts, i.e. feature property set and class property set. Then notations in detail are
listed in table1.

Symbol Meaning

Tr Training set

Te Testing set

L Labeled Sample

U Unlabeled Sample

R Recommended Sample

Rp Positive Fake Sample

Pn Negative Fake Sample

X Feature Property

Y Class Property

Xr Feature Property of Training Dataset

Yr Class Property of Training Dataset

Xe Feature Property of Testing Dataset

Ye Class Property of Testing Dataset

vd Vector Dimension

Description of algorithm SVM

In this part we’ll express algorithm SVM in detail including the method to design algorithm and its
execution process.

13
Department of Information Technology
St. Thomas’ College of Engineering and Technology

Figure 5: Execution of SVM

The realization steps of SVM are as follow:

Step1 Independent feature set partition

In order to get two mutually independent sub-feature set of malware, we firstly preprocess labeled
dataset L with ICA and get dataset Tr, and then split Tr into two sets Tr1 and Tr2. With the help of ICA,
the independence of Tr1 and Tr2 is guaranteed. We also handle unlabeled dataset U with ICA and get
testing set Te, and obtain Te1 and Te2 correspondingly;

Step2 Train individual classifiers

We train two classifiers, i.e. classifier C1 and classifier C2 by training with Tr1 and Tr2. Although the
training set is the same one, the two classifiers are absolutely unrelated as the two sub-feature sets Tr1
and Tr2 are totally independent. Then we test the testing set Te by using classifier C1 and classifier C2
and obtain two different results Ye1 and Ye2;

Step3 Form new training set

The next step concerns how to form new training set by recommended samples. Recommended samples
are selected depending on the distance between sample and hyper-plane. Firstly, we sort the results Ye1
and Ye2 based on the distance, and then select the top k to form the recommended dataset R1 and R2
respectively. Finally, we combine R1 with L2 and combine R1 with L2 to form new training set;

Step4 train new classifiers and obtain final result

we train two classifiers with new training set and get classifier C1’ and classifier C2’. Then we make a
second test towards testing dataset Te with the two new classifiers and obtain the ultimate results Ye1’
and Ye2’, with which we Compute precision, recall rate, F-measure and accuracy rate at last .

14
Department of Information Technology
St. Thomas’ College of Engineering and Technology

Another way to find out path of virus detection

X2 Statistic
The X2 statistic was initially proposed in (Pearson, 1900) and is used to compare
an observed frequency distribution to an expected frequency distribution
(Plackett, 1983). In mathematical terms, the X 2 statistic is a normalized sum of
square deviation between the observed and expected frequency distributions.
This statistic is calculated as

where n denotes the number of features or observations, X2 is the cumulative

statistic, Oi is the observed value of the instance, and Ei is expected value of ith
instance. In our experiments, this statistic is used to quantify the difference between
SVM feature weights within a specific time window, as compared to the average
feature weights over an extended period of time. Specifically, the extended time
period is one year, while the specific time window is one month. That is, we compute
the difference—in the X sense—between the model for a specific month and the
average model over the most recent year. We then plot this X2 statistic through
overlapping time windows, and any points where there is a substantial change (i.e.,
a "spike") in the resulting graph indicates a point where the code base changes
significantly from the baseline. These are points of interest, since they indicate the
times at which the code has been significantly modified.

Figure 6: X2 statistics

Random Forest algorithm

Random Forest or random decision forest is a method that operates by constructing multiple
decision trees during training phase.The decision of the majority of the trees is chosen by random
forest as the final decision.

15
Department of Information Technology
St. Thomas’ College of Engineering and Technology

Figure 7: Decision Making

Working of Random Forest Algorithm

We can understand the working of Random Forest algorithm with the help of following steps −

Step1 − First, start with the selection of random samples from a given dataset.

Step2 − Next, this algorithm will construct a decision tree for every sample. Then it will get the
prediction result from every decision tree.

Step3 − In this step, voting will be performed for every predicted result.

Step4 − At last, select the most voted prediction result as the final prediction result.

16
Department of Information Technology
St. Thomas’ College of Engineering and Technology

Pros and Cons

The following are the advantages of Random Forest algorithm –

overcomes the problem of overfitting by averaging or combining the results of different decision trees
work well for a large range of data items than a single decision tree does.

Less variance then single decision tree.

very flexible and possess very high accuracy.

The following are the disadvantages of Random Forest algorithm −

Complexity is the main disadvantage of Random forest algorithms.

Construction of Random forests are much harder and time-consuming than decision trees.

More computational resources are required to implement Random Forest algorithm.

Random Forest algorithm is used in Banking, Medicine, Stock Market and E-commerce.

Figure 8: Regression technique

17
Department of Information Technology
St. Thomas’ College of Engineering and Technology

K-Nearest Neighbor (KNN) Algorithm

K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on Supervised
Learning technique. KNN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar to the available
categories-NN algorithm stores all the available data and classifies a new data point based on
the similarity.

Working process

Step-1: Select the number K of the neighbors

Step-2: Calculate the Euclidean distance of K number of neighbors

Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

Step-4: Among these k neighbors, count the number of the data points in each category.

Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.

Step-6: Our model is ready.

Firstly, we will choose the number of neighbors, so we will choose the k=5.

Figure 9: Dividing the data into category

18
Department of Information Technology
St. Thomas’ College of Engineering and Technology

Next, we will calculate the Euclidean distance between the data points. The

Figure 10: Data point allocation to a category

Euclidean distance is the distance between two points, which we have already studied in geometry. It
can be calculated as: Euclidean Distance between A1 and B2= √((x2-x1)2+(y2-y1)2).

SVM

• The main reason to use an SVM instead is because the problem might not be linearly
separable. In that case, we will have to use an SVM with a non linear kernel (e.g. RBF).
• Another related reason to use SVMs is if you are in a highly dimensional space. For
example, SVMs have been reported to work better for text classification.

But it requires a lot of time for training. So, it is not recommended when we have a large number of
training examples.

kNN

• It is robust to noisy training data and is effective in case of large number of training
examples.

But for this algorithm, we have to determine the value of parameter K (number of nearest
neighbors) and the type of distance to be used. The computation time is also very much as we need
to compute distance of each query instance to all training samples.

Random Forest

• Random Forest is nothing more than a bunch of Decision Trees combined. They can
handle categorical features very well.
• This algorithm can handle high dimensional spaces as well as large number of training
examples.

Random Forests can almost work out of the box and that is one reason why they are very popular.

19
Department of Information Technology
St. Thomas’ College of Engineering and Technology

Chapter 3
Design and Methodology
Collection of Dataset
The first part is the collection of datasets. According to research conducted there are many open-
source datasets available on different websites such as Kaggle, VirusTotal, VirusShare etc, But when
compared to each other the datasets that are available on Kaggle are much better and the data are
balanced.

The Dataset was downloaded from the Kaggle website, file is in .csv format so we could load the file
in excel and do the further operations. After loading the dataset had columns such as Name of the
Malware, AlertLevel, AvSigVersion, Type etc. The dataset almost consisted of everything on the basis
of which the detection will be performed. We needed only some columns which we collected there
are pre-processing of data we needed to do and labeling that had to be done before we proceed with
the further steps of detection

●Training of Dataset
The generated dataset is used for training the model with given parameters using Support Vector
Machine, random forest algorithm and KNN algorithm. Data Prepossessing is an important part of this
process where the data must be organized. Also, the relation between these datasets must be found
to extract features. Data cleaning is also a major part, to find out data with missing features and values.

●Extracting of Feature
The feature extraction is important for prediction and selected meticulously in order to perform faster
computation and low memory consumption. With the collected malicious dataset, the features are
taken from source codes of decompiled files. It tends to recover main files organizing source codes in
a particular way in folder. Then the methods implemented in the source of malicious datasets, the
numbers of permission-related APIs are extracted as the features to train the classifier.

●Implementation
In the implementation step, the dataset that we created was implemented in each machine learning
algorithm/model. To achieve the best results and accuracy we performed hyperparameter tuning
which is changing of small parameters in the model to achieve better accuracy and performance.
Further moving on to the implementation or applying the model we split the dataset into two parts
that are 70% of the data is allotted to training and 30% of data is allotted to the test data, this is done
because we can test the performance of the data instead of applying the whole data to training. We
did the split by applying “train_test_split” which is imported from the “sklearn model” package in
python. Explanation of each model which is the KNN model, SVM model, Random Forest model, and
Random forest with decision tree model is explained in detail below

20
Department of Information Technology
St. Thomas’ College of Engineering and Technology

Figure 11: Design the implementation

Support Vector Machines are perhaps one of the most popular and talked about machine learning
algorithms.

How to understand the various names given to support vector machines. When the model is really
stored on memory, SVM uses this representation. How to produce predictions for new data using a
trained SVM model representation.

Learn how to train an SVM model using training data. How to get your data ready for the SVM
algorithm. Where can you find more information on SVM?SVM differs from previous classification
methods in that it selects the decision boundary that optimizes the distance between all classes'
nearest data points. An SVM not only discovers a decision boundary, but it also identifies the best
decision boundary. The most optimal choice boundary is the one with the greatest margin between
all of the classes' nearest points.

A decision tree is an understandable concept that is used to generate a random forest. A decision tree
can be thought of as a sequence of yes/no questions regarding our data that eventually leads to a
projected class (or continuous value in the case of regression). This model is explainable because it
generates classifications in the same way that we do: we ask questions about the data we have. Our
data consists of only two features (predictor variables), x1 and x2, and six data points (samples)
separated into two labels. Although the task is straightforward, it is not linearly separable, which
means we can't classify the points by drawing a single straight line through the data. However, we can
draw a sequence of straight lines to divide the data points into boxes, which we'll refer to as
nodes. During training we give the model both the features and the labels so it can learn to classify
points based on the features.

21
Department of Information Technology
St. Thomas’ College of Engineering and Technology

The KNN method is one of the most well-known machine learning algorithms, and it's a must-have in
any machine learning toolkit. Python is the most popular programming language for machine learning,
so what better way to learn about KNN than with NumPy and scikit-learn, two of Python's most well-
known packages. The first determining property of machine learning algorithms is the split
between supervised and unsupervised models. The difference between supervised and unsupervised
models is the problem statement.

In supervised models, you have two types of variables at the same time:

1. A target variable, which is also called the dependent variable or the y variable.
2. Independent variables, which are also known as x variables or explanatory variables.

The target variable is the variable that you want to predict. It depends on the independent variables
and it isn’t something that you know ahead of time. The independent variables are variables that you
do know ahead of time. You can plug them into an equation to predict the target variable. In this way,
it’s relatively similar to the y = ax + b case.

22
Department of Information Technology
St. Thomas’ College of Engineering and Technology

Chapter 4
Sample Code:
Support Vector Machine (SVM) model
The use of the Support Vector Machine is that it helps to find the hyperplane in an N dimensional
space that classifies the data points. So our main objective in this is to find the maximum margin which
means the maximum distance between two data points. Then we apply the dataset in the model which
executed as given in the below figure.

Figure 12: Creation of Support Vector machine model

Accuracy score of SVM model is given below

Figure 13: Accuracy score by using Support Vector Machine model

From SVM model we get the accuracy around 51%, Which is not satisfactory. So we here we go for
another approach.

Random Forest Algorithm approach

After extracting the features, random forest algorithm is used for classification. The name if we break
down the word, it consists of ’forest’ which consist of group of decision trees, and the word ’random’
comes because we are doing random sampling. On applying this algorithm on a data set, it takes a
subset of the data as training set and clusters the data into groups and subgroups. On connecting the
data points to groups and sub-groups we get a structure resembling a tree, called d+ecision tree. The

23
Department of Information Technology
St. Thomas’ College of Engineering and Technology

algorithm then prepares a number of trees, resembling a forest. But each tree is different, as for each
split in the tree, the variables are chosen randomly. The remaining data set, apart from the training
set is used for predicting the tree in forest which makes best classification of data points and the tree
having most predictive power is shown as output. At each node of the decision tree, it splits the
training set into two subsets with different labels by minimizing the uncertainty of the class labels.

Figure 14: Decision Making

In random forest has a group of trees in which each tree will classify and give results based on a few
attributes. Trees are chosen which has the maximum number of nodes. The model was implemented
on the dataset by giving little parameters which are the maximum depth, the estimators, and the
criteria. The below figures gives the output.

+ Figure 15: Random Forest model creation

Random forest with a Decision tree: Random Forest with decision tree just gives more precision
and accuracy but according to the evaluation. It didn’t give due to the imbalance in data. The dataset
was applied to the mode with some parameters considering the tree size, estimators, and much more.

24
Department of Information Technology
St. Thomas’ College of Engineering and Technology

Figure 16: Random forest with decision model creation

Knn algorithm approach

The use of the KNN model is that it classifies the data according to the nearest data
available. for example. If there are malware files then the malware will go to the class
which has the highest number of malware. We do this by setting up parameters such as
K-values, Leaf size, and the number of neighbours. The dataset is applied to the model as
showed in the figure below.

Figure 17: KNN model creation

Here we can see that the accuracy we get is becoming good compare to Support Vector
Machine and Random Forest algorithm.

25
Department of Information Technology
St. Thomas’ College of Engineering and Technology

Chapter 5
Testing, Result, Discussion On Results
Evaluation:
The below figure shows below accuracy rate and the confusion matrix.

SVM MODEL

Figure 18: Confusion matrix of SVM model

The below figure shows the code and Accuracy rate of the Random Forest model

Figure 19: Accuracy score of Random forest model

The below graph is the Error rate of the KNN model which is decreasing which represents a
positive result.

26
Department of Information Technology
St. Thomas’ College of Engineering and Technology

Figure 20: An error rate of KNN model

Now we will see the accuracy become better than the previous one we got.

Figure 21: Confusion matrix for KNN model

27
Department of Information Technology
St. Thomas’ College of Engineering and Technology

The evaluation in machine learning is done by calculating metrics by taking the help of a confusion
matrix which is in the sklearn metrics package. The graphs and ROC curves are plotted with the help
of the confusion matrix for the KNN model and other models that are compared.

When we execute it is considered an important step to calculate the matrix with the help of the fusion
matrix. This is in the form of a table or a graph that represents the Accuracy, Precision, F1-score, and
Recall of the model that we implemented. The below figure shows how a confusion matrix looks like.

Figure 22: Design of Confusion matrix

The above mentioned in boxes which are True positives, False positives, False-negative and True
negative are explained in detail below.

1. True Positive (TP): True positive means the model or algorithm correctly predicts what has
been told to predict. E.g., In our implementation, we have told to detect the malware so the
amount of application detected will be shown.

2. False Positive (FP): The meaning of False positive is that it incorrectly predicts the positive
that is been told to the algorithm.

3. False Negative (FN): In this, the outcome that is generated represents that the model
predicts or detects the negatives incorrectly.

4. True Negative (TN): In this, the model correctly predicts or detects the negatives.

The parameters that are considered in the confusion matrix such as Accuracy, precision, recall, and
F1- score are explained below and also the equation that is used to calculate those.

1. Precision: A metric that quantifies the number of correct positive detection made. This
calculates the accuracy of the minority class. It is calculated by correctly detected positives divided by
the total number of positives detected.

True Positives / (Trues Positives + False Positives) ………………………. (1)

28
Department of Information Technology
St. Thomas’ College of Engineering and Technology

2. Accuracy: Accuracy means the detected values by the model which are correctly and as told
to the model to do. It is calculated by adding TP and TN divided by all the positives and negatives.

TP + TN / TP + TN + FP + FN …………………. (2)

3. F1-Score: F1-Score is a metric in which precision and recall are taken into account. The
equation to calculate is as follows:

2 (Precision Recall / Precision + Recall) ……… (3)

4. Recall: Recall is the sensitivity or the True positives rate of which the equation is explained
below.

TP / TP + FN………… (4)

Discussion on Result

The sort of machine learning that’s found in a lot of antimalware software tries to learn which files
are malicious and which are benign based on databases of malicious task. The AI involved tries to
make decisions about whether or not analyzed code is harmful based on a series of traits. Some
traits may rank higher than other traits. So code that’s determined to be benign might have some
traits that the software considers to be a possible indication of malware. Malware is evolving rapidly,
so the algorithms must evolve rapidly as well. It’s a constant, ongoing process.

After implementation we achieved 51% accuracy through SVM and 87% through Random forest and
98% through KNN algorithm. From the evaluation and the results that are achieved from the models
considered we can say that the KNN algorithm has outperformed other algorithms with a greater
percentage of accuracy. Our model can detect a large number of malwares and classify them
accordingly. KNN model has given us a good accuracy which is 98%.

Use perfmon log data from all machines in an experiment, apart from data from the machine on which
we want to determine whether it has a virus or not. Average the counter values from each of these
machines for a specific instance in time (after the gradient had first been applied). Perform this process
for all instances. We are attempting to create a representation of normality by averaging the output
from many machines, in an attempt to minimise the effect of the virus from those machines which
have been infected. We then use these averaged instances as the training set for the novelty classifier.
The instances from the machine we are attempting to determine whether it has a virus or not are used
as the test set.

29
Department of Information Technology
St. Thomas’ College of Engineering and Technology

Chapter 6
6.1 Scope for future improvement
We were not able to implement this into a file or detect the files but yes for sure we have detected
the malware. Although we achieved good results with our proposed models, our priority was to
implement different machine learning models to compare, rather than spending too much time
tweaking to achieve marginal gains. Machine learning has a lot of room for experimenting and tuning.
It is very possible that even better results could be achieved by tweaking the proposed solutions.
Windows is not the only operating system that is targeted by malware. A possible future improvement
could be creating a solution that works with multiple executable formats. Microsoft’s PE format is
relatively similar to the ELF format used by Linux executables, and a uniform machine learning model
should therefore not be impossible to achieve. Another thing to keep in mind is that as malware
detection get cleverer, so does the malware. It is a perpetual battle where the malware keeps
changing. As such, a solution that works good today, might not work good in the future.

In the future, we will work on obtaining the opcode and analyzing the behavior of malware types and
try to detect the malware in real-time.

6.2 Conclusion :
It is necessary in order to detect zero-day attack malware which leads an opening of a platform to
provide various malware to perform an attack or to overload the system. It plays vital role in the field
of security many attacks have been generated. Rapid growth of malware directs most researchers to
implement new approaches to defeat the attacks and developed countermeasures; on this ground we
used machine learning.

Both static and dynamic analysis yield very high accuracies in the reviewed literature, although there
are some suggestions that static approaches can struggle on newer types of malware. Certain
difficulties for dynamic methods was also highlighted by the literature. Various forms of neural
networks seem to perform very well across the board. Classifiers like SVM, KNN and decision tree-
based models also get good results.

Based on various articles, converting a malware binary into a gray scale image appears to be a very
solid approach to malware classification using static analysis with some clear advantages. It is easy to
process, appears to be resilient against obfuscation and it can take advantage of image analysis
techniques. While several different articles obtain very good results when classifying malware into
various families, we would have liked to see how such a technique performs for malware detection.

30
Department of Information Technology
St. Thomas’ College of Engineering and Technology

References:
[1]. Santosh Joshi, Himanshu Upadhyay, Leonel Lagos, Naga Suryamitra Akkipeddi, Valerie Guerra,
“Machine learning Approach for Malware Detection Using Random Forest Classifier on Process List
Data Structure”.

[2]. I. Shhadat, B. Bataineh, A. Hayajneh, and Z. A. Al-Sharif, “The Use of Machine Learning Techniques
to Advance the Detection and Classification of Unknown Malware,” Procedia Comput. Sci., vol. 170,
pp. 917–922, Jan. 2020, doi: 10.1016/j.procs.2020.03.110.

[3]. D. Gupta and R. Rani, “Improving malware detection using big data and ensemble learning,”
Comput. Electr. Eng., vol. 86, p. 106729, Sep. 2020, doi: 10.1016/j.compeleceng.2020.106729.

[4] P. Singhal, “Malware Detection Module using Machine Learning Algorithms to Assist in Centralized
Security in Enterprise Networks,” Int. J. Netw. Secur. Its Appl., vol. 4, no. 1, pp. 61–67, Jan. 2012, doi:
10.5121/ijnsa.2012.4106.

[5] J. Z. Kolter and M. A. Maloof, “Learning to detect malicious executables in the wild,” in Proceedings
of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’04,
Seattle, WA, USA, 2004, p. 470, doi: 10.1145/1014052.1014105.

[6] “IMDS.” https://dl.acm.org/doi/pdf/10.1145/1281192.1281308 (accessed Apr. 05, 2020).

[7]. https://www.researchgate.net/publication/336611638_Detecting_Malware_Evolution_Using_
Support_Vector_Machines.

[8]. https://www.sciencedirect.com/science/article/pii/S1877050917305987.

[9]. https://www.sciencedirect.com/science/article/pii/S1084804519303868.

[10]. http://asnugroho.net/papers/act2010-firdausi.pdf.

[11] S. Cesare, Y. Xiang, and W. Zhou, “Control flow-based malware variant detection,” IEEE
Transactions on Dependable and Secure Computing, vol. 11, no. 4, pp. 304–317, 2014.

[https://scholar.google.com/] [https://ieeexplore.ieee.org/]

[12] H. S. Galal, Y. B. Mahdy, and M. A. Atiea, “Behavior-based features model for malware
detection,” Journal in Computer Virology and Hacking Techniques, vol. 12, no. 2, pp. 59–67, 2016.

[https://link.springer.com/]

[13] O. Ronneberger, P. Fischer, and T. Brox, “U-net: convolutional networks for biomedical image
segmentation,” in Proceedings of the International Conference on Medical Image Computing and
Computer-Assisted Intervention (MICCAI '15), vol. 9351 of Lecture Notes in Computer Science, pp.
234–241, Springer, Cham, Switzerland, November 2015.
[https://scholar.google.com/scholar_lookup]

31
Department of Information Technology

Digital SAT November 2024
83% (12)
Digital SAT November 2024
100 pages
The Art of Problem Solving Intermediate Algebra
96% (25)
The Art of Problem Solving Intermediate Algebra
720 pages
Discovering Geometry Solutions Manual
70% (10)
Discovering Geometry Solutions Manual
304 pages
Woodcock Johson IV Training Manual PDF
100% (2)
Woodcock Johson IV Training Manual PDF
48 pages
The First Days of School - Nodrm
93% (44)
The First Days of School - Nodrm
339 pages
Beginner's Step-By-Step Coding Course Learn Computer Programming The Easy Way, UK Edition
98% (46)
Beginner's Step-By-Step Coding Course Learn Computer Programming The Easy Way, UK Edition
360 pages
The Motivational Interviewing Workbook - Exercises To Decide What You Want and How To Get There
100% (10)
The Motivational Interviewing Workbook - Exercises To Decide What You Want and How To Get There
224 pages
Introduction To Geometry
90% (21)
Introduction To Geometry
580 pages
Essential Prealgebra Skills Practice Workbook (Chris McMullen) (Z-Library)
100% (3)
Essential Prealgebra Skills Practice Workbook (Chris McMullen) (Z-Library)
341 pages
Workout Log
63% (19)
Workout Log
8 pages
Physics Primer - Homework - 1
95% (42)
Physics Primer - Homework - 1
40 pages
Wharton Resume Template
50% (4)
Wharton Resume Template
3 pages
Blackjack Attack - Playing The Pros' Way
89% (9)
Blackjack Attack - Playing The Pros' Way
286 pages
Catherine V Holmes - How To Draw Cool Stuff, A Drawing Guide For Teachers and Students
97% (35)
Catherine V Holmes - How To Draw Cool Stuff, A Drawing Guide For Teachers and Students
260 pages
SFBT Diamond Approach Treatment Manual For Individuals Version 1 November 2023 - Compressed
100% (2)
SFBT Diamond Approach Treatment Manual For Individuals Version 1 November 2023 - Compressed
38 pages
[Algebra Essentials Practice Workbook with Answers Linear and Quadratic Equations Cross Multiplying and Systems of Equations Improve your Math Fluency Series] Chris McMullen - Algebra Essentials Practice Workbook with A.pdf
80% (10)
[Algebra Essentials Practice Workbook with Answers Linear and Quadratic Equations Cross Multiplying and Systems of Equations Improve your Math Fluency Series] Chris McMullen - Algebra Essentials Practice Workbook with A.pdf
207 pages
Guide For The First Days of School
71% (7)
Guide For The First Days of School
92 pages
Parts Work 4th Edition
100% (30)
Parts Work 4th Edition
166 pages
Math 87 Mathematics 8 - 7 Textbook An Incremental Development Stephen Hake John Saxon
100% (10)
Math 87 Mathematics 8 - 7 Textbook An Incremental Development Stephen Hake John Saxon
696 pages
Astrology Cheatsheet
98% (43)
Astrology Cheatsheet
15 pages
Algebra 1
40% (5)
Algebra 1
61 pages
Self-System Therapy For Depression Client Workbook
100% (8)
Self-System Therapy For Depression Client Workbook
113 pages
Notifier NFN Web Server Manual
No ratings yet
Notifier NFN Web Server Manual
52 pages
Chris McMullen - Intermediate Algebra Skills Practice Workbook With Answers - Functions, Radicals, Polynomials, Conics, Systems, Inequalities, and (2021, Zishka Publishing) - Libgen - Li
100% (3)
Chris McMullen - Intermediate Algebra Skills Practice Workbook With Answers - Functions, Radicals, Polynomials, Conics, Systems, Inequalities, and (2021, Zishka Publishing) - Libgen - Li
502 pages
Major Project Synopsis Format
No ratings yet
Major Project Synopsis Format
20 pages
FULL DOME PRO GENERAL PURCHASE AGREEMENT - 8m - Infosys - Dubai
No ratings yet
FULL DOME PRO GENERAL PURCHASE AGREEMENT - 8m - Infosys - Dubai
25 pages
The Colossal Book of Mathematics PDF
100% (10)
The Colossal Book of Mathematics PDF
744 pages
8085 Microprocessor Project Digital Frequency Meter
No ratings yet
8085 Microprocessor Project Digital Frequency Meter
2 pages
The Today and Future of WSN, AI, and IoT: A Compass and Torchbearer for the Technocrats
From Everand
The Today and Future of WSN, AI, and IoT: A Compass and Torchbearer for the Technocrats
Dr.Chandrakant
No ratings yet
Project Report of The Virus Detector Final
No ratings yet
Project Report of The Virus Detector Final
50 pages
Classification of Malware Attacks Using Machine Learning in Decision Tree
No ratings yet
Classification of Malware Attacks Using Machine Learning in Decision Tree
16 pages
Machine Learning Detection
No ratings yet
Machine Learning Detection
13 pages
Disease Prediction Using ML
100% (1)
Disease Prediction Using ML
43 pages
Classification of Lung Sounds Using CNN
No ratings yet
Classification of Lung Sounds Using CNN
10 pages
Synopsis For Separable Reversible Data Hiding in Encrypted Image Using AES
100% (1)
Synopsis For Separable Reversible Data Hiding in Encrypted Image Using AES
26 pages
Sms Spam Detectionn (1)
No ratings yet
Sms Spam Detectionn (1)
63 pages
Federated Learning For Intrusion Detection Systems in IoT Devices
No ratings yet
Federated Learning For Intrusion Detection Systems in IoT Devices
22 pages
Verifiable and Multi-Keyword Searchable Attribute-Based Encryption Scheme For Cloud Storage
No ratings yet
Verifiable and Multi-Keyword Searchable Attribute-Based Encryption Scheme For Cloud Storage
83 pages
Enhancing Data Security in Iot Healthcare Services Using Fog Computing
No ratings yet
Enhancing Data Security in Iot Healthcare Services Using Fog Computing
36 pages
Complete Final Sem Report PDF
No ratings yet
Complete Final Sem Report PDF
79 pages
Final Document
No ratings yet
Final Document
73 pages
Final Year Thesis 1632
No ratings yet
Final Year Thesis 1632
50 pages
SMS Spam Detection Using Machine Learning
No ratings yet
SMS Spam Detection Using Machine Learning
9 pages
Ravi Internship Report
No ratings yet
Ravi Internship Report
39 pages
Btech Project Report Intro
No ratings yet
Btech Project Report Intro
12 pages
Hybrid Software Defined Network-Based Deep Learning Framework For Enhancing Internet of Medical Things Cybersecurity
No ratings yet
Hybrid Software Defined Network-Based Deep Learning Framework For Enhancing Internet of Medical Things Cybersecurity
12 pages
Towards An Ontology of Malware Classes
No ratings yet
Towards An Ontology of Malware Classes
16 pages
Ethical Hacking: Internship Report
No ratings yet
Ethical Hacking: Internship Report
15 pages
Project Report Emaildetection
No ratings yet
Project Report Emaildetection
44 pages
Detection of Phishing WebsitesUsing Random Forest and XGBOOST
No ratings yet
Detection of Phishing WebsitesUsing Random Forest and XGBOOST
14 pages
Big Data
No ratings yet
Big Data
30 pages
Fake News Detection Using Natural Language Processing
100% (1)
Fake News Detection Using Natural Language Processing
8 pages
Mca, Bca Project List 2023-2024
No ratings yet
Mca, Bca Project List 2023-2024
90 pages
Cyber Security Documentation-1
No ratings yet
Cyber Security Documentation-1
38 pages
Illuminating Deepfakes: - An Anti-Deepfake Technology
No ratings yet
Illuminating Deepfakes: - An Anti-Deepfake Technology
20 pages
Detecting Malware in Portable Executable Files Using Machine Learning Approach
No ratings yet
Detecting Malware in Portable Executable Files Using Machine Learning Approach
7 pages
Project Detecto!: A Real-Time Object Detection Model
No ratings yet
Project Detecto!: A Real-Time Object Detection Model
3 pages
Practical Malware Analysis Based On Sandboxing
No ratings yet
Practical Malware Analysis Based On Sandboxing
6 pages
Cyberbullying A17 Major Project
No ratings yet
Cyberbullying A17 Major Project
98 pages
Detection of Cyber Attacks Using Ai
No ratings yet
Detection of Cyber Attacks Using Ai
92 pages
Abstract On Steganography
100% (1)
Abstract On Steganography
10 pages
Fruit Old
No ratings yet
Fruit Old
37 pages
CSE35 Project Report
No ratings yet
CSE35 Project Report
111 pages
Report Blockchain PDF
No ratings yet
Report Blockchain PDF
17 pages
Heart Disease Prediction Using Machine Learning-1
No ratings yet
Heart Disease Prediction Using Machine Learning-1
6 pages
MCA 504 Modelling and Simulation: Index
No ratings yet
MCA 504 Modelling and Simulation: Index
138 pages
AI_Enabled_Threat_Detection_Leveraging_Artificial_Intelligence_for_Advanced_Security_and_Cyber_Threat_Mitigation
No ratings yet
AI_Enabled_Threat_Detection_Leveraging_Artificial_Intelligence_for_Advanced_Security_and_Cyber_Threat_Mitigation
10 pages
Detecting Phishing Websites Using Machine Learning
No ratings yet
Detecting Phishing Websites Using Machine Learning
6 pages
Final Intership Report
No ratings yet
Final Intership Report
32 pages
Bus Detection Device For The Blind Using RFID Application
No ratings yet
Bus Detection Device For The Blind Using RFID Application
3 pages
Report Event
No ratings yet
Report Event
24 pages
(KAVYA R SHETTY)
No ratings yet
(KAVYA R SHETTY)
21 pages
Title of Assignment: Security Vulnerabilities and Countermeasures in
No ratings yet
Title of Assignment: Security Vulnerabilities and Countermeasures in
19 pages
Android Malware Detection Using Machine Learning
No ratings yet
Android Malware Detection Using Machine Learning
4 pages
cs3491 Ai ML Lab Manual
No ratings yet
cs3491 Ai ML Lab Manual
58 pages
Malware - Detection - Using - Machine - Learning (3) - Removed
No ratings yet
Malware - Detection - Using - Machine - Learning (3) - Removed
31 pages
UPI Id Fraud Detection Using Machine Learning
No ratings yet
UPI Id Fraud Detection Using Machine Learning
7 pages
File 4
No ratings yet
File 4
60 pages
Cryptography Intruders in Network Security
No ratings yet
Cryptography Intruders in Network Security
11 pages
Detection and Localization of Adaptive Hierarchical Cyber Attacks in Active Distribution Systems
No ratings yet
Detection and Localization of Adaptive Hierarchical Cyber Attacks in Active Distribution Systems
4 pages
Secure File Storage On Cloud Using Hybrid Cryptography
No ratings yet
Secure File Storage On Cloud Using Hybrid Cryptography
5 pages
PG Finder
No ratings yet
PG Finder
12 pages
Data Valley 21VV1A0510
No ratings yet
Data Valley 21VV1A0510
85 pages
Smart Computing
No ratings yet
Smart Computing
801 pages
Detection of Phishing Websites Using Machine Learning IJERTV10IS050235
No ratings yet
Detection of Phishing Websites Using Machine Learning IJERTV10IS050235
5 pages
Daa_report_final[1]
No ratings yet
Daa_report_final[1]
41 pages
Fortinet
No ratings yet
Fortinet
23 pages
Algebra 2
95% (19)
Algebra 2
200 pages
Florida Teacher Certificate Examinations (FTCE) Study Guide
0% (2)
Florida Teacher Certificate Examinations (FTCE) Study Guide
20 pages
Pre-Algebra and Algebra
100% (23)
Pre-Algebra and Algebra
66 pages
Downloadable Official CompTIA Network+ Student Guide
100% (4)
Downloadable Official CompTIA Network+ Student Guide
612 pages
ComputationalPhysicsKNA2ndEd PDF
No ratings yet
ComputationalPhysicsKNA2ndEd PDF
684 pages
22 Popular Free Crochet Patterns
100% (11)
22 Popular Free Crochet Patterns
61 pages
TE Computer Engg 2019 Course Syllabus Draft 23may2021
No ratings yet
TE Computer Engg 2019 Course Syllabus Draft 23may2021
100 pages
Ch3 - System Security - OS and Networks
No ratings yet
Ch3 - System Security - OS and Networks
69 pages
Excel 2021 A Beginners Guide To Master Microsoft Excel Functions, Formulas, Charts, and Graphs Through Exercises (D. Cooney, James) (Z-Lib - Org) - 1
100% (1)
Excel 2021 A Beginners Guide To Master Microsoft Excel Functions, Formulas, Charts, and Graphs Through Exercises (D. Cooney, James) (Z-Lib - Org) - 1
52 pages
Python sqlite copy table from one database to another
No ratings yet
Python sqlite copy table from one database to another
2 pages
Day 1 E-Link Overview - Version 1
No ratings yet
Day 1 E-Link Overview - Version 1
271 pages
All India Bank Officers Association
No ratings yet
All India Bank Officers Association
14 pages
Tutorial: Thymeleaf + Spring
No ratings yet
Tutorial: Thymeleaf + Spring
34 pages
Material Determination in SAP
No ratings yet
Material Determination in SAP
13 pages
Haj0282 Hajj24 Swap WL Sid
No ratings yet
Haj0282 Hajj24 Swap WL Sid
24 pages
TSPSC CPT - Model Paper 1
No ratings yet
TSPSC CPT - Model Paper 1
3 pages
SafeSign-IC-Standard 3.0 Windows Product Description
No ratings yet
SafeSign-IC-Standard 3.0 Windows Product Description
45 pages
Meshtastic in South Africa - GadgeteerZA
No ratings yet
Meshtastic in South Africa - GadgeteerZA
20 pages
FortiGate FortiWiFi 30E 3G4G
No ratings yet
FortiGate FortiWiFi 30E 3G4G
6 pages
Uncontrolled When Printed: University Health Network/Mount Sinai Hospital, Department of Microbiology
No ratings yet
Uncontrolled When Printed: University Health Network/Mount Sinai Hospital, Department of Microbiology
28 pages
Model Paper - Senior KG: Color The Picture by Numbers
No ratings yet
Model Paper - Senior KG: Color The Picture by Numbers
7 pages
CV-Mahesh D Kolhe
No ratings yet
CV-Mahesh D Kolhe
3 pages
Scheme of Work For The Topics OS and Networking
No ratings yet
Scheme of Work For The Topics OS and Networking
2 pages
x 86 Family and Microcontroller
No ratings yet
x 86 Family and Microcontroller
11 pages
Data Warehouse, Lake & Application Modernization
No ratings yet
Data Warehouse, Lake & Application Modernization
23 pages
En - mb1381 H745XI B02 Schematic
No ratings yet
En - mb1381 H745XI B02 Schematic
22 pages
Custom Website for Mr. Sachin by Akarsh
No ratings yet
Custom Website for Mr. Sachin by Akarsh
13 pages
Weigh Master Detailed
No ratings yet
Weigh Master Detailed
15 pages
Digital Transformation in Oil & Gas Whitepaper
100% (1)
Digital Transformation in Oil & Gas Whitepaper
9 pages
HPE_a00127292en_us_HPE Performance Cluster Manager Administration Guide
No ratings yet
HPE_a00127292en_us_HPE Performance Cluster Manager Administration Guide
319 pages
DBMS CSN-351 Sheet - 4
No ratings yet
DBMS CSN-351 Sheet - 4
2 pages
Legal and Ethical Use of Technology: Prepared By: Iris A. Calalin Pamela Policarpio
No ratings yet
Legal and Ethical Use of Technology: Prepared By: Iris A. Calalin Pamela Policarpio
17 pages
Linear Video Editing
No ratings yet
Linear Video Editing
3 pages