On the Limitations of Continual Learning for Malware Classification

This repository contains code and data of the paper On the Limitations of Continual Learning for Malware Classification, accepted at the First Conference on Lifelong Learning Agents (CoLLA). The code and dataset can be used to reproduce the results presented in the paper.

[Paper] [Video]

Code Credit

We would like convey our special thanks to Gido Van De Ven. We reused and adapted his implementations of different continual learning techniques from continual-learning and brain-inspired-replay repositories.

Dependencies & Required Packages

Please make sure you have all the dependencies available and installed.

NVIDIA GPU
CentOS Linux 7 (Ubuntu should also work)
Python3-venv/ conda
CUDA Version: 10.2
Python Version: 3.7.X

-- Please install the required packages using the following command: pip install -r requirements.txt

We suggest the users to create a python virtual environment or conda environment and install the required packages.

conda create -n cl_malware python=3.7
conda activate cl_malware
conda install numpy=1.20.2
conda install pytorch=1.6.0 torchvision=0.7.0 cudatoolkit=10.2 -c pytorch

To use the given conda environment

Using Anaconda 3 on Linux, run the following:

conda env create -f cl_malware.yml
conda activate cl_malware

Dataset

Drebin dataset:

We cannot share the data to honor the requirement of the data publisher. However, anyone can request the data from the Drebin data publisher to access the data. The dataset is available at [Drebin Dataset]. Please put the data into the drebin_data forlder after downloading. Afterwards, drebin_data/Drebin_Data_Process-CoLLAs-2022.py will generate the training and testing data for the top 18 families that can be used for the experiments of Class-IL and Task-IL.

EMBER 2018 dataset:

EMBER dataset is public and ready to download from this [EMBER repository]. Please note that we use 2018 version of the data with feature version 2. For Domain-IL experiments, we process the data to have month-by-month train-test split from January 2018 to December 2018. For Task-IL and Class-IL, we process the data by malware families. EMBER dataset has 2900 malware families. We select top 100 families each with at least 400 samples.

EMBER data needs two separate processing -- i) one for Task-IL and Class-IL experiments, and ii) another for Domain-IL experiments.
Task-IL and Class-IL Data: - run ember_data/EMBER_2018_TASK_CLASS_IL_FAMILY-CoLLAs-2022.py
Domain-IL Data: - run ember_data/EMBER_2018_DOMAIN_IL_data_process_with_family_labels-CoLLAs-2022.py

Experiments

Drebin Experiments:

Task-IL: run drebin_exps/Drebin_Task_IL.sh
Class-IL: run drebin_exps/Drebin_Class_IL.sh

EMBER Experiments:

Domain-IL: run ember_domain_exps/EMBER_Domain_IL.sh
Class-IL: run ember_class_task_exps/EMBER_Class_IL.sh
Task-IL: run ember_class_task_exps/EMBER_Task_IL.sh

Reference Format

@inproceedings{continual-learning-malware,
  title = {{On the Limitations of Continual Learning for Malware Classification}},
  author = {Rahman, Mohammad Saidur and Coull, Scott E. and Wright, Matthew},
  booktitle = {First Conference on Lifelong Learning Agents (CoLLAs)},
  year = {2022},
  publisher = Proceedings of Machine Learning Research (PMLR)
}

Acknowledgements

We thank our anonymous reviewers for helpful suggestions and comments. This work was funded by the National Science Foundation under Grant Number 1816851.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

On the Limitations of Continual Learning for Malware Classification

Code Credit

Dependencies & Required Packages

To use the given conda environment

Dataset

Drebin dataset:

EMBER 2018 dataset:

Experiments

Drebin Experiments:

EMBER Experiments:

Reference Format

Acknowledgements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
drebin_data		drebin_data
drebin_exps		drebin_exps
ember_class_task_exps		ember_class_task_exps
ember_data		ember_data
ember_domain_exps		ember_domain_exps
repo_img		repo_img
LICENSE		LICENSE
README.md		README.md
cl_malware.yml		cl_malware.yml
requirements.txt		requirements.txt

License

msrocean/continual-learning-malware

Folders and files

Latest commit

History

Repository files navigation

On the Limitations of Continual Learning for Malware Classification

Code Credit

Dependencies & Required Packages

To use the given conda environment

Dataset

Drebin dataset:

EMBER 2018 dataset:

Experiments

Drebin Experiments:

EMBER Experiments:

Reference Format

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages