This repository contains code and data of the paper On the Limitations of Continual Learning for Malware Classification
, accepted at the First Conference on Lifelong Learning Agents (CoLLA). The code and dataset can be used to reproduce the results presented in the paper.
We would like convey our special thanks to Gido Van De Ven. We reused and adapted his implementations of different continual learning techniques from continual-learning and brain-inspired-replay repositories.
Please make sure you have all the dependencies available and installed.
- NVIDIA GPU
- CentOS Linux 7 (Ubuntu should also work)
- Python3-venv/ conda
- CUDA Version: 10.2
- Python Version: 3.7.X
-- Please install the required packages using the following command:
pip install -r requirements.txt
We suggest the users to create a python virtual environment
or conda environment
and install the required packages.
conda create -n cl_malware python=3.7
conda activate cl_malware
conda install numpy=1.20.2
conda install pytorch=1.6.0 torchvision=0.7.0 cudatoolkit=10.2 -c pytorch
Using Anaconda 3 on Linux, run the following:
conda env create -f cl_malware.yml
conda activate cl_malware
We cannot share the data to honor the requirement of the data publisher. However, anyone can request the data from the Drebin data publisher to access the data. The dataset is available at [Drebin Dataset]. Please put the data into the drebin_data
forlder after downloading. Afterwards, drebin_data/Drebin_Data_Process-CoLLAs-2022.py
will generate the training
and testing
data for the top 18 families that can be used for the experiments of Class-IL and Task-IL.
EMBER dataset is public and ready to download from this [EMBER repository]. Please note that we use 2018 version of the data with feature version 2. For Domain-IL experiments, we process the data to have month-by-month train-test split from January 2018 to December 2018. For Task-IL and Class-IL, we process the data by malware families. EMBER dataset has 2900 malware families. We select top 100 families each with at least 400 samples.
-
EMBER data needs two separate processing -- i) one for Task-IL and Class-IL experiments, and ii) another for Domain-IL experiments.
-
Task-IL and Class-IL Data: - run
ember_data/EMBER_2018_TASK_CLASS_IL_FAMILY-CoLLAs-2022.py
-
Domain-IL Data: - run
ember_data/EMBER_2018_DOMAIN_IL_data_process_with_family_labels-CoLLAs-2022.py
-
Task-IL: run
drebin_exps/Drebin_Task_IL.sh
-
Class-IL: run
drebin_exps/Drebin_Class_IL.sh
-
Domain-IL: run
ember_domain_exps/EMBER_Domain_IL.sh
-
Class-IL: run
ember_class_task_exps/EMBER_Class_IL.sh
-
Task-IL: run
ember_class_task_exps/EMBER_Task_IL.sh
@inproceedings{continual-learning-malware,
title = {{On the Limitations of Continual Learning for Malware Classification}},
author = {Rahman, Mohammad Saidur and Coull, Scott E. and Wright, Matthew},
booktitle = {First Conference on Lifelong Learning Agents (CoLLAs)},
year = {2022},
publisher = Proceedings of Machine Learning Research (PMLR)
}
We thank our anonymous reviewers for helpful suggestions and comments. This work was funded by the National Science Foundation under Grant Number 1816851.