Skip to content

Code for the paper - Source Code Vulnerability Detection: Combining Code Language Models and Code Property Graph

License

Notifications You must be signed in to change notification settings

sigmoid-bar/vul-LMGGNN

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vul-LMGGNN

Code for the paper - Source Code Vulnerability Detection: Combining Code Language Models and Code Property Graph

Introduction

In this work, we propose Vul-LMGNN, a unified model that combines pre-trained code language models with code property graphs for code vulnerability detection. Vul-LMGNN constructs a code property graph, thereafter leveraging pre-trained code model to extract local semantic features as node embeddings in the code property graph. Furthermore, we introduce a gated code Graph Neural Network (GNN). By jointly training the code language model and the gated code GNN modules in Vul-LMGNN, our proposed method efficiently leverages the strengths of both mechanisms. Finally, we use a pre-trained CodeBERT as an auxiliary classifier. The proposed method demonstrated superior performance compared to six state-of-the-art approaches.

Getting Started

Create environment and install required packages for LMGGNN

Install packages

The experiments were executed on single NVIDIA A100 80GB GPU. The system specifications comprised NVIDIA driver version 525.85.12 and CUDA version 11.8.

Dataset

We evaluated the performance of our model using four publicly available datasets. The composition of the datasets is as follows, and you can click on the dataset names to download them. Please note that you need to modify the code in the CPG_generator function in run.py to adapt to different dataset formats.

Dataset #Vulnerable #Non-Vulnerable Source
DiverseVul 18,945 330,492 Snyk,Bugzilla
Devign 11,888 14,149 Github
VDSIC 82,411 119,1955 Github, Debian
ReVeal 1664 16,505 Chrome, Debian

Usage

Some tips:
  • Modifications to the configs.json structure should be updated in the configs.py script.
  • Joern processing may be slow or potentially freeze your OS, depending on your system’s specs. To prevent this, reduce the chunk size processed during the CPG_generation process by adjusting the "slice_size" value in the "create" section of the configs.json file.
  • Within the "slice_size" parameter, nodes exceeding the configured size limit will be filtered out and discarded.
  • Follow the instructions on Joern's documentation page and install Joern's command line tools under 'project'\joern\joern-cli\ .
  • You can find the implementation code of the baselines mentioned in the paper in the baselines.zip, which consists of four Jupyter notebooks.
Preparing the CPG :
python run.py -cpg -embed -mode train -path /your/model/path

-cpg and -embed respectively represent using joern to extract the code's CPG and generating corresponding embeddings. -path is used to specify the path for saving the model.

Training and Testing:
python run.py -mode test -path /your/model/saved/path

-mode is used to specify whether only the training process is executed or both the training and testing processes are performed. -path is used to specify the path for saving the model.

Fine-tuning process:

This command is used to fine-tune CodeBERT on a specific dataset and then generate embeddings for subsequent nodes. Pre-trained CodeBERT weights need to be downloaded from here.

python fine-tune.py

Main Results

Here only the accuracy results are displayed; for other metrics, please refer to the paper.

Model DiverseVul VDSIC Devign ReVeal
BERT 91.99 79.41 60.58 86.88
CodeBERT 92.40 83.13 64.80 88.64
GraphCodeBERT 92.96 83.98 64.80 89.25
TextCNN 92.16 66.54 60.38 85.43
TextGCN 91.50 67.55 60.47 87.25
Devign 70.21 59.30 57.66 65.47
Our 93.06 84.38 65.70 90.80

Acknowledgement

Parts of the code for data preprocessing and graph construction using Joern are adapted from Devign. We appreciate their excellent work!

About

Code for the paper - Source Code Vulnerability Detection: Combining Code Language Models and Code Property Graph

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 94.0%
  • Scala 6.0%