Code for the paper - Source Code Vulnerability Detection: Combining Code Language Models and Code Property Graph
In this work, we propose Vul-LMGNN, a unified model that combines pre-trained code language models with code property graphs for code vulnerability detection. Vul-LMGNN constructs a code property graph, thereafter leveraging pre-trained code model to extract local semantic features as node embeddings in the code property graph. Furthermore, we introduce a gated code Graph Neural Network (GNN). By jointly training the code language model and the gated code GNN modules in Vul-LMGNN, our proposed method efficiently leverages the strengths of both mechanisms. Finally, we use a pre-trained CodeBERT as an auxiliary classifier. The proposed method demonstrated superior performance compared to six state-of-the-art approaches.
Create environment and install required packages for LMGGNN
-
transformer(=3.3.1)
The experiments were executed on single NVIDIA A100 80GB GPU. The system specifications comprised NVIDIA driver version 525.85.12 and CUDA version 11.8.
We evaluated the performance of our model using four publicly available datasets. The composition of the datasets is as follows, and you can click on the dataset names to download them. Please note that you need to modify the code in the CPG_generator
function in run.py
to adapt to different dataset formats.
Dataset | #Vulnerable | #Non-Vulnerable | Source |
---|---|---|---|
DiverseVul | 18,945 | 330,492 | Snyk,Bugzilla |
Devign | 11,888 | 14,149 | Github |
VDSIC | 82,411 | 119,1955 | Github, Debian |
ReVeal | 1664 | 16,505 | Chrome, Debian |
- Modifications to the
configs.json
structure should be updated in theconfigs.py
script. - Joern processing may be slow or potentially freeze your OS, depending on your system’s specs. To prevent this, reduce the chunk size processed during the CPG_generation process by adjusting the
"slice_size"
value in the"create"
section of theconfigs.json
file. - Within the
"slice_size"
parameter, nodes exceeding the configured size limit will be filtered out and discarded. - Follow the instructions on Joern's documentation page and install Joern's command line tools under
'project'\joern\joern-cli\
. - You can find the implementation code of the baselines mentioned in the paper in the
baselines.zip
, which consists offour Jupyter notebooks
.
python run.py -cpg -embed -mode train -path /your/model/path
-cpg
and -embed
respectively represent using joern
to extract the code's CPG
and generating corresponding embeddings. -path
is used to specify the path for saving the model.
python run.py -mode test -path /your/model/saved/path
-mode
is used to specify whether only the training process is executed or both the training and testing processes are performed. -path
is used to specify the path for saving the model.
This command is used to fine-tune CodeBERT on a specific dataset and then generate embeddings for subsequent nodes. Pre-trained CodeBERT weights need to be downloaded from here.
python fine-tune.py
Here only the accuracy results are displayed; for other metrics, please refer to the paper.
Model | DiverseVul | VDSIC | Devign | ReVeal |
---|---|---|---|---|
BERT | 91.99 | 79.41 | 60.58 | 86.88 |
CodeBERT | 92.40 | 83.13 | 64.80 | 88.64 |
GraphCodeBERT | 92.96 | 83.98 | 64.80 | 89.25 |
TextCNN | 92.16 | 66.54 | 60.38 | 85.43 |
TextGCN | 91.50 | 67.55 | 60.47 | 87.25 |
Devign | 70.21 | 59.30 | 57.66 | 65.47 |
Our | 93.06 | 84.38 | 65.70 | 90.80 |
Parts of the code for data preprocessing and graph construction using Joern
are adapted from Devign. We appreciate their excellent work!