Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention
Official Implementation of "Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention" (
Paper
)
- 2025.02: add checkpoints for the TPAMI version
- 2024.10: our paper has beed recognized as ``Best Paper Candidate'' (Milano, Italy, ECCV 2024)
For simplicity, you can directly run bash install.sh
, which includes the following steps:
- install pytorch 1.9.1 and other dependencies, e.g.,
pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html # this might need to be changed due to cuda driver version
pip install -r requirements.txt
- install GroundingDINO and download pre-trained weights
cd GroundingDINO && python3 setup.py install
mkdir $PWD/GroundingDINO/weights/
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth -O $PWD/GroundingDINO/weights/groundingdino_swint_ogc.pth
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha2/groundingdino_swinb_cogcoor.pth -O $PWD/GroundingDINO/weights/groundingdino_swinb_cogcoor.pth
- VG150
- COCO
prepare the dataset under the folder data
with the instruction
For training OvSGTR (w. Swin-T) on VG150, running with this command
bash scripts/DINO_train_dist.sh vg ./config/GroundingDINO_SwinT_OGC_full.py ./data ./logs/ovsgtr_vg_swint_full ./GroundingDINO/weights/groundingdino_swint_ogc.pth
or
bash scripts/DINO_train_dist.sh vg ./config/GroundingDINO_SwinB_full.py ./data ./logs/ovsgtr_vg_swinb_full ./GroundingDINO/weights/groundingdino_swinb_cogcoor.pth
for using Swin-B backbone.
you might need to change the default devices of CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
in the script.
Notice that the actual batch size = batch size (default 4 in config files) * num gpus.
For inference, running with this command
bash scripts/DINO_eval.sh vg [config file] [data path] [output path] [checkpoint]
or
bash scripts/DINO_eval_dist.sh vg [config file] [data path] [output path] [checkpoint]
with multiple GPUs (there is a slight difference of the result output by DINO_eval.sh and DINO_eval_dist.sh due to data dividing and gathering).
backbone | R@20/50/100 | Checkpoint | Config |
---|---|---|---|
Swin-T | 26.97 / 35.82 / 41.38 | link | config/GroundingDINO_SwinT_OGC_full.py |
Swin-T (w. pre-trained on [MegaSG](https://arxiv.org/pdf/2411.15435) dataset) | 27.34 / 36.27 / 41.95 | link | config/GroundingDINO_SwinT_OGC_full.py |
Swin-B | 27.75 / 36.44 / 42.35 | link | config/GroundingDINO_SwinB_full.py |
Swin-B (w.o. frequency bias, focal loss) | 27.53 / 36.18 / 41.79 | link | config/GroundingDINO_SwinB_full_open.py |
Swin-B (w. pre-trained on MegaSG dataset) | 28.61 / 37.58 / 43.41 | link | config/GroundingDINO_SwinB_full_open.py |
for OvD-SGG mode, set sg_ovd_mode = True
in the config file (e.g., config/GroundingDINO_SwinT_OGC_ovd.py).
Following "Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning" and VS3, we split the VG150 into two parts, i.e.,
base objects VG150_BASE_OBJ_CATEGORIES
, and novel objects in VG150_NOVEL2BASE
.
For PREDCLS, please set use_gt_box=True
when calling inference scripts.
backbone | R@20/50/100 (Base+Novel) | R@20/50/100 (Novel) | Checkpoint | Config |
---|---|---|---|---|
Swin-T | 12.34 / 18.14 / 23.20 | 6.90 / 12.06 / 16.49 | link | config/GroundingDINO_SwinT_OGC_ovd.py |
Swin-B | 15.43 / 21.35 / 26.22 | 10.21 / 15.58 / 19.96 | link | config/GroundingDINO_SwinB_ovd.py |
Swin-T (w. pre-trained on MegaSG dataset) | 14.33 / 20.91 / 25.98 | 10.52 / 17.30 / 22.90 | link | config/GroundingDINO_SwinT_OGC_ovd.py |
Swin-B (w. pre-trained on MegaSG dataset) | 15.21 / 21.21 / 26.12 | 10.31 / 15.78 / 20.47 | link | config/GroundingDINO_SwinT_OGC_ovd.py |
for OvR-SGG mode, set sg_ovr_mode = True
in the config file (e.g., config/GroundingDINO_SwinT_OGC_ovr.py).
Base object categories VG150_BASE_PREDICATE
and novel object categories VG150_NOVEL_PREDICATE
can be found in the datasets/vg.py
.
backbone | R@20/50/100 (Base+Novel) | R@20/50/100 (Novel) | Checkpoint | Config | Pre-trained checkpoint | Pre-trained config |
---|---|---|---|---|---|---|
Swin-T | 15.85 / 20.50 / 23.90 | 10.17 / 13.47 / 16.20 | link | config/GroundingDINO_SwinT_OGC_ovr.py | config/GroundingDINO_SwinT_OGC_pretrain.py | |
Swin-B | 17.63 / 22.90 / 26.68 | 12.09 / 16.37 / 19.73 | link | config/GroundingDINO_SwinB_ovr.py | link | config/GroundingDINO_SwinB_pretrain.py |
Swin-T (pretrained on MegaSG) | 19.38 / 25.40 / 29.71 | 12.23 / 17.02 / 21.15 | link | config/GroundingDINO_SwinT_OGC_ovr.py | config/GroundingDINO_SwinT_OGC_pretrain.py | |
Swin-B (pretrained on MegaSG) | 21.09 / 27.92 / 32.74 | 16.59 / 22.86 / 27.73 | link | config/GroundingDINO_SwinB_ovr.py | config/GroundingDINO_SwinB_pretrain.py |
For OvD+R-SGG mode, set both sg_ovd_mode = True
and sg_ovr_mode = True
(e.g., config/GroundingDINO_SwinT_OGC_ovdr.py)
backbone | R@20/50/100 (Joint) | R@20/50/100 (Novel Object) | R@20/50/100 (Novel Relation) | Checkpoint | Config | Pre-trained checkpoint | Pre-trained config |
---|---|---|---|---|---|---|---|
Swin-T | 10.02 / 13.50 / 16.37 | 10.56 / 14.32 / 17.48 | 7.09 / 9.19 / 11.18 | link | config/GroundingDINO_SwinT_OGC_ovdr.py | config/GroundingDINO_SwinT_OGC_pretrain.py | |
Swin-B | 12.37 / 17.14 / 21.03 | 12.63 / 17.58 / 21.70 | 10.56 / 14.62 / 18.22 | link | config/GroundingDINO_SwinB_ovdr.py | link | config/GroundingDINO_SwinB_pretrain.py |
Swin-T (pretrained on MegaSG) | 10.67 / 15.15 / 18.82 | 8.22 / 12.49 / 16.29 | 9.62 / 13.68 / 17.19 | link | config/GroundingDINO_SwinT_OGC_ovdr.py | config/GroundingDINO_SwinT_OGC_pretrain.py | |
Swin-B (pretrained on MegaSG) | 12.54 / 17.84 / 21.95 | 10.29 / 15.66 / 19.84 | 12.21 / 17.15 / 21.05 | link | config/GroundingDINO_SwinB_ovdr.py | config/GroundingDINO_SwinB_pretrain.py |
Thank Scene-Graph-Benchmark.pytorch and GroundingDINO for their awesome code and models.
Please cite OvSGTR in your publications if it helps your research:
@inproceedings{chen2024expanding,
title={Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention},
author={Chen, Zuyao and Wu, Jinlin and Lei, Zhen and Zhang, Zhaoxiang and Chen, Changwen},
booktitle={European Conference on Computer Vision (ECCV)},
pages={108--124},
year={2024}
}