By Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki.
Official implementation of "Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds", accepted by ECCV 2022.
Note:
This is the code for the 3D BUTD-DETR. For the 2D version check the branch bdetr2d
.
We showcase the installation for CUDA 11.1 and torch==1.10.2, which is what we used for our experiments.
If you need to use a different version, you can try to modify environment.yml
accordingly.
- Install environment:
conda env create -f environment.yml --name bdetr3d
- Activate environment:
conda activate bdetr3d
- Install torch:
pip install -U torch==1.10.2 torchvision==0.11.3 --extra-index-url https://download.pytorch.org/whl/cu111
- Compile the CUDA layers for PointNet++, which we used in the backbone
network:
sh init.sh
-
Download ScanNet v2 data HERE. Let
DATA_ROOT
be the path to folder that contains the downloaded annotations. UnderDATA_ROOT
there should be a folderscans
. Underscans
there should be folders with names likescene0001_01
. We provide a script to download only the relative annotations for our task. Runpython scripts/download_scannet_files.py
. Note that the original ScanNet script is written for python2. -
Download ReferIt3D annotations following the instructions HERE. Place all .csv files under
DATA_ROOT/refer_it_3d/
. -
Download ScanRefer annotations following the instructions HERE. Place all files under
DATA_ROOT/scanrefer/
. -
Download object detector's outputs. Unzip inside
DATA_ROOT
. Here is the group-free checkpoint we used to get these boxes in case you need it -
Download span predictor's outputs inside
DATA_ROOT
: ScanRefer_train, ScanRefer_val, SR3D, NR3D. -
(optional) Download PointNet++ checkpoint into
DATA_ROOT
. -
Run
python prepare_data.py --data_root DATA_ROOT
specifying yourDATA_ROOT
. This will create two .pkl files and has to only run once.
-
sh scripts/train_test_det.sh
to train/test BUTD-DETR. You need to modify the script by providingDATA_ROOT
. -
sh scripts/train_test_cls.sh
to train/test BUTD-DETR with ground-truth boxes (not classes). Again, you need to modify the script by providingDATA_ROOT
.
The above scripts will run training and evaluation on SR3D. You can edit the following to customize training:
-
Use
TRAIN_DATASET
(can be sr3d, nr3d, scanrefer, scannet, sr3d+) to change the training dataset. -
Use
TEST_DATASET
(does not have to be the same as TRAIN_DATASET) to change the validation dataset. -
Add
--eval
to skip training and just evaluate. -
To train on multiple datasets, e.g. on SR3D and NR3D simultaneously, set
--TRAIN_DATASET sr3d nr3d
. -
On NR3D and ScanRefer we need much more training epochs to converge. It's better to monitor the validation accuracy and decrease learning rate accordingly. For example, in
det
setup, we decrease lr at epochs 80 and 90 for NR3D and at epoch 65 for Scanrefer. To disable automatic learning rate decay, you can remove--lr_decay_epochs
from the train script and manually decrease the learning rate when the validation accuracy converges. Be sure to add--reduce_lr
flag when decreasing learning rate and continuing from a checkpoint to load optimizers correctly. -
(Optional) To train a span predictor
cd src
andpython text_cls.py --dataset DATASET
.
Download our checkpoints for SR3D_det, NR3D_det, ScanRefer_det, SR3D_cls, NR3D_cls. Add --checkpoint_path CKPT_NAME
to the above scripts in order to utilize the stored checkpoints.
Note that these checkpoints were stored while using DistributedDataParallel
. To use them outside these checkpoints without DistributedDataParallel
, take a look here.
Lastly, we also release the checkpoints for span prediction (ScanRefer, SR3D, NR3D)
- For each object query, we compute per-token confidence scores and regress bounding boxes.
- Given a target span, we keep the most confident query for it. This is our model's best guess.
- We compute the IoU of the corresponding box and the ground-truth box.
- We check whether this IoU is greater than the thresholds (0.25, 0.5).
Parts of this code were based on the codebase of Group-Free. The loss implementation (Hungarian matching and criterion class) are based on the codebase of MDETR.
If you find BUTD-DETR useful in your research, please consider citing:
@misc{https://doi.org/10.48550/arxiv.2112.08879,
doi = {10.48550/ARXIV.2112.08879},
url = {https://arxiv.org/abs/2112.08879},
author = {Jain, Ayush and Gkanatsios, Nikolaos and Mediratta, Ishita and Fragkiadaki, Katerina},
keywords = {Computer Vision and Pattern Recognition (cs.CV), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds},
publisher = {arXiv},
year = {2021},
copyright = {Creative Commons Attribution 4.0 International}
}
The majority of BUTD-DETR code is licensed under CC-BY-NC, however portions of the project are available under separate license terms: MDETR is licensed under the Apache 2.0 license; and Group-Free is licensed under the MIT license.