Skip to content

showlab/LOVA3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LOVA3: Learning to Visual Question Answering, Asking and Assessment


Paper PDF Project Page Models EvalQABench Dataset
TL;DR LOVA3 is a new training paradigm for advancing multimodal training by incorporating new capabilities: asking questions and assessing vqa triplets.

📢 Update

  • [03/03/2025] We update four models in paper for testing, have fun!
  • [10/16/2024] We release the webpage.
  • [09/26/2024] LOVA3 is accepted by NeurIPS 2024.
  • [07/01/2024] Related work Genixer is accepted by ECCV 2024.
  • [05/24/2024] We release the code of LOVA3, the EvalQABench, the training dataset Mixed_VQA_GenQA_EvalQA_1.5M.jsonl, and the checkpoint LOVA3-llava-v1.5-7b.
  • [05/23/2024] We release the LOVA3 paper.

⚒️ Install

conda create -n LOVA python=3.10
conda activate LOVA
pip install --upgrade pip
pip install -e .

Model weight

Model Name Size Checkpoint EvalQA Data Filtered By
LOVA3-llava-v1.5-7b 7B checkpoint Fuyu-8B
LOVA3-llava-v1.5-7b-gemini 7B checkpoint Gemini-1.5-Flash
LOVA3-llava-v1.5-phi1.5-baseline 1.5B checkpoint -
LOVA3-llava-v1.5-phi1.5-fuyu 1.5B checkpoint Fuyu-8B
LOVA3-llava-v1.5-phi1.5-gemini 1.5B checkpoint Gemini-1.5-Flash

Download from huggingface:

git clone https://huggingface.co/hhenryz/LOVA3-llava-v1.5-7b

Data

Data Json

Image Datasets

Please download the images from constituting datasets:

💃 Evaluation

  1. Download LOVA3-llava-v1.5-7b under the folder checkpoints.

  2. Download the CLIP vision encoder clip-vit-large-patch14-336 under the folder checkpoints.

  3. Run the evaluation scripts under the folder scripts/v1_5/eval. There are 12 multimodal datasets and benchmarks awaiting evaluation.

Take VizWiz as an example, the running command is as follows:

modelname=LOVA3-llava-v1.5-7b

python -m llava.eval.model_vqa_loader \
    --model-path checkpoints/$modelname \
    --question-file ./playground/data/eval/vizwiz/llava_test.jsonl \
    --image-folder /yourpath/vizwiz/test/ \
    --answers-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \
    --temperature 0 \
    --conv-mode vicuna_v1

python scripts/convert_vizwiz_for_submission.py \
    --annotation-file ./playground/data/eval/vizwiz/llava_test.jsonl \
    --result-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \
    --result-upload-file ./playground/data/eval/vizwiz/answers_upload/$modelname.json

Training

  1. Download the pretrained MLP adapter weights llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5 from and put it under the folder checkpoints.

  2. Download the model weight clip-vit-large-patch14-336 under the folder checkpoints.

  3. Download the model weight vicuna-7b-v1.5 under the folder checkpoints.

  4. Download the training data Mixed_VQA_GenQA_EvalQA_1.5M.jsonl under the folder data.

  5. Run the training script.

bash scripts/v1_5/finetune.sh

🙏 Acknowledgement

  • LLaVA: The codebase we built upon.
  • LAVIS: We download some datasets from its scripts.

🎓 Citation

If you find LOVA3 useful, please cite using this BibTeX:

@misc{zhao2024lova3learningvisualquestion,
      title={LOVA3: Learning to Visual Question Answering, Asking and Assessment}, 
      author={Henry Hengyuan Zhao and Pan Zhou and Difei Gao and Zechen Bai and Mike Zheng Shou},
      year={2024},
      eprint={2405.14974},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2405.14974}, 
}