TL;DR LOVA3 is a new training paradigm for advancing multimodal training by incorporating new capabilities: asking questions and assessing vqa triplets.
- [03/03/2025] We update four models in paper for testing, have fun!
- [10/16/2024] We release the webpage.
- [09/26/2024] LOVA3 is accepted by NeurIPS 2024.
- [07/01/2024] Related work Genixer is accepted by ECCV 2024.
- [05/24/2024] We release the code of LOVA3, the EvalQABench, the training dataset Mixed_VQA_GenQA_EvalQA_1.5M.jsonl, and the checkpoint LOVA3-llava-v1.5-7b.
- [05/23/2024] We release the LOVA3 paper.
conda create -n LOVA python=3.10
conda activate LOVA
pip install --upgrade pip
pip install -e .
Model Name | Size | Checkpoint | EvalQA Data Filtered By |
---|---|---|---|
LOVA3-llava-v1.5-7b | 7B | checkpoint | Fuyu-8B |
LOVA3-llava-v1.5-7b-gemini | 7B | checkpoint | Gemini-1.5-Flash |
LOVA3-llava-v1.5-phi1.5-baseline | 1.5B | checkpoint | - |
LOVA3-llava-v1.5-phi1.5-fuyu | 1.5B | checkpoint | Fuyu-8B |
LOVA3-llava-v1.5-phi1.5-gemini | 1.5B | checkpoint | Gemini-1.5-Flash |
Download from huggingface:
git clone https://huggingface.co/hhenryz/LOVA3-llava-v1.5-7b
-
Training Data: Mixed_VQA_GenQA_EvalQA_1.5M.jsonl.
-
EvalQABench Data: EvalQABench
Please download the images from constituting datasets:
- COCO: train2014
- GQA: images
- OCR-VQA: download script, we save all files as
.jpg
- AOKVQA: download script
- TextVQA: train_val_images
- VisualGenome: part1, part2
- LLaVA-Instruct: huggingface
-
Download LOVA3-llava-v1.5-7b under the folder
checkpoints
. -
Download the CLIP vision encoder clip-vit-large-patch14-336 under the folder
checkpoints
. -
Run the evaluation scripts under the folder
scripts/v1_5/eval
. There are 12 multimodal datasets and benchmarks awaiting evaluation.
Take VizWiz as an example, the running command is as follows:
modelname=LOVA3-llava-v1.5-7b
python -m llava.eval.model_vqa_loader \
--model-path checkpoints/$modelname \
--question-file ./playground/data/eval/vizwiz/llava_test.jsonl \
--image-folder /yourpath/vizwiz/test/ \
--answers-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \
--temperature 0 \
--conv-mode vicuna_v1
python scripts/convert_vizwiz_for_submission.py \
--annotation-file ./playground/data/eval/vizwiz/llava_test.jsonl \
--result-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \
--result-upload-file ./playground/data/eval/vizwiz/answers_upload/$modelname.json
-
Download the pretrained MLP adapter weights llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5 from and put it under the folder
checkpoints
. -
Download the model weight clip-vit-large-patch14-336 under the folder
checkpoints
. -
Download the model weight vicuna-7b-v1.5 under the folder
checkpoints
. -
Download the training data Mixed_VQA_GenQA_EvalQA_1.5M.jsonl under the folder
data
. -
Run the training script.
bash scripts/v1_5/finetune.sh
If you find LOVA3 useful, please cite using this BibTeX:
@misc{zhao2024lova3learningvisualquestion,
title={LOVA3: Learning to Visual Question Answering, Asking and Assessment},
author={Henry Hengyuan Zhao and Pan Zhou and Difei Gao and Zechen Bai and Mike Zheng Shou},
year={2024},
eprint={2405.14974},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2405.14974},
}