Skip to content

Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective (ACL 2024)

License

Notifications You must be signed in to change notification settings

yuezih/less-is-more

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Checklist

What you can find in this repo:

  • Selective EOS Supervision
    • training code
    • trained model checkpoints
  • Scoring EOS Supervision
    • data filtering code
    • filtered data
  • CHAIR evaluation
    • evaluation scripts
    • our test set data
  • others

Selective EOS Supervision

Training

Following the instruction of LLaVA to prepare the environment, data (LLaVA-Instruction-150K) and pretraining models (e.g., LLaVA-1.5-7b).

Train the model with Selective EOS Supervision. The default configuration is set to train the llava-1.5-7b model with Detail23k for one epoch.

cd LLaVA
bash scripts/v1_5/selective_eos_finetune.sh

The main modifications to the original LLaVA code for Selective EOS Supervision are detailed in ./assets/selective-eos-supervision.md.

Checkpoint

Our models (LoRA weights) finetuned with Selective EOS Supervision:

Basic Model Finetuning Data Checkpoint
llava-1.5-7b Detail23k llava-v1.5-7b-selective-23k-lora
llava-1.5-7b LLaVA-Instruction-150K llava-v1.5-7b-selective-150k-lora

Scoring EOS Supervision

Data Scoring

For the LLaVA codebase, due to some constraints related to deepspeed, currently I have no idea about how to efficiently score a dataset with a standalone script. Our scoring relies on the training process, i.e., for each training step:

  • Score the data in the minibatch and save the scores;
  • Cancel loss backward (can be achieved by modifying the trainer code).

The core code for data scoring are provided in ./LLaVA/llava/model/language_model/llava_llama_filter.py.

Filtered Data

Our data filtered with Scoring EOS Supervision:

Basic Data Filtered Data
LLaVA-Instruction-150K LLaVA-Instruction-150K-filtered [OneDrive]

Training

Instruction tune the LLaVA-7b model on our filtered data with:

cd LLaVA
bash scripts/finetune_qlora_filtered.sh

CHAIR Evaluation

Data

The test set used in our paper for CHAIR evaluation is provided in ./CHAIR-eval/data/chair-500.jsonl. The data is randomly sampled from the MSCOCO validation set with a random seed of 0.

CHAIR Images

We provide two ways to collect test set images:

  • a python script to collect images from the original MSCOCO images with softlinks. Please specify the path of your own MSCOCO image path. The script will create a folder ./CHAIR-eval/data/chair-500 for the CHAIR images.
    python ./CHAIR-eval/prepare_data.py
  • a OneDrive link to download the 500 images. Unzip the images to ./CHAIR-eval/data/chair-500.

MSCOCO Annotation

Use the following command to download the annotation files of MSCOCO detection, which will be used for CHAIR evaluation:

cd ./CHAIR-eval/data
wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip
mkdir MSCOCO
unzip -d MSCOCO/annotation annotations_trainval2014.zip

Evaluation

We provide a script for CHAIR inference and evaluation.
Set your model in the following script and then run it:

bash ./CHAIR-eval/eval.sh

MODEL_NAME: lora weights, e.g., yuezih/llava-v1.5-7b-selective-23k-lora
MODEL_BASE: base model checkpoint, e.g., liuhaotian/llava-v1.5-7b

The first-time evaluation can be slow because of the ground-truth object set construction. Subsequent evaluations will be faster with the cache.

Citation

If you find this repo helpful, please consider citing our paper:

@misc{yue2024less,
      title={Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective}, 
      author={Zihao Yue and Liang Zhang and Qin Jin},
      year={2024},
      eprint={2402.14545},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Acknowledgement

This repo is built on LLaVA (models) and OPERA (CHAIR evaluation). Many thanks for their efforts. The use of our code should also follow the original licenses.

About

Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective (ACL 2024)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published