GitHub - haoyiq114/VALOR: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models (ACL-Findings 2024)

VALOR-Eval: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models

Haoyi Qiu*, Wenbo Hu*, Zi-Yi Dou, Nanyun Peng

University of California, Los Angeles

*Equal contribution, listed in alphabetical order by first name.

Introduction

🔍 Large Vision Language Models (LVLMs) suffer from hallucination problems, wherein the models generate plausible-sounding but factually incorrect outputs, undermining their reliability.

📚 A comprehensive quantitative evaluation is necessary to identify and understand the extent of hallucinations in these models. However, existing benchmarks are often limited in scope, focusing mainly on object hallucinations. Furthermore, current evaluation methods struggle to effectively address the subtle semantic distinctions between model outputs and reference data, as well as the balance between hallucination and informativeness.

📌 To address these issues, we introduce a multi-dimensional benchmark (VALOR-Bench) covering objects, attributes, and relations, with challenging images selected based on associative bias.

⚖️ Moreover, we propose an LLM-based two-stage evaluation framework (VALOR-Eval) that generalizes the popular CHAIR metric and incorporates both faithfulness and coverage into the evaluation.

🚧 We provide a detailed assessment of 10 established LVLMs within our framework and results demonstrate that we provide a more comprehensive and human-correlated evaluation than existing work.

The results of 10 mainstream VLLMs evaluated by VALOR-Eval.

🌟 Through this work, we highlight the critical balance between faithfulness and coverage of model outputs, and we hope our work encourages future progress on addressing hallucinations in LVLMs while keeping their outputs informative.

Start with Our Code

Under a Linux environment, clone this repository and navigate to VALOR folder

git clone https://github.com/haoyiq114/VALOR
cd VALOR

Install Package

conda create -n valor python=3.10 -y
conda activate valor
pip install --upgrade pip
pip install -e .

Dataset

Please refer to datasets for preparing the images in our benchmarks.

Evaluation

Prepare your OpenAI KEY, and set it up here.

To inference the 10 LVLMs evaluated in VALOR benchmark, please to refer to model-generation. Notice that some of the models require to be downloaded and installed seperately, for detials please refer to their offical implementation pages.

For evaluation of your own model on our benchmark. Once obtained the generated captions from your model, format the output file following the name template in our scripts. For example, to run the evaluation on objects existence, format the output file as "your_model_name_long_caps.json", then simply replace the model name here to:

evaluated_model="your model name"

Then run,

bash scripts/evaluate_object_existence.sh

Citation

If you found this work useful, consider giving this repository a star and citing our paper as followed:

@misc{qiu2024valoreval,
      title={VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models}, 
      author={Haoyi Qiu and Wenbo Hu and Zi-Yi Dou and Nanyun Peng},
      year={2024},
      eprint={2404.13874},
      archivePrefix={arXiv}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
datasets		datasets
evaluation		evaluation
generation		generation
human_annotation		human_annotation
scripts		scripts
DATA_LICENSE		DATA_LICENSE
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VALOR-Eval: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models

Introduction

Start with Our Code

Dataset

Evaluation

Citation

About

Releases

Packages

Contributors 2

Languages

License

haoyiq114/VALOR

Folders and files

Latest commit

History

Repository files navigation

VALOR-Eval: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models

Introduction

Start with Our Code

Dataset

Evaluation

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages