BEAF: Observing Before-AFter Changes to Evaluate Hallucination in Vision-language Models (ECCV 2024)

Authors: Moon Ye-Bin*, Nam Hyeon-Woo*, Wonseok Choi, Tae-Hyun Oh

Project Page | Dataset | YouTube | Paper

This repository is official implementation for the ECCV 2024 paper, "BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models". The key idea of our BEAF benchmark is manipulating visual scene information and designing the metrics based on the model's answer changes along the scene changes.

Abstract: Large vision language models (LVLMs) perceive the world through a combination of a visual encoder and large language models (LLMs). The visual encoder, pre-trained on large-scale vision-text datasets, provides zero-shot generalization to visual data, and LLMs endow the high reasoning ability to LVLMs. It leads LVLMs to achieve high performance on wide benchmarks without fine-tuning, known as zero or few-shot capability of LLMs. However, recent studies show that LVLMs are vulnerable to hallucination. This undesirable behavior degrades reliability and credibility, thereby making users unable to fully trust the output from LVLMs. To enhance trustworthiness and better tackle the hallucination of LVLMs, we curate a new evaluation dataset, called the BEfore-AFter hallucination dataset (BEAF), and introduce new metrics: True Understanding (TU), IGnorance (IG), StuBbornness (SB), and InDecision (ID). Unlike prior works that focus only on constructing questions and answers, the key idea of our benchmark is that we manipulate visual scene information by image editing models and design the metrics based on scene changes. This allows us to clearly assess whether LVLMs correctly understand a given scene by observing the ability to perceive changes. We also visualize the correctness heatmap by virtue of our two-axis view: vision and text. Upon evaluating LVLMs with our dataset, we observed that our metrics can reveal different aspects of LVLM hallucination.

Evaluation

1. BEAF dataset download

* [07/18] We released the BEAF dataset ver0, but it will be re-filtered and refined as ver1 soon!!
* [10/15] We updated the BEAF dataset ver1. The QnA JSON file is also updated. Note that the number of images and questions is different from the ver0 (the value in the paper). In ver1, the number of images is 2,224, and the number of image-question pairs is 26,064.

Original + Manipulated images: download from here
The original images are sourced from the COCO dataset

2. Get your model's answer

Image name, question, GT answers, and additional metadata are in ./beaf_qna.json file
The format and question of our BEAF is inspired by the POPE dataset

The model output should be organized in a json file in the following format:

[
  {"id": 0, "answer": "No."}, 
  {"id": 1, "answer": "Yes."}, 
  ... 
  {"id": 26063, "answer": "No."}
]

Please refer to answer_gpt4o.json as an example of a model answer.

3. Evaluation

Compute both our metrics (TU, IG, SBp, SBn, ID) and traditional metrics (Accuracy, Precision, Recall, F1)

Run beaf_metric.py with:

python beaf_metric.py --model-answers {your_model_answer.json}

Citation

If you use BEAF in a research paper, please cite our work and related works as follows:

@inproceedings{yebin2024beaf,
  title     = {BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models},
  author    = {Ye-Bin, Moon and Hyeon-Woo, Nam and Choi, Wonseok and Oh, Tae-Hyun},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2024},
}

@inproceedings{lin2014microsoft,
  title     = {Microsoft coco: Common objects in context},
  author    = {Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll{'a}r, Piotr and Zitnick, C Lawrence},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2014},
}

@inproceedings{Li-hallucination-2023,
  title     = {Evaluating Object Hallucination in Large Vision-Language Models},
  author    = {Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao and Ji-Rong Wen},
  booktitle = {The 2023 Conference on Empirical Methods in Natural Language Processing},
  year      = {2023},
}

Acknowledgment

This work was partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2021-0-02068, Artificial Intelligence Innovation Hub; No.2022-0-00124, Development of Artificial Intelligence Technology for Self-Improving Competency-Aware Learning Capabilities; No.RS-2019-II191906, Artificial Intelligence Graduate School Program(POSTECH))

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
README.md		README.md
answer_gpt4o.json		answer_gpt4o.json
beaf_metric.py		beaf_metric.py
beaf_qna.json		beaf_qna.json
teaser_out.gif		teaser_out.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BEAF: Observing Before-AFter Changes to Evaluate Hallucination in Vision-language Models (ECCV 2024)

Project Page | Dataset | YouTube | Paper

Evaluation

1. BEAF dataset download

2. Get your model's answer

3. Evaluation

Citation

Acknowledgment

About

Releases

Packages

Languages

postech-ami/BEAF

Folders and files

Latest commit

History

Repository files navigation

BEAF: Observing Before-AFter Changes to Evaluate Hallucination in Vision-language Models (ECCV 2024)

Project Page | Dataset | YouTube | Paper

Evaluation

1. BEAF dataset download

2. Get your model's answer

3. Evaluation

Citation

Acknowledgment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages