Skip to content

Official implementation of paper 'Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models'.

License

Notifications You must be signed in to change notification settings

1zhou-Wang/MemVR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

If you like our project, please give us a star ⭐ on GitHub for the latest update.

hf arXiv License Hits

📣 News

  • [2024/10/7] ⭐️ Paper of MemVR uploaded. Please check out this link for details.
  • [2024/10/7] 🚀 Codes will be released. Welcome to watch 👀 this repository for the latest updates.
  • [2024/10/23] 🚀 Source code released! We're now working on extending MemVR to more MLLMs.

🎯 Overview

We propose Memory-Space Visual Retracing (MemVR), a novel hallucination mitigation paradigm without needing external knowledge retrieval or additional fine-tuning. MemVR has two significant advantages:

  • First, MemVR significantly mitigates hallucination issues across various MLLMs and excels in general benchmarks, emphasizing its potential for widespread applicability.
  • Second, MemVR is a plug-and-play solution without incurring added time overhead.

MemVR

MemVR

It’s a game-changer for effectiveness and efficiency.

Comprehensive experimental evaluations demonstrate that MemVR significantly mitigates hallucination issues across various MLLMs and excels in general benchmarks without incurring added time overhead.

🕹️ Usage

Installation

  1. We recommend you use LLaVA as the working environment. Please clone the repository from LLaVA and set up the environment by running
git clone https://github.com/haotian-liu/LLaVA
cd LLaVA
conda create -n memvr python==3.10
conda activate memvr
pip install --upgrade pip
pip install -e .
  1. After setting up, clone the repository from MemVR and move all contents to the main directory of LLaVA (except README.md).
LLaVA/
├── llava/
│ ├── eval/ # merge here in the next step
│ ├── .../
├── eval_scripts/
│ ├── llava/
│ ├── qwen/
│ ├── glm/
├── memvr.py/
├── inference.py/
├── images/
│ ├── ...
└── ...

Then merge the file eval to the directory

/LLaVA/llava/eval/

Downloading Checkpoints

Under the main directory of LLaVA:

  1. Download the checkpoint of LLaVA v1.5 here.
  2. Download the checkpoint of Qwen-VL-Chat here. Replace the downloaded 'modeling_qwen.py' by modeling_qwen.py to enable MemVR on Qwen-VL-Chat model.
  3. Download the checkpoint of glm-4v-9b here. Replace the downloaded 'modeling_chatglm.py' by modeling_chatglm.py to enable MemVR on GLM-4V-9b model.

You may check if your environment works fine by running

python inference.py

Evaluation

Follow Evaluation.md in LLaVA to prepare for the benchmark materials. Additionally, we recommend you use GPUs with no less than 40GB of VRAM. Test with these benchmarks by running

bash eval_scripts/llava/mme.sh 

Please note that you may need to fill in your own OpenAI API-KEY for GPT-based evaluations like llavabench or MM-Vet.

Here are some tips of the parameters in the scripts:

    --retracing-ratio 0.12 \
    --entropy-threshold 0.75 \
    --starting-layer 5 \
    --ending-layer 16 \

Where

  • [retracing-ratio] refers to the percentage of visual_token to be retraced in a certain layer. It has a straightforward effect on the model's performance.
  • [entropy-threshold] defines the minimum layer-wide entropy that triggers visual information retracing.
  • [starting-layer] and [ending-layer] set the range of layers where visual information retracing is allowed.

🏅 Experiments

MemVR Figure 5. Results on MMBench. MemVR enhances comprehensive performance on diverse tasks.

📌 Examples

Case1 Figure 9. Visualization of uncertainty across layers without and with MemVR. MemVR effectively reduces uncertainty after the 8th layer, contributing to hallucination mitigations.

Case2 Figure 10. A case study in long text generation. MemVR effectively mitigates hallucinations.

✏️ Citation

If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:

@article{zou2024memvr,
  title={Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models}, 
  author={Xin Zou and Yizhou Wang and Yibo Yan and Sirui Huang and Kening Zheng and Junkai Chen and Chang Tang and Xuming Hu},
  journal={arxiv preprint arxiv:2410.03577},
  year={2024}
}

📝 Related Projects

  • OPERA: OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
  • VCD: VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
  • DoLa: DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
  • Contrastive Decoding: Open-ended Text Generation as Optimization
  • GLM-4V: ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
  • Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  • LLaVA 1.5: Improved Baselines with Visual Instruction Tuning

Star History

Star History Chart

About

Official implementation of paper 'Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models'.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •