Skip to content

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Notifications You must be signed in to change notification settings

dongyh20/Insight-V

Repository files navigation

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Yuhao Dong*,1Zuyan Liu*,2,3  Hai-Long Sun2,4Jingkang Yang1

Winston Hu2Yongming Rao2,3,✉Ziwei Liu1,✉

1S-Lab, NTU   2Tencent  3Tsinghua University 4Nanjing University 

* Equal Contribution  ✉ Corresponding Author

arXiv Paper: Static Badge

Model Checkpoints: Insight-V-checkpoints

📢 News

  • [11/2024] 🔧🔨Training & Inference Scripts Release! Try Insight-V on your own!
  • [11/2024] 🔥 🚀Introducing Insight-V! An early attempt to explore long-chain visual reasoning with MLLMs.
    • [Paper]: Detailed introduction of Insight-V, including structured, long-chain data generation pipeline and effective multi-agent system design!
    • [Checkpoints]: We release model checkpoints on LLaVA-NeXT-LLaMA3 and our base model.

🚀 Introducing Insight-V

Main idea of Insight-V

Insight-V is an early effort to explore long-chain visual reasoning with MLLMs.

Insight-V offers 1) a scalable data generation pipeline for long-chain, high-quality reasoning data, 2) a multi-agent system that decomposes visual reasoning tasks into reasoning and summarization, and 3) a two-stage training pipeline to enhance visual reasoning capabilities. Together, these contributions address key challenges in visual reasoning, providing a solid foundation for future research in MLLM reasoning.

Overview of Data Generation Pipeline

The reasoning processes are generated progressively through a reasoning generator, and then fed into a multi-granularity assessment system to ensure high-quality reasoning.

Overview of Multi-Agent System

We derive a multi-agent system from a single model. By decomposing the task into reasoning and summarization, the two agents collaborate to enhance the overall reasoning capability.

✅ TODO List

  • Release paper on arXiv
  • Release Insight-V models.
  • Demo code for generation.
  • All the training and inference code.
  • Evaluation code for visual reasoning benchmarks.
  • Insight-V SFT Data.
  • Insight-V with stronger MLLMs.

📃 Main Results

Results on Visual Reasoning Benchmarks

Results on Other Image Benchmarks

Qualitative Results

Citation

If you find it useful for your research and applications, please cite our paper using this BibTeX:

@article{dong2024insight,
  title={Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models},
  author={Dong, Yuhao and Liu, Zuyan and Sun, Hai-Long and Yang, Jingkang and Hu, Winston and Rao, Yongming and Liu, Ziwei},
  journal={arXiv preprint arXiv:2411.14432},
  year={2024}
}

Acknowledgement

  • Our codebase is conducted on LLaVA

  • The data generation pipeline is mitigated from g1

  • Thanks to lmms-eval team, for building such a useful evaluation system!

About

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published