Yuhao Dong*,1 Zuyan Liu*,2,3 Hai-Long Sun2,4 Jingkang Yang1
Winston Hu2 Yongming Rao2,3,✉ Ziwei Liu1,✉
1S-Lab, NTU 2Tencent 3Tsinghua University 4Nanjing University
* Equal Contribution ✉ Corresponding Author
- [11/2024] 🔧🔨Training & Inference Scripts Release! Try Insight-V on your own!
- [11/2024] 🔥 🚀Introducing Insight-V! An early attempt to explore long-chain visual reasoning with MLLMs.
- [Paper]: Detailed introduction of Insight-V, including structured, long-chain data generation pipeline and effective multi-agent system design!
- [Checkpoints]: We release model checkpoints on LLaVA-NeXT-LLaMA3 and our base model.
Insight-V is an early effort to explore long-chain visual reasoning with MLLMs.
Insight-V offers 1) a scalable data generation pipeline for long-chain, high-quality reasoning data, 2) a multi-agent system that decomposes visual reasoning tasks into reasoning and summarization, and 3) a two-stage training pipeline to enhance visual reasoning capabilities. Together, these contributions address key challenges in visual reasoning, providing a solid foundation for future research in MLLM reasoning.
The reasoning processes are generated progressively through a reasoning generator, and then fed into a multi-granularity assessment system to ensure high-quality reasoning.
We derive a multi-agent system from a single model. By decomposing the task into reasoning and summarization, the two agents collaborate to enhance the overall reasoning capability.
- Release paper on arXiv
- Release Insight-V models.
- Demo code for generation.
- All the training and inference code.
- Evaluation code for visual reasoning benchmarks.
- Insight-V SFT Data.
- Insight-V with stronger MLLMs.
If you find it useful for your research and applications, please cite our paper using this BibTeX:
@article{dong2024insight,
title={Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models},
author={Dong, Yuhao and Liu, Zuyan and Sun, Hai-Long and Yang, Jingkang and Hu, Winston and Rao, Yongming and Liu, Ziwei},
journal={arXiv preprint arXiv:2411.14432},
year={2024}
}