Skip to content
/ MATH-V Public

MATH-Vision dataset and code to measure Multimodal Mathematical Reasoning capabilities.

License

Notifications You must be signed in to change notification settings

mathllm/MATH-V

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Measuring Multimodal Mathematical Reasoning with the MATH-Vision🔥 Dataset

MathQA Mathematical Reasoning Multimodal Reasoning

ChatGPT GPT-4 GPT-4V GPT-4V Gemini

🌟 This is the official repository for the paper "Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset", which contains both evaluation code and data for the MATH-V benchmark.

[🌐 Homepage] [🤗 Huggingface Dataset] [📊 Leaderboard ] [🔍 Visualization] [📖 ArXiv Paper]

💥 News

  • [2024-09-27] MATH-V is accepted by NeurIPS DB Track, 2024! 🎉🎉🎉
  • [2024-08-29] 🔥🔥🔥 Qwen2-VL-72B achieves new open-sourced SOTA on MATH-Vision with 25.9! 🎉 Congratulations! Learn more at the Qwen2-VL blog.
  • [2024-07-19] open-compass/VLMEvalKit now supports MATH-V, utilizing LLMs for more accurate answer extraction!🔥🔥
  • [2024-05-19] OpenAI's GPT-4o scores 30.39% on MATH-V, considerable advancement in short time! 💥
  • [2024-03-01] InternVL-Chat-V1-2-Plus achieves 16.97%, establishing itself as the new best-performing open-sourced model. 🎉 Congratulations!
  • [2024-02-23] Our dataset is now accessible at huggingface.
  • [2024-02-22] The top-performing model, GPT-4V only scores 23.98% on MATH-V, while human performance is around 70%.
  • [2024-02-22] Our paper is now accessible at ArXiv Paper.

👀 Introduction

Recent advancements in Large Multimodal Models (LMMs) have shown promising results in mathematical reasoning within visual contexts, with models approaching human-level performance on existing benchmarks such as MathVista. However, we observe significant limitations in the diversity of questions and breadth of subjects covered by these benchmarks. To address this issue, we present the MATH-Vision (MATH-V) dataset, a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions. Spanning 16 distinct mathematical disciplines and graded across 5 levels of difficulty, our dataset provides a comprehensive and diverse set of challenges for evaluating the mathematical reasoning abilities of LMMs.


The accuracies of four prominent Large Multimodal Models (LMMs), random chance, and human performance are evaluated on our proposed MATH-Vision (MATH-V) across 16 subjects.

Through extensive experimentation, we unveil a notable performance gap between current LMMs and human performance on MATH-V, underscoring the imperative for further advancements in LMMs.

You can refer to our project homepage and the paper for more details.

📐 Dataset Examples

Some examples of MATH-V on three subjects: analytic geometry, topology, and graph theory.

Analytic geometry


Topology


Graph Geometry


You can refer to the Appendix D.3 of the paper for example images of 16 subjects.

🏆 Leaderboard

The leaderboard is available here.

📈 Evaluation

Generating Outputs of Different Models

Gemini

python models/Gemini.py --in_path ./data/test.jsonl --save_path ./Gemini.jsonl

This will run the Gemini API and save the outputs to ./Gemini.jsonl path. You can modify the system prompt, max tokens, etc. in the benchmark_gemini function.

GPT_with_caption

Generate image captions using GPT-4V:

python models/GPT_with_caption.py --model gpt-4-vision-preview --in_path ./data/test.jsonl --save_path ./data/gpt4v-captions.jsonl

Generate outputs using ChatGPT-3.5 or GPT-4 with image captions:

python models/GPT_with_caption.py --model gpt-3.5-turbo-0125 (gpt-4-turbo-preview) --in_path ./data/test.jsonl --save_path ./gpt3.5_caption.jsonl (./gpt4_caption.jsonl)

Evaluation of Model Outputs

Once all the model outputs have been generated, execute the python evaluation/evaluate.py function to assess these outputs. This script will examine all outputs located in the outputs/ directory, computing overall accuracy as well as accuracy for each subject and level.

You can refer to the Appendix E and F of the paper for some evaluation results of the above models.

📝 Citation

If you find this benchmark useful in your research, please consider citing this BibTex:

@misc{wang2024measuring,
      title={Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset}, 
      author={Ke Wang and Junting Pan and Weikang Shi and Zimu Lu and Mingjie Zhan and Hongsheng Li},
      year={2024},
      eprint={2402.14804},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

🧠 Related Work

About

MATH-Vision dataset and code to measure Multimodal Mathematical Reasoning capabilities.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages