Name	Name	Last commit message	Last commit date
parent directory ..
figures	figures
README.md	README.md

MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains

Introducing MMAU 🎉

The Massive Multitask Agent Understanding (MMAU) benchmark is designed to evaluate the performance of large language models (LLMs) as agents across a wide variety of tasks. It provides comprehensive insights into the capabilities and limitations of these models by featuring extensive offline tasks that eliminate the need for complex environment setups.

MMAU evaluates models across five key domains:

Tool-use
Directed Acyclic Graph (DAG) QA
Data Science and Machine Learning Coding
Contest-level Programming
Mathematics

These domains cover five essential capabilities:

Understanding
Reasoning
Planning
Problem-solving
Self-correction

With a total of 20 meticulously designed tasks encompassing over 3,000 distinct prompts, MMAU offers a robust framework for assessing the strengths and weaknesses of LLM agents.

Overview
Dataset Summary
Quick Start
Data prepartion
Benchmark
Leaderboard
Citation

Overview

Key Evaluation Results on MMAU

MMAU key features

Comprehensive Evaluation: MMAU provides evaluations from both application scenarios and fundamental capabilities. This dual approach offers a holistic framework for understanding the strengths and limitations of LLM agents.
Simplified Evaluation Process: The evaluation process on MMAU is straightforward and unified on a static dataset. This method avoids the instability issues that can arise from interactive evaluations, ensuring more reliable and consistent results.
Open Access: We release our evaluation dataset and scripts to the public. This transparency aims to set a new standard for performance assessment in the AI landscape, encouraging further research and development.

Benchmark Comparison Table

Table 1: Comparison of benchmarks in evaluating core capabilities of LLM agents. “En.” and “Dis.” represent entangled and disentangled, specifically. Understand.: understanding, Reason.: reasoning, Plan.: planning, Prob.-solv.: problem-solving, Self-corr.: self-correction, MM: multimodal grounding.

Benchmarks	Understand. En.	Understand. Dis.	Reason. En.	Reason. Dis.	Plan. En.	Plan. Dis.	Prob.-solv. En.	Prob.-solv. Dis.	Self-corr.	MM
AgentBench	✅	❌	✅	❌	✅	❌	✅	❌	✅	✅
AgentBoard	✅	❌	✅	❌	✅	❌	✅	❌	✅	❌
PlanBench	✅	❌	✅	❌	✅	❌	✅	❌	❌	❌
MMLU	✅	❌	✅	❌	✅	❌	✅	❌	❌	❌
MMMU	✅	❌	✅	❌	✅	❌	✅	❌	✅	❌
MMAU	✅	✅	✅	✅	✅	✅	✅	✅	✅	✅

MMAU's extensive and meticulously curated dataset, along with its robust evaluation framework, provides a valuable resource for assessing and advancing the capabilities of large language models.

Dataset Summary

The construction of MMAU encompasses both breadth and depth of data, drawn from a diverse range of sources to ensure comprehensive evaluation across various domains.

Our dataset is constructed from heterogeneous sources, including:

In-house Tool-use Data: Utilized for tasks under tool-use and Directed Acyclic Graph (DAG) QA, providing a robust foundation for evaluating these specific capabilities.
Kaggle Datasets: Rewritten to design tasks for Data Science and Machine Learning (DS & ML) coding, ensuring practical and relevant challenges that reflect real-world scenarios.
CodeContest: Sourced for tasks under contest-level programming, offering high-level coding challenges that test the advanced problem-solving skills of LLMs.
DeepMind Math Dataset: Used for math tasks, providing a rigorous set of problems to evaluate mathematical reasoning and computational abilities.

MMAU involves meticulous curation and rewriting of these data sources to create 20 diverse tasks encompassing over 3,000 distinct prompts. This extensive and varied dataset allows for a thorough assessment of LLM agents, covering five essential capabilities: understanding, reasoning, planning, problem-solving, and self-correction.

Quick Start

Make sure you are using python3.10+. To get started quickly, follow these steps:

pip install "axlearn[mmau]@git+https://github.com/apple/axlearn.git@main"

Install from repository

Clone the repository:

git clone https://github.com/apple/axlearn.git

Navigate to the repository directory:
```
cd axlearn
```

Install the necessary dependencies:

# Install for all mmau metrics.
pip install ".[mmau]"

Dataset

The dataset is available on Huggingface Hub.

Benchmark

Here is a simple command to run generation and evaluation for one of MMAU eval sets.

OpenAI Models

export OPENAI_API_KEY=<your_openai_key>
python3 -m axlearn.open_api.evaluator \
--model gpt-3.5-turbo-0125 \
--eval_set_name mmau_tool_use_plan \
--client_name openai

Open Source Models

Use vLLM or other inference framework to start an OpenAI compatible server. In this case, --client_name=openai can be re-used.

export OPENAI_API_KEY=EMPTY
export OPENAI_BASE_URL=http://127.0.0.1:8000/v1
export GRADER_OPENAI_API_KEY=<your_openai_key>
python3 -m axlearn.open_api.evaluator \
--model Mistral-7B-Instruct-v0.3 \
--eval_set_name mmau_tool_use_plan \
--client_name openai

All available eval_set_name are:

eval_set_name	Required GRADER_OPENAI_API_KEY	Number of Examples
mmau_tool_use_execution	❌	~1000
mmau_tool_use_plan	❌	~450
mmau_code_contests	❌	~250
mmau_math	✅	~1000
mmau	✅	~2700

--eval_set_name mmau is a collection of mmau_tool_use_execution (require function calling endpoint), mmau_tool_use_plan, mmau_code_contests, mmau_math and the score is the average of all 4 categories. Note we will add mmau_code_kaggle later as well.

You may find mmau_code_contests to be slow since it will execute generated code.

Gemini or Anthropic Models

Gemini

export VERTEX_AI_PROJECT=<your_project_name>
export VERTEX_AI_LOCATION=<your_project_location>
export GRADER_OPENAI_API_KEY=<your_openai_key>
python3 -m axlearn.open_api.evaluator \
--model gemini-1.5-flash-001 \
--eval_set_name mmau_tool_use_plan \
--client_name gemini

Anthropic

export ANTHROPIC_API_KEY=<your_anthropic_key>
export GRADER_OPENAI_API_KEY=<your_openai_key>
python3 -m axlearn.open_api.evaluator \
--model claude-3-haiku-20240307 \
--eval_set_name mmau_tool_use_plan \
--client_name anthropic \
--decode_parameters '{"add_system_parallel_tools": true}'

Arguments

Pass --concurrency 16 to increase request throughput.
Pass --max_instances 32 to run first few instances for debugging.
Pass --output_file ./output to save responses and metrics.

Leaderboard

Model	MMAU Tool Use Plan	MMAU Tool Use Execution	MMAU Math	MMAU Code Contests
OpenAI/gpt-3.5-turbo-0125	26.1	68.2	20.7	6.5
OpenAI/gpt-4o-mini-2024-07-18	68.5	68.9	45.7	12.3
OpenAI/gpt-4o-2024-08-06	80.0	74.5	63.1	23.4
Google/gemini-1.0-pro-001	46.7	32.9	27.9	5.4
Google/gemini-1.5-pro-001	75.7	56.1	37.8	6.9
Anthropic/claude-3-haiku-20240307	34.6	45.1	31.1	6.9
Anthropic/claude-3-5-sonnet-20240620	74.6	68.3	48.1	13.4
Microsoft/phi-3.5-mini-instruct	26.5	N/A	30.5	3.1
Meta/llama3.1-8b-instruct	35.7	N/A	27.2	7.7
Meta/llama3.1-70b-instruct	79.3	N/A	31.0	16.8
Meta/llama3.1-405b-instruct	85.2	N/A	50.9	25.7

TODO

Add more open api clients.
Add code kaggle into mmau eval set.

Citation

If you use MMAU in your research, please cite our paper:

@article{patil2024mmau,
  title={MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains},
  author={Guoli Yin, Felix Bai, Shuang Ma, Zirui Wang, et al.},
  year={2024},
  url={https://github.com/apple/axlearn/docs/research/mmau},
  journal={https://arxiv.org/abs/2407.18961},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mmau

mmau

README.md

MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains

Introducing MMAU 🎉

Table of Contents

Overview

Key Evaluation Results on MMAU

MMAU key features

Benchmark Comparison Table

Dataset Summary

Quick Start

Dataset

Benchmark

Leaderboard

TODO

Citation

Files

mmau

Directory actions

More options

Directory actions

More options

Latest commit

History

mmau

Folders and files

parent directory

README.md

MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains

Introducing MMAU 🎉

Table of Contents

Overview

Key Evaluation Results on MMAU

MMAU key features

Benchmark Comparison Table

Dataset Summary

Quick Start

Dataset

Benchmark

Leaderboard

TODO

Citation