Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.
📑 Paper
| 🤗 Hugging Models
| 🤗 Spaces Demo
| 📝 Slides
| 🕹️ OpenBayes贝式计算 Demo
🤗 Datasets | 💬 X (Twitter)
| 🖥️ Computer Use
| 📖 GUI Paper List
| 🤖 ModelScope
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou
Show Lab @ National University of Singapore, Microsoft
- [2025.1.20] Support Navigation tasks: Mind2Web, AITW, Miniwob training and evaluator.
- [2025.1.17] Support API Calling via Gradio Client, simply run
python3 api.py
. - [2025.1.5] Release the
ShowUI-web
dataset. - [2024.12.28] Update GPT-4o annotation recaptioning scripts.
- [2024.12.27] Update training codes and instructions.
- [2024.12.23] Update
showui
for UI-guided token selection implementation. - [2024.12.15] ShowUI received Outstanding Paper Award at NeurIPS2024 Open-World Agents workshop.
- [2024.12.9] Support int8 Quantization.
- [2024.12.5] Major Update: ShowUI is integrated into OOTB for local run!
- [2024.12.1] We support iterative refinement to improve grounding accuracy. Try it at HF Spaces demo.
- [2024.11.27] We release the arXiv paper, HF Spaces demo and
ShowUI-desktop
. - [2024.11.16]
showlab/ShowUI-2B
is available at huggingface.
Run python3 api.py
by providing a screenshot and a query.
Since we are based on huggingface gradio client, you don't need a GPU to deploy the model locally 🤗
See Computer Use OOTB for using ShowUI to control your PC.
computer_use_with_showui-en-s.mp4
See Quick Start for local model usage.
See Gradio for installation.
Our Training codebases supports:
- Grounding and Navigation training: Mind2Web, AITW, Miniwob
- Self-customized model: ShowUI, Qwen2VL
- Efficient Training: DeepSpeed, BF16, QLoRA, SDQA / FlashAttention2, Liger-Kernel
- Multiple datasets mixed training
- Interleaved data streaming
- Image randomly resize (crop, pad)
- Wandb training monitor
See Train for training set up.
Try test.ipynb
, which seamless support for Qwen2VL models.
Try recaption.ipynb
, where we provide instructions on how to recaption the original annotations using GPT-4o.
We extend our gratitude to SeeClick for providing their codes and datasets.
Special thanks to Siyuan for assistance with the Gradio demo and OOTB support.
If you find our work helpful, please kindly consider citing our paper.
@misc{lin2024showui,
title={ShowUI: One Vision-Language-Action Model for GUI Visual Agent},
author={Kevin Qinghong Lin and Linjie Li and Difei Gao and Zhengyuan Yang and Shiwei Wu and Zechen Bai and Weixian Lei and Lijuan Wang and Mike Zheng Shou},
year={2024},
eprint={2411.17465},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.17465},
}
If you like our project, please give us a star ⭐ on GitHub for the latest update.