Stars
ChatArena (or Chat Arena) is a Multi-Agent Language Game Environments for LLMs. The goal is to develop communication and collaboration capabilities of AIs.
VisualWebArena is a benchmark for multimodal agents.
O1 Replication Journey: A Strategic Progress Report – Part I
Code for ROICtrl: Boosting Instance Control for Visual Generation
Repository for ShowUI: One Vision-Language-Action Model for GUI Visual Agent
[NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
Out-of-the-box (OOTB) GUI Agent for Windows and macOS
[ICML'24] SeeAct is a system for generalist web agents that autonomously carry out tasks on any given website, with a focus on large multimodal models (LMMs) such as GPT-4V(ision).
MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
(ECCV 2024) Empowering Multimodal Large Language Model as a Powerful Data Generator
(NeurIPS 2024) Learning to Visual Question Answering, Asking and Assessment
[NeurIPS 2024] EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models.
[ICML 2024 Oral] Official code repository for MLLM-as-a-Judge.
[NeurIPS2023] Official implementation and model release of the paper "What Makes Good Examples for Visual In-Context Learning?"
CVPR and NeurIPS poster examples and templates. May we have in-person poster session soon!
Official implementation of the paper "MMInA: Benchmarking Multihop Multimodal Internet Agents"
A programming framework for agentic AI 🤖 (PyPi: autogen-agentchat)
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?
The paper list of the 86-page paper "The Rise and Potential of Large Language Model Based Agents: A Survey" by Zhiheng Xi et al.
[NeurIPS'23 Spotlight] "Mind2Web: Towards a Generalist Agent for the Web"
Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.