-
National University of Singapore
- https://waxnkw.github.io/
Stars
Long Context Transfer from Language to Vision
NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Open-Sora: Democratizing Efficient Video Production for All
MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
The code of the paper "NExT-Chat: An LMM for Chat, Detection and Segmentation".
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)
Code and models for NExT-GPT: Any-to-Any Multimodal Large Language Model
Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval --ICCV2023 Oral
[ICLR'24 spotlight] Chinese and English Multimodal Large Model Series (Chat and Paint) | 基于CPM基础模型的中英双语多模态大模型系列
Official repo for VideoComposer: Compositional Video Synthesis with Motion Controllability
Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, B…
Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts
Codes for VPGTrans: Transfer Visual Prompt Generator across LLMs. VL-LLaMA, VL-Vicuna.
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Running large language models on a single GPU for throughput-oriented scenarios.
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
[NeurIPS 2022] Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
This is the code of ECCV 2022 (Oral) paper "Fine-Grained Scene Graph Generation with Data Transfer".
Code repository for "It's About Time: Analog clock Reading in the Wild"
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR'21)
Visual Relation Grounding in Videos (ECCV'20, Spotlight)
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR'21)
Video as Conditional Graph Hierarchy for Multi-Granular Question Answering (AAAI'22, Oral)
The source code of our COLING'18 paper "Few-Shot Charge Prediction with Discriminative Legal Attributes".