Stars
(NeurIPS 2024 Oral 🔥) Improved Distribution Matching Distillation for Fast Image Synthesis
a family of versatile and state-of-the-art video tokenizers.
NOVA: Autoregressive Video Generation without Vector Quantization
A PyTorch library for implementing flow matching algorithms, featuring continuous and discrete flow matching implementations. It includes practical examples for both text and image modalities.
A generative world for general-purpose robotics & embodied AI learning.
Liquid: Language Models are Scalable Multi-modal Generators
[ECCV 2024] Street Gaussians: Modeling Dynamic Urban Scenes with Gaussian Splatting
🔥 Official impl. of "TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation".
Infinity ∞ : Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
The code and models for the paper: Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis
XQ-GAN🚀: An Open-source Image Tokenization Framework for Autoregressive Generation
CoDe: Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient
ElasticTok: Adaptive Tokenization for Image and Video
The paper collections for the autoregressive models in vision.
DIAMOND (DIffusion As a Model Of eNvironment Dreams) is a reinforcement learning agent trained in a diffusion world model. NeurIPS 2024 Spotlight.
Movie Gen Bench - two media generation evaluation benchmarks released with Meta Movie Gen
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer
[NeurIPS 2024 Spotlight] Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models
Official Pytorch Implementation of Our CVPR2023 Paper: "Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization"
Official inference repo for FLUX.1 models
This is the official implementation for ControlVAR.
Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
Tarsier -- a family of large-scale video-language models, which is designed to generate high-quality video descriptions , together with good capability of general video understanding.
Implements VAR+CLIP for text-to-image (T2I) generation
🔥stable, simple, state-of-the-art VQVAE toolkit & cookbook
[NeurIPS 2024]OmniTokenizer: one model and one weight for image-video joint tokenization.
Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation
Official repo for paper "MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions"
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization