We are thrilled to present Open-Sora-Plan v1.0.0, which significantly enhances video generation quality and text control capabilities. See our report. We are training for higher resolution (>1024) as well as longer duration (>10s) videos, here is a preview of the next release. We show compressed .gif on GitHub, which loses some quality.
Thanks to HUAWEI Ascend NPU Team for supporting us.
目前已支持国产AI芯片(华为昇腾,期待更多国产算力芯片)进行推理,下一步将支持国产算力训练,具体可参考昇腾分支hw branch.
This project aims to create a simple and scalable repo, to reproduce Sora (OpenAI, but we prefer to call it "ClosedAI" ). We wish the open-source community can contribute to this project. Pull requests are welcome!!!
本项目希望通过开源社区的力量复现Sora,由北大-兔展AIGC联合实验室共同发起,当前版本离目标差距仍然较大,仍需持续完善和快速迭代,欢迎Pull request!!!
Project stages:
- Primary
- Setup the codebase and train an un-conditional model on a landscape dataset.
- Train models that boost resolution and duration.
- Extensions
- Conduct text2video experiments on landscape dataset.
- Train the 1080p model on video2text dataset.
- Control model with more conditions.
[2024.04.09] 🚀 Excited to share our latest exploration on metamorphic time-lapse video generation: MagicTime, which learns real-world physics knowledge from time-lapse videos. Here is the dataset for train (updating): Open-Sora-Dataset.
[2024.04.07] 🔥🔥🔥 Today, we are thrilled to present Open-Sora-Plan v1.0.0, which significantly enhances video generation quality and text control capabilities. See our report. Thanks to HUAWEI NPU for supporting us.
[2024.03.27] 🚀🚀🚀 We release the report of VideoCausalVAE, which supports both images and videos. We present our reconstructed video in this demonstration as follows. The text-to-video model is on the way.
[2024.03.10] 🚀🚀🚀 This repo supports training a latent size of 225×90×90 (t×h×w), which means we are able to train 1 minute of 1080P video with 30FPS (2× interpolated frames and 2× super resolution) under class-condition.
[2024.03.08] We support the training code of text condition with 16 frames of 512x512. The code is mainly borrowed from Latte.
[2024.03.07] We support training with 128 frames (when sample rate = 3, which is about 13 seconds) of 256x256, or 64 frames (which is about 6 seconds) of 512x512.
[2024.03.05] See our latest todo, pull requests are welcome.
[2024.03.04] We re-organize and modulize our code to make it easy to contribute to the project, to contribute please see the Repo structure.
[2024.03.03] We open some discussions to clarify several issues.
[2024.03.01] Training code is available now! Learn more on our project page. Please feel free to watch 👀 this repository for the latest updates.
- Fix typos & Update readme. 🤝 Thanks to @mio2333, @CreamyLong, @chg0901, @Nyx-177, @HowardLi1984, @sennnnn, @Jason-fan20
- Setup environment. 🤝 Thanks to @nameless1117
- Add docker file. ⌛ [WIP] 🤝 Thanks to @Mon-ius, @SimonLeeGit
- Enable type hints for functions. 🤝 Thanks to @RuslanPeresy, 🙏 [Need your contribution]
- Resume from checkpoint.
- Add Video-VQVAE model, which is borrowed from VideoGPT.
- Support variable aspect ratios, resolutions, durations training on DiT.
- Support Dynamic mask input inspired by FiT.
- Add class-conditioning on embeddings.
- Incorporating Latte as main codebase.
- Add VAE model, which is borrowed from Stable Diffusion.
- Joint dynamic mask input with VAE.
- Add VQVAE from VQGAN. 🙏 [Need your contribution]
- Make the codebase ready for the cluster training. Add SLURM scripts. 🙏 [Need your contribution]
- Refactor VideoGPT. 🤝 Thanks to @qqingzheng, @luo3300612, @sennnnn
- Add sampling script.
- Add DDP sampling script. ⌛ [WIP]
- Use accelerate on multi-node. 🤝 Thanks to @sysuyy
- Incorporate SiT. 🤝 Thanks to @khan-yin
- Add evaluation scripts (FVD, CLIP score). 🤝 Thanks to @rain305f
- Add PI to support out-of-domain size. 🤝 Thanks to @jpthu17
- Add 2D RoPE to improve generalization ability as FiT. 🤝 Thanks to @jpthu17
- Compress KV according to PixArt-sigma.
- Support deepspeed for videogpt training. 🤝 Thanks to @sennnnn
- Train a low dimension Video-AE, whether it is VAE or VQVAE.
- Extract offline feature.
- Train with offline feature.
- Add frame interpolation model. 🤝 Thanks to @yunyangge
- Add super resolution model. 🤝 Thanks to @Linzy19
- Add accelerate to automatically manage training.
- Joint training with images.
- Implement MaskDiT technique for fast training. 🙏 [Need your contribution]
- Incorporate NaViT. 🙏 [Need your contribution]
- Add FreeNoise support for training-free longer video generation. 🙏 [Need your contribution]
- Load pretrained weights from Latte.
- Implement PeRFlow for improving the sampling process. 🙏 [Need your contribution]
- Finish data loading, pre-processing utils.
- Add T5 support.
- Add CLIP support. 🤝 Thanks to @Ytimed2020
- Add text2image training script.
- Add prompt captioner.
- Collect training data.
- Need video-text pairs with caption. 🙏 [Need your contribution]
- Extract multi-frame descriptions by large image-language models. 🤝 Thanks to @HowardLi1984
- Extract video description by large video-language models. 🙏 [Need your contribution]
- Integrate captions to get a dense caption by using a large language model, such as GPT-4. 🤝 Thanks to @HowardLi1984
- Train a captioner to refine captions. 🚀 [Require more computation]
- Collect training data.
- Looking for a suitable dataset, welcome to discuss and recommend. 🙏 [Need your contribution]
- Add synthetic video created by game engines or 3D representations. 🙏 [Need your contribution]
- Finish data loading, and pre-processing utils.
- Support memory friendly training.
- Add flash-attention2 from pytorch.
- Add xformers. 🤝 Thanks to @jialin-zhao
- Support mixed precision training.
- Add gradient checkpoint.
- Support for ReBased and Ring attention. 🤝 Thanks to @kabachuha
- Train using the deepspeed engine. 🤝 Thanks to @sennnnn
- Train with a text condition. Here we could conduct different experiments: 🚀 [Require more computation]
- Train with T5 conditioning.
- Train with CLIP conditioning.
- Train with CLIP + T5 conditioning (probably costly during training and experiments).
- Incorporating ControlNet. ⌛ [WIP] 🙏 [Need your contribution]
├── README.md
├── docs
│ ├── Data.md -> Datasets description.
│ ├── Contribution_Guidelines.md -> Contribution guidelines description.
├── scripts -> All scripts.
├── opensora
│ ├── dataset
│ ├── models
│ │ ├── ae -> Compress videos to latents
│ │ │ ├── imagebase
│ │ │ │ ├── vae
│ │ │ │ └── vqvae
│ │ │ └── videobase
│ │ │ ├── vae
│ │ │ └── vqvae
│ │ ├── captioner
│ │ ├── diffusion -> Denoise latents
│ │ │ ├── diffusion
│ │ │ ├── dit
│ │ │ ├── latte
│ │ │ └── unet
│ │ ├── frame_interpolation
│ │ ├── super_resolution
│ │ └── text_encoder
│ ├── sample
│ ├── train -> Training code
│ └── utils
- Clone this repository and navigate to Open-Sora-Plan folder
git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan
cd Open-Sora-Plan
- Install required packages
conda create -n opensora python=3.8 -y
conda activate opensora
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
- Install optional requirements such as static type checking:
pip install -e '.[dev]'
Highly recommend trying out our web demo by the following command. We also provide online demo and in Huggingface Spaces.
🤝 Enjoying the and , created by @camenduru, who generously supports our research!
python -m opensora.serve.gradio_web_server
sh scripts/text_condition/sample_video.sh
Refer to Data.md
Refer to the document EVAL.md.
Example:
python examples/rec_imvi_vae.py --video_path test_video.mp4 --rec_path output_video.mp4 --fps 24 --resolution 512 --crop_size 512 --num_frames 128 --sample_rate 1 --ae CausalVAEModel_4x8x8 --model_path pretrained_488_release --enable_tiling --enable_time_chunk
Parameter explanation:
-
--enable_tiling
: This parameter is a flag to enable a tiling conv. -
--enable_time_chunk
: This parameter is a flag to enable a time chunking. This will block the video in the temporal dimension and reconstruct the long video. This is only an operation performed in the video space, not the latent space, and cannot be used for training.
Please refer to the document CausalVideoVAE.
Please refer to the document VQVAE.
sh scripts/text_condition/train_videoae_17x256x256.sh
sh scripts/text_condition/train_videoae_65x256x256.sh
sh scripts/text_condition/train_videoae_65x512x512.sh
In comparison to the original implementation, we implement a selection of training speed acceleration and memory saving features including gradient checkpointing, mixed precision training, and pre-extracted features, xformers, deepspeed. Some data points using a batch size of 1 with a A100:
gradient checkpointing | mixed precision | xformers | feature pre-extraction | deepspeed config | compress kv | training speed | memory |
---|---|---|---|---|---|---|---|
✔ | ✔ | ✔ | ✔ | ❌ | ❌ | 0.64 steps/sec | 43G |
✔ | ✔ | ✔ | ✔ | Zero2 | ❌ | 0.66 steps/sec | 14G |
✔ | ✔ | ✔ | ✔ | Zero2 | ✔ | 0.66 steps/sec | 15G |
✔ | ✔ | ✔ | ✔ | Zero2 offload | ❌ | 0.33 steps/sec | 11G |
✔ | ✔ | ✔ | ✔ | Zero2 offload | ✔ | 0.31 steps/sec | 12G |
gradient checkpointing | mixed precision | xformers | feature pre-extraction | deepspeed config | compress kv | training speed | memory |
---|---|---|---|---|---|---|---|
✔ | ✔ | ✔ | ✔ | ❌ | ❌ | 0.08 steps/sec | 77G |
✔ | ✔ | ✔ | ✔ | Zero2 | ❌ | 0.08 steps/sec | 41G |
✔ | ✔ | ✔ | ✔ | Zero2 | ✔ | 0.09 steps/sec | 36G |
✔ | ✔ | ✔ | ✔ | Zero2 offload | ❌ | 0.07 steps/sec | 39G |
✔ | ✔ | ✔ | ✔ | Zero2 offload | ✔ | 0.07 steps/sec | 33G |
We greatly appreciate your contributions to the Open-Sora Plan open-source community and helping us make it even better than it is now!
For more details, please refer to the Contribution Guidelines
- Latte: The main codebase we built upon and it is an wonderful video generated model.
- PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis.
- VideoGPT: Video Generation using VQ-VAE and Transformers.
- DiT: Scalable Diffusion Models with Transformers.
- FiT: Flexible Vision Transformer for Diffusion Model.
- Positional Interpolation: Extending Context Window of Large Language Models via Positional Interpolation.
- See LICENSE for details.
@software{pku_yuan_lab_and_tuzhan_ai_etc_2024_10948109,
author = {PKU-Yuan Lab and Tuzhan AI etc.},
title = {Open-Sora-Plan},
month = apr,
year = 2024,
publisher = {GitHub},
doi = {10.5281/zenodo.10948109},
url = {https://doi.org/10.5281/zenodo.10948109}
}