Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models

Fan, Weichen; Si, Chenyang; Song, Junhao; Yang, Zhenyu; He, Yinan; Zhuo, Long; Huang, Ziqi; Dong, Ziyue; He, Jingwen; Pan, Dongwei; Wang, Yi; Jiang, Yuming; Wang, Yaohui; Gao, Peng; Chen, Xinyuan; Li, Hengjie; Lin, Dahua; Qiao, Yu; Liu, Ziwei

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.08453 (cs)

[Submitted on 14 Jan 2025]

Title:Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models

Authors:Weichen Fan, Chenyang Si, Junhao Song, Zhenyu Yang, Yinan He, Long Zhuo, Ziqi Huang, Ziyue Dong, Jingwen He, Dongwei Pan, Yi Wang, Yuming Jiang, Yaohui Wang, Peng Gao, Xinyuan Chen, Hengjie Li, Dahua Lin, Yu Qiao, Ziwei Liu

View PDF HTML (experimental)

Abstract:We present Vchitect-2.0, a parallel transformer architecture designed to scale up video diffusion models for large-scale text-to-video generation. The overall Vchitect-2.0 system has several key designs. (1) By introducing a novel Multimodal Diffusion Block, our approach achieves consistent alignment between text descriptions and generated video frames, while maintaining temporal coherence across sequences. (2) To overcome memory and computational bottlenecks, we propose a Memory-efficient Training framework that incorporates hybrid parallelism and other memory reduction techniques, enabling efficient training of long video sequences on distributed systems. (3) Additionally, our enhanced data processing pipeline ensures the creation of Vchitect T2V DataVerse, a high-quality million-scale training dataset through rigorous annotation and aesthetic evaluation. Extensive benchmarking demonstrates that Vchitect-2.0 outperforms existing methods in video quality, training efficiency, and scalability, serving as a suitable base for high-fidelity video generation.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2501.08453 [cs.CV]
	(or arXiv:2501.08453v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.08453

Submission history

From: Weichen Fan [view email]
[v1] Tue, 14 Jan 2025 21:53:11 UTC (22,224 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models

Submission history

Access Paper:

Current browse context:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models

Submission history

Access Paper:

Current browse context:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators