CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Abstract
We introduce CogVideoX, a large-scale diffusion transformer model designed for generating videos based on text prompts. To efficently model video data, we propose to levearge a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions. To improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. By employing a progressive training technique, CogVideoX is adept at producing coherent, long-duration videos characterized by significant motions. In addition, we develop an effective text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method. It significantly helps enhance the performance of CogVideoX, improving both generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weights of both the 3D Causal VAE and CogVideoX are publicly available at https://github.com/THUDM/CogVideo.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VidGen-1M: A Large-Scale Dataset for Text-to-video Generation (2024)
- OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation (2024)
- VIMI: Grounding Video Generation through Multi-modal Instruction (2024)
- ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models (2024)
- MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Question from the community
@tengjiayan
@keg-yzy
@zwd125
:
Plz ask them if they or someone can make it available in Open WebUI via ComfyUI, Auto1111 or other methods to run it locally on our machines
Models citing this paper 9
Browse 9 models citing this paperDatasets citing this paper 0
No dataset linking this paper