Magic 1-For-1: Generating One Minute Video Clips within One Minute
Abstract
In this technical report, we present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency. The key idea is simple: factorize the text-to-video generation task into two separate easier tasks for diffusion step distillation, namely text-to-image generation and image-to-video generation. We verify that with the same optimization algorithm, the image-to-video task is indeed easier to converge over the text-to-video task. We also explore a bag of optimization tricks to reduce the computational cost of training the image-to-video (I2V) models from three aspects: 1) model convergence speedup by using a multi-modal prior condition injection; 2) inference latency speed up by applying an adversarial step distillation, and 3) inference memory cost optimization with parameter sparsification. With those techniques, we are able to generate 5-second video clips within 3 seconds. By applying a test time sliding window, we are able to generate a minute-long video within one minute with significantly improved visual quality and motion dynamics, spending less than 1 second for generating 1 second video clips on average. We conduct a series of preliminary explorations to find out the optimal tradeoff between computational cost and video quality during diffusion step distillation and hope this could be a good foundation model for open-source explorations. The code and the model weights are available at https://github.com/DA-Group-PKU/Magic-1-For-1.
Community
Most people are probably going to want to see the project page at https://magic-141.github.io/Magic-141/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Open-Sora: Democratizing Efficient Video Production for All (2024)
- Diffusion Adversarial Post-Training for One-Step Video Generation (2025)
- Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers (2025)
- CascadeV: An Implementation of Wurstchen Architecture for Video Generation (2025)
- Autoregressive Video Generation without Vector Quantization (2024)
- VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation (2024)
- Fast Video Generation with Sliding Tile Attention (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper