DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion
Abstract
Recent advancements in music generation have garnered significant attention, yet existing approaches face critical limitations. Some current generative models can only synthesize either the vocal track or the accompaniment track. While some models can generate combined vocal and accompaniment, they typically rely on meticulously designed multi-stage cascading architectures and intricate data pipelines, hindering scalability. Additionally, most systems are restricted to generating short musical segments rather than full-length songs. Furthermore, widely used language model-based methods suffer from slow inference speeds. To address these challenges, we propose DiffRhythm, the first latent diffusion-based song generation model capable of synthesizing complete songs with both vocal and accompaniment for durations of up to 4m45s in only ten seconds, maintaining high musicality and intelligibility. Despite its remarkable capabilities, DiffRhythm is designed to be simple and elegant: it eliminates the need for complex data preparation, employs a straightforward model structure, and requires only lyrics and a style prompt during inference. Additionally, its non-autoregressive structure ensures fast inference speeds. This simplicity guarantees the scalability of DiffRhythm. Moreover, we release the complete training code along with the pre-trained model on large-scale data to promote reproducibility and further research.
Community
Such a clever yet simple technique for generating songs! Works super well, kudos to the authors for this research and also for creating such an amazing model and releasing it open weights!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation (2025)
- InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation (2025)
- Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis (2025)
- TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching (2025)
- UniForm: A Unified Diffusion Transformer for Audio-Video Generation (2025)
- SayAnything: Audio-Driven Lip Synchronization with Conditional Video Diffusion (2025)
- A Comprehensive Survey on Generative AI for Video-to-Music Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper