Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Abstract
We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and demo are available at: https://hkchengrex.github.io/MMAudio
Community
Huggingface space demo: https://huggingface.co/spaces/hkchengrex/MMAudio
Project page: https://hkchengrex.com/MMAudio
Code: https://github.com/hkchengrex/MMAudio
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from Text (2024)
- VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation (2024)
- Video-Guided Foley Sound Generation with Multimodal Controls (2024)
- Gotta Hear Them All: Sound Source Aware Vision to Audio Generation (2024)
- Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment (2024)
- SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation (2024)
- LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper