|
--- |
|
license: cc-by-nc-nd-4.0 |
|
--- |
|
|
|
# ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation |
|
|
|
This page shares the official model checkpoints of the paper \ |
|
*ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation* \ |
|
from Microsoft Applied Science Group and UC Berkeley \ |
|
by [Yatong Bai](https://bai-yt.github.io), |
|
[Trung Dang](https://www.microsoft.com/applied-sciences/people/trung-dang), |
|
[Dung Tran](https://www.microsoft.com/applied-sciences/people/dung-tran), |
|
[Kazuhito Koishida](https://www.microsoft.com/applied-sciences/people/kazuhito-koishida), |
|
and [Somayeh Sojoudi](https://people.eecs.berkeley.edu/~sojoudi/). |
|
|
|
**[[🤗 Live Demo](https://huggingface.co/spaces/Bai-YT/ConsistencyTTA)]** |
|
**[[Preprint Paper](https://arxiv.org/abs/2309.10740)]** |
|
**[[Project Homepage](https://consistency-tta.github.io)]** |
|
**[[Code](https://github.com/Bai-YT/ConsistencyTTA)]** |
|
**[[Model Checkpoints](https://huggingface.co/Bai-YT/ConsistencyTTA)]** |
|
**[[Generation Examples](https://consistency-tta.github.io/demo.html)]** |
|
|
|
|
|
## Description |
|
|
|
**2024/06 Updates:** |
|
|
|
- We have hosted an interactive live demo of ConsistencyTTA at [🤗 Huggingface](https://huggingface.co/spaces/Bai-YT/ConsistencyTTA). |
|
- ConsistencyTTA has been accepted to ***INTERSPEECH 2024***! We look forward to meeting you in Kos Island. |
|
|
|
This work proposes a *consistency distillation* framework to train |
|
text-to-audio (TTA) generation models that only require a single neural network query, |
|
reducing the computation of the core step of diffusion-based TTA models by a factor of 400. |
|
By incorporating *classifier-free guidance* into the distillation framework, |
|
our models retain diffusion models' impressive generation quality and diversity. |
|
Furthermore, the non-recurrent differentiable structure of the consistency model |
|
allows for end-to-end fine-tuning with novel loss functions such as the CLAP score, further boosting performance. |
|
|
|
<center> |
|
<img src="main_figure_.png" alt="ConsistencyTTA Results" title="Results" width="480"/> |
|
</center> |
|
|
|
|
|
## Model Details |
|
|
|
We share three model checkpoints: |
|
- [ConsistencyTTA directly distilled from a diffusion model]( |
|
https://huggingface.co/Bai-YT/ConsistencyTTA/blob/main/ConsistencyTTA.zip); |
|
- [ConsistencyTTA fine-tuned by optimizing the CLAP score]( |
|
https://huggingface.co/Bai-YT/ConsistencyTTA/blob/main/ConsistencyTTA_CLAPFT.zip); |
|
- [The diffusion teacher model from which ConsistencyTTA is distilled]( |
|
https://huggingface.co/Bai-YT/ConsistencyTTA/blob/main/LightweightLDM.zip). |
|
|
|
The first two models are capable of high-quality single-step text-to-audio generation. Generations are 10 seconds long. |
|
|
|
After downloading and unzipping the files, place them in the `saved` directory. |
|
|
|
The training and inference code are on our [GitHub page](https://github.com/Bai-YT/ConsistencyTTA). Please refer to the GitHub page for usage details. |
|
|