Bai-YT
/

ConsistencyTTA

Model card Files Files and versions Community

ConsistencyTTA / README.md

Bai-YT's picture

Update README.md

ab88f61 verified 8 months ago

|

history blame contribute delete

3.06 kB

	---
	license: cc-by-nc-nd-4.0
	---

	# ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

	This page shares the official model checkpoints of the paper \
	ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation \
	from Microsoft Applied Science Group and UC Berkeley \
	by [Yatong Bai](https://bai-yt.github.io),
	[Trung Dang](https://www.microsoft.com/applied-sciences/people/trung-dang),
	[Dung Tran](https://www.microsoft.com/applied-sciences/people/dung-tran),
	[Kazuhito Koishida](https://www.microsoft.com/applied-sciences/people/kazuhito-koishida),
	and [Somayeh Sojoudi](https://people.eecs.berkeley.edu/~sojoudi/).

	[[🤗 Live Demo](https://huggingface.co/spaces/Bai-YT/ConsistencyTTA)]
	[[Preprint Paper](https://arxiv.org/abs/2309.10740)]
	[[Project Homepage](https://consistency-tta.github.io)]
	[[Code](https://github.com/Bai-YT/ConsistencyTTA)]
	[[Model Checkpoints](https://huggingface.co/Bai-YT/ConsistencyTTA)]
	[[Generation Examples](https://consistency-tta.github.io/demo.html)]


	## Description

	2024/06 Updates:

	- We have hosted an interactive live demo of ConsistencyTTA at [🤗 Huggingface](https://huggingface.co/spaces/Bai-YT/ConsistencyTTA).
	- ConsistencyTTA has been accepted to *INTERSPEECH 2024*! We look forward to meeting you in Kos Island.

	This work proposes a consistency distillation framework to train
	text-to-audio (TTA) generation models that only require a single neural network query,
	reducing the computation of the core step of diffusion-based TTA models by a factor of 400.
	By incorporating classifier-free guidance into the distillation framework,
	our models retain diffusion models' impressive generation quality and diversity.
	Furthermore, the non-recurrent differentiable structure of the consistency model
	allows for end-to-end fine-tuning with novel loss functions such as the CLAP score, further boosting performance.

	<center>
	<img src="main_figure_.png" alt="ConsistencyTTA Results" title="Results" width="480"/>
	</center>


	## Model Details

	We share three model checkpoints:
	- [ConsistencyTTA directly distilled from a diffusion model](
	https://huggingface.co/Bai-YT/ConsistencyTTA/blob/main/ConsistencyTTA.zip);
	- [ConsistencyTTA fine-tuned by optimizing the CLAP score](
	https://huggingface.co/Bai-YT/ConsistencyTTA/blob/main/ConsistencyTTA_CLAPFT.zip);
	- [The diffusion teacher model from which ConsistencyTTA is distilled](
	https://huggingface.co/Bai-YT/ConsistencyTTA/blob/main/LightweightLDM.zip).

	The first two models are capable of high-quality single-step text-to-audio generation. Generations are 10 seconds long.

	After downloading and unzipping the files, place them in the `saved` directory.

	The training and inference code are on our [GitHub page](https://github.com/Bai-YT/ConsistencyTTA). Please refer to the GitHub page for usage details.