declare-lab
/

tango2-full

Text-to-Audio

Transformers

English

Inference Endpoints

Model card Files Files and versions Community

soujanyaporia commited on Apr 13, 2024

Commit

4422be4

verified ·

1 Parent(s): 255f8e3

Update README.md

Browse files

Files changed (1) hide show

README.md +5 -16

README.md CHANGED Viewed

@@ -8,37 +8,34 @@ pipeline_tag: text-to-audio
 tags:
 - text-to-audio
 ---
-# TANGO: Text to Audio using iNstruction-Guided diffusiOn
-**TANGO** is a latent diffusion model for text-to-audio generation. **TANGO** can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We use the frozen instruction-tuned LLM Flan-T5 as the text encoder and train a UNet based diffusion model for audio generation. We outperform current state-of-the-art models for audio generation across both objective and subjective metrics. We release our model, training, inference code and pre-trained checkpoints for the research community.
-📣 We are releasing [**Tango-Full-FT-Audiocaps**](https://huggingface.co/declare-lab/tango-full-ft-audiocaps) which was first pre-trained on [**TangoPromptBank**](https://huggingface.co/datasets/declare-lab/TangoPromptBank), a collection of diverse text, audio pairs. We later fine tuned this checkpoint on AudioCaps. This checkpoint obtained state-of-the-art results for text-to-audio generation on AudioCaps.
 ## Code
 Our code is released here: [https://github.com/declare-lab/tango](https://github.com/declare-lab/tango)
-We uploaded several **TANGO** generated samples here: [https://tango-web.github.io/](https://tango-web.github.io/)
 Please follow the instructions in the repository for installation, usage and experiments.
 ## Quickstart Guide
-Download the **TANGO** model and generate audio from a text prompt:
 ```python
 import IPython
 import soundfile as sf
 from tango import Tango
-tango = Tango("declare-lab/tango")
 prompt = "An audience cheering and clapping"
 audio = tango.generate(prompt)
 sf.write(f"{prompt}.wav", audio, samplerate=16000)
 IPython.display.Audio(data=audio, rate=16000)
 ```
-[An audience cheering and clapping.webm](https://user-images.githubusercontent.com/13917097/233851915-e702524d-cd35-43f7-93e0-86ea579231a7.webm)
 The model will be automatically downloaded and saved in cache. Subsequent runs will load the model directly from cache.
@@ -49,9 +46,7 @@ prompt = "Rolling thunder with lightning strikes"
 audio = tango.generate(prompt, steps=200)
 IPython.display.Audio(data=audio, rate=16000)
 ```
-[Rolling thunder with lightning strikes.webm](https://user-images.githubusercontent.com/13917097/233851929-90501e41-911d-453f-a00b-b215743365b4.webm)
-<!-- [MachineClicking](https://user-images.githubusercontent.com/25340239/233857834-bfda52b4-4fcc-48de-b47a-6a6ddcb3671b.mp4 "sample 1") -->
 Use the `generate_for_batch` function to generate multiple audio samples for a batch of text prompts:
@@ -63,10 +58,4 @@ prompts = [
 ]
 audios = tango.generate_for_batch(prompts, samples=2)
 ```
-This will generate two samples for each of the three text prompts.
-## Limitations
-TANGO is trained on the small AudioCaps dataset so it may not generate good audio samples related to concepts that it has not seen in training (e.g. _singing_). For the same reason, TANGO is not always able to finely control its generations over textual control prompts. For example, the generations from TANGO for prompts _Chopping tomatoes on a wooden table_ and _Chopping potatoes on a metal table_ are very similar. _Chopping vegetables on a table_ also produces similar audio samples. Training text-to-audio generation models on larger datasets is thus required for the model to learn the composition of textual concepts and varied text-audio mappings.
-We are training another version of TANGO on larger datasets to enhance its generalization, compositional and controllable generation ability.

 tags:
 - text-to-audio
 ---
+# Tango 2: Aligning Diffusion-based Text-to-Audio Generative Models through Direct Preference Optimization
+🎵 We developed **Tango 2** building upon **Tango** for text-to-audio generation. **Tango 2** was initialized with the **Tango-full-ft** checkpoint and underwent alignment training using DPO on **audio-alpaca**, a dataset of pairwise audio preferences. 🎶
 ## Code
 Our code is released here: [https://github.com/declare-lab/tango](https://github.com/declare-lab/tango)
 Please follow the instructions in the repository for installation, usage and experiments.
 ## Quickstart Guide
+Download the **Tango 2** model and generate audio from a text prompt:
 ```python
 import IPython
 import soundfile as sf
 from tango import Tango
+tango = Tango("declare-lab/tango2-full")
 prompt = "An audience cheering and clapping"
 audio = tango.generate(prompt)
 sf.write(f"{prompt}.wav", audio, samplerate=16000)
 IPython.display.Audio(data=audio, rate=16000)
 ```
 The model will be automatically downloaded and saved in cache. Subsequent runs will load the model directly from cache.
 audio = tango.generate(prompt, steps=200)
 IPython.display.Audio(data=audio, rate=16000)
 ```
 Use the `generate_for_batch` function to generate multiple audio samples for a batch of text prompts:
 ]
 audios = tango.generate_for_batch(prompts, samples=2)
 ```
+This will generate two samples for each of the three text prompts.