Text-to-Audio
Transformers
English
Inference Endpoints
soujanyaporia commited on
Commit
4422be4
·
verified ·
1 Parent(s): 255f8e3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -16
README.md CHANGED
@@ -8,37 +8,34 @@ pipeline_tag: text-to-audio
8
  tags:
9
  - text-to-audio
10
  ---
11
- # TANGO: Text to Audio using iNstruction-Guided diffusiOn
12
 
13
- **TANGO** is a latent diffusion model for text-to-audio generation. **TANGO** can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We use the frozen instruction-tuned LLM Flan-T5 as the text encoder and train a UNet based diffusion model for audio generation. We outperform current state-of-the-art models for audio generation across both objective and subjective metrics. We release our model, training, inference code and pre-trained checkpoints for the research community.
14
 
15
- 📣 We are releasing [**Tango-Full-FT-Audiocaps**](https://huggingface.co/declare-lab/tango-full-ft-audiocaps) which was first pre-trained on [**TangoPromptBank**](https://huggingface.co/datasets/declare-lab/TangoPromptBank), a collection of diverse text, audio pairs. We later fine tuned this checkpoint on AudioCaps. This checkpoint obtained state-of-the-art results for text-to-audio generation on AudioCaps.
16
 
17
  ## Code
18
 
19
  Our code is released here: [https://github.com/declare-lab/tango](https://github.com/declare-lab/tango)
20
 
21
- We uploaded several **TANGO** generated samples here: [https://tango-web.github.io/](https://tango-web.github.io/)
22
 
23
  Please follow the instructions in the repository for installation, usage and experiments.
24
 
25
  ## Quickstart Guide
26
 
27
- Download the **TANGO** model and generate audio from a text prompt:
28
 
29
  ```python
30
  import IPython
31
  import soundfile as sf
32
  from tango import Tango
33
 
34
- tango = Tango("declare-lab/tango")
35
 
36
  prompt = "An audience cheering and clapping"
37
  audio = tango.generate(prompt)
38
  sf.write(f"{prompt}.wav", audio, samplerate=16000)
39
  IPython.display.Audio(data=audio, rate=16000)
40
  ```
41
- [An audience cheering and clapping.webm](https://user-images.githubusercontent.com/13917097/233851915-e702524d-cd35-43f7-93e0-86ea579231a7.webm)
42
 
43
  The model will be automatically downloaded and saved in cache. Subsequent runs will load the model directly from cache.
44
 
@@ -49,9 +46,7 @@ prompt = "Rolling thunder with lightning strikes"
49
  audio = tango.generate(prompt, steps=200)
50
  IPython.display.Audio(data=audio, rate=16000)
51
  ```
52
- [Rolling thunder with lightning strikes.webm](https://user-images.githubusercontent.com/13917097/233851929-90501e41-911d-453f-a00b-b215743365b4.webm)
53
 
54
- <!-- [MachineClicking](https://user-images.githubusercontent.com/25340239/233857834-bfda52b4-4fcc-48de-b47a-6a6ddcb3671b.mp4 "sample 1") -->
55
 
56
  Use the `generate_for_batch` function to generate multiple audio samples for a batch of text prompts:
57
 
@@ -63,10 +58,4 @@ prompts = [
63
  ]
64
  audios = tango.generate_for_batch(prompts, samples=2)
65
  ```
66
- This will generate two samples for each of the three text prompts.
67
-
68
- ## Limitations
69
-
70
- TANGO is trained on the small AudioCaps dataset so it may not generate good audio samples related to concepts that it has not seen in training (e.g. _singing_). For the same reason, TANGO is not always able to finely control its generations over textual control prompts. For example, the generations from TANGO for prompts _Chopping tomatoes on a wooden table_ and _Chopping potatoes on a metal table_ are very similar. _Chopping vegetables on a table_ also produces similar audio samples. Training text-to-audio generation models on larger datasets is thus required for the model to learn the composition of textual concepts and varied text-audio mappings.
71
-
72
- We are training another version of TANGO on larger datasets to enhance its generalization, compositional and controllable generation ability.
 
8
  tags:
9
  - text-to-audio
10
  ---
11
+ # Tango 2: Aligning Diffusion-based Text-to-Audio Generative Models through Direct Preference Optimization
12
 
13
+ 🎵 We developed **Tango 2** building upon **Tango** for text-to-audio generation. **Tango 2** was initialized with the **Tango-full-ft** checkpoint and underwent alignment training using DPO on **audio-alpaca**, a dataset of pairwise audio preferences. 🎶
14
 
 
15
 
16
  ## Code
17
 
18
  Our code is released here: [https://github.com/declare-lab/tango](https://github.com/declare-lab/tango)
19
 
 
20
 
21
  Please follow the instructions in the repository for installation, usage and experiments.
22
 
23
  ## Quickstart Guide
24
 
25
+ Download the **Tango 2** model and generate audio from a text prompt:
26
 
27
  ```python
28
  import IPython
29
  import soundfile as sf
30
  from tango import Tango
31
 
32
+ tango = Tango("declare-lab/tango2-full")
33
 
34
  prompt = "An audience cheering and clapping"
35
  audio = tango.generate(prompt)
36
  sf.write(f"{prompt}.wav", audio, samplerate=16000)
37
  IPython.display.Audio(data=audio, rate=16000)
38
  ```
 
39
 
40
  The model will be automatically downloaded and saved in cache. Subsequent runs will load the model directly from cache.
41
 
 
46
  audio = tango.generate(prompt, steps=200)
47
  IPython.display.Audio(data=audio, rate=16000)
48
  ```
 
49
 
 
50
 
51
  Use the `generate_for_batch` function to generate multiple audio samples for a batch of text prompts:
52
 
 
58
  ]
59
  audios = tango.generate_for_batch(prompts, samples=2)
60
  ```
61
+ This will generate two samples for each of the three text prompts.