Text-to-Audio
Transformers
music
text-to-music
Inference Endpoints
soujanyaporia commited on
Commit
a04d5c8
β€’
1 Parent(s): 386c9b1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +98 -0
README.md ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - amaai-lab/MusicBench
5
+ tags:
6
+ - music
7
+ ---
8
+
9
+ <div align="center">
10
+
11
+ # Mustango: Toward Controllable Text-to-Music Generation
12
+
13
+ [Demo]() [Model](https://huggingface.co/declare-lab/mustango) [Website and Examples](https://amaai-lab.github.io/mustango/) [Paper](https://arxiv.org/abs/2311.08355) [Dataset](https://huggingface.co/datasets/amaai-lab/MusicBench)
14
+ </div>
15
+
16
+ Meet Mustango, an exciting addition to the vibrant landscape of Multimodal Large Language Models designed for controlled music generation. Mustango leverages Latent Diffusion Model (LDM), Flan-T5, and musical features to do the magic!
17
+
18
+ <div align="center">
19
+ <img src="img/mustango.jpg" width="500"/>
20
+ </div>
21
+
22
+
23
+ ## Quickstart Guide
24
+
25
+ Generate music from a text prompt:
26
+
27
+ ```python
28
+ import IPython
29
+ import soundfile as sf
30
+ from mustango import Mustango
31
+
32
+ model = Mustango("declare-lab/mustango")
33
+
34
+ prompt = "This is a new age piece. There is a flute playing the main melody with a lot of staccato notes. The rhythmic background consists of a medium tempo electronic drum beat with percussive elements all over the spectrum. There is a playful atmosphere to the piece. This piece can be used in the soundtrack of a children's TV show or an advertisement jingle."
35
+
36
+ music = model.generate(prompt)
37
+ sf.write(f"{prompt}.wav", audio, samplerate=16000)
38
+ IPython.display.Audio(data=audio, rate=16000)
39
+ ```
40
+
41
+ ## Datasets
42
+
43
+ The [MusicBench](https://huggingface.co/datasets/amaai-lab/MusicBench) dataset contains 52k music fragments with a rich music-specific text caption.
44
+ ## Subjective Evaluation by Expert Listeners
45
+
46
+ | **Model** | **Dataset** | **Pre-trained** | **Relevance** ↑ | **Chord Match** ↑ | **Tempo Match** ↑ | **Audio Quality** ↑ | **Musicality** ↑ | **Rhythmic Presence and Stability** ↑ | **Harmony and Consonance** ↑ |
47
+ |-----------|-------------|:-----------------:|:-----------:|:-----------:|:-----------:|:----------:|:----------:|:----------:|:----------:|
48
+ | Tango | MusicCaps | βœ“ | 4.35 | 2.75 | 3.88 | 3.35 | 2.83 | 3.95 | 3.84 |
49
+ | Tango | MusicBench | βœ“ | 4.91 | 3.61 | 3.86 | 3.88 | 3.54 | 4.01 | 4.34 |
50
+ | Mustango | MusicBench | βœ“ | 5.49 | 5.76 | 4.98 | 4.30 | 4.28 | 4.65 | 5.18 |
51
+ | Mustango | MusicBench | βœ— | 5.75 | 6.06 | 5.11 | 4.80 | 4.80 | 4.75 | 5.59 |
52
+
53
+
54
+
55
+
56
+ ## Training
57
+
58
+ We use the `accelerate` package from Hugging Face for multi-gpu training. Run `accelerate config` from terminal and set up your run configuration by the answering the questions asked.
59
+
60
+ You can now train **Mustango** on the MusicBench dataset using:
61
+
62
+ ```bash
63
+ accelerate launch train.py \
64
+ --text_encoder_name="google/flan-t5-large" \
65
+ --scheduler_name="stabilityai/stable-diffusion-2-1" \
66
+ --unet_model_config="configs/diffusion_model_config_munet.json" \
67
+ --model_type Mustango --freeze_text_encoder --uncondition_all --uncondition_single \
68
+ --drop_sentences --random_pick_text_column --snr_gamma 5 \
69
+ ```
70
+
71
+ The `--model_type` flag allows to choose either Mustango, or Tango to be trained with the same code. However, do note that you also need to change `--unet_model_config` to the relevant config: diffusion_model_config_munet for Mustango; diffusion_model_config for Tango.
72
+
73
+ The arguments `--uncondition_all`, `--uncondition_single`, `--drop_sentences` control the dropout functions as per Section 5.2 in our paper. The argument of `--random_pick_text_column` allows to randomly pick between two input text prompts - in the case of MusicBench, we pick between ChatGPT rephrased captions and original enhanced MusicCaps prompts, as depicted in Figure 1 in our paper.
74
+
75
+ Recommended training time from scratch on MusicBench is at least 40 epochs.
76
+
77
+
78
+ ## Model Zoo
79
+
80
+ We have released the following models:
81
+
82
+ Mustango Pretrained: https://huggingface.co/declare-lab/mustango
83
+
84
+
85
+ Mustango: Coming soon!
86
+
87
+
88
+ ## Citation
89
+ Please consider citing the following article if you found our work useful:
90
+ ```
91
+ @misc{melechovsky2023mustango,
92
+ title={Mustango: Toward Controllable Text-to-Music Generation},
93
+ author={Jan Melechovsky and Zixun Guo and Deepanway Ghosal and Navonil Majumder and Dorien Herremans and Soujanya Poria},
94
+ year={2023},
95
+ eprint={2311.08355},
96
+ archivePrefix={arXiv},
97
+ }
98
+ ```