sino commited on
Commit
ced1e40
·
1 Parent(s): e9cd07a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +79 -85
README.md CHANGED
@@ -2,120 +2,114 @@
2
  language:
3
  - zh
4
  - en
5
- tags:
6
- - qwen
7
  pipeline_tag: text-generation
8
- inference: false
9
  ---
10
 
11
- # Qwen-Audio
12
 
13
  <br>
14
-
15
- <p align="center">
16
- <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/audio_logo.jpg" width="400"/>
17
- <p>
18
- <br>
19
-
20
- <p align="center">
21
- Qwen-Audio <a href="https://www.modelscope.cn/models/qwen/QWen-Audio/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-Audio">🤗</a>&nbsp | Qwen-Audio-Chat <a href="https://www.modelscope.cn/models/qwen/QWen-Audio-Chat/summary">🤖 <a>| <a href="https://huggingface.co/Qwen/Qwen-Audio-Chat">🤗</a>&nbsp | &nbsp&nbsp Demo<a href="https://modelscope.cn/studios/qwen/Qwen-Audio-Chat-Demo/summary"> 🤖</a> | <a href="https://huggingface.co/spaces/Qwen/Qwen-Audio">🤗</a>&nbsp
22
- <br>
23
- &nbsp&nbsp<a href="https://qwen-audio.github.io/Qwen-Audio/">Homepage</a>&nbsp | &nbsp<a href="http://arxiv.org/abs/2311.07919">Paper</a> | &nbsp<a href="https://huggingface.co/papers/2311.07919">🤗</a>
24
  </p>
25
- <br><br>
26
-
27
- **Qwen-Audio** (Qwen Large Audio Language Model) is the multimodal version of the large model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-Audio accepts diverse audio (human speech, natural sound, music and song) and text as inputs, outputs text. The contribution of Qwen-Audio include:
28
-
29
- - **Fundamental audio models**: Qwen-Audio is a fundamental multi-task audio-language model that supports various tasks, languages, and audio types, serving as a universal audio understanding model. Building upon Qwen-Audio, we develop Qwen-Audio-Chat through instruction fine-tuning, enabling multi-turn dialogues and supporting diverse audio-oriented scenarios.
30
- - **Multi-task learning framework for all types of audios**: To scale up audio-language pre-training, we address the challenge of variation in textual labels associated with different datasets by proposing a multi-task training framework, enabling knowledge sharing and avoiding one-to-many interference. Our model incorporates more than 30 tasks and extensive experiments show the model achieves strong performance.
31
- - **Strong Performance**: Experimental results show that Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Specifically, Qwen-Audio achieves state-of-the-art results on the test set of Aishell1, cochlscene, ClothoAQA, and VocalSound.
32
- - **Flexible multi-run chat from audio and text input**: Qwen-Audio supports multiple-audio analysis, sound understading and reasoning, music appreciation, and tool usage for speech editing.
33
-
34
- **Qwen-Audio** 是阿里云研发的大规模音频语言模型(Large Audio Language Model)。Qwen-Audio 可以以多种音频 (包括说话人语音、自然音、音乐、歌声)和文本作为输入,并以文本作为输出。Qwen-Audio 系列模型的特点包括:
35
-
36
- - **音频基石模型**:Qwen-Audio是一个性能卓越的通用的音频理解模型,支持各种任务、语言和音频类型。在Qwen-Audio的基础上,我们通过指令微调开发了Qwen-Audio-Chat,支持多轮、多语言、多语言对话。Qwen-Audio和Qwen-Audio-Chat模型均已开源。
37
- - **兼容多种复杂音频的多任务学习框架**:为了避免由于数据收集来源不同以及任务类型不同,带来的音频到文本的一对多的干扰问题,我们提出了一种多任务训练框架,实现相似任务的知识共享,并尽可能减少不同任务之间的干扰。通过提出的框架,Qwen-Audio可以容纳训练超过30多种不同的音频任务;
38
- - **出色的性能**:Qwen-Audio在不需要任何任务特定的微调的情况下,在各种基准任务上取得了领先的结果。具体得,Qwen-Audio在Aishell1、cochlscene、ClothoAQA和VocalSound的测试集上都达到了SOTA;
39
- - **支持多轮音频和文本对话,支持各种语音场景**:Qwen-Audio-Chat支持声音理解和推理、音乐欣赏、多音频分析、多轮音频-文本交错对话以及外部语音工具的使用(如语音编辑)。
40
-
41
-
42
- We release Qwen-Audio and Qwen-Audio-Chat, which are pretrained model and Chat model respectively. For more details about Qwen-Audio, please refer to our [Github Repo](https://github.com/QwenLM/Qwen-Audio/tree/main). This repo is the one for Qwen-Audio.
43
  <br>
44
 
45
- 目前,我们提供了Qwen-Audio和Qwen-Audio-Chat两个模型,分别为预训练模型和Chat模型。如果想了解更多关于信息,请点击[链接](https://github.com/QwenLM/Qwen-Audio/tree/main)查看Github仓库。本仓库为Qwen-Audio仓库。
 
46
 
47
 
48
  ## Requirements
49
- * python 3.8 and above
50
- * pytorch 1.12 and above, 2.0 and above are recommended
51
- * CUDA 11.4 and above are recommended (this is for GPU users)
52
- * FFmpeg
 
53
  <br>
54
 
55
  ## Quickstart
56
- Below, we provide simple examples to show how to use Qwen-Audio with 🤗 Transformers.
57
-
58
- Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries.
59
-
60
- ```bash
61
- pip install -r requirements.txt
62
- ```
63
- For more details, please refer to [tutorial](https://github.com/QwenLM/Qwen-Audio).
64
 
65
  #### 🤗 Transformers
66
 
67
- To use Qwen-Audio for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, **please make sure that you are using the latest code.**
68
 
69
  ```python
70
- from transformers import AutoModelForCausalLM, AutoTokenizer
71
- from transformers.generation import GenerationConfig
72
  import torch
73
- torch.manual_seed(1234)
74
-
75
- tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-Audio", trust_remote_code=True)
76
-
77
- # use bf16
78
- # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="auto", trust_remote_code=True, bf16=True).eval()
79
- # use fp16
80
- # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="auto", trust_remote_code=True, fp16=True).eval()
81
- # use cpu only
82
- # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="cpu", trust_remote_code=True).eval()
83
- # use cuda device
84
- model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="cuda", trust_remote_code=True).eval()
85
-
86
- # Specify hyperparameters for generation (No need to do this if you are using transformers>4.32.0)
87
- # model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-Audio", trust_remote_code=True)
88
- audio_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/1272-128104-0000.flac"
89
- sp_prompt = "<|startoftranscription|><|en|><|transcribe|><|en|><|notimestamps|><|wo_itn|>"
90
- query = f"<audio>{audio_url}</audio>{sp_prompt}"
91
- audio_info = tokenizer.process_audio(query)
92
- inputs = tokenizer(query, return_tensors='pt', audio_info=audio_info)
93
- inputs = inputs.to(model.device)
94
- pred = model.generate(**inputs, audio_info=audio_info)
95
- response = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False,audio_info=audio_info)
96
- print(response)
97
- # <audio>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/1272-128104-0000.flac</audio><|startoftranscription|><|en|><|transcribe|><|en|><|notimestamps|><|wo_itn|>mister quilting is the apostle of the middle classes and we are glad to welcome his gospel<|endoftext|>
98
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
 
 
100
 
101
- ## License Agreement
102
- Researchers and developers are free to use the codes and model weights of Qwen-Audio. We also allow its commercial use. Check our license at [LICENSE](https://github.com/QwenLM/Qwen-Audio/blob/main/LICENSE.txt) for more details.
103
- <br>
104
 
105
  ## Citation
106
  If you find our paper and code useful in your research, please consider giving a star and citation
107
 
108
  ```BibTeX
109
- @article{Qwen-Audio,
110
- title={Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models},
111
- author={Chu, Yunfei and Xu, Jin and Zhou, Xiaohuan and Yang, Qian and Zhang, Shiliang and Yan, Zhijie and Zhou, Chang and Zhou, Jingren},
112
- journal={arXiv preprint arXiv:2311.07919},
113
  year={2023}
114
  }
115
  ```
116
  <br>
117
 
118
- ## Contact Us
119
-
120
- If you are interested to leave a message to either our research team or product team, feel free to send an email to [email protected].
121
-
 
2
  language:
3
  - zh
4
  - en
 
 
5
  pipeline_tag: text-generation
 
6
  ---
7
 
8
+ # JMLA
9
 
10
  <br>
11
+ &nbsp<a href="https://arxiv.org/pdf/2310.10159.pdf">Paper</a>
 
 
 
 
 
 
 
 
 
12
  </p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  <br>
14
 
15
+ Music tagging is a task to predict the tags of music recordings. However, previous music tagging research primarily focuses on close-set music tagging tasks which can not be generalized to new tags. In this work, we propose a zero-shot music tagging system modeled by a joint music and language attention (**JMLA**) model to address the open-set music tagging problem. The **JMLA** model consists of an audio encoder modeled by a pretrained masked autoencoder and a decoder modeled by a Falcon7B.
16
+ We introduce preceiver resampler to convert arbitrary length audio into fixed length embeddings. We introduce dense attention connections between encoder and decoder layers to improve the information flow between the encoder and decoder layers. We collect a large-scale music and description dataset from the internet. We propose to use ChatGPT to convert the raw descriptions into formalized and diverse descriptions to train the **JMLA** models. Our proposed **JMLA** system achieves a zero-shot audio tagging accuracy of $ 64.82\% $ on the GTZAN dataset, outperforming previous zero-shot systems and achieves comparable results to previous systems on the FMA and the MagnaTagATune datasets.
17
 
18
 
19
  ## Requirements
20
+ * conda create -name SpectPrompt python=3.9
21
+ * pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
22
+ * pip install transformers datasets librosa einops_exts einops mmcls peft ipdb torchlibrosa
23
+ * pip install -U openmim
24
+ * mim install mmcv==1.7.1
25
  <br>
26
 
27
  ## Quickstart
28
+ Below, we provide simple examples to show how to use **JMLA** with 🤗 Transformers.
 
 
 
 
 
 
 
29
 
30
  #### 🤗 Transformers
31
 
32
+ To use JMLA for the inference, all you need to do is to input a few lines of codes as demonstrated below.
33
 
34
  ```python
35
+ from transformers import AutoModel, AutoTokenizer
 
36
  import torch
37
+ import numpy as np
38
+
39
+ model = AutoModel.from_pretrained('Tabgac/SpectPrompt', trust_remote_code=True)
40
+ device = model.device
41
+ # sample rate: 16k
42
+ music_path = '/path/to/music.wav'
43
+
44
+ # extract logmel spectrogram
45
+ # 1. parameters
46
+ class FFT_parameters:
47
+ sample_rate = 16000
48
+ window_size = 400
49
+ n_fft = 400
50
+ hop_size = 160
51
+ n_mels = 80
52
+ f_min = 50
53
+ f_max = 8000
54
+ prms = FFT_parameters()
55
+ # 2. extract
56
+ import nnAudio.Spectrogram
57
+ import librosa
58
+ to_spec = nnAudio.Spectrogram.MelSpectrogram(
59
+ sr=prms.sample_rate,
60
+ n_fft=prms.n_fft,
61
+ win_length=prms.window_size,
62
+ hop_length=prms.hop_size,
63
+ n_mels=prms.n_mels,
64
+ fmin=prms.f_min,
65
+ fmax=prms.f_max,
66
+ center=True,
67
+ power=2,
68
+ verbose=False,
69
+ )
70
+ wav, ori_sr = librosa.load(music_path, mono=True, sr=prms.sample_rate)
71
+ lms = to_spec(torch.tensor(wav))
72
+ lms = (lms + torch.finfo().eps).log().to(device)
73
+ # 3. processing
74
+ import os
75
+ from torch.nn.utils.rnn import pad_sequence
76
+ import random
77
+ # get the file transforms.py from https://github.com/taugastcn/SpectPrompt.git
78
+ from transforms import Normalize, SpecRandomCrop, SpecPadding, SpecRepeat
79
+
80
+
81
+ transforms = [ Normalize(-4.5, 4.5), SpecRandomCrop(target_len=2992), SpecPadding(target_len=2992), SpecRepeat() ]
82
+ lms = lms.numpy()
83
+
84
+ for trans in transforms:
85
+ lms = trans(lms)
86
+
87
+ # template of input
88
+ input = dict()
89
+ input['filenames'] = [music_path.split('/')[-1]]
90
+ input['ans_crds'] = [0]
91
+ input['audio_crds'] = [0]
92
+ input['attention_mask'] = torch.tensor([[1, 1, 1, 1, 1]]).to(device)
93
+ input['input_ids'] = torch.tensor([[1, 694, 5777, 683, 13]]).to(device)
94
+ input['spectrogram'] = torch.from_numpy(lms).unsqueez(dim=0).to(device)
95
+ # generation
96
+ model.eval()
97
+ gen_ids = model.forward_test(input)
98
+ gen_text = model.neck.tokenizer.batch_decode(gen_ids.clip(0))
99
 
100
+ ```
101
 
 
 
 
102
 
103
  ## Citation
104
  If you find our paper and code useful in your research, please consider giving a star and citation
105
 
106
  ```BibTeX
107
+ @article{JMLA,
108
+ title={JOINT MUSIC AND LANGUAGE ATTENTION MODELS FOR ZERO-SHOT MUSIC TAGGING},
109
+ author={Xingjian Du, Zhesong Yu, Jiaju Lin, Bilei Zhu, Qiuqiang Kong},
110
+ journal={arXiv preprint arXiv:2310.10159},
111
  year={2023}
112
  }
113
  ```
114
  <br>
115