UniMus
/

OpenJMLA

@@ -2,120 +2,114 @@
 language:
 - zh
 - en
-tags:
-- qwen
 pipeline_tag: text-generation
-inference: false
 ---
-# Qwen-Audio
 <br>
-<p align="center">
-    <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/audio_logo.jpg" width="400"/>
-<p>
-<br>
-<p align="center">
-        Qwen-Audio <a href="https://www.modelscope.cn/models/qwen/QWen-Audio/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-Audio">🤗</a>&nbsp ｜ Qwen-Audio-Chat <a href="https://www.modelscope.cn/models/qwen/QWen-Audio-Chat/summary">🤖 <a>| <a href="https://huggingface.co/Qwen/Qwen-Audio-Chat">🤗</a>&nbsp | &nbsp&nbsp Demo<a href="https://modelscope.cn/studios/qwen/Qwen-Audio-Chat-Demo/summary"> 🤖</a> | <a href="https://huggingface.co/spaces/Qwen/Qwen-Audio">🤗</a>&nbsp
-<br>
-&nbsp&nbsp<a href="https://qwen-audio.github.io/Qwen-Audio/">Homepage</a>&nbsp ｜ &nbsp<a href="http://arxiv.org/abs/2311.07919">Paper</a> | &nbsp<a href="https://huggingface.co/papers/2311.07919">🤗</a>
 </p>
-<br><br>
-**Qwen-Audio** (Qwen Large Audio Language Model) is the multimodal version of the large model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-Audio accepts diverse audio (human speech, natural sound, music and song) and text as inputs, outputs text. The contribution of Qwen-Audio include:
-- **Fundamental audio models**: Qwen-Audio is a fundamental multi-task audio-language model that supports various tasks, languages, and audio types, serving as a universal audio understanding model. Building upon Qwen-Audio, we develop Qwen-Audio-Chat through instruction fine-tuning, enabling multi-turn dialogues and supporting diverse audio-oriented scenarios.
-- **Multi-task learning framework for all types of audios**: To scale up audio-language pre-training, we address the challenge of variation in textual labels associated with different datasets by proposing a multi-task training framework, enabling knowledge sharing and avoiding one-to-many interference. Our model incorporates more than 30 tasks and extensive experiments show the model achieves strong performance.
-- **Strong Performance**: Experimental results show that Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Specifically, Qwen-Audio achieves state-of-the-art results on the test set of Aishell1, cochlscene, ClothoAQA, and VocalSound.
-- **Flexible multi-run chat from audio and text input**: Qwen-Audio supports multiple-audio analysis, sound understading and reasoning, music appreciation, and tool usage for speech editing.
-**Qwen-Audio** 是阿里云研发的大规模音频语言模型（Large Audio Language Model）。Qwen-Audio 可以以多种音频 (包括说话人语音、自然音、音乐、歌声）和文本作为输入，并以文本作为输出。Qwen-Audio 系列模型的特点包括：
-- **音频基石模型**：Qwen-Audio是一个性能卓越的通用的音频理解模型，支持各种任务、语言和音频类型。在Qwen-Audio的基础上，我们通过指令微调开发了Qwen-Audio-Chat，支持多轮、多语言、多语言对话。Qwen-Audio和Qwen-Audio-Chat模型均已开源。
-- **兼容多种复杂音频的多任务学习框架**：为了避免由于数据收集来源不同以及任务类型不同，带来的音频到文本的一对多的干扰问题，我们提出了一种多任务训练框架，实现相似任务的知识共享，并尽可能减少不同任务之间的干扰。通过提出的框架，Qwen-Audio可以容纳训练超过30多种不同的音频任务；
-- **出色的性能**：Qwen-Audio在不需要任何任务特定的微调的情况下，在各种基准任务上取得了领先的结果。具体得，Qwen-Audio在Aishell1、cochlscene、ClothoAQA和VocalSound的测试集上都达到了SOTA；
-- **支持多轮音频和文本对话，支持各种语音场景**：Qwen-Audio-Chat支持声音理解和推理、音乐欣赏、多音频分析、多轮音频-文本交错对话以及外部语音工具的使用(如语音编辑)。
-We release Qwen-Audio and Qwen-Audio-Chat, which are pretrained model and Chat model respectively. For more details about Qwen-Audio, please refer to our [Github Repo](https://github.com/QwenLM/Qwen-Audio/tree/main). This repo is the one for Qwen-Audio.
 <br>
-目前，我们提供了Qwen-Audio和Qwen-Audio-Chat两个模型，分别为预训练模型和Chat模型。如果想了解更多关于信息，请点击[链接](https://github.com/QwenLM/Qwen-Audio/tree/main)查看Github仓库。本仓库为Qwen-Audio仓库。
 ## Requirements
-* python 3.8 and above
-* pytorch 1.12 and above, 2.0 and above are recommended
-* CUDA 11.4 and above are recommended (this is for GPU users)
-* FFmpeg
   <br>
 ## Quickstart
-Below, we provide simple examples to show how to use Qwen-Audio with 🤗 Transformers.
-Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries.
-```bash
-pip install -r requirements.txt
-```
-For more details, please refer to [tutorial](https://github.com/QwenLM/Qwen-Audio).
 #### 🤗 Transformers
-To use Qwen-Audio for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, **please make sure that you are using the latest code.**
 ```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-from transformers.generation import GenerationConfig
 import torch
-torch.manual_seed(1234)
-tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-Audio", trust_remote_code=True)
-# use bf16
-# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="auto", trust_remote_code=True, bf16=True).eval()
-# use fp16
-# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="auto", trust_remote_code=True, fp16=True).eval()
-# use cpu only
-# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="cpu", trust_remote_code=True).eval()
-# use cuda device
-model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="cuda", trust_remote_code=True).eval()
-# Specify hyperparameters for generation (No need to do this if you are using transformers>4.32.0)
-# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-Audio", trust_remote_code=True)
-audio_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/1272-128104-0000.flac"
-sp_prompt = "<|startoftranscription|><|en|><|transcribe|><|en|><|notimestamps|><|wo_itn|>"
-query = f"<audio>{audio_url}</audio>{sp_prompt}"
-audio_info = tokenizer.process_audio(query)
-inputs = tokenizer(query, return_tensors='pt', audio_info=audio_info)
-inputs = inputs.to(model.device)
-pred = model.generate(**inputs, audio_info=audio_info)
-response = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False,audio_info=audio_info)
-print(response)
-# <audio>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/1272-128104-0000.flac</audio><|startoftranscription|><|en|><|transcribe|><|en|><|notimestamps|><|wo_itn|>mister quilting is the apostle of the middle classes and we are glad to welcome his gospel<|endoftext|>
-```
-## License Agreement
-Researchers and developers are free to use the codes and model weights of Qwen-Audio. We also allow its commercial use. Check our license at [LICENSE](https://github.com/QwenLM/Qwen-Audio/blob/main/LICENSE.txt) for more details.
-<br>
 ## Citation
 If you find our paper and code useful in your research, please consider giving a star and citation
 ```BibTeX
-@article{Qwen-Audio,
-  title={Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models},
-  author={Chu, Yunfei and Xu, Jin and Zhou, Xiaohuan and Yang, Qian and Zhang, Shiliang and Yan, Zhijie  and Zhou, Chang and Zhou, Jingren},
-  journal={arXiv preprint arXiv:2311.07919},
   year={2023}
 }
 ```
 <br>
-## Contact Us
-If you are interested to leave a message to either our research team or product team, feel free to send an email to [email protected].

 language:
 - zh
 - en
 pipeline_tag: text-generation
 ---
+# JMLA
 <br>
+ &nbsp<a href="https://arxiv.org/pdf/2310.10159.pdf">Paper</a>
 </p>
 <br>
+Music tagging is a task to predict the tags of music recordings. However, previous music tagging research primarily focuses on close-set music tagging tasks which can not be generalized to new tags. In this work, we propose a zero-shot music tagging system modeled by a joint music and language attention (**JMLA**) model to address the open-set music tagging problem. The **JMLA** model consists of an audio encoder modeled by a pretrained masked autoencoder and a decoder modeled by a Falcon7B.
+We introduce preceiver resampler to convert arbitrary length audio into fixed length embeddings. We introduce dense attention connections between encoder and decoder layers to improve the information flow between the encoder and decoder layers. We collect a large-scale music and description dataset from the internet. We propose to use ChatGPT to convert the raw descriptions into formalized and diverse descriptions to train the **JMLA** models. Our proposed **JMLA** system achieves a zero-shot audio tagging accuracy of $ 64.82\% $ on the GTZAN dataset, outperforming previous zero-shot systems and achieves comparable results to previous systems on the FMA and the MagnaTagATune datasets.
 ## Requirements
+* conda create -name SpectPrompt python=3.9
+* pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
+* pip install transformers datasets librosa einops_exts einops mmcls peft ipdb torchlibrosa
+* pip install -U openmim
+* mim install mmcv==1.7.1
   <br>
 ## Quickstart
+Below, we provide simple examples to show how to use **JMLA** with 🤗 Transformers.
 #### 🤗 Transformers
+To use JMLA for the inference, all you need to do is to input a few lines of codes as demonstrated below.
 ```python
+from transformers import AutoModel, AutoTokenizer
 import torch
+import numpy as np
+model = AutoModel.from_pretrained('Tabgac/SpectPrompt', trust_remote_code=True)
+device = model.device
+# sample rate: 16k
+music_path = '/path/to/music.wav'
+# extract logmel spectrogram
+# 1. parameters
+class FFT_parameters:
+  sample_rate = 16000
+  window_size = 400
+  n_fft = 400
+  hop_size = 160
+  n_mels = 80
+  f_min = 50
+  f_max = 8000
+prms = FFT_parameters()
+# 2. extract
+import nnAudio.Spectrogram
+import librosa
+to_spec = nnAudio.Spectrogram.MelSpectrogram(
+  sr=prms.sample_rate,
+  n_fft=prms.n_fft,
+  win_length=prms.window_size,
+  hop_length=prms.hop_size,
+  n_mels=prms.n_mels,
+  fmin=prms.f_min,
+  fmax=prms.f_max,
+  center=True,
+  power=2,
+  verbose=False,
+)
+wav, ori_sr = librosa.load(music_path, mono=True, sr=prms.sample_rate)
+lms = to_spec(torch.tensor(wav))
+lms = (lms + torch.finfo().eps).log().to(device)
+# 3. processing
+import os
+from torch.nn.utils.rnn import pad_sequence
+import random
+# get the file transforms.py from https://github.com/taugastcn/SpectPrompt.git
+from transforms import Normalize, SpecRandomCrop, SpecPadding, SpecRepeat
+transforms = [ Normalize(-4.5, 4.5), SpecRandomCrop(target_len=2992), SpecPadding(target_len=2992), SpecRepeat() ]
+lms = lms.numpy()
+for trans in transforms:
+  lms = trans(lms)
+# template of input
+input = dict()
+input['filenames'] = [music_path.split('/')[-1]]
+input['ans_crds'] = [0]
+input['audio_crds'] = [0]
+input['attention_mask'] = torch.tensor([[1, 1, 1, 1, 1]]).to(device)
+input['input_ids'] = torch.tensor([[1, 694, 5777, 683, 13]]).to(device)
+input['spectrogram'] = torch.from_numpy(lms).unsqueez(dim=0).to(device)
+# generation
+model.eval()
+gen_ids = model.forward_test(input)
+gen_text = model.neck.tokenizer.batch_decode(gen_ids.clip(0))
+```
 ## Citation
 If you find our paper and code useful in your research, please consider giving a star and citation
 ```BibTeX
+@article{JMLA,
+  title={JOINT MUSIC AND LANGUAGE ATTENTION MODELS FOR ZERO-SHOT MUSIC TAGGING},
+  author={Xingjian Du, Zhesong Yu, Jiaju Lin, Bilei Zhu, Qiuqiang Kong},
+  journal={arXiv preprint arXiv:2310.10159},
   year={2023}
 }
 ```
 <br>