Spaces:
Running
Running
license: mit | |
inference: false | |
tags: | |
- music | |
# Introduction to our series work | |
The development log of our Music Audio Pre-training (m-a-p) model family: | |
- 17/03/2023: we release two advanced music understanding models, [MERT-v1-95M](https://huggingface.co/m-a-p/MERT-v1-95M) and [MERT-v1-330M](https://huggingface.co/m-a-p/MERT-v1-330M) , trained with new paradigm and dataset. They outperform the previous models and can better generalize to more tasks. | |
- 14/03/2023: we retrained the MERT-v0 model with open-source-only music dataset [MERT-v0-public](https://huggingface.co/m-a-p/MERT-v0-public) | |
- 29/12/2022: a music understanding model [MERT-v0](https://huggingface.co/m-a-p/MERT-v0) trained with **MLM** paradigm, which performs better at downstream tasks. | |
- 29/10/2022: a pre-trained MIR model [music2vec](https://huggingface.co/m-a-p/music2vec-v1) trained with **BYOL** paradigm. | |
Here is a table for quick model pick-up: | |
| Name | Pre-train Paradigm | Training Data (hour) | Pre-train Context (second) | Model Size | Transformer Layer-Dimension | Feature Rate | Sample Rate | Release Date | | |
| ------------------------------------------------------------ | ------------------ | -------------------- | ---------------------------- | ---------- | --------------------------- | ------------ | ----------- | ------------ | | |
| [MERT-v1-330M](https://huggingface.co/m-a-p/MERT-v1-330M) | MLM | 160K | 5 | 330M | 24-1024 | 75 Hz | 24K Hz | 17/03/2023 | | |
| [MERT-v1-95M](https://huggingface.co/m-a-p/MERT-v1-95M) | MLM | 20K | 5 | 95M | 12-768 | 75 Hz | 24K Hz | 17/03/2023 | | |
| [MERT-v0-public](https://huggingface.co/m-a-p/MERT-v0-public) | MLM | 900 | 5 | 95M | 12-768 | 50 Hz | 16K Hz | 14/03/2023 | | |
| [MERT-v0](https://huggingface.co/m-a-p/MERT-v0) | MLM | 1000 | 5 | 95 M | 12-768 | 50 Hz | 16K Hz | 29/12/2022 | | |
| [music2vec-v1](https://huggingface.co/m-a-p/music2vec-v1) | BYOL | 1000 | 30 | 95 M | 12-768 | 50 Hz | 16K Hz | 30/10/2022 | | |
## Explanation | |
The m-a-p models share the similar model architecture and the most distinguished difference is the paradigm in used pre-training. Other than that, there are several nuance technical configuration needs to know before using: | |
- **Model Size**: the number of parameters that would be loaded to memory. Please select the appropriate size fitting your hardware. | |
- **Transformer Layer-Dimension**: The number of transformer layers and the corresponding feature dimensions can be outputted from our model. This is marked out because features extracted by **different layers could have various performance depending on tasks**. | |
- **Feature Rate**: Given a 1-second audio input, the number of features output by the model. | |
- **Sample Rate**: The frequency of audio that the model is trained with. | |
# Introduction to MERT-v1 | |
Compared to MERT-v0, we introduce multiple new things in the MERT-v1 pre-training: | |
- Change the pseudo labels to 8 codebooks from [encodec](https://github.com/facebookresearch/encodec), which potentially has higher quality and empower our model to support music generation. | |
- MLM prediction with in-batch noise mixture. | |
- Train with higher audio frequency (24K Hz). | |
- Train with more audio data (up to 160 thousands of hours). | |
- More available model sizes 95M and 330M. | |
More details will be written in our coming-soon paper. | |
# Model Usage | |
```python | |
# from transformers import Wav2Vec2Processor | |
from transformers import Wav2Vec2FeatureExtractor | |
from transformers import AutoModel | |
import torch | |
from torch import nn | |
import torchaudio.transforms as T | |
from datasets import load_dataset | |
# loading our model weights | |
model = AutoModel.from_pretrained("m-a-p/MERT-v1-95M", trust_remote_code=True) | |
# loading the corresponding preprocessor config | |
processor = Wav2Vec2FeatureExtractor.from_pretrained("m-a-p/MERT-v1-95M",trust_remote_code=True) | |
# load demo audio and set processor | |
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation") | |
dataset = dataset.sort("id") | |
sampling_rate = dataset.features["audio"].sampling_rate | |
resample_rate = processor.sampling_rate | |
# make sure the sample_rate aligned | |
if resample_rate != sampling_rate: | |
print(f'setting rate from {sampling_rate} to {resample_rate}') | |
resampler = T.Resample(sampling_rate, resample_rate) | |
else: | |
resampler = None | |
# audio file is decoded on the fly | |
if resampler is None: | |
input_audio = dataset[0]["audio"]["array"] | |
else: | |
input_audio = resampler(torch.from_numpy(dataset[0]["audio"]["array"])) | |
inputs = processor(input_audio, sampling_rate=resample_rate, return_tensors="pt") | |
with torch.no_grad(): | |
outputs = model(**inputs, output_hidden_states=True) | |
# take a look at the output shape, there are 13 layers of representation | |
# each layer performs differently in different downstream tasks, you should choose empirically | |
all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze() | |
print(all_layer_hidden_states.shape) # [13 layer, Time steps, 768 feature_dim] | |
# for utterance level classification tasks, you can simply reduce the representation in time | |
time_reduced_hidden_states = all_layer_hidden_states.mean(-2) | |
print(time_reduced_hidden_states.shape) # [13, 768] | |
# you can even use a learnable weighted average representation | |
aggregator = nn.Conv1d(in_channels=13, out_channels=1, kernel_size=1) | |
weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0)).squeeze() | |
print(weighted_avg_hidden_states.shape) # [768] | |
``` | |
# Citation | |
```shell | |
@article{li2022large, | |
title={Large-Scale Pretrained Model for Self-Supervised Music Audio Representation Learning}, | |
author={Li, Yizhi and Yuan, Ruibin and Zhang, Ge and Ma, Yinghao and Lin, Chenghua and Chen, Xingran and Ragni, Anton and Yin, Hanzhi and Hu, Zhijie and He, Haoyu and others}, | |
year={2022} | |
} | |
``` |