File size: 5,683 Bytes
457abe7 a1ec488 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 |
---
library_name: transformers
tags: []
---
# Huggingface Implementation of AV-HuBERT on the MuAViC Dataset
This repository contains a Huggingface implementation of the AV-HuBERT (Audio-Visual Hidden Unit BERT) model, specifically trained and tested on the MuAViC (Multilingual Audio-Visual Corpus) dataset. AV-HuBERT is a self-supervised model designed for audio-visual speech recognition, leveraging both audio and visual modalities to achieve robust performance, especially in noisy environments.
Key features of this repository include:
- Pre-trained Models: Access pre-trained AV-HuBERT models fine-tuned on the MuAViC dataset. The pre-trained model been exported from [MuAViC](https://github.com/facebookresearch/muavic) repository.
- Inference scripts: Easily pipelines using Huggingface’s interface.
- Data preprocessing scripts: Including normalize frame rate, extract lips and audio.
### Inference code
```sh
git clone https://github.com/nguyenvulebinh/AV-HuBERT-S2S.git
cd AV-HuBERT-S2S
conda create -n avhuberts2s python=3.9
conda activate avhuberts2s
pip install -r requirements.txt
python run_example.py
```
```python
from src.model.avhubert2text import AV2TextForConditionalGeneration
from src.dataset.load_data import load_feature
from transformers import Speech2TextTokenizer
import torch
if __name__ == "__main__":
# Load pretrained english model
model = AV2TextForConditionalGeneration.from_pretrained('nguyenvulebinh/AV-HuBERT')
tokenizer = Speech2TextTokenizer.from_pretrained('nguyenvulebinh/AV-HuBERT')
# cuda
model = model.cuda().eval()
# Load normalized input data
sample = load_feature(
'./example/lip_movement.mp4',
"./example/noisy_audio.wav"
)
# cuda
audio_feats = sample['audio_source'].cuda()
video_feats = sample['video_source'].cuda()
attention_mask = torch.BoolTensor(audio_feats.size(0), audio_feats.size(-1)).fill_(False).cuda()
# Generate output sequence using HF interface
output = model.generate(
audio_feats,
attention_mask=attention_mask,
video=video_feats,
)
# decode output sequence
print(tokenizer.batch_decode(output, skip_special_tokens=True))
# check output
assert output.detach().cpu().numpy().tolist() == [[ 2, 16, 130, 516, 8, 339, 541, 808, 210, 195, 541, 79, 130, 317, 269, 4, 2]]
print("Example run successfully")
```
### Data preprocessing scripts
```sh
mkdir model-bin
cd model-bin
wget https://huggingface.co/nguyenvulebinh/AV-HuBERT/resolve/main/20words_mean_face.npy .
wget https://huggingface.co/nguyenvulebinh/AV-HuBERT/resolve/main/shape_predictor_68_face_landmarks.dat .
# raw video only support 4:3 ratio now
cp raw_video.mp4 ./example/
python src/dataset/video_to_audio_lips.py
```
### Pretrained model
<table align="center">
<tr>
<th>Task</th>
<th>Languages</th>
<th>Huggingface</th>
</tr>
<tr>
<td rowspan="10">AVSR</td>
<th>ar</th>
<th><a href="todo">TODO</a></th>
</tr>
<tr>
<th>de</th>
<th><a href="todo">TODO</a></th>
</tr>
<tr>
<th>el</th>
<th><a href="todo">TODO</a></th>
</tr>
<tr>
<th>en</th>
<th><a href="nguyenvulebinh/AV-HuBERT">English Chekpoint</a></th>
</tr>
<tr>
<th>es</th>
<th><a href="todo">TODO</a></th>
</tr>
<tr>
<th>fr</th>
<th><a href="todo">TODO</a></th>
</tr>
<tr>
<th>it</th>
<th><a href="todo">TODO</a></th>
</tr>
<tr>
<th>pt</th>
<th><a href="todo">TODO</a></th>
</tr>
<tr>
<th>ru</th>
<th><a href="todo">TODO</a></th>
</tr>
<tr>
<th>ar,de,el,es,fr,it,pt,ru</th>
<th><a href="todo">TODO</a></th>
</tr>
<tr>
<td rowspan="13">AVST</td>
<th>en-el</th>
<th><a href="todo">TODO</a></th>
</tr>
<tr>
<th>en-es</th>
<th><a href="todo">TODO</a></th>
</tr>
<tr>
<th>en-fr</th>
<th><a href="todo">TODO</a></th>
</tr>
<tr>
<th>en-it</th>
<th><a href="todo">TODO</a></th>
</tr>
<tr>
<th>en-pt</th>
<th><a href="todo">TODO</a></th>
</tr>
<tr>
<th>en-ru</th>
<th><a href="todo">TODO</a></th>
</tr>
<tr>
<th>el-en</th>
<th><a href="todo">TODO</a></th>
</tr>
<tr>
<th>es-en</th>
<th><a href="todo">TODO</a></th>
</tr>
<tr>
<th>fr-en</th>
<th><a href="todo">TODO</a></th>
</tr>
<tr>
<th>it-en</th>
<th><a href="todo">TODO</a></th>
</tr>
<tr>
<th>pt-en</th>
<th><a href="todo">TODO</a></th>
</tr>
<tr>
<th>ru-en</th>
<th><a href="todo">TODO</a></th>
</tr>
<tr>
<th>{el,es,fr,it,pt,ru}-en</th>
<th><a href="todo">TODO</a></th>
</tr>
</table>
## Acknowledgments
**AV-HuBERT**: A significant portion of the codebase in this repository has been adapted from the original AV-HuBERT implementation.
**MuAViC Repository**: We also gratefully acknowledge the creators of the MuAViC dataset and repository for providing the pre-trained models used in this project
## License
CC-BY-NC 4.0
## Citation
```bibtex
@article{anwar2023muavic,
title={MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation},
author={Anwar, Mohamed and Shi, Bowen and Goswami, Vedanuj and Hsu, Wei-Ning and Pino, Juan and Wang, Changhan},
journal={arXiv preprint arXiv:2303.00628},
year={2023}
}
``` |