File size: 5,683 Bytes
457abe7
 
 
 
 
a1ec488
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
---
library_name: transformers
tags: []
---

# Huggingface Implementation of AV-HuBERT on the MuAViC Dataset

This repository contains a Huggingface implementation of the AV-HuBERT (Audio-Visual Hidden Unit BERT) model, specifically trained and tested on the MuAViC (Multilingual Audio-Visual Corpus) dataset. AV-HuBERT is a self-supervised model designed for audio-visual speech recognition, leveraging both audio and visual modalities to achieve robust performance, especially in noisy environments.


Key features of this repository include:

- Pre-trained Models: Access pre-trained AV-HuBERT models fine-tuned on the MuAViC dataset. The pre-trained model been exported from [MuAViC](https://github.com/facebookresearch/muavic) repository.

- Inference scripts: Easily pipelines using Huggingface’s interface.

- Data preprocessing scripts: Including normalize frame rate, extract lips and audio.

### Inference code

```sh
git clone https://github.com/nguyenvulebinh/AV-HuBERT-S2S.git
cd AV-HuBERT-S2S
conda create -n avhuberts2s python=3.9
conda activate avhuberts2s
pip install -r requirements.txt
python run_example.py
```

```python
from src.model.avhubert2text import AV2TextForConditionalGeneration
from src.dataset.load_data import load_feature
from transformers import Speech2TextTokenizer
import torch

if __name__ == "__main__":
    # Load pretrained english model
    model = AV2TextForConditionalGeneration.from_pretrained('nguyenvulebinh/AV-HuBERT')
    tokenizer = Speech2TextTokenizer.from_pretrained('nguyenvulebinh/AV-HuBERT')

    # cuda
    model = model.cuda().eval()
    
    # Load normalized input data
    sample = load_feature(
        './example/lip_movement.mp4',
        "./example/noisy_audio.wav"
    )
    
    # cuda
    audio_feats = sample['audio_source'].cuda()
    video_feats = sample['video_source'].cuda()
    attention_mask = torch.BoolTensor(audio_feats.size(0), audio_feats.size(-1)).fill_(False).cuda()
    
    # Generate output sequence using HF interface
    output = model.generate(
        audio_feats,
        attention_mask=attention_mask,
        video=video_feats,
    )

    # decode output sequence
    print(tokenizer.batch_decode(output, skip_special_tokens=True))
    
    # check output
    assert output.detach().cpu().numpy().tolist() == [[  2,  16, 130, 516,   8, 339, 541, 808, 210, 195, 541,  79, 130, 317, 269,   4,   2]]
    print("Example run successfully")
```

### Data preprocessing scripts

```sh
mkdir model-bin
cd model-bin
wget https://huggingface.co/nguyenvulebinh/AV-HuBERT/resolve/main/20words_mean_face.npy .
wget https://huggingface.co/nguyenvulebinh/AV-HuBERT/resolve/main/shape_predictor_68_face_landmarks.dat .

# raw video only support 4:3 ratio now
cp raw_video.mp4 ./example/ 

python src/dataset/video_to_audio_lips.py
```

### Pretrained model

<table align="center">
    <tr>
        <th>Task</th>
        <th>Languages</th>
        <th>Huggingface</th>
    </tr>
    <tr>
        <td rowspan="10">AVSR</td>
        <th>ar</th>
        <th><a href="todo">TODO</a></th>
    </tr>
    <tr>
        <th>de</th>
        <th><a href="todo">TODO</a></th>
    </tr>
    <tr>
        <th>el</th>
        <th><a href="todo">TODO</a></th>
    </tr>
    <tr>
        <th>en</th>
        <th><a href="nguyenvulebinh/AV-HuBERT">English Chekpoint</a></th>
    </tr>
    <tr>
        <th>es</th>
        <th><a href="todo">TODO</a></th>
    </tr>
    <tr>
        <th>fr</th>
        <th><a href="todo">TODO</a></th>
    </tr>
    <tr>
        <th>it</th>
        <th><a href="todo">TODO</a></th>
    </tr>
    <tr>
        <th>pt</th>
        <th><a href="todo">TODO</a></th>
    </tr>
    <tr>
        <th>ru</th>
        <th><a href="todo">TODO</a></th>
    </tr>
    <tr>
        <th>ar,de,el,es,fr,it,pt,ru</th>
        <th><a href="todo">TODO</a></th>
    </tr>
    <tr>
        <td rowspan="13">AVST</td>
        <th>en-el</th>
        <th><a href="todo">TODO</a></th>
    </tr>
     <tr>
        <th>en-es</th>
        <th><a href="todo">TODO</a></th>
    </tr>
    <tr>
        <th>en-fr</th>
        <th><a href="todo">TODO</a></th>
    </tr>
    <tr>
        <th>en-it</th>
        <th><a href="todo">TODO</a></th>
    </tr>
    <tr>
        <th>en-pt</th>
        <th><a href="todo">TODO</a></th>
    </tr>
    <tr>
        <th>en-ru</th>
        <th><a href="todo">TODO</a></th>
    </tr>
    <tr>
        <th>el-en</th>
        <th><a href="todo">TODO</a></th>
    </tr>
    <tr>
        <th>es-en</th>
        <th><a href="todo">TODO</a></th>
    </tr>
    <tr>
        <th>fr-en</th>
        <th><a href="todo">TODO</a></th>
    </tr>
    <tr>
        <th>it-en</th>
        <th><a href="todo">TODO</a></th>
    </tr>
    <tr>
        <th>pt-en</th>
        <th><a href="todo">TODO</a></th>
    </tr>
    <tr>
        <th>ru-en</th>
        <th><a href="todo">TODO</a></th>
    </tr>
    <tr>
        <th>{el,es,fr,it,pt,ru}-en</th>
        <th><a href="todo">TODO</a></th>
    </tr>
</table>


## Acknowledgments

**AV-HuBERT**: A significant portion of the codebase in this repository has been adapted from the original AV-HuBERT implementation.

**MuAViC Repository**: We also gratefully acknowledge the creators of the MuAViC dataset and repository for providing the pre-trained models used in this project

## License

CC-BY-NC 4.0

## Citation

```bibtex
@article{anwar2023muavic,
  title={MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation},
  author={Anwar, Mohamed and Shi, Bowen and Goswami, Vedanuj and Hsu, Wei-Ning and Pino, Juan and Wang, Changhan},
  journal={arXiv preprint arXiv:2303.00628},
  year={2023}
}
```