nguyenvulebinh
commited on
Commit
·
1d0fb34
1
Parent(s):
a63603e
Update README.md
Browse files
README.md
CHANGED
@@ -20,12 +20,36 @@ widget:
|
|
20 |
|
21 |
[Facebook's Wav2Vec2](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/)
|
22 |
|
23 |
-
|
24 |
-
make sure that your speech input is also sampled at 16Khz.
|
25 |
|
26 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
|
28 |
-
To transcribe audio files the model can be used as a standalone acoustic model as follows:
|
29 |
|
30 |
```python
|
31 |
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
|
@@ -58,13 +82,6 @@ logits = model(input_values).logits
|
|
58 |
predicted_ids = torch.argmax(logits, dim=-1)
|
59 |
transcription = processor.batch_decode(predicted_ids)
|
60 |
```
|
61 |
-
|
62 |
-
*Result WER (with 4-grams LM)*:
|
63 |
-
|
64 |
-
| VIVOS | VLSP-T1 | VLSP-T2 |
|
65 |
-
|---|---|---|
|
66 |
-
| 6.1 | 9.1 | 40.8 |
|
67 |
-
|
68 |
# License
|
69 |
|
70 |
This model follows [CC-BY-NC-4.0](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/CC-BY-NC-SA-4.0.txt) license. Therefore, those compounds are freely available for academic purposes or individual research but restricted for commercial use.
|
|
|
20 |
|
21 |
[Facebook's Wav2Vec2](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/)
|
22 |
|
23 |
+
### Model description
|
|
|
24 |
|
25 |
+
[Our model](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h) was pre-trained on 13k hours of youtube (un-label data) and fine-tuned on 250 hours labeled of [VLSP ASR dataset](https://vlsp.org.vn/vlsp2020/eval/asr) on 16kHz sampled speech audio.
|
26 |
+
|
27 |
+
We use wav2vec2 architecture for the pre-trained model. Follow wav2vec2 paper:
|
28 |
+
|
29 |
+
>For the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.
|
30 |
+
|
31 |
+
For fine-tuning phase, wav2Vec2 is fine-tuned using Connectionist Temporal Classification (CTC), which is an algorithm that is used to train neural networks for sequence-to-sequence problems and mainly in Automatic Speech Recognition and handwriting recognition.
|
32 |
+
|
33 |
+
| Model | #params | Pre-training data | Fine-tune data |
|
34 |
+
|---|---|---|---|
|
35 |
+
| [base]((https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h)) | 95M | 13k hours | 250 hours |
|
36 |
+
|
37 |
+
In a formal ASR system, two components are required: acoustic model and language model. Here ctc-wav2vec fine-tuned model working as an acoustic model. For the language model, we provide a [4-grams model](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/blob/main/vi_lm_4grams.bin.zip) trained on 2GB of spoken text.
|
38 |
+
|
39 |
+
|
40 |
+
### Benchmark WER result (with 4-grams LM):
|
41 |
+
|
42 |
+
| [VIVOS](https://ailab.hcmus.edu.vn/vivos) | [VLSP-T1](https://vlsp.org.vn/vlsp2020/eval/asr) | [VLSP-T2](https://vlsp.org.vn/vlsp2020/eval/asr) |
|
43 |
+
|---|---|---|
|
44 |
+
| 6.1 | 9.1 | 40.8 |
|
45 |
+
|
46 |
+
|
47 |
+
### Example usage
|
48 |
+
|
49 |
+
When using the model make sure that your speech input is also sampled at 16Khz. Following Colab link below to use a combination of CTC-wav2vec and 4-grams LM.
|
50 |
+
|
51 |
+
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1pVBY46gSoWer2vDf0XmZ6uNV3d8lrMxx?usp=sharing)
|
52 |
|
|
|
53 |
|
54 |
```python
|
55 |
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
|
|
|
82 |
predicted_ids = torch.argmax(logits, dim=-1)
|
83 |
transcription = processor.batch_decode(predicted_ids)
|
84 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
85 |
# License
|
86 |
|
87 |
This model follows [CC-BY-NC-4.0](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/CC-BY-NC-SA-4.0.txt) license. Therefore, those compounds are freely available for academic purposes or individual research but restricted for commercial use.
|