nguyenvulebinh commited on
Commit
1d0fb34
·
1 Parent(s): a63603e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -11
README.md CHANGED
@@ -20,12 +20,36 @@ widget:
20
 
21
  [Facebook's Wav2Vec2](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/)
22
 
23
- The base model pretrained and fine-tuned on 250 hours of [VLSP ASR dataset](https://vlsp.org.vn/vlsp2020/eval/asr) on 16kHz sampled speech audio. When using the model
24
- make sure that your speech input is also sampled at 16Khz.
25
 
26
- # Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
- To transcribe audio files the model can be used as a standalone acoustic model as follows:
29
 
30
  ```python
31
  from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
@@ -58,13 +82,6 @@ logits = model(input_values).logits
58
  predicted_ids = torch.argmax(logits, dim=-1)
59
  transcription = processor.batch_decode(predicted_ids)
60
  ```
61
-
62
- *Result WER (with 4-grams LM)*:
63
-
64
- | VIVOS | VLSP-T1 | VLSP-T2 |
65
- |---|---|---|
66
- | 6.1 | 9.1 | 40.8 |
67
-
68
  # License
69
 
70
  This model follows [CC-BY-NC-4.0](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/CC-BY-NC-SA-4.0.txt) license. Therefore, those compounds are freely available for academic purposes or individual research but restricted for commercial use.
 
20
 
21
  [Facebook's Wav2Vec2](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/)
22
 
23
+ ### Model description
 
24
 
25
+ [Our model](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h) was pre-trained on 13k hours of youtube (un-label data) and fine-tuned on 250 hours labeled of [VLSP ASR dataset](https://vlsp.org.vn/vlsp2020/eval/asr) on 16kHz sampled speech audio.
26
+
27
+ We use wav2vec2 architecture for the pre-trained model. Follow wav2vec2 paper:
28
+
29
+ >For the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.
30
+
31
+ For fine-tuning phase, wav2Vec2 is fine-tuned using Connectionist Temporal Classification (CTC), which is an algorithm that is used to train neural networks for sequence-to-sequence problems and mainly in Automatic Speech Recognition and handwriting recognition.
32
+
33
+ | Model | #params | Pre-training data | Fine-tune data |
34
+ |---|---|---|---|
35
+ | [base]((https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h)) | 95M | 13k hours | 250 hours |
36
+
37
+ In a formal ASR system, two components are required: acoustic model and language model. Here ctc-wav2vec fine-tuned model working as an acoustic model. For the language model, we provide a [4-grams model](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/blob/main/vi_lm_4grams.bin.zip) trained on 2GB of spoken text.
38
+
39
+
40
+ ### Benchmark WER result (with 4-grams LM):
41
+
42
+ | [VIVOS](https://ailab.hcmus.edu.vn/vivos) | [VLSP-T1](https://vlsp.org.vn/vlsp2020/eval/asr) | [VLSP-T2](https://vlsp.org.vn/vlsp2020/eval/asr) |
43
+ |---|---|---|
44
+ | 6.1 | 9.1 | 40.8 |
45
+
46
+
47
+ ### Example usage
48
+
49
+ When using the model make sure that your speech input is also sampled at 16Khz. Following Colab link below to use a combination of CTC-wav2vec and 4-grams LM.
50
+
51
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1pVBY46gSoWer2vDf0XmZ6uNV3d8lrMxx?usp=sharing)
52
 
 
53
 
54
  ```python
55
  from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
 
82
  predicted_ids = torch.argmax(logits, dim=-1)
83
  transcription = processor.batch_decode(predicted_ids)
84
  ```
 
 
 
 
 
 
 
85
  # License
86
 
87
  This model follows [CC-BY-NC-4.0](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/CC-BY-NC-SA-4.0.txt) license. Therefore, those compounds are freely available for academic purposes or individual research but restricted for commercial use.