bofenghuang
commited on
Commit
•
c89fcb5
1
Parent(s):
e3b2570
up
Browse files- .gitattributes +1 -0
- README.md +153 -0
- assets/bench.png +0 -0
.gitattributes
CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
*.nemo filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
@@ -1,3 +1,156 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
language: fr
|
4 |
+
library_name: nemo
|
5 |
+
datasets:
|
6 |
+
- mozilla-foundation/common_voice_13_0
|
7 |
+
- multilingual_librispeech
|
8 |
+
- facebook/voxpopuli
|
9 |
+
- google/fleurs
|
10 |
+
- gigant/african_accented_french
|
11 |
+
thumbnail: null
|
12 |
+
tags:
|
13 |
+
- automatic-speech-recognition
|
14 |
+
- speech
|
15 |
+
- audio
|
16 |
+
- Transducer
|
17 |
+
- FastConformer
|
18 |
+
- CTC
|
19 |
+
- Transformer
|
20 |
+
- pytorch
|
21 |
+
- NeMo
|
22 |
+
- hf-asr-leaderboard
|
23 |
+
model-index:
|
24 |
+
- name: stt_fr_fastconformer_hybrid_large
|
25 |
+
results:
|
26 |
+
- task:
|
27 |
+
name: Automatic Speech Recognition
|
28 |
+
type: automatic-speech-recognition
|
29 |
+
dataset:
|
30 |
+
name: Common Voice 13.0
|
31 |
+
type: mozilla-foundation/common_voice_13_0
|
32 |
+
config: fr
|
33 |
+
split: test
|
34 |
+
args:
|
35 |
+
language: fr
|
36 |
+
metrics:
|
37 |
+
- name: WER
|
38 |
+
type: wer
|
39 |
+
value: 9.16
|
40 |
+
- task:
|
41 |
+
type: Automatic Speech Recognition
|
42 |
+
name: automatic-speech-recognition
|
43 |
+
dataset:
|
44 |
+
name: Multilingual LibriSpeech (MLS)
|
45 |
+
type: facebook/multilingual_librispeech
|
46 |
+
config: french
|
47 |
+
split: test
|
48 |
+
args:
|
49 |
+
language: fr
|
50 |
+
metrics:
|
51 |
+
- name: WER
|
52 |
+
type: wer
|
53 |
+
value: 4.82
|
54 |
+
- task:
|
55 |
+
type: Automatic Speech Recognition
|
56 |
+
name: automatic-speech-recognition
|
57 |
+
dataset:
|
58 |
+
name: VoxPopuli
|
59 |
+
type: facebook/voxpopuli
|
60 |
+
config: french
|
61 |
+
split: test
|
62 |
+
args:
|
63 |
+
language: fr
|
64 |
+
metrics:
|
65 |
+
- name: WER
|
66 |
+
type: wer
|
67 |
+
value: 9.23
|
68 |
+
- task:
|
69 |
+
type: Automatic Speech Recognition
|
70 |
+
name: automatic-speech-recognition
|
71 |
+
dataset:
|
72 |
+
name: Fleurs
|
73 |
+
type: google/fleurs
|
74 |
+
config: fr_fr
|
75 |
+
split: test
|
76 |
+
args:
|
77 |
+
language: fr
|
78 |
+
metrics:
|
79 |
+
- name: WER
|
80 |
+
type: wer
|
81 |
+
value: 8.65
|
82 |
+
- task:
|
83 |
+
type: Automatic Speech Recognition
|
84 |
+
name: automatic-speech-recognition
|
85 |
+
dataset:
|
86 |
+
name: African Accented French
|
87 |
+
type: gigant/african_accented_french
|
88 |
+
config: fr
|
89 |
+
split: test
|
90 |
+
args:
|
91 |
+
language: fr
|
92 |
+
metrics:
|
93 |
+
- name: WER
|
94 |
+
type: wer
|
95 |
+
value: 6.55
|
96 |
---
|
97 |
+
|
98 |
+
# FastConformer-Hybrid Large (fr)
|
99 |
+
|
100 |
+
<style>
|
101 |
+
img {
|
102 |
+
display: inline;
|
103 |
+
}
|
104 |
+
</style>
|
105 |
+
|
106 |
+
| [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer_CTC-lightgrey#model-badge)](#model-architecture)
|
107 |
+
| [![Model size](https://img.shields.io/badge/Params-115M-lightgrey#model-badge)](#model-architecture)
|
108 |
+
| [![Language](https://img.shields.io/badge/Language-fr-lightgrey#model-badge)](#datasets)
|
109 |
+
|
110 |
+
This model aims to replicate [nvidia/stt_fr_fastconformer_hybrid_large_pc](https://huggingface.co/nvidia/stt_fr_fastconformer_hybrid_large_pc) with the goal of predicting only the lowercase French alphabet, hyphen, and apostrophe. While this choice sacrifices broader functionalities like predicting casing, numbers, and punctuation, it can enhance accuracy for specific use cases.
|
111 |
+
|
112 |
+
Similar to its sibling, this is a "large" version of the FastConformer Transducer-CTC model (around 115M parameters). It's a hybrid model trained using two loss functions: Transducer (default) and CTC.
|
113 |
+
|
114 |
+
## Performance
|
115 |
+
|
116 |
+
We evaluated our model on the following datasets and re-ran the evaluation on other models for comparison. Please note that the reported WER is the result after converting numbers to text, removing punctuation (except for apostrophes and hyphens), and converting all characters to lowercase.
|
117 |
+
|
118 |
+
![Benchmarks](https://huggingface.co/bofenghuang/stt_fr_fastconformer_hybrid_large/resolve/main/assets/bench.png)
|
119 |
+
|
120 |
+
All the evaluation results can be found [here](https://drive.google.com/drive/folders/1adZTgGAptYx2ut9jddjmlj5--dkY2XWZ?usp=sharing).
|
121 |
+
|
122 |
+
## Usage
|
123 |
+
|
124 |
+
The model is available for use in the NeMo toolkit, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
|
125 |
+
|
126 |
+
```python
|
127 |
+
# Install nemo
|
128 |
+
# !pip install nemo_toolkit['all']
|
129 |
+
|
130 |
+
import nemo.collections.asr as nemo_asr
|
131 |
+
|
132 |
+
model_name = "bofenghuang/stt_fr_fastconformer_hybrid_large"
|
133 |
+
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name=model_name)
|
134 |
+
|
135 |
+
# Path to your 16kHz mono-channel audio file
|
136 |
+
audio_path = "/path/to/your/audio/file"
|
137 |
+
|
138 |
+
# Transcribe with defaut transducer decoder
|
139 |
+
asr_model.transcribe([audio_path])
|
140 |
+
|
141 |
+
# (Optional) Switch to CTC decoder
|
142 |
+
asr_model.change_decoding_strategy(decoder_type="ctc")
|
143 |
+
|
144 |
+
# (Optional) Transcribe with CTC decoder
|
145 |
+
asr_model.transcribe([audio_path])
|
146 |
+
```
|
147 |
+
|
148 |
+
## Datasets
|
149 |
+
|
150 |
+
This model has been trained on a composite dataset comprising over 2500 hours of French speech audio and transcriptions, including [Common Voice 13.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0), [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech), [Voxpopuli](https://github.com/facebookresearch/voxpopuli), [Fleurs](https://huggingface.co/datasets/google/fleurs), and more.
|
151 |
+
|
152 |
+
## Limitations
|
153 |
+
|
154 |
+
Since this model was trained on publically available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
|
155 |
+
|
156 |
+
The model exclusively generates the lowercase French alphabet, hyphen, and apostrophe. Therefore, it may not perform well in situations where uppercase characters and additional punctuation are also required.
|
assets/bench.png
ADDED