bofenghuang commited on
Commit
c89fcb5
1 Parent(s): e3b2570
Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +153 -0
  3. assets/bench.png +0 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.nemo filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,156 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language: fr
4
+ library_name: nemo
5
+ datasets:
6
+ - mozilla-foundation/common_voice_13_0
7
+ - multilingual_librispeech
8
+ - facebook/voxpopuli
9
+ - google/fleurs
10
+ - gigant/african_accented_french
11
+ thumbnail: null
12
+ tags:
13
+ - automatic-speech-recognition
14
+ - speech
15
+ - audio
16
+ - Transducer
17
+ - FastConformer
18
+ - CTC
19
+ - Transformer
20
+ - pytorch
21
+ - NeMo
22
+ - hf-asr-leaderboard
23
+ model-index:
24
+ - name: stt_fr_fastconformer_hybrid_large
25
+ results:
26
+ - task:
27
+ name: Automatic Speech Recognition
28
+ type: automatic-speech-recognition
29
+ dataset:
30
+ name: Common Voice 13.0
31
+ type: mozilla-foundation/common_voice_13_0
32
+ config: fr
33
+ split: test
34
+ args:
35
+ language: fr
36
+ metrics:
37
+ - name: WER
38
+ type: wer
39
+ value: 9.16
40
+ - task:
41
+ type: Automatic Speech Recognition
42
+ name: automatic-speech-recognition
43
+ dataset:
44
+ name: Multilingual LibriSpeech (MLS)
45
+ type: facebook/multilingual_librispeech
46
+ config: french
47
+ split: test
48
+ args:
49
+ language: fr
50
+ metrics:
51
+ - name: WER
52
+ type: wer
53
+ value: 4.82
54
+ - task:
55
+ type: Automatic Speech Recognition
56
+ name: automatic-speech-recognition
57
+ dataset:
58
+ name: VoxPopuli
59
+ type: facebook/voxpopuli
60
+ config: french
61
+ split: test
62
+ args:
63
+ language: fr
64
+ metrics:
65
+ - name: WER
66
+ type: wer
67
+ value: 9.23
68
+ - task:
69
+ type: Automatic Speech Recognition
70
+ name: automatic-speech-recognition
71
+ dataset:
72
+ name: Fleurs
73
+ type: google/fleurs
74
+ config: fr_fr
75
+ split: test
76
+ args:
77
+ language: fr
78
+ metrics:
79
+ - name: WER
80
+ type: wer
81
+ value: 8.65
82
+ - task:
83
+ type: Automatic Speech Recognition
84
+ name: automatic-speech-recognition
85
+ dataset:
86
+ name: African Accented French
87
+ type: gigant/african_accented_french
88
+ config: fr
89
+ split: test
90
+ args:
91
+ language: fr
92
+ metrics:
93
+ - name: WER
94
+ type: wer
95
+ value: 6.55
96
  ---
97
+
98
+ # FastConformer-Hybrid Large (fr)
99
+
100
+ <style>
101
+ img {
102
+ display: inline;
103
+ }
104
+ </style>
105
+
106
+ | [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer_CTC-lightgrey#model-badge)](#model-architecture)
107
+ | [![Model size](https://img.shields.io/badge/Params-115M-lightgrey#model-badge)](#model-architecture)
108
+ | [![Language](https://img.shields.io/badge/Language-fr-lightgrey#model-badge)](#datasets)
109
+
110
+ This model aims to replicate [nvidia/stt_fr_fastconformer_hybrid_large_pc](https://huggingface.co/nvidia/stt_fr_fastconformer_hybrid_large_pc) with the goal of predicting only the lowercase French alphabet, hyphen, and apostrophe. While this choice sacrifices broader functionalities like predicting casing, numbers, and punctuation, it can enhance accuracy for specific use cases.
111
+
112
+ Similar to its sibling, this is a "large" version of the FastConformer Transducer-CTC model (around 115M parameters). It's a hybrid model trained using two loss functions: Transducer (default) and CTC.
113
+
114
+ ## Performance
115
+
116
+ We evaluated our model on the following datasets and re-ran the evaluation on other models for comparison. Please note that the reported WER is the result after converting numbers to text, removing punctuation (except for apostrophes and hyphens), and converting all characters to lowercase.
117
+
118
+ ![Benchmarks](https://huggingface.co/bofenghuang/stt_fr_fastconformer_hybrid_large/resolve/main/assets/bench.png)
119
+
120
+ All the evaluation results can be found [here](https://drive.google.com/drive/folders/1adZTgGAptYx2ut9jddjmlj5--dkY2XWZ?usp=sharing).
121
+
122
+ ## Usage
123
+
124
+ The model is available for use in the NeMo toolkit, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
125
+
126
+ ```python
127
+ # Install nemo
128
+ # !pip install nemo_toolkit['all']
129
+
130
+ import nemo.collections.asr as nemo_asr
131
+
132
+ model_name = "bofenghuang/stt_fr_fastconformer_hybrid_large"
133
+ asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name=model_name)
134
+
135
+ # Path to your 16kHz mono-channel audio file
136
+ audio_path = "/path/to/your/audio/file"
137
+
138
+ # Transcribe with defaut transducer decoder
139
+ asr_model.transcribe([audio_path])
140
+
141
+ # (Optional) Switch to CTC decoder
142
+ asr_model.change_decoding_strategy(decoder_type="ctc")
143
+
144
+ # (Optional) Transcribe with CTC decoder
145
+ asr_model.transcribe([audio_path])
146
+ ```
147
+
148
+ ## Datasets
149
+
150
+ This model has been trained on a composite dataset comprising over 2500 hours of French speech audio and transcriptions, including [Common Voice 13.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0), [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech), [Voxpopuli](https://github.com/facebookresearch/voxpopuli), [Fleurs](https://huggingface.co/datasets/google/fleurs), and more.
151
+
152
+ ## Limitations
153
+
154
+ Since this model was trained on publically available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
155
+
156
+ The model exclusively generates the lowercase French alphabet, hyphen, and apostrophe. Therefore, it may not perform well in situations where uppercase characters and additional punctuation are also required.
assets/bench.png ADDED