Lauler commited on
Commit
d4c8b9e
·
verified ·
1 Parent(s): 61729bb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +132 -3
README.md CHANGED
@@ -25,8 +25,16 @@ The National Library of Sweden releases a new suite of Whisper models trained on
25
  | [large-v3](https://huggingface.co/KBLab/kb-whisper-large) | **KBLab** | **5.4** | **4.1** | **5.2** |
26
  | | OpenAI | 7.8 | 9.5 | 11.3 |
27
 
 
 
28
  ### Usage
29
 
 
 
 
 
 
 
30
  ```python
31
  import torch
32
  from datasets import load_dataset
@@ -58,11 +66,124 @@ res = pipe("audio.mp3",
58
  generate_kwargs={"task": "transcribe", "language": "sv"})
59
  ```
60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  ### Training data
62
 
63
  Our models have been trained on over 50,000 hours of Swedish audio with text transcriptions. The models were trained in 2 stages, each characterized by the application of different quality filters and thresholds for said filters.
64
 
65
- Stage 1 employed low threshold values (0.15 to 0.30 BLEU), whereas Stage 2 used stricter thresholds (`BLEU >= 0.7`, weighted ROUGE-N `>= 0.7`, CER of first and last 10 characters `<= 0.2`).
66
 
67
  | Dataset | Continued pretraining (h) -- Stage 1 | Finetuning (h) -- Stage 2 |
68
  |-------------|--------------------------|--------------|
@@ -72,10 +193,12 @@ Stage 1 employed low threshold values (0.15 to 0.30 BLEU), whereas Stage 2 used
72
  | NST | 250 | 250 |
73
  | **Total** | **56,514** | **8,533** |
74
 
75
- The default when loading our models through Hugging Face is **Stage 2**. We have however also uploaded the checkpoints of our continued pretraing and tagged them. You can load these other checkpoints by specifying the `revision`. For example: [`pretrained-checkpoint`](https://huggingface.co/KBLab/kb-whisper-large/tree/pretrained-checkpoint). The Stage 2 default model's tag is named `standard`.
76
 
77
  ### Evaluation
78
 
 
 
79
  | Model size | | FLEURS | CommonVoice | NST |
80
  |------------|---------|--------|-------------|------|
81
  | [tiny](https://huggingface.co/KBLab/kb-whisper-tiny) | **KBLab** | **13.2** | **12.9** | **11.2** |
@@ -90,6 +213,7 @@ The default when loading our models through Hugging Face is **Stage 2**. We have
90
  | | OpenAI | 7.8 | 9.5 | 11.3 |
91
 
92
 
 
93
  | Model size | | FLEURS | CommonVoice | NST |
94
  |------------|---------|--------|-------------|------|
95
  | tiny | KBLab | **76.6** | **73.7** | **74.3** |
@@ -101,4 +225,9 @@ The default when loading our models through Hugging Face is **Stage 2**. We have
101
  | medium | KBLab | **87.6** | **85.0** | **80.2** |
102
  | | OpenAI | 77.1 | 70.1 | 68.9 |
103
  | large-v3 | KBLab | **89.8** | **87.2** | **81.1** |
104
- | | OpenAI | 84.9 | 79.1 | 75.1 |
 
 
 
 
 
 
25
  | [large-v3](https://huggingface.co/KBLab/kb-whisper-large) | **KBLab** | **5.4** | **4.1** | **5.2** |
26
  | | OpenAI | 7.8 | 9.5 | 11.3 |
27
 
28
+ Table: **Word Error Rate (WER)** comparison between KBLab's Whisper models and the corresponding OpenAI versions.
29
+
30
  ### Usage
31
 
32
+ We provide checkpoints in different formats: `Hugging Face`, `whisper.cpp` (GGML), `onnx`, and `ctranslate2` (used in `faster-whisper` and `WhisperX`).
33
+
34
+ #### Hugging Face
35
+
36
+ Inference example for using `KB-Whisper` with Hugging Face:
37
+
38
  ```python
39
  import torch
40
  from datasets import load_dataset
 
66
  generate_kwargs={"task": "transcribe", "language": "sv"})
67
  ```
68
 
69
+ #### Faster-whisper
70
+
71
+ [Faster-whisper](https://github.com/SYSTRAN/faster-whisper) provides fast and efficient inference via a reimplementation of Whisper using `ctranslate2`.
72
+
73
+ ```python
74
+ #### faster-whisper model ####
75
+ from faster_whisper import WhisperModel
76
+
77
+ model_id = "KBLab/kb-whisper-base"
78
+ model = WhisperModel(
79
+ model_id,
80
+ device="cuda",
81
+ compute_type="float16",
82
+ download_root="cache", # cache directory
83
+ # condition_on_previous_text = False # Can reduce hallucinations if we don't use prompts
84
+ )
85
+
86
+ # Transcribe audio.wav (convert to 16khz mono wav first via ffmpeg)
87
+ segments, info = model.transcribe("audio.wav", condition_on_previous_text=False)
88
+ print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
89
+
90
+ for segment in segments:
91
+ print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
92
+ ```
93
+
94
+ #### WhisperX
95
+
96
+ [WhisperX](https://github.com/m-bain/whisperX) provides a convenient method of getting accurate word level timestamps. The library combines (force aligns) the text output of Whisper with the accurate timestamps of Wav2vec2. We provide an example below of how to use `KB-Whisper` together with [KBLab/wav2vec2-large-voxrex-swedish](https://huggingface.co/KBLab/wav2vec2-large-voxrex-swedish).
97
+
98
+ ```python
99
+ import whisperx
100
+
101
+ device = "cuda"
102
+ audio_file = "audio.wav"
103
+ batch_size = 16 # reduce if low on GPU mem
104
+ compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)
105
+
106
+ # 1. Transcribe with original whisper (batched)
107
+ model = whisperx.load_model(
108
+ "KBLab/kb-whisper-base", device, compute_type=compute_type, download_root="cache" # cache_dir
109
+ )
110
+
111
+ audio = whisperx.load_audio(audio_file)
112
+ result = model.transcribe(audio, batch_size=batch_size)
113
+ print(result["segments"]) # before alignment
114
+
115
+ # delete model if low on GPU resources
116
+ # import gc; gc.collect(); torch.cuda.empty_cache(); del model
117
+
118
+ # 2. Align whisper output
119
+ model_a, metadata = whisperx.load_align_model(
120
+ language_code=result["language"],
121
+ device=device,
122
+ model_name="KBLab/wav2vec2-large-voxrex-swedish",
123
+ model_dir="cache", # cache_dir
124
+ )
125
+ result = whisperx.align(
126
+ result["segments"], model_a, metadata, audio, device, return_char_alignments=False
127
+ )
128
+
129
+ print(result["segments"]) # word level timestamps after alignment
130
+ ```
131
+
132
+ #### Whisper.cpp / GGML
133
+
134
+ We provide GGML checkpoints used in the apps `whisper.cpp` and `MacWhisper`. To use our model with `whisper.cpp` first clone the repository and build the library:
135
+
136
+ ```
137
+ git clone https://github.com/ggerganov/whisper.cpp.git
138
+ cd whisper.cpp
139
+ cmake -B build
140
+ cmake --build build --config Release
141
+ ```
142
+
143
+ To use the model you need to download one of the GGML checkpoints we have uploaded. You can either press the download buttons [here](https://huggingface.co/KBLab/kb-whisper-base/tree/main), or download using `wget`:
144
+
145
+ ```
146
+ wget https://huggingface.co/KBLab/kb-whisper-base/resolve/main/ggml-model-q5_0.bin # Quantized version
147
+ # wget https://huggingface.co/KBLab/kb-whisper-base/resolve/main/ggml-model.bin # Non-quantized version
148
+ ```
149
+
150
+ Run inference by specifying the model path after the argument `-m`, along with the path to the audio file as the last positional argument.
151
+
152
+ ```
153
+ ./build/bin/whisper-cli -m ggml-model-q5_0.bin ../audio.wav
154
+ ```
155
+
156
+ #### onnx (optimum) and transformers.js usage
157
+
158
+ You can use the `onnx` checkpoints via Hugging Face's `optimum` library in the following manner:
159
+
160
+ ```python
161
+ from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
162
+ from transformers import AutoProcessor
163
+
164
+ model_id = "KBLab/kb-whisper-base"
165
+ processor = AutoProcessor.from_pretrained(model_id, cache_dir="cache")
166
+ model = ORTModelForSpeechSeq2Seq.from_pretrained(
167
+ model_id,
168
+ cache_dir="cache",
169
+ subfolder="onnx",
170
+ )
171
+
172
+ import soundfile as sf
173
+ audio = sf.read("audio.wav")
174
+
175
+ inputs = processor.feature_extractor(audio[0], sampling_rate=16000, return_tensors="pt")
176
+ gen_tokens = model.generate(**inputs, max_length=300)
177
+ processor.decode(gen_tokens[0], skip_special_tokens=True)
178
+ ```
179
+
180
+ An example of an app that runs inference locally in the browser with `transformers.js` and `KB-Whisper` can be found at [https://whisper.mesu.re/](https://whisper.mesu.re/) (created by Pierre Mesure). A template for setting up such an app with javascript can be found at [https://github.com/xenova/whisper-web](https://github.com/xenova/whisper-web).
181
+
182
  ### Training data
183
 
184
  Our models have been trained on over 50,000 hours of Swedish audio with text transcriptions. The models were trained in 2 stages, each characterized by the application of different quality filters and thresholds for said filters.
185
 
186
+ Stage 1 employed low threshold values (0 to 0.30 BLEU depending on dataset), whereas Stage 2 used stricter thresholds (`BLEU >= 0.7`, weighted ROUGE-N `>= 0.7`, CER of first and last 10 characters `<= 0.2`).
187
 
188
  | Dataset | Continued pretraining (h) -- Stage 1 | Finetuning (h) -- Stage 2 |
189
  |-------------|--------------------------|--------------|
 
193
  | NST | 250 | 250 |
194
  | **Total** | **56,514** | **8,533** |
195
 
196
+ The default when loading our models through Hugging Face is **Stage 2**. We have however also uploaded continued pretraining checkpoints and tagged them. You can load these other checkpoints by specifying the `revision` in `.from_pretrained()`. The pretrained checkpoints tag can for example be found here: [`pretrained-checkpoint`](https://huggingface.co/KBLab/kb-whisper-large/tree/pretrained-checkpoint). The Stage 2 default model tag is named `standard`. We supply a different stage 2 checkpoint -- with a more condensed style of transcribing -- under the name `subtitle`.
197
 
198
  ### Evaluation
199
 
200
+
201
+ #### WER
202
  | Model size | | FLEURS | CommonVoice | NST |
203
  |------------|---------|--------|-------------|------|
204
  | [tiny](https://huggingface.co/KBLab/kb-whisper-tiny) | **KBLab** | **13.2** | **12.9** | **11.2** |
 
213
  | | OpenAI | 7.8 | 9.5 | 11.3 |
214
 
215
 
216
+ #### BLEU Score
217
  | Model size | | FLEURS | CommonVoice | NST |
218
  |------------|---------|--------|-------------|------|
219
  | tiny | KBLab | **76.6** | **73.7** | **74.3** |
 
225
  | medium | KBLab | **87.6** | **85.0** | **80.2** |
226
  | | OpenAI | 77.1 | 70.1 | 68.9 |
227
  | large-v3 | KBLab | **89.8** | **87.2** | **81.1** |
228
+ | | OpenAI | 84.9 | 79.1 | 75.1 |
229
+
230
+
231
+ ### Citation
232
+
233
+ Paper reference coming soon.