KBLab
/

kb-whisper-base

@@ -25,8 +25,16 @@ The National Library of Sweden releases a new suite of Whisper models trained on
 | [large-v3](https://huggingface.co/KBLab/kb-whisper-large)   | **KBLab**   | **5.4**   | **4.1**   | **5.2**   |
 |            | OpenAI  | 7.8    | 9.5    | 11.3    |
 ### Usage
 ```python
 import torch
 from datasets import load_dataset
@@ -58,11 +66,124 @@ res = pipe("audio.mp3",
            generate_kwargs={"task": "transcribe", "language": "sv"})
 ```
 ### Training data
 Our models have been trained on over 50,000 hours of Swedish audio with text transcriptions. The models were trained in 2 stages, each characterized by the application of different quality filters and thresholds for said filters.
-Stage 1 employed low threshold values (0.15 to 0.30 BLEU), whereas Stage 2 used stricter thresholds (`BLEU >= 0.7`, weighted ROUGE-N `>= 0.7`, CER of first and last 10 characters `<= 0.2`).
 | Dataset      | Continued pretraining (h) -- Stage 1 | Finetuning (h) -- Stage 2 |
 |-------------|--------------------------|--------------|
@@ -72,10 +193,12 @@ Stage 1 employed low threshold values (0.15 to 0.30 BLEU), whereas Stage 2 used
 | NST         | 250                      | 250          |
 | **Total**   | **56,514**               | **8,533**    |
-The default when loading our models through Hugging Face is **Stage 2**. We have however also uploaded the checkpoints of our continued pretraing and tagged them. You can load these other checkpoints by specifying the `revision`. For example: [`pretrained-checkpoint`](https://huggingface.co/KBLab/kb-whisper-large/tree/pretrained-checkpoint). The Stage 2 default model's tag is named `standard`.
 ### Evaluation
 | Model size  |  | FLEURS | CommonVoice | NST  |
 |------------|---------|--------|-------------|------|
 | [tiny](https://huggingface.co/KBLab/kb-whisper-tiny)       | **KBLab**   | **13.2**  | **12.9**  | **11.2**  |
@@ -90,6 +213,7 @@ The default when loading our models through Hugging Face is **Stage 2**. We have
 |            | OpenAI  | 7.8    | 9.5    | 11.3    |
 | Model size  |   | FLEURS | CommonVoice | NST  |
 |------------|---------|--------|-------------|------|
 | tiny       | KBLab   | **76.6**  | **73.7**  | **74.3**  |
@@ -101,4 +225,9 @@ The default when loading our models through Hugging Face is **Stage 2**. We have
 | medium     | KBLab   | **87.6**   | **85.0**   | **80.2**   |
 |            | OpenAI  | 77.1   | 70.1   | 68.9   |
 | large-v3   | KBLab   | **89.8**   | **87.2**   | **81.1**   |
-|            | OpenAI  | 84.9    | 79.1    | 75.1    |

 | [large-v3](https://huggingface.co/KBLab/kb-whisper-large)   | **KBLab**   | **5.4**   | **4.1**   | **5.2**   |
 |            | OpenAI  | 7.8    | 9.5    | 11.3    |
+Table: **Word Error Rate (WER)** comparison between KBLab's Whisper models and the corresponding OpenAI versions.
 ### Usage
+We provide checkpoints in different formats: `Hugging Face`, `whisper.cpp` (GGML), `onnx`, and `ctranslate2` (used in `faster-whisper` and `WhisperX`).
+#### Hugging Face
+Inference example for using `KB-Whisper` with Hugging Face:
 ```python
 import torch
 from datasets import load_dataset
            generate_kwargs={"task": "transcribe", "language": "sv"})
 ```
+#### Faster-whisper
+[Faster-whisper](https://github.com/SYSTRAN/faster-whisper) provides fast and efficient inference via a reimplementation of Whisper using `ctranslate2`.
+```python
+#### faster-whisper model ####
+from faster_whisper import WhisperModel
+model_id = "KBLab/kb-whisper-base"
+model = WhisperModel(
+    model_id,
+    device="cuda",
+    compute_type="float16",
+    download_root="cache", # cache directory
+    # condition_on_previous_text = False # Can reduce hallucinations if we don't use prompts
+)
+# Transcribe audio.wav (convert to 16khz mono wav first via ffmpeg)
+segments, info = model.transcribe("audio.wav", condition_on_previous_text=False)
+print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
+for segment in segments:
+    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
+```
+#### WhisperX
+[WhisperX](https://github.com/m-bain/whisperX) provides a convenient method of getting accurate word level timestamps. The library combines (force aligns) the text output of Whisper with the accurate timestamps of Wav2vec2. We provide an example below of how to use `KB-Whisper` together with [KBLab/wav2vec2-large-voxrex-swedish](https://huggingface.co/KBLab/wav2vec2-large-voxrex-swedish).
+```python
+import whisperx
+device = "cuda"
+audio_file = "audio.wav"
+batch_size = 16  # reduce if low on GPU mem
+compute_type = "float16"  # change to "int8" if low on GPU mem (may reduce accuracy)
+# 1. Transcribe with original whisper (batched)
+model = whisperx.load_model(
+    "KBLab/kb-whisper-base", device, compute_type=compute_type, download_root="cache"  # cache_dir
+)
+audio = whisperx.load_audio(audio_file)
+result = model.transcribe(audio, batch_size=batch_size)
+print(result["segments"])  # before alignment
+# delete model if low on GPU resources
+# import gc; gc.collect(); torch.cuda.empty_cache(); del model
+# 2. Align whisper output
+model_a, metadata = whisperx.load_align_model(
+    language_code=result["language"],
+    device=device,
+    model_name="KBLab/wav2vec2-large-voxrex-swedish",
+    model_dir="cache",  # cache_dir
+)
+result = whisperx.align(
+    result["segments"], model_a, metadata, audio, device, return_char_alignments=False
+)
+print(result["segments"])  # word level timestamps after alignment
+```
+#### Whisper.cpp / GGML
+We provide GGML checkpoints used in the apps `whisper.cpp` and `MacWhisper`. To use our model with `whisper.cpp` first clone the repository and build the library:
+```
+git clone https://github.com/ggerganov/whisper.cpp.git
+cd whisper.cpp
+cmake -B build
+cmake --build build --config Release
+```
+To use the model you need to download one of the GGML checkpoints we have uploaded. You can either press the download buttons [here](https://huggingface.co/KBLab/kb-whisper-base/tree/main), or download using `wget`:
+```
+wget https://huggingface.co/KBLab/kb-whisper-base/resolve/main/ggml-model-q5_0.bin # Quantized version
+# wget https://huggingface.co/KBLab/kb-whisper-base/resolve/main/ggml-model.bin # Non-quantized version
+```
+Run inference by specifying the model path after the argument `-m`, along with the path to the audio file as the last positional argument.
+```
+./build/bin/whisper-cli -m ggml-model-q5_0.bin ../audio.wav
+```
+#### onnx (optimum) and transformers.js usage
+You can use the `onnx` checkpoints via Hugging Face's `optimum` library in the following manner:
+```python
+from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
+from transformers import AutoProcessor
+model_id = "KBLab/kb-whisper-base"
+processor = AutoProcessor.from_pretrained(model_id, cache_dir="cache")
+model = ORTModelForSpeechSeq2Seq.from_pretrained(
+    model_id,
+    cache_dir="cache",
+    subfolder="onnx",
+)
+import soundfile as sf
+audio = sf.read("audio.wav")
+inputs = processor.feature_extractor(audio[0], sampling_rate=16000, return_tensors="pt")
+gen_tokens = model.generate(**inputs, max_length=300)
+processor.decode(gen_tokens[0], skip_special_tokens=True)
+```
+An example of an app that runs inference locally in the browser with `transformers.js` and `KB-Whisper` can be found at [https://whisper.mesu.re/](https://whisper.mesu.re/) (created by Pierre Mesure). A template for setting up such an app with javascript can be found at [https://github.com/xenova/whisper-web](https://github.com/xenova/whisper-web).
 ### Training data
 Our models have been trained on over 50,000 hours of Swedish audio with text transcriptions. The models were trained in 2 stages, each characterized by the application of different quality filters and thresholds for said filters.
+Stage 1 employed low threshold values (0 to 0.30 BLEU depending on dataset), whereas Stage 2 used stricter thresholds (`BLEU >= 0.7`, weighted ROUGE-N `>= 0.7`, CER of first and last 10 characters `<= 0.2`).
 | Dataset      | Continued pretraining (h) -- Stage 1 | Finetuning (h) -- Stage 2 |
 |-------------|--------------------------|--------------|
 | NST         | 250                      | 250          |
 | **Total**   | **56,514**               | **8,533**    |
+The default when loading our models through Hugging Face is **Stage 2**. We have however also uploaded continued pretraining checkpoints and tagged them. You can load these other checkpoints by specifying the `revision` in `.from_pretrained()`. The pretrained checkpoints tag can for example be found here: [`pretrained-checkpoint`](https://huggingface.co/KBLab/kb-whisper-large/tree/pretrained-checkpoint). The Stage 2 default model tag is named `standard`. We supply a different stage 2 checkpoint -- with a more condensed style of transcribing -- under the name `subtitle`.
 ### Evaluation
+#### WER
 | Model size  |  | FLEURS | CommonVoice | NST  |
 |------------|---------|--------|-------------|------|
 | [tiny](https://huggingface.co/KBLab/kb-whisper-tiny)       | **KBLab**   | **13.2**  | **12.9**  | **11.2**  |
 |            | OpenAI  | 7.8    | 9.5    | 11.3    |
+#### BLEU Score
 | Model size  |   | FLEURS | CommonVoice | NST  |
 |------------|---------|--------|-------------|------|
 | tiny       | KBLab   | **76.6**  | **73.7**  | **74.3**  |
 | medium     | KBLab   | **87.6**   | **85.0**   | **80.2**   |
 |            | OpenAI  | 77.1   | 70.1   | 68.9   |
 | large-v3   | KBLab   | **89.8**   | **87.2**   | **81.1**   |
+|            | OpenAI  | 84.9    | 79.1    | 75.1    |
+### Citation
+Paper reference coming soon.