Automatic Speech Recognition
Transformers
Safetensors
Japanese
whisper
audio
hf-asr-leaderboard
Eval Results
Inference Endpoints
asahi417 commited on
Commit
645cd6f
•
1 Parent(s): c9c5c56

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -13
README.md CHANGED
@@ -63,16 +63,15 @@ model-index:
63
  _Kotoba-Whisper_ is a collection of distilled [Whisper](https://arxiv.org/abs/2212.04356) models for Japanese ASR. Following the original work of distil-whisper ([Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430)),
64
  we employ OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3) as the teacher model, and the student model that consists the full encoder of the
65
  teacher whisper model, and a decoder with two layers initialized from the first and last layer of the whisper model.
 
66
  As the initial version, we release ***kotoba-whisper-v1.0*** trained on the `large` subset of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech),
67
  which amounts 1,253 hours of audio with 16,861,235 characters of transcriptions (5 sec audio with 18 text tokens in average) after
68
- those transcriptions more than 10 WER are removed (see [WER Filter](https://huggingface.co/distil-whisper/distil-large-v3#wer-filter)).
69
  The model was trained for 8 epochs with batch size 256 with sampling rate of 16kHz, and the raining and evaluation code to reproduce kotoba-whisper is available at [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper).
70
 
71
-
72
- Kotoba-whisper-v1.0 achieves better CER and WER than the [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) in the in-domain held-out test set from ReazonSpeech, and
73
- achieves competitive CER and WER on the out-of-domain test set including [JSUT basic 5000](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and
74
- the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/common_voice).
75
-
76
 
77
  - ***CER***
78
 
@@ -302,12 +301,8 @@ See [https://huggingface.co/distil-whisper/distil-large-v3#model-details](https:
302
 
303
 
304
  ## Evaluation
305
-
306
- The following code-snippets demonstrates how to evaluate the Distil-Whisper model on the LibriSpeech validation-clean
307
- dataset with [streaming mode](https://huggingface.co/blog/audio-datasets#streaming-mode-the-silver-bullet), meaning no
308
- audio data has to be downloaded to your local device.
309
-
310
- First, we need to install the required packages, including 🤗 Datasets to stream and load the audio data, and 🤗 Evaluate to
311
  perform the WER calculation:
312
 
313
  ```bash
@@ -326,6 +321,7 @@ from tqdm import tqdm
326
 
327
  # config
328
  model_id = "kotoba-tech/kotoba-whisper-v1.0"
 
329
  torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
330
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
331
  audio_column = 'audio'
@@ -338,7 +334,7 @@ model.to(device)
338
  processor = AutoProcessor.from_pretrained(model_id)
339
 
340
  # load the dataset and sample the audio with 16kHz
341
- dataset = load_dataset("japanese-asr/ja_asr.common_voice_8_0", split="test")
342
  dataset = dataset.cast_column(audio_column, features.Audio(sampling_rate=processor.feature_extractor.sampling_rate))
343
  dataset = dataset.select([0, 1, 2, 3, 4, 5, 6])
344
 
@@ -375,6 +371,13 @@ cer = 100 * cer_metric.compute(predictions=all_transcriptions, references=all_re
375
  print(cer)
376
  ```
377
 
 
 
 
 
 
 
 
378
 
379
  ## Acknowledgements
380
  * OpenAI for the Whisper [model](https://huggingface.co/openai/whisper-large-v3).
 
63
  _Kotoba-Whisper_ is a collection of distilled [Whisper](https://arxiv.org/abs/2212.04356) models for Japanese ASR. Following the original work of distil-whisper ([Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430)),
64
  we employ OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3) as the teacher model, and the student model that consists the full encoder of the
65
  teacher whisper model, and a decoder with two layers initialized from the first and last layer of the whisper model.
66
+
67
  As the initial version, we release ***kotoba-whisper-v1.0*** trained on the `large` subset of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech),
68
  which amounts 1,253 hours of audio with 16,861,235 characters of transcriptions (5 sec audio with 18 text tokens in average) after
69
+ those transcriptions more than 10 WER are removed (see [WER Filter](https://huggingface.co/distil-whisper/distil-large-v3#wer-filter) for detail).
70
  The model was trained for 8 epochs with batch size 256 with sampling rate of 16kHz, and the raining and evaluation code to reproduce kotoba-whisper is available at [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper).
71
 
72
+ Kotoba-whisper-v1.0 achieves better CER and WER than the [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) in the in-domain held-out test set
73
+ from ReazonSpeech, and achieves competitive CER and WER on the out-of-domain test sets including [JSUT basic 5000](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and
74
+ the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/common_voice) (see [Evaluation](#evaluation) for detail).
 
 
75
 
76
  - ***CER***
77
 
 
301
 
302
 
303
  ## Evaluation
304
+ The following code-snippets demonstrates how to evaluate the kotoba-whisper model on the Japanese subset of the CommonVoice 8.0.
305
+ First, we need to install the required packages, including 🤗 Datasets to load the audio data, and 🤗 Evaluate to
 
 
 
 
306
  perform the WER calculation:
307
 
308
  ```bash
 
321
 
322
  # config
323
  model_id = "kotoba-tech/kotoba-whisper-v1.0"
324
+ dataset_name = "japanese-asr/ja_asr.common_voice_8_0"
325
  torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
326
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
327
  audio_column = 'audio'
 
334
  processor = AutoProcessor.from_pretrained(model_id)
335
 
336
  # load the dataset and sample the audio with 16kHz
337
+ dataset = load_dataset(dataset_name, split="test")
338
  dataset = dataset.cast_column(audio_column, features.Audio(sampling_rate=processor.feature_extractor.sampling_rate))
339
  dataset = dataset.select([0, 1, 2, 3, 4, 5, 6])
340
 
 
371
  print(cer)
372
  ```
373
 
374
+ The huggingface links to the major Japanese ASR datasets for evaluation are summarized at [here](https://huggingface.co/collections/japanese-asr/japanese-asr-evaluation-dataset-66051a03d6ca494d40baaa26).
375
+ For example, to evaluate the model on JSUT Basic5000, change the `dataset_name`:
376
+
377
+ ```diff
378
+ - dataset_name = "japanese-asr/ja_asr.common_voice_8_0"
379
+ + dataset_name = "japanese-asr/ja_asr.jsut_basic5000"
380
+ ```
381
 
382
  ## Acknowledgements
383
  * OpenAI for the Whisper [model](https://huggingface.co/openai/whisper-large-v3).