|
--- |
|
datasets: |
|
- japanese-asr/ja_asr.jsut_basic5000 |
|
- litagin/Galgame_Speech_ASR_16kHz |
|
language: |
|
- ja |
|
metrics: |
|
- cer |
|
base_model: |
|
- openai/whisper-large-v3-turbo |
|
library_name: transformers |
|
--- |
|
|
|
# Whisper Large V3 Japanese Phone Accent |
|
|
|
This is a Whisper model designed to transcribe Japanese speech into Katakana with pitch accent annotations. The model is built upon the whisper-large-v3-turbo and has been fine-tuned using a subset (1/20) of the Galgame-Speech dataset, as well as the jsut-5000 dataset. |
|
|
|
## Training Data: |
|
- **Stage 1**: Audio from the Galgame-Speech dataset was used. The text was converted into Katakana sequences with pitch accent annotations using pyopenjtalk. |
|
- **Stage 2**: JSUT-5000 dataset, using its original training set with pitch accent annotations. The data was split into 90% for training and 10% for evaluation. |
|
|
|
## Evaluation Results: |
|
- The model achieved a CER (Character Error Rate) of approximately 4% on the JSUT-5000 test set, which is an improvement over the 7% CER of pyopenjtalk. |
|
- Training only with Stage 1 resulted in a CER of 13%, with errors including specific misreadings and misclassification between on'yomi (音読) and kun'yomi (訓読) readings. This was improved in Stage 2. |
|
|
|
We are currently seeking Japanese pitch accent annotated datasets. If you have such data, please reach out! |
|
|